Comparing the Effect of Smoothing and N-gram Order : Finding the Best Way to Combine the Smoothing and Order of N-gram
Abstract
The SRILM is a toolkit for building and applying statistical language models (LMs), designed and developed primarily for use in speech recognition, statistical tagging and segmentation, and machine translation. It has been under development in the SRI Speech Technology and Research Laboratory since 1995. The toolkit has also greatly benefited from its use and enhancements during the Johns Hopkins University/CLSP summer workshops in 1995, 1996, 1997, and 2002. In this thesis, the effect of smoothing and order of N-gram for language model we build by srilm toolkit is studied.
My primary method is to use comparison. Firstly, training corpus and testing corpus in website is downloaded. This should be checked in all of the document.
Then, I use command window and training corpus to train a language model in different smoothing and order of n-gram and test another one we downloaded in website. Finally, I will get the perplexities which can weigh the language model. I will also list every perplexity and compare them in different smoothing and order
of n-gram to see which language model we built has minimal perplexity. Then, we will knwhich language model we built is the best one.
Also, I will do it again by another two different corpora, one for training,
another for testing, to see the effect of different corpus for language model. If the
two
group perplexity is
the
same, it means the different corpus do not affect
perplexity. Otherwise, the result is opposite.
In conclusion, my measure above all is to calculate perplexity of each
language model in different smoothing and order of n-gram and compare every
perplexity to find the best way to match the smoothing and order of n-gram for the language model. At the same time, we will know the effect of different corpus for
the
language model with same smoothing and order of n-gram.