Comparing the Effect of Smoothing and N-gram Order : Finding the Best Way to Combine the Smoothing and Order of N-gram
The SRILM is a toolkit for building and applying statistical language models (LMs), designed and developed primarily for use in speech recognition, statistical tagging and segmentation, and machine translation. It has been under development in the SRI Speech Technology and Research Laboratory since 1995. The toolkit has also greatly benefited from its use and enhancements during the Johns Hopkins University/CLSP summer workshops in 1995, 1996, 1997, and 2002. In this thesis, the effect of smoothing and order of N-gram for language model we build by srilm toolkit is studied. My primary method is to use comparison. Firstly, training corpus and testing corpus in website is downloaded. This should be checked in all of the document. Then, I use command window and training corpus to train a language model in different smoothing and order of n-gram and test another one we downloaded in website. Finally, I will get the perplexities which can weigh the language model. I will also list every perplexity and compare them in different smoothing and order of n-gram to see which language model we built has minimal perplexity. Then, we will knwhich language model we built is the best one. Also, I will do it again by another two different corpora, one for training, another for testing, to see the effect of different corpus for language model. If the two group perplexity is the same, it means the different corpus do not affect perplexity. Otherwise, the result is opposite. In conclusion, my measure above all is to calculate perplexity of each language model in different smoothing and order of n-gram and compare every perplexity to find the best way to match the smoothing and order of n-gram for the language model. At the same time, we will know the effect of different corpus for the language model with same smoothing and order of n-gram.