Enhanced suffix arrays as language models: Virtual k-testable languages
In this article, we propose the use of suffix arrays to efficiently implement n-gram language models with practically unlimited
size n. This approach, which is used with synchronous back-off, allows
us to distinguish between alternative sequences using large contexts. We
also show that we can build this kind of models with additional information for each symbol, such as part-of-speech tags and dependency
information.
The approach can also be viewed as a collection of virtual k-testable
automata. Once built, we can directly access the results of any k-testable
automaton generated from the input training data. Synchronous back-
off automatically identies the k-testable automaton with the largest
feasible k. We have used this approach in several classification tasks.
Share this page