Token merging in language model-based confusible disambiguation
In the context of confusible disambiguation (spelling correction that requires context), the synchronous
back-off strategy combined with traditional n-gram language models performs well. However, when
alternatives consist of a different number of tokens, this classification technique cannot be applied directly,
because the computation of the probabilities is skewed. Previous work already showed that probabilities
based on different order n-grams should not be compared directly.
In this article, we propose new probability metrics in which the size of the n is varied according to the
number of tokens of the confusible alternative. This requires access to n-grams of variable length. Results
show that the synchronous back-off method is extremely robust.
We discuss the use of suffix trees as a technique to store variable length n-gram information efficiently.
Share this page