We generate bigram, trigram and n-gram lists according to the customer’s specifications. For typing prediction, trigrams perform much better than bigrams and we possess corpora in many languages large enough to generate a sufficient number of trigrams for this purpose.

All bigrams, trigrams and n-grams in a language

Our corpora are large enough to generate n-gram lists of all used n-grams in a language. Such a language database can contain hundreds of millions of n-grams. The n-gram database can be filtered according to the criteria specified by the customer and delivered as a download in many formats.

Typically, an n-gram database comes with frequency but we are able to meet further requirements specified by the customer.

An example of an Arabic bigram list in the XML format.

 <item>
      <str>عليه وسلم</str>
      <freq>60929</freq>
    </item>
    <item>
      <str>في هذا</str>
      <freq>56788</freq>
    </item> 

Bigram, trigram & n-gram database in these languages

Bigrams, trigrams & n-gram databases for more languages can be developed on request.

a course in lexicography and lexical computing