We generate bigram, trigram and n-gram lists according to the customer’s specifications. For typing prediction, trigrams perform much better than bigrams and we possess corpora in many languages large enough to generate a sufficient number of trigrams for this purpose.

All bigrams, trigrams and n-grams in a language

Our corpora are large enough to generate n-gram lists of all used n-grams in a language. Such a language database can contain hundreds of millions of n-grams. The n-gram database can be filtered according to the criteria specified by the customer and delivered as a download in many formats.

Typically, an n-gram database comes with frequency but we are able to meet further requirements specified by the customer.

An example of an Arabic bigram list in the XML format.

 <item>
      <str>عليه وسلم</str>
      <freq>60929</freq>
    </item>
    <item>
      <str>في هذا</str>
      <freq>56788</freq>
    </item> 

Bigram, trigram & n-gram database in these languages

Bigrams, trigrams & n-gram databases for more languages can be developed on request.

Supported languages

Afrikaans
Albanian
Amharic
Arabic
Azerbaijani
Basque
Belarusian
Bengali
Bosnian
Bulgarian
Catalan
Chinese Simplified
Chinese Traditional
Croatian
Czech
Danish
Dutch
English
Estonian
Filipino
Finnish
French
Frisian
Georgian
German
Greek
Gujarati
Hausa (Boko)
Hebrew
Hindi
Hungarian
Icelandic
Igbo
Indonesian
Irish
Italian
Japanese
Kannada
Kazakh
Korean
Kyrgyz
Latin
Latvian
Lithuanian
Macedonian
Malayalam
Malay
Maltese
Maori
Mongolian
Nepali
N'Ko
Norwegian Bokmål
Norwegian
Norwegian Nynorsk
Oromo
Persian
Polish
Portuguese
Punjabi (Shahmukhi)
Romanian
Russian
Samoan
Scottish Gaelic
Serbian (Latin)
Serbian
Setswana
Slovak
Slovenian
Somali
Spanish
Swahili
Swedish
Tajik
Tamil
Tatar
Telugu
Thai
Tibetan
Tigrinya
Turkish
Turkmen
Ukrainian
Urdu
Uzbek
Vietnamese
Welsh
Yoruba