bigram/ngram databases and ngram models

We are providers of high-quality bigram and bigram/ngram databases and ngram models in many languages. The lists are generated from an enormous database of authentic text (text corpora) produced by real users of the language. Our corpora in major languages contain texts with a total length of as many as 40,000,000,000 words.

Data quality

The corpus size is not really an issue to generate a database of the most frequent 10,000 n-grams. The use of such a language model is very limited, though. For any serious application, a much larger database is needed, typically millions of n-grams are required.

An enormous text database (corpus) is required to ensure reliable n-gram frequency information even for rare and infrequently used n-grams. The only viable option of building corpora of billions of words is using an automatic procedure of downloading content from the web. Lexical Computing developed a sophisticated procedure for collecting only linguistically valuable content from the web. A series of tools is used to focus on the right content and to perform deduplication and cleaning. This ensures that the statistics are not skewed. This blog post gives more details.

N-gram database size

We are able to generate frequency lists of millions of unique n-grams. The actual size of the ngram model depends on the specifications. By default, we will not include any n-gram which appears fewer than 5 times in the corpus. Such n-grams are typically noise without any linguistic value. The client can specify any filtering options.

Enriched n-gram databases

We are also able to provide additional information such as POS tags, lemmas, probabilities of the next word, or any other statistics or morphological information.

Sample n-gram model

The easiest is to register a free trial account in Sketch Engine and use the n-gram tool to generate a list of n-grams. The advanced tab of the n-gram tool allows for detailed specifications to be used.

Prices

We will provide a quotation based on the exact specifications and the intended use of the database.

Download

The database will be made for download to you on a dedicated link within the agreed period of time. It normally takes a week or two to generate the data. Very complex databases can be computationally demanding and can take longer to produce.

An example of an Arabic bigram database in the XML format.

 <item>
      <str>عليه وسلم</str>
      <freq>60929</freq>
    </item>
    <item>
      <str>في هذا</str>
      <freq>56788</freq>
    </item> 

Bigram, trigram & n-gram database in these languages

Bigrams, trigrams & n-gram databases in other languages can be developed on request.

Supported languages

Afrikaans
Albanian
Amharic
Arabic
Azerbaijani
Basque
Belarusian
Bengali
Bosnian
Bulgarian
Catalan
Chinese Simplified
Chinese Traditional
Croatian
Czech
Danish
Dutch
English
Estonian
Filipino
Finnish
French
Frisian
Georgian
German
Greek
Gujarati
Hausa (Boko)
Hebrew
Hindi
Hungarian
Icelandic
Igbo
Indonesian
Irish
Italian
Japanese
Kannada
Kazakh
Korean
Kyrgyz
Latin
Latvian
Lithuanian
Macedonian
Malayalam
Malay
Maltese
Maori
Mongolian
Nepali
N'Ko
Norwegian Bokmål
Norwegian
Norwegian Nynorsk
Oromo
Persian
Polish
Portuguese
Punjabi (Shahmukhi)
Romanian
Russian
Samoan
Scottish Gaelic
Serbian (Latin)
Serbian
Setswana
Slovak
Slovenian
Somali
Spanish
Swahili
Swedish
Tajik
Tamil
Tatar
Telugu
Thai
Tibetan
Tigrinya
Turkish
Turkmen
Ukrainian
Urdu
Uzbek
Vietnamese
Welsh
Yoruba