English n-gram databases and n-gram models

We are providers of high-quality n-gram databases in English (and many other languages). The n-gram model is generated from an enormous database of authentic text (text corpora) produced by real users of English. Our largest English corpus contains texts with a total length of 60,000,000,000 words.

Data quality

The corpus size is not really an issue to generate an n-gram model of the most frequent 10,000 n-grams in English. The use of such a small database is very limited, though. For any serious application, a much larger list is needed, typically millions of n-grams are required.

An enormous text database (corpus) is required to ensure reliable n-gram frequency even for rare and infrequently used n-grams. The only viable option of building corpora of billions of words is using an automatic procedure of downloading content from the web. Lexical Computing developed a sophisticated procedure for collecting only linguistically valuable content from the web. A series of tools is used to focus on the right content and to perform deduplication and cleaning. This ensures that the statistics are not skewed. This blog post gives more details.

English n-gram model size

We are able to generate an n-gram database of millions of unique n-grams in English. The actual size depends on the specifications. By default, we will not include any n-gram which appears fewer than 5 times in the corpus. Such words are typically noise without any linguistic value. The client can specify any filtering options.

Enriched n-gram databases

We are also able to provide additional information such as POS tags, lemmas, probabilities of the next word, or any other statistical or morphological information.

Sample data

The easiest is to register a free trial account in Sketch Engine and use the n-gram tool to generate a list of n-grams. Use the advanced tab to specify detailed criteria and the format of the output.

Prices

We will provide a quotation based on the exact specifications and the intended use of the database.

N-gram database download

The database will be made for download to you on a dedicated link within the agreed period of time. It normally takes a week or two to generate the data. Very complex databases can be computationally demanding and can take longer to produce.

English bigram database

Random items from the English bigram database with part-of-speech tags and frequencies. The list can be supplied in various formats and enriched with statistical, morphological and other linguistic information.

English n-grams

English bigram model sample

Download a spreadsheet with a sample of the last 100 bigrams in each thousand between 1,000 and 100,000. The list is case sensitive. Lists with specific criteria and filtering options can be generated to your requirements.