Language databases, ngram models and word lists

Word lists, n-gram models, lexicons and language databases

We use our multi-billion word samples of authentic language called text corpora to generate tailor-made language databases and lexical data according to the customer’s requirements.

The size of the corpora is key for a truly representative sample of language. While it is easy to download databases of the top few thousand most frequent words in many languages, we are capable of providing lists of millions of items. Such a list could be regarded as a complete database of a lexicon of a language. Our corpora for many languages are large enough to generate a list of all words in a language.

Even a small corpus will contain enough examples of the most frequent words but the size really matters with less frequent words or subject specific vocabulary. A huge corpus is a must when information about such words is needed so that a reasonable number of examples is found and the analysis produces meaningful language data. The corpus size is critical for the quality of collocations and thesaurus.

The language databases we supply can range from a simple word frequency lists or bigram or n-gram lists model to complex lexical data combining any of the language data types we offer.

available languages request data

A complex lexical database can consist of base forms (lemmas) and

part of speech
grammar labels
all word forms
thesaurus (synonyms and similar words)
good dictionary examples

and any other statistical, morphological or linguistic information derived from the corpus.

Data samples

Word frequency lists: English, Spanish, French, Arabic, Russian, Portuguese, Hindi. Bigram databases: English, Spanish, German, Russian.

Word lists, n-gram models, lexicons and language databases

Data samples

Word databases, lexicons
& dictionaries

bigram, trigram
& n-gram database

synonym database
thesaurus

collocation database
word combinations

Good Dictionary Examples
GDEX

corpus query and management system

online dictionary editor

term extraction

A Course in Lexicography and Lexical Computing