We use our multi-billion word samples of authentic language called text corpora to generate tailor-made language databases and lexical data according to the customer’s requirements.
The size of the corpora is key for a truly representative sample of language. While it is easy to download databases of the top few thousand most frequent words in many languages, we are capable of providing lists of millions of items. Such a list could be regarded as a complete database of a lexicon of a language. Our corpora for many languages are large enough to generate a list of all words in a language.
Even a small corpus will contain enough examples of the most frequent words but the size really matters with less frequent words or subject specific vocabulary. A huge corpus is a must when information about such words is needed so that a reasonable number of examples is found and the analysis produces meaningful language data. The corpus size is critical for the quality of collocations and thesaurus.
The language databases we supply can range from a simple word frequency lists or bigram or n-gram lists model to complex lexical data combining any of the language data types we offer.