German n-gram databases and n-gram models
We can supply high-quality databases of bigrams, trigram and larger n-grams and n-gram models in German (as well as many other languages). The lists are generated from a gigantic database of authentic texts (text corpora) produced by users of German. Our largest German corpus contains texts surpassing the length of 16,000,000,000 words.
N-gram database quality
A relatively small collection of texts is enough to generate the 10,000 most frequent German n-grams because such n-grams appear frequently enough in any text.
However, a very large text database (corpus) is required to ensure reliable n-gram frequency information even for rare and infrequent n-grams. The only viable option of building corpora of billions of words is automating the process of downloading web content. Lexical Computing developed an advanced process for collecting only linguistically valuable texts from the web. A series of tools is used to gather the right content and to carry out deduplication and cleaning. This ensures that the statistics are affected by content that is found in excessive quantities on the web while it is not found as frequently in real life. Read this blog post for details about the procedure.
N-gram model size
We are able to generate German n-gram databases of millions of unique n-grams. The actual size depends on the specifications. By default, we exclude n-grams found fewer than 5 times in the corpus. These n-grams are typically only noise without much linguistic value. The customer can specify a multitude of filtering options.
Additional linguistic data can be added such as POS tags, lemmas, probabilities of the next word, or any other statistical or morphological information.
N-gram database sample
The easiest is to register a free trial account in Sketch Engine and use the n-gram tool to generate a list of n-grams to the required specifications. The advanced tab of the n-gram tool allows for detailed specifications to be used.
N-gram database prices
We will provide a quotation based on the exact specifications and the intended use of the database.
N-gram database download
The corpus will be made for download to you on a dedicated link within the agreed period of time. It normally takes a week or two to generate the data. Very complex databases can be computationally demanding and can take longer to produce.
A random sample of German bigrams of word forms with part-of-speech tags. We offer the list in various formats, optionally enriched with statistical, morphological and other linguistic information.
German bigram model sample
Download a spreadsheet with a sample of the last 100 most frequent German bigrams in each thousand between 1,000 and 100,000. Lists with specific criteria and filtering options can be generated to your requirements.