French word frequency lists
We are providers of high-quality word frequency lists in French (and many other languages). The lists are generated from an enormous authentic database of text (text corpora) produced by real users of French. Our largest French corpus contains texts with a total length of 9,000,000,000 words.
A relatively small corpus is sufficient to generate a list of the 2,000 most frequent French words, or the list of 3,000 words or 5,000 words because such words appear frequently enough in any text and correct frequency ranking can be achieved from a small sample of text.
However, an enormous text database (corpus) is required to ensure reliable word frequency information for rare and infrequently used words. The only viable option of building corpora of billions of words is an automatic approach of downloading content from the web. Lexical Computing developed a sophisticated procedure for collecting only linguistically valuable content from the web. A series of tools is used to focus on the right content and to perform deduplication and cleaning. This ensures that the statistics are not skewed. This blog post gives more details.
The Fench word lists we are able to generate can reach a size of millions of unique words. The actual size depends on the specifications. By default, we will not include any word which appears fewer than 5 times in the corpus. Such words are typically noise without any linguistic value. The client can specify a multitude of filtering options.
Enriched frequency wordlists
We are also able to provide additional information such as POS tags, lemmas, probabilities of the next word, or any other statistics or morphological information.
We can also provide additional data such as collocations, example sentences, synonyms, spelling variants and other types of linguistic data.
The frequency list can be generated from the whole corpus or only from its parts. Each document in the corpus carries information about the top-level domain (TLD) from which it was downloaded, for example .fr, .ca or .dz. This information can be used to generate frequency lists of regional varieties of French.
The easiest is to register a free trial account in Sketch Engine and use the wordlist tool to generate a wordlist. The advanced tab of the wordlist tool allows for detailed specifications to be used.
We will provide a quotation based on the exact specifications and the intended use of the wordlist.
The database will be made for download to you on a dedicated link within the agreed period of time. It normally takes a week or two to generate the data. Very complex wordlist can be computationally demanding and can take longer to produce.