Portuguese word frequency lists
We offer large word frequency lists of Portuguese (and many other languages). The lists (sometimes also called lexicons or dictionaries) are computed from an gigantic authentic database of text (text corpora) produced by Portuguese speakers. Our largest Portuguese corpus is made up of texts with a total length of 8,000,000,000 words.
A relatively small amount of texts is sufficient to generate a list of the 2,000 most frequent Portuguese words, or the list of 3,000 words or 5,000 words because such words appear with a high frequency in any text.
However, an enormous text database (corpus) is required to ensure reliable word frequency information even for rare and infrequently used words. The only reasonable option of compiling multi-billion-word corpora is by downloading content from the web automatically. Lexical Computing developed a sophisticated procedure for collecting only linguistically valuable content from the web. A series of tools is used to focus on the right content and to perform deduplication and cleaning. This ensures that the statistics are not affected by content exceedingly present on the web but rare in real life. This blog post gives more details.
We are able to generate word frequency lists of millions of unique words in Portuguese. The actual size depends on the specifications. By default, we will not include words with fewer than 5 occurrences in the corpus. Such words are often noise without much linguistic value. The client can specify any filtering options.
Enriched frequency wordlists
The lexicons we produce can contain additional information such as POS tags, lemmas, probabilities of the next word, or any other statistics or morphological information.
The frequency list can be generated from the whole corpus or only from its parts. Each document in the corpus carries information about the top-level domain (TLD) from which it was downloaded, for example .pt, .br or .ao. This information can be used to generate frequency lists of regional varieties of Portuguese.
The easiest is to register a free trial account in Sketch Engine and use the wordlist tool to generate a wordlist. The advanced tab of the wordlist tool allows for detailed specifications to be used.
We will provide a quotation based on the exact specifications of the lexicon and its intended use.
The list will be made available for download on a dedicated link within the agreed period of time. It normally takes a week or two to generate the data. Very complex wordlist can be computationally demanding and can take longer to produce.