Portuguese word frequency lists
We offer large word frequency lists of Portuguese (and many other languages). The lists (sometimes also called lexicons or dictionaries) are computed from an gigantic authentic database of text (text corpora) produced by Portuguese speakers. Our largest Portuguese corpus is made up of texts with a total length of 8,000,000,000 words.
A relatively small amount of texts is sufficient to generate a list of the 2,000 most frequent Portuguese words, or the list of 3,000 words or 5,000 words because such words appear with a high frequency in any text.
However, an enormous text database (corpus) is required to ensure reliable word frequency information even for rare and infrequently used words. The only reasonable option of compiling multi-billion-word corpora is by downloading content from the web automatically. Lexical Computing developed a sophisticated procedure for collecting only linguistically valuable content from the web. A series of tools is used to focus on the right content and to perform deduplication and cleaning. This ensures that the statistics are not affected by content exceedingly present on the web but rare in real life. This blog post gives more details.
We are able to generate word frequency lists of millions of unique words in Portuguese. The wordlist size depends on the customer’s requirements. By default, we will not include words with fewer than 5 occurrences in the corpus. Such words are often noise without much linguistic value. The client can specify any filtering options.
Enriched frequency wordlists
The frequency list can be generated from the whole corpus or only from its parts. Each document in the corpus carries information about the top-level domain (TLD) from which it was downloaded, for example .pt, .br or .ao. This information can be used to generate frequency lists of regional varieties of Portuguese.
We will provide a quotation based on the exact specifications of the lexicon and its intended use.
The list will be made available for download on a dedicated link within the agreed period of time. It normally takes a week or two to generate the data. Complex wordlist can take longer to produce.
Portuguese word frequency list sample
Download a spreadsheet with a sample of the last 100 words in each thousand between 1,000 and 100,000. The list is case sensitive. Lists with specific criteria and filtering options can be generated to your requirements.