Building large text corpora
Lexical Computing has a long experience in building large high-quality text corpora from the web. We possess both the expertise and the tools need to crawl large amounts of text data and processing them into a clean text corpus.
We specialize in building and processing corpora of billions of words. Our largest corpus sizes reach the 73-billion-word mark.
On-demand corpus building
We are capable of developing language data and word databases for major languages as well as languages whose language resources are scarce or non-existent. In the past, we produced dozens of language corpora according to customer’s specification and processed them into lexical databases, word lists, n-gram lists and other types of data.