Lexical Computing has a long-standing cooperation with the Natural Language Processing Centre at the Faculty of Informatics, Masaryk University, Brno, Czech Republic. Together, we developed a number of open-source NLP tools which are available for download. They have also been integrated with the Sketch Engine corpus query and management system and they are automatically applied on data within Sketch Engine so that even users without the necessary technical knowledge can benefit from them.
NLP tools for non-technical users
We integrated all the necessary NLP tools into our flagship product, Sketch Engine. Users can build and analyse large amounts of text without any technical knowledge and without installing and setting up any tools.
Spiderling is a web spider designed for linguistics applications which can crawl text-rich parts of the web and collect data that are suitable for inclusion into text corpora. It is a key tool for our corpus building projects.
Onion (ONe Instance ONly) is designed for deduplication large text collections (corpora) by measuring the similarity of paragraphs or whole documents. The duplicate texts are removed based on the threshold set by the user.
Unitok is a universal text tokeniser with specific settings for many languages. It can turn plain text into a sequence of newline-separated tokens (vertical format), while preserving metadata in XML-like tags.