Unitex/GramLab is an open source, multilingual corpus processing suite

Unitex/GramLab is a corpus processing suite for the analysis of texts in natural language by using language resources and tools. These resources consist of language-processing dictionaries and grammars provided by contributors from several countries. Unitex/GramLab is a platform in which you can put linguistic resources, manage them and use them. It is open source, cross-platform, modular, and allows you to deal with languages that use special writing systems (e.g. Arabic and many Asian languages). Some of its functional features stem from its resources: precision, completeness, and ability to handle multi-word units, including through the use of dictionaries and local grammars.

The dictionaries specify the simple and multi-word units of a language together with their lemmas and a set of grammatical (semantic and inflectional) codes. The availability of these dictionaries is a major advantage compared to the usual utilities for pattern searching: the information they contain can be used for searching and matching, thus describing large classes of words using very simple patterns. The dictionaries are presented in a specific formalism called DELA, internally compiled into a minimal finite state automata, and were constructed by teams of linguists from different countries for several languages: French, English, Greek, Italian, Spanish, German, Arabic, Thai, Korean, Polish, Norwegian, Portuguese.

The grammars are representations of linguistic phenomena on the basis of recursive transition networks (RTN), a formalism closely related to FST. Numerous studies have shown the adequacy of automata for linguistic problems at all descriptive levels from morphology and syntax to phonetic issues. Grammars created with Unitex/GramLab carry this approach further by using a formalism even more powerful than automata. They are represented as graphs that the user can easily create and update using a visual integrated development environment which is an essential part of the platform.

Technologies

  • java
  • c++

Topics

2016