Réseau Relex                     Accueil

1. INTRODUCTION AND BACKROUND


  • 1.1. Methods and goal
  • 1.2 The treatment of texts
  • 1.3. A cumulative project for electronic lexicons
  •  

    1.0. Introduction

    Within the framework of the informal network RELEX of laboratories a group of European teams (French, German, Italian, Portuguese, Greek, Spanish) in the domain of computational linguistics has been working in close cooperation for more than five years on the construction of electronic lexicons and grammars. Each team works on its national language, and all teams are using IDENTICAL METHODS. At least once a year, they meet to confront their problems, to present their results and to adopt further lines of standardization.
     

    The results obtained so far are numerous and coherent. Dictionaries of significant sizes have been built for each language and software that incorporates these dictionaries has been built in order to process corpora. Prospects for constructing comparative grammars are quite promising. A very important feature of this work is that at all levels different teams or individuals have worked on the same items (dictionaries and grammars) and that their partial results have been merged together without the slightest difficulty. The common methodology guarantees the cumulativity of the data. The first goal is to further enrich the dictionaries and grammars, and more specifically, to develop helps for the detection of terminology in technical texts.

    The methodological position of the network members stems from the observation that commercial dictionaries and grammars are very far from having the coverage and content that provide basic insights into the nature of natural languages and that could be used in NLP. Since a very large amount of words and of constraints on their construction is not available, no large scale automatic treatment of natural language can be envisioned in a near future. Fundamental research and computer applications make identical demands on the field: extremely detailed, formalized and complete descriptions. The fact that parallel formats of description have been adopted by the RELEX consortium and that the emphasis is on linguistic rather than on computational properties frees computer representations from constraints depending on particular applications.

    To a large extent, the exceptional homogeneity of this cooperative work reflects the unity of European languages. European languages are traditionally sources of numerous comparative studies, monolingual: diachronic or synchronic or multilingual. During the last century, specialists in these various domains have contributed to the description of a single culture, even if specialized subfields have progressively emerged.

    An interesting feature of European cooperative work is related to the classical comparative programme. Many studies have focused on differences between languages, but at a certain level of analysis, European languages are remarkably similar. However, a first step of analysis (mainly morphological) is needed to reduce the differences. It is interesting to note that different spelling habits entail important differences in the parsing procedures needed for each language. For example:

  • In German, compound words are written without separators, which is not the case in general for the other languages,
  • in Italian and Spanish, some pronouns are attached to some verbal forms, extending the conjugation of verbs in an explosive way, whereas in the other languages, the same pronouns are separated from the verb by a space.



    1.1. Methods and goal

     

    Modern combinatorial methods in linguistics, that is, distributional and transformational methods, can be applied to systematically describe languages and to compare them. By formal systematic description, we mean studies arriving at representing all constraints that govern the combinations of all the units that constitute sentences. Representations should bear on all components of natural languages, and in particular on its lexicon and its grammar. One essential condition put on the way rules are written is that their application be entirely mechanical, namely based on identifiable clues, not on human intuitions. Besides its customary function, the lexicon has a complementary role for the grammar: it lists all exceptions to the rules. The degree of formalization of grammars and lexicons should be such that they can be incorporated into sentence parsers and generators.

    Coverage of words and rules should tend towards exhaustivity, both from the theoretical and the applied viewpoints:

  • models to be built should reflect the functioning of the human linguistic apparatus,
  • parsers should not encounter unknown words and structures when they apply to varied texts.

    However, it is clear that exhaustivity cannot be completely reached: words are constantly created, both in the ordinary and in the technical vocabulary. But, for the stable and most important part of a language, it is necessary to proceed to an enumeration of words and rules which is as complete as possible. Such a programme requires both a quantitative and a qualitative change in the activities of lexicographers and grammarians.

    So far, the two descriptions, that of the lexicon and that of the grammar, have rarely been carried out simultaneously, although proper methods of representation have been available for the last twenty years.

    High quality of processing, which entails full coverage of languages, results in the processing of large amounts of data, that is, low rates of failures due to a simple lack of linguistic information, such as the absence of a word in a dictionary. Simple and compound utterances have to be compiled in large figures, the syntactic structures that accommodate these large lexicons are also in large numbers. All this information must be coherently structured. In order to be accessed, the data must be represented in a coherent form, then translated into computer form. Tools have to be devised for classifying, comparing and retrieving the data. In fact, a rather powerful computer technology is required to handle lexicons and grammars of any significant size.

    1.2. The treatment of texts

     

    Studies have started from the existing lexical material elaborated in each language, but once dictionaries, textbooks and theoretical studies have been exploited, this material runs short of providing the kind of coverage needed for computational applications. Systematic application of productive rules (e.g. such as the production of adjectives in -able derived from transitive verbs, see 3.1) is a way of completing dictionaries, but new words and new uses of existing words can only be determined by processing large amounts of properly selected texts.

    The treatment of large amounts of texts is a new area whose development is expected in many applications. It is only recently that texts are computationally produced on a large scale by Administrations and Companies. A large variety of texts are thus stored, and tools to retrieve them by content rather than by indexing will have to be developed. So far, the only way to search a text for information is by means of a set of keywords whose use is always limited by their ambiguity or vagueness. Sophisticated methods of retrieval require a linguistic approach that will recognize meaningful units in texts. We recall that a practical problem is rampant, the variety of codes resulting from the variety of word processor hampers the construction of large corpora, and it is far from clear that the use of SGML representations is a general solution for this difficulty.

    Automatic processing of texts always starts from simple words taken as the basic units that compose documents. Simple words are the natural point of departure for analysis because of the high degree of formalization that has been introduceed in the writing and description of European languages. However, simple words are not always the natural units of processing :

  • often they are ambiguous, that is looking up a dictionary provides several entries, hence meanings, for each of them,
  • often they are meaningless; this is obviously the case for grammatical items and even general verbs such as to be, to have. Also, many technical phrases are composed of simple words, but the meaning of such terms cannot be deduced from the meaning of the individual words : for example the noun electro-magnetic field corresponds to a very precise (mathematically defined) concept, which corresponds only vaguely to the words.

    1.3. A cumulative project for electronic lexicons

     

    Linguistic theory is certainly better founded than any other human and social science. Methods and basic concepts have been developed across languages and they can now be applied to the systematic description of languages. However, large-scale descriptive tasks have not been undertaken yet, and for various reasons, ranging from conflicts of interest between theoreticians and dictionary publishers to the immense size of the task. All descriptions published so far differ to some extent. Even dictionaries published within the same company are not compatible: they are aimed at different audiences, they are written by different authors. Even supplements of a given dictionary may be very difficult to merge with the main body of the dictionary, because no standard exists for representing information (e.g. the Oxford English Dictionary project).

    The project presented here is unique in the fact that linguists devoted to the study of different European languages have reached a consensus about descriptive data which should be:

  • reproducible enough so that they can be applied consistently by teams or generations of specialists (not only by talented individuals),
  • concentrated on comparable phenomena in the various languages. Data relate to areas of each language that are of comparable size (e.g. similar size vocabularies, same families of syntactic structures).

    The consensus is not only theoretical, but empirical as well. After more than five years of parallel experimental work on six languages, working principles have been arrived at that can be extrapolated to other languages and to other phenomena. The results obtained so far are given in the bibliography on lexicon-grammars (Leclère C. & Subirats C. 1991), and examplified in annex 4.



    Back     Forward