Réseau Relex

6. CORPORA


  • 6.1. Availability of corpora
  • 6.2. Coverage
  •  

    6.1. Availability of corpora

    The various teams are processing corpora. It is interesting to observe that very similar legal problems have arisen in the various countries. Also, attitudes are quite similar: after several years of practice, it is clear that personal contacts with the owners of texts have played a crucial role in the obtention of texts. On the organizational level, it is clear that an institution intending to build a large corpus will need a special team of specialists dealing with the owners of texts:

  • - these specialists should be able to explain convincingly, the purpose of the corpus,
  • - they should be acquainted with the legal issues involved,
  • - they should have a good expertise about text formats, from current word processors to sophisticated typesetting machines.

    So far, texts used for processing of various types :

  • - each member of the network, through special agreement with a private company, had access to technical texts or to newspapers or journals. The use of these texts is restricted to experimental work in a given team;
  • - commercially available texts which can be unloaded in ASCII form in order to be processed freely. Some these texts are currently available on CD-ROM: newspapers such as le Monde in French, 24 Ore-Il Sole in Italian, news from Press Agencies (in French and in German). The legal status of these texts is not clear: for example, to what extent samples of several of these texts can be merged in order to build a special purpose corpus that would be made available to institutions who have not purchased the whole CD-ROMs?
  • - texts in the public domain. For example, literary texts for which no rights are required (they are often 50 or so year old), theses written in the vicinity of the various laboratories. Many official texts published by governments could be made available.

    There are also special opportunities, difficult to plan for, at least at the present time. For example, the multilingual corpus represented by the journal Scientific American and its translations is in principal available. However, securing all the texts in each country where they are produced is an untractable problem. On the other hand, there exist a few newspapers for Germanic and Francophonic readers published simultaneously in both languages. Samples made available to the German team have shown that bilingual processing could be extremely valuable.

    6.2. Coverage

    Large amounts of corpora are necessary to assess the quality of linguistic tools, hence the feasibility of certain applications, for example, determining the degree of completeness of a dictionary or the part of human editing needed in an automatic translation process, since not all ambiguities can be solved by computer programmes.

    Corpus is used for determining the coverage of existing dictionaries: failure of a look up procedure indicates in general that the corresponding word:

  • 1 : is a new-word,
  • 2 : or is a proper name,
  • 3 : or is a non alphabetic string (e.g. numerals),
  • 4 : or is mispelled.

    Each of these situations can be coped with, requiring different approaches.

    T1. New words must be encoded and added to the existing dictionary. However, some assessment should be made as to the prospects of their lives (some words are coined in specific contexts and may never be used again.

    T2. Catalogs of proper names exist. In some cases it is possible to build specialized dictionaries for them. In annex 3 we give an example of names of oceans and we propose to identify more such situations (e.g. places such as countries, islands) and to generalize this methodology to other areas.

    T3. Local grammars will be built for numerals of all types (Roman, alphabetic, etc.).

    We recall that the formalism of finite automata allows the automatic construction of parsers from the graphs of the local grammars. These parsers are applied to texts and combined with the use of the various dictionaries (of simple and compound words), they allow the detection of a large variety of patterns that one can define by combining practically at will lexical and grammatical features. This procedure is an extremely promising tool for the detection of terminology not yet present ion the current dictionaries. In annex 3 we give examples of such grammars used for detection of new terms. The development of this approach in the subject of task L. This approach can be extended to the search for terms and their translations by processing bilingual texts. Such bilingual corpora are available (in particular between French and German), this extension is the subject of task B.

    T4. Detection of spelling mistakes can be performed by using the dictionaries DELA already constructed, but procedures of detection and of correction should be improved. In the future RTD programme, industrial partners who are already engaged in the development of spelling programmes have shown interest in pursuing their work by using the tools of the LEXNET project.



    Back      Forward
  •