The various teams are processing corpora. It is interesting to observe that very similar legal problems have arisen in the various countries. Also, attitudes are quite similar: after several years of practice, it is clear that personal contacts with the owners of texts have played a crucial role in the obtention of texts. On the organizational level, it is clear that an institution intending to build a large corpus will need a special team of specialists dealing with the owners of texts:
So far, texts used for processing of various types :
There are also special opportunities, difficult to plan for, at least at the present time. For example, the multilingual corpus represented by the journal Scientific American and its translations is in principal available. However, securing all the texts in each country where they are produced is an untractable problem. On the other hand, there exist a few newspapers for Germanic and Francophonic readers published simultaneously in both languages. Samples made available to the German team have shown that bilingual processing could be extremely valuable.
Large amounts of corpora are necessary to assess the quality of linguistic tools, hence the feasibility of certain applications, for example, determining the degree of completeness of a dictionary or the part of human editing needed in an automatic translation process, since not all ambiguities can be solved by computer programmes.
Corpus is used for determining the coverage of existing dictionaries: failure of a look up procedure indicates in general that the corresponding word:
Each of these situations can be coped with, requiring different approaches.
T1. New words must be encoded and added to the existing dictionary. However, some assessment should be made as to the prospects of their lives (some words are coined in specific contexts and may never be used again.
T2. Catalogs of proper names exist. In some cases it is possible to build specialized dictionaries for them. In annex 3 we give an example of names of oceans and we propose to identify more such situations (e.g. places such as countries, islands) and to generalize this methodology to other areas.
T3. Local grammars will be built for numerals of all types (Roman, alphabetic, etc.).
We recall that the formalism of finite automata allows the automatic construction of parsers from the graphs of the local grammars. These parsers are applied to texts and combined with the use of the various dictionaries (of simple and compound words), they allow the detection of a large variety of patterns that one can define by combining practically at will lexical and grammatical features. This procedure is an extremely promising tool for the detection of terminology not yet present ion the current dictionaries. In annex 3 we give examples of such grammars used for detection of new terms. The development of this approach in the subject of task L. This approach can be extended to the search for terms and their translations by processing bilingual texts. Such bilingual corpora are available (in particular between French and German), this extension is the subject of task B.
T4. Detection of spelling mistakes can be performed by using the dictionaries DELA already constructed, but procedures of detection and of correction should be improved. In the future RTD programme, industrial partners who are already engaged in the development of spelling programmes have shown interest in pursuing their work by using the tools of the LEXNET project.