We will propose a full scheme to be developped on a large scale at a later stage and we will provide significant samples of the full-scale project in the present proposal.
The process of construction of linguistic data bases is identical for all languages. However, linguistic differences among languages may change certain volumes of data, and hence the duration of certain phrases of growth. Also, differences are reflected in the sizes of the teams and the starting date of the RELEX project. But for all teams concerned with the electronic dictionaries, the methods and the formats used are the same.
A distinction is made between the Kernel vocabulary, that is the vocabulary for which syntactic and semantic information can be elicited among linguists, and the Satellite vocabulary, which includes technical terms not in use by the general public. Terminology belongs to the Satellite vocabulary. The Satellite vocabulary can be encoded by linguist at best at the morphological level, but its detailed description will require the knowledge of specialists in each field.
By'electronic dictionary', we mean a computerized dictionary intended for use in rather sophisticated computer operations, such as recognizing a complex technical term in a text to be automatically indexed or parsing a text in order to translate it into another language and/or a phonetic form. Such elementary operations are the preliminary steps necessary in systems of translation or information retrieval. Electronic dictionaries are thus sharply distinguished from ordinary commercial dictionaries which have been computerized. In general phototypesetting is the motivation for computerizing dictionaries, but there are now non paper versions (i.e. CD ROMs) derived from the paper version to which an automatic look-up procedure has been added for use by the general reader.
Electronic dictionaries differ in several crucial ways from ordinary dictionaries :
Coverage provided by commercial dictionaries is determined by a compromise between several non linguistic parameters. It even appears that the rational approach to coverage is done in terms of marketing: a public for a dictionary and the corresponding selling price are determined, the linguistic coverage then follows. Suppose for example, that some publisher feels there is a need for a language rather than an encyclopedic dictionary, and that the price is placed at 30 ECUs. Since the amount of information attached to each entry (explanations and examples) is constrained by tradition, and mostly competition among publishers, the size in number of characters, is determined, and the number of entries is going to be about 50.000.
Large number of entries are left out on various grounds. Although dictionaries do contain a substantial amount of scientific and technical vocabulary, it is clear that many terms are not included.
First of all, given the uses foreseen for them, electronic dictionaries should be complete. If a word in a sentence is not recognized, the chances are that the parsing process of the sentence will fail. Syntactic computations may easily enter into unstable conditions, and there are many ways in which a recognition procedure may fail. For example, if one character is not recognized (by an optical reader or due to a spelling or typing mistake, etc.), the word that contains this character will not be recognized (and spelling correctors may fail). But then, the sentence containing the word will not be analyzed. This negative magnifying effect distinguishes computer processes from the corresponding human processes, where the effect of a mistake can often be kept local, avoiding the perception operation being affected in its totality.
Another difference between electronic and commercial dictionaries lies in the nature of the information attached to lexical items. Common grammatical information such as gender, number, person, tense, case, are often attached to endings, they have to be put in computerized form. However, some formal requirements reveal deficiencies in ordinary dictionaries. Let us take the example of the French electronic dictionary DELA (cf. annex 1). The strategy adopted to describe the set of French words has been the following. Entries in the traditional form have been compiled. The information necessary for deriving all forms of each entry has been introduced. Thus a complete set of inflected words can be generated from the dictionary entries. At this point, it is easy to verify that even a mark such as the plural of nouns has been poorly represented in commercial dictionaries. In general a default rule is used in ordinary dictionary: "if unspecified, the plural is formed by suffixing an -s to the word". As a consequence, such a rule applies to all subentries of a given entry. For example, the plural of ocean is oceans, however, ocean in Pacific ocean which has no plural will have to be constrained; in the same way, the noun effect has the plural form effects, but not in the adverbial form in effect; there are thousands such situations which must be encoded individually.
Building a computer dictionary involves the preliminary step of entering a list of words of the corresponding language. Such a step has always seemed trivial to most computer specialists who have concerned themselves with the complex procedures of syntactic and semantic analysis since the early days of mechanical translation. In general, they considered that appending a dictionary to their programs simply involves "keying in" the entries of some commercially available dictionary.
Inflecting the entries of a commercial dictionary in order to get the full set of words found in texts (e.g. nouns in plural, conjugated verbs) has never been a major concern to computer specialists, especially to those dealing with English which is a language with a rather simple system of inflection. Again, it is generally throught that the information contained in available dictionaries and grammars is sufficient to build a program of inflection. But the problem is by no means simple, and numerous questions, both linguistic and computational, have to be solved, in order to get a meaningful representation of what is commonly thought to exist under the term: the set of words of a given language.