Dictionaries and grammars have been recognized as crucial components of most applications of Natural Language Processing (NLP). Numerous prototypes of language analyzers and generators have been built, but, practically none of these prototypes incorporate full scale dictionaries and grammars. This general situation has been dubbed: processing with "toy dictionaries" and "toy grammars".
Defining a full scale dictionary is already a problem in itself and this question must be addressed in several steps and constitutes in fact the core of the project.
The first step is the level of graphically simple words, namely words as they appear as entries of commercial dictionaries. In order to match a dictionary of canonical entries with words as they are found in texts, entries must be inflected. The general inflection scheme consists in appending inflection codes to canonical entries in order to generate all inflected forms. This approach seems straightforward, and even well-prepared by existing material such as conjugation dictionaries built for pedagogical purposes. However, few such dictionaries exist to-day, either in academic or industrial environments. There are indeed various questions to be solved both at the practical and at the theoretical level, in order to reach an operational stage of coverage for a dictionary.The members of the RELEX group have all built such a dictionary (DELA) for their language (cf. annex 1). These dictionaries are to be completed by many derivatives and technical words. This is the subject of task T1.
are easy to coin and to understand, they are not entered into commercial dictionaries, the reason is that morphological studies have shown that they are extremely numerous. The size of current dictionaries is about 100 000 canonical entries (to the extent that derived words can be listed), derived words increase the size of the lexicon by a factor one hundred at least;
At this point, a first definition of full scale dictionary can be given : A full coverage dictionary is a dictionary which recognizes all the ASCII sequences of a library of texts.
To reach such a stage of completeness for a dictionary, the mentioned problems must be solved:
1.The problem of derivational morphology. The present proposal includes a general approach to this question.
More precisely, consider the case of adjectives derived from verbs by suffixization of -able (transitive verbs in French and in English), for example swallowable is a well-formed adjective on the passivizable verb to swallow. It is not necessary to include it in a dictionary since the transformational pattern of derivation and interpretation is fully regular :
For example, in the OED, we do not find abbreviatable and abjurable which are well-formed adjectives derived from transitive verbs, and could possibly occur in proper contexts. Such conventions depend on languages. For example, Italian and Spanish dictionaries omit numerous diminutive and augmentative forms of nouns and adjectives that are easily constructed from the standard form.
This implicit knowledge must be incorporated in any electronic dictionary: either full lists are compiled, or else rules are stated that will predict productive forms. Let us return to the example of -able. Deciding whether a given form in -able is "regular" or not can be a non trivial problem (e.g. drinkable). Let us consider now the steps necessary to list all the possible forms in -able. Some verbs do not have derived adjectives, for example to remain (*remainable), others do. However, the search for -able forms is not a simple yes-no answer to be given to the question : "Does the verb X have a derived adjective in able?" Consider the verb to break, there is an adjective breakable and they are associated in the syntactic relation :
(1) (One + the book) can break this glass = This glass is breakable
but not the verb to break in the sentence:
(2) Bob broke with Jo
One could satisfy oneself with an "existential" answer: There exists one meaning, (use or entry) to break which has an adjectival form breakable. At this level, we do not perform a full lexico-syntactic description of the word break. However, a detailed description will be needed for other purposes such as sentence parsing or sentence generation. Moreover a good knowledge of the formal conditions under which the adjectivization in -able occurs can only be obtained by examining a lexicon where the structures (and meanings) of verbs have been clearly separated. Common dictionaries do separate sentences such as (1) and (2) as being different uses and as subentries of a given verb form, but this separation is always far from complete. In particular, attempts to separate figurative or metaphoric meanings from proper ones have considerably blurred the notion of entry of a given word. Consider for example :
(3) The letter broke Jo
It is customary to call (3) a metaphor of (1). Again, the adjectivization :
Jo is breakable
is forbidden, for unclear reasons since the source of V-able: Jo can be easily broken
is accepted. The term metaphor suggests a relation between (1) and (3), but only one aspect of the relation can be made clear: (3) has a diachronic (etymological) origin, presumably, it was created after (1). This observation is of not much help for the morphosyntactic and semantic description. Too many questions remain :
- What is degree of generality of the relation between (1) and (3) ? Namely, there are other pairs, such as :
(1a) This book (crushed + hit + touched + shattered) Jo
which are intuitively similar, but there are also sentences such as :
(1b) The book (missed + reached + smashed) the glass
for which there is no corresponding metaphor:
(3a) This (letter + book) (missed + reached + ?smashed) Jo
Hence, before one can consider that there exists a semantic relation that associates sentences of types (1) and (3), one has to investigate the lexicon and separate the various meanings (or uses, or subentries) of each word;
- What is the nature of the relation? Why should a change of meaning block a morpho-syntactic process? As a matter of fact, since the pairs (1a)-(3a) have never been studied, it is not known whether the adjectivized form in -able is forbidden in the metaphoric meaning.
The consequence of this discussion should be clear: the answer to a question that seemed limited to simple words, that is a morphological question, turns out to be a bona fide syntactic question. The framework for solving this problem, in fact practically all problems of derivational morphology, is situated at the level of lexicon-grammar (cf. 4).
At this point we only dealt with forms of simple words. This description must be completed by:
- syntactic information of a lexical nature. For example, every verb has its own argument structure which must be described in the dictionary entry (cf. lexicon-grammar).
- semantic information. The very first step consists on separating the different meanings a given word may have.
The following two questions encompass more than the simple words, they also bear on compounds:
2. The problem of numbers and quantitative phrases. This question has to be refined and we will classify the utterances involved. For example we will consider as separate items dates and physical measurements. We will then propose a lexico-grammatical approach to most of the numerical expressions; (Task T2).
3. The problem of proper names. Some can be listed (e.g. names of countries and other places or objects such as mountains, lakes, rivers). Others whose dictionary seems untractable may be in fact constructed: telephone books could be a basis for the construction of dictionaries of proper names of persons and organizations. The present proposal addresses partly this question: we will show that the method proposed for numerical expressions can be extended to proper names and we will construct significant examples; (Task T3).
The morphological dictionaries DELA under construction by each partner can be adjoined a phonological component. In this domain it is not thinkable to use only a universal alphabet like the International Phonetic Alphabet: too many specialists disagree on the basic sounds. But national alphabets can be easily designed according to foreseen applications: phonetization of written texts, use in spelling correction, etc. Constructing a phonetic system is a two component task:
- first, all the entries of the component of simple words must be encoded. Then phonogical rules of inflection must be devised;
- second, a phonetization of corpora using the phonetic dictionary as well as general rules must be constructed;
- other components will be necessary, for example in French rules of 'liaison' and elision are necessary to generate the phonetic forms of compounds starting from simple words.
The various teams have already acquired some experience in this area, in particular French and Italian have been partly described.
The next step in complexity is the construction of dictionaries of compound terms. The participants of the project have adopted the classification of parts of speech used for simple terms:
Within each of these major categories, subclasses are defined in terms of the categories that make the compounds. From the point of view of the recognition of complex utterances in a text one will have to distinguish at least two main types of entries depending of the variability of the terms :
In annex 2, we present a classification of compound terms, as adopted by the consortium and currently used. We indicate the figures obtained in a first trial of compilation of French terms.
The DELA system common to 6 languages (cf. annex 1) is the most elementary form of dictionary: a list of words, and attached to each word, the grammatical information needed to inflect it or to keep it invariable. For European languages, this information is limited to gender, number, case, tense, mood, person. There is no limit to the amount of information that one may want to attach to words: syntactic, semantic, phonetic, stylistic, historical, encyclopedic data can be introduced, depending on applications. Already, the minimal information previously required allows for some syntactic computations: rules that establish elementary agreement between the inflected words of a text, can be defined and used in a parser. Also this elementary information is used to represent certain ambiguities (i.e. called homographs at this level of description). For example, the French word voile can be:
Thus, at the descriptive level of the parts of speech, we have 3 homographs: 2 nouns, 1 verb. At the more precise level of inflected forms, we have 5 verbal forms, hence 7 homographic forms.
This level of description does not provide for semantic ambiguities. For example,
- the masculine noun voile can also mean "soft palate" (le voile du palais), a fabric (Swiss voile), difficulty in seeing (un voile devant les yeux), etc. - the verb can mean to veil a statue, but also to buckle a wheel, etc.
At this point, the list of words which must be built does not require any evaluation of the use of words, ancient or modern, hypercorrect or slang, etc. Hence, one does not have to sort out more or less obsolete words, as for example many found among the entries of a dictionary such as the Oxford English Dictionary.
With respect to these minimal demands, a superficial examination of the best available dictionaries shows that they are incomplete in many respects, and thus unfit as a basis for automatic analysis.
Adding semantic features to dictionaries entries is an important improvement from several points of view:
- in the lexicon: at the morphological level it is not possible to separate the two meanings of a noun such as board: 'group of persons' or 'concrete object'; the noun guard can be a person or a device for protection, it has a similar ambiguity (or homography). Introducing semantic features such as: Hum (for 'human'), Conc for 'concrete' allows to write entries such as:
board, N1, Hum board, N1, Conc guard, N1, Hum board, N1, Conc
where N1 is the inflection code which indicates that these nouns take an -s in the plural. Such features must be refined, for example, to account for the difference between a board 'group of humans' and a guard 'simple invidivual', the feature HumColl (human collective) has to be introduced. This activity of classification of the vocabulary must he controlled in order to reach reproducible classes of words. The only general method that can be used to-day is by relating the semantic features to grammatical properties: for example, guard is clearly related to the interrogative pronoun who, whereas board is not so;
- semantic features can be encoded in lexicon-grammar of verbs. This, the description of the argument structure of a given verb includes categorial features such as Hum. This information is crucial for disambiguating combinations such as:
where board can only be Human and the solution board Concrete is rejected because the structure N0 approve N1 is described with only N0 =: Hum. At the same time the object N1 is described as accepting both human and non human subjects (including sentential objects).
Pursuing further semantic classification will involve the same dual description (G. Gross, 1991). Subclasses of nouns will be defined in relation with their combinatorial properties with verbs. For example, the complement N1 of the structure:
includes names of clothes, justifying the feature Cloth. Such verb-noun combinations are the basis of an operational approach to the construction of a classification of nouns and as such, an alternative to a priori schemes used in encyclopedie descriptions of the world That are not direcly related to NLP and more specically to the question of disambiguation.
The presentation of such a classification scheme and examples for various languages will be the subject of task S.
In annex 1, the French samples are marked with a preliminary set of semantic features.