The RELEX network

7. AMBIGUITY OF CORPORA

7.1. Tagging corpora

Given a text, the application of a look up procedure to all its formally simple words provides a complete solution to the tagging problem: no approximation based on lack of information or on statistical rules need to be proposed. Consulting a dictionary provides all the available information and this information can be enriched at will, on general grounds or by customization. More precisely the tagging procedure:

- does not ask for morphological analysis, except perhaps for some words not found in the dictionary. However, morphology can be used during a provisional phase when derivational morphology has not yet been fully described in the lexicon-grammar;

- includes the tagging of most compound nouns (e.g. technical terminology) and of frozen adverbs. These two types of units can be located by a simple look-up procedure (string matching). They are to be distinguished from compound verbs and compound adjectives which are in fact elementary sentences described in the lexicon-grammar. Hence a certain amount of syntactic analysis is required to locate them in texts;

- is only based on the use of dictionaries. As we will see below, ambiguities are numerous and distributed at various linguistic levels. Large amounts of information are needed to solve ambiguities, hence the approach proposed: dictionaries can be made to contain all the information necessary to explore the contexts of ambiguous words in order to solve ambiguities.

7.2. Ambiguities

In principle, texts are meant to be unambiguous. More precisely, a given word, that is, a sequence of characters occuring between two consecutive separators (e.g. between two spaces), is not ambiguous in its context. However, when an isolated word is looked up in a dictionary, several solutions are often given for its interpretation.

A fundamental operation is involved in most advanced NLP application, that consists in analyzing the words of a given text by looking up each of them in a dictionary. Given a sentence, some of its words are ambiguous, as a result, practically all corresponding sentences are ambiguous.

It should be clear that the more a dictionary is complete, the more it provides solutions for each word. Thus, large dictionaries provide highly technical and obsolete meanings in their entries which are potential uses for words, even if thought implausible. More systematically, given a compound utterance, it has very often an alternative analysis in terms of its simple words; even if this situation is rarely found in texts, one must always take into account that an idiom such as to take the bull by the horns can a priori be interpreted with its literal meaning, that of to take the bull by one of its horns.

We intend to elaborate a model of representation for lexical ambiguities, that is for all the solution provided by the dictionary used to process the text.

The following principle has been adopted : the representation of a text with all the information attached to the words is in the form of a finite automaton, more precisely a directed acyclic graph or DAG.

An example of application of this principle is the following : Let us consider the ambiguities present in the very simple French sentence:

La place compte pour son cousin (Space counts for her cousin)

when parsed by a computer programme which looks up each word in a dictionary:

the token 'la' can be a feminine singular determiner (#1 DET:fs), a pronoun (#2) or a masculine noun which represents the musical note A (#3 N:ms);

-- in the Larousse dictionary, the word 'place' has 30 meanings as a feminine noun (#1 to #30) and 20 meanings as the verb 'placer'; for each of these verbs, the token 'place' can represent a conjugated form in the indicative present tor in the subjunctive (first person singular or third person singular: V:Pres1s, Pres3s, Sub1s, Sub3s) or in the second person of the imperative (V:Imp2s): nodes #31 to #130 in figure 1;

--the word 'compte' is similar: it has 35 meanings as a masculine noun (#1 to #35), and 15 meanings as the verb 'compter'; for each verb 'compter', the form 'compte' holds for 5 conjugated forms (#36 to #110);

--the token 'pour' is either a preposition (#1), either a masculine noun (#2);

--the token 'son' can be a determiner (#1), a noun which means 'sound' (#2) and a noun which means 'bran' (#3);

--the noun 'cousin' has two meanings: a cousin (#1) or a mosquito (#2).

The DAG in figure 1 represents the combination of all the possible interpretations that are obtained by reading the graph from left (the initial node) to right (the terminal node).

At this step, before using the grammar rules, the total number of word for word translations would be: 3 x 130 x 110 x 2 x 3 x 2 = 257,400. Even a computer, dealing with such a combinatorial explosion, would fail because of time/memory overflow. But this mechanical way of representing all the solutions is certainly an incorrect way of representing the sentence: no French speaker is aware of this degree of ambiguities, because each word occurs in a certain context which specifies its syntactic function and its meaning. In a more precise description, elementary grammar rules limit the number of possible combinations:

the token 'la' at the beginning of the sentence can only be the determiner,
the word following the determiner can only be a noun,
there must be a verb in the sentence, hence 'compte' can only be a verb. This verb is conjugated in the third person singular since 'je' doesn't occur to the left of the verb and the sentence has not an imperative structure. Since the token 'que' does not occur before the verb 'compte', the mood cannot be Subjunctive,
the token 'pour' can only be a preposition,
the token 'son' can only be a determiner.

Hence, by using these constraints adequately, one can obtain the DAG of figure 2. The total number of combinations is now: 1 x 30 x 15 x 1 x 1 x 2 = 900.

The linguistic information extracted from the dictionary and carried by the graph is largely standardized across the RELEX teams, but many questions about the formal representation are open. To a large extent, solutions and decisions depend on the role of this text representation inside the general parsing procedure. We intend to investigate various possibilitie of disambiguating the graph of a sentence :

- by intersecting it with graphs of local grammars :

- by structuring and customizing the various subdictionaries.

In a parallel way, there should be linguistic work carried out at a systematic level which separates the meanings of each word providing at the same time contexts for their disambiguation. In annex 5, we give examples of detailed entries where contexts are classified according to their grammatical or lexical nature.

Two separate but parallel tasks are involved here :

TX: Formalization of ambiguous texts as analysed by dictionaries.

SM: Separating the meanings of words and characterizing the meanings by the context (lexical or syntactic) (cf. annex 5).

Heuristics for eliminating ambiguities will have to be developed. We intend to investigate two specific approaches in an experimental way:

- marking the dictionaries with 3 levels of plausibility of use for each word (current, less current, technical or rare),

- study general rules such as 'longest matching, when a given utterance (word or compound) is included in a longer utterance (e.g. tunnelling effect is included in tunnelling effect microscope).

Back