Language & Information Lab.

Word Manager

Pius ten Hacken


This text was distributed as a handout at the DELIS end-of-project workshop in 1995.


[home] [up]

Abstract

Word Manager (WM) is a system for morphological dictionaries. It has been developed by a group of six people, working in Basel, Lugano, and Amsterdam, under the direction of Marc Domenig. This text briefly introduces the purpose and general setup of the system.

Contents


The Lexical Bottleneck Problem

It is a well-known problem in NLP that many systems do not perform in practice as well as they could with adequate dictionary resources, due to the cost of production, adaptation, and maintenance of these resources. The contribution of WM to a solution of this problem is based on an analysis of the lexical recognition process as in Fig. 1:

Fig. 1:  The WM-model of lexical recognition in NLP

Fig. 1: The WM-model of lexical recognition in NLP

The mapping between text words (strings found in a text) and the dictionary entries of a particular NLP-system is divided into two parts, mediated by the lexeme. Mapping 1 is the mapping between lexemes and forms occurring in actual text. This mapping involves inflection and in some cases wordformation, clitics, and multi-word units. Mapping 2 is the mapping between lexemes and readings, fully specified for all information required by the system. As can be seen in Fig. 1 the two mappings are independent. WM covers all and only mapping 1. System-specific information is found only in mapping 2. Therefore the coverage of WM is such that all NLP-systems can use its results as a well-defined part of the solution to the lexical bottleneck problem.

Another important point of departure in the development of WM has been that linguists and lexicographers are maximally supported in expressing their expert knowledge in a natural way. This has resulted in specialized user interfaces that show only information that is relevant to the particular stage of coding, and in tools that recognize inconsistencies and potential errors with a high success ratio.

The Client-Server Architecture

In Figure 1, the flow of information between different components of WM and the outside world is represented. At the centre, there is a server, mediating between linguists, lexicographers, external NLP-systems, and databases. Different users can access the same databases in parallel. They need not be at the same machine as the server, as long as they have a network connection to it. The linguist's interface has the full authorization to create, modify, and delete rules, entries, and databases. The lexicographer's interface has a more limited authorization, with task-oriented support. Client applications can access databases, but not change them. The system-specific information they need in the dictionary (e.g. pronunciation) can be added by a system of indices. By the architecture described here, and the separation of general information from system-specific information, reusability is guaranteed.

Fig. 2: The Client-Server model

Fig. 2: The Client-Server model

The Development of a Database

The development of dictionaries in WM can be schematically represented as follows:

Fig. 3: Development stages

Fig. 3: Development stages

In the first step, a linguist describes the system of inflection, wordformation, clitics, and multi-word units of a language. The formalism for the former two is described in detail in Domenig & ten Hacken (1992), the formalism for the latter two in Pedrazzini (1994). The linguist's interface offers the possibility of a specification ? compilation ? test cycle, so that the coverage of the rules can gradually be extended and the formulation improved where necessary after testing the results. In the compilation step, syntactic and semantic consistency is checked. An example of a full morphological rule database is the one for Italian described by Bopp (1993). Besides, a complete rule database for German and a morphological rule database for English have been developed.

In the second step, a team of lexicographers adds entries to the rule database, by linking each lexeme to a rule in the database. This task is supported by a tailor-made lexicographer's interface, described below. The result is a lexical database with morphological and phrasal information, to which programmers, linguists, and lexicographers of client applications may add system-specific information. A German dictionary database is currently under development.

The Lexicographer's Interface

The task of the WM-lexicographer is to decide which entries will appear in a dictionary, and what their internal structure and inflectional properties are. In the development of the lexicographer's interface it has been a guiding principle that the lexicographer is burdened as little as possible by details of the rules in the rule database. In order to provide maximal support to the lexicographer, we have divided the specification of new entries into three types: The latter type only concerns a limited number of items, because once the stems exist in the database the lexicographer can use them. Thus, go is specified by the linguist, and estimate by the lexicographer. Once these formatives and a rule for the prefix under exist, the system will propose analyses for undergo and underestimate, which the lexicographer can confirm or change. In each case, the system will generate the entire inflectional paradigm for inspection by the lexicographer. For the support of linguistic and lexicographic decisions concerning the classification of constructions and individual items, ten Hacken (1994) gives a system of tests and guidelines.

Conclusion

The WM system offers a solution to the mapping problem between text words and dictionaries. Its client-server architecture enhances reusability, because general information is separated from (client-)system-specific types of information. The tailor-made interfaces for linguists and lexicographers allow these experts to express their knowledge in a natural way. The compilation checks consistency of the database. Thus, all conditions on a proper solution mentioned in section 1 are fulfilled.

References

Domenig, Marc & ten Hacken, Pius (1992), Word Manager: A System for Morphological Dictionaries, Olms, Hildesheim.

Bopp, Stephan (1993), Computerimplementation der italienischen Flexions- und Wortbildungsmorphologie, Olms, Hildesheim.

Pedrazzini, Sandro (1994), Phrase Manager: A System for Phrasal and Idiomatic Dictionaries, Olms, Hildesheim.

ten Hacken, Pius (1994), Defining Morphology: A Principled Approach to Determining the Boundaries of Compounding, Derivation, and Inflection, Olms, Hildesheim.


[home] [up]