Language & Information Lab.
Word Manager
Pius ten Hacken
This text was distributed as a handout at the DELIS end-of-project
workshop in 1995.
[home] [up]
Abstract
Word Manager (WM) is a system for morphological dictionaries. It has been
developed by a group of six people, working in Basel,
Lugano,
and Amsterdam, under the direction
of Marc Domenig. This text briefly introduces
the purpose and general setup of the system.
Contents
The Lexical Bottleneck Problem
It is a well-known problem in NLP that many systems do not perform in practice
as well as they could with adequate dictionary resources, due to the cost
of production, adaptation, and maintenance of these resources. The contribution
of WM to a solution of this problem is based on an analysis of the lexical
recognition process as in Fig. 1:
Fig. 1: The WM-model of lexical recognition in NLP
The mapping between text words (strings found in a text) and the dictionary
entries of a particular NLP-system is divided into two parts, mediated
by the lexeme. Mapping 1 is the mapping between lexemes and forms occurring
in actual text. This mapping involves inflection and in some cases wordformation,
clitics, and multi-word units. Mapping 2 is the mapping between lexemes
and readings, fully specified for all information required by the system.
As can be seen in Fig. 1 the two mappings are independent. WM covers all
and only mapping 1. System-specific information is found only in mapping
2. Therefore the coverage of WM is such that all NLP-systems can use its
results as a well-defined part of the solution to the lexical bottleneck
problem.
Another important point of departure in the development of WM has been
that linguists and lexicographers are maximally supported in expressing
their expert knowledge in a natural way. This has resulted in specialized
user interfaces that show only information that is relevant to the particular
stage of coding, and in tools that recognize inconsistencies and potential
errors with a high success ratio.
The Client-Server Architecture
In Figure 1, the flow of information between different components of WM
and the outside world is represented. At the centre, there is a server,
mediating between linguists, lexicographers, external NLP-systems, and
databases. Different users can access the same databases in parallel. They
need not be at the same machine as the server, as long as they have a network
connection to it. The linguist's interface has the full authorization to
create, modify, and delete rules, entries, and databases. The lexicographer's
interface has a more limited authorization, with task-oriented support.
Client applications can access databases, but not change them. The system-specific
information they need in the dictionary (e.g. pronunciation) can be added
by a system of indices. By the architecture described here, and the separation
of general information from system-specific information, reusability is
guaranteed.
Fig. 2: The Client-Server model
The Development of a Database
The development of dictionaries in WM can be schematically represented
as follows:
Fig. 3: Development stages
In the first step, a linguist describes the system of inflection, wordformation,
clitics, and multi-word units of a language. The formalism for the former
two is described in detail in Domenig & ten Hacken (1992), the formalism
for the latter two in Pedrazzini (1994). The linguist's interface offers
the possibility of a specification ? compilation ? test cycle, so that
the coverage of the rules can gradually be extended and the formulation
improved where necessary after testing the results. In the compilation
step, syntactic and semantic consistency is checked. An example of a full
morphological rule database is the one for Italian described by Bopp (1993).
Besides, a complete rule database for German and a morphological rule database
for English have been developed.
In the second step, a team of lexicographers adds entries to the rule
database, by linking each lexeme to a rule in the database. This task is
supported by a tailor-made lexicographer's interface, described below.
The result is a lexical database with morphological and phrasal information,
to which programmers, linguists, and lexicographers of client applications
may add system-specific information. A German dictionary database is currently
under development.
The Lexicographer's Interface
The task of the WM-lexicographer is to decide which entries will appear
in a dictionary, and what their internal structure and inflectional properties
are. In the development of the lexicographer's interface it has been a
guiding principle that the lexicographer is burdened as little as possible
by details of the rules in the rule database. In order to provide maximal
support to the lexicographer, we have divided the specification of new
entries into three types:
-
For entries based on existing formatives, combined in a regular way (i.e.
by regular wordformation rules as defined by the linguist), the interface
proposes a number of analyses, and the lexicographer decides which one
is correct.
-
For regular stems not in the database, the lexicographer is guided to the
correct inflection rule for the new stem.
-
Irregular stems are specified by the linguist rather than by the lexicographer.
The latter type only concerns a limited number of items, because once the
stems exist in the database the lexicographer can use them. Thus, go is
specified by the linguist, and estimate by the lexicographer. Once
these formatives and a rule for the prefix under exist, the system
will propose analyses for undergo and underestimate, which
the lexicographer can confirm or change. In each case, the system will
generate the entire inflectional paradigm for inspection by the lexicographer.
For the support of linguistic and lexicographic decisions concerning the
classification of constructions and individual items, ten Hacken (1994)
gives a system of tests and guidelines.
Conclusion
The WM system offers a solution to the mapping problem between text words
and dictionaries. Its client-server architecture enhances reusability,
because general information is separated from (client-)system-specific
types of information. The tailor-made interfaces for linguists and lexicographers
allow these experts to express their knowledge in a natural way. The compilation
checks consistency of the database. Thus, all conditions on a proper solution
mentioned in section 1 are fulfilled.
References
Domenig, Marc & ten Hacken, Pius (1992), Word Manager: A
System for Morphological Dictionaries, Olms, Hildesheim.
Bopp, Stephan (1993), Computerimplementation der italienischen
Flexions- und Wortbildungsmorphologie, Olms, Hildesheim.
Pedrazzini, Sandro (1994), Phrase Manager: A System for Phrasal
and Idiomatic Dictionaries, Olms, Hildesheim.
ten Hacken, Pius (1994), Defining Morphology: A Principled
Approach to Determining the Boundaries of Compounding, Derivation, and
Inflection, Olms, Hildesheim.
[home] [up]