Laura Alonso and Gemma Boleda
Universitat de Barcelona and Universitat Pompeu Fabra
The
aim of our work is to establish lexical classes of adjectives for Catalan,
seeking empirical grounds to validate the classes that have been proposed in the
literature. To do that, we have carried out a number of experiments with
clustering techniques.
Work
in theoretical linguistics (see e. g. [Hamman 1991]; see [Boleda 2001] for Catalan) has often distinguished three classes of adjectives, which in Catalan have the following characteristics:
- relative:
typically denominal, non-gradable and directly postnominal
- qualitative:
typically scalar, gradable, can occur pre-and post-nominally, as well as
predicatively
- non-predicative:
non gradable, only prenominal
Our
hypotheses are the following:
-
adjectives do not behave uniformly in naturally occurring text, but a number of
classes can be identified on the basis of distinct syntactic behaviour
-
these behaviours can be identified and characterized by way of data-driven
methodologies, like clustering techiques
- generalizations on the syntactical behaviour of adjectives increases
descriptive adequacy of adjectives; this improvement can be tested via
implementation in knowledge-requiring NLP tasks, such as MT, WSD, etc.
We
have clustered a set of 3754 adjectives, described by their context of
occurrence in a 1,3 million word corpus. The use of a bigger corpus has been
precluded by the difficulties of gathering huge amounts of text in Catalan, a
minorised language. The starting set of features that was considered useful for
distinguishing the lexical class of adjectives can be seen in Table 1. The
software used to perform the cluster analysis of adjectives was CLUTO ([Karypis
2002]), a stand-alone tool for clustering both high and low-dimensional
datasets and for analyzing the characteristics of the obtained clusters. We
used a partitional algorithm for cluster discovery, and the optimal clustering
criterion function was found to be H2, a combination of internal and external
clustering ([Zhao and Karypis 2001]).
By
combining the various adjective-describing features, several clustering
solutions have been obtained (see Table 2). The interpretation tools of CLUTO
elicited elicited the most discriminating features for each solution, so that feature configurations could be increasingly improved so as to achieve better classifications. The goodness of classifications was assessed by cluster
quality, in terms of cluster homogeneity. Secondarily, precision and recall
measures were obtained by comparison with a group of 47 adjectives that
prototypically represented the classes proposed in the literature.
Results
confirm the initial 3-class division of adjectives sketched in [Boleda 2001]
specifically for Catalan. However, in all the attribute configurations, CLUTO consistently succeeded in distinguishing non-predicative adjectives from the rest, but performed poorly in distinguishing between relative and qualitative adjectives (except for a subset of them; see below), which are the majoritary classes. One of the reasons of this poor performance is probably the sparseness affecting most of the features that were discriminating of qualitative adjectives (comparativity, gradability, predicat
ive nature), as well as the features accounting for the syntactic categories of the words before and after each adjective.
The
features that best described the obtained clusters were the ones signalling a
certain syntactic relation, such as modifying function or predicative/
attributive use, while features representing the bare context of occurrence of
the adjective, like preceding or following word, were found to be disturbing.
This can be seen by the fact that when only bare context features were
considered for clustering (Solution 3), cluster quality was much lower than
when they were not taken into account at all (Solution 2). However, we believe
a bigger corpus will provide a better representation of these features.
CONCLUSIONS
In this work, we have exploited a methodology that has been widely applied in AI and specifically in NLP, clustering, for knowledge discovery in an area of linguistics which is relatively unexplored from a formal perspective, namely, adjective classification. We have shown that this technique partially succeeds in motivating adjective lexical classes that had been proposed in the literature. Once established, these classes improve the adequacy of the description of these items in a lexicon, which is argua bly useful for reducing the ambiguity (both morphosyntactic and semantic) of a lexical item in context. The adequacy of this information can therefore be tested by the improvements it yields in NLP tasks. Further refinements and enhancements in the methodolgy will presumably improve results.
Future work includes enhancing this initial experiment both by number of adjectives and by number of describing features. Concerning the first, the construction of a bigger corpus is in process (ca. 7 million words), which will provide more representative data of each of the adjectives. A reduction in the error rate in the corpus tagging process (currently ca. 6%) will also be pursued. As for the second, we plan to refine the ones
already considered and to include further information, mainly on derivational morphology and selectional restrictions.
Last,
we are planning to take advantage of clustering based on rules [Gibert et al. 1998],
a clustering technique specifically oriented to ill-structured domains, such as
natural language. This will enable us to maximize the information contained in useful features that are sparse in nature.
Table
1: Attributes for adjective characterisation and
clustering
(attribute
values are the percentage of occurrences of the adjective with that attribute)
|
Obtained from morphological analysis of the
corpus by the Catalan CG [Alsina et al. 2002] |
used in Solution |
|
the adjective is a modifier of a noun at the
right the adjective is a modifier of a noun at the
left the adjective has a possible function as
attribute the adjective has a possible function as a
predicate occurrences in a form different from
masculine singular appreciative syntactical category of preceding word (13
attributes) syntactical category of following word (13
attributes) |
1, 2, 4 1, 2, 4 1, 2, 4 1, 2, 4 1, 2 1, 2 1, 3, 4 1, 3, 4 |
|
Contextual |
|
|
preceded by a definite article preceded by an indefinite article followed by the preposition 'de' ('of') comparative grade graduated occurs at any time as graduated or
comparative occurs in coordination with another adjective preceded by the verb 'estar' ('be') preceded by the verb 'ser' ('be') preceded by an attributive verb occurrences at the end of sentence |
1, 2, 4 1, 2, 4 1, 2, 4 4 4 4 4 4 4 4 4 |
Table
2 : Obtained clustering solutions
|
Adjective
Class |
Cluster
homogeneity (internal
similarity / deviation) |
Cluster
distinctivity (external
similarity / deviation) |
Recall |
Precision |
|
Solution
1 |
||||
|
Non-Predicative |
.80
/ .08 |
.49
/ .07 |
1 |
1 |
|
Relative |
.51
/ .13 |
.35
/ .18 |
.85 |
.50 |
|
Qualitative |
.58
/ .10 |
.53
/ .15 |
.45 |
.83 |
|
Solution
2 |
||||
|
Non-Predicative |
.85
/ .09 |
.61
/ .10 |
1 |
.50 |
|
Relative |
.82
/ .11 |
.60
/ .13 |
.71 |
.71 |
|
Qualitative |
.30
/ .15 |
.14
/ .21 |
.72 |
.88 |
|
Solution
3 |
||||
|
Non-Predicative |
.75
/ .11 |
.35
/ .08 |
1 |
1 |
|
Relative |
.43
/.15 |
.23
/ .16 |
.85 |
.50 |
|
Qualitative |
.52
/ .12 |
.38
/ .16 |
.45 |
.83 |
|
Solution
4 |
||||
|
Non-Predicative |
.71
/ .10 |
.33
/ .08 |
1 |
.66 |
|
Relative |
.46
/ .12 |
.23
/ .18 |
.71 |
.71 |
|
Qualitative |
.45
/ .12 |
.37
/ .17 |
.72 |
.80 |
[Alsina et al. 2002] Alsina, À., et al. (2002)
CATCG: a general purpose parsing tool applied, in Proceedings of Third
International Conference on Language Resources and Evaluation, Las Palmas,
29-31 May 2002
[Boleda
2001] Boleda, G. (2001) Sobre el(s) tipus semàntic(s) dels adjectius, ms.,
Universitat Pompeu Fabra
[Gibert et al. 1998] Gibert, K. et al. (1998)
Knowledge discovery with clustering based on rules: Interpreting results.
Principles of Data Mining and Knowledge Discovery. Springer-Verlag
[Hamann 1991]Hamann, C. (1991) Adjectivsemantik
/ Adjectival Semantics, in von Stechow, A. i D. Wunderlich (1991)
Semantik/Semantics. Ein internationales Handbuch der Zeitgenossischen
Forschung. An International Handbook of Contemporary Research, Berlin/NY: de Gruyter,
657-673
[Karypis 2002] Karypis, G. (2002) CLUTO,
http://www-users.cs.umn.edu/~karypis/cluto/
[Zhao and Karypis 2001] Zhao, Y., and Karypis, G., (2001), Criterion
Functions for Document Clustering: Experiments and Analysis, University of Minnessotta,
Technical Report 01-40