An approach to Catalan Adjective Lexical Classes by Clustering

Laura Alonso and Gemma Boleda

Universitat de Barcelona and Universitat Pompeu Fabra

 

The aim of our work is to establish lexical classes of adjectives for Catalan, seeking empirical grounds to validate the classes that have been proposed in the literature. To do that, we have carried out a number of experiments with clustering techniques.

 

Work in theoretical linguistics (see e. g. [Hamman 1991]; see [Boleda 2001] for Catalan) has often distinguished three classes of adjectives, which in Catalan have the following characteristics:

- relative: typically denominal, non-gradable and directly postnominal

- qualitative: typically scalar, gradable, can occur pre-and post-nominally, as well as predicatively

- non-predicative: non gradable, only prenominal

 

Our hypotheses are the following:

- adjectives do not behave uniformly in naturally occurring text, but a number of classes can be identified on the basis of distinct syntactic behaviour

- these behaviours can be identified and characterized by way of data-driven methodologies, like clustering techiques

- generalizations on the syntactical behaviour of adjectives increases descriptive adequacy of adjectives; this improvement can be tested via implementation in knowledge-requiring NLP tasks, such as MT, WSD, etc.

 

We have clustered a set of 3754 adjectives, described by their context of occurrence in a 1,3 million word corpus. The use of a bigger corpus has been precluded by the difficulties of gathering huge amounts of text in Catalan, a minorised language. The starting set of features that was considered useful for distinguishing the lexical class of adjectives can be seen in Table 1. The software used to perform the cluster analysis of adjectives was CLUTO ([Karypis 2002]), a stand-alone tool for clustering both high and low-dimensional datasets and for analyzing the characteristics of the obtained clusters. We used a partitional algorithm for cluster discovery, and the optimal clustering criterion function was found to be H2, a combination of internal and external clustering ([Zhao and Karypis 2001]).

 

By combining the various adjective-describing features, several clustering solutions have been obtained (see Table 2). The interpretation tools of CLUTO elicited elicited the most discriminating features for each solution, so that feature configurations could be increasingly improved so as to achieve better classifications. The goodness of classifications was assessed by cluster quality, in terms of cluster homogeneity. Secondarily, precision and recall measures were obtained by comparison with a group of 47 adjectives that prototypically represented the classes proposed in the literature.

 

Results confirm the initial 3-class division of adjectives sketched in [Boleda 2001] specifically for Catalan. However, in all the attribute configurations, CLUTO consistently succeeded in distinguishing non-predicative adjectives from the rest, but performed poorly in distinguishing between relative and qualitative adjectives (except for a subset of them; see below), which are the majoritary classes. One of the reasons of this poor performance is probably the sparseness affecting most of the features that were discriminating of qualitative adjectives (comparativity, gradability, predicat ive nature), as well as the features accounting for the syntactic categories of the words before and after each adjective.

 

The features that best described the obtained clusters were the ones signalling a certain syntactic relation, such as modifying function or predicative/ attributive use, while features representing the bare context of occurrence of the adjective, like preceding or following word, were found to be disturbing. This can be seen by the fact that when only bare context features were considered for clustering (Solution 3), cluster quality was much lower than when they were not taken into account at all (Solution 2). However, we believe a bigger corpus will provide a better representation of these features.

 

CONCLUSIONS

In this work, we have exploited a methodology that has been widely applied in AI and specifically in NLP, clustering, for knowledge discovery in an area of linguistics which is relatively unexplored from a formal perspective, namely, adjective classification. We have shown that this technique partially succeeds in motivating adjective lexical classes that had been proposed in the literature. Once established, these classes improve the adequacy of the description of these items in a lexicon, which is argua bly useful for reducing the ambiguity (both morphosyntactic and semantic) of a lexical item in context. The adequacy of this information can therefore be tested by the improvements it yields in NLP tasks. Further refinements and enhancements in the methodolgy will presumably improve results.

 

Future work includes enhancing this initial experiment both by number of adjectives and by number of describing features. Concerning the first, the construction of a bigger corpus is in process (ca. 7 million words), which will provide more representative data of each of the adjectives. A reduction in the error rate in the corpus tagging process (currently ca. 6%) will also be pursued. As for the second, we plan to refine the ones already considered and to include further information, mainly on derivational morphology and selectional restrictions.

 

Last, we are planning to take advantage of clustering based on rules [Gibert et al. 1998], a clustering technique specifically oriented to ill-structured domains, such as natural language. This will enable us to maximize the information contained in useful features that are sparse in nature.

 

Table 1: Attributes for adjective characterisation and clustering

(attribute values are the percentage of occurrences of the adjective with that attribute)

Obtained from morphological analysis of the corpus by the Catalan CG [Alsina et al. 2002]

used in Solution

the adjective is a modifier of a noun at the right

the adjective is a modifier of a noun at the left

the adjective has a possible function as attribute

the adjective has a possible function as a predicate

occurrences in a form different from masculine singular

appreciative

syntactical category of preceding word (13 attributes)

syntactical category of following word (13 attributes)

1, 2, 4

1, 2, 4

1, 2, 4

1, 2, 4

1, 2

1, 2

1, 3, 4

1, 3, 4

Contextual

 

preceded by a definite article

preceded by an indefinite article

followed by the preposition 'de' ('of')

comparative grade

graduated

occurs at any time as graduated or comparative

occurs in coordination with another adjective

preceded by the verb 'estar' ('be')

preceded by the verb 'ser' ('be')

preceded by an attributive verb

occurrences at the end of sentence

1, 2, 4

1, 2, 4

1, 2, 4

4

4

4

4

4

4

4

4

 

Table 2 : Obtained clustering solutions

Adjective Class

Cluster homogeneity

(internal similarity / deviation)

Cluster distinctivity

(external similarity / deviation)

Recall

Precision

Solution 1

Non-Predicative

.80 / .08

.49 / .07

1

1

Relative

.51 / .13

.35 / .18

.85

.50

Qualitative

.58 / .10

.53 / .15

.45

.83

Solution 2

Non-Predicative

.85 / .09

.61 / .10

1

.50

Relative

.82 / .11

.60 / .13

.71

.71

Qualitative

.30 / .15

.14 / .21

.72

.88

Solution 3

Non-Predicative

.75 / .11

.35 / .08

1

1

Relative

.43 /.15

.23 / .16

.85

.50

Qualitative

.52 / .12

.38 / .16

.45

.83

Solution 4

Non-Predicative

.71 / .10

.33 / .08

1

.66

Relative

.46 / .12

.23 / .18

.71

.71

Qualitative

.45 / .12

.37 / .17

.72

.80

 

References

 

[Alsina et al. 2002] Alsina, À., et al. (2002) CATCG: a general purpose parsing tool applied, in Proceedings of Third International Conference on Language Resources and Evaluation, Las Palmas, 29-31 May 2002

[Boleda 2001] Boleda, G. (2001) Sobre el(s) tipus semàntic(s) dels adjectius, ms., Universitat Pompeu Fabra

[Gibert et al. 1998] Gibert, K. et al. (1998) Knowledge discovery with clustering based on rules: Interpreting results. Principles of Data Mining and Knowledge Discovery. Springer-Verlag

[Hamann 1991]Hamann, C. (1991) Adjectivsemantik / Adjectival Semantics, in von Stechow, A. i D. Wunderlich (1991) Semantik/Semantics. Ein internationales Handbuch der Zeitgenossischen Forschung. An International Handbook of Contemporary Research, Berlin/NY: de Gruyter, 657-673

[Karypis 2002] Karypis, G. (2002) CLUTO, http://www-users.cs.umn.edu/~karypis/cluto/

[Zhao and Karypis 2001] Zhao, Y., and Karypis, G., (2001), Criterion Functions for Document Clustering: Experiments and Analysis, University of Minnessotta, Technical Report 01-40