Skip to main content
   
Andrew Roberts @ School of Computing
Roberts, Andrew
Automatic Acquisition of Word Classification using Distributional Analysis of Content Words with Respect to Function Words.
School of Computing, University of Leeds, United Kingdom
ps.gz | pdf | bibtex entry

Abstract

This project describes a method which can automatically infer word classification. Previous systems designed to assign parts-of-speech to words sought the use of training data or were built upon rules devised by experts in linguistics. The report details the use of an unsupervised approach that can reduce significantly the reliance on prior linguistic intuition.

The study looks in to how words behave relative to the function words. As these are the most common words, there is a great deal of information that can be attained. It was possible to analyse how the content words from a given body of text were distributed with respect to the function words. This information could be used as a profile, and therefore content words with a similar profile against the function words could be assumed to be of similar word class.

Agglomerative hierarchical clustering techniques were applied to partition words into different clusters. Words that were deemed similar were grouped together, and thus, each cluster should contain words that posses the same part-of-speech.

This project performed many experiments to investigate how the many factors affected the overall clustering performance, in order to find the optimal parameters. The results report an accuracy of 87% when performed on the LOB corpus. Experiments were also carried out with an alternative Spanish corpus and the clustering accuracy achieved 85%. Semantic clustering was also observed indicating the effectiveness of the described approach for the task of automatically acquiring word classification.