FREE ELECTRONIC LIBRARY - Theses, dissertations, documentation

Pages:   || 2 | 3 | 4 | 5 |   ...   | 13 |

«´ Krister LINDEN PUBLICATIONS University of Helsinki Department of General Linguistics No. 37 P.O. Box 9 FIN-00014 University of Helsinki Finland   ...»

-- [ Page 1 ] --




Krister LINDEN


University of Helsinki

Department of General Linguistics

No. 37

P.O. Box 9

FIN-00014 University of Helsinki



c Krister Lind´ n


ISSN 0355-7170

ISBN 952-10-2471-2 (bound)

ISBN 952-10-2472-0 (PDF)


Helsinki 2005 Helsinki University Press Suit the action to the word, the word to the action.

— William Shakespeare (c. 1600-01) Hamlet iii Abstract The work is based on the assumption that words with similar syntactic usage have similar meaning, which was proposed by Zellig S. Harris (1954,1968). We study his assumption from two aspects: firstly, different meanings (word senses) of a word should manifest themselves in different usages (contexts), and secondly, similar usages (contexts) should lead to similar meanings (word senses).

If we start with the different meanings of a word, we should be able to find distinct contexts for the meanings in text corpora. We separate the meanings by grouping and labeling contexts in an unsupervised or weakly supervised manner (Publication 1, 2 and 3). We are confronted with the question of how best to represent contexts in order to induce effective classifiers of contexts, because differences in context are the only means we have to separate word senses.

If we start with words in similar contexts, we should be able to discover similarities in meaning. We can do this monolingually or multilingually. In the monolingual material, we find synonyms and other related words in an unsupervised way (Publication 4). In the multilingual material, we find translations by supervised learning of transliterations (Publication 5). In both the monolingual and multilingual case, we first discover words with similar contexts, i.e., synonym or translation lists. In the monolingual case we also aim at finding structure in the lists by discovering groups of similar words, e.g., synonym sets.

In this introduction to the publications of the thesis, we consider the larger background issues of how meaning arises, how it is quantized into word senses, and how it is modeled. We also consider how to define, collect and represent contexts. We discuss how to evaluate the trained context classifiers and discovered word sense classifications, and finally we present the word sense discovery and disambiguation methods of the publications.

This work supports Harris’ hypothesis by implementing three new methods modeled on his hypothesis. The methods have practical consequences for creating thesauruses and translation dictionaries, e.g., for information retrieval and machine translation purposes.

–  –  –

One could say that the work on this thesis has its roots in my childhood when I was 9 years old. I had already made two significant decisions in my life. I wanted to be a scientist and I intended to invent a speech translation device so that people wouldn’t have to learn foreign languages. At that time PCs and mobile phones were still unknown. I already spoke two languages and was learning English at school, but it was the prospect of having to move to Finland and learn a fourth and radically different language, Finnish, that begot these lofty ideas.

Language Technology as a subject was not yet invented, when I began my studies at the Computer Science Department at Helsinki University. After graduation I joined the Department of Linguistics, where I was involved in an English-to-Finnish Machine Translation project under the supervision of Dr. Lauri Carlson. His vast knowledge of both applied and formal linguistics combined with his down-to-earth remarks paved my way to linguistics. During that period Dr. Kimmo Koskenniemi became the first professor of Computational Linguistics in Finland, and Prof. Fred Karlsson was Head of the Linguistics Department leading a project on constraint grammar. This environment was tremendously inspiring and their ideas and views on morphology, surface syntax, constraint grammar, and translation I will forever carry with me.

I was, however, thrown into the business world of Language Technology, where I had the opportunity to participate in the start-up of a company called Lingsoft, which I headed for a number of years, before I went on to be its Chief Technology Officer. At Lingsoft I took part in a range of interesting projects.

When doing language technology for information retrieval, I was fortunate to meet Prof. Kalervo J¨ rvelin, at Tampere University (UTA). During a project on a the Finnish dictionary, Perussanakirja, I met Dr. Krista Lagus, now at the Helsinki University of Technology (HUT). Both were later to become supervisors of this Ph.D. thesis. As one of my projects at Lingsoft I also designed and supervised the implementation of a Finnish speech recognition system.

By that time I had taken part in the development of all the necessary components for the speech translation device I had set out to create in childhood. Why then a dissertation on word senses? Well, after having seen all the components, I v


was also acutely aware of how much work remained for a practical large-scale solution. I needed a new angle. Applied lexical semantics and a statistical approach seemed necessary. Prof. Lauri Carlson, at Helsinki University (HU), accepted to be the main supervisor of my thesis, and the project was made financially possible during 2001-2005 by the Graduate School of Language Technology in Finland and the Department of General Linguistics at the University of Helsinki, which I gratefully acknowledge.

During my time at the Graduate School of Language Technology, I had the privilege to cooperate with graduate students from three different universities.

Foremost among those have been Mathias Creutz at HUT, Jussi Piitulainen at HU, and Heikki Keskustalo at UTA. Together with them I was able to make some of the ideas materialize into publications.

As a complement to the intellectual work, I have enjoyed folk dancing several times a week. So much so that I now also hold a degree as a folk dancing instructor. I am grateful to the folk dancers at Arbetets V¨ nner and Brage for providing a me with relaxing, playful, but also challenging and stimulating environments for folk dancing. Merry motions always bring about a good mood.

I am also deeply indebted to my parents, Stig and Eva, without whose unfailing belief in my capabilities, and without whose decision long ago to move back to Finland, this thesis may never have happened, and to my sister, Lilian, for many discussions on the meaning of everything, and finally, to my partner in life, Juhani, without whose delicious food and good-natured support I would have felt much lonelier.


–  –  –

Word sense discovery and disambiguation are the essence of communication in a natural language. Discovery corresponds to growing or acquiring a vocabulary. Disambiguation is the basis for understanding. These processes are also key components of language evolution and development. In this work we will restrict ourselves to the core processes of word sense discovery and disambiguation in text-based computer applications.

We will try to demonstrate that word sense discovery and disambiguation are two sides of the same coin: you cannot have one without first having the other. The resolution of this paradox requires some form of external reference. For humans the reference is provided by the world and the language community we live in.

Since we are dealing with computer programs analyzing text, we will refer to written representations of language communities, i.e., text corpora and machinereadable dictionaries.

In the introduction we outline the processes involved in word sense discovery and disambiguation and briefly touch on some of the main problems common to both. We then outline the organization of the work and give an account of the author’s contributions.


1.1 Word Sense Disambiguation Word sense disambiguation is the task of selecting the appropriate senses of a word in a given context. An excellent survey of the history of ideas used in word sense disambiguation is provided by Ide and Veronis (1998). Word sense disambiguation is an intermediate task which is necessary in order to accomplish some other natural language processing task, e.g., £ translation selection in machine translation, £ eliminating irrelevant hits in information retrieval, £ analyzing the distribution of predefined categories in thematic analysis, £ part-of-speech tagging, prepositional phrase attachment and parsing space restriction in grammatical analysis, £ phonetization of words in speech synthesis and homophone discrimination in speech recognition, and £ spelling correction, case changes and lexical access in text processing.

Word sense disambiguation (WSD) involves the association of a given word in a text or discourse with a definition or meaning which is distinguishable from other meanings potentially attributable to that word. The task therefore necessarily involves two steps according to Ide and Veronis (1998). The first step is to determine all the different senses for every word relevant to the text or discourse under consideration, i.e., to choose a sense inventory, e.g., from the lists of senses in everyday dictionaries, from the synonyms in a thesaurus, or from the translations in a translation dictionary.

The second step involves a means to assign the appropriate sense to each occurrence of a word in context. All disambiguation work involves matching the context of an instance of the word to be disambiguated either with information from external knowledge sources or with contexts of previously disambiguated instances of the word. For both of these sources we need preprocessing or knowledge-extraction procedures representing the information as context features. For some disambiguation tasks, there are already well-known procedures such as morpho-syntactic disambiguation and therefore WSD has largely focused on distinguishing senses among homographs belonging to the same syntactic category.

However, it is useful to recognize that a third step is also involved: the computer needs to learn how to associate a word sense with a word in context using either machine learning or manual creation of rules or metrics.


It is the third step which is the focus of this work and especially the machine learning aspect. Unless the associations between word senses and context features are given explicitly in the form of rules by a human being, the computer will need to use machine learning techniques to infer the associations from some training material. In order to avoid confusion, we will speak of manually 1 created disambiguation techniques as a separate category and only divide the machine learning techniques into the subcategories of supervised, semi-supervised and unsupervised.

1.2 Word Sense Discovery Word sense discovery is defined as the task of learning what senses a word may have in different contexts. Word sense discovery is what lexicographers do by profession. Automated word sense discovery on a large scale in order to build a thesaurus has a much shorter history. Some of the first attempts were made in the 1960s by Karen Sp¨ rck Jones (1986). As sufficiently large corpora and efficient a computers have become available, several attempts to automate the process have been undertaken.

In lexicography, when building mono- and multilingual dictionaries as well as thesauruses and ontologies, word sense discovery is regarded as a preprocessing stage (Kilgarriff et al., 2004; Kilgarriff and Tugwell, 2001). In various applications, it is seen as a part of the lexical acquisition and adaptation process, e.g., in £ translation discovery when training statistical machine translation systems, £ synonym discovery for information retrieval, £ document clustering providing a domain analysis, detecting neologisms2 or rare uses of words in part-of-speech tagging and £ grammatical analysis, discovering ontological relations3 for terminologies, £ In the word sense disambiguation literature, notably SENSEVAL-2 (2001), manually created metrics or disambiguation rules are referred to as unsupervised. From a machine learning point of view, this is perhaps technically correct because no final automated training was used to improve the performance with a training corpus. According to the same reasoning, e.g., a manually designed wide-coverage parser would be an unsupervised method from a machine learning point of view.

Basic lexical acquisition is done all the time in most natural language applications. Often it is simply dismissed as part of the preprocessing heuristics for neologisms, i.e., new words or out-of-vocabulary items.

Ontological relations are: type and subtype (isa), part-of and whole, etc.


£ named entity recognition, and £ automated discovery of morphology and syntax.

Word sense discovery involves the grouping of words by their contexts into labeled sets of related words. Also this task can be seen as consisting of three steps. The first step is to determine the groups of related words in context, i.e., create a context clustering. It involves calculating the similarity of the word contexts to be clustered, or to use similarity information from external knowledge sources.

The second step is to determine a suitable inventory of word sense labels.

There is no well-established convention for labeling the context-clustered word groups. The predefined labels are typically taken from sense descriptors in everyday dictionaries, labels in thesauruses and ontologies, or translations in a translation dictionary. The labeling varies according to purpose: in terminology mining the ontological relations are frequently used, in thesaurus discovery thesaurus relations are often used, and in statistical machine translation the translations are suitable labels of word clusters.

In word sense discovery, the third step involves a way to learn how to associate a word sense label with a word cluster using either machine learning or manually created rules or metrics.

Pages:   || 2 | 3 | 4 | 5 |   ...   | 13 |

Similar works:

«International Journal of Artificial Intelligence in Education (2001), 12, to appear Supporting Social Interaction in an Intelligent Collaborative Learning System Amy L. Soller Learning Research and Development Center and Intelligent Systems Program, University of Pittsburgh, 3939 O’Hara Street, Pittsburgh, PA 15260-5159 soller@pitt.edu, http://lesgold42.lrdc.pitt.edu Abstract: Students learning effectively in groups encourage each other to ask questions, explain and justify their opinions,...»

«Income protection Paying you a monthly income if you can’t work because of an accident or illness Income Protection How it works when you can’t Income Protection is a type of insurance which helps replace some of the income you’d lose if you couldn’t work because of an accident or illness. We naturally insure all those things important to us, such as our home, the car and possibly even the dog. So it makes sense to protect the thing that pays for it all.you and your income. This short...»

«Editorial Note: The Synod of Pistoia was a local council held in 1786 in Pistoia, Italy, which was somewhat similar in content to the modernist Second Vatican Council (1962-65) of the Novus Ordo Church, especially in terms of introducing novelties and using ambiguous language. In 1794, Pope Pius VI condemned 85 propositions taught by the synod and solemnly forbade Catholics from believing, teaching, or preaching them. While the text of the 85 propositions and the censures Pope Pius VI attached...»


«She’s the Way Summit Saturday, December 1, 2012 Arrival/Registration 10AM 3022 Broadway 130 Uris Hall Room 301 Columbia University -New York, NY 10027 SHE’S THE WAY SUMMIT The future success or failure of our world depends on the youth and how prepared, equipped and ready they are for the responsibilities that lies ahead of them.they must be educated, spiritually balanced and they must keep their traditions alive.otherwise they are doomed to fail.the preceding generation must do their part...»

«IDENTITY My Name By Sandra Cisneros Pre-Reading: Knowledge Rating Scale Term I have no I think I know I know this Definition idea what the meaning; word and this I’ve heard this can use it in means. word before. a sentence. hope muddy sobbing sack fancy chandelier inherit Pre-Reading: Anticipation Guide Strongly Agree Disagree Strongly No Agree Disagree Opinion I know what my name means. I know how I got my name. I know when I was born. I think that being born on a particular day, or in a...»

«Sermon #1699 Metropolitan Tabernacle Pulpit 1 “SUPPOSING HIM TO BE THE GARDENER” NO. 1699 A SERMON DELIVERED ON LORD’S-DAY MORNING, DECEMBER 31, 1882, BY C. H. SPURGEON, AT THE METROPOLITAN TABERNACLE, NEWINGTON. “Supposing Him to be the gardener.” John 20:15. I WAS sitting, about a fortnight ago, in a very lovely garden, in the midst of all kinds of flowers which were blooming in delightful abundance all around. Screening myself from the heat of the sun under the overhanging boughs...»

«let’s make scotland more active A strategy for physical activity Physical Activity Task Force let’s make scotland more active A strategy for physical activity Physical Activity Task Force Contents First Minister’s foreword 4 Chairman’s foreword 6 1 Introduction 9 2 Summary of our recommendations 21 3 Strengthening the infrastructure 27 4 Strategic objectives 33 5 Strategic priorities 37 Children and young people 38 Adults 46 Adults in later life 54 6 Sources 59 Annexes A Physical...»

«CLICK for More Info Online 4 x 2” Rear Frame with Billet 4-Bar Rear Suspension System Canted Billet 4-Bar Rear Suspension System Chassisworks’ canted billet 4-bar rear frame system is a truly versatile high-performance suspension solution, suitable for g-Machines, street rods, muscle cars, or any project in need of optimized handling. State-of-the-art computer workstations with Pro/ENGINEER software were instrumental in creating a system that can be easily sized from 34” to 50” frame...»

«Guide for Using DataPlace DataPlace is one-stop source for housing and demographic data about communities, the region, and the nation. The site assembles a variety of data sets from multiple sources, and provides tools and guides to assist in analyzing, interpreting, and applying the data for use in making more informed decisions. Includes data for the District of Columbia, for census tracts and zip codes and in some instances for individual properties.This guide covers 3 functions you can...»

«CCNR Final Submission to JRP by Gordon Edwards 1. Endorsing Abandonment The Canadian Coalition for Nuclear Responsibility (CCNR) believes that nuclear power is an immature technology because we still do not have a clearly safe permanent solution to the nuclear waste problem. We know how to package it safely for extended periods. But we do not know how to get rid of it. In the final analysis, “disposal” of nuclear waste simply means “abandonment”. The “Deep Geologic Repository” or...»

«Disclaimer: All rights belong to Stephenie Meyer. My Arranged Mafia Marriage Chapter 1: Prologue ?POV The sound of guns being fired and rapid feet running echoed throughout the building. I hid behind the big black crates just like my father had demanded. I didn't know who was in this building besides my mother and father; I just hoped they were still alive. I heard the garage door opening to the right of me, so I quickly moved as carefully as I could to the other side as three older men walked...»

<<  HOME   |    CONTACTS
2016 www.theses.xlibx.info - Theses, dissertations, documentation

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.