The understanding of natural language by computational methods has been a continuing and elusive problem in artificial intelligence. In recent years there has been a resurgence in natural language processing research. Much of this work has been on empirical or corpus-based methods which use a data-driven approach to train systems on large amounts of real language data. Using corpus-based methods, the performance of part-of-speech (POS) taggers, which assign to the individual words of a sentence their appropriate part of speech category (e.g., noun, verb, preposition), now rivals human performance levels, achieving accuracies exceeding 95%. Such taggers have proved useful as preprocessors for such tasks as parsing, speech synthesis, and information retrieval.
Parsing remains, however, a difficult problem, even with the benefit of POS tagging. Moreover, as sentence length increases, there is a corresponding combinatorial explosion of alternative possible parses. Consider the following sentence from a New York Times online article:
After Salinas was arrested for murder in 1995 and lawyers for the bank had begun monitoring his accounts, his personal banker in New York quietly advised Salinas' wife to move the money elsewhere, apparently without the consent of the legal department.
To facilitate parsing and other tasks, we would like to decompose this sentence into the following three shorter sentences which, taken together, convey the same meaning as the original:
- Salinas was arrested for murder in 1995.
- Lawyers for the bank had begun monitoring his accounts.
- His personal banker in New York quietly advised Salinas' wife to move the money elsewhere, apparently without the consent of the legal department.
This study investigates the development of heuristics for decomposing such long sentences into sets of shorter sentences without affecting the meaning of the original sentences. Without parsing or semantic analysis, heuristic rules were developed based on: (1) the output of a POS tagger (Brill's tagger); (2) the punctuation contained in the input sentences; and (3) the words themselves. The heuristic algorithms were implemented in an intelligent editor program which first augmented the POS tags and assigned tags to punctuation, and then tested the rules against a corpus of 25 New York Times online articles containing approximately 1,200 sentences and over 32,000 words, with good results.
Recommendations are made for improving the algorithms and for continuing this line of research.
Master of Science (M.S.)
College of Arts and Sciences
Written permission granted by copyright holder to the University of Central Florida Libraries to digitize and distribute for nonprofit, educational purposes.
Length of Campus-only Access
Masters Thesis (Open Access)
Glinos, Demetrios George, "An Intelligent Editor for Natural Language Processing of Unrestricted Text" (1999). Retrospective Theses and Dissertations. 2134.