Text summarization is a rapidly growing field with many new innovations. End-to-end models using the sequence-to-sequence architecture achieve high scores according to automatic metrics on standard datasets. However, they frequently generate summaries that are factually inconsistent with the original article -- a vital problem to be solved before the summaries can be used in real-world applications. In addition, they are not generalizable to new domains, especially those with few training examples. In this dissertation, we propose to explicitly separate the two steps of content selection and surface realization in summarization. Content selection is the process of choosing important words/phrases/sentences from the document. Surface realization is the transformation of the selected content into a coherent, grammatical text summary. This paradigm more closely follows human patterns of summarization, as a human will often find important ideas within the article (content selection), and then write out a summary based on those ideas (surface realization). We make several contributions to the summarization field using this paradigm of separate content selection and surface realization steps. First, we present two techniques focusing on content selection: a model that can rank both single sentences and pairs of sentences in a unified space and a cascade approach that highlights salient words/phrases from sentences. Second, we present several studies on sentence fusion in summarization: an analysis of the quality of state-of-the-art summarizers for performing sentence fusion, a dataset containing points of correspondence between sentences, and a method utilizing these points of correspondence to improve sentence fusion. Finally, we introduce two methods with separate content selection and surface realization steps for multi-document summarization: a technique to adapt single document summarizers to the multi-document setting based on the Maximal Marginal Relevance (MMR) algorithm and a conceptual framework to model asynchronous endorsement between synopses and documents.
Doctor of Philosophy (Ph.D.)
College of Engineering and Computer Science
Length of Campus-only Access
Doctoral Dissertation (Campus-only Access)
Lebanoff, Logan, "Separating Content Selection from Surface Realization in Neural Text Summarization" (2020). Electronic Theses and Dissertations, 2020-. 375.