A Study on Extraction-based Multidocument Summarization
|關鍵字:||多文件摘要;一般性摘要;以查詢為導向之摘要;語句排序;語句摘錄;重複性資訊過濾;multidocument summarization;generic summary;query-focused summary;sentence ranking;sentence extraction;redundancy filtering|
本論文探討多文件自動化摘要的方法，研究主題包含：(1) 多文件摘要(Multidocument Summarization)與(2) 以查詢為導向之多文件摘要(Query-focused Multidocument Summarization)。多文件摘要乃是從多篇主題相關的文件中產生單篇摘要；以查詢為導向之多文件摘要則是從多篇主題相關的文件中擷取與使用者興趣相關的內容，並依此產生單篇摘要。本論文採用語句摘錄(Sentence Extraction)的方法，判別語句的重要性，並逐字摘錄重要的語句以產生摘錄式摘要。其中，本論文的重點為語句重要性的計量及語句排序方法的研究。
針對多文件摘要，本論文提出一套以圖形為基礎的語句排序(Sentence Ranking)方法：iSpreadRank。此方法建構語句關係網路(Sentence Similarity Network)作為分析多文件的模型，並採用擴散激發理論(Spreading Activation)推導語句的重要性作為排序的依據。接著，依序挑選重要的語句以形成摘要；挑選語句時，以與先前被挑選的語句具較低資訊重複者為優先。實驗中，將此摘要方法應用於DUC 2004的資料集。評估結果顯示，相較於DUC 2004當年度競賽的系統，本論文所提出的方法於ROUGE基準上有良好的表現。
針對以查詢為導向之多文件摘要，本論文結合：(1) 語句與查詢主題的相似度與(2) 語句的資訊代表性，提出一套語句重要性的計量方法。其中，利用潛在語意分析(Latent Semantic Analysis)，以計算語句與查詢主題於語意空間的相似度；並採用傳統摘要方法中探討語句代表性的特徵(Surface-level Features)，以評量語句的資訊代表性。本論文亦基於Maximum Marginal Relevance技術，考量資訊的重複性，提出一個適用於以查詢為導向之多文件摘要的語句摘錄方法。實驗中，將此摘要方法應用於DUC 2005的資料集。評估結果顯示，相較於DUC 2005當年度競賽的系統，本論文所提出的方法於ROUGE基準上有良好的表現。|
The rapid development of information technology over the past decades has dramatically increased the amount and the availability of online information. The explosion of information has led to information overload, implying that finding and using the information that people really need efficiently and effectively has become a pressing practical problem in people’s daily life. Text summarization, which can automatically digest information content from document(s) while preserving the underlying main points, is one obvious technique to help people interact with information. This thesis discusses work on summarization, including: (1) multidocument summarization, and (2) query-focused multidocument summarization. The first is to produce a generic summary of a set of topically-related documents. The second, a particular task of the first, is to generate a query-focused summary, which reflects particular points that are relevant to the user’s desired topic(s) of interest. Both tasks are addressed using the most common technique for summarization, namely sentence extraction: important sentences are identified and extracted verbatim from documents and composed into an extractive summary. The first step towards sentence extraction is obviously to score and rank sentences in order of importance, which is the major focus of this thesis. In the first task, a novel graph-based sentence ranking method, iSpreadRank, is proposed to rank sentences according to their likelihood of being part of the summary. The input documents are modeled as a sentence similarity network. iSpreadRank practically applies spreading activation to reason the relative importance of sentences based on the network structure. It then iteratively extracts one sentence at a time into the summary, which not only has high importance but also has low redundancy with the sentences extracted prior to it. The proposed summarization method is evaluated using the DUC 2004 data set and found to perform well in various ROUGE measures. Experimental results show that the proposed method is competitive to the top systems at DUC 2004. In the second task, a new scoring method, which combines (1) the degree of relevance of a sentence to the query, and (2) the informativeness of a sentence, is proposed to measure the likelihood of sentences of being part in the summary. While the degree of query relevance of a sentence is assessed as the similarity between the sentence and the query computed in a latent semantic space, the informativeness of a sentence is estimated using surface-level features. Moreover, a novel sentence extraction method, inspired by maximal marginal relevance (MMR), is developed to iteratively extract one sentence at a time into the summary, if it is not too similar to any sentences already extracted. The proposed summarization method is evaluated using the DUC 2005 data set and found to perform well in various ROUGE measures. Experimental results show that the proposed method is competitive to the top systems at DUC 2005.
|Appears in Collections:||Thesis|
Files in This Item: