標題: 以概念分群為基礎之新聞文件自動摘要系統
Concept Cluster Based News Document Summarization
作者: 劉政璋
Liu,Cheng-Chang
柯皓仁
楊維邦
Hao-Ren Ke
Wei-Pang Yan
資訊科學與工程研究所
關鍵字: 英文新聞文件摘要;特徵選擇;概念分群;English news summary;feature selection;concept clustering
公開日期: 2004
摘要: 多文件摘要系統可以有效節省在閱讀大量文件時所耗費的時間。一般摘要系統會從文件中挑選出具有較多語意資訊的句子構成摘要,以期從摘要中即可瞭解文件所涵蓋的概念,幫助讀者更快速瞭解文件內容。而多文件摘要更需要注重刪除重複的資訊。 本研究透過分析文件內容語意,藉以幫助挑選文件中具有較多語意資訊的句子。分析過程可分成兩大步驟:1) 找出隱藏在文件中的重要概念:多文件中會包含許多較小的主題,稱之為概念;我們使用描述鍵結去敘述一個概念,由於一個有代表性的詞彙,常伴隨著類似的名詞、形容詞、動詞等一起出現,因此我們利用這些出現在代表性詞彙周圍的詞來描述該詞彙,並加入語意網路,來加強描述的準確性。2) 分析內容語意要先分辨出哪些概念是相同或是不同的,亦即語意歧異解析,並將相同的概念分在同一群之中;我們使用K-Means分群法,將前一步驟找出的概念加以分群,以解決語意歧異的問題,並去掉重複的概念。確定文件內容語意之後,根據概念的分群結果、句子的資訊含量、句子在文章中的位置等不同的特徵選擇出最能代表文件集的句子。 在實驗中,使用DUC2003提供的新聞文件,以及評估軟體,評估軟體是比較系統自動產生的摘要與專家寫作的摘要之間的相似度,系統將會利用該評估程式來提供一個客觀的數據。
A multi-document summarization system can reduce the time for a user to read a large number of documents. A summarization system, in general, selects salient features from one (or many) document(s) to compose a summarization, in the hope that the generated summarization can help a user understand the meaning of the document(s). This thesis proposes a method to analyze the semantics of news documents. The method is divided into two phases. The first phase attempts to discover the subtle topics called concepts hidden in documents. Due to the phenomenon that similar nouns, verbs, and adjectives usually co-occur with the same representative term, we describe a concept by those terms around it, and use a semantic network to assist the description of a concept more accurately. The second phase distinguishes the concepts discovered in the first phase by their word senses. The K-means clustering algorithm is exploited to gather concepts with the same sense into the same cluster. Clustering can diminish the problem about word sense ambiguation and reduce concepts with similar sense. After the two above phase, we choose five features to weight sentences and order sentences according to their weights. The five features are lengths of clusters, location of a sentence, tf*idf, distance between a sentence and the center of the cluster to which the sentence belongs, and the similarity between a sentence and the cluster to which the sentence belongs. We use the news documents of Document Understanding Conferences 2003 (DUC2003) and its evaluation tool to evaluate the performance of our method.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT009223573
http://hdl.handle.net/11536/76623
Appears in Collections:Thesis


Files in This Item:

  1. 357301.pdf
  2. 357302.pdf