標題: 以語料為基礎的中文語篇連貫關係自動標記
Corpus-Based Coherence Relation Tagging in Chinese Discourse
作者: 鄭守益
Shou-Yi Cheng
梁婷
Tyne Liang
資訊科學與工程研究所
關鍵字: 中文;連貫關係;特徵分析;語篇標記;詞彙探勘;Coherence relation;surface feature analysis;discourse tag;cue term mining
公開日期: 2005
摘要: 語篇分析是文本理解中一項不可缺乏的工作,以釐清文章的論題或邏輯結構。因此,本論文乃以語料為主的方法,針對語篇的表層特徵進行收集及擴展,並制定相關的規則,以及提出一套有效的中文語篇自動標記程序。我們使用中研院平衡語料庫3.0版作為探勘的語料,計有報導、傳記日記、散文、信函、評論、說明手冊等文類,共7265篇。分別針對並列、承接、遞進、選擇、轉折、因果、條件、解證、目的等九種語篇類別,進行線索詞和連續詞性、特殊標點符號等輔助特徵的探勘。在我們的實驗中,使用100篇平均字數為1500字的報紙社論進行效能評估,在句內的標記部份,正確率可達到91%,召回率是95%,篩檢正確率是98%。另外,在句間的標記部分,正確率可達到86%,召回率是93%,篩檢正確率是95%,。 我們相信藉此語篇標記的研究,有助於將其應用在問答系統、作文評分系統、自動摘要和自動投影片產生系統之上。
Discourse analysis plays an important role of document understanding and is crucial for clarifying the proposition and logical structure of the document. Therefore, this thesis is aimed to built a automated Chinese discourse tagging system by collecting and expanding the coherence feature of discourse base on corpus study and to design the corresponding rules. We used the written documents from Sinica Balance Corpus 3.0 as our mining corpus. It includes 7265 articles covering news, biographies, essays, letters, commentary and illustration manuals. We mine individually cue term, continuous POS tag and peculiar punctuation marks for nine types of rhetorical relations of Chinese discourse, that includes Coordinate, Continue, Option, Forward, Disjunctive, Cause and Effect, Conditions, Elaboration and Goal. In our experiment, we used 100 news editorial articles, each of which contains around 1500 words(1424~1558), as testing corpus. The precision, recall and filtration precision of intra sentence tagging achieve 91%, 95% and 98%. On the other hand, the precision, recall and filtration precision of inter sentence tagging achieve 86%, 93% and 95%.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT009323540
http://hdl.handle.net/11536/79065
Appears in Collections:Thesis


Files in This Item:

  1. 354001.pdf