Plagiarism Detection using N-gram Co-occurrence Statistics Based on ROUGE and WordNe
|Keywords:||剽竊偵測;ROUGE;WordNet;N-gram 共現;Plagiarism Detection;ROUGE;WordNet;N-gram co-occurrence Statistics|
目前大多數的剽竊偵測方法分成fingerprinting和term occurrence。雖然兩種方法在剽竊偵測的領域裡已有一定的成果，它們還是有不足之處。刻意針對原文做修改就會影響上述方法對於剽竊偵測的表現，尤其是fingerprinting受其影響甚鉅。因此，本論文提出了套用了ROUGE和WordNet來偵測剽竊的演算法，因為前者包括了n-gram co-occurrence statistics、skip-bigram和longest common subsequence (LCS)，而後者有著同義詞典的功能也提供詞意上的資訊。N-gram co-occurrence statistics可以有效地偵測照抄和更動句子結構的剽竊，skip-bigram和LCS則不會受到純粹地新增詞彙於原文中或部分原文被刪除的影響，而運用WordNet則得以偵測用同義詞替換原文的情形。
本論文用兩組以人力做成的資料集(稱之為abstract 和 paraphrased)，來評估方法的效果。每個方法都依實驗結果的觀察來推薦適合的標準值和前置處理的設定。最後，由幾個不同類型的剽竊例子來支持先前對於每個方法的強項和弱點的假設。|
With the arrival of Digital Era and the Internet, control of information flow is nearly impossible; the lack of control provides an incentive for Internet users and computer owners to freely copy and paste any content available to them. Plagiarism often occurs when users fail to credit the original owner for the content borrowed, and such behavior leads to violation of intellectual property. Two main approaches to plagiarism detection are fingerprinting and term occurrence. Although these two approaches have yielded considerable results, they are not without faults. One common weakness suffered by both approaches, especially fingerprinting, is the incapability to detect modified text plagiarism. This research proposed adoption of ROUGE and WordNet. The former includes n-gram co-occurrence statistics, skip-bigram, and longest common subsequence (LCS), while the latter acts as a thesaurus dictionary, which also provides semantic information. N-gram co-occurrence statistics can detect verbatim copy and certain sentence structural changes, skip-bigram and LCS is immune from text modification such as simple addition or deletion of words, and WordNet may handle the problem of word substitution. The proposed methods have been tested on two manually created corpora, abstract set and paraphrased set. Empirically derived threshold and preprocessing setting for each method are recommended based on the evaluation of the performance. Different types of plagiarism examples are shown to support the statements made about the strengths and weaknesses of the proposed methods.
|Appears in Collections:||Thesis|
Files in This Item: