Palkovskii Y., Belov A. Zhytomyr State University Institute of Foreign Philology, In affiliation...

20
Using TF-IDF Weight Ranking Model in CLINSS as Effective Similarity Measure to Identify Cases of Journalistic Text Re-use Palkovskii Y., Belov A. Zhytomyr State University Institute of Foreign Philology, In affiliation with SkyLine LLC [Plagiarism Prevention Solutions] Zhytomyr, Ukraine

Transcript of Palkovskii Y., Belov A. Zhytomyr State University Institute of Foreign Philology, In affiliation...

Page 1: Palkovskii Y., Belov A. Zhytomyr State University Institute of Foreign Philology, In affiliation with SkyLine LLC [Plagiarism Prevention Solutions] Zhytomyr,

Using TF-IDF Weight Ranking Model in CLINSS as Effective Similarity

Measure to Identify Cases of Journalistic Text Re-use

Palkovskii Y., Belov A.Zhytomyr State University

Institute of Foreign Philology, In affiliation with SkyLine LLC

[Plagiarism Prevention Solutions]Zhytomyr, Ukraine

Page 2: Palkovskii Y., Belov A. Zhytomyr State University Institute of Foreign Philology, In affiliation with SkyLine LLC [Plagiarism Prevention Solutions] Zhytomyr,

Who we are\what we doSmall, devoted group of students\professors in ZSU.Focused on Plagiarism Detection\Cross-Language PD.We develop a core text compare engine for a number of

commercial products, PD related, for SkyLine LLC:

We like to participate in competitions in Plagiarism Detections (especially in hot countries) and proud to have taken part in:PAN09 Spain, PAN10 Italy, PAN11 Amsterdam, PAN12

Italy, CL!TR11 India Mumbai, IIT

In affiliation with Zhytomyr State Uni and SkyLine LLC

© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity

measure

Plagiarism Detector Accumulator Server [PDAS]

Plagiarism Detector Client [PDC]

Page 3: Palkovskii Y., Belov A. Zhytomyr State University Institute of Foreign Philology, In affiliation with SkyLine LLC [Plagiarism Prevention Solutions] Zhytomyr,

CL!NSS proposed taskWhat we are looking for? -“Same news event”

within a pair of documentsPair-wise document comparisonReasonable processing timeResolution issues for focal news events are

not a requirement, at least at this pointFocus on the final result and a “starting

point” prototype

In affiliation with Zhytomyr State Uni and SkyLine LLC

© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity

measure

Page 4: Palkovskii Y., Belov A. Zhytomyr State University Institute of Foreign Philology, In affiliation with SkyLine LLC [Plagiarism Prevention Solutions] Zhytomyr,

How does it work?Language normalization via Google TranslateText preprocessing that included most frequent

words removal (preliminary harvested from both corpuses and sorted by frequency)

Running comparison of each document against the test corpus, saving the data retrieved for further analysis

Each cached result for every pair undergoes estimation via predefined filter set getting scores.

Top 100 list is formed by ascending score value.

In affiliation with Zhytomyr State Uni and SkyLine LLC

© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity

measure

Page 5: Palkovskii Y., Belov A. Zhytomyr State University Institute of Foreign Philology, In affiliation with SkyLine LLC [Plagiarism Prevention Solutions] Zhytomyr,

Our evaluation methods

In affiliation with Zhytomyr State Uni and SkyLine LLC

© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity

measure

via Google Images

Page 6: Palkovskii Y., Belov A. Zhytomyr State University Institute of Foreign Philology, In affiliation with SkyLine LLC [Plagiarism Prevention Solutions] Zhytomyr,

News set about “Curiosity” landing on Mars

In affiliation with Zhytomyr State Uni and SkyLine LLC

© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity

measure

via Google Images

Page 7: Palkovskii Y., Belov A. Zhytomyr State University Institute of Foreign Philology, In affiliation with SkyLine LLC [Plagiarism Prevention Solutions] Zhytomyr,

In affiliation with Zhytomyr State Uni and SkyLine LLC

© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity

measure

via Google Images

Latest Bollywood newsfeeds

Page 8: Palkovskii Y., Belov A. Zhytomyr State University Institute of Foreign Philology, In affiliation with SkyLine LLC [Plagiarism Prevention Solutions] Zhytomyr,

In detailInserting manually crafted news pairs into

the both corpora and evaluating final ranking positions

Different degree of news stories uniqueness – ranging from news about Curiosity Landing on Mars to the latest Bollywood films news (i.e. matching the context character and the exact vocabulary of the training set)

10 news planted, 9 out of ten fell into the “top 10” ranking, thus proving the initial hypothesis

In affiliation with Zhytomyr State Uni and SkyLine LLC

© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity

measure

Page 9: Palkovskii Y., Belov A. Zhytomyr State University Institute of Foreign Philology, In affiliation with SkyLine LLC [Plagiarism Prevention Solutions] Zhytomyr,

Detailed document comparisonPAN 2012 prototype – “iGTC” project, based

on an n-gram matching principle, with 3 levels of graphically based clusterization, already tuned in by a GA last year FIRE\PAN to both tackle medium-to-high degrees jf obfuscation as well as translated and simulated plagiarism

We did not use it. With main reason – retain the purely statistical approach based on TF-IDF values

In affiliation with Zhytomyr State Uni and SkyLine LLC

© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity

measure

Page 10: Palkovskii Y., Belov A. Zhytomyr State University Institute of Foreign Philology, In affiliation with SkyLine LLC [Plagiarism Prevention Solutions] Zhytomyr,

CL!NSS Results Achieved Hindi\EnglishRank Run

NDCG@1

NDCG@5 NDCG@10

1 run-1-english-hindi-palkovskii 0.3229 0.3259 0.3380

2 run-2-english-hindi-deriupm 0.2100 0.2136 0.2613

3 run-1-english-hindi-deriupm 0.1900 0.2110 0.2168

4 run-1-english-hindi-iiith 0.1939 0.1994 0.2154

5 run-3-english-hindi-deriupm 0.1500 0.1886 0.2030

6 run-3-english-hindi-iiith 0.1837 0.1557 0.1722

7 run-2-english-hindi-iiith 0.0204 0.0462 0.0512In affiliation with Zhytomyr State Uni and

SkyLine LLC

© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity

measure

Page 11: Palkovskii Y., Belov A. Zhytomyr State University Institute of Foreign Philology, In affiliation with SkyLine LLC [Plagiarism Prevention Solutions] Zhytomyr,

CL!NSS Results Achieved Gujarati\English

Rank Run

NDCG@1

NDCG@5 NDCG@10

1 run-1-english-hindi-palkovskii 0.0541 0.0843 0.0955

In affiliation with Zhytomyr State Uni and SkyLine LLC

© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity

measure

Ideas to consider:Different efficiency for different sources types and news

types\structure\origin [according to Parth Gutpa analysis of CL!NSS]

MT substitute for Gujarati

Page 12: Palkovskii Y., Belov A. Zhytomyr State University Institute of Foreign Philology, In affiliation with SkyLine LLC [Plagiarism Prevention Solutions] Zhytomyr,

PAN2011\CLEF CLPD BaselineManual: 0.37 P-det R: .69 P: .26

G: 1Automatic: 0.92 P-det R: .97 P: .88

G: 1

Comparison problem:NDCG* metrics vs P-det (any ideas?)

In affiliation with Zhytomyr State Uni and SkyLine LLC

© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity

measure

Page 13: Palkovskii Y., Belov A. Zhytomyr State University Institute of Foreign Philology, In affiliation with SkyLine LLC [Plagiarism Prevention Solutions] Zhytomyr,

Hardware\RuntimeModerately computationally intensiveSingle Intel 6-core 990 ex6 GB Ram (RAM intensive usage)Single SSD driveTotal runtime of 12 hours for the test corpus

(excluding the PAN2012 comparer filter)

In affiliation with Zhytomyr State Uni and SkyLine LLC

© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity

measure

Page 14: Palkovskii Y., Belov A. Zhytomyr State University Institute of Foreign Philology, In affiliation with SkyLine LLC [Plagiarism Prevention Solutions] Zhytomyr,

Software used

Microsoft windows 7 []Microsoft Visual Studio

2010\C#

In affiliation with Zhytomyr State Uni and SkyLine LLC

© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity

measure

Page 15: Palkovskii Y., Belov A. Zhytomyr State University Institute of Foreign Philology, In affiliation with SkyLine LLC [Plagiarism Prevention Solutions] Zhytomyr,

What we missedMeta-parameters tuning-in exhaustivenessNERA hybrid approach that uses a combination of PAN2012 text

comparer prototype as an additional scoring mechanism (runtime limitations and an idea to stick to a single methodology)

Post analysis of successful and failed detections Including results visualization in hope for further insights

Our competitive colleagues from Austria, Romania, Chile, etc.!

Layered Analysis of each influencing scoring factor [ref. to PAN 2011\2012 analysis]

In affiliation with Zhytomyr State Uni and SkyLine LLC

© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity

measure

Page 16: Palkovskii Y., Belov A. Zhytomyr State University Institute of Foreign Philology, In affiliation with SkyLine LLC [Plagiarism Prevention Solutions] Zhytomyr,

Things we’re happy to discussResults evaluationAchieved baseline in comparison to PAN

resultsThe corpus sizeAutomatic evaluation platform for result

processing and evaluationPerspectives of machine learningHybrid approachesBaseline comparison with other related tracks

In affiliation with Zhytomyr State Uni and SkyLine LLC

© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity

measure

Page 17: Palkovskii Y., Belov A. Zhytomyr State University Institute of Foreign Philology, In affiliation with SkyLine LLC [Plagiarism Prevention Solutions] Zhytomyr,

References [1] Cristian Grozea and Marius Popescu. Encoplot—Performance in the Second International Plagiarism Detection Challenge: Lab Report for PAN at CLEF 2010. In Braschler et al. ISBN 978-88-904810-0-0 [2] Debora Weber-Wulff, "Plagiarism Detection Competition" copy-shake-paste.blogspot.com. 2009. 21 June.2011. [3] Markus Muhr, Roman Kern, Mario Zechner, and Michael Granitzer. External and Intrinsic Plagiarism Detection using a Cross-Lingual Retrieval and Segmentation System: Lab Report for PAN at CLEF 2010. In Braschler et al. [2]. ISBN 978-88-904810-0-0. [4] Martin Potthast, Benno Stein, Andreas Eiselt, Alberto Barrón-Cedeño, and Paolo Rosso. Overview of the 1st International Competition on Plagiarism Detection. In Benno Stein, Paolo Rosso, Efstathios Stamatatos, Moshe Koppel, and Eneko Agirre, editors, SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 09), pages 1–9. CEUR-WS.org, September 2009. URL http://ceur-ws.org/Vol-502. [5] Thanh Dao. "An improvement on capturing similarity between strings" www.codeproject.com. 2005. 29 Jul. 2011. http://www.codeproject.com/KB/recipes/improvestringsimilarity.aspx [6] Troy Simpson, Thanh Dao. "WordNet-based semantic similarity measurement" www.codeproject.com. 2005. 1 Oct. 2011. http://www.codeproject.com/KB/string/semanticsimilaritywordnet.aspx [7] Yurii Palkovskii, Alexei Belov, and Irina Muzika. Exploring Fingerprinting as External Plagiarism Detection Method: Lab Report for PAN at CLEF 2010. In Braschler et al. [2]. ISBN 978-88-904810-0-0.

In affiliation with Zhytomyr State Uni and SkyLine LLC

© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity

measure

Page 18: Palkovskii Y., Belov A. Zhytomyr State University Institute of Foreign Philology, In affiliation with SkyLine LLC [Plagiarism Prevention Solutions] Zhytomyr,

Letters are powered by people:

In affiliation with Zhytomyr State Uni and SkyLine LLC

© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity

measure

Page 19: Palkovskii Y., Belov A. Zhytomyr State University Institute of Foreign Philology, In affiliation with SkyLine LLC [Plagiarism Prevention Solutions] Zhytomyr,

Letters are powered by people:

In affiliation with Zhytomyr State Uni and SkyLine LLC

© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity

measure

Page 20: Palkovskii Y., Belov A. Zhytomyr State University Institute of Foreign Philology, In affiliation with SkyLine LLC [Plagiarism Prevention Solutions] Zhytomyr,

I would like to thank those people – thank you for your assistance and help:

In affiliation with Zhytomyr State Uni and SkyLine LLC

© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity

measure

And an additional “thank you” for getting as far as Kolkata!

•Mandar Mitra•Parth Gupta•Anwar Shaikh