Detecting Web Spam Created with Markov Chains Text Generators Anton S. Pavlov, Moscow State...

26
Detecting Web Spam Created with Markov Chains Text Generators 19.09.2009 1 Anton S. Pavlov, Boris V. Dobrov, Detecting Web Spam Created With Markov Chains Text Generator, RCDL'09

Transcript of Detecting Web Spam Created with Markov Chains Text Generators Anton S. Pavlov, Moscow State...

Detecting Web Spam Created with Markov Chains Text Generators

19.09.2009 1Anton S. Pavlov, Boris V. Dobrov, Detecting

Web Spam Created With Markov Chains Text Generator, RCDL'09

Overview

1. Web Spam– Markov chains text generators

2. Proposed Approach– Method overview– Constraints and features– Machine learning

3. Experiments4. Conclusion

19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting

Web Spam Created With Markov Chains Text Generator, RCDL'09

2

Web Spam (I)

• Web Spam – deliberate actions aimed at unjustifiably raising relevance of some pages in a search engine:– Decreases search quality and performance– Aims at specific algorithms, used by search

engines (PageRank, BM25, etc.)– Efficient if mass created

• Some types of web spam get into search results page (doorways)

19.09.2009 3Anton S. Pavlov, Boris V. Dobrov, Detecting

Web Spam Created With Markov Chains Text Generator, RCDL'09

→ automatically generated

Web Spam (II)

• Spammers have to create many web pages, that will be indistinguishable from normal pages:– Create texts manually Too Expensive– Copying texts from other sources Duplicates can

be detected– Generate texts automatically: keyword stuffing,

Markov chains, text modification

19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting

Web Spam Created With Markov Chains Text Generator, RCDL'09

4

Markov Chains Text Generators (I)

• Markov Chain – a sequence of random variables, each variable depends only on previous one

• Can be used as a generative model for texts:– States – word n-gramms– Transitions – probability of seeing a word, after

seeing n words

19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting

Web Spam Created With Markov Chains Text Generator, RCDL'09

5

).|(),...,|(

,...,...,,

1111

21

NNNNNN

N

xXxXPxXxXxXP

XXX

Markov Chains Text Generators (II)

• Markov chains training:– Select a collection of documents– Collect probabilities of seeing a word x in every

state

• Generation process:– Use collected statistics to generate a text of

predefined length

19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting

Web Spam Created With Markov Chains Text Generator, RCDL'09

6

Markov Chains Text Generators (III)

• Resulting texts retain topical coherence and local coherence

Again, the prevalence of spam within blog

comments. Our work in this paper extend our

previous work in this paper approximate what will

eventually be perceived by users of search

engines.

19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting

Web Spam Created With Markov Chains Text Generator, RCDL'09

7

Overview

1. Web Spam– Markov chains text generators

2. Proposed Approach– Method overview– Constraints and features– Machine learning

3. Experiments4. Conclusion

19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting

Web Spam Created With Markov Chains Text Generator, RCDL'09

8

Proposed Method (I)

• Generated texts can simulate only some constraints of the natural texts:– Local coherence– Topical coherence

• Presumably generated texts violate other constraints:– Genre coherence– Readability– Diversity– Etc…

19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting

Web Spam Created With Markov Chains Text Generator, RCDL'09

9

Proposed Method (II)

• Collect various features:– Style/Genre– Authorship– Readability– Diversity

• Use machine learning to create automatic spam classifier

19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting

Web Spam Created With Markov Chains Text Generator, RCDL'09

10

Style/Genre identification

features

Authorship identification

features

Readability metrics

Text diversity features

Machine learning

Automatic spam

classifier

Genre and Style

• Part of speech (POS) usage statistics– Especially rare POS statistics

• Punctuation statistics:– Expressive punctuation («!», «?»)– Smileys («:)»)– References ([23])

19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting

Web Spam Created With Markov Chains Text Generator, RCDL'09

11

Authorship

• Part of speech statistics• Standard deviations of part of speech ratios– Helps identify texts with mixed authorship

• Sentences with several verbs

19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting

Web Spam Created With Markov Chains Text Generator, RCDL'09

12

Readability

• Readability metrics:– Average word length– Ratio of long words– Average, minimum, maximum sentence length

19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting

Web Spam Created With Markov Chains Text Generator, RCDL'09

13

Diversity

• Zipf’s law:– For words– For stemmed nouns

• Compression rates:– gzip– bz2

• Same words occurring in neighboring sentences

19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting

Web Spam Created With Markov Chains Text Generator, RCDL'09

14

;)(

i

iFreq

Machine Learning

• Extracted 61 features for each document• Compared two popular ML algorithms:– SVM– C4.5 + Bagging

19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting

Web Spam Created With Markov Chains Text Generator, RCDL'09

15

Decision Trees• Algorithm based on

C4.5:– Each split minimizes

informational entropy– Use different subsets of

the training set to build a tree and to select it’s weight Avoids overfitting

– Use bagging to merge multiple weak classifiers

19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting

Web Spam Created With Markov Chains Text Generator, RCDL'09

17

Gzip compr. ratio > 0.6

Max sentence

length > 32

Spam

Ham Spam

N Y

N Y

Overview

1. Web Spam– Markov chains text generators

2. Proposed Approach– Method overview– Constraints and features– Machine learning

3. Experiments4. Conclusion

19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting

Web Spam Created With Markov Chains Text Generator, RCDL'09

18

Experiments• Experiments were conducted on ROMIP By.Web

collection• Web spam sources:– Rusadult doorway generator (rusadult)– Doorway.su doorway generator (doorway_su)– Order 2 Markov chains text generator (markov2)

• Generated 3 training and 3 test sets:– Each contained 10000 By.Web documents (not spam)– Each contained 10000 documents from generators

(spam)• Measured precision, recall, and F-measure

19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting

Web Spam Created With Markov Chains Text Generator, RCDL'09

19

Detecting Generated Web Spam

19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting

Web Spam Created With Markov Chains Text Generator, RCDL'09

20

Real World Spam Examples• Trained classifiers on texts generated by

markov2 generator

• Checked if the same classifier can be used to detect real spam– Real spam samples from the Yandex Blog Search

–Manually collected 600 automatically generated spam texts

–Measured recall on the real spam samples, and F-measure on previously used test set

• Measured how adding samples of real spam affected classification

19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting

Web Spam Created With Markov Chains Text Generator, RCDL'09

21

Real World Spam Examples

• Продажа автомашин. Каталог цен! автомобильная

акустика Большой выбор, хорошие цены Модель

Daewoo Nexia представляет собой последнюю

модификацию модели Opel Kadett E. В 1986 году

лицензированное производство этого автомобиля

началось в автомобильный видеорегистратор Купля-

продажа авто в Тюмени Много частных объявлений о

продаже автомобилей в Тюмени на Новые и.

автомобили.19.09.2009

Anton S. Pavlov, Boris V. Dobrov, Detecting Web Spam Created With Markov Chains

Text Generator, RCDL'0922

Detecting Real World Spam

19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting

Web Spam Created With Markov Chains Text Generator, RCDL'09

23

Conclusion

• Proposed a new approach to web spam detection

• Proved the possibility of detection web spam, generated using Markov chains

• Detected spam not in the original training datasets

19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting

Web Spam Created With Markov Chains Text Generator, RCDL'09

24

The End

• Questions?

19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting

Web Spam Created With Markov Chains Text Generator, RCDL'09

25

References1. Веб коллекция BY.Web, http://romip.ru/ru/collections/by.web-2007.html.2. Генератор дорвеев Doorway.Su, http://doorway.su/.3. Зеленков Ю.Г., Сегалович И.В., Сравнительный анализ методов определения нечетких

дубликатов для Web-документов // Труды 9ой Всероссийской научной конференции «Электронные библиотеки: перспективные методы и технологии, электронные коллекции» - RCDL’2007, Переславль, Россия, 2007. – Том 1, С. 166-174.

4. Парсер mystem http://company.yandex.ru/technology/mystem/.5. Серверный генератор дорвеев от RUSADULT.com, http://doorways.rusadult.com/ru/.6. Фоменко В.П., Фоменко Т.Г., Авторский инвариант русских литературных текстов, 1981. 7. Чжун Кай-лай, Однородные цепи Маркова. Перев. с англ. — М.: Мир, 1964. — 425 с.8. Яндекс.Поиск по блогам, http://blogs.yandex.ru/.9. Benczúr, A. A., Bíró, I., Csalogány, K. and Sárlós, T. Web Spam Detection via Commercial Intent

Analysis. In Proceedings of the 3rd international workshop on Adversarial Information Retrieval on the Web, Banff, Alberta, Canada, May 8th, 2007. Pages: 89–92.

10. Braslavski P. Document Style Recognition Using Shallow Statistical Analysis. In Proceedings of the ESSLLI 2004 Workshop on Combining Shallow and Deep Processing for NLP, Nancy, France, 2004, p. 1–9.

11. Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M., Vigna, S. A Reference Collection for Web Spam. ACM SIGIR Forum Volume 40, Issue 2 (December 2006) Pages: 11–24.

19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting

Web Spam Created With Markov Chains Text Generator, RCDL'09

26

References12. Dale, E. and J. S. Chall. 1949. “The concept of readability.” Elementary English 26: 23.13. Dubay, W.H.. 2004. The Principles of Readability. Costa Mesa, CA: Impact Information14. Fetterly, D., Manasse, M., Najork, M. Spam, damn spam, and statistics: using statistical analysis

to locate spam web pages. In Proceedings of WebDB’04, New York, USA, 2004.15. Fetterly, D., Manasse, M., Najork, M. Detecting phrase-level duplication on the World Wide Web.

In Proceedings of SIGIR’05, pages 170–177, New York, NY, USA, 2005. ACM.16. Gyöngyi, Z. and Garcia-Molina H., Web Spam Taxonomy. In Proceedings of AIRWeb 2005, May

2005.17. Joachims, T. Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support

Vector Learning, B. Schölkopf and C. Burges and A. Smola (ed.), MIT-Press, 1999.18. Mishne, G., Carmel, D., and Lempel, R. Blocking blog spam with language model disagreement. In

Proceedings of AIRWeb 2005, May 2005.19. Piskorski, J., Sydow, M., Weiss, D., Exploring Linguistic Features for Web Spam Detection: A

Preliminary Study. In Proceedings of the 4th international workshop on Adversarial Information Retrieval on the Web, Beijing, China, Pages 25-28.

20. Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.21. Urvoy T., Chauveau E., Filoche, P. Tracking Web Spam with HTML Style Similarities. ACM

Transactions on the Web, Vol. 2, No. 1, Article 3.

19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting

Web Spam Created With Markov Chains Text Generator, RCDL'09

27