Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I...

75
Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document I Netspeak Query Log Analysis I Informative Linguistic Knowledge Extraction from Wikipedia I Elastic Search and the Clueweb I Passphone Protocol Analysis with Avispa I Beta Web I SimHash as a Service: Scaling Near-Duplicate Detection I One Class Classification of Vandalism in the Wikipedia

Transcript of Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I...

Page 1: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Webis Student Presentations WS2014/15

I Argumentation Analysis in Newspaper Articles

I Morning Morality

I The Super-document

I Netspeak Query Log Analysis

I Informative Linguistic Knowledge Extraction from Wikipedia

I Elastic Search and the Clueweb

I Passphone Protocol Analysis with Avispa

I Beta Web

I SimHash as a Service: Scaling Near-Duplicate Detection

I One Class Classification of Vandalism in the Wikipedia

Page 2: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Modeling Information Extraction Problems using Argumentation

Theory

Speakers:

Philip Drewes Jonas Köhler

1

Page 3: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Motivation:

○ Opinion mining

○ Summaries of large texts

○ Rating the validity of arguments in texts

○ Search for arguments for a given hypothesis

⇒ We want to have a computable model of argumentation for human language.

2

Page 4: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

A computational model of argumentation

We should open our borders!

Open borders will be a threat to inner security.

Our economy will benefit from new workers.

Studies show there is no correlation between immigration and crime rate

nodes: argumentative units (Claims, Premises)

arcs: relations between arguments(Attacks, Supports, ...)

Questions:

When do arguments contradict?

How are arguments related?

What are important arguments?3

directed graph

Page 5: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

A computational model of argumentation

Searching for arguments involves the task of detecting them

Classification:

Is a part of a text an argumentative unit? ⇒ binary { yes, no }

What type of argumentative unit? ⇒ nominal { claim, premise, ... }

Are two argumentative units related? ⇒ binary { yes, no }

What type of relation is it? ⇒ nominal { attack, support, ... }

⇒ Supervised learning problem

⇒ which features? 4

Page 6: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

A computational model of argumentation

Features (mostly NLP based):

Lexical: number of punctuation marks in a part of text

Syntactic: depth of the parse tree (linguistics)

Indicators: are discourse marker present?

Contextual: number of sub clauses in the sentences around the part of interest

Heavy use of the Stanford NLP Java library:

⇒ training data?

⇒ human annotation!5

Page 7: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Creating a corpus for an argument classifierAnnotation:

Humans will annotate argumentative texts by hand.

The texts are taken from online newspapers (opinion section).

The tool for annotation is web-based. The annotations are saved to XML files.

Question:

Don’t we need 1000s of annotations?

Who will do all this work?

⇒ Crowdsourcing!

6

Page 8: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Outlook:What we have done so far:

● Implementing a classification framework, which is

○ Calculating the feature vectors○ Reproducing the state of the art in classification

■ Stab et al.1 achieve ~72% precision on an essay corpus■ We are able to achieve ~68%

● Gathering the text data (automated web scraping)

● Designing the annotation job for the digital crowd.

71 Stab C., Gurevych, I., Identifying Argumentative Discourse Structures in Persuasive Essays Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), p. 46-56, Association for Computational Linguistics, October 2014.

Page 9: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Outlook:What we will do until February 2015 / What may come in the long term

● Let the crowd annotate our texts and build the training corpus

● Add additional features and improve the classification○ Extend the model? Refine the classification?

● Analyze the data○ Which questions may arise?

● Search for argument components○ Only possible if there is a good model + classification

8

Page 10: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Thank you for your listening!

Questions?

9

Page 11: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Morning Morality on the Web

Webis presentation2014-12-18

Page 12: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Morning Morality on the Web-

foundation

Project foundation and discussion starter:

• Kouchaki, Maryam, and Isaac H. Smith. "The Morning Morality Effect The Influence of Time of Day on Unethical Behavior." Psychological science 25 (2013):95-102

Content:• People's ethical behaviour is changing throughout the day.• There is a „self-regulatory“ resource, which depletes the longer someone

is behaving good.• Therefore, a person is more likely to cheat and lie in the afternoon or

evening than in the morning.

Page 13: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Morning Morality on the Web-

previous work

Is such a phenomenon measurable on the Web?

• In an effort to show such an effect, Wikipedia-Vandalism cases where analyzed.

What is Wikipedia-Vandalism?

Inappropriate change, addition or removal of Wikipedia content, like adding irrelevant, abusive words, deleting pages or purposely adding false information.

Page 14: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Morning Morality on the Web-

Wikipedia Vandalism

How to get Wikipedia-Vandalism data?

• Scan through the history of edits for a Simple Vandalism Pattern.• A revert back to a revision befor an edit(V) is most often a case of

vandalism.• Detection is done by users and bots.

Page 15: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Morning Morality on the Web-

previous work

Finding the „Morning Morality Effect“ in Wikipedia-Vandalism data.Work of the previous project group:

• Analyzing correlations between local time and vandalism.• Geolocation of vandal - IP addresses for local edit time.

Page 16: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Morning Morality on the Web-

current work

Finding more correlation between bad behaviour on the Web and exogenous/external factors, e.g. , weather, time and region.

What we have done so far/are working on:

• Geolocate the given vandalism and normal edit dumps of the United States for 2013.

• Correleated them with the NOAA National Weather Service data (hourly weather data from 1.700 weather stations in the US over the last 15 years).

Page 17: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Morning Morality on the Web-

current work

Early data -Work still in progress :

Page 18: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Morning Morality on the Web-

future work

• Analyze data for different Climate Zones and weather effects like rain and snow.

• Changing vandalism frequency in correlation with weather over time, e.g., annual and monthly time periods

• Different locations: comparisons of different states, rural and metropolitan areas.

Page 19: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

The Super DocumentA Result Presentation Paradigm for Exploratory Search Tasks

Participants:Kevin Reinartz, Janek Bevendorff,

Kristof Komlossy, Carsten Tetens, Sebastian Gottschlich

Tim Gollub Michael Völske Benno Stein

Web Technology & Information SystemsBauhaus-Universität WeimarWinter Term 2014/15

1 © webis 2014

Page 20: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Project: SuperSERPWeb Search: Problem Description

Generate SearchResult Page

query

x 103

Given the relevant documents for a query, how to present them to the user?

2 © webis 2014

Page 21: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Project: SuperSERPTraditional Presentation Paradigm: Ranked Result Lists

Weimar

1.

3.

2.

. . .

Search

q Compile a list of document descriptions linking to the original resources.q Order based on the likelihood that a document contains relevant information.

3 © webis 2014

Page 22: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Project: SuperSERPGeneral

q Alternative result presentation paradigms for open or undirectedinformational queries

q General Approach: Increase accessibility of resources in the limit of asearch result list.

q Observation: An effective domain independent paradigm is hard to find.

q We concentrate on two applications:

– Related Work Search– City Search

4 © webis 2014

Page 23: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Application 1: Related Work SearchCurrent State

q LUCENE Index of webis-csp corpus (approx. 177.000 papers)

q Keyphrase extraction (KpMinerExtractor from aitools)

q MUSTACHE Template-Engine for search result presentation

q Search result based on keyphrases (currently)

5 © webis 2014

Page 24: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Application 1: Related Work SearchCurrent user interface

6 © webis 2014

Page 25: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Application 1: Related Work SearchFuture Work

q Improved user interaction:

– Query by document– Manual “topic points”– Quality of clustering statistics

q Topic Model for Indexing & keyquery compositing

q Efficient clustering algorithm for outline generating

7 © webis 2014

Page 26: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Application 2: City SearchCurrent State

q collected Google Placesq using Bigdata as triple store (replacing Fuseki)q read Google Places as RDF triples into triple storeq generated random people at random locations

CityBricks:

q each place is a brickq sorted from north to southq highlight on search & similarity

8 © webis 2014

Page 27: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Application 2: City SearchCityBricks

9 © webis 2014

Page 28: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Application 2: City SearchCityTales

q take the user on a journey through the cityq create a mashup using content & statisticsq streets from a city + Random users and locationsq from various sources (Google Places, Flickr ...)

10 © webis 2014

Page 29: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Application 2: City SearchFuture Work

q Improve storytelling, infographic inspired UI

q Add sources like news, official statistics, social network, reviews

q Use focused crawling (Heritrix) to obtain web pages related to Weimar.

11 © webis 2014

Page 30: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Netspeak Query Log AnalysisAmir Othman

cvs:

code-in-progress/webisstud/wstud-netspeak-analysiscode-in-progress/webisstud/wstud-netspeak-analysis-query-detectioncode-in-progress/webisstud/wstud-netspeak-analysis-query-browser

data-in-progress/wstud-netspeak-analysis

Page 31: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

● Service to check usage of words● ~2000 Users a month● Log from March 2009 to February 2014

Page 32: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Query Detection

● Decision Tree, using log from 100 different IPs as groundtruth

● Features: overlapping characters, term overlap, character Jaccard coefficient, trigram character cosine similarity, Levenshtein distance, timegap

Page 33: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Netspeak Query Log Browser

● Facilitate analysis – added visualizations and interlinking

● Exploring● Add Notes

Page 34: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document
Page 35: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document
Page 36: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document
Page 37: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document
Page 38: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Ideas

● Learning effect● Identifiable user

Page 39: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Informative Linguistic Knowledge

Extraction from Wikipedia

Roxanne El Baff (1st Semester CSM Student)

Supervisior :Khalid El Khatib

Page 40: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Wikipidea and JWPL

Wikiperdia

JWPL (Java

Wikipedia

Library)

Natural

Language

Processing

High quality, up

to date

knowledge base

Page

Category

...

Title

Content

Links

...

Page 41: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Measuring Term Informativeness

Term Informativeness Measurments

Statistic Semantic

Term Frequency Document Frequency Semantic Relatedness

Measure the importance of a term based on Its context Importance of the term (Statistic) Importance of its context (How strong is the relation between term and context (Semantic Relatedness )

Context-Aware Term

Informativeness

Page 42: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Elasticsearch and the CluewebA Work-in-Progress Presentation

Janek Bevendorff

Web Technology & Information SystemsBauhaus-Universität Weimar

1 © webis 2014

Page 43: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Elasticsearch and the CluewebWhat is the Clueweb?

some data:

q web crawl of 1,040,809,705 documentsq 5TB of compressed data (25TB uncompressed)q 4,780,950,903 unique URLsq tons of spam

2 © webis 2014

Page 44: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Elasticsearch and the CluewebAnd what do we do with it?

3 © webis 2014

Page 45: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Elasticsearch and the CluewebNew backend: Elasticsearch

Elasticsearch is a. . .

q distributed and redundant Lucene indexq (RESTful) search server

Optionally, Elasticsearch comes with Hadoop integration for indexing largeamounts of data and performing real-time search on HDFS clusters.

4 © webis 2014

Page 46: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Elasticsearch and the CluewebChatnoir 2

5 © webis 2014

Page 47: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Elasticsearch and the CluewebFuture Work

q index the whole ClueWeb12 and ClueWeb09 datasets on our brandnewBetaweb cluster

q use more fields (title, URL, anchor texts etc.) for weighted searchq some more frontend magic

6 © webis 2014

Page 48: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Elasticsearch and the Clueweb

Thank you for your attention!

7 © webis 2014

Page 49: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Passphone Protocol Analysis with Avispa

Andre Karge

17. Dezember 2014

Andre Karge Avispa 17. Dezember 2014

Page 50: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Agenda

1 Passphone Protocol

2 AVISPA

Andre Karge Avispa 17. Dezember 2014

Page 51: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Passphone Protocol

Passphone Protocol

Protocol for two factor authentication at a service providerFactors:

Password as usualSmartphone

User enters his passwordGets a QR-Code in returnScans the QR-Code with his registered smartphone appAfter success the user is logged in

Andre Karge Avispa 17. Dezember 2014

Page 52: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Passphone Protocol

Passphone Protocol

Protocol for two factor authentication at a service providerFactors:

Password as usualSmartphone

User enters his passwordGets a QR-Code in returnScans the QR-Code with his registered smartphone appAfter success the user is logged in

Andre Karge Avispa 17. Dezember 2014

Page 53: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Passphone Protocol

In protocol: several communications with different parties:Service Provider (e.g. Facebook, Ebay, Amazon, ...)Trusted Third Party ServerUser at a browserUser at his smartphone

Communication save?

Andre Karge Avispa 17. Dezember 2014

Page 54: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

AVISPA

AVISPA

Approach: automatic proofing of the protocol with AVISPAAVISPA = Automated Validation of Internet Security Protocols andApplicationsProtocol has to be translated into special language HLPSLHLPSL = High Level Protocol Specification Language

Andre Karge Avispa 17. Dezember 2014

Page 55: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

AVISPA

AVISPA

Approach: automatic proofing of the protocol with AVISPAAVISPA = Automated Validation of Internet Security Protocols andApplicationsProtocol has to be translated into special language HLPSLHLPSL = High Level Protocol Specification Language

Andre Karge Avispa 17. Dezember 2014

Page 56: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

AVISPA

AVISPA Function

Abbildung : Normal Case: 1 Request & 1 Response.

Andre Karge Avispa 17. Dezember 2014

Page 57: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

AVISPA

Possible to choose the prooferoutput afeter proofing depends on criterias set in the hlspl file(e.g. security of a nonce)Proofer checks if the given protocol is safe or if notIf a Protocol is not safe the proofer gives an attack trace

Andre Karge Avispa 17. Dezember 2014

Page 59: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

135 Server 135 Server

= 27 x = 27 x

Page 60: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Disk SpaceDisk Space

2160 TB2160 TB

Page 61: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

10GbE 10GbE NetworkNetwork

Page 62: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

1620 Cores1620 Cores8.64 TB RAM8.64 TB RAM

Page 63: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Network BootNetwork Boot

Page 64: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Setup viaSetup viaConfiguration Configuration ManagementManagement

Page 65: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Current StatusCurrent Status

Page 66: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

SimHash as a ServiceScaling Near-Duplicate Detection

Jan Graßegger

Page 67: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Near-Duplicates

2

Page 68: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

SimHash [Cha02]

• Locality-Sensitive Hash

• embeds document text into a 64-bit hash

• correlates with Cos-Similarity

3

Page 69: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

SimHash as a Service

Searching for near-duplicates over a web service

• corpus: ClueWeb12 (over 700M docs)

• response time: < 1 second

• search tables allow fast candidate retrieval [MJS07]

• works with aitools-invertedindex3

4

Page 70: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Bibliography

[Cha02] Moses Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings on 34th Annual ACM Symposium on Theory of Computing, May 19-21, 2002, Montr eal, Quebec, Canada, Seiten 380–388, 2002.

[MJS07] Gurmeet Singh Manku, Arvind Jain und Anish Das Sarma. Detecting near-duplicates for web crawling. In Proceedings of the 16th International Conference on World Wide Web, WWW’07, Banff, Alberta, Canada, May 8-12, 2007, Seiten 141–150, 2007.

5

Page 71: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

One class classificationof vandalism in the

wikipedia

1

Speaker:

Jonas Köhler

Page 72: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

The classification problem:

Classify edits of wikipedia entries into regular edits and vandalism edits.

- Currently he is the Chairman of the [[World of Labor Institute]].

+ Currently he is the Chairman of the [[World of Labor Institute]], and wants to breed an army of termites to claim world domination..

The corpora:

PAN WVC 2010 and PAN WVC 20111 (humanly annotated edits: vandalism and regular)

PAN WVC 20102394 vandalism entries ⇒ imbalanced classes...30045 regular entries

Features: 54 few meta-data, few linguistic data ⇒ dimensionality will grow!

21 Martin Potthast. Crowdsourcing a Wikipedia Vandalism Corpus. In Fabio Crestani et al, editors, 33rd International ACM Conference on Research and Development in Information Retrieval (SIGIR 10), pages 789-790, July 2010. ACM. ISBN 978-1-4503-0153-4

Page 73: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

One class classification

Train a model with data of the positive class only.

The model shall detect if a data vector is positive or an outlier from this class.

Useful if: the negative class is hard to describe with feature model

the negative class is difficult to sample

the class cardinality is very imbalanced

⇒ There are two ways Wikipedia vandalism detection can be seen as a OCC:

1) vandalism can be modelled with features positive = vandalismregular entries probably can’t

2) a lot more regular entries exist positive = regularannotation of vandalism entries is expensive

3

Page 74: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Outlook

What we have tried:

applying two standard implementations (libsvm)

applying a method intended for high dimension OCC (based on Random Forest 1)

Results

standard implementations do not work on PAN-WVC-2010 and PAN-WVC-2011

there is a lot of research on OCC, but only few implementations of methods are available

implementing the methods by our own is not feasible

How we want to proceed now:

continue with the work on the features (meta-data, NLP, …)

analyze the „hard cases“ (there are ~280 entries which are always bad in recall)41 Chesner Désir, Simon Bernard, Caroline Petitjean, Heutte Laurent. One class random forests.Pattern Recognition, Elsevier, 2013, 46, pp.3490-3506.

Page 75: Webis Student Presentations WS2014/15 · 2020-05-22 · Webis Student Presentations WS2014/15 I Argumentation Analysis in Newspaper Articles I Morning Morality I The Super-document

Thank you!

Questions?

5