MIA at the 6th Data Science Day Berlin
-
Upload
jan-maller -
Category
Data & Analytics
-
view
104 -
download
0
description
Transcript of MIA at the 6th Data Science Day Berlin
![Page 1: MIA at the 6th Data Science Day Berlin](https://reader033.fdocuments.in/reader033/viewer/2022051514/5495fd71b479595b4d8b4e98/html5/thumbnails/1.jpg)
A Cloud-Based Marketplace for Information and Analyses
Peter Adolphs, Project Manager R&D, Neofonie GmbH 6th Data Science Day, 8 May 2014
![Page 2: MIA at the 6th Data Science Day Berlin](https://reader033.fdocuments.in/reader033/viewer/2022051514/5495fd71b479595b4d8b4e98/html5/thumbnails/2.jpg)
Research Project MIA
Data Mining Social Media Monitoring
Data Enrichment
Data Security, Reliable Applications
Data Acquistion & Enrichment, Text Mining,
Media Publisher Services
Scalable Database Technologies, Data Cleansing
Real-Time Data Mining
Funding Period: 2012 -2014
![Page 3: MIA at the 6th Data Science Day Berlin](https://reader033.fdocuments.in/reader033/viewer/2022051514/5495fd71b479595b4d8b4e98/html5/thumbnails/3.jpg)
Who owns the Web?
Image source: ©iStock.com/ahlobystov (Stock Photo: 4619850)
![Page 4: MIA at the 6th Data Science Day Berlin](https://reader033.fdocuments.in/reader033/viewer/2022051514/5495fd71b479595b4d8b4e98/html5/thumbnails/4.jpg)
Usage
4
![Page 5: MIA at the 6th Data Science Day Berlin](https://reader033.fdocuments.in/reader033/viewer/2022051514/5495fd71b479595b4d8b4e98/html5/thumbnails/5.jpg)
Usage Scenarios
Market Research
Brand Monitoring
Reputation Management
Internet
Data Extraction
MIA
![Page 6: MIA at the 6th Data Science Day Berlin](https://reader033.fdocuments.in/reader033/viewer/2022051514/5495fd71b479595b4d8b4e98/html5/thumbnails/6.jpg)
MIA Platform and Marketplace
Technology (Apps) Stored Queries €
Providers of Application
Providers of Analysis Algorithms
Technology (Algorithms) €
Analysts
Ad-Hoc Questions €
Analysis Results
German Speaking
Web
Data Providers
Data & Aggregation
€
Tru
st /
Cert
ific
atio
n
Infra Structure: Cloud/Real Time Processing + Storage
Marketplace: Distribution of Technologies, Purchase of Computing Capacity
Acquisition Cleansing
Enrichment Aggregation
Data Mining Storage
![Page 7: MIA at the 6th Data Science Day Berlin](https://reader033.fdocuments.in/reader033/viewer/2022051514/5495fd71b479595b4d8b4e98/html5/thumbnails/7.jpg)
Marketplace
![Page 8: MIA at the 6th Data Science Day Berlin](https://reader033.fdocuments.in/reader033/viewer/2022051514/5495fd71b479595b4d8b4e98/html5/thumbnails/8.jpg)
Using MIA as a Service
Neofonie Individual Configuration
Application Developers
Multiple Application Scenarios
Web Annotation Tool (WATT) ZEITMASCHINE: a Search App for News Archives
Dashboards with Current Data
![Page 9: MIA at the 6th Data Science Day Berlin](https://reader033.fdocuments.in/reader033/viewer/2022051514/5495fd71b479595b4d8b4e98/html5/thumbnails/9.jpg)
Using MIA for Ad-Hoc Queries
![Page 10: MIA at the 6th Data Science Day Berlin](https://reader033.fdocuments.in/reader033/viewer/2022051514/5495fd71b479595b4d8b4e98/html5/thumbnails/10.jpg)
Using MIA for Providing Data / Applications
![Page 11: MIA at the 6th Data Science Day Berlin](https://reader033.fdocuments.in/reader033/viewer/2022051514/5495fd71b479595b4d8b4e98/html5/thumbnails/11.jpg)
Text Analysis
11
![Page 12: MIA at the 6th Data Science Day Berlin](https://reader033.fdocuments.in/reader033/viewer/2022051514/5495fd71b479595b4d8b4e98/html5/thumbnails/12.jpg)
Boilerplate Removal & Document Structure Analysis
Goals
✱ Extract the document core
✱ Remove ads and navigation
✱ Determine document structure
Approach
✱ Determine text core and title using SVMs
✱ Features: text characteristics, linguistic properties, DOM structure, link/anchor structure
![Page 13: MIA at the 6th Data Science Day Berlin](https://reader033.fdocuments.in/reader033/viewer/2022051514/5495fd71b479595b4d8b4e98/html5/thumbnails/13.jpg)
Named Entities
✱ Recognition of Names for Real-World Entities like
✱ People
✱ Locations
✱ Organizations
✱ Products
✱ ...
✱ Named Entity Recognition NER
![Page 14: MIA at the 6th Data Science Day Berlin](https://reader033.fdocuments.in/reader033/viewer/2022051514/5495fd71b479595b4d8b4e98/html5/thumbnails/14.jpg)
References for Names
✱Names alone are not really useful
✱ Entity type is often also not enough
✱We want a reference (e.g. a URI) w.r.t. some world model.
Image sources: 1) Bundesarchiv, B 145 Bild-F074398-0021 / Engelbert Reineke / CC-BY-SA 3.0 Germany. Shortlink: http://goo.gl/hTzdkH 2) Brassica oleracea convar. capitata var. alba, spitskool (2).jpg by user Rasbak / CC-BY-SA 3.0 Unported. Shortlink: http://goo.gl/IQhqQC
![Page 15: MIA at the 6th Data Science Day Berlin](https://reader033.fdocuments.in/reader033/viewer/2022051514/5495fd71b479595b4d8b4e98/html5/thumbnails/15.jpg)
Knowledge Bases
![Page 16: MIA at the 6th Data Science Day Berlin](https://reader033.fdocuments.in/reader033/viewer/2022051514/5495fd71b479595b4d8b4e98/html5/thumbnails/16.jpg)
Ambiguities
“Peter Müller”
![Page 17: MIA at the 6th Data Science Day Berlin](https://reader033.fdocuments.in/reader033/viewer/2022051514/5495fd71b479595b4d8b4e98/html5/thumbnails/17.jpg)
✱ Supervised Machine Learning Requires labeled training data
✱ Sequence Learning Method
Conditional Random Fields for NER
![Page 18: MIA at the 6th Data Science Day Berlin](https://reader033.fdocuments.in/reader033/viewer/2022051514/5495fd71b479595b4d8b4e98/html5/thumbnails/18.jpg)
Dependency Parsing
Yahoo trennt sich von CEO Scott Thomson
ORG PERSON
Yahoo trennen sich von CEO Scott Thomson
Token
Lemma
NE
![Page 19: MIA at the 6th Data Science Day Berlin](https://reader033.fdocuments.in/reader033/viewer/2022051514/5495fd71b479595b4d8b4e98/html5/thumbnails/19.jpg)
Relation Extraction
Yahoo trennt sich von CEO Scott Thomson
ORG PERSON
Yahoo trennen sich von CEO Scott Thomson
Token
Lemma
NE
![Page 20: MIA at the 6th Data Science Day Berlin](https://reader033.fdocuments.in/reader033/viewer/2022051514/5495fd71b479595b4d8b4e98/html5/thumbnails/20.jpg)
Sentiment Analysis
✱ Goal: determine positive or negative sentiments
✱ Base: SentiWS (Sentiment-Lexicon of University Leipzig; freely available for research)
✱ Simple approach: sentiment of sentence = average sentiment weights of the words
Lemma: Stem|PoS Sentiment
Abhängigkeit|NN -0.3653
abfällig|ADJX -0.3197
abgedroschen|ADJX -0.1839
absolut|ADJX 0.2418
Ablehnung|NN -0.5118
Ablenkung|NN -0.0435
Anerkennung|NN 0.0855
anspruchsvoll|ADJX 0.2216
Freispruch|NN 0.0040
Freude|NN 0.6502
Freund|NN 0.0116
Peter Adolphs, Neofonie GmbH 20
![Page 21: MIA at the 6th Data Science Day Berlin](https://reader033.fdocuments.in/reader033/viewer/2022051514/5495fd71b479595b4d8b4e98/html5/thumbnails/21.jpg)
Text Analysis Components
Topic Classification
Sentence Segmentation
Boilerplate Removal
PoS-Tagging Tokenization Lemmatization
Subjectivity Recognition
Dependency Parsing
NER & NERD
Quote Recognition Relation
Extraction
Brand Monitoring
Data Extraction Reputation
Management Market
Research
Sentiment Analysis
![Page 22: MIA at the 6th Data Science Day Berlin](https://reader033.fdocuments.in/reader033/viewer/2022051514/5495fd71b479595b4d8b4e98/html5/thumbnails/22.jpg)
An Application
![Page 23: MIA at the 6th Data Science Day Berlin](https://reader033.fdocuments.in/reader033/viewer/2022051514/5495fd71b479595b4d8b4e98/html5/thumbnails/23.jpg)
Are there Political Tendencies in the Coverage of German Online Media?
![Page 24: MIA at the 6th Data Science Day Berlin](https://reader033.fdocuments.in/reader033/viewer/2022051514/5495fd71b479595b4d8b4e98/html5/thumbnails/24.jpg)
Selected Data
![Page 25: MIA at the 6th Data Science Day Berlin](https://reader033.fdocuments.in/reader033/viewer/2022051514/5495fd71b479595b4d8b4e98/html5/thumbnails/25.jpg)
Some Statistics
Investigation Period
12 Months >18,000,000
Documents
6,543 Politicians
![Page 26: MIA at the 6th Data Science Day Berlin](https://reader033.fdocuments.in/reader033/viewer/2022051514/5495fd71b479595b4d8b4e98/html5/thumbnails/26.jpg)
Method
✱ Recognize names with named entity recognition with reference linking and disambiguation
✱ Join recognized references with Freebase subset of German politicians and their party
✱ Aggregate and count
✱ Inspect & Visualize in Excel
![Page 27: MIA at the 6th Data Science Day Berlin](https://reader033.fdocuments.in/reader033/viewer/2022051514/5495fd71b479595b4d8b4e98/html5/thumbnails/27.jpg)
184 Online-Portals, 1.8.2012-31.7.2013
12% Chancellor Angela Merkel
0,00%
2,00%
4,00%
6,00%
8,00%
10,00%
12,00%
14,00% Angela Merkel
Peer Steinbrück
Philipp Rösler
WolfgangSchäuble
Horst Seehofer
Mentions of Politicians
![Page 28: MIA at the 6th Data Science Day Berlin](https://reader033.fdocuments.in/reader033/viewer/2022051514/5495fd71b479595b4d8b4e98/html5/thumbnails/28.jpg)
Media Coverage on Average
63% about current coalition
7,92%
39,85%
8,72% 2,15%
14,26%
0,12%
0,01%
26,95%
184 Online-Portals, 1.8.2012-31.7.2013 CDU CSU SPD Grüne FDP Linke Piraten NPD Übrige
![Page 29: MIA at the 6th Data Science Day Berlin](https://reader033.fdocuments.in/reader033/viewer/2022051514/5495fd71b479595b4d8b4e98/html5/thumbnails/29.jpg)
Media Coverage in Particular News Sources
7%
39%
7%
2%
13%
32%
Bild
15%
34%
6% 5%
9%
30%
die tageszeitung
8%
25%
6% 20%
7%
34%
Neues Deutschland
184 Online-Portals, 1.8.2012-31.7.2013 CDU CSU SPD Grüne FDP Linke Piraten NPD Übrige
![Page 30: MIA at the 6th Data Science Day Berlin](https://reader033.fdocuments.in/reader033/viewer/2022051514/5495fd71b479595b4d8b4e98/html5/thumbnails/30.jpg)
Sentiments in Online Media
B90/Grüne
CDU
CSU
Linke
FDP
NPD
Piraten
SPD
Averaged over all Mentions in all News Articles
NPD
![Page 31: MIA at the 6th Data Science Day Berlin](https://reader033.fdocuments.in/reader033/viewer/2022051514/5495fd71b479595b4d8b4e98/html5/thumbnails/31.jpg)
Conclusions
![Page 32: MIA at the 6th Data Science Day Berlin](https://reader033.fdocuments.in/reader033/viewer/2022051514/5495fd71b479595b4d8b4e98/html5/thumbnails/32.jpg)
✱ Processing Web-scale amounts of textual data is a real challenge
✱ Requires the right tools, data and infrastructure
✱MIA sketches a marketplace & execution platform which allows users to basically apply SQL to the Web
✱Marketplace allows algorithm developers and data providers to share (& monetize) their assets
Summary & Conclusions
![Page 33: MIA at the 6th Data Science Day Berlin](https://reader033.fdocuments.in/reader033/viewer/2022051514/5495fd71b479595b4d8b4e98/html5/thumbnails/33.jpg)
Monday, May 26, 2014, 7 PM / 19:00
Data Talk
Common Crawl meets MIA:
Gathering and Crunching Open Web Data
http://ow.ly/wC1Fm
CINIQ, Einsteinufer 37, Berlin, Tickets on Eventbrite
Save the Date!
![Page 34: MIA at the 6th Data Science Day Berlin](https://reader033.fdocuments.in/reader033/viewer/2022051514/5495fd71b479595b4d8b4e98/html5/thumbnails/34.jpg)
Peter Adolphs Project Manager R&D [email protected] T: +49 30 246 27 525
Neofonie GmbH Robert-Koch-Platz 4 10115 Berlin www.neofonie.de
Thank You For Your Attention!