Open Data & News Analytics #2
Presentation Outline – PART I
• Quick news-analytics case• Technology approach• FactForge-News: Data architecture• Sample queries on Linked Open Data• News analytics examples• Today’s News Map
Mar 2016
Open Data & News Analytics #3
Quick news-analytics case
Mar 2016
• Our Dynamic Semantic Publishing platform already offers linking of text with big open data graphs
• One can get navigate from text to concepts, get trends, related entities and news
• Try it at http://now.ontotext.com
Open Data & News Analytics #4
Presentation Outline
• Quick news-analytics case• Technology approach• FactForge-News: Data architecture• Sample queries on Linked Open Data• News analytics examples• Today’s News Map
Mar 2016
Open Data & News Analytics
Our approach to Big Data
1. Integrate relevant data from many sources− Build a Big Knowledge Graph from proprietary databases and
taxonomies integrated with millions of facts of Linked Data
2. Infer new facts and unveil relationships− Performing reasoning across data from different sources
3. Interlink text and with big data− Using text-mining to automatically discover references to
concepts and entities
4. Use NoSQL graph database for metadata management, querying and search
Mar 2016 #5
Open Data & News Analytics #6
NoSQL Graph Database
Mar 2016
myData: Maria
ptop:Agent
ptop:Person
ptop:Woman
ptop:childOf
ptop:parentOf
rdfs:range
owl:inverseO
f
inferred
myData:Ivan
owl:relativeOf
owl:inverseOfowl:SymmetricProperty
rdfs:subPropertyOf
owl:inverseOf
owl:inverseOf
rdf:t
ype
rdf:t
ype
rdf:type
• The hottest NoSQL trend• W3C standards• Efficient Data Integration
− Using logical inference
− For data integration and BI
Open Data & News Analytics #7
Analyzing Text
Mar 2016
• Full spectrum of NLP weaponry
• Semantic indexing− Tag references with entity IDs
− Generate semantic metadata descriptions of documents
− Store metadata in GraphDB
Open Data & News Analytics #8
Presentation Outline
• Quick news-analytics case• Technology approach• FactForge-News: Data architecture• Sample queries on Linked Open Data• News analytics examples• Today’s News Map
Mar 2016
Open Data & News Analytics
The Web of Linked Data in 2007
Mar 2016 #9
structured database version of Wikipedia
database of all locations on Earth
product reviews
semantic synonym dictionary
Note: Each bubble represents a dataset. Arrows represent mappings across datasets; e.g. dbpedia:Paris owl:sameAs geo:2988507
Open Data & News Analytics
The Web of Linked Data is Gaining Mass
Mar 2016 #12
• 2013 stats: 2 289 public datasets− http://stats.lod2.eu/
• Growing exponentially − see the dotted trend line
• Structured markup− Schema.org; semantic SEO
• Enables better semantic tagging!− As there are more concepts and richer
descriptions to refer to
27 43 89 162295
822
2,289
2007 2008 2009 2010 2011 2012 2013
Linked Data Datasets
Open Data & News Analytics #13
The FactForge Data
• DBpedia (the English version only): 496M statements
• Geonames: 150M statements− SameAs links between DBpedia and Geonames: 471K statements
• NOW data – metadata about news: 128M statements
• Total size: 938М statements− 656M explicit statements + 281M inferred statements
− RDFRank and geo-spatial indices enabled to allow for ranking and efficient geo-region constraints
Mar 2016
Open Data & News Analytics #14
News Metadata
• Metadata from Ontotext’s Dynamic Semantic Publishing platform− Automatically generated as part of the NOW.ontotext.com semantic news showcase
• News corpus from Google since Feb 2015, about 10k news/month
• ~70 tags (annotations) per news article
• Tags link text mentions of concepts to the knowledge graph− Technically these are URIs for entities (people, organizations, locations, etc.) and key phrases
Mar 2016
Open Data & News Analytics #16
News Metadata
Mar 2016
Category Count International 52 074
Science and Technology 23 201Sports 20 714Business 15 155Lifestyle 11 684
122 828
Mentions / entity type Count Keyphrase 2 589 676Organization 1 276 441Location 1 260 972Person 1 248 784Work 309 093Event 258 388RelationPersonRole 236 638Species 180 946
Open Data & News Analytics #18
Class Hierarchy Map (by number of instances)
Mar 2016
Left: The big pictureRight: dbo:Agent class (2.7M organizations and persons)
Open Data & News Analytics #19
Presentation Outline
• Quick news-analytics case• Technology approach• FactForge-News: Data architecture• Sample queries on Linked Open Data• News analytics examples• Today’s News Map
Mar 2016
Open Data & News Analytics #20
Sample queries
• There is a rich set of sample queries that allow exploration of this combination of DBPedia, GeoNames and news metadata
• We will showcase few of those, starting from the simple once
• In bold we marked the “parameters” of the queires
Mar 2016
Open Data & News Analytics #21
Query: Big Cities in Eastern Europe# benefits from inference over transitive gn:parentFeature# benefits from owl:sameAs mapping between DBPedia and Geonames
PREFIX dbr: <http://dbpedia.org/resource/>PREFIX onto: <http://www.ontotext.com/>PREFIX gn: <http://www.geonames.org/ontology#>PREFIX dbo: <http://dbpedia.org/ontology/>select *from onto:disable-sameAswhere { ?loc gn:parentFeature dbr:Eastern_Europe ; gn:featureClass gn:P. ?loc dbo:populationTotal ?population ; dbo:country ?country . FILTER(?population > 300000 )} order by ?country
Mar 2016
Open Data & News Analytics #22
Query: People and Organizations related to Google# benefits from inference over transitive dbo:parent# RDFRank makes it easy to see the “top suspects” in a list of 93 entities
PREFIX dbo: <http://dbpedia.org/ontology/>PREFIX rank: <http://www.ontotext.com/owlim/RDFRank#>PREFIX dbr: <http://dbpedia.org/resource/> select distinct ?related_entity ?rankwhere { BIND (dbr:Google as ?entity) { ?related_entity a dbo:Person ; ?p ?entity . } UNION { ?related_entity a dbo:Organisation ; dbo:parent ?entity . } ?related_entity rank:hasRDFRank ?rank} order by desc(?rank)
Mar 2016
Open Data & News Analytics #23
Query: Airports near London# GraphDB’s geo-spatial plug-in allows efficient evaluation of near-by# RDFRank brings the top 6 passanger airports at the top of a list of 80
PREFIX dbr: <http://dbpedia.org/resource/>PREFIX geo-pos: <http://www.w3.org/2003/01/geo/wgs84_pos#>PREFIX gdb-geo: <http://www.ontotext.com/owlim/geo#>PREFIX dbo: <http://dbpedia.org/ontology/>PREFIX gdb: <http://www.ontotext.com/owlim/>
SELECT distinct ?airport ?rrankWHERE { { SELECT * { dbr:London geo-pos:lat ?lat ; geo-pos:long ?long . } LIMIT 10 } ?airport gdb-geo:nearby(?lat ?long "50mi"); a dbo:Airport ; gdb:hasRDFRank ?rrank .} ORDER BY DESC(?rrank)
Mar 2016
Open Data & News Analytics #24
Query: Top-level Industries by number of companies# benefits from mapping and consolidation of industry classifications# and predicates in DBPedia (ff-map)
PREFIX dbo: <http://dbpedia.org/ontology/>PREFIX ff-map: <http://factforge.net/ff2016-mapping/>select distinct ?topIndustry (count(?company) as ?companies)where { ?company dbo:industry ?industry . ?industrySum ff-map:industryVariant ?industry . ?industrySum ff-map:industryCenter ?topIndustry .} group by ?topIndustry order by desc(?companies)
Mar 2016
Open Data & News Analytics #25
Presentation Outline
• Quick news-analytics case• Technology approach• FactForge-News: Data architecture• Sample queries on Linked Open Data• News analytics examples• Today’s News Map
Mar 2016
Open Data & News Analytics #26
Semantic Press-Clipping
• We can trace references to a specific company in the news− This is pretty much standard, however we can deal with syntactic variations in the names, because state
of the art Named Entity Recognition technology is used
− What’s more important, we distinguish correctly in which mention “Paris” refers to which of the following: Paris (the capital of France), Paris in Texas, Paris Hilton or to Paris (the Greek hero)
• We can trace and consolidate references to daughter companies
• We have comprehensive industry classification− The one from DBPedia, but refined to accommodate identifier variations and specialization (e.g.
company classified as dbr:Bank will also be considered classified as dbr:FinancialServices)
Mar 2016
Open Data & News Analytics #27
Query: News Mentioning an IBM# technical example to demonstrate how news metadata can be accessed
PREFIX pub-old: <http://ontology.ontotext.com/publishing#>PREFIX pub: <http://ontology.ontotext.com/taxonomy/>PREFIX dbr: <http://dbpedia.org/resource/>PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
select distinct ?news ?title ?date ?pub_entity where { ?news pub-old:containsMention / pub-old:hasInstance ?pub_entity . ?pub_entity pub:exactMatch dbr:IBM . ?news pub-old:creationDate ?date; pub-old:title ?title . FILTER ( (?date > "2015-10-01T00:02:00Z"^^xsd:dateTime) &&
(?date < "2015-11-01T00:02:00Z"^^xsd:dateTime))} limit 100
Mar 2016
Open Data & News Analytics #28
Query: News Mentioning Gazprom and Its Related Entities
# benefits from inference over transitive dbo:parent relation and mappings to it
select distinct ?news ?title ?date ?related_entity where { { select distinct ?related_entity { BIND (dbr:Gazprom as ?entity) { ?related_entity a dbo:Person ; ?p ?entity .
FILTER NOT EXISTS { ?related_entity dbo:club ?entity } } UNION { ?related_entity a dbo:Organisation ; dbo:parent ?entity . } UNION { BIND(?entity as ?related_entity) } } } ?news pub-old:containsMention / pub-old:hasInstance ?pub_entity . ?pub_entity pub:exactMatch ?related_entity . ?news pub-old:creationDate ?date; pub-old:title ?title .} order by desc(?date) limit 1000
Mar 2016
Open Data & News Analytics #29
Query: Most Popular in the News Automotive Companies# benefits from mapping and consolidation of industry classifications
select distinct ?pub_entity (max(?entity_label) as ?label) (count(?news) as ?news_count)
where { ?news pub-old:containsMention / pub-old:hasInstance ?pub_entity . ?pub_entity pub:exactMatch ?entity; pub:preferredLabel ?entity_label. dbr:Automotive ff-map:industryVariant ?industry . ?entity dbo:industry ?industry . ?news pub-old:creationDate ?date .} group by ?pub_entity order by desc(?news_count)
Mar 2016
Open Data & News Analytics #30
Query: Most Popular in the News, including children# benefits from mapping and consolidation of industry classifications
select distinct ?parent (count(?news) as ?news_count)where { { select distinct ?parent ?entity { BIND(dbr:Software as ?industry) ?industry ff-map:industryVariant ?industryVar . ?parent dbo:industry ?industryVar . ?parent a dbo:Company . FILTER NOT EXISTS { ?parent dbo:parent / dbo:industry / ff-map:industryVariant ?industry } { ?entity dbo:parent ?parent . } UNION { BIND(?parent as ?entity) } } } ?news pub-old:containsMention / pub-old:hasInstance ?pub_entity . ?pub_entity pub:exactMatch ?entity . ?news pub-old:creationDate ?date .} group by ?parent order by desc(?news_count)
Mar 2016
Open Data & News Analytics #31
News Popularity Ranking: Automotive
Mar 2016
Rank Company News # Rank Company incl. mentions of controlled News #1 General Motors 2722 1 General Motors 46202 Tesla Motors 2346 2 Volkswagen Group 39993 Volkswagen 2299 3 Fiat Chrysler Automobiles 26584 Ford Motor Company 1934 4 Tesla Motors 23705 Toyota 1325 5 Ford Motor Company 21256 Chevrolet 1264 6 Toyota 16567 Chrysler 1054 7 Renault-Nissan Alliance 13328 Fiat Chrysler Automobiles 1011 8 Honda 8649 Audi AG 972 9 BMW 715
10 Honda 717 10 Takata Corporation 547
Open Data & News Analytics #32
News Popularity: Finance
Mar 2016
Rank Company News # Rank Company incl. mentions of controlled News #1 Bloomberg L.P. 3203 1 China Merchants Bank 409402 Goldman Sachs 1992 2 Alphabet Inc. 242193 JP Morgan Chase 1712 3 Capital Group Companies 43794 Wells Fargo 1688 4 Bloomberg L.P. 38935 Citigroup 1557 5 Exor (company) 27756 HSBC Holdings 1546 6 JP Morgan Chase 27157 Deutsche Bank 1414 7 Nasdaq, Inc. 21788 Bank of America 1335 8 Oaktree Capital Management 17579 Barclays 1260 9 Goldman Sachs 1085
10 UBS 694 10 Sentinel Capital Partners 1064
Note: Including investment funds, stock exchanges, agencies, etc.
Open Data & News Analytics #33
News Popularity: Banking
Mar 2016
Rank Company News # Rank Company incl. mentions of controlled News #1 Goldman Sachs 996 1 China Merchants Bank * 382882 JP Morgan Chase 856 2 JP Morgan Chase 19723 HSBC Holdings 773 3 Goldman Sachs 10304 Deutsche Bank 707 4 HSBC 9665 Barclays 630 5 Bank of America 7716 Citigroup 519 6 Deutsche Bank 7427 Bank of America 445 7 Barclays 6818 Wells Fargo 422 8 Citigroup 6309 UBS 347 9 Wells Fargo 428
10 Chase 126 10 UBS 347
Note: including investment funds, stock exchanges, agencies, etc.
Open Data & News Analytics #34
Presentation Outline
• Quick news-analytics case• Technology approach• FactForge-News: Data architecture• Sample queries on Linked Open Data• News analytics examples• Today’s News Map
Mar 2016
Open Data & News Analytics #37
Expect in Part II
• Mentions of entity and related by month
• Most relevant co-occurrnig entities
• Most relevant co-occurrnig entities per month
• Related News
• and more
Mar 2016
Open Data & News Analytics #38
Thank you!
Experience the technology with NOW: Semantic News Portalhttp://now.ontotext.com
Start using GraphDB and text-mining with S4 in the cloudhttp://s4.ontotext.com
Learn more at our website or simply get in touch [email protected], @ontotext
Mar 2016
Top Related