Semnews - Euroscipy 2011
-
Upload
vincent-michel -
Category
Data & Analytics
-
view
95 -
download
1
Transcript of Semnews - Euroscipy 2011
A semantic news aggregator in Python usingDbpedia Cubicweb and Scikits-learn
Vincent Michel - Logilab
27 aoucirct 2011
Context
With a bunch of RSS news
Google Borrows Apple Strategy Googlersquos deal underscores theallure of a business model pioneered
Japan disaster plant cold shutdown could face delay TOKYO(Reuters) - Tokyo Electric Power Co said on Wednesday
Libya shows signs of slipping from Muammar Gaddafirsquos grasp Supply lines to capital in peril as coastal cities fall
how can we analyze them in Python
rarr Clustering (grouping) RSS (eg Google News)
rarr Extractingsynthetizing information
rarr Providing semantic usefuloriginal visualisation andanalytics tools
Overview
1 Introduction
2 Extracting information from RSS news
3 Analyzing RSS news
Semantic
As of September 2010
MusicBrainz
(zitgist)
P20
YAGO
World Fact-book (FUB)
WordNet (W3C)
WordNet(VUA)
VIVO UFVIVO
Indiana
VIVO Cornell
VIAF
URIBurner
Sussex Reading
Lists
Plymouth Reading
Lists
UMBEL
UK Post-codes
legislationgovuk
Uberblic
UB Mann-heim
TWC LOGD
Twarql
transportdatagov
uk
totlnet
Tele-graphis
TCMGeneDIT
TaxonConcept
The Open Library (Talis)
t4gm
Surge Radio
STW
RAMEAU SH
statisticsdatagov
uk
St Andrews Resource
Lists
ECS South-ampton EPrints
Semantic CrunchBase
semanticweborg
SemanticXBRL
SWDog Food
rdfabout US SEC
Wiki
UNLOCODE
Ulm
ECS (RKB
Explorer)
Roma
RISKS
RESEX
RAE2001
Pisa
OS
OAI
NSF
New-castle
LAAS
KISTIJISC
IRIT
IEEE
IBM
Eureacutecom
ERA
ePrints
dotAC
DEPLOY
DBLP (RKB
Explorer)
Course-ware
CORDIS
CiteSeer
Budapest
ACM
riese
Revyu
researchdatagov
uk
referencedatagov
uk
Recht-spraak
nl
RDFohloh
LastFM (rdfize)
RDF Book
Mashup
PSH
ProductDB
PBAC
Pokeacute-peacutedia
Ord-nance Survey
Openly Local
The Open Library
OpenCyc
OpenCalais
OpenEI
New York
Times
NTU Resource
Lists
NDL subjects
MARC Codes List
Man-chesterReading
Lists
Lotico
The London Gazette
LOIUS
lobidResources
lobidOrgani-sations
LinkedMDB
LinkedLCCN
LinkedGeoData
LinkedCT
Linked Open
Numbers
lingvoj
LIBRIS
Lexvo
LCSH
DBLP (L3S)
Linked Sensor Data (Knoesis)
Good-win
Family
Jamendo
iServe
NSZL Catalog
GovTrack
GESIS
GeoSpecies
GeoNames
GeoLinkedData(es)
GTAA
STITCHSIDER
Project Guten-berg (FUB)
MediCare
Euro-stat
(FUB)
DrugBank
Disea-some
DBLP (FU
Berlin)
DailyMed
Freebase
flickr wrappr
Fishes of Texas
FanHubz
Event-Media
EUTC Produc-
tions
Eurostat
EUNIS
ESD stan-dards
Popula-tion (En-AKTing)
NHS (EnAKTing)
Mortality (En-
AKTing)Energy
(En-AKTing)
CO2(En-
AKTing)
educationdatagov
uk
ECS South-ampton
Gem Norm-datei
datadcs
MySpace(DBTune)
MusicBrainz
(DBTune)
Magna-tune
John Peel(DB
Tune)
classical(DB
Tune)
Audio-scrobbler (DBTune)
LastfmArtists
(DBTune)
DBTropes
dbpedia lite
DBpedia
Pokedex
Airports
NASA (Data Incu-bator)
MusicBrainz(Data
Incubator)
Moseley Folk
Discogs(Data In-cubator)
Climbing
Linked Data for Intervals
Cornetto
Chronic-ling
America
Chem2Bio2RDF
bizdata
govuk
UniSTS
UniRef
UniPath-way
UniParc
Taxo-nomy
UniProt
SGD
Reactome
PubMed
PubChem
PRO-SITE
ProDom
Pfam PDB
OMIM
OBO
MGI
KEGG Reaction
KEGG Pathway
KEGG Glycan
KEGG Enzyme
KEGG Drug
KEGG Cpd
InterPro
HomoloGene
HGNC
Gene Ontology
GeneID
GenBank
ChEBI
CAS
Affy-metrix
BibBaseBBC
Wildlife Finder
BBC Program
mesBBC
Music
rdfaboutUS Census
Media
Geographic
Publications
Government
Cross-domain
Life sciences
User-generated content
ldquoLinking Open Data cloud diagram by Richard Cyganiak and Anja Jentzschhttp lod-cloudnetrdquo
Tools
Fetching storing and querying the datararr CubicWeb ()
4 Semantic CMS high-level database management with metadata
4 Multiple sources RSS micro-blogging SQL database
4 Based on PostgreSQL deals with (very) large database
Data-mining and machine learningrarr Scikits-learn ()
4 Easy-to-use and general-purpose machine learning in Python
4 Unsupervised learning supervised learning model selection
Semantic information databaserarr Dbpedia ()
4 sim 8106 articles with abstracts images from Wikipedia
4 sim 06106 categories 273 types (eg person place )
4 sim 100106 links between articles categories and types
() Open SourceCreative Commons
Storing RSS information with Cubicweb
Each object (or entity) is defined in a schema and may be displayedusing different views
Storing RSS
Define (in schemapy ) how RSS should be stored in the database
class RSSArt ic le ( Ent i tyType ) t i t l e = S t r i n g ( ) T i t l e o f the feedu r i = S t r i n g ( unique=True ) Ur i o f the rss feedcontent = S t r i n g ( ) Content o f the feed
Fetching RSS (based on feedparser and BeautifulSoup)
Simply construct a source by giving an URL and a parser
u r l = u rsquo h t t p feeds bbc i co uk news wor ld rss xml rsquos = session c r e a t e _ e n t i t y ( rsquoCWSource rsquo name=u rsquoBBCNewsminusWorld rsquo u r l = u r l
type=u rsquo datafeed rsquo parser=u rsquo rssminusparser rsquo con f i g =u rsquo synchron iza t ionminusi n t e r v a l =240min rsquo )
s pu l l _da ta ( session )
7 englishamerican journals (The New York Times )
Storing Dbpedia information with Cubicweb
Storing Dbpedia page
Define (in schemapy ) how Dbpedia pages should be stored in thedatabase
class DbpediaPage ( Ent i tyType ) u r i = S t r i n g ( unique=True indexed=True ) Ur i o f the ressourcel a b e l = S t r i n g ( indexed=True ) h t t p wwww3 org 200001 rd fminusschemapageid = S t r i n g ( ) h t t p dbpedia org onto logy wikiPageIDabs t r ac t = S t r i n g ( ) h t t p dbpedia org onto logy abs t r ac thomepage = S t r i n g ( ) h t t p xmlns com f o a f 0 1 homepagethumbnai l = S t r i n g ( ) h t t p dbpedia org onto logy thumbnai ld e p i c t i on = S t r i n g ( ) h t t p xmlns com f o a f 0 1 d ep i c t i o nwikipage = S t r i n g ( ) h t t p xmlns com f o a f 0 1 pagel a t i t u d e = S t r i n g ( ) h t t p wwww3 org 200301 geo wgs84_posl ong i t ude = S t r i n g ( ) h t t p wwww3 org 200301 geo wgs84_pos
Storing all dbpedia information (sim 9106 pages sim 100106 links20Go) in Cubicweb takes less than 24 hours
rarr See the full schema
Analyzing RSS news
What can we do with the RSS news stored in database
1 extract relevant features of the data
2 construct a usable (ie matrix) representation of the data
3 cluster (group) RSS together
4 deeper semantic analyze and visualization of the information
Example sentence
ldquoGoogle is to buy mobile phone manufacturer Motorola Mobilityallowing it to mount a serious challenge to Apple Incrdquo
Overview
1 Introduction
2 Extracting information from RSS news
3 Analyzing RSS news
Extracting information Classical approaches (Scikits-learn)
Char-N-gram
Extracts features of N characters from a text
from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import CharNGramAnalyzeranalyzer = CharNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )
450 features rsquogoorsquo rsquooogrsquo rsquooglrsquo rsquomobrsquo rsquoobirsquo rsquobilrsquo
Word-N-gram
Extracts features of N words from a text
from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import WordNGramAnalyzeranalyzer = WordNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )
58 features rsquogoogle is torsquo rsquois to buyrsquo rsquoserious challenge torsquo
8 Many irrelevant features (tokens)8 Features do not carry lots of contextual information (ie
understandable by humans)
Feature extraction Dbpedia - Context
Main Hypothesis Only things that exist in Dbpedia (ieWikipedia) have some interest in news analysisrarr Named
Entities Recognition (NER)
Dbpedia feature extraction (CubicwebDbpedia)
from cubes semnews views ne r t oo l s import Dbped iaEnt i t iesAna lyzeranalyzer = Dbped iaEnt i t iesAna lyzer ( session lang= rsquo en rsquo )tokens = analyzer analyze ( sentence )
rarr 3 features rsquoApple Incrsquo rsquoGooglersquo rsquoMotorolarsquo
ltENAMEX TYPE=ORGANIZATIONgt GoogleltENAMEXgt is to buy mobile phone
manufacturer ltENAMEX TYPE=ORGANIZATIONgt MotorolaltENAMEXgt Mobility allowing it to
mount a serious challenge to ltENAMEX TYPE=ORGANIZATIONgt Apple IncltENAMEXgt
rarr Try it
ldquoDBpedia Spotlight Shedding Light on the Web of Documentsrdquo Pablo N Mendes et al I-Semantics 2011ldquoLearning Named Entity Recognition from Wikipediardquo Joel Nothman 2008
ldquoLarge-Scale Named Entity Disambiguation Based on Wikipedia Datardquo Silviu Cucerzan 2007
Feature extraction Dbpedia - Properties
Efficient and robust feature extraction
Keep the meaning of a textrarr interpretable features8 rsquo said former soldier Larry rsquo8 rsquo said student Larry rsquo4 rsquo said Larry Page rsquo
Robust features based on redirectionseg rsquoObamarsquo rsquoBarak Obambarsquo rsquoPres Obamarsquo redirect to rsquoBarack Obamarsquo
Simple RQL (Relation Query Language) queries
Fast based on indexed SQL tables and regular expressions rset = rql(rsquoAny E WHERE E is DbpediaPage
E label (token)srsquo rsquotokenrsquo token)
eg 19 entities extracted in 4s in 765 words among sim 8106 dbpedia entries
Different labels but same URIrarr cross-language featureextraction eg GrenadaGrenaderarr http dbpediaorgresourceGrenada
Overview
1 Introduction
2 Extracting information from RSS news
3 Analyzing RSS news
Feature extraction Matrix representation and storage
Obamarsquos approval Protest as Spain
Libya rebels fight Obama in Spain
rarr
0 0 1 00 0 0 1 0 1 0 00 0 1 1
Eg The Obama-Merkel-Sarkozy
space
Results stored in a relation (appears in rss) in Cubicweb
rql(rsquoAny X WHERE X appears_in_rss Yrsquo)
Exploiting information Clustering news
Creating clusters (groups) of news using the matrixrepresentation of the data
Meanshift algorithm (Scikits-learn)
Based on locating the maxima of a density function
Automatically tunes the number of clusters
from s c i k i t s l ea rn c l u s t e r import MeanShift est imate_bandwidthbandwidth = est imate_bandwidth (X q u a n t i l e =0005)c l u s t e r i n g = MeanShift ( bandwidth=bandwidth )c l u s t e r i n g f i t (X)l a b e l s = c l u s t e r i n g labe ls_
ldquoMean shift mode seeking and clusteringrdquo Yizong Cheng IEEE Transactions on Pattern Analysis and Machine Intelligence1995
rarr Try it
Other possible alternatives Wardrsquos algorithm K-means
Exploiting information Queries and Views
Each query returns a result set (rset) A view is called on a result setrarr define the representation rules
Defining a view (short version)
class MyEnt i t iesView ( View ) __regid__ = rsquo exampleminusview rsquo
def c a l l ( s e l f ) r s e t = s e l f cw_rsetfor e n t i t y in r s e t e n t i t i e s ( )
do_whatever_you_want_ write_some_html_
you just have to plug a rset from a query
r s e t = r q l ( rsquo Any X WHERE rsquo )s e l f wview ( rsquo exampleminusview rsquo r s e t )
or within an url
httpmyapplicationrql=Any X WHERE ampvid=example-view
Pluging Scikitslearn in a view
Combine machine learning tools and database query system in purePython
Defining the view
class RSSClustering ( View ) __regid__ = rsquo rssminusc l u s t e r i n g rsquo
def c a l l ( s e l f ) Get the r e s u l t s set and create the datar s e t = s e l f cw_rsetoccurrence_matr ix = OccurrenceMatr ix ( )occurrence_matr ix cons t ruc t_ma t r i x ( r s e t = r s e t ) Scale the dataX = scale ( occurrence_matr ix ax is =0 w i th_s td=True copy=True ) Compute bandwith f o r c l u s t e r i n gbandwidth = est imate_bandwidth (X) C lus te r i ngc l u s t e r = MeanShift ( bandwidth=bandwidth )c l u s t e r f i t (X)l a b e l s = c l u s t e r l abe ls_ Perform some HTML render ing
rarr Try it
A new approach for querying information from RSS
All musical artists in the newsrql(rsquoDISTINCT Any E R WHERE E appears_in_rss R
E has_type T T label musical artistrsquo)
All living office holder persons in the news
rql(rsquoDISTINCT Any E WHERE E appears_in_rss RE has_type T T label office holderE has_subject C C label Living peoplersquo)
All news that talk about Barack Obama and any scientist
rql(rsquoDISTINCT Any R WHERE E1 label Barack ObamaE1 appears_in_rss R E2 appears_in_rss RE2 has_type T T label scientistrsquo)
All news that talk about a drug
rql(rsquoAny X R WHERE X appears_in_rss RX has_type T T label drugrsquo)
Try it with an xml view or a thumbnail view
Vizualisation Mapping information
View
class EntityMapView ( View ) __regid__ = rsquomap rsquo
def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )
s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )
s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )
Based on mapstraction (javascript) http mapstractioncom
4 Automatically locate information from RSS news
Try it
Conclusion
Cubicweb and Scikits-learn
Efficient and easy-to-use tools for data storing querying andmining
Easy to plug together only Python tools all Open Source
Using Dbpedia allows to extract very few highly relevant features
4 Decrease the dimensionality of the data
4 Link features to millions of pages of information
A new semantic way for querying information
4 Simple information queries using RQL expressions
4 Use Dbpedia types and categories to refine the selection
Future Improvements
Named Entities Recognition
rarr use disambiguations links more refined Regular expressions
rarr add new databases (MusicBrainz diseasome )
rarr closely follow Wikipedia with Dbpedia live update
Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo
RSS news analyzing
rarr explore new algorithms bi-clustering
rarr add new data sources Twitters Blogs
rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph
Thanks you for your attention
Questions
- Introduction
- Extracting information from RSS news
- Analyzing RSS news
-
Context
With a bunch of RSS news
Google Borrows Apple Strategy Googlersquos deal underscores theallure of a business model pioneered
Japan disaster plant cold shutdown could face delay TOKYO(Reuters) - Tokyo Electric Power Co said on Wednesday
Libya shows signs of slipping from Muammar Gaddafirsquos grasp Supply lines to capital in peril as coastal cities fall
how can we analyze them in Python
rarr Clustering (grouping) RSS (eg Google News)
rarr Extractingsynthetizing information
rarr Providing semantic usefuloriginal visualisation andanalytics tools
Overview
1 Introduction
2 Extracting information from RSS news
3 Analyzing RSS news
Semantic
As of September 2010
MusicBrainz
(zitgist)
P20
YAGO
World Fact-book (FUB)
WordNet (W3C)
WordNet(VUA)
VIVO UFVIVO
Indiana
VIVO Cornell
VIAF
URIBurner
Sussex Reading
Lists
Plymouth Reading
Lists
UMBEL
UK Post-codes
legislationgovuk
Uberblic
UB Mann-heim
TWC LOGD
Twarql
transportdatagov
uk
totlnet
Tele-graphis
TCMGeneDIT
TaxonConcept
The Open Library (Talis)
t4gm
Surge Radio
STW
RAMEAU SH
statisticsdatagov
uk
St Andrews Resource
Lists
ECS South-ampton EPrints
Semantic CrunchBase
semanticweborg
SemanticXBRL
SWDog Food
rdfabout US SEC
Wiki
UNLOCODE
Ulm
ECS (RKB
Explorer)
Roma
RISKS
RESEX
RAE2001
Pisa
OS
OAI
NSF
New-castle
LAAS
KISTIJISC
IRIT
IEEE
IBM
Eureacutecom
ERA
ePrints
dotAC
DEPLOY
DBLP (RKB
Explorer)
Course-ware
CORDIS
CiteSeer
Budapest
ACM
riese
Revyu
researchdatagov
uk
referencedatagov
uk
Recht-spraak
nl
RDFohloh
LastFM (rdfize)
RDF Book
Mashup
PSH
ProductDB
PBAC
Pokeacute-peacutedia
Ord-nance Survey
Openly Local
The Open Library
OpenCyc
OpenCalais
OpenEI
New York
Times
NTU Resource
Lists
NDL subjects
MARC Codes List
Man-chesterReading
Lists
Lotico
The London Gazette
LOIUS
lobidResources
lobidOrgani-sations
LinkedMDB
LinkedLCCN
LinkedGeoData
LinkedCT
Linked Open
Numbers
lingvoj
LIBRIS
Lexvo
LCSH
DBLP (L3S)
Linked Sensor Data (Knoesis)
Good-win
Family
Jamendo
iServe
NSZL Catalog
GovTrack
GESIS
GeoSpecies
GeoNames
GeoLinkedData(es)
GTAA
STITCHSIDER
Project Guten-berg (FUB)
MediCare
Euro-stat
(FUB)
DrugBank
Disea-some
DBLP (FU
Berlin)
DailyMed
Freebase
flickr wrappr
Fishes of Texas
FanHubz
Event-Media
EUTC Produc-
tions
Eurostat
EUNIS
ESD stan-dards
Popula-tion (En-AKTing)
NHS (EnAKTing)
Mortality (En-
AKTing)Energy
(En-AKTing)
CO2(En-
AKTing)
educationdatagov
uk
ECS South-ampton
Gem Norm-datei
datadcs
MySpace(DBTune)
MusicBrainz
(DBTune)
Magna-tune
John Peel(DB
Tune)
classical(DB
Tune)
Audio-scrobbler (DBTune)
LastfmArtists
(DBTune)
DBTropes
dbpedia lite
DBpedia
Pokedex
Airports
NASA (Data Incu-bator)
MusicBrainz(Data
Incubator)
Moseley Folk
Discogs(Data In-cubator)
Climbing
Linked Data for Intervals
Cornetto
Chronic-ling
America
Chem2Bio2RDF
bizdata
govuk
UniSTS
UniRef
UniPath-way
UniParc
Taxo-nomy
UniProt
SGD
Reactome
PubMed
PubChem
PRO-SITE
ProDom
Pfam PDB
OMIM
OBO
MGI
KEGG Reaction
KEGG Pathway
KEGG Glycan
KEGG Enzyme
KEGG Drug
KEGG Cpd
InterPro
HomoloGene
HGNC
Gene Ontology
GeneID
GenBank
ChEBI
CAS
Affy-metrix
BibBaseBBC
Wildlife Finder
BBC Program
mesBBC
Music
rdfaboutUS Census
Media
Geographic
Publications
Government
Cross-domain
Life sciences
User-generated content
ldquoLinking Open Data cloud diagram by Richard Cyganiak and Anja Jentzschhttp lod-cloudnetrdquo
Tools
Fetching storing and querying the datararr CubicWeb ()
4 Semantic CMS high-level database management with metadata
4 Multiple sources RSS micro-blogging SQL database
4 Based on PostgreSQL deals with (very) large database
Data-mining and machine learningrarr Scikits-learn ()
4 Easy-to-use and general-purpose machine learning in Python
4 Unsupervised learning supervised learning model selection
Semantic information databaserarr Dbpedia ()
4 sim 8106 articles with abstracts images from Wikipedia
4 sim 06106 categories 273 types (eg person place )
4 sim 100106 links between articles categories and types
() Open SourceCreative Commons
Storing RSS information with Cubicweb
Each object (or entity) is defined in a schema and may be displayedusing different views
Storing RSS
Define (in schemapy ) how RSS should be stored in the database
class RSSArt ic le ( Ent i tyType ) t i t l e = S t r i n g ( ) T i t l e o f the feedu r i = S t r i n g ( unique=True ) Ur i o f the rss feedcontent = S t r i n g ( ) Content o f the feed
Fetching RSS (based on feedparser and BeautifulSoup)
Simply construct a source by giving an URL and a parser
u r l = u rsquo h t t p feeds bbc i co uk news wor ld rss xml rsquos = session c r e a t e _ e n t i t y ( rsquoCWSource rsquo name=u rsquoBBCNewsminusWorld rsquo u r l = u r l
type=u rsquo datafeed rsquo parser=u rsquo rssminusparser rsquo con f i g =u rsquo synchron iza t ionminusi n t e r v a l =240min rsquo )
s pu l l _da ta ( session )
7 englishamerican journals (The New York Times )
Storing Dbpedia information with Cubicweb
Storing Dbpedia page
Define (in schemapy ) how Dbpedia pages should be stored in thedatabase
class DbpediaPage ( Ent i tyType ) u r i = S t r i n g ( unique=True indexed=True ) Ur i o f the ressourcel a b e l = S t r i n g ( indexed=True ) h t t p wwww3 org 200001 rd fminusschemapageid = S t r i n g ( ) h t t p dbpedia org onto logy wikiPageIDabs t r ac t = S t r i n g ( ) h t t p dbpedia org onto logy abs t r ac thomepage = S t r i n g ( ) h t t p xmlns com f o a f 0 1 homepagethumbnai l = S t r i n g ( ) h t t p dbpedia org onto logy thumbnai ld e p i c t i on = S t r i n g ( ) h t t p xmlns com f o a f 0 1 d ep i c t i o nwikipage = S t r i n g ( ) h t t p xmlns com f o a f 0 1 pagel a t i t u d e = S t r i n g ( ) h t t p wwww3 org 200301 geo wgs84_posl ong i t ude = S t r i n g ( ) h t t p wwww3 org 200301 geo wgs84_pos
Storing all dbpedia information (sim 9106 pages sim 100106 links20Go) in Cubicweb takes less than 24 hours
rarr See the full schema
Analyzing RSS news
What can we do with the RSS news stored in database
1 extract relevant features of the data
2 construct a usable (ie matrix) representation of the data
3 cluster (group) RSS together
4 deeper semantic analyze and visualization of the information
Example sentence
ldquoGoogle is to buy mobile phone manufacturer Motorola Mobilityallowing it to mount a serious challenge to Apple Incrdquo
Overview
1 Introduction
2 Extracting information from RSS news
3 Analyzing RSS news
Extracting information Classical approaches (Scikits-learn)
Char-N-gram
Extracts features of N characters from a text
from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import CharNGramAnalyzeranalyzer = CharNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )
450 features rsquogoorsquo rsquooogrsquo rsquooglrsquo rsquomobrsquo rsquoobirsquo rsquobilrsquo
Word-N-gram
Extracts features of N words from a text
from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import WordNGramAnalyzeranalyzer = WordNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )
58 features rsquogoogle is torsquo rsquois to buyrsquo rsquoserious challenge torsquo
8 Many irrelevant features (tokens)8 Features do not carry lots of contextual information (ie
understandable by humans)
Feature extraction Dbpedia - Context
Main Hypothesis Only things that exist in Dbpedia (ieWikipedia) have some interest in news analysisrarr Named
Entities Recognition (NER)
Dbpedia feature extraction (CubicwebDbpedia)
from cubes semnews views ne r t oo l s import Dbped iaEnt i t iesAna lyzeranalyzer = Dbped iaEnt i t iesAna lyzer ( session lang= rsquo en rsquo )tokens = analyzer analyze ( sentence )
rarr 3 features rsquoApple Incrsquo rsquoGooglersquo rsquoMotorolarsquo
ltENAMEX TYPE=ORGANIZATIONgt GoogleltENAMEXgt is to buy mobile phone
manufacturer ltENAMEX TYPE=ORGANIZATIONgt MotorolaltENAMEXgt Mobility allowing it to
mount a serious challenge to ltENAMEX TYPE=ORGANIZATIONgt Apple IncltENAMEXgt
rarr Try it
ldquoDBpedia Spotlight Shedding Light on the Web of Documentsrdquo Pablo N Mendes et al I-Semantics 2011ldquoLearning Named Entity Recognition from Wikipediardquo Joel Nothman 2008
ldquoLarge-Scale Named Entity Disambiguation Based on Wikipedia Datardquo Silviu Cucerzan 2007
Feature extraction Dbpedia - Properties
Efficient and robust feature extraction
Keep the meaning of a textrarr interpretable features8 rsquo said former soldier Larry rsquo8 rsquo said student Larry rsquo4 rsquo said Larry Page rsquo
Robust features based on redirectionseg rsquoObamarsquo rsquoBarak Obambarsquo rsquoPres Obamarsquo redirect to rsquoBarack Obamarsquo
Simple RQL (Relation Query Language) queries
Fast based on indexed SQL tables and regular expressions rset = rql(rsquoAny E WHERE E is DbpediaPage
E label (token)srsquo rsquotokenrsquo token)
eg 19 entities extracted in 4s in 765 words among sim 8106 dbpedia entries
Different labels but same URIrarr cross-language featureextraction eg GrenadaGrenaderarr http dbpediaorgresourceGrenada
Overview
1 Introduction
2 Extracting information from RSS news
3 Analyzing RSS news
Feature extraction Matrix representation and storage
Obamarsquos approval Protest as Spain
Libya rebels fight Obama in Spain
rarr
0 0 1 00 0 0 1 0 1 0 00 0 1 1
Eg The Obama-Merkel-Sarkozy
space
Results stored in a relation (appears in rss) in Cubicweb
rql(rsquoAny X WHERE X appears_in_rss Yrsquo)
Exploiting information Clustering news
Creating clusters (groups) of news using the matrixrepresentation of the data
Meanshift algorithm (Scikits-learn)
Based on locating the maxima of a density function
Automatically tunes the number of clusters
from s c i k i t s l ea rn c l u s t e r import MeanShift est imate_bandwidthbandwidth = est imate_bandwidth (X q u a n t i l e =0005)c l u s t e r i n g = MeanShift ( bandwidth=bandwidth )c l u s t e r i n g f i t (X)l a b e l s = c l u s t e r i n g labe ls_
ldquoMean shift mode seeking and clusteringrdquo Yizong Cheng IEEE Transactions on Pattern Analysis and Machine Intelligence1995
rarr Try it
Other possible alternatives Wardrsquos algorithm K-means
Exploiting information Queries and Views
Each query returns a result set (rset) A view is called on a result setrarr define the representation rules
Defining a view (short version)
class MyEnt i t iesView ( View ) __regid__ = rsquo exampleminusview rsquo
def c a l l ( s e l f ) r s e t = s e l f cw_rsetfor e n t i t y in r s e t e n t i t i e s ( )
do_whatever_you_want_ write_some_html_
you just have to plug a rset from a query
r s e t = r q l ( rsquo Any X WHERE rsquo )s e l f wview ( rsquo exampleminusview rsquo r s e t )
or within an url
httpmyapplicationrql=Any X WHERE ampvid=example-view
Pluging Scikitslearn in a view
Combine machine learning tools and database query system in purePython
Defining the view
class RSSClustering ( View ) __regid__ = rsquo rssminusc l u s t e r i n g rsquo
def c a l l ( s e l f ) Get the r e s u l t s set and create the datar s e t = s e l f cw_rsetoccurrence_matr ix = OccurrenceMatr ix ( )occurrence_matr ix cons t ruc t_ma t r i x ( r s e t = r s e t ) Scale the dataX = scale ( occurrence_matr ix ax is =0 w i th_s td=True copy=True ) Compute bandwith f o r c l u s t e r i n gbandwidth = est imate_bandwidth (X) C lus te r i ngc l u s t e r = MeanShift ( bandwidth=bandwidth )c l u s t e r f i t (X)l a b e l s = c l u s t e r l abe ls_ Perform some HTML render ing
rarr Try it
A new approach for querying information from RSS
All musical artists in the newsrql(rsquoDISTINCT Any E R WHERE E appears_in_rss R
E has_type T T label musical artistrsquo)
All living office holder persons in the news
rql(rsquoDISTINCT Any E WHERE E appears_in_rss RE has_type T T label office holderE has_subject C C label Living peoplersquo)
All news that talk about Barack Obama and any scientist
rql(rsquoDISTINCT Any R WHERE E1 label Barack ObamaE1 appears_in_rss R E2 appears_in_rss RE2 has_type T T label scientistrsquo)
All news that talk about a drug
rql(rsquoAny X R WHERE X appears_in_rss RX has_type T T label drugrsquo)
Try it with an xml view or a thumbnail view
Vizualisation Mapping information
View
class EntityMapView ( View ) __regid__ = rsquomap rsquo
def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )
s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )
s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )
Based on mapstraction (javascript) http mapstractioncom
4 Automatically locate information from RSS news
Try it
Conclusion
Cubicweb and Scikits-learn
Efficient and easy-to-use tools for data storing querying andmining
Easy to plug together only Python tools all Open Source
Using Dbpedia allows to extract very few highly relevant features
4 Decrease the dimensionality of the data
4 Link features to millions of pages of information
A new semantic way for querying information
4 Simple information queries using RQL expressions
4 Use Dbpedia types and categories to refine the selection
Future Improvements
Named Entities Recognition
rarr use disambiguations links more refined Regular expressions
rarr add new databases (MusicBrainz diseasome )
rarr closely follow Wikipedia with Dbpedia live update
Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo
RSS news analyzing
rarr explore new algorithms bi-clustering
rarr add new data sources Twitters Blogs
rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph
Thanks you for your attention
Questions
- Introduction
- Extracting information from RSS news
- Analyzing RSS news
-
Overview
1 Introduction
2 Extracting information from RSS news
3 Analyzing RSS news
Semantic
As of September 2010
MusicBrainz
(zitgist)
P20
YAGO
World Fact-book (FUB)
WordNet (W3C)
WordNet(VUA)
VIVO UFVIVO
Indiana
VIVO Cornell
VIAF
URIBurner
Sussex Reading
Lists
Plymouth Reading
Lists
UMBEL
UK Post-codes
legislationgovuk
Uberblic
UB Mann-heim
TWC LOGD
Twarql
transportdatagov
uk
totlnet
Tele-graphis
TCMGeneDIT
TaxonConcept
The Open Library (Talis)
t4gm
Surge Radio
STW
RAMEAU SH
statisticsdatagov
uk
St Andrews Resource
Lists
ECS South-ampton EPrints
Semantic CrunchBase
semanticweborg
SemanticXBRL
SWDog Food
rdfabout US SEC
Wiki
UNLOCODE
Ulm
ECS (RKB
Explorer)
Roma
RISKS
RESEX
RAE2001
Pisa
OS
OAI
NSF
New-castle
LAAS
KISTIJISC
IRIT
IEEE
IBM
Eureacutecom
ERA
ePrints
dotAC
DEPLOY
DBLP (RKB
Explorer)
Course-ware
CORDIS
CiteSeer
Budapest
ACM
riese
Revyu
researchdatagov
uk
referencedatagov
uk
Recht-spraak
nl
RDFohloh
LastFM (rdfize)
RDF Book
Mashup
PSH
ProductDB
PBAC
Pokeacute-peacutedia
Ord-nance Survey
Openly Local
The Open Library
OpenCyc
OpenCalais
OpenEI
New York
Times
NTU Resource
Lists
NDL subjects
MARC Codes List
Man-chesterReading
Lists
Lotico
The London Gazette
LOIUS
lobidResources
lobidOrgani-sations
LinkedMDB
LinkedLCCN
LinkedGeoData
LinkedCT
Linked Open
Numbers
lingvoj
LIBRIS
Lexvo
LCSH
DBLP (L3S)
Linked Sensor Data (Knoesis)
Good-win
Family
Jamendo
iServe
NSZL Catalog
GovTrack
GESIS
GeoSpecies
GeoNames
GeoLinkedData(es)
GTAA
STITCHSIDER
Project Guten-berg (FUB)
MediCare
Euro-stat
(FUB)
DrugBank
Disea-some
DBLP (FU
Berlin)
DailyMed
Freebase
flickr wrappr
Fishes of Texas
FanHubz
Event-Media
EUTC Produc-
tions
Eurostat
EUNIS
ESD stan-dards
Popula-tion (En-AKTing)
NHS (EnAKTing)
Mortality (En-
AKTing)Energy
(En-AKTing)
CO2(En-
AKTing)
educationdatagov
uk
ECS South-ampton
Gem Norm-datei
datadcs
MySpace(DBTune)
MusicBrainz
(DBTune)
Magna-tune
John Peel(DB
Tune)
classical(DB
Tune)
Audio-scrobbler (DBTune)
LastfmArtists
(DBTune)
DBTropes
dbpedia lite
DBpedia
Pokedex
Airports
NASA (Data Incu-bator)
MusicBrainz(Data
Incubator)
Moseley Folk
Discogs(Data In-cubator)
Climbing
Linked Data for Intervals
Cornetto
Chronic-ling
America
Chem2Bio2RDF
bizdata
govuk
UniSTS
UniRef
UniPath-way
UniParc
Taxo-nomy
UniProt
SGD
Reactome
PubMed
PubChem
PRO-SITE
ProDom
Pfam PDB
OMIM
OBO
MGI
KEGG Reaction
KEGG Pathway
KEGG Glycan
KEGG Enzyme
KEGG Drug
KEGG Cpd
InterPro
HomoloGene
HGNC
Gene Ontology
GeneID
GenBank
ChEBI
CAS
Affy-metrix
BibBaseBBC
Wildlife Finder
BBC Program
mesBBC
Music
rdfaboutUS Census
Media
Geographic
Publications
Government
Cross-domain
Life sciences
User-generated content
ldquoLinking Open Data cloud diagram by Richard Cyganiak and Anja Jentzschhttp lod-cloudnetrdquo
Tools
Fetching storing and querying the datararr CubicWeb ()
4 Semantic CMS high-level database management with metadata
4 Multiple sources RSS micro-blogging SQL database
4 Based on PostgreSQL deals with (very) large database
Data-mining and machine learningrarr Scikits-learn ()
4 Easy-to-use and general-purpose machine learning in Python
4 Unsupervised learning supervised learning model selection
Semantic information databaserarr Dbpedia ()
4 sim 8106 articles with abstracts images from Wikipedia
4 sim 06106 categories 273 types (eg person place )
4 sim 100106 links between articles categories and types
() Open SourceCreative Commons
Storing RSS information with Cubicweb
Each object (or entity) is defined in a schema and may be displayedusing different views
Storing RSS
Define (in schemapy ) how RSS should be stored in the database
class RSSArt ic le ( Ent i tyType ) t i t l e = S t r i n g ( ) T i t l e o f the feedu r i = S t r i n g ( unique=True ) Ur i o f the rss feedcontent = S t r i n g ( ) Content o f the feed
Fetching RSS (based on feedparser and BeautifulSoup)
Simply construct a source by giving an URL and a parser
u r l = u rsquo h t t p feeds bbc i co uk news wor ld rss xml rsquos = session c r e a t e _ e n t i t y ( rsquoCWSource rsquo name=u rsquoBBCNewsminusWorld rsquo u r l = u r l
type=u rsquo datafeed rsquo parser=u rsquo rssminusparser rsquo con f i g =u rsquo synchron iza t ionminusi n t e r v a l =240min rsquo )
s pu l l _da ta ( session )
7 englishamerican journals (The New York Times )
Storing Dbpedia information with Cubicweb
Storing Dbpedia page
Define (in schemapy ) how Dbpedia pages should be stored in thedatabase
class DbpediaPage ( Ent i tyType ) u r i = S t r i n g ( unique=True indexed=True ) Ur i o f the ressourcel a b e l = S t r i n g ( indexed=True ) h t t p wwww3 org 200001 rd fminusschemapageid = S t r i n g ( ) h t t p dbpedia org onto logy wikiPageIDabs t r ac t = S t r i n g ( ) h t t p dbpedia org onto logy abs t r ac thomepage = S t r i n g ( ) h t t p xmlns com f o a f 0 1 homepagethumbnai l = S t r i n g ( ) h t t p dbpedia org onto logy thumbnai ld e p i c t i on = S t r i n g ( ) h t t p xmlns com f o a f 0 1 d ep i c t i o nwikipage = S t r i n g ( ) h t t p xmlns com f o a f 0 1 pagel a t i t u d e = S t r i n g ( ) h t t p wwww3 org 200301 geo wgs84_posl ong i t ude = S t r i n g ( ) h t t p wwww3 org 200301 geo wgs84_pos
Storing all dbpedia information (sim 9106 pages sim 100106 links20Go) in Cubicweb takes less than 24 hours
rarr See the full schema
Analyzing RSS news
What can we do with the RSS news stored in database
1 extract relevant features of the data
2 construct a usable (ie matrix) representation of the data
3 cluster (group) RSS together
4 deeper semantic analyze and visualization of the information
Example sentence
ldquoGoogle is to buy mobile phone manufacturer Motorola Mobilityallowing it to mount a serious challenge to Apple Incrdquo
Overview
1 Introduction
2 Extracting information from RSS news
3 Analyzing RSS news
Extracting information Classical approaches (Scikits-learn)
Char-N-gram
Extracts features of N characters from a text
from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import CharNGramAnalyzeranalyzer = CharNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )
450 features rsquogoorsquo rsquooogrsquo rsquooglrsquo rsquomobrsquo rsquoobirsquo rsquobilrsquo
Word-N-gram
Extracts features of N words from a text
from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import WordNGramAnalyzeranalyzer = WordNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )
58 features rsquogoogle is torsquo rsquois to buyrsquo rsquoserious challenge torsquo
8 Many irrelevant features (tokens)8 Features do not carry lots of contextual information (ie
understandable by humans)
Feature extraction Dbpedia - Context
Main Hypothesis Only things that exist in Dbpedia (ieWikipedia) have some interest in news analysisrarr Named
Entities Recognition (NER)
Dbpedia feature extraction (CubicwebDbpedia)
from cubes semnews views ne r t oo l s import Dbped iaEnt i t iesAna lyzeranalyzer = Dbped iaEnt i t iesAna lyzer ( session lang= rsquo en rsquo )tokens = analyzer analyze ( sentence )
rarr 3 features rsquoApple Incrsquo rsquoGooglersquo rsquoMotorolarsquo
ltENAMEX TYPE=ORGANIZATIONgt GoogleltENAMEXgt is to buy mobile phone
manufacturer ltENAMEX TYPE=ORGANIZATIONgt MotorolaltENAMEXgt Mobility allowing it to
mount a serious challenge to ltENAMEX TYPE=ORGANIZATIONgt Apple IncltENAMEXgt
rarr Try it
ldquoDBpedia Spotlight Shedding Light on the Web of Documentsrdquo Pablo N Mendes et al I-Semantics 2011ldquoLearning Named Entity Recognition from Wikipediardquo Joel Nothman 2008
ldquoLarge-Scale Named Entity Disambiguation Based on Wikipedia Datardquo Silviu Cucerzan 2007
Feature extraction Dbpedia - Properties
Efficient and robust feature extraction
Keep the meaning of a textrarr interpretable features8 rsquo said former soldier Larry rsquo8 rsquo said student Larry rsquo4 rsquo said Larry Page rsquo
Robust features based on redirectionseg rsquoObamarsquo rsquoBarak Obambarsquo rsquoPres Obamarsquo redirect to rsquoBarack Obamarsquo
Simple RQL (Relation Query Language) queries
Fast based on indexed SQL tables and regular expressions rset = rql(rsquoAny E WHERE E is DbpediaPage
E label (token)srsquo rsquotokenrsquo token)
eg 19 entities extracted in 4s in 765 words among sim 8106 dbpedia entries
Different labels but same URIrarr cross-language featureextraction eg GrenadaGrenaderarr http dbpediaorgresourceGrenada
Overview
1 Introduction
2 Extracting information from RSS news
3 Analyzing RSS news
Feature extraction Matrix representation and storage
Obamarsquos approval Protest as Spain
Libya rebels fight Obama in Spain
rarr
0 0 1 00 0 0 1 0 1 0 00 0 1 1
Eg The Obama-Merkel-Sarkozy
space
Results stored in a relation (appears in rss) in Cubicweb
rql(rsquoAny X WHERE X appears_in_rss Yrsquo)
Exploiting information Clustering news
Creating clusters (groups) of news using the matrixrepresentation of the data
Meanshift algorithm (Scikits-learn)
Based on locating the maxima of a density function
Automatically tunes the number of clusters
from s c i k i t s l ea rn c l u s t e r import MeanShift est imate_bandwidthbandwidth = est imate_bandwidth (X q u a n t i l e =0005)c l u s t e r i n g = MeanShift ( bandwidth=bandwidth )c l u s t e r i n g f i t (X)l a b e l s = c l u s t e r i n g labe ls_
ldquoMean shift mode seeking and clusteringrdquo Yizong Cheng IEEE Transactions on Pattern Analysis and Machine Intelligence1995
rarr Try it
Other possible alternatives Wardrsquos algorithm K-means
Exploiting information Queries and Views
Each query returns a result set (rset) A view is called on a result setrarr define the representation rules
Defining a view (short version)
class MyEnt i t iesView ( View ) __regid__ = rsquo exampleminusview rsquo
def c a l l ( s e l f ) r s e t = s e l f cw_rsetfor e n t i t y in r s e t e n t i t i e s ( )
do_whatever_you_want_ write_some_html_
you just have to plug a rset from a query
r s e t = r q l ( rsquo Any X WHERE rsquo )s e l f wview ( rsquo exampleminusview rsquo r s e t )
or within an url
httpmyapplicationrql=Any X WHERE ampvid=example-view
Pluging Scikitslearn in a view
Combine machine learning tools and database query system in purePython
Defining the view
class RSSClustering ( View ) __regid__ = rsquo rssminusc l u s t e r i n g rsquo
def c a l l ( s e l f ) Get the r e s u l t s set and create the datar s e t = s e l f cw_rsetoccurrence_matr ix = OccurrenceMatr ix ( )occurrence_matr ix cons t ruc t_ma t r i x ( r s e t = r s e t ) Scale the dataX = scale ( occurrence_matr ix ax is =0 w i th_s td=True copy=True ) Compute bandwith f o r c l u s t e r i n gbandwidth = est imate_bandwidth (X) C lus te r i ngc l u s t e r = MeanShift ( bandwidth=bandwidth )c l u s t e r f i t (X)l a b e l s = c l u s t e r l abe ls_ Perform some HTML render ing
rarr Try it
A new approach for querying information from RSS
All musical artists in the newsrql(rsquoDISTINCT Any E R WHERE E appears_in_rss R
E has_type T T label musical artistrsquo)
All living office holder persons in the news
rql(rsquoDISTINCT Any E WHERE E appears_in_rss RE has_type T T label office holderE has_subject C C label Living peoplersquo)
All news that talk about Barack Obama and any scientist
rql(rsquoDISTINCT Any R WHERE E1 label Barack ObamaE1 appears_in_rss R E2 appears_in_rss RE2 has_type T T label scientistrsquo)
All news that talk about a drug
rql(rsquoAny X R WHERE X appears_in_rss RX has_type T T label drugrsquo)
Try it with an xml view or a thumbnail view
Vizualisation Mapping information
View
class EntityMapView ( View ) __regid__ = rsquomap rsquo
def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )
s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )
s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )
Based on mapstraction (javascript) http mapstractioncom
4 Automatically locate information from RSS news
Try it
Conclusion
Cubicweb and Scikits-learn
Efficient and easy-to-use tools for data storing querying andmining
Easy to plug together only Python tools all Open Source
Using Dbpedia allows to extract very few highly relevant features
4 Decrease the dimensionality of the data
4 Link features to millions of pages of information
A new semantic way for querying information
4 Simple information queries using RQL expressions
4 Use Dbpedia types and categories to refine the selection
Future Improvements
Named Entities Recognition
rarr use disambiguations links more refined Regular expressions
rarr add new databases (MusicBrainz diseasome )
rarr closely follow Wikipedia with Dbpedia live update
Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo
RSS news analyzing
rarr explore new algorithms bi-clustering
rarr add new data sources Twitters Blogs
rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph
Thanks you for your attention
Questions
- Introduction
- Extracting information from RSS news
- Analyzing RSS news
-
Semantic
As of September 2010
MusicBrainz
(zitgist)
P20
YAGO
World Fact-book (FUB)
WordNet (W3C)
WordNet(VUA)
VIVO UFVIVO
Indiana
VIVO Cornell
VIAF
URIBurner
Sussex Reading
Lists
Plymouth Reading
Lists
UMBEL
UK Post-codes
legislationgovuk
Uberblic
UB Mann-heim
TWC LOGD
Twarql
transportdatagov
uk
totlnet
Tele-graphis
TCMGeneDIT
TaxonConcept
The Open Library (Talis)
t4gm
Surge Radio
STW
RAMEAU SH
statisticsdatagov
uk
St Andrews Resource
Lists
ECS South-ampton EPrints
Semantic CrunchBase
semanticweborg
SemanticXBRL
SWDog Food
rdfabout US SEC
Wiki
UNLOCODE
Ulm
ECS (RKB
Explorer)
Roma
RISKS
RESEX
RAE2001
Pisa
OS
OAI
NSF
New-castle
LAAS
KISTIJISC
IRIT
IEEE
IBM
Eureacutecom
ERA
ePrints
dotAC
DEPLOY
DBLP (RKB
Explorer)
Course-ware
CORDIS
CiteSeer
Budapest
ACM
riese
Revyu
researchdatagov
uk
referencedatagov
uk
Recht-spraak
nl
RDFohloh
LastFM (rdfize)
RDF Book
Mashup
PSH
ProductDB
PBAC
Pokeacute-peacutedia
Ord-nance Survey
Openly Local
The Open Library
OpenCyc
OpenCalais
OpenEI
New York
Times
NTU Resource
Lists
NDL subjects
MARC Codes List
Man-chesterReading
Lists
Lotico
The London Gazette
LOIUS
lobidResources
lobidOrgani-sations
LinkedMDB
LinkedLCCN
LinkedGeoData
LinkedCT
Linked Open
Numbers
lingvoj
LIBRIS
Lexvo
LCSH
DBLP (L3S)
Linked Sensor Data (Knoesis)
Good-win
Family
Jamendo
iServe
NSZL Catalog
GovTrack
GESIS
GeoSpecies
GeoNames
GeoLinkedData(es)
GTAA
STITCHSIDER
Project Guten-berg (FUB)
MediCare
Euro-stat
(FUB)
DrugBank
Disea-some
DBLP (FU
Berlin)
DailyMed
Freebase
flickr wrappr
Fishes of Texas
FanHubz
Event-Media
EUTC Produc-
tions
Eurostat
EUNIS
ESD stan-dards
Popula-tion (En-AKTing)
NHS (EnAKTing)
Mortality (En-
AKTing)Energy
(En-AKTing)
CO2(En-
AKTing)
educationdatagov
uk
ECS South-ampton
Gem Norm-datei
datadcs
MySpace(DBTune)
MusicBrainz
(DBTune)
Magna-tune
John Peel(DB
Tune)
classical(DB
Tune)
Audio-scrobbler (DBTune)
LastfmArtists
(DBTune)
DBTropes
dbpedia lite
DBpedia
Pokedex
Airports
NASA (Data Incu-bator)
MusicBrainz(Data
Incubator)
Moseley Folk
Discogs(Data In-cubator)
Climbing
Linked Data for Intervals
Cornetto
Chronic-ling
America
Chem2Bio2RDF
bizdata
govuk
UniSTS
UniRef
UniPath-way
UniParc
Taxo-nomy
UniProt
SGD
Reactome
PubMed
PubChem
PRO-SITE
ProDom
Pfam PDB
OMIM
OBO
MGI
KEGG Reaction
KEGG Pathway
KEGG Glycan
KEGG Enzyme
KEGG Drug
KEGG Cpd
InterPro
HomoloGene
HGNC
Gene Ontology
GeneID
GenBank
ChEBI
CAS
Affy-metrix
BibBaseBBC
Wildlife Finder
BBC Program
mesBBC
Music
rdfaboutUS Census
Media
Geographic
Publications
Government
Cross-domain
Life sciences
User-generated content
ldquoLinking Open Data cloud diagram by Richard Cyganiak and Anja Jentzschhttp lod-cloudnetrdquo
Tools
Fetching storing and querying the datararr CubicWeb ()
4 Semantic CMS high-level database management with metadata
4 Multiple sources RSS micro-blogging SQL database
4 Based on PostgreSQL deals with (very) large database
Data-mining and machine learningrarr Scikits-learn ()
4 Easy-to-use and general-purpose machine learning in Python
4 Unsupervised learning supervised learning model selection
Semantic information databaserarr Dbpedia ()
4 sim 8106 articles with abstracts images from Wikipedia
4 sim 06106 categories 273 types (eg person place )
4 sim 100106 links between articles categories and types
() Open SourceCreative Commons
Storing RSS information with Cubicweb
Each object (or entity) is defined in a schema and may be displayedusing different views
Storing RSS
Define (in schemapy ) how RSS should be stored in the database
class RSSArt ic le ( Ent i tyType ) t i t l e = S t r i n g ( ) T i t l e o f the feedu r i = S t r i n g ( unique=True ) Ur i o f the rss feedcontent = S t r i n g ( ) Content o f the feed
Fetching RSS (based on feedparser and BeautifulSoup)
Simply construct a source by giving an URL and a parser
u r l = u rsquo h t t p feeds bbc i co uk news wor ld rss xml rsquos = session c r e a t e _ e n t i t y ( rsquoCWSource rsquo name=u rsquoBBCNewsminusWorld rsquo u r l = u r l
type=u rsquo datafeed rsquo parser=u rsquo rssminusparser rsquo con f i g =u rsquo synchron iza t ionminusi n t e r v a l =240min rsquo )
s pu l l _da ta ( session )
7 englishamerican journals (The New York Times )
Storing Dbpedia information with Cubicweb
Storing Dbpedia page
Define (in schemapy ) how Dbpedia pages should be stored in thedatabase
class DbpediaPage ( Ent i tyType ) u r i = S t r i n g ( unique=True indexed=True ) Ur i o f the ressourcel a b e l = S t r i n g ( indexed=True ) h t t p wwww3 org 200001 rd fminusschemapageid = S t r i n g ( ) h t t p dbpedia org onto logy wikiPageIDabs t r ac t = S t r i n g ( ) h t t p dbpedia org onto logy abs t r ac thomepage = S t r i n g ( ) h t t p xmlns com f o a f 0 1 homepagethumbnai l = S t r i n g ( ) h t t p dbpedia org onto logy thumbnai ld e p i c t i on = S t r i n g ( ) h t t p xmlns com f o a f 0 1 d ep i c t i o nwikipage = S t r i n g ( ) h t t p xmlns com f o a f 0 1 pagel a t i t u d e = S t r i n g ( ) h t t p wwww3 org 200301 geo wgs84_posl ong i t ude = S t r i n g ( ) h t t p wwww3 org 200301 geo wgs84_pos
Storing all dbpedia information (sim 9106 pages sim 100106 links20Go) in Cubicweb takes less than 24 hours
rarr See the full schema
Analyzing RSS news
What can we do with the RSS news stored in database
1 extract relevant features of the data
2 construct a usable (ie matrix) representation of the data
3 cluster (group) RSS together
4 deeper semantic analyze and visualization of the information
Example sentence
ldquoGoogle is to buy mobile phone manufacturer Motorola Mobilityallowing it to mount a serious challenge to Apple Incrdquo
Overview
1 Introduction
2 Extracting information from RSS news
3 Analyzing RSS news
Extracting information Classical approaches (Scikits-learn)
Char-N-gram
Extracts features of N characters from a text
from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import CharNGramAnalyzeranalyzer = CharNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )
450 features rsquogoorsquo rsquooogrsquo rsquooglrsquo rsquomobrsquo rsquoobirsquo rsquobilrsquo
Word-N-gram
Extracts features of N words from a text
from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import WordNGramAnalyzeranalyzer = WordNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )
58 features rsquogoogle is torsquo rsquois to buyrsquo rsquoserious challenge torsquo
8 Many irrelevant features (tokens)8 Features do not carry lots of contextual information (ie
understandable by humans)
Feature extraction Dbpedia - Context
Main Hypothesis Only things that exist in Dbpedia (ieWikipedia) have some interest in news analysisrarr Named
Entities Recognition (NER)
Dbpedia feature extraction (CubicwebDbpedia)
from cubes semnews views ne r t oo l s import Dbped iaEnt i t iesAna lyzeranalyzer = Dbped iaEnt i t iesAna lyzer ( session lang= rsquo en rsquo )tokens = analyzer analyze ( sentence )
rarr 3 features rsquoApple Incrsquo rsquoGooglersquo rsquoMotorolarsquo
ltENAMEX TYPE=ORGANIZATIONgt GoogleltENAMEXgt is to buy mobile phone
manufacturer ltENAMEX TYPE=ORGANIZATIONgt MotorolaltENAMEXgt Mobility allowing it to
mount a serious challenge to ltENAMEX TYPE=ORGANIZATIONgt Apple IncltENAMEXgt
rarr Try it
ldquoDBpedia Spotlight Shedding Light on the Web of Documentsrdquo Pablo N Mendes et al I-Semantics 2011ldquoLearning Named Entity Recognition from Wikipediardquo Joel Nothman 2008
ldquoLarge-Scale Named Entity Disambiguation Based on Wikipedia Datardquo Silviu Cucerzan 2007
Feature extraction Dbpedia - Properties
Efficient and robust feature extraction
Keep the meaning of a textrarr interpretable features8 rsquo said former soldier Larry rsquo8 rsquo said student Larry rsquo4 rsquo said Larry Page rsquo
Robust features based on redirectionseg rsquoObamarsquo rsquoBarak Obambarsquo rsquoPres Obamarsquo redirect to rsquoBarack Obamarsquo
Simple RQL (Relation Query Language) queries
Fast based on indexed SQL tables and regular expressions rset = rql(rsquoAny E WHERE E is DbpediaPage
E label (token)srsquo rsquotokenrsquo token)
eg 19 entities extracted in 4s in 765 words among sim 8106 dbpedia entries
Different labels but same URIrarr cross-language featureextraction eg GrenadaGrenaderarr http dbpediaorgresourceGrenada
Overview
1 Introduction
2 Extracting information from RSS news
3 Analyzing RSS news
Feature extraction Matrix representation and storage
Obamarsquos approval Protest as Spain
Libya rebels fight Obama in Spain
rarr
0 0 1 00 0 0 1 0 1 0 00 0 1 1
Eg The Obama-Merkel-Sarkozy
space
Results stored in a relation (appears in rss) in Cubicweb
rql(rsquoAny X WHERE X appears_in_rss Yrsquo)
Exploiting information Clustering news
Creating clusters (groups) of news using the matrixrepresentation of the data
Meanshift algorithm (Scikits-learn)
Based on locating the maxima of a density function
Automatically tunes the number of clusters
from s c i k i t s l ea rn c l u s t e r import MeanShift est imate_bandwidthbandwidth = est imate_bandwidth (X q u a n t i l e =0005)c l u s t e r i n g = MeanShift ( bandwidth=bandwidth )c l u s t e r i n g f i t (X)l a b e l s = c l u s t e r i n g labe ls_
ldquoMean shift mode seeking and clusteringrdquo Yizong Cheng IEEE Transactions on Pattern Analysis and Machine Intelligence1995
rarr Try it
Other possible alternatives Wardrsquos algorithm K-means
Exploiting information Queries and Views
Each query returns a result set (rset) A view is called on a result setrarr define the representation rules
Defining a view (short version)
class MyEnt i t iesView ( View ) __regid__ = rsquo exampleminusview rsquo
def c a l l ( s e l f ) r s e t = s e l f cw_rsetfor e n t i t y in r s e t e n t i t i e s ( )
do_whatever_you_want_ write_some_html_
you just have to plug a rset from a query
r s e t = r q l ( rsquo Any X WHERE rsquo )s e l f wview ( rsquo exampleminusview rsquo r s e t )
or within an url
httpmyapplicationrql=Any X WHERE ampvid=example-view
Pluging Scikitslearn in a view
Combine machine learning tools and database query system in purePython
Defining the view
class RSSClustering ( View ) __regid__ = rsquo rssminusc l u s t e r i n g rsquo
def c a l l ( s e l f ) Get the r e s u l t s set and create the datar s e t = s e l f cw_rsetoccurrence_matr ix = OccurrenceMatr ix ( )occurrence_matr ix cons t ruc t_ma t r i x ( r s e t = r s e t ) Scale the dataX = scale ( occurrence_matr ix ax is =0 w i th_s td=True copy=True ) Compute bandwith f o r c l u s t e r i n gbandwidth = est imate_bandwidth (X) C lus te r i ngc l u s t e r = MeanShift ( bandwidth=bandwidth )c l u s t e r f i t (X)l a b e l s = c l u s t e r l abe ls_ Perform some HTML render ing
rarr Try it
A new approach for querying information from RSS
All musical artists in the newsrql(rsquoDISTINCT Any E R WHERE E appears_in_rss R
E has_type T T label musical artistrsquo)
All living office holder persons in the news
rql(rsquoDISTINCT Any E WHERE E appears_in_rss RE has_type T T label office holderE has_subject C C label Living peoplersquo)
All news that talk about Barack Obama and any scientist
rql(rsquoDISTINCT Any R WHERE E1 label Barack ObamaE1 appears_in_rss R E2 appears_in_rss RE2 has_type T T label scientistrsquo)
All news that talk about a drug
rql(rsquoAny X R WHERE X appears_in_rss RX has_type T T label drugrsquo)
Try it with an xml view or a thumbnail view
Vizualisation Mapping information
View
class EntityMapView ( View ) __regid__ = rsquomap rsquo
def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )
s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )
s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )
Based on mapstraction (javascript) http mapstractioncom
4 Automatically locate information from RSS news
Try it
Conclusion
Cubicweb and Scikits-learn
Efficient and easy-to-use tools for data storing querying andmining
Easy to plug together only Python tools all Open Source
Using Dbpedia allows to extract very few highly relevant features
4 Decrease the dimensionality of the data
4 Link features to millions of pages of information
A new semantic way for querying information
4 Simple information queries using RQL expressions
4 Use Dbpedia types and categories to refine the selection
Future Improvements
Named Entities Recognition
rarr use disambiguations links more refined Regular expressions
rarr add new databases (MusicBrainz diseasome )
rarr closely follow Wikipedia with Dbpedia live update
Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo
RSS news analyzing
rarr explore new algorithms bi-clustering
rarr add new data sources Twitters Blogs
rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph
Thanks you for your attention
Questions
- Introduction
- Extracting information from RSS news
- Analyzing RSS news
-
Tools
Fetching storing and querying the datararr CubicWeb ()
4 Semantic CMS high-level database management with metadata
4 Multiple sources RSS micro-blogging SQL database
4 Based on PostgreSQL deals with (very) large database
Data-mining and machine learningrarr Scikits-learn ()
4 Easy-to-use and general-purpose machine learning in Python
4 Unsupervised learning supervised learning model selection
Semantic information databaserarr Dbpedia ()
4 sim 8106 articles with abstracts images from Wikipedia
4 sim 06106 categories 273 types (eg person place )
4 sim 100106 links between articles categories and types
() Open SourceCreative Commons
Storing RSS information with Cubicweb
Each object (or entity) is defined in a schema and may be displayedusing different views
Storing RSS
Define (in schemapy ) how RSS should be stored in the database
class RSSArt ic le ( Ent i tyType ) t i t l e = S t r i n g ( ) T i t l e o f the feedu r i = S t r i n g ( unique=True ) Ur i o f the rss feedcontent = S t r i n g ( ) Content o f the feed
Fetching RSS (based on feedparser and BeautifulSoup)
Simply construct a source by giving an URL and a parser
u r l = u rsquo h t t p feeds bbc i co uk news wor ld rss xml rsquos = session c r e a t e _ e n t i t y ( rsquoCWSource rsquo name=u rsquoBBCNewsminusWorld rsquo u r l = u r l
type=u rsquo datafeed rsquo parser=u rsquo rssminusparser rsquo con f i g =u rsquo synchron iza t ionminusi n t e r v a l =240min rsquo )
s pu l l _da ta ( session )
7 englishamerican journals (The New York Times )
Storing Dbpedia information with Cubicweb
Storing Dbpedia page
Define (in schemapy ) how Dbpedia pages should be stored in thedatabase
class DbpediaPage ( Ent i tyType ) u r i = S t r i n g ( unique=True indexed=True ) Ur i o f the ressourcel a b e l = S t r i n g ( indexed=True ) h t t p wwww3 org 200001 rd fminusschemapageid = S t r i n g ( ) h t t p dbpedia org onto logy wikiPageIDabs t r ac t = S t r i n g ( ) h t t p dbpedia org onto logy abs t r ac thomepage = S t r i n g ( ) h t t p xmlns com f o a f 0 1 homepagethumbnai l = S t r i n g ( ) h t t p dbpedia org onto logy thumbnai ld e p i c t i on = S t r i n g ( ) h t t p xmlns com f o a f 0 1 d ep i c t i o nwikipage = S t r i n g ( ) h t t p xmlns com f o a f 0 1 pagel a t i t u d e = S t r i n g ( ) h t t p wwww3 org 200301 geo wgs84_posl ong i t ude = S t r i n g ( ) h t t p wwww3 org 200301 geo wgs84_pos
Storing all dbpedia information (sim 9106 pages sim 100106 links20Go) in Cubicweb takes less than 24 hours
rarr See the full schema
Analyzing RSS news
What can we do with the RSS news stored in database
1 extract relevant features of the data
2 construct a usable (ie matrix) representation of the data
3 cluster (group) RSS together
4 deeper semantic analyze and visualization of the information
Example sentence
ldquoGoogle is to buy mobile phone manufacturer Motorola Mobilityallowing it to mount a serious challenge to Apple Incrdquo
Overview
1 Introduction
2 Extracting information from RSS news
3 Analyzing RSS news
Extracting information Classical approaches (Scikits-learn)
Char-N-gram
Extracts features of N characters from a text
from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import CharNGramAnalyzeranalyzer = CharNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )
450 features rsquogoorsquo rsquooogrsquo rsquooglrsquo rsquomobrsquo rsquoobirsquo rsquobilrsquo
Word-N-gram
Extracts features of N words from a text
from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import WordNGramAnalyzeranalyzer = WordNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )
58 features rsquogoogle is torsquo rsquois to buyrsquo rsquoserious challenge torsquo
8 Many irrelevant features (tokens)8 Features do not carry lots of contextual information (ie
understandable by humans)
Feature extraction Dbpedia - Context
Main Hypothesis Only things that exist in Dbpedia (ieWikipedia) have some interest in news analysisrarr Named
Entities Recognition (NER)
Dbpedia feature extraction (CubicwebDbpedia)
from cubes semnews views ne r t oo l s import Dbped iaEnt i t iesAna lyzeranalyzer = Dbped iaEnt i t iesAna lyzer ( session lang= rsquo en rsquo )tokens = analyzer analyze ( sentence )
rarr 3 features rsquoApple Incrsquo rsquoGooglersquo rsquoMotorolarsquo
ltENAMEX TYPE=ORGANIZATIONgt GoogleltENAMEXgt is to buy mobile phone
manufacturer ltENAMEX TYPE=ORGANIZATIONgt MotorolaltENAMEXgt Mobility allowing it to
mount a serious challenge to ltENAMEX TYPE=ORGANIZATIONgt Apple IncltENAMEXgt
rarr Try it
ldquoDBpedia Spotlight Shedding Light on the Web of Documentsrdquo Pablo N Mendes et al I-Semantics 2011ldquoLearning Named Entity Recognition from Wikipediardquo Joel Nothman 2008
ldquoLarge-Scale Named Entity Disambiguation Based on Wikipedia Datardquo Silviu Cucerzan 2007
Feature extraction Dbpedia - Properties
Efficient and robust feature extraction
Keep the meaning of a textrarr interpretable features8 rsquo said former soldier Larry rsquo8 rsquo said student Larry rsquo4 rsquo said Larry Page rsquo
Robust features based on redirectionseg rsquoObamarsquo rsquoBarak Obambarsquo rsquoPres Obamarsquo redirect to rsquoBarack Obamarsquo
Simple RQL (Relation Query Language) queries
Fast based on indexed SQL tables and regular expressions rset = rql(rsquoAny E WHERE E is DbpediaPage
E label (token)srsquo rsquotokenrsquo token)
eg 19 entities extracted in 4s in 765 words among sim 8106 dbpedia entries
Different labels but same URIrarr cross-language featureextraction eg GrenadaGrenaderarr http dbpediaorgresourceGrenada
Overview
1 Introduction
2 Extracting information from RSS news
3 Analyzing RSS news
Feature extraction Matrix representation and storage
Obamarsquos approval Protest as Spain
Libya rebels fight Obama in Spain
rarr
0 0 1 00 0 0 1 0 1 0 00 0 1 1
Eg The Obama-Merkel-Sarkozy
space
Results stored in a relation (appears in rss) in Cubicweb
rql(rsquoAny X WHERE X appears_in_rss Yrsquo)
Exploiting information Clustering news
Creating clusters (groups) of news using the matrixrepresentation of the data
Meanshift algorithm (Scikits-learn)
Based on locating the maxima of a density function
Automatically tunes the number of clusters
from s c i k i t s l ea rn c l u s t e r import MeanShift est imate_bandwidthbandwidth = est imate_bandwidth (X q u a n t i l e =0005)c l u s t e r i n g = MeanShift ( bandwidth=bandwidth )c l u s t e r i n g f i t (X)l a b e l s = c l u s t e r i n g labe ls_
ldquoMean shift mode seeking and clusteringrdquo Yizong Cheng IEEE Transactions on Pattern Analysis and Machine Intelligence1995
rarr Try it
Other possible alternatives Wardrsquos algorithm K-means
Exploiting information Queries and Views
Each query returns a result set (rset) A view is called on a result setrarr define the representation rules
Defining a view (short version)
class MyEnt i t iesView ( View ) __regid__ = rsquo exampleminusview rsquo
def c a l l ( s e l f ) r s e t = s e l f cw_rsetfor e n t i t y in r s e t e n t i t i e s ( )
do_whatever_you_want_ write_some_html_
you just have to plug a rset from a query
r s e t = r q l ( rsquo Any X WHERE rsquo )s e l f wview ( rsquo exampleminusview rsquo r s e t )
or within an url
httpmyapplicationrql=Any X WHERE ampvid=example-view
Pluging Scikitslearn in a view
Combine machine learning tools and database query system in purePython
Defining the view
class RSSClustering ( View ) __regid__ = rsquo rssminusc l u s t e r i n g rsquo
def c a l l ( s e l f ) Get the r e s u l t s set and create the datar s e t = s e l f cw_rsetoccurrence_matr ix = OccurrenceMatr ix ( )occurrence_matr ix cons t ruc t_ma t r i x ( r s e t = r s e t ) Scale the dataX = scale ( occurrence_matr ix ax is =0 w i th_s td=True copy=True ) Compute bandwith f o r c l u s t e r i n gbandwidth = est imate_bandwidth (X) C lus te r i ngc l u s t e r = MeanShift ( bandwidth=bandwidth )c l u s t e r f i t (X)l a b e l s = c l u s t e r l abe ls_ Perform some HTML render ing
rarr Try it
A new approach for querying information from RSS
All musical artists in the newsrql(rsquoDISTINCT Any E R WHERE E appears_in_rss R
E has_type T T label musical artistrsquo)
All living office holder persons in the news
rql(rsquoDISTINCT Any E WHERE E appears_in_rss RE has_type T T label office holderE has_subject C C label Living peoplersquo)
All news that talk about Barack Obama and any scientist
rql(rsquoDISTINCT Any R WHERE E1 label Barack ObamaE1 appears_in_rss R E2 appears_in_rss RE2 has_type T T label scientistrsquo)
All news that talk about a drug
rql(rsquoAny X R WHERE X appears_in_rss RX has_type T T label drugrsquo)
Try it with an xml view or a thumbnail view
Vizualisation Mapping information
View
class EntityMapView ( View ) __regid__ = rsquomap rsquo
def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )
s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )
s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )
Based on mapstraction (javascript) http mapstractioncom
4 Automatically locate information from RSS news
Try it
Conclusion
Cubicweb and Scikits-learn
Efficient and easy-to-use tools for data storing querying andmining
Easy to plug together only Python tools all Open Source
Using Dbpedia allows to extract very few highly relevant features
4 Decrease the dimensionality of the data
4 Link features to millions of pages of information
A new semantic way for querying information
4 Simple information queries using RQL expressions
4 Use Dbpedia types and categories to refine the selection
Future Improvements
Named Entities Recognition
rarr use disambiguations links more refined Regular expressions
rarr add new databases (MusicBrainz diseasome )
rarr closely follow Wikipedia with Dbpedia live update
Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo
RSS news analyzing
rarr explore new algorithms bi-clustering
rarr add new data sources Twitters Blogs
rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph
Thanks you for your attention
Questions
- Introduction
- Extracting information from RSS news
- Analyzing RSS news
-
Storing RSS information with Cubicweb
Each object (or entity) is defined in a schema and may be displayedusing different views
Storing RSS
Define (in schemapy ) how RSS should be stored in the database
class RSSArt ic le ( Ent i tyType ) t i t l e = S t r i n g ( ) T i t l e o f the feedu r i = S t r i n g ( unique=True ) Ur i o f the rss feedcontent = S t r i n g ( ) Content o f the feed
Fetching RSS (based on feedparser and BeautifulSoup)
Simply construct a source by giving an URL and a parser
u r l = u rsquo h t t p feeds bbc i co uk news wor ld rss xml rsquos = session c r e a t e _ e n t i t y ( rsquoCWSource rsquo name=u rsquoBBCNewsminusWorld rsquo u r l = u r l
type=u rsquo datafeed rsquo parser=u rsquo rssminusparser rsquo con f i g =u rsquo synchron iza t ionminusi n t e r v a l =240min rsquo )
s pu l l _da ta ( session )
7 englishamerican journals (The New York Times )
Storing Dbpedia information with Cubicweb
Storing Dbpedia page
Define (in schemapy ) how Dbpedia pages should be stored in thedatabase
class DbpediaPage ( Ent i tyType ) u r i = S t r i n g ( unique=True indexed=True ) Ur i o f the ressourcel a b e l = S t r i n g ( indexed=True ) h t t p wwww3 org 200001 rd fminusschemapageid = S t r i n g ( ) h t t p dbpedia org onto logy wikiPageIDabs t r ac t = S t r i n g ( ) h t t p dbpedia org onto logy abs t r ac thomepage = S t r i n g ( ) h t t p xmlns com f o a f 0 1 homepagethumbnai l = S t r i n g ( ) h t t p dbpedia org onto logy thumbnai ld e p i c t i on = S t r i n g ( ) h t t p xmlns com f o a f 0 1 d ep i c t i o nwikipage = S t r i n g ( ) h t t p xmlns com f o a f 0 1 pagel a t i t u d e = S t r i n g ( ) h t t p wwww3 org 200301 geo wgs84_posl ong i t ude = S t r i n g ( ) h t t p wwww3 org 200301 geo wgs84_pos
Storing all dbpedia information (sim 9106 pages sim 100106 links20Go) in Cubicweb takes less than 24 hours
rarr See the full schema
Analyzing RSS news
What can we do with the RSS news stored in database
1 extract relevant features of the data
2 construct a usable (ie matrix) representation of the data
3 cluster (group) RSS together
4 deeper semantic analyze and visualization of the information
Example sentence
ldquoGoogle is to buy mobile phone manufacturer Motorola Mobilityallowing it to mount a serious challenge to Apple Incrdquo
Overview
1 Introduction
2 Extracting information from RSS news
3 Analyzing RSS news
Extracting information Classical approaches (Scikits-learn)
Char-N-gram
Extracts features of N characters from a text
from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import CharNGramAnalyzeranalyzer = CharNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )
450 features rsquogoorsquo rsquooogrsquo rsquooglrsquo rsquomobrsquo rsquoobirsquo rsquobilrsquo
Word-N-gram
Extracts features of N words from a text
from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import WordNGramAnalyzeranalyzer = WordNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )
58 features rsquogoogle is torsquo rsquois to buyrsquo rsquoserious challenge torsquo
8 Many irrelevant features (tokens)8 Features do not carry lots of contextual information (ie
understandable by humans)
Feature extraction Dbpedia - Context
Main Hypothesis Only things that exist in Dbpedia (ieWikipedia) have some interest in news analysisrarr Named
Entities Recognition (NER)
Dbpedia feature extraction (CubicwebDbpedia)
from cubes semnews views ne r t oo l s import Dbped iaEnt i t iesAna lyzeranalyzer = Dbped iaEnt i t iesAna lyzer ( session lang= rsquo en rsquo )tokens = analyzer analyze ( sentence )
rarr 3 features rsquoApple Incrsquo rsquoGooglersquo rsquoMotorolarsquo
ltENAMEX TYPE=ORGANIZATIONgt GoogleltENAMEXgt is to buy mobile phone
manufacturer ltENAMEX TYPE=ORGANIZATIONgt MotorolaltENAMEXgt Mobility allowing it to
mount a serious challenge to ltENAMEX TYPE=ORGANIZATIONgt Apple IncltENAMEXgt
rarr Try it
ldquoDBpedia Spotlight Shedding Light on the Web of Documentsrdquo Pablo N Mendes et al I-Semantics 2011ldquoLearning Named Entity Recognition from Wikipediardquo Joel Nothman 2008
ldquoLarge-Scale Named Entity Disambiguation Based on Wikipedia Datardquo Silviu Cucerzan 2007
Feature extraction Dbpedia - Properties
Efficient and robust feature extraction
Keep the meaning of a textrarr interpretable features8 rsquo said former soldier Larry rsquo8 rsquo said student Larry rsquo4 rsquo said Larry Page rsquo
Robust features based on redirectionseg rsquoObamarsquo rsquoBarak Obambarsquo rsquoPres Obamarsquo redirect to rsquoBarack Obamarsquo
Simple RQL (Relation Query Language) queries
Fast based on indexed SQL tables and regular expressions rset = rql(rsquoAny E WHERE E is DbpediaPage
E label (token)srsquo rsquotokenrsquo token)
eg 19 entities extracted in 4s in 765 words among sim 8106 dbpedia entries
Different labels but same URIrarr cross-language featureextraction eg GrenadaGrenaderarr http dbpediaorgresourceGrenada
Overview
1 Introduction
2 Extracting information from RSS news
3 Analyzing RSS news
Feature extraction Matrix representation and storage
Obamarsquos approval Protest as Spain
Libya rebels fight Obama in Spain
rarr
0 0 1 00 0 0 1 0 1 0 00 0 1 1
Eg The Obama-Merkel-Sarkozy
space
Results stored in a relation (appears in rss) in Cubicweb
rql(rsquoAny X WHERE X appears_in_rss Yrsquo)
Exploiting information Clustering news
Creating clusters (groups) of news using the matrixrepresentation of the data
Meanshift algorithm (Scikits-learn)
Based on locating the maxima of a density function
Automatically tunes the number of clusters
from s c i k i t s l ea rn c l u s t e r import MeanShift est imate_bandwidthbandwidth = est imate_bandwidth (X q u a n t i l e =0005)c l u s t e r i n g = MeanShift ( bandwidth=bandwidth )c l u s t e r i n g f i t (X)l a b e l s = c l u s t e r i n g labe ls_
ldquoMean shift mode seeking and clusteringrdquo Yizong Cheng IEEE Transactions on Pattern Analysis and Machine Intelligence1995
rarr Try it
Other possible alternatives Wardrsquos algorithm K-means
Exploiting information Queries and Views
Each query returns a result set (rset) A view is called on a result setrarr define the representation rules
Defining a view (short version)
class MyEnt i t iesView ( View ) __regid__ = rsquo exampleminusview rsquo
def c a l l ( s e l f ) r s e t = s e l f cw_rsetfor e n t i t y in r s e t e n t i t i e s ( )
do_whatever_you_want_ write_some_html_
you just have to plug a rset from a query
r s e t = r q l ( rsquo Any X WHERE rsquo )s e l f wview ( rsquo exampleminusview rsquo r s e t )
or within an url
httpmyapplicationrql=Any X WHERE ampvid=example-view
Pluging Scikitslearn in a view
Combine machine learning tools and database query system in purePython
Defining the view
class RSSClustering ( View ) __regid__ = rsquo rssminusc l u s t e r i n g rsquo
def c a l l ( s e l f ) Get the r e s u l t s set and create the datar s e t = s e l f cw_rsetoccurrence_matr ix = OccurrenceMatr ix ( )occurrence_matr ix cons t ruc t_ma t r i x ( r s e t = r s e t ) Scale the dataX = scale ( occurrence_matr ix ax is =0 w i th_s td=True copy=True ) Compute bandwith f o r c l u s t e r i n gbandwidth = est imate_bandwidth (X) C lus te r i ngc l u s t e r = MeanShift ( bandwidth=bandwidth )c l u s t e r f i t (X)l a b e l s = c l u s t e r l abe ls_ Perform some HTML render ing
rarr Try it
A new approach for querying information from RSS
All musical artists in the newsrql(rsquoDISTINCT Any E R WHERE E appears_in_rss R
E has_type T T label musical artistrsquo)
All living office holder persons in the news
rql(rsquoDISTINCT Any E WHERE E appears_in_rss RE has_type T T label office holderE has_subject C C label Living peoplersquo)
All news that talk about Barack Obama and any scientist
rql(rsquoDISTINCT Any R WHERE E1 label Barack ObamaE1 appears_in_rss R E2 appears_in_rss RE2 has_type T T label scientistrsquo)
All news that talk about a drug
rql(rsquoAny X R WHERE X appears_in_rss RX has_type T T label drugrsquo)
Try it with an xml view or a thumbnail view
Vizualisation Mapping information
View
class EntityMapView ( View ) __regid__ = rsquomap rsquo
def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )
s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )
s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )
Based on mapstraction (javascript) http mapstractioncom
4 Automatically locate information from RSS news
Try it
Conclusion
Cubicweb and Scikits-learn
Efficient and easy-to-use tools for data storing querying andmining
Easy to plug together only Python tools all Open Source
Using Dbpedia allows to extract very few highly relevant features
4 Decrease the dimensionality of the data
4 Link features to millions of pages of information
A new semantic way for querying information
4 Simple information queries using RQL expressions
4 Use Dbpedia types and categories to refine the selection
Future Improvements
Named Entities Recognition
rarr use disambiguations links more refined Regular expressions
rarr add new databases (MusicBrainz diseasome )
rarr closely follow Wikipedia with Dbpedia live update
Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo
RSS news analyzing
rarr explore new algorithms bi-clustering
rarr add new data sources Twitters Blogs
rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph
Thanks you for your attention
Questions
- Introduction
- Extracting information from RSS news
- Analyzing RSS news
-
Storing Dbpedia information with Cubicweb
Storing Dbpedia page
Define (in schemapy ) how Dbpedia pages should be stored in thedatabase
class DbpediaPage ( Ent i tyType ) u r i = S t r i n g ( unique=True indexed=True ) Ur i o f the ressourcel a b e l = S t r i n g ( indexed=True ) h t t p wwww3 org 200001 rd fminusschemapageid = S t r i n g ( ) h t t p dbpedia org onto logy wikiPageIDabs t r ac t = S t r i n g ( ) h t t p dbpedia org onto logy abs t r ac thomepage = S t r i n g ( ) h t t p xmlns com f o a f 0 1 homepagethumbnai l = S t r i n g ( ) h t t p dbpedia org onto logy thumbnai ld e p i c t i on = S t r i n g ( ) h t t p xmlns com f o a f 0 1 d ep i c t i o nwikipage = S t r i n g ( ) h t t p xmlns com f o a f 0 1 pagel a t i t u d e = S t r i n g ( ) h t t p wwww3 org 200301 geo wgs84_posl ong i t ude = S t r i n g ( ) h t t p wwww3 org 200301 geo wgs84_pos
Storing all dbpedia information (sim 9106 pages sim 100106 links20Go) in Cubicweb takes less than 24 hours
rarr See the full schema
Analyzing RSS news
What can we do with the RSS news stored in database
1 extract relevant features of the data
2 construct a usable (ie matrix) representation of the data
3 cluster (group) RSS together
4 deeper semantic analyze and visualization of the information
Example sentence
ldquoGoogle is to buy mobile phone manufacturer Motorola Mobilityallowing it to mount a serious challenge to Apple Incrdquo
Overview
1 Introduction
2 Extracting information from RSS news
3 Analyzing RSS news
Extracting information Classical approaches (Scikits-learn)
Char-N-gram
Extracts features of N characters from a text
from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import CharNGramAnalyzeranalyzer = CharNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )
450 features rsquogoorsquo rsquooogrsquo rsquooglrsquo rsquomobrsquo rsquoobirsquo rsquobilrsquo
Word-N-gram
Extracts features of N words from a text
from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import WordNGramAnalyzeranalyzer = WordNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )
58 features rsquogoogle is torsquo rsquois to buyrsquo rsquoserious challenge torsquo
8 Many irrelevant features (tokens)8 Features do not carry lots of contextual information (ie
understandable by humans)
Feature extraction Dbpedia - Context
Main Hypothesis Only things that exist in Dbpedia (ieWikipedia) have some interest in news analysisrarr Named
Entities Recognition (NER)
Dbpedia feature extraction (CubicwebDbpedia)
from cubes semnews views ne r t oo l s import Dbped iaEnt i t iesAna lyzeranalyzer = Dbped iaEnt i t iesAna lyzer ( session lang= rsquo en rsquo )tokens = analyzer analyze ( sentence )
rarr 3 features rsquoApple Incrsquo rsquoGooglersquo rsquoMotorolarsquo
ltENAMEX TYPE=ORGANIZATIONgt GoogleltENAMEXgt is to buy mobile phone
manufacturer ltENAMEX TYPE=ORGANIZATIONgt MotorolaltENAMEXgt Mobility allowing it to
mount a serious challenge to ltENAMEX TYPE=ORGANIZATIONgt Apple IncltENAMEXgt
rarr Try it
ldquoDBpedia Spotlight Shedding Light on the Web of Documentsrdquo Pablo N Mendes et al I-Semantics 2011ldquoLearning Named Entity Recognition from Wikipediardquo Joel Nothman 2008
ldquoLarge-Scale Named Entity Disambiguation Based on Wikipedia Datardquo Silviu Cucerzan 2007
Feature extraction Dbpedia - Properties
Efficient and robust feature extraction
Keep the meaning of a textrarr interpretable features8 rsquo said former soldier Larry rsquo8 rsquo said student Larry rsquo4 rsquo said Larry Page rsquo
Robust features based on redirectionseg rsquoObamarsquo rsquoBarak Obambarsquo rsquoPres Obamarsquo redirect to rsquoBarack Obamarsquo
Simple RQL (Relation Query Language) queries
Fast based on indexed SQL tables and regular expressions rset = rql(rsquoAny E WHERE E is DbpediaPage
E label (token)srsquo rsquotokenrsquo token)
eg 19 entities extracted in 4s in 765 words among sim 8106 dbpedia entries
Different labels but same URIrarr cross-language featureextraction eg GrenadaGrenaderarr http dbpediaorgresourceGrenada
Overview
1 Introduction
2 Extracting information from RSS news
3 Analyzing RSS news
Feature extraction Matrix representation and storage
Obamarsquos approval Protest as Spain
Libya rebels fight Obama in Spain
rarr
0 0 1 00 0 0 1 0 1 0 00 0 1 1
Eg The Obama-Merkel-Sarkozy
space
Results stored in a relation (appears in rss) in Cubicweb
rql(rsquoAny X WHERE X appears_in_rss Yrsquo)
Exploiting information Clustering news
Creating clusters (groups) of news using the matrixrepresentation of the data
Meanshift algorithm (Scikits-learn)
Based on locating the maxima of a density function
Automatically tunes the number of clusters
from s c i k i t s l ea rn c l u s t e r import MeanShift est imate_bandwidthbandwidth = est imate_bandwidth (X q u a n t i l e =0005)c l u s t e r i n g = MeanShift ( bandwidth=bandwidth )c l u s t e r i n g f i t (X)l a b e l s = c l u s t e r i n g labe ls_
ldquoMean shift mode seeking and clusteringrdquo Yizong Cheng IEEE Transactions on Pattern Analysis and Machine Intelligence1995
rarr Try it
Other possible alternatives Wardrsquos algorithm K-means
Exploiting information Queries and Views
Each query returns a result set (rset) A view is called on a result setrarr define the representation rules
Defining a view (short version)
class MyEnt i t iesView ( View ) __regid__ = rsquo exampleminusview rsquo
def c a l l ( s e l f ) r s e t = s e l f cw_rsetfor e n t i t y in r s e t e n t i t i e s ( )
do_whatever_you_want_ write_some_html_
you just have to plug a rset from a query
r s e t = r q l ( rsquo Any X WHERE rsquo )s e l f wview ( rsquo exampleminusview rsquo r s e t )
or within an url
httpmyapplicationrql=Any X WHERE ampvid=example-view
Pluging Scikitslearn in a view
Combine machine learning tools and database query system in purePython
Defining the view
class RSSClustering ( View ) __regid__ = rsquo rssminusc l u s t e r i n g rsquo
def c a l l ( s e l f ) Get the r e s u l t s set and create the datar s e t = s e l f cw_rsetoccurrence_matr ix = OccurrenceMatr ix ( )occurrence_matr ix cons t ruc t_ma t r i x ( r s e t = r s e t ) Scale the dataX = scale ( occurrence_matr ix ax is =0 w i th_s td=True copy=True ) Compute bandwith f o r c l u s t e r i n gbandwidth = est imate_bandwidth (X) C lus te r i ngc l u s t e r = MeanShift ( bandwidth=bandwidth )c l u s t e r f i t (X)l a b e l s = c l u s t e r l abe ls_ Perform some HTML render ing
rarr Try it
A new approach for querying information from RSS
All musical artists in the newsrql(rsquoDISTINCT Any E R WHERE E appears_in_rss R
E has_type T T label musical artistrsquo)
All living office holder persons in the news
rql(rsquoDISTINCT Any E WHERE E appears_in_rss RE has_type T T label office holderE has_subject C C label Living peoplersquo)
All news that talk about Barack Obama and any scientist
rql(rsquoDISTINCT Any R WHERE E1 label Barack ObamaE1 appears_in_rss R E2 appears_in_rss RE2 has_type T T label scientistrsquo)
All news that talk about a drug
rql(rsquoAny X R WHERE X appears_in_rss RX has_type T T label drugrsquo)
Try it with an xml view or a thumbnail view
Vizualisation Mapping information
View
class EntityMapView ( View ) __regid__ = rsquomap rsquo
def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )
s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )
s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )
Based on mapstraction (javascript) http mapstractioncom
4 Automatically locate information from RSS news
Try it
Conclusion
Cubicweb and Scikits-learn
Efficient and easy-to-use tools for data storing querying andmining
Easy to plug together only Python tools all Open Source
Using Dbpedia allows to extract very few highly relevant features
4 Decrease the dimensionality of the data
4 Link features to millions of pages of information
A new semantic way for querying information
4 Simple information queries using RQL expressions
4 Use Dbpedia types and categories to refine the selection
Future Improvements
Named Entities Recognition
rarr use disambiguations links more refined Regular expressions
rarr add new databases (MusicBrainz diseasome )
rarr closely follow Wikipedia with Dbpedia live update
Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo
RSS news analyzing
rarr explore new algorithms bi-clustering
rarr add new data sources Twitters Blogs
rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph
Thanks you for your attention
Questions
- Introduction
- Extracting information from RSS news
- Analyzing RSS news
-
Analyzing RSS news
What can we do with the RSS news stored in database
1 extract relevant features of the data
2 construct a usable (ie matrix) representation of the data
3 cluster (group) RSS together
4 deeper semantic analyze and visualization of the information
Example sentence
ldquoGoogle is to buy mobile phone manufacturer Motorola Mobilityallowing it to mount a serious challenge to Apple Incrdquo
Overview
1 Introduction
2 Extracting information from RSS news
3 Analyzing RSS news
Extracting information Classical approaches (Scikits-learn)
Char-N-gram
Extracts features of N characters from a text
from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import CharNGramAnalyzeranalyzer = CharNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )
450 features rsquogoorsquo rsquooogrsquo rsquooglrsquo rsquomobrsquo rsquoobirsquo rsquobilrsquo
Word-N-gram
Extracts features of N words from a text
from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import WordNGramAnalyzeranalyzer = WordNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )
58 features rsquogoogle is torsquo rsquois to buyrsquo rsquoserious challenge torsquo
8 Many irrelevant features (tokens)8 Features do not carry lots of contextual information (ie
understandable by humans)
Feature extraction Dbpedia - Context
Main Hypothesis Only things that exist in Dbpedia (ieWikipedia) have some interest in news analysisrarr Named
Entities Recognition (NER)
Dbpedia feature extraction (CubicwebDbpedia)
from cubes semnews views ne r t oo l s import Dbped iaEnt i t iesAna lyzeranalyzer = Dbped iaEnt i t iesAna lyzer ( session lang= rsquo en rsquo )tokens = analyzer analyze ( sentence )
rarr 3 features rsquoApple Incrsquo rsquoGooglersquo rsquoMotorolarsquo
ltENAMEX TYPE=ORGANIZATIONgt GoogleltENAMEXgt is to buy mobile phone
manufacturer ltENAMEX TYPE=ORGANIZATIONgt MotorolaltENAMEXgt Mobility allowing it to
mount a serious challenge to ltENAMEX TYPE=ORGANIZATIONgt Apple IncltENAMEXgt
rarr Try it
ldquoDBpedia Spotlight Shedding Light on the Web of Documentsrdquo Pablo N Mendes et al I-Semantics 2011ldquoLearning Named Entity Recognition from Wikipediardquo Joel Nothman 2008
ldquoLarge-Scale Named Entity Disambiguation Based on Wikipedia Datardquo Silviu Cucerzan 2007
Feature extraction Dbpedia - Properties
Efficient and robust feature extraction
Keep the meaning of a textrarr interpretable features8 rsquo said former soldier Larry rsquo8 rsquo said student Larry rsquo4 rsquo said Larry Page rsquo
Robust features based on redirectionseg rsquoObamarsquo rsquoBarak Obambarsquo rsquoPres Obamarsquo redirect to rsquoBarack Obamarsquo
Simple RQL (Relation Query Language) queries
Fast based on indexed SQL tables and regular expressions rset = rql(rsquoAny E WHERE E is DbpediaPage
E label (token)srsquo rsquotokenrsquo token)
eg 19 entities extracted in 4s in 765 words among sim 8106 dbpedia entries
Different labels but same URIrarr cross-language featureextraction eg GrenadaGrenaderarr http dbpediaorgresourceGrenada
Overview
1 Introduction
2 Extracting information from RSS news
3 Analyzing RSS news
Feature extraction Matrix representation and storage
Obamarsquos approval Protest as Spain
Libya rebels fight Obama in Spain
rarr
0 0 1 00 0 0 1 0 1 0 00 0 1 1
Eg The Obama-Merkel-Sarkozy
space
Results stored in a relation (appears in rss) in Cubicweb
rql(rsquoAny X WHERE X appears_in_rss Yrsquo)
Exploiting information Clustering news
Creating clusters (groups) of news using the matrixrepresentation of the data
Meanshift algorithm (Scikits-learn)
Based on locating the maxima of a density function
Automatically tunes the number of clusters
from s c i k i t s l ea rn c l u s t e r import MeanShift est imate_bandwidthbandwidth = est imate_bandwidth (X q u a n t i l e =0005)c l u s t e r i n g = MeanShift ( bandwidth=bandwidth )c l u s t e r i n g f i t (X)l a b e l s = c l u s t e r i n g labe ls_
ldquoMean shift mode seeking and clusteringrdquo Yizong Cheng IEEE Transactions on Pattern Analysis and Machine Intelligence1995
rarr Try it
Other possible alternatives Wardrsquos algorithm K-means
Exploiting information Queries and Views
Each query returns a result set (rset) A view is called on a result setrarr define the representation rules
Defining a view (short version)
class MyEnt i t iesView ( View ) __regid__ = rsquo exampleminusview rsquo
def c a l l ( s e l f ) r s e t = s e l f cw_rsetfor e n t i t y in r s e t e n t i t i e s ( )
do_whatever_you_want_ write_some_html_
you just have to plug a rset from a query
r s e t = r q l ( rsquo Any X WHERE rsquo )s e l f wview ( rsquo exampleminusview rsquo r s e t )
or within an url
httpmyapplicationrql=Any X WHERE ampvid=example-view
Pluging Scikitslearn in a view
Combine machine learning tools and database query system in purePython
Defining the view
class RSSClustering ( View ) __regid__ = rsquo rssminusc l u s t e r i n g rsquo
def c a l l ( s e l f ) Get the r e s u l t s set and create the datar s e t = s e l f cw_rsetoccurrence_matr ix = OccurrenceMatr ix ( )occurrence_matr ix cons t ruc t_ma t r i x ( r s e t = r s e t ) Scale the dataX = scale ( occurrence_matr ix ax is =0 w i th_s td=True copy=True ) Compute bandwith f o r c l u s t e r i n gbandwidth = est imate_bandwidth (X) C lus te r i ngc l u s t e r = MeanShift ( bandwidth=bandwidth )c l u s t e r f i t (X)l a b e l s = c l u s t e r l abe ls_ Perform some HTML render ing
rarr Try it
A new approach for querying information from RSS
All musical artists in the newsrql(rsquoDISTINCT Any E R WHERE E appears_in_rss R
E has_type T T label musical artistrsquo)
All living office holder persons in the news
rql(rsquoDISTINCT Any E WHERE E appears_in_rss RE has_type T T label office holderE has_subject C C label Living peoplersquo)
All news that talk about Barack Obama and any scientist
rql(rsquoDISTINCT Any R WHERE E1 label Barack ObamaE1 appears_in_rss R E2 appears_in_rss RE2 has_type T T label scientistrsquo)
All news that talk about a drug
rql(rsquoAny X R WHERE X appears_in_rss RX has_type T T label drugrsquo)
Try it with an xml view or a thumbnail view
Vizualisation Mapping information
View
class EntityMapView ( View ) __regid__ = rsquomap rsquo
def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )
s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )
s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )
Based on mapstraction (javascript) http mapstractioncom
4 Automatically locate information from RSS news
Try it
Conclusion
Cubicweb and Scikits-learn
Efficient and easy-to-use tools for data storing querying andmining
Easy to plug together only Python tools all Open Source
Using Dbpedia allows to extract very few highly relevant features
4 Decrease the dimensionality of the data
4 Link features to millions of pages of information
A new semantic way for querying information
4 Simple information queries using RQL expressions
4 Use Dbpedia types and categories to refine the selection
Future Improvements
Named Entities Recognition
rarr use disambiguations links more refined Regular expressions
rarr add new databases (MusicBrainz diseasome )
rarr closely follow Wikipedia with Dbpedia live update
Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo
RSS news analyzing
rarr explore new algorithms bi-clustering
rarr add new data sources Twitters Blogs
rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph
Thanks you for your attention
Questions
- Introduction
- Extracting information from RSS news
- Analyzing RSS news
-
Overview
1 Introduction
2 Extracting information from RSS news
3 Analyzing RSS news
Extracting information Classical approaches (Scikits-learn)
Char-N-gram
Extracts features of N characters from a text
from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import CharNGramAnalyzeranalyzer = CharNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )
450 features rsquogoorsquo rsquooogrsquo rsquooglrsquo rsquomobrsquo rsquoobirsquo rsquobilrsquo
Word-N-gram
Extracts features of N words from a text
from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import WordNGramAnalyzeranalyzer = WordNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )
58 features rsquogoogle is torsquo rsquois to buyrsquo rsquoserious challenge torsquo
8 Many irrelevant features (tokens)8 Features do not carry lots of contextual information (ie
understandable by humans)
Feature extraction Dbpedia - Context
Main Hypothesis Only things that exist in Dbpedia (ieWikipedia) have some interest in news analysisrarr Named
Entities Recognition (NER)
Dbpedia feature extraction (CubicwebDbpedia)
from cubes semnews views ne r t oo l s import Dbped iaEnt i t iesAna lyzeranalyzer = Dbped iaEnt i t iesAna lyzer ( session lang= rsquo en rsquo )tokens = analyzer analyze ( sentence )
rarr 3 features rsquoApple Incrsquo rsquoGooglersquo rsquoMotorolarsquo
ltENAMEX TYPE=ORGANIZATIONgt GoogleltENAMEXgt is to buy mobile phone
manufacturer ltENAMEX TYPE=ORGANIZATIONgt MotorolaltENAMEXgt Mobility allowing it to
mount a serious challenge to ltENAMEX TYPE=ORGANIZATIONgt Apple IncltENAMEXgt
rarr Try it
ldquoDBpedia Spotlight Shedding Light on the Web of Documentsrdquo Pablo N Mendes et al I-Semantics 2011ldquoLearning Named Entity Recognition from Wikipediardquo Joel Nothman 2008
ldquoLarge-Scale Named Entity Disambiguation Based on Wikipedia Datardquo Silviu Cucerzan 2007
Feature extraction Dbpedia - Properties
Efficient and robust feature extraction
Keep the meaning of a textrarr interpretable features8 rsquo said former soldier Larry rsquo8 rsquo said student Larry rsquo4 rsquo said Larry Page rsquo
Robust features based on redirectionseg rsquoObamarsquo rsquoBarak Obambarsquo rsquoPres Obamarsquo redirect to rsquoBarack Obamarsquo
Simple RQL (Relation Query Language) queries
Fast based on indexed SQL tables and regular expressions rset = rql(rsquoAny E WHERE E is DbpediaPage
E label (token)srsquo rsquotokenrsquo token)
eg 19 entities extracted in 4s in 765 words among sim 8106 dbpedia entries
Different labels but same URIrarr cross-language featureextraction eg GrenadaGrenaderarr http dbpediaorgresourceGrenada
Overview
1 Introduction
2 Extracting information from RSS news
3 Analyzing RSS news
Feature extraction Matrix representation and storage
Obamarsquos approval Protest as Spain
Libya rebels fight Obama in Spain
rarr
0 0 1 00 0 0 1 0 1 0 00 0 1 1
Eg The Obama-Merkel-Sarkozy
space
Results stored in a relation (appears in rss) in Cubicweb
rql(rsquoAny X WHERE X appears_in_rss Yrsquo)
Exploiting information Clustering news
Creating clusters (groups) of news using the matrixrepresentation of the data
Meanshift algorithm (Scikits-learn)
Based on locating the maxima of a density function
Automatically tunes the number of clusters
from s c i k i t s l ea rn c l u s t e r import MeanShift est imate_bandwidthbandwidth = est imate_bandwidth (X q u a n t i l e =0005)c l u s t e r i n g = MeanShift ( bandwidth=bandwidth )c l u s t e r i n g f i t (X)l a b e l s = c l u s t e r i n g labe ls_
ldquoMean shift mode seeking and clusteringrdquo Yizong Cheng IEEE Transactions on Pattern Analysis and Machine Intelligence1995
rarr Try it
Other possible alternatives Wardrsquos algorithm K-means
Exploiting information Queries and Views
Each query returns a result set (rset) A view is called on a result setrarr define the representation rules
Defining a view (short version)
class MyEnt i t iesView ( View ) __regid__ = rsquo exampleminusview rsquo
def c a l l ( s e l f ) r s e t = s e l f cw_rsetfor e n t i t y in r s e t e n t i t i e s ( )
do_whatever_you_want_ write_some_html_
you just have to plug a rset from a query
r s e t = r q l ( rsquo Any X WHERE rsquo )s e l f wview ( rsquo exampleminusview rsquo r s e t )
or within an url
httpmyapplicationrql=Any X WHERE ampvid=example-view
Pluging Scikitslearn in a view
Combine machine learning tools and database query system in purePython
Defining the view
class RSSClustering ( View ) __regid__ = rsquo rssminusc l u s t e r i n g rsquo
def c a l l ( s e l f ) Get the r e s u l t s set and create the datar s e t = s e l f cw_rsetoccurrence_matr ix = OccurrenceMatr ix ( )occurrence_matr ix cons t ruc t_ma t r i x ( r s e t = r s e t ) Scale the dataX = scale ( occurrence_matr ix ax is =0 w i th_s td=True copy=True ) Compute bandwith f o r c l u s t e r i n gbandwidth = est imate_bandwidth (X) C lus te r i ngc l u s t e r = MeanShift ( bandwidth=bandwidth )c l u s t e r f i t (X)l a b e l s = c l u s t e r l abe ls_ Perform some HTML render ing
rarr Try it
A new approach for querying information from RSS
All musical artists in the newsrql(rsquoDISTINCT Any E R WHERE E appears_in_rss R
E has_type T T label musical artistrsquo)
All living office holder persons in the news
rql(rsquoDISTINCT Any E WHERE E appears_in_rss RE has_type T T label office holderE has_subject C C label Living peoplersquo)
All news that talk about Barack Obama and any scientist
rql(rsquoDISTINCT Any R WHERE E1 label Barack ObamaE1 appears_in_rss R E2 appears_in_rss RE2 has_type T T label scientistrsquo)
All news that talk about a drug
rql(rsquoAny X R WHERE X appears_in_rss RX has_type T T label drugrsquo)
Try it with an xml view or a thumbnail view
Vizualisation Mapping information
View
class EntityMapView ( View ) __regid__ = rsquomap rsquo
def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )
s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )
s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )
Based on mapstraction (javascript) http mapstractioncom
4 Automatically locate information from RSS news
Try it
Conclusion
Cubicweb and Scikits-learn
Efficient and easy-to-use tools for data storing querying andmining
Easy to plug together only Python tools all Open Source
Using Dbpedia allows to extract very few highly relevant features
4 Decrease the dimensionality of the data
4 Link features to millions of pages of information
A new semantic way for querying information
4 Simple information queries using RQL expressions
4 Use Dbpedia types and categories to refine the selection
Future Improvements
Named Entities Recognition
rarr use disambiguations links more refined Regular expressions
rarr add new databases (MusicBrainz diseasome )
rarr closely follow Wikipedia with Dbpedia live update
Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo
RSS news analyzing
rarr explore new algorithms bi-clustering
rarr add new data sources Twitters Blogs
rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph
Thanks you for your attention
Questions
- Introduction
- Extracting information from RSS news
- Analyzing RSS news
-
Extracting information Classical approaches (Scikits-learn)
Char-N-gram
Extracts features of N characters from a text
from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import CharNGramAnalyzeranalyzer = CharNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )
450 features rsquogoorsquo rsquooogrsquo rsquooglrsquo rsquomobrsquo rsquoobirsquo rsquobilrsquo
Word-N-gram
Extracts features of N words from a text
from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import WordNGramAnalyzeranalyzer = WordNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )
58 features rsquogoogle is torsquo rsquois to buyrsquo rsquoserious challenge torsquo
8 Many irrelevant features (tokens)8 Features do not carry lots of contextual information (ie
understandable by humans)
Feature extraction Dbpedia - Context
Main Hypothesis Only things that exist in Dbpedia (ieWikipedia) have some interest in news analysisrarr Named
Entities Recognition (NER)
Dbpedia feature extraction (CubicwebDbpedia)
from cubes semnews views ne r t oo l s import Dbped iaEnt i t iesAna lyzeranalyzer = Dbped iaEnt i t iesAna lyzer ( session lang= rsquo en rsquo )tokens = analyzer analyze ( sentence )
rarr 3 features rsquoApple Incrsquo rsquoGooglersquo rsquoMotorolarsquo
ltENAMEX TYPE=ORGANIZATIONgt GoogleltENAMEXgt is to buy mobile phone
manufacturer ltENAMEX TYPE=ORGANIZATIONgt MotorolaltENAMEXgt Mobility allowing it to
mount a serious challenge to ltENAMEX TYPE=ORGANIZATIONgt Apple IncltENAMEXgt
rarr Try it
ldquoDBpedia Spotlight Shedding Light on the Web of Documentsrdquo Pablo N Mendes et al I-Semantics 2011ldquoLearning Named Entity Recognition from Wikipediardquo Joel Nothman 2008
ldquoLarge-Scale Named Entity Disambiguation Based on Wikipedia Datardquo Silviu Cucerzan 2007
Feature extraction Dbpedia - Properties
Efficient and robust feature extraction
Keep the meaning of a textrarr interpretable features8 rsquo said former soldier Larry rsquo8 rsquo said student Larry rsquo4 rsquo said Larry Page rsquo
Robust features based on redirectionseg rsquoObamarsquo rsquoBarak Obambarsquo rsquoPres Obamarsquo redirect to rsquoBarack Obamarsquo
Simple RQL (Relation Query Language) queries
Fast based on indexed SQL tables and regular expressions rset = rql(rsquoAny E WHERE E is DbpediaPage
E label (token)srsquo rsquotokenrsquo token)
eg 19 entities extracted in 4s in 765 words among sim 8106 dbpedia entries
Different labels but same URIrarr cross-language featureextraction eg GrenadaGrenaderarr http dbpediaorgresourceGrenada
Overview
1 Introduction
2 Extracting information from RSS news
3 Analyzing RSS news
Feature extraction Matrix representation and storage
Obamarsquos approval Protest as Spain
Libya rebels fight Obama in Spain
rarr
0 0 1 00 0 0 1 0 1 0 00 0 1 1
Eg The Obama-Merkel-Sarkozy
space
Results stored in a relation (appears in rss) in Cubicweb
rql(rsquoAny X WHERE X appears_in_rss Yrsquo)
Exploiting information Clustering news
Creating clusters (groups) of news using the matrixrepresentation of the data
Meanshift algorithm (Scikits-learn)
Based on locating the maxima of a density function
Automatically tunes the number of clusters
from s c i k i t s l ea rn c l u s t e r import MeanShift est imate_bandwidthbandwidth = est imate_bandwidth (X q u a n t i l e =0005)c l u s t e r i n g = MeanShift ( bandwidth=bandwidth )c l u s t e r i n g f i t (X)l a b e l s = c l u s t e r i n g labe ls_
ldquoMean shift mode seeking and clusteringrdquo Yizong Cheng IEEE Transactions on Pattern Analysis and Machine Intelligence1995
rarr Try it
Other possible alternatives Wardrsquos algorithm K-means
Exploiting information Queries and Views
Each query returns a result set (rset) A view is called on a result setrarr define the representation rules
Defining a view (short version)
class MyEnt i t iesView ( View ) __regid__ = rsquo exampleminusview rsquo
def c a l l ( s e l f ) r s e t = s e l f cw_rsetfor e n t i t y in r s e t e n t i t i e s ( )
do_whatever_you_want_ write_some_html_
you just have to plug a rset from a query
r s e t = r q l ( rsquo Any X WHERE rsquo )s e l f wview ( rsquo exampleminusview rsquo r s e t )
or within an url
httpmyapplicationrql=Any X WHERE ampvid=example-view
Pluging Scikitslearn in a view
Combine machine learning tools and database query system in purePython
Defining the view
class RSSClustering ( View ) __regid__ = rsquo rssminusc l u s t e r i n g rsquo
def c a l l ( s e l f ) Get the r e s u l t s set and create the datar s e t = s e l f cw_rsetoccurrence_matr ix = OccurrenceMatr ix ( )occurrence_matr ix cons t ruc t_ma t r i x ( r s e t = r s e t ) Scale the dataX = scale ( occurrence_matr ix ax is =0 w i th_s td=True copy=True ) Compute bandwith f o r c l u s t e r i n gbandwidth = est imate_bandwidth (X) C lus te r i ngc l u s t e r = MeanShift ( bandwidth=bandwidth )c l u s t e r f i t (X)l a b e l s = c l u s t e r l abe ls_ Perform some HTML render ing
rarr Try it
A new approach for querying information from RSS
All musical artists in the newsrql(rsquoDISTINCT Any E R WHERE E appears_in_rss R
E has_type T T label musical artistrsquo)
All living office holder persons in the news
rql(rsquoDISTINCT Any E WHERE E appears_in_rss RE has_type T T label office holderE has_subject C C label Living peoplersquo)
All news that talk about Barack Obama and any scientist
rql(rsquoDISTINCT Any R WHERE E1 label Barack ObamaE1 appears_in_rss R E2 appears_in_rss RE2 has_type T T label scientistrsquo)
All news that talk about a drug
rql(rsquoAny X R WHERE X appears_in_rss RX has_type T T label drugrsquo)
Try it with an xml view or a thumbnail view
Vizualisation Mapping information
View
class EntityMapView ( View ) __regid__ = rsquomap rsquo
def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )
s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )
s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )
Based on mapstraction (javascript) http mapstractioncom
4 Automatically locate information from RSS news
Try it
Conclusion
Cubicweb and Scikits-learn
Efficient and easy-to-use tools for data storing querying andmining
Easy to plug together only Python tools all Open Source
Using Dbpedia allows to extract very few highly relevant features
4 Decrease the dimensionality of the data
4 Link features to millions of pages of information
A new semantic way for querying information
4 Simple information queries using RQL expressions
4 Use Dbpedia types and categories to refine the selection
Future Improvements
Named Entities Recognition
rarr use disambiguations links more refined Regular expressions
rarr add new databases (MusicBrainz diseasome )
rarr closely follow Wikipedia with Dbpedia live update
Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo
RSS news analyzing
rarr explore new algorithms bi-clustering
rarr add new data sources Twitters Blogs
rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph
Thanks you for your attention
Questions
- Introduction
- Extracting information from RSS news
- Analyzing RSS news
-
Feature extraction Dbpedia - Context
Main Hypothesis Only things that exist in Dbpedia (ieWikipedia) have some interest in news analysisrarr Named
Entities Recognition (NER)
Dbpedia feature extraction (CubicwebDbpedia)
from cubes semnews views ne r t oo l s import Dbped iaEnt i t iesAna lyzeranalyzer = Dbped iaEnt i t iesAna lyzer ( session lang= rsquo en rsquo )tokens = analyzer analyze ( sentence )
rarr 3 features rsquoApple Incrsquo rsquoGooglersquo rsquoMotorolarsquo
ltENAMEX TYPE=ORGANIZATIONgt GoogleltENAMEXgt is to buy mobile phone
manufacturer ltENAMEX TYPE=ORGANIZATIONgt MotorolaltENAMEXgt Mobility allowing it to
mount a serious challenge to ltENAMEX TYPE=ORGANIZATIONgt Apple IncltENAMEXgt
rarr Try it
ldquoDBpedia Spotlight Shedding Light on the Web of Documentsrdquo Pablo N Mendes et al I-Semantics 2011ldquoLearning Named Entity Recognition from Wikipediardquo Joel Nothman 2008
ldquoLarge-Scale Named Entity Disambiguation Based on Wikipedia Datardquo Silviu Cucerzan 2007
Feature extraction Dbpedia - Properties
Efficient and robust feature extraction
Keep the meaning of a textrarr interpretable features8 rsquo said former soldier Larry rsquo8 rsquo said student Larry rsquo4 rsquo said Larry Page rsquo
Robust features based on redirectionseg rsquoObamarsquo rsquoBarak Obambarsquo rsquoPres Obamarsquo redirect to rsquoBarack Obamarsquo
Simple RQL (Relation Query Language) queries
Fast based on indexed SQL tables and regular expressions rset = rql(rsquoAny E WHERE E is DbpediaPage
E label (token)srsquo rsquotokenrsquo token)
eg 19 entities extracted in 4s in 765 words among sim 8106 dbpedia entries
Different labels but same URIrarr cross-language featureextraction eg GrenadaGrenaderarr http dbpediaorgresourceGrenada
Overview
1 Introduction
2 Extracting information from RSS news
3 Analyzing RSS news
Feature extraction Matrix representation and storage
Obamarsquos approval Protest as Spain
Libya rebels fight Obama in Spain
rarr
0 0 1 00 0 0 1 0 1 0 00 0 1 1
Eg The Obama-Merkel-Sarkozy
space
Results stored in a relation (appears in rss) in Cubicweb
rql(rsquoAny X WHERE X appears_in_rss Yrsquo)
Exploiting information Clustering news
Creating clusters (groups) of news using the matrixrepresentation of the data
Meanshift algorithm (Scikits-learn)
Based on locating the maxima of a density function
Automatically tunes the number of clusters
from s c i k i t s l ea rn c l u s t e r import MeanShift est imate_bandwidthbandwidth = est imate_bandwidth (X q u a n t i l e =0005)c l u s t e r i n g = MeanShift ( bandwidth=bandwidth )c l u s t e r i n g f i t (X)l a b e l s = c l u s t e r i n g labe ls_
ldquoMean shift mode seeking and clusteringrdquo Yizong Cheng IEEE Transactions on Pattern Analysis and Machine Intelligence1995
rarr Try it
Other possible alternatives Wardrsquos algorithm K-means
Exploiting information Queries and Views
Each query returns a result set (rset) A view is called on a result setrarr define the representation rules
Defining a view (short version)
class MyEnt i t iesView ( View ) __regid__ = rsquo exampleminusview rsquo
def c a l l ( s e l f ) r s e t = s e l f cw_rsetfor e n t i t y in r s e t e n t i t i e s ( )
do_whatever_you_want_ write_some_html_
you just have to plug a rset from a query
r s e t = r q l ( rsquo Any X WHERE rsquo )s e l f wview ( rsquo exampleminusview rsquo r s e t )
or within an url
httpmyapplicationrql=Any X WHERE ampvid=example-view
Pluging Scikitslearn in a view
Combine machine learning tools and database query system in purePython
Defining the view
class RSSClustering ( View ) __regid__ = rsquo rssminusc l u s t e r i n g rsquo
def c a l l ( s e l f ) Get the r e s u l t s set and create the datar s e t = s e l f cw_rsetoccurrence_matr ix = OccurrenceMatr ix ( )occurrence_matr ix cons t ruc t_ma t r i x ( r s e t = r s e t ) Scale the dataX = scale ( occurrence_matr ix ax is =0 w i th_s td=True copy=True ) Compute bandwith f o r c l u s t e r i n gbandwidth = est imate_bandwidth (X) C lus te r i ngc l u s t e r = MeanShift ( bandwidth=bandwidth )c l u s t e r f i t (X)l a b e l s = c l u s t e r l abe ls_ Perform some HTML render ing
rarr Try it
A new approach for querying information from RSS
All musical artists in the newsrql(rsquoDISTINCT Any E R WHERE E appears_in_rss R
E has_type T T label musical artistrsquo)
All living office holder persons in the news
rql(rsquoDISTINCT Any E WHERE E appears_in_rss RE has_type T T label office holderE has_subject C C label Living peoplersquo)
All news that talk about Barack Obama and any scientist
rql(rsquoDISTINCT Any R WHERE E1 label Barack ObamaE1 appears_in_rss R E2 appears_in_rss RE2 has_type T T label scientistrsquo)
All news that talk about a drug
rql(rsquoAny X R WHERE X appears_in_rss RX has_type T T label drugrsquo)
Try it with an xml view or a thumbnail view
Vizualisation Mapping information
View
class EntityMapView ( View ) __regid__ = rsquomap rsquo
def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )
s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )
s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )
Based on mapstraction (javascript) http mapstractioncom
4 Automatically locate information from RSS news
Try it
Conclusion
Cubicweb and Scikits-learn
Efficient and easy-to-use tools for data storing querying andmining
Easy to plug together only Python tools all Open Source
Using Dbpedia allows to extract very few highly relevant features
4 Decrease the dimensionality of the data
4 Link features to millions of pages of information
A new semantic way for querying information
4 Simple information queries using RQL expressions
4 Use Dbpedia types and categories to refine the selection
Future Improvements
Named Entities Recognition
rarr use disambiguations links more refined Regular expressions
rarr add new databases (MusicBrainz diseasome )
rarr closely follow Wikipedia with Dbpedia live update
Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo
RSS news analyzing
rarr explore new algorithms bi-clustering
rarr add new data sources Twitters Blogs
rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph
Thanks you for your attention
Questions
- Introduction
- Extracting information from RSS news
- Analyzing RSS news
-
Feature extraction Dbpedia - Properties
Efficient and robust feature extraction
Keep the meaning of a textrarr interpretable features8 rsquo said former soldier Larry rsquo8 rsquo said student Larry rsquo4 rsquo said Larry Page rsquo
Robust features based on redirectionseg rsquoObamarsquo rsquoBarak Obambarsquo rsquoPres Obamarsquo redirect to rsquoBarack Obamarsquo
Simple RQL (Relation Query Language) queries
Fast based on indexed SQL tables and regular expressions rset = rql(rsquoAny E WHERE E is DbpediaPage
E label (token)srsquo rsquotokenrsquo token)
eg 19 entities extracted in 4s in 765 words among sim 8106 dbpedia entries
Different labels but same URIrarr cross-language featureextraction eg GrenadaGrenaderarr http dbpediaorgresourceGrenada
Overview
1 Introduction
2 Extracting information from RSS news
3 Analyzing RSS news
Feature extraction Matrix representation and storage
Obamarsquos approval Protest as Spain
Libya rebels fight Obama in Spain
rarr
0 0 1 00 0 0 1 0 1 0 00 0 1 1
Eg The Obama-Merkel-Sarkozy
space
Results stored in a relation (appears in rss) in Cubicweb
rql(rsquoAny X WHERE X appears_in_rss Yrsquo)
Exploiting information Clustering news
Creating clusters (groups) of news using the matrixrepresentation of the data
Meanshift algorithm (Scikits-learn)
Based on locating the maxima of a density function
Automatically tunes the number of clusters
from s c i k i t s l ea rn c l u s t e r import MeanShift est imate_bandwidthbandwidth = est imate_bandwidth (X q u a n t i l e =0005)c l u s t e r i n g = MeanShift ( bandwidth=bandwidth )c l u s t e r i n g f i t (X)l a b e l s = c l u s t e r i n g labe ls_
ldquoMean shift mode seeking and clusteringrdquo Yizong Cheng IEEE Transactions on Pattern Analysis and Machine Intelligence1995
rarr Try it
Other possible alternatives Wardrsquos algorithm K-means
Exploiting information Queries and Views
Each query returns a result set (rset) A view is called on a result setrarr define the representation rules
Defining a view (short version)
class MyEnt i t iesView ( View ) __regid__ = rsquo exampleminusview rsquo
def c a l l ( s e l f ) r s e t = s e l f cw_rsetfor e n t i t y in r s e t e n t i t i e s ( )
do_whatever_you_want_ write_some_html_
you just have to plug a rset from a query
r s e t = r q l ( rsquo Any X WHERE rsquo )s e l f wview ( rsquo exampleminusview rsquo r s e t )
or within an url
httpmyapplicationrql=Any X WHERE ampvid=example-view
Pluging Scikitslearn in a view
Combine machine learning tools and database query system in purePython
Defining the view
class RSSClustering ( View ) __regid__ = rsquo rssminusc l u s t e r i n g rsquo
def c a l l ( s e l f ) Get the r e s u l t s set and create the datar s e t = s e l f cw_rsetoccurrence_matr ix = OccurrenceMatr ix ( )occurrence_matr ix cons t ruc t_ma t r i x ( r s e t = r s e t ) Scale the dataX = scale ( occurrence_matr ix ax is =0 w i th_s td=True copy=True ) Compute bandwith f o r c l u s t e r i n gbandwidth = est imate_bandwidth (X) C lus te r i ngc l u s t e r = MeanShift ( bandwidth=bandwidth )c l u s t e r f i t (X)l a b e l s = c l u s t e r l abe ls_ Perform some HTML render ing
rarr Try it
A new approach for querying information from RSS
All musical artists in the newsrql(rsquoDISTINCT Any E R WHERE E appears_in_rss R
E has_type T T label musical artistrsquo)
All living office holder persons in the news
rql(rsquoDISTINCT Any E WHERE E appears_in_rss RE has_type T T label office holderE has_subject C C label Living peoplersquo)
All news that talk about Barack Obama and any scientist
rql(rsquoDISTINCT Any R WHERE E1 label Barack ObamaE1 appears_in_rss R E2 appears_in_rss RE2 has_type T T label scientistrsquo)
All news that talk about a drug
rql(rsquoAny X R WHERE X appears_in_rss RX has_type T T label drugrsquo)
Try it with an xml view or a thumbnail view
Vizualisation Mapping information
View
class EntityMapView ( View ) __regid__ = rsquomap rsquo
def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )
s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )
s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )
Based on mapstraction (javascript) http mapstractioncom
4 Automatically locate information from RSS news
Try it
Conclusion
Cubicweb and Scikits-learn
Efficient and easy-to-use tools for data storing querying andmining
Easy to plug together only Python tools all Open Source
Using Dbpedia allows to extract very few highly relevant features
4 Decrease the dimensionality of the data
4 Link features to millions of pages of information
A new semantic way for querying information
4 Simple information queries using RQL expressions
4 Use Dbpedia types and categories to refine the selection
Future Improvements
Named Entities Recognition
rarr use disambiguations links more refined Regular expressions
rarr add new databases (MusicBrainz diseasome )
rarr closely follow Wikipedia with Dbpedia live update
Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo
RSS news analyzing
rarr explore new algorithms bi-clustering
rarr add new data sources Twitters Blogs
rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph
Thanks you for your attention
Questions
- Introduction
- Extracting information from RSS news
- Analyzing RSS news
-
Overview
1 Introduction
2 Extracting information from RSS news
3 Analyzing RSS news
Feature extraction Matrix representation and storage
Obamarsquos approval Protest as Spain
Libya rebels fight Obama in Spain
rarr
0 0 1 00 0 0 1 0 1 0 00 0 1 1
Eg The Obama-Merkel-Sarkozy
space
Results stored in a relation (appears in rss) in Cubicweb
rql(rsquoAny X WHERE X appears_in_rss Yrsquo)
Exploiting information Clustering news
Creating clusters (groups) of news using the matrixrepresentation of the data
Meanshift algorithm (Scikits-learn)
Based on locating the maxima of a density function
Automatically tunes the number of clusters
from s c i k i t s l ea rn c l u s t e r import MeanShift est imate_bandwidthbandwidth = est imate_bandwidth (X q u a n t i l e =0005)c l u s t e r i n g = MeanShift ( bandwidth=bandwidth )c l u s t e r i n g f i t (X)l a b e l s = c l u s t e r i n g labe ls_
ldquoMean shift mode seeking and clusteringrdquo Yizong Cheng IEEE Transactions on Pattern Analysis and Machine Intelligence1995
rarr Try it
Other possible alternatives Wardrsquos algorithm K-means
Exploiting information Queries and Views
Each query returns a result set (rset) A view is called on a result setrarr define the representation rules
Defining a view (short version)
class MyEnt i t iesView ( View ) __regid__ = rsquo exampleminusview rsquo
def c a l l ( s e l f ) r s e t = s e l f cw_rsetfor e n t i t y in r s e t e n t i t i e s ( )
do_whatever_you_want_ write_some_html_
you just have to plug a rset from a query
r s e t = r q l ( rsquo Any X WHERE rsquo )s e l f wview ( rsquo exampleminusview rsquo r s e t )
or within an url
httpmyapplicationrql=Any X WHERE ampvid=example-view
Pluging Scikitslearn in a view
Combine machine learning tools and database query system in purePython
Defining the view
class RSSClustering ( View ) __regid__ = rsquo rssminusc l u s t e r i n g rsquo
def c a l l ( s e l f ) Get the r e s u l t s set and create the datar s e t = s e l f cw_rsetoccurrence_matr ix = OccurrenceMatr ix ( )occurrence_matr ix cons t ruc t_ma t r i x ( r s e t = r s e t ) Scale the dataX = scale ( occurrence_matr ix ax is =0 w i th_s td=True copy=True ) Compute bandwith f o r c l u s t e r i n gbandwidth = est imate_bandwidth (X) C lus te r i ngc l u s t e r = MeanShift ( bandwidth=bandwidth )c l u s t e r f i t (X)l a b e l s = c l u s t e r l abe ls_ Perform some HTML render ing
rarr Try it
A new approach for querying information from RSS
All musical artists in the newsrql(rsquoDISTINCT Any E R WHERE E appears_in_rss R
E has_type T T label musical artistrsquo)
All living office holder persons in the news
rql(rsquoDISTINCT Any E WHERE E appears_in_rss RE has_type T T label office holderE has_subject C C label Living peoplersquo)
All news that talk about Barack Obama and any scientist
rql(rsquoDISTINCT Any R WHERE E1 label Barack ObamaE1 appears_in_rss R E2 appears_in_rss RE2 has_type T T label scientistrsquo)
All news that talk about a drug
rql(rsquoAny X R WHERE X appears_in_rss RX has_type T T label drugrsquo)
Try it with an xml view or a thumbnail view
Vizualisation Mapping information
View
class EntityMapView ( View ) __regid__ = rsquomap rsquo
def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )
s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )
s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )
Based on mapstraction (javascript) http mapstractioncom
4 Automatically locate information from RSS news
Try it
Conclusion
Cubicweb and Scikits-learn
Efficient and easy-to-use tools for data storing querying andmining
Easy to plug together only Python tools all Open Source
Using Dbpedia allows to extract very few highly relevant features
4 Decrease the dimensionality of the data
4 Link features to millions of pages of information
A new semantic way for querying information
4 Simple information queries using RQL expressions
4 Use Dbpedia types and categories to refine the selection
Future Improvements
Named Entities Recognition
rarr use disambiguations links more refined Regular expressions
rarr add new databases (MusicBrainz diseasome )
rarr closely follow Wikipedia with Dbpedia live update
Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo
RSS news analyzing
rarr explore new algorithms bi-clustering
rarr add new data sources Twitters Blogs
rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph
Thanks you for your attention
Questions
- Introduction
- Extracting information from RSS news
- Analyzing RSS news
-
Feature extraction Matrix representation and storage
Obamarsquos approval Protest as Spain
Libya rebels fight Obama in Spain
rarr
0 0 1 00 0 0 1 0 1 0 00 0 1 1
Eg The Obama-Merkel-Sarkozy
space
Results stored in a relation (appears in rss) in Cubicweb
rql(rsquoAny X WHERE X appears_in_rss Yrsquo)
Exploiting information Clustering news
Creating clusters (groups) of news using the matrixrepresentation of the data
Meanshift algorithm (Scikits-learn)
Based on locating the maxima of a density function
Automatically tunes the number of clusters
from s c i k i t s l ea rn c l u s t e r import MeanShift est imate_bandwidthbandwidth = est imate_bandwidth (X q u a n t i l e =0005)c l u s t e r i n g = MeanShift ( bandwidth=bandwidth )c l u s t e r i n g f i t (X)l a b e l s = c l u s t e r i n g labe ls_
ldquoMean shift mode seeking and clusteringrdquo Yizong Cheng IEEE Transactions on Pattern Analysis and Machine Intelligence1995
rarr Try it
Other possible alternatives Wardrsquos algorithm K-means
Exploiting information Queries and Views
Each query returns a result set (rset) A view is called on a result setrarr define the representation rules
Defining a view (short version)
class MyEnt i t iesView ( View ) __regid__ = rsquo exampleminusview rsquo
def c a l l ( s e l f ) r s e t = s e l f cw_rsetfor e n t i t y in r s e t e n t i t i e s ( )
do_whatever_you_want_ write_some_html_
you just have to plug a rset from a query
r s e t = r q l ( rsquo Any X WHERE rsquo )s e l f wview ( rsquo exampleminusview rsquo r s e t )
or within an url
httpmyapplicationrql=Any X WHERE ampvid=example-view
Pluging Scikitslearn in a view
Combine machine learning tools and database query system in purePython
Defining the view
class RSSClustering ( View ) __regid__ = rsquo rssminusc l u s t e r i n g rsquo
def c a l l ( s e l f ) Get the r e s u l t s set and create the datar s e t = s e l f cw_rsetoccurrence_matr ix = OccurrenceMatr ix ( )occurrence_matr ix cons t ruc t_ma t r i x ( r s e t = r s e t ) Scale the dataX = scale ( occurrence_matr ix ax is =0 w i th_s td=True copy=True ) Compute bandwith f o r c l u s t e r i n gbandwidth = est imate_bandwidth (X) C lus te r i ngc l u s t e r = MeanShift ( bandwidth=bandwidth )c l u s t e r f i t (X)l a b e l s = c l u s t e r l abe ls_ Perform some HTML render ing
rarr Try it
A new approach for querying information from RSS
All musical artists in the newsrql(rsquoDISTINCT Any E R WHERE E appears_in_rss R
E has_type T T label musical artistrsquo)
All living office holder persons in the news
rql(rsquoDISTINCT Any E WHERE E appears_in_rss RE has_type T T label office holderE has_subject C C label Living peoplersquo)
All news that talk about Barack Obama and any scientist
rql(rsquoDISTINCT Any R WHERE E1 label Barack ObamaE1 appears_in_rss R E2 appears_in_rss RE2 has_type T T label scientistrsquo)
All news that talk about a drug
rql(rsquoAny X R WHERE X appears_in_rss RX has_type T T label drugrsquo)
Try it with an xml view or a thumbnail view
Vizualisation Mapping information
View
class EntityMapView ( View ) __regid__ = rsquomap rsquo
def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )
s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )
s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )
Based on mapstraction (javascript) http mapstractioncom
4 Automatically locate information from RSS news
Try it
Conclusion
Cubicweb and Scikits-learn
Efficient and easy-to-use tools for data storing querying andmining
Easy to plug together only Python tools all Open Source
Using Dbpedia allows to extract very few highly relevant features
4 Decrease the dimensionality of the data
4 Link features to millions of pages of information
A new semantic way for querying information
4 Simple information queries using RQL expressions
4 Use Dbpedia types and categories to refine the selection
Future Improvements
Named Entities Recognition
rarr use disambiguations links more refined Regular expressions
rarr add new databases (MusicBrainz diseasome )
rarr closely follow Wikipedia with Dbpedia live update
Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo
RSS news analyzing
rarr explore new algorithms bi-clustering
rarr add new data sources Twitters Blogs
rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph
Thanks you for your attention
Questions
- Introduction
- Extracting information from RSS news
- Analyzing RSS news
-
Exploiting information Clustering news
Creating clusters (groups) of news using the matrixrepresentation of the data
Meanshift algorithm (Scikits-learn)
Based on locating the maxima of a density function
Automatically tunes the number of clusters
from s c i k i t s l ea rn c l u s t e r import MeanShift est imate_bandwidthbandwidth = est imate_bandwidth (X q u a n t i l e =0005)c l u s t e r i n g = MeanShift ( bandwidth=bandwidth )c l u s t e r i n g f i t (X)l a b e l s = c l u s t e r i n g labe ls_
ldquoMean shift mode seeking and clusteringrdquo Yizong Cheng IEEE Transactions on Pattern Analysis and Machine Intelligence1995
rarr Try it
Other possible alternatives Wardrsquos algorithm K-means
Exploiting information Queries and Views
Each query returns a result set (rset) A view is called on a result setrarr define the representation rules
Defining a view (short version)
class MyEnt i t iesView ( View ) __regid__ = rsquo exampleminusview rsquo
def c a l l ( s e l f ) r s e t = s e l f cw_rsetfor e n t i t y in r s e t e n t i t i e s ( )
do_whatever_you_want_ write_some_html_
you just have to plug a rset from a query
r s e t = r q l ( rsquo Any X WHERE rsquo )s e l f wview ( rsquo exampleminusview rsquo r s e t )
or within an url
httpmyapplicationrql=Any X WHERE ampvid=example-view
Pluging Scikitslearn in a view
Combine machine learning tools and database query system in purePython
Defining the view
class RSSClustering ( View ) __regid__ = rsquo rssminusc l u s t e r i n g rsquo
def c a l l ( s e l f ) Get the r e s u l t s set and create the datar s e t = s e l f cw_rsetoccurrence_matr ix = OccurrenceMatr ix ( )occurrence_matr ix cons t ruc t_ma t r i x ( r s e t = r s e t ) Scale the dataX = scale ( occurrence_matr ix ax is =0 w i th_s td=True copy=True ) Compute bandwith f o r c l u s t e r i n gbandwidth = est imate_bandwidth (X) C lus te r i ngc l u s t e r = MeanShift ( bandwidth=bandwidth )c l u s t e r f i t (X)l a b e l s = c l u s t e r l abe ls_ Perform some HTML render ing
rarr Try it
A new approach for querying information from RSS
All musical artists in the newsrql(rsquoDISTINCT Any E R WHERE E appears_in_rss R
E has_type T T label musical artistrsquo)
All living office holder persons in the news
rql(rsquoDISTINCT Any E WHERE E appears_in_rss RE has_type T T label office holderE has_subject C C label Living peoplersquo)
All news that talk about Barack Obama and any scientist
rql(rsquoDISTINCT Any R WHERE E1 label Barack ObamaE1 appears_in_rss R E2 appears_in_rss RE2 has_type T T label scientistrsquo)
All news that talk about a drug
rql(rsquoAny X R WHERE X appears_in_rss RX has_type T T label drugrsquo)
Try it with an xml view or a thumbnail view
Vizualisation Mapping information
View
class EntityMapView ( View ) __regid__ = rsquomap rsquo
def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )
s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )
s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )
Based on mapstraction (javascript) http mapstractioncom
4 Automatically locate information from RSS news
Try it
Conclusion
Cubicweb and Scikits-learn
Efficient and easy-to-use tools for data storing querying andmining
Easy to plug together only Python tools all Open Source
Using Dbpedia allows to extract very few highly relevant features
4 Decrease the dimensionality of the data
4 Link features to millions of pages of information
A new semantic way for querying information
4 Simple information queries using RQL expressions
4 Use Dbpedia types and categories to refine the selection
Future Improvements
Named Entities Recognition
rarr use disambiguations links more refined Regular expressions
rarr add new databases (MusicBrainz diseasome )
rarr closely follow Wikipedia with Dbpedia live update
Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo
RSS news analyzing
rarr explore new algorithms bi-clustering
rarr add new data sources Twitters Blogs
rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph
Thanks you for your attention
Questions
- Introduction
- Extracting information from RSS news
- Analyzing RSS news
-
Exploiting information Queries and Views
Each query returns a result set (rset) A view is called on a result setrarr define the representation rules
Defining a view (short version)
class MyEnt i t iesView ( View ) __regid__ = rsquo exampleminusview rsquo
def c a l l ( s e l f ) r s e t = s e l f cw_rsetfor e n t i t y in r s e t e n t i t i e s ( )
do_whatever_you_want_ write_some_html_
you just have to plug a rset from a query
r s e t = r q l ( rsquo Any X WHERE rsquo )s e l f wview ( rsquo exampleminusview rsquo r s e t )
or within an url
httpmyapplicationrql=Any X WHERE ampvid=example-view
Pluging Scikitslearn in a view
Combine machine learning tools and database query system in purePython
Defining the view
class RSSClustering ( View ) __regid__ = rsquo rssminusc l u s t e r i n g rsquo
def c a l l ( s e l f ) Get the r e s u l t s set and create the datar s e t = s e l f cw_rsetoccurrence_matr ix = OccurrenceMatr ix ( )occurrence_matr ix cons t ruc t_ma t r i x ( r s e t = r s e t ) Scale the dataX = scale ( occurrence_matr ix ax is =0 w i th_s td=True copy=True ) Compute bandwith f o r c l u s t e r i n gbandwidth = est imate_bandwidth (X) C lus te r i ngc l u s t e r = MeanShift ( bandwidth=bandwidth )c l u s t e r f i t (X)l a b e l s = c l u s t e r l abe ls_ Perform some HTML render ing
rarr Try it
A new approach for querying information from RSS
All musical artists in the newsrql(rsquoDISTINCT Any E R WHERE E appears_in_rss R
E has_type T T label musical artistrsquo)
All living office holder persons in the news
rql(rsquoDISTINCT Any E WHERE E appears_in_rss RE has_type T T label office holderE has_subject C C label Living peoplersquo)
All news that talk about Barack Obama and any scientist
rql(rsquoDISTINCT Any R WHERE E1 label Barack ObamaE1 appears_in_rss R E2 appears_in_rss RE2 has_type T T label scientistrsquo)
All news that talk about a drug
rql(rsquoAny X R WHERE X appears_in_rss RX has_type T T label drugrsquo)
Try it with an xml view or a thumbnail view
Vizualisation Mapping information
View
class EntityMapView ( View ) __regid__ = rsquomap rsquo
def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )
s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )
s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )
Based on mapstraction (javascript) http mapstractioncom
4 Automatically locate information from RSS news
Try it
Conclusion
Cubicweb and Scikits-learn
Efficient and easy-to-use tools for data storing querying andmining
Easy to plug together only Python tools all Open Source
Using Dbpedia allows to extract very few highly relevant features
4 Decrease the dimensionality of the data
4 Link features to millions of pages of information
A new semantic way for querying information
4 Simple information queries using RQL expressions
4 Use Dbpedia types and categories to refine the selection
Future Improvements
Named Entities Recognition
rarr use disambiguations links more refined Regular expressions
rarr add new databases (MusicBrainz diseasome )
rarr closely follow Wikipedia with Dbpedia live update
Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo
RSS news analyzing
rarr explore new algorithms bi-clustering
rarr add new data sources Twitters Blogs
rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph
Thanks you for your attention
Questions
- Introduction
- Extracting information from RSS news
- Analyzing RSS news
-
Pluging Scikitslearn in a view
Combine machine learning tools and database query system in purePython
Defining the view
class RSSClustering ( View ) __regid__ = rsquo rssminusc l u s t e r i n g rsquo
def c a l l ( s e l f ) Get the r e s u l t s set and create the datar s e t = s e l f cw_rsetoccurrence_matr ix = OccurrenceMatr ix ( )occurrence_matr ix cons t ruc t_ma t r i x ( r s e t = r s e t ) Scale the dataX = scale ( occurrence_matr ix ax is =0 w i th_s td=True copy=True ) Compute bandwith f o r c l u s t e r i n gbandwidth = est imate_bandwidth (X) C lus te r i ngc l u s t e r = MeanShift ( bandwidth=bandwidth )c l u s t e r f i t (X)l a b e l s = c l u s t e r l abe ls_ Perform some HTML render ing
rarr Try it
A new approach for querying information from RSS
All musical artists in the newsrql(rsquoDISTINCT Any E R WHERE E appears_in_rss R
E has_type T T label musical artistrsquo)
All living office holder persons in the news
rql(rsquoDISTINCT Any E WHERE E appears_in_rss RE has_type T T label office holderE has_subject C C label Living peoplersquo)
All news that talk about Barack Obama and any scientist
rql(rsquoDISTINCT Any R WHERE E1 label Barack ObamaE1 appears_in_rss R E2 appears_in_rss RE2 has_type T T label scientistrsquo)
All news that talk about a drug
rql(rsquoAny X R WHERE X appears_in_rss RX has_type T T label drugrsquo)
Try it with an xml view or a thumbnail view
Vizualisation Mapping information
View
class EntityMapView ( View ) __regid__ = rsquomap rsquo
def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )
s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )
s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )
Based on mapstraction (javascript) http mapstractioncom
4 Automatically locate information from RSS news
Try it
Conclusion
Cubicweb and Scikits-learn
Efficient and easy-to-use tools for data storing querying andmining
Easy to plug together only Python tools all Open Source
Using Dbpedia allows to extract very few highly relevant features
4 Decrease the dimensionality of the data
4 Link features to millions of pages of information
A new semantic way for querying information
4 Simple information queries using RQL expressions
4 Use Dbpedia types and categories to refine the selection
Future Improvements
Named Entities Recognition
rarr use disambiguations links more refined Regular expressions
rarr add new databases (MusicBrainz diseasome )
rarr closely follow Wikipedia with Dbpedia live update
Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo
RSS news analyzing
rarr explore new algorithms bi-clustering
rarr add new data sources Twitters Blogs
rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph
Thanks you for your attention
Questions
- Introduction
- Extracting information from RSS news
- Analyzing RSS news
-
A new approach for querying information from RSS
All musical artists in the newsrql(rsquoDISTINCT Any E R WHERE E appears_in_rss R
E has_type T T label musical artistrsquo)
All living office holder persons in the news
rql(rsquoDISTINCT Any E WHERE E appears_in_rss RE has_type T T label office holderE has_subject C C label Living peoplersquo)
All news that talk about Barack Obama and any scientist
rql(rsquoDISTINCT Any R WHERE E1 label Barack ObamaE1 appears_in_rss R E2 appears_in_rss RE2 has_type T T label scientistrsquo)
All news that talk about a drug
rql(rsquoAny X R WHERE X appears_in_rss RX has_type T T label drugrsquo)
Try it with an xml view or a thumbnail view
Vizualisation Mapping information
View
class EntityMapView ( View ) __regid__ = rsquomap rsquo
def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )
s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )
s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )
Based on mapstraction (javascript) http mapstractioncom
4 Automatically locate information from RSS news
Try it
Conclusion
Cubicweb and Scikits-learn
Efficient and easy-to-use tools for data storing querying andmining
Easy to plug together only Python tools all Open Source
Using Dbpedia allows to extract very few highly relevant features
4 Decrease the dimensionality of the data
4 Link features to millions of pages of information
A new semantic way for querying information
4 Simple information queries using RQL expressions
4 Use Dbpedia types and categories to refine the selection
Future Improvements
Named Entities Recognition
rarr use disambiguations links more refined Regular expressions
rarr add new databases (MusicBrainz diseasome )
rarr closely follow Wikipedia with Dbpedia live update
Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo
RSS news analyzing
rarr explore new algorithms bi-clustering
rarr add new data sources Twitters Blogs
rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph
Thanks you for your attention
Questions
- Introduction
- Extracting information from RSS news
- Analyzing RSS news
-
Vizualisation Mapping information
View
class EntityMapView ( View ) __regid__ = rsquomap rsquo
def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )
s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )
s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )
Based on mapstraction (javascript) http mapstractioncom
4 Automatically locate information from RSS news
Try it
Conclusion
Cubicweb and Scikits-learn
Efficient and easy-to-use tools for data storing querying andmining
Easy to plug together only Python tools all Open Source
Using Dbpedia allows to extract very few highly relevant features
4 Decrease the dimensionality of the data
4 Link features to millions of pages of information
A new semantic way for querying information
4 Simple information queries using RQL expressions
4 Use Dbpedia types and categories to refine the selection
Future Improvements
Named Entities Recognition
rarr use disambiguations links more refined Regular expressions
rarr add new databases (MusicBrainz diseasome )
rarr closely follow Wikipedia with Dbpedia live update
Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo
RSS news analyzing
rarr explore new algorithms bi-clustering
rarr add new data sources Twitters Blogs
rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph
Thanks you for your attention
Questions
- Introduction
- Extracting information from RSS news
- Analyzing RSS news
-
Conclusion
Cubicweb and Scikits-learn
Efficient and easy-to-use tools for data storing querying andmining
Easy to plug together only Python tools all Open Source
Using Dbpedia allows to extract very few highly relevant features
4 Decrease the dimensionality of the data
4 Link features to millions of pages of information
A new semantic way for querying information
4 Simple information queries using RQL expressions
4 Use Dbpedia types and categories to refine the selection
Future Improvements
Named Entities Recognition
rarr use disambiguations links more refined Regular expressions
rarr add new databases (MusicBrainz diseasome )
rarr closely follow Wikipedia with Dbpedia live update
Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo
RSS news analyzing
rarr explore new algorithms bi-clustering
rarr add new data sources Twitters Blogs
rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph
Thanks you for your attention
Questions
- Introduction
- Extracting information from RSS news
- Analyzing RSS news
-
Future Improvements
Named Entities Recognition
rarr use disambiguations links more refined Regular expressions
rarr add new databases (MusicBrainz diseasome )
rarr closely follow Wikipedia with Dbpedia live update
Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo
RSS news analyzing
rarr explore new algorithms bi-clustering
rarr add new data sources Twitters Blogs
rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph
Thanks you for your attention
Questions
- Introduction
- Extracting information from RSS news
- Analyzing RSS news
-
Thanks you for your attention
Questions
- Introduction
- Extracting information from RSS news
- Analyzing RSS news
-