Mapping french open data actors on the web with common crawl
-
Upload
data-publica -
Category
Documents
-
view
1.240 -
download
1
description
Transcript of Mapping french open data actors on the web with common crawl
![Page 1: Mapping french open data actors on the web with common crawl](https://reader033.fdocuments.in/reader033/viewer/2022050816/5492ebc8b47959604d8b46f4/html5/thumbnails/1.jpg)
Mapping french Open Data actors on the web with Common [email protected]@glebourg
![Page 2: Mapping french open data actors on the web with common crawl](https://reader033.fdocuments.in/reader033/viewer/2022050816/5492ebc8b47959604d8b46f4/html5/thumbnails/2.jpg)
Mining the Web at Data Publica
Different needs, different techniques● Scraping● Focused crawling● Prospective crawling
![Page 3: Mapping french open data actors on the web with common crawl](https://reader033.fdocuments.in/reader033/viewer/2022050816/5492ebc8b47959604d8b46f4/html5/thumbnails/3.jpg)
Mining the Web at Data Publica
Scraping● Identified resources● Configured extractors● Structured content● Not scalable
![Page 4: Mapping french open data actors on the web with common crawl](https://reader033.fdocuments.in/reader033/viewer/2022050816/5492ebc8b47959604d8b46f4/html5/thumbnails/4.jpg)
Mining the Web at Data Publica
Focused crawling● Identified entities● Fuzzy extraction● Structured content using text-mining● Scalable● Useful to get meta information on known
entities
![Page 5: Mapping french open data actors on the web with common crawl](https://reader033.fdocuments.in/reader033/viewer/2022050816/5492ebc8b47959604d8b46f4/html5/thumbnails/5.jpg)
Mining the Web at Data Publica
Prospective crawling● No starting point● Fuzzy extraction● Structured content using text-mining● Very hard to scale● Heavy resources needed : CPU, RAM,
HDD
It makes your life easier to use a third-party !
![Page 6: Mapping french open data actors on the web with common crawl](https://reader033.fdocuments.in/reader033/viewer/2022050816/5492ebc8b47959604d8b46f4/html5/thumbnails/6.jpg)
From a crawl to a map
Goal : build a map of the french open data actors on the web
● As a graph● Showing websites
![Page 7: Mapping french open data actors on the web with common crawl](https://reader033.fdocuments.in/reader033/viewer/2022050816/5492ebc8b47959604d8b46f4/html5/thumbnails/7.jpg)
From a crawl to a map
Using Common Crawl● Large web crawl archives fully accessible● Good coverage of french web● Easy access via AWS / MapReduce jobs
![Page 8: Mapping french open data actors on the web with common crawl](https://reader033.fdocuments.in/reader033/viewer/2022050816/5492ebc8b47959604d8b46f4/html5/thumbnails/8.jpg)
From a crawl to a map
Working on french web● Irrelevant to use tld .fr for detection● Detecting page language● Giving websites a "frenchness" score
○ Sw = amount of fr pages / total of pages○ Cutoff manually chosen via testing on french
websites
![Page 9: Mapping french open data actors on the web with common crawl](https://reader033.fdocuments.in/reader033/viewer/2022050816/5492ebc8b47959604d8b46f4/html5/thumbnails/9.jpg)
From a crawl to a map
Working on Open Data websites● Building an Open Data "vocabulary"● Detecting if page speaks about Open
Data● Giving websites an "opendataness" score
○ Sw = amount of Open Data pages / total of pages○ Cutoff manually chosen via testing on Open Data
websites
![Page 10: Mapping french open data actors on the web with common crawl](https://reader033.fdocuments.in/reader033/viewer/2022050816/5492ebc8b47959604d8b46f4/html5/thumbnails/10.jpg)
From a crawl to a map
Building graph● Inside our subset
○ Inlinks○ Outlinks
● Generating two files○ nodes.csv (list of websites with an id)○ edges.csv (directed links between websites)
Node AA inlink A outlink
A inlink
![Page 11: Mapping french open data actors on the web with common crawl](https://reader033.fdocuments.in/reader033/viewer/2022050816/5492ebc8b47959604d8b46f4/html5/thumbnails/11.jpg)
From a crawl to a map
Building graph● Links tell a lot about websites
○ Authorities○ Hubs
![Page 12: Mapping french open data actors on the web with common crawl](https://reader033.fdocuments.in/reader033/viewer/2022050816/5492ebc8b47959604d8b46f4/html5/thumbnails/12.jpg)
From a crawl to a map
Visualizing graph using Gephi● Load graph● Spatialize graph
○ links between websites create "attraction", to make them appear near each other
○ the more inlinks, bigger the node (= authority)○ categorizing web site for better understanding (a
color per category)■ Companies, Non profit/blogs, Governement
agencies○ communities can now appear !
![Page 13: Mapping french open data actors on the web with common crawl](https://reader033.fdocuments.in/reader033/viewer/2022050816/5492ebc8b47959604d8b46f4/html5/thumbnails/13.jpg)
From a crawl to a map
![Page 14: Mapping french open data actors on the web with common crawl](https://reader033.fdocuments.in/reader033/viewer/2022050816/5492ebc8b47959604d8b46f4/html5/thumbnails/14.jpg)
From a crawl to a map
Visualizing graph on the web● Sigma.js● Uses Gephi files● Gives better interactivity
![Page 15: Mapping french open data actors on the web with common crawl](https://reader033.fdocuments.in/reader033/viewer/2022050816/5492ebc8b47959604d8b46f4/html5/thumbnails/15.jpg)
Analyze
● The final graph is a good way to understand interactions between actors○ Open Data is definitely initiated by a Non Profit
movement○ Companies are beginning to work on the subject○ French state only had some sporadic initiatives for
now● This graph is to be generated again in near
futur, to see changes in this ecosystem
![Page 16: Mapping french open data actors on the web with common crawl](https://reader033.fdocuments.in/reader033/viewer/2022050816/5492ebc8b47959604d8b46f4/html5/thumbnails/16.jpg)
Results
● Large scale crawl made easy○ Easy to focus on mining the results instead of
finding/storing the data● Nice workflow from raw data to an
understandable visualisation● The final graph is a good way to understand
interactions between actors
![Page 17: Mapping french open data actors on the web with common crawl](https://reader033.fdocuments.in/reader033/viewer/2022050816/5492ebc8b47959604d8b46f4/html5/thumbnails/17.jpg)
Feedback
● Common Crawl○ Common crawl doesn't have an exhaustive crawl of
the french web for now○ Data is not fresh as it could be○ It is missing an index to access at least domains,
and maybe pages in O(1)● Methodology
○ Opendataness scoring can put aside some websites not enough focused on open data even if relevant
![Page 18: Mapping french open data actors on the web with common crawl](https://reader033.fdocuments.in/reader033/viewer/2022050816/5492ebc8b47959604d8b46f4/html5/thumbnails/18.jpg)
Resources
● http://webatlas.fr/tempshare/OpenDataActeursTypes.pdf○ poster by Franck Ghitalla
● http://french-opendata.data-publica.com/index.html○ dynamic visualisation of the results, by Data Publica
● http://fr.slideshare.net/willounet/a-sneak-peek-into-the-web-presentation,○ A sneak peek into the web, by GL
● http://french-opendata.data-publica.com/○ Project host page
![Page 19: Mapping french open data actors on the web with common crawl](https://reader033.fdocuments.in/reader033/viewer/2022050816/5492ebc8b47959604d8b46f4/html5/thumbnails/19.jpg)
Mapping french Open Data actors on the web with Common [email protected]@glebourg