Semantic Illegal Content Hunter - SICH project · Semantic Illegal Content Hunter ... COGITO...
Transcript of Semantic Illegal Content Hunter - SICH project · Semantic Illegal Content Hunter ... COGITO...
Co-funded by the Prevention of and Fight against Crime Programme of the European Union
CONFERENZA FINALE
PROGETTO SICH Semantic Illegal Content Hunter
Laura, Luigi – Sapienza Università di Roma Roma, 20 Novembre 2015
Co-funded by the Prevention of and Fight against Crime Programme of the European Union
Realizzazione di un motore di
ricerca ottimizzato per la ricerca
di contenuti illegali.
Luigi Laura
Sapienza Università di Roma
Co-funded by the Prevention of and Fight against Crime Programme of the European Union
SUMMARY
Challenges in the development of SICH:
• A comparison with traditional web search
and search engines
• A comparison with traditional
“Adversarial Web Search”
• The main components of SICH
• Future developments
Co-funded by the Prevention of and Fight against Crime Programme of the European Union
EVOLUTION OF
WEB SEARCH ENGINES First Generation
Use mainly textual information
Second Generation
Need a good global ranking:
hyperlink analysis (Google’s Pagerank)
Third Generation
The need behind the query
Co-funded by the Prevention of and Fight against Crime Programme of the European Union
OPEN SOURCE SEs
There are several good
open source SEs or SEs
components…
Why didn’t we use them?!?
Co-funded by the Prevention of and Fight against Crime Programme of the European Union
IN WEB SEARCH…
… everybody wants to be found, and
ranked high!!
Here we play a different game: the pages
we are looking for do not want to be found
(by us!!!)
Adversarial Web Search?
Co-funded by the Prevention of and Fight against Crime Programme of the European Union
ADVERSARIAL
WEB SEARCH?
Adversarial Web
Search is a nowadays
Mature field of research,
but still it is not what we
need!!!
Co-funded by the Prevention of and Fight against Crime Programme of the European Union
SEARCH ENGINE COMPONENTS
Co-funded by the Prevention of and Fight against Crime Programme of the European Union
THE CRAWLER
The Crawler is the component that is in
charge of collecting the data in the internet.
There are several problems to be resolved:
• one machine is not enough, we need
many…
• ... how do we share the load betweeen
the machines?
• Which pages do we visit first?
• Freshness of the results
Co-funded by the Prevention of and Fight against Crime Programme of the European Union
THE SICH CRAWLER
We used the ESCrawler, the Expert System
Crawler, inside SICH. ESCrawler:
• download documents from the Web;
• search and extract parts of documents;
• generate documents composed by different
documents;
• filter non-core parts of documents;
• populate HTML forms and to get the result;
• Classify documents or parts of them by
calculating a hash signature
Co-funded by the Prevention of and Fight against Crime Programme of the European Union
THE INDEXER
The indexer in the SICH engine is
developed using Expert System’s
COGITO Semantic Technology
Co-funded by the Prevention of and Fight against Crime Programme of the European Union
THE INTERFACE
In a traditional web search engine the
User Interface is very simple…
both the query...
Co-funded by the Prevention of and Fight against Crime Programme of the European Union
THE INTERFACE
In a traditional web search engine the
User Interface is very simple…
both the query... and the results!
Co-funded by the Prevention of and Fight against Crime Programme of the European Union
THE SICH INTERFACE
It allows:
• linguistic search
• spatial search
• events-based search
• search based on corpora selection
Co-funded by the Prevention of and Fight against Crime Programme of the European Union
THE SICH INTERFACE
Co-funded by the Prevention of and Fight against Crime Programme of the European Union
THE SICH INTERFACE
Co-funded by the Prevention of and Fight against Crime Programme of the European Union
THE SICH INTERFACE
PROJECT SICH – SEMANTIC ILLEGAL CONTENT HUNTER
With the financial support of the Prevention of and Fight Against Crime Programme – ISEC 2012
European Commission – Directorate-General Home Affairs ( HOME/2012/ISEC/AG/INT/4000003863 )
The ESCrawler is in charge to: download documents from the Web; search and extract parts of documents; generate documents composed by different documents; filter non-core parts of documents; populate HTML forms and to get the result; Classify documents or parts of them by calculating a
hash signature
Discovery & Categorization by • Parser • Semantic Network
• Lexicon • Knowledge Base
• Memory Semantic Index Conceptual Map
• linguistic search • spatial search • events-based search • search based on corpora selection Semantic Search; Text Mining; Automatic Categorization; Data Intelligence.
SUMMING UP: THE SICH SYSTEM
Co-funded by the Prevention of and Fight against Crime Programme of the European Union
Contatti
Grazie
Luigi Laura
Sapienza Università di Roma