Primary Research Team & Capabilities
description
Transcript of Primary Research Team & Capabilities
11 October 2013 1
Primary Research Team & CapabilitiesPrimary Research Team & Capabilities
Dept. of Parallel and Distributed ComputingResearch and Development Areas:
– Large-scale HPCN, Grid and MapReduce applications– Intelligent and Knowledge oriented Technologies
Experience from IST:– 3 project in FP5: ANFAS, CrosGRID, Pellucid– 6 project in FP6: EGEE II, K-Wf Grid, DEGREE
(coordinator), EGEE, int.eu.grid, MEDIGRID– 4 projects in FP7: Commius, Admire, Secricom, EGEE III
Several National Projects (SPVV, VEGA, APVT)IKT Group Focus:
– Information Processing (Large Scale)– Graph Processing – Information Extraction and Retrieval– Semantic Web– Knowledge oriented Technologies– Parallel and Distributed Information Processing
Solutions:– SGDB: Simple Graph Database– gSemSearch: Graph based Semantic Search– Ontea: Pattern-based Semantic Annotation– ACoMA: KM tool in Email– EMBET: Recommendation System– Experts on MapReduce and IR (Nutch, Solr, Lucene)
Director & leader of PDC: Dr. Ladislav Hluchý
URL: http://ikt.ui.sav.sk
Towards Entity SearchTowards Entity Search
• Current approaches– Confirmed human knowledge– Google Knowledge Graph– Facebook Graph Search
• Data sets Available– Wikipedia– DBPedia (111 languages)– Freebase– Linked Data cloud
• Our approach– Quite unique mix of skills:
• IR, Semantic Web, Graphs and Networks
– Networks, Text, metadata– Graph algorithms– Information Retrieval techniques– Anchor texts: aliases, properties, types
11 October 2013 2
Entity Search ApplicationsEntity Search Applications
11 October 2013 3
https://www.linkedin.com/today/post/article/20130805134105-50510-search-what-s-cooking-in-the-lab
http://www.siliconrepublic.com/strategy/item/31182-global-enterprise-search-ma
Entity Search ApplicationsEntity Search Applications
• Online Advertising– Query Categorization
– Keyword Extension
• Business Intelligence– Enterprise Search
– Knowledge Management
– Text analytics
• Multilingual short text categorizations– Based on Wikipedia Language versions,
DBPedia, Freebase
– Query Categorization
– Social media (Twitter) categorization, analysis
• Security Domain – Information Leakage prevention
– Categorization
11 October 2013 4
Large scale Text and Graph data processingLarge scale Text and Graph data processing
Core Technology• Web crawling
– Nutch + plugins
• Full text indexing and search– lucene, Sorl
• Information Extraction– Ontea, GATE
• All above large scale– Hadoop, S4
• Graph processing and Querying– Simple Graph Database (SGDB)
– gSemSearch
– Neo4j
– Blueprints
11 October 2013 5
Underlined are the technologies developed by IISAS
Relation to Business Intelligence Relation to Business Intelligence
• Old BI approaches– Data Integration from RDBM
– Data ware houses
– OLAP
– …
• New BI approaches– Other than RDBM data structures: Networks, Semantics
• Networks/Graphs in Telecom, Social Networks, Transactions, Linked Data …
• NoSQL: key value (Tokyo Cabinet), column stores (HBase), Graph databases, RDF(s)
– In-Memory computing
– Commodity PCs solutions for large data:• MapReduce style - Hadoop, Pregel style – Giraph, Hama
– Big unstructured data processing (on Hadoop):• Sentiment analysis, topic detection, named entity detection
11 October 2013 6
Ontea: Information Extraction ToolOntea: Information Extraction Tool
Regex patternsGazetteersResuls
Key-value pairs Structured into trees graphs
Transformers, ConfigurationAutomatic loading of extractors
Visual Annotation Tool Integration with external tools
GATE, Stemers, Hadoop …Multilingual tests
English, Slovak, Spanish, Italian
11 October 2013 7
http://ontea.sf.net
Text with annotations
Tree of annotations
Network /Graph of annotations
Named Entity Recognition (NER)Named Entity Recognition (NER)
• Combination of Existing NER– ANNIE (GATE), Apache OpenNLP, – Illinois NER, Illinois Wikifier, – LingPipe, Open Calais– Stanford NER ,WikiMiner, – Miscinator
• Machine Learning– Decision Trees models
• Received second place at MSM 2013, missing first place by 1%, where participated 17 teams word widehttp://ikt.ui.sav.sk/index.php?n=Main.IEChallenge2013
11 October 2013 8
gSemSearch: Graph based Semantic SearchgSemSearch: Graph based Semantic Search
• Entity relation search in semantic networks/graphs
• Search, Navigation, Data Interaction
• Aiming at data integration of– Structured data
(Relational data, LinkedData)
– Unstructured Data(text, documents, communication)
• Applications: – Email, Web, Text documents,
LinkedData
11 October 2013 9
http://ikt.ui.sav.sk/esns/
SemSets: Sematnic SearchSemSets: Sematnic Search
• Answering list type questions: astronauts who walked on the Moon
• Wikipedia as text and networks/graph
• Text: IR methods, Lucene based
• Graph/network: sprading activation and SemSets
• Winning solution on Semantic Search Challenge 2011
11 October 2013 10
1. Eugene_Cernan2. Alan_Bean3. David_Scott4. John_Young_(astronaut)5. Neil_Armstrong6. Pete_Conrad7. Harrison_Schmitt8. Alan_Shepard9. Charles_Duke10. Buzz_Aldrin11. James_Irwin12. Edgar_Mitchell
SGDB: Simple Graph DatabaseSGDB: Simple Graph Database
• Storage for graphs• Optimized for graph traversing and spread of activation• Faster then Neo4j for graph traversing operations• Supports Blueprints API• https://simplegdb.svn.sourceforge.net/svnroot/simplegdb/Sgdb3
• Graph Database Benchmarks– Graph Traversal Benchmark for Graph Databases
– http://ups.savba.sk/~marek/gbench.html
– Blueprints API - possibility to test compliant Graph databases
11 October 2013 11
Source: http://geza.kzoo.edu/bionet/html/scalefree.html
Community Detection in Complex Networks Community Detection in Complex Networks
• Task: Identify densely connected subgraphs in complex networks
• community collapsing problem
• SCCD– Near-linear time complexity– Avoids community collapsing
problem (to certain extend)
• KDD paper– Re-weighting approach
– Better results on real networks
11 October 2013 12
Marek Ciglan , Kjetil Nørvåg: Fast detection of size-constrained communities in large networks, proceedings of WISE'10, LNCS Volume 6488/2010
Marek Ciglan, Michal Laclavík and Kjetil Nørvåg: On Community Detection in Real-World Networks and the Importance of Degree Assortativity, 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2013
Future Direction: Entity Search in Large Graph DataFuture Direction: Entity Search in Large Graph Data
• Motivation– Graph/Network data are everywhere: social networks, web, LinkedData,
transactions, communication (email, phone). – Also text can be converted to graph. – Interconnecting graph data and searching for relations is crucial.
• Approach– Forming semantic trees and graphs from text, web, communication, databases
and LinkedData– User interaction with graph data in order to achieve integration and data
cleansing– Users will do it, if user effort have immediate impact on search results
11 October 2013 13