A Scalable Architecture for Extracting, Aligning, Linking...

20
A Scalable Architecture for Extracting, Aligning, Linking, and Visualizing Multi-Int Data Craig Knoblock & Pedro Szekely University of Southern California

Transcript of A Scalable Architecture for Extracting, Aligning, Linking...

Page 1: A Scalable Architecture for Extracting, Aligning, Linking ...usc-isi-i2.github.io/slides/knoblock15-spie-slides.pdf · A Scalable Architecture for Extracting, Aligning, Linking, and

A Scalable Architecture for Extracting, Aligning, Linking, and

Visualizing Multi-Int Data

Craig Knoblock & Pedro Szekely

University of Southern California

Page 2: A Scalable Architecture for Extracting, Aligning, Linking ...usc-isi-i2.github.io/slides/knoblock15-spie-slides.pdf · A Scalable Architecture for Extracting, Aligning, Linking, and

Introduction

• Massive quantities of data available for analysis – OSING, HUMINT, SIGINT, MASINT, GEOINT, …

• Data is spread across multiple sources, multiple sites and multiple formats – Databases, text, web sites, XML, JSON, etc...

• If an analyst could exploit all of this data, it could transform analysis – Disruptive technology for analysis

University of Southern California 2

Page 3: A Scalable Architecture for Extracting, Aligning, Linking ...usc-isi-i2.github.io/slides/knoblock15-spie-slides.pdf · A Scalable Architecture for Extracting, Aligning, Linking, and

Solution: Domain-specific Insight Graphs

• Innovative architecture

– Extracting, aligning, linking, and visualizing massive amounts of data

– Domain-specific content from structured and unstructured sources

• State-of-the art open source software

– Open architecture with flexible APIs

– Cloud-based infrastructure (HDFS, Hadoop, ElasticSearch, etc.)

University of Southern California 3

Page 4: A Scalable Architecture for Extracting, Aligning, Linking ...usc-isi-i2.github.io/slides/knoblock15-spie-slides.pdf · A Scalable Architecture for Extracting, Aligning, Linking, and

Example Scenario

• Want to determine the nuclear know-how of a given country from open source data

• Analyze the universities, academics, publications, reports, articles within the country

University of Southern California 4

Page 5: A Scalable Architecture for Extracting, Aligning, Linking ...usc-isi-i2.github.io/slides/knoblock15-spie-slides.pdf · A Scalable Architecture for Extracting, Aligning, Linking, and

Scenario Results

• Exploit the data available from – Web pages,

publications, articles, etc.

• Produce a knowledge graph – Key people and

connections – Technical

capabilities and how they have changed over time

University of Southern California 5

Page 6: A Scalable Architecture for Extracting, Aligning, Linking ...usc-isi-i2.github.io/slides/knoblock15-spie-slides.pdf · A Scalable Architecture for Extracting, Aligning, Linking, and

DIG Pipeline

• Crawling

• Extracting

• Cleaning

• Integration

• Computing simlarity

• Entity resolution

• Graph construction

• Query, analysis, and visualization

University of Southern California 6

Page 7: A Scalable Architecture for Extracting, Aligning, Linking ...usc-isi-i2.github.io/slides/knoblock15-spie-slides.pdf · A Scalable Architecture for Extracting, Aligning, Linking, and

Crawling

• Challenge: how to crawl just the relevant pages • Approach:

– Uses the Apache Nutch framework for Web pages – Uses Karma to extract pages from the deep Web

University of Southern California 7

Page 8: A Scalable Architecture for Extracting, Aligning, Linking ...usc-isi-i2.github.io/slides/knoblock15-spie-slides.pdf · A Scalable Architecture for Extracting, Aligning, Linking, and

Extracting

• Need to produce a structured representation for indexing and linking

• Highly configurable architecture for extractors – Learning of landmark

extractors for structured data – Trainable CRF-based

extractors for unstructured data

– Uses Mechanical Turk to crowd source training data

University of Southern California 8

Page 9: A Scalable Architecture for Extracting, Aligning, Linking ...usc-isi-i2.github.io/slides/knoblock15-spie-slides.pdf · A Scalable Architecture for Extracting, Aligning, Linking, and

Cleaning

• Cleaning and normalization to support analysis and linking

– Visualization showing data distribution

– Learned transformations from examples

– Cleaning programs written in Python

University of Southern California 9

Page 10: A Scalable Architecture for Extracting, Aligning, Linking ...usc-isi-i2.github.io/slides/knoblock15-spie-slides.pdf · A Scalable Architecture for Extracting, Aligning, Linking, and

Integration

• Need to align the data across extracted data and structured sources

• Performed using a data integration tool called Karma • Karma maps arbitrary sources into a shared domain vocabulary

(schema alignment)

• Uses machine learning to minimize user effort

University of Southern California 10

Page 11: A Scalable Architecture for Extracting, Aligning, Linking ...usc-isi-i2.github.io/slides/knoblock15-spie-slides.pdf · A Scalable Architecture for Extracting, Aligning, Linking, and

Integration Using Karma

University of Southern California 11

Page 12: A Scalable Architecture for Extracting, Aligning, Linking ...usc-isi-i2.github.io/slides/knoblock15-spie-slides.pdf · A Scalable Architecture for Extracting, Aligning, Linking, and

Similarity

• Computes similarity across text fields and images

– Image similarity done using DeepSentiBank

– Text similarity done using Minhash/LSH

University of Southern California 12

Page 13: A Scalable Architecture for Extracting, Aligning, Linking ...usc-isi-i2.github.io/slides/knoblock15-spie-slides.pdf · A Scalable Architecture for Extracting, Aligning, Linking, and

Entity Resolution

• Finds matching entities

• Reference source

– Match against source to disambiguate entities

– E.g., geonames for locations

• No reference source

– Combine entities by considering the similarity across multiple fields

University of Southern California 13

Page 14: A Scalable Architecture for Extracting, Aligning, Linking ...usc-isi-i2.github.io/slides/knoblock15-spie-slides.pdf · A Scalable Architecture for Extracting, Aligning, Linking, and

Graph Construction

• Data is integrated into a graph that can be queries and analyzed

– Data stored in HDFS

– Data represented in a common language JSON-LD

– Represented using a common terminology

University of Southern California 14

Page 15: A Scalable Architecture for Extracting, Aligning, Linking ...usc-isi-i2.github.io/slides/knoblock15-spie-slides.pdf · A Scalable Architecture for Extracting, Aligning, Linking, and

Query, Analysis and Visualization

• Challenge: support efficient querying against the graph

• Employ ElasticSearch to provide keyword querying, faceted browsing, and aggregation queries

University of Southern California 15

Page 16: A Scalable Architecture for Extracting, Aligning, Linking ...usc-isi-i2.github.io/slides/knoblock15-spie-slides.pdf · A Scalable Architecture for Extracting, Aligning, Linking, and

Query, Analysis & Visualization

• Visualization interface that provides faceted queries, timeslines, maps, etc.

University of Southern California 16

Page 17: A Scalable Architecture for Extracting, Aligning, Linking ...usc-isi-i2.github.io/slides/knoblock15-spie-slides.pdf · A Scalable Architecture for Extracting, Aligning, Linking, and

Discussion

• Technology that can provide dramatic new insights from data that is already available

• Applies to a wide range of problems – Determining the nuclear know-how of a given country

• Technologies, key scientists, relevant organizations

– Combating human trafficking – Understanding trends in technical areas

• E.g., Material Science

– Analyzing the competitive landscape of companies – and many other domains with massive quantities of

data

University of Southern California 17

Page 18: A Scalable Architecture for Extracting, Aligning, Linking ...usc-isi-i2.github.io/slides/knoblock15-spie-slides.pdf · A Scalable Architecture for Extracting, Aligning, Linking, and

USC DIG Team

University of Southern California 18

Page 19: A Scalable Architecture for Extracting, Aligning, Linking ...usc-isi-i2.github.io/slides/knoblock15-spie-slides.pdf · A Scalable Architecture for Extracting, Aligning, Linking, and

Acknowledgements

• Collaborators

– Next Century Technologies

– InferLink Inc.

– JPL

– Columbia University

• Sponsor

– DARPA

• AFRL contract number FA8750-14-C-0240

University of Southern California 19

Page 20: A Scalable Architecture for Extracting, Aligning, Linking ...usc-isi-i2.github.io/slides/knoblock15-spie-slides.pdf · A Scalable Architecture for Extracting, Aligning, Linking, and

Thanks!

• More information:

– Homepage

• isi.edu/~knoblock

– DIG

• usc-isi-i2.github.io/dig

– Karma

• usc-isi-i2.github.io/karma

University of Southern California 20