Post on 05-Dec-2014
description
to help organizations solve really big problems
The Open Source document analysis platform
Or, how IKANOW uses
Agenda
• What is Document Analysis?• The Infinit.e Solution
– Infinit.e’s Architecture– Why and How we use MongoDB
• Analyzing #MongoDC• Questions
This is what Big Data Looks Like
Shamelessly stolen from: http://techbuddha.wordpress.com/2011/09/04/big-data-are-you-creating-a-garbage-dump-or-mountains-of-gold/
What is Document Analysis?
"Document Analysis refers tocomputer-assisted analysis of large numbers of documents in order to answer questions about the content of a document set.”Source: http://www.text-tech.com/docanalysis/definition.html
Document Analysis
• Common document source formats:
RSS JSON XML
HTML PDF TXT
RTF Word PPT
Multimedia Files RDBMS Records ETC.
Document Analysis
• The goal is to:– Extract Entities (people, places, things)– Create Associations between entities (in the
form of noun-verb-noun), e.g.:• John Doe lives in Washington, D.C• John Doe is married to Jane Doe• John Doe is a Virgo• John Doe traveled to Mexico on July 6th, 2011
• And…
Document Analysis
• Turn Who, What, When andWhere into a unified data structure that supports data analytics and visualization.
Whopeople, organizations, facilities, company
Whatevents, summaries,facts, themes
Whenpast, present, future dates
Wherecity, state, country, coordinate
• Infinit.e is an Open Source document discovery and analysis platform that has these very cool Open Sourcetools lurking under the hood.
The Infinit.e Solution
github.com/ikanow/Infinit.e
The Infinit.e Solution
CollectingStoring
EnrichingRetrieving
AnalyzingVisualizing
Structured and Unstructured Documents
Infinit.e is a scalable
framework for
IkanMeow
Document Collection
• Infinit.e harvests documents from:
– URLs
– File Shares
– Databases
Sample RSS Document<rss version="2.0"><channel>…<item>
<title>Mediterranean conference seeks to flourish tourism in Egypt, Tunisia… </title><link>http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish-tourism-in-egypt-tunisia-report-by-egyptlastminute-com-13613.html</link><description>Report by egyptlastminute.com CAIRO: On Monday, the countries of the Mediterranean opened a conference seeking to enhance the future of tourism in the region. The conference focuses on the countries of Egypt and Tunisia the most … </description><dc:publisher>Latest Press Releases | Press Release Bureau</dc:publisher><dc:creator>unknown</dc:creator><dc:date>Sat, 21 Apr 2012 00:00:00 GMT</dc:date>
</item>…</channel></rss>
Full Text Source
Source Ingestion Data Flow
Document DBs and Collections
Document Metadata
• doc_metadata.metadata{
"_id" : ObjectId("4f93638e0cf212156d0559d2"),"title" : "Mediterranean conference seeks to flourish tourism in Egypt, Tunisia ...","url" : "http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish-tourism-in-egypt-tunisia-report-by-egyptlastminute-com-13613.html""description" : "Report by egyptlastminute.com CAIRO: On Monday, the countries of the Mediterranean opened a conference seeking to enhance the future of tourism in the region. The conference focuses on the countries of Egypt and Tunisia; the most ...","created" : ISODate("2012-04-22T01:49:02Z"),
“metadata” : {…},"associations" : […],"entities" : […],...
}
Harvested Document Metadata
• doc_metadata.metadata.metadata"metadata" : {
"location" : [{
"region" : "South Asia","citystateprovince" : {
"stateprovince" : "Rolpa”, "city" : "Newang"
},"country" : "Nepal"
}],"icn" : [ "200573487" ],"incidentdate" : [ "07/25/2005" ],"organization" : [
"Communist Party of Nepal (Maoist)/United People's Front” ],...
},
Note: It is okay to laugh at this
Document Enrichment
• Infinit.e supports the extraction of entities and creation of associations using a combination of built in enrichment libraries and 3rd party NLP APIs including:
Harvested Entities
• feature.entity{
"_id" : ObjectId("4f9189d48baf188282a1c9ef"),"alias" : [
"Zine el Abidine Ben Ali","Zine El Abidine Ben Ali","Zine el Abidine ben Ali"
],"batch_resync" : true,"communityId" : ObjectId("4f8f138103644ee8003bf518"),"db_sync_doccount" : NumberLong(143),"db_sync_time" : "1338751174988","dimension" : "Who","disambiguated_name" : "Zine El Abidine Ben Ali","doccount" : 152,"index" : "zine el abidine ben ali/person","totalfreq" : 353,"type" : "Person"
}
Harvested Entities
Harvested Associations
• feature.association{
"_id" : ObjectId("4f9189d48baf188282a1ca24"),"assoc_type" : "Fact","communityId" : ObjectId("4f8f138103644ee8003bf518"),"db_sync_doccount" : NumberLong(70),"db_sync_time" : "1338491609281","doccount" : NumberLong(73),"entity1" : [
"zine el abidine ben ali","zine el abidine ben ali/person"
],"entity1_index" : "zine el abidine ben ali/person","entity2" : ["president”,"president/position”],"entity2_index" : "president/position","index" : "5e3fff27ddb78d6873ccfc77cf05c52f","verb" : ["career”,"current”,"past”],"verb_category" : "career"
}
Harvested Associations
Geolocation of Entities/Events
• feature.geo{
"_id" : ObjectId("4d8bb5efbe07bb4f7036c82e"),"search_field" : "cairo","country" : "Egypt","country_code" : "EG","city" : "cairo","region" : "Al Qahirah","region_code" : "EG11","population" : 7734602,"latitude" : "30.05","longitude" : "31.25","geoindex" : {
"lat" : 30.05,"lon" : 31.25
}}
Note: MongoDB 2d Index
Geolocation of Entities/Events
Who, What, Where and When
Why MongoDB? – Reason #1
Document-Oriented Storage• MongoDB’s document-oriented storage
(i.e. schema-less) is perfectly suited to the data design requirements of a system that needs to ingest a wide variety of structured and unstructured document formats and normalize them into one unified, semi-structured format
Why MongoDB? – Reason #2
JSON• The standard language of open document
analysis– JSON is a common interchange format supported by
tools like elasticsearch and SaaS NLP engines– BSON (Binary JSON) is MongoDB’s native data
format– Infinit.e ingests and exports JSON
natively via the REST based API
Note: Infinit.e uses Google’s GSON JAVA library to convert JSON to POJOs and back
This is the JSON logo
Why MongoDB? – Reason #3
MongoDB Is Web Scale*
*Shards are the secret ingredients in the web scale sauce. They just work.
Why MongoDB? – Reason #3
Scalability• Seriously, MongoDB Scales
– Harvesting and enriching documents requires a lot of disk space
– MongoDB scales to arbitrary sizes in both read/write dimensions
– Sophisticated sharding keys provide powerful/flexible balancing
BUT building an initial cluster can be complex and managing cluster changes is “fiddly”
Why MongoDB? – Reason #4
Integration with Apache Hadoop• Hadoop is rapidly becoming the de-facto standard for
data analytics– Open Source, very customizable– Proven scalability– Java libraries
• The MongoDB Hadoop Adaptor allows Hadoop to read from and write to MongoDB instead of HDFS
+ =
Tweeting about MongoDC
• Source: http://search.twitter.com/search.rss?q=mongodc– Who’s Tweeting?– What are they Tweeting?– What does basic document analysis of these
Tweets tell us?
Who’s Tweeting about MongoDC?
How are Tweeter’s Connected?
What are they Tweeting About?
Sentiment?
Twitter has its Limits…
Thank You!
Craig Vitter
www.ikanow.com
cvitter@ikanow.com