DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes...

50
DATA CENTER AN MIT RESEARCH PROGRAM David Brock The Data Center Program Massachusetts Institute of Technology Optimizing Resources through Connected Data Systems Integrating Information in Industry and Government

Transcript of DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes...

Page 1: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

DATA CENTER AN MIT RESEARCH PROGRAM

David Brock The Data Center Program Massachusetts Institute of Technology

Optimizing Resources through Connected Data Systems

Integrating Information in Industry and Government

Page 2: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

Data Explosion

Data doubles every 18 months1

1. Gantz, John and Reinsel, David, “As the Economy Contracts, the Digital Universe Expands,” IDC – Multimedia White Paper, May 2009. (http://www.emc.com/digital_universe)

2009 2010 2011 2012 2013

Exabytes

0

500

1,000

1,500

2,000

2,500

Data Explosion

Page 3: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

Data Explosion

The world's digital content is equivalent to a stack of books stretching from the Earth to Pluto and back.1

Page 4: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

Data Explosion

30 Tons

If we stick with the book analogy, then the digital universe in 2010 is equivalent to 30 tons of books for every man, woman and child on the Earth. 1

1. Gantz, John and Reinsel, David, “As the Economy Contracts, the Digital Universe Expands,” IDC – Multimedia White Paper, May 2009. (http://www.emc.com/digital_universe)

Page 5: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

Internet Usage

Online banking

Electronic Commerce

Social Networking

500 million active users on Facebook 1

$32.4 Billion 2

1. ComScore.com 2. U.S Census Bureau's Quarterly Retail E-Commerce Sales 2nd Quarter 2009

75% US customers use online banking 1

Page 6: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

Academic Research

“. . . people were drowning in scholarly information, and drowning in information in general. So it takes twice as much time for people to begin their research.“ - Damon Zucca, Oxford University Press, Executive Editor

Page 7: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

Next Generation

•  The average teen texter exchanges 1,500 texts each month, with an average of 50 per day. 1

•  21% of teenagers access the Internet only from their cell phones.

1. Pew Research Centre's Internet and American Life Project.

Page 8: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

Machine Understanding

Yet almost none of this information is understandable by a computer.

Page 9: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

Extensible Markup Language

<?xml version="1.0" encoding="ISO-8859-1"?> <bookstore> <book> <title lang="en">Twilight</title> <price>29.99</price> </book> <book> <title lang="en">Learning XML</title> <price>39.95</price> </book> </bookstore>

Page 10: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

Extensible Markup Language

<?xml version="1.0" encoding="ISO-8859-1"?> <bookstore> <book> <title lang="en">Twilight</title>

<price>29.99</price> </book> <book> <title lang="en">Learning XML</title> <price>39.95</price> </book> </bookstore>

Page 11: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

4ML AML AML AML AML AML AML ABML ABML ACML ACML ACAP ACS X12 ADML AECM AFML AGML AHML AIML AIML AIF AL3 ANML ANNOTEA ANATML APML APPML AQL APPEL ARML ARML ASML ASML ASTM ARML ARML ASML

ARML ARML ASML ASML ASTM ATML ATML ATML ATML AWML AXML AXML AXML AXML BML BML BML BML BML BML BannerML BCXML BEEP BGML BHTML BIBLIOML BIOML BIPS BizCodes BLM XML BPML BRML BSML BCXML BEEP BGML BHTML

BiblioML BCXML BEEP BGML BHTML BIBLIOML BIOML BIPS BizCodes BLM XML BPML BRML BSML CML xCML CaXML CaseXML xCBL CBML CDA CDF CDISC CELLML ChessGML ChordML ChordQL CIM CIML CIDS CIDX xCIL CLT CNRP ComicsML CIM CIML CIDS

CIDX xCIL CLT CNRP ComicsML Covad xLink CPL CP eXchange CSS CVML CWMI CycML DML DAML DaliML DaqXML DAS DASL DCMI DOI DeltaV DIG35 DLML DMML DocBook DocScope DoD XML DPRL DRI DSML DSD DXS EML EML DLML EAD ebXML

eBIS-XML ECML eCo EcoKnow edaXML EMSA eosML ESML ETD-ML FieldML FINML FITS FIXML FLBC FLOWML FPML FSML GML GML GML GXML GAME GBXML GDML GEML GEDML GEN GeoLang GIML GXD GXL Hy XM HITIS HR-XML HRMML HTML HTTPL

HTTP-DRP HumanML HyTime IML ICML IDE IDML IDWG IEEE DTD IFX IMPP IMS Global InTML IOTP IRML IXML IXRetail JabberXML JDF JDox JECMM JLife JSML JSML JScoreML KBML LACITO LandXML LEDES LegalXML Life Data LitML LMML LogML LogML LTSC XML MAML

MatML MathML MBAM MISML MCF MDDL MDSI-XML Metarule MFDX MIX MMLL MML MML MML MoDL MOS MPML MPXML MRML MSAML MTML MTML MusicXML NAML xNAL NAA Ads Navy DTD NewsML NML NISO DTB NITF NLMXML NVML OAGIS OBI OCF ODF

ODRL OeBPS OFX OIL OIM OLifE OML ONIX DTD OOPML OPML OpenMath Office XML OPML OPX OSD OTA PML PML PML PML PML PML PML PML P3P PDML PDX PEF XML PetroML PGML PhysicsML PICS PMML PNML PNML PNG PrintML

PrintTalk ProductionML PSL PSI QML QAML QuickData RBAC RDDl RDF RDL RecipeML RELAX RELAX NG REXML REPML ResumeXML RETML RFML RightsLang RIXML RoadmOPS RosettaNet PIP RSS RuleML SML SML SML SML SAML SABLE SAE J2008 SBML Schemtron SDML SearchDM-XML SGML

SHOE SIF SMML SMBXML SMDL SDML SMIL SOAP SODL SOX SPML SpeechML SSML STML STEP STEPML SVG SWAP SWMS SyncML TML TML TML TalkML TaxML TDL TDML TEI ThML TIM TIM TMML TMX TP TPAML TREX TxLife

UML UBL UCLP UDDI UDEF UIML ULF UMLS UPnP URI/URL UXF VML vCalendar vCard VCML VHG VIML VISA XML VMML VocML VoiceXML VRML WAP WDDX WebML WebDAV WellML WeldingXML Wf-XML WIDL WITSML WorldOS WSML WSIA XML XML Court XML EDI

XML F XML Key XMLife XML MP XML News XML RPC XML Schema XML Sign XML Query XML P7C XML TP XMLVoc XML XCI XAML XACML XBL XSBEL XBN XBRL XCFF XCES Xchart Xdelta XDF XForms XGF XGL XGMML XHTML XIOP XLF XLIFF XLink XMI XMSG XMTP XNS

Standards

Thousands of incompatible standards make interoperability practically impossible!

Page 12: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

<alert xmlns="http://www.incident.com/cap/1.1"> <identifier>23e090317621</identifier> <sender>[email protected]</sender> <sent>2007-010-09T14:39:01-05:00</sent> <status>Actual</status> <msgType>Alert</msgType> <scope>Public</scope> <info> <category>Security</category> <event>Security Update</event> <urgency>Past</urgency> <severity>Moderate</severity> <certainty>Observed</certainty> <senderName>Department of Homeland Security</senderName> <headline>Suspicious car spotted / Tucson, AZ </headline> <description>Blue sedan, license plate number 61SN52, caught on red light camera on intersection of Main St. and Pleasant St.</description> <area> <areaDesc>Tucson, AZ</areaDesc> <circle>32.1995,-110.8925 0</circle> </area> </info> </alert>

Common Alerting Protocol (CAP)

The Common Alerting Protocol (CAP) is an XML-based data format for exchanging public warnings and emergencies between alerting technologies. CAP implementations have been demonstrated by agencies and companies including: United States Department of Homeland Security; National Weather Service; United States Geological Survey; California Office of Emergency Services; and many others.

CAP is the foundation technology for the proposed "Integrated Public Alert and Warning System," an all-hazard, all-media national warning architecture being developed by DHS, the National Weather Service and the Federal Communications Commission.

Page 13: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

<alert xmlns="http://www.incident.com/cap/1.1"> <identifier>23e090317621</identifier> <sender>[email protected]</sender> <sent>2007-010-09T14:39:01-05:00</sent> <status>Actual</status> <msgType>Alert</msgType> <scope>Public</scope> <info> <category>Security</category> <event>Security Update</event> <urgency>Past</urgency> <severity>Moderate</severity> <certainty>Observed</certainty>

<senderName> Department of Homeland Security</senderName> <headline>Suspicious car spotted / Tucson, AZ </headline> <description>Blue sedan, license plate number 61SN52, caught on red light camera on intersection of Main St. and Pleasant St.</description> <area> <areaDesc>Tucson, AZ</areaDesc> <circle>32.1995,-110.8925 0</circle> </area> </info> </alert>

Compound Tag Names

Page 14: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

<alert xmlns="http://www.incident.com/cap/1.1"> <identifier>23e090317621</identifier> <sender>[email protected]</sender> <sent>2007-010-09T14:39:01-05:00</sent> <status>Actual</status>

<msgType> Alert</msgType> <scope>Public</scope> <info> <category>Security</category> <event>Security Update</event> <urgency>Past</urgency> <severity>Moderate</severity> <certainty>Observed</certainty> <senderName>Department of Homeland Security</senderName> <headline>Suspicious car spotted / Tucson, AZ </headline> <description>Blue sedan, license plate number 61SN52, caught on red light camera on intersection of Main St. and Pleasant St.</description> <area> <areaDesc>Tucson, AZ</areaDesc> <circle>32.1995,-110.8925 0</circle> </area> </info> </alert>

Abbreviations

Page 15: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

<alert xmlns="http://www.incident.com/cap/1.1"> ... <scope>Public</scope> <info> <category>Security</category> <event>Security Update</event> <urgency>Past</urgency> <severity>Moderate</severity> <certainty>Observed</certainty> <senderName>Department of Homeland Security</senderName> <headline>Suspicious car spotted / Tucson, AZ </headline> <description>Blue sedan, license plate number 61SN52, caught on red light camera on intersection of Main St. and Pleasant St.</description> <area> <areaDesc>Tucson, AZ</areaDesc>

<circle> 32.1995,-110.8925 0 </circle> </area> </info> </alert>

Embedded Data Structures

Page 16: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

<alert xmlns="http://www.incident.com/cap/1.1"> <identifier>23e090317621</identifier> <sender>[email protected]</sender> <sent>2007-010-09T14:39:01-05:00</sent> <status>Actual</status> <msgType>Alert</msgType> <scope>Public</scope> <info> <category>Security</category> <event>Security Update</event> <urgency>Past</urgency> <severity>Moderate</severity> <certainty>Observed</certainty> <senderName>Department of Homeland Security</senderName>

<headline> Suspicious car spotted / Tucson, AZ </headline> <description>Blue sedan, license plate number 61SN52, caught on red light camera on intersection of Main St. and Pleasant St.</description> <area>

Natural Language

Page 17: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

<Event xmlns:gml="http://www.opengis.net/gml/3.2" id="ForestFire“> <Descriptor>Forest Fire</Descriptor> <Identifier label="Type">Pine Forest</Identifier> <SimpleProperty label="Containment">5%</SimpleProperty> <Location id="FF-Location"> <Boundary><Polygon> <gml:LinearRing> <gml:pos>39.09 -105.91</gml:pos> . . . <gml:pos>39.09 -105.91</gml:pos> </gml:LinearRing></exterior> </Polygon></Boundary> </Location> <FireInformation> <Type>Pine Forest</Type> <Containment>5%</Containment> <Cause>Believed natural; arson unlikely</Cause> </FireInformation> <Conditions> <ReportedAt>2008-04-14T18:00:00Z</ReportedAt> <Temperature>90.2F</Temperature> <Wind>NW 3MPH</Wind> <Humidity>15%</Humidity> <Pressure>1014.8 hPa</Pressure> <Visibility>5 mi</Visibility> </Conditions>

DoD Universal Core (UCore)

Universal Core (UCore) is a national security and preparedness initiative led jointly by the Department of Defense (DoD), Department of Justice (DOJ), Department of Homeland Security (DHS), and the Office of the Director of National Intelligence (ODNI).

Page 18: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

<Event xmlns:gml="http://www.opengis.net/gml/3.2" id="ForestFire“> <Descriptor>Forest Fire</Descriptor> <Identifier label="Type">Pine Forest</Identifier> <SimpleProperty label="Containment">5%</SimpleProperty> <Location id="FF-Location"> <Boundary><Polygon>

<gml:LinearRing> <gml:pos>39.09 -105.91</gml:pos> . . . <gml:pos>39.09 -105.91</gml:pos> </gml:LinearRing> </Polygon></Boundary> </Location> <FireInformation> <Type>Pine Forest</Type> <Containment>5%</Containment> <Cause>Believed natural; arson unlikely</Cause> </FireInformation> <Conditions>

Composite Schema

Page 19: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

<Event xmlns:gml="http://www.opengis.net/gml/3.2" id="ForestFire“> <Descriptor>Forest Fire</Descriptor> <Identifier label="Type">Pine Forest</Identifier> <SimpleProperty label="Containment">5%</SimpleProperty> <Location id="FF-Location"> <Boundary><Polygon> <gml:LinearRing> <gml:pos>39.09 -105.91</gml:pos> . . . <gml:pos>39.09 -105.91</gml:pos> </gml:LinearRing></exterior> </Polygon></Boundary> </Location> <FireInformation> <Type>Pine Forest</Type> <Containment>5%</Containment> <Cause>Believed natural; arson unlikely</Cause> </FireInformation> <Conditions> <ReportedAt>2008-04-14T18:00:00Z</ReportedAt>

<Temperature>90.2F</Temperature> <Wind>NW 3MPH</Wind> <Humidity>15%</Humidity> <Pressure>1014.8 hPa</Pressure>

Embedded Units

Page 20: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

<current_observation version="1.0"> <location>Boston, Logan International Airport, MA</location> <latitude>42.38</latitude> <longitude>-71.03</longitude> <observation_time>Jul 25 2009, 3:54 pm EDT</observation_time> <weather>Partly Cloudy</weather>

<temp_f>84.0</temp_f> <temp_c>28.9</temp_c> <relative_humidity>44</relative_humidity> <wind_dir>West</wind_dir> <wind_degrees>270</wind_degrees> <wind_mph>11.5</wind_mph> <wind_kt>10</wind_kt> <pressure_mb>1013.1</pressure_mb> <pressure_in>29.92</pressure_in> <dewpoint_f>60.1</dewpoint_f> <dewpoint_c>15.6</dewpoint_c> <visibility_mi>10.00</visibility_mi> </current_observation>

National Weather Service (NOAA NWS)

NWS offers hourly weather observations formatted with xml tags to aid in the parsing of the information by automated programs used to populate databases, display information on web pages or other similar applications

Page 21: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

M Language

M Language

Common Vocabulary and Data Format

Page 22: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

M Language

M Language

•  Dictionary – Repository of well-defined terms

•  Ontologies – Connections between terms

•  Grammar – Rules for composing terms into documents and messages

Page 23: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

Dictionary

cell

Repository of “concepts”

- the basic structural and functional unit of all organisms.

token definition

Page 24: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

Dictionary

cell.0

Keyed concepts

- the basic structural and functional unit of all organisms.

cell.1

“cell.biology”

“cell.phone”

- A hand-held mobile radiotelephone for use in an area divided into small sections, each with its own short-range transmitter/receiver.

cell.2

“cell.manufacturing”

- group of workers and/or machines work together as a team to produce dedicated set of products or assemblies.

Page 25: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

Compose Concepts - Words

cell.0

cell.0 plural.0

cell

cells +

cell.0 modifier.0 cellular +

Compose concepts Words

Page 26: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

Compose Concepts - Phrases

Phrases – sequences of concepts

design.0 cell.1 + research.0 +

company.0 chemical.0 + description.0 +

annual.0 maximum.0 + temperature.0 +

cellular phone design research

chemical company descriptions

maximum annual temperature

Page 27: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

Compose Concepts – Structured Documents

<Event id="ForestFire"> <Description>Forest Fire</Description> <FireInformation> <Type>Pine Forest</Type> <Containment>5%</Containment> <Cause> Believed natural; arson unlikely </Cause> </FireInformation> <Conditions> <ReportedAt> 2008-04-14T18:00:00Z </ReportedAt> <Temperature>90.2F</Temperature> <Wind>NW 3MPH</Wind> <Humidity>15%</Humidity> <Pressure>1014.8 hPa</Pressure> <Visibility>5 mi</Visibility> </Conditions>

event

forest.trees fire.event

containment.bound

5 percent.unit

cause.effect

weather.conditions

2008-04-14T18:00:00Z

wind.air speed.velocity

northwest.direction

3 mph.unit

wind.air direction.angle

date.time

Belived natural

Page 28: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

Ontology

iron.0

Connections between concepts

heavy metal.0

cobalt.0

manganese.0

mecury.0

lead.0 type-of

Page 29: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

M Language

Challenge

Transform disparate structured and unstructured data into the M Language

Page 30: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

Transform Structured and Unstructured Data

<Event id="ForestFire"> <Description>Forest Fire</Description> <FireInformation> <Type>Pine Forest</Type> <Containment>5%</Containment> <Cause> Believed natural; arson unlikely </Cause> </FireInformation> <Conditions> <ReportedAt> 2008-04-14T18:00:00Z </ReportedAt> <Temperature>90.2F</Temperature> <Wind>NW 3MPH</Wind> <Humidity>15%</Humidity> <Pressure>1014.8 hPa</Pressure> <Visibility>5 mi</Visibility> </Conditions>

event

fire information

cause

believed natural

Page 31: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

Natural Language Processing

Pre-process Text

- load text - spell check -  clean white space -  remove meta characters -  process capitals letters -  process apostrophes -  tokenize

Generic Token Labeler

Tokens

- adjectives - adverbs -  articles -  auxiliary verbs -  conjunctions -  idioms -  interrogatives

Specialized Token Labeler

Labeled Tokens

- times / dates - locations -  personalities -  organizations -  products -  numbers -  equations/formulas -  acronyms / codes

Domain Specific Token Labeler

Labeled Tokens

-  organizational labels -  proprietary labels

Labeled Tokens

Token Process

Labeled Tokens

- names (first) -  names (last) -  nouns -  numbers -  prepositions -  pronouns -  punctuation -  verbs

-  compound words -  word function culling -  token ranking

Processed Tokens

Token Cluster

-  noun phrases -  verb phrases -  prepositional phrases -  conjunction -  grammatical partitions -  tree formation

Token graph Meta-data Analysis

-  source analysis -  context analysis -  tone analysis

Disambiguation

-  meta-data -  word association -  context analysis -  concept abstraction -  pattern abstraction

Token graph + meta-data

Link Process

Concept graph

Concept graph

-  modifier references -  prepositional phrase references -  pronoun references

Knowledge Storage

-  Knowledge storage

Concept graph

M Language Dictionary -  generic concepts and representations

M Language Dictionary -  specialized concepts and representations

M Language Dictionary -  proprietary concepts and representations

M Language Knowledge Database

M Language Knowledge Database

RDF Triple Extraction

M Language to RDF Concept Map

RDF Triples

Page 32: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

M Language

past.tense state.0

Fire natural believed

natural.state

thin perception. attribute

attribute.0 attribute.0

attribute.0 + attribute.0

Attribute.0

is

past.tense + be.state

past.tense + concept.state

fire.event

event.object

Emergency. event

object.0

object.0 → state.0 → attribute.0

statement.0

past.tense + believe.state

Page 33: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

Transform Structured and Unstructured Data

<Event id="ForestFire"> <Description>Forest Fire</Description> <FireInformation> <Type>Pine Forest</Type> <Containment>5%</Containment> <Cause> Believed natural; arson unlikely </Cause> </FireInformation> <Conditions> <ReportedAt> 2008-04-14T18:00:00Z </ReportedAt> <Temperature>90.2F</Temperature> <Wind>NW 3MPH</Wind> <Humidity>15%</Humidity> <Pressure>1014.8 hPa</Pressure> <Visibility>5 mi</Visibility> </Conditions>

event

fire information

cause

believed natural

event.0

fire.1

cause.0

believe.0 past.0 natural.0

Page 34: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

M Language

Applications

Page 35: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

GMT M

BFA M

DIS M

TDL M

Network

M

M

M

M

GMT

BFA

DIS

TDL

Interoperation of existing government data protocols

Real-time Data Translation

From point-to-point to source translation - “n2 to n”

Page 36: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

DOJ

Network

M

DHS

M

FBI

M

IBIS

M

Fusion of existing government data sources

Data Fusion

Integrate and fuse data without changing database

Page 37: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

Data System Integration

Connect disparate systems without changing protocol

Parts

Gas

Food M-EDI

Adapter

Processing Flow in M

M-KML Adapter

Google Earth

Request Processor

Inventory Service

Requisition Algorithm

Procurement Processing System

Order Service

Fulfillment Report

EDI

KML

Vendors DC Local Govt /

Vendor System

s

M A

dapters

Data Flow Management

Procurement Tracking System

M Adapters

M Adapters

M Dict.

& Svcs

Page 38: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

Enterprise Integration

Dedicated Solution Network

Command Centers

Oracle Fusion Middleware / Enterprise Service Bus Services: Security, Identity Management, SSO, UDDI, Others

Network

•  COP / SA •  Portals •  C2I Apps •  Intelligence

ISR Sensors Network

Mobile Users

External Users

External Organizations, Networks & Simulators

SOA

Web Portal

External Interfaces Public & Private Feeds

RSS CAP etc.

Public & Private Data Sources

ISR Sensors Network

Data Sources

Data Integration Layer (Oracle / AquaLogic)

Adaptors

Web Services Web Services

Web Portal Enhanced Data Interfaces

Simulated Sensor Network

Simulated Data

Sources

Smart Data Archives

Smart Dynamic Repositories

•  Portals •  Mash-ups •  Wiki •  Web 2

Web Center

Pre-built Component

Collaboration •  IM •  Online

Meetings

Intelligent Services (M Language Apps.) •  Information Fusion •  Monitors & Triggers

Rapid Composition Simulation Subsystem •  Scenario Generation •  Data Generation •  Simulation Control •  Monitors •  Training

Business Intelligence

•  Dashboards •  Ad-Hoc

BI Date Warehouse

Operations •  C2I •  Decision Support •  Operations Mgmt.

Business Systems •  HRMS / Training •  Performance Based

Logistics

Enterprise Search •  COTS Tools

Page 39: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

Future

Future

Page 40: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

Development

•  Vocabulary acquisition – automatically build vocabulary, definitions and relationships from parsed text

•  Context - use document context to improve parsing

•  Pattern extraction – identify common structural patterns in parsed data to improve processing

•  Semi-Structured data – apply parsing approach to semi-structured data such as generic web pages and scripting languages

Page 41: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

Thank you

Page 42: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

M Language

cell.biology -- the basic structural and functional unit of all organisms; they may exist as independent units of life (as in monads) or may form colonies or tissues as in.

cell.electric, electric_cell.1 -- A device that delivers an electric current as the result of a chemical reaction.

cell.jail, jail_cell.1, prison_cell.1 -- A room where a prisoner is kept.

cell.room, cubicle.3 -- Small room is which a monk or nun lives.

cell.compartment -- Any small compartment; the cells of a honeycomb.

cell.telephone, cellular_telephone.1, cellular_phone.1, cellphone.1, mobile_phone.1 -- A hand-held mobile radiotelephone for use in an area divided into small sections, each with its own short-range transmitter/receiver.

cell.political, cadre.2 -- A small unit serving as part of or as the nucleus of a larger political movement.

cell.manufacturing -- Manufacturing cell, in which a group of workers and/or machines work together as a team to produce dedicated set of products or assemblies.

Dictionary Data Structure

computer

IP Address

18.78.14.156

Location

Latitude

145.681

Page 43: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

Unstructured Data

<Shipping_Incidents> <Incident> <Date>04/26/2007</Date> <Reference_No>2007-130</Reference_No> <Subregion>93</Subregion> <Geolocation>9.566666667,111.9166667</Geolocation> <Aggressor>PIRATES</Aggressor> <Victim>FISHING VESSEL</Victim> <Description>Sprattly Islands, SOUTH CHINA SEA: Pirates boarded a large Japanese fishing vessel. The vessel was robbed of its catch, while it was taking shelter due to engine trouble. The master informed his family; about the robbery and that another vessel was approaching it. All contact with the fishing vessel was lost, since the master'apos;s last call. The fate of the vessel and its crewmembers are unknown.</Description> </Incident>

Page 44: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

M Query

[e.g. relation abbreviation:depth:+/-,relation abbreviation:depth:+/-,...] XML Query Language + M Language Predicates

Page 45: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

<alert xmlns="http://www.incident.com/cap/1.1"> <identifier>23e090317621</identifier> <sender>[email protected]</sender> <sent>2007-010-09T14:39:01-05:00</sent> <status>Actual</status> <msgType>Alert</msgType> <scope>Public</scope> <info> <category>Security</category> <event>Security Update</event> <urgency>Past</urgency> <severity>Moderate</severity> <certainty>Observed</certainty> <senderName>Department of Homeland Security</senderName> <headline>Suspicious car spotted / Tucson, AZ </headline> <description>Blue sedan, license plate number 61SN52, caught on red light camera on intersection of Main St. and Pleasant St.</description> <area> <areaDesc>Tucson, AZ</areaDesc> <circle>32.1995,-110.8925 0</circle> </area> </info> </alert>

Common Alerting Protocol (CAP)

The Common Alerting Protocol (CAP) is an XML-based data format for exchanging public warnings and emergencies between alerting technologies. CAP implementations have been demonstrated by agencies and companies including: United States Department of Homeland Security; National Weather Service; United States Geological Survey; California Office of Emergency Services; and many others.

CAP is the foundation technology for the proposed "Integrated Public Alert and Warning System," an all-hazard, all-media national warning architecture being developed by DHS, the National Weather Service and the Federal Communications Commission.

2

Abbreviations

CamelCase Unstructured data

Natural language

Abbreviations

Embedded Data Structures

Page 46: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

<Event id="ForestFire"> <Descriptor>Forest Fire</Descriptor> <Identifier label="Type">Pine Forest</Identifier> <SimpleProperty label="Containment">5%</SimpleProperty> <Location id="FF-Location"> <GeoLocation><Polygon><exterior><LinearRing> <pos>39.09 -105.91</pos> . . . <pos>39.09 -105.91</pos> </LinearRing></exterior></Polygon> </GeoLocation> </Location> <FireInformation> <Type>Pine Forest</Type> <Containment>5%</Containment> <Cause>Believed natural; arson unlikely</Cause> </FireInformation> <Conditions> <ReportedAt>2008-04-14T18:00:00Z</ReportedAt> <Temperature>90.2F</Temperature> <Wind>NW 3MPH</Wind> <Humidity>15%</Humidity> <Pressure>1014.8 hPa</Pressure> <Visibility>5 mi</Visibility> </Conditions>

DoD Universal Core (UCore)

Universal Core (UCore) is a national security and preparedness initiative led jointly by the Department of Defense (DoD), Department of Justice (DOJ), Department of Homeland Security (DHS), and the Office of the Director of National Intelligence (ODNI).

2

Abbreviations

CamelCase

Unstructured data

Natural language

Phrases Embedded units of measure

Page 47: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

<current_observation version="1.0"> <location>Boston, Logan International Airport, MA</location> <latitude>42.38</latitude> <longitude>-71.03</longitude> <observation_time>Jul 25 2009, 3:54 pm EDT</observation_time> <weather>Partly Cloudy</weather> <temp_f>84.0</temp_f> <temp_c>28.9</temp_c> <relative_humidity>44</relative_humidity> <wind_dir>West</wind_dir> <wind_degrees>270</wind_degrees> <wind_mph>11.5</wind_mph> <wind_kt>10</wind_kt> <pressure_mb>1013.1</pressure_mb> <pressure_in>29.92</pressure_in> <dewpoint_f>60.1</dewpoint_f> <dewpoint_c>15.6</dewpoint_c> <visibility_mi>10.00</visibility_mi> </current_observation>

National Weather Service (NOAA NWS)

NWS offers hourly weather observations formatted with xml tags to aid in the parsing of the information by automated programs used to populate databases, display information on web pages or other similar applications

1

Phrases

Unstructured data

4 Heterogeneous data structures

5 Phrases Embedded units of measure

5 Embedded units of measure

Page 48: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

Unstructured Data – Grammatical Representation

“Pirates boarded a large Japanese fishing vessel.” Pirates boarded a large Japanese fishing vessel

pirate

board

fishing vessel

a

noun phrase

noun

Indefinite article

verb phrase

verb

past tense

noun phrase

noun

plural

adjective

adjective large

Japanese

Page 49: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

M Language

action.0

Jeff red ball the

red.color ball.object the.article

color.attribute thing.object reference.article

attribute.0 attribute.0 object.0

attribute.0+ → object.0

object.0

threw

past.tense + throw.move

past.tense + action.0

boy.person

entity.object

person.entity

object.0

object.0 → action.0 → object.0

statement.0

Page 50: DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes legalxml life data litml lmml logml logml ltsc xml maml matml mathml mbam misml mcf

Outline

•  Data Volume

•  Information Systems

•  on-line banking, commerce, social networking

•  XML and Web Services

•  Standardization

•  Mashups

•  Issues

•  M Language (un)-(semi)-structured data

•  Future