Structured Querying of Web Text: A Technical Challenge
description
Transcript of Structured Querying of Web Text: A Technical Challenge
Structured Querying of Web Text: A Technical ChallengeMichael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko
Presenter: Shahina FerdousID – 1000630375Date – 03/23/10
Querying over Unstructured Data
Web(Text Documents)
Contains vast amount Text Documents, which is:• Unstructured• Accessed by keywords• Limited search quality
Querying over Unstructured Data
Web
Show me some people, what they invented, and the years they died
Keyword-in
Document-out
Querying over Unstructured Data
Web
List some Scientists with their invention and the years they died
Keyword-in
Document-out
Structured Querying of web Text
“Show me some people, what they invented, and the years they died”Scientist Inventions Year ProbKepler log books 1630 .7902
Heisenberg matrix mechanics 1976 .7897
Galileo telescope 1642 .7395Newton calculus 1727 .7366
In this paper, they proposed a structured Web query System called extraction databse, ExDB.
ExDb uses information extraction (IE) system to extract Data. As the extracted Data can be erroneos, ExDB assigns Probability to the
tuples.
ExDB Work Flow
…no one could
surprising. In
1877, Edisoninvented thephonograph.
Although he…
…didnt surprising.
In1877, Edisoninvented thephonograph.
Although he…
…was surprising.
In1877, Edisoninvented thephonograph.
Although he…
Obj1 Pred Obj2 probEdison invente
dphonogr
aph0.97
Morgan born-in 1837 0.85
Type Instance probscientist Einstein 0.99
city Seattle 0.92
Pred1 Pred2 probinvented did-invent 0.85invented created 0.72
Facts
Types
Synonyms
RDBMS
Querymiddlewa
re
invented(Edison ?e, ?i)
1. Run extractors 2. Populate data model3. Query Processing & Applications
Web
Information ExtractionExDB extracts several base-level concepts
through combination of existing IE techniques: Objects are Data values in the system. Examples: Einstein, telephone, Boston,Light-bulb, etc. Predicates represents binary relation between pair of objects. Examples: discovered (Edison, phonograph), born-in (A. –Einstein, Switzerland) and sells (Amazon, PlayStation) etc.
Semantic types represents unary relation of objects. Examples: city (Boston), city (New-York) and electronics (dvd-player) etc.
Information ExtractionExDB should also extract more series of relationships
to make queries even easier for the user: Synonyms denote equivalent objects, predicates or types. Examples: Einstein and A. –Einstein almost certainly refer to same object. Also, invented and has-invented refer to same predicate. Inclusion Dependencies describes subset relationship between two predicates. Examples: invented (?x, ?y ) discovered (?x, ?y). Functional Dependencies are useful to answer query with negation or why an object is not an answer.For example, a probabilistic FD indicating a person can only be born in one Country: born-in(?x, <country> ?y): ?x -> ?y p=0.95 “All Scientists born in Germany that taught at Princeton”. If after receivingthe answers, they ask again to the system “Why Einstein is not an answer?”. Using the above FD, the system will answer: “As born-in (Einstein, Switzerland)” and FD tells a person can only born in oneCountry, therefore probability of born-in (Einstein, Germany) is very low.
Information ExtractionExample Description IE
techniqueinvented(Edison, phonograph) Arity-2 fact TextRunner
<scientist> Einstein Type (hypernymy)
KnowItAll
has-invented = invented Synonymy DIRT
invented discovered ID (troponymy) ?
FD: has-capital(x, y) has-capital(y) FD (rule) ?
ExDB Work Flow
…no one could
surprising. In
1877, Edisoninvented thephonograph.
Although he…
…didnt surprising.
In1877, Edisoninvented thephonograph.
Although he…
…was surprising.
In1877, Edisoninvented thephonograph.
Although he…
Obj1 Pred Obj2 probEdison invente
dphonogr
aph0.97
Morgan born-in 1837 0.85
Type Instance probscientist Einstein 0.99
city Seattle 0.92
Pred1 Pred2 probinvented did-invent 0.85invented created 0.72
Facts
Types
Synonyms
RDBMS
Querymiddlewa
re
invented(Edison ?e, ?i)
1. Run extractors 2. Populate data model 3. Query Processing & Applications
Web
Populate Data ModelObj1 Pred Obj2 prob
Edison invented
phonograph
0.97
Morgan born-in 1837 0.85
Type Instance probscientist Einstein 0.99
city Boston 0.92
Pred1 Pred2 probinvented did-invent 0.85invented created 0.72
Inclusion Includer probinvented discovered 0.81Seattle Washington 0.65
LHS RHS probcapital(x, y) capital(y) 0.77born-in(x) country(y) 0.95
Facts
Types
Synonyms
IDs
FDs
It was big news when Edison invented the phonograph…
He visited cities such as Boston and New York.
We all know that Edison did-invent the light bulb.…In 1877 Edison created the phonograph.
Morgan was born-in 1837 into a prosperous mercantile-banking family…
Einstein is one of the best known scientists and intellectuals of all time.
•For fact extraction ExDB uses unsupervised system called TextRunner.
•TextRunner generates a large set of extraction while running on entire corpus of text.
•Unlike other IE systems, it does not require a set of target predicates specified beforehand, rather it starts by using a heavy weight linguistic parser to generate high quality extraction triples.
•Later they use these high quality triples as the training set to generate a light weight extraction classifier that can run on entire web-scale corpus
TextRunner
•For type extraction ExDB uses the KnowItAll system.
•KnowItALL searches the entire corpus to extract hypernym or “is-a” relationships. For example: it extracts city (Boston) from “cities such as Seattle and Boston”.
•Assign each extraction a probability based on its frequency (or search engine hit count).
knowItAll
• ExDB uses DIRT algorithm to extract predicate synonyms.
•DIRT computes the degree to which the argument pairs of two predicates coincide. For example, invented and has-invented will overlap many argument pairs like Edison/Light-bulb or Einstein/theory-of-relativity.
DIRT
ExDB Work Flow
…no one could
surprising. In
1877, Edisoninvented thephonograph.
Although he…
…didnt surprising.
In1877, Edisoninvented thephonograph.
Although he…
…was surprising.
In1877, Edisoninvented thephonograph.
Although he…
Obj1 Pred Obj2 probEdison invente
dphonogr
aph0.97
Morgan born-in 1837 0.85
Type Instance probscientist Einstein 0.99
city Seattle 0.92
Pred1 Pred2 probinvented did-invent 0.85invented created 0.72
Facts
Types
Synonyms
RDBMS
Querymiddlewa
re
invented(Edison ?e, ?i)
1. Run extractors 2. Populate data model3. Query Processing & Applications
Web
ExDB Queries ExDB proposes the users to query over the web Data
model using Datalog-like notation.Example: q(?i) :- invented(Edison, ?i) returns all inventions by Edison.Example constranits: q(?x, ?y) :- died-in(<Scientist> ?x, 1955?y)
Example query for locally available inexpensive electronics: q(?x, ?y, ?z) :- for-sale-in(<electronics> ?x, Seattle ?y), costs (?x, ?z), (?z < 25)
Another example can be: q(?x, ?y, ?z) :- invented(<scientists> ?x, ?y), died-in (?x, <year> ?z), (?z < 1900)
Example of projection queries: q(?s) :- invented(<scientist> ?s, ?i)
Query Processing Non-projecting queries
Involves a series of join against tables in the Web Data Model Probability of a joined tuple is the product of the individual tuple’s
probabilities Select top-k queries ranked by their probability as results.
Object Classeinstein scientistboston citybohr scientistfrance countrycurie scientist
Bugs bunny scientist
Object1 Predicate Object2einstein invented relativity1848 Was-year-
ofrevolution
edison invented phonograph
dukakis visited bostoneinstein died-in 1955
humans have Cold-fusion
prob0.990.980.950.920.91
prob0.990.970.960.930.92
0.01 0.01
… …
Types Facts
Example: q(?x, ?y, ?z) :- invented (<scientist> ?x, ?y), died-in (?x, <year> ?z).
Scientist
Invented
Died-in prob
einstein relativity
1955 0.90
…
Projecting queries q (?s) :- invented (<scientist> ?s, ?i) rank scientists according to the probability of the scientist invented something without caring much about the actual invention.
Need to compute a disjunction of m probabilistic events.
A scientist Tesla appears in the output q, if the tuple invented (Tesla, I0) is in the database. There can be many inventions I1, …, Im for Tesla such as invented (Tesla, Ii). Any of these are sufficient to return Tesla as an answer for q.
As m can be very large, a large number of very low probability extractions can unexpectedly result in a quite large probability.
Therefore, try to abstract panel of experts, where an expert is a tuple with a score such as Invented (tesla, Fluroescent-Lighting), 0.95, which determine the probability of its appearing in q.
Result of Projecting Queries
q(?s) :- invented(<scientist> ?s, x) Scientist invented
ExDB Prototype Web crawl: 90M pages Facts: 338M tuples, 102M objects Types: 6.6M instances Synonyms: 17k pairs No IDs or FDs yet
Applications ExDB’s extracted Data are not meant to be examined directly, rather
they are used to build topic-specific tables so that human user can appreciate.
A synthetic table about scientists, generated by merging answers from Died-in(<scientist> ?x, ?y), invented(<scientist> ?x, ?y), published(<scientist> ?x, ?y) and taught(<scientist> ?x, ?y).
If it is possible to automatically generate an ExDB query from keywords, it is possible to build a very powerful query system.
It is possible to build web Data cube over the large amount of read only structured Data of ExDB.
Alternative Models Schema Extraction Model, intends to find out single best
schema for the entire set of extractions to transform the web Text into a traditional relational database
Three good criteria for schema extraction are: Simplicity (few tables). Completeness (All extractions appear in the output). Fullness ( output database has no NULLs).
Alternative Models Text Query Model does not perform any information
extraction at all, rather offers a descriptive query language to generates answers for users query very quickly.
Extract city/date tuples from band’s website.
Indicate the city where she lives. Compute the dates when the
band’s city and her own city are within 100 miles of each other.
User’s Query
Questions?Thank You