The Structure of (Computer) Scientific Revolutions
description
Transcript of The Structure of (Computer) Scientific Revolutions
![Page 1: The Structure of (Computer) Scientific Revolutions](https://reader035.fdocuments.in/reader035/viewer/2022062801/568143f7550346895db08633/html5/thumbnails/1.jpg)
The Structure of (Computer) Scientific Revolutions
Dow Jones Enterprise VenturesMay 2006
Michael Franklin
UC Berkeley&
Amalgamated Insight
![Page 2: The Structure of (Computer) Scientific Revolutions](https://reader035.fdocuments.in/reader035/viewer/2022062801/568143f7550346895db08633/html5/thumbnails/2.jpg)
Michael FranklinDow Jones EV Summit May 2006
Data Management: Then
Structured DataProcessing
![Page 3: The Structure of (Computer) Scientific Revolutions](https://reader035.fdocuments.in/reader035/viewer/2022062801/568143f7550346895db08633/html5/thumbnails/3.jpg)
Michael FranklinDow Jones EV Summit May 2006
Data Management: Now
![Page 4: The Structure of (Computer) Scientific Revolutions](https://reader035.fdocuments.in/reader035/viewer/2022062801/568143f7550346895db08633/html5/thumbnails/4.jpg)
Michael FranklinDow Jones EV Summit May 2006
The Structure Spectrum
• Structured data (schema-first)• regular, known, conforming, …• e.g., Relational database
• Unstructured data (schema-never) freeform, irregular, • e.g., plain text, images, audio, …
• Semi-structured data (schema-later)• Provides structural information, but
less constrained. e.g., XML, tagged text/media
![Page 5: The Structure of (Computer) Scientific Revolutions](https://reader035.fdocuments.in/reader035/viewer/2022062801/568143f7550346895db08633/html5/thumbnails/5.jpg)
Michael FranklinDow Jones EV Summit May 2006
Whither Structured Data?
• Conventional Wisdom: ~20% of data is structured currently.
• Consumer apps, enterprise search, media apps are placing downward pressure on this.
![Page 6: The Structure of (Computer) Scientific Revolutions](https://reader035.fdocuments.in/reader035/viewer/2022062801/568143f7550346895db08633/html5/thumbnails/6.jpg)
Michael FranklinDow Jones EV Summit May 2006
A Contrarian View? Two reasons why structured data is where
the action will be:
• The “Data Industrial Revolution”: Data used to be “hand-crafted”, now it’s generated by computers!!!
• The Data Integration quagmire: structure provides crucial cues for making data usable.
![Page 7: The Structure of (Computer) Scientific Revolutions](https://reader035.fdocuments.in/reader035/viewer/2022062801/568143f7550346895db08633/html5/thumbnails/7.jpg)
Michael FranklinDow Jones EV Summit May 2006
The New LandscapeBell’s Law: Every decade, a new, lower cost, class of computers emerges, defined by platform, interface, and interconnect
• Mainframes 1960s• Minicomputers 1970s• Microcomputers/PCs 1980s• Web-based computing 1990s• Devices (Cell phones, PDAs, wireless sensors,
RFID) 2000’s
Enabling a new generation of applications forOperational Visibility, monitoring, and alerting.
![Page 8: The Structure of (Computer) Scientific Revolutions](https://reader035.fdocuments.in/reader035/viewer/2022062801/568143f7550346895db08633/html5/thumbnails/8.jpg)
Michael FranklinDow Jones EV Summit May 2006
Data Streams Data Flood
Clickstream
BarcodesPoS System
SensorsRFID
Telematics
Inventory
• Exponential data growth
• New challenges: continuous, inter-connected, distributed, physical
• Shrinking business cycles
• More complex decisions
Phones
TransactionalSystems
![Page 9: The Structure of (Computer) Scientific Revolutions](https://reader035.fdocuments.in/reader035/viewer/2022062801/568143f7550346895db08633/html5/thumbnails/9.jpg)
Michael FranklinDow Jones EV Summit May 2006
State of the Art
• Custom-coded implementations that are expensive and often unsuccessful.
• Can we develop the right infrastructure to support large-scale data streaming apps?
![Page 10: The Structure of (Computer) Scientific Revolutions](https://reader035.fdocuments.in/reader035/viewer/2022062801/568143f7550346895db08633/html5/thumbnails/10.jpg)
Michael FranklinDow Jones EV Summit May 2006
High Fan In Systems• A data management infrastructure for
large-scale data streaming environments.
• Uniform Declarative Framework • Every node is a data stream processor that
speaks SQL-ese stream-oriented queries at all levels• Hierarchical, stream-based views as an
organizing principle.• Can impose a “view” over messy devices.
![Page 11: The Structure of (Computer) Scientific Revolutions](https://reader035.fdocuments.in/reader035/viewer/2022062801/568143f7550346895db08633/html5/thumbnails/11.jpg)
Michael FranklinDow Jones EV Summit May 2006
HiFi - Taming the Data Flood
Receptors
Warehouses, Stores
Dock doors, Shelves
Regional Centers
Headquarters
Hierarchical Aggregation
• Spatial• TemporalIn-network StreamQuery Processing and Storage
Fast DataPath vs.Slow DataPath
![Page 12: The Structure of (Computer) Scientific Revolutions](https://reader035.fdocuments.in/reader035/viewer/2022062801/568143f7550346895db08633/html5/thumbnails/12.jpg)
Michael FranklinDow Jones EV Summit May 2006
Device Issues: example
Shelf RIFD Test - Ground Truth
![Page 13: The Structure of (Computer) Scientific Revolutions](https://reader035.fdocuments.in/reader035/viewer/2022062801/568143f7550346895db08633/html5/thumbnails/13.jpg)
Michael FranklinDow Jones EV Summit May 2006
Actual RFID Readings
“Restock every time inventory goes below 5”
![Page 14: The Structure of (Computer) Scientific Revolutions](https://reader035.fdocuments.in/reader035/viewer/2022062801/568143f7550346895db08633/html5/thumbnails/14.jpg)
Michael FranklinDow Jones EV Summit May 2006
Query-based Data Cleaning
Point
Smooth
CREATE VIEW smoothed_rfid_stream AS(SELECT receptor_id, tag_id FROM cleaned_rfid_stream [range by ’5 sec’, slide by ’5 sec’] GROUP BY receptor_id, tag_id HAVING count(*) >= count_T)
![Page 15: The Structure of (Computer) Scientific Revolutions](https://reader035.fdocuments.in/reader035/viewer/2022062801/568143f7550346895db08633/html5/thumbnails/15.jpg)
Michael FranklinDow Jones EV Summit May 2006
Query-based Data Cleaning
Point
Smooth
ArbitrateCREATE VIEW arbitrated_rfid_stream AS(SELECT receptor_id, tag_idFROM smoothed_rfid_stream rs [range by ’5 sec’, slide by ’5 sec’]GROUP BY receptor_id, tag_idHAVING count(*) >= ALL (SELECT count(*) FROM smoothed_rfid_stream [range by ’5 sec’, slide by ’5 sec’] WHERE tag_id = rs.tag_id GROUP BY receptor_id))
![Page 16: The Structure of (Computer) Scientific Revolutions](https://reader035.fdocuments.in/reader035/viewer/2022062801/568143f7550346895db08633/html5/thumbnails/16.jpg)
Michael FranklinDow Jones EV Summit May 2006
After Query-based Cleaning
“Restock every time inventory goes below 5”
![Page 17: The Structure of (Computer) Scientific Revolutions](https://reader035.fdocuments.in/reader035/viewer/2022062801/568143f7550346895db08633/html5/thumbnails/17.jpg)
Michael FranklinDow Jones EV Summit May 2006
Once you have the right abstractions…
• “Soft Sensors”• Quality and lineage• Optimization (power, etc.)• Pushdown of external validation
information• Data archiving• Model-based sensing• Imperative processing• …
![Page 18: The Structure of (Computer) Scientific Revolutions](https://reader035.fdocuments.in/reader035/viewer/2022062801/568143f7550346895db08633/html5/thumbnails/18.jpg)
Michael FranklinDow Jones EV Summit May 2006
Data Integration
• Integration is the ultimate schema-first problem.
• Structure is both a key enabler and a key impediment here.
![Page 19: The Structure of (Computer) Scientific Revolutions](https://reader035.fdocuments.in/reader035/viewer/2022062801/568143f7550346895db08633/html5/thumbnails/19.jpg)
Michael FranklinDow Jones EV Summit May 2006
Search vs. Query
What if you wanted to find out which actors donated to John Kerry’s presidential campaign?
![Page 20: The Structure of (Computer) Scientific Revolutions](https://reader035.fdocuments.in/reader035/viewer/2022062801/568143f7550346895db08633/html5/thumbnails/20.jpg)
Michael FranklinDow Jones EV Summit May 2006
Search vs. Query
![Page 21: The Structure of (Computer) Scientific Revolutions](https://reader035.fdocuments.in/reader035/viewer/2022062801/568143f7550346895db08633/html5/thumbnails/21.jpg)
Michael FranklinDow Jones EV Summit May 2006
Search vs. Query
What if you wanted to find out which actors donated to John Kerry’s presidential campaign?
![Page 22: The Structure of (Computer) Scientific Revolutions](https://reader035.fdocuments.in/reader035/viewer/2022062801/568143f7550346895db08633/html5/thumbnails/22.jpg)
Michael FranklinDow Jones EV Summit May 2006
Search vs. Query
• “Search” can return only what’s been previously “stored”.
![Page 23: The Structure of (Computer) Scientific Revolutions](https://reader035.fdocuments.in/reader035/viewer/2022062801/568143f7550346895db08633/html5/thumbnails/23.jpg)
Michael FranklinDow Jones EV Summit May 2006
Also…
• What if you wanted to find out the average donation of actors to each candidate?
• What if you wanted to compare actor donations this campaign to the last one?
• What if you wanted to find out who gave the most to each candidate?
• What if you wanted to know where the information came from, and how old it was?
![Page 24: The Structure of (Computer) Scientific Revolutions](https://reader035.fdocuments.in/reader035/viewer/2022062801/568143f7550346895db08633/html5/thumbnails/24.jpg)
Michael FranklinDow Jones EV Summit May 2006
A “Deep-Web” Query Approach
SELECT y.name,f.occupation,…FROM Yahoo_Actors y, FECInfo fWHERE y.name = f.name
![Page 25: The Structure of (Computer) Scientific Revolutions](https://reader035.fdocuments.in/reader035/viewer/2022062801/568143f7550346895db08633/html5/thumbnails/25.jpg)
Michael FranklinDow Jones EV Summit May 2006
“Yahoo Actors” JOIN “FECInfo”
Q: Did it Work?
![Page 26: The Structure of (Computer) Scientific Revolutions](https://reader035.fdocuments.in/reader035/viewer/2022062801/568143f7550346895db08633/html5/thumbnails/26.jpg)
Michael FranklinDow Jones EV Summit May 2006
The Fundamental Tradeoff
Level ofFunctionality
Time (and cost)
Structured(schema-first)
Unstructured (schema-less)
Semi-Structured(schema-later)
Structure enables computers to help users manipulate and maintain the data.
![Page 27: The Structure of (Computer) Scientific Revolutions](https://reader035.fdocuments.in/reader035/viewer/2022062801/568143f7550346895db08633/html5/thumbnails/27.jpg)
Michael FranklinDow Jones EV Summit May 2006
Dataspaces*
• Deal with all the data from an enterprise – in whatever form
• Data co-existenceno integrated schema, no single warehouse
• Pay-as-you-go services• Keyword search is bare minimum.• Data manipulation and increased consistency as you add work.
* “From Databases to Dataspaces: A New Abstraction for Information Management”, Michael Franklin, Alon Halevy, David Maier, SIGMOD Record, December 2005.
![Page 28: The Structure of (Computer) Scientific Revolutions](https://reader035.fdocuments.in/reader035/viewer/2022062801/568143f7550346895db08633/html5/thumbnails/28.jpg)
Michael FranklinDow Jones EV Summit May 2006
Dataspaces vs. Databases
• Data Coexistence• Autonomous
Sources
• Search, Browse, Approximate Answer
• Best Effort Guarantees
• Single Schema• Centralized
Administration
• Structured Query
• Strict Integrity Constraints
![Page 29: The Structure of (Computer) Scientific Revolutions](https://reader035.fdocuments.in/reader035/viewer/2022062801/568143f7550346895db08633/html5/thumbnails/29.jpg)
Michael FranklinDow Jones EV Summit May 2006
The World of Dataspaces
High Low
Near
Far
Desktop Search
Web SearchVirtual
Organization
Federated DBMS
DBMS
Semantic Integration
AdministrativeProximity
![Page 30: The Structure of (Computer) Scientific Revolutions](https://reader035.fdocuments.in/reader035/viewer/2022062801/568143f7550346895db08633/html5/thumbnails/30.jpg)
Michael FranklinDow Jones EV Summit May 2006
Conclusions• Structured data not going away.
• In fact, there will be lots more of it.• and it must be processed as fast as it is created.
• Structure is crucial for successful data integration and manipulation.• Much effort will be expended to add structural information to text and media.
• Traditional (structured) database technology is not up to the task.
• Great opportunities for innovation.• HiFi and Dataspaces are examples.