Xpath XPath is a language for finding information in an XML document.
Course Project Ideas - University of Massachusetts...
Transcript of Course Project Ideas - University of Massachusetts...
Course Project Ideas
Yanlei DiaoUniversity of Massachusetts Amherst
Yanlei Diao, University of Massachusetts Amherst 2/6/2007
New Directions for DB Research
Sensor data: new architecture
XML: new data model
Streams: new execution model
Data quality and lineage: new services
…
Yanlei Diao, University of Massachusetts Amherst 2/6/2007
Querying in Sensor Networks
Acoustic stream
• Store data locally at sensors and push queries into the sensor network– Flash memory energy-
efficiency.– Limited capabilities of sensor
platforms.
Internet
Gateway
Image stream
Flash Memory
Push query to sensors
Yanlei Diao, University of Massachusetts Amherst 2/6/2007
Optimize for Flash and Limited RAM
• Flash Memory Constraints– Data cannot be over-written, only
erased– Pages can often only be erased in
blocks (16-64KB)– Unlike magnetic disks, cannot
modify in-place
• Challenges:– Energy: Organize data on flash to
minimize read/write/erase operations
– Memory: Minimize use of memory for flash database.
1. 1. Load block 2. Into Memory
3. Save block back
Eraseblock
Memory
2. Modify in-memory
~16-64 KB
~4-10 KB
Yanlei Diao, University of Massachusetts Amherst 2/6/2007
StonesDB: System OperationImage Retrieval: Return images taken last month with at least two birds one
of which is a bird of type A.
• Identify “best” sensors to forward query.
• Provide hints to reduce search complexity at sensor.
Proxy Cache of Image Summaries
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompress
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Yanlei Diao, University of Massachusetts Amherst 2/6/2007
StonesDB: System OperationImage Retrieval: Return images taken last
month with at least two birds one of which is a bird of type A.
Query Engine
Partitioned Access Methods
Yanlei Diao, University of Massachusetts Amherst 2/6/2007
Research Issues in StonesDB
• Local Database Layer– Reduce updates for indexing and aging.– New cost models for self-tuning sensor databases.– Energy-optimized query processing.– Query processing over aged data.
• Distributed Database Layer– What summaries are relevant to queries?– What remainder queries to send to sensors?– What resolution of summaries to cache?
Yanlei Diao, University of Massachusetts Amherst 2/6/2007
XML (Extensible Markup Language)
<bibliography><book> <title> Foundations… </title>
<author> Abiteboul </author><author> Hull </author><author> Vianu </author><publisher> Addison Wesley </publisher><year> 1995 </year>
</book>…
</bibliography>
XML: a tagging mechanism to describe content.
Yanlei Diao, University of Massachusetts Amherst 2/6/2007
XML Data Model (Graph)
bookb1
b2
title authorauthor author
pcdataComplete... Principles...Chamberlin Bernstein Newcomer
pcdata pcdata pcdata pcdata
publisher
name state
CAMorgan...
pcdata pcdata
pub pub
db
mkp
#1 #2 #3 #4 #5 #6 #7
#0
book
title
Main structure: ordered, labeled treeReferences between node: becoming a graph
Yanlei Diao, University of Massachusetts Amherst 2/6/2007
XQuery: XML Query Language
• A declarative language for querying XML data
• XPath: path expressions– Patterns to be matched against an XML graph– /bib/paper[author/lastname=‘Croft’]/title
• FLOWR expressions– Combining matching and restructuring of XML data– For $p in distinct(document("bib.xml")//publisher)
Let $b := document("bib.xml")/book[publisher = $p] Where count($b) > 100 Order by $p/nameReturn $p
Yanlei Diao, University of Massachusetts Amherst 2/6/2007
Metadata Management using XML
• File systems for large-scale scientific simulations– File systems: petabytes or even more– Directory tree (metadata): large, can’t fit in memory– Links between files: steps in a simulation, data derivation
• File Searches– all the files generated on Oct 1, 2005– all the files whose name is like ‘*simu*.txt’– all the files that were generated from the file ‘basic-measures.txt’
Build an XML store to manage directory trees!– XML data model– XML Query language– XML Indices
Yanlei Diao, University of Massachusetts Amherst 2/6/2007
XML Document Processing
Multi-hierarchical XML markup of text documents– Multi-hierarchies: part-of-speech, page-line – Features in different hierarchies overlap in scope– Need a query language & querying mechanism – References [Nakov et al., 2005; Iacob & Dekhtyar, 2005]
Querying and ranking of XML data– XML fragments returned as results– Fuzzy matches– Ranking of matches– References [Amer-Yahia et al., 2005; Luo et al., 2003]
• Well-defined problems identify your contributions!
Yanlei Diao, University of Massachusetts Amherst 2/6/2007
Data Stream Management
Queries, RulesQueries, Rules
Event Specs,Event Specs,
SubscriptionsSubscriptions
Results Results
•Data in motion, unending
•Continuous, long-running queries
•Data-driven execution
Data
Traditional Database
Attr1 Attr2 Attr3Query
Data Stream Processor
•Data at rest
•One-shot or periodic queries
•Query-driven execution
Yanlei Diao, University of Massachusetts Amherst 2/6/2007
• XML is becoming the wire format for data• In-network XML processing
– Authentication– Authorization– Routing – Transformation– Pattern matching
• XPath widely used for in-network XML processing• Applied directly to streaming XML data• Line-speed performance
In-Network XML Processing
Expedite trafficEnhance securityReal-time monitoring & diagnosis
Yanlei Diao, University of Massachusetts Amherst 2/6/2007
Research Issues
Gigabit rate XPath processing– Take one look, process XPath, buffer data for future use if
necessary– Processing needs to be gigabit rate– Memory usage needs to be minimized
• Time/space complexity of XPath stream processing– Theoretical analysis for common features of XPath
• Minimizing memory usage of YFilter technolgy– YFilter: state-of-the-art for multi-XPath processing
Yanlei Diao, University of Massachusetts Amherst 2/6/2007
RFID Technology
• RFID technology
01.01298.6EF.0A
01.01267.60D.01
04.0768E.001.F0
reader_id,tag_id,timestamp
Yanlei Diao, University of Massachusetts Amherst 2/6/2007
RFID Stream Processing<pml ><tag>01.01298.6EF.0A</tag><time>00129038</time><location>shelf 2</location>
</pml> +<pml><tag>01.01298.6EF.0A</tag><time>02183947</time><location>exit1</location>
</pml>
RFID tag RFID reader
Yanlei Diao, University of Massachusetts Amherst 2/6/2007
Example Queries
• Shoplifting: an item was taken out of store without being checked out.
• Out of stocks: the number of items of product X on shelf ≤ 3.
• Misplacement: an item was moved from Shelf A to Shelf B without being purchased or put back.
• …
Yanlei Diao, University of Massachusetts Amherst 2/6/2007
RFID Processing: Global Tracking
+
<pml><epc>01.001298.6EF.0A</epc><ts type=“begin”>
<date>…</date></ts><entity type=“maker”><name type=“legal”>X Ltd.</name>
</entity>…
<pml><epc>01.001298.6EF.0A</epc><ts><date>…</date></ts><location>…</location><msr label=“temperature”
max=2>90</msr> …
<pml><epc>01.001298.6EF.0A</epc><ts><date>…</date></ts><location>…</location><msr label=“temperature”
max=5>95</msr> …
<pml><epc>01.001298.6EF.0A</epc><ts><date>…</date></ts><location>…</location><msr label=“temperature”
max=2>80</msr> …
<pml><epc>01.001298.6EF.0A</epc><ts><date>…</date></ts><location>…</location><msr label=“temperature”
max=2>85</msr> …
<pml><epc>01.001298.6EF.0A</epc><ts type=“end”>
<date>…</date></ts><entity type=“retailer”>
<name type=“legal”>CVS </name></entity> …
Yanlei Diao, University of Massachusetts Amherst 2/6/2007
Example Queries
• Counterfeit drugs: a bottle is accepted at the retailer if it came from a legal manufacturer and followed all necessary steps in the distribution network
• Expired/spoiled drugs: a bottle is accepted at the retailer if it went through the distribution network in less than 3 months and was never exposed to temperature > 96 F
• Missing pallet, expected case, illegally cloned tags…
Yanlei Diao, University of Massachusetts Amherst 2/6/2007
Challenges in RFID Management• Data-Information Mismatch
– RFID raw data: (tag id, reader id, timestamp) – Meaningful information: shoplifting, misplaced inventory, out-of-
stocks; expired drugs, spoiled drugs…
• Incomplete, inaccurate data– Readers miss tags– Readers can pick up tags from overlapping areas
• High-volume data – Readers read constantly, from all tags in range, without line-of-sight– Can create up to millions of terabytes of data in a single day
• Low-latency processing– Up-to-the-second information, time-critical actions
Yanlei Diao, University of Massachusetts Amherst 2/6/2007
Research Issues
• Real-time event stream processing– Handling duplicate readings/results– Data cleaning– Data compression
• Handling incomplete readings– Inferences in event databases– Inferences over event streams
• Distributed processing– Real time anomaly detection– Distributed inferences
Yanlei Diao, University of Massachusetts Amherst 2/6/2007
Adaptive Sensing of Atmosphere
• Environmental monitoring: real-time processing of huge-volume meteorological data
• Challenges– Large volume but limited bandwidth– Real-time processing– Uncertain data– Data archiving and querying the
history
Sense Sense
Send Send
MergeMerge
Detection
Prediction
Yanlei Diao, University of Massachusetts Amherst 2/6/2007
Managing Uncertain Data
• Sources of data uncertainty1)Sensing noise and partial scanning2)Data compression3)Lossy wireless links4) Incomplete merging
• Managing uncertain data– Model sources of data uncertainty– Develop uncertainty calculus to
combine the effects of these sources– Augment results with confidence
values
(1) (1)
(2) (2)
(3) (3)
MergeMerge(4)
Tornado Detection
Prediction (confidence?)
Yanlei Diao, University of Massachusetts Amherst 2/6/2007
Managing Uncertain Data• Sources of data uncertainty
1)Sensing noise and partial scanning2)Data compression3)Lossy wireless links4) Incomplete merging
• Self diagnosis and tuning– Compare predication at t with
observation at t+1 (no ground truth?!)
– System diagnosis when confidence value is low
– Automatically tune the system
(1) (1)
(2) (2)
(3) (3)
MergeMerge(4)
Tornado Detection
Prediction (confidence?)
Yanlei Diao, University of Massachusetts Amherst 2/6/2007
Questions