John Wu Searching Large Scientific Data John Wu Scientific Data Management Lawrence Berkeley...
-
Upload
stewart-bryan -
Category
Documents
-
view
218 -
download
1
Transcript of John Wu Searching Large Scientific Data John Wu Scientific Data Management Lawrence Berkeley...
John Wu
Searching Large Scientific Data
John Wu
Scientific Data Management
Lawrence Berkeley National Laboratory
John Wu
Outline
• Highlight of Accomplishments
• Grid Collector (accelerate others’ work)
• Query-Driven Visualization (enabling new way of knowledge discovery)
• Molecular docking (enabling others to accomplish great things)
• Outlook
• More complex searches
• Parallelization
• Supporting more data formats
• Integration with large framework
John Wu
FastBit In a Nutshell
• FastBit is designed to search multi-
dimensional append-only data
• Conceptually in table format
• rows objects
• columns attributes
• FastBit uses vertical (column-
oriented) organization for the data
• Efficient for searching
• FastBit uses bitmap indices with our
compression method
• Proven in analysis to be optimal
for one-dimensional queries
• Faster than other optimal indexes
for multi-dimensional queries
row
colum
n
[Wu, Otoo, Shoshani 2006]
John Wu
Motivation
• Scientific datasets are getting larger fast
• Most data analysis algorithm can not handle a
whole dataset
• Therefore, most data analysis tasks are performed
on a subset of the data
• Some examples of searches
• Find the collision events with the most distinct features of
Quantum-Qluon-Plasma from a high-energy physics
experiment
• Find and tracking ignition in a combustion simulation
• Identify the puppet-master bedind a distribution denial-of-
service attack on a computer network
John Wu 5
Highlight 1 – Grid Collector
• Searching over billions of objects with hundreds of attributes each:
• Distributed analysis over the Grid
• Make petabytes of raw data available for world wide analyses
• Benefits of the Grid Collector:• Transparent object access, select objects based on their
attributes• Improvement of analysis system’s throughput• Best Paper Award (ISC’05) [Wu, Gu, Lauret, Poskanzer,
Shoshani, Sim and Zhang 2005]
John Wu 6
Grid Collector Speeds up Analyses
0
1
2
3
4
5
0 0.2 0.4 0.6 0.8 1
selectivity
sp
ee
du
p
Sample 1
Sample 2
Sample 3
• Test machine: 2.8 GHz Xeon, 27 MB/s read speed• When searching for rare events, say, selecting one event
out of 1000, using GC is 20 to 50 times faster• Using GC to read 1/2 of events, speedup > 1.5, 1/10 events,
speed up > 2.• Bottom line – improve the throughtput of data analyses!
1
10
100
1000
0.00001 0.0001 0.001 0.01 0.1 1
selectivity
sp
ee
du
p
Sample 1
Sample 2
Sample 3
John Wu
Highlight 2 – Visualization
• Query-Driven Visualization – collaboration between SDM
and VACET• Use FastBit indexes to efficiently select the most interesting data for
visualization
• Above example: laser wakefield accelerator simulation• VORPAL produces 2D and 3D simulations of particles in laser wakefield
• Finding and tracking particles with large momentum is key to design the
accelerator
• Brute-force algorithm is quadratic (taking 5 minutes on 0.5 mil particles), FastBit
time is linear in the number of results (takes 0.3 s, 1000 X speedup)
John Wu
Bin-Based Parallel Coordinate Display
• Integrate FastBit with H5Part, a HDF5 package for particle
physics data
• Use FastBit to compute histograms efficiently
• Bin-based parallel coordinate display reduces the number
of lines displayed on screen, reduces visual clutter,
reduces response time
• FastBit further speeds up the response time further
John Wu
FastBit Speeds up Historgraming
• Time needed to compute desired histograms
• Custom code that directly uses the raw data directly
• FastBit can be 1000 X faster than the custom code (left)
• FastBit maintains the performance advantage on a parallel
system
Low
er is b
etter
~ 104 X
John Wu
Highlight 3 – Molecular Docking
• Jochen Schlosser [[email protected]]Center for Bioinformatics, University of Hamburg
• Application: Structure-based virtual screening (ACS Fall 2007)
Match ligandwith cavity
Name Score
1bef -16,4
4dab -12,3
4d2a -11,6
… …
n ligands
n dockingruns
Hit list
One targetprotein
Standard approach: match every ligand with every target proteinNew approach: using FastBit indexes to avoid brute-force matching
John Wu
Use of FastBit for Molecular Docking
Method• Specification of the descriptor
as triangle geometry• Types of interaction centers• Triangle side lengths• Interaction directions• 80 bulk dimensions
• Receptors• Receptor descriptors are
generated similarly• Using complementary
information where necessary• Use of pharmacophore
constraints on receptor triangles• Reduces number of queries• Improved query selectivity
because the pharmacophore tends to be inside the protein cavity
John Wu
Use of FastBit for Molecular Docking
Method• Indexing system
• Properties of the problem:• Billions of descriptors (~ 1,000 for
each ligand)• High dimensional query
• Properties of bitmap indexes• Well suited for those kind of
queries• Can be run stand alone• Further compression possible• FastBit uses compression
[0] ... … … [n]
0 1 0 0 00 0 0 1 00 1 0 0 00 0 0 0 11 0 0 0 0
desc1desc2desc3desc4desc5
attribute(i)
Bitmap index
ResultsTrixX-BMI is an efficient tool for virtual screening with average runtime in
sub-second range screen libraries of ligands 12 times faster than FlexX without
pharmacophore constraintsWith pharmacophore constraints, speedup 140 – 250
John Wu
Outline
• Highlight of Accomplishments
• Grid Collector
• Query-Driven Visualization
• Molecular docking
• Outlook
• More complex searches
• Parallelization
• Supporting more data formats
• Integration with large framework
John Wu
Complex Searches
• So far, FastBit software primarily handles range
queries of the form “pressure > 105 and
temperature between 800 and 1000”
• Need to support complex types of searches
• GTC data analysis: find all particles with certain energy level
that have passed through a region with specified properties
on the electric field
• Network security: find the hosts that have contacted all
identified drones within an hour of the start of an attack
• Protein sequences: Identify known proteins with specified
molecular weight
• Catalog matching: matching records of stars and galaxies
from one survey / simulation to another one
• Subqueries: searching the results of previous searches
John Wu
Complex Searches
• Extending the histograming functionality: group by,
top-k, automatic computation of derived fields
• Implement join algorithm
• Existing bitmap indexes are efficient for filtering out the
desired records for common join algorithms such as sort-
merge join
• Existing bitmap index based join algorithms appear promising
from back-of-envelope calculation
• A* algorithm: for programs such as neighborhood
expansion, formulating them as joins may be not as
efficient as using alternative searching algorithms,
such as, A*
John Wu
Parallelization
• For I/O dominated tasks,
• Take advantage of parallel I/O system, PVFS
• Better data layout to effectively utilize the I/O hardware
• Active Storage, In-Situ data processing
• For CPU dominated tasks,
• Devise new algorithms, e.g., parallel join algorithms, new join
indexes
• Algorithms for GPU, Cell processor, and many-core
architecture
John Wu
More Data Formats
• Working with application specialist to integrate
FastBit with their data library
• H5Part: HDF5
• ROOT (?)
• ADIOS
• Restructure FastBit to make it easier to work with
different data formats
• Virtualize data sources
John Wu
Integrated Data Analysis Framework
• Iterator for coarse grain data
• Examples: ROOT and Map-Reduce
• Indexing provides a way to implement a “smart iterator”,
e.g., Grid Collector for STAR data analysis framework (using
ROOT)
• Framework for fine grain data
• Tighter integration with programmatic API
• Provide scripting support for productivity layer (end user)
John Wu
Indexes Facilitate Smart Analysis
Indexes go here!
Or
How to make your system smarter!