An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.
Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State...
-
Upload
jared-lyons -
Category
Documents
-
view
214 -
download
0
Transcript of Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State...
Supporting High-Performance Data Processing on Flat-Files
Xuan Zhang
Gagan Agrawal
Ohio State University
Motivation
• Challenges of bioinformatics integration– Data volume: overwhelming
• DNA sequence: 100 gigabases (August, 2005)
– Data growth:
exponential
Figure provided by PDB
Existing Solutions
– (Relational) Databases• Support for indexing and high-level queries • Not suitable for biological data
– Flat Files with Scripts • Compact, Perl Scripts available • Lack indexing and high-level query processing
– Web-services • Significant overhead
• Enhance information integration systems on– Functionality
• On-the-fly data incorporation• Flat file data process
– Usability• Declarative interface• Low programming requirement
– Performance • Incorporate indexing support
Our Approach
Approach Summary
• Metadata– Declarative description of data– Data mining algorithms for semi-automatic
writing– Reusable by different requests on same data
• Code generation– Request analysis and execution separated– General modules with plug-in data module
System OverviewUnderstand Data Process Data
Data File User Request
Answ
er
Metadata Description
Layout Descriptor---------------------------------------------------
Schema DescriptorLayout Descriptor
---------------------------------------------------
Schema DescriptorLayout Descriptor
---------------------------------------------------
Schema Descriptor
CodeGeneration
RequestProcessor
Layout Miner
SchemaMiner
Information Integration System
Advantages
• Simple interface– At metadata level, declarative
• General data model– Semi-structured data– Flat file data
• Low human involvement– Semi-automatic data incorporation– Low maintenance cost
• OK Performance– Linear scale guaranteed – Can improve by using indexing
System Components
• Understand data– Layout mining– Schema mining
• Process data– Wrapper generation– Query Process– Query Process with indices
Data Process Overview
• Automatic code generation approach• Input
– Metadata about datasets involved– Optional:
• Implicit data transformation task• Request by users• Indexing functions
• Output– Executable programs
• General modules• Task-specific data module
Metadata Description
• Two aspects of data in flat files– Logical view of the data– Physical data organization
• Two components of every data descriptor– Schema description– Layout description
• Design goals– Powerful– Easy for writing and interpretation
Schema Descriptors
• Follow XML DTD standard for semi-structured data
• Simple attribute list for relational data
<?xml version='1.0' encoding='UTF-8'?><!ELEMENT FASTA (ID, DESCRIPTION, SEQ)><!ELEMENT ID (#PCDATA)><!ELEMENT DESCRIPTION (#PCDATA)><!ELEMENT SEQ (#PCDATA)>
[FASTA] //Schema NameID = string //Data type definitionsDESCRIPTION = stringSEQ = string
Layout Descriptors
• Overall structure (FASTA example)
DATASET “FASTAData” { //Dataset nameDATATYPE {FASTA} //Schema name
DATASPACE LINESIZE=80 {
// ---- File layout details goes here ----
}DATA {osu/fasta} //File location
}
Wrapper GenerationSystem Overview
DataReader DataWriterSynchronizer
SourceDataset
TargetDataset
WRAPINFO
Wrapper generationsystem
wrapper
Mapping File
Mapping Parser
Schema Mapping
Mapping Generator
Schema Descriptors
Layout Parser
Layout Descriptor
Data EntryRepresentation
Application Analyzer
Query With IndicesMotivation
• Goal– Improve the performance of query-proc program
• Index
– Maintain the advantages• Flat file based• Low requirement on programming
Challenges & Approaches
• Various indexing algorithms for various biological data– User defined indexing functions– Standard function interfaces
• Flat file data– Values parsed implicitly and ready to be indexed– Byte offset as pointer
• Metadata about indices– Layout descriptor
System Revisitedquery
Query parser
Metadatacollection
Datasetdescriptors
Descriptorparser
Application analyzer
QUERYINFOR
DataReader DataWriter
Synchronizer
Source data files
Targetdata file
Source/target names
Schema & Layout information mappings
Query analysis
Query execution
Index file Index functions
Language Enhancement
• Describe indices– Indexing is a property of dataset– Extend layout descriptors
– Maintain query format
DATASET “name”{…INDEX {attribute:index_file_loc:index_gen_fun:index_retr_fun:fun_loc[, attribute:index_file_loc:index_gen_fun:index_retr_fun:fun_loc]}}
AUTOWRAP GNAMESFROM CHIPDATA, YEASTGENOMEBY CHIPDATA.GENE = YEASTGENOME.IDWHERE …
New meaning of “=“:If index available, use index
retrieving functionElse, compare values directly
System Enhancement
• Metadata Descriptor Parser+ parse index information
• Application Analyzer+ index information: index look-up table
+ test condition: compare_field_indexing
Microarray Gene Information Look-up
• Goal: gather information about genes (120)
• Query: microarray output join genome database
• Index: gene names in genome
0.01 0.72
20.89
81.59
0
10
20
30
40
50
60
70
80
90
Per
form
ance
(se
c)
queryanalysis
indexgeneration
query withindices
query w/oindices
BLAST-ENHANCE Query
• Goal: Add extra information to BLAST output
• Query: BLAST output join Swiss-Prot database
• Index: protein ID in Swiss-Prot
0
200
400
600
800
1000
1200
Per
form
ance
(se
c)
indexgeneration
query w/indices
query w/oindices
3 5 12
OMIM-PLUS Query
• Goal: add Swiss-Prot link to OMIM
• Query: OMIM join Swiss-Prot
• Index: protein ID in Swiss-Prot
1
10
100
1000
10000
100000
1000000
10000000
Perf
orm
ance
(sec
)
indexgeneration
query w/indices
query w/oindices
Homology Search Query
• Goal: find similar sequences
• Query: query sequence list * sequence database
• Indexing algorithm– Sequence-based– Transformation of sub-string composition– Indexing n-D numerical values
Homology Search (1)
• Index (Singh’s algorithm)– Data: yeast
genome– wavelet
coefficients – minimum
bounding rectangles
0
50
100
150
200
250
300
350
Per
form
ance
(sec
)
1 2 3 4 5
Database size (9.8MB)
Index generation
10
20
40
Homology Search (2)
• Index (Ferhatosmanoglu’s algorithm)– Data: GenBank– Wavelet coefficients– Scalar quantization– R-tree 0
5
10
15
20
25
30
perf
orm
ance
(sec
)
1 2 3 4 5
Database size (250MB)
10
20
40
Conclusions
• A frame work and a set of tools for on-the-fly flat file data integration– New data source understood semi-automatically
by data mining tools– New data processed automatically by generated
programs – Support for indexing incorporated flexibly