Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19...
Transcript of Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19...
![Page 1: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching](https://reader034.fdocuments.in/reader034/viewer/2022050204/5f578d00091fda0ebf385d3f/html5/thumbnails/1.jpg)
Lazy Big Data Integration
Prof. Dr. Andreas ThorHochschule für Telekommunikation Leipzig (HfTL)
Martin-Luther-Universität Halle-Wittenberg16.12.2016
![Page 2: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching](https://reader034.fdocuments.in/reader034/viewer/2022050204/5f578d00091fda0ebf385d3f/html5/thumbnails/2.jpg)
2
Agenda
• Data Integration– Data analytics for domain-specific questions
– Use cases: Bibliometrics & Life Sciences
• Big Data Integration– Techniques for efficient big data management
– Exploiting cloud infrastructures (MapReduce, NoSQL data stores)
• Lazy Big Data Integration– (Efficient and) effective goal-oriented data integration
– Integrated analytical approach for big data analytics
![Page 3: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching](https://reader034.fdocuments.in/reader034/viewer/2022050204/5f578d00091fda0ebf385d3f/html5/thumbnails/3.jpg)
3
Use Case: Bibliometrics
• Does the peer review process actually work? Does it select the „best“ papers?
• Data from reviewing process (e.g., Easy Chair)– Bibliographic information (title, authors, …) of submitted papers
– Review score(s) incl. editorial decision
• Data from bibliographic data sources (e.g., Google Scholar)– Accepted papers and rejected papers that are published elsewhere
– Number of citations
• Determine covariance between review score(s) and #citations
![Page 4: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching](https://reader034.fdocuments.in/reader034/viewer/2022050204/5f578d00091fda0ebf385d3f/html5/thumbnails/4.jpg)
4
Data Integration
• Combining data residing at different sources and providing the user with a unified view of this data– Added value by linking &
merging data
– Queries that can only be answered using multiple sources
• Schema Matching– Finding mappings of
corresponding attributes
• Entity Matching– Finding equivalent
data objects
![Page 5: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching](https://reader034.fdocuments.in/reader034/viewer/2022050204/5f578d00091fda0ebf385d3f/html5/thumbnails/5.jpg)
5
Data Quality
• Can/should we use Google Scholar citations for ranking …– Papers
– Researchers
– Institutions
– etc.
• Convergent validity of citation analyses?– Comparison of analysis results for source overlap
Google Scholar Web of Science
Coverage Huge Limited
Data quality Medium (fully automatic) High (manually curated)
Costs Free Expensive
![Page 6: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching](https://reader034.fdocuments.in/reader034/viewer/2022050204/5f578d00091fda0ebf385d3f/html5/thumbnails/6.jpg)
6
Use Case: Life Sciences
• Gene Annotation Graph– Genes are annotated with
Gene Ontology (GO) and Plant Ontology (PO) terms
• Links form a graph that captures meaningful biological knowledge
• Sense making of graph is important
• Prediction of new annotations – hypothesis for wet lab experiments
Is this annotation likely to be added in
the future?
![Page 7: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching](https://reader034.fdocuments.in/reader034/viewer/2022050204/5f578d00091fda0ebf385d3f/html5/thumbnails/7.jpg)
7
Graph Summarization + Link Prediction
• Graph summary = Signature + Corrections
• Signature: graph pattern / structure– Super nodes = partitioning of nodes
– Super edges = edges between super nodes = all edges between nodes of super nodes
• Corrections: edges e between individual nodes
– Additions: e G but e signature
– Deletions: e G but e signature
• p(PO_20030, CIB5) 0.96– High prediction score because
it is the “only missing piece” to a “perfect 4x6 pattern”
PO_20030
PO_9006
PO_37
PO_20038
HY5
PHOT1
CIB5
CRY2
COP1
CRY1
PO_20030
PO_9006
PO_37
PO_20038
HY5
PHOT1
CIB5
CRY2
COP1
CRY1
=
![Page 8: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching](https://reader034.fdocuments.in/reader034/viewer/2022050204/5f578d00091fda0ebf385d3f/html5/thumbnails/8.jpg)
8
(Big) Data Analytics Pipeline
Schema & Entity
Matching
DataExtraction/
Cleaning
DataAcquisition
DataFusion/
Aggregation
Bibliometrics
Life Sciences
DataAnalytics /
Visualization
![Page 9: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching](https://reader034.fdocuments.in/reader034/viewer/2022050204/5f578d00091fda0ebf385d3f/html5/thumbnails/9.jpg)
9
Agenda
• Data Integration– Data analytics for domain-specific questions
– Use cases: Bibliometrics & Life Sciences
• Big Data Integration– Techniques for efficient big data management
– Exploiting cloud infrastructures (MapReduce, NoSQL data stores)
• Lazy Big Data Integration– (Efficient and) effective goal-oriented data integration
– Integrated analytical approach for big data analytics
![Page 10: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching](https://reader034.fdocuments.in/reader034/viewer/2022050204/5f578d00091fda0ebf385d3f/html5/thumbnails/10.jpg)
10
(Big) Data Analytics Pipeline
Schema & Entity
Matching
DataExtraction/
Cleaning
DataAcquisition
DataFusion/
Aggregation
DataAnalytics /
Visualization
Bibliometrics
Life Sciences
QueryStrategies
Product CodeExtraction
LoadBalancing
NoSQLSharding
Dedoop (Deduplication with Hadoop)
![Page 11: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching](https://reader034.fdocuments.in/reader034/viewer/2022050204/5f578d00091fda0ebf385d3f/html5/thumbnails/11.jpg)
11
How to speed up entity matching?
• Entity matching is expensive (due to pair-wise comparisons)
• Blocking to reduce search space– Group similar entities within blocks based on blocking key
– Restrict matching to entities from the same block
• Parallelization– Split match computation in sub-tasks to be executed in parallel
– Exploitation of cloud infrastructures and frameworks like MapReduce
![Page 12: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching](https://reader034.fdocuments.in/reader034/viewer/2022050204/5f578d00091fda0ebf385d3f/html5/thumbnails/12.jpg)
12
Blocking + MapReduce: Naïve
• Data skew leads to unbalanced workload
red
uce
1re
du
ce2
red
uce
3
Part
itio
nin
g: “
Blo
ckKe
y m
od
#R
edu
ce T
asks
”
Key Valuew Aw Bx Cx Dx Ez Fz G
Key Valuew Hw Iy Ky Lz Mz Nz O
Key Valuew Aw Bw Hw Iz Fz Gz Mz Nz O
PairsA-B, A-H, A-I, B-H, B-I, H-IF-G, F-M, F-N, F-O,G-M, G-N, G-O, M-N, M-O, N-O
Π1
Aw
Bw
Cx
Dx
Ex
Fz
Gz
Π2
Hw
Iw
Ky
Ly
Mz
Nz
Oz
map
1m
ap2
Map: Blocking Reduce: Matching
Key Valuey Ky L
Key Valuex Cx Dx E
PairsK-L
PairsC-D, C-E, D-E
S
Aw
Bw
Cx
Dx
Ex
Fz
Gz
Hw
Iw
Ky
Ly
Mz
Nz
Oz
Inp
ut
Split
![Page 13: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching](https://reader034.fdocuments.in/reader034/viewer/2022050204/5f578d00091fda0ebf385d3f/html5/thumbnails/13.jpg)
13
Load Balancing for MR-based EM
Analysis to identify blocking key distribution
Global load balancing algorithms assigning similar number of pairs to reduce tasks
Block Distribution
Matrix
Π’1
Π’2
map1
map2
reduce1
reduce2
reduce3Load
Balancing SimilarityComputation
MR Job2: Matching
M1
M2
M3
MatchPartitions
MatchResult
Π1 map1
map2
reduce1
Π 2
Blocking Count Occurrences
MR Job1: AnalysisInput
Partitions
InputEntities reduce2
Partition Overall
Π1 Π2 #E #P
Blo
cks
w Φ1 2 2 4 6
y Φ2 0 2 2 1
x Φ3 3 0 3 3
z Φ4 2 3 5 10
Partition Π1 Π2
Entity A B C D E F G H I K L M N OBlocking Key w w x x x z z w w y y z z z
![Page 14: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching](https://reader034.fdocuments.in/reader034/viewer/2022050204/5f578d00091fda0ebf385d3f/html5/thumbnails/14.jpg)
14
BlockSplit
• Large blocks split into m sub-blocks– according to m input partitions
– large if #PBlock > #POverall / #Reducer
• Two types of match tasks– Single (small blocks and sub-blocks)
– Two sub-blocks
• Greedy load balancing– Sort match tasks by number of pairs
in descending order
– Assign match task to reducer with lowest number of pairs
• Example– r=3 reduce tasks, split Φ4 in m=2 sub-blocks
– Φ4‘s match tasks: Φ4.1 , Φ4.2 , and Φ4.1×2
Partition Overall
Π1 Π2 #E #P
Blo
cks
w Φ1 2 2 4 6
y Φ2 0 2 2 1
x Φ3 3 0 3 3
z Φ4 2 3 5 10
#P Reducer
Mat
ch T
asks
Φ1 6 1
Φ4.1×2 6 2
Φ3 3 3
Φ4.2 3 3
Φ2 1 1
Φ4.1 1 2
![Page 15: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching](https://reader034.fdocuments.in/reader034/viewer/2022050204/5f578d00091fda0ebf385d3f/html5/thumbnails/15.jpg)
15
BlockSplit: MR-Dataflow
red
uce
1re
du
ce2
red
uce
3
Part
itio
nin
g b
y R
edu
cer-
Ind
ex
Key Value1.1.* A1.1.* B3.3.* C3.3.* D3.3.* E2.4.1 F2.4.12 F1
2.4.1 G2.4.12 G1
Key Value1.1.* H1.1.* I1.2.* K1.2.* L2.4.12 M2
3.4.2 M2.4.12 N2
3.4.2 N2.4.12 O2
3.4.2 O
Key Value1.1.* A1.1.* B1.1.* H1.1.* I1.2.* K1.2.* L
PairsA-B, A-H, A-I, B-H, B-I, H-IK-L
Π1
Aw
Bw
Cx
Dx
Ex
Fz
Gz
Π2
Hw
Iw
Ky
Ly
Mz
Nz
Oz
map
1m
ap2
MapReduce
Key Value2.4.12 F0
2.4.12 G0
2.4.12 M1
2.4.12 N1
2.4.12 O1
2.4.1 F2.4.1 G
Key Value3.3.* C3.3.* D3.3.* E3.4.2 M3.4.2 N3.4.2 O
PairsF-M, F-N, F-O, G-M, G-N, G-OF-G
PairsC-D, C-E, D-EM-N, M-O, N-O
MapReduce
Techniques
• MapKey =ReducerIndex +MatchTask
• Replicate entities of sub-blocks
![Page 16: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching](https://reader034.fdocuments.in/reader034/viewer/2022050204/5f578d00091fda0ebf385d3f/html5/thumbnails/16.jpg)
16
Evaluation: Robustness + Scalability
„All entities in a single block“
„Uniform distribution“
Robust against data
skew
Scalable
![Page 17: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching](https://reader034.fdocuments.in/reader034/viewer/2022050204/5f578d00091fda0ebf385d3f/html5/thumbnails/17.jpg)
17
Agenda
• Data Integration– Data analytics for domain-specific questions
– Use cases: Bibliometrics & Life Sciences
• Big Data Integration– Techniques for efficient big data management
– Exploiting cloud infrastructures (MapReduce, NoSQL data stores)
• Lazy Big Data Integration– (Efficient and) effective goal-oriented data integration
– Integrated analytical approach for big data analytics
![Page 18: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching](https://reader034.fdocuments.in/reader034/viewer/2022050204/5f578d00091fda0ebf385d3f/html5/thumbnails/18.jpg)
18
(Big) Data Analytics Pipeline
Schema & Entity
Matching
DataExtraction/
Cleaning
DataAcquisition
DataFusion/
Aggregation
DataAnalytics /
Visualization
Data IntegrationInterpre-
tation
![Page 19: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching](https://reader034.fdocuments.in/reader034/viewer/2022050204/5f578d00091fda0ebf385d3f/html5/thumbnails/19.jpg)
19
Citation Analysis Pipeline
• For a given set of Bibtex entries– Find matching Google Scholar entries
– Determine aggregated citation counts
• Analytical questions for a researcher– Complete publication list + #citations
– Top-5 publications
– H-Index, Average Number of citations
• Analytical questions for comparing– Institutions
– Research fields
– …
![Page 20: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching](https://reader034.fdocuments.in/reader034/viewer/2022050204/5f578d00091fda0ebf385d3f/html5/thumbnails/20.jpg)
20
„Lazy Machine“: Effectiveness
• Do the right thing! Do only things that are needed!– Priorization / filtering of data objects to be processed
• Example: Top-5 publications of a researcher– Entity Matching for highly cited
Google Scholar entries
– Cutoff data that does not contribute to the analytical result (anymore)
– „does not“ → „is not likely to“
• Pipeline stages– Data Akquisition: query strategies
– Data Extraction: on-demand
– Data Matching: relevant entities only
![Page 21: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching](https://reader034.fdocuments.in/reader034/viewer/2022050204/5f578d00091fda0ebf385d3f/html5/thumbnails/21.jpg)
21
„Lazy User“: Data Quality
• Automatic data integration does not give 100% data quality– Data acquisition might miss relevant data
– Matching is imperfect (precision, recall)
– …
• Pipeline & integrated result should effectively point the user to the “weak points”
• Examples– What (non-)matching pairs have the most effect on the analytical
result?
– Outlier detection → What pipeline stage caused the effect?
![Page 22: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching](https://reader034.fdocuments.in/reader034/viewer/2022050204/5f578d00091fda0ebf385d3f/html5/thumbnails/22.jpg)
22
Lazy Big Data Integration
• Integrated approach for both– Data integration workflow
– Analytical query
• Current work based on Gradoop (Graph Analytics on Hadoop)– Graph model + operators for analytical pipelines
– Efficient execution in distributed environment
• Next steps– Operators for complex analytical queries / statistics (e.g., h-index)
– Data provenance model for measuring the impact of data objects to specific results
![Page 23: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching](https://reader034.fdocuments.in/reader034/viewer/2022050204/5f578d00091fda0ebf385d3f/html5/thumbnails/23.jpg)
23
Summary
• Data Integration– Data analytics for domain-specific questions
– Use cases: Bibliometrics & Life Sciences
• Big Data Integration– Techniques for efficient big data management
– Exploiting cloud infrastructures (MapReduce, NoSQL data stores)
• Lazy Big Data Integration– (Efficient and) effective goal-oriented data integration
– Integrated analytical approach for big data analytics