„Big data” Benczúr András MTA SZTAKI. Benczúr – Big Data - Szeged- 2012 március 23 Big...
-
Upload
alycia-fabian -
Category
Documents
-
view
219 -
download
5
Transcript of „Big data” Benczúr András MTA SZTAKI. Benczúr – Big Data - Szeged- 2012 március 23 Big...
„Big data”
Benczúr AndrásMTA SZTAKI
Benczúr – Big Data - Szeged- 2012 március 23
Big Data – the new hype
• “big data” is when the size of the data itself becomes part of the problem
• “big data” is data that becomes large enough that it cannot be processed using conventional methods
• Google sorts 1PB in 33 minutes (07-09-2011)• Amazon S3 store contains 499B objects (19-07-
2011)• New Relic: 20B+ application metrics/day (18-07-
2011)• Walmart monitors 100M entities in real time (12-09-
2011)
Source: The Emerging Big Data slide from the Intelligent Information Management DG INFSO/E2 Objective ICT-2011.4.4 Info day in Luxembourg on 26 September 2011
Benczúr – Big Data - Szeged- 2012 március 23
batch
real time KDB
Esper
InfoBright
MySql Hadoop
Progress
MapR
HBase
NetezzaVertica
MegaByte
PetaByte
Greenplum
Speed
Size
Fast data
Big analytics
Big Data Services
custom hardware
Matlab
R
Mahout
SciPy
custom software
RevolutionSAS
SPSS
power station sensors
IT logs
Web content
fraud detection
online reputation
newscuration
navigation mobility
media pricing
Big Data Planes
S4
MOA
GraphLab
Benczúr – Big Data - Szeged- 2012 március 23
Overview• Introduction – Buzzwords• Part I: Background
• Examples• Mobility and navigation traces• Sensors, smart city• IT logs• Wind power
• Scientific and Business Relevance• Part II: Infrastructures
• NoSQL, Key-value store, Hadoop, H*, Pregel, …• Part III: Algorithms
• Brief History of Algorithms• Web processing, PageRank with algorithms• Stream algoritmusok• Entity resolution – detailed comparison (QDB 2011)
Benczúr – Big Data - Szeged- 2012 március 23
Navigation and Mobility Traces
• Streaming data at mobile base stations• Privacy issues
• Regulations to let only anonymized data leave beyond network operations and billing
• Do regulation policy makers know about deanonymization attacks?
• What your wife/husband will not know – your mobile provider will
Benczúr – Big Data - Szeged- 2012 március 23
Sensors – smart home, city, country, …
• Road and parking slot sensors• Mobile parking traces• Public transport, Oyster cards• Bike hire schemes
Source: Internet of Things Comic Book, http://www.smartsantander.eu/images/IoT_Comic_Book.pdf
Benczúr – Big Data - Szeged- 2012 március 23
… even agriculture …
Benczúr – Big Data - Szeged- 2012 március 23
-30
-20
-10
0
10
20
30
40
50
60
70
80
90
100
110
120
130
0
10
20
30
40
50
60
70
80
90
100
110
Mo
de
l e
stim
ati
on
err
or
(%)
[lim
it:
+/
-17
%]
Tem
pe
ratu
res
Time - a year
Non-conform situation detection - estimation of the gearbox bearing temperature by a neural network modell
(Model validity: ambient temperature between 4 and 10 C)
Values_for_Model_INPUT_2 Values_for_Model_INPUT_1
Gearbox bearing temperature_MODEL_ESTIMATES Gearbox bearing temperature_MEASURED
Ambient temperature (for model vaildity) Error_%
… and wind power stations
Benczúr – Big Data - Szeged- 2012 március 23
Our experience: 30-100+ GB/day3-60 M events
Corporate IT log processing
Aggregation intoData Warehouse
Identify bottlenecksOptimize procedures
Detect misuse, fraud, attacks
?
Traditional methods fail
Benczúr – Big Data - Szeged- 2012 március 23
Scientific and business relevance• VLDB 2011 (~100 papers): • 6 papers on MapReduce/Hadoop, 10 on big data
(+keynote), 11 NoSQL architectures, 6 GPS/sensory data
• tutorials, demos (Microsoft, SAP, IBM NoSQL tools)• session: Big Data Analysis, MapReduce, Scalable
Infrastructures
• EWEA 2011: 28% of papers on wind power raise data size issues
• SIGMOD 2011: out of 70 papers, 10 on new architectures and extensions for analytics
• Gartner 2011 trend No. 5: Next Generation Analytics - „significant changes to existing operational and business intelligence infrastructures”
• The Economist 2010.02.27: „Monstrous amounts of data … Information is transforming traditional businesses”
• News special issue on Big Data this April
Benczúr – Big Data - Szeged- 2012 március 23
New challenges in database technologies
Question of research and practice: Applicability to a specific problem? Applicability as a general technique?
Benczúr – Big Data - Szeged- 2012 március 23
Overview
• Part I: Background• Examples• Scientific and Business Relevance
• Part II: Infrastructures• NoSQL• Key-value stores• Hadoop és Hadoopra épülő eszközök• Bulk Synchronous Parallel, Pregel• Streaming, S4
• Part III: Algorithms, Examples
Benczúr – Big Data - Szeged- 2012 március 23
Most jön sok külső slide show …
• NoSQL bevezető – www.intertech.com/resource/usergroup/NoSQL.ppt
• Key-value stores• BerkeleyBD – nem osztott• Voldemort – behemoth.strlen.net/~alex/voldemort-nosql_live.ppt
• Cassandra, Dynamo, …• Hadoop alapon is létezik (lent): HBase
• Hadoop – Erdélyi Miki fóliái• HBase – datasearch.ruc.edu.cn/course/cloudcomputing20102/slides/Lec07.ppt
• Cascading – nem lesz• Mahout – cwiki.apache.org/MAHOUT/faq.data/Mahout%20Overview.ppt
• Miért kell más? Mi más kell?• Bulk Synchronous Parallel• Graphlab – Danny Bickson slides• MOA – http://www.slideshare.net/abifet/moa-5636332/download
• Streaming• S4 – http://www.slideshare.net/alekbr/s4-stream-computing-platform
Benczúr – Big Data - Szeged- 2012 március 23
Bulk Synchronous Parallel architecture
HAMA: Pregel klón
Benczúr – Big Data - Szeged- 2012 március 23
Use of large matrices
• Main step in all distributed algorithms• Network based features in classification• Partitioning for efficient algorithms• Exploring the data,
navigation (e.g. rankingto select a nice compact subgraph)
• Hadoop apps (e.g. PageRank)move the entire data around ineach iteration
• Baseline C++ code keepsdata local
HadoopHadoop + KeyValue storeBest C++ custom code
Benczúr – Big Data - Szeged- 2012 március 23
BSP vs. MapReduce
• MapReduce: Data locality not preserved between Map and Reduce invocations or MapReduce iterations.
• BSP: Tailored towards processing data with locality.• Proprietary: Google Pregel• Open-source (will be?? … several flaws
now): HAMA• Home developed C++ code base
• Both: Easy parallelization and distribution. Sidlo et al., Infrastructures and bounds for
distributed entity resolution. QDB 2011
Benczúr – Big Data - Szeged- 2012 március 23
Overview• Part I: Background• Part II: Infrastructures
• NoSQL, Key-value store, Hadoop, H*, Pregel, …
• Part III: Algorithms, Examples, Comparison• Data and computation intense tasks,
architectures• History of Algorithms• Web processing, PageRank with algorithms• Entity resolution – detailed with algorithms
• Summary, conclusions
Benczúr – Big Data - Szeged- 2012 március 23
Types of Big Data problems
• Data intense• Web processing, info retrieval,
classification• Log processing (telco, IT, supermarket, …)
• Compute intense:• Expectation Maximization, Gaussian
mixture decomposition, image retrieval, …• Genom matching, phylogenetic trees, …
• Data AND compute intense:• Network (Web, friendship, …) partitioning,
finding similarities, centers, hubs, …• Singular value decomposition
Benczúr – Big Data - Szeged- 2012 március 23
HardwareData intense:• Map-reduce (Hadoop),
cloud, …
Compute intense:• Shared memory• Message passing• Processor arrays, …→ became affordable choice
recently, as graphics co-procs!
Data AND compute intense??
Benczúr – Big Data - Szeged- 2012 március 23
Big data: Why now?
• Hardware is just getting better, cheaper?
• But data is getting larger, easier to access
• Bad news for algorithms slower than ~ linear
Benczúr – Big Data - Szeged- 2012 március 23
Processor Year No of elements 4004 1971 2.300 8008 1972 2.500 8080 1974 4.500 8086 1978 29.000 Intel 286 1982 134.000 Intel 386 processor 1985 275.000 Intel 486 processor 1989 1.200.000 Intel Pentium processor 1993 3.100.000 Intel Pentium II processor 1997 7.500.000 Intel Pentium III processor 1999 9.500.000 Intel Pentium 4 processor 2000 42.000.000 Intel Itanium processor 2001 25.000.000 Intel Itanium 2 processor 2003 220.000.000 Intel Itanium 2 processor (9MB cache) 2004 592.000.000
Moore’s Law: doubling in 18 months
But in a key aspect, the trend has changed!From speed to no of cores
Benczúr – Big Data - Szeged- 2012 március 23
„Numbers Everyone Should Know”
RAM• L1 cache reference 0.5 ns • L2 cache reference 7 ns• Main memory reference 100 ns • Read 1 MB sequentially from memory 250,000 ns
Intra-process communication• Mutex lock/unlock 100 ns • Read 1 MB sequentially from network 10,000,000 ns
Disk• Disk seek 10,000,000 ns • Read 1 MB sequentially from disk 30,000,000 ns
Jeff Dean, Google
Disk• 10+TB
RAM• 100+ GB
CPU• L2 1+ MB• L1 10+ KB
GPU onboard memory
• Global 4-8 GB
• Block shared 10+ KB
Benczúr – Big Data - Szeged- 2012 március 23
Back to Databases, this means …
Sub-linear speed-up
Linear speed-up (ideal)
Number of CPUs
Num
ber
of
transa
ctio
ns/
seco
nd
1000/Sec
5 CPUs
2000/Sec
10 CPUs 16 CPUs
1600/Sec
Cost Security Integrity control more
difficult Lack of standards Lack of experience Complexity of
management and control Increased storage
requirements Increased training cost
Read 1 MB sequentially…
• memory 250,000 ns • network 10,000,000
ns • disk 30,000,000 ns M CPU
M CPU
M CPU
M CPU
M CPU
MEMORYCPU
CPU
CPU
CPU
CPU
CPU
Connolly, Begg: Database systems: a practical approach to design, implementation, and management], International computer science series, Pearson Education, 2005
Benczúr – Big Data - Szeged- 2012 március 23
“The brief history of Algorithms”
P, NP PRA
Mtheoretic models
Thinking Machines: hypercube, …
Cray: vectorprocessors
SIMD, MIMD, message passing
Map-reduceGoogle
Multi-coreMany-coreCloudFlash disk
External memory algs
CM-5: many vectorprocs
Benczúr – Big Data - Szeged- 2012 március 23
• P: Graph traversalSpanning tree
• NP: Steiner trees
Earliest history: P, NP
1
25
15 5
1
5
11
152
1 2
2
2 1 2
11
Benczúr – Big Data - Szeged- 2012 március 23
Why do we care about graphs, trees?
Mary [email protected] 50071
M. Doe [email protected] 79216
Mary Doe [email protected] 50071
34302
name e-mail ID
1
2 3
Image segmentation
Entity Resolution
Benczúr – Big Data - Szeged- 2012 március 23
History of algs: spanning trees in parallel
• iterative minimum spanning forest• every node is a tree at start; every
iteration merges trees
Bentley: A parallel algorithm for constructing minimum spanning trees1980
Harish et al. Fast Minimum Spanning Tree for Large Graphs on the GPU2009
1
3
8
2
6
7
4
5
Benczúr – Big Data - Szeged- 2012 március 23
Overview• Part I: Background• Part II: Infrastructures
• NoSQL, Key-value store, Hadoop, H*, Pregel, …
• Part III: Algorithms, Examples, Comparison• Data and computation intense tasks,
architectures• History of Algorithms• Web processing, PageRank with algorithms• Streaming algoritmusok• Entity resolution – detailed with algorithms
• Summary, conclusions
Benczúr – Big Data - Szeged- 2012 március 23
The Web is about data tooPosted by John Klossner on Aug 03, 2009• WEB 1.0 (browsers) – Users find data
WEB 2.0 (social networks) – Users find each otherWEB 3.0 (semantic Web) – Data find each other
• WEB 4.0 – Data create their own Facebook page, restrict friends.
• WEB 5.0 – Data decide they can work without humans, create their own language.
• WEB 6.0 –Human users realize that they no longer can find data unless invited by data.
• WEB 7.0 – Data get cheaper cell phone rates.• WEB 8.0 – Data horde all the good YouTube videos, leaving
human users with access to bad ’80′s music videos only.• WEB 9.0 – Data create and maintain own blogs, are more
popular than human blogs.• WEB 10.0 – All episodes of Battlestar Gallactica will now be
shown from the Cylons’ point of view.
Big Data interpetation: recommender
s, personalizatio
n, info extraction
Benczúr – Big Data - Szeged- 2012 március 23
Benczúr – Big Data - Szeged- 2012 március 23
Longitudinal Analytics of Web Archive Data
Building a Virtual Web Observatory on large temporal data of Internet archives
Benczúr – Big Data - Szeged- 2012 március 23
Partner approaches to hardware
• Hanzo Archives (UK):Amazon EC2 cloud + S3
• Internet Memory Foundation:50 low-end servers
• We: indexing 3TB compressed, .5B pages• Open source tools not yet mature• One week of processing on 50 old dual
cores• Hardware worth approx €10,000; Amazon
price around €5000
Benczúr – Big Data - Szeged- 2012 március 23
Text REtrieval Conference measurement
• Documents stored in HBase tables over the Hadoop file system (HDFS)
• Indexelés:• 200 példány saját C++ kereső• 40 Lucene példány, utána top 50,000 találat saját kereső• SolR? Katta? Tényleg real time működnek? Ranking?
• Realistic - Even spam is important!Spam
• Obvious parallelization: each node processes all pages of one host
• Link features (eg. PageRank) cannot be computed in this way M. Erdélyi, A. Garzó, and A. A. Benczúr: Web
spam classification: a few features worth more (WebQuality 2011)
Benczúr – Big Data - Szeged- 2012 március 23
Distributed storage: HBase vs WARC files
o WARCo Many, many medium sized files very inefficient w/ Hadoopo Either huge block size wasting spaceo Or data locality lost as blocks may continue at non-local
HDFS nodeo HBase
o Data locality preserving ranges – cooperation w/ Hadoopo Experiments up to 3TB compressed Web data
o WARC to HBaseo One-time expensive step, no data localityo One-by-one inserts fail, very low performanceo MapReduce jobs to create HFiles, the native HBase format
o HFile transfero HBase insertion rate 100,000 per hour
Benczúr – Big Data - Szeged- 2012 március 23
u
The Random Surfer Model
Nodes = Web pages
Edges = hyperlinks
Starts at a random page—arrives at quality page
Benczúr – Big Data - Szeged- 2012 március 23
u
PageRank: The Random Surfer Model
Chooses random neighbor with probability 1-
Benczúr – Big Data - Szeged- 2012 március 23
u
The Random Surfer Model
Or with probability “teleports” to random page—gets bored and types a new URL
Benczúr – Big Data - Szeged- 2012 március 23
The Random Surfer Model
And continues with the random walk …
Benczúr – Big Data - Szeged- 2012 március 23
The Random Surfer Model
And continues with the random walk …
Benczúr – Big Data - Szeged- 2012 március 23
The Random Surfer Model
Until convergence … ?
[Brin, Page 98]
Benczúr – Big Data - Szeged- 2012 március 23
PR(k+1) = PR(k) ( (1 - ) M + · U )
= PR(1) ( (1 - ) M + · U )k
PageRank as Quality
A quality page is pointed to by several quality pages
Benczúr – Big Data - Szeged- 2012 március 23
u
Or with probability “teleports” to random page—selected from her bookmarks
Personalized PageRank
Benczúr – Big Data - Szeged- 2012 március 23
Algorithmics• Estimated 10+ billions of Web pages
worldwide• PageRank (as floats)
• fits into 40GB storage• Personalization just to single pages:
• 10 billions of PageRank scores for each page
• Storage exceeds several Exabytes!
• NB single-page personalization is enough: )()()( 1111 kkkk vPPRvPPRvvPPR
Benczúr – Big Data - Szeged- 2012 március 23
• For light to reach the other side of the Galaxy … takes rather longer: five hundred thousand years.
• The record for hitch hiking this distance is just under five years, but you don't get to see much on the way.
D Adams, The Hitchhiker's Guide to the Galaxy. 1979
For certain things are just too big?
Benczúr – Big Data - Szeged- 2012 március 23
• Reformulation by simple tricks of linear algebra• From u simulate N independent random
walks• Database of fingerprints: ending vertices
of the walks from all vertices• Query
• PPR(u,v) := # ( walks u→v ) / N• N ≈ 1000 approximates top 100 well
Fogaras-Racz: Towards Scaling Fully Personalized PageRank, WAW 2004
Markov Chain Monte Carlo
Benczúr – Big Data - Szeged- 2012 március 23
SimRank: similarity in graphs
“Two pages are similar if pointed to by similar pages” [Jeh–Widom KDD 2002]:
• Same trick: path pair summation (can be sampled [Fogaras–Rácz WWW 2005]) overu = w0,w1, . . . ,wk−1,wk = v2
u = w’0 ,w’1 , . . . ,w’k−1,w’k = v1
• DB application e.g: Yin, Han, Yu. LinkClus: efficient clustering via heterogeneous semantic links, VLDB '06
otherwise1
if)(indeg)(indeg
,)1(),( 21
21
21
21uu
uu
)vSimRank(vuuSimRank
Benczúr – Big Data - Szeged- 2012 március 23
Communication complexity bounding
• Bit-vector probing (BVP)
• Theorem: B ≥ m for any protocol• Reduction from BVP to Exact-PPR-
compare
Alice has a bit vector
Input: x = (x1, x2, …, xm )
Bob has a number
Input: 1 ≤ k ≤ m
Xk = ?
Communication
B bits
Alice has x = (x1, x2, …, xm )
G graph with V vertices, where V2 = m
Pre-compute an Exact PPR data of size D
Communication
Exact PPR, D bits
Bob has 1 ≤ k ≤ m
u, v, w vertices
PPR(u,v) ? PPR(u,w)
Xk = ?
Thus D = B ≥ m= V2
Benczúr – Big Data - Szeged- 2012 március 23
Theory of Streaming algorithms
• Distinct values példa – Motwani slides• Szekvenciális, RAM algoritmusok• Külső táras algoritmusok• Mintavételezés negatív eredmény• „Sketching” technika
Benczúr – Big Data - Szeged- 2012 március 23
Overview• Part I: Background
• Examples• Scientific and Business Relevance
• Part II: Foundations Illustrated• Data and computation intense tasks,
architectures• History of Algorithms• Web processing, PageRank with algorithms• Streaming algoritmusok• Entity resolution – detailed with algorithms
• Summary, conclusions
Benczúr – Big Data - Szeged- 2012 március 23
Distributed Computing Paradigms and Tools
• Distributed Key-Value Stores:
distributed B-tree index for all attributes
Project Voldemort
• MapReduce: map → reduce operations Apache Hadoop
• Bulk Synchronous Parallel: supersteps: computation →
communication → barrier sync
Apache Hama
Benczúr – Big Data - Szeged- 2012 március 23
e1Mary Major 09.12.1979
50071 ...r1:
J. Doe 23.04.1965
79216 ...r2:
John Doe 23.04.1965
79216 ...r3:
Richard Miles 31.09.1980
34302 ...r4:
Richard G. Miles
21.09.1980
34302 ...r5:
e2
e3
A1A2 A3
Records with attributes – r1, … , r5
Entities formed by sets of records – e1, e2 , e3
Entity Resolution problem = partition the records into customers, …
Entity resolution
Benczúr – Big Data - Szeged- 2012 március 23
A communication complexity lower bound Set intersection cannot be decided by
communicating less than Θ(n) bits[Kalyanasundaram, Schintger 1992]
Implication: if data is over multiple servers, one needs Θ(n) communication to decide if it may have a duplicate with another node
Best we can do is communicate all data Related area: Locality Sensitive Hashing
no LSH for „minimum”, i.e. to decide if two attributes agree
similar to negative results on Donoho’s Zero „norm” (number of non-zero coordinates)
Benczúr – Big Data - Szeged- 2012 március 23
Wait – how about blocking?
Blocking speeds up shared memory parallel algorithms [many in literature, eg. Whang, Menestrina, Koutrika, Theobald, Garcia-Molina. ER with Iterative Blocking, 2009]
Even if we could partition the data with no duplicates split, the lower bound applies just to test we are done
Still, blocking is a good idea – we may have to communicate much more than Θ(n) bits
Benczúr – Big Data - Szeged- 2012 március 23
Distributed Key-Value Store (KVS)
the record graph can be served from the KVS KVS mainly provides random access to a
huge graph that would not fit in main memory
computing nodes, many indexing nodes implement a graph traversal
Breadth-First Search (BFS) with queues spanning forest with Union-Find basic textbook algorithms [Cormen-Leiserson-
Rivest,…]
Benczúr – Big Data - Szeged- 2012 március 23
MapReduce (MR)
Sorting as the prime MR app For each feature, sort to form a graph of records
MR has no data locality All data is moved around the network in each
iteration Find connected components of the graph
Again the textbook matrix power method can be implemented in MR
Iterated matrix multiplication is not MR-friendly[Kang, Tsourakakis, Faloutsos. Pegasus framework, 2009]
This step will move huge amounts around: the whole data even if we have small components only
Benczúr – Big Data - Szeged- 2012 március 23
Bulk Synchronous Parallel (BSP)
One master node + several processing nodes perform merge sort
Algorithm: resolve local data at each node locally send attribute values in sorted order central server identifies and sends back
candidates find connected components
this is the prominent BSP app, very fast
Benczúr – Big Data - Szeged- 2012 március 23
Experiments: scalability
15 older blade servers, 4GB memory, 3GHz CPU each insurance client dataset (~ 2 records per entity)
Benczúr – Big Data - Szeged- 2012 március 23
Experiments: scalability
15 older blade servers, 4GB memory, 3GHz CPU each insurance client dataset (~ 2 records per entity)
Benczúr – Big Data - Szeged- 2012 március 23
Experiments: scalability
15 older blade servers, 4GB memory, 3GHz CPU each insurance client dataset (~ 2 records per entity)
Hadoop phases
HAMA phases
Benczúr – Big Data - Szeged- 2012 március 23
Conclusions• Big Data is founded on several subfields
• Architectures – processor arrays, many-core affordable
• Algorithms – design principles from the ‘90-s• Databases – distributed, column oriented, NoSQL• Data mining, Information retrieval, Machine
learning, Networks – for the top application• Hadoop and Stream processing are the two
main efficient techniques• Limitations for data AND compute intense problems• Many emerging alternatives (e.g. BSP)• Selecting the right architecture is often the
question
Benczúr – Big Data - Szeged- 2012 március 23
Kérdések?
András BenczúrHead, Informatics Laboratory
http://datamining.sztaki.hu/
Institute for Computer Science and
Control, Hungarian Academy of Sciences
Email: [email protected]