„Big data” Benczúr András MTA SZTAKI. Benczúr – Big Data - Szeged- 2012 március 23 Big...

„Big data”

Benczúr AndrásMTA SZTAKI

Benczúr – Big Data - Szeged- 2012 március 23

Big Data – the new hype

• “big data” is when the size of the data itself becomes part of the problem

• “big data” is data that becomes large enough that it cannot be processed using conventional methods

• Google sorts 1PB in 33 minutes (07-09-2011)• Amazon S3 store contains 499B objects (19-07-

2011)• New Relic: 20B+ application metrics/day (18-07-

2011)• Walmart monitors 100M entities in real time (12-09-

2011)

Source: The Emerging Big Data slide from the Intelligent Information Management DG INFSO/E2 Objective ICT-2011.4.4 Info day in Luxembourg on 26 September 2011


batch

real time KDB

Esper

InfoBright

MySql Hadoop

Progress

MapR

HBase

NetezzaVertica

MegaByte

PetaByte

Greenplum

Speed

Size

Fast data

Big analytics

Big Data Services

custom hardware

Matlab

R

Mahout

SciPy

custom software

RevolutionSAS

SPSS

power station sensors

IT logs

Web content

fraud detection

online reputation

newscuration

navigation mobility

media pricing

Big Data Planes

S4

MOA

GraphLab


Overview• Introduction – Buzzwords• Part I: Background

• Examples• Mobility and navigation traces• Sensors, smart city• IT logs• Wind power

• Scientific and Business Relevance• Part II: Infrastructures

• NoSQL, Key-value store, Hadoop, H*, Pregel, …• Part III: Algorithms

• Brief History of Algorithms• Web processing, PageRank with algorithms• Stream algoritmusok• Entity resolution – detailed comparison (QDB 2011)


Navigation and Mobility Traces

• Streaming data at mobile base stations• Privacy issues

• Regulations to let only anonymized data leave beyond network operations and billing

• Do regulation policy makers know about deanonymization attacks?

• What your wife/husband will not know – your mobile provider will


Sensors – smart home, city, country, …

• Road and parking slot sensors• Mobile parking traces• Public transport, Oyster cards• Bike hire schemes

Source: Internet of Things Comic Book, http://www.smartsantander.eu/images/IoT_Comic_Book.pdf


… even agriculture …


-30

-20

-10

0

10

20

30

40

50

60

70

80

90

100

110

120

130

0

10

20

30

40

50

60

70

80

90

100

110

Mo

de

l e

stim

ati

on

err

or

(%)

[lim

it:

+/

-17

%]

Tem

pe

ratu

res

Time - a year

Non-conform situation detection - estimation of the gearbox bearing temperature by a neural network modell

(Model validity: ambient temperature between 4 and 10 C)

Values_for_Model_INPUT_2 Values_for_Model_INPUT_1

Gearbox bearing temperature_MODEL_ESTIMATES Gearbox bearing temperature_MEASURED

Ambient temperature (for model vaildity) Error_%

… and wind power stations


Our experience: 30-100+ GB/day3-60 M events

Corporate IT log processing

Aggregation intoData Warehouse

Identify bottlenecksOptimize procedures

Detect misuse, fraud, attacks

?

Traditional methods fail


Scientific and business relevance• VLDB 2011 (~100 papers): • 6 papers on MapReduce/Hadoop, 10 on big data

(+keynote), 11 NoSQL architectures, 6 GPS/sensory data

• tutorials, demos (Microsoft, SAP, IBM NoSQL tools)• session: Big Data Analysis, MapReduce, Scalable

Infrastructures

• EWEA 2011: 28% of papers on wind power raise data size issues

• SIGMOD 2011: out of 70 papers, 10 on new architectures and extensions for analytics

• Gartner 2011 trend No. 5: Next Generation Analytics - „significant changes to existing operational and business intelligence infrastructures”

• The Economist 2010.02.27: „Monstrous amounts of data … Information is transforming traditional businesses”

• News special issue on Big Data this April


New challenges in database technologies

Question of research and practice: Applicability to a specific problem? Applicability as a general technique?


Overview

• Part I: Background• Examples• Scientific and Business Relevance

• Part II: Infrastructures• NoSQL• Key-value stores• Hadoop és Hadoopra épülő eszközök• Bulk Synchronous Parallel, Pregel• Streaming, S4

• Part III: Algorithms, Examples


Most jön sok külső slide show …

• NoSQL bevezető – www.intertech.com/resource/usergroup/NoSQL.ppt

• Key-value stores• BerkeleyBD – nem osztott• Voldemort – behemoth.strlen.net/~alex/voldemort-nosql_live.ppt

• Cassandra, Dynamo, …• Hadoop alapon is létezik (lent): HBase

• Hadoop – Erdélyi Miki fóliái• HBase – datasearch.ruc.edu.cn/course/cloudcomputing20102/slides/Lec07.ppt

• Cascading – nem lesz• Mahout – cwiki.apache.org/MAHOUT/faq.data/Mahout%20Overview.ppt

• Miért kell más? Mi más kell?• Bulk Synchronous Parallel• Graphlab – Danny Bickson slides• MOA – http://www.slideshare.net/abifet/moa-5636332/download

• Streaming• S4 – http://www.slideshare.net/alekbr/s4-stream-computing-platform

http://www.intertech.com/resource/usergroup/NoSQL.ppt







http://behemoth.strlen.net/~alex/voldemort-nosql_live.ppt







http://datasearch.ruc.edu.cn/course/cloudcomputing20102/slides/Lec07.ppt






http://cwiki.apache.org/MAHOUT/faq.data/Mahout%20Overview.ppt




http://www.slideshare.net/abifet/moa-5636332/download

http://www.slideshare.net/alekbr/s4-stream-computing-platform

http://www.slideshare.net/alekbr/s4-stream-computing-platform


Bulk Synchronous Parallel architecture

HAMA: Pregel klón


Use of large matrices

• Main step in all distributed algorithms• Network based features in classification• Partitioning for efficient algorithms• Exploring the data,

navigation (e.g. rankingto select a nice compact subgraph)

• Hadoop apps (e.g. PageRank)move the entire data around ineach iteration

• Baseline C++ code keepsdata local

HadoopHadoop + KeyValue storeBest C++ custom code


BSP vs. MapReduce

• MapReduce: Data locality not preserved between Map and Reduce invocations or MapReduce iterations.

• BSP: Tailored towards processing data with locality.• Proprietary: Google Pregel• Open-source (will be?? … several flaws

now): HAMA• Home developed C++ code base

• Both: Easy parallelization and distribution. Sidlo et al., Infrastructures and bounds for

distributed entity resolution. QDB 2011


Overview• Part I: Background• Part II: Infrastructures

• NoSQL, Key-value store, Hadoop, H*, Pregel, …

• Part III: Algorithms, Examples, Comparison• Data and computation intense tasks,

architectures• History of Algorithms• Web processing, PageRank with algorithms• Entity resolution – detailed with algorithms

• Summary, conclusions


Types of Big Data problems

• Data intense• Web processing, info retrieval,

classification• Log processing (telco, IT, supermarket, …)

• Compute intense:• Expectation Maximization, Gaussian

mixture decomposition, image retrieval, …• Genom matching, phylogenetic trees, …

• Data AND compute intense:• Network (Web, friendship, …) partitioning,

finding similarities, centers, hubs, …• Singular value decomposition


HardwareData intense:• Map-reduce (Hadoop),

cloud, …

Compute intense:• Shared memory• Message passing• Processor arrays, …→ became affordable choice

recently, as graphics co-procs!

Data AND compute intense??


Big data: Why now?

• Hardware is just getting better, cheaper?

• But data is getting larger, easier to access

• Bad news for algorithms slower than ~ linear


Processor Year No of elements 4004 1971 2.300 8008 1972 2.500 8080 1974 4.500 8086 1978 29.000 Intel 286 1982 134.000 Intel 386 processor 1985 275.000 Intel 486 processor 1989 1.200.000 Intel Pentium processor 1993 3.100.000 Intel Pentium II processor 1997 7.500.000 Intel Pentium III processor 1999 9.500.000 Intel Pentium 4 processor 2000 42.000.000 Intel Itanium processor 2001 25.000.000 Intel Itanium 2 processor 2003 220.000.000 Intel Itanium 2 processor (9MB cache) 2004 592.000.000

Moore’s Law: doubling in 18 months

But in a key aspect, the trend has changed!From speed to no of cores


„Numbers Everyone Should Know”

RAM• L1 cache reference 0.5 ns • L2 cache reference 7 ns• Main memory reference 100 ns • Read 1 MB sequentially from memory 250,000 ns

Intra-process communication• Mutex lock/unlock 100 ns • Read 1 MB sequentially from network 10,000,000 ns

Disk• Disk seek 10,000,000 ns • Read 1 MB sequentially from disk 30,000,000 ns

Jeff Dean, Google

Disk• 10+TB

RAM• 100+ GB

CPU• L2 1+ MB• L1 10+ KB

GPU onboard memory

• Global 4-8 GB

• Block shared 10+ KB


Back to Databases, this means …

Sub-linear speed-up

Linear speed-up (ideal)

Number of CPUs

Num

ber

of

transa

ctio

ns/

seco

nd

1000/Sec

5 CPUs

2000/Sec

10 CPUs 16 CPUs

1600/Sec

Cost Security Integrity control more

difficult Lack of standards Lack of experience Complexity of

management and control Increased storage

requirements Increased training cost

Read 1 MB sequentially…

• memory 250,000 ns • network 10,000,000

ns • disk 30,000,000 ns M CPU

M CPU

M CPU

M CPU

M CPU

MEMORYCPU

CPU

CPU

CPU

CPU

CPU

Connolly, Begg: Database systems: a practical approach to design, implementation, and management], International computer science series, Pearson Education, 2005


“The brief history of Algorithms”

P, NP PRA

Mtheoretic models

Thinking Machines: hypercube, …

Cray: vectorprocessors

SIMD, MIMD, message passing

Map-reduceGoogle

Multi-coreMany-coreCloudFlash disk

External memory algs

CM-5: many vectorprocs


• P: Graph traversalSpanning tree

• NP: Steiner trees

Earliest history: P, NP

1

25

15 5

1

5

11

152

1 2

2

2 1 2

11


Why do we care about graphs, trees?

Mary [email protected] 50071

M. Doe [email protected] 79216

Mary Doe [email protected] 50071

M. [email protected]

34302

name e-mail ID

1

2 3

Image segmentation

Entity Resolution


History of algs: spanning trees in parallel

• iterative minimum spanning forest• every node is a tree at start; every

iteration merges trees

Bentley: A parallel algorithm for constructing minimum spanning trees1980

Harish et al. Fast Minimum Spanning Tree for Large Graphs on the GPU2009

1

3

8

2

6

7

4

5


Overview• Part I: Background• Part II: Infrastructures

• NoSQL, Key-value store, Hadoop, H*, Pregel, …

• Part III: Algorithms, Examples, Comparison• Data and computation intense tasks,

architectures• History of Algorithms• Web processing, PageRank with algorithms• Streaming algoritmusok• Entity resolution – detailed with algorithms



The Web is about data tooPosted by John Klossner on Aug 03, 2009• WEB 1.0 (browsers) – Users find data

WEB 2.0 (social networks) – Users find each otherWEB 3.0 (semantic Web) – Data find each other

• WEB 4.0 – Data create their own Facebook page, restrict friends.

• WEB 5.0 – Data decide they can work without humans, create their own language.

• WEB 6.0 –Human users realize that they no longer can find data unless invited by data.

• WEB 7.0 – Data get cheaper cell phone rates.• WEB 8.0 – Data horde all the good YouTube videos, leaving

human users with access to bad ’80′s music videos only.• WEB 9.0 – Data create and maintain own blogs, are more

popular than human blogs.• WEB 10.0 – All episodes of Battlestar Gallactica will now be

shown from the Cylons’ point of view.

Big Data interpetation: recommender

s, personalizatio

n, info extraction


Longitudinal Analytics of Web Archive Data

Building a Virtual Web Observatory on large temporal data of Internet archives


Partner approaches to hardware

• Hanzo Archives (UK):Amazon EC2 cloud + S3

• Internet Memory Foundation:50 low-end servers

• We: indexing 3TB compressed, .5B pages• Open source tools not yet mature• One week of processing on 50 old dual

cores• Hardware worth approx €10,000; Amazon

price around €5000


Text REtrieval Conference measurement

• Documents stored in HBase tables over the Hadoop file system (HDFS)

• Indexelés:• 200 példány saját C++ kereső• 40 Lucene példány, utána top 50,000 találat saját kereső• SolR? Katta? Tényleg real time működnek? Ranking?

• Realistic - Even spam is important!Spam

• Obvious parallelization: each node processes all pages of one host

• Link features (eg. PageRank) cannot be computed in this way M. Erdélyi, A. Garzó, and A. A. Benczúr: Web

spam classification: a few features worth more (WebQuality 2011)


Distributed storage: HBase vs WARC files

o WARCo Many, many medium sized files very inefficient w/ Hadoopo Either huge block size wasting spaceo Or data locality lost as blocks may continue at non-local

HDFS nodeo HBase

o Data locality preserving ranges – cooperation w/ Hadoopo Experiments up to 3TB compressed Web data

o WARC to HBaseo One-time expensive step, no data localityo One-by-one inserts fail, very low performanceo MapReduce jobs to create HFiles, the native HBase format

o HFile transfero HBase insertion rate 100,000 per hour


u

The Random Surfer Model

Nodes = Web pages

Edges = hyperlinks

Starts at a random page—arrives at quality page


u

PageRank: The Random Surfer Model

Chooses random neighbor with probability 1-


u


Or with probability “teleports” to random page—gets bored and types a new URL



And continues with the random walk …



Until convergence … ?

[Brin, Page 98]


PR(k+1) = PR(k) ( (1 - ) M + · U )

= PR(1) ( (1 - ) M + · U )k

PageRank as Quality

A quality page is pointed to by several quality pages


u

Or with probability “teleports” to random page—selected from her bookmarks

Personalized PageRank


Algorithmics• Estimated 10+ billions of Web pages

worldwide• PageRank (as floats)

• fits into 40GB storage• Personalization just to single pages:

• 10 billions of PageRank scores for each page

• Storage exceeds several Exabytes!

• NB single-page personalization is enough: )()()( 1111 kkkk vPPRvPPRvvPPR


• For light to reach the other side of the Galaxy … takes rather longer: five hundred thousand years.

• The record for hitch hiking this distance is just under five years, but you don't get to see much on the way.

D Adams, The Hitchhiker's Guide to the Galaxy. 1979

For certain things are just too big?


• Reformulation by simple tricks of linear algebra• From u simulate N independent random

walks• Database of fingerprints: ending vertices

of the walks from all vertices• Query

• PPR(u,v) := # ( walks u→v ) / N• N ≈ 1000 approximates top 100 well

Fogaras-Racz: Towards Scaling Fully Personalized PageRank, WAW 2004

Markov Chain Monte Carlo


SimRank: similarity in graphs

“Two pages are similar if pointed to by similar pages” [Jeh–Widom KDD 2002]:

• Same trick: path pair summation (can be sampled [Fogaras–Rácz WWW 2005]) overu = w0,w1, . . . ,wk−1,wk = v2

u = w’0 ,w’1 , . . . ,w’k−1,w’k = v1

• DB application e.g: Yin, Han, Yu. LinkClus: efficient clustering via heterogeneous semantic links, VLDB '06

otherwise1

if)(indeg)(indeg

,)1(),( 21

21

21

21uu

uu

)vSimRank(vuuSimRank


Communication complexity bounding

• Bit-vector probing (BVP)

• Theorem: B ≥ m for any protocol• Reduction from BVP to Exact-PPR-

compare

Alice has a bit vector

Input: x = (x1, x2, …, xm )

Bob has a number

Input: 1 ≤ k ≤ m

Xk = ?

Communication

B bits

Alice has x = (x1, x2, …, xm )

G graph with V vertices, where V2 = m

Pre-compute an Exact PPR data of size D

Communication

Exact PPR, D bits

Bob has 1 ≤ k ≤ m

u, v, w vertices

PPR(u,v) ? PPR(u,w)

Xk = ?

Thus D = B ≥ m= V2


Theory of Streaming algorithms

• Distinct values példa – Motwani slides• Szekvenciális, RAM algoritmusok• Külső táras algoritmusok• Mintavételezés negatív eredmény• „Sketching” technika


Overview• Part I: Background

• Examples• Scientific and Business Relevance

• Part II: Foundations Illustrated• Data and computation intense tasks,

architectures• History of Algorithms• Web processing, PageRank with algorithms• Streaming algoritmusok• Entity resolution – detailed with algorithms



Distributed Computing Paradigms and Tools

• Distributed Key-Value Stores:

distributed B-tree index for all attributes

Project Voldemort

• MapReduce: map → reduce operations Apache Hadoop

• Bulk Synchronous Parallel: supersteps: computation →

communication → barrier sync

Apache Hama


e1Mary Major 09.12.1979

50071 ...r1:

J. Doe 23.04.1965

79216 ...r2:

John Doe 23.04.1965

79216 ...r3:

Richard Miles 31.09.1980

34302 ...r4:

Richard G. Miles

21.09.1980

34302 ...r5:

e2

e3

A1A2 A3

Records with attributes – r1, … , r5

Entities formed by sets of records – e1, e2 , e3

Entity Resolution problem = partition the records into customers, …

Entity resolution


A communication complexity lower bound Set intersection cannot be decided by

communicating less than Θ(n) bits[Kalyanasundaram, Schintger 1992]

Implication: if data is over multiple servers, one needs Θ(n) communication to decide if it may have a duplicate with another node

Best we can do is communicate all data Related area: Locality Sensitive Hashing

no LSH for „minimum”, i.e. to decide if two attributes agree

similar to negative results on Donoho’s Zero „norm” (number of non-zero coordinates)


Wait – how about blocking?

Blocking speeds up shared memory parallel algorithms [many in literature, eg. Whang, Menestrina, Koutrika, Theobald, Garcia-Molina. ER with Iterative Blocking, 2009]

Even if we could partition the data with no duplicates split, the lower bound applies just to test we are done

Still, blocking is a good idea – we may have to communicate much more than Θ(n) bits


Distributed Key-Value Store (KVS)

the record graph can be served from the KVS KVS mainly provides random access to a

huge graph that would not fit in main memory

computing nodes, many indexing nodes implement a graph traversal

Breadth-First Search (BFS) with queues spanning forest with Union-Find basic textbook algorithms [Cormen-Leiserson-

Rivest,…]


MapReduce (MR)

Sorting as the prime MR app For each feature, sort to form a graph of records

MR has no data locality All data is moved around the network in each

iteration Find connected components of the graph

Again the textbook matrix power method can be implemented in MR

Iterated matrix multiplication is not MR-friendly[Kang, Tsourakakis, Faloutsos. Pegasus framework, 2009]

This step will move huge amounts around: the whole data even if we have small components only


Bulk Synchronous Parallel (BSP)

One master node + several processing nodes perform merge sort

Algorithm: resolve local data at each node locally send attribute values in sorted order central server identifies and sends back

candidates find connected components

this is the prominent BSP app, very fast


Experiments: scalability

15 older blade servers, 4GB memory, 3GHz CPU each insurance client dataset (~ 2 records per entity)


Experiments: scalability

15 older blade servers, 4GB memory, 3GHz CPU each insurance client dataset (~ 2 records per entity)

Hadoop phases

HAMA phases


Conclusions• Big Data is founded on several subfields

• Architectures – processor arrays, many-core affordable

• Algorithms – design principles from the ‘90-s• Databases – distributed, column oriented, NoSQL• Data mining, Information retrieval, Machine

learning, Networks – for the top application• Hadoop and Stream processing are the two

main efficient techniques• Limitations for data AND compute intense problems• Many emerging alternatives (e.g. BSP)• Selecting the right architecture is often the

question


Kérdések?

András BenczúrHead, Informatics Laboratory

http://datamining.sztaki.hu/

Institute for Computer Science and

Control, Hungarian Academy of Sciences

Email: [email protected]

http://datamining.sztaki.hu/

mailto:[email protected]

„Big data” Benczúr András MTA SZTAKI. Benczúr – Big Data - Szeged- 2012 március 23 Big...

Documents

Transcript of „Big data” Benczúr András MTA SZTAKI. Benczúr – Big Data - Szeged- 2012 március 23 Big...