Database and Stream Mining using GPUs Naga K. Govindaraju UNC Chapel Hill.

57
Database and Stream Mining Database and Stream Mining using GPUs using GPUs Naga K. Govindaraju Naga K. Govindaraju UNC Chapel Hill UNC Chapel Hill

Transcript of Database and Stream Mining using GPUs Naga K. Govindaraju UNC Chapel Hill.

Database and Stream Mining Database and Stream Mining using GPUsusing GPUs

Naga K. GovindarajuNaga K. Govindaraju

UNC Chapel HillUNC Chapel Hill

2

GoalGoal

• Utilize graphics processors for fast Utilize graphics processors for fast computation of common database computation of common database operationsoperations• Conjunctive selectionsConjunctive selections

• AggregationsAggregations

• Semi-linear queriesSemi-linear queries

• Essential componentsEssential components

3

Motivation: Fast operationsMotivation: Fast operations

• Increasing database sizesIncreasing database sizes

•Faster processor speeds but low Faster processor speeds but low improvement in query execution improvement in query execution timetime•Memory stalls Memory stalls

•Branch mispredictionsBranch mispredictions

•Resource stallsResource stalls

•Ref: Ref: [Ailamaki99,01] [Boncz99] [Manegold00,02] [Ailamaki99,01] [Boncz99] [Manegold00,02] [Meki00] [Shatdal94] [Rao99] [Ross02] [Zhou02]……[Meki00] [Shatdal94] [Rao99] [Ross02] [Zhou02]……

4

Fast Database OperationsFast Database Operations

CPU(3 GHz)

System Memory(2 GB)

AGP Memory(512 MB)

PCI-e Bus(4 GB/s)

Ours

Video Memory(256 MB)

GPU(500 MHz)

Others

5

NVIDIA GeForceFXNVIDIA GeForceFX6800 Ultra6800 Ultra

NVIDIA GeForceFXNVIDIA GeForceFX5900 Ultra5900 Ultra

Intel Pentium 4Intel Pentium 4

MemoryMemoryBandwidthBandwidth

35.2 GBps35.2 GBps 27.2 GBps27.2 GBps 6.4 GBps6.4 GBpsDDR2 400 RDRAMDDR2 400 RDRAM

Peak SIMD Peak SIMD InstructionsInstructions

6 Vertex Ops 6 Vertex Ops 16 Pixel Ops16 Pixel OpsFloat Float

4 Vertex Ops4 Vertex Ops4 Pixel Ops4 Pixel OpsFloatFloat

4 Float Ops (SSE) 4 Float Ops (SSE) 2 Double Ops 2 Double Ops (SSE2)(SSE2)

Vector Ops Vector Ops per Clockper Clock

16 vector4 (float)16 vector4 (float) 4 vector4 (float)4 vector4 (float) 1 vector4 (float)1 vector4 (float)

Peak Peak Comparison Comparison Ops per Ops per ClockClock

6464 1616 4 4

ClockClock 400 MHz400 MHz 450 MHz450 MHz 3.4 GHz3.4 GHz

6

Graphics Processors: Design IssuesGraphics Processors: Design Issues

• Relatively low bandwidth to CPURelatively low bandwidth to CPU• Design database operations avoiding frame buffer Design database operations avoiding frame buffer

readbacksreadbacks

• No arbitrary writesNo arbitrary writes• Design algorithms avoiding data rearrangementsDesign algorithms avoiding data rearrangements

• Programmable pipeline has poor Programmable pipeline has poor branchingbranching• Design algorithms without branching in Design algorithms without branching in

programmable pipeline - evaluate branches using programmable pipeline - evaluate branches using fixed function tests fixed function tests

7

Basic DB OperationsBasic DB Operations

Basic SQL query Basic SQL query Select ASelect A

From TFrom T

Where CWhere C

A= attributes or aggregations (SUM, A= attributes or aggregations (SUM, COUNT, MAX etc)COUNT, MAX etc)

T=relational tableT=relational table

C= Boolean Combination of Predicates C= Boolean Combination of Predicates (using operators AND, OR, NOT)(using operators AND, OR, NOT)

8

Database OperationsDatabase Operations

• Predicates Predicates • aaii opop constant constant or or aaii opop a ajj

• op: op: <,>,<=,>=,!=, =, TRUE, FALSE<,>,<=,>=,!=, =, TRUE, FALSE

• Boolean combinations Boolean combinations • Conjunctive Normal Form (CNF)Conjunctive Normal Form (CNF)

• AggregationsAggregations• COUNT, SUM, MAX, MEDIAN, AVGCOUNT, SUM, MAX, MEDIAN, AVG

9

Data RepresentationData Representation

• Attribute values aAttribute values aii are stored in 2D are stored in 2D textures on the GPUtextures on the GPU

• A fragment program is used to copy A fragment program is used to copy attributes to the depth bufferattributes to the depth buffer

10

Copy Time to the Depth Buffer Copy Time to the Depth Buffer

11

Data Representation: IssuesData Representation: Issues

• Floating point and fixed point Floating point and fixed point representations are differentrepresentations are different• Need to define scaling operationsNeed to define scaling operations

12

Predicate EvaluationPredicate Evaluation

• aaii op constant (d) op constant (d)

• Copy the attribute values aCopy the attribute values a ii into depth buffer into depth buffer

• Specify the comparison operation used in the Specify the comparison operation used in the depth testdepth test

• Draw a screen filling quad at depth d and Draw a screen filling quad at depth d and perform the depth testperform the depth test

13

Screen

PIf ( ai op d )pass fragment

Else

reject fragment

aaii op d op d

d

14

Predicate EvaluationPredicate Evaluation

CPU implementation — Intel compiler 7.1 with SIMD optimizations

15

Predicate EvaluationPredicate Evaluation

• aaii op a op ajj

• Equivalent to (aEquivalent to (ai i – a– ajj) op 0 ) op 0

• Semi-linear queriesSemi-linear queries• Defined as linear combination of attribute values Defined as linear combination of attribute values

compared against a constantcompared against a constant

• Linear combination is computed as a dot product of two Linear combination is computed as a dot product of two vectorsvectors

• Utilize the vector processing capabilities of GPUsUtilize the vector processing capabilities of GPUs

ii as

16

Semi-linear QuerySemi-linear Query

17

Boolean CombinationBoolean Combination

• CNF: CNF: • (A(A11 AND A AND A22 AND … AND A AND … AND Akk) where) where

AAii = (B = (Bii11 OR B OR Bii

22 OR … OR B OR … OR Biimi mi ) )

• Performed using stencil test recursivelyPerformed using stencil test recursively• CC11 = (TRUE AND A = (TRUE AND A11) = A) = A11

• CCi i = (A= (A11 AND A AND A22 AND … AND A AND … AND Aii) = (C) = (Ci-1i-1 AND A AND Aii))

• Different stencil values are used to code Different stencil values are used to code the outcome of Cthe outcome of Cii

• Positive stencil values — pass predicate evaluation Positive stencil values — pass predicate evaluation

• Zero — fail predicate evaluationZero — fail predicate evaluation

18

AA1 1 AND AAND A22

A1

B21

B22

B23

A2 = (B21 OR B2

2 OR B23 )

19

AA1 1 AND AAND A22

A1

Stencil value = 1

20

AA1 1 AND AAND A22

A1

Stencil value = 0

Stencil value = 1

TRUE AND A1

21

AA1 1 AND AAND A22

A1

Stencil = 0

Stencil = 1

B21

Stencil=2

B22

Stencil=2

B23

Stencil=2

22

AA1 1 AND AAND A22

A1

Stencil = 0

Stencil = 1

B21

B22

B23

Stencil=2

Stencil=2

Stencil=2

23

AA1 1 AND AAND A22

Stencil = 0

Stencil=2A1 AND B2

1

Stencil = 2A1 AND B2

2 Stencil=2

A1 AND B23

24

Multi-Attribute QueryMulti-Attribute Query

25

Range QueryRange Query

• Compute aCompute ai i within [low, high]within [low, high]

• Evaluated as ( aEvaluated as ( aii >= low ) AND ( a >= low ) AND ( aii <= high ) <= high )

• Use NVIDIA depth bounds test to Use NVIDIA depth bounds test to evaluate both conditionals in a evaluate both conditionals in a single clock cycle single clock cycle

26

Range QueryRange Query

27

AggregationsAggregations

• COUNT, MAX, MIN, SUM, AVGCOUNT, MAX, MIN, SUM, AVG

28

COUNTCOUNT

• Use Use occlusion queriesocclusion queries to get the number to get the number of pixels passing the testsof pixels passing the tests

• Syntax:Syntax:• Begin occlusion queryBegin occlusion query

• Perform database operationPerform database operation

• End occlusion queryEnd occlusion query

• Get count of number of attributes that passed database Get count of number of attributes that passed database operationoperation

• Involves no additional overhead!Involves no additional overhead!• Efficient selectivity computationEfficient selectivity computation

29

MAX, MIN, MEDIANMAX, MIN, MEDIAN

• Kth-largestKth-largest number number

• Traditional algorithms require data Traditional algorithms require data rearrangementsrearrangements

• We perform We perform • no data rearrangements no data rearrangements

• no frame buffer readbacksno frame buffer readbacks

30

K-th Largest NumberK-th Largest Number

• Let vLet vk k denote the k-th largest denote the k-th largest numbernumber

• How do we generate a number m How do we generate a number m equal to vequal to vkk??

• Without knowing vWithout knowing vkk’s value ’s value

• Using occlusion queries to obtain the number Using occlusion queries to obtain the number of values of values some given value some given value

• Starting from the most significant bit, Starting from the most significant bit, determine the value of each bit at a timedetermine the value of each bit at a time

31

K-th Largest NumberK-th Largest Number

• Given a set S of valuesGiven a set S of values• c(m) —number of values c(m) —number of values m m

• vvkk — the k-th largest number — the k-th largest number

• We haveWe have• If c(m) > k-1, then m If c(m) > k-1, then m ≤≤ v vkk

• If c(m) If c(m) ≤≤ k-1, then m > v k-1, then m > vkk

32

0011 1011 1101

0111 0101 0001

0111 1010 0010

m = 0000v2 = 1011

22ndnd Largest in 9 Values Largest in 9 Values

33

0011 1011 1101

0111 0101 0001

0111 1010 0010

m = 1000v2 = 1011

Draw a Quad at Depth 8 Draw a Quad at Depth 8 Compute Compute c(1000)c(1000)

34

0011 1011 1101

0111 0101 0001

0111 1010 0010

m = 1000v2 = 1011

c(m) = 3

11stst bit = 1 bit = 1

35

0011 1011 1101

0111 0101 0001

0111 1010 0010

m = 1100v2 = 1011

Draw a Quad at Depth 12 Draw a Quad at Depth 12 Compute c(1100)Compute c(1100)

36

0011 1011 1101

0111 0101 0001

0111 1010 0010

m = 1100v2 = 1011

c(m) = 1

22ndnd bit = 0 bit = 0

37

0011 1011 1101

0111 0101 0001

0111 1010 0010

m = 1010v2 = 1011

Draw a Quad at Depth 10 Draw a Quad at Depth 10 Compute c(1010)Compute c(1010)

38

0011 1011 1101

0111 0101 0001

0111 1010 0010

m = 1010v2 = 1011

c(m) = 3

33rdrd bit = 1 bit = 1

39

0011 1011 1101

0111 0101 0001

0111 1010 0010

m = 1011v2 = 1011

Draw a Quad at Depth 11 Draw a Quad at Depth 11 Compute c(1011)Compute c(1011)

40

0011 1011 1101

0111 0101 0001

0111 1010 0010

m = 1011v2 = 1011

c(m) = 2

44thth bit = 1 bit = 1

41

Our algorithmOur algorithm

• Initialize m to 0Initialize m to 0

• Start with the MSB and scan all bits Start with the MSB and scan all bits till LSBtill LSB

• At each bit, put 1 in the At each bit, put 1 in the corresponding bit-position of mcorresponding bit-position of m

• If c(m) If c(m) ≤≤ k-1, make that bit 0 k-1, make that bit 0

• Proceed to the next bitProceed to the next bit

42

Kth-LargestKth-Largest

43

MedianMedian

44

Top K FrequenciesTop K Frequencies

• Given n values in frame buffer, Given n values in frame buffer, compute the top k frequencies compute the top k frequencies without performing data without performing data rearrangements and using rearrangements and using comparisonscomparisons

45

Accumulator, MeanAccumulator, Mean

• Possible algorithmsPossible algorithms• Use fragment programs – requires very few Use fragment programs – requires very few

renderingsrenderings

• Use mipmaps [Harris et al. 02], fragment Use mipmaps [Harris et al. 02], fragment programs [Coombe et al. 03]programs [Coombe et al. 03]

• Issue: overflow in floating point valuesIssue: overflow in floating point values

• Our approach: bit-based algorithmOur approach: bit-based algorithm• Mean computed using accumulator Mean computed using accumulator

and divide by nand divide by n

46

AccumulatorAccumulator

• Data representation is of formData representation is of form22kk a ak k + 2+ 2k-1k-1 a ak-1k-1 + … + a + … + a00

Sum = 2Sum = 2kk ΣΣ a akk + 2 + 2k-1k-1 ΣΣ a ak-1k-1 +…+ +…+ ΣΣ a a00

ΣΣ a ai i = number of values with i-th bit as 1= number of values with i-th bit as 1

Current GPUs support no bit-masking operationsCurrent GPUs support no bit-masking operations

47

TestBitTestBit

• Read the data value from texture, Read the data value from texture, say asay aii

• F= frac(aF= frac(aii/2/2kk))

• If F>=0.5, then k-th bit of aIf F>=0.5, then k-th bit of aii is 1 is 1

• Set F to alpha value. Alpha test Set F to alpha value. Alpha test passes a fragment if alpha passes a fragment if alpha value>=0.5 value>=0.5

48

AccumulatorAccumulator

49

Stream MiningStream Mining

• Streams are continuous sequence of Streams are continuous sequence of data values arriving at a portdata values arriving at a port• A few common examples include networking A few common examples include networking

data, stock marketing and financial data, and data, stock marketing and financial data, and data collected from sensors data collected from sensors

• Goal: Efficiently approximate order Goal: Efficiently approximate order statistics such as frequencies, and statistics such as frequencies, and quantiles on data streamsquantiles on data streams• Exact computations require infinite memoryExact computations require infinite memory

50

IssuesIssues

• Data streaming applications require Data streaming applications require real-time processing requirementsreal-time processing requirements

• Applications also require small or Applications also require small or limited memory footprintlimited memory footprint

51

IssuesIssues

• Efficient CPU-algorithms perform Efficient CPU-algorithms perform histogram computations and are histogram computations and are eithereither• Compute-limited and therefore, cannot process Compute-limited and therefore, cannot process

data faster than its arrival ratedata faster than its arrival rate

• Memory-limited, and therefore, use memory Memory-limited, and therefore, use memory hierarchies on disks and are slow. Alternately, hierarchies on disks and are slow. Alternately, load shedding algorithms which drop excess load shedding algorithms which drop excess items are also useditems are also used

52

Histogram ComputationHistogram Computation

• Efficient sorting is fundamental for Efficient sorting is fundamental for histogram computationshistogram computations

• Our new sorting network algorithm Our new sorting network algorithm uses texture mapping and blending uses texture mapping and blending functionality of GPUs to perform fast functionality of GPUs to perform fast sorting on GPUs.sorting on GPUs.• The comparator mapping is performed using The comparator mapping is performed using

texture mappingtexture mapping

• The conditional assignments (MIN and MAX) are The conditional assignments (MIN and MAX) are implemented using blending algorithmimplemented using blending algorithm

• Maps efficiently to rasterization and is fast!Maps efficiently to rasterization and is fast!

53

Further detailsFurther details

Fast and Approximate Stream Fast and Approximate Stream Mining of Quantiles and Frequencies Mining of Quantiles and Frequencies Using Graphics ProcessorsUsing Graphics Processors

Naga K. Govindaraju, Nikunj Naga K. Govindaraju, Nikunj Raghuvanshi, Dinesh ManochaRaghuvanshi, Dinesh Manocha

Proc. of ACM SIGMOD 2005Proc. of ACM SIGMOD 2005

54

AdvantagesAdvantages

• Algorithms progress at GPU growth Algorithms progress at GPU growth raterate

• Offload CPU workOffload CPU work• Streaming processor parallel to CPUStreaming processor parallel to CPU

• Fast Fast • Massive parallelism on GPUsMassive parallelism on GPUs

• High memory bandwidthHigh memory bandwidth

• No branch mispredictionsNo branch mispredictions

• Commodity hardware!Commodity hardware!

55

ConclusionsConclusions

• Novel algorithms to perform Novel algorithms to perform database operations on GPUsdatabase operations on GPUs

• Evaluation of predicates, boolean Evaluation of predicates, boolean combinations of predicates, combinations of predicates, aggregationsaggregations

• Algorithms take into account Algorithms take into account GPU limitationsGPU limitations

• No data rearrangementsNo data rearrangements

• No frame buffer readbacksNo frame buffer readbacks

56

ConclusionsConclusions

• Algorithms map well to rasterization Algorithms map well to rasterization and GPUsand GPUs

• Preliminary comparisons with Preliminary comparisons with optimized CPU implementations is optimized CPU implementations is promisingpromising

• GPU as a useful co-processorGPU as a useful co-processor

57

Future WorkFuture Work

• Improve performance of many of our Improve performance of many of our algorithmsalgorithms

• More database operations such as More database operations such as join, sorting, classification and join, sorting, classification and clustering.clustering.

• Queries on spatial and temporal Queries on spatial and temporal databasesdatabases