EvaluatingaProcessing-in-Memory Architecturewiththe k...

Post on 04-Aug-2020

4 views 0 download

Transcript of EvaluatingaProcessing-in-Memory Architecturewiththe k...

Evaluating a Processing-in-MemoryArchitecture with the k-means Algorithm

Simon Bihel simon.bihel@ens-rennes.frLesly-Ann Daniel lesly-ann.daniel@ens-rennes.frFlorestan De Moor florestan.de-moor@ens-rennes.frBastien Thomas bastien.thomas@ens-rennes.frMay 4, 2017

University of Rennes IÉcole Normale Supérieure de Rennes

With Help From…

Dominique Lavenier dominique.lavenier@irisa.frCNRSIRISA

David Furodet & the Upmem Team dfurodet@upmem.com

Context

BIG DATA Workloads

End of Dennard Scaling

End of Moore’s Law

Shift towards Data-Centric Architectures Exascale

Bandwidth and Memory Walls

1/17

Table of contents

1. The Upmem Architecture

2. k-means Implementation for the Upmem Architecture

3. Experimental Evaluation

2/17

The Upmem Architecture

Upmem architecture overview

CPU

WRAM

DPU

MRAM

DDR bus

0

WRAM

DPU

MRAM

255...

...

...

...

DIMM

DPU dram processing-unitWRAM execution memory for programsMRAM main memoryDIMM dual in-line memory module

3/17

A massively parallel architecture

Characteristics

• Several DIMMs can be added to a CPU• A 16 GBytes DIMM embeds 256 DPUs• Each DPU can support up to 24 threads

The context is switched between DPU threads every clock cycle.

The programming approach has to consider this fine-grainedparallelism.

4/17

A massively parallel architecture

Characteristics

• Several DIMMs can be added to a CPU• A 16 GBytes DIMM embeds 256 DPUs• Each DPU can support up to 24 threads

The context is switched between DPU threads every clock cycle.

The programming approach has to consider this fine-grainedparallelism.

4/17

Upmem Architecture Overview

On a programming level: two programs must be specified.

CPU

performsdata-intensive

operations

orchestratesthe execution

DPUs

TaskletHost

program{ {5/17

Upmem Architecture Overview

On a programming level: two programs must be specified.

CPU

performsdata-intensive

operations

orchestratesthe execution

DPUs

TaskletHost

program{ {communication

- MRAM- Mailboxes

5/17

Drawbacks and advantages

Drawbacks: computation power

• Frequency around 750 MHz• No floating point operations• Significant multiplication overhead (no hardwaremultiplier)

• Explicit memory management

Advantages: data access

• Parallelization power• Minimum latency• Increased bandwidth• Reduced power consumption

6/17

Drawbacks and advantages

Drawbacks: computation power

• Frequency around 750 MHz• No floating point operations• Significant multiplication overhead (no hardwaremultiplier)

• Explicit memory management

Advantages: data access

• Parallelization power• Minimum latency• Increased bandwidth• Reduced power consumption

6/17

k-means Implementation for theUpmem Architecture

k-means Clustering Problem

Partition data ∈ Rn×m into k clusters C1 . . . Ckn (resp. m): number of points (resp. attributes)

d: Euclidean distance

ArgminCk∑i=1

∑p∈Ci

d(p,mean(Ci))

Examples of applicationsSegmentationCommunities in socialnetworksMarket researchGene sequence analysis

7/17

k-means Standard Algorithm [6]

1: function k-means(k, data, δ)2: Choose C̃ := (c̃1 . . . c̃k) initial centroids3: repeat4: C = C̃5: for all point p ∈data do6: j := Argmini d(p, ci) ▷ Find nearest cluster7: Assign p to cluster Cj8: end for9: for all i in {1 . . . k} do10: c̃i = mean(p ∈ Ci) ▷ Compute new centroids11: end for12: until ∥C̃− C∥ ≤ δ ▷ Convergence criteria13: return C̃ ▷ Return the final centroids14: end function

8/17

k-means algorithm on Upmem

Computations

Start centroidsupdate

DPUs

HOST

Send centroids

End centroidsupdate

Data input

Choose initialcentroids

Distribute points

Convergence?

Output results

yes

no

points

The points aredistributed across theDPUs.

9/17

Implementation & Memory Management

• int type to store distance(easy to overflow withdistances)

MRAM

• Global variables (e.g. # ofpoints)

• Centers• Points• New centers

10/17

Experimental Evaluation

Experimental Setup

Simulator

• Architecture not yet manufactured• Cycle-Accurate simulator

Datasets

• int• Randomly generated (notuniformly, with clusters)

Could not find ready-to-useinteger large datasets.

200 0 200 400 600 800 1000200

0

200

400

600

800

1000

11/17

Experimental Setup

Simulator

• Architecture not yet manufactured• Cycle-Accurate simulator

Datasets

• int• Randomly generated (notuniformly, with clusters)

Could not find ready-to-useinteger large datasets.

200 0 200 400 600 800 1000200

0

200

400

600

800

1000

11/17

Number of Threads

0 5 10 15 20 25Number of threads

Runti

me

High number of

• points(N=1000000,D=10, K=5)

• dimensions(N=500000,D=34, K=3)

• centroids(N=100000,D=2, K=10)

Not the same runtime scales. 12/17

Number of DPUs

0 5 10 15 20 25 30 35Number of DPUs

0

10

20

30

40

50

60

70

80

Runti

me (

seco

nds)

Always the samenumber of points.

Time is divided by the number of DPUs. 13/17

Comparison with sequential k-means

Dataset Many PointsAlgorithm 16-DPUs 1 core SeqC

Runtime (s) 1.568 0.268Faster than SeqC with 94 DPUs

Large number of dimensions provides a large amount ofmultiplications to compute distances

14/17

Comparison with sequential k-means

Dataset Many DimensionsAlgorithm 16-DPUs 1 core SeqC

Runtime (s) 4.534 0.119Faster than SeqC with 610 DPUs

Large numbers of dimensions provides a large amount ofmultiplications to compute distances

14/17

Comparison with sequential k-means

Dataset Many CentersAlgorithm 16-DPUs 1 core SeqC

Runtime (s) 0.4353 0.0142Faster than SeqC with 491 DPUs

Large numbers of centers provides a large amount ofcomputation per memory transfer [2]

14/17

Conclusion

Conclusion

• Ideal use case with very low computation programs (e.g.genomic text processing [4, 5])

• Even if there is no gain on time, power might be reduced• Overflows when computing distances• Implemented k-means++ [1] with GMP library (arbitraryprecision numbers) but what was interesting is the timefor an iteration

15/17

Conclusion

• Ideal use case with very low computation programs (e.g.genomic text processing [4, 5])

• Even if there is no gain on time, power might be reduced

• Overflows when computing distances• Implemented k-means++ [1] with GMP library (arbitraryprecision numbers) but what was interesting is the timefor an iteration

15/17

Conclusion

• Ideal use case with very low computation programs (e.g.genomic text processing [4, 5])

• Even if there is no gain on time, power might be reduced• Overflows when computing distances

• Implemented k-means++ [1] with GMP library (arbitraryprecision numbers) but what was interesting is the timefor an iteration

15/17

Conclusion

• Ideal use case with very low computation programs (e.g.genomic text processing [4, 5])

• Even if there is no gain on time, power might be reduced• Overflows when computing distances• Implemented k-means++ [1] with GMP library (arbitraryprecision numbers) but what was interesting is the timefor an iteration

15/17

Going Further with the Hardware

Actual Physical Device

• Evaluate how the program behaves at large scale• Impact on the DDR bus & communications

Hardware Multiplication

• Now: 40% of multiplication instructions & 30 instructionsper multiplication

16/17

Going Further with the Hardware

Actual Physical Device

• Evaluate how the program behaves at large scale• Impact on the DDR bus & communications

Hardware Multiplication

• Now: 40% of multiplication instructions & 30 instructionsper multiplication

16/17

Going Further with the k-means

Keep the distance to the current nearest centroid [3]Easy to add in our implementation: keep distance in DPU

+ Avoid useless computations during next iteration

− Reduce number of points per DPU

Define a border made of points that can switch cluster [7]Harder to integrate

+ Reduce the number of distance computations

− Might involve the CPU

17/17

Going Further with the k-means

Keep the distance to the current nearest centroid [3]Easy to add in our implementation: keep distance in DPU

+ Avoid useless computations during next iteration

− Reduce number of points per DPU

Define a border made of points that can switch cluster [7]Harder to integrate

+ Reduce the number of distance computations

− Might involve the CPU

17/17

Thank You

References

References i

D. Arthur and S. Vassilvitskii.k-means++: The advantages of careful seeding.In Proceedings of the eighteenth annual ACM-SIAMsymposium on Discrete algorithms, pages 1027–1035.Society for Industrial and Applied Mathematics, 2007.

M. A. Bender, J. Berry, S. D. Hammond, B. Moore, B. Moseley,and C. A. Phillips.k-means clustering on two-level memory systems.In Proceedings of the 2015 International Symposium onMemory Systems, MEMSYS ’15, pages 197–205, New York, NY,USA, 2015. ACM.

References ii

A. M. Fahim, A. M. Salem, F. A. Torkey, and M. A. Ramadan.An efficient enhanced k-means clustering algorithm.Journal of Zhejiang University-SCIENCE A, 7(10):1626–1633,2006.D. Lavenier, C. Deltel, D. Furodet, and J.-F. Roy.BLAST on UPMEM.Research Report RR-8878, INRIA Rennes - BretagneAtlantique, Mar. 2016.

D. Lavenier, C. Deltel, D. Furodet, and J.-F. Roy.MAPPING on UPMEM.Research Report RR-8923, INRIA, June 2016.

References iii

S. Lloyd.Least squares quantization in pcm.IEEE transactions on information theory, 28(2):129–137, 1982.

C. M. Poteraş, M. C. Mihăescu, and M. Mocanu.An optimized version of the k-means clustering algorithm.

In Computer Science and Information Systems (FedCSIS),2014 Federated Conference on, pages 695–699. IEEE, 2014.