EvaluatingaProcessing-in-Memory Architecturewiththe k...

39
Evaluating a Processing-in-Memory Architecture with the k-means Algorithm Simon Bihel [email protected] Lesly-Ann Daniel [email protected] Florestan De Moor [email protected] Bastien Thomas [email protected] May 4, 2017 University of Rennes I École Normale Supérieure de Rennes

Transcript of EvaluatingaProcessing-in-Memory Architecturewiththe k...

Page 1: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

Evaluating a Processing-in-MemoryArchitecture with the k-means Algorithm

Simon Bihel [email protected] Daniel [email protected] De Moor [email protected] Thomas [email protected] 4, 2017

University of Rennes IÉcole Normale Supérieure de Rennes

Page 2: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

With Help From…

Dominique Lavenier [email protected]

David Furodet & the Upmem Team [email protected]

Page 3: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

Context

BIG DATA Workloads

End of Dennard Scaling

End of Moore’s Law

Shift towards Data-Centric Architectures Exascale

Bandwidth and Memory Walls

1/17

Page 4: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

Table of contents

1. The Upmem Architecture

2. k-means Implementation for the Upmem Architecture

3. Experimental Evaluation

2/17

Page 5: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

The Upmem Architecture

Page 6: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

Upmem architecture overview

CPU

WRAM

DPU

MRAM

DDR bus

0

WRAM

DPU

MRAM

255...

...

...

...

DIMM

DPU dram processing-unitWRAM execution memory for programsMRAM main memoryDIMM dual in-line memory module

3/17

Page 7: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

A massively parallel architecture

Characteristics

• Several DIMMs can be added to a CPU• A 16 GBytes DIMM embeds 256 DPUs• Each DPU can support up to 24 threads

The context is switched between DPU threads every clock cycle.

The programming approach has to consider this fine-grainedparallelism.

4/17

Page 8: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

A massively parallel architecture

Characteristics

• Several DIMMs can be added to a CPU• A 16 GBytes DIMM embeds 256 DPUs• Each DPU can support up to 24 threads

The context is switched between DPU threads every clock cycle.

The programming approach has to consider this fine-grainedparallelism.

4/17

Page 9: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

Upmem Architecture Overview

On a programming level: two programs must be specified.

CPU

performsdata-intensive

operations

orchestratesthe execution

DPUs

TaskletHost

program{ {5/17

Page 10: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

Upmem Architecture Overview

On a programming level: two programs must be specified.

CPU

performsdata-intensive

operations

orchestratesthe execution

DPUs

TaskletHost

program{ {communication

- MRAM- Mailboxes

5/17

Page 11: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

Drawbacks and advantages

Drawbacks: computation power

• Frequency around 750 MHz• No floating point operations• Significant multiplication overhead (no hardwaremultiplier)

• Explicit memory management

Advantages: data access

• Parallelization power• Minimum latency• Increased bandwidth• Reduced power consumption

6/17

Page 12: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

Drawbacks and advantages

Drawbacks: computation power

• Frequency around 750 MHz• No floating point operations• Significant multiplication overhead (no hardwaremultiplier)

• Explicit memory management

Advantages: data access

• Parallelization power• Minimum latency• Increased bandwidth• Reduced power consumption

6/17

Page 13: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

k-means Implementation for theUpmem Architecture

Page 14: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

k-means Clustering Problem

Partition data ∈ Rn×m into k clusters C1 . . . Ckn (resp. m): number of points (resp. attributes)

d: Euclidean distance

ArgminCk∑i=1

∑p∈Ci

d(p,mean(Ci))

Examples of applicationsSegmentationCommunities in socialnetworksMarket researchGene sequence analysis

7/17

Page 15: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

k-means Standard Algorithm [6]

1: function k-means(k, data, δ)2: Choose C̃ := (c̃1 . . . c̃k) initial centroids3: repeat4: C = C̃5: for all point p ∈data do6: j := Argmini d(p, ci) ▷ Find nearest cluster7: Assign p to cluster Cj8: end for9: for all i in {1 . . . k} do10: c̃i = mean(p ∈ Ci) ▷ Compute new centroids11: end for12: until ∥C̃− C∥ ≤ δ ▷ Convergence criteria13: return C̃ ▷ Return the final centroids14: end function

8/17

Page 16: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

k-means algorithm on Upmem

Computations

Start centroidsupdate

DPUs

HOST

Send centroids

End centroidsupdate

Data input

Choose initialcentroids

Distribute points

Convergence?

Output results

yes

no

points

The points aredistributed across theDPUs.

9/17

Page 17: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

Implementation & Memory Management

• int type to store distance(easy to overflow withdistances)

MRAM

• Global variables (e.g. # ofpoints)

• Centers• Points• New centers

10/17

Page 18: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

Experimental Evaluation

Page 19: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

Experimental Setup

Simulator

• Architecture not yet manufactured• Cycle-Accurate simulator

Datasets

• int• Randomly generated (notuniformly, with clusters)

Could not find ready-to-useinteger large datasets.

200 0 200 400 600 800 1000200

0

200

400

600

800

1000

11/17

Page 20: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

Experimental Setup

Simulator

• Architecture not yet manufactured• Cycle-Accurate simulator

Datasets

• int• Randomly generated (notuniformly, with clusters)

Could not find ready-to-useinteger large datasets.

200 0 200 400 600 800 1000200

0

200

400

600

800

1000

11/17

Page 21: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

Number of Threads

0 5 10 15 20 25Number of threads

Runti

me

High number of

• points(N=1000000,D=10, K=5)

• dimensions(N=500000,D=34, K=3)

• centroids(N=100000,D=2, K=10)

Not the same runtime scales. 12/17

Page 22: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

Number of DPUs

0 5 10 15 20 25 30 35Number of DPUs

0

10

20

30

40

50

60

70

80

Runti

me (

seco

nds)

Always the samenumber of points.

Time is divided by the number of DPUs. 13/17

Page 23: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

Comparison with sequential k-means

Dataset Many PointsAlgorithm 16-DPUs 1 core SeqC

Runtime (s) 1.568 0.268Faster than SeqC with 94 DPUs

Large number of dimensions provides a large amount ofmultiplications to compute distances

14/17

Page 24: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

Comparison with sequential k-means

Dataset Many DimensionsAlgorithm 16-DPUs 1 core SeqC

Runtime (s) 4.534 0.119Faster than SeqC with 610 DPUs

Large numbers of dimensions provides a large amount ofmultiplications to compute distances

14/17

Page 25: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

Comparison with sequential k-means

Dataset Many CentersAlgorithm 16-DPUs 1 core SeqC

Runtime (s) 0.4353 0.0142Faster than SeqC with 491 DPUs

Large numbers of centers provides a large amount ofcomputation per memory transfer [2]

14/17

Page 26: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

Conclusion

Page 27: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

Conclusion

• Ideal use case with very low computation programs (e.g.genomic text processing [4, 5])

• Even if there is no gain on time, power might be reduced• Overflows when computing distances• Implemented k-means++ [1] with GMP library (arbitraryprecision numbers) but what was interesting is the timefor an iteration

15/17

Page 28: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

Conclusion

• Ideal use case with very low computation programs (e.g.genomic text processing [4, 5])

• Even if there is no gain on time, power might be reduced

• Overflows when computing distances• Implemented k-means++ [1] with GMP library (arbitraryprecision numbers) but what was interesting is the timefor an iteration

15/17

Page 29: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

Conclusion

• Ideal use case with very low computation programs (e.g.genomic text processing [4, 5])

• Even if there is no gain on time, power might be reduced• Overflows when computing distances

• Implemented k-means++ [1] with GMP library (arbitraryprecision numbers) but what was interesting is the timefor an iteration

15/17

Page 30: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

Conclusion

• Ideal use case with very low computation programs (e.g.genomic text processing [4, 5])

• Even if there is no gain on time, power might be reduced• Overflows when computing distances• Implemented k-means++ [1] with GMP library (arbitraryprecision numbers) but what was interesting is the timefor an iteration

15/17

Page 31: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

Going Further with the Hardware

Actual Physical Device

• Evaluate how the program behaves at large scale• Impact on the DDR bus & communications

Hardware Multiplication

• Now: 40% of multiplication instructions & 30 instructionsper multiplication

16/17

Page 32: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

Going Further with the Hardware

Actual Physical Device

• Evaluate how the program behaves at large scale• Impact on the DDR bus & communications

Hardware Multiplication

• Now: 40% of multiplication instructions & 30 instructionsper multiplication

16/17

Page 33: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

Going Further with the k-means

Keep the distance to the current nearest centroid [3]Easy to add in our implementation: keep distance in DPU

+ Avoid useless computations during next iteration

− Reduce number of points per DPU

Define a border made of points that can switch cluster [7]Harder to integrate

+ Reduce the number of distance computations

− Might involve the CPU

17/17

Page 34: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

Going Further with the k-means

Keep the distance to the current nearest centroid [3]Easy to add in our implementation: keep distance in DPU

+ Avoid useless computations during next iteration

− Reduce number of points per DPU

Define a border made of points that can switch cluster [7]Harder to integrate

+ Reduce the number of distance computations

− Might involve the CPU

17/17

Page 35: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

Thank You

Page 36: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

References

Page 37: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

References i

D. Arthur and S. Vassilvitskii.k-means++: The advantages of careful seeding.In Proceedings of the eighteenth annual ACM-SIAMsymposium on Discrete algorithms, pages 1027–1035.Society for Industrial and Applied Mathematics, 2007.

M. A. Bender, J. Berry, S. D. Hammond, B. Moore, B. Moseley,and C. A. Phillips.k-means clustering on two-level memory systems.In Proceedings of the 2015 International Symposium onMemory Systems, MEMSYS ’15, pages 197–205, New York, NY,USA, 2015. ACM.

Page 38: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

References ii

A. M. Fahim, A. M. Salem, F. A. Torkey, and M. A. Ramadan.An efficient enhanced k-means clustering algorithm.Journal of Zhejiang University-SCIENCE A, 7(10):1626–1633,2006.D. Lavenier, C. Deltel, D. Furodet, and J.-F. Roy.BLAST on UPMEM.Research Report RR-8878, INRIA Rennes - BretagneAtlantique, Mar. 2016.

D. Lavenier, C. Deltel, D. Furodet, and J.-F. Roy.MAPPING on UPMEM.Research Report RR-8923, INRIA, June 2016.

Page 39: EvaluatingaProcessing-in-Memory Architecturewiththe k ...simonbihel.me/static/PIM/slides/hipeac/slides.pdfEvaluatingaProcessing-in-Memory Architecturewiththek-meansAlgorithm SimonBihel

References iii

S. Lloyd.Least squares quantization in pcm.IEEE transactions on information theory, 28(2):129–137, 1982.

C. M. Poteraş, M. C. Mihăescu, and M. Mocanu.An optimized version of the k-means clustering algorithm.

In Computer Science and Information Systems (FedCSIS),2014 Federated Conference on, pages 695–699. IEEE, 2014.