EvaluatingaProcessing-in-Memory Architecturewiththe k...

Evaluating a Processing-in-MemoryArchitecture with the k-means Algorithm

Simon Bihel simon.bihel@ens-rennes.frLesly-Ann Daniel lesly-ann.daniel@ens-rennes.frFlorestan De Moor florestan.de-moor@ens-rennes.frBastien Thomas bastien.thomas@ens-rennes.frMay 4, 2017

University of Rennes IÉcole Normale Supérieure de Rennes

With Help From…

Dominique Lavenier dominique.lavenier@irisa.frCNRSIRISA

David Furodet & the Upmem Team dfurodet@upmem.com

Context

BIG DATA Workloads

End of Dennard Scaling

End of Moore’s Law

Shift towards Data-Centric Architectures Exascale

Bandwidth and Memory Walls

Table of contents

1. The Upmem Architecture

2. k-means Implementation for the Upmem Architecture

3. Experimental Evaluation

The Upmem Architecture

Upmem architecture overview

DDR bus

255...

DPU dram processing-unitWRAM execution memory for programsMRAM main memoryDIMM dual in-line memory module

A massively parallel architecture

Characteristics

• Several DIMMs can be added to a CPU• A 16 GBytes DIMM embeds 256 DPUs• Each DPU can support up to 24 threads

The context is switched between DPU threads every clock cycle.

The programming approach has to consider this fine-grainedparallelism.

A massively parallel architecture

Characteristics

• Several DIMMs can be added to a CPU• A 16 GBytes DIMM embeds 256 DPUs• Each DPU can support up to 24 threads

The context is switched between DPU threads every clock cycle.

The programming approach has to consider this fine-grainedparallelism.

Upmem Architecture Overview

On a programming level: two programs must be specified.

performsdata-intensive

operations

orchestratesthe execution

TaskletHost

program{ {5/17

Upmem Architecture Overview

On a programming level: two programs must be specified.

performsdata-intensive

operations

orchestratesthe execution

TaskletHost

program{ {communication

- MRAM- Mailboxes

Drawbacks and advantages

Drawbacks: computation power

• Frequency around 750 MHz• No floating point operations• Significant multiplication overhead (no hardwaremultiplier)

• Explicit memory management

Advantages: data access

• Parallelization power• Minimum latency• Increased bandwidth• Reduced power consumption

Drawbacks and advantages

Drawbacks: computation power

• Frequency around 750 MHz• No floating point operations• Significant multiplication overhead (no hardwaremultiplier)

• Explicit memory management

Advantages: data access

• Parallelization power• Minimum latency• Increased bandwidth• Reduced power consumption

k-means Implementation for theUpmem Architecture

k-means Clustering Problem

Partition data ∈ Rn×m into k clusters C1 . . . Ckn (resp. m): number of points (resp. attributes)

d: Euclidean distance

ArgminCk∑i=1

∑p∈Ci

d(p,mean(Ci))

Examples of applicationsSegmentationCommunities in socialnetworksMarket researchGene sequence analysis

k-means Standard Algorithm [6]

1: function k-means(k, data, δ)2: Choose C̃ := (c̃1 . . . c̃k) initial centroids3: repeat4: C = C̃5: for all point p ∈data do6: j := Argmini d(p, ci) ▷ Find nearest cluster7: Assign p to cluster Cj8: end for9: for all i in {1 . . . k} do10: c̃i = mean(p ∈ Ci) ▷ Compute new centroids11: end for12: until ∥C̃− C∥ ≤ δ ▷ Convergence criteria13: return C̃ ▷ Return the final centroids14: end function

k-means algorithm on Upmem

Computations

Start centroidsupdate

Send centroids

End centroidsupdate

Data input

Choose initialcentroids

Distribute points

Convergence?

Output results

points

The points aredistributed across theDPUs.

Implementation & Memory Management

• int type to store distance(easy to overflow withdistances)

• Global variables (e.g. # ofpoints)

• Centers• Points• New centers

Experimental Evaluation

Experimental Setup

Simulator

• Architecture not yet manufactured• Cycle-Accurate simulator

Datasets

• int• Randomly generated (notuniformly, with clusters)

Could not find ready-to-useinteger large datasets.

200 0 200 400 600 800 1000200

Experimental Setup

Simulator

• Architecture not yet manufactured• Cycle-Accurate simulator

Datasets

• int• Randomly generated (notuniformly, with clusters)

Could not find ready-to-useinteger large datasets.

200 0 200 400 600 800 1000200

Number of Threads

0 5 10 15 20 25Number of threads

High number of

• points(N=1000000,D=10, K=5)

• dimensions(N=500000,D=34, K=3)

• centroids(N=100000,D=2, K=10)

Not the same runtime scales. 12/17

Number of DPUs

0 5 10 15 20 25 30 35Number of DPUs

Always the samenumber of points.

Time is divided by the number of DPUs. 13/17

Comparison with sequential k-means

Dataset Many PointsAlgorithm 16-DPUs 1 core SeqC

Runtime (s) 1.568 0.268Faster than SeqC with 94 DPUs

Large number of dimensions provides a large amount ofmultiplications to compute distances

Dataset Many DimensionsAlgorithm 16-DPUs 1 core SeqC

Large numbers of dimensions provides a large amount ofmultiplications to compute distances

Dataset Many CentersAlgorithm 16-DPUs 1 core SeqC

Large numbers of centers provides a large amount ofcomputation per memory transfer [2]

Conclusion

• Ideal use case with very low computation programs (e.g.genomic text processing [4, 5])

• Even if there is no gain on time, power might be reduced• Overflows when computing distances• Implemented k-means++ [1] with GMP library (arbitraryprecision numbers) but what was interesting is the timefor an iteration

Conclusion

• Even if there is no gain on time, power might be reduced

• Overflows when computing distances• Implemented k-means++ [1] with GMP library (arbitraryprecision numbers) but what was interesting is the timefor an iteration

Conclusion

• Even if there is no gain on time, power might be reduced• Overflows when computing distances

• Implemented k-means++ [1] with GMP library (arbitraryprecision numbers) but what was interesting is the timefor an iteration

Conclusion

• Even if there is no gain on time, power might be reduced• Overflows when computing distances• Implemented k-means++ [1] with GMP library (arbitraryprecision numbers) but what was interesting is the timefor an iteration

Going Further with the Hardware

Actual Physical Device

• Evaluate how the program behaves at large scale• Impact on the DDR bus & communications

Hardware Multiplication

• Now: 40% of multiplication instructions & 30 instructionsper multiplication

Going Further with the Hardware

Actual Physical Device

• Evaluate how the program behaves at large scale• Impact on the DDR bus & communications

Hardware Multiplication

• Now: 40% of multiplication instructions & 30 instructionsper multiplication

Going Further with the k-means

Keep the distance to the current nearest centroid [3]Easy to add in our implementation: keep distance in DPU

+ Avoid useless computations during next iteration

− Reduce number of points per DPU

Define a border made of points that can switch cluster [7]Harder to integrate

+ Reduce the number of distance computations

− Might involve the CPU

Going Further with the k-means

Keep the distance to the current nearest centroid [3]Easy to add in our implementation: keep distance in DPU

+ Avoid useless computations during next iteration

− Reduce number of points per DPU

Define a border made of points that can switch cluster [7]Harder to integrate

+ Reduce the number of distance computations

− Might involve the CPU

Thank You

References

References i

D. Arthur and S. Vassilvitskii.k-means++: The advantages of careful seeding.In Proceedings of the eighteenth annual ACM-SIAMsymposium on Discrete algorithms, pages 1027–1035.Society for Industrial and Applied Mathematics, 2007.

M. A. Bender, J. Berry, S. D. Hammond, B. Moore, B. Moseley,and C. A. Phillips.k-means clustering on two-level memory systems.In Proceedings of the 2015 International Symposium onMemory Systems, MEMSYS ’15, pages 197–205, New York, NY,USA, 2015. ACM.

References ii

A. M. Fahim, A. M. Salem, F. A. Torkey, and M. A. Ramadan.An efficient enhanced k-means clustering algorithm.Journal of Zhejiang University-SCIENCE A, 7(10):1626–1633,2006.D. Lavenier, C. Deltel, D. Furodet, and J.-F. Roy.BLAST on UPMEM.Research Report RR-8878, INRIA Rennes - BretagneAtlantique, Mar. 2016.

D. Lavenier, C. Deltel, D. Furodet, and J.-F. Roy.MAPPING on UPMEM.Research Report RR-8923, INRIA, June 2016.

References iii

S. Lloyd.Least squares quantization in pcm.IEEE transactions on information theory, 28(2):129–137, 1982.

C. M. Poteraş, M. C. Mihăescu, and M. Mocanu.An optimized version of the k-means clustering algorithm.

In Computer Science and Information Systems (FedCSIS),2014 Federated Conference on, pages 695–699. IEEE, 2014.

EvaluatingaProcessing-in-Memory Architecturewiththe k...

Documents

Transcript of EvaluatingaProcessing-in-Memory Architecturewiththe k...

Application analysis for parallelization on multi-core devices · 10/16/2012 · HiPEAC Computing Systems, Ghent, Belgium Oct 16, 2012 . 2 | Oct 16, 2012 HiPEAC Computing Systems

37 · PDF fileinfo 37 appears quarterly january 2014. 2 HiPEAC info 37 MESSAGE FROM THE HIPEAC COORDINATOR ... In this thematic session the organizer, David Black­Schaffer,

PIM-gem5: a system simulator for Processing-in-Memory ...

Prefetching Techniques for Near-memory Throughput Processors › docs › pdf › PIM-Prefetch-ICS2016.pdf · 2020-03-06 · 1 !!!!! Prefetching Techniques for Near-memory ... 40%

PIM-Assembler: A Processing-in-Memory Platform for Genome … · 2020. 8. 2. · PIM-Assembler: A Processing-in-Memory Platform for Genome Assembly Shaahin Angizi , Naima Ahmed Fahmi

JANUARY 2021 - HiPEAC

Scalable Many-Core Memory Systems Lecture 1, Topic 1: DRAM Basics and DRAM Scaling Prof. Onur Mutlu omutlu onur@cmu.edu HiPEAC.

5th International Conference, HiPEAC 2010 MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava.

Frank Van der Hout - Home | etp4hpc...Connecting with BDVA (cPPP) and HiPEAC, BDEC • HiPEAC’sConference, Jan 18-20, 2016, Prague J-F Lavignon invited speaker • HiPEAC Computing

Informatica PIM - Web 2 · Informatica PIM - Web 2.0 6 Informatica PIM - Web 1 Informatica PIM - Web 1.1 Introduction

Fair CPU Time Accounting in CMP+SMT Processorspeople.ac.upc.edu/mmoreto/papers/Luque-TACO-HiPEAC-slides.pdf · th HIPEAC 2013 2nd January Hardware Overhead Nothing is said about other

NEAR-MEMORY PROCESSING - Michigan · NEAR-MEMORY PROCESSING Salessawi Ferede Yitbarek Todd Austin Outline ... Economies of building specialized PIM systems were unattractive to industry

Scalable Many-Core Memory Systems Topic 1: DRAM Basics and DRAM Scaling Prof. Onur Mutlu omutlu onur@cmu.edu HiPEAC ACACES Summer.

THE HIPEAC VISION FOR ADVANCED COMPUTING IN HORIZON …€¦ · HiPEAC Network of Excellence to explore technology and market trends and identify promising directions for innovation.

Hipeac Roadmap 2013

PIM----SSMSSMマルチキャストマルチキャストネットワーク ......PIM----SSMSSMマルチキャストマルチキャストネットワークネットワーク Solution

TOP-PIM: Throughput-Oriented Programmable Processing in … · 2014. 5. 19. · TOP-PIM: Throughput-Oriented Programmable Processing in Memory Dong Ping Zhang1 Nuwan Jayasena1 Alexander

THE HIPEAC VISION FOR ADVANCED COMPUTING IN HORIZON 2020nunez/mastertecnologiastelecomunicac... · 2014-10-08 · 4 THE HiPEAC VISION FOR ADVANCED COMPUTING IN HORIZON 2020 improve

PROCESSING IN MEMORY, ADVANCED SEMINAR, WS … · called Processing in Memory(PIM) should accelerate memory dependent applications. The idea is to bring the computing ... against

Application analysis for parallelization on multi-core devicesOct 16, 2012 · HiPEAC Computing Systems, Ghent, Belgium Oct 16, 2012 . 2 | Oct 16, 2012 HiPEAC Computing Systems week

37 · PDF fileinfo 37 appears quarterly january 2014. 2 HiPEAC info 37 MESSAGE FROM THE HIPEAC COORDINATOR ... In this thematic session the organizer, David BlackSchaffer,