Enabling Near-Data Accelerators in Datacenterspxt176/publications/Enabling Near-Data...Enabling...

1
2015 Intel Big Data Software Summit http://goto/bigdatasoftware Enabling Near-Data Accelerators in Datacenters Dave Ojika, Jayson Strayer, Gaurav Kaul, Prashanth Thinakaran, Darin Acosta Motivation Bring unconventional compute cores (especially FPGAs) into mainstream big data use Abstract software complexity by introducing efficient accelerator programming model Enable a data-oriented framework for near- memory, distributed processing Approach Accelerate computation on FPGA; transfer data over low-latency DDR bus Provide in-memory storage using open-source Tachyon framework Offload Spark workload to accelerator Method Use Compute-Near Memory (CNM) architecture for design-space exploration Map data to cores with affinity to specific memory regions Integrate a Java-OpenCL middleware to support scheduling of tasks on accelerator Highlights Accelerator Overview Memory-speed data access Memory-centric buffer synchronizes with underlying file system Write Method Read Method Data Register 4 TB Image Cmd Register Interface Connect Object Copy Method W W R R R W write_bit read_bit copy_bit R/W Workload Analysis In-Memory Framework Data and Compute Layer Current Developments Boosted Decision Tree (BDT) Latency-sensitive Poor data locality Fits in 4TB memory 7-fold cross-validation Hit 1GB 10GB L1 94.06% 89.75% DRAM 0.74% 5.74% quad-core i7 CPU, 8 GB RAM Fraction of store-bound stalls increases with size of dataset; memory bandwidth requirement too high for CPU Workload can be trivially parallelized across DIMMs 1 st Place ATLAS ‘14 Higgs ML Challenge: Deep Learning from Oxdata’s H20 Where do FPGA accelerators stand? Explore BDT on CNM accelerator High-energy physics experiment at CERN’s LHC (collaboration with UF Physics) Simics Simulator Functional model Software stack Apps & workload exploration Task Task Host Middleware Driver FPGA Queue Scheduler Tachyon File System (Local or HDFS) In-memory data exchange Reliable file sharing at memory-speed Caching of working set files in memory Fault-tolerant and distributed API Tachyon utilizes memory aggressively, leveraging data lineage OpenCL driver integration Container enablement Cloud orchestration NVM support and NFV Compute Near Memory (CNM) Big Data Framework Application API API Prototype with PCI, DDR and Direct I/O interfaces JOC JOC: Java-to-OpenCL Component No-Higgs or Higgs BDT on CPU (2 nd Place ATLAS ‘14 Higgs ML Challenge) Application to architecture transformation Utilize parallelism on FPGA Leverage low-latency DDR and 100 GB optical links 100 GB Transceiver IP FIFO Decoder (data filter) Data Reassembly Level 2: FPGA as high-performance accelerator Level 1: FPGA receives and pre-processes data in real-time QFSP Direct I/O Real-time Low latency Low power Compute Engine Up to 3 TFLOPS OpenCL kernel BW of host memory Altera Arria 10 FPGA Generic implementation To Datastore DRAM DRAM DRAM Pre-processed dataset Datastore Synchronize In-memory data store Memory-centric distributed storage Reliable data sharing at memory speed Development kit Cloud Orchestration Training time for 11 million events: 5 hours! Xeon E5-2680 @ 2.8 GHZ BDT on MATLAB Prediction time: 370 ms Okay for online, real- time prediction Training time: 5 hours Grew with increasing data size Data affinity Cores cooperate with each other for shared data accesses Shared Virtual Memory (SVM) Accelerator model CPU and device both access shared data using the same virtual addresses No explicit data marshaling Dave Ojika: Cloud Infrastructure Jayson Strayer: Platform Silicon Gaurav Kaul: Health and Life Sciences Prashanth Thinakaran: Big Data Darin Acosta: Physics Professor, UF

Transcript of Enabling Near-Data Accelerators in Datacenterspxt176/publications/Enabling Near-Data...Enabling...

Page 1: Enabling Near-Data Accelerators in Datacenterspxt176/publications/Enabling Near-Data...Enabling Near-Data Accelerators in Datacenters Dave Ojika, ... DRAM 0.74% 5.74% ... Intel Corporation

2015 Intel Big Data Software Summithttp://goto/bigdatasoftware

Enabling Near-Data Accelerators in DatacentersDave Ojika, Jayson Strayer, Gaurav Kaul, Prashanth Thinakaran, Darin Acosta

• Motivation

• Bring unconventional compute cores (especially

FPGAs) into mainstream big data use

• Abstract software complexity by introducing

efficient accelerator programming model

• Enable a data-oriented framework for near-

memory, distributed processing

• Approach

• Accelerate computation on FPGA; transfer data

over low-latency DDR bus

• Provide in-memory storage using open-source

Tachyon framework

• Offload Spark workload to accelerator

• Method

• Use Compute-Near Memory (CNM) architecture

for design-space exploration

• Map data to cores with affinity to specific memory

regions

• Integrate a Java-OpenCL middleware to support

scheduling of tasks on accelerator

Highlights Accelerator Overview

• Memory-speed data access

• Memory-centric buffer

synchronizes with

underlying file system

Write Method

Read Method

Data

Register 4 TB

Image

Cmd

Register

Interface

Connect Object

Copy Method

W

W

R

R

R

W

write_bit

read_bit

copy_bit

R/W

Workload Analysis

In-Memory Framework

Data and Compute Layer Current Developments

• Boosted Decision Tree (BDT)• Latency-sensitive

• Poor data locality

• Fits in 4TB memory

• 7-fold cross-validation

Hit 1GB 10GB

L1 94.06% 89.75%

DRAM 0.74% 5.74%quad-core i7 CPU, 8 GB RAM

• Fraction of store-bound stalls increases with size of

dataset; memory bandwidth requirement too high for CPU

Workload can be trivially parallelized across DIMMs

• 1st Place ATLAS ‘14 Higgs ML Challenge:• Deep Learning from Oxdata’s H20

• Where do FPGA accelerators stand?• Explore BDT on CNM accelerator

High-energy physics experiment at CERN’s LHC

(collaboration with UF Physics)

• Simics Simulator • Functional model

• Software stack

• Apps & workload

exploration

Task

Task

Host Middleware

Driver FPGA

Queue

Scheduler

Tachyon

File System (Local or HDFS)

• In-memory data exchange

• Reliable file sharing at

memory-speed

• Caching of working set files

in memory

• Fault-tolerant and distributed

API

Tachyon utilizes memory aggressively, leveraging data lineage

• OpenCL driver integration

• Container enablement

• Cloud orchestration

• NVM support and NFV

Compute Near Memory (CNM)

Big Data Framework

Application

API

API

Prototype with PCI, DDR and

Direct I/O interfaces

JOC

JOC: Java-to-OpenCL Component

No-Higgs or Higgs

• BDT on CPU (2nd Place

ATLAS ‘14 Higgs ML Challenge)

Application to architecture transformation

• Utilize parallelism on FPGA

• Leverage low-latency DDR

and 100 GB optical links

100 GB

Transceiver

IP

FIFO Decoder(data filter)

Data

Reassembly

Level 2: FPGA as high-performance accelerator

Level 1: FPGA receives and pre-processes data in real-time

QFSP• Direct I/O

• Real-time

• Low latency

• Low power

• Compute Engine

• Up to 3 TFLOPS

• OpenCL kernel

• BW of host memory

Altera Arria 10 FPGA

Generic implementation

To

Datastore

DRAM

DRAMDRAM

Pre-processed dataset

Datastore

Synchronize

• In-memory data store• Memory-centric distributed storage

• Reliable data sharing at memory speed

Development kit

Cloud Orchestration

Training time for 11 million events: 5 hours!

Xeon E5-2680 @ 2.8 GHZ

BDT on MATLAB

• Prediction time: 370 ms

• Okay for online, real-

time prediction

• Training time: 5 hours

• Grew with increasing

data size

• Data affinity • Cores cooperate with each

other for shared data

accesses

• Shared Virtual Memory (SVM)

Accelerator model

CPU and device both

access shared data using

the same virtual

addresses

No explicit data

marshaling

Dave Ojika: Cloud Infrastructure Jayson Strayer: Platform Silicon

Gaurav Kaul: Health and Life SciencesPrashanth Thinakaran: Big Data

Darin Acosta: Physics Professor, UF