Exploiting Multithreaded Architectures to Improve Data Management Operations

Layali RashidThe Advanced Computer Architecture Group @ U of C

(ACAG)Department of Electrical and Computer Engineering

University of Calgary

Outline The SMT and the CMP Architectures Join (Hash Join)

Motivation Algorithm Results

Sort (Radix and Quick Sorts) Motivation Algorithms Results

Index (CSB+-Tree) Motivation Algorithm Results

Conclusions

The SMT and the CMP Architectures

Simultaneous Multithreading (SMT): multiple threads run simultaneously on a single processor.

Chip Multiprocessor (CMP): more than one processor are integrated on a single chip.

Hash Join Motivation

20 60 100 140Tuple Size (Byte)

Hash join is one of the most important operations commonly used in current commercial DBMSs.

The L2 cache load miss rate is a critical factor in main-memory hash join performance.

Increase level of parallelism in hash join.

20 60 100 140Tuple (Size)

Architecture-Aware Hash Join (AA_HJ)

Build Index Partition Phase Tuples divided equally between threads, each thread has its own

set of L2-cache size clusters The Build and Probe Index Partition Phase

One thread builds a hash table from each key-range, other threads index partition the probe relation similar to the previous phase.

Probe Phase See figure.

AA_HJ Results

101520253035404550

20 60 100 140

Tuple Size (Byte)

PT NPT Index PT 2 4 8 12 16

We achieve speedups ranging from 2 to 4.6 compared to PT on Quad Intel Xeon Dual Core server.

Speedups for the Pentium 4 with HT ranges between 2.1 to 2.9 compared to PT.

Memory-Analysis for Multithreaded AA_HJ

NPT 2 4 8 12 16

A decrease in L2 load miss rate is due to the cache-sized index partitioning, constructive cache sharing and Group Prefetching. A minor increase in L1 data cache load miss rate from 1.5% to 4%.

NPT 2 4 8 12 16

The Sort Motivation Some researches find that the sort algorithms suffer

from high level two cache miss rates. Whereas others pointed out that radix sort has high

TLB miss rates. In addition, the fact that most sort algorithms are

sequential has high impact on generating efficient parallel sort algorithms.

In our work we target Radix Sort (distribution-based sort) and Quick Sort (comparison-based sort).

Our Parallel Sorts Radix Sort

A hybrid radix sort between Partition Parallel Radix Sort and Cache-Conscious Radix Sort.

Repartitioning large destination buckets only when they are significantly larger than the L2 cache size.

Quick Sort Use Fast Parallel Quick Sort. Dynamically balancing the load across threads. Improve thread parallelism during the sequential cleaning up

sorting. Stop the recursive partitioning process when the size of the

subarray is almost equal to the largest cache size.

The Sort Timing for the Random Datasets on the SMT Arhcitecure

Radix Sort and Quick Sort shows low L1 and L2 caches miss rates on our machines. Radix Sort has a DTLB Store miss rate up to 26%.

Radix Sort accomplishes slight speedup on SMT architectures that doesn’t exceed 3% , due to its CPU-intensive nature.

Enhancements in execution time for quick sort are about 25% to 30%.

1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07

Number of Keys

Time (

Quick Sort

1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07

Number of Keys

Time (

LSB 1 2

Radix Sort

The Sort Timing for the Random Datasets on the CMP Architecture

1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07

Number of Keys

1 2 4 8 12 16

1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07

Number of Keys

LSB 1 2 4 8 12 16

Radix Sort Quick Sort

Our speedups for the Radix sort range from 54% for two threads up to 300% for threads from 2 to 8. Our speedups for the Quick Sort range from 34% to 417%.

The Index Motivation

Despite the fact that CSB+-tree proves to have significant speedup over B+-trees, experiments show that a large fraction of its execution time is still spent waiting for data.

The L2 load miss rate for single-threaded CSB+-tree is as high as 42%.

Dual-threaded CSB+-Tree

One CSB+-Tree. Single thread for the

bulkloading. Two threads for

probing. Unlike inserts and

deletes, search needs no synchronization since it involves reads only.

Index Results

1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07Number of Keys

Single-Threaded Dual-Threaded

1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07

Number of Keys

Speedups for dual-threaded CSB+-tree range from 19% to 68% compared to single-threaded CSB+-tree.

Two threads for memory-bound operations propose more chances to keep the functional units working.

Sharing one CSB+-tree amongst both of our threads result in constructive behaviour and reduction of 6% -8% in the L2 miss rate.

Conclusions State-of-the-art parallel architectures (SMT and

CMP) have opened opportunities for the improvement of software operations to better utilize the underlying hardware resources.

It is essential to have efficient implementations of database operations.

We propose architecture-aware multithreaded database algorithms of the most important database operations (joins, sorts and indexes).

We characterize the timing and memory behaviour of these database operations.

The End

Backup Slides

Figure 1‑1: The SMT Architecture

Figure 1‑2: Comparison between the SMT and the Dual Core Architectures

Figure 1‑3: Combining the SMT and the CMP Architectures

Figure 2‑1: The L1 Data Cache Load Miss Rate for Hash Join

Figure 2‑2: The L2 Cache Load Miss Rate for Hash Join

Figure 2‑3: The Trace Cache Miss Rate for Hash Join

20 60 100 140Tuple (Size)

Figure 2‑4: Typical Relational Table in RDBMS

Figure 2‑5: Database Join

Figure 2‑6: Hash Equi-join Process

Figure 2‑7: Hash Table Structure

Figure 2‑8: Hash Join Base Algorithm

partition R into R0, R1,…, Rn-1partition S into S0, S1,…, Sn-1for i = 0 until i = n-1

use Ri to build hash-tablei

for i = 0 until i = n-1probe Si using hash-

tablei

Figure 2‑9: AA_HJ Build Phase Executed by one Thread

Figure 2‑10: AA_HJ Probe Index Partitioning Phase Executed by one Thread

Figure 2‑11: AA_HJ S-Relation Partitioning and Probing Phases

Figure 2‑12: AA_HJ Multithreaded Probing Algorithm

Table 2‑1: Machines Specifications

Table 2‑2: Number of Tuples for Machine 1

Table 2‑3: Number of Tuples for Machine 2

Figure 2‑13: Timing for three Hash Join Partitioning Techniques

20 60 100 140

Tuple Size (Byte)

PT NPT Index PT

Figure 2‑14: Memory Usage for three Hash Join Partitioning Techniques

20 60 100 140

Tuple Size (Byte)

PT NPT Index PT

Figure 2‑15: Timing for Dual-threaded Hash Join

20 60 100 140

Tuple Size (Byte)

SMT+PT SMT+NPT SMT+Index PT

Figure 2‑16: Memory Usage for Dual-threaded Hash Join

20 60 100 140

Tuple Size (Byte)

SMT+PT SMT+NPT SMT+Index PT

Figure 2‑17: Timing Comparison of all Hash Join Algorithms

0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.03.23.43.63.84.0

20 60 100 140

Tuple Size (Byte)

AA_HJ+GP+SMT AA_HJ+SMT SMT+NPT NPT

SMT+PT PT SMT+Index PT Index PT

Figure 2‑18: Memory Usage Comparison of all Hash Join Algorithms

20 60 100 140

Tuple Size (Byte)

AA_HJ+GP+SMT AA_HJ+SMT SMT+NPT NPT

SMT+PT PT SMT+Index PT Index PT

Figure 2‑19: Speedups due to the AA_HJ+SMT and the AA_HJ+GP+SMT Algorithms

20 60 100 140

Tuple Size (Byte)

PT SMT+PT Index PT SMT+Index PT AA_HJ+SMT AA_HJ+GP+SMT

Figure 2‑20: Varying Number of Clusters for the AA_HJ+GP+SMT

32 64 128 512 1024 2048Number of Clusters

Figure 2‑21: Varying the Selectivity for Tuple Size = 100Bytes

20 40 60 80 100

Selectivity

Second)

PT SMT+PT AA_HJ+SMT AA_HJ+GP+SMT

Figure 2‑22: Time Breakdown Comparison for the Hash Join Algorithms for tuple sizes 20Bytes and 100Bytes

100 20 100 20 100 20 100 20 100 20 100 20 100 20 100 20

NPT SMT+NPT PT SMT+PT Index PT SMT+Index PT AA_HJ+SMT AA_HJ+GP+SMT

Build Index Partition Probe Index Partition Partition Build Probe

Figure 2‑23: Timing for the Multi-threaded Architecture-Aware Hash Join

101520253035404550

20 60 100 140

Tuple Size (Byte)

PT NPT Index PT 2 4 8 12 16

Figure 2‑24: Speedups for the Multi-Threaded Architecture-Aware Hash Join

20 60 100 140

PT Index PT 2 4 8 12 16

Figure 2‑25: Memory Usage for the Multi-Threaded Architecture-Aware Hash Join

Figure 2‑26: Time Breakdown Comparison for Hash Join Algorithms

99.510

10.511

11.512

12.513

13.514

14.515

Index P

2 4 8 12

Index P

2 4 8 12

Index P

2 4 8 12

Index P

2 4 8 12

20 60 100 140

Tuple Size

Second)

Partition Build Index Partition Probe Index Partition Build Probe

35.91 second

27.70 second

Figure 2‑27: The L1 Data Cache Load Miss Rate for NPT and AA_HJ

NPT 2 4 8 12 16

Figure 2‑28: Number of Loads for NPT and AA_HJ

0.E+00

1.E+09

2.E+09

3.E+09

4.E+09

5.E+09

6.E+09

NPT 2 4 8 12 16

Figure 2‑29: The L2 Cache Load Miss Rate for NPT and AA_HJ

NPT 2 4 8 12 16

Figure 2‑30: The Trace Cache Miss Rate for NPT and AA_HJ

NPT 2 4 8 12 16

Figure 2‑31: The DTLB Load Miss Rate for NPT and AA_HJ

Load M

NPT 2 4 8 12 16

Figure 3‑1: The LSD Radix Sort

1 for (i= 0; i < number_of_digits; i ++)2 sort source-array based on digiti;

Figure 3‑2: The Counting LSD Radix Sort Algorithm

Figure 3‑3: Parallel Radix Sort Algorithm

Table 3‑1: Memory Characterization for LSD Radix Sort with Different Datasets

Figure 3‑4: Radix Sort Timing for the Random Datasets on Machine 2

1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07

Number of Keys

Second)

LSB 1 2 4 8 12 16

Figure 3‑5: Radix Sort Timing for the Gaussian Datasets on Machine 2

1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07

Number of Keys

LSB 1 2 4 8 12 16

Figure 3‑6: Radix Sort Timing for Zero Datasets on Machine 2

1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07

Number of Keys

LSB 1 2 4 8 12 16

Figure 3‑7: Radix Sort Timing for the Random Datasets on Machine 1

1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07

Number of Keys

LSB 1 2

Figure 3‑8: Radix Sort Timing for the Gaussian Datasets on Machine 1

1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07

Number of Keys

LSB 1 2

Figure 3‑9: Radix Sort Timing for the Zero Datasets on Machine 1

1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07

Number of Keys

LSB 1 2

Figure 3‑10: The DTLB Stores Miss Rate for the Radix Sort on Machine 2 (Random Datasets)

1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07

Number of Keys

LSB 1 2 4 8 16

Figure 3‑11: The L1 Data Cache Load Miss Rate for the Radix Sort on Machine 2 (Random Datasets)

1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07

Number of Keys

LSB 1 2 4 8 12 16

Table 3‑2: Memory Characterization for Memory-Tuned Quick Sort with Different Datasets

Figure 3‑12: Quicksort Timing for the Random Datasets on Machine 2

1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07

Number of Keys

1 2 4 8 12 16

Figure 3‑13: Quicksort Timing for the Random Dataset on Machine 1

1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07

Number of Keys

Figure 3‑14: Quicksort Timing for the Gaussian Datasets on Machine 2

1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07

Number of Keys

1 2 4 8 12 16

Figure 3‑15: Quicksort Timing for the Gaussian Dataset on Machine 1

1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07

Number of Keys

Figure 3‑16: Quicksort Timing for the Zero Datasets on Machine 2

1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07

Number of Keys

1 2 4 8 12 16

Figure 3‑17: Quicksort Timing for the Zero Dataset on Machine 1

1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07

Number of Keys

Table 3‑3: The Sort Results for Machine 1

Table 3‑4: The Sort Results for Machine 2

Figure 4‑1: Search Operation on an Index Tree

Figure 4‑2: Differences between the B+-Tree and the CSB+-Tree

Figure 4‑3: Dual-Threaded CSB+-Tree for the SMT Architectures

Figure 4‑4: Timing for the Single and Dual-Threaded CSB+-Tree

Second)

Figure 4‑5: The L1 Data Cache Load Miss Rate for the Single and Dual-Threaded CSB+-Tree

Figure 4‑6: The Trace Cache Miss Rate for the Single and Dual-Threaded CSB+-Tree

1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07

Number of Keys

Figure 4‑7: The L2 Load Miss Rate for the Single and Dual-Threaded CSB+-Tree

1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07

Number of Keys

Figure 4‑8: The DTLB Load Miss Rate for the Single and Dual-Threaded CSB+-Tree

0%2%4%

10%12%14%

16%18%20%

1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07

Number of Keys

Figure 4‑9: The ITLB Load Miss Rate for the Single and Dual-Threaded CSB+-Tree

10%12%

16%18%

Exploiting Multithreaded Architectures to Improve Data Management Operations

Documents

Transcript of Exploiting Multithreaded Architectures to Improve Data Management Operations

Multithreaded Programming Guide - Oracle · Multithreaded Programming Guide - Oracle ... Alarms.....150

Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures.

Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Multithreaded Parallelism on Multicore Architectures

Multithreaded Programming

Exploiting modern microarchitectures: Meltdown, …...7 Exploiting modern microarchitectures: Meltdown, Spectre, and other attacks Common concepts in modern architectures • Operating

Large Graph Algorithms for Massively Multithreaded Architectures · 2017-08-29 · Large Graph Algorithms for Massively Multithreaded Architectures by Pawan Harish, Vibhav Vineet,

Exploiting Data Locality in Adaptive Architectures · exploiting data locality. For example, a large cache line multiprocessor makes better use of spatial locality. However, the cache

UNIVERSITY OF CALGARY Exploiting Multithreaded ...blogs.ubc.ca/lrashid/files/2011/01/layali_msc_thesis.pdf · 3.8 Quick Sort ... Memory Characterization for LSD Radix Sort wit h Different

Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to.

Total Work-Flow: Exploiting Hybrid Computing …Operated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid Computing Architectures

Research on MultiThreaded architectures at BSC-CNSseminarisempresa.fib.upc.edu/anteriors/2008/programes/AulaEmpresas.pdfThis chart shows gzip’s IPC when executed in an 4-context

Exploiting Programmable Architectures for WiFi/ZigBee ... · Exploiting Programmable Architectures for WiFi/ZigBee Inter- ... Bluetooth, ZigBee, phones ... – 1 USRP as a channel

Multithreaded Processors

FAME: FAirly MEasuring Multithreaded Architectures

Runtime Aware Architectures · Multithreaded Vector Architectures (HPCA 1997) SMT Vector Architectures (HICS 1997, ... The MultiCore Era Moore’s Law+ Memory Wall + Power Wall Chip

UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,

Anshul Kumar, CSE IITD Other Architectures & Examples Multithreaded architectures Dataflow architectures Multiprocessor examples 1 st May, 2006.

Stadium Hashing: Scalable and Flexible Hashing on GPUsmehmet.belviranli.com/papers/pact15.pdfA common approach to exploiting parallelism for hashing in a multithreaded environment

COMP4211 – Advanced Computer Architectures & Algorithms University of NSW Seminar Presentation Semester 2 2004 Software Approaches to Exploiting Instruction.