Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi.

24
Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi

Transcript of Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi.

Page 1: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi.

Implementing a Speech Recognition System on a

GPU using CUDA

Presented by

Omid Talakoub

Astrid Yi

Page 2: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi.

OutlineBackground

Motivation

Speech recognition algorithm

Implementation steps

GPU implementation strategies

Data flow and representation

Profiler results

Floating point accuracy

Future optimizationsApril 23rd, 2009 2

Page 3: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi.

BackgroundSpeech recognition system:

1. Speaker-dependent or speaker-independent

2. Isolated words or continuous speech

Practical applications use isolated-word recognition systems

April 23rd, 2009 3

Page 4: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi.

Motivation2 phases in speech recognition:

1. Memorize a set of reference templates

2. Take a test template and return the closest reference template match

Recognition accuracy improved with a larger set of reference templates

But, it takes more time to find the closest match

April 23rd, 2009 4

Page 5: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi.

Algorithm Overview

April 23rd, 2009 5

Sound Recorder

Voice Activity

Detection (VAD)

Segmentation

Mel Frequency Cepstral

Coefficients (MFCC)

Dynamic Time

WarpingTest Word

Recognized Word

MFCC

Page 6: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi.

6

Words are parameterised on a frame-by-frame basis

Choose frame length, over which speech remains reasonably stationary

Overlap frames e.g. 25ms frames, 10ms frame shift

We want to compare frames of test and reference words i.e. calculate distances between them

25ms

10ms

Hamming Window

Page 7: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi.

7

• Problem:

Number of frames won’t always correspond

• Easy:

Sum differences between corresponding frames

Calculating Distances

Page 8: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi.

Feature Extraction

Calculating Mel-frequency cepstral coefficients (MFCCs):

April 23rd, 2009 8

MFCCs are coefficients of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency.

Page 9: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi.

“seven”

Cepstral domain

DCT

Log energy

Mel-scaled filter bank

Fourier

x(t)

Time

Filter #

MFCC algorithm

Page 10: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi.

Dynamic Time Warping

i

j

t(i-1)

r(j)

),1(

)1,1(

)1,(

min

)(),(),(

jiD

jiD

jiD

jritjiD

),( jiD

r(j-1)

t(i)

t: input MFCC matrix (Each row is a frame’s feature.)r: reference MFCC matrixLocal paths: 0-45-90 degrees

DTW recurrence:

Page 11: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi.

Local Path Constraints

0-45-90 local paths

jiD ,

),1(

)1,1(

)1,(

min

)()(),(

jiD

jiD

jiD

jritjiD

1, jiD 1,1 jiD

jiD ,1

Page 12: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi.

Other VariantsLocal constraints Start/ending area

Page 13: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi.

DTW Paths of “Match Corners”

i

j

Page 14: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi.

14

Dynamic Time Warping The Brute Force of the Engineering Approach

TE

MP

LA

TE

(W

OR

D 7

)

UNKNOWN WORD

Page 15: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi.

Another Example of DTW Minimum Path

April 23rd, 2009 15

Page 16: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi.

Reference ImplementationReference code implemented on CPU

Provided a base on which to compare results from GPU

Based on original MATLAB code

Used as a base to compare performance increase by shifting work to the GPU

Single threaded with no SIMD type execution used

April 23rd, 2009 16

Page 17: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi.

GPU Implementation StrategiesTwo main approaches to implementation

1. Do more work in the same amount of time

2. To the same amount of work in less time

How to choose an approach in general

1. Amdahl’s law states that performance is dictated by how much can be parallelized

2. If algorithm is serial, do more work simultaneously instead of faster

April 23rd, 2009 17

Page 18: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi.

Data Flow and RepresentationMatrices stored in row major format with a pitch

equal to width

After initial data load, all operations on data take place on device and transform the data in place or to another location

Transformation implemented in the form of CUDA kernels

Kernels can be invoked to process all the data at a specific transformation stage

April 23rd, 2009 18

Page 19: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi.

Profiler ResultsCPU implementation was used as a base

Profiled the generation of the MFCCs which is the computationally expensive portion

Averaging the results over 5 runs of 45 MFCC calls yields the following results:

1. GPU Time: 1.4080285 seconds

2. CPU Time: 8.6354016 seconds

3. Speedup: 5.8

April 23rd, 2009 19

Page 20: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi.

Kernel GPU UtilizationHalf of GPU runtime is limited to a single kernel

Further optimizations should concentrate on first three listed kernels

April 23rd, 2009 20

Page 21: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi.

Kernel Runtimes

April 23rd, 2009 21

Page 22: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi.

Floating Point Accuracy

April 23rd, 2009 22

Floating point accuracy was an issue during verification

Two main problem areas:

1. FFT

2. DCT

Problems from the FFT could not be fixed due to the existing implementation in CUFFT

Problems in the DCT occurred due to cancellation

Page 23: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi.

Floating Point Accuracy con’tDetermined that differences between

implementations did not affect results

To continue comparing against reference results, CPU data was loaded for the GPU during verification

April 23rd, 2009 23

Page 24: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi.

Future OptimizationsUse the already optimized CUBLAS library for

some operations

Requires reordering of data to use column major ordering

IIR filters used have coefficients such that their invocations can be better parallelized

Allocate memory for matrices based using pitches to align data, ie. cudaMalloc2D

April 23rd, 2009 24