Patrick-André Savard, Philippe Gournay and Roch Lefebvre Université de Sherbrooke, Québec,...

27
Hybrid Time-Scale Modification of Audio Patrick-André Savard, Philippe Gournay and Roch Lefebvre Université de Sherbrooke, Québec, Canada

Transcript of Patrick-André Savard, Philippe Gournay and Roch Lefebvre Université de Sherbrooke, Québec,...

Page 1: Patrick-André Savard, Philippe Gournay and Roch Lefebvre Université de Sherbrooke, Québec, Canada.

Hybrid Time-Scale Modification of Audio

Patrick-André Savard, Philippe Gournayand Roch Lefebvre

Université de Sherbrooke, Québec, Canada

Page 2: Patrick-André Savard, Philippe Gournay and Roch Lefebvre Université de Sherbrooke, Québec, Canada.

Problem description Prior art

◦ Synchronized overlap-add w/fixed syn. (SOLAFS)◦ Improved phase vocoder

Hybrid time-scale modification◦ High level algorithm◦ Classification◦ Main algorithm◦ Mode transition

Performance evaluation◦ Classification performance◦ Subjective testing results

Presentation content

Patrick-André
Main algorithm before classification?
Page 3: Patrick-André Savard, Philippe Gournay and Roch Lefebvre Université de Sherbrooke, Québec, Canada.

What is time-scale modification? Subject of interest:

◦ Subjective quality of time-scaled signals Existing methods:

◦ Time vs frequency approaches◦ High quality results on specific types of signals

TSM applied to various signal types◦ Can be speech, music, or mixed-type signals

There is a need for a more “universal” method

Problem description

Page 4: Patrick-André Savard, Philippe Gournay and Roch Lefebvre Université de Sherbrooke, Québec, Canada.

Prior ArtSynchronized overlap-add with fixed synthesis (SOLAFS)

Input Signal

Output Signal

Sa

Ss

WLEN delay delay

Page 5: Patrick-André Savard, Philippe Gournay and Roch Lefebvre Université de Sherbrooke, Québec, Canada.

Prior ArtImproved phase vocoder Based on the block-by-

block STFT analysis/synthesis model

STFT phases are updated so as to preserve instantaneous frequencies

STFT amplitudes are preserved

STFT modification Improvements

Peak- detection

Compute inst. freq. for peaks

Define regions of influence

Update peak phases

Apply phase-lock. to ROIs

¯

¯

STFT modification stage

¯

¯

FFT

IFFT

Overlap-add and gain control

N

Ra

Rs

Page 6: Patrick-André Savard, Philippe Gournay and Roch Lefebvre Université de Sherbrooke, Québec, Canada.

Uses a frame-by-frame model

Each frame goes through a classifier

Signals identified as monophonic are processed using SOLAFS

Signals identified as polyphonic or noisy are processed using the phase vocoder

Hybrid time-scale modification:High level algorithm

Read input frame

Classifysignal

Process samples using SOLAFS

Process samples using the phase

vocoder

Write output frame

Monophonic Polyphonic, noisy

Patrick-André
re-introduire SOLAFS vs Phase Vocoder = Hybrid
Page 7: Patrick-André Savard, Philippe Gournay and Roch Lefebvre Université de Sherbrooke, Québec, Canada.

Goal:◦ Discriminate monophonic/polyphonic/noise signals

Method used:◦ Test the maximum of the normalized cross-

correlation (C.C.) measure in SOLAFS for each analysis window

Hybrid time-scale modification:Classification

0 100 200 300 400 500 600 700-1

-0.5

0

0.5

1

Am

plitu

de

Time (ms)

0 5 10 15 20 25 30 35 40 450

0.5

1

Synthesis window number

Nor

mal

ized

cro

ss-c

orre

latio

n

0 100 200 300 400 500 600 700-1

-0.5

0

0.5

1

Am

plitu

de

Time (ms)

0 5 10 15 20 25 30 35 40 450

0.5

1

Synthesis window number

Nor

mal

ized

cro

ss-c

orre

latio

n

Music Signal

Speech Signal

Unvoiced Voiced

Voiced speech: High C.C.

Music: Low to medium C.C.

Unvoiced speech: Low &

high C.C.

Patrick-André
features slide
Patrick-André
Mettre en emphase la variation de xcorr pour unvoiced speech + music
Page 8: Patrick-André Savard, Philippe Gournay and Roch Lefebvre Université de Sherbrooke, Québec, Canada.

SOLAFS processing

Rmax<Txcorr

Hybrid time-scale modification:Main Algorithm

Default method: SOLAFS

Switches to phase vocoder when Rmax<Txcorr

Constraint on minimum length of a SOLAFS synthesis segment

Frame 1 Frame 2

SOLAFS processing

Rmax<Txcorr

SOLAFS processin

g

Phase vocoder

processing

Phase vocoder

processing

Phase vocoder

processing

Phase vocoder

processing

Frame 1 Frame 2

discarded

Page 9: Patrick-André Savard, Philippe Gournay and Roch Lefebvre Université de Sherbrooke, Québec, Canada.

Hybrid time-scale modification:SOLAFS to Phase Vocoder Transition

Phase vocoder initialization:

Synthesis padded with input samples

Initialization based on matching input/output samples

Gain control: More padding needed Synthesis further

padded and windowed to reproduce a phase vocoder output

Last SOLAFS synthesis window

Output signal padded with input samples

Initialization based onmatching

input/output samples

Previously padded

synthesis

More padding using input

samples

Resulting synthesis is windowed

First phase vocoder

synthesis window overlaps

coherently

Page 10: Patrick-André Savard, Philippe Gournay and Roch Lefebvre Université de Sherbrooke, Québec, Canada.

Hybrid time-scale modification:Phase Vocoder to SOLAFS Transition Current frame’s first

analysis window is out of phase with current output signal

Assume that the current input frame contains a stationary signal

First input window is one phase vocoder analysis step ahead

First SOLAFS segment is OLA at the last phase vocoder synthesis step

SOLAFS synthesis samples (after the first OLA region) replace synthesis samples obtained by the phase vocoder

Previous frame Current frame

Synthesis signal(before

transition)

First SOLAFS synthesis window

Subsequent SOLAFS

synthesis windows

Current frame’s first analysis window

(not in phase with current output)Approximately in phase with current output

Page 11: Patrick-André Savard, Philippe Gournay and Roch Lefebvre Université de Sherbrooke, Québec, Canada.

0 0.2 0.4 0.6 0.8 1

Time-scaled speech signal (=2, Tmax

=0.6)

Time (s)

0 0.2 0.4 0.6 0.8 1SOLAFS

Phase vocoderClassification results

Time (s)

Performance evaluationClassification of a speech signal Signal length =1

second Tmax=0.6 Unvoiced speech

is successfully detected

Triggers phase vocoder processing

Page 12: Patrick-André Savard, Philippe Gournay and Roch Lefebvre Université de Sherbrooke, Québec, Canada.

Performance evaluationClassification of a music signal Signal length =

25 seconds Tmax=0.6 Classification

results: 91 % phase

vocoder 9 % SOLAFS

0 5 10 15 20 25

Time-scaled music signal (=2, Tmax

=0.6)

Time (s)

0 5 10 15 20 25SOLAFS

Phase vocoderClassification results

Time (s)

Page 13: Patrick-André Savard, Philippe Gournay and Roch Lefebvre Université de Sherbrooke, Québec, Canada.

A/B method Speech, music and mixed content (speech

over music) samples tested Hybrid method compared to stand-alone

techniques Comparisons performed on compressed and

expanded signals Eight listeners took part of the test Samples evaluated using a 5 step scale

Performance evaluationSubjective testing

Patrick-André
cas ou les methodes ind. fail.signaux complementaires
Page 14: Patrick-André Savard, Philippe Gournay and Roch Lefebvre Université de Sherbrooke, Québec, Canada.

H >> SOLA H > SOLA H = SOLA H < SOLA H << SOLA0%

10%

20%

30%

40%

50%

60%

70%

Speech

Performance evaluation: ResultsHybrid vs SOLAFS, α=1.75

Page 15: Patrick-André Savard, Philippe Gournay and Roch Lefebvre Université de Sherbrooke, Québec, Canada.

H >> SOLA H > SOLA H = SOLA H < SOLA H << SOLA0%

10%

20%

30%

40%

50%

60%

70%

SpeechMusic

Performance evaluation: ResultsHybrid vs SOLAFS, α=1.75

Page 16: Patrick-André Savard, Philippe Gournay and Roch Lefebvre Université de Sherbrooke, Québec, Canada.

H >> SOLA H > SOLA H = SOLA H < SOLA H << SOLA0%

10%

20%

30%

40%

50%

60%

70%

SpeechMusicMixed

Performance evaluation: ResultsHybrid vs SOLAFS, α=1.75

Page 17: Patrick-André Savard, Philippe Gournay and Roch Lefebvre Université de Sherbrooke, Québec, Canada.

H >> PV H > PV H = PV H < PV H << PV0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

Speech

Performance evaluation: ResultsHybrid vs Phase vocoder, α=1.75

Page 18: Patrick-André Savard, Philippe Gournay and Roch Lefebvre Université de Sherbrooke, Québec, Canada.

H >> PV H > PV H = PV H < PV H << PV0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

SpeechMusic

Performance evaluation: ResultsHybrid vs Phase vocoder, α=1.75

Page 19: Patrick-André Savard, Philippe Gournay and Roch Lefebvre Université de Sherbrooke, Québec, Canada.

H >> PV H > PV H = PV H < PV H << PV0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

SpeechMusicMixed

Performance evaluation: ResultsHybrid vs Phase vocoder, α=1.75

Page 20: Patrick-André Savard, Philippe Gournay and Roch Lefebvre Université de Sherbrooke, Québec, Canada.

H >> SOLA H > SOLA H = SOLA H < SOLA H << SOLA0%

10%

20%

30%

40%

50%

60%

SpeechMusicMixed

Performance evaluation: ResultsHybrid vs SOLAFS, α=0.75

Page 21: Patrick-André Savard, Philippe Gournay and Roch Lefebvre Université de Sherbrooke, Québec, Canada.

H >> SOLA H > SOLA H = SOLA H < SOLA H << SOLA0%

10%

20%

30%

40%

50%

60%

SpeechMusic

Performance evaluation: ResultsHybrid vs SOLAFS, α=0.75

Page 22: Patrick-André Savard, Philippe Gournay and Roch Lefebvre Université de Sherbrooke, Québec, Canada.

H >> SOLA H > SOLA H = SOLA H < SOLA H << SOLA0%

10%

20%

30%

40%

50%

60%

SpeechMusicMixed

Performance evaluation: ResultsHybrid vs SOLAFS, α=0.75

Page 23: Patrick-André Savard, Philippe Gournay and Roch Lefebvre Université de Sherbrooke, Québec, Canada.

H >> PV H > PV H = PV H < PV H << PV0%

10%

20%

30%

40%

50%

60%

Speech

Performance evaluation: ResultsHybrid vs Phase vocoder, α=0.75

Page 24: Patrick-André Savard, Philippe Gournay and Roch Lefebvre Université de Sherbrooke, Québec, Canada.

H >> PV H > PV H = PV H < PV H << PV0%

10%

20%

30%

40%

50%

60%

SpeechMusic

Performance evaluation: ResultsHybrid vs Phase vocoder, α=0.75

Page 25: Patrick-André Savard, Philippe Gournay and Roch Lefebvre Université de Sherbrooke, Québec, Canada.

H >> PV H > PV H = PV H < PV H << PV0%

10%

20%

30%

40%

50%

60%

SpeechMusicMixed

Performance evaluation: ResultsHybrid vs Phase vocoder, α=0.75

Patrick-André
Page 26: Patrick-André Savard, Philippe Gournay and Roch Lefebvre Université de Sherbrooke, Québec, Canada.

A hybrid TSM method is presented◦ Uses a frame-by-frame classification stage◦ Selects the best method based on the input signal

monophonic/polyphonic/noise character◦ Mode transitions

High quality results are obtained◦ Using speech, music and mixed-content signals

Future work◦ Refine the classification criterion◦ Use of phase flexibility to improve phase coherence

would improve phase vocoder to SOLAFS transitions

Conclusion

Page 27: Patrick-André Savard, Philippe Gournay and Roch Lefebvre Université de Sherbrooke, Québec, Canada.

Contact: [email protected]

Thank you.