Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-2.pdf · Audio and...

$: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-2.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu$
Audio and (Multimedia) Methods for Video Analysis

Dr. Gerald Friedland, [email protected]

CS-298 Seminar Fall 2011•

1

mailto:[email protected]

mailto:[email protected]

Introduction to Sound

2

• What is sound?• How is it recorded and stored?• What are it’s most important properties

(to us)?• Introduction to features 1• Some frameworks and tools to work

with sound

Where am I?

3

Most famous for...

4

Less famous for...

5

http://www.multimediaeval.org/

http://www.multimediaeval.org

http://www.multimediaeval.org

What is Sound?

6

Video from ViHart (Youtube):http://www.youtube.com/watch?v=i_0DXxNeaQ0•

http://www.youtube.com/watch?v=i_0DXxNeaQ0&feature=uploademail

http://www.youtube.com/watch?v=i_0DXxNeaQ0&feature=uploademail

What is Sound?

7

“a traveling wave which is an oscillation of pressure transmitted through a solid, liquid, or gas, composed of frequencies within the range of hearing and of a level sufficiently strong to be heard, or the sensation stimulated in organs of hearing by such vibrations.” (AHD)•

Visualizations of Sound

8

Time Domain aka Amplitude Space aka Waveform

time


9

Frequency Domain aka Fourier Space aka Spectrum

frequency


10

Spectrogram•

time

frequency

energy

Hearing Spectrum

11Source: http://sound.westhost.com/articles/fadb.htm

http://sound.westhost.com/articles/fadb.htm

http://sound.westhost.com/articles/fadb.htm

dB SPL?

• decibel Sound Pressure Level• NOT a physical unit, only a

scale

where pref is the reference sound pressure and prms is the rms sound pressure being measured.

12

bits -> db Range (Cheat sheet)

13

• 8 bits -> 48 dB SPL• 11 bits -> 66 dB SPL• 16 bits -> 96 dB SPL• 24 bits ->144 dB SPL

Frequency-Normalized Range (A-Weighting)

14

Figure 1. Sound pressure A-weighting scheme according to IEC 61672:2003.

Sound pressure levels weighted by the A scheme are usually labelled as dBA or dB(A). Please note that dB and dBA, like a percent symbol ‘%’, define ratios and not physical units of measurement. A value of 10 dB can refer to completely different sound pressure levels depending on the reference. Also there are no physical units associated with dB.

Observed Properties of Sound

As explained in the previous paragraph, sound is a pressure wave traveling through a medium. In practice, sounds are not exclusively traveling in a homogenous medium from a source to exhaustion. The environment is filled with objects, sometimes sounds are produced in a closed room, and sounds pressure waves may collide with other sounds. The resulting effects of these conditions play a large role when designing multimedia systems. Also, the effects on sound are more significant than on light waves. The three most important ones are echo, reverberation, and interference.

An echo is a reflection of sound, arriving at the listener some time after the original sound. Typical examples are the echo produced by the bottom of a well, by a building, or by the walls of an enclosed room. Sounds is very easily reflected by most materials so echos are always present in every environment. A true echo is a single reflection of the sound source. Mostly, however, many echoes form reverberation. The time delay is the extra distance divided by the speed of sound. When dealing with audible frequencies, the human ear cannot distinguish an echo from the original

How is Sound Recorded?

15Figure 3. Edison’s Phonograph on US stamp.

Not surprisingly today’s sound recording still obeys the same principles with two main exceptions: First, the sound waves are converted to electrical waves by a microphone and second, most of today’s storage media is digital, i.e. sound waves are converted into binary numbers before they are imprinted on the medium. The media themselves, such as CDROM or DAT are a bit more sophisticated than Edison’s cylinders. Having said that, we are currently observing the replacement of all of these specialized media with generic media, such as harddisks and flash memory. We therefore decided not to explain the technical details of these, the reader is referred to the bibliography for further information. The next paragraphs, however, will explain the governing principles of modern sound processing.

Microphones

A microphone is an acoustic sensor that converts sound into an electrical signal. The general principle is that sound pressure is inflicted on a membrane which varies it’s electrical resistance according to the movement. Most microphones in use today for audio use electromagnetic induction (dynamic microphone) by letting the membrane swing a magnetic field produced by a coil, capacitance change (condenser microphone) by letting the membrane be part of a capacitor which varies capacity with movement, or piezoelectric generation (piezo crystals emit electricity when under pressure). Some modern microphones use light modulation to produce the electric signal by “watching” the mechanical vibration (laser microphones). A single dynamic membrane will not respond linearly to all audio frequencies. Some microphones for this reason utilize multiple membranes for the different parts of the audio spectrum and then combine the resulting signals. The different microphone types have different electrical properties. A complete

Microphone

Time Control

Media

Modern Microphone

16Source: http://www.mediacollege.com/audio/microphones/dynamic.html

http://www.mediacollege.com/audio/microphones/dynamic.html

http://www.mediacollege.com/audio/microphones/dynamic.html

Types of Microphones

17

• Nearfield: Close to sound source e.g., headset, boom microphone (movies, TV productions), singer microphones

• Farfield: Further away from sound source e.g., lapel microphone, stationary microphone, webcams, handheld cams.

Difference Farfield/Nearfield

18Demo: http://www.icsi.berkeley.edu/Speech/mr/nearfar.html

• Nearfield: More energy, less distortion, captures sound source well.

• Farfield: Captures environment with sound source, “better for forensics”, processing often slower.

http://www.icsi.berkeley.edu/Speech/mr/nearfar.html

http://www.icsi.berkeley.edu/Speech/mr/nearfar.html

Microphone Directionality

19

microphone also includes a housing and a means of bringing the signal from the element to other equipment (eg. wires or RF capability). These and other characteristics, such as diaphragm size, intended use or orientation of the principal sound input to the principal axis (end- or side-address) of the microphone determine the properties of the recorded sound space. When planning a recording, it is therefore best to survey the current available market and read vendor specifications. The most important characteristics of a microphone is its directionality.

Figure 4: Four common polar patterns of microphones. From left to right: Omnidirectional, cardiod, supercardiod, shotgun (images from Wikimedia Commons).

A microphone's directionality indicates how sensitive it is to sounds arriving at different angles about its central axis. The directionality of a microphone is usually visualized using a polar pattern. Polar patterns represent the location of points that produce the same signal level output in the microphone if a constant sound pressure level is generated from that point. Figure 4 shows some idealized example patterns. The patterns are considered idealized because in the real world, polar patterns are a function of frequency. Manufacturer’s diagrams therefore usually include multiple plots at different frequencies Also, while an omnidirectional microphone's response is generally considered to be a perfect sphere in three dimensions. In the real world, this is not the case. The

Digitization of Sound

20

mixers, and pre-amplifiers are still analog but the storage and processing is digital. With standards, like the aforementioned AES 42 getting more and more popular, digitization will become a much earlier part of the processing chain soon.

Digitizing is the representation of a signal by a discrete set of its samples. Instead of representing the sound signal by an electrical current proportional to its sound pressure, the signal is represented by on-off patterns that represent sample values of the analog signal at certain fixed points. The on-off patterns are much less susceptible to the distortions outlined above, especially copying is usually lossless. Conceptually, digitization works in two parts, illustrated in Figure 6. - Discretization: The analog signal is read at regular time intervals (sampling rate), sampling the value of the signal at that point in time. One such reading is called a sample.- Quantization: Samples are rounded to a fixed set of numbers (such as integers), a process known as quantization.

Figure 6. Digital representation of an analog signal. Both amplitude and time axis are discretized.

A series of quantized samples can be transformed back into an analog output that approximates the original analog representation by generating the signal represented by each sample. The sampling rate and the number of bits used to represent the sample values determine how close such an approximation to the analog signal a digitization will be.

The error introduced by the quantization is called quantization noise and affects how accurately the amplitude can be represented. Very few bits for the samples will result in the signal only being represented coarsely and will affect the perceived dynamic of the sound as well as introduce high-frequency artifacts. Typical bit representations for audio are 8, 16, and 24 bits.

The error introduced by the sampling rate is called discretization error and determines the maximum frequency that can be represented in the signal. This upper frequency limit is determined by the so-called Nyquist frequency. The Nyquist frequency, named after the Swedish-American engineer Harry Nyquist or the Nyquist–Shannon sampling theorem, is half the sampling

Remember: Nyquist Limit!

21

frequency of a discrete signal processing system. In other words, if a function x(t) contains no frequencies higher than B hertz, it is completely determined by giving its ordinates at a series of points spaced 1/(2B) seconds apart.

The proof of this fundamental theorem can be found in the research papers at the end of this chapter. In this overview chapter we will stick with an illustrating example.

To illustrate the necessity of fs > 2B, consider the sinusoid:

With fs = 2B or equivalently T = 1/(2B), the samples are given by:

Those samples cannot be distinguished from the samples of:

But for any θ such that sin( ) 0, x(t)θ ≠ and y(t) have different amplitudes and different phase. Figure 7 illustrates this further.

Figure 7. Three possible analog signals for the same sampling points.

The Nyquist theorem is not at all limited to sound signals, it is true for the digitization of any signal. However, we discuss it here, since, from all multimedia formats sound formats are most influenced by this limit. Since the maximum frequency perceptible by the human auditory system is about 22 kHz, compact discs sample at 44Khz. Human speech which usually peaks at about 6-8kHz is considered completely represented by a 16kHz sampling frequency. Sampling frequencies

Math: See Draft Chapter 3 of Friedland & Jain on mm-creole.org

Common Recording Resolutions

22

•8000Hz, 8-bit log. companded ~ 11 bit uncompanded (a/µ-law): telephone

•16000Hz, 16-bit linear: speech (Skype)•44100Hz, 16-bit linear, stereo: Compact

Disk, many camcorders•48000Hz, 32-bit linear, stereo:

Digital Audio Tape, Hard Disk Recorders

µ-law Companding

23

Some Basic Audio Features

24

•Energy (usually rms)•Zero-Crossing Rate•Entropy•Pitch•Voicedness/Unvoicedness (HNR)•Long-Term Average Spectrum (LTAS)

Energy

25

Energy is usually calculated as the root-mean square of a window of samples:

•Most pre-valent basic feature•Typical filter uses: Squelching•For analysis: Is there audio at all?

Zero-Crossing Rate

26

a very fundamental algorithm that is used in many different versions in a whole range of speech processing devices, it also allows us to discuss a couple of fundamental speech processing techniques.

Voiced/Unvoiced Detection

How can we determine if a speech frame contains a vowel or not? The methods currently avail

able for doing this are not perfectly accurate but work in about 99% of the cases. Usually, a bag of features is used, combining the results from several computations on the signal. The most obvious feature is energy: In order to have pitch, this means in order to have periodicity, the signal must cross the zero amplitude line a couple of times. Thus the integral (or the sum of samples) of that signal should be close to zero. An unpitched signal can have any shape and the integral might be heavily biased towards a positive or negative number. To determine the energy E of a frame of length N containing samples s and ending at instant m the following simple equation is

usually used:

Energy is always a positive number, hence the square root. Alternatively, the absolute value of

the sample can be used, resulting in the so-called Magnitude-Sum Function:

As already said, a voiced signal might cross the zero amplitude line more often than a non-

pitched signal. Of course, this can also be measured directly, by calculating the Zero-Crossing-Rate SC:

The sgn() function returns 1 or 0 depending on the sign of the operand. A third method is to cal

culate the prediction gain:

Used among other things in overlapping speech detection, voiced/unvoiced estimation, music analysis

Entropy

27

How noisy is the signal? Helps in many cases.

Pitch

28

The autocorrelation value reflects the similarity between the frame s[n] and the time-shifted version s[n-l]. Again, the frame has length N containing and is ending at instant m. The variable l is a positive integer representing a time lag and n= [m–N+1,m]. The range of lag is selected so that it covers a wide range of pitch period values. For example, at 8kHz sampling rate, if l is between 20 and 147 (2.5 ms to 18.3 ms), the possible pitch estimation times values range from 54.4 Hz to 400 Hz. By calculating the autocorrelation values for the entire range of lag, it is possible to find the value of l associated with the highest autocorrelation representing the pitch period estimate. In other words, the autocorrelation R is maximized when the lag l is equal to the pitch period.

The pseudo-code for the algorithm would look like this:

// Input: last index in frame m, number of samples N,

// sampling rate sr per second

// Output: The pitch period in seconds.

pitch(m, N, sr)

peak := 0

FOR l:=20 TO 150

autoc:=0

FOR n:=m-N+1 TO m

autoc:=autoc+s[n]*s[n-l]

IF (autoc>peak)

peak:=autoc

lag:=l

pitch := lag/sr

A second method for pitch estimation is the so-called Magnitude-Difference Function. The idea is that for short segments of voiced speech it is reasonable to expect that s[n]-s[n-l] is small for l =0,±T,±2T, . . , with T being the signal’s period. By computing the MDF for the lag l range of interest, we can estimate the period by locating the lag value associated with the minimum magni

tude difference. And here is the equation for the MDF:Note that from the same equation, each additional accumulation of term causes the result to be greater than or equal to the previous sum since each term is positive. Thus, it is not necessary to calculate the sum entirely: if the accumulated result at any instance during the iteration loop is greater than the minimum found so far, calculation stops and resumes with the next lag. Also, no multiplication is involved in this method which is sometimes interesting for speed and memory

Most frequently used method is autocorrelation:

Fails in many situations, computationally intensive.

Pitch Estimation is ongoing research!

Harmonic-to-Noise Ratio

29

Used as indicator for type of audio (e.g. speech vs music, types of music), speaker age, and as a metric for quality distortion.

HNR = vuv

Long-Term Average Spectrum

30

Often used as indicator for type of audio (e.g. speech vs music, types of music).

Basic Audio Features

31

A Speech Signal:

Pitch:

Energy:

Formants:

HNR:

More Features (next week)!

32

•Linear-Prediction Coefficents (LPC)•Mel-Frequency-Scaled Coefficients

(MFCC): MFCC12, MFCC19, MFCCxx+delta+deltadelta

•PLP (Perceptual Linear Prediction)•RASTA, RASTA-PLP, MSG

Audacity

33

Open Source wave editor and filter:http://audacity.sourceforge.net/

Wavesurfer

34

Open Source wave editor and filter, specialized for speech:http://www.speech.kth.se/wavesurfer/

http://www.speech.kth.se/wavesurfer/

http://www.speech.kth.se/wavesurfer/

Praat

35

Open Source speech experimental toolkithttp://www.fon.hum.uva.nl/praat/

http://www.fon.hum.uva.nl/praat/

http://www.fon.hum.uva.nl/praat/

Other useful tools

36

•sox -- command-line based audio filter and converter: http://sox.sourceforge.net/

•ffmpeg -- command-line based audio and videoconverter http://ffmpeg.org/

•MPlayer (mencoder) -- open source video player and converter: http://www.mplayerhq.hu/

http://sox.sourceforge.net

http://sox.sourceforge.net

http://ffmpeg.org

http://ffmpeg.org

http://www.mplayerhq.hu

http://www.mplayerhq.hu

Next Week

37

•More on Features•Useful Filters•Some Machine Learning•More Toolkits•No Skype

Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-2.pdf · Audio and...

Documents

Transcript of Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-2.pdf · Audio and...