AudeoSynth: Music-Driven Video Montage - zichengl.netzichengl.net/stuff/montage-SG15talk.pdf ·...

Post on 31-Aug-2018

223 views 1 download

Transcript of AudeoSynth: Music-Driven Video Montage - zichengl.netzichengl.net/stuff/montage-SG15talk.pdf ·...

AudeoSynth: Music-Driven Video Montage

Zicheng Liao

Zhejiang University

Bingchen Gong

Zhejiang University

Lechao Cheng

Zhejiang University

Yizhou Yu

University of Hong Kong

ACM SIGGRAPH 2015

The success of visual media synthesis

Video textures [2000] Animating pictures [2005] De-animating video [2012]

Progressive video loop [2013]Cinemagraphs [2012] Cliplets [2012]

Video synopsis [2008]

The success of visual media synthesis

Image analogy [2001]Graph cut texture synth [2003]Texture synthesis [1999 & 2001]Pyramid blending [1983]

Gradient domain editing [2003] Digital photo montage [2004] stitching & panorama [2003]

*Silent* pixels

Other dimensions of human sensation are absent

- hear, touch, smell or taste

- design for 5-sense [Jinsop Lee 2013]

Add sound to the game

- why sound?

Source: www.MontblancOneSecond.com [#NOT paper result]

Co

nte

nt A

na

lysis

Op

tim

ization

Vid

eo M

on

tage

Music Driven Video Montage

Applications

Video summary and online sharing

Timelapse photography [Louie Schwartzberg 2011]

Hyperlapse videos [Joshi et al. 2015, Kopf et al. 2014]

Smartphone app in Apple Store or Google Play

A challenging new task

How to formulate this task?

How to write an objective function?

How to find a solution?

How to evaluate?

How to translate the subtleties of an artistic process

into a machine algorithm?

Principle I: Synchronization

Time & pace of visual activities to follow with music

Audio-Visual Synchresis

- Mental fusion when sound and visual occur at the same time

- An instinct for survival developed from the ancient

- Footsteps synchronized with music beat, popping with drum

- Film editing, animations, dancing (“dance to the beat”).

[Michel Chion 1994]

Principle II: Cut-to-the-Beat

Montage: A language of visual expression

Timing is KING

- Music transition points

- Beginning of music bars

“Mosaic, assembling, or a juxtaposition of imagery, …

an orchestration” - Alfred Hitchcock

[Walter Murch 2001]

“to separate and punctuate an idea from what follows”

- Walter Murch

Alfred Hitchcock

FormulationM

usi

cV

ideo

cli

ps

scaling factor

segment 1 segment 2 segment 3 segment 4

Music-Driven Imagery

segment 1 segment 2 segment 3 segment 4

mu

sic

vid

eos

Energy function

pairs

synchronization

Overview

Music

Video clipsVideo clipsVideo clips

Analysis

Video clipsVideo clipsmotion

frequency

dynamism

segments

note onsets

saliency

Optimization

Pre-

compute

Output

Rendering

MCMC

optimization

Energy

function

Music analysis

MIDI: Musical Instrument Digital Interface

- Music industrial standard protocol (1983)

- Connects instruments, sequencers and software

- Online databases (free-midi.org; 8notes.com)

- Semantical encoding language of music A MIDI controller

source: http://wikipedia.org/MIDI

MIDI formatMIDI event

TIME EVENT ID channel P1 P2

Event types:

ID P1 P2

Note off 0x8 pitch velocity

Note on 0x9 pitch velocity

Note aftertouch 0xA note # value

Controller 0xB controller # value

Program change 0xC program # channel

Channel aftertouch 0xD value NA

Pitch Bend 0xE value 1 value 2

Program change event: <0xC program# channel>

Program #

01 – 08: Piano Timbres

09 – 16: Chromatic percussion

17 – 24: Organ Timbres

25 – 32: Guitar Timbres

105 – 112: Ethnic Timbres

113 – 128: Sound Effects (Tinkle Bell, Breath noise, Bird Tweet, etc)

Music metadata

Clef, meter and tempo

Music segmentation

Bottom up hierarchical segmentation

“Agglomerative image segmentation with superpixels”

Music bars as “superpixels”

Bar 1 Bar 2 Bar 3 Bar 4 Bar 5

Music temporal saliency

For audio-visual alignment (synchronization)

8 note onset scores

pitch-peak, pitch-shift, deviated-pitch, before-a-long-interval, after-a-long-interval, start-of-a-bar, start-of-a-new-bar, start-of-a-different-bar

Convolve with Gaussian kernel

salie

ncy

MID

I

Optical flow as generic visual descriptor [Liu et al. 2005]

Motion change rate (MCR)

Iterative back propagation [Yang et al. 2011]

Visual temporal saliency

Video analysis cont’d

Motion frequency- Project motions in discretized directions

- Power spectral density analysis over time window

- Take the frequency with largest 𝑝𝑠𝑑

Flow peak and dynamism

𝑑 = 0

𝑑 = 1

𝑑 = 2

𝑑 = 3

…………

Matching cost

Synchronization cost Pace/frequency cost

Transition cost

Pace/velocity compatibility # tracks/dynamism compatibility

Optimization

A combination of continuous and discrete optimization

Non-convex

Cannot do gradient descent

Two-Stage Optimization

Stage 1m

usi

c

segment 1 segment 2 segment 3 segment 4

Stage 2

MC

R

scalable sliding window

frame (video timeline)

Stage I

start framescaling factor

music timelinemusic temporal saliency

Global

alignment

music timeline

Temporal

snapping

end frame

Stage II

Metroplis-Hasting algorithm- Two mutations options

- node label update

- Two nodes label swap

- Reversibility constraint- Uniform distribution for label update

segment 1 segment 2 segment 3 segment 4

musi

cvid

eos

Result: wild

Input: 35 videos of wild life scene; music: Exploration (excerpt)

Result: Aurora

Input: 36 aurora videos; Music: Someone like you (excerpt)

Result: City timelapse

Input: 55 timelapse videos of city timelapse; Music: Clocks (excerpt)

Comparison sync/no-sync

Feature turned offFeature turned on

Without cut-to-the-beatWith cut-to-the-beat

cut-to-the-beatComparison

User study

Experiment set up

- 5 groups: ours, - cut-to-the-beat, -sync, Avg User, Expert User

- 6 examples (right)

- 29 participants

- random order, rate from 1 to 5

- Subpopulation analysis by questionnaire

Aurora City timelapse

Happy birthday Adventure

Ballet Wild

w/out sync.

Manual edits

w/out cut-to-beatOurs Expert User Avg user

User study resultsAverage rating of different methods Average rating of different examples

Fraction of best VS worst rate for each method Fraction of higher VS lower rate in pairs of methods

Limitations

Manual selection of music

Storyline is not preserved

The use of MIDI: bad side and good side

Future work

Replace MIDI with .wav and .mp3

Put human in the loop

Video-based music recommendation

...

http://web.engr.illinois.edu/~liao17/montage.html

Thank You