Crowd++: Unsupervised Speaker Count with Smartphones
description
Transcript of Crowd++: Unsupervised Speaker Count with Smartphones
![Page 1: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/1.jpg)
Crowd++: Unsupervised Speaker Count with Smartphones
Chenren Xu, Sugang Li, Gang Liu, Yanyong Zhang, Emiliano Miluzzo, Yih-Farn Chen, Jun Li, Bernhard Firner
ACM UbiComp’13September 10th, 2013
![Page 6: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/6.jpg)
6Chenren Xu [email protected]
Solutions?
Speaker count! Dinner time, where to go?
Find the place where has most people talking!
Is your kid social? Find how many (different) people they talked with!
Which class is more attractive? Check how many students ask you questions!
Microphone + microcomputer
![Page 9: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/9.jpg)
9Chenren Xu [email protected]
Voice-centric sensing
Stressful
BobFamily life Alice
Speech recognition
Emotion detection
2
3
Speakeridentification
Speaker count
![Page 10: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/10.jpg)
10Chenren Xu [email protected]
How to count?
ChallengeNo prior knowledge of speakers
Background noise
Speech overlap
Energy efficiency
Privacy concern
![Page 11: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/11.jpg)
11Chenren Xu [email protected]
How to count?
Challenge SolutionNo prior knowledge of speakers Unique features extraction
Background noise Frequency-based filter
Speech overlap Overlap detection
Energy efficiency Coarse-grained modeling
Privacy concern On-device computation
![Page 13: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/13.jpg)
13Chenren Xu [email protected]
Speech detection
Pitch-based filter Determined by the vibratory frequency of the vocal folds
Human voice statistics: spans from 50 Hz to 450 Hz
f (Hz)50 450
Human voice
![Page 14: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/14.jpg)
14Chenren Xu [email protected]
Speaker features
MFCC Speaker identification/verification
Alice or Bob, or else?
Emotion/stress sensing Happy, or sad, stressful, or fear, or anger?
Speaker counting No prior information
Supervised
Unsupervised
![Page 15: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/15.jpg)
15Chenren Xu [email protected]
Speaker features
Same speaker or not? MFCC + cosine similarity distance metric
θMFCC 1
MFCC 2
We use the angle θ to capture the distance between speech segments.
![Page 16: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/16.jpg)
16Chenren Xu [email protected]
Speaker features
MFCC + cosine similarity distance metric
θs Bob’s MFCC in speech segment 1
Bob’s MFCC in speech segment 2
Alice’s MFCC in speech segment 3
θd
θd > θs
![Page 17: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/17.jpg)
17Chenren Xu [email protected]
Speaker features
MFCC + cosine similarity distance metric
histogram of θs histogram of θd
1 second speech segment
2-second speech segment
3-second speech segment
10-second speech segment
.
.
.
10 seconds is not natural in conversation!
We use 3-second for basic speech unit.
![Page 18: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/18.jpg)
18Chenren Xu [email protected]
Speaker features
MFCC + cosine similarity distance metric
3-second speech segment
3015
![Page 20: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/20.jpg)
20Chenren Xu [email protected]
Same speaker or not?
Same speaker
Different speakers
Not sure
IF MFCC cosine similarity score < 15
AND
Pitch indicates they are same gender
ELSEIF MFCC cosine similarity score > 30
OR
Pitch indicates they are different genders
ELSE
![Page 21: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/21.jpg)
21Chenren Xu [email protected]
Conversation example
Speaker A
Speaker B
Speaker C
Conversation
Overlap Overlap Silence
Time
![Page 22: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/22.jpg)
22Chenren Xu [email protected]
Phase 1: pre-clustering Merge the speech segments from same speakers
Unsupervised speaker counting
![Page 23: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/23.jpg)
23Chenren Xu [email protected]
Unsupervised speaker counting
Ground truth
Segmentation
Pre-clustering
Same speaker
![Page 24: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/24.jpg)
24Chenren Xu [email protected]
Unsupervised speaker counting
Ground truth
Segmentation
Pre-clustering
Merge
![Page 25: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/25.jpg)
25Chenren Xu [email protected]
Unsupervised speaker counting
Ground truth
Segmentation
Pre-clustering
Keep going …
![Page 26: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/26.jpg)
26Chenren Xu [email protected]
Unsupervised speaker counting
Ground truth
Segmentation
Pre-clustering
Same speaker
![Page 27: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/27.jpg)
27Chenren Xu [email protected]
Unsupervised speaker counting
Ground truth
Segmentation
Pre-clustering
Merge
![Page 28: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/28.jpg)
28Chenren Xu [email protected]
Unsupervised speaker counting
Ground truth
Segmentation
Pre-clustering
Keep going …
![Page 29: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/29.jpg)
29Chenren Xu [email protected]
Unsupervised speaker counting
Ground truth
Segmentation
Pre-clustering
Same speaker
![Page 30: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/30.jpg)
30Chenren Xu [email protected]
Unsupervised speaker counting
Ground truth
Segmentation
Pre-clustering
Merge
![Page 31: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/31.jpg)
31Chenren Xu [email protected]
Unsupervised speaker counting
Ground truth
Segmentation
Pre-clustering
Merge
Pre-clustering
.
.
.
![Page 32: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/32.jpg)
32Chenren Xu [email protected]
Phase 1: pre-clustering Merge the speech segments from same speakers
Phase 2: counting Only admit new speaker when its speech segment is
different from all the admitted speakers.
Dropping uncertain speech segments.
Unsupervised speaker counting
![Page 37: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/37.jpg)
37Chenren Xu [email protected]
Unsupervised speaker counting
Counting: 1
Counting: 1Different speaker
![Page 38: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/38.jpg)
38Chenren Xu [email protected]
Unsupervised speaker counting
Counting: 2
Counting: 1Admit new speaker!
![Page 39: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/39.jpg)
39Chenren Xu [email protected]
Unsupervised speaker counting
Counting: 2
Counting: 2
Counting: 1
Same speaker
![Page 40: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/40.jpg)
40Chenren Xu [email protected]
Unsupervised speaker counting
Counting: 2
Counting: 2
Counting: 1
Different speaker
Same speaker
![Page 41: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/41.jpg)
41Chenren Xu [email protected]
Unsupervised speaker counting
Counting: 2
Counting: 2
Counting: 1
Merge
![Page 42: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/42.jpg)
42Chenren Xu [email protected]
Unsupervised speaker counting
Counting: 2
Counting: 2
Counting: 1
Not sure
![Page 43: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/43.jpg)
43Chenren Xu [email protected]
Unsupervised speaker counting
Counting: 2
Counting: 2
Counting: 1
Not sure
Not sure
![Page 44: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/44.jpg)
44Chenren Xu [email protected]
Unsupervised speaker counting
Counting: 2
Counting: 2
Counting: 1
Drop
![Page 45: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/45.jpg)
45Chenren Xu [email protected]
Unsupervised speaker counting
Counting: 2
Counting: 2
Counting: 1
Different speaker
![Page 46: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/46.jpg)
46Chenren Xu [email protected]
Unsupervised speaker counting
Counting: 2
Counting: 2
Counting: 1
Different speaker
Different speaker
![Page 47: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/47.jpg)
47Chenren Xu [email protected]
Unsupervised speaker counting
Counting: 2
Counting: 3
Counting: 1
Admit new speaker!
![Page 48: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/48.jpg)
48Chenren Xu [email protected]
Unsupervised speaker counting
Counting: 2
Counting: 3
Counting: 1
Ground truth
We have three speakers in this conversation!
![Page 49: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/49.jpg)
49Chenren Xu [email protected]
Evaluation metric
Error count distance The difference between the estimated count and the
ground truth.
![Page 52: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/52.jpg)
52Chenren Xu [email protected]
Benchmark results
Phones on the table are better than inside the pockets
![Page 53: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/53.jpg)
53Chenren Xu [email protected]
Benchmark results
Collaboration among multiple phones provide better results
![Page 54: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/54.jpg)
54Chenren Xu [email protected]
Large scale crowdsourcing effort
120 users from university and industry contribute
109 audio clips of 1034 minutes in total.
Private indoor Public indoor Outdoor
![Page 55: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/55.jpg)
55Chenren Xu [email protected]
Large scale crowdsourcing results
Sample number
Error count distance
Private indoor 40 1.07
Public indoor 44 1.35
Outdoor 25 1.83
The error count results in all environments are reasonable.
![Page 56: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/56.jpg)
56Chenren Xu [email protected]
Computational latency
Latency(msec)
HTCEVO 4g
SamsungGalaxy S2
SamsungGalaxy S3
GoogleNexus 4
GoogleNexus 7
MFCC 42.90 36.71 24.41 22.86 23.14
Pitch 102.71 80.36 58.11 47.93 58.33
Count 175.16 150.47 89.01 83.53 70.23
Total 320.77 267.54 171.53 154.32 151.7It takes less than 1 minute to process
a 5-minute conversation.
![Page 59: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/59.jpg)
59Chenren Xu [email protected]
Use case 2: Social log
Home
Class
Meeting
Lunch
Research Research
Sleeping Research Dinner
Ph.D student life snapshot before UbiComp’13 submission
His paper is accepted!
![Page 60: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/60.jpg)
60Chenren Xu [email protected]
Use case 3: Speaker count patterns
Different talks show very different speaker count patterns as time goes.
![Page 61: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/61.jpg)
61Chenren Xu [email protected]
Conclusion
Smartphones can count the number of speakers
with reasonable accuracies in different
environments.
Crowd++ can enable different social sensing
applications.
![Page 62: Crowd++: Unsupervised Speaker Count with Smartphones](https://reader036.fdocuments.in/reader036/viewer/2022062814/5681677d550346895ddc8196/html5/thumbnails/62.jpg)
62Chenren Xu [email protected]
Thank you
Chenren XuWINLAB/ECE
Rutgers University
Sugang LiWINLAB/ECE
Rutgers University
Gang LiuCRSS
UT Dallas
Yanyong ZhangWINLAB/ECE
Rutgers University
Emiliano MiluzzoAT&T LabsResearch
Yih-Farn ChenAT&T Labs Research
Jun LiInterdigital
Communication
Bernhard FirnerWINLAB/ECE
Rutgers University