What can corpus phonetics tell us about ‘English’ phonology?
Transcript of What can corpus phonetics tell us about ‘English’ phonology?
What can corpus phonetics tell us about ‘English’
phonology?
Jane Stuart-SmithGlasgow University Laboratory of Phonetics (GULP)
The SPADE Consortium
English Corpus Phonetics and Phonology at ICAME (digital)
20th May 2020, hosted by Trier University
phonetics
phonology
English corpus
Text over time and space
Huge amounts of annotated speech exist…
• $$ ££
• Software
• Ethics
barriers
corpus phonetics: overcome barriers and scale up scientific study of speech
Huge amounts of annotated speech exist…
Scientific and/or professional user questions, e.g.
• How variable are ‘English’ sounds across space/time?
• $$ ££
• Software
• Ethics
barriers
corpus phonetics: overcome barriers and scale up scientific study of speech
http://spade.glasgow.ac.uk/
Postdocs & Doc
Michael McAuliffeSoftware development
Rachel MacdonaldProject manager James Tanner
Project PhDSubmitted 11 May!
Arlie Coles (U. de Montréal)
Elias Stengel-Eskin (Johns Hopkins)
VannaWillerton(McGIll)
Michael Goodale, Sarah Mihuc(McGill)
and many more!Stacey HarkinKirsty McCahillMitchell McGeeEdward MarshallJulia MorenoJo Pearce Niamh WalkerEwa Wanat
Jordan HolleyPeter AndrewsKaylynn Gunter
Software large-scale speech analysis
Data from ~40 datasets(socio)linguisticsurveys
Corpus phonetics in practice
Research ’English’ sounds over time and space?
Datasets (speech corpora, lexicons)
Database
import
add measures & structure
querying
Set of linguistic objects
Data file (CSV)
export
Implementation• Python API• Graphical User Interface
McAuliffe et al. Proc. ICPhS 2019
Michael McAuliffe
Integrated Speech Corpus ANalysis (ISCAN)
US and CanadaUK and Ireland
• 40 collected: public/private, 4 countries, 115 years • 25 processed: 30 dialects, ~4500 speakers, ~2060 hours• 18 measured
Datasetshttps://spade.glasgow.ac.uk/the-spade-consortium/
What can we learn about English phonology?
StopsLiquids: Scottish rhotics
Vowels: Scottish patterns
Sibilants Vowels: formants
Vowel duration: voicing effect
Stuart-Smith et al. Proc. ICPhS 2019
Mielke et al. Proc. ICPhS 2019
Tanner et al. Toronto WP Ling 2019; Frontiers Comp. Slx 2020
https://spade.glasgow.ac.uk/news-outputs/
Vowels: dynamicsTanner PhD 2020
Stuart-Smith et al. Proc. ICPhS 2019
Updated analysishttps://osf.io/bknrg/
How does S-retraction vary across English dialects and speakers?
Data
• stressed, word-initial /s str ʃ/ e.g seat, street, sheet
• 420 speakers
• 5 corpora ~ 10 dialects
• 98,000 tokens
• spectral Centre of Gravity (CoG)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
female male
colu
mbu
s
US
: W
est
US
: N
. C
itie
s
rale
igh
Ca
nad
a
Gla
sgow
Scot:
SW
Sco
t: E
Sco
t: H
i/Il
Sco
t: W
colu
mbu
s
US
: W
est
US
: N
. C
itie
s
rale
igh
Ca
nad
a
Gla
sgow
Scot:
SW
Sco
t: E
Sco
t: H
i/Il
Sco
t: W
4000
5000
6000
Dialect
Co
G (
Hz)
seat
sheet
US Canada Scotland
420 speakers~ 75k tokens
higher pitched
/s ʃ/ differ by dialect
S-retraction differs by dialect
●
●
●
●
●
●
●
●●
●
0.6
0.7
0.8
0.9
1.0
rale
igh
colu
mbu
s
Gla
sgow
Sco
t: E
Ca
nad
a
Sco
t: W
Scot:
SW
US
: N
. C
itie
s
Sco
t: H
i/Il
US
: W
est
Corpus
Retr
actio
n r
atio fo
r /s
tr/
… on a continuum (not a dichotomy)
more like ‘s’
more like ‘sh’
US Scotland Canada
420 speakers~ 77k tokens
Scot: Hi/Il Scot: W
Canada Glasgow Scot: SW Scot: E
columbus US: West US: N. Cities raleigh
−0.5 0.0 0.5 1.0 1.5 2.0−0.5 0.0 0.5 1.0 1.5 2.0
−0.5 0.0 0.5 1.0 1.5 2.0−0.5 0.0 0.5 1.0 1.5 2.0
0
1
2
3
4
5
0
1
2
3
4
5
0
1
2
3
4
5
Retraction ratio for /str/
De
nsity (
sp
ea
ke
rs)
S-retraction differs more by speaker
420 speakers~ 77k tokens
Den
sity
(sp
eake
rs)
Retraction ratio for /str/, e.g. street
USCanada
Scotland
Tanner et al. Toronto WP Ling 2019under review, Frontiers in Computational Sociolinguistics
How robust is the ‘English’ Voicing Effect?
Tanner et al. Toronto WP Ling 2019Tanner et al. Frontiers in Computational Sociolinguistics 2020
Data
•Utterance final, CVC words e.g. beat, bead
• 1964 speakers
• 15 corpora ~ 30 dialects
• ~230,000 tokens
• Vowel duration
James Tanner
bead > beat
bead = beat
1964 speakers~230k tokens
Voicing Effect differs by English dialectEs
tim
ated
Vo
icin
g Ef
fect
Siz
e
North AmericaUK & Ireland
Dialect
bead > beat
bead = beat
1964 speakers~230k tokens
Voicing Effect differs by English dialectEs
tim
ated
Vo
icin
g Ef
fect
Siz
e
North AmericaUK & Ireland
Dialect
… and is much smaller than in lab speech
bead > beat
bead = beat
1964 speakers~230k tokens
Scottish (no Voicing Effect expected)
African American Vernacular English (big Voicing Effect expected)
Voicing Effect differs by dialectEs
tim
ated
Vo
icin
g Ef
fect
Siz
e
North AmericaUK & Ireland
Dialect
Voicing Effect differs more by dialect than by speakers
Amount of dialect variability
Am
ou
nt
of
spe
aker
var
iab
ility
English speech over time and spacehttp://152.1.64.33/spade/latest/
Try out our Shiny app!
What do we learn about English
phonology?
• confirm expected patterns from current-scale work
• identify new patterns of variability:
Certain features vary more by speaker, and less by dialect (e.g. sibilants)
Others vary more by dialect and less by speaker (e.g. vowel duration)
Why? e.g. Kleinschmidt (2018)
Challenges
• Ethics (multiple countries, GDPR)
• Data (collection, processing)
• Software development
• Some measures elusive (stops, e.g. p t k)
What next?
• analyse the sounds of ‘English’!
• prosody (intonation, voice quality, etc.)
• Expand ‘English’ to World/non-native Englishes
• Beyond English (ISCAN not language-specific)
Thank you!
and to the organizers of this workshop
Documentation
• GUI / server install: https://iscan.readthedocs.io/• Can sign up as tutorial user
• Python API: https://polyglotdb.readthedocs.io/