Regression Analysis on Levenshtein-Pointwise Mutual ...
Transcript of Regression Analysis on Levenshtein-Pointwise Mutual ...
Regression Analysis on Levenshtein-Pointwise Mutual
Information Segment Distance Across Languages and Acoustic
Distance
Eliza Margaretha, Martijn Wieling, John Nerbonne
[email protected], [email protected], [email protected]
University of Groningen
Abstract
We compare phonetic segment distances induced by Levenshtein with pointwise mutual
information (PMI) weights among 3 languages, namely Dutch, German and Bulgarian. Our results
show that Dutch Levenshtein-PMI segment distances have a significant correlation with those of Bulgarian. While Dutch and Bulgarian pair yields a rather low correlation, Dutch and German pair
yields a high correlation. Furthermore, we are interested in bridging the phonetic-linguistic and
distribution information approaches together. We observe how well vowel quality would influence Levenshtein-PMI segment distances by presenting the correlation of Levenshtein-PMI distances and
acoustic vowel distances derived from formant measurements.
1 Introduction
Phonetic segment distance measures how far a phonetic segment considered from another
segment, namely the similarity between two segments. It serves as fundamental information
for a wide range of research, for instance, speech recognition and spoken language
processing. From transcribing spoken discourse (Geutner, et al., 1998) to the studies of
dialectology (Nerbonne, et al., 1996), segment distances are of great importance to predict
different pronunciations produced due to speaker differences such as geographic location,
gender and the size and shape of vocal tracts.
In addition to the classic Levenshtein algorithm, Wieling, et al. (2009) proposed a
variation using PMI as generated-weights which in turn suggests the best performance
compared to the other variations. In this work, we are interested in comparing the segment
distances induced by this Levenshtein-PMI method across Dutch, German and Bulgarian. We
explore whether segment distances of different languages significantly correlate to each other
and thus also whether segment distances in one language can be used to predict those in other
languages.
Levenshtein-PMI method attempts to estimate segment distances automatically. Wieling,
et al. (2009) describes that the method has a very strong relationship with PairHMM method
which correlate highly with acoustic distances. In this work, we seek the direct relationship
between the Levenshtein-PMI method and vowel quality. To generalize, we would like to
perceive the relationship between information distribution and phonetic-linguistic approaches
as we figure out how well Levenshtein-PMI distances can be estimated by acoustic distances.
Previous study conducted by Ellison (1992) shows similar attempt to derive content from
distribution of word information rather than from acoustic data. They proposed a
methodology to construct unsupervised, cipher-independent and language-independent
machine learning systems for learning phonology. Their induction model learns from surface
information of words derived from a lexicon and by using such information, they show that
they were able to perform a consonant-vowel classification.
2
This report is structured as follows. We give brief explanations of the Levenshtein-PMI
method, formant measurement and Mantel test in the next section. We describe our datasets
and methodology in section 3 and report the results in section 4. Finally, discussions and
summary of our work are pointed out in sections 5 and 6.
2 Literature
This section highlights previous works related to our research. Firstly, we describe the
Levenshtein-PMI method by which we obtain our segment distances. Secondly, we represent
the concept of vowel quality and formant measurements to obtain the acoustic distances.
Lastly, we explain Mantel test as a decent method to clarify the significance of a correlation
of distance matrices.
2.1 Levenshtein using Pointwise Mutual Information-generated segment distances
Levenshtein algorithm (Levenshtein, 1965) is a well-known distance measure which has
been applied widely, also known for computing segment distances, namely how often
segment � is aligned with segment �. In the context of string alignment, an insertion is
regarded as an alignment of a segment against a gap, a deletion is an alignment of a gap
against a segment, and a substitution is alignment of two segments (Wieling, et al., 2009).
The following is an example of string alignment between 2 different pronunciations of
milk in Dutch. The first string is /molke/ which is a Frisian dialect and the second is /melek/,
another dialect spoken in several regions in the Netherlands such as Limburg. In the example,
we discover a substitution (S) between the vowel /�/ and /�/, a deletion (D) as the alignment
between a gap and /�/, and an insertion (I) of /�/ into a gap. For computing segment distances,
we are interested most in the substitution operation.
m � � k �
m � � � k
S D I
PMI proposed by Church & Hanks (1990) was originally to measure word association
norms. PMI compares the probability of observing two variables x and y together (joint
probability) with the probability of observing the variables independently (chances). PMI
applied to Levenshtein algorithm attempts to adjust an alignment distance by giving a weight
which specify the distance according to the alignment frequency. Therefore, it is able to
explain whether an alignment is nearer or further than other similar alignments.
�����, �� log� � ���, �����������
In the view of information distribution, segment distances can be considered as how far
the distribution of segment � from the distribution of segment �. Wieling, et al. (2009)
defined the properties of PMI between a segment pair � and � as described below with regard
to generating segment distances:
• ���, �� is the relative occurrence of the aligned segments � and � in the whole data
set. Specifically, ���, �� is computed as the number of � and � occurrences at the
3
same position in 2 aligned strings of � and �, divided by the total number of aligned
segments.
• ���� or ���� shows the relative occurrence of � or � respectively in the whole
dataset, namely the number of the occurrences of � or � divided by the total number
of segment occurrences.
PMI value goes proportionally with the number of the segment pair's co-occurrence. If
segments x and y are likely to co-occur, ���, �� will be much larger than the �������� and
consequently the �����, �� will be much larger than 0. Conversely, PMI negative values
indicate that segments are not likely to co-occur. To set corresponding segments at low
distance, the segment distances are transformed by subtracting PMI value from 0 and adding
the maximum PMI value.
Segment distances are trained using an iterative procedure in the following manner. First,
string alignments are generated using Levensthein algorithm which does not allow
alignments of vowels with consonants. Second, the PMI values for every segment pair is
calculated and transformed. Third, Levenshtein algorithm is applied to these segment
distances to create a new alignment sets. Step 2 and 3 are repeated until convergence is
reached, namely the difference between two consecutive iterations is very small, close to 0.
2.2 Vowel Quality and Formant measurement
Vowel quality is the property that makes one vowel sound different from another, for
example, /��/ as in sheep from /�/ as in ship (McArthur, 1998). The quality of a vowel is
determined by the position of the vocal tracts (the parts of the anatomy which produce vocal
sounds) during pronunciation, i.e. the tongue, lips, and lower jaw, and the resulting size and
shape of the mouth and pharynx.
The most common way to measure vowel quality by means of acoustic signals is formant
measurement (Leinonen, 2010). Formants specify the energy concentration positions in the
acoustic signals, i.e. the lowest resonance frequencies (Peterson & Barney, 1952). At a
resonance frequency, similar acoustic signals oscillate at larger amplitudes than at other
frequencies. These vocal resonances are able to characterize distinguishable vowel sounds.
The 2 first formants are the most distinguishing features and the 3rd formant would be
useful when pronunciation is very much affected by the position of the lips (Ladefoged,
2005). Figure 2.1 illustrates vowel distinguishments via formant measurements. A formant is
presented as a darker band in a spectrogram. It shows that /�/ and /�/ has similar first
formants but the second formant of /�/ is higher than that of /�/. The third formant provides
additional information for the distinguishments.
Figure 2.1 Illustration of Format Measurements (Leinonen, 2010)
4
An acoustic distance between 2 vowels can be acquired by calculating the Euclidean
distance of their formant values (Wieling, et al., 2007). To generalize the acoustic distance,
normalizing non-linguistic speaker-dependent differences, such as pitch, in acoustic signal is
required. A common way to do so is by applying a band-pass filtering using Bark filters or
Mel filters. To match human pitch and perception which are not linear, linear Hertz frequency
should be transformed to non-linear, almost logarithmic, Bark or Mel scales.
In addition to Bark and Mel scales, z-score transformation was suggested by Lobanov
(1971) with the intention of achieving normalization per speaker. Thus, z-scores
transformation would help to assimilate the voice differencies between men and women.
While Bark and Mel scales are based on one vowel token, z-scores transformation make use
of information across vowels. More vowel normalization methods such as Gerstman’s range
normalization and Millers’ formant-ratio model are discussed and compared by Adank
(2003).
2.3 Mantel Test
Normally we compare independent objects in
carrying out regression analysis. In other words, we
assume that the objects to correlate are
independent. However, distance matrices are
typically dependent in some ways (Manly, 1994).
In the case of acoustic distances derived from
the first and second formants, the distances are dependent as particularly they obey the
theorem of triangle inequality. According to the theorem, if the straight distance between 2
objects A and C is smaller than the sum of other distances through another object B, then the
straight distance is dependent to the other distances. This concept is illustrated in Figure 2.2.
On the other hand, Levenshtein-PMI distances can be viewed as independent as they do
not necessarily obey the triangle inequality. Moreover, since Levenshtein-PMI distances
come from information distribution theory, they are not guaranteed as distances or metrics in
mathematical sense.
Comparing Levenshtein-PMI distances to acoustic distances would introduce a
comparison between an independent matrix and a dependent matrix. In such a case, it is
essential to test if the relationship between such matrices would be truly significant.
Evaluating their correlation coefficient and testing its significance are not sufficient.
Mantel test introduced by Mantel (1967) is a prevalent test for dealing with such a
purpose. It was primarily suggested as a solution for identifying space and time clustering of
disease. Mantel test is based on randomization and permutation test. To assure the
significance of the correlation of two distance matrices, their correlation is compared with the
correlations of permutated matrices, i.e. multiple comparisons with correlations of one
original matrix and all possible permutated matrices where the rows and columns of the other
matrix are permutated randomly.
The null hypothesis is set as there is no relationship between the two matrices. If it is
satisfied, then the correlation coefficients of permuted matrices should be equally larger or
smaller. An observation value can be used to show a positive relationship. Specifically, we
compute the observation value by adding 1 for every ����1, �2� � ���1, �2�, where D1
and D2 are distance matrices and PD1 is permutated D1, and then divided by the number of
replicates.
To be precise, we need to perform comparisons for all possible permutations. However,
the number of permutation would grow enormously as the size of the matrix grows larger.
Figure 2.2 Triangle
Inequality
5
Therefore, Monte-Carlo test (Metropolis & Ulam, 1949) would be a good alternative. It
suggests that taking a small random sample of the possible replicates should be sufficient.
3 Dataset
Our Dutch data came from digital Dutch dialect data transcriptions from the Goeman-
Taeldeman-Van Reenen-Project (GTRP) as used by (Wieling, et al., 2007). It consists of
dialect varieties from 424 different locations in the Netherlands. For each variety, there are
562 transcriptions of different words which altogether comprise 82 Dutch phonetic segment
types.
On the other hand, the Bulgarian data was collected from various resources, namely
students’ theses at the University of Sofia, published monographs, dictionaries, and the
archive of the Ideographic Dictionary of Bulgarian Dialects (Prokić, et al., 2009). It contains
transcriptions of 152 words from 197 locations all over Bulgaria and there are 67 different
segment types including diacritics and suprasegmentals, i.e. vocal effects such as emphasis or
prosodic.
Additionally, we made use of the German dataset (Nerbonne & Siedle, 2005) which
consists of 78 segment types. The transcriptions of 196 words were collected from 186
locations in Germany for the Kleiner Deutscher Lautatlas project.
For each language, L04 program1 developed by Peter Kleiweg was employed to compute
its Levenshtein-PMI segment distances. The program produces a contingency matrix where
the rows and columns designate segment types and each cell ��, �� describes the distance
between 2 corresponding segment types � and �. For each non-alignment pair, we give a very
high penalty which turns out to be a very high distance.
Since the segment labels of different languages are written in different Extended Speech
Assessment Methods Phonetic Alphabet (X-SAMPA) formats, we transform the X-SAMPA
labels to their corresponding International Phonetic Alphabet (IPA) standard. Then, we map
each shared segment label between 2 languages, namely Dutch and Bulgarian, and Dutch and
German. For each shared segment in a language pair, we collect all segment alignments
which have low distances in both languages. Additionally, we also calculate the number of
vowel alignments and consonant alignments separately.
Table 3.1 Dutch, Bulgarian, and German Segment Figures
Language Pair Shared
Types
Segment
Alignments
Vowel
Alignments
Consonant
Alignments
Dutch and Bulgarian 43 235 92 143
Dutch and German 71 870 261 609
Dutch and Bulgarian share 43 identical phonetic segments in our data. In total, there are
903 segment alignments, but there are only 235 alignments with low distances consisting of
92 vowel alignments and 143 consonant alignments. On the other hand, Dutch and German
data share 71 identical phonetic segments. There are 870 alignments with low distances
including 261 vowel alignments and 609 consonant alignments. These figures are
summarized in Table 3.1.
The normal Q-Q plots of Dutch and Bulgarian low segment distances plotting observed
values against expected normal values in Figure 3.2 (a) and (b) suggest that the data are
1 http://www.let.rug.nlkleiweg/L04/Manuals/leven.html
6
normally distributed with one outlier in
Dutch. Similarly, the Q-Q plot of Dutch
and German low segment distances also
show that the data are normally distributed
(see Appendix I.2). The segment distances
vary from 0 to 5000.
The box plot of the data depicted in
Figure 3.2 show that the Dutch and
Bulgarian medians are close to each other
and most of the data overlap. Thus, we
expect the data would be fairly similar, i.e.
no significant difference. The box plot of
Dutch and German data also show
comparable manner (see Appendix I.2).
Our acoustic data was obtained from
Pols, et al. (1973) containing three first formants of 50 Dutch male speakers and Van Nierop,
et al. (1973) those of 25 female speakers. The formants of all speakers are averaged and the
acoustic distances were computed as the Euclidean distances of the formant values. In total,
there are 36 acoustic vowel alignments in the acoustic data. All of these alignments also
appear in our Levenshtein-PMI Dutch data.
Beside the raw Hertz frequency of the formants, we use the transformed formants in Bark
and Mel scale. Since raw Hertz frequency is linear whereas our perception is not, transformed
formants in Bark and Mel scales which are nonlinear should fit to the nature of our perception
better.
Additionally, we apply z-score transformation to our acoustic data in the following
manner. Raw hertz values are transformed to standardized z-scores of each speaker so as to
normalize the differences over all the vowels per speaker. Then, the average of z-scores per
vowel of all speakers is taken.
4 Results and Analysis
In this section, we present regression analyses over different setups. First, we highlight our
comparisons of Levenshtein-PMI distances across languages. Second, we compare
Levenshtein-PMI distance with various variations of acoustic distances.
(a) (b)
Figure 3.2 Q-Q Plots of (a) Bulgarian and (b) Dutch Data
Figure 3.1 Box Plot of Bulgarian and Dutch
Data
7
4.1 Comparing Dutch, Bulgarian and German Levenshtein-PMI Distances
We analyze Levenshtein-PMI distances across languages with the following arrangement.
We compare variable pairs, which are Dutch and Bulgarian and Dutch and German, for all
existing cases, namely all segment alignments occurring in both languages. The value of each
variable for each case is the corresponding segment distance. For example, the value of /�/ -
/a/ alignment in Dutch is its distance and we aim at comparing it with such a distance in
Bulgarian and in German. We carry on the task by performing a regression analysis on 2
Levenshtein-PMI distance sets and computing their correlation coefficient to measure the
effect size. The task is modeled in Figure 4.1 below.
Dutch Bulgarian/German
Segment alignment Levenshtein-PMI distance Levenshtein-PMI distance
Figure 4.1 Regression Analysis Model on Comparing Levenshtein-PMI Distances
The scatter plot in Figure 4.2 (a) visualizes the relationship between Dutch and Bulgarian
data, whilst (b) Dutch and German. Each point in the scatter plots indicates a case where � is
the segment distance in Dutch and � is the segment distance in Bulgarian or German. We
assume Dutch as the independent variable which somewhat determine the values of the
dependent variables (Bulgarian or German).
A straight line (regression line) in each scatter plot is drawn suggesting linear dependency.
The line has a formula � � � � where � is the intercept and is the slope. Each �! in
the regression line is the predicted segment distance in another language (Bulgarian or
German) estimated by the corresponding Dutch segment distance. The difference between the
actual and predicted segment distances is the residual, "# ��# $ �!#�. Least-squares
regression is applied to find the minimal squared residuals for all segment alignments.
The points in Dutch and Bulgarian data seem to scatter more than those in Dutch and
German data which is fairly concentrated nearby its regression line. Although the points look
moderately random, the points in Dutch and German show a slight trend that the residuals
become larger as the distances become larger.
To examine the residuals accurately, we plot the residuals against the predicted value as
depicted in Figure 4.3 (a). The residuals imply linearity since the points are moderately
random and widely spread. They are also reasonably normally distributed as shown by P-P
plot in Figure 4.3 (b). Besides, Dutch and German data show similar manners with extra data
(a) (b)
Figure 4.2 Scatter Plots of (a) Dutch and Bulgarian, (b) Dutch and German
8
points (see Appendix I.2). It also shows the trend mentioned before, i.e. residuals become
larger as the distances become larger.
According to our SPPS results given in Figure 4.4, the regression line for Dutch and
Bulgarian is � 1568.562 � 0.3�. By using this regression line, we are able to calculate the
predicted Bulgarian segment distance given the corresponding Dutch distance. For example,
given the distance of /�/ aligned to /�/ is 1556 in Dutch, the predicted alignment distance in
Bulgarian is �! 1568.562 � 0.3�1556� 2053.362. If we allow 5% error, i.e. with 95%
confidence interval, the mean of /�/ and /�/ alignment distance in Bulgarian should lie
between 2053.362 + 1083 �970,3136� where 1083 is the standard error �! for specific
� 1556. In our data, the real distance is 1675 which indeed lie in the interval.
Figure 4.4 Dutch and Bulgarian Regression Line Coefficients
The t-statistics is 5.454 presenting that the relationship between Dutch and Bulgarian data
is significant at � / 0.000. In other words, Dutch segment distances can be considered as a
good predictor for Bulgarian segment distances. The correlation coefficient � 0.336 shows
that Dutch and Bulgarian has a low positive correlation.
A coefficient of determination, that is the square of correlation coefficient (��) shows the
proportion of variability in a data set accounted for by a regression model (Moore & McCabe,
2006). It compares the variations explained by an explanatory variable, i.e. Dutch segment
distances in our case, to the total variations in the whole data set. Therefore, it presents the
explanatory size of the independent variable to the dependent variable. It also specifies how
well future outcomes can be predicted by the model.
In an ANOVA’s point of view as depicted in Figure 4.5, the coefficient of determination is
computed as the sum of squares of regression model divided by the total sum of squares. For
Dutch and Bulgarian case, the coefficient shows that Dutch segment distances account for
approximately 11% variation of Bulgarian segment distances. Figure 4.5 also presents that
Dutch distances have a significant effect on Bulgarian distances as their F-statistics is
significant at (p < 0.000).
(a) (b)
Figure 4.3 Plots of Dutch and Bulgarian Residuals
9
Figure 4.5 ANOVA Summary of Dutch and Bulgarian Data
In the case of Dutch and German (see Appendix I.2), the t-statistics also indicates
significant relationship, namely 23.925 (� / 0.000). The regression line formula is � 879.010 � 550�. The correlation (� 0.630) is stronger than Dutch and Bulgarian. Almost
40% variation of German segment distances is accounted for by Dutch segment distances.
Furthermore, we compare vowel alignments and consonant alignments separately.
Generally, both vowel and consonant alignments yield significant correlations at � / 0.000.
We figure that vowel alignments obtain better correlations than consonant alignments. Dutch
and Bulgarian vowel alignments correlate significantly at � 0.418 which means nearly
18% variation of Bulgarian vowel distances can be predicted by Dutch vowel distances. Their
consonant alignments on the other hand, yield correlation at � 0.339, that is roughly 11%
variation of Bulgarian consonant distances are accounted for by Dutch consonant distances.
Table 4.1 Dutch, Bulgarian and German Levenshtein-PMI Distances Correlations (p < 0.001)
Language
Pair
Alignment
Sets
Pearson
Correlation (r)
Explanatory
size (r2)
Dutch and
Bulgarian
All 0.336 0.113
Vowel 0.418 0.178
Consonant 0.339 0.115
Dutch and
German
All 0.630 0.397
Vowel 0.620 0.384
Consonant 0.587 0.345
For Dutch and German, their vowel distances correlate at � 0.620 and thus Dutch vowel
distances account for over 38% variation of German vowels. Their consonant distances have
a slightly lower correlation at � 0.587 suggesting that approximately 35% German
consonants are accounted for by Dutch consonant distances.
4.2 Comparing Levenshtein-PMI Distances to Acoustic Distances
Our second task is to compare segment distances produced by information distribution
approach to the common assessment concerning vowel quality, phonetic-linguistics approach.
Specifically, we compare Dutch Levenshtein-PMI distances to acoustic distances.
Since we are interested in perceiving how well acoustic distances would explain
Levenshtein-PMI distances, we set acoustic distance as the explanatory variable and
10
Levenshtein-PMI distance as the response variable. Akin to the previous task, the cases and
values of the variables are segment alignments and distances in the corresponding
approaches.
We evaluate each variation of the acoustic distances as described in section 3, namely raw
Hertz frequency and transformed frequency in Bark scale, Mel scale and Z-scores. For each
variation, we compute Pearson correlation coefficients (r) and coefficients of determination
showing the explanatory size (r2) for the first 2 and the first 3 formants. The results are
presented in Table 4.2.
Table 4.2 Dutch Levenshtein-PMI and Acoustic Distances Correlations
Acoustic
variation
Number of
first formants
Pearson
Correlation (r)
Explanatory
Size (r2)
Significance
(p-value)
Hertz 2 0.481 0.231 0.003
3 0.426 0.181 0.010
Z-scores 2 0.720 0.518 0.000
3 0.640 0.410 0.000
Bark Scale 2 0.616 0.379 0.000
3 0.517 0.267 0.001
Mel Scale 2 0.603 0.364 0.000
3 0.507 0.257 0.002
The correlations between acoustic distances using raw Hertz and Levenshtein-PMI
distance are not remarkable. Raw hertz with 2 first formants has correlation at � 0.481
which shows that it accounts for 23% variation of Levenshtein-PMI distance. Taking into
account the third formant slightly lower the correlation coefficient to � 0.426 signifying
that the acoustic distance accounts for 18% variation of the Levenshtein-PMI distance.
Normalizing the raw Hertz is indeed improving the results. Our acoustic z-scores distances
yield the best correlations at � 0.720 for 2 first formants and � 0.640 for the 3 first
formants. Both results are significant at � / 0.000. Using the 2 first formants, it accounts
for nearly 52% variation of Levenshtein-PMI distance. Considering the third formant does
not help refining the results and yields a poorer result, explicitly over 10% minor explanatory
size than excluding the formant. Only 41% variation of Levenshtein-PMI distance accounted
for by acoustic z-scores distances with 3 first formants.
Bark and Mel scales produce similar results although Bark scales are marginally better
than Mel scales. Almost 38% variation of Levenshtein-PMI distance is explained by acoustic
distances in Bark scale with 2 first formants (� 0.616) and over 36% is explained by Mel
scale, also with 2 first formants �� 0.603). The third formant is again exacerbating the
results. With the third formant, acoustic distances in Bark scale predict nearly 27% variation
of Levenshtein-PMI distance (� 0.517) and the distances in Mel scale predict almost 26%
(� 0.507). While using 2 first formants in Bark and Mel scales is significance at � /0.001, the significance of using 3 first formants also fall to � / 0.005.
As mentioned in section 2.3, p-value is not sufficient for validating the significance of a
correlation coefficient of distance matrices. Although Levenshtein-PMI distances can be
recognized as independent, acoustic distance is not independent. Since we are comparing an
independent object with a dependent object, we need to perform Mantel test to the
significance of their correlation coefficient. Instead of testing all possible permutations, we
11
use Monte-Carlo sampling of 10000 replicates. The outcomes of the Mantel test with Monte-
Carlo sampling are given in Table 4.3.
Table 4.3 Mantel Test Results of Dutch Levenshtein-PMI and Acoustic Distances
Acoustic
variation
Observation
value
Significance
(p-value)
Hertz 2 0.168 0.013
Hertz 3 0.132 0.035
Z-score 2 0.410 1e-04
Z-score 3 0.317 3e-04
Bark 2 0.303 2e-04
Bark 3 0.206 0.002
Mel 2 0.286 2e-04
Mel 3 0.195 0.004
The significances in Mantel test goes proportionately with the significances of the
correlation coefficients in Table 4.2. The previous table shows that Z-scores using first 2 and
3 formants, Bark scale 2 formants, Mel scale 2 formants are significant at p < 0.001. On the
other hand, Table 4.3 highlights that these variations have tremendously low values implying
that permuting the rows and columns does not really affect the correlations between the
acoustic distances and Levenshtein-PMI distances. Thus, the two kinds of distances have a
decent relationship and their correlation is truly significant.
5 Discussion
Our results show that Dutch Levenshtein-PMI distances is able to predict distances in
German and Bulgarian. Dutch prediction over German, which has similar characteristics to
Dutch, is much better than the prediction over Bulgarian, which has different characteristics.
Dutch and German are deemed to be grouped in Germanic languages category. Since they
have the same earlier parent language during the historical developments, they share a wide
range of similarities including types of consonants, vowels and accents (Auwera & König,
1994).
On the other side, Bulgarian is included in Slavonic languages which are mainly spoken in
Eastern Europe. Therefore, Bulgarian has diverse phonetic properties from Dutch. Since the
sound systems of Slavonic languages are rich in consonants, Slavonic people particularly are
not accustomed to pronounce vowels. They typically find difficulties in pronouncing vowels
and they pronounce vowel in different ways from Dutch people.
Another issue that should be taken into account is that the phonetic notation system in
International Phonetic Alphabet (IPA) does not necessarily denote exactly the same phonetic
sounds from different languages. The alphabet was originally defined based on English. A
sound which is alike but not exactly the same as in English could be signified to the same
alphabet. For instance, /i/ sound in Bulgarian might be pronounced slightly different from
English /i/ but it is labeled to the same alphabet /i/.
Comparisons of Levenshtein-PMI distances and various transformations of acoustic
distances show that Z-score transformation yields the best results. Bark and Mel scales help
in normalizing the formants to meet human perception which is nonlinear. They improve the
12
estimation of Levenshtein-PMI distances for more than 10%. However, z-score
transformation suits our acoustic data better since the data was collected from male and
female speaker and z-score transformation attempt to normalize differences of all vowels per
speaker. Therefore, z-scores assist properly in smoothing speaker differences with regard to
gender. It improves nearly 30% of the predictions.
It appears that the third formant is not useful in our experiments, even impact poorer
outcomes. This phenomenon is not peculiar as it was also previously discovered in Wieling,
et al., (2007). We suspect that it might be due to the pronunciations in our data are not much
determined by lips position which greatly affects the third formant. Instead of helping in
distinguishing vowels, the third formant seems to make the differences among the vowels
more unclear.
6 Summary
We have described 2 segment distance comparison tasks. First, we demonstrate
comparisons of Levenshtein-PMI segment distances between 2 pairs of languages, namely
Dutch-Bulgarian and Dutch-German. Second, we present comparison of Levenshtein-PMI
distances and some variations of acoustic distances induced by formant measurements. For
both tasks, we show significant correlations between the variables to compare.
Our results reveal that Levenshtein-PMI distances of Dutch are able to predict those of
Bulgarian and German. That is to say Levenshtein-PMI distances of one language are able to
predict distances in other languages. We also report that prediction of a language whose
similar characteristics to the predictor is better than that of a language whose different
characteristics. Particularly in our work, Dutch prediction over German is better than Dutch
prediction over Bulgarian. Dutch distances account for up to 40% variation of German
distances. In Bulgarian case, Dutch distances are able to estimate approximately 11%
variation.
Additionally, we display that vowel quality as represented by acoustic distances correlate
reasonably highly with Levenshtein-PMI distances. This implies that phonetic-linguistic
approach has a significant relationship with distribution information approach and the former
can finely explain the latter to some extent. In our case, we evaluate how well Dutch acoustic
distances are capable of predicting its Levenshtein-PMI distances. We show that the former
can predict up to 52% of the latter.
Acoustic distances using raw Hertz frequency from 2 first formants are able to estimate
about 23% variation of Levenshtein-PMI distances. Normalizing the raw Hertz frequency is
indeed improving the results. Bark and Mel scales transform linear raw Hertz to nonlinear
frequency in order to match human perception. The acoustic distances in Bark and Mel scales
produce comparable results where Bark is faintly better than Mel scales. They achieve
approximation about 37% variations of Levenshtein-PMI distances.
The best prediction is attained by z-score transformation. Since our acoustic data combine
male and female speaker and z-score transformation attempt to normalize differences of all
vowels per speaker, it helps to smooth the differences between men and women.
Since we compare independent Levenshtein-PMI distances to dependent acoustic
distances, we also test the significance of their correlations. We do so by carrying out Mantel
test which eventually assure the significance. Especially for correlations with normalized
acoustic distances: z-score, Bark and Mel scale using 2 first formants, the p-values are very
low indicating that there is a relationship between the 2 compared distances.
13
Appendix I Data
I.1 Dutch and Bulgarian Data
I.2 Dutch and German Data
14
Appendix II Results
II. 1 Results of Levenshtein-PMI Dutch and German Segment Distance Comparison
15
Bibliography
Adank, P. M. (2003). Vowel Normalization: a perceptual acoustic study of Dutch vowels.
Wageningen: Ponsen & Looijen.
Auwera, J. v., & König, E. (1994). The Germanic Languages. London: Routledge.
Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and
lexicography. Comput. Linguist., 16(1), 22-29.
Ellison, T. M. (1992). The Machine Learning of Phonological Structure. Phd Thesis,
University of Western Australia, Department of Computer Science.
Geutner, P., Finke, M., & Waibel, A. (1998). Phonetic-Distance-Based Hypothesis Driven
Lexical Adaptation For Transcribing Multlingual Broadcast News. In Proceedings of the
International Conference on Spoken Language Processing.
Ladefoged, P. (2005). Vowels and Consonants: An Introduction to the Sounds of Languages
(2nd ed.). Malden, MA: Blackwell.
Leinonen, T. (2010). An Acoustic Analysis of Vowel Pronunciation in Swedish Dialects. PhD
Thesis, Groningen.
Levenshtein, V. (1965). Binary codes capable of correcting deletions, insertions and
reversals. Soviet Physics Doklady, 163(4), 845-848.
Lobanov, B. M. (1971). Classification of Russian Vowels Spoken by Different Speakers. J.
Acoust. Soc. Am., 49, 606-608.
Manly, B. F. (1994). Multivariate Statistical Methods: A Primer (2nd ed.). USA: Chapman
and Hall.
Mantel, N. (1967). The Detection of Disease Clustering and a Generalized Regression
Approach. Cancer Research, 27(2), 209-220.
McArthur, T. (1998). "VOWEL QUALITY" Concise Oxford Companion to the English
Language. Retrieved May 5, 2010, from Oxford Reference Online:
http://www.oxfordreference.com/views/ENTRY.html?subview=Main&entry=t29.e1288
Metropolis, N., & Ulam, S. (1949). The Monte Carlo Method. Journal of the American
Statistical Association, 44(247), 335-341.
Moore, D. S., & McCabe, G. P. (2006). Introduction to the Practice of Statistics 5th edition.
New York: W. H. Freeman.
Nerbonne, J., & Siedle, C. (2005). Dialektklassifikation auf der Grundlage aggregierter
Ausspracheunterschiede. Zeitschrift für Dialektologie und Linguistik, 72(2), 129–147.
Nerbonne, J., Heeringa, W., van den Hout, E., van der Koo, P., Otten, S., & van de Vis, W.
(1996). Phonetic Distance between Dutch Dialects. G.Durieux, W.Daelemans, & S.Gillis
(eds.) CLIN VI: Proc. of the Sixth CLIN Meeting, (pp. 185-202). Antwerp, Centre for
Dutch Language and Speech (UIA).
Peterson, G., & Barney, H. (1952). Control methods used in a study of the vowels.
J.Acoust.Soc.Am, 24(2), 175-184.
Pols, L. C., Tromp, H. R., & Plomp, R. (1973). Frequency analysis of Dutch vowels from 50
male speakers. The Journal of Acoustical Society of America, 43, 1093–1101.
16
Prokić, J., Nerbonne, J., Zhobov, V., Osenova, P., Simov, K., Zastrow, T., et al. (2009). The
computational analysis of Bulgarian dialect pronunciation. Serdica Journal of Computing.
Statistical Consulting Group. (n.d.). How can I perform a Mantel test in R? Retrieved May 8,
2010, from UCLA: Academic Technology Services:
http://www.ats.ucla.edu/stat/R/faq/mantel_test.htm
Van Nierop, D. J., Pols, L. C., & Plomp, R. (1973). Frequency analysis of Dutch vowels from
25 female speakers. Acoustica, 29, 110–118.
Wieling, M., Heeringa, W., & Nerbonne, J. (2007). An Aggregate Analysis of Pronunciation
in the Goeman-Taeldeman-van Reenen-Project Data. Taal en Tongval, 59(1), 84-116.
Wieling, M., Leinonen, T., & Nerbonne, J. (2007). Inducing sound segment differences using
Pair Hidden Markov Models. SigMorPhon '07: Proceedings of Ninth Meeting of the ACL
Special Interest Group in Computational Morphology and Phonology (pp. 48-56).
Prague, Czech Republic: Association for Computational Linguistics.
Wieling, M., Prokić, J., & Nerbonne, J. (2009). Evaluating the pairwise string alignment of
pronunciations. Proceedings of the EACL 2009 Workshop on Language Technology and
Resources for Cultural Heritage, Social Sciences, Humanities, and Education (pp. 26-
34). Athens, Greece: Association for Computational Linguistics.
Zwicker, E. (1961). Subdivision of the audible frequency range into critical bands. The
Journal of the Acoustical Society of America, 33(2), 248.