How Many Mechanisms are Needed to Analyze Speech Many Mechanisms are Needed to Analyze Speech?: A...
Transcript of How Many Mechanisms are Needed to Analyze Speech Many Mechanisms are Needed to Analyze Speech?: A...
How many mechanisms? 1
1
Running Head: HOW MANY MECHANISMS?
How Many Mechanisms are Needed to Analyze Speech?:
A Connectionist Simulation of Structural Rule Learning in Language Acquisition
Paco Calvo
Universidad de Murcia
Aarre Laakso
University of Michigan – Dearborn
Paco Calvo, Departamento de Filosofía, Universidad de Murcia; Aarre Laakso, Department of
Behavioral Sciences, University of Michigan – Dearborn.
This research was supported by DGICYT Project HUM2006-11603-C02-01 (Spanish Ministry of
Science and Education and Feder Funds). This paper extends a preliminary study that appeared in the
Proceedings of the 30th Annual Meeting of the Cognitive Science Society.
Correspondence concerning this article should be addressed to Paco Calvo, Departamento de
Filosofía, Universidad de Murcia, Edificio Luis Vives, Campus de Espinardo, E-30100 Murcia – SPAIN. E-
mail: [email protected]
How many mechanisms? 2
2
Abstract
According to the “dual-mechanism” hypothesis, the induction of structural information correlates negatively
with familiarization length, in such a way that two mechanisms are required in order to analyze speech: (1)
statistical computations based on nonadjacent transitional probabilities, and (2) an additional rule-following
mechanism. The argument is that, although statistical mechanisms may suffice for speech segmentation, an
additional rule-following mechanism is required in order to quickly extract structural information. We
present a set of neural network studies that shows how a single statistical mechanism can mimic the
apparent discovery of structural regularities, beyond the segmentation of speech. We argue that our results
undermine one argument for the dual-mechanism hypothesis.
Keywords Artificial language acquisition; Dual mechanism hypothesis; Statistical learning
How many mechanisms? 3
3
1. Introduction
Language acquisition is a central component of human development. A key question is whether
language can be acquired solely by domain-general (e.g., statistical) learning mechanisms or whether
domain-specific (e.g., algebraic) learning mechanisms are required.
Peña, Bonatti, Nespor, & Mehler (2002) consider whether statistical computations based on
transitional probabilities of the sort that are exploited in speech segmentation (Saffran, Aslin, & Newport,
1996) can also be used over nonadjacent dependencies in order to induce rule-like regularities in the speech
stream. Nonadjacent dependencies are an important test of any purported account of language acquisition
based on statistical learning because they are pervasive in language, but it is not obvious how a statistical
learning mechanism could acquire them. Examples of nonadjacent dependencies occur at many levels in
language. At the subsyllabic level, for example, the vowels of English words differ in pronunciation
depending on whether the word ends with a silent ‘e’ (as in “lop” versus “lope”). At the morphosyntactic
level, parasynthesis, infixation and agreement all involve non-adjacent dependencies. One example is
agreement between auxiliaries and inflectional morphemes, as in the English “is running.” Another is
agreement between articles and inflectional morphemes, as in the Spanish “el pollo.” A third is agreement in
number, as in the English “he runs.” The nonadjacent dependencies required for maintaining agreement can
span lengthy center embeddings, as in “The cats who the dog bites run.” Thus, mastering a language
requires behaving (e.g., producing utterances) in accordance with nonadjacent dependencies, including long-
distance dependencies.
Peña et al. report experimental evidence from French-speaking adults that they argue shows that
humans use both statistical learning (to segment speech) and algebraic computations (to induce structural
regularities such as grammatical rules). That is, Peña et al. argue that statistics are insufficient to support the
discovery of underlying grammatical regularities, and that their results imply knowledge of rules. Endress &
Bonatti (2007) go a step further and argue that participants may be tuning to rules at an even higher level of
abstraction than those considered by Peña et al. Endress & Bonatti replicate and extend the Peña et al.
How many mechanisms? 4
4
results with Italian-speaking adults and attempt to model them using connectionist networks. They argue
that their failure to model the experimental results with connectionist networks demonstrates that associative
learning mechanisms are insufficient for language learning.
In this paper, we report a set of connectionist simulations – based on a single statistical mechanism –
that does model the experimental results of Peña et al. and Endress & Bonatti. We conclude that they have
not demonstrated that rule-governed structure learning mechanisms are necessary for language acquisition.
2. An overview of the evidence for the “dual-mechanism hypothesis”
2.1 Peña et al.’s (2002) experiments
Peña et al. (2002) tested adults’ abilities to segment speech based on non-adjacent dependencies and
to generalize beyond the familiarization corpus. The experiments were based on roughly the same method as
that used by Newport & Aslin (2000). Adult participants were asked to listen to a sequence of trisyllabic
artificial “words” for a certain period of time. The artificial “words” had the form AiXCi, where Ai, X and Ci
are syllables. The subscripts on A and C indicate that the nonadjacent syllables are matched, such that the
transitional probability between an Ai and the following Ci is 1.0. There are three X syllables, so the
transitional probabilities between an Ai and an intermediate X and between an X and the final Ci are each
0.33. There are three word classes (i∈[1,2,3]), and no two adjacent words in the speech stream may be from
the same class, so the transitional probability between the final syllable of one word Ci and the first syllable
of the next word Aj is 0.5. The three word classes are pu…ki, be…ga and ta…du. The three filler syllables
are li, ra and fo. Thus, the A1XC1 family consists of the words puliki, puraki and pufoki, the A2XC2 family
consists of the words beliga, beraga and befoga, and the A3XC3 family consists of the words talidu, taradu
and tafodu. Ten-minute and two-minute familiarization streams were produced by concatenating speech-
synthesized tokens of these nine words, randomly selected subject to the constraints that (a) a word of a
given family could not be immediately followed by another word of the same family, and (b) a word with a
given intermediate syllable could not be immediately followed by another word with the same intermediate
syllable. After familiarization, participants were asked to choose – between pairs of stimuli that could
How many mechanisms? 5
5
belong to three kinds of test items: “words”, “part words” and “rule words” – which seemed more like a
word from the familiarization stream.
In Peña et al.’s experiment 1, participants were familiarized to a continuous speech stream, as
described above, for 10 minutes. In the test phase, they were asked to choose – between a “word” and a
“part word” – which item seemed more like a word of the imaginary language with which they had been
familiarized. The “words” were simply items of the form AiXCi, that is, words that had appeared in the
familiarization stream. The “part words” were also items that had appeared in the familiarization stream, but
ones that straddled a word boundary. These could be of two types: “type 12” part words consisted of items
having the form CiAjX, whereas “type 21” part words consisted of items having the form XCiAj. There are
18 type 12 part words: dubefo, dubeli, dubera, dupufo, dupuli, dupura, gapufo, gapuli, gapura, gatafo,
gatali, gatara, kibefo, kibeli, kibera, kitafo, kitali and kitara. There are also 18 type 21 part words: foduga,
foduki, fogapu, fogata, fokidu, fokiga, lidube, lidupu, ligapu, ligata, likibe, likita, radube, radupu, ragapu,
ragata, rakibe and rakita. The result was that participants preferred words over part words. Experiment 1 is
consistent with the hypothesis that statistics alone suffice for speech segmentation, since what counts as a
“word” is a function of the transitional probabilities between specific non-adjacent items (in this case,
between Ai and Ci).
In their experiment 2, Peña et al. investigated whether participants were simply segmenting the
stream by exploiting differential transitional probabilities between words and part words, or whether they
were attuning to some more abstract underlying grammatical regularity. In order to answer this question,
participants were asked, after having been familiarized for 10 minutes to the same speech stream as in
experiment 1, to choose between a part word and what Peña et al. dubbed a “rule word.” A rule word is a
sequence of three syllables of the form AiX′Ci, where X′ stands for a “familiar” syllable, that is, one that
occurs in familiarization, although never between Ai and Ci. In other words, X′ is another A or C syllable, so
that rule words have forms like A1C3C1 and A1A2C1. There are 12 rule words: beduga, bekiga, bepuga,
betaga, pubeki, puduki, pugaki, putaki, tabedu, tagadu, takidu and tapudu. They are called “rule words”
How many mechanisms? 6
6
because participants who learned rules of the form “if the first syllable is Ai, then the last syllable is Ci”
should find them familiar. Although a rule-word is novel, it is congruent with the structure of actual words
by means of the preservation of the relationship between Ai and Ci. As it turned out, participants preferred
part words over rule words, suggesting that they had not extracted the rule from the training stream even
though (as shown in experiment 1) they had learned to segment the words of the language. Peña et al.
therefore claim that their experiment 2 shows that statistics are not sufficient to extract structural
information from a continuous familiarization corpus. In light of experiments 1 and 2 together, Peña et al.
conclude that a “computational mechanism sufficiently powerful to support segmentation on the basis of
nonadjacent transitional probabilities [experiment 1] is insufficient to support the discovery of the
underlying grammatical-like regularity embedded in a continuous speech stream [experiment 2]” (p. 605).
In Peña et al.’s experiment 3, a “subliminal” 25ms pause was inserted between each pair of words in
the familiarization stream. Although participants reported no awareness of such gaps, their presence did
affect the results. Specifically, Peña et al. found that, when participants were trained on a speech stream with
gaps, the participants subsequently preferred rule words to part words at test. In their view, these results
imply knowledge of rules, insofar as the very notion of an abstract rule-word underlies the successful
discrimination of rule words and part words. Thus, they write, “This seems to be due to the fact that the
selected items are compatible with a generalization of the kind ‘If there is a [pu] now, then there will be a
[ki] after an intervening X’” (p. 606). In other words, Peña et al. contend that two different computational
mechanisms must be responsible for the results of experiments 1-3: (i) a statistical mechanism for
performing speech segmentation (experiment 1), and (ii) a rule-governed mechanism responsible for the
induction of grammatical structural regularities in the corpus (experiment 3).
Peña et al.’s experiments 4 and 5 tested for preference between part words and rule words after
familiarization on a continuous and a segmented stream for 30 and 2 minutes, respectively (see Table 1). In
the first case, participants preferred part words over rule words, demonstrating that even lengthy
familiarization with a corpus that does not contain cues to segmentation does not lead to abstraction of the
How many mechanisms? 7
7
rule. In the second case, the participants preferred rule words over part words, demonstrating that even very
short familiarization with a corpus that does contain cues to segmentation leads to abstraction of the rule.
Table 1: Summary of Peña et al.’s experimental results. Exp. Stream Duration famil Test choice 1 Continuous 10 minutes Words over part words 2 Continuous 10 minutes No preference between rule words and part words 3 Segmented 10 minutes Rule words over part words 4 Continuous 30 minutes Part words over rule words 5 Segmented 2 minutes Rule words over part words
Peña et al. interpret the results of Table 1 as evidence for a dual-mechanism hypothesis: a statistical
mechanism for segmenting the familiarization corpus (experiment 1), and a rule-governed mechanism that
accounts for the induction of the rule that prefers rule words over part words (experiments 3 and 5).
2.2 Endress and Bonatti’s (2007) experiments
Rule words, as defined by Peña et al., relate to the sort of dependency that obtains between specific
syllables (Ai and Ci). Parasynthesis, infixation and agreement are natural language counterparts of the sort of
non-adjacent dependencies that define artificial rule words, as in the agreement that obtains in Spanish
across the lexical root poll in el pollo (“the chicken”, masculine singular) versus los pollos (masculine
plural). However, morphosyntactic dependencies may not exhibit a one-to-one correspondence between
specific “Ai” and “Ci” items (el _ o; los _ os). Instead, language acquisition may call for tracking
dependencies that occur between the classes that items such as Ai and Ci belong to. Endress & Bonatti note,
for example, that in English, prefixes such as de-, re- and un- and suffixes such as -ed, -es and -ing can
attach to verbs in nearly any combination. They also note that one may form an English noun phrase by
combining a determiner, an adjective and a noun. These are not convincing evidence of non-adjacent
dependencies, however, in that it is not clear what the dependencies are between prefixes and suffixes or
between determiners and nouns. Circumfixes, affixes consisting of two parts that surround a root when
attached, may be better examples. However, languages having circumfixes (double affixes) are less frequent
than languages without them (Greenberg, 1963). Even in languages that allow circumfixation, circumfixes
are often very rare indeed. For example, in Dutch, the only circumfix is the discontinuous nominal ge__te,
How many mechanisms? 8
8
which applies to underived nouns in order to obtain collective nouns, as in berg – ‘mountain’; gebergte –
‘mountains’(Klamer, 2005). There are some languages in which circumfixes, although still less frequent
than prefixes or suffixes, are more widely employed, as in Georgian where, for instance, or means‘two’ and
meore means ‘second’(Colarusso, 2005), or in Malay-Indonesian where, for instance, baik means ‘good’
and kebaikan means ‘goodness’(Tadmor, 2005). In any case, it is not clear whether circumfixation is ever a
unitary morphological process rather than a two-step derivation (Klamer, 2005).
Whether or not these examples are convincing, we may grant for argument’s sake that natural
language acquisition calls for tracking dependencies that occur between classes of items. Endress & Bonatti
thus go a step further and argue that participants may not prefer rule words themselves, but so-called “class
words,” which involve such dependencies between classes of items. Class words have the form AiX′Cj, that
is, an A syllable from one class, followed by a syllable that had appeared in the speech stream but never in
the middle of a word (as in rule words), followed by a C syllable from a different class. These are called
“class words” because they would be preferred if participants learned rules of the form “if the first syllable
is from the A class, then the last syllable is from the C class” (where the A class comprises syllables A1, A2
and A3, and the C class comprises syllables C1, C2 and C3). Whereas rule words are restricted to
dependencies between syllable tokens (e.g., between pu and ki in rule words pubeki and putaki), class words
are such that the transitional probability between the first and the third syllable is not 1.0. There are 18 class
words: beduki, bekidu, bepudu, bepuki, betadu, betaki, pubedu, pubega, puduga, pugadu, putadu, putaga,
tabega, tabeki, tagaki, takiga, tapuga and tapuki.
Table 2 summarizes some of Endress & Bonatti’s experimental results.1
1 Experiments 6 and 7 by Endress and Bonatti (2007) were designed to control whether subjects considered either the initial or the
final syllable to induce the rule, instead of both. Their experiments 10 and 11 were designed to discount the possibility of
phonological confounds in their experiments 1 and 2. These experiments are omitted for present purposes. Their experiment 9, on
the other hand, was designed to test whether a single-mechanism can exploit TPs over both syllables and gaps. It is omitted from
table 2, because we discuss it separately below.
How many mechanisms? 9
9
Table 2: Summary of some of Endress & Bonatti’s experimental results. Exp. Stream Duration famil Test choice 1 Segmented 10 minutes Class words over part words 2 Continuous 10 minutes No preference between class words and part words 3 Segmented 2 minutes Class words over part words 4 Segmented 30 minutes No preference between class words and part words 5 Segmented 60 minutes Part words over class words 8 Segmented 2 minutes Words over rule words 12 Segmented 2 minutes Rule words over class words 13 Continuous 10 minutes Rule words over class words
Endress & Bonatti highlight the fact that participants’ responses to class words exhibit a negative correlation
between structural generalization and familiarization duration of much the same sort that Peña et al. (2002)
had found for rule words. This negative correlation is critical to the inference that there is more than one
mechanism at work, a hypothesis that Endress & Bonatti dub the “MOM” (more-than-one-mechanism)
hypothesis, and which parallels the dual-system hypothesis put forward by Peña et al. Endress & Bonatti
conjecture that, when confronted with a segmented corpus, participants may be freed from the burden of
having to parse the input stream to first extract the constituents, as in the case of a continuous stream. In this
way, participants are able to focus from the start upon the underlying structure itself. With more
familiarization duration, generalization is overwhelmed as participants have more time to track the statistical
relations that obtain between familiarization items. Endress & Bonatti thus interpret the data as showing that
a dependency is initially induced between classes of items but degrades with further familiarization.
2.3 A challenge for connectionist modelling
As we have seen, Peña et al. and Endress & Bonatti find a negative correlation between performance
on abstract items and familiarization duration.
Tables 3 and 4 list the experimental results from Peña et al. and Endress & Bonatti (Tables 1 and 2)
in order from shorter to longer familiarization times, with the goal of highlighting the relationship between
the experimental results and the dual-mechanism (MOM) hypothesis.
How many mechanisms? 10
10
Table 3: Summary of Peña et al.’s and some of Endress & Bonatti’s experimental results with a segmented stream, in increasing order of familiarization duration.
Exp. Study Familiarization duration Test choice
5 Peña et al. 2 minutes Rule words over part words 8 Endress & Bonatti 2 minutes Words over rule words 3 Endress & Bonatti 2 minutes Class words over part words 12 Endress & Bonatti 2 minutes Rule words over class words 3 Peña et al. 10 minutes Rule words over part words 1 Endress & Bonatti 10 minutes Class words over part words 4 Endress & Bonatti 30 minutes No preference between class words and part words 5 Endress & Bonatti 60 minutes Part words over class words
Table 4: Summary of Peña et al.’s and some of Endress & Bonatti’s experimental results with a continuous stream, in increasing order of familiarization duration.
Exp. Study Familiarization duration Test choice
1 Peña et al. 10 minutes Words over part words 2 Peña et al. 10 minutes No preference between rule words and part words 2 Endress & Bonatti 10 minutes No preference between class words and part words 13 Endress & Bonatti 10 minutes Rule words over class words 4 Peña et al. 30 minutes Part words over rule words
The results obtained by Peña et al. and Endress & Bonatti on segmented 2-minute familiarization
streams (experiments 5, 8, 3 and 12 in Table 3) indicate preferences for words over rule words, for rule
words over class words, and for class words over part words of types 12 and 21. Their results on segmented
10-minute familiarization streams (experiments 3 and 1 in Table 3) indicate preferences for class words and
rule words over part words. Finally, Endress & Bonatti’s results on segmented 30- and 60-minute
familiarization streams (experiments 4 and 5 in Table 3) indicate no preference between class words and
part words, and preference for part words over class words, respectively. On the other hand, the results by
Peña et al. and Endress & Bonatti’s on continuous 10-minute familiarization streams (experiments 1, 2, 2
and 13 in Table 4) indicate preferences for words over part, no preference between rule/class words and part
words, and preference for rule words over class words. Finally, Peña et al.’s results on continuous 30-minute
familiarization streams (experiment 4 in Table 4) indicate preference for part words over rule words. The
overall reversal of the preferences observed (Tables 3 and 4) as we move from familiarization exposures of
2 minutes to those of 60 minutes reflects the negative correlation that underpins the MOM hypothesis.
How many mechanisms? 11
11
Endress & Bonatti (2007) consider whether an SRN trained on a prediction task may account for the
results of their experiments (Table 2). In particular, they are interested in whether a SRN can induce class-
words after being trained on segmented and continuous corpora of the sort employed in their experiments 1-
8. Endress & Bonatti purport to show that an associative connectionist network cannot account for this
pattern. They report a set of studies with Simple Recurrent Networks (SRNs; Elman, 1990) that they claim
shows that a single-mechanism such as the one that underlies an SRN cannot account for the preference for
class words exhibited by humans in their experiments, and for the negative correlation observed between
class word induction and familiarization duration. In what follows, we aim to show that a single statistical
learning mechanism can in fact account for all the patterns of preference of Tables 1 and 2, and for the
negative correlation for both segmented and continuous corpora that Tables 3 and 4 summarize. In the
remainder of this article, we report and discuss a set of SRN studies that does model the experimental
results. Our goal is to determine if the rank-order preference observed (words > rule words > class words >
part words) and the reversal that obtains as familiarization durations are increased can be modeled
statistically.
3. Preliminary analysis of types of dependencies in Peña et al.’s corpus
As others (e.g., Seidenberg, MacDonald, & Saffran, 2002) have observed, there is no principled
reason to exclude the possibility that subliminal segmentation gaps can be exploited statistically. The mere
fact that these gaps are subliminal does not prevent them from carrying potentially relevant information. It
simply means that their presence is not available to conscious access. As a matter of fact (as Peña et al. well
observe), they must carry the critical piece of information for the mastery of structural induction since the
inclusion of the gaps is the only difference between experiment 2, where part words are preferred, and
experiment 3, where rule words are favored (Table 1).
Although Peña et al. consider a potential rejoinder according to which a single statistical mechanism
may be responsible for the induction of the structural regularity in their experiment 3 by tracking transitional
probabilities over pauses as well as syllables, they dismiss that alternative in a footnote (note 27). The
How many mechanisms? 12
12
rejoinder is the idea that participants might have computed transitional probabilities to and from the pauses
as well as to and from the audible syllables. In that case, they might estimate that a rule word at test (with
the structure #AiX′Ci#) was more likely than a part word at test (with the structure #CiAjX# or #XCiAj#).
The relevant transitional probabilities are shown in Figure 1. Note that the sum of transitional probabilities
for the rule word (3.66) is greater than the sum of transitional probabilities for part words of either type
(0.33 in both cases). Indeed, the sum of adjacent transitional probabilities for the rule word (1.33) is greater
than the sum of adjacent transitional probabilities for part words of either type. Thus, one might suppose
that participants would prefer rule words to part words based strictly on adjacent transitional probabilities.
(a) # Ai X ′ Ci #
0.33 0 0 1
0 1 0
0.33 1
(b) # Ci Aj X #
0 0 0.33 0
0 0 0
0 0
(c) # X Ci Aj #
0 0.33 0 0
0 0 0
0 0
Figure 1: Adjacent and non-adjacent transitional probabilities for (a) a rule word having the form #AiX′Ci#, (b) a part word of type 12 having the form #CiAjX# and (c) a part word of type 21 having the form
#XCiAj#. Solid arrows indicate transitions between adjacent items. Dashed arrows indicate transitions between non-adjacent items separated by one. Dotted arrows indicate transitions between non-adjacent
items separated by two. Each arrow is labeled with the corresponding transitional probability. In response to this potential rejoinder, Peña et al. performed a control experiment in which
participants were tested with test items that consisted of part words of type 21 including the internal pauses,
that is, items having the form #XCi#Aj#. The relevant transitional probabilities are shown in Figure 2. In this
case, the transitional probabilities between adjacent items sum to 1.66, higher than the sum of transitional
probabilities between adjacent items in the rule word. Thus, if participants were using only transitional
How many mechanisms? 13
13
probabilities between adjacent items in calculating their preferences between test items, we would expect
them to prefer the part words of type 21 including the internal pauses to the rule words. However, Peña et al.
report that, on the contrary, participants still prefer the rule words to part words even when the part words
are presented with internal pauses.
# X Ci # Aj #
0 0.33 1 0.33 0
0 1 0.5 0
0 0.33 00 0
0
Figure 2: Adjacent and non-adjacent transitional probabilities for a part word of type 21 with an internal gap, i.e., an item having the form #XCi#Aj#.
However, there is a misunderstanding in Peña et al.’s line of reasoning that invalidates the
conclusion they reach. Peña et al. consider the prediction that participants would choose rule words
(#AiX′Ci#) over part words (#CiAjX#), once we consider “probabilities over syllables, pauses, and absence
of pauses in the stream and the test items”, since “[t]ransitional probabilities between adjacent elements
would favor rule words over part words” (note 27, p. 607, emphasis added). The problem with this comment
is that no reason is offered as to why the only transitional probabilities to be computed must be those
between adjacent elements in the speech stream. They assume that a statistical learning mechanism can only
be sensitive to immediately adjacent patterns. However, there is no reason to believe that such mechanisms
cannot be sensitive to nonadjacent regularities. This is especially so given that Peña et al.’s experimental
setting was precisely designed by constructing a lexicon mainly characterized in terms of nonadjacent
transitional probabilities; probabilities which, as they themselves acknowledge, are the cornerstone of the
segmentation task in their experiment 1: “[We] explore whether participants can segment a stream of speech
by means of nonadjacent transition probabilities, and we also ask whether the same computations are used
to promote the discovery of its underlying grammatical structure” (pp. 604-605; emphasis added).
How many mechanisms? 14
14
We may then ask: which test items would participants choose if they were computing transitional
probabilities over both adjacent and nonadjacent items? The sum of adjacent and non-adjacent transitional
probabilities for a rule word is 3.66 (Figure 1a), whereas the sum of adjacent and non-adjacent transitional
probabilities for a part word with an internal pause is 3.5 (Figure 2). Thus, if participants are in fact
computing transitional probabilities over both adjacent and nonadjacent items, then there is no reason not to
expect them to prefer rule words over part words with an internal pause, exactly as Peña et al. report that
they do. It is of course not obvious that participants must be sensitive to the overall sum of TPs. Different
types of information may be weighted differently, and the underlying empirical question in dispute is
precisely what sources of information people track as they acquire a new language. Thus, what the current
discussion shows is that the possibility that statistics is behind it cannot be excluded in principle on the basis
of an alleged statistical inferiority on the part of rule words as opposed to part words with an internal pause.
Our point is that statistical computations based on nonadjacent transitional probabilities of the sort
that are exploited in speech segmentation may be used in order to induce existing grammatical regularities
in the speech stream. There is in fact a fairly clear literature demonstrating that recurrent networks can
induce grammars from examples of context-free and context-sensitive languages; grammars that are
precisely of a form in which there are long-distance dependencies (see, for example, Boden & Wiles, 2000;
Chalup & Blair, 2003). In order to empirically demonstrate these claims, and with an eye to undermining
one argument for the dual-mechanism (MOM) hypothesis, we ran a series of connectionist simulations that
illustrate the exploitation of statistically-driven information. The simulations were conducted in two separate
studies. In the first study, we aim to determine whether SRNs can exhibit the patterns of preference
contained in Table 3 when trained on a corpus that contains subliminal gaps (Peña et al.’s experiments 5 and
3, and Endress & Bonatti’s experiments 8, 3, 12, 1, 4, and 5; see Table 3). In the second study, we aim to
determine whether SRNs can exhibit the patterns of preference contained in Table 4 when trained on a
corpus that does not contain subliminal gaps (Peña et al.’s experiments 1, 2 and 4, and Endress & Bonatti’s
experiments 2 and 13; see Table 4).
How many mechanisms? 15
15
4. Simulation studies
In our SRN studies, the familiarization corpus consisted of the same strings of syllables used by
Peña et al. (2002) and Endress & Bonatti (2007). In particular, syllables were represented by pair-wise
orthonormal nine- or ten-dimensional binary vectors (depending on whether segmentation gaps were
included in the familiarization). The familiarization corpora were as close as possible to those used by
Endress & Bonatti while still respecting the constraints described by Peña et al. Recall that in Peña et al., so-
called words have the form AiXCi, where Ai, X and Ci are syllables. The subscripts on A and C indicate that
the nonadjacent syllables are matched, such that the transitional probability between any Ai and the
following Ci is 1.0. There are three X syllables, so the transitional probabilities between an Ai and an
intermediate X and between an X and the final Ci are each 0.33. There are three word classes (i∈[1,2,3]),
and a word class may not repeat in the speech stream, so the transitional probability between Ci and the next
word’s first syllable is 0.5. The specific word classes used by Peña et al. (2002b) were:
i = 1 : pu . . . ki
i = 2 : be . . . ga
i = 3 : ta . . . du
and the filler syllables were:
li
ra
fo
Thus, the A1XC1 family consists of the words puliki, puraki and pufoki; the A2XC2 family consists of the
words beliga, beraga and befoga; and the A3XC3 family consists of the words talidu, taradu and tafodu. To
create the 10-minute familiarization stream, 100 tokens of each of the 9 words were randomly concatenated,
subject to two constraints: (a) a word of a family could not be followed by another word of the same family,
and (b) two words could not be adjacent if they had the same intermediate syllable. There seems to be
something of an issue here, because the online methods supplement (Peña et al., 2002b) states clearly that
How many mechanisms? 16
16
rule (b) is in effect (“two words are not adjacent if they have the same intermediate syllable”, p. 1), while
Figure 1 of the original paper (Peña et al., 2002a) indicates that the familiarization stream 1 contains the
string puraki beliga tafodu pufoki talidu beraga, which clearly contains a violation of rule (b). Endress &
Bonatti (2007) apparently use the rule (viz. “Consecutive items could not belong to the same family, nor
could they have the same middle syllable,” in the description of the familiarization procedure for experiment
1, p. 255). Thus, for the purpose of the simulations, we made an attempt to constrain all training input
according to rule (b).
However, there is a further issue, in that observing these constraints makes it difficult to generate a
sequence of exactly the required length. Depending upon how the initial words are randomly ordered, it can
be impossible to place the last few words. As an extreme example, imagine that the sequence started as
follows (the superscripts indicate repetitions):
[puliki beöaga]100[pufoki beliga]100[puöaki befoga]100 talidu
This sequence is only 601 words long, but it cannot be completed because all remaining words are from the
same family. Of course, pseudorandomly selecting words tends to avoid such extreme examples.
Nevertheless, a pseudorandom selection procedure remains susceptible to the less extreme forms of the
problem: the placement of words earlier in the sequence makes it impossible to place all words in the
sequence while obeying constraints (a) and (b). When the word sequence is generated stochastically, it is
highly unlikely for the problem to be this extreme. However, it is also unlikely that exactly 900 words can
be used. In one trial run, for example, only 48 of 1,000 randomly-generated sequences were able to use all
900 elements.
Perhaps of even greater concern, the sequences that do use all 900 elements tend to end with
relatively long sequences of alternating word pairs. This is because the procedure for selecting words is
essentially selection from a 900-element set without replacement. Thus, the 900th word of a 900-word
sequence has 0 entropy—it is the only option left in the pool. The first 900-word sequence we generated by
pseudorandom selection ended with 16 straight repetitions of “talidu, pufoki”.
How many mechanisms? 17
17
Hence, in generating our familiarization corpus, we did not use the constraint that there must be
exactly 100 words of each type. Rather, for each set of simulations, we pseudorandomly generated 900
words according to the other constraints.
We created 7 test corpora to investigate the predictions of Peña et al. and Endress & Bonatti: (1) words
(AiXCi); (2) part words of type 12 (CkAiX); (3) part words of type 21 (XCiAj); (4) rule words (AiX′Ci); (5)
class words (AiX′Cj); (6) part words of type 12 that include internal gaps of the sort considered in footnote
27 of Peña et al. (Ck#AiX); and (7) part words of type 21 that include internal gaps of the sort considered in
footnote 27 of Peña et al. (XCi#Aj).
4.1 Study 1
Methods
Like Endress & Bonatti, we used an SRN (Elman, 1990). The syllables were coded as ten-bit
pairwise orthonormal binary vectors (a “1-of-c” encoding), with the tenth bit representing a gap. Endress &
Bonatti do not report the activation functions or objective function used in their simulations, but there is a
well-known issue with a common choice (using sigmoid output units and the sum-squared error function) to
train networks on problems where the target patterns are mostly zeros, as they are here. Such networks
easily find a local minimum of the sum-squared error function by adjusting weights so that all output unit
activations are close to zero. Moreover, because the delta term in backpropagating sum-squared error
involves a multiplication by the derivative of the activation function (the “sigma prime term”), training
slows down dramatically whenever the output approaches 0 or 1, regardless of the target value (because the
derivative of the sigmoid approaches 0 in both cases). The usual procedure for problems using a 1-of-c
encoding is to use the softmax activation function at the output units combined with the cross-entropy
objective function (e.g., Bishop, 1995). The softmax activation function causes the activations of the output
units to always sum to unity, which is correct in the case of a 1-of-c encoding; a side effect is that one may
treat output activations as the network’s subjective assessments of the probability that each output unit
codes for the right category on a given input pattern. Using the cross-entropy objective function causes the
How many mechanisms? 18
18
sigma prime term to drop out of the calculation of delta values, ensuring that weight updates approach zero
only as the activation value approaches the target value. Thus, we used the softmax activation function at the
output layer combined with the cross-entropy objective function. Based on the results of preliminary studies
(reported in Laakso & Calvo, 2008), we set momentum to 0 and used 54 hidden units.
Presenting a word to the network consisted of sequentially presenting each of its three syllables,
followed by a gap. Networks had the same number of output units as input units and were trained to predict
the next syllable (or gap) from each syllable (or gap) presented as input.
Trained networks were tested on five item types: training words (N=9), part words of type 12
(N=18), part words of type 21 (N=18), rule words (N=12) and class words (N=18).
In their simulations, Endress & Bonatti considered 3 possibilities for representing and training on the
segmentation gaps: (1) representing the gaps with a vector of length 0 and training the SRN to predict the
gap just as it would predict a syllable; (2) representing the gaps with a vector of length 0 and training the
SRN to predict the syllable immediately after the gap; and (3) representing the gaps by an extra unit and
training the SRN to predict the silence immediately after the gap. In selecting their representational scheme,
however, Endress & Bonatti rely upon some of Peña et al. (2002)’s data (in particular, the data Peña et al.
report in footnote 27 of their paper that they claim suggests participants’ performance does not depend on
the gap being present in the test items). Endress & Bonatti interpret this to mean that the best way to model
participants’ representations of the test items is by using a network that always predicts the next syllable,
ignoring the gaps. In fact, there is a fourth possibility: representing the gaps by an extra input unit and
training the SRN to predict the syllable after the gap (Laakso and Calvo, 2008). However, for the purposes
of direct comparison with Endress & Bonatti’s results, we tested the networks, as they did, by recording the
network output for the second syllable of the test items (i.e., its prediction of what the third item would be)
and then comparing the network output with the actual third syllable of the test item using the cosine
similarity measure. That is, the cosine similarity measure was recorded between the third syllable of the test
item and the network output activation in response to the second syllable of the test item.
How many mechanisms? 19
19
Fifty networks were trained in order to simulate individual differences. After every 10 epochs of
training, the performance of each network was measured and recorded. Training was stopped after 300
epochs.
Results
The results are shown in Figure 3, which depicts cosine similarity values for words, part words, rule
words and class words over the course of 300 epochs of training averaged across our 50 network “subjects”.
Epochs of training
Mea
n co
sine
sim
ilari
ty
0.2
0.4
0.6
0.8
1.0
0 50 100 150 200 250 300
CWRWWPW
Figure 3. Mean cosine similarity values for 50 networks trained and tested with gaps. (CW=class words, RW=rule words, W=words, PW=part words).
After about 50 epochs of training, the network outputs are more similar to the targets for words than they are
to the targets for rule words, more similar to the targets for rule words than they are to the targets for class
words, and more similar to the targets for class words than they are to the targets for part words. After about
How many mechanisms? 20
20
75 epochs of training, the network outputs are more similar to the targets for rule words than they are to the
targets for part words as well as more similar to the targets for class words than they are to the targets for
part words. At about 100 epochs of training, the network outputs are equally similar to the targets for class
words and to the targets for part words. At about 150 epochs of training, the network outputs are more
similar to the targets for part words than they are to the targets for class words.
The results shown in Figure 3show only a single line for part words (PW). Figure 4 shows the same
results except that the data for part words of type 12 and part words of type 21 are shown separately.
Epochs of training
Mea
n co
sine
sim
ilari
ty
0.0
0.2
0.4
0.6
0.8
1.0
0 50 100 150 200 250 300
CWPW12PW21RWW
Figure 4. Mean cosine similarity values for 50 networks trained and tested with gaps. (CW=class words, RW=rule words, W=words, PW12=part words of type 12, PW21=part words of type 21).
How many mechanisms? 21
21
Virtually the same pattern of results described above holds for both part words of type 12 and part words of
type 21 in Figure 4. Specifically, after about 50 epochs of training, the network outputs are more similar to
the targets for words than they are to the targets for rule words, more similar to the targets for rule words
than they are to the targets for class words, and more similar to the targets for class words than they are to
the targets for part words of either type. After about 60 epochs of training, the network outputs are more
similar to the targets for rule words than to the targets for part words of either type, as well as more similar
to the targets for class words than they are to the targets for part words of either type. After about 75 epochs
of training, the network outputs are equally similar to the targets for class words and to the targets for part
words of type 12, although the network outputs are more similar to the targets for class words than they are
to the targets for part words of type 21. After about 100 epochs of training, the network outputs are more
similar to the targets for part words of type 12 than they are to the targets for class words, although they are
more similar to the targets for class words than they are to the targets for part words of type 21.
Discussion
The results of our Study 1 (Figure 3) accurately model the behavior of human participants in Peña et
al.’s experiments 5 and 3, and Endress & Bonatti’s experiments 8, 3, 12, 1, 4, and 5; see Table 3.
Specifically, the results by Peña et al. and Endress & Bonatti’s on segmented 2-minute familiarization
streams (experiments 5, 8, 3 and 12 in Table 3) indicate preferences for words over rule words, for rule
words over class words, and for class words over part words of types 12 and 21 (compare network
performance in Figure 3 after about 50 epochs of training). Their results on segmented 10-minute
familiarization streams (experiments 3 and 1 in Table 3) indicate preferences for class words and rule words
over part words (compare network performance in Figure 3 after about 75 epochs of training). Their results
on segmented 30-minute familiarization streams (experiment 4 in Table 3) indicate no preference between
class words and part words (compare network performance in Figure 3 after about 100 epochs of training).
Finally, their results on segmented 60-minute familiarization streams (experiment 5 in Table 3) indicate
How many mechanisms? 22
22
preference for part words over class words (compare network performance in Figure 3 after about 150
epochs of training).
However, as Figure 4 shows, part words of type 12 and part words of type 21 fare differently with
respect to class words in our simulations. Although part words of type 12 show the same qualitative pattern
as the average of all part words (that is, they are dispreferred to class words after fewer epochs of training
but preferred to class words after more epochs of training), part words of type 21 do not (they are never
preferred to class words). Figures 1(b) and 1(c) in section 3 above, where adjacent and non-adjacent
transitional probabilities for a part word of type 12 and a part word of type 21 were shown, may hint at the
divergence between the patterns of part words of different types in Figure 4. In the case of part words of
type 12, the transitional probability between the second and the third syllable is 0.33 (figure 1b). By
contrast, the transitional probability between the second and the third syllable in the case of part words of
type 21 is 0 (figure 1c). Figure 4 records the cosine similarity measure between the third syllable of the test
item and the network output activation in response to the second syllable of the test item. It is possible that
the divergence observed between the two types of part words is due to the particular representational
scheme that Endress & Bonatti chose to model participants’ representations of the test items, and that we
used for the purposes of direct comparison with Endress & Bonatti’s results. Nevertheless, Endress &
Bonatti, having observed something similar in one of their network simulations, wrote that “the network
predicts a difference between how part-words of type 12 and type 21 will stand the comparison with class-
words; yet, in none of our experiments have we observed it” (pp. 282-3). Endress & Bonatti, however, have
not reported experiments explicitly comparing preferences for part words of type 12 versus part words of
type 21. We predict that such experiments will show a preference for part words of type 12 over part words
of type 21 after 30 minutes or more of familiarization.
More important, though, is what our simulations reveal about the preference for class words. Endress
& Bonatti interpret human participants’ preference for class words over part words after 2 minutes and 10
minutes of familiarization as evidence that the participants have learned a class rule. Endress & Bonatti
How many mechanisms? 23
23
conjecture that when confronted with a segmented familiarization stream, participants may be freed from the
burden of having to parse the input to extract first the constituents, as in the case of a continuous stream. In
this way, they are able to focus from the start upon the underlying structure itself. With longer
familiarization, generalization is overwhelmed as participants have more time to track the statistical
relations that obtain between items in the input stream. The interpretation of Endress & Bonatti is thus that
an initially induced dependency that takes place between classes of items degrades with familiarization
duration. But Figure 3 and Figure 4 make it clear that another interpretation of these results is possible:
perhaps people are not learning a class rule at all but only ruling it out rather slowly. That is what our
networks seem to be doing – although the networks disprefer part words right from the beginning, it takes
them some time (approximately 100 epochs – see Figure 3) to learn that class words are no better.
Now why would participants be slower to develop a dispreference for class words than to show a
dispreference for part words? Figure 5 shows adjacent and non-adjacent transitional probabilities for a class
word having the form #AiX′Cj#. Non-adjacent statistical relations between adjacent items, and between non-
adjacent items separated by two, outweigh non-adjacent statistical relations between adjacent items, and
between non-adjacent items separated by two in the case of part words of both types. Transitional
probabilities between adjacent items sum to 1.33 in class words (Figure 5) but only to 0.33 in part words of
either type (Figure 1b,c). Similarly, transitional probabilities between non-adjacent items separated by two
sum to 1.33 in class words (Figure 5) but to 0 in part words of either type (Figure 1b,c). Transitional
probabilities between non-adjacent items separated by one are 0 in every case. An associative explanation
may thus underlie the diverging time spans of dispreference between class words and part words shown in
Figure 3 and Figure 4. We suggest that much the same might underlie human participants’ performance in
Endress & Bonatti’s experiments. That is, people may be ruling out an unlearned class rule rather slowly. In
fact, taking into account that the non-adjacent transitional probability between the first and the third syllable
is lost in the case of class words, it is difficult to grasp how participants might possess such powerful
How many mechanisms? 24
24
generalization machinery that is triggered when released from having to parse the corpus, as Endress &
Bonatti suggest (see general discussion, below).
# Ai X ′ Cj #
0.33 0 0 1
0 0 0
0.33 1
Figure 5. Adjacent and non-adjacent transitional probabilities for a class word having the form #AiX′Cj#. Solid arrows indicate transitions between adjacent items. Dashed arrows indicate transitions between non-adjacent items separated by one. Dotted arrows indicate transitions between non-adjacent
items separated by two. Each arrow is labeled with the corresponding transitional probability. A final caveat is in order. One might question the sense in which epochs of training in artificial
neural networks and familiarization duration with human participants relate to each other (recall that
network performance was probed every 10 epochs, and connection weights frozen after 300 epochs).
Indeed, it is not clear that there must be a linear relation between epochs of training and familiarization
duration. Our key point is that the networks do reproduce the initial patterns of preference, and that such a
preference does reverse in subsequent epoch intervals as a result of an increased learning of the prediction
task. The dual-mechanism hypothesis capitalizes on an observed negative correlation between the extraction
of structural regularities and familiarization duration. The longer the duration of the continuous
familiarization stream, the stronger the preference for part words over rule words. On the contrary, a very
short familiarization with a segmented stream allows for the induction of the rule (preference of class and
rule words). Thus, Endress & Bonatti predict that a preference for class words will decrease over longer
familiarization durations. Demonstrating that, with sufficient training, networks can also show a reversal
(coming to prefer part words over class words) is thus critical to the debate. Such a reversal is exactly what
we find in study 1. The dual-mechanism hypothesis ignores the possibility that subliminal segmentation
gaps can be exploited statistically, as the present results with SRNs illustrate.
How many mechanisms? 25
25
4.2 Study 2
The previous simulation used training data matching the familiarization stimuli used in Peña et al.’s
experiments 5 and 3, and Endress & Bonatti’s experiments 8, 3, 12, 1, 4, and 5 (Table 3). However, an
important aspect of the Peña et al.’s and Endress & Bonatti’s results is the fact that their experiments show
that participants cannot learn the so-called abstract rules without the segmentation gaps. How then do our
SRNs fare when trained on a familiarization corpus that does not include the gaps (Table 4)? The results by
Peña et al. and Endress & Bonatti’s on continuous 10-minute familiarization streams (the first four
experiments in Table 4) indicate preferences for words over part words, no preference between rule/class
words and part words, and preference for rule words over class words. Finally, Peña et al.’s results on
continuous 30-minute familiarization streams (experiment 4 in Table 4) indicates preference for part words
over rule words. Study 2 attempts to model the pattern of performance in Table 4.
Methods
Study 2 was identical to study 1, except that the networks were trained and tested without gaps. Thus,
the syllables were coded as nine-bit pairwise orthonormal binary vectors. As previously, the networks used
the softmax activation function at the output layer, the cross-entropy objective function, momentum of 0 and
54 hidden units.
Presenting a word to the network consisted of sequentially presenting each of its three syllables.
Networks had the same number of output units as input units and were trained to predict the next syllable
from each syllable presented as input. The testing procedures were the same as in study 1, except that the
test items did not contain gaps. As in study 1, 50 networks were trained in order to simulate individual
differences. After every 10 epochs of training, the performance of each network was measured and
recorded. Training was stopped after 300 epochs.
How many mechanisms? 26
26
Results
The results are shown in Figure 6, which depicts cosine similarity values for words, part words of
type 12, part words of type 21, rule words and class words over the course of 300 epochs of training
averaged across our 50 network “subjects”.
Epochs of training
Mea
n co
sine
sim
ilari
ty
0.2
0.4
0.6
0.8
0 50 100 150 200 250 300
CWPW12PW21RWW
Figure 6. Mean cosine similarity ratings for 50 networks trained and tested without gaps. (CW=class words, RW=rule words, W=words, PW12=part words of type 12, PW21=part words of type 21).
After about 60 epochs of training, the network outputs are clearly more similar to the targets for words than
they are to those for part words of either type, they are approximately equally similar to the targets for rule
words and those for part words of either type, they are clearly more similar to the targets for part words of
both types than they are to those for class words, and they are clearly more similar to the targets for rule
How many mechanisms? 27
27
words than they are to those for class words. After about 150 epochs, the network outputs are more similar
to the targets for part words of either type than they are to those for rule words.
Discussion
In response to Perruchet et al. (2004), Bonatti et al. (2006) argue that they had “neglected other
crucial findings that we reported—for example, that prolonging exposure to the continuous stream induces a
preference for part words over rule words (Peña et al., 2002, experiment 4). It was the contrast between this
result and the preference for structurally correct but previously unheard items observed after short
familiarization with a segmented stream that led us to favor a dual-mechanism account. [W]e wonder what
alternative explanation they propose” (p. 319). In effect, the experiments in Table 4 complement those of
Table 3 insofar as no generalization obtains when the familiarization stream contains no gaps. The modeling
results in Study 1 would thus be less significant were we to fail to model statistically the fact that no
induction of rules obtains without the segmentation gaps.
The results shown in Figure 6 match the experimental results in Table 4 with one exception:
although Endress & Bonatti found that human participants had no preference between class words and part
words after 10 minutes of familiarization with a continuous stream (their experiment 2 – see Table 4 above),
our networks show a preference for part words over class words throughout the course of training on inputs
without gaps (Figure 6). We must remember, however, the rationale of Endress & Bonatti’s experiment 2 in
the context of testing the more-than-one mechanism hypothesis. Their concern was with whether
participants could learn a class rule. They took the results of their experiment 1 (showing a preference for
class words over part words after 10 minutes of familiarization on a segmented stream – see Table 3 above)
to show that participants could learn a class rule after familiarization with a stream containing segmentation
cues. By contrast, they took the results of their experiment 2 (showing no preference between class words
and part words after 10 minutes of familiarization on a continuous stream – see Table 4 above) to show that
participants could not learn a class rule after familiarization with a stream not containing segmentation cues.
Based on our network modeling results, however, we have suggested that participants may not be learning a
How many mechanisms? 28
28
class rule even when the familiarization stream contains segmentation cues. Rather, participants may simply
be slower to develop a dispreference for class words than they are to show a dispreference for part words.
(See Figure 3and Figure 4 and the discussion in the results section of study 1 above.) The same point applies
to streams not containing segmentation cues as to streams that do contain segmentation cues: networks
gradually develop a dispreference for class words (compare the lines for class words in Figure 4 and Figure
6), and we suggest that human participants may be doing the same. Figure 7 shows adjacent and non-
adjacent transitional probabilities for a class word without surrounding gaps, that is, a word having the form
AiX′Cj. Note that all of the transitional probabilities are 0. It is not clear why we would expect any learner to
find such items familiar.
Ai X ′ Cj
0 0
0
Figure 7. Adjacent and non-adjacent transitional probabilities for a class word without surrounding gaps, i.e., having the form AiX′Cj. Solid arrows indicate transitions between adjacent items. The dashed
arrow indicates the transition between non-adjacent items separated by one. Each arrow is labeled with the corresponding transitional probability.
If this is correct, we take the following lessons from the stimulation results: (1) segmented inputs
lead statistical learners to find both words and rule words familiar while finding class words and part words
unfamiliar; and (2) unsegmented inputs lead statistical learners to find words familiar while finding rule
words, part words and class words unfamiliar. Other differences (i.e., the differences between part words of
type 12, part words of type 21 and class words in Figure 4, as well as the differences between part words of
either type, rule words and class words in Figure 6) may simply be uninteresting for participants as they are
all “below threshold” for familiarity. Thus, the negative correlation between duration of familiarization and
generalization of a class rule that is exhibited in the human experiments may be a red herring – rather than
providing strong evidence for the existence of two different learning mechanisms, it may simply reflect
overlearning of certain statistical patterns (part words of type 12) that are in any case irrelevant to the
How many mechanisms? 29
29
projection of a rule. Note how low the cosine similarity values are for both kinds of part words and for class
words after about 50 epochs of training in Figure 6.
We conclude that Studies 1 and 2 successfully model the experimental results of Peña et al. and
Endress & Bonatti.
5. General Discussion
According to the dual-mechanism (MOM) hypothesis, both an associative and a rule-following
mechanism are needed to analyze speech. Although an associative mechanism suffices to segment speech,
structural generalization requires a non-associative mechanism. Endress & Bonatti argue that, insofar as
associative learning relies upon the piecemeal modification of synaptic connections, and is thus dependent
upon long exposures or cycles of training in order to adequately tune to whatever structure the environment
carries, the negative correlation between the generalization of structure and familiarization duration tells
against the competing “single-mechanism” hypothesis. It is assumed, by contrast, that an algebraic
mechanism can quickly extract the structure of the corpus, an ability that would conflict with longer
familiarization, as the actual chunks in the data (words and part words, but not rule words or class words)
will keep re-shaping the error landscape, in successive cycles of processing. The analysis of speech, thus,
requires two mechanisms, to account for the negative correlation observed between duration of
familiarization and the generalization of structure.
We are not the first to have criticized the dual-mechanism framework. Previous criticisms have been
raised by Newport & Aslin (2004), Onnis, Monaghan, Richmond, & Chater (2005), Perruchet, Tyler,
Galland, & Peereman (2004), Perruchet, Peereman, & Tyler (2006), and Seidenberg, MacDonald & Saffran
(2002). These replies targeted other aspects of Peña et al. and Endress & Bonatti’s results, such as the
possibility of phonological/phonotactic confounds, or the possibility that participants may be tuning to parts
of the stimuli (Ai or Ci) instead of generalizing the Ai_Ci rule itself. Endress & Bonatti designed control
experiments to deal with these and other concerns, one by one. Thus, for example, their experiments 6 and 7
were designed to control whether participants considered either the initial or the final syllable to induce the
How many mechanisms? 30
30
rule, instead of both, and their experiments 10 and 11 to discount the possibility of phonological confounds
in their experiments 1 and 2. Once having controlled for, and discarded one by one, the alternatives
proposed against the dual-mechanism hypothesis, Endress & Bonatti turned to connectionist simulations
with SRNs. The idea was to demonstrate that there was no possibility for associative language learning,
assuming that SRNs were, computationally speaking, representative of the associative mechanisms that may
underlie language acquisition in humans.
In this article, we have taken up the challenge put forward by Endress & Bonatti and reported the
results of simulations based on a connectionist model that account for the negative correlation observed in
earlier work. Fine-grained details aside, the current debate is not new to the cognitive science community,
and speech processing in adults is not the only arena in which the single vs. dual-mechanism hypotheses
have been contrasted. Notable precursors are the “great past-tense debate” (Pinker & Ullman, 2002;
Ramscar, 2002) and the debate over “rule learning” in infants (Marcus, Vijayan, Bandi Rao, & Vishton,
1999; Seidenberg & Elman, 1999). Speech processing in adults, thus, constitutes another venue in which to
assess the hypothesis that a single type of associative learning mechanism can explain human learning.
A few words of caution are needed before the single-mechanism hypothesis can be unconditionally
embraced. For one thing, concluding that algebraic manipulations do not take place based on the simple fact
that a neural network accounts for the data is a non sequitur. The hidden premise that delivers the goods is
that neural networks do not implement abstract relationships between variables in an explicit manner.
Marcus (2001) has pursued this line of reasoning in the case of the past tense debate and the debate over rule
learning in infants, and concluded that only connectionist models that explicitly implement abstract
relationships between variables can account for both the past-tense and infant results. We need to know then
which architectural features are behind our network’s successful modelling of the experimental results of
Peña et al. and Endress & Bonatti. The overall dispute among sympathizers and foes of associative models
of cognition boils down to an open empirical question as to what the architecture of cognition looks like. If
the required features include, for instance, the instantiation of variables, then a connectionist model of
How many mechanisms? 31
31
speech generalization will serve to back up the dual-mechanism hypothesis (by implementing abstract
variables). We may put this somewhat differently: to warrant their claim that the MOM hypothesis is the
only explanation for the data, Endress & Bonatti need to show that only connectionist networks that
implement explicit rules in the form of abstract relationships between variables can account for their results.
The question then is: which models implement rules? A clear case of implementation occurs when
nodes in the architecture are used as variables. This is not the case with our SRNs, where the “1-of-c”
encoding is used to represent the actual syllables that form the “training” words in familiarization (PU or
KI), and not to represent word positional slots (Ai or Ci). A different way to become an ‘implementational
connectionist’ (Marcus, 2001) would be to train the SRN on a categorization task, where the teaching
pattern for gradient descent learning is provided externally. As in the case of the encoding of variables, a
rule is implemented in the non-ecological calculation of delta values in the form of a trainer that marks the
output categories explicitly. In our simulations, however, the cross-entropy objective function deployed in a
self-supervised task of prediction allows for the calculation of delta values in an ecological, non-
implementational, manner. That is, no output categories are fed externally to the network as targets, because
the target is the next input to the SRN, an input that stands for actual syllables and not variable slots.
Certainly, the above two examples do not exhaust the landscape, and there may be other ways in
which the modeling results of an implementational connectionist network would serve to back up dual,
rather than single-mechanism hypotheses. But it is noteworthy that Endress & Bonatti’s failed attempt to
model their own experimental results was carried out with SRNs. We may thus assume that their effort was
directed towards a form of connectionism that they considered non-implementational. Because the only
architectural difference between their model and ours is the employment of the softmax activation function
at the output units combined with the cross-entropy objective function in an otherwise standard Elman net,
we take it that they would agree that we have built a connectionist model that accounts for the negative
correlation observed without implementing abstract relationships between variables in doing so.
How many mechanisms? 32
32
Bearing in mind the distinction between connectionist models that explicitly implement abstract
relationships between variables and models that do not, the following three observations can help tip the
balance in a definitive manner against dual-mechanisms: (i) types and tokens of mechanism must be told
apart; (ii) the claim that statistical mechanisms are slow must be qualified; and (iii) a distinction must be
honored between positive evidence that backs up the claim that algebraic mechanisms are fast, on the one
hand, and negative “evidence” that would back up the same claim somewhat more indirectly. We consider
points (i)-(iii) in turn.
(i) Types and tokens
In their reply to Perruchet et al. (2004), Bonatti et al. (2006) seem to conflate types of mechanisms
with specific mechanisms as tokens when they claim: “We never denied that a theory based on statistical
learning might account for fast learning; however, the thesis that all learning can be explained by statistical
computations is empty unless our critics can propose a single mechanism that is capable of simultaneously
explaining (a) segmentation of words after exposure to a long familiarization (but not to a short
familiarization) with a continuous stream, (b) extraction of structural information after a short
familiarization with a discontinuous stream, and (c) failure to extract the same information after
familiarization with any continuous stream” (p. 319; emphasis added).
The challenge thus posited, someone may contend, is unjustified. The challenge resides not in
accounting for (a)-(c) statistically, but rather in accounting for (a)-(c) simultaneously by means of a single
mechanism. In connectionist jargon, the challenge is then to obtain a single weight matrix through error-
driven training such that all the knowledge that gets induced is fully distributed and superposed in the
matrix. But why couldn’t separate or partially overlapping statistical mechanisms be responsible for (a), (b)
and (c)? Why, in short, is a single weight matrix needed where all knowledge is fully superposed? This
rendering of the situation amounts to raising the challenge in terms of a single (token) mechanism. Bonatti
et al. (2006) do not complain after all about the speed of one kind of learning over another (“We never
denied that a theory based on statistical learning might account for fast learning”), but rather about the
How many mechanisms? 33
33
number of token mechanisms that can play a role. The MOM hypothesis entails “more than one type of
mechanism,” but nothing in their argument tells against the possibility that more than one associative
mechanism is in place. In short, someone may even wish to stick to the label “MOM” and defend on
empirical grounds the existence of “more-than-one-(statistical)-mechanism”, such that none of the statistical
mechanisms involved implements abstract relationships between variables.
In fact, the modeling results reported here account for the data in a non-implementational manner
while meeting the added challenge of accounting for (a)-(c) above in one single weight matrix. We imagine,
therefore, that the results could also be modeled by separate, serial connectionist mechanisms. A
sympathizer of the dual-mechanism hypothesis might then argue that such a more-than-one-(statistical)-
mechanism solution would result in a “modularization” of the architecture with activations getting
symbolized as they are output and fed from one statistical mechanism into another. In this way, the
connectionist network would amount to an implementation of a classical architecture. It is not clear, though,
that more-than-one-(statistical)-mechanism must amount to implementational connectionism. To take an
illustration where neurobiological plausibility is honored, O’Reilly (1998) shows the advantages of
combining Hebbian and biological forms of error-driven learning. We may, for instance, have a SRN
architecture supplemented with bidirectional (symmetric) propagation of activation, inhibitory competition,
an error-driven form of learning (McClelland, 1994), and Hebbian learning. The Hebbian component, in
combination with error-driven learning, makes its contribution by representing in hidden unit activation
space the first-order correlational structure of the input data. We fail to see where the rules and variables
would be in this network.
(ii) Slow versus rapid associative mechanisms
Someone might suggest that a single-mechanism statistical model is necessarily at odds with quick
rule induction. But why couldn’t a non-implementational SRN quickly extract generalizations? After all,
one-shot Hebbian learning (Rolls & Treves, 1998) allows for a few event co-occurrences to contribute to
fast recall. Furthermore, taking into account that combined Hebbian and error-driven learning drives SRNs
How many mechanisms? 34
34
to a better performance on generalization than backpropagation (O'Reilly & Munakata, 2000), we see no
reason to assume that statistical mechanisms are necessarily slow.2
As discussed above, the defender of the dual mechanism hypothesis needs to show that only
connectionist networks that explicitly implement abstract relationships between variables can account for
the Endress & Bonatti results. Presumably, the argument would be that the quick extraction of
generalizations can be achieved only by a mechanism that implements abstract relationships between
variables. However, the claim that statistical mechanisms are slow must be qualified. Even though Bonatti et
al. (2006) acknowledge the possibility of fast associative learning, the negative correlation between
familiarization duration and rule induction is critical for Endress & Bonatti’s case. What is the empirical
evidence that statistical mechanisms are slower than algebraic ones? Perhaps it is the distinction between
generalizing class words and generalizing rule words that motivates the dual-mechanism hypothesis.
Although the induction of a rule word might, for the sake of argument, be accomplished courtesy of quick
statistical mechanisms such as combined Hebbian and error-driven learning, the induction of class words
may not obtained in the same way. Quick extraction of class words appears to be a more difficult pill for the
defender of associative mechanisms to swallow, insofar as most of the correlational information is lost as
one moves from items to classes of items. But, if Endress & Bonatti agree that rule words may be learned
statistically by fast associative mechanisms (mechanisms that they acknowledge exist), then class words no
longer constitute a threat to the single mechanism hypothesis. Based on our network modeling results, we
have suggested that participants may not be learning a class rule after all, even when the familiarization
stream contains segmentation cues. Rather, participants may simply be slower to develop a dispreference for
class words than they are to show a dispreference for part words. A single fast associative mechanism may
thus account for the Endress & Bonatti results.
2 A similar point is made by Perruchet et al. (2004) when they note that: “The assertion that associative learning proceeds slowly
does not stand up to empirical observations. For example, some associative forms of learning have been shown to develop over
one trial or so” (p. 582).
How many mechanisms? 35
35
(iii) Negative vs. positive evidence
Finally, it is important to distinguish between positive evidence that supports the claim that algebraic
mechanisms are fast, on the one hand, and negative “evidence” that would support the same claim
somewhat more indirectly, on the other. This relates to the question of how parsimonious the contending
hypotheses are. Much of the Peña et al. and Endress & Bonatti strategy has the flavor of a negative, “only-
game-in-town” kind of argument: Given such a short familiarization duration, we cannot think of any way
that people could perform the way they do other than by having two different learning mechanisms. If our
working hypothesis is on the right track, only-game-in-town arguments must be accompanied by positive
evidence, not against single mechanisms, but in favor of dual ones. As Seidenberg et al. (2002) put it in
response to Peña et al. (2002), “The evidence for rule learning is mostly negative: cases where learning
occurs but there is no obvious statistical explanation. A theory explaining how rule learners arrive at exactly
the correct generalizations given the complexities of their experience would represent substantial progress”
(p. 554). Allowing for the statistical generalization of rule words and the slower development of a
dispreference for class words, together with the reversal of behavior we have modeled, we would like to see
an explicit presentation of the nuts-and-bolts of a dual mechanism that would make it a more attractive
alternative than our very simple model.
Endress & Bonatti (2007) try to make progress by shifting from a negative, only-game-in-town,
characterization of their working hypothesis to a positive characterization in terms of the operations that
may underlie the negative correlation they observed. They hypothesize the existence of “a general
mechanism representing syllables in words as variables, capable of operating under a variety of input
conditions. Such a mechanism would be able to extract relations between such variables within their
respective units… The silences may act as ‘markers’ that define the units of an analysis. Such markers may
be a prerequisite for dependencies between classes in speech to be analyzed, and this would explain [why]
the mechanism for generalization seems to only work over an already segmented input” (p. 291).
How many mechanisms? 36
36
It is not clear what to make of Endress & Bonatti’s suggestion. For one thing, in Endress & Bonatti’s
view, one of the advantages of their framing of the debate in terms of adult speech (as opposed to the debate
about rules in infants or the past tense debate), is that they can control for training-independent factors, and
thereby focus upon (online) training-dependent factors. Connectionist attempts to explain rule learning in
infants (Seidenberg & Elman, 1999) have exploited the sort of knowledge that infants have already induced
during their short lives, and which they subsequently bring into the lab. In the case of adult speakers,
Endress & Bonatti control for such training-independent factors. But if there are no training-independent
biases that adults bring into the lab with them, it is all but obvious why the general mechanism they posit to
represent syllables as variables is ready to operate under the sort of higher-level abstraction demanded by
class words. The problem, put bluntly, is that it is not clear what makes algebraic mechanisms fast in the
first place. Or, somewhat more dramatically, if statistical mechanisms carry the burden of proving that they
can be fast, why are algebraic mechanisms relieved from such a burden?
Moreover, Bonatti et al. (2006) comment that Perruchet et al. (2004) “conflated the language-specific
nature of structure extraction (for which we did not argue) [and] its algebraic character, which is compatible
with but not mandated by our results” (p. 316). A subtle distinction seems to be at work here. We may have
put “dual-mechanisms”, “algebraic operations” and “domain-specificity” into a single bag. If the point that
proponents of the dual-mechanism hypotheses are making is after all about domain (language)-specificity,
what would a non-algebraic alternative (as suggested by the quote above) consist of? Again, no hints are
offered. And finally, it is not clear why a reversal in preferences must necessarily obtain. That is, why
should we ever stop using a rule once in place? The value of rule-following is precisely that, once acquired,
a rule can be applied indiscriminately (or so its virtue is usually sold).
All this shows that the dual-mechanism hypothesis is less parsimonious explanation than a single-
mechanism (associative learning) hypothesis. The fact that dual-mechanism hypotheses are unparsimonious
and become speculative when one tries to flesh them out affects not only the current debate but also the past
tense debate and the debate about infant rule learning. In an attempt to spell out the “positive side” of rule-
How many mechanisms? 37
37
following, Marcus (2001) claims, “My hunch is that the brain contains a stock of basic instructions, each
defined to operate over all possible values of registers” (p. 58); at another point, he goes on, “Fundamental
to my proposal are the assumptions that the mind has a large stock of empty treelets on hand and that new
knowledge can be represented by filling in an empty treelet or by adjusting the values contained in an
existing treelet” (p. 108); and last, “I have not proven that the mind/brain implements symbol manipulation.
Instead, I have merely described how symbol-manipulation could support the relevant cognitive phenomena.
All that we can do is to provisionally accept symbol-manipulation as providing the best explanation” (p.
170). We fail to see in what sense the work of Peña et al. and Endress & Bonatti provides any better positive
evidence for a dual-mechanism hypothesis (in the case of the analysis of speech in adults) than these
speculative remarks by Marcus might allow us to imagine (in the case of rule learning in infants). But why
should we favor such an unparsimonious model when one that is clearly more parsimonious can account for
the same data?
According to Bonatti et al. (2006), single-mechanism hypotheses are “empty unless our critics can
propose a single mechanism that is capable of simultaneously explaining (a) segmentation of words after
exposure to a long familiarization (but not to a short familiarization) with a continuous stream, (b) extraction
of structural information after a short familiarization with a discontinuous stream, and (c) failure to extract
the same information after familiarization with any continuous stream.” (p. 319). Our model, we contend,
meets this challenge: it explains the phenomena that Peña et al. and Endress & Bonatti report, and it is more
parsimonious than their alternative. The single-mechanism hypothesis, moreover, generates further
predictions for testing. As noted in study 1, Endress & Bonatti do not explicitly compare preferences for
part words of type 12 versus part words of type 21. Our associative learning-based prediction is that such
experiments will show a preference for part words of type 12 over part words of type 21. Parsimony and
novel testable predictions are an added value that in our view may end up tipping the balance against the
dual-mechanism (MOM) hypothesis.
How many mechanisms? 38
38
6. Conclusion
How many different mechanisms are needed to analyze speech? According to a dual-mechanism
hypothesis (Peña et al., 2002; Endress & Bonatti, 2007), language learning is achieved by means of two
mechanisms: a statistical mechanism that permits the learner to extract words from the speech stream,
together with a non-statistical mechanism that is necessary for extracting higher-level structure. We have
presented a neural network study that shows how statistics alone can support the discovery of structural
regularities, beyond the segmentation of speech. We have argued that our results undermine Peña et al. and
Endress & Bonatti’s argument for the dual-mechanism (MOM) hypothesis, and we therefore conclude that
they have not demonstrated that rule-governed language-learning mechanisms are necessary for the
extraction of structural information. In addition, we believe that these modeling results go beyond the
idiosyncrasies of SRNs. Our work shows that a primitive, artificial statistical learning mechanism can learn
linguistic preferences that appear to be governed by abstract, structural rules. There is no reason to think that
the powerful statistical learning machinery that is the human brain could not do the same.
How many mechanisms? 39
39
References
Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford: Oxford University Press.
Boden, M., & Wiles, J. (2000). Context-free and context-sensitive dynamics in recurrent neural networks.
Connection Science: Journal of Neural Computing, Artificial Intelligence & Cognitive Research,
12(3-4), 197-210.
Chalup, S. K., & Blair, A. D. (2003). Incremental training of first order recurrent neural networks to predict
a context-sensitive language. Neural Networks, 16(7), 955-972.
Colarusso, J. (2005). Georgian and Caucasian languages. In P. Strazny (Ed.), Encyclopedia of Linguistics
(pp. 380-383). New York: Fitzroy Dearborn.
Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179-211.
Endress, A. D., & Bonatti, L. L. (2007). Rapid learning of syllable classes from a perceptually continuous
speech stream. Cognition, 105(2), 247-299.
Greenberg, J. H. (1963). Some universals of grammar, with particular reference to the order of meaningful
elements. In J. H. Greenberg (Ed.), Universals of language (pp. 73-113). Cambridge, MA: MIT
Press.
Klamer, M. (2005). Explaining some structural and semantic asymmetries in morphological typology. In G.
Booij, E. Guevara, A. Ralli, S. Sgroi & S. Scalise (Eds.), Morphology and Linguistic Typology: On-
line Proceedings of the Fourth Mediterranean Morphology Meeting (MMM4) (pp. 127-142).
Bologna: Università degli Studi di Bologna.
Laakso, A., & Calvo, P. (2008). A connectionist simulation of structural rule learning in language
acquisition. In B. C. Love, K. McRae & V. M. Sloutsky (Eds.), Proceedings of the 30th Annual
Meeting of the Cognitive Science Society (pp. 709-714). Austin, TX: Cognitive Science Society.
Marcus, G. F., Vijayan, S., Bandi Rao, S., & Vishton, P. M. (1999). Rule learning by seven-month-old
infants. Science, 283, 77-80.
How many mechanisms? 40
40
McClelland, J. L. (1994). The interaction of nature and nurture in development: A Parallel distributed
processing perspective. In P. Bertelson, P. Eelen & G. d'Ydewalle (Eds.), International Perspectives
on Psychological Science (Vol. I: Leading Themes). Hillsdale, NJ: Erlbaum.
Newport, E. L., & Aslin, R. N. (2000). Innately constrained learning: Blending old and new approaches to
language acquisition. In S. C. Howell, S. A. Fish & T. Keith-Lucas (Eds.), BUCLD 24: Proceedings
of the 24th annual Boston University Conference on Language Development. Boston: Cascalla Press.
Newport, E. L., & Aslin, R. N. (2004). Learning at a distance: I. Statistical learning of non-adjacent
dependencies. Cognitive psychology, 48(2), 127-162.
O'Reilly, R. C. (1998). Six principles for biologically based computational models of cortical cognition.
Trends in Cognitive Sciences, 2(11), 455-462.
O'Reilly, R. C., & Munakata, Y. (2000). Computational explorations in cognitive neuroscience:
Understanding the mind by simulating the brain. Cambride, MA: MIT Press.
Onnis, L., Monaghan, P., Richmond, K., & Chater, N. (2005). Phonology impacts segmentation in online
speech processing. Journal of Memory and Language, 53(2), 225-237.
Peña, M., Bonatti, L. L., Nespor, M., & Mehler, J. (2002). Signal-driven computations in speech processing.
Science, 298(5593), 604-607.
Perruchet, P., Peereman, R., & Tyler, M. D. (2006). Do We Need Algebraic-Like Computations? A Reply to
Bonatti, Peña, Nespor, and Mehler (2006). Journal of Experimental Psychology: General, 135(2),
322-326.
Perruchet, P., Tyler, M. D., Galland, N., & Peereman, R. (2004). Learning Nonadjacent Dependencies: No
Need for Algebraic-Like Computations. Journal of Experimental Psychology: General, 133(4), 573-
583.
Pinker, S., & Ullman, M. T. (2002). The past and future of the past tense. Trends in Cognitive Sciences,
6(11), 456-463.
How many mechanisms? 41
41
Ramscar, M. (2002). The role of meaning in inflection: Why the past tense does not require a rule. Cognitive
Psychology, 45(1), 45-94.
Rolls, E. T., & Treves, A. (1998). Neural networks and brain function. Oxford: Oxford University Press.
Saffran, J. R., Aslin, R. N., & Newport, E. L. (1996). Statistical learning by 8-month old infants. Science,
274(5294), 1926-1928.
Seidenberg, M. S., & Elman, J. L. (1999). Networks are not 'hidden rules.'. Trends in Cognitive Sciences,
3(8), 288-289.
Seidenberg, M. S., MacDonald, M. C., & Saffran, J. R. (2002). Does grammar start where statistics stop?
Science, 298, 553-554.
Tadmor, U. (2005). Malay-Indonesian and Malayic languages. In P. Strazny (Ed.), Encyclopedia of
Linguistics (pp. 644-647). New York: Fitzroy Dearborn.