How Many Mechanisms are Needed to Analyze Speech Many Mechanisms are Needed to Analyze Speech?: A...

How many mechanisms? 1

1

Running Head: HOW MANY MECHANISMS?

How Many Mechanisms are Needed to Analyze Speech?:

A Connectionist Simulation of Structural Rule Learning in Language Acquisition

Paco Calvo

Universidad de Murcia

Aarre Laakso

University of Michigan – Dearborn

Paco Calvo, Departamento de Filosofía, Universidad de Murcia; Aarre Laakso, Department of

Behavioral Sciences, University of Michigan – Dearborn.

This research was supported by DGICYT Project HUM2006-11603-C02-01 (Spanish Ministry of

Science and Education and Feder Funds). This paper extends a preliminary study that appeared in the

Proceedings of the 30th Annual Meeting of the Cognitive Science Society.

Correspondence concerning this article should be addressed to Paco Calvo, Departamento de

Filosofía, Universidad de Murcia, Edificio Luis Vives, Campus de Espinardo, E-30100 Murcia – SPAIN. E-

mail: [email protected]

aarre

Typewritten Text

Submitted: January 16, 2009


2

Abstract

According to the “dual-mechanism” hypothesis, the induction of structural information correlates negatively

with familiarization length, in such a way that two mechanisms are required in order to analyze speech: (1)

statistical computations based on nonadjacent transitional probabilities, and (2) an additional rule-following

mechanism. The argument is that, although statistical mechanisms may suffice for speech segmentation, an

additional rule-following mechanism is required in order to quickly extract structural information. We

present a set of neural network studies that shows how a single statistical mechanism can mimic the

apparent discovery of structural regularities, beyond the segmentation of speech. We argue that our results

undermine one argument for the dual-mechanism hypothesis.

Keywords Artificial language acquisition; Dual mechanism hypothesis; Statistical learning


3

1. Introduction

Language acquisition is a central component of human development. A key question is whether

language can be acquired solely by domain-general (e.g., statistical) learning mechanisms or whether

domain-specific (e.g., algebraic) learning mechanisms are required.

Peña, Bonatti, Nespor, & Mehler (2002) consider whether statistical computations based on

transitional probabilities of the sort that are exploited in speech segmentation (Saffran, Aslin, & Newport,

1996) can also be used over nonadjacent dependencies in order to induce rule-like regularities in the speech

stream. Nonadjacent dependencies are an important test of any purported account of language acquisition

based on statistical learning because they are pervasive in language, but it is not obvious how a statistical

learning mechanism could acquire them. Examples of nonadjacent dependencies occur at many levels in

language. At the subsyllabic level, for example, the vowels of English words differ in pronunciation

depending on whether the word ends with a silent ‘e’ (as in “lop” versus “lope”). At the morphosyntactic

level, parasynthesis, infixation and agreement all involve non-adjacent dependencies. One example is

agreement between auxiliaries and inflectional morphemes, as in the English “is running.” Another is

agreement between articles and inflectional morphemes, as in the Spanish “el pollo.” A third is agreement in

number, as in the English “he runs.” The nonadjacent dependencies required for maintaining agreement can

span lengthy center embeddings, as in “The cats who the dog bites run.” Thus, mastering a language

requires behaving (e.g., producing utterances) in accordance with nonadjacent dependencies, including long-

distance dependencies.

Peña et al. report experimental evidence from French-speaking adults that they argue shows that

humans use both statistical learning (to segment speech) and algebraic computations (to induce structural

regularities such as grammatical rules). That is, Peña et al. argue that statistics are insufficient to support the

discovery of underlying grammatical regularities, and that their results imply knowledge of rules. Endress &

Bonatti (2007) go a step further and argue that participants may be tuning to rules at an even higher level of

abstraction than those considered by Peña et al. Endress & Bonatti replicate and extend the Peña et al.


4

results with Italian-speaking adults and attempt to model them using connectionist networks. They argue

that their failure to model the experimental results with connectionist networks demonstrates that associative

learning mechanisms are insufficient for language learning.

In this paper, we report a set of connectionist simulations – based on a single statistical mechanism –

that does model the experimental results of Peña et al. and Endress & Bonatti. We conclude that they have

not demonstrated that rule-governed structure learning mechanisms are necessary for language acquisition.

2. An overview of the evidence for the “dual-mechanism hypothesis”

2.1 Peña et al.’s (2002) experiments

Peña et al. (2002) tested adults’ abilities to segment speech based on non-adjacent dependencies and

to generalize beyond the familiarization corpus. The experiments were based on roughly the same method as

that used by Newport & Aslin (2000). Adult participants were asked to listen to a sequence of trisyllabic

artificial “words” for a certain period of time. The artificial “words” had the form AiXCi, where Ai, X and Ci

are syllables. The subscripts on A and C indicate that the nonadjacent syllables are matched, such that the

transitional probability between an Ai and the following Ci is 1.0. There are three X syllables, so the

transitional probabilities between an Ai and an intermediate X and between an X and the final Ci are each

0.33. There are three word classes (i∈[1,2,3]), and no two adjacent words in the speech stream may be from

the same class, so the transitional probability between the final syllable of one word Ci and the first syllable

of the next word Aj is 0.5. The three word classes are pu…ki, be…ga and ta…du. The three filler syllables

are li, ra and fo. Thus, the A1XC1 family consists of the words puliki, puraki and pufoki, the A2XC2 family

consists of the words beliga, beraga and befoga, and the A3XC3 family consists of the words talidu, taradu

and tafodu. Ten-minute and two-minute familiarization streams were produced by concatenating speech-

synthesized tokens of these nine words, randomly selected subject to the constraints that (a) a word of a

given family could not be immediately followed by another word of the same family, and (b) a word with a

given intermediate syllable could not be immediately followed by another word with the same intermediate

syllable. After familiarization, participants were asked to choose – between pairs of stimuli that could


5

belong to three kinds of test items: “words”, “part words” and “rule words” – which seemed more like a

word from the familiarization stream.

In Peña et al.’s experiment 1, participants were familiarized to a continuous speech stream, as

described above, for 10 minutes. In the test phase, they were asked to choose – between a “word” and a

“part word” – which item seemed more like a word of the imaginary language with which they had been

familiarized. The “words” were simply items of the form AiXCi, that is, words that had appeared in the

familiarization stream. The “part words” were also items that had appeared in the familiarization stream, but

ones that straddled a word boundary. These could be of two types: “type 12” part words consisted of items

having the form CiAjX, whereas “type 21” part words consisted of items having the form XCiAj. There are

18 type 12 part words: dubefo, dubeli, dubera, dupufo, dupuli, dupura, gapufo, gapuli, gapura, gatafo,

gatali, gatara, kibefo, kibeli, kibera, kitafo, kitali and kitara. There are also 18 type 21 part words: foduga,

foduki, fogapu, fogata, fokidu, fokiga, lidube, lidupu, ligapu, ligata, likibe, likita, radube, radupu, ragapu,

ragata, rakibe and rakita. The result was that participants preferred words over part words. Experiment 1 is

consistent with the hypothesis that statistics alone suffice for speech segmentation, since what counts as a

“word” is a function of the transitional probabilities between specific non-adjacent items (in this case,

between Ai and Ci).

In their experiment 2, Peña et al. investigated whether participants were simply segmenting the

stream by exploiting differential transitional probabilities between words and part words, or whether they

were attuning to some more abstract underlying grammatical regularity. In order to answer this question,

participants were asked, after having been familiarized for 10 minutes to the same speech stream as in

experiment 1, to choose between a part word and what Peña et al. dubbed a “rule word.” A rule word is a

sequence of three syllables of the form AiX′Ci, where X′ stands for a “familiar” syllable, that is, one that

occurs in familiarization, although never between Ai and Ci. In other words, X′ is another A or C syllable, so

that rule words have forms like A1C3C1 and A1A2C1. There are 12 rule words: beduga, bekiga, bepuga,

betaga, pubeki, puduki, pugaki, putaki, tabedu, tagadu, takidu and tapudu. They are called “rule words”


6

because participants who learned rules of the form “if the first syllable is Ai, then the last syllable is Ci”

should find them familiar. Although a rule-word is novel, it is congruent with the structure of actual words

by means of the preservation of the relationship between Ai and Ci. As it turned out, participants preferred

part words over rule words, suggesting that they had not extracted the rule from the training stream even

though (as shown in experiment 1) they had learned to segment the words of the language. Peña et al.

therefore claim that their experiment 2 shows that statistics are not sufficient to extract structural

information from a continuous familiarization corpus. In light of experiments 1 and 2 together, Peña et al.

conclude that a “computational mechanism sufficiently powerful to support segmentation on the basis of

nonadjacent transitional probabilities [experiment 1] is insufficient to support the discovery of the

underlying grammatical-like regularity embedded in a continuous speech stream [experiment 2]” (p. 605).

In Peña et al.’s experiment 3, a “subliminal” 25ms pause was inserted between each pair of words in

the familiarization stream. Although participants reported no awareness of such gaps, their presence did

affect the results. Specifically, Peña et al. found that, when participants were trained on a speech stream with

gaps, the participants subsequently preferred rule words to part words at test. In their view, these results

imply knowledge of rules, insofar as the very notion of an abstract rule-word underlies the successful

discrimination of rule words and part words. Thus, they write, “This seems to be due to the fact that the

selected items are compatible with a generalization of the kind ‘If there is a [pu] now, then there will be a

[ki] after an intervening X’” (p. 606). In other words, Peña et al. contend that two different computational

mechanisms must be responsible for the results of experiments 1-3: (i) a statistical mechanism for

performing speech segmentation (experiment 1), and (ii) a rule-governed mechanism responsible for the

induction of grammatical structural regularities in the corpus (experiment 3).

Peña et al.’s experiments 4 and 5 tested for preference between part words and rule words after

familiarization on a continuous and a segmented stream for 30 and 2 minutes, respectively (see Table 1). In

the first case, participants preferred part words over rule words, demonstrating that even lengthy

familiarization with a corpus that does not contain cues to segmentation does not lead to abstraction of the


7

rule. In the second case, the participants preferred rule words over part words, demonstrating that even very

short familiarization with a corpus that does contain cues to segmentation leads to abstraction of the rule.

Table 1: Summary of Peña et al.’s experimental results. Exp. Stream Duration famil Test choice 1 Continuous 10 minutes Words over part words 2 Continuous 10 minutes No preference between rule words and part words 3 Segmented 10 minutes Rule words over part words 4 Continuous 30 minutes Part words over rule words 5 Segmented 2 minutes Rule words over part words

Peña et al. interpret the results of Table 1 as evidence for a dual-mechanism hypothesis: a statistical

mechanism for segmenting the familiarization corpus (experiment 1), and a rule-governed mechanism that

accounts for the induction of the rule that prefers rule words over part words (experiments 3 and 5).

2.2 Endress and Bonatti’s (2007) experiments

Rule words, as defined by Peña et al., relate to the sort of dependency that obtains between specific

syllables (Ai and Ci). Parasynthesis, infixation and agreement are natural language counterparts of the sort of

non-adjacent dependencies that define artificial rule words, as in the agreement that obtains in Spanish

across the lexical root poll in el pollo (“the chicken”, masculine singular) versus los pollos (masculine

plural). However, morphosyntactic dependencies may not exhibit a one-to-one correspondence between

specific “Ai” and “Ci” items (el _ o; los _ os). Instead, language acquisition may call for tracking

dependencies that occur between the classes that items such as Ai and Ci belong to. Endress & Bonatti note,

for example, that in English, prefixes such as de-, re- and un- and suffixes such as -ed, -es and -ing can

attach to verbs in nearly any combination. They also note that one may form an English noun phrase by

combining a determiner, an adjective and a noun. These are not convincing evidence of non-adjacent

dependencies, however, in that it is not clear what the dependencies are between prefixes and suffixes or

between determiners and nouns. Circumfixes, affixes consisting of two parts that surround a root when

attached, may be better examples. However, languages having circumfixes (double affixes) are less frequent

than languages without them (Greenberg, 1963). Even in languages that allow circumfixation, circumfixes

are often very rare indeed. For example, in Dutch, the only circumfix is the discontinuous nominal ge__te,


8

which applies to underived nouns in order to obtain collective nouns, as in berg – ‘mountain’; gebergte –

‘mountains’(Klamer, 2005). There are some languages in which circumfixes, although still less frequent

than prefixes or suffixes, are more widely employed, as in Georgian where, for instance, or means‘two’ and

meore means ‘second’(Colarusso, 2005), or in Malay-Indonesian where, for instance, baik means ‘good’

and kebaikan means ‘goodness’(Tadmor, 2005). In any case, it is not clear whether circumfixation is ever a

unitary morphological process rather than a two-step derivation (Klamer, 2005).

Whether or not these examples are convincing, we may grant for argument’s sake that natural

language acquisition calls for tracking dependencies that occur between classes of items. Endress & Bonatti

thus go a step further and argue that participants may not prefer rule words themselves, but so-called “class

words,” which involve such dependencies between classes of items. Class words have the form AiX′Cj, that

is, an A syllable from one class, followed by a syllable that had appeared in the speech stream but never in

the middle of a word (as in rule words), followed by a C syllable from a different class. These are called

“class words” because they would be preferred if participants learned rules of the form “if the first syllable

is from the A class, then the last syllable is from the C class” (where the A class comprises syllables A1, A2

and A3, and the C class comprises syllables C1, C2 and C3). Whereas rule words are restricted to

dependencies between syllable tokens (e.g., between pu and ki in rule words pubeki and putaki), class words

are such that the transitional probability between the first and the third syllable is not 1.0. There are 18 class

words: beduki, bekidu, bepudu, bepuki, betadu, betaki, pubedu, pubega, puduga, pugadu, putadu, putaga,

tabega, tabeki, tagaki, takiga, tapuga and tapuki.

Table 2 summarizes some of Endress & Bonatti’s experimental results.1

1 Experiments 6 and 7 by Endress and Bonatti (2007) were designed to control whether subjects considered either the initial or the

final syllable to induce the rule, instead of both. Their experiments 10 and 11 were designed to discount the possibility of

phonological confounds in their experiments 1 and 2. These experiments are omitted for present purposes. Their experiment 9, on

the other hand, was designed to test whether a single-mechanism can exploit TPs over both syllables and gaps. It is omitted from

table 2, because we discuss it separately below.


9

Table 2: Summary of some of Endress & Bonatti’s experimental results. Exp. Stream Duration famil Test choice 1 Segmented 10 minutes Class words over part words 2 Continuous 10 minutes No preference between class words and part words 3 Segmented 2 minutes Class words over part words 4 Segmented 30 minutes No preference between class words and part words 5 Segmented 60 minutes Part words over class words 8 Segmented 2 minutes Words over rule words 12 Segmented 2 minutes Rule words over class words 13 Continuous 10 minutes Rule words over class words

Endress & Bonatti highlight the fact that participants’ responses to class words exhibit a negative correlation

between structural generalization and familiarization duration of much the same sort that Peña et al. (2002)

had found for rule words. This negative correlation is critical to the inference that there is more than one

mechanism at work, a hypothesis that Endress & Bonatti dub the “MOM” (more-than-one-mechanism)

hypothesis, and which parallels the dual-system hypothesis put forward by Peña et al. Endress & Bonatti

conjecture that, when confronted with a segmented corpus, participants may be freed from the burden of

having to parse the input stream to first extract the constituents, as in the case of a continuous stream. In this

way, participants are able to focus from the start upon the underlying structure itself. With more

familiarization duration, generalization is overwhelmed as participants have more time to track the statistical

relations that obtain between familiarization items. Endress & Bonatti thus interpret the data as showing that

a dependency is initially induced between classes of items but degrades with further familiarization.

2.3 A challenge for connectionist modelling

As we have seen, Peña et al. and Endress & Bonatti find a negative correlation between performance

on abstract items and familiarization duration.

Tables 3 and 4 list the experimental results from Peña et al. and Endress & Bonatti (Tables 1 and 2)

in order from shorter to longer familiarization times, with the goal of highlighting the relationship between

the experimental results and the dual-mechanism (MOM) hypothesis.


10

Table 3: Summary of Peña et al.’s and some of Endress & Bonatti’s experimental results with a segmented stream, in increasing order of familiarization duration.

Exp. Study Familiarization duration Test choice

5 Peña et al. 2 minutes Rule words over part words 8 Endress & Bonatti 2 minutes Words over rule words 3 Endress & Bonatti 2 minutes Class words over part words 12 Endress & Bonatti 2 minutes Rule words over class words 3 Peña et al. 10 minutes Rule words over part words 1 Endress & Bonatti 10 minutes Class words over part words 4 Endress & Bonatti 30 minutes No preference between class words and part words 5 Endress & Bonatti 60 minutes Part words over class words

Table 4: Summary of Peña et al.’s and some of Endress & Bonatti’s experimental results with a continuous stream, in increasing order of familiarization duration.

Exp. Study Familiarization duration Test choice

1 Peña et al. 10 minutes Words over part words 2 Peña et al. 10 minutes No preference between rule words and part words 2 Endress & Bonatti 10 minutes No preference between class words and part words 13 Endress & Bonatti 10 minutes Rule words over class words 4 Peña et al. 30 minutes Part words over rule words

The results obtained by Peña et al. and Endress & Bonatti on segmented 2-minute familiarization

streams (experiments 5, 8, 3 and 12 in Table 3) indicate preferences for words over rule words, for rule

words over class words, and for class words over part words of types 12 and 21. Their results on segmented

10-minute familiarization streams (experiments 3 and 1 in Table 3) indicate preferences for class words and

rule words over part words. Finally, Endress & Bonatti’s results on segmented 30- and 60-minute

familiarization streams (experiments 4 and 5 in Table 3) indicate no preference between class words and

part words, and preference for part words over class words, respectively. On the other hand, the results by

Peña et al. and Endress & Bonatti’s on continuous 10-minute familiarization streams (experiments 1, 2, 2

and 13 in Table 4) indicate preferences for words over part, no preference between rule/class words and part

words, and preference for rule words over class words. Finally, Peña et al.’s results on continuous 30-minute

familiarization streams (experiment 4 in Table 4) indicate preference for part words over rule words. The

overall reversal of the preferences observed (Tables 3 and 4) as we move from familiarization exposures of

2 minutes to those of 60 minutes reflects the negative correlation that underpins the MOM hypothesis.


11

Endress & Bonatti (2007) consider whether an SRN trained on a prediction task may account for the

results of their experiments (Table 2). In particular, they are interested in whether a SRN can induce class-

words after being trained on segmented and continuous corpora of the sort employed in their experiments 1-

8. Endress & Bonatti purport to show that an associative connectionist network cannot account for this

pattern. They report a set of studies with Simple Recurrent Networks (SRNs; Elman, 1990) that they claim

shows that a single-mechanism such as the one that underlies an SRN cannot account for the preference for

class words exhibited by humans in their experiments, and for the negative correlation observed between

class word induction and familiarization duration. In what follows, we aim to show that a single statistical

learning mechanism can in fact account for all the patterns of preference of Tables 1 and 2, and for the

negative correlation for both segmented and continuous corpora that Tables 3 and 4 summarize. In the

remainder of this article, we report and discuss a set of SRN studies that does model the experimental

results. Our goal is to determine if the rank-order preference observed (words > rule words > class words >

part words) and the reversal that obtains as familiarization durations are increased can be modeled

statistically.

3. Preliminary analysis of types of dependencies in Peña et al.’s corpus

As others (e.g., Seidenberg, MacDonald, & Saffran, 2002) have observed, there is no principled

reason to exclude the possibility that subliminal segmentation gaps can be exploited statistically. The mere

fact that these gaps are subliminal does not prevent them from carrying potentially relevant information. It

simply means that their presence is not available to conscious access. As a matter of fact (as Peña et al. well

observe), they must carry the critical piece of information for the mastery of structural induction since the

inclusion of the gaps is the only difference between experiment 2, where part words are preferred, and

experiment 3, where rule words are favored (Table 1).

Although Peña et al. consider a potential rejoinder according to which a single statistical mechanism

may be responsible for the induction of the structural regularity in their experiment 3 by tracking transitional

probabilities over pauses as well as syllables, they dismiss that alternative in a footnote (note 27). The


12

rejoinder is the idea that participants might have computed transitional probabilities to and from the pauses

as well as to and from the audible syllables. In that case, they might estimate that a rule word at test (with

the structure #AiX′Ci#) was more likely than a part word at test (with the structure #CiAjX# or #XCiAj#).

The relevant transitional probabilities are shown in Figure 1. Note that the sum of transitional probabilities

for the rule word (3.66) is greater than the sum of transitional probabilities for part words of either type

(0.33 in both cases). Indeed, the sum of adjacent transitional probabilities for the rule word (1.33) is greater

than the sum of adjacent transitional probabilities for part words of either type. Thus, one might suppose

that participants would prefer rule words to part words based strictly on adjacent transitional probabilities.

(a) # Ai X ′ Ci #

0.33 0 0 1

0 1 0

0.33 1

(b) # Ci Aj X #

0 0 0.33 0

0 0 0

0 0

(c) # X Ci Aj #

0 0.33 0 0

0 0 0

0 0

Figure 1: Adjacent and non-adjacent transitional probabilities for (a) a rule word having the form #AiX′Ci#, (b) a part word of type 12 having the form #CiAjX# and (c) a part word of type 21 having the form

#XCiAj#. Solid arrows indicate transitions between adjacent items. Dashed arrows indicate transitions between non-adjacent items separated by one. Dotted arrows indicate transitions between non-adjacent

items separated by two. Each arrow is labeled with the corresponding transitional probability. In response to this potential rejoinder, Peña et al. performed a control experiment in which

participants were tested with test items that consisted of part words of type 21 including the internal pauses,

that is, items having the form #XCi#Aj#. The relevant transitional probabilities are shown in Figure 2. In this

case, the transitional probabilities between adjacent items sum to 1.66, higher than the sum of transitional

probabilities between adjacent items in the rule word. Thus, if participants were using only transitional


13

probabilities between adjacent items in calculating their preferences between test items, we would expect

them to prefer the part words of type 21 including the internal pauses to the rule words. However, Peña et al.

report that, on the contrary, participants still prefer the rule words to part words even when the part words

are presented with internal pauses.

# X Ci # Aj #

0 0.33 1 0.33 0

0 1 0.5 0

0 0.33 00 0

0

Figure 2: Adjacent and non-adjacent transitional probabilities for a part word of type 21 with an internal gap, i.e., an item having the form #XCi#Aj#.

However, there is a misunderstanding in Peña et al.’s line of reasoning that invalidates the

conclusion they reach. Peña et al. consider the prediction that participants would choose rule words

(#AiX′Ci#) over part words (#CiAjX#), once we consider “probabilities over syllables, pauses, and absence

of pauses in the stream and the test items”, since “[t]ransitional probabilities between adjacent elements

would favor rule words over part words” (note 27, p. 607, emphasis added). The problem with this comment

is that no reason is offered as to why the only transitional probabilities to be computed must be those

between adjacent elements in the speech stream. They assume that a statistical learning mechanism can only

be sensitive to immediately adjacent patterns. However, there is no reason to believe that such mechanisms

cannot be sensitive to nonadjacent regularities. This is especially so given that Peña et al.’s experimental

setting was precisely designed by constructing a lexicon mainly characterized in terms of nonadjacent

transitional probabilities; probabilities which, as they themselves acknowledge, are the cornerstone of the

segmentation task in their experiment 1: “[We] explore whether participants can segment a stream of speech

by means of nonadjacent transition probabilities, and we also ask whether the same computations are used

to promote the discovery of its underlying grammatical structure” (pp. 604-605; emphasis added).


14

We may then ask: which test items would participants choose if they were computing transitional

probabilities over both adjacent and nonadjacent items? The sum of adjacent and non-adjacent transitional

probabilities for a rule word is 3.66 (Figure 1a), whereas the sum of adjacent and non-adjacent transitional

probabilities for a part word with an internal pause is 3.5 (Figure 2). Thus, if participants are in fact

computing transitional probabilities over both adjacent and nonadjacent items, then there is no reason not to

expect them to prefer rule words over part words with an internal pause, exactly as Peña et al. report that

they do. It is of course not obvious that participants must be sensitive to the overall sum of TPs. Different

types of information may be weighted differently, and the underlying empirical question in dispute is

precisely what sources of information people track as they acquire a new language. Thus, what the current

discussion shows is that the possibility that statistics is behind it cannot be excluded in principle on the basis

of an alleged statistical inferiority on the part of rule words as opposed to part words with an internal pause.

Our point is that statistical computations based on nonadjacent transitional probabilities of the sort

that are exploited in speech segmentation may be used in order to induce existing grammatical regularities

in the speech stream. There is in fact a fairly clear literature demonstrating that recurrent networks can

induce grammars from examples of context-free and context-sensitive languages; grammars that are

precisely of a form in which there are long-distance dependencies (see, for example, Boden & Wiles, 2000;

Chalup & Blair, 2003). In order to empirically demonstrate these claims, and with an eye to undermining

one argument for the dual-mechanism (MOM) hypothesis, we ran a series of connectionist simulations that

illustrate the exploitation of statistically-driven information. The simulations were conducted in two separate

studies. In the first study, we aim to determine whether SRNs can exhibit the patterns of preference

contained in Table 3 when trained on a corpus that contains subliminal gaps (Peña et al.’s experiments 5 and

3, and Endress & Bonatti’s experiments 8, 3, 12, 1, 4, and 5; see Table 3). In the second study, we aim to

determine whether SRNs can exhibit the patterns of preference contained in Table 4 when trained on a

corpus that does not contain subliminal gaps (Peña et al.’s experiments 1, 2 and 4, and Endress & Bonatti’s

experiments 2 and 13; see Table 4).


15

4. Simulation studies

In our SRN studies, the familiarization corpus consisted of the same strings of syllables used by

Peña et al. (2002) and Endress & Bonatti (2007). In particular, syllables were represented by pair-wise

orthonormal nine- or ten-dimensional binary vectors (depending on whether segmentation gaps were

included in the familiarization). The familiarization corpora were as close as possible to those used by

Endress & Bonatti while still respecting the constraints described by Peña et al. Recall that in Peña et al., so-

called words have the form AiXCi, where Ai, X and Ci are syllables. The subscripts on A and C indicate that

the nonadjacent syllables are matched, such that the transitional probability between any Ai and the

following Ci is 1.0. There are three X syllables, so the transitional probabilities between an Ai and an

intermediate X and between an X and the final Ci are each 0.33. There are three word classes (i∈[1,2,3]),

and a word class may not repeat in the speech stream, so the transitional probability between Ci and the next

word’s first syllable is 0.5. The specific word classes used by Peña et al. (2002b) were:

i = 1 : pu . . . ki

i = 2 : be . . . ga

i = 3 : ta . . . du

and the filler syllables were:

li

ra

fo

Thus, the A1XC1 family consists of the words puliki, puraki and pufoki; the A2XC2 family consists of the

words beliga, beraga and befoga; and the A3XC3 family consists of the words talidu, taradu and tafodu. To

create the 10-minute familiarization stream, 100 tokens of each of the 9 words were randomly concatenated,

subject to two constraints: (a) a word of a family could not be followed by another word of the same family,

and (b) two words could not be adjacent if they had the same intermediate syllable. There seems to be

something of an issue here, because the online methods supplement (Peña et al., 2002b) states clearly that


16

rule (b) is in effect (“two words are not adjacent if they have the same intermediate syllable”, p. 1), while

Figure 1 of the original paper (Peña et al., 2002a) indicates that the familiarization stream 1 contains the

string puraki beliga tafodu pufoki talidu beraga, which clearly contains a violation of rule (b). Endress &

Bonatti (2007) apparently use the rule (viz. “Consecutive items could not belong to the same family, nor

could they have the same middle syllable,” in the description of the familiarization procedure for experiment

1, p. 255). Thus, for the purpose of the simulations, we made an attempt to constrain all training input

according to rule (b).

However, there is a further issue, in that observing these constraints makes it difficult to generate a

sequence of exactly the required length. Depending upon how the initial words are randomly ordered, it can

be impossible to place the last few words. As an extreme example, imagine that the sequence started as

follows (the superscripts indicate repetitions):

[puliki beöaga]100[pufoki beliga]100[puöaki befoga]100 talidu

This sequence is only 601 words long, but it cannot be completed because all remaining words are from the

same family. Of course, pseudorandomly selecting words tends to avoid such extreme examples.

Nevertheless, a pseudorandom selection procedure remains susceptible to the less extreme forms of the

problem: the placement of words earlier in the sequence makes it impossible to place all words in the

sequence while obeying constraints (a) and (b). When the word sequence is generated stochastically, it is

highly unlikely for the problem to be this extreme. However, it is also unlikely that exactly 900 words can

be used. In one trial run, for example, only 48 of 1,000 randomly-generated sequences were able to use all

900 elements.

Perhaps of even greater concern, the sequences that do use all 900 elements tend to end with

relatively long sequences of alternating word pairs. This is because the procedure for selecting words is

essentially selection from a 900-element set without replacement. Thus, the 900th word of a 900-word

sequence has 0 entropy—it is the only option left in the pool. The first 900-word sequence we generated by

pseudorandom selection ended with 16 straight repetitions of “talidu, pufoki”.


17

Hence, in generating our familiarization corpus, we did not use the constraint that there must be

exactly 100 words of each type. Rather, for each set of simulations, we pseudorandomly generated 900

words according to the other constraints.

We created 7 test corpora to investigate the predictions of Peña et al. and Endress & Bonatti: (1) words

(AiXCi); (2) part words of type 12 (CkAiX); (3) part words of type 21 (XCiAj); (4) rule words (AiX′Ci); (5)

class words (AiX′Cj); (6) part words of type 12 that include internal gaps of the sort considered in footnote

27 of Peña et al. (Ck#AiX); and (7) part words of type 21 that include internal gaps of the sort considered in

footnote 27 of Peña et al. (XCi#Aj).

4.1 Study 1

Methods

Like Endress & Bonatti, we used an SRN (Elman, 1990). The syllables were coded as ten-bit

pairwise orthonormal binary vectors (a “1-of-c” encoding), with the tenth bit representing a gap. Endress &

Bonatti do not report the activation functions or objective function used in their simulations, but there is a

well-known issue with a common choice (using sigmoid output units and the sum-squared error function) to

train networks on problems where the target patterns are mostly zeros, as they are here. Such networks

easily find a local minimum of the sum-squared error function by adjusting weights so that all output unit

activations are close to zero. Moreover, because the delta term in backpropagating sum-squared error

involves a multiplication by the derivative of the activation function (the “sigma prime term”), training

slows down dramatically whenever the output approaches 0 or 1, regardless of the target value (because the

derivative of the sigmoid approaches 0 in both cases). The usual procedure for problems using a 1-of-c

encoding is to use the softmax activation function at the output units combined with the cross-entropy

objective function (e.g., Bishop, 1995). The softmax activation function causes the activations of the output

units to always sum to unity, which is correct in the case of a 1-of-c encoding; a side effect is that one may

treat output activations as the network’s subjective assessments of the probability that each output unit

codes for the right category on a given input pattern. Using the cross-entropy objective function causes the


18

sigma prime term to drop out of the calculation of delta values, ensuring that weight updates approach zero

only as the activation value approaches the target value. Thus, we used the softmax activation function at the

output layer combined with the cross-entropy objective function. Based on the results of preliminary studies

(reported in Laakso & Calvo, 2008), we set momentum to 0 and used 54 hidden units.

Presenting a word to the network consisted of sequentially presenting each of its three syllables,

followed by a gap. Networks had the same number of output units as input units and were trained to predict

the next syllable (or gap) from each syllable (or gap) presented as input.

Trained networks were tested on five item types: training words (N=9), part words of type 12

(N=18), part words of type 21 (N=18), rule words (N=12) and class words (N=18).

In their simulations, Endress & Bonatti considered 3 possibilities for representing and training on the

segmentation gaps: (1) representing the gaps with a vector of length 0 and training the SRN to predict the

gap just as it would predict a syllable; (2) representing the gaps with a vector of length 0 and training the

SRN to predict the syllable immediately after the gap; and (3) representing the gaps by an extra unit and

training the SRN to predict the silence immediately after the gap. In selecting their representational scheme,

however, Endress & Bonatti rely upon some of Peña et al. (2002)’s data (in particular, the data Peña et al.

report in footnote 27 of their paper that they claim suggests participants’ performance does not depend on

the gap being present in the test items). Endress & Bonatti interpret this to mean that the best way to model

participants’ representations of the test items is by using a network that always predicts the next syllable,

ignoring the gaps. In fact, there is a fourth possibility: representing the gaps by an extra input unit and

training the SRN to predict the syllable after the gap (Laakso and Calvo, 2008). However, for the purposes

of direct comparison with Endress & Bonatti’s results, we tested the networks, as they did, by recording the

network output for the second syllable of the test items (i.e., its prediction of what the third item would be)

and then comparing the network output with the actual third syllable of the test item using the cosine

similarity measure. That is, the cosine similarity measure was recorded between the third syllable of the test

item and the network output activation in response to the second syllable of the test item.


19

Fifty networks were trained in order to simulate individual differences. After every 10 epochs of

training, the performance of each network was measured and recorded. Training was stopped after 300

epochs.

Results

The results are shown in Figure 3, which depicts cosine similarity values for words, part words, rule

words and class words over the course of 300 epochs of training averaged across our 50 network “subjects”.

Epochs of training

Mea

n co

sine

sim

ilari

ty

0.2

0.4

0.6

0.8

1.0

0 50 100 150 200 250 300

CWRWWPW

Figure 3. Mean cosine similarity values for 50 networks trained and tested with gaps. (CW=class words, RW=rule words, W=words, PW=part words).

After about 50 epochs of training, the network outputs are more similar to the targets for words than they are

to the targets for rule words, more similar to the targets for rule words than they are to the targets for class

words, and more similar to the targets for class words than they are to the targets for part words. After about


20

75 epochs of training, the network outputs are more similar to the targets for rule words than they are to the

targets for part words as well as more similar to the targets for class words than they are to the targets for

part words. At about 100 epochs of training, the network outputs are equally similar to the targets for class

words and to the targets for part words. At about 150 epochs of training, the network outputs are more

similar to the targets for part words than they are to the targets for class words.

The results shown in Figure 3show only a single line for part words (PW). Figure 4 shows the same

results except that the data for part words of type 12 and part words of type 21 are shown separately.

Epochs of training

Mea

n co

sine

sim

ilari

ty

0.0

0.2

0.4

0.6

0.8

1.0

0 50 100 150 200 250 300

CWPW12PW21RWW

Figure 4. Mean cosine similarity values for 50 networks trained and tested with gaps. (CW=class words, RW=rule words, W=words, PW12=part words of type 12, PW21=part words of type 21).


21

Virtually the same pattern of results described above holds for both part words of type 12 and part words of

type 21 in Figure 4. Specifically, after about 50 epochs of training, the network outputs are more similar to

the targets for words than they are to the targets for rule words, more similar to the targets for rule words

than they are to the targets for class words, and more similar to the targets for class words than they are to

the targets for part words of either type. After about 60 epochs of training, the network outputs are more

similar to the targets for rule words than to the targets for part words of either type, as well as more similar

to the targets for class words than they are to the targets for part words of either type. After about 75 epochs

of training, the network outputs are equally similar to the targets for class words and to the targets for part

words of type 12, although the network outputs are more similar to the targets for class words than they are

to the targets for part words of type 21. After about 100 epochs of training, the network outputs are more

similar to the targets for part words of type 12 than they are to the targets for class words, although they are

more similar to the targets for class words than they are to the targets for part words of type 21.

Discussion

The results of our Study 1 (Figure 3) accurately model the behavior of human participants in Peña et

al.’s experiments 5 and 3, and Endress & Bonatti’s experiments 8, 3, 12, 1, 4, and 5; see Table 3.

Specifically, the results by Peña et al. and Endress & Bonatti’s on segmented 2-minute familiarization

streams (experiments 5, 8, 3 and 12 in Table 3) indicate preferences for words over rule words, for rule

words over class words, and for class words over part words of types 12 and 21 (compare network

performance in Figure 3 after about 50 epochs of training). Their results on segmented 10-minute

familiarization streams (experiments 3 and 1 in Table 3) indicate preferences for class words and rule words

over part words (compare network performance in Figure 3 after about 75 epochs of training). Their results

on segmented 30-minute familiarization streams (experiment 4 in Table 3) indicate no preference between

class words and part words (compare network performance in Figure 3 after about 100 epochs of training).

Finally, their results on segmented 60-minute familiarization streams (experiment 5 in Table 3) indicate


22

preference for part words over class words (compare network performance in Figure 3 after about 150

epochs of training).

However, as Figure 4 shows, part words of type 12 and part words of type 21 fare differently with

respect to class words in our simulations. Although part words of type 12 show the same qualitative pattern

as the average of all part words (that is, they are dispreferred to class words after fewer epochs of training

but preferred to class words after more epochs of training), part words of type 21 do not (they are never

preferred to class words). Figures 1(b) and 1(c) in section 3 above, where adjacent and non-adjacent

transitional probabilities for a part word of type 12 and a part word of type 21 were shown, may hint at the

divergence between the patterns of part words of different types in Figure 4. In the case of part words of

type 12, the transitional probability between the second and the third syllable is 0.33 (figure 1b). By

contrast, the transitional probability between the second and the third syllable in the case of part words of

type 21 is 0 (figure 1c). Figure 4 records the cosine similarity measure between the third syllable of the test

item and the network output activation in response to the second syllable of the test item. It is possible that

the divergence observed between the two types of part words is due to the particular representational

scheme that Endress & Bonatti chose to model participants’ representations of the test items, and that we

used for the purposes of direct comparison with Endress & Bonatti’s results. Nevertheless, Endress &

Bonatti, having observed something similar in one of their network simulations, wrote that “the network

predicts a difference between how part-words of type 12 and type 21 will stand the comparison with class-

words; yet, in none of our experiments have we observed it” (pp. 282-3). Endress & Bonatti, however, have

not reported experiments explicitly comparing preferences for part words of type 12 versus part words of

type 21. We predict that such experiments will show a preference for part words of type 12 over part words

of type 21 after 30 minutes or more of familiarization.

More important, though, is what our simulations reveal about the preference for class words. Endress

& Bonatti interpret human participants’ preference for class words over part words after 2 minutes and 10

minutes of familiarization as evidence that the participants have learned a class rule. Endress & Bonatti


23

conjecture that when confronted with a segmented familiarization stream, participants may be freed from the

burden of having to parse the input to extract first the constituents, as in the case of a continuous stream. In

this way, they are able to focus from the start upon the underlying structure itself. With longer

familiarization, generalization is overwhelmed as participants have more time to track the statistical

relations that obtain between items in the input stream. The interpretation of Endress & Bonatti is thus that

an initially induced dependency that takes place between classes of items degrades with familiarization

duration. But Figure 3 and Figure 4 make it clear that another interpretation of these results is possible:

perhaps people are not learning a class rule at all but only ruling it out rather slowly. That is what our

networks seem to be doing – although the networks disprefer part words right from the beginning, it takes

them some time (approximately 100 epochs – see Figure 3) to learn that class words are no better.

Now why would participants be slower to develop a dispreference for class words than to show a

dispreference for part words? Figure 5 shows adjacent and non-adjacent transitional probabilities for a class

word having the form #AiX′Cj#. Non-adjacent statistical relations between adjacent items, and between non-

adjacent items separated by two, outweigh non-adjacent statistical relations between adjacent items, and

between non-adjacent items separated by two in the case of part words of both types. Transitional

probabilities between adjacent items sum to 1.33 in class words (Figure 5) but only to 0.33 in part words of

either type (Figure 1b,c). Similarly, transitional probabilities between non-adjacent items separated by two

sum to 1.33 in class words (Figure 5) but to 0 in part words of either type (Figure 1b,c). Transitional

probabilities between non-adjacent items separated by one are 0 in every case. An associative explanation

may thus underlie the diverging time spans of dispreference between class words and part words shown in

Figure 3 and Figure 4. We suggest that much the same might underlie human participants’ performance in

Endress & Bonatti’s experiments. That is, people may be ruling out an unlearned class rule rather slowly. In

fact, taking into account that the non-adjacent transitional probability between the first and the third syllable

is lost in the case of class words, it is difficult to grasp how participants might possess such powerful


24

generalization machinery that is triggered when released from having to parse the corpus, as Endress &

Bonatti suggest (see general discussion, below).

# Ai X ′ Cj #

0.33 0 0 1

0 0 0

0.33 1

Figure 5. Adjacent and non-adjacent transitional probabilities for a class word having the form #AiX′Cj#. Solid arrows indicate transitions between adjacent items. Dashed arrows indicate transitions between non-adjacent items separated by one. Dotted arrows indicate transitions between non-adjacent

items separated by two. Each arrow is labeled with the corresponding transitional probability. A final caveat is in order. One might question the sense in which epochs of training in artificial

neural networks and familiarization duration with human participants relate to each other (recall that

network performance was probed every 10 epochs, and connection weights frozen after 300 epochs).

Indeed, it is not clear that there must be a linear relation between epochs of training and familiarization

duration. Our key point is that the networks do reproduce the initial patterns of preference, and that such a

preference does reverse in subsequent epoch intervals as a result of an increased learning of the prediction

task. The dual-mechanism hypothesis capitalizes on an observed negative correlation between the extraction

of structural regularities and familiarization duration. The longer the duration of the continuous

familiarization stream, the stronger the preference for part words over rule words. On the contrary, a very

short familiarization with a segmented stream allows for the induction of the rule (preference of class and

rule words). Thus, Endress & Bonatti predict that a preference for class words will decrease over longer

familiarization durations. Demonstrating that, with sufficient training, networks can also show a reversal

(coming to prefer part words over class words) is thus critical to the debate. Such a reversal is exactly what

we find in study 1. The dual-mechanism hypothesis ignores the possibility that subliminal segmentation

gaps can be exploited statistically, as the present results with SRNs illustrate.


25

4.2 Study 2

The previous simulation used training data matching the familiarization stimuli used in Peña et al.’s

experiments 5 and 3, and Endress & Bonatti’s experiments 8, 3, 12, 1, 4, and 5 (Table 3). However, an

important aspect of the Peña et al.’s and Endress & Bonatti’s results is the fact that their experiments show

that participants cannot learn the so-called abstract rules without the segmentation gaps. How then do our

SRNs fare when trained on a familiarization corpus that does not include the gaps (Table 4)? The results by

Peña et al. and Endress & Bonatti’s on continuous 10-minute familiarization streams (the first four

experiments in Table 4) indicate preferences for words over part words, no preference between rule/class

words and part words, and preference for rule words over class words. Finally, Peña et al.’s results on

continuous 30-minute familiarization streams (experiment 4 in Table 4) indicates preference for part words

over rule words. Study 2 attempts to model the pattern of performance in Table 4.

Methods

Study 2 was identical to study 1, except that the networks were trained and tested without gaps. Thus,

the syllables were coded as nine-bit pairwise orthonormal binary vectors. As previously, the networks used

the softmax activation function at the output layer, the cross-entropy objective function, momentum of 0 and

54 hidden units.

Presenting a word to the network consisted of sequentially presenting each of its three syllables.

Networks had the same number of output units as input units and were trained to predict the next syllable

from each syllable presented as input. The testing procedures were the same as in study 1, except that the

test items did not contain gaps. As in study 1, 50 networks were trained in order to simulate individual

differences. After every 10 epochs of training, the performance of each network was measured and

recorded. Training was stopped after 300 epochs.


26

Results

The results are shown in Figure 6, which depicts cosine similarity values for words, part words of

type 12, part words of type 21, rule words and class words over the course of 300 epochs of training

averaged across our 50 network “subjects”.

Epochs of training

Mea

n co

sine

sim

ilari

ty

0.2

0.4

0.6

0.8

0 50 100 150 200 250 300

CWPW12PW21RWW

Figure 6. Mean cosine similarity ratings for 50 networks trained and tested without gaps. (CW=class words, RW=rule words, W=words, PW12=part words of type 12, PW21=part words of type 21).

After about 60 epochs of training, the network outputs are clearly more similar to the targets for words than

they are to those for part words of either type, they are approximately equally similar to the targets for rule

words and those for part words of either type, they are clearly more similar to the targets for part words of

both types than they are to those for class words, and they are clearly more similar to the targets for rule


27

words than they are to those for class words. After about 150 epochs, the network outputs are more similar

to the targets for part words of either type than they are to those for rule words.

Discussion

In response to Perruchet et al. (2004), Bonatti et al. (2006) argue that they had “neglected other

crucial findings that we reported—for example, that prolonging exposure to the continuous stream induces a

preference for part words over rule words (Peña et al., 2002, experiment 4). It was the contrast between this

result and the preference for structurally correct but previously unheard items observed after short

familiarization with a segmented stream that led us to favor a dual-mechanism account. [W]e wonder what

alternative explanation they propose” (p. 319). In effect, the experiments in Table 4 complement those of

Table 3 insofar as no generalization obtains when the familiarization stream contains no gaps. The modeling

results in Study 1 would thus be less significant were we to fail to model statistically the fact that no

induction of rules obtains without the segmentation gaps.

The results shown in Figure 6 match the experimental results in Table 4 with one exception:

although Endress & Bonatti found that human participants had no preference between class words and part

words after 10 minutes of familiarization with a continuous stream (their experiment 2 – see Table 4 above),

our networks show a preference for part words over class words throughout the course of training on inputs

without gaps (Figure 6). We must remember, however, the rationale of Endress & Bonatti’s experiment 2 in

the context of testing the more-than-one mechanism hypothesis. Their concern was with whether

participants could learn a class rule. They took the results of their experiment 1 (showing a preference for

class words over part words after 10 minutes of familiarization on a segmented stream – see Table 3 above)

to show that participants could learn a class rule after familiarization with a stream containing segmentation

cues. By contrast, they took the results of their experiment 2 (showing no preference between class words

and part words after 10 minutes of familiarization on a continuous stream – see Table 4 above) to show that

participants could not learn a class rule after familiarization with a stream not containing segmentation cues.

Based on our network modeling results, however, we have suggested that participants may not be learning a


28

class rule even when the familiarization stream contains segmentation cues. Rather, participants may simply

be slower to develop a dispreference for class words than they are to show a dispreference for part words.

(See Figure 3and Figure 4 and the discussion in the results section of study 1 above.) The same point applies

to streams not containing segmentation cues as to streams that do contain segmentation cues: networks

gradually develop a dispreference for class words (compare the lines for class words in Figure 4 and Figure

6), and we suggest that human participants may be doing the same. Figure 7 shows adjacent and non-

adjacent transitional probabilities for a class word without surrounding gaps, that is, a word having the form

AiX′Cj. Note that all of the transitional probabilities are 0. It is not clear why we would expect any learner to

find such items familiar.

Ai X ′ Cj

0 0

0

Figure 7. Adjacent and non-adjacent transitional probabilities for a class word without surrounding gaps, i.e., having the form AiX′Cj. Solid arrows indicate transitions between adjacent items. The dashed

arrow indicates the transition between non-adjacent items separated by one. Each arrow is labeled with the corresponding transitional probability.

If this is correct, we take the following lessons from the stimulation results: (1) segmented inputs

lead statistical learners to find both words and rule words familiar while finding class words and part words

unfamiliar; and (2) unsegmented inputs lead statistical learners to find words familiar while finding rule

words, part words and class words unfamiliar. Other differences (i.e., the differences between part words of

type 12, part words of type 21 and class words in Figure 4, as well as the differences between part words of

either type, rule words and class words in Figure 6) may simply be uninteresting for participants as they are

all “below threshold” for familiarity. Thus, the negative correlation between duration of familiarization and

generalization of a class rule that is exhibited in the human experiments may be a red herring – rather than

providing strong evidence for the existence of two different learning mechanisms, it may simply reflect

overlearning of certain statistical patterns (part words of type 12) that are in any case irrelevant to the


29

projection of a rule. Note how low the cosine similarity values are for both kinds of part words and for class

words after about 50 epochs of training in Figure 6.

We conclude that Studies 1 and 2 successfully model the experimental results of Peña et al. and

Endress & Bonatti.

5. General Discussion

According to the dual-mechanism (MOM) hypothesis, both an associative and a rule-following

mechanism are needed to analyze speech. Although an associative mechanism suffices to segment speech,

structural generalization requires a non-associative mechanism. Endress & Bonatti argue that, insofar as

associative learning relies upon the piecemeal modification of synaptic connections, and is thus dependent

upon long exposures or cycles of training in order to adequately tune to whatever structure the environment

carries, the negative correlation between the generalization of structure and familiarization duration tells

against the competing “single-mechanism” hypothesis. It is assumed, by contrast, that an algebraic

mechanism can quickly extract the structure of the corpus, an ability that would conflict with longer

familiarization, as the actual chunks in the data (words and part words, but not rule words or class words)

will keep re-shaping the error landscape, in successive cycles of processing. The analysis of speech, thus,

requires two mechanisms, to account for the negative correlation observed between duration of

familiarization and the generalization of structure.

We are not the first to have criticized the dual-mechanism framework. Previous criticisms have been

raised by Newport & Aslin (2004), Onnis, Monaghan, Richmond, & Chater (2005), Perruchet, Tyler,

Galland, & Peereman (2004), Perruchet, Peereman, & Tyler (2006), and Seidenberg, MacDonald & Saffran

(2002). These replies targeted other aspects of Peña et al. and Endress & Bonatti’s results, such as the

possibility of phonological/phonotactic confounds, or the possibility that participants may be tuning to parts

of the stimuli (Ai or Ci) instead of generalizing the Ai_Ci rule itself. Endress & Bonatti designed control

experiments to deal with these and other concerns, one by one. Thus, for example, their experiments 6 and 7

were designed to control whether participants considered either the initial or the final syllable to induce the


30

rule, instead of both, and their experiments 10 and 11 to discount the possibility of phonological confounds

in their experiments 1 and 2. Once having controlled for, and discarded one by one, the alternatives

proposed against the dual-mechanism hypothesis, Endress & Bonatti turned to connectionist simulations

with SRNs. The idea was to demonstrate that there was no possibility for associative language learning,

assuming that SRNs were, computationally speaking, representative of the associative mechanisms that may

underlie language acquisition in humans.

In this article, we have taken up the challenge put forward by Endress & Bonatti and reported the

results of simulations based on a connectionist model that account for the negative correlation observed in

earlier work. Fine-grained details aside, the current debate is not new to the cognitive science community,

and speech processing in adults is not the only arena in which the single vs. dual-mechanism hypotheses

have been contrasted. Notable precursors are the “great past-tense debate” (Pinker & Ullman, 2002;

Ramscar, 2002) and the debate over “rule learning” in infants (Marcus, Vijayan, Bandi Rao, & Vishton,

1999; Seidenberg & Elman, 1999). Speech processing in adults, thus, constitutes another venue in which to

assess the hypothesis that a single type of associative learning mechanism can explain human learning.

A few words of caution are needed before the single-mechanism hypothesis can be unconditionally

embraced. For one thing, concluding that algebraic manipulations do not take place based on the simple fact

that a neural network accounts for the data is a non sequitur. The hidden premise that delivers the goods is

that neural networks do not implement abstract relationships between variables in an explicit manner.

Marcus (2001) has pursued this line of reasoning in the case of the past tense debate and the debate over rule

learning in infants, and concluded that only connectionist models that explicitly implement abstract

relationships between variables can account for both the past-tense and infant results. We need to know then

which architectural features are behind our network’s successful modelling of the experimental results of

Peña et al. and Endress & Bonatti. The overall dispute among sympathizers and foes of associative models

of cognition boils down to an open empirical question as to what the architecture of cognition looks like. If

the required features include, for instance, the instantiation of variables, then a connectionist model of


31

speech generalization will serve to back up the dual-mechanism hypothesis (by implementing abstract

variables). We may put this somewhat differently: to warrant their claim that the MOM hypothesis is the

only explanation for the data, Endress & Bonatti need to show that only connectionist networks that

implement explicit rules in the form of abstract relationships between variables can account for their results.

The question then is: which models implement rules? A clear case of implementation occurs when

nodes in the architecture are used as variables. This is not the case with our SRNs, where the “1-of-c”

encoding is used to represent the actual syllables that form the “training” words in familiarization (PU or

KI), and not to represent word positional slots (Ai or Ci). A different way to become an ‘implementational

connectionist’ (Marcus, 2001) would be to train the SRN on a categorization task, where the teaching

pattern for gradient descent learning is provided externally. As in the case of the encoding of variables, a

rule is implemented in the non-ecological calculation of delta values in the form of a trainer that marks the

output categories explicitly. In our simulations, however, the cross-entropy objective function deployed in a

self-supervised task of prediction allows for the calculation of delta values in an ecological, non-

implementational, manner. That is, no output categories are fed externally to the network as targets, because

the target is the next input to the SRN, an input that stands for actual syllables and not variable slots.

Certainly, the above two examples do not exhaust the landscape, and there may be other ways in

which the modeling results of an implementational connectionist network would serve to back up dual,

rather than single-mechanism hypotheses. But it is noteworthy that Endress & Bonatti’s failed attempt to

model their own experimental results was carried out with SRNs. We may thus assume that their effort was

directed towards a form of connectionism that they considered non-implementational. Because the only

architectural difference between their model and ours is the employment of the softmax activation function

at the output units combined with the cross-entropy objective function in an otherwise standard Elman net,

we take it that they would agree that we have built a connectionist model that accounts for the negative

correlation observed without implementing abstract relationships between variables in doing so.


32

Bearing in mind the distinction between connectionist models that explicitly implement abstract

relationships between variables and models that do not, the following three observations can help tip the

balance in a definitive manner against dual-mechanisms: (i) types and tokens of mechanism must be told

apart; (ii) the claim that statistical mechanisms are slow must be qualified; and (iii) a distinction must be

honored between positive evidence that backs up the claim that algebraic mechanisms are fast, on the one

hand, and negative “evidence” that would back up the same claim somewhat more indirectly. We consider

points (i)-(iii) in turn.

(i) Types and tokens

In their reply to Perruchet et al. (2004), Bonatti et al. (2006) seem to conflate types of mechanisms

with specific mechanisms as tokens when they claim: “We never denied that a theory based on statistical

learning might account for fast learning; however, the thesis that all learning can be explained by statistical

computations is empty unless our critics can propose a single mechanism that is capable of simultaneously

explaining (a) segmentation of words after exposure to a long familiarization (but not to a short

familiarization) with a continuous stream, (b) extraction of structural information after a short

familiarization with a discontinuous stream, and (c) failure to extract the same information after

familiarization with any continuous stream” (p. 319; emphasis added).

The challenge thus posited, someone may contend, is unjustified. The challenge resides not in

accounting for (a)-(c) statistically, but rather in accounting for (a)-(c) simultaneously by means of a single

mechanism. In connectionist jargon, the challenge is then to obtain a single weight matrix through error-

driven training such that all the knowledge that gets induced is fully distributed and superposed in the

matrix. But why couldn’t separate or partially overlapping statistical mechanisms be responsible for (a), (b)

and (c)? Why, in short, is a single weight matrix needed where all knowledge is fully superposed? This

rendering of the situation amounts to raising the challenge in terms of a single (token) mechanism. Bonatti

et al. (2006) do not complain after all about the speed of one kind of learning over another (“We never

denied that a theory based on statistical learning might account for fast learning”), but rather about the


33

number of token mechanisms that can play a role. The MOM hypothesis entails “more than one type of

mechanism,” but nothing in their argument tells against the possibility that more than one associative

mechanism is in place. In short, someone may even wish to stick to the label “MOM” and defend on

empirical grounds the existence of “more-than-one-(statistical)-mechanism”, such that none of the statistical

mechanisms involved implements abstract relationships between variables.

In fact, the modeling results reported here account for the data in a non-implementational manner

while meeting the added challenge of accounting for (a)-(c) above in one single weight matrix. We imagine,

therefore, that the results could also be modeled by separate, serial connectionist mechanisms. A

sympathizer of the dual-mechanism hypothesis might then argue that such a more-than-one-(statistical)-

mechanism solution would result in a “modularization” of the architecture with activations getting

symbolized as they are output and fed from one statistical mechanism into another. In this way, the

connectionist network would amount to an implementation of a classical architecture. It is not clear, though,

that more-than-one-(statistical)-mechanism must amount to implementational connectionism. To take an

illustration where neurobiological plausibility is honored, O’Reilly (1998) shows the advantages of

combining Hebbian and biological forms of error-driven learning. We may, for instance, have a SRN

architecture supplemented with bidirectional (symmetric) propagation of activation, inhibitory competition,

an error-driven form of learning (McClelland, 1994), and Hebbian learning. The Hebbian component, in

combination with error-driven learning, makes its contribution by representing in hidden unit activation

space the first-order correlational structure of the input data. We fail to see where the rules and variables

would be in this network.

(ii) Slow versus rapid associative mechanisms

Someone might suggest that a single-mechanism statistical model is necessarily at odds with quick

rule induction. But why couldn’t a non-implementational SRN quickly extract generalizations? After all,

one-shot Hebbian learning (Rolls & Treves, 1998) allows for a few event co-occurrences to contribute to

fast recall. Furthermore, taking into account that combined Hebbian and error-driven learning drives SRNs


34

to a better performance on generalization than backpropagation (O'Reilly & Munakata, 2000), we see no

reason to assume that statistical mechanisms are necessarily slow.2

As discussed above, the defender of the dual mechanism hypothesis needs to show that only

connectionist networks that explicitly implement abstract relationships between variables can account for

the Endress & Bonatti results. Presumably, the argument would be that the quick extraction of

generalizations can be achieved only by a mechanism that implements abstract relationships between

variables. However, the claim that statistical mechanisms are slow must be qualified. Even though Bonatti et

al. (2006) acknowledge the possibility of fast associative learning, the negative correlation between

familiarization duration and rule induction is critical for Endress & Bonatti’s case. What is the empirical

evidence that statistical mechanisms are slower than algebraic ones? Perhaps it is the distinction between

generalizing class words and generalizing rule words that motivates the dual-mechanism hypothesis.

Although the induction of a rule word might, for the sake of argument, be accomplished courtesy of quick

statistical mechanisms such as combined Hebbian and error-driven learning, the induction of class words

may not obtained in the same way. Quick extraction of class words appears to be a more difficult pill for the

defender of associative mechanisms to swallow, insofar as most of the correlational information is lost as

one moves from items to classes of items. But, if Endress & Bonatti agree that rule words may be learned

statistically by fast associative mechanisms (mechanisms that they acknowledge exist), then class words no

longer constitute a threat to the single mechanism hypothesis. Based on our network modeling results, we

have suggested that participants may not be learning a class rule after all, even when the familiarization

stream contains segmentation cues. Rather, participants may simply be slower to develop a dispreference for

class words than they are to show a dispreference for part words. A single fast associative mechanism may

thus account for the Endress & Bonatti results.

2 A similar point is made by Perruchet et al. (2004) when they note that: “The assertion that associative learning proceeds slowly

does not stand up to empirical observations. For example, some associative forms of learning have been shown to develop over

one trial or so” (p. 582).


35

(iii) Negative vs. positive evidence

Finally, it is important to distinguish between positive evidence that supports the claim that algebraic

mechanisms are fast, on the one hand, and negative “evidence” that would support the same claim

somewhat more indirectly, on the other. This relates to the question of how parsimonious the contending

hypotheses are. Much of the Peña et al. and Endress & Bonatti strategy has the flavor of a negative, “only-

game-in-town” kind of argument: Given such a short familiarization duration, we cannot think of any way

that people could perform the way they do other than by having two different learning mechanisms. If our

working hypothesis is on the right track, only-game-in-town arguments must be accompanied by positive

evidence, not against single mechanisms, but in favor of dual ones. As Seidenberg et al. (2002) put it in

response to Peña et al. (2002), “The evidence for rule learning is mostly negative: cases where learning

occurs but there is no obvious statistical explanation. A theory explaining how rule learners arrive at exactly

the correct generalizations given the complexities of their experience would represent substantial progress”

(p. 554). Allowing for the statistical generalization of rule words and the slower development of a

dispreference for class words, together with the reversal of behavior we have modeled, we would like to see

an explicit presentation of the nuts-and-bolts of a dual mechanism that would make it a more attractive

alternative than our very simple model.

Endress & Bonatti (2007) try to make progress by shifting from a negative, only-game-in-town,

characterization of their working hypothesis to a positive characterization in terms of the operations that

may underlie the negative correlation they observed. They hypothesize the existence of “a general

mechanism representing syllables in words as variables, capable of operating under a variety of input

conditions. Such a mechanism would be able to extract relations between such variables within their

respective units… The silences may act as ‘markers’ that define the units of an analysis. Such markers may

be a prerequisite for dependencies between classes in speech to be analyzed, and this would explain [why]

the mechanism for generalization seems to only work over an already segmented input” (p. 291).


36

It is not clear what to make of Endress & Bonatti’s suggestion. For one thing, in Endress & Bonatti’s

view, one of the advantages of their framing of the debate in terms of adult speech (as opposed to the debate

about rules in infants or the past tense debate), is that they can control for training-independent factors, and

thereby focus upon (online) training-dependent factors. Connectionist attempts to explain rule learning in

infants (Seidenberg & Elman, 1999) have exploited the sort of knowledge that infants have already induced

during their short lives, and which they subsequently bring into the lab. In the case of adult speakers,

Endress & Bonatti control for such training-independent factors. But if there are no training-independent

biases that adults bring into the lab with them, it is all but obvious why the general mechanism they posit to

represent syllables as variables is ready to operate under the sort of higher-level abstraction demanded by

class words. The problem, put bluntly, is that it is not clear what makes algebraic mechanisms fast in the

first place. Or, somewhat more dramatically, if statistical mechanisms carry the burden of proving that they

can be fast, why are algebraic mechanisms relieved from such a burden?

Moreover, Bonatti et al. (2006) comment that Perruchet et al. (2004) “conflated the language-specific

nature of structure extraction (for which we did not argue) [and] its algebraic character, which is compatible

with but not mandated by our results” (p. 316). A subtle distinction seems to be at work here. We may have

put “dual-mechanisms”, “algebraic operations” and “domain-specificity” into a single bag. If the point that

proponents of the dual-mechanism hypotheses are making is after all about domain (language)-specificity,

what would a non-algebraic alternative (as suggested by the quote above) consist of? Again, no hints are

offered. And finally, it is not clear why a reversal in preferences must necessarily obtain. That is, why

should we ever stop using a rule once in place? The value of rule-following is precisely that, once acquired,

a rule can be applied indiscriminately (or so its virtue is usually sold).

All this shows that the dual-mechanism hypothesis is less parsimonious explanation than a single-

mechanism (associative learning) hypothesis. The fact that dual-mechanism hypotheses are unparsimonious

and become speculative when one tries to flesh them out affects not only the current debate but also the past

tense debate and the debate about infant rule learning. In an attempt to spell out the “positive side” of rule-


37

following, Marcus (2001) claims, “My hunch is that the brain contains a stock of basic instructions, each

defined to operate over all possible values of registers” (p. 58); at another point, he goes on, “Fundamental

to my proposal are the assumptions that the mind has a large stock of empty treelets on hand and that new

knowledge can be represented by filling in an empty treelet or by adjusting the values contained in an

existing treelet” (p. 108); and last, “I have not proven that the mind/brain implements symbol manipulation.

Instead, I have merely described how symbol-manipulation could support the relevant cognitive phenomena.

All that we can do is to provisionally accept symbol-manipulation as providing the best explanation” (p.

170). We fail to see in what sense the work of Peña et al. and Endress & Bonatti provides any better positive

evidence for a dual-mechanism hypothesis (in the case of the analysis of speech in adults) than these

speculative remarks by Marcus might allow us to imagine (in the case of rule learning in infants). But why

should we favor such an unparsimonious model when one that is clearly more parsimonious can account for

the same data?

According to Bonatti et al. (2006), single-mechanism hypotheses are “empty unless our critics can

propose a single mechanism that is capable of simultaneously explaining (a) segmentation of words after

exposure to a long familiarization (but not to a short familiarization) with a continuous stream, (b) extraction

of structural information after a short familiarization with a discontinuous stream, and (c) failure to extract

the same information after familiarization with any continuous stream.” (p. 319). Our model, we contend,

meets this challenge: it explains the phenomena that Peña et al. and Endress & Bonatti report, and it is more

parsimonious than their alternative. The single-mechanism hypothesis, moreover, generates further

predictions for testing. As noted in study 1, Endress & Bonatti do not explicitly compare preferences for

part words of type 12 versus part words of type 21. Our associative learning-based prediction is that such

experiments will show a preference for part words of type 12 over part words of type 21. Parsimony and

novel testable predictions are an added value that in our view may end up tipping the balance against the

dual-mechanism (MOM) hypothesis.


38

6. Conclusion

How many different mechanisms are needed to analyze speech? According to a dual-mechanism

hypothesis (Peña et al., 2002; Endress & Bonatti, 2007), language learning is achieved by means of two

mechanisms: a statistical mechanism that permits the learner to extract words from the speech stream,

together with a non-statistical mechanism that is necessary for extracting higher-level structure. We have

presented a neural network study that shows how statistics alone can support the discovery of structural

regularities, beyond the segmentation of speech. We have argued that our results undermine Peña et al. and

Endress & Bonatti’s argument for the dual-mechanism (MOM) hypothesis, and we therefore conclude that

they have not demonstrated that rule-governed language-learning mechanisms are necessary for the

extraction of structural information. In addition, we believe that these modeling results go beyond the

idiosyncrasies of SRNs. Our work shows that a primitive, artificial statistical learning mechanism can learn

linguistic preferences that appear to be governed by abstract, structural rules. There is no reason to think that

the powerful statistical learning machinery that is the human brain could not do the same.


39

References

Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford: Oxford University Press.

Boden, M., & Wiles, J. (2000). Context-free and context-sensitive dynamics in recurrent neural networks.

Connection Science: Journal of Neural Computing, Artificial Intelligence & Cognitive Research,

12(3-4), 197-210.

Chalup, S. K., & Blair, A. D. (2003). Incremental training of first order recurrent neural networks to predict

a context-sensitive language. Neural Networks, 16(7), 955-972.

Colarusso, J. (2005). Georgian and Caucasian languages. In P. Strazny (Ed.), Encyclopedia of Linguistics

(pp. 380-383). New York: Fitzroy Dearborn.

Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179-211.

Endress, A. D., & Bonatti, L. L. (2007). Rapid learning of syllable classes from a perceptually continuous

speech stream. Cognition, 105(2), 247-299.

Greenberg, J. H. (1963). Some universals of grammar, with particular reference to the order of meaningful

elements. In J. H. Greenberg (Ed.), Universals of language (pp. 73-113). Cambridge, MA: MIT

Press.

Klamer, M. (2005). Explaining some structural and semantic asymmetries in morphological typology. In G.

Booij, E. Guevara, A. Ralli, S. Sgroi & S. Scalise (Eds.), Morphology and Linguistic Typology: On-

line Proceedings of the Fourth Mediterranean Morphology Meeting (MMM4) (pp. 127-142).

Bologna: Università degli Studi di Bologna.

Laakso, A., & Calvo, P. (2008). A connectionist simulation of structural rule learning in language

acquisition. In B. C. Love, K. McRae & V. M. Sloutsky (Eds.), Proceedings of the 30th Annual

Meeting of the Cognitive Science Society (pp. 709-714). Austin, TX: Cognitive Science Society.

Marcus, G. F., Vijayan, S., Bandi Rao, S., & Vishton, P. M. (1999). Rule learning by seven-month-old

infants. Science, 283, 77-80.


40

McClelland, J. L. (1994). The interaction of nature and nurture in development: A Parallel distributed

processing perspective. In P. Bertelson, P. Eelen & G. d'Ydewalle (Eds.), International Perspectives

on Psychological Science (Vol. I: Leading Themes). Hillsdale, NJ: Erlbaum.

Newport, E. L., & Aslin, R. N. (2000). Innately constrained learning: Blending old and new approaches to

language acquisition. In S. C. Howell, S. A. Fish & T. Keith-Lucas (Eds.), BUCLD 24: Proceedings

of the 24th annual Boston University Conference on Language Development. Boston: Cascalla Press.

Newport, E. L., & Aslin, R. N. (2004). Learning at a distance: I. Statistical learning of non-adjacent

dependencies. Cognitive psychology, 48(2), 127-162.

O'Reilly, R. C. (1998). Six principles for biologically based computational models of cortical cognition.

Trends in Cognitive Sciences, 2(11), 455-462.

O'Reilly, R. C., & Munakata, Y. (2000). Computational explorations in cognitive neuroscience:

Understanding the mind by simulating the brain. Cambride, MA: MIT Press.

Onnis, L., Monaghan, P., Richmond, K., & Chater, N. (2005). Phonology impacts segmentation in online

speech processing. Journal of Memory and Language, 53(2), 225-237.

Peña, M., Bonatti, L. L., Nespor, M., & Mehler, J. (2002). Signal-driven computations in speech processing.

Science, 298(5593), 604-607.

Perruchet, P., Peereman, R., & Tyler, M. D. (2006). Do We Need Algebraic-Like Computations? A Reply to

Bonatti, Peña, Nespor, and Mehler (2006). Journal of Experimental Psychology: General, 135(2),

322-326.

Perruchet, P., Tyler, M. D., Galland, N., & Peereman, R. (2004). Learning Nonadjacent Dependencies: No

Need for Algebraic-Like Computations. Journal of Experimental Psychology: General, 133(4), 573-

583.

Pinker, S., & Ullman, M. T. (2002). The past and future of the past tense. Trends in Cognitive Sciences,

6(11), 456-463.


41

Ramscar, M. (2002). The role of meaning in inflection: Why the past tense does not require a rule. Cognitive

Psychology, 45(1), 45-94.

Rolls, E. T., & Treves, A. (1998). Neural networks and brain function. Oxford: Oxford University Press.

Saffran, J. R., Aslin, R. N., & Newport, E. L. (1996). Statistical learning by 8-month old infants. Science,

274(5294), 1926-1928.

Seidenberg, M. S., & Elman, J. L. (1999). Networks are not 'hidden rules.'. Trends in Cognitive Sciences,

3(8), 288-289.

Seidenberg, M. S., MacDonald, M. C., & Saffran, J. R. (2002). Does grammar start where statistics stop?

Science, 298, 553-554.

Tadmor, U. (2005). Malay-Indonesian and Malayic languages. In P. Strazny (Ed.), Encyclopedia of

Linguistics (pp. 644-647). New York: Fitzroy Dearborn.

How Many Mechanisms are Needed to Analyze Speech Many Mechanisms are Needed to Analyze Speech?: A...

Documents

Transcript of How Many Mechanisms are Needed to Analyze Speech Many Mechanisms are Needed to Analyze Speech?: A...