The Ninetieth Anniversary of the LSA: A Commemorative Symposium

40
The Ninetieth Anniversary of the LSA: A Commemorative Symposium Morphology: the last 40 years Mark Aronoff January 3, 2014

description

The Ninetieth Anniversary of the LSA: A Commemorative Symposium. Morphology: the last 40 years Mark Aronoff January 3, 2014. Preface: Technology and theory. T he relation between technology and theory goes both ways We like to believe that theory leads technology - PowerPoint PPT Presentation

Transcript of The Ninetieth Anniversary of the LSA: A Commemorative Symposium

Page 1: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

The Ninetieth Anniversary of the LSA: A Commemorative Symposium

Morphology: the last 40 yearsMark Aronoff

January 3, 2014

Page 2: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

Preface:Technology and theory

• The relation between technology and theory goes both ways

• We like to believe that theory leads technology

• At least as often it is the other way round• Many of the successes of early science were

technology driven

Page 3: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

Galileo GalileiSidereus Nuncius (1610)

Page 4: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

Antoni van LoewenhoekIn the year of 1675 I discover’d living creatures in Rain water

Page 5: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

A Case StudyMorphological Productivity

• Morphological productivity was rarely investigated until the 1980’s

• Newly available electronic tools made the quantitative study of morphological productivity possible

• New tools have led to breakthroughs in our understanding of both synchronic and diachronic morphology

• The tools lead us to question fundamental assumptions about the discreteness of language and the value of the competence/performance distinction

Page 6: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

Counting WordsData Resources and English Morphology

• Fundamental discoveries in linguistic morphology over the last half-century have depended on improvements in our ability to count English words

• As the resources for counting words have changed and improved, so have our ideas about morphology changed and (we hope) our understanding improved

Page 7: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

Laying the Foundations for Studying Morphological Productivity

• Early linguistic word data resources were not designed for linguistics, though they were focused on language– Walker 1775– Thorndike 1921, 1932, 1944

• Only in the 1960’s did the first truly linguistically driven electronic word data resources appear– Brown 1963 (word counts)– Kučera and Francis 1967 (frequency counts)

Page 8: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

John Walker1732 – 1807

Page 9: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

John WalkerThe Godfather of Modern Morphology

• Walker’s Rhyming Dictionary. 1775• Walker’s dictionary has gone through many

editions and remains in print• The term rhyming dictionary was misleading,

though it was a good selling point• Walker’s dictionary was meant for linguists as

much as for poets, though few linguists used it

Page 10: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

Notable linguistic remarks from Walker’s original Introduction

• As in other Dictionaries words follow each other in an alphabetical order according to the letters they begin with, in this they follow each other according to the letters they end with.

• The English Language, it may be said, has hitherto been seen through but one end of the perspective; and though terminations form the distinguishing character and specific difference of every language in the world, we have never before had a prospect of our own, in this point of view.

Page 11: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

Edward Thorndike, 1874-1949The Godfather, Part II

Page 12: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

The Father of Educational Psychology

• Thorndike was one of the first American experimental psychologists

• Thorndike’s work was a precursor to both behaviorism and modern cognitive psychology

• Thorndike spent his entire career at Columbia University Teacher’s College

• Thorndike is regarded as a founding figure in educational psychology

Page 13: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

Thorndike’s word books

• Between 1921 and 1944, Thorndike published three frequency-based word books for teachers, to be used in curriculum design

• The last edition (Thorndike and Lorge) contained 30,000 words

• The books consisted almost entirely of frequency lists:1/ 1,000,000; 1/4,000,000; 1000 most frequent

• These were the first frequency lists published for any language

Page 14: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

A. F. Brown

• A. F. Brown was one of the first computational linguists, working at Penn and then at LeHigh

• In 1963, he published his Normal and Reverse English Word List, prepared under contract with the Air Force Office of Scientific Research

• The list was collated from 18 dictionaries– Each list runs to 400 pages of computer printout,

with 100 words per page = 400,000 entries

Page 15: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

Kučera and FrancisFrancis and Kučera

• The Brown Corpus (1964)– 1,014,312 words of running text of edited English prose

printed in the United States during the calendar year 1961– 500 samples of 2000+ words each– Tagged in a variety of ways

• Computational Analysis of Present-Day American English (1967)

• Frequency Analysis of English Usage (1982)– Approximately 45,000 distinct lemmas listed with their

frequencies– Lemmas with adjusted frequency >5/m in rank order

Page 16: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

The last 25 yearsLarge-scale electronic resources

• The availability in the last quarter century of large-scale electronic resources has made it possible to study English morphology in hitherto unimagined ways

• These resources have changed our perspective on how morphology works

• Two types of resources:– Electronic dictionaries– Large corpora

Page 17: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

The Oxford English Dictionary

Page 18: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

The Oxford English Dictionary

• The largest, longest, and most expensive academic publishing project in history

• 1857 Inaugurated • 1879 Work begins in earnest• 1933 First full edition• 1989 OED2• 1992 CD-ROM of OED2• 2000 – OED Online (by subscription)

Page 19: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

À quoi ça sert (l’amour)?

• The OED, unlike Webster’s II and others, is a historical dictionary

• Recent editions of the OED were designed from the bottom up as electronic resources

• The combination allows us to ask questions that we could never before expect to find answers for

• We can even ask questions that we might never before have imagined

Page 20: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

OED Tools

• The OED prides itself on the accuracy of its first citations• The first citations provide the most accurate historical

record available in any language of the first use of a word

• The ability to use wild cards permits the simple construction of historical timelines for individual affixes

• The timelines allow easy and accurate study for the first time of the growth and decline of patterns of affixation in English over the last millennium

Page 21: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

What the OED shows us

• The system is self-organizing• We can track the emergence of “borrowed”

affixes from the borrowing of large numbers of individual words to the productive use of an affix (e.g., -ment, -ation, -ity, -able)

• Homonymous affixes compete• The competition between affixes is resolved

through competition

Page 22: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

Sample affix histories from the OED(Anshen & Aronoff 1999)

Page 23: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

Sample affix histories from the OED(Marine Lasserre)

Page 24: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

Corpora

• The Brown corpus, compiled 50 years ago, contained a total of 1 million words

• The Google Books database currently contains over 30 million books and over 150 billion words

• Other modern large corpora are comparably large and are tagged for part of speech

• The COCA corpus contains over 450 million words• Corpora allow for the counting of individual

words/lemmas and their frequencies in a corpus

Page 25: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

Harald and the Elusive Index

Page 26: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

Baayen’s Productivity Indices• In a series of publications from 1989 on, Harald Baayen

developed a number of corpus-based indices intended to capture the intuitive notion of morphological productivity

• Baayen’s indices are based on the idea that words that only occur once in a corpus, hapax legomena, are a window into morphological productivity

• This idea makes no sense in the absence of a searchable corpus of reasonable size

• The general method becomes less useful as the corpus grows in size

Page 27: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

P = n1 / N

• The best known of Baayen’s indices is P, which measures the “growth rate” of the affix: the probability that an encounter with a word containing the affix is a new type.

• In the equation, n1 is represents the total number of hapaxes containing the affix, and N represents the total number of tokens containing the affix.

• P fits linguists’ intuitions about productivity reasonably well in corpora < 100M words, except when both n’s are small (for unproductive affixes)

Page 28: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

V and P*

• V is the total number of lexeme types containing a given affix– Differences in V between affixes reflect the extent

to which relevant base words have been used• Baayen plots P against V to obtain P*, the

relative “global productivity” of affixes– This measure is problematic, as Baayen notes,

because there is no principled way of scaling the axes

Page 29: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

Hapax vs. Hapax• Baayen’s final measure is P *, the hapax-conditioned degree

of productivity• P * = n1 / h1, where h1 is the total number of hapaxes across

all types in the corpus• Since h1 is the same for all affixes in a corpus, this measure

simply counts the numbers of hapaxes for each affix identified in a corpus

• The difference in P * yields intuitively satisfactory results for Baayen’s corpora

• The greatest weakness of P * is that it cannot easily be compared across corpora

Page 30: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

Where hapaxes fail• Both P and P * measurements are dependent on the size

(N) of the corpus • The number of hapaxes in a corpus is a decreasing function

of N– The rate of increase in the number of hapaxes slows as the size

of the corpus increases– Very large corpora show few if any hapaxes

• There is no way to know what the “proper” size of a corpus is for hapax-based measures to be useful

• It is not clear what the value of a measure of global productivity is

Page 31: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

So far, so good

• We gain insights into morphological productivity if we use quantitative tools

• We can not treat productivity as a discrete phenomenon if we want to learn about it

• The methods and measures we use depend on the machinery that we have

• The notion of an absolute measure of productivity that is valid across corpora is elusive and problematic

Page 32: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

Escape from Hapax

• The number of hapaxes decreases as the size of the corpus increases

• With very large corpora hapaxes are not helpful • We can learn a great deal from very large corpora if we

confine ourselves to the direct comparison of pairs of competing affixes

• This method is not based on hapaxes• This line of research does not address the question of

global productivity at all• Google Fight!

Page 33: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

Using Google Search

• We use Google Search Estimated Total Matches (ETM) as a measure of usage

• PROBLEMS– Google is very noisy and must be used with great caution– ETM is not an actual count but an estimate based on a

proprietary method• SOLUTIONS

– Little weight is placed on raw numbers or on individual word pairs

– Only large differences between affixes are taken into account

Page 34: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

A test caseComparing –ic and -ical

• Sample ETM counts for high frequency doublets (Lindsay & Aronoff 2013)

Page 35: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

Comparing –ic and -ical

• Sample ETM counts for high frequency singletons (Lindsay & Aronoff 2013)

Page 36: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

Usually –ic winsSometimes -ical wins

• -ical is productive in stems ending in -olog (from Lindsay and Aronoff 2013)

Page 37: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

Usually –ic winsSometimes -ical wins

• -ical is productive in stems ending in -olog (from Lindsay and Aronoff 2013)

Page 38: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

Why –olog?

• -olog defines the largest set by far of stems with neighborhood length 4 preceding either of the two suffixes (475 members)

• The -olog set contains 2/3 of all stems in –g• The -olog set is thus a very large morphologically

defined subsystem with very few neighbors • The -olog set is uniquely suited to sustain -ical as

a productive suffix, in spite of the clear dominance of -ic overall

Page 39: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

Conclusion

• The combination of rich computational resources and quantitative methods allows us to make progress in understanding questions that could not be profitably studied a quarter century ago

• As the resources change, so do the questions, the methods, and the theories that they drive

Page 40: The Ninetieth Anniversary of the LSA: A Commemorative Symposium

THANK YOU

Special thanks to those who have joined in my personal struggle over the last 40 years to understand

morphological productivity by counting

Morris HalleFrank AnshenMark Lindsay

La lotta continua!