Carolin Müller-Spitzer & Sascha Wolfer - A quantitative view on dictionary use: Potentials and...

27
ScotLex-1, Edinburgh, 08.04.2016 Carolin Müller-Spitzer & Sascha Wolfer A QUANTITATIVE VIEW ON DICTIONARY USE: POTENTIALS AND LIMITATIONS OF LOG FILE ANALYSES

Transcript of Carolin Müller-Spitzer & Sascha Wolfer - A quantitative view on dictionary use: Potentials and...

ScotLex-1, Edinburgh, 08.04.2016

Carolin Müller-Spitzer & Sascha Wolfer

A QUANTITATIVE VIEW ON DICTIONARY USE:POTENTIALS AND LIMITATIONS OF LOG FILE ANALYSES

• Lew (2015a): „Until fairly recently, dictionary users were not usually of central concern in the process of dictionary making […].”

• Advantages of focusing on the user:

Discover the challenges users face when accessing and using dictionaries user instruction, usability

Learn how users are working with the dictionary

Discover what users are interested in the most/least

Test preconceptions of the lexicographer about the users

User studies enable us to make better dictionaries.

RESEARCH INTO DICTIONARY USE

04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 2

Lew, R. (2015a). Dictionaries and their users. In P. Hanks & G.-M. De Schryver (Ed.), International Handbook of Modern Lexis and Lexicography (1–9). Berlin/Heidelberg: Springer.

• Main aim: Collect empirical data to gain insights into dictionary usage

• Multiple methods of data collection:(Web) questionnaires, eye tracking studies, usability studies,log file analyses, …

• Choice of method depends on the research question we want toaddress.

Lew, R. (2015b). Opportunities and limitations of user studies. In C. Tiberius & C. Müller-Spitzer (Hrsg.), Research into dictionary use / Wörterbuchbenutzungsforschung. 5. Arbeitsbericht des wissenschaftlichen Netzwerks „Internetlexikografie“ (Bd. 2/2015, S. 6–16). Mannheim: Institut für Deutsche Sprache. Abgerufen von http://pub.ids-mannheim.de/laufend/opal/pdf/opal15-2.pdf

Müller-Spitzer, C. (2014). Using Online Dictionaries. Berlin, New York: De Gruyter.

RESEARCH INTO DICTIONARY USE

04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 3

• Log files: Protocols of search requests or articlelook-ups.

• Varying amount of information:

Minimum: Article ID, Timestamp

User information, article history, technical information (e.g., browser, device), ...

Some log files are already aggregated (e.g., per hour).

• Take care of the legal framework of your country: What kind of information are you allowed to use without explicit user consent?

LOG FILE ANALYSES

604.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses

• Bergenholtz, H., & Johnsen, M. (2005). Log Files as a Tool for Improving Internet Dictionaries. Hermes, 34, 117–141.

• Bergenholtz, H., & Johnson, M. (2007). Log files can and should be prepared for a functionalistic approach. Lexikos, 17, 1–21.

• Verlinde, S., & Binon, J. (2010). Monitoring Dictionary Use in the Electronic Age. In A. Dykstra & T. Schoonheim(Hrsg.), Proceedings of the XIV Euralex International Congress (S. 1144–1151). Ljouwert: Afûk.

• Hult, A.-K. (2012). Old and New User Study Methods Combined ‒ Linking Web Questionnaires with Log Files from the Swedish Lexin Dictionary. Oslo. Universitetet i Oslo, Institutt for lingvistiske og nordiske studier. In J. M. Torjusen & R. V. Fjeld (Hrsg.), Proceedings of the 15th EURALEX International Congress 2012 (S. 922–928). Oslo, Norway. Abgerufen von http://www.euralex.org/elx_proceedings/Euralex2012/pp922-928%20Hult.pdf

• Schoonheim, T., Tiberius, C., Niestadt, J., & Tempelaars, R. (2012). Dictionary Use and Language Games: Getting to Know the Dictionary as Part of the Game. In R. Vatvedt Fjeld & J. M. Torjusen (Hrsg.), Proceedings of the 15th EURALEX International Congress. 7-11 August 2012 (S. 974–979). Oslo: Department of Linguistics and Scandinavian Studies: University of Oslo.

• De Schryver, G.-M., Joffe, D., Joffe, P., & Hillewaert, S. (2006). Do dictionary users really look up frequent words?—on the overestimation of the value of corpus-based lexicography. Lexikos, 16, 67–83.

• Koplenig, A., Meyer, P., & Müller-Spitzer, C. (2014). Dictionary users do look up frequent words. A log fileanalysis. In C. Müller-Spitzer (Hrsg.), Using Online Dictionaries (S. 229–250). Berlin, Boston: de Gruyter.

LOG FILE ANALYSES: PREVIOUSRESEARCH

04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 7

• The Wikimedia foundation provides log files for all their sites, including all the different language editionsof Wiktionary.

https://dumps.wikimedia.org/other/pagecounts-raw/

STUDIES USING WIKTIONARY LOG FILES

804.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses

• One file per hour with all projects.• Approx. 66 GB (gzipped) per month.

DATA PREPARATION

904.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses

Downloaded files

Relevant rows(e.g. „de.d“)

Daily aggregates

Weeklyaggregates

Yearlyaggregates

Additional information (someextracted from Wiktionary)

• part-of-speech• # of senses• headword frequency• ...

DATA PREPARATION

1004.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses

Page POS Frequency Visits 2013

Tribüne Noun 11,072 230,720

fakultativ Adjektive 497 133,381

Tribunal Noun 11,072 61,728

Grandezza Noun 1,222 20,475

reflektieren Verb 7,961 19,736

... ... ... ...

Visits per 1 million visits

1,723.3

996.3

461.1

153.0

147.4

...

• Are more frequent words visited more frequently?

• Are polysemic words visited more frequently thanmonosemic words?

• How can we investigate temporal effects on visitingfrequency?

• What portions of Wiktionary stay „in the dark“(i.e., are not visited at all or very seldom)?

• Data base: German language edition of Wiktionary

RESEARCH QUESTIONS

1104.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses

• If we compile a general dictionary from scratch, does it makesense to include more frequent words first?

• Log-file analyses from Wiktionary and DWDS log files suggest: Yes, words that occur more frequently in every-day language arealso visited more frequently.

CORPUS AND LOOK-UP FREQUENCY

04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 12

• Corpus frequency still matters if most frequent wordsare excluded.

CORPUS AND LOOK-UP FREQUENCY

04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 13

10,000 mostfrequent words

A

B

10,000 words randomlysampled from rest

10,000 most frequentwords from rest

34%

56%successfulsearches

• Are polysemic words visited more often than monosemicwords?

• Challenge: Polysemic words are also more frequent. So, we haveto control for the effect of frequency just shown.

POLYSEMIC AND MONOSEMIC WORDS

04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 14

monosemic

polysemic

POLYSEMIC AND MONOSEMIC WORDS

04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 15

• Effect of frequency still visible.• Effect of polysemy• Interaction effect: Polysemy

contrast tends to be morepronounced in higher frequencybands (especially in the highestdecile)

• If we want to extract temporary effects, we have totake time into consideration.

Interactive visualisation (German Wiktionary, more to come): http://www.owid.de/plus/wikivi2015/

• We employed a trend-residualisation technique.

Calculate the current trend of visitation frequency.

Calculate the deviations from this trend („residuals“) atspecific points in time.

TEMPORARY EFFECTS ON LOOK-UPFREQUENCY

04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 16

TEMPORARY EFFECTS: EXAMPLE

04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 17

TEMPORARY EFFECTS: EXAMPLES

04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 18

TEMPORARY EFFECTS: EXAMPLES

04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 19

TEMPORARY EFFECTS: EXAMPLES

04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 20

TEMPORARY EFFECTS: ‚LARMOYANT‘

04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 21

„Der ist jetzt aber richtig sauer. Das passt dem gar nicht. Und wenn ich das richtig deute, blickt er da eher Richtung Toni Kroos. Das ist ihm ein bisschen zu larmoyant... und ... der ist vielleicht noch eher im Freundschaftsspielmodus …“

He is really peeved now. That really doesn‘t suit him. Andif I interpret this correctly, he is looking into the directionof Toni Kroos. That‘s a little too lachrymose for him. And... maybe, he‘s more in exhibition mode …“

• How many and which articles are not visited at all?

We consider the years 2013, 2014 and 2015.

Account for the fact that the number of articles is rising.

THE DARK SIDE OF WIKTIONARY

04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 22

THE DARK SIDE OF WIKTIONARY

04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 23

?

• Approx. 25,000 articles were not visitedduring 2013, 2014 and 2015.

Mostly newer

Mostly non-German

German idioms

Inflected forms

THE DARK SIDE OF WIKTIONARY

04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 24

• Log files are well suited to investigate effects on the„macro user“ level:

Corpus frequency and look-up frequency

Polysemy and look-up frequency

Temporary effects

„Dark side“ of dictionaries

Collaborative dictionaries: Look-up and revision frequency

SUMMARY

2504.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses

• Lew (2015b: 11-12): „[…] we need to be aware of the limitations of the approach.

One such limitation is that server log files will rarely tell us what the contextof dictionary use is:

what activity the user is involved in,

what particular problem they are trying to solve,

and the levels of success and satisfaction achieved in the consultation.

Nothing is known about the user, either, such as their age, languages spoken, proficiency in them, or professional background. […]

Issues of data privacy can also be a limiting factor in log file analysis.“

OUTLOOK / LIMITATIONS

2604.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses

• Little can be inferred from a small number of log file events.

Research based on individual cases is virtually impossible.

Log file analyses work best if many cases are available for longer periods.

Quantitative methods

• Log files might be integrated with other methodologies to gain an even broader insight into dictionary usage.

Test hypotheses generated by log file analyses with methods that assess individual performances or preferences.

OUTLOOK / LIMITATIONS

04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 27

THANK YOU.

04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 28

BONUS SLIDE: REVISIONS

2904.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses

English Wiktionary German Wiktionary