Carolin Müller-Spitzer & Sascha Wolfer - A quantitative view on dictionary use: Potentials and...
-
Upload
scottish-language-dictionaries -
Category
Presentations & Public Speaking
-
view
201 -
download
1
Transcript of Carolin Müller-Spitzer & Sascha Wolfer - A quantitative view on dictionary use: Potentials and...
ScotLex-1, Edinburgh, 08.04.2016
Carolin Müller-Spitzer & Sascha Wolfer
A QUANTITATIVE VIEW ON DICTIONARY USE:POTENTIALS AND LIMITATIONS OF LOG FILE ANALYSES
• Lew (2015a): „Until fairly recently, dictionary users were not usually of central concern in the process of dictionary making […].”
• Advantages of focusing on the user:
Discover the challenges users face when accessing and using dictionaries user instruction, usability
Learn how users are working with the dictionary
Discover what users are interested in the most/least
Test preconceptions of the lexicographer about the users
User studies enable us to make better dictionaries.
RESEARCH INTO DICTIONARY USE
04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 2
Lew, R. (2015a). Dictionaries and their users. In P. Hanks & G.-M. De Schryver (Ed.), International Handbook of Modern Lexis and Lexicography (1–9). Berlin/Heidelberg: Springer.
• Main aim: Collect empirical data to gain insights into dictionary usage
• Multiple methods of data collection:(Web) questionnaires, eye tracking studies, usability studies,log file analyses, …
• Choice of method depends on the research question we want toaddress.
Lew, R. (2015b). Opportunities and limitations of user studies. In C. Tiberius & C. Müller-Spitzer (Hrsg.), Research into dictionary use / Wörterbuchbenutzungsforschung. 5. Arbeitsbericht des wissenschaftlichen Netzwerks „Internetlexikografie“ (Bd. 2/2015, S. 6–16). Mannheim: Institut für Deutsche Sprache. Abgerufen von http://pub.ids-mannheim.de/laufend/opal/pdf/opal15-2.pdf
Müller-Spitzer, C. (2014). Using Online Dictionaries. Berlin, New York: De Gruyter.
RESEARCH INTO DICTIONARY USE
04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 3
• Log files: Protocols of search requests or articlelook-ups.
• Varying amount of information:
Minimum: Article ID, Timestamp
User information, article history, technical information (e.g., browser, device), ...
Some log files are already aggregated (e.g., per hour).
• Take care of the legal framework of your country: What kind of information are you allowed to use without explicit user consent?
LOG FILE ANALYSES
604.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses
• Bergenholtz, H., & Johnsen, M. (2005). Log Files as a Tool for Improving Internet Dictionaries. Hermes, 34, 117–141.
• Bergenholtz, H., & Johnson, M. (2007). Log files can and should be prepared for a functionalistic approach. Lexikos, 17, 1–21.
• Verlinde, S., & Binon, J. (2010). Monitoring Dictionary Use in the Electronic Age. In A. Dykstra & T. Schoonheim(Hrsg.), Proceedings of the XIV Euralex International Congress (S. 1144–1151). Ljouwert: Afûk.
• Hult, A.-K. (2012). Old and New User Study Methods Combined ‒ Linking Web Questionnaires with Log Files from the Swedish Lexin Dictionary. Oslo. Universitetet i Oslo, Institutt for lingvistiske og nordiske studier. In J. M. Torjusen & R. V. Fjeld (Hrsg.), Proceedings of the 15th EURALEX International Congress 2012 (S. 922–928). Oslo, Norway. Abgerufen von http://www.euralex.org/elx_proceedings/Euralex2012/pp922-928%20Hult.pdf
• Schoonheim, T., Tiberius, C., Niestadt, J., & Tempelaars, R. (2012). Dictionary Use and Language Games: Getting to Know the Dictionary as Part of the Game. In R. Vatvedt Fjeld & J. M. Torjusen (Hrsg.), Proceedings of the 15th EURALEX International Congress. 7-11 August 2012 (S. 974–979). Oslo: Department of Linguistics and Scandinavian Studies: University of Oslo.
• De Schryver, G.-M., Joffe, D., Joffe, P., & Hillewaert, S. (2006). Do dictionary users really look up frequent words?—on the overestimation of the value of corpus-based lexicography. Lexikos, 16, 67–83.
• Koplenig, A., Meyer, P., & Müller-Spitzer, C. (2014). Dictionary users do look up frequent words. A log fileanalysis. In C. Müller-Spitzer (Hrsg.), Using Online Dictionaries (S. 229–250). Berlin, Boston: de Gruyter.
LOG FILE ANALYSES: PREVIOUSRESEARCH
04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 7
• The Wikimedia foundation provides log files for all their sites, including all the different language editionsof Wiktionary.
https://dumps.wikimedia.org/other/pagecounts-raw/
STUDIES USING WIKTIONARY LOG FILES
804.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses
• One file per hour with all projects.• Approx. 66 GB (gzipped) per month.
DATA PREPARATION
904.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses
Downloaded files
Relevant rows(e.g. „de.d“)
Daily aggregates
Weeklyaggregates
Yearlyaggregates
Additional information (someextracted from Wiktionary)
• part-of-speech• # of senses• headword frequency• ...
DATA PREPARATION
1004.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses
Page POS Frequency Visits 2013
Tribüne Noun 11,072 230,720
fakultativ Adjektive 497 133,381
Tribunal Noun 11,072 61,728
Grandezza Noun 1,222 20,475
reflektieren Verb 7,961 19,736
... ... ... ...
Visits per 1 million visits
1,723.3
996.3
461.1
153.0
147.4
...
• Are more frequent words visited more frequently?
• Are polysemic words visited more frequently thanmonosemic words?
• How can we investigate temporal effects on visitingfrequency?
• What portions of Wiktionary stay „in the dark“(i.e., are not visited at all or very seldom)?
• Data base: German language edition of Wiktionary
RESEARCH QUESTIONS
1104.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses
• If we compile a general dictionary from scratch, does it makesense to include more frequent words first?
• Log-file analyses from Wiktionary and DWDS log files suggest: Yes, words that occur more frequently in every-day language arealso visited more frequently.
CORPUS AND LOOK-UP FREQUENCY
04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 12
• Corpus frequency still matters if most frequent wordsare excluded.
CORPUS AND LOOK-UP FREQUENCY
04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 13
10,000 mostfrequent words
A
B
10,000 words randomlysampled from rest
10,000 most frequentwords from rest
34%
56%successfulsearches
• Are polysemic words visited more often than monosemicwords?
• Challenge: Polysemic words are also more frequent. So, we haveto control for the effect of frequency just shown.
POLYSEMIC AND MONOSEMIC WORDS
04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 14
monosemic
polysemic
POLYSEMIC AND MONOSEMIC WORDS
04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 15
• Effect of frequency still visible.• Effect of polysemy• Interaction effect: Polysemy
contrast tends to be morepronounced in higher frequencybands (especially in the highestdecile)
• If we want to extract temporary effects, we have totake time into consideration.
Interactive visualisation (German Wiktionary, more to come): http://www.owid.de/plus/wikivi2015/
• We employed a trend-residualisation technique.
Calculate the current trend of visitation frequency.
Calculate the deviations from this trend („residuals“) atspecific points in time.
TEMPORARY EFFECTS ON LOOK-UPFREQUENCY
04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 16
TEMPORARY EFFECTS: ‚LARMOYANT‘
04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 21
„Der ist jetzt aber richtig sauer. Das passt dem gar nicht. Und wenn ich das richtig deute, blickt er da eher Richtung Toni Kroos. Das ist ihm ein bisschen zu larmoyant... und ... der ist vielleicht noch eher im Freundschaftsspielmodus …“
He is really peeved now. That really doesn‘t suit him. Andif I interpret this correctly, he is looking into the directionof Toni Kroos. That‘s a little too lachrymose for him. And... maybe, he‘s more in exhibition mode …“
• How many and which articles are not visited at all?
We consider the years 2013, 2014 and 2015.
Account for the fact that the number of articles is rising.
THE DARK SIDE OF WIKTIONARY
04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 22
• Approx. 25,000 articles were not visitedduring 2013, 2014 and 2015.
Mostly newer
Mostly non-German
German idioms
Inflected forms
THE DARK SIDE OF WIKTIONARY
04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 24
• Log files are well suited to investigate effects on the„macro user“ level:
Corpus frequency and look-up frequency
Polysemy and look-up frequency
Temporary effects
„Dark side“ of dictionaries
Collaborative dictionaries: Look-up and revision frequency
…
SUMMARY
2504.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses
• Lew (2015b: 11-12): „[…] we need to be aware of the limitations of the approach.
One such limitation is that server log files will rarely tell us what the contextof dictionary use is:
what activity the user is involved in,
what particular problem they are trying to solve,
and the levels of success and satisfaction achieved in the consultation.
Nothing is known about the user, either, such as their age, languages spoken, proficiency in them, or professional background. […]
Issues of data privacy can also be a limiting factor in log file analysis.“
OUTLOOK / LIMITATIONS
2604.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses
• Little can be inferred from a small number of log file events.
Research based on individual cases is virtually impossible.
Log file analyses work best if many cases are available for longer periods.
Quantitative methods
• Log files might be integrated with other methodologies to gain an even broader insight into dictionary usage.
Test hypotheses generated by log file analyses with methods that assess individual performances or preferences.
OUTLOOK / LIMITATIONS
04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 27