Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of...

34
Crowdsourcing Ling 240

description

Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a large number of [non-expert] people, typically via the internet” (OED) Examples: Wikipedia Google Translate FamilySearch Indexing

Transcript of Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of...

Page 1: Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a.

CrowdsourcingLing 240

Page 2: Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a.

What is crowdfunding?

Page 3: Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a.

Crowdsourcing—definition “the practice of obtaining information or services by

soliciting input from a large number of [non-expert] people, typically via the internet” (OED)

Examples:• Wikipedia• Google Translate• FamilySearch Indexing

Page 4: Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a.

COCA's registers based on publication type

Page 5: Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a.

Crowdsourcing• What are the benefits of collecting data through

crowdsourcing?• What are the limitations/weaknesses?• What can be done to ensure that crowdsourcing

workers are giving quality data?

Page 6: Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a.

Crowdsourcing in linguistics• Wilhelm Kaeding (1897)

• Thousands of non-experts helped compile and analyze an 11 million word corpus of German

• Oxford English Dictionary (1858 – 1928)• Hundreds of non-expert readers submitted 6 million

quotation slips• Perceptual dialectology

• Dialect perceptions elicited from non-experts

Page 7: Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a.

Mechanical Turk (Amazon)• Strengths

• Inexpensive• Fast• Quality control• Access to thousands of people

• Growing body of research strongly supports the quality of MTurk data • E.g., Buhmester et al., 2011; Kittur et al., 2008; Suri & Watts, 2011;

Urbano et al., 2010

Page 8: Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a.

Case study--

Page 9: Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a.

Register classification• Traditional ‘user’-based approach

• ‘Expert’ classifies texts into registers by simply sampling from the publication type of interest

• Limitations• ‘Publication type’ is not a meaningful criterion for web

documents• Experts can’t agree on register category for internet texts

Page 10: Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a.

Corpus • Extracted from the Corpus of Global Web-based English

(GloWbE), constructed by Mark Davies• (Near) random sampling methods used to build the corpus

• Google searches of highly frequent English 3-grams (e.g., is not the, and from the) used to identify URLs

• 800-1000 links for each n-gram (i.e., 80-100 Google results pages)• Davies randomly extracted c. 49,300 URLs from GloWbE

• Only web pages from USA, UK, Canada, Aus., and NZ• Documents < 75 words were excluded • Non-textual material was removed from all web pages (HTML scrubbing and

boilerplate removal) using JusText• 1,445 URLs were excluded from subsequent analysis

because they consisted mostly of photos or graphics. • Final corpus for the study: 48,555 web documents.

Page 11: Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a.

People asked to determine mode of passage, then participants, purpose, etc. This led to 7 sub-registers

Page 12: Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a.

Crowdsourcing end-user data: Classification• Developed a computer-adaptive survey for register

classification

• Tested the tool through 10 rounds of piloting, resulting in numerous revisions

• Recruited 908 raters through Mechanical Turk

• 6 responses x 4 raters x 49,300 texts = 1.2 million individual ratings

Page 13: Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a.

Agreement results for the general register classification of 48,147 web documents(Fleiss’ Kappa = .47, moderate agreement)

• 69% of documents achieved majority agreement• Additional 11.8% are potential 2-way hybrids

4 agree 3 agree 2-2 split 2-1-1 split

No agreement

17,511 15,684 5,682 8,515 755 36.4% 32.6% 11.8% 17.7% 1.6%

Page 14: Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a.

Frequencies of general register categories (i.e., documents where 3 or 4 raters were in agreement)

Page 15: Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a.

Systematic patterns of disagreement

• 28 different 2-2 combinations are possible in theory

• But, only 7 of those combinations occurred > 100 times in our corpus of 48,000 documents

• Because these are widely attested user-based patterns, we are able to interpret disagreement as a special pattern of agreement

Page 16: Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a.

Frequencies of 2-way hybrids that occur 100+ times

Page 17: Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a.

Multi-Dimensional analysis• Factor analysis to identify dimensions based on co-

occurrence among a large set of linguistic features• Interpret dimensions functionally• Calculate scores for each text on each dimension

17

Page 18: Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a.

Features used by Biber adopted:

Positive features:

Verbs: present tense verbs, mental verbs, do as pro verb, ‑ be as main verb, possibility modals

Pronouns: 1st person pronouns, 2nd person pronouns, it, demonstrative pronouns, indefinite pronouns

Adverbs: general emphatics, hedges, amplifiers

Dependent clauses: that complement clauses (with that deletion), causative adverbial clauses, WH clauses Other: contractions, analytic negation, discourse particles, sentence relatives, WH questions, clause coordination ==================================

Negative features: Nouns, long words, prepositional phrases, attributive adjectives, lexical diversity

Page 19: Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a.

The results• Linguistic (use-based) variation across user-based

register categories

Page 20: Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a.

Web registers along Dimension 1

Page 21: Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a.

Web registers along Dimension 1

Page 22: Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a.

What have we learned?• Non-expert users can reliably classify web

documents • At least 1 in 10 internet texts belongs to a hybrid

register category• Publication type ≠ register (at least for the web)

• E.g., blogs showed up in several register categories• Triangulating end-user classifications with linguistic

analysis gives us a more complete understanding of register variation on the web

Page 23: Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a.

Web register research: Next steps

• Comprehensive linguistic description of the patterns of register variation on the web

• A new multi-dimensional analysis of web registers• Detailed linguistic descriptions of ‘unique’ web

registers• Automatic prediction of register (‘AGI’)• Automatically coded large corpus of web documents• Extend descriptions to include ‘private’ web registers

Page 24: Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a.

Areas for future user-based research• Register classification of printed texts• Reader/listener perceptions• Corpus annotation• Word sense disambiguation

Page 25: Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a.

5. The future of crowdsourcing in user-based linguistics• User-based analyses have always happened; now

we can do them in a more valid way using crowdsourcing

• Triangulating use-based linguistic data offers a more complete understanding of discourse

• Linguists are often unable to fully analyze and interpret patterns in use-based datasets, particularly those that are very large

• Harnessing the power of user-based data via crowdsourcing could help us tackle big, difficult problems in linguistics

Page 26: Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a.

Mechanical Turk• The name comes from an 18th century machine that

played chess.• A person actually hid inside and played

Page 27: Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a.

Mechanical Turk• Amazon's Mechanical Turk is a crowdsourcing tool.• Researchers who need human evaluation can get

data• People who want to make some money help with

the project (less than minimum wage)– Image recognition– Speech processing– Subjective evaluation– Giving opinions– Tagging corpora– Match picture with product

Page 28: Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a.

Mechanical Turk• Example: word sense disambiguation in corpora

– What should head be tagged as? Noun or verb?– What does head mean in a sentence?

• They charged the head of finances with the crime. (person with office)

• The beer was flat with no head. (froth)• They were going head first (manner of

movement)• Computers can't do it well but people can

Page 29: Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a.

How does it work?

Page 30: Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a.
Page 31: Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a.

Couldn't people cheat? After reviewing results the requester can

reject a worker When rejected, they don't get paid Workers have approval rates Requesters can choose only workers with

good rates

Page 32: Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a.

Advantages Thousands of potential workers available You can get results fast Demographic variety (not just undergrads) Cheap (average $1.40 per hour)

Page 33: Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a.

Disadvantages Cheating

Some studies show it's at same rates as in lab

Ways to test “While exercising how often have you had a

fatal heart attack?” It requires money Can't do many types of experiments (RT)

Page 34: Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a.

Go look at it

Mechanical Turk website