Internet The open - Wikimedia€¦ · Internet Using AI to keep Wikipedia open Aaron Halfaker --...

Post on 24-May-2020

16 views 0 download

Transcript of Internet The open - Wikimedia€¦ · Internet Using AI to keep Wikipedia open Aaron Halfaker --...

New content

The Internet

Using AI to keep Wikipedia open

Aaron Halfaker -- Wikimania 2018

Wikipedia-logo-v2.svg (CC-BY-SA 3.0)

https://meta.wikimedia.org/wiki/User:Halfak_(WMF) (CC-BY-SA 3.0)

Halfaker,_Aaron_Sept_2013.jpg (CC-BY-SA 3.0)

https://meta.wikimedia.org/wiki/User:Halfak_(WMF) (CC-BY-SA 3.0)

Halfaker,_Aaron_Sept_2013.jpg (CC-BY-SA 3.0)

Halfaker,_Aaron_Sept_2013.jpg (CC-BY-SA 3.0)

Halfaker,_Aaron_Sept_2013.jpg (CC-BY-SA 3.0)

1. Openness vs. quality control

AHHHHHH!Turn it OFF!

New contentNew editors

1. Openness vs. quality control

2. Scaling review process

AHHHHHH!Turn it OFF!

New contentNew editors

5s 10m 24h

1. Openness vs. quality control

2. Scaling review process

3. Using AI to re-shape new article review

AHHHHHH!Turn it OFF!

New contentNew editors

5s 10m 24h

ORES

Part 1Openness vs. quality control

AHHHHHH!Turn it OFF!

New content

New contentThe

Internet

OMG NEW CONTENT (enwiki)

● 160k edits to review / day[1]

● 1-2k new Article creations / day○ ~40k drafts need review & 200 new / day[2]

● 330k articles need cleanup[3]

● 1600 new editors / day[4]

1. http://quarry.wmflabs.org/query/4387

2. http://quarry.wmflabs.org/query/4386

3. http://quarry.wmflabs.org/query/4388

4. http://quarry.wmflabs.org/query/4389

quality?

New contentThe

Internet

New contentThe

Internet

QUALITY!

New contentThe

Internet

QUALITY!“curators”

Bad stuff

Good stuffNew content

The Internet

QUALITY!“curators”

Bad stuff

Good stuffNew content

The Internet

QUALITY!“curators”

...backlog...

Bad stuff

Good stuffNew content

The Internet

QUALITY!“curators”

...backlog...

Good stuff

Bad stuff

New contentThe

Internet

QUALITY!“curators”

...backlog...

Bad stuff

Good stuff

New contentThe

Internet

QUALITY!“curators”

...backlog...

Bad stuff

Good stuff

New contentThe

Internet

“curators”

...backlog...

Bad stuff

Good stuff

New contentThe

Internet

“curators”

BACKLOG!

AHHHHHH!Turn it OFF!

In summary...

● Few curators + Small amount of new content → Small backlog

In summary...

● Few curators + Small amount of new content → Small backlog

● More curators + Large amount of new content → Small backlog

In summary...

● Few curators + Small amount of new content → Small backlog

● More curators + Large amount of new content → Small backlog

● More content than curators can handle → Pressure to close the wiki AHHHHHH!

Turn it OFF!

If we don’t get this right, it’s very tempting to close Wikipedia off.

OOjs_UI_icon_lock.svg (MIT license)

#1 reason listed:“This would reduce the workload on New Page Patrollers”

Part 2Scaling review process

5seconds

1-10minutes

24+hours

New contentThe

Internet

5seconds

1-10minutes

24+hours

Good stuff

Subtle problemsBad stuff

Obviously bad stuff

New contentThe

Internet

Strategy: Filter out everything in Special:RecentChanges

5seconds

1-10minutes

24+hours

Good stuff

Subtle problemsBad stuff

Obviously bad stuff

New contentThe

Internet

Strategy: Filter out everything in Special:RecentChanges

Racial slurs, Curse words,Key mashes

5seconds

1-10minutes

24+hours

Good stuff

Subtle problemsBad stuff

Obviously bad stuff

New contentThe

Internet

Strategy: Filter out everything in Special:RecentChanges

“___ is gay”,dubious claims,unexplained deletions

5seconds

1-10minutes

24+hours

Good stuff

Subtle problemsBad stuff

Obviously bad stuff

New contentThe

Internet

Strategy: Filter out everything in Special:RecentChanges

Hoaxes,Undue weight,MOS issues

5seconds

1-10minutes

24+hours

Good stuff

Subtle problemsBad stuff

Obviously bad stuff

New contentThe

Internet

Hoaxes,Undue weight,MOS issues

RC Patrollers

Hoax or NotIs this edit good or bad?

Can you tell?

HOAX

HOAX

Insight:Routing based on interest makes subtle judgement calls easier.

New content Watchlists

Topic-interested individuals

Insight:Routing based on interest makes subtle judgement calls easier.

New content Watchlists

Topic-interested individuals

Insight:Routing based on interest makes subtle judgement calls easier.

New content Watchlists

Topic-interested individuals

5seconds

1-10minutes

24+hours

Good stuff

Subtle problemsBad stuff

Obviously bad stuff

New contentThe

Internet

Hoaxes,Undue weight,MOS issues

RC Patrollers

5seconds

1-10minutes

24+hours

RC Patrollers

Bad stuffObviously bad stuff

Probably good

Subtle problems

New contentThe

Internet

Goodstuff

Hoaxes,Undue weight,MOS issues

Watchlists

RC Patrollers

Watchlists

https://commons.wikimedia.org/wiki/File:Flickr_-_Official_U.S._Navy_Imagery_-_Sailor%27s_daughter_operates_a_fire_hose_with_crew_member_assistance..jpg

AHHHHHH!Turn it OFF!

RC Patrollers

Watchlists

ORES.editquality.recentchanges_split_(enwiki).svg (CC-BY-SA 3.0)

Subtle problems

Bad stuff

Obviously bad stuff

Not obviously bad

New contentThe

Internet

Goodstuff

Probably good

5seconds

1-10minutes

24+hours

RC patrollerswatch-list-ersbots

Edit review filter

Subtle problems

Bad stuff

Obviously bad stuff

Not obviously bad

New contentThe

Internet

Goodstuff

Probably good

5seconds

1-10minutes

24+hours

RC patrollers

watch-list-ers

bots

Edit review filter

Artificial.intelligence.jpg (public domain)

RC patrollerswatch-list-ers

bots

Geiger, R. S., & Halfaker, A. (2013, August). When the levee breaks: without bots, what happens to Wikipedia's quality control processes?. In Proceedings of the 9th International Symposium on Open Collaboration (p. 6). ACM.

Edit review filter

Strainer_MET_98065.jpg (CC0)

Edit review process

● Fast● High capacity

Edit review process

● Fast● High capacity

Reduces the pressure to close off contribution

Firehose of 1500 newarticles per

day

The Internet

New page patrol process

Firehose of 1500 newarticles per

day

The Internet

patrollersNew page patrol process

Subtle problemsBad stuff

Really bad stuff

Firehose of 1500 newarticles per

day

The Internet

Goodstuff

patrollersNew page patrol process

Subtle problemsBad stuff

Really bad stuff

Firehose of 1500 newarticles per

day

The Internet

Goodstuff

patrollersNew page patrol process

Bad stuffReally bad stuff

Firehose of 1500 newarticles per

day

The Internet Backlog

patrollers

Good stuff

Subtle problems

New page patrol process

Good stuff

Really bad stuff

Firehose of 1500 newarticles per

day

The Internet Backlog

patrollers

Subtle problems

Bad stuff

New page patrol process

Good stuff

Really bad stuff

Firehose of 1500 newarticles per

day

The Internet Backlog

patrollers

Subtle problemsBad stuff

New page patrol process

Good stuff

Really bad stuff

Firehose of 1500 newarticles per

day

The Internet

patrollersNew page patrol process AHHHHHH!

Turn it OFF!

Backlog

Subtle problemsBad stuff

https://commons.wikimedia.org/wiki/File:Strainer_MET_98065.jpg (CC0)

https://commons.wikimedia.org/wiki/File:Gootsteenstop.png (PD)

Bad stuffReally bad stuff

The Internet

Subtle problems

Edit review process

Subtle problemsBad stuff

Really bad stuff

The Internet Backlog

New page patrol process

Part 3Using AI to re-shape new

article review

ORES

Obviously bad stuff Bad stuff

Obviously bad stuff Bad stuff

● G3: Pure vandalism and blatant hoaxes● G11: Unambiguous advertising or promotion● G10: Attack pages

Obviously bad stuff Bad stuff

Obviously bad stuff

1.5k new pages

Subtle problems

Subtle problems

Subtle problems

Probably good edits Watchlists

Topic-interested individuals

Subtle problems

Probably good edits Watchlists

Topic-interested individuals

Very difficult to watch pages before they are created...

Subtle problems

Probably good edits Watchlists

Topic-interested individuals

Topic modeling in ORES

https://en.wikipedia.org/w/index.php?title=Alan_Turing&oldid=234785 (CC-BY-SA)

ORES

Topic modeling in ORES

https://en.wikipedia.org/w/index.php?title=Alan_Turing&oldid=234785 (CC-BY-SA)

Biography

Technology

Philosophy

Europe

ORES

Topic modeling in ORES

https://en.wikipedia.org/w/index.php?title=Alan_Turing&oldid=234785 (CC-BY-SA)

Biography

Technology

Philosophy

Europe

ORES

Culture and arts Geography History and society STEM

Arts Music Performing Plastic VisualBroadcastingCrafts and hobbiesEntertainment Games and toysFood and drinkInternet cultureLanguage and lit. Linguistics BiographyMediaPhilosophy and rel.Sports

Bodies of waterCitiesCountries Africa Americas Asia Europe OceaniaLandformsMapsParks & cons.

History and societyBusiness and econ.EducationMilitary and warfarePolitics and gov.Transportation

ScienceBiologyChemistryEconomicsEngineeringGeosciencesMedicineInformation scienceMathematicsMeteorologyPhysicsSpaceTechnologyTimeWomen's health

Culture and arts Geography History and society STEM

Arts Music Performing Plastic VisualBroadcastingCrafts and hobbiesEntertainment Games and toysFood and drinkInternet cultureLanguage and lit. Linguistics BiographyMediaPhilosophy and rel.Sports

Bodies of waterCitiesCountries Africa Americas Asia Europe OceaniaLandformsMapsParks & cons.

History and societyBusiness and econ.EducationMilitary and warfarePolitics and gov.Transportation

ScienceBiologyChemistryEconomicsEngineeringGeosciencesMedicineInformation scienceMathematicsMeteorologyPhysicsSpaceTechnologyTimeWomen's health

Culture and arts Geography History and society STEM

Arts Music Performing Plastic VisualBroadcastingCrafts and hobbiesEntertainment Games and toysFood and drinkInternet cultureLanguage and lit. Linguistics BiographyMediaPhilosophy and rel.Sports

Bodies of waterCitiesCountries Africa Americas Asia Europe OceaniaLandformsMapsParks & cons.

History and societyBusiness and econ.EducationMilitary and warfarePolitics and gov.Transportation

ScienceBiologyChemistryEconomicsEngineeringGeosciencesMedicineInformation scienceMathematicsMeteorologyPhysicsSpaceTechnologyTimeWomen's health

Culture and arts Geography History and society STEM

Arts Music Performing Plastic VisualBroadcastingCrafts and hobbiesEntertainment Games and toysFood and drinkInternet cultureLanguage and lit. Linguistics BiographyMediaPhilosophy and rel.Sports

Bodies of waterCitiesCountries Africa Americas Asia Europe OceaniaLandformsMapsParks & cons.

History and societyBusiness and econ.EducationMilitary and warfarePolitics and gov.Transportation

ScienceBiologyChemistryEconomicsEngineeringGeosciencesMedicineInformation scienceMathematicsMeteorologyPhysicsSpaceTechnologyTimeWomen's health

Hand-wavey Feature Engineering for my ML Friends● Features: 300 cell vector based on word2vec

● Estimator: GradientBoosting (one vs rest strategy)

Hand-wavey statistics for my ML Friends

TL;DR: It really works well -- even on the drafty first revisions of articles!

The ORES vision for article draft review

The ORES vision for article draft review

Released:July, 2017

The ORES vision for article draft review

Released:July, 2017

Released:March, 2018

Roger_Bamkin_y_Rosie_Stephenson-Goodknight_en_Wikimania_2015_22.JPG (CC-BY-SA 4.0)

Mary Wollstonecraft by John Opie (c. 1797).jpg (PD)

New Editors

+MoreNew Editors

=People like:

New Editors

+MoreNew Editors

=

andLess hard work for page patrollers

People like:

Disclaimer: Rosiestep has not reviewed or endorsed this presentation or the modeling project I have described.

I had one conversation with Rosiestep at Wikimania’17 and I happen find her and her work inspiring -- so I thought she made for a great example.

She deserves some of the credit, but none of the blame for what I’m talking to you about.

We have the models...

We have the models...

Now we need product integration.

We have the models...

Now we need product integration.

We have the models...

Now we need product integration.

We have the models...

Now we need product integration.User:SQLBot

New content

Thanks!Props to Sumit Asthana for doing most of the hard work.

Aaron Halfaker

ahalfaker@wikimedia.org

EpochFail/halfak

SCIENCE

SCIENCE

SCIENCESCIENCE

SCIENCESCIENCE SCIENCE

● Expanding beyond English Wikipedia

● Other workflows that are struggling

● Demo some of these AIs -- what is ORES anyway?