New content
The Internet
Using AI to keep Wikipedia open
Aaron Halfaker -- Wikimania 2018
Wikipedia-logo-v2.svg (CC-BY-SA 3.0)
https://meta.wikimedia.org/wiki/User:Halfak_(WMF) (CC-BY-SA 3.0)
Halfaker,_Aaron_Sept_2013.jpg (CC-BY-SA 3.0)
https://meta.wikimedia.org/wiki/User:Halfak_(WMF) (CC-BY-SA 3.0)
Halfaker,_Aaron_Sept_2013.jpg (CC-BY-SA 3.0)
Halfaker,_Aaron_Sept_2013.jpg (CC-BY-SA 3.0)
Halfaker,_Aaron_Sept_2013.jpg (CC-BY-SA 3.0)
1. Openness vs. quality control
AHHHHHH!Turn it OFF!
New contentNew editors
1. Openness vs. quality control
2. Scaling review process
AHHHHHH!Turn it OFF!
New contentNew editors
5s 10m 24h
1. Openness vs. quality control
2. Scaling review process
3. Using AI to re-shape new article review
AHHHHHH!Turn it OFF!
New contentNew editors
5s 10m 24h
ORES
Part 1Openness vs. quality control
AHHHHHH!Turn it OFF!
New content
New contentThe
Internet
OMG NEW CONTENT (enwiki)
● 160k edits to review / day[1]
● 1-2k new Article creations / day○ ~40k drafts need review & 200 new / day[2]
● 330k articles need cleanup[3]
● 1600 new editors / day[4]
1. http://quarry.wmflabs.org/query/4387
2. http://quarry.wmflabs.org/query/4386
3. http://quarry.wmflabs.org/query/4388
4. http://quarry.wmflabs.org/query/4389
quality?
New contentThe
Internet
New contentThe
Internet
QUALITY!
New contentThe
Internet
QUALITY!“curators”
Bad stuff
Good stuffNew content
The Internet
QUALITY!“curators”
Bad stuff
Good stuffNew content
The Internet
QUALITY!“curators”
...backlog...
Bad stuff
Good stuffNew content
The Internet
QUALITY!“curators”
...backlog...
Good stuff
Bad stuff
New contentThe
Internet
QUALITY!“curators”
...backlog...
Bad stuff
Good stuff
New contentThe
Internet
QUALITY!“curators”
...backlog...
Bad stuff
Good stuff
New contentThe
Internet
“curators”
...backlog...
Bad stuff
Good stuff
New contentThe
Internet
“curators”
BACKLOG!
AHHHHHH!Turn it OFF!
In summary...
● Few curators + Small amount of new content → Small backlog
In summary...
● Few curators + Small amount of new content → Small backlog
● More curators + Large amount of new content → Small backlog
In summary...
● Few curators + Small amount of new content → Small backlog
● More curators + Large amount of new content → Small backlog
● More content than curators can handle → Pressure to close the wiki AHHHHHH!
Turn it OFF!
If we don’t get this right, it’s very tempting to close Wikipedia off.
OOjs_UI_icon_lock.svg (MIT license)
#1 reason listed:“This would reduce the workload on New Page Patrollers”
Part 2Scaling review process
5seconds
1-10minutes
24+hours
New contentThe
Internet
5seconds
1-10minutes
24+hours
Good stuff
Subtle problemsBad stuff
Obviously bad stuff
New contentThe
Internet
Strategy: Filter out everything in Special:RecentChanges
5seconds
1-10minutes
24+hours
Good stuff
Subtle problemsBad stuff
Obviously bad stuff
New contentThe
Internet
Strategy: Filter out everything in Special:RecentChanges
Racial slurs, Curse words,Key mashes
5seconds
1-10minutes
24+hours
Good stuff
Subtle problemsBad stuff
Obviously bad stuff
New contentThe
Internet
Strategy: Filter out everything in Special:RecentChanges
“___ is gay”,dubious claims,unexplained deletions
5seconds
1-10minutes
24+hours
Good stuff
Subtle problemsBad stuff
Obviously bad stuff
New contentThe
Internet
Strategy: Filter out everything in Special:RecentChanges
Hoaxes,Undue weight,MOS issues
5seconds
1-10minutes
24+hours
Good stuff
Subtle problemsBad stuff
Obviously bad stuff
New contentThe
Internet
Hoaxes,Undue weight,MOS issues
RC Patrollers
Hoax or NotIs this edit good or bad?
Can you tell?
HOAX
HOAX
✓
Insight:Routing based on interest makes subtle judgement calls easier.
New content Watchlists
Topic-interested individuals
Insight:Routing based on interest makes subtle judgement calls easier.
New content Watchlists
Topic-interested individuals
Insight:Routing based on interest makes subtle judgement calls easier.
New content Watchlists
Topic-interested individuals
5seconds
1-10minutes
24+hours
Good stuff
Subtle problemsBad stuff
Obviously bad stuff
New contentThe
Internet
Hoaxes,Undue weight,MOS issues
RC Patrollers
5seconds
1-10minutes
24+hours
RC Patrollers
Bad stuffObviously bad stuff
Probably good
Subtle problems
New contentThe
Internet
Goodstuff
Hoaxes,Undue weight,MOS issues
Watchlists
RC Patrollers
Watchlists
https://commons.wikimedia.org/wiki/File:Flickr_-_Official_U.S._Navy_Imagery_-_Sailor%27s_daughter_operates_a_fire_hose_with_crew_member_assistance..jpg
RC Patrollers
Watchlists
https://commons.wikimedia.org/wiki/File:Flickr_-_Official_U.S._Navy_Imagery_-_Sailor%27s_daughter_operates_a_fire_hose_with_crew_member_assistance..jpg
AHHHHHH!Turn it OFF!
RC Patrollers
Watchlists
ORES.editquality.recentchanges_split_(enwiki).svg (CC-BY-SA 3.0)
Subtle problems
Bad stuff
Obviously bad stuff
Not obviously bad
New contentThe
Internet
Goodstuff
Probably good
5seconds
1-10minutes
24+hours
RC patrollerswatch-list-ersbots
Edit review filter
Subtle problems
Bad stuff
Obviously bad stuff
Not obviously bad
New contentThe
Internet
Goodstuff
Probably good
5seconds
1-10minutes
24+hours
RC patrollers
watch-list-ers
bots
Edit review filter
Artificial.intelligence.jpg (public domain)
RC patrollerswatch-list-ers
bots
Geiger, R. S., & Halfaker, A. (2013, August). When the levee breaks: without bots, what happens to Wikipedia's quality control processes?. In Proceedings of the 9th International Symposium on Open Collaboration (p. 6). ACM.
Edit review filter
Strainer_MET_98065.jpg (CC0)
Edit review process
● Fast● High capacity
Edit review process
● Fast● High capacity
Reduces the pressure to close off contribution
Firehose of 1500 newarticles per
day
The Internet
New page patrol process
Firehose of 1500 newarticles per
day
The Internet
patrollersNew page patrol process
Subtle problemsBad stuff
Really bad stuff
Firehose of 1500 newarticles per
day
The Internet
Goodstuff
patrollersNew page patrol process
Subtle problemsBad stuff
Really bad stuff
Firehose of 1500 newarticles per
day
The Internet
Goodstuff
patrollersNew page patrol process
Bad stuffReally bad stuff
Firehose of 1500 newarticles per
day
The Internet Backlog
patrollers
Good stuff
Subtle problems
New page patrol process
Good stuff
Really bad stuff
Firehose of 1500 newarticles per
day
The Internet Backlog
patrollers
Subtle problems
Bad stuff
New page patrol process
Good stuff
Really bad stuff
Firehose of 1500 newarticles per
day
The Internet Backlog
patrollers
Subtle problemsBad stuff
New page patrol process
Good stuff
Really bad stuff
Firehose of 1500 newarticles per
day
The Internet
patrollersNew page patrol process AHHHHHH!
Turn it OFF!
Backlog
Subtle problemsBad stuff
https://commons.wikimedia.org/wiki/File:Strainer_MET_98065.jpg (CC0)
https://commons.wikimedia.org/wiki/File:Gootsteenstop.png (PD)
Bad stuffReally bad stuff
The Internet
Subtle problems
Edit review process
Subtle problemsBad stuff
Really bad stuff
The Internet Backlog
New page patrol process
Part 3Using AI to re-shape new
article review
ORES
Obviously bad stuff Bad stuff
Obviously bad stuff Bad stuff
● G3: Pure vandalism and blatant hoaxes● G11: Unambiguous advertising or promotion● G10: Attack pages
Obviously bad stuff Bad stuff
Obviously bad stuff
1.5k new pages
Subtle problems
Subtle problems
Subtle problems
Probably good edits Watchlists
Topic-interested individuals
Subtle problems
Probably good edits Watchlists
Topic-interested individuals
Very difficult to watch pages before they are created...
Subtle problems
Probably good edits Watchlists
Topic-interested individuals
Topic modeling in ORES
https://en.wikipedia.org/w/index.php?title=Alan_Turing&oldid=234785 (CC-BY-SA)
ORES
Topic modeling in ORES
https://en.wikipedia.org/w/index.php?title=Alan_Turing&oldid=234785 (CC-BY-SA)
Biography
Technology
Philosophy
Europe
ORES
Topic modeling in ORES
https://en.wikipedia.org/w/index.php?title=Alan_Turing&oldid=234785 (CC-BY-SA)
Biography
Technology
Philosophy
Europe
ORES
Culture and arts Geography History and society STEM
Arts Music Performing Plastic VisualBroadcastingCrafts and hobbiesEntertainment Games and toysFood and drinkInternet cultureLanguage and lit. Linguistics BiographyMediaPhilosophy and rel.Sports
Bodies of waterCitiesCountries Africa Americas Asia Europe OceaniaLandformsMapsParks & cons.
History and societyBusiness and econ.EducationMilitary and warfarePolitics and gov.Transportation
ScienceBiologyChemistryEconomicsEngineeringGeosciencesMedicineInformation scienceMathematicsMeteorologyPhysicsSpaceTechnologyTimeWomen's health
Culture and arts Geography History and society STEM
Arts Music Performing Plastic VisualBroadcastingCrafts and hobbiesEntertainment Games and toysFood and drinkInternet cultureLanguage and lit. Linguistics BiographyMediaPhilosophy and rel.Sports
Bodies of waterCitiesCountries Africa Americas Asia Europe OceaniaLandformsMapsParks & cons.
History and societyBusiness and econ.EducationMilitary and warfarePolitics and gov.Transportation
ScienceBiologyChemistryEconomicsEngineeringGeosciencesMedicineInformation scienceMathematicsMeteorologyPhysicsSpaceTechnologyTimeWomen's health
Culture and arts Geography History and society STEM
Arts Music Performing Plastic VisualBroadcastingCrafts and hobbiesEntertainment Games and toysFood and drinkInternet cultureLanguage and lit. Linguistics BiographyMediaPhilosophy and rel.Sports
Bodies of waterCitiesCountries Africa Americas Asia Europe OceaniaLandformsMapsParks & cons.
History and societyBusiness and econ.EducationMilitary and warfarePolitics and gov.Transportation
ScienceBiologyChemistryEconomicsEngineeringGeosciencesMedicineInformation scienceMathematicsMeteorologyPhysicsSpaceTechnologyTimeWomen's health
Culture and arts Geography History and society STEM
Arts Music Performing Plastic VisualBroadcastingCrafts and hobbiesEntertainment Games and toysFood and drinkInternet cultureLanguage and lit. Linguistics BiographyMediaPhilosophy and rel.Sports
Bodies of waterCitiesCountries Africa Americas Asia Europe OceaniaLandformsMapsParks & cons.
History and societyBusiness and econ.EducationMilitary and warfarePolitics and gov.Transportation
ScienceBiologyChemistryEconomicsEngineeringGeosciencesMedicineInformation scienceMathematicsMeteorologyPhysicsSpaceTechnologyTimeWomen's health
Hand-wavey Feature Engineering for my ML Friends● Features: 300 cell vector based on word2vec
…
● Estimator: GradientBoosting (one vs rest strategy)
Hand-wavey statistics for my ML Friends
TL;DR: It really works well -- even on the drafty first revisions of articles!
The ORES vision for article draft review
The ORES vision for article draft review
Released:July, 2017
The ORES vision for article draft review
Released:July, 2017
Released:March, 2018
Roger_Bamkin_y_Rosie_Stephenson-Goodknight_en_Wikimania_2015_22.JPG (CC-BY-SA 4.0)
Mary Wollstonecraft by John Opie (c. 1797).jpg (PD)
Rescuesquad.png (CC-BY-SA)
New Editors
+MoreNew Editors
=People like:
New Editors
+MoreNew Editors
=
andLess hard work for page patrollers
People like:
Disclaimer: Rosiestep has not reviewed or endorsed this presentation or the modeling project I have described.
I had one conversation with Rosiestep at Wikimania’17 and I happen find her and her work inspiring -- so I thought she made for a great example.
She deserves some of the credit, but none of the blame for what I’m talking to you about.
We have the models...
We have the models...
Now we need product integration.
We have the models...
Now we need product integration.
We have the models...
Now we need product integration.
We have the models...
Now we need product integration.User:SQLBot
New content
Thanks!Props to Sumit Asthana for doing most of the hard work.
Aaron Halfaker
EpochFail/halfak
SCIENCE
SCIENCE
SCIENCESCIENCE
SCIENCESCIENCE SCIENCE
● Expanding beyond English Wikipedia
● Other workflows that are struggling
● Demo some of these AIs -- what is ORES anyway?
Top Related