Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot...

48
Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot User :Bináris Hungarian Wikipedia & Pywikipedia developer team Wikimania 2012 From Budapest

Transcript of Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot...

Page 1: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

Efficient and flexible text manipulation, spelling

correction and page collections with Pywikibot

User:Bináris

Hungarian Wikipedia &

Pywikipedia developer teamWikimania 2012

From Budapest

Page 2: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

Useful links

[[meta:User:Bináris]]

Just check it now on your laptop to follow me

Page 3: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

What is this about?

My spellchecker underlined occurence.• Wiktionary:

Nounoccurence1.Common misspelling of occurrence.

• A search in English Wikipedia:Results 1–20 of 333,623 for occurence Does this include every erronious form?

Page 4: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

We speak about

• Pywikipedia bot framework

• replace.py

• fixes.py

This works on every MediaWiki installation!

Page 5: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

Some ideas

• Spellink corrections• Linking and unlinking• Mass change of section titles• Execution of naming conventions• Replacing templates• Replacing template parameters• Placing templates• Correcting link errors

Page 6: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

6

Decisions

1. Command line parameters or fix?

2. Searching in live wiki or in dump?

3. Search & replace in one run or separately?

4. Simple text replacements or regular expressions?

5. Manual or automatic running?

Page 7: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

7

The two-pass model of replacement

1. Gathering candidates (possible to-be-replaced texts) to a file-save / -savenewRelatively slow and automatic

– Optionally uploading the list to your wiki(line numbers help to clean)

2. Making the actual replacementsFaster (or very fast) and attended

Page 8: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

8

Decisions

1. Command line parameters or fix?

2. Searching in live wiki or in dump?

3. Search & replace in one run or separately?

4. Simple text replacements or regular expressions?

5. Manual or automatic running?

Page 9: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

9

What is a fix?

• A fix contains a replacement task.

• See the links on my Meta page for description & examples

Page 10: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

The magic of regular expressions

Page 11: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

11

Decisions

1. Command line parameters or fix?

2. Searching in live wiki or in dump?

3. Search & replace in one run or separately?

4. Simple text replacements or regular expressions?

5. Manual or automatic running?

Page 12: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

12

Regular expressions• color colour: this is concrete and accidental (and

uninteresting :-P)• What about changing

[[január 4]]. to [[január 4.]] and [[január 4]]-én to [[január 4.|január 4]]-én? (For all dates, of course)

• Or July 13, 2012 and 13 July 2012 to 2012-07-13 and7/13/2012 to 2012-07-13 (ISO 8601) within tables?

• Or color, Color, c/Colorful, c/Colorfulness to colour… (but not Colorado and colorectal cancer)?Note! Colorful (film) and (manga) and CSS colors go to exceptions! (Why? Sure? How to decide?)

Page 13: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

13

Regular expressions• Regular expressions form a simple

programming language that searches for patterns and replaces with patterns.

• Learn them, they are worth! Another dimension of efficiency.

Page 14: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

14

Example: search for a date

July 13, 2012 (a regex-like analysis)1. A month name (possibly in lower case or abbreviated

as Jul)

2. One or more or less spaces

3. 1…9 OR 0 followed by 1…9 OR 1 or 2 followed by 0…9 OR 3 followed by 0 or 1

4. Comma?

5. One or more or less spaces (not less without comma)

6. Maximum of four digits (1 and 2: are they worth?)

Page 15: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

15

First theorem

The more hits and the more precise matching you want, the more complex the regex will be.

(Do you want to find july? Do you want to find July 13,2012? Do you want to find

Jul 13, 2012?)

Page 16: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

16

Example: agents (search & replace)

'replacements': [

(ur'(FBI|CIA|KGB|MI ?\d) [üÜ]gynök(?!e)', ur'\1-ügynök'),

(ur'(FBI|CIA|KGB|MI ?\d\]\]) [üÜ]gynök(?!e)', ur'\1-ügynök'),

],

1. An agency (MI followed by an optional space and a digit)

2. A space

3. Ügynök OR ügynök, but NOT ügynöke (hyphen prohibited)

Second line: a linked agency

Result: a hyphenated, lower case agent (=ügynök in Hungarian)

NB it was preceeded by some searches! Not all agencies are here.

Page 17: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

17

Example: exceptions with regexesBaseExceptions = {

'inside-tags': [

'hyperlink',

'interwiki',

],

'text-contains': [

ur'(?i)(\{\{szinnyei|\{\{pallas\}|\{\{fényes\}|\{\{vályi\}|Vályi András|Fényes Elek|\{\{sicc\})',

],

'inside': [

r'\{\{DEFAULTSORT:.*?\}\}', #A defaultsortban szándékosan ékezet nélküli szavak vannak.

ur'<ref name.*?>',

#Mindenféle idézősablonok:

ur'(?is)\{\{cite.*?\}\}', #Az összes citenyavalya sablon (nem mindig van szóköz)

ur'(?is)\{\{cit(lib|per).*?\}\}', #A CitLib és a CitPer (nem biztos a szóköz, lehet |)

ur'(?is)\{\{citation .*?\}\}',

],

'title': [

ur'\d{4} a jogalkotásban',

],

}

Page 18: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

18

What is to be excepted?

• Keywords

Page 19: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

19

Advanced level

• Fixes and functions – own Python functions

Page 20: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

Workflow

Page 21: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

21

Simple replacement tasks

• Find an idea• Create the replacement• Find a good selector (search*, category…)• Do the work with two fingers

(y/enter, then /enter)(asynchronous save!)

• Imagine this and next slide is a flowchart.

*Unfortunately, no regexes in MediaWiki search engine

Page 22: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

22

Advanced replacements tasks• Find an idea• Create the first version of replacement• Test it as usual in software development

– Watch it working during collection– Create a test page with purposeful errors– Take care of [[link]]ed & [[link|piped]] versions!

• Found falses? Missing replacements? Is it too slow? Are the previous problems solved as far as possible? Refine your regexes and/or exceptions

• Press ctrl C, and da capo al fine• If the fix is good enough, begin the work.• Maintain fixes & exceptions continously

Page 23: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

23

Decisions

1. Command line parameters or fix?

2. Searching in live wiki or in dump?

3. Search & replace in one run or separately?

4. Simple text replacements or regular expressions?

5. Manual or automatic running?

Page 24: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

24

Why manually?

• Color as CSS property

• % next to a number – may be an operation

• Misspelled word – may be an example in a linguistic article or a quotation

• RESPONSIBILITY!RESPONSIBILITY!

Page 25: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

25

Second theorem

Spelling corrections must be manually.

Period.

Page 26: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

26

Semiautomatic running

• Ingredients:– A replacement task that runs almost always

correctly– One or more pizzas (depending on running time)

(possibly a bottle of beer, if you like it)– Your favourite music– Stable knowledge of where your Pause button is

Page 27: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

27

Errors

• False positives• Conflicts (originated from false positives)• Missed matches• Simply bad replacement expression• Slow fix• Inappropriate automatic running• Unneccessary changing because of fatigue• Unneccessary changing because of incompetence

Change the bot owner!

Page 28: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

28

Third theorem

The more hits you want, the more conflicts you get.

This is the game.

Find the balance.

Page 29: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

Speed

Page 30: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

30

Speed

• Complex fixes may run slower

• Exceptions make it slower

• Lookbehinds make it slower

• Recursive run and allowoverlap are definitely slow (risk of infinite loop!)

• Will be slow if the beginning of the expression has much more hits than the trailing (see examples in fixes.py)

Page 31: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

31

Speed

Fast replacements take the titles from

• -search

• -cat & al

• -links

• -transcludes

• -file

etc.

Page 32: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

32

The two-pass model of replacement

1. Gathering candidates (possible to-be-replaced texts) to a file-save / -savenewRelatively slow and automatic

– Optionally uploading the list to your wiki

2. Making the actual replacementsFaster (or very fast) and attended

Page 33: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

33

Decisions

1. Command line parameters or fix?

2. Searching in live wiki or in dump?

3. Search & replace in one run or separately?

4. Simple text replacements or regular expressions?

5. Manual or automatic running?

Page 34: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

Efficiency

Page 35: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

35

What does it mean?

• Find as much occurrences as possible (even if agglutinated)

• Find as few false positives as possible

• Face as few correction conflicts as possible

• Give the appropriate replacement always

• Let the bot work quickly — don’t wait in front of the screen

Page 36: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

36

Keys to efficiency

• If you find a very efficient replacement (near to 100%), do it separately before others in the same package – you will have less conflict (but you may collect them together)

• Too big packages may run slow and have a greater chance to cause correction conflicts. Sometimes it is worth to make smaller parts of them.

• Too small packages will use more dead time during preparation and execution. Sometimes it is worth to put them together.

• How to decide then? Just watch.

Page 37: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

37

Keys to efficiency

• Use exceptions when appropriate. They will decrease false positives as well as correction conflicts. E.g.– Cite book, cite web, cite anything templates

– URLs, image names (even as template parameters and gallery images!)

– Templates marking pages out of your scope (old authors in Hungarian Wikipedia whose quotations contain old-style spelling)

– Titles marking pages out of your scope (year numbers in law in Hungarian Wikipedia)

• …and first of all: improve your regexes continously!

Page 38: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

38

Keys to efficiency

• Once you found a false positive, save it for later use!-saveexc / -saveexcnew

• Then insert these titles into your exceptions.• Run searches before/during creation of a fix.• Don’t deal with tasks that are not worth a bot!• Use the two-pass model and the dump whenever

possible!

Page 39: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

39

An ugly example

I have a fix to correct short and long i (i/í).

Argentína has an í, but often occurs in English and Spanish titles no regex for it, title exceptions must be used separate fix.

But they may be collected together.

Page 40: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

40

A less ugly example

• replace.py ásnéven "ás néven" -search:másnéven -ns:0 -summary:"Helyesírás javítása kézi botszerkesztéssel: más néven„

live demo

Page 41: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

41

Character encoding problems

• Keep your files in UTF-8, and don’t use Notepad of Windows

• E.g. setting in Notepad++:

Page 42: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

42

Character encoding problems

• If it doesn’t work in command line, write a fix• If you can’t solve with a fix, use URL encoding

– replace.py -catr:Венгрия . @ -lang:ru -excepttext:"[[hu:" -save:magyarok.txt -always

– replace.py -catr:%D0%92%D0%B5%D0%BD%D0%B3%D1%80%D0%B8%D1%8F . @ -lang:ru -excepttext:"[[hu:" -save:magyarok.txt –always live demo

• You may store this in a script (import replace.py)

This is the way of page collections

Page 43: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

Page collections

Page 44: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

44

The two-pass model of replacement

1. Gathering candidates (possible to-be-replaced texts) to a file-save / -savenewRelatively slow and automatic

– Optionally uploading the list to your wiki

2. Making the actual replacementsFaster (or very fast) and attended

Page 45: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

45

A simple idea

1. Gathering candidates (possible to-be-replaced texts) to a fileRelatively slow and automatic

– Uploading the list to your wiki (this is the result!)

2. Nothing. You are ready.

Page 46: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

46

Some ideas for page collections

• Scheme: some existing/missing text

• Articles related to Hungary in other Wikipedias (see above for ruwiki)

• The Redlist Project for animals and plants

• Articles with {{commons}} template, but without any image

• …let your phantasy go!

Page 47: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

47

Useful links

[[meta:User:Bináris]]

Thank you for your attention!

Page 48: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer.

48

PS – some thoughts months later

• Lookahead is faster than recursion or overlapping.

• If a function is called for each much, that makes the bot run really slowly.

• In such cases a separate „fellow fix” without function call for searching is useful for faster search.