Jeremie Charlet
25th May2016
Presentation of Taxonomy Applications and their development to the BBC
Introduction
3
– Categorisation was initially done with Autonomy: 2 years work from the Taxonomy team to write and perfect category queries
– Since we migrated our search engine to Solr, we had to build the taxonomy tools from scratch
“air force” "Air Force" OR "air forces" OR "Air Ministry" OR "Air Historical Branch" OR "Air Department" OR "Air Board" OR "Air Council" OR "Department of the Air Member" OR "air army“ …
Plan
Introduction
1. Solution
2. How we implemented it
3. Attempt on Machine Learning
Conclusion: learnings and next steps
http://discovery.nationalarchives.gov.uk/4
5
Categories displayed on Discovery our archives portal
Administration User Interface for taxonomists
Command Line Interface to categorise everything once
Batch Job to categorise documents every day
1/ Solution
1. Solution / Discovery
7
1. Solution / admin GUI
8
1. Solution / admin GUI
9
Application to categorise documents every day1. to categorise new documents2. to re-categorise documents when they are updated
1. Solution / daily updates
10
1. Solution / daily updates
11
Application to categorise everything once1. To do it for the first time2. to apply latest modifications from taxonomists on all documents
1. Solution / categorise all docs
12
under the hood of taxonomy-batch-cat-all1. Solution / categorise all docs
13
Categorisation and updates on Solr are decoupled1. Solution / categorise all docs
14
Architecture diagram for daily updates (Java side)1. Solution
Plan
Introduction
1. Solution– Discovery portal– Administration UI– Tool to categorise everything once– Batch Job to categorise every day
2. How we implemented it
3. Attempt on Machine Learning
Conclusion: learnings and next steps
http://discovery.nationalarchives.gov.uk/15
16
To get it right
To get it fast• Algorithm• Fine tuning• Distributed system with Akka
2. Implementation
Many parameters to take into account• Is case sensitiveness important? • Use punctuation? • Use synonyms? • Ignore stop words (of, the, a, …)? • Use wildcards? • Which meta data to use?
= Iterative process
How to evaluate if our results are valid? > Use documents and categories from former system> Categorise them again and compare results
To do that quickly, created Command Line Interface
17
[jcharlet@server ~]$
./runCli.sh -EVALcategoriseEvalDataSet --lucene.index.useSynonymFilter=true
2. Implementation / get it right
It dependsIt dependsYesNo, use stop words* ?Title, description, context description, categories, people, places, corporate bodies
We apply our 136 categories to 22 millions records in 1,5 days (~ 5ms per doc)
• We create an index in memory with a single document and run our queries against it. Then we run the matching queries to the complete index to have a score that enables us to rank matches
• Distributed system with Akka (13 processes running on 2 servers)
2 * 24 Core CPU40 Go RAM
18
2. Implementation / get it fast
Use the right driver for your system (NRTCacheDirectory instead of default one) > 1 line in 1 file = 20% faster on search queries
Use filter instead of query to search on only 1 document + use carefully low level api
Profile your application frequently> Identify ugly code, where to add cache, where to add concurrencySpent 7% on creating Query objects for every document: instead, create them once and store them in memory
19
2. Implementation / get it fast
How to transmit documents to categorise efficiently?
By sending messages to workers
See the problem?
Categorisation Supervisor
Categorisation Worker
Categorisation Worker
Categorisation Worker
C456321;C65465;C654879;C56879
C456321;C65465;C654879;C56879
C456321;C65465;C654879;C56879
C456321;C65465;C654879;C56879C456321;C65465;
C654879;C56879C456321;C65465;C654879;C56879C456321;C65465;
C654879;C56879
C456321;C65465;C654879;C56879C456321;C65465;
C654879;C56879C456321;C65465;C654879;C56879
C456321;C65465;C654879;C56879C456321;C65465;
C654879;C56879C456321;C65465;C654879;C56879C456321;C65465;
C654879;C56879
2. Implementation / get it fast
Solution: http://www.michaelpollmeier.com/akka-work-pulling-pattern/
2. Implementation / get it fast
Applied to taxonomy Applications
https://github.com/nationalarchives/taxonomy
There are 2 types batch applications (each runs in its own application server)
• 1 instance of Taxonomy-cat-all-supervisor
• N instances of Taxonomy-cat-all-worker
Categorisation supervisor browses the whole index and retrieve 1000 documents at a time
Categorisation worker receives categorisation requests that contains a list of documents to
categorise
2. Implementation / get it right
Plan
Introduction
1. Solution– Discovery portal– Administration UI– Tool to categorise everything once– Batch Job to categorise every day
2. How we implemented it– Get it right– Get it fast
• Fine tuning• Distributed system with Akka
3. Attempt on Machine Learning
Conclusion: learnings and next steps
http://discovery.nationalarchives.gov.uk/23
Research on a training set based solution for 2 months
1.Take a data set of known (already classified) documents2.Split it into a test set and training set
– Train the system with the training set– Evaluate it using the test set– Iterate until satisfactory
3.Move it to production– Classify new documents using the trained system
24
3. Attempt on Machine Learning
Why it did not work1.Using category queries to create the training set
– Highly dependent on the validity/accuracy of the category queries
2.Nature of our categories– far too many (136)– categories too vague / broad or too similar (“Poverty”, “Military”): do not
suit such a system
3.Not the right tool? We used Lucene (search engine) built in tool
4.Nature of the data? Quality of the meta data?
25
3. Attempt on Machine Learning
Plan
Introduction
1. Solution– Discovery portal– Administration UI– Tool to categorise everything once– Batch Job to categorise every day
2. How we implemented it– Get it right– Get it fast
• Fine tuning• Distributed system with Akka
3. Attempt on Machine Learning
Conclusion: learnings and next steps
http://discovery.nationalarchives.gov.uk/26
Conclusion: learnings and next steps
27
Gains and lossesNo * within words
categorisation 10 times faster
use of free solutions (*)
admin interface more fluid and useable
Conclusion: learnings and next steps
28
Possible improvements- Update documents for 1 category on demand- Create more generic solution- Add missing GUI (reporting, categorise all)- Build solution upon Solr, not Lucene- Use Cloud Services instead of onsite servers
Next steps- Categorise other archives- Work on new digital-born records New categories ? New research on machine learning ?
Solr
Lucene
Thank you for listening
Any questions ?
Top Related