DOCUMENT DIGITIZATION - QCon.ai · @nischalhp | Document Digitization | QconAI SFO 2019 Human and...
Transcript of DOCUMENT DIGITIZATION - QCon.ai · @nischalhp | Document Digitization | QconAI SFO 2019 Human and...
DOCUMENT DIGITIZATIONRethinking it with Machine Learning
Nischal Harohalli Padmanabha QConAI SFO 2019
@nischalhp | Document Digitization | QconAI SFO 2019
“The brain sure as hell doesn’t work by somebody programming in rule.”
- Geoffrey Hinton
PROBLEM
@nischalhp | Document Digitization | QconAI SFO 2019
Understanding unstructured documents and extracting semantic information to automate claims handling.
@nischalhp | Document Digitization | QconAI SFO 2019
DOCUMENT CLASS
Policy
POLICY NUMBER
H 54/16 307 728
CUSTOMER
Renolate GmbH10115 Berlin
AGENT
pma Insurance Broker48149 Nurnberg
RISK DESCRIPTION / INSURED LOCATION
Private liability insurance comfort plus Dog liabilityEnvironmental damage insuranceEmployees on premises
POLICY
Liability Protection
EFFECTIVE DATE OF CHANGE
22.12.2016 12:00TERMINATION
22.12.2019 12:00ANNUAL CHARGE
EUR 424,63
COVERAGES
Persons & property damage flatFinancial lossesEnvironmental damage basic flat
EUR 3.000.000EUR 100.000EUR 3.000.000
REWIND
TABULAR INFORMATION EXTRACTION
@nischalhp | Document Digitization | QconAI SFO 2019
Writing a lot of rules
COURSE OF ACTION - ROUND 1
Initial results, gave us a lot of happiness. Evaluation on known Data
RESULT
@nischalhp | Document Digitization | QconAI SFO 2019
In production 58% accuracy
RESULT
@nischalhp | Document Digitization | QconAI SFO 2019
We failed, miserably.Rules became cumbersome & brittle.
In production 58% accuracy
@nischalhp | Document Digitization | QconAI SFO 2019
Life or death situation for the project (and us engineers)
@nischalhp | Document Digitization | QconAI SFO 2019
ADAPTIVE LEARNING THOUGHT PROCESS
How does a human solve the same problem?
Identifies Grouping of Text, to build Context
Eg: Tables, paragraphs, passages Given the context, domain knowledge and semantic understanding of text
@nischalhp | Document Digitization | QconAI SFO 2019
Sounds straightforward, right?
@nischalhp | Document Digitization | QconAI SFO 2019
TECH STACK CHECK
NEXT STEPS
@nischalhp | Document Digitization | QconAI SFO 2019
Which algorithms to use?
What should we feed as input to the algorithm?What to annotate?
What are our deadlines?
Human and computation resources required?
How to agile this?
@nischalhp | Document Digitization | QconAI SFO 2019
Which algorithms to use?
COURSE OF ACTION - ROUND 2
Supervised Learning
Unsupervised Learning
Computer Vision
NLP
Computer Vision
NLP
Using this technique to generate data for supervised training. Wrote implementations of Deep clustering, word / sentence / page / document embeddings
● Object detection● Messaging parsing networks● Custom CNN networks
● Implementation of Deep Topic modeling● Custom RNN + CNN networks with
domain adaptation
EMPHASIS ON SUPERVISED LEARNING
@nischalhp | Document Digitization | QconAI SFO 2019
@nischalhp | Document Digitization | QconAI SFO 2019
Computer Vision
NLP
● Drawing polygon bounding boxes● Labeling pages● Labeling documents
Complex annotation of passages, phrases, tables, line items, hierarchy nature of textual information
What should we feed as input to the algorithm?What to annotate?
]Built an in houseAnnotation System
COURSE OF ACTION - ROUND 2
Workflows supporthuge annotation jobs
@nischalhp | Document Digitization | QconAI SFO 2019
Human and computation resources required?
Data Scientists
Engineers
● Data Scientists from Academia● Deep learning engineers● Research programme with Universities● Master Thesis sponsorship at omni:us
● Full stack engineers● Data Engineers● Devops
Leadership & Mentors
Cloud startup programmes
● Team leads with experience in AI● Identifying and convincing industry experts to mentor● Devops
● Credits to support memory and GPU training algorithms● Mentoring to scale operations
COURSE OF ACTION - ROUND 2
@nischalhp | Document Digitization | QconAI SFO 2019
What are our Deadlines?
How to agile this?
Sprint Planning for Research
Quick turn around of POC
Engineer AI systems to run experiments in a systematic and automated way
COURSE OF ACTION - ROUND 2
RESULT
@nischalhp | Document Digitization | QconAI SFO 2019
In production 94% accuracy
Successful AIdelivery
@nischalhp | Document Digitization | QconAI SFO 2019
TECH STACK CHECK
GO LIVE OR GO HOME
@nischalhp | Document Digitization | QconAI SFO 2019
Trained Models Predict
AI IN PRODUCTION
Human in the loop, fixes the errors and validates corrections
Train on the corrections, Continuous improvements
@nischalhp | Document Digitization | QconAI SFO 2019
DO NOT IGNORE
Domain Knowledge is essential
Educate your customers on AI
Engineer end to end AI systems to solve business use case, not a dataset
@nischalhp | Document Digitization | QconAI SFO 2019
PLATFORM
Training Platform Prediction Platform with human in the loop
Management Console of Infrastructure, Applications & Users
@nischalhp | Document Digitization | QconAI SFO 2019
Training Platform
COURSE OF ACTION - ROUND 3
Annotation System
Ability to train and evaluate models
Mechanism and system to trigger training, retraining of evaluation and versioning of different types models, in a managed way across various infrastructures supporting CPU and GPU
System to define data models, annotate data, manage annotation jobs, audit the annotated data and version control the datasets]Console connecting
the two together
@nischalhp | Document Digitization | QconAI SFO 2019
COURSE OF ACTION - ROUND 3
Async API for Ingestion
Data PipelinesRobust data pipelines connecting the services with providing capabilities of high throughput, reliability and retry mechanisms.
Rest API that supports asynchronous data upload capabilities ]Prediction console
connects all.
Prediction Platform with human in the loop
Validation UI
AI microservices
User interface to fix prediction errors
Scaling deep learning models as microservices
@nischalhp | Document Digitization | QconAI SFO 2019
Management Console of Infrastructure, Applications & Users
COURSE OF ACTION - ROUND 3
Configuration management
Application logsMonitoring logs of applications and setting up dashboards for internal and external stakeholders
Central management of configuration of various systems, consoles and services ]Management and
monitoring console
User management
Infrastructure logs
Managing users and providing authentication and authorisation capabilities for services.
Monitoring infrastructure usage and patterns to setup alerts and notifications
@nischalhp | Document Digitization | QconAI SFO 2019
TECH STACK CHECK
@nischalhp | Document Digitization | QconAI SFO 2019
omni:us platform console |
Learnings
Learnings
@nischalhp | Document Digitization | QconAI SFO 2019
● Very important for an entire organization to believe that AI can solve problems● Engineer AI products, do not believe that having just AI models are good enough● Agile for AI works, choose an interpretation that works for your team● Pay attention to details, domain knowledge and use case to be solved. ● Combination of multiple technologies have to be used to solve use case, not just one
hammer for all.● Do not try to “AI” everything, certain matured technologies are capable of solving
certain problems well. Use them wisely.● Believe in human in the loop, builds trust with business● Educate internal and external stakeholders around the possibilities and limitations
of AI.● Visualisation is power tool to understand and explain AI to everybody. Use them.● AI is no more a black box, it can fine tuned, managed and configured appropriately.● Automate your current processes as much as possible, this gives more room for
research.