Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013
-
Upload
amazon-web-services -
Category
Technology
-
view
772 -
download
4
description
Transcript of Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013
![Page 1: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013](https://reader034.fdocuments.in/reader034/viewer/2022052600/557cf542d8b42a071b8b47b5/html5/thumbnails/1.jpg)
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Consumer Analytics in Real Time: How InfoScout Tracks Purchase Behavior with Mechanical Turk Jon Brelig, CTO, InfoScout Sharon Chiarella, Vice President, Amazon Mechanical Turk
November 13, 2013
![Page 2: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013](https://reader034.fdocuments.in/reader034/viewer/2022052600/557cf542d8b42a071b8b47b5/html5/thumbnails/2.jpg)
Overview
– Receipt workflow – Quality control – Analytics
![Page 3: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013](https://reader034.fdocuments.in/reader034/viewer/2022052600/557cf542d8b42a071b8b47b5/html5/thumbnails/3.jpg)
![Page 4: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013](https://reader034.fdocuments.in/reader034/viewer/2022052600/557cf542d8b42a071b8b47b5/html5/thumbnails/4.jpg)
Wish I knew who that shopper was!
![Page 5: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013](https://reader034.fdocuments.in/reader034/viewer/2022052600/557cf542d8b42a071b8b47b5/html5/thumbnails/5.jpg)
Helping brands answer… • Who’s buying my product? • Who’s the end consumer? • Why did they buy? • When and where? • How many? • At what price? • With what else?
Who’s the shopper? What’s their motive?
![Page 6: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013](https://reader034.fdocuments.in/reader034/viewer/2022052600/557cf542d8b42a071b8b47b5/html5/thumbnails/6.jpg)
How do we build a better panel? Capture receipts through mobile
![Page 7: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013](https://reader034.fdocuments.in/reader034/viewer/2022052600/557cf542d8b42a071b8b47b5/html5/thumbnails/7.jpg)
Our mobile apps Receipt Hog Shoparoo
Put $ in your pocket! Fundraise for a cause!
![Page 8: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013](https://reader034.fdocuments.in/reader034/viewer/2022052600/557cf542d8b42a071b8b47b5/html5/thumbnails/8.jpg)
Architecture
2. Convert to structured data Computer vision + OCR + MTurk 1. Capture Receipt
5. Build cool stuff on top of it! Analytics, data firehouse, hacks, etc.
4) Data warehouse & prematerialize MySQL, Amazon Redshift, Hadoop (Amazon EMR)
Tlog Redshift
target.com target.com
3) Link to masterdata Scraping + classification models + human training
GAT G2 LMN LIME = UPC 052000209648
Masterdata MySQL
![Page 9: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013](https://reader034.fdocuments.in/reader034/viewer/2022052600/557cf542d8b42a071b8b47b5/html5/thumbnails/9.jpg)
Digitizing Receipts Task is to convert image(s) of receipts => structured data
![Page 10: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013](https://reader034.fdocuments.in/reader034/viewer/2022052600/557cf542d8b42a071b8b47b5/html5/thumbnails/10.jpg)
Amazon Mechanical Turk
![Page 11: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013](https://reader034.fdocuments.in/reader034/viewer/2022052600/557cf542d8b42a071b8b47b5/html5/thumbnails/11.jpg)
Can w
e skip?
Use
r mar
ks o
r sta
ff re
ject
s H
IT
Auto Extract OpenCV, OCR, Regex
Summary Extraction Mechanical Turk
Itemized Extraction Mechanical Turk
Score & Audit Staff / Mechanical Turk
Complete
• Isn’t OCR good enough? – It is a solved problem… for books – Low recognition on wrinkled receipts from mobile
• Hybrid of computer + human – Leverage OCR & computer vision, fill gaps with
humans
• Human = MTurk + small audit staff – We leverage a 6-person team to act as the top
audit layer of the system
Transcribing Receipts
![Page 12: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013](https://reader034.fdocuments.in/reader034/viewer/2022052600/557cf542d8b42a071b8b47b5/html5/thumbnails/12.jpg)
Can w
e skip?
Use
r mar
ks o
r sta
ff re
ject
s H
IT
Auto Extract OpenCV, OCR, Regex
Summary Extraction Mechanical Turk
Itemized Extraction Mechanical Turk
Score & Audit Staff / Mechanical Turk
Complete
Summary Transcription
![Page 13: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013](https://reader034.fdocuments.in/reader034/viewer/2022052600/557cf542d8b42a071b8b47b5/html5/thumbnails/13.jpg)
Summary Transcription
-
200,000
400,000
600,000
800,000
1,000,000
1,200,000Receipts by Month
How do we scale quality control with growing volume?
![Page 14: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013](https://reader034.fdocuments.in/reader034/viewer/2022052600/557cf542d8b42a071b8b47b5/html5/thumbnails/14.jpg)
Known Answers
• Publish HIT with at least one known answer to audit Worker accuracy
• Additional support provided by Amazon API
• Most effective when there is a concrete, expected answer
– i.e. Multiple choice answers Known Answer
![Page 15: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013](https://reader034.fdocuments.in/reader034/viewer/2022052600/557cf542d8b42a071b8b47b5/html5/thumbnails/15.jpg)
Known Answers
$-
$0.0050
$0.0100
$0.0150
$0.0200
$0.0250
$0.0300
Net Cost per Receipt
InfoScout Review Cost Mturk Cost
Known Answers lowered our net cost per receipt from 2 cents to 1 cent per receipt
Developed more efficient review process
Transitioned to Known Answers
![Page 16: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013](https://reader034.fdocuments.in/reader034/viewer/2022052600/557cf542d8b42a071b8b47b5/html5/thumbnails/16.jpg)
Can w
e skip?
Use
r mar
ks o
r sta
ff re
ject
s H
IT
Auto Extract OpenCV, OCR, Regex
Summary Extraction Mechanical Turk
Itemized Extraction Mechanical Turk
Score & Audit Staff / Mechanical Turk
Complete
Itemized Extraction
![Page 17: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013](https://reader034.fdocuments.in/reader034/viewer/2022052600/557cf542d8b42a071b8b47b5/html5/thumbnails/17.jpg)
Itemized Extraction • Transcribe every item on receipt • HITs audited by review team, priority scored by:
– Comparing output to known OCR extraction – Comparison to master data? (i.e. did they “fat finger” a price or UPC?) – Worker approval history – Worker tenure (for InfoScout HITs) – Additional features
• Not a great candidate for Known Answers….
How do we scale quality control for itemized extraction?
![Page 18: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013](https://reader034.fdocuments.in/reader034/viewer/2022052600/557cf542d8b42a071b8b47b5/html5/thumbnails/18.jpg)
Plurality • HIT completed by >1 Worker
– InfoScout only sends HITs with low confidence to multiple Workers
• Higher quality, higher cost
– Limit costs by scientifically selecting HITs to send to a second Worker
• Multiple strategies when an answer
discrepancy is found – Ask a third Worker – Leverage internal auditors
Match?
YES
Publish HIT
Worker 1 Submits
Worker 2 Submits
Accept
![Page 19: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013](https://reader034.fdocuments.in/reader034/viewer/2022052600/557cf542d8b42a071b8b47b5/html5/thumbnails/19.jpg)
HIT Acceptance Latency
0
100
200
300
400
500
600
700
12/22/12 1/22/13 2/22/13 3/22/13 4/22/13 5/22/13 6/22/13
Min
utes
to A
ccep
t
Changed Template
• Measures HIT demand • Template change decreased demand temporarily, but Workers acclimated
![Page 20: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013](https://reader034.fdocuments.in/reader034/viewer/2022052600/557cf542d8b42a071b8b47b5/html5/thumbnails/20.jpg)
Worker Retention
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
% C
ompl
eted
by
reta
ined
Wor
kers
Tota
l HIT
s C
ompl
eted
HITs Complete (New Workers) HITs Complete (Retained Workers)
Within two months, 80% of HITs were completed by returning Workers
![Page 21: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013](https://reader034.fdocuments.in/reader034/viewer/2022052600/557cf542d8b42a071b8b47b5/html5/thumbnails/21.jpg)
Pareto of Worker Volume
0%10%20%30%40%50%60%70%80%90%
Top 5% 6-10% 10-20% 21-50% 51-100%
% o
f all
HIT
s co
mpl
eted
Worker Percentile
Our top 5% (~500) active Workers account for >80% of all HITs completed
![Page 22: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013](https://reader034.fdocuments.in/reader034/viewer/2022052600/557cf542d8b42a071b8b47b5/html5/thumbnails/22.jpg)
Analytics Demo
![Page 23: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013](https://reader034.fdocuments.in/reader034/viewer/2022052600/557cf542d8b42a071b8b47b5/html5/thumbnails/23.jpg)
Please give us your feedback on this presentation
As a thank you, we will select prize winners daily for completed surveys!
BDT206
![Page 24: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013](https://reader034.fdocuments.in/reader034/viewer/2022052600/557cf542d8b42a071b8b47b5/html5/thumbnails/24.jpg)
Appendix
![Page 25: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013](https://reader034.fdocuments.in/reader034/viewer/2022052600/557cf542d8b42a071b8b47b5/html5/thumbnails/25.jpg)
Quality Control Strategies • Filter incoming Workers
– Qualifications
• Increase quality during completion – Template validation – Template instructions
• Post submission – Plurality (multiple HITs per task) – Known Answers – Workers audit Workers
Multiple strategies can yield high accuracy
Approve/Reject?
HIT
Enha
nce
![Page 26: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013](https://reader034.fdocuments.in/reader034/viewer/2022052600/557cf542d8b42a071b8b47b5/html5/thumbnails/26.jpg)
HIT templates • Clear & concise instructions
– 1st time each Worker sees detailed instructions, has ability to hide once they’re comfortable
• Keyboard shortcuts • Maximize Validation
– Client-side and/or AJAX validation
• Bonus Rewards – Nice option for rewarding Workers,
especially when HIT’s are variable in length & time