Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013

Consumer Analytics in Real Time: How InfoScout Tracks Purchase Behavior with Mechanical Turk Jon Brelig, CTO, InfoScout Sharon Chiarella, Vice President, Amazon Mechanical Turk

November 13, 2013

Overview

– Receipt workflow – Quality control – Analytics

Wish I knew who that shopper was!

Helping brands answer… • Who’s buying my product? • Who’s the end consumer? • Why did they buy? • When and where? • How many? • At what price? • With what else?

Who’s the shopper? What’s their motive?

How do we build a better panel? Capture receipts through mobile

Our mobile apps Receipt Hog Shoparoo

Put $ in your pocket! Fundraise for a cause!

Architecture

2. Convert to structured data Computer vision + OCR + MTurk 1. Capture Receipt

5. Build cool stuff on top of it! Analytics, data firehouse, hacks, etc.

4) Data warehouse & prematerialize MySQL, Amazon Redshift, Hadoop (Amazon EMR)

Tlog Redshift

target.com target.com

3) Link to masterdata Scraping + classification models + human training

GAT G2 LMN LIME = UPC 052000209648

Masterdata MySQL

Digitizing Receipts Task is to convert image(s) of receipts => structured data

Amazon Mechanical Turk

e skip?

Auto Extract OpenCV, OCR, Regex

Summary Extraction Mechanical Turk

Itemized Extraction Mechanical Turk

Score & Audit Staff / Mechanical Turk

Complete

• Isn’t OCR good enough? – It is a solved problem… for books – Low recognition on wrinkled receipts from mobile

• Hybrid of computer + human – Leverage OCR & computer vision, fill gaps with

humans

• Human = MTurk + small audit staff – We leverage a 6-person team to act as the top

audit layer of the system

Transcribing Receipts

e skip?

Complete

Summary Transcription

200,000

400,000

600,000

800,000

1,000,000

1,200,000Receipts by Month

How do we scale quality control with growing volume?

Known Answers

• Publish HIT with at least one known answer to audit Worker accuracy

• Additional support provided by Amazon API

• Most effective when there is a concrete, expected answer

– i.e. Multiple choice answers Known Answer

Known Answers

$0.0050

$0.0100

$0.0150

$0.0200

$0.0250

$0.0300

Net Cost per Receipt

InfoScout Review Cost Mturk Cost

Known Answers lowered our net cost per receipt from 2 cents to 1 cent per receipt

Developed more efficient review process

Transitioned to Known Answers

e skip?

Complete

Itemized Extraction

Itemized Extraction • Transcribe every item on receipt • HITs audited by review team, priority scored by:

– Comparing output to known OCR extraction – Comparison to master data? (i.e. did they “fat finger” a price or UPC?) – Worker approval history – Worker tenure (for InfoScout HITs) – Additional features

• Not a great candidate for Known Answers….

How do we scale quality control for itemized extraction?

Plurality • HIT completed by >1 Worker

– InfoScout only sends HITs with low confidence to multiple Workers

• Higher quality, higher cost

– Limit costs by scientifically selecting HITs to send to a second Worker

• Multiple strategies when an answer

discrepancy is found – Ask a third Worker – Leverage internal auditors

Match?

Publish HIT

Worker 1 Submits

Worker 2 Submits

Accept

HIT Acceptance Latency

12/22/12 1/22/13 2/22/13 3/22/13 4/22/13 5/22/13 6/22/13

Changed Template

• Measures HIT demand • Template change decreased demand temporarily, but Workers acclimated

Worker Retention

100,000

200,000

300,000

400,000

500,000

600,000

700,000

HITs Complete (New Workers) HITs Complete (Retained Workers)

Within two months, 80% of HITs were completed by returning Workers

Pareto of Worker Volume

0%10%20%30%40%50%60%70%80%90%

Top 5% 6-10% 10-20% 21-50% 51-100%

Worker Percentile

Our top 5% (~500) active Workers account for >80% of all HITs completed

Analytics Demo

Please give us your feedback on this presentation

As a thank you, we will select prize winners daily for completed surveys!

BDT206

Appendix

Quality Control Strategies • Filter incoming Workers

– Qualifications

• Increase quality during completion – Template validation – Template instructions

• Post submission – Plurality (multiple HITs per task) – Known Answers – Workers audit Workers

Multiple strategies can yield high accuracy

Approve/Reject?

HIT templates • Clear & concise instructions

– 1st time each Worker sees detailed instructions, has ability to hide once they’re comfortable

• Keyboard shortcuts • Maximize Validation

– Client-side and/or AJAX validation

• Bonus Rewards – Nice option for rewarding Workers,

especially when HIT’s are variable in length & time

Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013

Technology

Transcript of Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013

InfoScout Nº265

Amazon Cloudsearch Session With Elsevier: re:Invent 2013

AWS re:Invent - Accelerating Research

The Startup’s Guide to re:Invent · 2020-04-13 · PG. 2 About re:Invent re:Invent offers four days (give or take) to rub backpacks with about 70,000 of your fellow developers,

AWS re:Invent 2016 Photo Report

Bluesoft @ AWS re:Invent 2017 + AWS 101

Continuous Deployment @ AWS Re:Invent

InfoScout - napier.ac.uk/media/worktribe/output-168799/infoscout... · and techniques can be found in the NATO OSINT Hand-book [9], which is set in the context of military like opera-tions.

AWS re:Invent 2013 Recap

AWS re:Invent re:Flection - Spot Pricing

AWS re:Invent 2016 recap (part 2)

20131122 cloudpack Night re:Invent report

(SOV207) Amazon AppStream | AWS re:Invent 2014

(BDT206) How to Accelerate Your Projects with AWS Marketplace

InfoScout Nº80

Riverbed AWS re:Invent 2014 Survey Results

AWS re:Invent 2017 Recap

InfoSCOUT Nº 244

InfoScout Nº287

"Re:Invent Recruiting," the iRecruit Keynote