Shipment address classification in logistics, Ravindra Babu, Flipkart

30
Motivation, Problem Definition and Solution Overview Data Challenges, Modeling, Solutions and Deployment Summary Shipment Address Classification in Logistics in the absence of Geolocation Information Dr. T. Ravindra Babu, Data Scientist, Flipkart August 1, 2015 Dr. T. Ravindra Babu, Data Scientist, Flipkart Shipment Address Classification in Logistics in the absence of G

Transcript of Shipment address classification in logistics, Ravindra Babu, Flipkart

Motivation, Problem Definition and Solution OverviewData Challenges, Modeling, Solutions and Deployment

Summary

Shipment Address Classification in Logistics inthe absence of Geolocation Information

Dr. T. Ravindra Babu,Data Scientist,

Flipkart

August 1, 2015

Dr. T. Ravindra Babu, Data Scientist, Flipkart Shipment Address Classification in Logistics in the absence of Geolocation Information

Motivation, Problem Definition and Solution OverviewData Challenges, Modeling, Solutions and Deployment

Summary

Presentation Plan

Motivation, Problem Definition and Solution Overview

Data Challenges, Modeling, Solutions and Deployment

Summary

Dr. T. Ravindra Babu, Data Scientist, Flipkart Shipment Address Classification in Logistics in the absence of Geolocation Information

Motivation, Problem Definition and Solution OverviewData Challenges, Modeling, Solutions and Deployment

Summary

Motivation and Problem Definition

I Motivation

I Problem DefinitionI Typical Operations Scenario at Delivery Hub without a model

I Inscan of shipments received from Mother HubI Manual reading of address; Assign to the Route/FEI Sorting and Delivery

I Overview of Proposed SolutionI Capturing FEs’ domain knowledge and modelling around itI Classifying an address to be belonging to a pre-defined subareaI Allocation of the shipments to Route/FE based on Machine

Learning based ClassifierI Sorting and Delivery

Dr. T. Ravindra Babu, Data Scientist, Flipkart Shipment Address Classification in Logistics in the absence of Geolocation Information

Motivation, Problem Definition and Solution OverviewData Challenges, Modeling, Solutions and Deployment

Summary

Motivation and Problem Definition

I MotivationI Problem Definition

I Typical Operations Scenario at Delivery Hub without a modelI Inscan of shipments received from Mother HubI Manual reading of address; Assign to the Route/FEI Sorting and Delivery

I Overview of Proposed SolutionI Capturing FEs’ domain knowledge and modelling around itI Classifying an address to be belonging to a pre-defined subareaI Allocation of the shipments to Route/FE based on Machine

Learning based ClassifierI Sorting and Delivery

Dr. T. Ravindra Babu, Data Scientist, Flipkart Shipment Address Classification in Logistics in the absence of Geolocation Information

Motivation, Problem Definition and Solution OverviewData Challenges, Modeling, Solutions and Deployment

Summary

Motivation and Problem Definition

I MotivationI Problem Definition

I Typical Operations Scenario at Delivery Hub without a modelI Inscan of shipments received from Mother HubI Manual reading of address; Assign to the Route/FEI Sorting and Delivery

I Overview of Proposed SolutionI Capturing FEs’ domain knowledge and modelling around itI Classifying an address to be belonging to a pre-defined subareaI Allocation of the shipments to Route/FE based on Machine

Learning based ClassifierI Sorting and Delivery

Dr. T. Ravindra Babu, Data Scientist, Flipkart Shipment Address Classification in Logistics in the absence of Geolocation Information

Motivation, Problem Definition and Solution OverviewData Challenges, Modeling, Solutions and Deployment

Summary

Delivery Hub and Subareas

Figure: Hub and SubareasDr. T. Ravindra Babu, Data Scientist, Flipkart Shipment Address Classification in Logistics in the absence of Geolocation Information

Motivation, Problem Definition and Solution OverviewData Challenges, Modeling, Solutions and Deployment

Summary

Insights into Address Data

I No. of words in an addresses ranges from 4 to 75 leaving fewoutliers of more than 100.

I Word like Apartments is spelt in 263 different ways; whitefield24 ways, industrial 25 ways, Bangalore 161 ways, karnataka70 ways, etc.

I Structure in address is lacking even in city like Bangalore.Few examples.

I Some words a specific to certain places/states. Examples:halli, hobli; bawdi, kuan; society; layout; etc.

I Addressing Systems across the world: US, Europe, Korea,Japan; countries like Brazil, and India

Dr. T. Ravindra Babu, Data Scientist, Flipkart Shipment Address Classification in Logistics in the absence of Geolocation Information

Motivation, Problem Definition and Solution OverviewData Challenges, Modeling, Solutions and Deployment

Summary

Insights into Address Data

I No. of words in an addresses ranges from 4 to 75 leaving fewoutliers of more than 100.

I Word like Apartments is spelt in 263 different ways; whitefield24 ways, industrial 25 ways, Bangalore 161 ways, karnataka70 ways, etc.

I Structure in address is lacking even in city like Bangalore.Few examples.

I Some words a specific to certain places/states. Examples:halli, hobli; bawdi, kuan; society; layout; etc.

I Addressing Systems across the world: US, Europe, Korea,Japan; countries like Brazil, and India

Dr. T. Ravindra Babu, Data Scientist, Flipkart Shipment Address Classification in Logistics in the absence of Geolocation Information

Motivation, Problem Definition and Solution OverviewData Challenges, Modeling, Solutions and Deployment

Summary

Insights into Address Data

I No. of words in an addresses ranges from 4 to 75 leaving fewoutliers of more than 100.

I Word like Apartments is spelt in 263 different ways; whitefield24 ways, industrial 25 ways, Bangalore 161 ways, karnataka70 ways, etc.

I Structure in address is lacking even in city like Bangalore.Few examples.

I Some words a specific to certain places/states. Examples:halli, hobli; bawdi, kuan; society; layout; etc.

I Addressing Systems across the world: US, Europe, Korea,Japan; countries like Brazil, and India

Dr. T. Ravindra Babu, Data Scientist, Flipkart Shipment Address Classification in Logistics in the absence of Geolocation Information

Motivation, Problem Definition and Solution OverviewData Challenges, Modeling, Solutions and Deployment

Summary

Insights into Address Data

I No. of words in an addresses ranges from 4 to 75 leaving fewoutliers of more than 100.

I Word like Apartments is spelt in 263 different ways; whitefield24 ways, industrial 25 ways, Bangalore 161 ways, karnataka70 ways, etc.

I Structure in address is lacking even in city like Bangalore.Few examples.

I Some words a specific to certain places/states. Examples:halli, hobli; bawdi, kuan; society; layout; etc.

I Addressing Systems across the world: US, Europe, Korea,Japan; countries like Brazil, and India

Dr. T. Ravindra Babu, Data Scientist, Flipkart Shipment Address Classification in Logistics in the absence of Geolocation Information

Motivation, Problem Definition and Solution OverviewData Challenges, Modeling, Solutions and Deployment

Summary

Insights into Address Data

I No. of words in an addresses ranges from 4 to 75 leaving fewoutliers of more than 100.

I Word like Apartments is spelt in 263 different ways; whitefield24 ways, industrial 25 ways, Bangalore 161 ways, karnataka70 ways, etc.

I Structure in address is lacking even in city like Bangalore.Few examples.

I Some words a specific to certain places/states. Examples:halli, hobli; bawdi, kuan; society; layout; etc.

I Addressing Systems across the world: US, Europe, Korea,Japan; countries like Brazil, and India

Dr. T. Ravindra Babu, Data Scientist, Flipkart Shipment Address Classification in Logistics in the absence of Geolocation Information

Motivation, Problem Definition and Solution OverviewData Challenges, Modeling, Solutions and Deployment

Summary

Proposed Model

Figure: example caption

Dr. T. Ravindra Babu, Data Scientist, Flipkart Shipment Address Classification in Logistics in the absence of Geolocation Information

Motivation, Problem Definition and Solution OverviewData Challenges, Modeling, Solutions and Deployment

Summary

Preprocessing

I An elaborate preprocessing model was necessary that accountsfor the following.

I Retaining only those terms that possibly help classification(discriminability)

I Merging of terms by empirical statistical models as well asdomain knowledge based rules, n-grams, abbreviating, etc.

I Developing data dependent dictionaries based on patternclustering (Machine Learning) and forming an equivalent set

I Preprocessing reduces the vocabulary size by 65% asmeasured on a large dataset

Dr. T. Ravindra Babu, Data Scientist, Flipkart Shipment Address Classification in Logistics in the absence of Geolocation Information

Motivation, Problem Definition and Solution OverviewData Challenges, Modeling, Solutions and Deployment

Summary

Preprocessing for Data Compaction

Figure: Impact of Preprocessing

Dr. T. Ravindra Babu, Data Scientist, Flipkart Shipment Address Classification in Logistics in the absence of Geolocation Information

Motivation, Problem Definition and Solution OverviewData Challenges, Modeling, Solutions and Deployment

Summary

Models in Preprocessing::Fraud Address Classification -Address Strings

Sl.No. Address

1 adf6546s54f6sadfsd6dsa4f6sd54f6sd46fasd54sd6f2 gasdfashagadfasmejastic3 fdgdf4 hjsdhaddsdsasdsa5 dsfadafadsasdfsdafsda6 hjsdhaddsdsasdsa7 asd8 lmflvml9 assasfsafasfsasfsfsafashaphilomena10 faskjbdasdlkjbsaasd

Dr. T. Ravindra Babu, Data Scientist, Flipkart Shipment Address Classification in Logistics in the absence of Geolocation Information

Motivation, Problem Definition and Solution OverviewData Challenges, Modeling, Solutions and Deployment

Summary

Models in Preprocessing::Fraud Address Classification -Address Strings-Heatmap

Figure: MonkeyType AddressesDr. T. Ravindra Babu, Data Scientist, Flipkart Shipment Address Classification in Logistics in the absence of Geolocation Information

Motivation, Problem Definition and Solution OverviewData Challenges, Modeling, Solutions and Deployment

Summary

Models in Preprocessing::Fraud Address Classification -Items Bought

Figure: Items bought by such peopleDr. T. Ravindra Babu, Data Scientist, Flipkart Shipment Address Classification in Logistics in the absence of Geolocation Information

Motivation, Problem Definition and Solution OverviewData Challenges, Modeling, Solutions and Deployment

Summary

Models in Preprocessing :: Probabilistic Separation ofCompound Words

I To a large extent, Addresses are not amenable to EnglishDictionaries

I While writing addresses it is often found that the customereither inadvertently misses the space or removed duringstorage/retrieval

I Separating such compound wordsI Compute empirical probabilities of wordsI Assuming conditional independence, if the joint probability of a

compound word is less than the product of the individualwords, separate the words

Dr. T. Ravindra Babu, Data Scientist, Flipkart Shipment Address Classification in Logistics in the absence of Geolocation Information

Motivation, Problem Definition and Solution OverviewData Challenges, Modeling, Solutions and Deployment

Summary

Models in Preprocessing :: Probabilistic Separation ofCompound Words

I To a large extent, Addresses are not amenable to EnglishDictionaries

I While writing addresses it is often found that the customereither inadvertently misses the space or removed duringstorage/retrieval

I Separating such compound wordsI Compute empirical probabilities of wordsI Assuming conditional independence, if the joint probability of a

compound word is less than the product of the individualwords, separate the words

Dr. T. Ravindra Babu, Data Scientist, Flipkart Shipment Address Classification in Logistics in the absence of Geolocation Information

Motivation, Problem Definition and Solution OverviewData Challenges, Modeling, Solutions and Deployment

Summary

Models in Preprocessing :: Probabilistic Separation ofCompound Words

I To a large extent, Addresses are not amenable to EnglishDictionaries

I While writing addresses it is often found that the customereither inadvertently misses the space or removed duringstorage/retrieval

I Separating such compound wordsI Compute empirical probabilities of wordsI Assuming conditional independence, if the joint probability of a

compound word is less than the product of the individualwords, separate the words

Dr. T. Ravindra Babu, Data Scientist, Flipkart Shipment Address Classification in Logistics in the absence of Geolocation Information

Motivation, Problem Definition and Solution OverviewData Challenges, Modeling, Solutions and Deployment

Summary

Models in Preprocessing :: Frequent Pattern Tree forn-gram Generation

I Frequent pattern tree is a celebrated approach in mining largedatasets

I We implement a modified version of the tree to generaten-grams

I Conventional method

I New approach

Dr. T. Ravindra Babu, Data Scientist, Flipkart Shipment Address Classification in Logistics in the absence of Geolocation Information

Motivation, Problem Definition and Solution OverviewData Challenges, Modeling, Solutions and Deployment

Summary

Models in Preprocessing :: Frequent Pattern Tree forn-gram Generation

I Frequent pattern tree is a celebrated approach in mining largedatasets

I We implement a modified version of the tree to generaten-grams

I Conventional method

I New approach

Dr. T. Ravindra Babu, Data Scientist, Flipkart Shipment Address Classification in Logistics in the absence of Geolocation Information

Motivation, Problem Definition and Solution OverviewData Challenges, Modeling, Solutions and Deployment

Summary

Models in Preprocessing :: Frequent Pattern Tree forn-gram Generation

I Frequent pattern tree is a celebrated approach in mining largedatasets

I We implement a modified version of the tree to generaten-grams

I Conventional method

I New approach

Dr. T. Ravindra Babu, Data Scientist, Flipkart Shipment Address Classification in Logistics in the absence of Geolocation Information

Motivation, Problem Definition and Solution OverviewData Challenges, Modeling, Solutions and Deployment

Summary

Models in Preprocessing :: Frequent Pattern Tree forn-gram Generation

I Frequent pattern tree is a celebrated approach in mining largedatasets

I We implement a modified version of the tree to generaten-grams

I Conventional method

I New approach

Dr. T. Ravindra Babu, Data Scientist, Flipkart Shipment Address Classification in Logistics in the absence of Geolocation Information

Motivation, Problem Definition and Solution OverviewData Challenges, Modeling, Solutions and Deployment

Summary

Models in Preprocessing::Clustering for equivalent set ofwords with spell variations - Ex. koramangala, electronics

koramanagala koromangala kormanagala koramnagalakoramangalato kanamangala koramanagla koremangalakoaramangala koramamgala karamangala tkoramangalakormangalla koramongala koarmangala korammangalakoramangalla koramangale koramanagal

electronice eclectronic elelctronic eelectronic electronica electroincselectronics electroninc electrinics electroncis electronincs

Dr. T. Ravindra Babu, Data Scientist, Flipkart Shipment Address Classification in Logistics in the absence of Geolocation Information

Motivation, Problem Definition and Solution OverviewData Challenges, Modeling, Solutions and Deployment

Summary

Models in Preprocessing:: Clustering for ... spell variations- Ex. Bannerghattaroad(61 variations)

bannerghattaroad, bannergattaroad, banerghattaroad, bannerghataroad,

bannerughattaroad, bannarghattaroad, banergattaroad,

banneraghattaroad, bannerghettaroad, bannerugattaroad,

bhannerghattaroad, bennerghattaroad, bannerghttaroad,

bannargattaroad, banarghattaroad, banneghattaroad, banneragattaroad,

bennarghattaroad, baneerghattaroad, bannergettaroad,

banngerghattaroad, banerghataroad, bannerghuttaroad, bannergatharoad,

benerghattaroad, bannerghattaroadto, bannergataroad,

bannergattharoad, banerghettaroad, bannerguttaroad, bannarghataroad,

bannnerghattaroad, bannarghettaroad, banerughattaroad,

bannergahttaroad, bhannerughattaroad, bennergattaroad,

bannerghattroad, bannaraghattaroad, bannerhattaroad,

bannerghatharoad, banneerghattaroad, bannaerghattaroad,

baneergattaroad, bhannergattaroad, bhanerghattaroad,

bannerughataroad, baneerghataroad, bannerghatroad, baneghattaroad,

bannerghtaroad, bannerghatttaroad, bannerghattharoad,

banneraghataroad, bannergahattaroad, bangerghattaroad,

banerghttaroad, bannegattaroad, baneraghattaroad, banngergattaroad,

bannerghatteroad

Dr. T. Ravindra Babu, Data Scientist, Flipkart Shipment Address Classification in Logistics in the absence of Geolocation Information

Motivation, Problem Definition and Solution OverviewData Challenges, Modeling, Solutions and Deployment

Summary

Models in Post-processing :: Semi-Supervised Methods

Discussion

Dr. T. Ravindra Babu, Data Scientist, Flipkart Shipment Address Classification in Logistics in the absence of Geolocation Information

Motivation, Problem Definition and Solution OverviewData Challenges, Modeling, Solutions and Deployment

Summary

Revisiting The Model

I Supervised Classification

Figure: example caption

Dr. T. Ravindra Babu, Data Scientist, Flipkart Shipment Address Classification in Logistics in the absence of Geolocation Information

Motivation, Problem Definition and Solution OverviewData Challenges, Modeling, Solutions and Deployment

Summary

Summary

I NoveltyI Solution is novel and developed in-houseI No similar solution found in the Literature

Dr. T. Ravindra Babu, Data Scientist, Flipkart Shipment Address Classification in Logistics in the absence of Geolocation Information

Motivation, Problem Definition and Solution OverviewData Challenges, Modeling, Solutions and Deployment

Summary

Thank You

Dr. T. Ravindra Babu, Data Scientist, Flipkart Shipment Address Classification in Logistics in the absence of Geolocation Information