Beyond Kaggle: Solving Data Science Challenges at Scale
-
Upload
dato-inc -
Category
Technology
-
view
47 -
download
0
Transcript of Beyond Kaggle: Solving Data Science Challenges at Scale
![Page 1: Beyond Kaggle: Solving Data Science Challenges at Scale](https://reader034.fdocuments.in/reader034/viewer/2022042717/55d09badbb61ebb0058b4607/html5/thumbnails/1.jpg)
1
DRAFT
Think Big, Start Smart, Scale Fast
Dato ConferenceData Matching and Deduplication
using Dato ToolkitsJuly 21st, 2015
Guillermo Breto Rangel, PhD
![Page 2: Beyond Kaggle: Solving Data Science Challenges at Scale](https://reader034.fdocuments.in/reader034/viewer/2022042717/55d09badbb61ebb0058b4607/html5/thumbnails/2.jpg)
2
DRAFT
Entity Resolution: Multiple Definitions
2
(ER)Entity Resolution
Extract, match and disambiguate entity records in data.
![Page 3: Beyond Kaggle: Solving Data Science Challenges at Scale](https://reader034.fdocuments.in/reader034/viewer/2022042717/55d09badbb61ebb0058b4607/html5/thumbnails/3.jpg)
3
DRAFT
Extract, match and disambiguate entity records in data.
Entity Resolution: Real World Entity
Matching real world entities with profiles, mentions...
You
Facebook account(s)LinkedIn profile(s)TweetsGoogle Searches
Many recordsUnique Identities…
...…...
......
ER
![Page 4: Beyond Kaggle: Solving Data Science Challenges at Scale](https://reader034.fdocuments.in/reader034/viewer/2022042717/55d09badbb61ebb0058b4607/html5/thumbnails/4.jpg)
4
DRAFT
Entity Resolution: Use Cases
4
◆ Network Analysis ◆ Vocabulary Normalization:
Different organizations report different names for same entities
◆ Network Security: Finding user actions/intents
◆ Data Cleaning: removing duplicated records
◆ Metadata enrichment: records when matched append metadata to the entity.
![Page 5: Beyond Kaggle: Solving Data Science Challenges at Scale](https://reader034.fdocuments.in/reader034/viewer/2022042717/55d09badbb61ebb0058b4607/html5/thumbnails/5.jpg)
5
DRAFT
Entity Resolution: Challenges
5
◆ Missing Values
◆ Data entry errors
◆ Abbreviations and formatting
◆ Data volume
◆ Variety of raw data sourceso free text, semi-structured, streaming
◆ Data integration from multiple sources
◆ Preprocessing
◆ Normalization
◆ Choosing similarity metrics
![Page 6: Beyond Kaggle: Solving Data Science Challenges at Scale](https://reader034.fdocuments.in/reader034/viewer/2022042717/55d09badbb61ebb0058b4607/html5/thumbnails/6.jpg)
6
DRAFT
Dataset: Dbpedia/Amazon-Google Products
6
Putting a schema to WikipediaCrowd-sourced community project
Queries against WikipediaData Match data sets on the Web to Wikipedia data
A set of triples → <dbpedia:Luc_Besson> <dbpedia-owl:spouse><dbpedia:Milla_Jovovich>
Matching Amazon Products and Google Products
Deich Library and
![Page 7: Beyond Kaggle: Solving Data Science Challenges at Scale](https://reader034.fdocuments.in/reader034/viewer/2022042717/55d09badbb61ebb0058b4607/html5/thumbnails/7.jpg)
7
DRAFT
Preprocessing: Steps
7
1) Extracttokens
2) Cleantriplets
3) Pivottable
4) Selectrelevantfeatures
5) Normalization
6) Choosingsimilaritymetrics
![Page 8: Beyond Kaggle: Solving Data Science Challenges at Scale](https://reader034.fdocuments.in/reader034/viewer/2022042717/55d09badbb61ebb0058b4607/html5/thumbnails/8.jpg)
8
DRAFT
Algorithm: Nearest Neighbors
8
● The entity resolution problem is approached as a network problem○ Nodes: entity records○ Edges: similarity measures
● Define distance between entities to find the nearest neighbors. Composite distances could be built using euclidean, squared euclidean, levenshtein, Jaccard, Manhattan, cosine, dot product
● Compute the distance between all entities and find the nearest neighbors
● Duplicates are the connected components of the graph which are labeled as an entity
● Some parameters to keep in mind are:○ Grouping_features○ k (number of neighbors to compare)○ Radius (the distance threshold)
![Page 9: Beyond Kaggle: Solving Data Science Challenges at Scale](https://reader034.fdocuments.in/reader034/viewer/2022042717/55d09badbb61ebb0058b4607/html5/thumbnails/9.jpg)
9
DRAFT
Results:
9
The benchmark results can be found at:
https://github.com/cubreto/dataDeduplication
![Page 10: Beyond Kaggle: Solving Data Science Challenges at Scale](https://reader034.fdocuments.in/reader034/viewer/2022042717/55d09badbb61ebb0058b4607/html5/thumbnails/10.jpg)
10
DRAFT
Lessons Learned:
10
◆ Most of the time spent on preprocessing
◆ Hard to define the distance threshold
◆ Weighting the composite distance
◆ Data volume
◆ Dealing with missing values
◆ Tuning the parameters
◆ Finding exact matches
![Page 11: Beyond Kaggle: Solving Data Science Challenges at Scale](https://reader034.fdocuments.in/reader034/viewer/2022042717/55d09badbb61ebb0058b4607/html5/thumbnails/11.jpg)
11
DRAFT
Some Resources/Bibliography
11
◆ Ricardo Vasquez Sierra, PhD: Senior Data Scientist from Ooyala
◆ Kevin Glynn, MS: Data Scientist and Khan Academy Instructor
◆ Vince Gonzalez: MapR Software Engineer◆ Alexey Svyatkovskiy, PhD: BigData Scientist
Princeton University◆ Ashwin Machanavajjhala, PhD: Professor of
Computer Science, Duke University◆ Lise Getoor, PhD: Professor of Computer
Science, UC Santa Cruzo KDDTutorialonEntityResolution inBigDatao Deduplication and Group Detection using Links, Indrajit
Bhattacharya and Lise Getoor, The 10th ACM SIGKDD Workshop on
Link Analysis and Group Detection (LinkKDD-04).
o Collective Entity Resolution in Relational Data, Indrajit Bhattacharya
and Lise Getoor, ACM Transactions on Knowledge Discovery from
Data (ACM-TKDD), 2007
◆ The Dato Team◆ My colleagues at Think Big