Democratization of Data @Indix

27
Democratization of Data Why and how we built an internal data pipeline platform @Indix

Transcript of Democratization of Data @Indix

Page 1: Democratization of Data @Indix

Democratization of DataWhy and how we built an internal data pipeline platform @Indix

Page 2: Democratization of Data @Indix

About me

Manoj MahalingamPrincipal Engineer @Indix

Page 3: Democratization of Data @Indix

People

Documents Businesses

Places Products

ConnectedDevices

Six Business Critical Indexes

Page 4: Democratization of Data @Indix

Enabling businesses to build location-aware software.

~3.6 million websites use Google maps

Enabling businesses to build product-aware software.

Indix catalogs over 2.1 billion product offers

Indix - The “Google Maps” of Products

Page 5: Democratization of Data @Indix

Crawling Pipeline

Data PipelineML

AggregateMatchStandardizeExtract AttributesClassifyDedupe

Parse

Crawl Data

CrawlSeed

Brand & Retailer Websites

Feeds Pipeline

Transform Clean Connect

Feed Data

Brand & Retailer Feeds

Indix Product Catalog

Customizable Feeds

Search & Analytics

Index

Indexing PipelineReal Time

Index Analyze Derive Join

API (Bulk &

Synchronous)

Product Data Transformation

Service

Data Pipeline @Indix

Page 6: Democratization of Data @Indix

Democratization of Data

Enable everyone in the organization to know what data is available, and then understand and work with it.

Page 7: Democratization of Data @Indix

At Indix, we have and work with a lot of data.

Page 8: Democratization of Data @Indix

Scale of Data @ Indix

2.1 BillionProduct

URLs 8 TB HTML Data

Crawled Daily

1B Unique

Products

7000Categories

120 BPrice

Points

3000Sites

Page 9: Democratization of Data @Indix

● We have data in different

shapes and sizes.

● HTML pages, Thrift and avro records.

● And also the usual suspects - CSVs and plain text data.

Page 10: Democratization of Data @Indix

● Datasets can be in TBs or a few hundred KBs.

● Few billion records or a couple of hundreds.

Page 11: Democratization of Data @Indix

But...the data’s potential couldn’t be realized

Page 12: Democratization of Data @Indix

Data wasn’t discoverable

● The biggest problem was in knowing what data exists and where.

● Some of the data was in S3. Some in HDFS. Some in Google sheets.

● There was no way to know how frequently and when the data changed or updated.

Page 13: Democratization of Data @Indix

The schema wasn’t readily known

● The schema of the data, as expected, kept changing and it was difficult to keep track of which version of data had which schema.

● While Thrift and Avro alleviate this to an extent, access to data wasn’t simple, especially for non-engineers.

Page 14: Democratization of Data @Indix

Writing code limited scope

● We use Scalding and Spark for our MR jobs. Having to code and tweak the jobs limited the scope of who can write and run these jobs.

● “Readymade” jobs may not enable desired tweaks if needed, affecting productivity and increasing dependencies.

● Having to write code and ship jars hinders adhoc data experimentation.

Page 15: Democratization of Data @Indix

Cost control wasn’t trivial

● While data came in various sizes and shapes, what people did with the data also varied - some use cases needed sample of the data, while others wanted aggregations on the entire data.

● It wasn’t trivial to handle all the different workloads while minimizing costs.

● There was also the problem of adhoc jobs starving production jobs in our existing Hadoop clusters.

Page 16: Democratization of Data @Indix

Goals of Internal Data Pipeline PlatformEnable easy discovery of

data.

Allow Schema to be

transparent and easy to

create while also allowing

introspection.

Minimal coding - have

prebuilt transformations for

common tasks and enable

SQL based workflow.

Page 17: Democratization of Data @Indix

Goals of Internal Data Pipeline PlatformUI and Wizard based

workflow to enable ANYONE

in the organization to run

pipelines and extract data.

Manage underlying clusters

and resources transparently

while optimizing for costs.

Support data

experimentations and also

production / customer use

cases.

Page 18: Democratization of Data @Indix

MDA - Marketplace of Datasets and Algorithms

Page 19: Democratization of Data @Indix

Tech Stack

Page 20: Democratization of Data @Indix

MDA - DEMO!!!

Page 21: Democratization of Data @Indix

MDA with our Data Pipeline

MatchAttributesBrandClassifyDedup

Page 22: Democratization of Data @Indix

MDA with our Data Pipeline

MatchAttributesBrandClassifyDedup

Enrich Data Classify BrandFeed data from Customer

Feed output to customer

Page 23: Democratization of Data @Indix

MDA for ML Training Data

Filter Sample Preprocess

Training Data

Page 24: Democratization of Data @Indix

Notebooks//Setup the MDA client

import com.indix.holonet.core.client.SDKClient

val host = "holonet.force.io"

val port = 80

val client = SDKClient(host, port, spark)

//Create dataframe from any MDA dataset

val df = client.toDF("Indix", "PriceHistoryProfile")

df.show

Page 25: Democratization of Data @Indix

Dec 2015

Start work on MDA

Mar 2016

First release

Lot more transforms including sampling, full Hive SQL support and UX fixes

Late 2016

Performance improvements, Spark and infra upgrades.

June 2017

Ability to run pipelines in customer’s cloud infra

Jul 2016 Early 2017

Completely redesign the UI based on over year of feedback and learnings. GraphQL for the UI.

First closed preview of MDA for a customer

Aug 2017

Page 26: Democratization of Data @Indix

What does the future hold?● We are far from done - things like automatic schema

inference, better caching are already planned.

● And as is the original vision, make it fully self-served for our customers (internal and external.)

● Integration with other tools out there like Superset

● Open source as much as possible. First cut - http://github.com/indix/sparkplug

Page 27: Democratization of Data @Indix

Questions?I blog at https://stacktoheap.com

Twitter and most other platforms @manojlds