คอมพิวเตอร์กับงานสำนักงาน เล่ม 3 โปรแกรมจัดการฐานข้อมูล ...€¦ · Find Duplicates
Real-Time Insights for IoT and Applications · Implementation: Track IDs and Batches End-to-End...
Transcript of Real-Time Insights for IoT and Applications · Implementation: Track IDs and Batches End-to-End...
![Page 1: Real-Time Insights for IoT and Applications · Implementation: Track IDs and Batches End-to-End Identifying events & duplicates Generate an event ID & batch ID as early as possible](https://reader033.fdocuments.in/reader033/viewer/2022042923/5f71a98de50eda793a2c3474/html5/thumbnails/1.jpg)
Real-Time Insights for IoT and Applications
Max Zuckerman
Sr. Director, Solutions Sales
Alooma
# T C 1 8
![Page 2: Real-Time Insights for IoT and Applications · Implementation: Track IDs and Batches End-to-End Identifying events & duplicates Generate an event ID & batch ID as early as possible](https://reader033.fdocuments.in/reader033/viewer/2022042923/5f71a98de50eda793a2c3474/html5/thumbnails/2.jpg)
First of all, let’s talk cloud.But I already use the cloud...
![Page 3: Real-Time Insights for IoT and Applications · Implementation: Track IDs and Batches End-to-End Identifying events & duplicates Generate an event ID & batch ID as early as possible](https://reader033.fdocuments.in/reader033/viewer/2022042923/5f71a98de50eda793a2c3474/html5/thumbnails/3.jpg)
Expand Who Can Use the Data
Analysts
Marketing
Data Eng Business
Units
Product
Advertising
Customer
Service
![Page 4: Real-Time Insights for IoT and Applications · Implementation: Track IDs and Batches End-to-End Identifying events & duplicates Generate an event ID & batch ID as early as possible](https://reader033.fdocuments.in/reader033/viewer/2022042923/5f71a98de50eda793a2c3474/html5/thumbnails/4.jpg)
OK, so how do I get my data there? I’ll just write a simple script…
![Page 5: Real-Time Insights for IoT and Applications · Implementation: Track IDs and Batches End-to-End Identifying events & duplicates Generate an event ID & batch ID as early as possible](https://reader033.fdocuments.in/reader033/viewer/2022042923/5f71a98de50eda793a2c3474/html5/thumbnails/5.jpg)
Getting Data to the Cloud
Data Sources Data Warehouses Analytics & BI
70 %of time
spent here
ETL
UnscalableRigidLeaky Slow
![Page 6: Real-Time Insights for IoT and Applications · Implementation: Track IDs and Batches End-to-End Identifying events & duplicates Generate an event ID & batch ID as early as possible](https://reader033.fdocuments.in/reader033/viewer/2022042923/5f71a98de50eda793a2c3474/html5/thumbnails/6.jpg)
Basic Architecture of a Unified Data Pipeline
Data Stores
SaaS Providers
Apps & Agents
WebHooks
Extract Transform Load Transform
Fetch Tasks
Receive
Endpoint
Receive
Endpoint
Receive
Endpoint
Normalize
Fix & Cleanse
Enrich
Merge, Split
& Join
Update
Schemas
Convert to
Output Format
Copy to
Staging Env
Load
Merge
Retention
Cohorts
![Page 7: Real-Time Insights for IoT and Applications · Implementation: Track IDs and Batches End-to-End Identifying events & duplicates Generate an event ID & batch ID as early as possible](https://reader033.fdocuments.in/reader033/viewer/2022042923/5f71a98de50eda793a2c3474/html5/thumbnails/7.jpg)
Basic Architecture of a Unified Data Pipeline
Data Stores
SaaS Providers
Apps & Agents
WebHooks
Extract Transform Load Transform
Fetch Tasks
Receive
Endpoint
Receive
Endpoint
Receive
Endpoint
Normalize
Fix & Cleanse
Enrich
Merge, Split
& Join
Update
Schemas
Convert to
Output Format
Copy to
Staging Env
Load
Merge
Retention
Cohorts
![Page 8: Real-Time Insights for IoT and Applications · Implementation: Track IDs and Batches End-to-End Identifying events & duplicates Generate an event ID & batch ID as early as possible](https://reader033.fdocuments.in/reader033/viewer/2022042923/5f71a98de50eda793a2c3474/html5/thumbnails/8.jpg)
Basic Architecture of a Unified Data Pipeline
Data Stores
SaaS Providers
Apps & Agents
WebHooks
Extract Transform Load Transform
Fetch Tasks
Receive
Endpoint
Receive
Endpoint
Receive
Endpoint
Normalize
Fix & Cleanse
Enrich
Merge, Split
& Join
Update
Schemas
Convert to
Output Format
Copy to
Staging Env
Load
Merge
Retention
Cohorts
![Page 9: Real-Time Insights for IoT and Applications · Implementation: Track IDs and Batches End-to-End Identifying events & duplicates Generate an event ID & batch ID as early as possible](https://reader033.fdocuments.in/reader033/viewer/2022042923/5f71a98de50eda793a2c3474/html5/thumbnails/9.jpg)
Basic Architecture of a Unified Data Pipeline
Data Stores
SaaS Providers
Apps & Agents
WebHooks
Extract Transform Load Transform
Fetch Tasks
Receive
Endpoint
Receive
Endpoint
Receive
Endpoint
Normalize
Fix & Cleanse
Enrich
Merge, Split
& Join
Update
Schemas
Convert to
Output Format
Copy to
Staging Env
Load
Merge
Retention
Cohorts
![Page 10: Real-Time Insights for IoT and Applications · Implementation: Track IDs and Batches End-to-End Identifying events & duplicates Generate an event ID & batch ID as early as possible](https://reader033.fdocuments.in/reader033/viewer/2022042923/5f71a98de50eda793a2c3474/html5/thumbnails/10.jpg)
Basic Architecture of a Unified Data Pipeline
Data Stores
SaaS Providers
Apps & Agents
WebHooks
Extract Transform Load Transform
Fetch Tasks
Receive
Endpoint
Receive
Endpoint
Receive
Endpoint
Normalize
Fix & Cleanse
Enrich
Merge, Split
& Join
Update
Schemas
Convert to
Output Format
Copy to
Staging Env
Load
Merge
Retention
Cohorts
![Page 11: Real-Time Insights for IoT and Applications · Implementation: Track IDs and Batches End-to-End Identifying events & duplicates Generate an event ID & batch ID as early as possible](https://reader033.fdocuments.in/reader033/viewer/2022042923/5f71a98de50eda793a2c3474/html5/thumbnails/11.jpg)
Main Challenge of a Data Pipeline is Integrity
Managing
Dynamic
Schemas
Handling
Errors
Transmitting
Data
Exactly Once
![Page 12: Real-Time Insights for IoT and Applications · Implementation: Track IDs and Batches End-to-End Identifying events & duplicates Generate an event ID & batch ID as early as possible](https://reader033.fdocuments.in/reader033/viewer/2022042923/5f71a98de50eda793a2c3474/html5/thumbnails/12.jpg)
Let’s tackle dynamic schemasBecause the only constant, is change.
![Page 13: Real-Time Insights for IoT and Applications · Implementation: Track IDs and Batches End-to-End Identifying events & duplicates Generate an event ID & batch ID as early as possible](https://reader033.fdocuments.in/reader033/viewer/2022042923/5f71a98de50eda793a2c3474/html5/thumbnails/13.jpg)
Challenge: Schemas Always Change
NoSQL ==
No schema
New features,
New events
Backend apps require
monitoring
Weekly
A/B tests
New Marketing
tools
Salesforce
data
New tables,
New columns
![Page 14: Real-Time Insights for IoT and Applications · Implementation: Track IDs and Batches End-to-End Identifying events & duplicates Generate an event ID & batch ID as early as possible](https://reader033.fdocuments.in/reader033/viewer/2022042923/5f71a98de50eda793a2c3474/html5/thumbnails/14.jpg)
Partial Solution: “Guesstimate”
● Very early on we built our Automapper:
○ Periodic process: Detect Schema Changes
→ Modify Schemas on data warehouse
→ Update schema mappings
● Detecting schema changes was based on
sampling the events, and guessing the data
types of the different values
![Page 15: Real-Time Insights for IoT and Applications · Implementation: Track IDs and Batches End-to-End Identifying events & duplicates Generate an event ID & batch ID as early as possible](https://reader033.fdocuments.in/reader033/viewer/2022042923/5f71a98de50eda793a2c3474/html5/thumbnails/15.jpg)
More Complete Solution:“Guesstimate” + Schema Import
● Import schemas from any data source that supports it
○ Added an import_schema(schema_id) function to our inputs:
MySQL, Postgres, MSSQL, Oracle, Mongo, Salesforce, Google Adwords, ...
○ Added a schema_id metadata field to all events
○ New Automapper:
Detect Schema Change
→ Try to extract schema_id and import the updated schema
→ Translate source schema to data warehouse schema
→ Modify Schemas on data warehouse
→ Update schema mappings
![Page 16: Real-Time Insights for IoT and Applications · Implementation: Track IDs and Batches End-to-End Identifying events & duplicates Generate an event ID & batch ID as early as possible](https://reader033.fdocuments.in/reader033/viewer/2022042923/5f71a98de50eda793a2c3474/html5/thumbnails/16.jpg)
OK, but what about errors?
![Page 17: Real-Time Insights for IoT and Applications · Implementation: Track IDs and Batches End-to-End Identifying events & duplicates Generate an event ID & batch ID as early as possible](https://reader033.fdocuments.in/reader033/viewer/2022042923/5f71a98de50eda793a2c3474/html5/thumbnails/17.jpg)
Challenge: The Pipeline May Fail at Any Stage
Data Warehouse Errors19%
Missing Required
Values11%
Changed Schemas11%
Custom Logic Failures9%
Strings to Integers9%
Overly long
strings5%
Failed Value Conversions37%
* Statistics taken from 1B events in our Restream Queues
![Page 18: Real-Time Insights for IoT and Applications · Implementation: Track IDs and Batches End-to-End Identifying events & duplicates Generate an event ID & batch ID as early as possible](https://reader033.fdocuments.in/reader033/viewer/2022042923/5f71a98de50eda793a2c3474/html5/thumbnails/18.jpg)
Solution: Store All Failed Events and Failure Details
● Build a central, “catch-all,” event store that can also save error details
● Program all pipeline components to write to it in case of error
● Always store the original, unmodified event to allow simple re-processing
● Design a querying interface to enable easy classifying, grouping, and
prioritizing of errors
● Beware of:
○ pipeline loops
○ multiple pipeline destinations
○ complex processing (event splitting / joining / enrichment)
![Page 19: Real-Time Insights for IoT and Applications · Implementation: Track IDs and Batches End-to-End Identifying events & duplicates Generate an event ID & batch ID as early as possible](https://reader033.fdocuments.in/reader033/viewer/2022042923/5f71a98de50eda793a2c3474/html5/thumbnails/19.jpg)
Implementation V1
● Kafka queue to store all failed events (in original format)
● Metadata on each failed event:
○ Failure reason
○ # of failures (to avoid loops)
● Notification service which aggregates failure notifications by error type, similar
parameters
● Redis to store samples of failed events
![Page 20: Real-Time Insights for IoT and Applications · Implementation: Track IDs and Batches End-to-End Identifying events & duplicates Generate an event ID & batch ID as early as possible](https://reader033.fdocuments.in/reader033/viewer/2022042923/5f71a98de50eda793a2c3474/html5/thumbnails/20.jpg)
Implementation V2
![Page 21: Real-Time Insights for IoT and Applications · Implementation: Track IDs and Batches End-to-End Identifying events & duplicates Generate an event ID & batch ID as early as possible](https://reader033.fdocuments.in/reader033/viewer/2022042923/5f71a98de50eda793a2c3474/html5/thumbnails/21.jpg)
Uh oh, won’t this mean duplicates?Or accidentally overwriting newer data with old data?
![Page 22: Real-Time Insights for IoT and Applications · Implementation: Track IDs and Batches End-to-End Identifying events & duplicates Generate an event ID & batch ID as early as possible](https://reader033.fdocuments.in/reader033/viewer/2022042923/5f71a98de50eda793a2c3474/html5/thumbnails/22.jpg)
Challenge: Transmit All Events Once, and Only Once
at least once is
straightforwardonly once
is not
![Page 23: Real-Time Insights for IoT and Applications · Implementation: Track IDs and Batches End-to-End Identifying events & duplicates Generate an event ID & batch ID as early as possible](https://reader033.fdocuments.in/reader033/viewer/2022042923/5f71a98de50eda793a2c3474/html5/thumbnails/23.jpg)
Solution: Idempotency
● In simple words: a duplicate event will overwrite itself
● In practice, this is still difficult:
○ What identifies an event? Its content ? Its ID ?
○ Not all data warehouses support idempotency
○ Every component in the pipeline may fail and resend its last few events
upon resuming
![Page 24: Real-Time Insights for IoT and Applications · Implementation: Track IDs and Batches End-to-End Identifying events & duplicates Generate an event ID & batch ID as early as possible](https://reader033.fdocuments.in/reader033/viewer/2022042923/5f71a98de50eda793a2c3474/html5/thumbnails/24.jpg)
Implementation: Track IDs and Batches End-to-End
● Identifying events & duplicates
○ Generate an event ID & batch ID as early as possible (at the data source)
○ Where possible, make the IDs sequential
● Idempotency with cloud data warehouses
○ Primary key constraints are not validated on cloud data warehouses, so overwriting is not
an option
○ We must keep track of loaded events (or batches) in a separate table and:
■ Verify writing to both tables is atomic (using a transaction)
■ Check whether events have been written before writing every new batch
○ Expand this to encompass your whole pipeline, either end-to-end or step-by-step
![Page 25: Real-Time Insights for IoT and Applications · Implementation: Track IDs and Batches End-to-End Identifying events & duplicates Generate an event ID & batch ID as early as possible](https://reader033.fdocuments.in/reader033/viewer/2022042923/5f71a98de50eda793a2c3474/html5/thumbnails/25.jpg)
So does all of this work?
![Page 26: Real-Time Insights for IoT and Applications · Implementation: Track IDs and Batches End-to-End Identifying events & duplicates Generate an event ID & batch ID as early as possible](https://reader033.fdocuments.in/reader033/viewer/2022042923/5f71a98de50eda793a2c3474/html5/thumbnails/26.jpg)
Alooma just works. We're able to leverage the platform's real-time architecture to iterate on solutions in minutes rather than days, giving us the agility our business demands.”
![Page 27: Real-Time Insights for IoT and Applications · Implementation: Track IDs and Batches End-to-End Identifying events & duplicates Generate an event ID & batch ID as early as possible](https://reader033.fdocuments.in/reader033/viewer/2022042923/5f71a98de50eda793a2c3474/html5/thumbnails/27.jpg)
Alooma makes it easy for us to join data from across all marketing channels without draining our internal engineering resources on untested solutions.
![Page 28: Real-Time Insights for IoT and Applications · Implementation: Track IDs and Batches End-to-End Identifying events & duplicates Generate an event ID & batch ID as early as possible](https://reader033.fdocuments.in/reader033/viewer/2022042923/5f71a98de50eda793a2c3474/html5/thumbnails/28.jpg)
Alooma gives our marketing organization real-time data analytics pulled from across all app and ad sources without draining our internal engineering resources or sacrificing our security practices.
![Page 29: Real-Time Insights for IoT and Applications · Implementation: Track IDs and Batches End-to-End Identifying events & duplicates Generate an event ID & batch ID as early as possible](https://reader033.fdocuments.in/reader033/viewer/2022042923/5f71a98de50eda793a2c3474/html5/thumbnails/29.jpg)
Alooma provides us a scalable, robust and flexible infrastructure needed to accelerate our real-time analytics across the entire company.
![Page 30: Real-Time Insights for IoT and Applications · Implementation: Track IDs and Batches End-to-End Identifying events & duplicates Generate an event ID & batch ID as early as possible](https://reader033.fdocuments.in/reader033/viewer/2022042923/5f71a98de50eda793a2c3474/html5/thumbnails/30.jpg)
Can I see this in action?
![Page 31: Real-Time Insights for IoT and Applications · Implementation: Track IDs and Batches End-to-End Identifying events & duplicates Generate an event ID & batch ID as early as possible](https://reader033.fdocuments.in/reader033/viewer/2022042923/5f71a98de50eda793a2c3474/html5/thumbnails/31.jpg)
Closing Thoughts
![Page 32: Real-Time Insights for IoT and Applications · Implementation: Track IDs and Batches End-to-End Identifying events & duplicates Generate an event ID & batch ID as early as possible](https://reader033.fdocuments.in/reader033/viewer/2022042923/5f71a98de50eda793a2c3474/html5/thumbnails/32.jpg)
Closing Thoughts
● By leveraging a cloud based data warehouse and ETL platform, you dramatically improve data usage
across the organization with more reliable data
● 99.999% data integrity requires design and tested implementation
● When you plan your next analytics platform:
○ Plan for a highly scalable, simple to manage data warehouse
○ Include a schema repository & automated mapping tasks
○ Integrate sources with end-to-end idempotency in mind
○ Design a catch-all error queue & an iterative manual fixing process
● Come say hi and catch a live demo at our booth: #417