Schema registries and Snowplow
-
Upload
miiker -
Category
Technology
-
view
134 -
download
1
Transcript of Schema registries and Snowplow
Schema RegistriesMike Robins | Co-founder
linkedin.com/in/mikerobins
Snowplow Philosophy
- Open source (ALv2) or managed (paid)- Batch or real time- Collect everything (web, mobile, IoT, webhooks)- Ownership of data matters- Data modelling should be first class and flexible- BYO toolset (Spark, Drill, Beam etc)
● Imagine all employees are required to speak only in their native language. ● Either everyone has to be multilingual, or expensive translators must be added for every
pair of languages spoken. ○ Even if you have a sophisticated and efficient way of getting messages from place
to place, you’re still stuck with the overhead of constant translation.
Hazards of many languages
● A shared contract between a consumer and a producer● Prior art
○ Avro, Thrift, Protobuf etc
A schema
Key attributes of schema technologies
● Code generation – for bindings to your schemas in a given programming language
● Data encodings● Validation rules - for calibration and sanity● Types – a description of the type of data● Schema evolution
Copyright Frank Drake, NASA (1977) License: CC BY-NC-ND 2.0
Copyright Frank Drake, NASA (1977) License: CC BY-NC-ND 2.0
iglu:com.<myco>/<event>/jsonschema/1-0-0
{ "$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#", "description": "Schema for <…>", "self": { "vendor": "<com.myco>", "name": "<event>", "format": "jsonschema", "version": "1-0-0" }, "type": "object", "properties": { "action_context": { "type": "string" }... }, "required": ["subject", "event_id"], "additionalProperties": false}
The schema URI is IGLU
The name of this schema
The vendor of this schema
Schema format
Schema version
Schema storage
● Option 1: Send the entire definition with the record
Record Record Record Record
Schema Schema Schema Schema
Schema storage
● Option 2: Send a pointer to the definition
*Schema *Schema *Schema *Schema
Record Record Record Record
Schema storage
● A canonical, shared source of truth● Within and between organisations
Schema registry
● Data governance ○ Safe schema evolution○ Policy enforcement
● Data pipeline resilience● Data discovery● Efficiency
○ Cost○ Storage○ Computation
● Shares principles with software engineering CI/CD
Why?
Key takeaways
Schemas are critical and a shared repository of all schemas used by the organisation is important to make siloed knowledge shared and explicit.
By using schemas, the data definition for a particular kind of data exists in a single place.
Schemas serve as self-contained and automatically enforceable contracts between producers and consumers of data.
Demo
Snowplow (github.com/snowplow/snowplow)Schemas (Iglu Central)Kinesis (Amazon Web Services)Pusher (pusher.com)