Gobbin config-meetup-june-2016

26
Min Tu Pradhan Cadabam Gobblin Configuration Management Gobblin Meetup June 2016

Transcript of Gobbin config-meetup-june-2016

Page 1: Gobbin config-meetup-june-2016

Min Tu Pradhan Cadabam

Gobblin ConfigurationManagementGobblin Meetup June 2016

Page 2: Gobbin config-meetup-june-2016

1. Current Solutions and Motivation – Why we

built Gobblin config?

2. Architecture – Gobblin config internals

3. Retention Example – How retention is

configured using Gobblin config?

Agenda

Page 3: Gobbin config-meetup-june-2016

1. Current Solutions and Motivation – Why we

built Gobblin config?

2. Architecture – Gobblin config internals

3. Retention Example – How retention is

configured using Gobblin config?

Agenda

Page 4: Gobbin config-meetup-june-2016

Job Configs Vs. Dataset Configs

Copy Job

- Permission for loginEvent 700- Permission for logoutEvent 777

Option 1 : One job per dataset- Too many jobs- Long whitelist- Difficult to maintain

Option 2 : Prefix- Too many configs- Can not have single config for

all datasets with same permissions

/events/loginEvent/events/logoutEvent

/events/loginEvent - 700/events/logoutEvent - 777

Source Destination

Copy Job 1 Copy Job 2

dest.permission = 700whitelist = loginEvent

dest.permission = 777whitelist = logoutEvent

loginEvent.dest.permission = 700logoutEvent.dest.permission = 777

Copy Job with prefix

Page 5: Gobbin config-meetup-june-2016

Data Life Cycle Management Configs

/events/loginEvent_Avro /events/loginEvent_Orc

/events/loginEvent_Orc Retention Job

Conversion JobCopy Job

• Shared configs across jobs

• Destination path of conversion job is source path of copy job

• Retention job works on destination path of copy job

• Dataset needs to be enabled in all jobs

/events/loginEvent_Orc

/events/loginEvent_Orc

Retention Job

Retention Job

Page 6: Gobbin config-meetup-june-2016

Other Motivations

• New version of configs should be deployable

without deploying new binaries

• Should be easy to rollback to previous stable

version of configs

• Config changes should have an audit trail

• Complex value types and substitution resolution

support

Page 7: Gobbin config-meetup-june-2016

1. Current Solutions and Motivation – Why we

built Gobblin config?

2. Architecture – Gobblin config internals

3. Retention Example – How retention is

configured using Gobblin config?

Agenda

Page 8: Gobbin config-meetup-june-2016

At a very high-level, we extend typesafe config with:

• Abstraction of a Config Store

• Config versioning

• Support for logical “import” URIs

• Ability to traverse the ”import” relationships

Dataset Configuration Management

Page 9: Gobbin config-meetup-june-2016

Architecture

Client Application

ConfigClient API

ConfigStore API

HadoopFS

Store

HiveMetaStor

eAdapter

MySQLAdapter

Zookeeper

Adapter…

Page 10: Gobbin config-meetup-june-2016

Data Model

Config Store

Dataset config key (URI):/events/loginEvent

Key1: value1Key2: value2

…KeyM: valueM

Dataset config key (URI):/events

Tag config key(URI):/tags

imports

Imported by

Tag config key(URI):/tags/highPriority

keyA: valueXkeyB: valueY

Implicit import Implicit import

Page 11: Gobbin config-meetup-june-2016

HOCON format

• Support Java Properties file

• Support Json file

• Value substitution

• “+=“ syntax to append elements to arrays, path += "/bin”

• …

gobblin.retention : { selection { timeBased.lookbackTime=3y }}

Page 12: Gobbin config-meetup-june-2016

Using Configs in code

ConfigClient client =

ConfigClient.createConfigClient(VersionStabilityPolicy policy);

Config config = client.getConfig(URI uri);

Collection<URI> imports = client.getImports(URI dataset, boolean recursive);

Collection<URI> importedBy = client.getImportedBy(URI tag, boolean recursive);

Page 13: Gobbin config-meetup-june-2016

Config lifecycle at LinkedIn

Page 14: Gobbin config-meetup-june-2016

Example of a config store on HDFSROOT├── _CONFIG_STORE // contents = latest non-rolled-back version ├── 1.0.53 // version directory├── events│ ├── main.conf│   ├── loginEvent│ │ └── main.conf // configuration file for /events/loginEvent│   │ └── includes.conf // specify import links for /events/loginEvent│   ├── shareEvent│   │ └── includes.conf│   └── clickEvent│   └── includes.conf│└── tags ├── highPriority │ └── main.conf // configuration file for /tags/highPriority    │ └── includes.conf // specify import links for /tags/highPriority ├── blacklist └── 10Days

Page 15: Gobbin config-meetup-june-2016

1. Current Solutions and Motivation – Why we

built Gobblin config?

2. Architecture – Gobblin config internals

3. Retention Example – How retention is

configured using Gobblin config?

Agenda

Page 16: Gobbin config-meetup-june-2016

Retention

├── events   ├── loginEvent   │ ├── 2016-06-20.avro   │ └── 2016-06-25.avro   └── logoutEvent   ├── 2016-05-10.avro   └── 2016-06-10.avro

├── events   ├── loginEvent   │ └── 2016-06-25.avro   └── logoutEvent   └── 2016-06-10.avro

• Deleting data that is not required

• Most common retention policy is to delete data older than some days

Example

• Retention policy of 10 days for loginEvent

• Retention policy of 30 days for logoutEvent

Before Retention After Retention

Page 17: Gobbin config-meetup-june-2016

More complex use cases in Production

• Default retention policy of 30 days for all events

• Retention policy of 10 days for loginEvent

• Blacklist retention for clickEvent

• 3 years retention for high priority events like shareEvent

Page 18: Gobbin config-meetup-june-2016

● “events” is the common parent block for “shareEvent”, “loginEvent”, “logoutEvent”, “clickEvent”

● Each block implicitly imports configs from the parent block, “logoutEvent” implicitly imports “events” (Dashed lines)

● Any block can explicitly import any other block (Solid lines)● A child block overrides any key value pairs specified in the parent block

Retention Config

Page 19: Gobbin config-meetup-june-2016

● “logoutEvent” inherits the default retention of 30 days from implicit import, “events”

logoutEvent 30 Days

Page 20: Gobbin config-meetup-june-2016

● “loginEvent” inherits the default retention of 30 days from implicit import, “events”

● “loginEvent” defines a 10 days policy which overrides the 30 days inherited from “events”

loginEvent 10 Days

Page 21: Gobbin config-meetup-june-2016

● “shareEvent” explicitly imports a high priority tag which has retention of 3 years

● “clickEvent” explicitly imports blacklist tag which disables retention for “clickEvent”

Retention Config for share/clickEvent

Page 22: Gobbin config-meetup-june-2016

├── events│ ├── main.conf // Default 30 Days│   ├── loginEvent│   │ └── main.conf // 10 Days│   ├── shareEvent│   │ └── includes.conf // Import /tags/highPriority│   └── clickEvent│   └── includes.conf // Import /tags/blacklist│└── tags ├── highPriority │ └── main.conf // Define 3 Years retention └── blacklist

HDFS Config store

Page 23: Gobbin config-meetup-june-2016

Retention Config Examples/events/main.conf

gobblin.retention : { dataset : { finder.class=gobblin.data.management.retention.CleanableDatasetFinder pattern="/events/*" } selection { policy.class = gobblin.data.management.SelectBeforeTimeBasedSelectionPolicy timeBased.lookbackTime=30d } version : { finder.class=gobblin.data.management.DateTimeDatasetVersionFinder }}

gobblin.retention : { selection { timeBased.lookbackTime=3y }}

/tags/highPriority/main.conf

Page 24: Gobbin config-meetup-june-2016

Supported Policies

• SelectBeforeTimeBasedSelectionPolicy

• NewestKSelectionPolicy

• DailyDependentHourlyPolicy

• CombineSelectionPolicy

More policies -

http://gobblin.readthedocs.io/en/latest/data-management/Gobblin-Retention/

Page 25: Gobbin config-meetup-june-2016

Future work

• Config stores other than Hdfs based config store

• Improve tooling, validation and UI for config store

deployment

Page 26: Gobbin config-meetup-june-2016

Questions