Test strategies for data processing pipelines, v2.0

www.mapflat.com

Test strategies for data processing pipelines

1

Lars Albertsson, independent consultant (Mapflat)

www.mapflat.com

Who's talking● Swedish Institute of Computer. Science. (test & debug tools)● Sun Microsystems (large machine verification)● Google (Hangouts, productivity)● Recorded Future (NLP startup) (data integrations)● Cinnober Financial Tech. (trading systems)● Spotify (data processing & modelling, productivity)● Schibsted Products & Tech (data processing & modelling)● Mapflat (independent data engineering consultant)

2

www.mapflat.com

Agenda● Data applications from a test perspective● Testing batch processing products● Testing stream processing product● Data quality testing

Main focus is functional, regression testing

Prerequisites: Backend dev testing, basic data experience, reading Scala

3

www.mapflat.com

Datalake

Batch pipeline anatomy

4

Cluster storage

Unified log

Ingress

ETL

Egress

DBService

DatasetJobPipeline

Service

Export

Businessintelligence

DBDB

Import

www.mapflat.com

Workflow orchestrator● Dataset “build tool”● Run job instance when

○ input is available○ output missing○ resources are available

● Backfill for previous failures● DSL describes DAG

○ Includes ingress & egress

Luigi / Airflow

5

DB

Orchestrator

www.mapflat.com6

Stream pipeline anatomy● Unified log - bus of all business events● Pub/sub with history

○ Kafka○ AWS Kinesis, Google Pub/Sub, Azure

Event Hub● Decoupled producers/consumers

○ In source/deployment○ In space○ In time

● Publish results to log● Recovery from link failures● Replay on job bug fix

Job

Ads Search Feed

App App App

StreamStream Stream

Stream Stream Stream

Job

Job

Stream Stream Stream

Job

Job

Data lake

Business intelligence

Job

www.mapflat.com

Online vs offline

7

Online Offline

ETLCold store

AI feature

DatasetJobPipeline

Service

Service

Online

www.mapflat.com

Online failure vs Offline failure

8

10000s of customers, imprecise feedback

Need low probability => Proactive prevention => Low ROI

10s of employees, precise feedback

Ok with medium probability =>Reactive repair => High ROI

Risk = probability * impact

www.mapflat.com

Value of testing

9

For data-centric (batch) applications, in this order:● Productivity

○ Move fast without breaking things● Fast experimentation

○ 10% good ideas, 90% neutral or bad● Data quality

○ Challenging, more important than● Technical quality

○ Technical failure => ■ Operations hassle.■ Stale data. Often acceptable.■ No customers were harmed. Usually.

Significant harm to external customer is rare enough to be reactive - data-driven by bug frequency.

www.mapflat.com

Test concepts

10

Test harness

Testfixture

System under test(SUT)

3rd party component(e.g. DB)

3rd party component

3rd party component

Test input

Test oracle

Test framework (e.g. JUnit, Scalatest)

Seam

IDEs

Build tools

www.mapflat.com11

Data pipeline properties● Output = function(input, code)

○ No external factors => deterministic○ Easy to craft input, even in large tests○ Perfect for test!

● Pipeline and job endpoints are stable○ Correspond to business value

● Internal abstractions are volatile○ Reslicing in different dimensions is

common

q

www.mapflat.com12

Potential test scopes● Unit/component● Single job● Multiple jobs● Pipeline, including service● Full system, including client

Choose stable interfaces

Each scope has a cost

Job

Service

App

Stream

Stream

Job

Stream

Job

www.mapflat.com13

Recommended scopes● Single job● Multiple jobs● Pipeline, including service

Job

Service

App

Stream

Stream

Job

Stream

Job

www.mapflat.com14

Scopes to avoid● Unit/Component

○ Few stable interfaces○ Not necessary○ Avoid mocks, DI rituals

● Full system, including client○ Client automation fragile

“Focus on functional system tests, complement with smaller where you cannot get coverage.” - Henrik Kniberg

Job

Service

App

Stream

Stream

Job

Stream

Job

www.mapflat.com

Testing single batch job

15

Job

Standard Scalatest harness

file://test_input/ file://test_output/

1. Generate input 2. Run in local mode 3. Verify output

f() p()

Runs well in CI / from IDE

www.mapflat.com

Testing batch pipelines - two options

16

Standard Scalatest harness

file://test_input/ file://test_output/

1. Generate input 2. Run custom multi-job

Test job with sequence of jobs

3. Verify output

f() p()

A:

Customised workflow manager setup

+ Runs in CI+ Runs in IDE+ Quick setup- Multi-job maintenance

p()+ Tests workflow logic+ More authentic- Workflow mgr setup for testability- Difficult to debug- Dataset handling with Python

f()

B:● Both can be extended with ingress (Kafka), egress DBs

www.mapflat.com17

Test data

Test input data is code● Tied to production code● Should live in version control● Duplication to eliminate● Generation gives more power

○ Randomness can be desirable○ Larger scale○ Statistical distribution

● Maintainable

> git add src/resources/test-data.json

userInputFile.overwrite(Json.toJson( userTemplate.copy( age = 27, product = "premium", country = "NO")) .toString)

Prod data Privacy!

www.mapflat.com18

Putting batch tests togetherclass CleanUserTest extends FlatSpec { val inFile = (tmpDir / "test_user.json") val outFile = (tmpDir / "test_output.json") def writeInput(users: Seq[User]) = inFile.overwrite(users.map(u => Json.toJson(u).toString).mkString("\n")) def readOutput() = outFile.contentAsString.split("\n").map(line => Json.fromJson[User](Json.parse(line))).toSeq

def runJob(input: Seq[User]): Seq[User] = { writeInput(input) val args = Seq( "--user", inFile.path.toString, "--output", outFile.path.toString) CleanUserJob.main(args) // Works for some processing frameworks, e.g. Spark readOutput() }

"Clean users" should "remain untouched" { val output = runJob(Seq(TestInput.userTemplate)) assert(output === Seq(TestInput.userTemplate)) }}

www.mapflat.com19

Test oracleclass CleanUserTest extends FlatSpec { // ...

"Lower case country" should "translate to upper case" { val input = Seq(userTemplate.copy(country = "se")) val output = runJob(input)

assert(output.size === 1) assert(output.head.country === "SE")

val lens = GenLens[User](_.country) assert(lens.get(output.head) === "SE") }}

● Avoid full record comparisons○ Except for a few tests

● Examine only fields relevant to test○ Or tests break for unrelated changes

● Lenses are your friend○ JSON: JsonPath (Java), Play (Scala), ...○ Case classes: Monocle, Shapeless,

Scalaz, Sauron, Quicklens

www.mapflat.com

class CleanUserTest extends FlatSpec { def runJob(input: Seq[User]): Seq[User] = { writeInput(input) val args = Seq("--user", inFile.path.toString, "--output", outFile.path.toString) CleanUserJob.main(args) // Ugly way to expose counters val counters = Map( "invalid" -> CleanUserJob.invalidCount, "upper-cased" -> CleanUserJob.upperCased) (readOutput(), counters) }

"Lower case country" should "translate to upper case" { val input = Seq(userTemplate.copy(country = "se")) val (output, counters) = runJob(input)

assert(output.size === 1) assert(output.head.country === "SE") assert(counters.get("upper-cased") === 1) assert((counters - "upper-cased").filter(_._2 != 0) shouldBe empty) }}

20

Inspecting counters● Counters (accumulators in Spark) are

critical for monitoring and quality● Test that expected counters are bumped

○ But no other counters

www.mapflat.com

class CleanUserTest extends FlatSpec { def validateInvariants( input: Seq[User], output: Seq[User], counters: Map[String, Int]) = { output.foreach(recordInvariant)

// Dataset invariants assert(input.size === output.size) assert(input.size should be >= counters["upper-cased"]) }

def recordInvariant(u: User) = assert(u.country.size === 2)

def runJob(input: Seq[User]): Seq[User] = { // Same as before ... validateInvariants(input, output, counters) (output, counters) }

// Test case is the same}

21

Invariants● Some things are true

○ For every record○ For every job invocation

● Not necessarily in production○ Reuse invariant predicates as quality

probes

www.mapflat.com22

Streaming SUT, example harness

Scalatest Spark Streaming jobs IDE, CI, debug integration

DB

TopicKafka

Test input

Test oracle

Docker

IDE / Gradle

Polling

Service

JVM monolith for test

www.mapflat.com23

Test lifecycle1. Initiate from IDE / build system2. Start fixture containers3. Await fixture ready4. Allocate test case resources5. Start jobs6. Push input data7. While (!done && !timeout) {

pollDatabase() sleep(1ms)}

8. While (moreTests) { Goto 4 }9. Tear down fixture

8 3

6

1

24,9

5

7

www.mapflat.com24

Testing for absence1. Send test input2. Send dummy input3. Await effect of dummy input4. Verify test output absence

Assumes in-order semantics

1,2 3, 4

www.mapflat.com

Testing in the process1. Integration tests for happy paths.2. Likely odd cases, e.g.

○ Empty inputs○ Missing matching records, e.g. in joins○ Odd encodings

3. Dark corners○ Motivated for externally generated input

25

● Minimal test cases triggering desired paths● On production failure:

○ Debug in production○ Create minimal test case○ Add to regression suite

● Cannot trigger bug in test?○ Consider new scope? Significantly

different from old scopes.○ Staging pipelines?○ Automatic code inspection?

www.mapflat.com

Quality testing variants● Functional regression

○ Binary, key to productivity● Golden set

○ Extreme inputs => obvious output○ No regressions tolerated

● (Saved) production data input○ Individual regressions ok○ Weighted sum must not decline○ Beware of privacy

26

www.mapflat.com27

Quality metrics● Processing tool (Spark/Hadoop) counters

○ Odd code path => bump counter● Dedicated quality assessment pipelines● Workflow quality predicate for

consumption○ Depends on consumer use case

Hadoop / Spark counters DB

Quality assessment job

Tiny quality metadataset

www.mapflat.com28

Quality testing in the process● Binary self-contained

○ Validate in CI● Relative vs history

○ E.g. large drops○ Precondition for consuming dataset?

● Push aggregates to DB○ Standard ops: monitor, alert

DB

∆?

Code ∆!

www.mapflat.com29

Data pipeline = yet another programDon’t veer from best practices

● Regression testing● Design: Separation of concerns, modularity, etc● Process: CI/CD, code review, static analysis tools● Avoid anti-patterns: Global state, hard-coding location, duplication, ...

In data engineering, slipping is in the culture... :-(● Mix in solid backend engineers● Document “golden path”

www.mapflat.com

Test code = not yet another program● Shared between data engineers & QA engineers

○ Best result with mutual respect and collaboration● Readability >> abstractions

○ Create (Scala, Python) DSLs● Some software bad practices are benign

○ Duplication○ Inconsistencies○ Lazy error handling

30

www.mapflat.com

Honey trapsThe cloud

● PaaS components do not work locally○ Cloud providers should provide fake

implementations○ Exceptions: Kubernetes, Cloud SQL,

Relational Database Service, (S3)● Integrate PaaS service as fixture

component is challenging○ Distribute access tokens, etc○ Pay $ or $$$. Much better with per-second

billing.

31

Vendor batch test frameworks

● Spark, Scalding, Crunch variants● Seam == internal data structure

○ Omits I/O - common bug source● Vendor lock-in -

when switching batch framework:○ Need tests for protection○ Test rewrite is unnecessary burden

Test input Test output

Job

www.mapflat.com

Top anti-patterns1. Test as afterthought or in production. Data processing applications are suited for test!2. Developer testing requires cluster3. Static test input in version control4. Exact expected output test oracle5. Unit testing volatile interfaces6. Using mocks & dependency injection7. Tool-specific test framework - vendor lock-in8. Using wall clock time9. Embedded fixture components, e.g. in-memory Kafka/Cassandra/RDBMS

10. Performance testing (low ROI for offline)11. Data quality not measured, not monitored

32

www.mapflat.com

Thank you. Questions?Credits:

Øyvind Løkling, Schibsted Media Group

Images:

Tracey Saxby, Integration and Application Network, University of Maryland Center for Environmental Science (ian.umces.edu/imagelibrary/).

33

www.mapflat.com

Bonus slides

34

www.mapflat.com

Deployment, example

35

Hg/git repo Luigi DSL, jars, config

my-pipe-7.tar.gzHDFS

Luigidaemon

> pip install my-pipe-7.tar.gz

WorkerWorker

WorkerWorker

WorkerWorker

WorkerWorker

Redundant cron schedule, higher frequency + backfill (Luigi range tools)

* 10 * * * bin/my_pipe_daily \ --backfill 14

All that a pipeline needs, installed atomically

www.mapflat.com

Continuous deployment, example

36

● Poll and pull latest on worker nodes○ virtualenv package/version

■ No need to sync environment & versions○ Cron package/latest/bin/*

■ Old versions run pipelines to completion, then exit

Hg/git repo Luigi DSL, jars, config

my-pipe-7.tar.gzHDFS

my_cd.py hdfs://pipelines/

Worker

> virtualenv my_pipe/7> pip install my-pipe-7.tar.gz

* 10 * * * my_pipe/7/bin/*

Test strategies for data processing pipelines, v2.0

Data & Analytics

Transcript of Test strategies for data processing pipelines, v2.0