JustGiving – Serverless Data Pipelines, API, Messaging and Stream Processing
Test strategies for data processing pipelines, v2.0
-
Upload
lars-albertsson -
Category
Data & Analytics
-
view
284 -
download
2
Transcript of Test strategies for data processing pipelines, v2.0
www.mapflat.com
Test strategies for data processing pipelines
1
Lars Albertsson, independent consultant (Mapflat)
www.mapflat.com
Who's talking● Swedish Institute of Computer. Science. (test & debug tools)● Sun Microsystems (large machine verification)● Google (Hangouts, productivity)● Recorded Future (NLP startup) (data integrations)● Cinnober Financial Tech. (trading systems)● Spotify (data processing & modelling, productivity)● Schibsted Products & Tech (data processing & modelling)● Mapflat (independent data engineering consultant)
2
www.mapflat.com
Agenda● Data applications from a test perspective● Testing batch processing products● Testing stream processing product● Data quality testing
Main focus is functional, regression testing
Prerequisites: Backend dev testing, basic data experience, reading Scala
3
www.mapflat.com
Datalake
Batch pipeline anatomy
4
Cluster storage
Unified log
Ingress
ETL
Egress
DBService
DatasetJobPipeline
Service
Export
Businessintelligence
DBDB
Import
www.mapflat.com
Workflow orchestrator● Dataset “build tool”● Run job instance when
○ input is available○ output missing○ resources are available
● Backfill for previous failures● DSL describes DAG
○ Includes ingress & egress
Luigi / Airflow
5
DB
Orchestrator
www.mapflat.com6
Stream pipeline anatomy● Unified log - bus of all business events● Pub/sub with history
○ Kafka○ AWS Kinesis, Google Pub/Sub, Azure
Event Hub● Decoupled producers/consumers
○ In source/deployment○ In space○ In time
● Publish results to log● Recovery from link failures● Replay on job bug fix
Job
Ads Search Feed
App App App
StreamStream Stream
Stream Stream Stream
Job
Job
Stream Stream Stream
Job
Job
Data lake
Business intelligence
Job
www.mapflat.com
Online vs offline
7
Online Offline
ETLCold store
AI feature
DatasetJobPipeline
Service
Service
Online
www.mapflat.com
Online failure vs Offline failure
8
10000s of customers, imprecise feedback
Need low probability => Proactive prevention => Low ROI
10s of employees, precise feedback
Ok with medium probability =>Reactive repair => High ROI
Risk = probability * impact
www.mapflat.com
Value of testing
9
For data-centric (batch) applications, in this order:● Productivity
○ Move fast without breaking things● Fast experimentation
○ 10% good ideas, 90% neutral or bad● Data quality
○ Challenging, more important than● Technical quality
○ Technical failure => ■ Operations hassle.■ Stale data. Often acceptable.■ No customers were harmed. Usually.
Significant harm to external customer is rare enough to be reactive - data-driven by bug frequency.
www.mapflat.com
Test concepts
10
Test harness
Testfixture
System under test(SUT)
3rd party component(e.g. DB)
3rd party component
3rd party component
Test input
Test oracle
Test framework (e.g. JUnit, Scalatest)
Seam
IDEs
Build tools
www.mapflat.com11
Data pipeline properties● Output = function(input, code)
○ No external factors => deterministic○ Easy to craft input, even in large tests○ Perfect for test!
● Pipeline and job endpoints are stable○ Correspond to business value
● Internal abstractions are volatile○ Reslicing in different dimensions is
common
q
www.mapflat.com12
Potential test scopes● Unit/component● Single job● Multiple jobs● Pipeline, including service● Full system, including client
Choose stable interfaces
Each scope has a cost
Job
Service
App
Stream
Stream
Job
Stream
Job
www.mapflat.com13
Recommended scopes● Single job● Multiple jobs● Pipeline, including service
Job
Service
App
Stream
Stream
Job
Stream
Job
www.mapflat.com14
Scopes to avoid● Unit/Component
○ Few stable interfaces○ Not necessary○ Avoid mocks, DI rituals
● Full system, including client○ Client automation fragile
“Focus on functional system tests, complement with smaller where you cannot get coverage.” - Henrik Kniberg
Job
Service
App
Stream
Stream
Job
Stream
Job
www.mapflat.com
Testing single batch job
15
Job
Standard Scalatest harness
file://test_input/ file://test_output/
1. Generate input 2. Run in local mode 3. Verify output
f() p()
Runs well in CI / from IDE
www.mapflat.com
Testing batch pipelines - two options
16
Standard Scalatest harness
file://test_input/ file://test_output/
1. Generate input 2. Run custom multi-job
Test job with sequence of jobs
3. Verify output
f() p()
A:
Customised workflow manager setup
+ Runs in CI+ Runs in IDE+ Quick setup- Multi-job maintenance
p()+ Tests workflow logic+ More authentic- Workflow mgr setup for testability- Difficult to debug- Dataset handling with Python
f()
B:● Both can be extended with ingress (Kafka), egress DBs
www.mapflat.com17
Test data
Test input data is code● Tied to production code● Should live in version control● Duplication to eliminate● Generation gives more power
○ Randomness can be desirable○ Larger scale○ Statistical distribution
● Maintainable
> git add src/resources/test-data.json
userInputFile.overwrite(Json.toJson( userTemplate.copy( age = 27, product = "premium", country = "NO")) .toString)
Prod data Privacy!
www.mapflat.com18
Putting batch tests togetherclass CleanUserTest extends FlatSpec { val inFile = (tmpDir / "test_user.json") val outFile = (tmpDir / "test_output.json") def writeInput(users: Seq[User]) = inFile.overwrite(users.map(u => Json.toJson(u).toString).mkString("\n")) def readOutput() = outFile.contentAsString.split("\n").map(line => Json.fromJson[User](Json.parse(line))).toSeq
def runJob(input: Seq[User]): Seq[User] = { writeInput(input) val args = Seq( "--user", inFile.path.toString, "--output", outFile.path.toString) CleanUserJob.main(args) // Works for some processing frameworks, e.g. Spark readOutput() }
"Clean users" should "remain untouched" { val output = runJob(Seq(TestInput.userTemplate)) assert(output === Seq(TestInput.userTemplate)) }}
www.mapflat.com19
Test oracleclass CleanUserTest extends FlatSpec { // ...
"Lower case country" should "translate to upper case" { val input = Seq(userTemplate.copy(country = "se")) val output = runJob(input)
assert(output.size === 1) assert(output.head.country === "SE")
val lens = GenLens[User](_.country) assert(lens.get(output.head) === "SE") }}
● Avoid full record comparisons○ Except for a few tests
● Examine only fields relevant to test○ Or tests break for unrelated changes
● Lenses are your friend○ JSON: JsonPath (Java), Play (Scala), ...○ Case classes: Monocle, Shapeless,
Scalaz, Sauron, Quicklens
www.mapflat.com
class CleanUserTest extends FlatSpec { def runJob(input: Seq[User]): Seq[User] = { writeInput(input) val args = Seq("--user", inFile.path.toString, "--output", outFile.path.toString) CleanUserJob.main(args) // Ugly way to expose counters val counters = Map( "invalid" -> CleanUserJob.invalidCount, "upper-cased" -> CleanUserJob.upperCased) (readOutput(), counters) }
"Lower case country" should "translate to upper case" { val input = Seq(userTemplate.copy(country = "se")) val (output, counters) = runJob(input)
assert(output.size === 1) assert(output.head.country === "SE") assert(counters.get("upper-cased") === 1) assert((counters - "upper-cased").filter(_._2 != 0) shouldBe empty) }}
20
Inspecting counters● Counters (accumulators in Spark) are
critical for monitoring and quality● Test that expected counters are bumped
○ But no other counters
www.mapflat.com
class CleanUserTest extends FlatSpec { def validateInvariants( input: Seq[User], output: Seq[User], counters: Map[String, Int]) = { output.foreach(recordInvariant)
// Dataset invariants assert(input.size === output.size) assert(input.size should be >= counters["upper-cased"]) }
def recordInvariant(u: User) = assert(u.country.size === 2)
def runJob(input: Seq[User]): Seq[User] = { // Same as before ... validateInvariants(input, output, counters) (output, counters) }
// Test case is the same}
21
Invariants● Some things are true
○ For every record○ For every job invocation
● Not necessarily in production○ Reuse invariant predicates as quality
probes
www.mapflat.com22
Streaming SUT, example harness
Scalatest Spark Streaming jobs IDE, CI, debug integration
DB
TopicKafka
Test input
Test oracle
Docker
IDE / Gradle
Polling
Service
JVM monolith for test
www.mapflat.com23
Test lifecycle1. Initiate from IDE / build system2. Start fixture containers3. Await fixture ready4. Allocate test case resources5. Start jobs6. Push input data7. While (!done && !timeout) {
pollDatabase() sleep(1ms)}
8. While (moreTests) { Goto 4 }9. Tear down fixture
8 3
6
1
24,9
5
7
www.mapflat.com24
Testing for absence1. Send test input2. Send dummy input3. Await effect of dummy input4. Verify test output absence
Assumes in-order semantics
1,2 3, 4
www.mapflat.com
Testing in the process1. Integration tests for happy paths.2. Likely odd cases, e.g.
○ Empty inputs○ Missing matching records, e.g. in joins○ Odd encodings
3. Dark corners○ Motivated for externally generated input
25
● Minimal test cases triggering desired paths● On production failure:
○ Debug in production○ Create minimal test case○ Add to regression suite
● Cannot trigger bug in test?○ Consider new scope? Significantly
different from old scopes.○ Staging pipelines?○ Automatic code inspection?
www.mapflat.com
Quality testing variants● Functional regression
○ Binary, key to productivity● Golden set
○ Extreme inputs => obvious output○ No regressions tolerated
● (Saved) production data input○ Individual regressions ok○ Weighted sum must not decline○ Beware of privacy
26
www.mapflat.com27
Quality metrics● Processing tool (Spark/Hadoop) counters
○ Odd code path => bump counter● Dedicated quality assessment pipelines● Workflow quality predicate for
consumption○ Depends on consumer use case
Hadoop / Spark counters DB
Quality assessment job
Tiny quality metadataset
www.mapflat.com28
Quality testing in the process● Binary self-contained
○ Validate in CI● Relative vs history
○ E.g. large drops○ Precondition for consuming dataset?
● Push aggregates to DB○ Standard ops: monitor, alert
DB
∆?
Code ∆!
www.mapflat.com29
Data pipeline = yet another programDon’t veer from best practices
● Regression testing● Design: Separation of concerns, modularity, etc● Process: CI/CD, code review, static analysis tools● Avoid anti-patterns: Global state, hard-coding location, duplication, ...
In data engineering, slipping is in the culture... :-(● Mix in solid backend engineers● Document “golden path”
www.mapflat.com
Test code = not yet another program● Shared between data engineers & QA engineers
○ Best result with mutual respect and collaboration● Readability >> abstractions
○ Create (Scala, Python) DSLs● Some software bad practices are benign
○ Duplication○ Inconsistencies○ Lazy error handling
30
www.mapflat.com
Honey trapsThe cloud
● PaaS components do not work locally○ Cloud providers should provide fake
implementations○ Exceptions: Kubernetes, Cloud SQL,
Relational Database Service, (S3)● Integrate PaaS service as fixture
component is challenging○ Distribute access tokens, etc○ Pay $ or $$$. Much better with per-second
billing.
31
Vendor batch test frameworks
● Spark, Scalding, Crunch variants● Seam == internal data structure
○ Omits I/O - common bug source● Vendor lock-in -
when switching batch framework:○ Need tests for protection○ Test rewrite is unnecessary burden
Test input Test output
Job
www.mapflat.com
Top anti-patterns1. Test as afterthought or in production. Data processing applications are suited for test!2. Developer testing requires cluster3. Static test input in version control4. Exact expected output test oracle5. Unit testing volatile interfaces6. Using mocks & dependency injection7. Tool-specific test framework - vendor lock-in8. Using wall clock time9. Embedded fixture components, e.g. in-memory Kafka/Cassandra/RDBMS
10. Performance testing (low ROI for offline)11. Data quality not measured, not monitored
32
www.mapflat.com
Thank you. Questions?Credits:
Øyvind Løkling, Schibsted Media Group
Images:
Tracey Saxby, Integration and Application Network, University of Maryland Center for Environmental Science (ian.umces.edu/imagelibrary/).
33
www.mapflat.com
Bonus slides
34
www.mapflat.com
Deployment, example
35
Hg/git repo Luigi DSL, jars, config
my-pipe-7.tar.gzHDFS
Luigidaemon
> pip install my-pipe-7.tar.gz
WorkerWorker
WorkerWorker
WorkerWorker
WorkerWorker
Redundant cron schedule, higher frequency + backfill (Luigi range tools)
* 10 * * * bin/my_pipe_daily \ --backfill 14
All that a pipeline needs, installed atomically
www.mapflat.com
Continuous deployment, example
36
● Poll and pull latest on worker nodes○ virtualenv package/version
■ No need to sync environment & versions○ Cron package/latest/bin/*
■ Old versions run pipelines to completion, then exit
Hg/git repo Luigi DSL, jars, config
my-pipe-7.tar.gzHDFS
my_cd.py hdfs://pipelines/
Worker
> virtualenv my_pipe/7> pip install my-pipe-7.tar.gz
* 10 * * * my_pipe/7/bin/*