Lessons learned from designing QA automation event streaming platform(IoT big data)

40
Lessons learned from designing a QA automation event streaming platform (IoT & Big Data) Omid Vahdaty, Big Data ninja.

Transcript of Lessons learned from designing QA automation event streaming platform(IoT big data)

Page 1: Lessons learned from designing QA automation event streaming platform(IoT big data)

Lessons learned from designing a QA automation event streaming platform (IoT & Big Data) Omid Vahdaty, Big Data ninja.

Page 2: Lessons learned from designing QA automation event streaming platform(IoT big data)

Agenda●Introduction to Big Data Testing

●Challenges of Big Data Testing.

●Methodological approach for QA Automation road map

●True Story

●The Solution?

●Performance and Maintenance

Page 3: Lessons learned from designing QA automation event streaming platform(IoT big data)

Introduction to Big Data Testing

Page 4: Lessons learned from designing QA automation event streaming platform(IoT big data)

●100M ?

●1B ?

●10B?

●100B?

●1Tr?

How Big is Big Data?

Page 5: Lessons learned from designing QA automation event streaming platform(IoT big data)

How Hard Can it be (aggregation)?●Select Count(*) from t;

○ Assume

■ 1 Trillion records

■ ad-hoc query (no indexes)

■ Full Scan (no cache)

○ Challenges

■ Time it takes to compute?

■ IO bottleneck? What is IO pattern?

■ CPU bottleneck?

■ Scale up limitations?

Page 6: Lessons learned from designing QA automation event streaming platform(IoT big data)

How Hard Can it be (join)?●Select * from t1 join t2 where t1.id =t2.id

○ Assume

■ 1 Trillion records both tables.

■ ad-hoc query (no indexes)

■ Full Scan (no cache)

○ Challenges

■ Time it takes to compute?

■ IO bottleneck? What is IO pattern?

■ CPU bottleneck?

■ Scale out limitations?

Page 7: Lessons learned from designing QA automation event streaming platform(IoT big data)

Big Data Challenges

Page 8: Lessons learned from designing QA automation event streaming platform(IoT big data)

3 Pain Points in Big Data DBMS

Big Data SystemIngress data rate Egress data rate

Maintenance rate

Page 9: Lessons learned from designing QA automation event streaming platform(IoT big data)

Big Data Ingress Challenges●Parsing

●Rate and Amount of data coming into the system

●ACID: Atomicity, Consistency, Isolation, Durability

●Compression on the fly (non unique values)

●On the fly analytics

●Time constraints (target: x rows per sec/hour)

●High Availability Scenarios○ Can you afford to lost evetns/data?

○ How much downtime can you sustain ?

○ How much time it takes to recover from downtime

Page 10: Lessons learned from designing QA automation event streaming platform(IoT big data)

Big Data egress challenges● Sort

● Group by

● Reduce

● Join (sortMergeJoin)

● Data distribution

● Compression

● Theoretical Bottlenecks of hardware.

What is your engine algorithm doing?What is the impact on the system? CPU? IO? RAM? Network?What is the impact on the cluster?

Page 11: Lessons learned from designing QA automation event streaming platform(IoT big data)

Egress while Ingress Challenges●Input never stops

●Input rate rate oscillations

●Bottlenecks

●System gets stuck at different places in a complex cluster

●Ingress performance impact on egress.

●High Availability Scenarios○ Can you afford to lost data?

○ How much downtime can you sustain ?

○ How much time it takes to recover from downtime

Page 12: Lessons learned from designing QA automation event streaming platform(IoT big data)

Methodological Approach Event Streaming Platform testing

Page 13: Lessons learned from designing QA automation event streaming platform(IoT big data)

Business constraints?●Budget?

●Time to market?

●Resources available?

●Skill set available ?

Page 14: Lessons learned from designing QA automation event streaming platform(IoT big data)

Product requirements?●What are the product’s supported use cases?

●Supported Scale?

●Supported rate of ingress?

●Supported rate on egress?

●Complexity of Cluster?

●High availability?

●Cloud or onsite?

●Regulation? GMP?

●Security?

●HIPAA?

Page 15: Lessons learned from designing QA automation event streaming platform(IoT big data)

The Automation: Must have requirements?●Scale out/up?

●Fault tolerance!

●Reproducibility from scratch!

●Reporting!

●“Views” to avoid duplication for testing?

●Orchestration of same environment per Developer!

●Debugging options

● versioning!!!

Page 16: Lessons learned from designing QA automation event streaming platform(IoT big data)

The method● Phase 0: get ready

○ Product requirements & business constraints

○ Automation requirements

○ Creating a testing matrix based on insight and product features.

● Phase 1: Get insights (some baselines)

○ Test Ingress separately, find your baseline.

○ Test Egress separately, find your baseline.

○ Test Ingress while Egress in progress, find your baseline.

○ Stability and Performance

● Phase 2: Agile: Design an automation system that satisfies the requirements

○ Prioritize automation features based on your insights from phase 1.

○ Implement an automation infrastructure as soon as possible.

○ Update your testing matrix as you go (insight will keep coming)

○ Analyze the test results daily.

● Phase 3: Cost reduction

○ How to reduce compute time/ IO pattern/ network pattern/storage footprint

○ Maintenance time (build/package/deploy/monitor)

○ Hardware costs

Page 17: Lessons learned from designing QA automation event streaming platform(IoT big data)

True StoryHadoop based Event Streaming

Platform (Peta Scale)

Page 18: Lessons learned from designing QA automation event streaming platform(IoT big data)

True story●Product: Hadoop based event streaming platform. [peta scale]

●Technical Constraints○ What is the expected impact of big data on the platform?

○ How to debug scenarios issues?

○ QA???

○ Running time - how much it takes to run all the tests?

●Company Challenges○ Cost of Hardware? High end 6U Server

○ Expertise ? skillset?

○ Human resources: Size of QA Automation team?

○ When do you start testing? How to incorporate testing in SCRUM?

○ How do you manage automation?

○ What is your MVP? (Chicken and Egg)

○ TIME TO MARKET!!! The product needs to be released.

Page 19: Lessons learned from designing QA automation event streaming platform(IoT big data)

Technical Flow1. Start by testing the simplest flow, see that behaves as expected, and then another layer of

complexity. for this step you must understand the products state machine from A to Z.

2. start with single thread (one of each component)

3. continue with small data. gradual increments. understand machine metrics of each component. and all the different fine tune of configuration.

4. get to the point you are convinced you have big data (what does your product support?)

a. 0 data, 1M , 10M, 100M, 1GB, 10GB, 100G, 1TB, 10TB, 100TB ,1PB

5. start over , this time using multi threads/components

6. start over , this time try negative testing (fail over scenarios, data node failures, flume failures, namenode failures etc, spark failures)

7. Make sure Logs reflect the accurate status of the product. (error/warn/info/debug)

Page 20: Lessons learned from designing QA automation event streaming platform(IoT big data)

Testing matrix dimensions● Multiple components: scale out of of same component:● Data

different data rates: slow/fast /bursts (delay b/w packets slow/fast)what happens if we change data distribution?different input size (small , large, changing)

Performance & Metricswhat are the machine metrics: collector/flume/HDFS over time? get a rule of thumb of processing

poweringress rate?Heap allocations per Hadoop ecosystem?

● stability and persistence: a. what happen if Flume fail?data loss/stuck on flume?b. recovery from HA c. data node loss?d. NameNode recovery? persistence of Flume

Page 21: Lessons learned from designing QA automation event streaming platform(IoT big data)

Ingress Testing matrixsingle collector, single Flume,

multi collector, single Flume,

multi collector, multi Flume

HA HDFS

New cluster

Existing data small cluster

Existing cluster Big data

Stability (failure of components)

Page 22: Lessons learned from designing QA automation event streaming platform(IoT big data)

Egress Testing Matrix 1. Hadoop

a. HDFS performance/failover/data replication/block size

b. Yarn availability/metrics of jobs

c. spark availability/ metrics of jobs/ performance tuning

2. Assuming some sort of Engine

a. State (marks which files were already processed)

b. Poller (polling of new evetns)

Page 23: Lessons learned from designing QA automation event streaming platform(IoT big data)

Egress Testing Matrix HDFS Yarn Spark Engine state &

pollerAny other layers

New cluster

Existing data small cluster

Existing cluster Big data

Stability (failure of components)

Page 24: Lessons learned from designing QA automation event streaming platform(IoT big data)

The baseline concept●Requirements

○ Events data (simulated or real)

○ Cluster

○ Metrics - to understand if behaviour is expected.

○ Do i need to QA hadoop ecosystem?

●Steps○ Create a golden standard - per event type - per building block

○ permutate

○ If equal to baseline→ test pass.

Page 25: Lessons learned from designing QA automation event streaming platform(IoT big data)

The solution

Page 26: Lessons learned from designing QA automation event streaming platform(IoT big data)

The Solution: Event Flow testing framework.

●TBD

Page 27: Lessons learned from designing QA automation event streaming platform(IoT big data)

Challenges with Ingress●Amount of collectors

●Events streaming cluster○ Kafka VS. Flume

○ Hadoop tuning and maintenance

○ Data loss - how to handle

○ Performance of HDFS sink

○ Channel persistency

○ Channel performance

○ Downtime + recovery time

○ unreadable error logs (open source components)

○ Rate of data (events per second) → where is the bottleneck.

○ Debugging performance of a cluster

○ Continuous input (not stopping)

●Product - Batch Vs. Real Time (acceptable latency?)

Page 28: Lessons learned from designing QA automation event streaming platform(IoT big data)

Challenges with Egress● Hadoop tuning

●Spark Tuning

●Engine Tuning○ Size of file on HDFS - bigger is better?

○ Processing

■ Recovery from Crushes

■ Delayed data

● In order /out of order

● timestamp

■ Cluster getting stuck due to bottlenecks on the pipeline

Page 29: Lessons learned from designing QA automation event streaming platform(IoT big data)

Challenges with Big Data testing.●Generating Time of events → genUtil with json for user input

●Reproducibility → genUtil again.

●Disk space for testing data → peta scale product → peta scale testing data

●Results compared mechanism

●Network bottlenecks on a scale out system...

●ORDER IS NOT GUARANTEED!

●Replications

●Cluster utilization

Page 30: Lessons learned from designing QA automation event streaming platform(IoT big data)

The solution: Continuous Testing1.Extra large Sanity Testing per commit (!)

2.Hourly testing on latest commit (24 X 7)

3. Weekly testing

4. Monthly testing

Page 31: Lessons learned from designing QA automation event streaming platform(IoT big data)

The solution: Hardware perspective ●Cluster utilization? (mesos, dockers)?

●Developer sandbox: Dockers

●Uniform OS

●Uniform Hardware

●Get red of weak links

Page 32: Lessons learned from designing QA automation event streaming platform(IoT big data)

Performance Challenges: app perspective●Architecture bottlenecks [ count(*), group by, compression,

sort Merge Join]

●What is IO pattern? ○ OLTP VS OLAP.

○ Columnar or Row based?

○ How big is your data?

○ READ % vs WRITE %.

○ Sequential? random?

○ Temporary VS permanent

○ Latency VS. throughput.

○ Multi threaded ? or single thread?

○ Power query? Or production?

○ Cold data? Host data?

○ Number of thread per IO request.

Page 33: Lessons learned from designing QA automation event streaming platform(IoT big data)

Performance and Maintenance

Page 34: Lessons learned from designing QA automation event streaming platform(IoT big data)

Performance Challenges: OPS perspective●Metrics

○ What is Expected Total RAM required per Query?

○ OS RAM cache , CPU Cache hits ?

○ SWAP ? yes/no how much?

○ OS metrics - open files, stack size, realtime priority?

●Theoretical limits?○ Disk type selection -limitation, expected throughput

○ Raid controller, RAID type? Configuration? caching?

○ File system used? DFS? PFS? Ext4? XFS?

○ Recommended file system filesystem Block size? File system metadata?

○ Recommended File size on disk?

○ CPU selection - number of threads , cache level, hyper threading. Cilicon

○ Ram Selection - and placement on chassi

○ Network - the difference b/w 40Gbit and 25Gbit. NIC Bonding is good for?

○ PCI Express 3, 16 Langes. PCI Switch.

Page 35: Lessons learned from designing QA automation event streaming platform(IoT big data)

Maintenance challenges: OPS PerspectiveHow to Optimize your hardware selection ? ( Analytics on your testing data)

a. Compute intensive

i. GPU CORE intensive

ii. CPUcores intensive

1. Frequency

2. Amount of thread for concurrency

b. IO intensive?

i. SSD?

ii. NearLine sata?

iii. Partitions?

c. Hardware uniformity - use same hardware all over.

d. Get rid of weak links

Page 36: Lessons learned from designing QA automation event streaming platform(IoT big data)

Maintenance challenges: DevOps Perspective●Lightweight - python code

○ Advantage → focus on generic skill set (python coding)

○ Disadvantage → reinventing the wheel

●Fault tolerance - Scale out topology○ Cheap “desktops” (weak machines)

■ Advantage - >

● quick recovery from Image

● Cheap, low risk, pay as you go.

● Gets 90 % of the job DONE!

■ Disadvantage

● footprint/”cloud scale out” is hell.

● Management of “desktops” - requires creativity

●Rely heavily on automation feature: reproducibility from scratch

●Validation automation on infrastructure & Deployment mechanism.

Page 37: Lessons learned from designing QA automation event streaming platform(IoT big data)

Maintenance challenges: DevOps Perspective●Infrastructure

○ Continuous integration: continuous build, continuous packaging, installer.

○ Continuous Deployment: pre flight check, remote machine (on site, private cloud, cloud)

○ Continuous testing: Sanity, Hourly, Daily, weekly

●Reporting and Monitoring: ○ Extensive report server and analytics on current testing.

○ Monitoring: Green/Red and system real time metrics.

Page 38: Lessons learned from designing QA automation event streaming platform(IoT big data)

Maintenance challenges: Innovation Perspective1. Data generation (using GPU to generate data)

2. (hashing)

3. Workload (one dispatcher, many compute nodes)

4. saving running time: data/compute time

a. Hashing

b. Cacheing

c. Smart Data Creating - no need to create 100B events to generate 100B events test.

d. Logs: Test results were managed on a DB

5.Dockers

Page 39: Lessons learned from designing QA automation event streaming platform(IoT big data)

Lesson learned (insights)● Very hard to guarantee 100% coverage

● Quality in a reasonable cost is a huge challenge

● Very easy to miss milestones with QA of big data,

● each mistake create a huge ripple effect in timelines

● Main costs:

○ Employees : team of 2 people working 220 hours per month over 1 years.

○ Running Time: Coding of innovative algorithms inside testing framework,

○ Maintenance: Automation, Monitoring, Analytics

○ Hardware: innovative DevOps , innovative OPS, research

○ Skill set and knowledge base takes time

● Most of our innovation was spent on:

○ Doing more tests with less resources

○ agile improvement of backbone

○ Avoiding big data complexity problems in our testing!

● Industry tools ? great, but not always enough. Know their philosophy...

● Sometimes you need to get your hands dirty

● Keep a fine balance between business and technology

Page 40: Lessons learned from designing QA automation event streaming platform(IoT big data)

Stay in touch...●Omid Vahdaty

●+972-54-2384178