Ingesting Healthcare Data, Micah Whitacre

INGESTING COMPLEX HEALTHCARE DATA WITH APACHE KAFKA

Micah Whitacre@mkwhit

#kafkasummit

Leader Healthcare IT

~30% of all US Healthcare Data in a Cerner Solution

Sepsis Alerting(minutes)

Doctor’s Office

Minute Clinic

ERHospital

Specialist

Ambulatory(<2 seconds)

Table Table.NOTIFY

Google PercolatorNoSQL

Table Table.NOTIFY

Table Table.NOTIFY

Collector

HTTP

Was successful… for awhile

Progressed from minutes to seconds

Hit a wall preventing going faster (missed SLAs)

NoSQL

NoSQL

NoSQL

Collector

Collector

Collector

Crawler

Crawler

Crawler

Solution A

Solution B

Solution C

Collector

Collector

Collector

Crawler

Crawler

Crawler

Use the right tool for the job!

NoSQL != Distributed Queue

Anti-patterns apply to everyone eventually

Our scalability should not impact crawlers

Cluster sprawl should be avoided

Reduce the number of copies

Table Table.NOTIFY

NoSQL

Table Kafka Topic

Kafka Base Notifications

● Kafka topic per listener● Small Google Protobuf payloads

○ Gzip based compression for higher compression● Could minimize to fewer listeners

○ Single topic and partition vs 100s of NoSQL rows● Able to give up fairness concerns in favor of speed

NoSQL

NoSQL

NoSQL

Collector

Crawler

Crawler

Crawler

Kafka Staging Area● Single location for one copy of the data● Consumption based on type and source of data

○ 500ish of types and 100-1000 sources○ Choose source based topics to cut down on topics○ Default to 8 partitions

● Snappy compression for low latency● Huge variation in data sizes and frequency

○ Infrequent MB - GB file uploads (daily, weekly, monthly, yearly)○ Streaming uploads of 100B-10MB

● Time based retention to prevent data loss○ Ambitiously set to 30 days but lowered to 7 days○ Archive data to HDFS for reprocessing or lagging/offline consumers

Kafka Payloads And Delivery

● Avro Schema to wrap ingested data○ Source, Type, Id, Version, Value (byte[]), Metadata

(byte[]), Properties○ Common payload regardless of actual byte[]

● Set threshold for payloads stored in Kafka○ Store 95-98% of data in Kafka○ Data larger than 50 MB stored in HDFS with path

stored in Avro wrapper

● Rate of ingestion changes with Kafka○ Lack of backpressure can increase rate of ingestion○ Capacity and retention planning could end up

inaccurate

Most Surprising Lesson Learned

Weeks

msg

/sec

Initial Crawl - NoSQL

Crawl all historical data

Crawling only recent changes

Rate of Data Ingested Per Day

By Source

Number of Sources

Number of Days to keep Kafka

Total Storage Needed in Kafka

Days

msg

/sec

Initial Crawl - Kafka

Crawls from weeks to daysCrawl all historical data

Crawling only recent changes

Rate of Data Ingested Per Day

By Source

Number of Sources

Number of Days to keep Kafka

Total Storage Needed in Kafka

10-30x

Kafka Storage Woes

● Monitor ALL THE THINGS○ Broker free space○ Disk usage per topic○ Consumer lag in message count and max latency○ Rate of data per source to detect anomalies vs steady

state● Re-evaluate default retention with more evidence

Kafka Storage Woes Solution

● When storage gets tight know your options○ Automate building new servers○ Adjust retention policy for a topic(s)

● Balancing partitions is hard to do by hand○ Balance in small batches○ Automate, Automate, Automate

NoSQL

NoSQL

NoSQL

Collector

Crawler

Crawler

Crawler

NoSQL

Kafka

NoSQL

Collector

Crawler

Crawler

Crawler

NoSQL

NoSQL

NoSQL

Collector

Crawler

Crawler

Crawler

NoSQL

NoSQL

NoSQLDataCenter A

DataCenter B

Collector

Crawler

Crawler

Crawler

Current Stats● Deployed in 3 (soon to be 4) data centers● 440 sources currently (⅓ of all clients)● Ingesting 2 billion messages per day

○ Spiked as high as 6 billion

● Ingest 1.2 TB/day of raw data● Archive job runs hourly and takes ~10 mins to pull ~50 GB

data● Latency

○ NoSQL: 2-3 seconds (subset of data)○ Replication (Kafka to Kafka): 700 milliseconds (all the data)

http://engineering.cerner.com/

References● Percolator - http://research.google.com/pubs/pub36726.html● Cassandra Queue Anti-pattern: http://www.datastax.com/dev/blog/cassandra-

anti-patterns-queues-and-queue-like-datasets● https://blog.cloudera.com/blog/2014/11/how-cerner-uses-cdh-with-apache-

kafka/

http://research.google.com/pubs/pub36726.html

http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets



https://blog.cloudera.com/blog/2014/11/how-cerner-uses-cdh-with-apache-kafka/



Ingesting Healthcare Data, Micah Whitacre

Engineering

Transcript of Ingesting Healthcare Data, Micah Whitacre