Ingesting Healthcare Data, Micah Whitacre
-
Upload
confluent -
Category
Engineering
-
view
899 -
download
5
Transcript of Ingesting Healthcare Data, Micah Whitacre
INGESTING COMPLEX HEALTHCARE DATA WITH APACHE KAFKA
Micah Whitacre@mkwhit
#kafkasummit
Leader Healthcare IT
~30% of all US Healthcare Data in a Cerner Solution
Sepsis Alerting(minutes)
Doctor’s Office
Minute Clinic
ERHospital
Specialist
Ambulatory(<2 seconds)
Ambulatory(<2 seconds)
Ambulatory(<2 seconds)
Table Table.NOTIFY
Google PercolatorNoSQL
Table Table.NOTIFY
Table Table.NOTIFY
Collector
HTTP
Was successful… for awhile
Progressed from minutes to seconds
Hit a wall preventing going faster (missed SLAs)
NoSQL
NoSQL
NoSQL
Collector
Collector
Collector
Crawler
Crawler
Crawler
Solution A
Solution B
Solution C
Collector
Collector
Collector
Crawler
Crawler
Crawler
Use the right tool for the job!
NoSQL != Distributed Queue
Anti-patterns apply to everyone eventually
Our scalability should not impact crawlers
Cluster sprawl should be avoided
Reduce the number of copies
Table Table.NOTIFY
NoSQL
Table Kafka Topic
Kafka Base Notifications
● Kafka topic per listener● Small Google Protobuf payloads
○ Gzip based compression for higher compression● Could minimize to fewer listeners
○ Single topic and partition vs 100s of NoSQL rows● Able to give up fairness concerns in favor of speed
NoSQL
NoSQL
NoSQL
Collector
Crawler
Crawler
Crawler
Kafka Staging Area● Single location for one copy of the data● Consumption based on type and source of data
○ 500ish of types and 100-1000 sources○ Choose source based topics to cut down on topics○ Default to 8 partitions
● Snappy compression for low latency● Huge variation in data sizes and frequency
○ Infrequent MB - GB file uploads (daily, weekly, monthly, yearly)○ Streaming uploads of 100B-10MB
● Time based retention to prevent data loss○ Ambitiously set to 30 days but lowered to 7 days○ Archive data to HDFS for reprocessing or lagging/offline consumers
Kafka Payloads And Delivery
● Avro Schema to wrap ingested data○ Source, Type, Id, Version, Value (byte[]), Metadata
(byte[]), Properties○ Common payload regardless of actual byte[]
● Set threshold for payloads stored in Kafka○ Store 95-98% of data in Kafka○ Data larger than 50 MB stored in HDFS with path
stored in Avro wrapper
● Rate of ingestion changes with Kafka○ Lack of backpressure can increase rate of ingestion○ Capacity and retention planning could end up
inaccurate
Most Surprising Lesson Learned
Weeks
msg
/sec
Initial Crawl - NoSQL
Crawl all historical data
Crawling only recent changes
Rate of Data Ingested Per Day
By Source
Number of Sources
Number of Days to keep Kafka
Total Storage Needed in Kafka
Days
msg
/sec
Initial Crawl - Kafka
Crawls from weeks to daysCrawl all historical data
Crawling only recent changes
Rate of Data Ingested Per Day
By Source
Number of Sources
Number of Days to keep Kafka
Total Storage Needed in Kafka
10-30x
Kafka Storage Woes
● Monitor ALL THE THINGS○ Broker free space○ Disk usage per topic○ Consumer lag in message count and max latency○ Rate of data per source to detect anomalies vs steady
state● Re-evaluate default retention with more evidence
Kafka Storage Woes Solution
● When storage gets tight know your options○ Automate building new servers○ Adjust retention policy for a topic(s)
● Balancing partitions is hard to do by hand○ Balance in small batches○ Automate, Automate, Automate
NoSQL
NoSQL
NoSQL
Collector
Crawler
Crawler
Crawler
NoSQL
Kafka
NoSQL
Collector
Crawler
Crawler
Crawler
NoSQL
NoSQL
NoSQL
Collector
Crawler
Crawler
Crawler
NoSQL
NoSQL
NoSQLDataCenter A
DataCenter B
Collector
Crawler
Crawler
Crawler
Current Stats● Deployed in 3 (soon to be 4) data centers● 440 sources currently (⅓ of all clients)● Ingesting 2 billion messages per day
○ Spiked as high as 6 billion
● Ingest 1.2 TB/day of raw data● Archive job runs hourly and takes ~10 mins to pull ~50 GB
data● Latency
○ NoSQL: 2-3 seconds (subset of data)○ Replication (Kafka to Kafka): 700 milliseconds (all the data)
http://engineering.cerner.com/
References● Percolator - http://research.google.com/pubs/pub36726.html● Cassandra Queue Anti-pattern: http://www.datastax.com/dev/blog/cassandra-
anti-patterns-queues-and-queue-like-datasets● https://blog.cloudera.com/blog/2014/11/how-cerner-uses-cdh-with-apache-
kafka/