Apache Storm and Kafka Boston Storm User Group September 25, 2014 P. Taylor Goetz, Hortonworks...
-
Upload
ronaldo-wilding -
Category
Documents
-
view
222 -
download
0
Transcript of Apache Storm and Kafka Boston Storm User Group September 25, 2014 P. Taylor Goetz, Hortonworks...
Apache Storm and KafkaBoston Storm User Group
September 25, 2014
P. Taylor Goetz, Hortonworks@ptgoetz
Apache Kafka
Fast
“A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.”
http://kafka.apache.org
Apache Kafka
Scalable
“Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime.”
http://kafka.apache.org
Apache Kafka
Durable
“Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.”
http://kafka.apache.org
Apache Kafka
Distributed
“Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees.”
http://kafka.apache.org
Apache Kafka: Use Cases
• Stream Processing
• Messaging
• Click Streams
• Metrics Collection and Monitoring
• Log Aggregation
Apache Kafka: Use Cases
• Greek letter architectures
• Which are really just streaming design patterns
Apache Kafka: Under the Hood
Producers write data to Brokers
Consumers read data from Brokers
This work is distributed across the cluster
Apache Kafka: Under the Hood
Data is stored in topics.
Topics are divided into partitions.
Partitions are replicated.
Apache Kafka: Under the Hood
Topics are named feeds to which messages are published.
http://kafka.apache.org/documentation.html
Apache Kafka: Under the Hood
Topics consist of partitions.
http://kafka.apache.org/documentation.html
Apache Kafka: Under the Hood
A partition is an ordered and immutablesequence of messages that is continually appended to.
http://kafka.apache.org/documentation.html
Apache Kafka: Under the Hood
A partition is an ordered, immutablesequence of messages that is continually appended to.
http://kafka.apache.org/documentation.html
Apache Kafka: Under the Hood
Sequential disk access can be faster than RAM!
http://kafka.apache.org/documentation.html
Apache Kafka: Under the Hood
Within a partition, each message is assigned a uniqueID called an offset that identifies it.
http://kafka.apache.org/documentation.html
Apache Kafka: Under the Hood
http://kafka.apache.org/documentation.html
ZooKeeper is used to store cluster state
information and consumer offsets.
Data Source Reliability
• A data source is considered unreliable if there is no means to replay a previously-received message.
• A data source is considered reliable if it can somehow replay a message if processing fails at any point.
• A data source is considered durable if it can replay any message or set of messages given the necessary selection criteria.
Data Source Reliability
• A data source is considered unreliable if there is no means to replay a previously-received message.
• A data source is considered reliable if it can somehow replay a message if processing fails at any point.
• A data source is considered durable if it can replay any message or set of messages given the necessary selection criteria.
Kafka is a durable data source.
Reliability in Storm• Exactly once processing requires a durable data
source.
• At least once processing requires a reliable data source.
• An unreliable data source can be wrapped to provide additional guarantees.
• With durable and reliable sources, Storm will not drop data.
• Common pattern: Back unreliable data sources with Apache Kafka (minor latency hit traded for 100% durability).
Storm and KafkaApache Kafka is an ideal source for Storm topologies. It provides everything necessary for:
• At most once processing
• At least once processing
• Exactly once processing
Apache Storm includes Kafka spout implementations for all levels of reliability.
Kafka Supports a wide variety of languages and integration points for both producers and consumers.
Storm-Kafka Integration
• Included in Storm distribution since 0.9.2
• Core Storm Spout
• Trident Spouts (Transactional and Opaque-Transactional)
Storm-Kafka Integration
Features:
• Ingest from Kafka
• Configurable start time (offset position):
• Earliest, Latest, Last, Point-in-Time
• Write to Kafka (next release)
OpenSOC: Intrusion Detection
Breaches occur in sec./min./hrs.,but take days/weeks/months to discover.
"Traditional Security analytics tools scale up, not out.”
"OpenSOC is a software application that turns a conventional big data platform into a security analytics platform.”
- James Sirota, Cisco Security Solutions
https://www.youtube.com/watch?v=bQTZ8OgDayA
Health Market Science
• “Master File” database of every healthcare practitioner in the U.S.
• Kept up-to-date in near-real-time
• Represents the “truth” at any point in time (“Golden Record”)
Health Market Science
• Build products and services around the Master File
• Leverage those services to gather new data and updates
Why Trident?
• Aggregations and Joins
• Bulk update of persistence layer (Micro-batches)
• Throughput vs. Latency
CassandraCqlState
public void commit(Long txid) {BatchStatement batch = new
BatchStatement(Type.LOGGED); batch.addAll(this.statements); clientFactory.getSession().execute(batch); }
public void addStatement(Statement statement) { this.statements.add(statement); } public ResultSet execute(Statement statement){ return clientFactory.getSession().execute(statement); }
CassandraCqlStateUpdater
public void updateState(CassandraCqlState state, List<TridentTuple> tuples, TridentCollector collector) {
for (TridentTuple tuple : tuples) { Statement statement = this.mapper.map(tuple); state.addStatement(statement); } }
Mapper Implementation
public Statement map(List<String> keys, Number value) {Insert statement = QueryBuilder.insertInto(KEYSPACE_NAME, TABLE_NAME);statement.value(KEY_NAME, keys.get(0));statement.value(VALUE_NAME, value);return statement;
}
public Statement retrieve(List<String> keys) {Select statement = QueryBuilder.select()
.column(KEY_NAME).column(VALUE_NAME)
.from(KEYSPACE_NAME, TABLE_NAME)
.where(QueryBuilder.eq(KEY_NAME, keys.get(0))); return statement;}
Storm Cassandra CQL
[email protected]:hmsonline/storm-cassandra-cql.git
{tuple} <— <mapper> —> CQL Statement
Trident Batch == CQL Batch