Online machine Learning with Divolte
-
Upload
godatadriven -
Category
Data & Analytics
-
view
140 -
download
2
Transcript of Online machine Learning with Divolte
GoDataDrivenPROUDLY PART OF THE XEBIA GROUP
Online Machine Learning with Divolte Collector
Friso van VollenhovenCTO
GoDataDriven
The timeliness factor
• Apache Kafka
• Storm
• Apache Spark Streaming
• Apache Flink Streaming
• Low latency
• Real-time
• Event pipelines
99% of all data in Hadoop?156.68.7.63 - - [28/Jul/1995:11:53:28 -0400] "GET /images/WORLD-logosmall.gif HTTP/1.0" 200 669 137.244.160.140 - - [28/Jul/1995:11:53:29 -0400] "GET /images/WORLD-logosmall.gif HTTP/1.0" 304 0 163.205.160.5 - - [28/Jul/1995:11:53:31 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 4324 163.205.160.5 - - [28/Jul/1995:11:53:40 -0400] "GET /shuttle/countdown/count70.gif HTTP/1.0" 200 46573 140.229.50.189 - - [28/Jul/1995:11:53:54 -0400] "GET /shuttle/missions/sts-67/images/images.html HTTP/1.0" 200 4464 163.206.89.4 - - [28/Jul/1995:11:54:02 -0400] "GET /shuttle/technology/sts-newsref/sts-mps.html HTTP/1.0" 200 215409 163.206.89.4 - - [28/Jul/1995:11:54:05 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204 163.206.89.4 - - [28/Jul/1995:11:54:05 -0400] "GET /images/shuttle-patch-logo.gif HTTP/1.0" 200 891 131.110.53.48 - - [28/Jul/1995:11:54:07 -0400] "GET /shuttle/technology/sts-newsref/stsref-toc.html HTTP/1.0" 200 84905 163.205.160.5 - - [28/Jul/1995:11:54:14 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204 130.160.196.81 - - [28/Jul/1995:11:54:15 -0400] "GET /shuttle/resources/orbiters/challenger.html HTTP/1.0" 200 8089 131.110.53.48 - - [28/Jul/1995:11:54:16 -0400] "GET /images/shuttle-patch-small.gif HTTP/1.0" 200 4179 137.244.160.140 - - [28/Jul/1995:11:54:16 -0400] "GET /shuttle/missions/sts-69/mission-sts-69.html HTTP/1.0" 200 10136 131.110.53.48 - - [28/Jul/1995:11:54:18 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204 131.110.53.48 - - [28/Jul/1995:11:54:19 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 1713 130.160.196.81 - - [28/Jul/1995:11:54:19 -0400] "GET /shuttle/resources/orbiters/challenger-logo.gif HTTP/1.0" 200 4179 163.205.160.5 - - [28/Jul/1995:11:54:25 -0400] "GET /shuttle/missions/sts-70/images/images.html HTTP/1.0" 200 8657 130.181.4.158 - - [28/Jul/1995:11:54:26 -0400] "GET /history/rocket-history.txt HTTP/1.0" 200 26990 137.244.160.140 - - [28/Jul/1995:11:54:30 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 304 0 137.244.160.140 - - [28/Jul/1995:11:54:31 -0400] "GET /images/launch-logo.gif HTTP/1.0" 304 0 137.244.160.140 - - [28/Jul/1995:11:54:38 -0400] "GET /history/apollo/images/apollo-logo1.gif HTTP/1.0" 304 0 168.178.17.149 - - [28/Jul/1995:11:54:48 -0400] "GET /shuttle/missions/sts-65/mission-sts-65.html HTTP/1.0" 200 131165 140.229.50.189 - - [28/Jul/1995:11:54:53 -0400] "GET /shuttle/missions/sts-67/images/KSC-95EC-0390.jpg HTTP/1.0" 200 128881 131.110.53.48 - - [28/Jul/1995:11:54:58 -0400] "GET /shuttle/missions/missions.html HTTP/1.0" 200 8677 131.110.53.48 - - [28/Jul/1995:11:55:02 -0400] "GET /images/launchmedium.gif HTTP/1.0" 200 11853 131.110.53.48 - - [28/Jul/1995:11:55:05 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 200 786 128.159.111.141 - - [28/Jul/1995:11:55:09 -0400] "GET /procurement/procurement.html HTTP/1.0" 200 3499 128.159.111.141 - - [28/Jul/1995:11:55:10 -0400] "GET /images/op-logo-small.gif HTTP/1.0" 200 14915 128.159.111.141 - - [28/Jul/1995:11:55:11 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 200 786 128.159.111.141 - - [28/Jul/1995:11:55:11 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204 192.213.154.220 - - [28/Jul/1995:11:55:15 -0400] "GET /shuttle/countdown/tour.html HTTP/1.0" 200 4347 192.213.154.220 - - [28/Jul/1995:11:55:15 -0400] "GET /images/KSC-94EC-412-small.gif HTTP/1.0" 200 20484
GoDataDriven
Stream HTTP server logs
access.logMessage Queue or Event Transport
(Kafka, Flume, etc.) EVENTS
tail -F
EVENTS
OTHER CONSUMERS
GoDataDriven
Tagging
index.html script.
js
web server
access.log
tracking server
Message Queue or Event Transport(Kafka, Flume, etc.) EVENTS
OTHER CONSUMERS
web page traffic
tracking traffic(asynchronous)
structured events
structured events
GoDataDriven
Tagging
• Not a new idea (Google Analytics, Omniture, etc.)
• Less garbage traffic, because a browser is required to evaluate the tag
• Event logging is asynchronous
• Easier to do inflight processing (apply a schema, add enrichments, etc.)
• Allows for custom events (other than page view)
GoDataDriven
Also…
• Manage session through cookies on the client side
• Incoming data is already sessionised
• Extract additional information from clients
• Screen resolution
• Viewport size
• Timezone
GoDataDriven
Divolte Collector
index.html script.
js
web server
access.log
tracking server
Message Queue or Event Transport(Kafka, Flume, etc.) EVENTS
OTHER CONSUMERS
web page traffic
tracking traffic(asynchronous)
structured events
structured events
GoDataDriven
Javascript based tag<body><!-- Your page content here.-->
<!-- Include Divolte Collector just before the closing body tag--><script src="//example.com/divolte.js" defer async></script></body>
GoDataDriven
Data with a schema in Avro
{ "namespace": "com.example.record", "type": "record", "name": "MyEventRecord", "fields": [ { "name": "location", "type": "string" }, { "name": "pageType", "type": "string" }, { "name": "timestamp", "type": "long" } ]}
GoDataDriven
Map incoming data onto Avro records
mapping { map clientTimestamp() onto 'timestamp' map location() onto 'location'
def u = parse location() to uri section { when u.path().equalTo('/checkout') apply { map 'checkout' onto 'pageType' exit() } map 'normal' onto 'pageType' }}
GoDataDriven
User agent parsing
map userAgent().family() onto 'browserName'map userAgent().osFamily() onto 'operatingSystemName'map userAgent().osVersion() onto 'operatingSystemVersion'
// Etc... More fields available
GoDataDriven
Useful performanceRequests per second: 14010.80 [#/sec] (mean) Time per request: 0.571 [ms] (mean) Time per request: 0.071 [ms] (mean, across all concurrent requests) Transfer rate: 4516.55 [Kbytes/sec] received
Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.1 0 1 Processing: 0 0 0.2 0 3 Waiting: 0 0 0.2 0 3 Total: 0 1 0.2 1 3
Percentage of the requests served within a certain time (ms) 50% 1 66% 1 75% 1 80% 1 90% 1 95% 1 98% 1 99% 1 100% 3 (longest request)
GoDataDriven
Custom events
divolte.signal('addToBasket', { productId: 309125, count: 1})
In the page (Javascript)
map eventParameter('productId') onto 'basketProductId'map eventParameter('count') onto 'basketNumProducts'
In the mapping (Groovy)
GoDataDriven
Approach
1. Pick n images randomly2. Optimise displayed image using bandit optimisation3. After X iterations:
• Pick n / 2 new images randomly
• Select n / 2 images from existing set using learned distribution
• Construct new set of images using half of existing set and newly selected random images
4. Goto 2
GoDataDriven
Bayesian Bandits
• For each image, keep track of:
• Number of impressions
• Number of clicks
• When serving an image:
• Draw a random number from a Beta distribution with parameters alpha = # of clicks, beta = # of impressions, for each image
• Show image where sample value is largest
GoDataDriven
Bayesian Bandits
• https://en.wikipedia.org/wiki/Multi-armed_bandit
• http://tdunning.blogspot.nl/2012/02/bayesian-bandits.html
• https://www.chrisstucchio.com/blog/2013/bayesian_bandit.html
GoDataDriven
Prototype UI
class HomepageHandler(ShopHandler): @coroutine def get(self): # Hard-coded ID for a pretty flower. # Later this ID will be decided by the bandit optmization. winner = '15442023790'
# Grab the item details from our catalog service. top_item = yield self._get_json('catalog/item/%s' % winner)
# Render the homepage self.render( 'index.html', top_item=top_item)
GoDataDriven
Prototype UI
<div class="col-md-6"> <h4>Top pick:</h4> <p> <!-- Link to the product page with a source identifier for tracking --> <a href="/product/{{ top_item['id'] }}/#/?source=top_pick"> <img class="img-responsive img-rounded" src="{{ top_item['variants']['Medium']['img_source'] }}"> <!-- Signal that we served an impression of this image --> <script>divolte.signal('impression', { source: 'top_pick', productId: '{{ top_item['id'] }}'})</script> </a> </p> <p> Photo by {{ top_item['owner']['real_name'] or top_item['owner']['user_name']}} </p></div>
GoDataDriven
Data collection in Divolte Collector
{ "name": "source", "type": ["null", "string"], "default": null}
def locationUri = parse location() to uriwhen eventType().equalTo('pageView') apply { def fragmentUri = parse locationUri.rawFragment() to uri map fragmentUri.query().value('source') onto 'source'}
when eventType().equalTo('impression') apply { map eventParameters().value('productId') onto 'productId' map eventParameters().value('source') onto 'source' }
GoDataDriven
Keep counts in Redis
{ 'c|14502147379': '2', 'c|15106342717': '2', 'c|15624953471': '1', 'c|9609633287': '1', 'i|14502147379': '2', 'i|15106342717': '3', 'i|15624953471': '2', 'i|9609633287': '3'}
GoDataDriven
Consuming Kafka in Python
def start_consumer(args): # Load the Avro schema used for serialization. schema = avro.schema.Parse(open(args.schema).read())
# Create a Kafka consumer and Avro reader. Note that # it is trivially possible to create a multi process # consumer. consumer = KafkaConsumer(args.topic, client_id=args.client, group_id=args.group, metadata_broker_list=args.brokers) reader = avro.io.DatumReader(schema)
# Consume messages. for message in consumer: handle_event(message, reader)
GoDataDriven
Consuming Kafka in Pythondef handle_event(message, reader): # Decode Avro bytes into a Python dictionary. message_bytes = io.BytesIO(message.value) decoder = avro.io.BinaryDecoder(message_bytes) event = reader.read(decoder)
# Event logic. if 'top_pick' == event['source'] and 'pageView' == event['eventType']: # Register a click. redis_client.hincrby( ITEM_HASH_KEY, CLICK_KEY_PREFIX + ascii_bytes(event['productId']), 1) elif 'top_pick' == event['source'] and 'impression' == event['eventType']: # Register an impression and increment experiment count. p = redis_client.pipeline() p.incr(EXPERIMENT_COUNT_KEY) p.hincrby( ITEM_HASH_KEY, IMPRESSION_KEY_PREFIX + ascii_bytes(event['productId']), 1) experiment_count, ingnored = p.execute()
if experiment_count == REFRESH_INTERVAL: refresh_items()
GoDataDriven
def refresh_items(): # Fetch current model state. We convert everything to str. current_item_dict = redis_client.hgetall(ITEM_HASH_KEY) current_items = numpy.unique([k[2:] for k in current_item_dict.keys()])
# Fetch random items from ElasticSearch. Note we fetch more than we need, # but we filter out items already present in the current set and truncate # the list to the desired size afterwards. random_items = [ ascii_bytes(item) for item in random_item_set(NUM_ITEMS + NUM_ITEMS - len(current_items) // 2) if not item in current_items][:NUM_ITEMS - len(current_items) // 2]
# Draw random samples. samples = [ numpy.random.beta( int(current_item_dict[CLICK_KEY_PREFIX + item]), int(current_item_dict[IMPRESSION_KEY_PREFIX + item])) for item in current_items]
# Select top half by sample values. current_items is conveniently # a Numpy array here. survivors = current_items[numpy.argsort(samples)[len(current_items) // 2:]]
# New item set is survivors plus the random ones. new_items = numpy.concatenate([survivors, random_items])
# Update model state to reflect new item set. This operation is atomic # in Redis. p = redis_client.pipeline(transaction=True) p.set(EXPERIMENT_COUNT_KEY, 1) p.delete(ITEM_HASH_KEY) for item in new_items: p.hincrby(ITEM_HASH_KEY, CLICK_KEY_PREFIX + item, 1) p.hincrby(ITEM_HASH_KEY, IMPRESSION_KEY_PREFIX + item, 1) p.execute()
GoDataDriven
Serving a recommendationclass BanditHandler(web.RequestHandler): redis_client = None
def initialize(self, redis_client): self.redis_client = redis_client
@gen.coroutine def get(self): # Fetch model state. item_dict = yield gen.Task(self.redis_client.hgetall, ITEM_HASH_KEY) items = numpy.unique([k[2:] for k in item_dict.keys()])
# Draw random samples. samples = [ numpy.random.beta( int(item_dict[CLICK_KEY_PREFIX + item]), int(item_dict[IMPRESSION_KEY_PREFIX + item])) for item in items]
# Select item with largest sample value. winner = items[numpy.argmax(samples)]
self.write(winner)
GoDataDriven
Integrate
class HomepageHandler(ShopHandler): @coroutine def get(self): http = AsyncHTTPClient() request = HTTPRequest(url='http://localhost:8989/item', method='GET') response = yield http.fetch(request) winner = json_decode(response.body) top_item = yield self._get_json('catalog/item/%s' % winner)
self.render( 'index.html', top_item=top_item)
GoDataDriven
References
• http://blog.godatadriven.com/rapid-prototyping-online-machine-learning-divolte-collector.html
• http://divolte.io
• https://github.com/divolte/divolte-collector
• https://github.com/divolte/divolte-examples