Fantasy League Sports with Big Data Technologies

Post on 15-Jul-2015

340 views 1 download

Tags:

Transcript of Fantasy League Sports with Big Data Technologies

FANTASY LEAGUE SPORTSFantastical, fast, and furious fantasy stats.

By: Silvia Oliveros

WHY FANTASY LEAGUES?

• I like watching sports.

• Large fan base (41 million people).

• Simulate my own site with 5 million user base.

WEBSITE

PIPELINE

User Data:Information

Roster

NFL:Play-by-Play

HDFSKafka

Spark Streaming

Spark

Cassandra

Flask

DATA INGESTION REAL-TIME / SPEED LAYER

BATCH LAYER SERVING LAYER

DATA INGESTION

User Data:Information

Roster

NFL:Play-by-Play

Kafka

DATA INGESTION

User Data:Information

Roster

NFL:Play-by-Play

Kafka

User Data (Roster):

Play-by-Play Data:

DATA INGESTION

User Data:Information

Roster

NFL:Play-by-Play

Kafka

Why Kafka?

Two consumers to send data to HDFS and Spark

Streaming.

Potential real-time changes in roster information

(future).

PIPELINE

User Data:Information

Roster

NFL:Play-by-Play

HDFSKafka

Spark Streaming

Spark

Cassandra

Flask

DATA INGESTION REAL-TIME / SPEED LAYER

BATCH LAYER SERVING LAYER

REAL-TIME / SPEED LAYER

User Data:Information

Roster

NFL:Play-by-Play

HDFSKafka

Spark Streaming

Spark

Cassandra

Flask

DATA INGESTION REAL-TIME / SPEED LAYER

BATCH LAYER SERVING LAYER

SPEED LAYER / REAL-TIMENew play comes in:

SPEED LAYER / REAL-TIMENew play comes in:

Lookup (roster data):

SPEED LAYER: REAL-TIMENew play comes in:

Lookup (roster data):

Generate information:

SPEED LAYER: REAL-TIMENew play comes in:

Lookup (roster data):

Generate information:

Aggregate pointsby user

PIPELINE

User Data:Information

Roster

NFL:Play-by-Play

HDFSKafka

Spark Streaming

Spark

Cassandra

Flask

DATA INGESTION REAL-TIME / SPEED LAYER

BATCH LAYER SERVING LAYER

BATCH LAYER

User Data:Information

Roster

NFL:Play-by-Play

HDFSKafka

Spark Streaming

Spark

Cassandra

Flask

DATA INGESTION REAL-TIME / SPEED LAYER

BATCH LAYER SERVING LAYER

BATCH LAYER• Spark on top of HDFS

• Admin queries (Updated once every 24 hours):

• Top Users

• Demographic Breakdown

• User and Player queries:

• Historical fantasy points per game / week

PIPELINE

User Data:Information

Roster

NFL:Play-by-Play

HDFSKafka

Spark Streaming

Spark

Cassandra

Flask

SERVING LAYER

User Data:Information

Roster

NFL:Play-by-Play

HDFSKafka

Spark Streaming

Spark

Cassandra

Flask

SERVING LAYER

Cassandra

Flask

Multiple queries require different tables with efficient

schemas.

API for both analysts and users of the website.

D3 graphs

LESSONS LEARNED

• Technologies: Spark, Spark Streaming, Cassandra

• Scalability in Spark Streaming for different operations (number of records vs number of nodes)

• Spark Streaming saveAs Function saves a lot of small files even after repartition, so to deal with that in HDFS I wrote a function to append to a single file.

SILVIA OLIVEROS• M.S. Computer Engineering -

Purdue University

• Developed Visual Analytics Tools for DHS Partners:

• Coast Guard

• Dietary Survey (NHANES)

soliverost@gmail.comgithub.com/soliverost