CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf ·...
Transcript of CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf ·...
![Page 1: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/1.jpg)
CLICKSTREAM ANALYSIS WITH APACHE SPARK
AndreasZitzelsberger
![Page 2: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/2.jpg)
THE CHALLENGE
![Page 3: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/3.jpg)
ONE POT TO RULE THEM ALL
Web Tracking Ad Tracking
ERP CRM
▪ Products
▪ Inventory
▪ Margins
▪Customer
▪Orders
▪Creditworthiness
▪Ad Im
pressions
▪Ad Costs
▪Clicks & Views
▪Conversions
![Page 4: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/4.jpg)
ONE POT TO RULE THEM ALL
Retention Reach
Monetarization
steer … ▪ Campaigns ▪ Offers ▪ Contents
![Page 5: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/5.jpg)
REACT ON WEB SITE TRAFFIC IN REAL TIME
Image: https://www.flickr.com/photos/nick-m/3663923048
![Page 6: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/6.jpg)
SAMPLE RESULTS
Geolocated and gender-specific conversions.
Frequency of visits
Performance of an ad campaign
![Page 7: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/7.jpg)
THE CONCEPTS
Image: Randy Paulino
![Page 8: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/8.jpg)
THE FIRST SKETCH
(= real-time)
SQL
![Page 9: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/9.jpg)
![Page 10: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/10.jpg)
CALCULATING USER JOURNEYS
C V VT VT VT C X
C V
V V V V V V V
C V V C V V V
VT VT V V V VT C
V X
Event stream: User journeys:
Web / Ad tracking
KPIs:▪ Unique users▪ Conversions▪ Ad costs / conversion value▪ …
V
X
VT
C Click
View
View Time
Conversion
![Page 11: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/11.jpg)
THE ARCHITECTURE
Big Data
![Page 12: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/12.jpg)
„LARRY & FRIENDS“ ARCHITECTURE
Runs not well for morethan 1 TB data in terms ofingestion speed, query timeand optimization efforts
![Page 13: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/13.jpg)
Image: adweek.com
Nope. Sorry, no Big Data.
![Page 14: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/14.jpg)
„HADOOP & FRIENDS“ ARCHITECTUREAggregation takes too long
Cumbersomeprogramming model(can be solved withpig, cascading et al.)
Not interactiveenough
![Page 15: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/15.jpg)
Nope.Toosluggish.
![Page 16: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/16.jpg)
Κ-ARCHITECTURE
Cumbersomeprogramming model
Over-engineered: We only need15min real-time ;-)
Stateful aggregations (unique x, conversions) require a separate DB with high throughput and fast aggregations & lookups.
![Page 17: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/17.jpg)
Λ-ARCHITECTURECumbersomeprogramming model Complex
architecture
Redundant logic
![Page 18: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/18.jpg)
FEELS OVER-ENGINEERED…
http://www.brainlazy.com/article/random-nonsense/over-engineered
![Page 19: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/19.jpg)
The Final Architecture**) Maybe called µ-architecture one day ;-)
![Page 20: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/20.jpg)
FUNCTIONAL ARCHITECTURE
Strange Events
IngestionRaw Event Stream
Collection Events Processing Analytics Warehouse
FactEntries
Atomic Event Frames
Data Lake
Master Data Integration
▪ Buffers load peeks▪ Ensures message
delivery (fire & forgetfor client)
▪ Create user journeys andunique user sets
▪ Enrich dimensions▪ Aggregate events to KPIs▪ Ability to replay for schema
evolution
▪ The representation of truth▪ Multidimensional data
model▪ Interactive queries for
actions in realtime anddata exploration
▪ Eternal memory for all events (even strangeones)
▪ One schema per eventtype. Time partitioned.
▪ Fault tolerant message handling▪ Event handling: Apply schema, time-partitioning, De-dup, sanity
checks, pre-aggregation, filtering, fraud detection▪ Tolerates delayed events▪ High throughput, moderate latency (~ 1min)
![Page 21: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/21.jpg)
SERIAL CONNECTION OF STREAMING AND BATCHING
IngestionRaw Event Stream
Collection Event Data Lake Processing Analytics Warehouse
FactEntries
SQL InterfaceAtomic Event
Frames
▪ Cool programming model▪ Uniform dev&ops
▪ Simple solution▪ High compression ratio due to
column-oriented storage▪ High scan speed
▪ Cool programming model▪ Uniform dev&ops▪ High performance▪ Interface to R out-of-the-box▪ Useful libs: MLlib, GraphX, NLP, …
▪ Good connectivity (JDBC, ODBC, …)
▪ Interactive queries▪ Uniform ops▪ Can easily be replaced
due to Hive Metastore
▪ Obvious choice forcloud-scale messaging
▪ Way the best throughputand scalability of all evaluated alternatives
![Page 22: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/22.jpg)
public Map<Long, UserJourney> sessionize(JavaRDD<AtomicEvent> events) { return events // Convert to a pair RDD with the userId as key .mapToPair(e -> new Tuple2<>(e.getUserId(), e)) // Build user journeys .<UserJourneyAcc>combineByKey( UserJourneyAcc::create, UserJourneyAcc::add, UserJourneyAcc::combine) // Convert to a Java map .collectAsMap(); }
![Page 23: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/23.jpg)
STREAM VERSUS BATCH
https://en.wikipedia.org/wiki/Tanker_(ship)#/media/File:Sirius_Star_2008b.jpghttps://blog.allstate.com/top-5-safety-tips-at-the-gas-pump/
![Page 24: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/24.jpg)
APACHE FLINK
■ Alsohasanice,Spark-likeAPI■ Promisessimilarorbetter
performancethanspark
■ Lookslikethebestsolutionforaκ-Architecture
■ Butit’salsothenewestkidontheblock
![Page 25: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/25.jpg)
EVENT VERSUS PROCESSING TIME■ There’sadifferencebetweeneventime(te)andprocessingtime
(tp).
■ Eventsarriveout-oforderevenduringnormaloperation.
■ Eventsmayarrivearbitrarylate.
Applyagraceperiodbeforeprocessingevents.
Allowarbitraryupdatewindowsofmetrics.
![Page 26: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/26.jpg)
EXAMPLE
Minute
Hour
Day
Week
Month
Quarter
Year
I
U
U
U
U
U
U
I
U
U
U
U
U
U
U
Resolution inTime
Time
dtp
tp
tp: ProcessingTimeti: Ingestiontimete: EventTime
dtp: Aggregationtime framedtw: Graceperiod
: Insertfact: Updatefact
dtw
te
ti
![Page 27: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/27.jpg)
LESSONS LEARNED
Image: http://hochmeister-alpin.at
![Page 28: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/28.jpg)
BEST-OF-BREED INSTEAD OF COMMODITY SOLUTIONS
ETL
Analytics
Realtime Analytics
Slice & Dice
Data Exploration
Polyglot Processing
http://datadventures.ghost.io/2014/07/06/polyglot-processing
![Page 29: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/29.jpg)
POLYGLOT ANALYTICS
Data Lake Analytics Warehouse
SQL lane
R lane
Timeserieslane
Reporting Data ExplorationData Science
![Page 30: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/30.jpg)
NO RETENTION PARANOIA
Data Lake
Analytics Warehouse
▪ Eternal memory ▪ Close to raw events ▪ Allows replays and refills
into warehouse
Aggressive forgetting with clearly defined retention policy per aggregation level like: ▪ 15min:30d ▪ 1h:4m ▪ …
Events
Strange Events
![Page 31: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/31.jpg)
BEWARE OF THE HIPSTERS
Image: h&m
![Page 32: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/32.jpg)
ENSURE YOUR SOFTWARE RUNS LOCALLY
The entire architecture must be able to run locally. Keep the round trips low for development and testing.
Throughput and reaction times need to be monitored continuously. Tune your software and the underlying frameworks as needed.
![Page 33: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/33.jpg)
TUNE CONTINUOUSLY
IngestionRaw Event Stream
Collection Event Data Lake Processing Analytics Warehouse
FactEntries
SQL InterfaceAtomic Event
Frames
Load generator Throughput & latency probes
System, container and process monitoring
![Page 34: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/34.jpg)
IN NUMBERSOverall dev effort until the first release: 250 person days
Dimensions: 10 KPIs: 26
Integrated 3rd party systems: 7Inbound data volume per day: 80GB
New data in DWH per day: 2GB
Total price of cheapest cluster which is able to handle production load:
![Page 35: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/35.jpg)
![Page 37: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/37.jpg)
BONUS SLIDES
![Page 38: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/38.jpg)
CALCULATING UNIQUE USERS
■Weneedanexactuniqueusercount.
■ Ifyoucan,youshoulduseanapproximationsuchasHyperLogLog.
U1
U2
U3
U1
U4
Time
Use
rs
3 UU 2 UU
4 UU
Flajolet, P.; Fusy, E.; Gandouet, O.; Meunier, F. (2007). "HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm". AOFA ’07: Proceedings of the 2007 International Conference on the Analysis of Algorithms.
![Page 39: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/39.jpg)
CHARTING TECHNOLOGY
https://github.com/qaware/big-data-landscape
![Page 40: CLICKSTREAM ANALYSIS WITH APACHE SPARKschd.ws/.../09/apachebigdata-clickstream-analysis.pdf · CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger. THE CHALLENGE. ONE POT](https://reader031.fdocuments.in/reader031/viewer/2022030418/5aa4ea707f8b9a517d8c8fd4/html5/thumbnails/40.jpg)
CHOOSING WHERE TO AGGREGATE
Ingestion Event Data Lake Processing Analytics Warehouse
FactEntries
AnalyticsAtomic Event
Frames
1 2
3
- Enrichment - Preprocessing - Validation
The hard lifting.
- Processing steps that can be done at query time. - Interactive queries.