Using machine learning to determine drivers of bounce and conversion (part 2)
-
Upload
tammy-everts -
Category
Technology
-
view
454 -
download
1
Transcript of Using machine learning to determine drivers of bounce and conversion (part 2)
Using machine learning to determine drivers
of bounce and conversion(part 2)
Velocity 2016 New York
Pat Meenan@patmeenan
Tammy Everts@tameverts
What we did (and why we did it)
Get the codehttps://github.com/WPO-
Foundation/beacon-ml
Deep learning
weights
Random forestLots of random decision trees
Vectorizing the data• Everything needs to be numeric• Strings converted to several inputs as
yes/no (1/0)• i.e. Device manufacturer• “Apple” would be a discrete input
• Watch out for input explosion (UA String)
Balancing the data• 3% conversion rate• 97% accurate by always guessing
no• Subsample the data for 50/50 mix
Smoothing the dataML works best on normally
distributed data
scaler = StandardScaler()x_train = scaler.fit_transform(x_train)x_val = scaler.transform(x_val)
Validation data• Train on 80% of the data• Validate on 20% to prevent
overfitting–Training accuracy from validation set
Input/output relationships
• SSL highly correlated with conversions• Long sessions highly correlated with
not bouncing• Remove correlated features from
training
Training random forest
clf = RandomForestClassifier(n_estimators=FOREST_SIZE, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, bootstrap=True, oob_score=False, n_jobs=12, random_state=None, verbose=2, warm_start=False, class_weight=None)clf.fit(x_train, y_train)
Feature importancesclf.feature_importances_
Training deep learning
model = Sequential()model.add(...)model.compile(optimizer='adagrad', loss='binary_crossentropy', metrics=["accuracy"])model.fit(x_train, y_train, nb_epoch=EPOCH_COUNT, batch_size=32, validation_data=(x_val, y_val), verbose=2, shuffle=True)
Understanding deep learning
Brute force FTW• 93 input “features”• Train 93 models with 1 input–Measuring the prediction accuracy of each
• Train 92 models with 2 inputs– Top feature from first round–Measure combined prediction accuracy
• Lather, rinse, repeat…
Visualizing the model• Take trained model (X inputs)
• Vary inputs–100ms to 20 seconds in 100ms intervals
• Apply the data smoothing from training set• model.predict_proba
What we learned
What’s in our beacon?
• Top-level – domain, timestamp, SSL• Session – start time, length (in pages), total load time• User agent – browser, OS, mobile ISP• Geo – country, city, organization, ISP, network speed• Bandwidth• Timers – base, custom, user-defined• Custom metrics• HTTP headers
https://docs.soasta.com/whatsinbeacon/
Finding 1Maybe everything doesn’t matter
after all
Bounce rate
Finding 2DOM ready (aka DOM content
loaded) and average session load time were the best indicators of
bounce rate
Up to 89.5% accuracy
Finding 3When it came to getting high
predictability, conversion data was tougher than bounce data
81% prediction accuracy was as high as we got
Finding 4Pages with more scripts were
more less likely to convert
Finding 5The number of DOM elements
matters…a lot
Finding 6Mobile-related measurements weren’t meaningful predictors of conversions
Finding 7Some conventional metrics
were not as important as we thought
Feature Importance (bounce)
Start render 69 ~top 3
Things to watch out for
(other than dangling prepositions)
Yep, checkout pages are SLOW
Takeaways
1. YMMV2. Do try this at home3. Gather your RUM data (lots of
it)4. Run the machine learning
against it5. If you get unexpected results,
keep digging
Thanks!@patmeenan@tameverts