Enterprise Link Spam Analysis- Estudio34 Presents Ian Lurie In LinkLove2013

89
Ian Lurie @portentint [email protected] Enterprise Link Spam Analysis

Transcript of Enterprise Link Spam Analysis- Estudio34 Presents Ian Lurie In LinkLove2013

Ian Lurie @portentint [email protected]

Enterprise Link Spam Analysis

MACHINE LEARNING How hard can it be?

Ian Lurie Portent, Inc @portentint

http://portent.co/machine-spam

yes, this is right, no ‘m’

TIME

NER

DO

SITY

1: There’s gotta be a better way 2: How hard can it be? 3: Lessons learned

ONE: THESE ARE GOOD LINKS!!!!

try to relax...

Some of these are real

BUILD A REALLY BIG SPREADSHEET.

GET ALL LINKS FROM GOOGLE WEBMASTER TOOLS OPENSITE EXPLORER MAJESTIC SEO

URL TITLE ANCHOR TEXT MAJESTIC TRUSTFLOW MOZ DA AND PA

FREELINKSHERE.COM ARTICLEFUNHOSTING.COM GETYERGREATARTICLESHERE.COM NEWJERSEYPRESSRELEASES.COM

NO

EVALUATE URLS

BAD GRAMMAR JUST STUPID MAKES YOU ITCH NO

EVALUATE TITLES

CONTACT DISAVOW

So cute. so harmless.

AFILTRATIONPROBLEM

500 LINKS: THE LOSERS ARE OBVIOUS

500,000 LINKS: YOU GET A MIGRAINE

AKNOWLEDGEPROBLEM

0   10000   20000   30000   40000   50000   60000   70000   80000   90000   100000  

The  Lonely  Mountain  

Pegasus  

Li<le  Round  Top  

Bagshot  Row  

GalacDca  

Broad  Street  Pump  

John  Snow  

The  Devil's  Den  

Mithril  Shirt  

Grunter  von  Agony  

Sufferlandria  

Alp  d'  Huez  

Col  Von  Miseryburgen  

Anchor  Text  Distribu1on  

BUT THESE ARE GOOD LINKS!!!!

WE CAN’T READ GOOGLE’S MIND

MACHINE LEARNING

TWO: HOW HARD CAN IT BE?

MACHINE LEARNING IN 60 SECONDS

START WITH A QUESTION

START WITH A QUESTION

IS THIS PAGE SPAM?

THE ANSWER IS A CLASSIFICATION

TRAINING SET ALGORITHM

CLASSIFICATION +

TRAINING SET ALGORITHM

CLASSIFICATION +

TRAINING SET ALGORITHM PREDICTION

+

TRAINING SET ALGORITHM

CLASSIFICATION + TEXT-BASED?

SUPERVISED? UNSUPERVISED?

TRAINING SET ALGORITHM

CLASSIFICATION +

CORRECT? OR MORONIC?

CLASSIFICATION = QUESTION

CLASSIFICATION = SPAM? TRUE OR FALSE

TRAINING SET 1 JUST WORDS

A home cooking blog featuring healthy low-glycemic recipes with step-by-step photos, as well as cooking tips, vegetable gardening, and products Kalyn loves. Mouse Trap (originally titled Mouse Trap Game) is a board game first published by Ideal in 1963 for two or more players. Over the course of the game, players at first. A delicious and refreshing cherry pie and a story about making friends of enemies :). Brady Bunch Punch drink recipe made with Amaretto,Cranberry juice,Orange juice,Triple Sec,Vodka,. How to make a Brady Bunch Punch with all the instructions and. A blog about a foreigner's life in Japan, on a mountainside above Lake Biwa. SEE the world's greatest collection of tattoo designs! Sample FREE Downloads! Cutting Edge Art by Famous Tattoo Artists! YOUR TATTOO DESIGN IS HERE!.

TRAINING SET BAYESIAN

FAIL

+

+

TRAINING SET 2 WORDS INTO NUMBERS

TRAINING SET LOGISTIC REGRESSION

WIN

+ who da

nerd?!!! WHO DA NERD?!!!!!

logistic regression

python nltk

scikit-learn mongodb

Flesch-Kincaid (FK) FK grade level

FK reading ease word count

sentence count syllable count

links/word MajesticSEO

Page TrustFlow Domain TrustFlow

Unique c-blocks

python nltk

scikit-learn mongodb

THE TRAINING SET

seogadget.co.uk

Is seogadget.co.uk spam? true = 1.93%

false = 98.07%

THREE: LESSONS LEARNED

ABOUT GOOGLE

HREF=“HTTP:// GETYERLINKS HEREFREE.COM

IT’S ABOUT LINKS. NOT PAGES.

TRUST FLOW 71 DA 97 PA 47

LESSON 1: THERE IS NO SPAM

HOW LIKELY IS IT THAT THIS LINK, FROM THIS PAGE, IN THE CONTEXT OF ALL OTHER LINKS TO THIS SITE, MIGHT SEEM SPAM-LIKE?

TRUSTWORTHY MANIPULATIVE

USEFUL

NOPE

CNN.COM

GODX.NET

INTERFLORA

DAILY SQUEE

LINKS FROM EDU SITES 45

LINKS FROM EDU SITES 45

2: DECLINING SPAM TOLERANCE

0%  10%  20%  30%  40%  50%  60%  70%  80%  90%  

Percent  spam  links  

THIS WAS SPAM IN APRIL

THIS MAY BE SPAM NOW

http://www.cs.hiram.edu/~oliphantlt/cpsc171/links.html

GOOGLE’S GETTING GRUMPIER

CLEAN UP. NOW.

ABOUT MACHINE LEARNING

NEED A BIGGER TRAINING SET

TOO BIG.

USE WORDS, TOO

SPECIALIZE BY VERTICAL

ian wtf!!!!

MACHINE LEARNING IS GROWING

MACHINE LEARNING IS BECOMING EASIER

GOOGLE PREDICTION API BIGML

EXCEL DATASCOPE ….?

How hard can it be?

EASY TO UNDERSTAND

How hard can it be?

SO UNDERSTAND IT

http://portent.co/machine-spam

yes, this is right, no ‘m’

THE END.

Ian Lurie Portent, Inc @portentint