Building an Identity Extraction Engine
-
Upload
jonathan-leblanc -
Category
Technology
-
view
108 -
download
1
description
Transcript of Building an Identity Extraction Engine
![Page 1: Building an Identity Extraction Engine](https://reader034.fdocuments.in/reader034/viewer/2022051819/54c8442c4a79591e158b4573/html5/thumbnails/1.jpg)
Identit
y Data
Min
ing
Building an Id
entity D
ata M
ining Engine in
PHPJonath
an LeBlanc (
@jcl
eblanc)
![Page 2: Building an Identity Extraction Engine](https://reader034.fdocuments.in/reader034/viewer/2022051819/54c8442c4a79591e158b4573/html5/thumbnails/2.jpg)
Premise
You can determine the personality profile of a person based on their browsing habits
![Page 3: Building an Identity Extraction Engine](https://reader034.fdocuments.in/reader034/viewer/2022051819/54c8442c4a79591e158b4573/html5/thumbnails/3.jpg)
Technology was the Solution!
![Page 4: Building an Identity Extraction Engine](https://reader034.fdocuments.in/reader034/viewer/2022051819/54c8442c4a79591e158b4573/html5/thumbnails/4.jpg)
Then I Read This…
Us & Them
The Science of Identity
By David Berreby
![Page 5: Building an Identity Extraction Engine](https://reader034.fdocuments.in/reader034/viewer/2022051819/54c8442c4a79591e158b4573/html5/thumbnails/5.jpg)
The Different States of Knowledge
What a person knows
What a person knows they don’t know
What a person doesn’t know they don’t know
![Page 6: Building an Identity Extraction Engine](https://reader034.fdocuments.in/reader034/viewer/2022051819/54c8442c4a79591e158b4573/html5/thumbnails/6.jpg)
Technology was NOT the Solution
Identity and discovery are
NOT a technology solution
![Page 7: Building an Identity Extraction Engine](https://reader034.fdocuments.in/reader034/viewer/2022051819/54c8442c4a79591e158b4573/html5/thumbnails/7.jpg)
Our Subject Material
![Page 8: Building an Identity Extraction Engine](https://reader034.fdocuments.in/reader034/viewer/2022051819/54c8442c4a79591e158b4573/html5/thumbnails/8.jpg)
Our Subject Material
HTML content is unstructured
There are some pretty bad web practices on the interwebz
You can’t trust that anything semantically valid will be present
![Page 9: Building an Identity Extraction Engine](https://reader034.fdocuments.in/reader034/viewer/2022051819/54c8442c4a79591e158b4573/html5/thumbnails/9.jpg)
How We’ll Capture This Data
Start with base linguistics
Extend with available extras
![Page 10: Building an Identity Extraction Engine](https://reader034.fdocuments.in/reader034/viewer/2022051819/54c8442c4a79591e158b4573/html5/thumbnails/10.jpg)
The Com
ponents
![Page 11: Building an Identity Extraction Engine](https://reader034.fdocuments.in/reader034/viewer/2022051819/54c8442c4a79591e158b4573/html5/thumbnails/11.jpg)
The Basic Pieces
Page Data
Scrapey Scrapey
Keywords Without all
the fluff
WeightingWord diets
FTW
![Page 12: Building an Identity Extraction Engine](https://reader034.fdocuments.in/reader034/viewer/2022051819/54c8442c4a79591e158b4573/html5/thumbnails/12.jpg)
Capture Raw Page Data
Semantic data on the webis sucktastic
Assume 5 year olds built the sites
Language is the key
![Page 13: Building an Identity Extraction Engine](https://reader034.fdocuments.in/reader034/viewer/2022051819/54c8442c4a79591e158b4573/html5/thumbnails/13.jpg)
Extract Keywords
We now have a big jumble of words. Let’s extract
Why is “and” a top word? Stop words = sad panda
![Page 14: Building an Identity Extraction Engine](https://reader034.fdocuments.in/reader034/viewer/2022051819/54c8442c4a79591e158b4573/html5/thumbnails/14.jpg)
Weight Keywords
All content is not created equal
Meta and headers and semantics oh my!
This is where we leech off the work of others
![Page 15: Building an Identity Extraction Engine](https://reader034.fdocuments.in/reader034/viewer/2022051819/54c8442c4a79591e158b4573/html5/thumbnails/15.jpg)
Simple
Ext
ract
ion E
ngine
![Page 16: Building an Identity Extraction Engine](https://reader034.fdocuments.in/reader034/viewer/2022051819/54c8442c4a79591e158b4573/html5/thumbnails/16.jpg)
Questions to Keep in Mind
Should I use regex to parse web content?
How do users interact with page content?
What key identifiers can be monitored to detect interest?
![Page 17: Building an Identity Extraction Engine](https://reader034.fdocuments.in/reader034/viewer/2022051819/54c8442c4a79591e158b4573/html5/thumbnails/17.jpg)
Fetching the Data: The Request
$html = file_get_contents('URL');
$c = curl_init('URL');
The Simple Way
The Controlled Way
![Page 18: Building an Identity Extraction Engine](https://reader034.fdocuments.in/reader034/viewer/2022051819/54c8442c4a79591e158b4573/html5/thumbnails/18.jpg)
Fetching the Data: cURL$req = curl_init($url);
$options = array( CURLOPT_URL => $url, CURLOPT_HEADER => $header, CURLOPT_RETURNTRANSFER => true, CURLOPT_FOLLOWLOCATION => true, CURLOPT_AUTOREFERER => true, CURLOPT_TIMEOUT => 15, CURLOPT_MAXREDIRS => 10 );
curl_setopt_array($req, $options);
![Page 19: Building an Identity Extraction Engine](https://reader034.fdocuments.in/reader034/viewer/2022051819/54c8442c4a79591e158b4573/html5/thumbnails/19.jpg)
//list of findable / replaceable string characters $find = array('/\r/', '/\n/', '/\s\s+/'); $replace = array(' ', ' ', ' '); //perform page content modification $mod_content = preg_replace('#<script(.*?)>(.*?)</ script>#is', '', $page_content); $mod_content = preg_replace('#<style(.*?)>(.*?)</ style>#is', '', $mod_content);
$mod_content = strip_tags($mod_content);$mod_content = strtolower($mod_content);$mod_content = preg_replace($find, $replace, $mod_content); $mod_content = trim($mod_content);$mod_content = explode(' ', $mod_content);
natcasesort($mod_content);
![Page 20: Building an Identity Extraction Engine](https://reader034.fdocuments.in/reader034/viewer/2022051819/54c8442c4a79591e158b4573/html5/thumbnails/20.jpg)
//set up list of stop words and the final found stopped list$common_words = array('a', ..., 'zero'); $searched_words = array();
//extract list of keywords with number of occurrences foreach($mod_content as $word) { $word = trim($word); if(strlen($word) > 2 && !in_array($word, $common_words)){ $searched_words[$word]++; } }
arsort($searched_words, SORT_NUMERIC);
![Page 21: Building an Identity Extraction Engine](https://reader034.fdocuments.in/reader034/viewer/2022051819/54c8442c4a79591e158b4573/html5/thumbnails/21.jpg)
Scraping Site Meta Data
//load scraped page data as a valid DOM document $dom = new DOMDocument(); @$dom->loadHTML($page_content);
//scrape title $title = $dom->getElementsByTagName("title"); $title = $title->item(0)->nodeValue;
![Page 22: Building an Identity Extraction Engine](https://reader034.fdocuments.in/reader034/viewer/2022051819/54c8442c4a79591e158b4573/html5/thumbnails/22.jpg)
//loop through all found meta tags $metas = $dom->getElementsByTagName("meta"); for ($i = 0; $i < $metas->length; $i++){ $meta = $metas->item($i); if($meta->getAttribute("property")){ if ($meta->getAttribute("property") == "og:description"){ $dataReturn["description"] = $meta->getAttribute("content"); } } else { if($meta->getAttribute("name") == "description"){ $dataReturn["description"] = $meta->getAttribute("content"); } else if($meta->getAttribute("name") == "keywords”){ $dataReturn[”keywords"] = $meta->getAttribute("content"); } } }
![Page 23: Building an Identity Extraction Engine](https://reader034.fdocuments.in/reader034/viewer/2022051819/54c8442c4a79591e158b4573/html5/thumbnails/23.jpg)
Extendin
g the E
ngine
![Page 24: Building an Identity Extraction Engine](https://reader034.fdocuments.in/reader034/viewer/2022051819/54c8442c4a79591e158b4573/html5/thumbnails/24.jpg)
Weighting Important Data
Tags you should care about: meta (include OG), title, description, h1+, header
Bonus points for adding in content location modifiers
![Page 25: Building an Identity Extraction Engine](https://reader034.fdocuments.in/reader034/viewer/2022051819/54c8442c4a79591e158b4573/html5/thumbnails/25.jpg)
Weighting Important Tags
//our keyword weights$weights = array("keywords" => "3.0", "meta" => "2.0", "header1" => "1.5", "header2" => "1.2");
//add modifier hereif(strlen($word) > 2 && !in_array($word, $common_words)){ $searched_words[$word]++; }
![Page 26: Building an Identity Extraction Engine](https://reader034.fdocuments.in/reader034/viewer/2022051819/54c8442c4a79591e158b4573/html5/thumbnails/26.jpg)
Expanding to Phrases
2-3 adjacent words, making up a direct relevant callout
Seems easy right? Just like single words
Language gets wonky without stop words
![Page 27: Building an Identity Extraction Engine](https://reader034.fdocuments.in/reader034/viewer/2022051819/54c8442c4a79591e158b4573/html5/thumbnails/27.jpg)
Working with Unknown Users
The majority of users won’t be immediately targetable
Use HTML5 LocalStorage & Cookie backup
![Page 28: Building an Identity Extraction Engine](https://reader034.fdocuments.in/reader034/viewer/2022051819/54c8442c4a79591e158b4573/html5/thumbnails/28.jpg)
Adding in Time Interactions
Interaction with a site does not necessarily mean interest in it
Time needs to also include an interaction component
Gift buying seasons see interest variations
![Page 29: Building an Identity Extraction Engine](https://reader034.fdocuments.in/reader034/viewer/2022051819/54c8442c4a79591e158b4573/html5/thumbnails/29.jpg)
Grouping Using Commonality
InterestsUser A
InterestsUser B
Inte
rests
Com
mon
![Page 30: Building an Identity Extraction Engine](https://reader034.fdocuments.in/reader034/viewer/2022051819/54c8442c4a79591e158b4573/html5/thumbnails/30.jpg)
Thank You!
Questio
ns?
www.slidesh
are.co
m/jc
leblanc