Big Data and Data Standardization at LinkedIn
-
Upload
alexis-baird -
Category
Technology
-
view
459 -
download
2
description
Transcript of Big Data and Data Standardization at LinkedIn
Recruiting Solutions Recruiting Solutions Recruiting Solutions
Reading the Tea Leaves:
Big Data at LinkedIn
Alexis Baird Product Manager LinkedIn
Alexis
1
What is LinkedIn?
§ LinkedIn’s mission: “Connect the world’s professionals to make them more productive and successful”
§ The site officially launched on May 5, 2003 § Now has >187 million members worldwide § LinkedIn has >3,000 employees in offices all around the
world § Headquartered in Mountain View, CA § Three different lines of revenue:
– Subscriptions – Talent Solutions – Marketing Solutions
2
Who am I?
3
The Age of Big Data
4
Big Data at LinkedIn
§ 187+ million members from >200 countries § Each month, 52 million members come to the site
generating ~2 billion page views: – Performing searches – Connecting with other members – Editing their profile – Sharing, commenting on, or liking news articles – Participating in group discussions – And much more…
5
Big Data Challenges
§ Storage and processing constraints
§ Noisy signal
– Variation – People are not always rational or consistent
6
Data Messiness
§ Job titles: § “programmer”, § “software developer” § “engineer” § “coding ninja”
§ Schools: § “Connecticut College” § “Conn College” § “Conn” § “CC” § “Conn College (NOT
Uconn)”
§ Companies: § “Microsoft” § “MSFT” § “Bing” § “Microsoft/Bing” § “Microsoft-Mountain View”
7
Data Standardization
§ Take an input (usually a user-entered string) and turn it into a meaningful abstract id
8
“Microsoft” “MSFT” “Bing” “Microsoft/Bing” “Microsoft-Mountain View
Company_id = 1035 (“Microsoft Corporation”)
Why is this important?
9
Search
10
Structured data > Unstructured data
11 11
P(“linkedin” = company_id 1337) = .87 P(“ceo” = title_id 238) = .92
Recommendations
12 12
Recommendation products at LinkedIn
13 13
Similar Profiles
Events You May Be Interested In
News
Network updates
Connections
LinkedIn’s recommender ecosystem
14
Recommendations drive:
> 50% of connections > 50% of job applications > 50% of group joins
Jobs You Might Be Interested In
15
How LinkedIn matches people to jobs
16
Corpus Stats
Job
User Base
Filtered
title geo company
industry description functional area
…
Candidate
General expertise specialties education headline geo experience
Current Position title summary tenure length industry functional area …
Similarity (candidate expertise, job description)
0.56 Similarity
(candidate specialties, job description)
0.2 Transition probability
(candidate industry, job industry)
0.43
Title Similarity
0.8
Similarity (headline, title)
0.7 . . .
derived
Matching Binary Exact matches: geo, industry, … Soft transition probabilities, similarity, … Text
Transition probabilities Connectivity yrs of experience to reach title education needed for this title …
Data Standardization: Occupations
§ How do we know a “senior software developer” and a “software developer” are the same occupation?
17
Data Standardization: Occupations
§ How do we know a “senior software developer” and a “software developer” are the same occupation? – Strip a special set of words known to indicate seniority
18
Data Standardization: Occupations
§ How do we know a “senior software developer” and a “software developer” are the same occupation? – Strip a special set of words known to indicate seniority
§ How do we know a “software developer” and a “software engineer” are the same occupation?
19
Data Standardization: Occupations
§ How do we know a “senior software developer” and a “software developer” are the same occupation? – Strip a special set of words known to indicate seniority
§ How do we know a “software developer” and a “software engineer” are the same occupation? – Term similarity
20
Data Standardization: Occupations
§ How do we know a “senior software developer” and a “software developer” are the same occupation? – Strip a special set of words known to indicate seniority
§ How do we know a “software developer” and a “software engineer” are the same occupation? – Term similarity
§ How do we know a “programmer” and a “software developer” are the same occupation but a “programmer” and a “program director” are not?
21
Data Standardization: Occupations
§ How do we know a “senior software developer” and a “software developer” are the same occupation? – Strip a special set of words known to indicate seniority
§ How do we know a “software developer” and a “software engineer” are the same occupation? – Term similarity
§ How do we know a “programmer” and a “software developer” are the same occupation but a “programmer” and a “program director” are not? – Need something more complicated
22
Data standardization: Occupations
1. Rule-based string clean up: – ~2 million different titles => 24,000 different “cleaned” titles – Eg. “Sr software dev” => “senior software developer”
2. Create “virtual profiles” for each title using various extracted and normalized profile features (i.e. skills, degree, field of study, summary, job description, honors, etc.)
3. Cluster similar titles 4. Get rid of uninformative titles spread across too many
different topics 5. Apply hand QA to tune the clusters/name the clusters
23
Lessons learned
§ Know your machine learning! § Know your success metric! § Need to allow for ambiguity within a given title
§ “Head of production” § DDS
§ Some titles are not standardizable:
25
Take aways
§ The more information you give, the better your standardization will be
§ Why do you want LI to do a good job standardizing the data on your profile? – Better recommendations:
§ News § Jobs § Groups § Connections § Etc.
– Recruiters can find you more easily – Potential connections can find you
26
2 4 8
17
32
55
90
2004 2005 2006 2007 2008 2009 2010 2011 LinkedIn Members (Millions)
175M+
25th Most visit website worldwide (Comscore 6-12)
Company pages
>2M
62% non U.S.
2/sec
85% Fortune 500 Companies use LinkedIn to hire
Thank You!
27
We’re
Hiring!
Learn more at http://data.linkedin.com/