Big Data and Data Standardization at LinkedIn

Post on 24-May-2015

461 views 2 download

Tags:

description

From a talk I gave to a group of Connecticut College students in November of 2012. This looks at some of the challenges of dealing with huge amounts of member-inputted data as well as techniques used to solve these challenges and product applications of that member-inputted data.

Transcript of Big Data and Data Standardization at LinkedIn

Recruiting Solutions Recruiting Solutions Recruiting Solutions

Reading the Tea Leaves:

Big Data at LinkedIn

Alexis Baird Product Manager LinkedIn

Alexis

1

What is LinkedIn?

§  LinkedIn’s mission: “Connect the world’s professionals to make them more productive and successful”

§  The site officially launched on May 5, 2003 §  Now has >187 million members worldwide §  LinkedIn has >3,000 employees in offices all around the

world §  Headquartered in Mountain View, CA §  Three different lines of revenue:

–  Subscriptions –  Talent Solutions –  Marketing Solutions

2

Who am I?

3

The Age of Big Data

4

Big Data at LinkedIn

§  187+ million members from >200 countries §  Each month, 52 million members come to the site

generating ~2 billion page views: –  Performing searches –  Connecting with other members –  Editing their profile –  Sharing, commenting on, or liking news articles –  Participating in group discussions –  And much more…

5

Big Data Challenges

§  Storage and processing constraints

§  Noisy signal

–  Variation –  People are not always rational or consistent

6

Data Messiness

§  Job titles: §  “programmer”, §  “software developer” §  “engineer” §  “coding ninja”

§  Schools: §  “Connecticut College” §  “Conn College” §  “Conn” §  “CC” §  “Conn College (NOT

Uconn)”

§  Companies: §  “Microsoft” §  “MSFT” §  “Bing” §  “Microsoft/Bing” §  “Microsoft-Mountain View”

7

Data Standardization

§  Take an input (usually a user-entered string) and turn it into a meaningful abstract id

8

“Microsoft” “MSFT” “Bing” “Microsoft/Bing” “Microsoft-Mountain View

Company_id = 1035 (“Microsoft Corporation”)

Why is this important?

9

Search

10

Structured data > Unstructured data

11 11

P(“linkedin” = company_id 1337) = .87 P(“ceo” = title_id 238) = .92

Recommendations

12 12

Recommendation products at LinkedIn

13 13

Similar Profiles

Events You May Be Interested In

News

Network updates

Connections

LinkedIn’s recommender ecosystem

14

Recommendations drive:

> 50% of connections > 50% of job applications > 50% of group joins

Jobs You Might Be Interested In

15

How LinkedIn matches people to jobs

16

Corpus Stats

Job

User Base

Filtered

title geo company

industry description functional area

Candidate

General expertise specialties education headline geo experience

Current Position title summary tenure length industry functional area …

Similarity (candidate expertise, job description)

0.56 Similarity

(candidate specialties, job description)

0.2 Transition probability

(candidate industry, job industry)

0.43

Title Similarity

0.8

Similarity (headline, title)

0.7 . . .

derived

Matching Binary Exact matches: geo, industry, … Soft transition probabilities, similarity, … Text

Transition probabilities Connectivity yrs of experience to reach title education needed for this title …

Data Standardization: Occupations

§  How do we know a “senior software developer” and a “software developer” are the same occupation?

17

Data Standardization: Occupations

§  How do we know a “senior software developer” and a “software developer” are the same occupation? –  Strip a special set of words known to indicate seniority

18

Data Standardization: Occupations

§  How do we know a “senior software developer” and a “software developer” are the same occupation? –  Strip a special set of words known to indicate seniority

§  How do we know a “software developer” and a “software engineer” are the same occupation?

19

Data Standardization: Occupations

§  How do we know a “senior software developer” and a “software developer” are the same occupation? –  Strip a special set of words known to indicate seniority

§  How do we know a “software developer” and a “software engineer” are the same occupation? –  Term similarity

20

Data Standardization: Occupations

§  How do we know a “senior software developer” and a “software developer” are the same occupation? –  Strip a special set of words known to indicate seniority

§  How do we know a “software developer” and a “software engineer” are the same occupation? –  Term similarity

§  How do we know a “programmer” and a “software developer” are the same occupation but a “programmer” and a “program director” are not?

21

Data Standardization: Occupations

§  How do we know a “senior software developer” and a “software developer” are the same occupation? –  Strip a special set of words known to indicate seniority

§  How do we know a “software developer” and a “software engineer” are the same occupation? –  Term similarity

§  How do we know a “programmer” and a “software developer” are the same occupation but a “programmer” and a “program director” are not? –  Need something more complicated

22

Data standardization: Occupations

1.  Rule-based string clean up: –  ~2 million different titles => 24,000 different “cleaned” titles –  Eg. “Sr software dev” => “senior software developer”

2.  Create “virtual profiles” for each title using various extracted and normalized profile features (i.e. skills, degree, field of study, summary, job description, honors, etc.)

3.  Cluster similar titles 4.  Get rid of uninformative titles spread across too many

different topics 5.  Apply hand QA to tune the clusters/name the clusters

23

Lessons learned

§  Know your machine learning! §  Know your success metric! §  Need to allow for ambiguity within a given title

§  “Head of production” §  DDS

§  Some titles are not standardizable:

25

Take aways

§  The more information you give, the better your standardization will be

§  Why do you want LI to do a good job standardizing the data on your profile? –  Better recommendations:

§  News §  Jobs §  Groups §  Connections §  Etc.

–  Recruiters can find you more easily –  Potential connections can find you

26

2 4 8

17

32

55

90

2004 2005 2006 2007 2008 2009 2010 2011 LinkedIn Members (Millions)

175M+

25th Most visit website worldwide (Comscore 6-12)

Company pages

>2M

62% non U.S.

2/sec

85% Fortune 500 Companies use LinkedIn to hire

Thank You!

27

We’re

Hiring!

Learn more at http://data.linkedin.com/