Post on 12-Apr-2017
go.indeed.com/IndeedEngTalks
was wo
job title, keywords or company name city, state or zip code
produktionshelfer Jobs findenmünchen
Αθήνατι που
τίτλος θέσης εργασίας, λέξεις-κλειδιά ή όνομα εταιρείας πόλη ή πολιτεία
βοηθός λογιστή Εύρεση θέσεων εργασίας
Precision
Job seeker searches for “architect”
10 jobs returned:8 building architect jobs 2 software architect jobs
Precision
Job seeker searches for “architect”
10 jobs returned:8 building architect jobs Relevant2 software architect jobs Not Relevant
Precision: 8 / 10
Recall
Job seeker searches for “hr”
Jobs that mention “hr” or “human resources” are both relevant to the job seeker.
Recall
Job seeker searches for “hr”
10 jobs are relevant: 7 hr jobs Returned 3 human resources jobs Not Returned
Recall: 7 / 10
Senior Software Engineer - SearchIndeed - Austin, TX
Indeed.com is seeking a Senior Software Engineer responsible for the information retrieval system that powers Indeed’s job search website. If you are an engineer who's passionate about building innovative products...
Job Description - English
Senior Software Engineer - SearchIndeed - Austin, TX
Indeed.com is seeking a Senior Software Engineer responsible for the information retrieval system that powers Indeed’s job search website. If you are an engineer who's passionate about building innovative products...
Tokenization
Inverted Index
Token Job A Job B Job C
assistant ✔
developer ✔
engineer ✔
lawyer ✔ ✔
paralegal ✔ ✔
retrieval ✔
Inverted Indexes
Allow you to:● Quickly find all documents containing a
token● Perform boolean queries, e.g “java AND
developer”
Secrétaire Saclay
Au sein de la direction de la Qualité et de l'Environnement (DQE) vous seconderez la secrétaire-assistante. Vos principales missions seront :
- organisation de réunions
- l'accueil téléphonique
- la gestion des missions ..
Job Description - French
Job Description - Chinese
岗位描述:1、全厂电气设备的日常检查、记录,在操作工或主操的指导下进行工艺操作. 2、现场液体充装,现场充装安全的管理. 3、负责现场工作环境的整洁. ...
全厂电气设备的日常检查、记录,
在操作工或主操的指导下进行工艺操作.
“Daily inspection of electrical equipment plant-wide”
Chinese using JobAnalyzer
Language Detection options
● HTTP Content-Language response header○ Most sites don’t provide this header○ May not be accurate
Language Detection - ICU4J
● ICU4J’s CharsetDetector○ Works well for languages with single byte
encoded characters○ Detect that language is one of
Danish, Dutch, English, French, German, Italian, Portuguese, Swedish
Naive Bayesian classifier
● Features - words
● Strong independence assumption
● Class label - language
Naive Bayesian Language detector
For each language, calculate P(wi ϵ Lj)● P(“experience” ϵ en) = 0.85
CJ language detection
● Strongly weight Hiragana and Katakana
● Some characters (Kanji) common between Chinese and Japanese
● p(卒 ϵ ja) = 0.99 p(卒 ϵ zh) = 0.000001
Language Results
● Did cross validation on hand labeled testing data
● 99% accurate for text > 30 characters○ Average job description is 200 characters
● Fast - 0.6ms per job
Dictionary-based tokenizers
● Dictionary of words in language
● Scan input sentence, return all possible tokenizations
北京 大学生 前来 应聘 Beijing college students come to apply jobs
北京大学 生前 来应聘 Peking University before death come to apply jobs
北京 大学生 前来 应聘 Beijing college students come to apply jobs
北京大学 生前 来应聘 Peking University before death come to apply jobs
CJK tokenizers
● Chinese - Imdict● Japanese - Sen● Korean - LuceneKorean
Chinese tokenization
http://nlp.stanford.edu/projects/chinese-nlp.shtml
What is stemming?
the process of turning multiple variations of a word into a single equivalent root
Stemming examples
● driver, drivers → driver● secretaire, secrétaire → secretaire● vendeur, vendeuse → vendeur
Why stemming matters
● Return all possible relevant jobs given the user’s query, not just exact matches
Stemming - Lucene Analyzers
● Do stemming before adding to inverted index
● Examples○ PorterStemFilter○ SnowballAnalyzer○ EnglishMinimalStemmer
Inverted Index
Job A: Directrice de Documentaires Job B: Directeur de production
Token Job A Job B
de ✔ ✔
directeur ✔ ✔
documentaires ✔
production ✔
Search with stemming tokenizers
● At search time, use the same analyzer on the query○ “directrice” → “directeur”
● Search for “directrice” returns both jobs
Term Expansion Maps
● Map from String->List<String>
● Key is root, values are tokens that stem to that root● driver → driver, drivers● vendeur → vendeur, vendeuse
Stemmer interface
● One method ● String stem(String token)
● Many implementations● EnglishStemmer● FrenchStemmer● GermanStemmer● SpanishStemmer
Building term expansion map
for each language
for each term in language
root = Stemmer.stem(term)
termMap[root].append(term)
● Takes ~1.5 minutes on index with 2 million tokens and 18 languages
Job A: Directrice de documentaires
Job B: Directeur de production
Token Job A Job B
de ✔ ✔
directeur ✔ ✔
documentaires ✔
production ✔
Job A: Directrice de documentaires
Job B: Directeur de production
Token Job A Job B
de ✔ ✔
directeur ✔
directrice ✔
documentaires ✔
production ✔
“directrice”
“directeur”Term
Expansion Map
French Stemmer
Query Rewriter
“directrice” OR “directeur”
Job A: Directrice de documentaires
Job B: Directeur de production
Token Job A Job B
de ✔ ✔
directeur ✔
directrice ✔
documentaires ✔
production ✔
Benefits
● Modifying stem rules don’t require index rebuilds○ Takes minutes on index with millions of jobs○ Had flexibility to iteratively implement stemming
rules as we come across different use cases
Scale Stemming
● Indeed continued international expansion
● Needed stemming to scale without code deploys and coordination between developers and country managers.
Goal● Comprehensive
○ Support all use cases we care about:■ plurals■ synonyms■ abbreviations■ accent collation■ gender suffixes
Substring rule
English - é→e résumé → resumecafé → cafe
German - ä → averkäufer → verkaufer
French - ô→ohôtesse → hotesse
Suffix Rule - English
● ies→y○ families → family○ policies → policy
● s→’’ ○ nurses → nurse○ drivers → driver
Prevent over-stemming
● s → ‘’ can cause this → thi
● Min Length - special terminal rule
● Usually set to anywhere from 3 to 5
Babelfish: Stem rule editor
● Webapp to edit and publish rules
● Rules interpreted by generic stemmer
● 27 languages
directricesdirectrice suffix rule “s” → “”directeur suffix rule “trice” → “teur”
ingénieuringenieur substring rule “é” → “e”
JobSeekers
Stem Rule EditorEN s → ‘’, ces → y, …FR e → é, u → ù, …
Jobs Index Builder
Term Expansion Mapsale → sale, salespolicy → policy, policies
Search Service
Country Managers
query
results
Term expansion map storage
● Custom serialization format ○ Store string array as UTF8 bytes and offsets○ Front encoding for additional compression
● 2X smaller than using Java native serialization
Scalable
27 languages use stemming rules
Re-used language detection and stemming libraries in resume search
Efficient
● Term expansion map in Europe index has 2 million terms in 18 languages - 60MB on disk
● Building term expansion maps takes ~ 1.5 minutes
● Doing boolean query for stemming adds ~5ms to median search time (~35ms)
Stemming helps job seekers
Searches that return no jobs reduced by 60% with stemming
3% to 5% more clicks
Sponsored Jobs at Indeed
Real-time auction used to determine Sponsored Job impressions
Auction winner based on expected value
Job Bid x eCTR = Value → Rank
B $2.00 10% $0.20 1
A $3.00 5% $0.15 2
B could win the auction with a lower bid...
B could win the auction with a lower bid...…only charge what’s needed to win!
Job Bid x eCTR = Value → Rank
B $2.00 10% $0.20 1
A $3.00 5% $0.15 2
B could win the auction with a lower bid...…only charge what’s needed to win!
Job Bid x eCTR = Value → Rank
B $2.00 10% $0.20 1
A $3.00 5% $0.15 2
$1.50 x 10% = $0.15
B could win the auction with a lower bid...…only charge what’s needed to win!
Cost = $1.51
Job Bid x eCTR = Value → Rank
B $2.00 10% $0.20 1
A $3.00 5% $0.15 2
B could win the auction with a lower bid...…only charge what’s needed to win!
Cost = $1.51
Job Bid x eCTR = Value → Rank
B $2.00 10% $0.20 1
A $3.00 5% $0.15 2
B could win the auction with a lower bid...…only charge what’s needed to win!
Cost = $1.51
Job Bid x eCTR = Value → Rank
B $2.00 10% $0.20 1
A $3.00 5% $0.15 2
Sponsored Jobs at Indeed
“Generalized Second Price Auction”● Fair for employers● Ensures sponsored results are relevant and
useful for job seekers
Sponsored Jobs at Indeed
Employers set their bid & budget
employer_id int(10) unsigned,
bid decimal(10,2) unsigned,
daily_budget decimal(10,2) unsigned,
Sponsored Jobs at Indeed
A builder process creates read-optimized data structures for the auction system
Sponsored Jobs at Indeed
When job seeker clicks on sponsored result, log information from the auction
employerId
jobId
bid
cost
…
Sponsored Jobs at Indeed
Process click logs to update budgets and charge employers
Apply business rules during click processing:● Fraud detection● Duplicate click detection
SJ outside the US
Non-US employers wanted their jobs in sponsored results...
...but they don’t have US Dollars
SJ outside the US
Credit Cards
+ No changes needed
- Bad UX for employers- Disadvantaged exchange rates
SJ outside the US
Credit Cards
+ No changes needed
- Bad UX for employers- Disadvantaged exchange rates- Employers bear currency risk
Credit Cards: Currency Risk
Desired Daily Budget: CA $100.00
Exchange rate on Jan 1: 0.9351
Set Daily Budget to: $93.51
Credit Cards: Currency Risk
Desired Daily Budget: CA $100.00
Exchange rate on Jan 1: 0.9351
Set Daily Budget to: $93.51
Exchange rate on Jan 31: 0.8970
Effective Daily Budget: CA $104.25
Credit Cards: Currency Risk
+4.25%
Desired Daily Budget: CA $100.00
Exchange rate on Jan 1: 0.9351
Set Daily Budget to: $93.51
Exchange rate on Jan 31: 0.8970
Effective Daily Budget: CA $104.25
Multi-currency SJ
Employers can set bids and budgets in preferred currency
Canadian Dollars CAD
Australian Dollars AUD
Japanese Yen JPY
Euro EUR
British Pounds GBP
Swiss Francs CHF
Millicents
Exchange rate between USD and millicents is fixed:
$0.01 == 1000 millicents $1.00 == 105 millicents
Millicents
Exchange rates between other currencies and millicents can vary over time:
€1.00 == 136,170 millicents ¥100 == 98,350 millicents
Millicents
Provide enough granularity to differentiate similar values in different currencies
All of these are about $1.00 (USD): £0.60 (GBP) €0.73 (EUR) ¥102 (JPY)
Millicents
Provide enough granularity to differentiate similar values in different currencies
All of these are about $1.00 (USD): £0.60 (GBP) €0.73 (EUR) Which is larger? ¥102 (JPY)
Millicents
Converting to USD doesn’t help
USD: $1.00 → $1.00
GBP: £0.60 → $1.00
EUR: €0.73 → $1.00
JPY: ¥102 → $1.00
Millicents
Millicents provide granularity to rank values
USD: $1.00 → 100000 mc
GBP: £0.60 → 100450 mc
EUR: €0.73 → 99519 mc
JPY: ¥102 → 100317 mc
Millicents
32 bit signed values $21,474 USD equivalent
64 bit signed values $9.2 trillion USD equivalent
Local Currency Values
Values in specific currency are represented with currency code and an integer
Integer represents “minor unit”, depends on the currency type: (USD, 543) == $5.43 (EUR, 543) == €5.43 (JPY, 543) == ¥543
Local Currency Values
For each currency, preferable that the “minor unit” is roughly equal to $0.01 USD● Exchange rate representation● Fairness in auction competition
Local Currency Values
32 bit signed values $21 million USD (and others) ¥2.1 billion JPY
64 bit signed values $90 quadrillion USD (and others) ¥9 quintillion JPY
Multi-currency SJ
Add multi-currency data to click logs:
employerId
jobId
bid
cost
...
employerId
jobId
currency
exchangeRate
bidInCurrency
bidMillicents
costMillicents
...
Multi-currency SJ
During click processing, convert auction cost (in millicents) back to employer’s currency using same exchange rate
costInMillicents
currency
exchangeRate
→ costInCurrency
Revenue Reporting
If the auction millicent cost is used, there could be errors!
Millicent Cost: 53,826 millicentsEuro Cost: €0.39483
Revenue Reporting
If the auction millicent cost is used, there could be errors!
Millicent Cost: 53,826 millicentsEuro Cost: €0.39483
Revenue Reporting
If the auction millicent cost is used, there could be errors!
Millicent Cost: 53,826 millicentsEuro Cost: €0.39Actual Millicent Cost: 53,168 millicents
Revenue Reporting
If the auction millicent cost is used, there could be errors!
Millicent Cost: 53,826 millicentsEuro Cost: €0.39Actual Millicent Cost: 53,168 millicents
1.2% difference!
International SuccessUnited Kingdom 1.) Indeed 2.) Reed 3.) Totaljobs
France 1.) Indeed 2.) Cadremploi 3.) Monster
Netherlands 1.) Indeed 2.) NVB 3.) Monsterboard
Canda 1.) Indeed 2.) Workopolis 3.) Monster
Italy 1.) Indeed 2.) Infojobs 3.) Jobrapido
Brazil 1.) Indeed 2.) Catho 3.) Infojobs
Japan 1.) Rikunabi 2.) Indeed 3.) Rikunabi Next
Australia 1.) Seek 2.) Indeed 3.) Careerone
India 1.) Naukri 2.) Timesjobs 3.) Indeed
Next @IndeedEng Talk
August 27th, 2014
http://engineering.indeed.com/talkshttps://twitter.com/IndeedEng