Relink - Tech Talk
-
Upload
relink -
Category
Data & Analytics
-
view
58 -
download
1
Transcript of Relink - Tech Talk
relink
www.relinklabs.com
Bjarne Ørum FruergaardLead Data Scientist
Machine Learning Match Making
1000employees
10%growth
10%churn
&
200people per year
hire
15.000profiles
manually assess
1.200.000EURO
Solution
Job CV & cover letter
Job <—> Applicant analyses empowering the recruiter
Large graphs connect jobs, educations & skills Augment job descriptions and profiles
Relevant skills, job experience and education for the job Probable skills with confidence on profiles
Rankingrelink
Select Top N
Interesting challenge: Given batch of J job descriptions Score 5M profiles (all jobs simultaneously) For each job (sequentially):
Top N in each partition Merge Top N from each partition
Sequential is a nuisance Collect on driver Results are not distributed
Tree Digests
[1] Dunning, T. "COMPUTING EXTREMELY ACCURATE QUANTILES USING t-DIGESTS”. https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf
Images shamelessly copied from here (thanks Cam Davidson-Pilon!): [2] https://dataorigami.net/blogs/napkin-folding/19055451-percentile-and-quantile-estimation-of-big-data-the-t-digest
Compressing the CDF Estimate quantile or percentiles with low error Associative and commutative “I streamed 8mb of pareto-distributed data into a t-Digest. The resulting size was 5kb, and I could estimate any percentile or quantile desired. Accuracy was on the order of 0.002%.”
[1]
[2]
Given batch of J job descriptions Score 5M profiles (all jobs simultaneously) Compute t-Digests locally on executors Sum t-Digests Broadcast t-Digests Filter partitions where score >= percentile Approximate Top N remain in partitions
t-Digests are small Collect on driver is comparatively small Results remain distributed Approximates Top N
Let’s try it!
5 jobs simultaneously ~5M scored profiles N=5000 Two methods: getTopScoresWithSortAndLimit getTopScoresWithTDigests
~28 seconds
~8 seconds