Document

51
Introduction The Method Evaluation Conclusion References A Graph-Based Approach to Skill Extraction from Text Higher School of Economics, School of Applied Mathematics and Information Science, Nizhny Novgorod, Russia Ilkka Kivim¨ aki 1 , Alexander Panchenko 4,2 , Adrien Dessy 1,2 , Dries Verdegem 3 , Pascal Francq 1 , C´ edrick Fairon 2 , Hugues Bersini 3 and Marco Saerens 1 [email protected] 1 ICTEAM, 2 CENTAL, Universit´ e catholique de Louvain, Belgium, 3 IRIDIA, Universit´ e libre de Bruxelles, Belgium, 4 Digital Society Laboratory LLC, Russia December 18, 2013 1 / 46

Transcript of Document

Page 1: Document

Introduction The Method Evaluation Conclusion References

A Graph-Based Approach toSkill Extraction from Text

Higher School of Economics, School of Applied Mathematicsand Information Science, Nizhny Novgorod, Russia

Ilkka Kivimaki1, Alexander Panchenko4,2, Adrien Dessy1,2,Dries Verdegem3, Pascal Francq1, Cedrick Fairon2,

Hugues Bersini3 and Marco Saerens1

[email protected]

1ICTEAM, 2CENTAL, Universite catholique de Louvain, Belgium,3IRIDIA, Universite libre de Bruxelles, Belgium,

4Digital Society Laboratory LLC, Russia

December 18, 20131 / 46

Page 2: Document

Introduction The Method Evaluation Conclusion References

Table of Contents

1 Expertise retrieval and skill extraction

2 The Elisit system for skill extractionOverview of the systemSample QueriesAssociation with WikipediaSpreading activation in Wikipedia

3 Evaluation of system

4 Conclusion and future work

2 / 46

Page 3: Document

Introduction The Method Evaluation Conclusion References

Reference paper:

Kivimki I., Panchenko A., Dessy A., Verdegem D., Francq P.,Bersini H. and Saerens M. ”A Graph-Based Approach to SkillExtraction from Text”. In Proceedings of the 8th WorkshopTextGraphs-8 Graph-based Methods for Natural LanguageProcessing. EMNLP 2013: Conference on Empirical Methodsin Natural Language Processing. Seattle, USA, October18-21, 2013.

http://aclweb.org/anthology/W/W13/W13-5011.pdf

3 / 46

Page 4: Document

Introduction The Method Evaluation Conclusion References

Expertise retrieval [Balog et al., 2012]

Expertise Retrieval vs. Expertise Seeking

Expertise retrieval: linking humans to expertise areas, andvice versa from a system-centered perspective. Expertiseretrieval has primarily focused on identifying good topicalmatches between a need for expertise on the one hand andthe content of documents associated with candidate expertson the other hand.

Expertise seeking: linking humans to expertise areas from ahuman-centered perspective. Expertise seeking has beenmainly investigated in the field of knowledge managementwhere the goal is to utilize human knowledge within anorganization as well as possible.

4 / 46

Page 5: Document

Introduction The Method Evaluation Conclusion References

Expertise retrieval [Balog et al., 2012]

Expertise retrieval: Expert Profiling vs. Expert Seeking

Person: a set of (text) documents generated by an individual.

Expertise: a keyword or a a keyphrase, specifying a field ofknowledge e.g. “Machine Learning”, “Hadoop”, “NLP”, etc.

Expert profiling: given a person, retrieve (profile) itsexpertise.Person → Expertise

Expert retrieval: given an expertise, retrieve persons withsuch expertise.Expertise → Person

5 / 46

Page 6: Document

Introduction The Method Evaluation Conclusion References

Expertise Retrieval: Earlier Work

TREC Enterprise Track [Balog et al., 2008]State-of-the-Art overview [Balog et al., 2012]A skill extraction system [Crow and DeSanto, 2004]Skill extraction System [Skomoroch et al., 2012]Expertise retrieval in universities [Balog et al., 2007]Expert finding on DBLP data [Deng et al., 2008]e-Human Resource Management system [Biesalski, 2003]

6 / 46

Page 7: Document

Introduction The Method Evaluation Conclusion References

Expertise Retrieval: Earlier Work

Skill extraction System [Skomoroch et al., 2012]

http://www.freepatentsonline.com/20120197863.pdf

7 / 46

Page 8: Document

Introduction The Method Evaluation Conclusion References

Expertise Retrieval: Applications

Expertise management systems

Knowledge management in enterprisesEmployee profiling

Reviewer selection for articles

Recommendation systems of

jobsjob applicantswebsites, blog texts, articles

8 / 46

Page 9: Document

Introduction The Method Evaluation Conclusion References

Expertise retrieval

9 / 46

Page 10: Document

Introduction The Method Evaluation Conclusion References

Expertise retrieval

10 / 46

Page 11: Document

Introduction The Method Evaluation Conclusion References

Skill extraction

We focus on skill extraction from texts,i.e. associating skills with text documents.

11 / 46

Page 12: Document

Introduction The Method Evaluation Conclusion References

Table of Contents

1 Expertise retrieval and skill extraction

2 The Elisit system for skill extractionOverview of the systemSample QueriesAssociation with WikipediaSpreading activation in Wikipedia

3 Evaluation of system

4 Conclusion and future work

12 / 46

Page 13: Document

Introduction The Method Evaluation Conclusion References

Overview of the system

Table of Contents

1 Expertise retrieval and skill extraction

2 The Elisit system for skill extractionOverview of the systemSample QueriesAssociation with WikipediaSpreading activation in Wikipedia

3 Evaluation of system

4 Conclusion and future work

13 / 46

Page 14: Document

Introduction The Method Evaluation Conclusion References

Overview of the system

The Elisit system for skill extraction

Original goal of the system:

Associate professional skills to people based on texts that theyproduce (emails, blogs, forums, articles etc.).

Tools:

List of skills extracted from LinkedIn.

The skills are linked to corresponding Wikipedia pages.

Method:

1 Find Wikipedia pages relevant to a target document.

2 Use spreading activation on Wikipedia’s hyperlink network tofind skills that are “close” or “central” to these relevant pages.

14 / 46

Page 15: Document

Introduction The Method Evaluation Conclusion References

Overview of the system

Skill extraction using Wikipedia

15 / 46

Page 16: Document

Introduction The Method Evaluation Conclusion References

Overview of the system

Example

16 / 46

Page 17: Document

Introduction The Method Evaluation Conclusion References

Overview of the system

Example

17 / 46

Page 18: Document

Introduction The Method Evaluation Conclusion References

Overview of the system

Example

18 / 46

Page 19: Document

Introduction The Method Evaluation Conclusion References

Overview of the system

Size of the problem

Our current version of English Wikipedia consists of

n = 3 983 338 encyclopedia entriesm = 247 560 469 links

27 513 of the encyclopedia entries correspond to LinkedInskills.

19 / 46

Page 20: Document

Introduction The Method Evaluation Conclusion References

Overview of the system

Implementation

For computing the similarities between the target documentand all Wikipedia pages, we use the Gensim library [Rehurekand Sojka, 2010].

This part of the Elisit system is called the text2wiki

module.Currently the bottleneck of the computation

For performing spreading activation, we use the sparse matrixlibrary of SciPy.

This part is called the wiki2skill module.

20 / 46

Page 21: Document

Introduction The Method Evaluation Conclusion References

Overview of the system

The Elisit system

At the moment not fully functional...

21 / 46

Page 22: Document

Introduction The Method Evaluation Conclusion References

Sample Queries

Table of Contents

1 Expertise retrieval and skill extraction

2 The Elisit system for skill extractionOverview of the systemSample QueriesAssociation with WikipediaSpreading activation in Wikipedia

3 Evaluation of system

4 Conclusion and future work

22 / 46

Page 23: Document

Introduction The Method Evaluation Conclusion References

Sample Queries

Popular Article about Natural Language Understanding

23 / 46

Page 24: Document

Introduction The Method Evaluation Conclusion References

Sample Queries

Popular Article about Natural Language Understanding

24 / 46

Page 25: Document

Introduction The Method Evaluation Conclusion References

Sample Queries

Blog Article about SEO Marketing

25 / 46

Page 26: Document

Introduction The Method Evaluation Conclusion References

Sample Queries

Blog Article about SEO Marketing

26 / 46

Page 27: Document

Introduction The Method Evaluation Conclusion References

Sample Queries

Wikipedia Article about Geo Information Systems

27 / 46

Page 28: Document

Introduction The Method Evaluation Conclusion References

Sample Queries

Wikipedia Article about Geo Information Systems

28 / 46

Page 29: Document

Introduction The Method Evaluation Conclusion References

Sample Queries

Scientific Article about Graph Mining

29 / 46

Page 30: Document

Introduction The Method Evaluation Conclusion References

Sample Queries

Scientific Article about Graph Mining

30 / 46

Page 31: Document

Introduction The Method Evaluation Conclusion References

Sample Queries

Try it. . .

Elisit Web Interfacehttp://elisit.cental.be/

Elisit Web Servicehttp://elisit.cental.be:8080/

This is only a demo: not optimized for multiple-user queries,high load, fast response, etc.

31 / 46

Page 32: Document

Introduction The Method Evaluation Conclusion References

Association with Wikipedia

Table of Contents

1 Expertise retrieval and skill extraction

2 The Elisit system for skill extractionOverview of the systemSample QueriesAssociation with WikipediaSpreading activation in Wikipedia

3 Evaluation of system

4 Conclusion and future work

32 / 46

Page 33: Document

Introduction The Method Evaluation Conclusion References

Association with Wikipedia

Association with Wikipedia

1. Find Wikipedia pages relevant to a target document.

We compute the similarity between the input document andall Wikipedia pages.

We tried four different models:

1 TF-IDF (300,000 dimensions)2 LogEntropy (300,000 dimensions)3 LogEntropy + LSA (200 dimensions)4 LogEntropy + LDA (200 topics)

⇒ the target document is represented as a semantic vector ofsize n, the number of Wikipedia pages (inspired byESA [Gabrilovich and Markovitch, 2007]).

33 / 46

Page 34: Document

Introduction The Method Evaluation Conclusion References

Spreading activation in Wikipedia

Table of Contents

1 Expertise retrieval and skill extraction

2 The Elisit system for skill extractionOverview of the systemSample QueriesAssociation with WikipediaSpreading activation in Wikipedia

3 Evaluation of system

4 Conclusion and future work

34 / 46

Page 35: Document

Introduction The Method Evaluation Conclusion References

Spreading activation in Wikipedia

Spreading activation in Wikipedia

2. Use Wikipedia’s hyperlink network to find skills that are “close”or “central” to these relevant pages.

INITIAL PAGES SKILLS

35 / 46

Page 36: Document

Introduction The Method Evaluation Conclusion References

Spreading activation in Wikipedia

Spreading activation in Wikipedia

2. Use Wikipedia’s hyperlink network to find skills that are “close”or “central” to these relevant pages.

INITIAL PAGES SKILLS

35 / 46

Page 37: Document

Introduction The Method Evaluation Conclusion References

Spreading activation in Wikipedia

Spreading activation in Wikipedia

2. Use Wikipedia’s hyperlink network to find skills that are “close”or “central” to these relevant pages.

INITIAL PAGES SKILLS

35 / 46

Page 38: Document

Introduction The Method Evaluation Conclusion References

Spreading activation in Wikipedia

Spreading activation in Wikipedia

2. Use Wikipedia’s hyperlink network to find skills that are “close”or “central” to these relevant pages.

INITIAL PAGES SKILLS

35 / 46

Page 39: Document

Introduction The Method Evaluation Conclusion References

Spreading activation in Wikipedia

Spreading activation in Wikipedia

2. Use Wikipedia’s hyperlink network to find skills that are “close”or “central” to these relevant pages.

INITIAL PAGES SKILLS

35 / 46

Page 40: Document

Introduction The Method Evaluation Conclusion References

Spreading activation in Wikipedia

Spreading activation in Wikipedia

2. Use Wikipedia’s hyperlink network to find skills that are “close”or “central” to these relevant pages.

INITIAL PAGES SKILLS

35 / 46

Page 41: Document

Introduction The Method Evaluation Conclusion References

Spreading activation in Wikipedia

Spreading activation in Wikipedia

Formalization of spreading activation by Shrager et al. [1987]:If a(0) is a vector of initial activations, then after each timestep t, the vector of activations is

a(t) = γa(t − 1) + λWTa(t − 1) + c(t)

Parameters

T , the number of time stepsγ ∈ [0, 1] is a decay factorλ ∈ [0, 1] is a friction factorc(t) is an activation source vectorThe link weight, element wij of W determines the amount ofactivation that is spread from i to j .

36 / 46

Page 42: Document

Introduction The Method Evaluation Conclusion References

Spreading activation in Wikipedia

Spreading activation in Wikipedia

a(t) = γa(t − 1) + λWTa(t − 1) + c(t)

Thorough model selection difficult because of the size of theproblem

We experimented with three versions of the model:

model 1: a(t) = WTa(t − 1)model 2: a(t) = WTa(t − 1) + a(t − 1)model 3: a(t) = WTa(t − 1) + a(0)

In addition, W is constrained to be row-stochastic.

More focus on the selection of the link weights than otherparameters.

37 / 46

Page 43: Document

Introduction The Method Evaluation Conclusion References

Spreading activation in Wikipedia

Spreading activation in Wikipedia

Observation from initial results:

Hubs get easily activated even if they are not relevant.Common phenomenon with large graphs [Brand, 2005; vonLuxburg et al., 2010]

Solution:

we bias the spreading to avoid hubs by

wij =παj∑

(i ,k)∈Eπαk

πj is a popularity index of j (degree / PageRank / HITS).

If α = 0, no biasing; if α < 0 popular nodes are avoided.

Biased random walks have e.g. shorter return times thanunbiased random walks [Fronczak and Fronczak, 2009].

38 / 46

Page 44: Document

Introduction The Method Evaluation Conclusion References

Table of Contents

1 Expertise retrieval and skill extraction

2 The Elisit system for skill extractionOverview of the systemSample QueriesAssociation with WikipediaSpreading activation in Wikipedia

3 Evaluation of system

4 Conclusion and future work

39 / 46

Page 45: Document

Introduction The Method Evaluation Conclusion References

Evaluation of system

We evaluated the biasing strategy by seeing how well the systemactivates related skills, defined by LinkedIn.

≤ 20

40 / 46

Page 46: Document

Introduction The Method Evaluation Conclusion References

Evaluation of system

We tested the biasing strategy by seeing how well the systemactivates related skills, defined by LinkedIn.

Pre@5 Pre@10 R-Pre Rec@100α din PR HITS din PR HITS din PR HITS din PR HITS

0 0.119 0.119 0.119 0.156 0.156 0.156 0.154 0.154 0.154 0.439 0.439 0.439-0.2 0.206 0.238 0.206 0.222 0.216 0.213 0.172 0.193 0.185 0.469 0.469 0.494-0.4 0.225 0.263 0.169 0.203 0.200 0.150 0.185 0.204 0.148 0.503 0.498 0.476-0.6 0.238 0.225 0.119 0.200 0.197 0.141 0.186 0.193 0.119 0.511 0.517 0.418-0.8 0.213 0.181 0.075 0.191 0.197 0.113 0.171 0.185 0.109 0.515 0.524 0.384-1 0.169 0.156 0.063 0.178 0.197 0.091 0.154 0.172 0.097 0.493 0.518 0.336

Table : The effect of the biasing parameter α and the choice ofpopularity index on the results in the evaluation of the module.

E.g., the top 5 most activated skills of all the ≈ 27 000 skillscontain 1-2 of the ≤ 20 related skills, on average.

Also, biasing definitely improves retrieval results.

41 / 46

Page 47: Document

Introduction The Method Evaluation Conclusion References

Evaluation of system

We also ran a test for comparing the different language models.

VSM Pre@5 Pre@10 R-Pre Rec@100TF-IDF 0.231 0.214 0.190 0.516

LogEntropy 0.216 0.212 0.193 0.525LogEnt + LSA 0.180 0.181 0.163 0.491LogEnt + LDA 0.193 0.174 0.159 0.470

Table : Comparison of the different vector space models of the system inthe performance of the whole system.

42 / 46

Page 48: Document

Introduction The Method Evaluation Conclusion References

Table of Contents

1 Expertise retrieval and skill extraction

2 The Elisit system for skill extractionOverview of the systemSample QueriesAssociation with WikipediaSpreading activation in Wikipedia

3 Evaluation of system

4 Conclusion and future work

43 / 46

Page 49: Document

Introduction The Method Evaluation Conclusion References

Conclusion

The Elisit system extracts explicit skills that are related toan arbitrary text input.

Combination of ESA-style conceptual mapping and spreadingactivation on the Wikipedia network

Evaluation experiments suggest that using popularity-biasedspreading activation improves retrieval results.

44 / 46

Page 50: Document

Introduction The Method Evaluation Conclusion References

Future work

Improvement of link weights, e.g. by

computing content similarity of the Wikipedia pagestrying other structural similarity measuresusing the category memberhips of pages

Comparison with other strategies

More sophisticated (e.g. hierarchical) representation of results.

Also, the methodology could be applied for other purposes,e.g. a general topic model by replacing skills with topics.

45 / 46

Page 51: Document

Introduction The Method Evaluation Conclusion References

References

Krisztian Balog, Toine Bogers, Leif Azzopardi, Maarten De Rijke, and Antal Van Den Bosch. Broad expertiseretrieval in sparse data environments. In Proceedings of the 30th annual international ACM SIGIR conferenceon Research and development in information retrieval, pages 551–558. ACM, 2007.

Krisztian Balog, Paul Thomas, Nick Craswell, Ian Soboroff, Peter Bailey, and Arjen P De Vries. Overview of thetrec 2008 enterprise track. Technical report, DTIC Document, 2008.

Krisztian Balog, Yi Fang, Maarten de Rijke, Pavel Serdyukov, and Luo Si. Expertise retrieval. Foundations andTrends in Information Retrieval, 6(2-3):127–256, 2012.

Ernst Biesalski. Knowledge management and e-human resource management. FGWM 2003, 2003.

M. Brand. A random walks perspective on maximizing satisfaction and profit. Proceedings of the 2005 SIAMInternational Conference on Data Mining, 2005.

Dan Crow and John DeSanto. A hybrid approach to concept extraction and recognition-based matching in thedomain of human resources. In Tools with Artificial Intelligence, 2004. ICTAI 2004. 16th IEEE InternationalConference on, pages 535–541. IEEE, 2004.

Hongbo Deng, Irwin King, and Michael R Lyu. Formal models for expert finding on dblp bibliography data. InData Mining, 2008. ICDM’08. Eighth IEEE International Conference on, pages 163–172. IEEE, 2008.

Agata Fronczak and Piotr Fronczak. Biased random walks in complex networks: The role of local navigation rules.Physical Review E, 80(1):016107, 2009.

Evgeniy Gabrilovich and Shaul Markovitch. Computing semantic relatedness using wikipedia-based explicitsemantic analysis. In IJCAI’07: Proceedings of the 20th international joint conference on Artifical intelligence,pages 1606–1611, San Francisco, CA, USA, 2007. Morgan Kaufmann Publishers Inc.

Radim Rehurek and Petr Sojka. Software Framework for Topic Modelling with Large Corpora. InProceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, Valletta,Malta, May 2010. ELRA.

Jeff Shrager, Tad Hogg, and Bernardo A Huberman. Observation of phase transitions in spreading activationnetworks. Science, 236(4805):1092–1094, 1987.

U. von Luxburg, A. Radl, and M. Hein. Getting lost in space: large sample analysis of the commute distance.Proceedings of the 23th Neural Information Processing Systems conference (NIPS 2010), pages 2622–2630,2010.

46 / 46