Extracting Special Information to Improve the Efficiency...

80
Extracting Special Information to Improve the Efficiency of Resume Selection Process Thesis submitted in partial fulfillment of the requirements for the degree of Master of Science (by Research) in Computer Science by Abhishek Sainani 200502001 [email protected] Center for Data Engineering (CDE) International Institute of Information Technology Hyderabad - 500 032, INDIA June 2011

Transcript of Extracting Special Information to Improve the Efficiency...

Page 1: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Extracting Special Information to Improve the Efficiency of ResumeSelection Process

Thesis submitted in partial fulfillmentof the requirements for the degree of

Master of Science (by Research)in

Computer Science

by

Abhishek Sainani200502001

[email protected]

Center for Data Engineering (CDE)International Institute of Information Technology

Hyderabad - 500 032, INDIAJune 2011

Page 2: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Copyright c© Abhishek Sainani, 2011

All Rights Reserved

Page 3: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGYHyderabad, India

CERTIFICATE

It is certified that the work contained in this thesis, entitled “Extracting SpecialInformation to Improvethe Efficiency of Resume Selection Process” by Abhishek Sainani, has been carried out under my su-pervision and is not submitted elsewhere for a degree.

Date Advisor: Prof. P. Krishna Reddy

Page 4: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Acknowledgments

iv

Page 5: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Acknowledgements

First and foremost, all praise to God who gave me all the help, guidance, and courage to finish mywork. May God help me to convey all what I learned for the benefit of the society.

I am indebted to many people without whom this work would not be possible. First and foremost,I would like to thank my advisor, P. Krishna Reddy, for his guidance, suggestions and constructivecriticism throughout this project. Through his support, care, and patience, he has transformed a normalgraduate student into an experienced researcher. His insight and ideas formed the foundation of thisdissertation as much as mine did, and his guidance and care helped me get over various hurdles duringmy graduate years. He not only gave me technical support and supervision that a graduate student couldexpect from his supervisor, but he also encouraged and gave moralsupport without which I would havenever made it this far.

A list that alas has far too many names on it to mention separately is that of all researchers andcolleagues at IT in Agriculture Lab and Center of Data Engineering Lab (IIIT-H)- my working place.It was pleasure, fun and stimulating of being in such an environment and I am definitely sure that I amgoing to miss all of them.

I would also like to give a special thanks to some of my friends Mohit, Sushanta,Sashidhar, Manish,Hemant, Asrar, and my seniors, Sumit and Uday Kiran who were always withwe in my ups and downsspecially during this period of one and half year. Thanks for all of the encouragement and supportthroughout this time. Without these people the journey would have been incomplete.

I would like to thank my parents and my sister for being with me even from beforethe beginning,and sometimes giving everything they have and more. Thank you for all the love, support and guidanceand for always believing in me.

Page 6: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Abstract

With the advancement of the internet, more and more content is becoming available in digital formon a regular basis. Such overwhelming amount of available data has lead to the problem of informationoverload. Various research efforts have been going on to develop improved methods to automaticallysearch and analyze such large amount of information on the internet. Mostof the data on the internet isin textual form, hence text mining and text analytics, particularly information extraction has become anactive research area in recent years, to extract interesting and useful information.

In this thesis, we have made an effort to propose improved information extraction approach forbetter resume processing. One of the areas of interest in recent years, to both corporate and researchcommunity, has been information extraction from resumes. In the current scenario, big enterprises andcorporations receive thousands of resumes on a regular basis. Currently available techniques or servicesfilter thousands of resumes to some hundred potential ones based on the requirement of the availablejob position. Since these filtered resumes are similar to each other as per the filtering criteria, they haveto manually look through each of the hundred resumes to select the appropriate candidate for the job.Hence, there is a need to automatically extract data from resume and mine information from it to helpin selection process.

We have investigated the problem of resume selection from set of similar resumes and proposedan efficient framework to solve the same. We have extended the notion of special features employedin product selection environment and proposed improved information extraction approach for resumeprocessing. A special feature is a feature of a given product which distinguishes it from other products.Normally a resume is a semi-structured document and contains sub-sections.Each sub-section containsinformation in different format. For example, the skill information exists as a set of skill names whereasexperience section contains a paragraphs describing different typesof jobs/projects. A resume may havespecialty in skills and specialty in experience.

We have made an effort to extract special features from the resumes and have proposed approaches toextract special skills and special experience. At first we improved the special feature extraction algorithmand extended it to skills sub-section of the resume. To extend the notion of special features to experiencesection, we first proposed a short-text labeling agorithm and used the proposed algorithm to label eachparagraph for capturing the nature of experience. Next, we used the labels as features and find specialexperience information and organized the labels. We have performed experiments on real world data-setof resumes and obtained encouraging results.

Overall, information extraction from resume is a complex task. Even though several products areavailable, not much related work is available. The proposed work is a partof initial effort.

vi

Page 7: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Contents

Chapter Page

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Issues in resume processing and related efforts . . . . . . . . . . . . . .. . . . . . . . 11.2 Overview of the proposed problem . . . . . . . . . . . . . . . . . . . . . . . .. . . . 4

1.2.1 Special features in e-Commerce environment . . . . . . . . . . . . . . . . .. 41.2.2 Problem Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Summary of the proposed work . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 51.3.1 Approach to find special skills information . . . . . . . . . . . . . . . . . . . 51.3.2 Approach to find special experience information . . . . . . . . . . . . . .. . 5

1.4 Contributions of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .51.5 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Improving Special Feature Extraction and Extracting Special Skills. . . . . . . . . . . . . . 72.1 Special features and special feature extraction approach . . . . . .. . . . . . . . . . . 7

2.1.1 About special features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72.1.2 Special feature extraction and organization . . . . . . . . . . . . . . . . .. . 8

2.2 Improved special feature extraction and organization approach . . .. . . . . . . . . . 102.2.1 Quality Threshold based clustering algorithm . . . . . . . . . . . . . . . . . .102.2.2 Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.2.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.2.2 Quality Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.2.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Extracting Special Skills Information from Resumes . . . . . . . . . . . . . . .. . . 152.3.1 Skills Section Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3.2 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .172.3.3 Overall Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.3.4 About Using the Proposed Framework . . . . . . . . . . . . . . . . . . . . .. 212.3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.3.5.1 Performance Results for Skill Type Features (STFS) . . . . . . . . .222.3.5.2 Results for Skill Value Features (SVFSs) . . . . . . . . . . . . . . . 23

2.3.6 Discussion on errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.4 Summary of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3 Extracting Special Experience Information from the resumes. . . . . . . . . . . . . . . . . 303.1 Issues in extracting special information from experience section . . . .. . . . . . . . 303.2 Proposed labeling approach for short text . . . . . . . . . . . . . . . . .. . . . . . . 31

vii

Page 8: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

viii CONTENTS

3.2.1 Background and problem definition . . . . . . . . . . . . . . . . . . . . . . .313.2.2 Related Research in Short Text Labeling . . . . . . . . . . . . . . . . . . .. . 333.2.3 Notion of Term-Label Affinity . . . . . . . . . . . . . . . . . . . . . . . . . . 353.2.4 Proposed Labeling Approach . . . . . . . . . . . . . . . . . . . . . . . . . .. 36

3.2.4.1 Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . 373.2.4.2 Proposed Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.2.5 Performance comparison related to labeling approach . . . . . . . . . . .. . . 413.2.5.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . . 423.2.5.2 Experiments 1: Normal labels . . . . . . . . . . . . . . . . . . . . . 443.2.5.3 Experiments 2: Higher level labels . . . . . . . . . . . . . . . . . . 44

3.2.6 Discussion on errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.3 Extracting Special Experience Information from Resumes . . . . . . . . .. . . . . . . 453.4 Summary of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4 Conclusion and future work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6 Appendix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

Page 9: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

List of Figures

Figure Page

1.1 Hierarchical structure of Resume in Table 1.1 . . . . . . . . . . . . . . . . . .. . . . 3

2.1 Three-level feature organization . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 82.2 Performance of QT and Naive algorithms on Intra-Cluster Similarity criteria. . . . . . 142.3 Performance of QT and Naive algorithms on Inter-Cluster Similarity criteria. . . . . . 152.4 Performance of QT and Naive algorithms on Maximum Total Similarity criteria .. . . 162.5 Hierarchical structure of Skills . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 172.6 Flow diagram of overall framework . . . . . . . . . . . . . . . . . . . . . . . .. . . 20

3.1 Hierarchical structure of Experience Section . . . . . . . . . . . . . . . .. . . . . . . 323.2 Schematic Representation of Learning Framework . . . . . . . . . . . . . . .. . . . . 373.3 Schematic Representation of Learning Framework . . . . . . . . . . . . . . .. . . . . 403.4 Average accuracy scores by various algorithms for labeling short text paragraphs using

normal labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.5 Average accuracy scores by various algorithms for labeling short text paragraphs using

higher level labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

ix

Page 10: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

List of Tables

Table Page

1.1 Sample Resume with corresponding sections and their respective features . . . . . . . 2

2.1 Algorithm: Special feature extraction algorithm . . . . . . . . . . . . . . . . . .. . . 92.2 Algorithm: Improved Three-level algorithm . . . . . . . . . . . . . . . . . . . .. . . 112.3 Sample Product Features of N-77 and N-82 mobile phone separated bydelimiter (,) . . 132.4 Sample Features for Skill Tag . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 162.5 Algorithm to calculate STFS and SVFS . . . . . . . . . . . . . . . . . . . . . . . . .182.6 All Skill Type Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 232.7 Reduction Factor values for skill types using proposed approach . .. . . . . . . . . . 232.8 Reduction Factor values for skill types using previous approach . . .. . . . . . . . . . 232.9 Skill Type Feature Organization Statistics . . . . . . . . . . . . . . . . . . . . . .. . 242.10 Organization of features (skill types) using three-level approachfor Resume data-set . 252.11 Reduction Factor values for skill values for each skill type using Proposed Approach . 262.12 Reduction Factor values for skill values for each skill type using Previous Approach . . 262.13 Organization of features (skill value :: database systems) using three-level approach for

Resume data-set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.14 Skill Value (Database Systems) Feature Statistics . . . . . . . . . . . . . . . .. . . . 282.15 Organization of features (skill value :: programming languages) usingthree-level ap-

proach for Resume data-set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 292.16 Skill Value (Programming Languages) Feature Statistics . . . . . . . . . . .. . . . . 29

3.1 Sample Experience section of a resume with corresponding subsectionsand their re-spective features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 31

3.2 P1 (left) and P2 (right) have no words in common, yet they could have thesame label. . 363.3 Learning Algorithm - Creation of TLM with TF-ILF as affinity score . . . .. . . . . . 393.4 Predefined normal labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 403.5 Concept Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 413.6 Predefined higher level labels . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 413.7 Sample Experience section of a resume with corresponding subsectionsand their re-

spective features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 433.8 Training/Testing dataset ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 433.9 Organization of features (experience types) using improved specialfeature extraction

approach with normal labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.10 Experience Feature Statistics with normal labels . . . . . . . . . . . . . . . . .. . . . 493.11 Reduction Factor values for experience types (normal labels) . . . .. . . . . . . . . . 49

x

Page 11: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

LIST OF TABLES xi

3.12 Reduction Factor values for experience types (higher level labels). . . . . . . . . . . 493.13 Organization of features (experience types) using improved special feature extraction

approach with higher level labels . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 503.14 Experience Feature Statistics with higher level labels . . . . . . . . . . . . .. . . . . 52

6.1 Organization of features (skill value :: Server Side Scripting) using three-level approachfor Resume data-set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .57

6.2 Skill Value (Server Side Scripting) Feature Statistics . . . . . . . . . . . . . .. . . . 586.3 Organization of features (skill value :: operating systems) using three-level approach for

Resume data-set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586.4 Skill Value (Operating System) Feature Statistics . . . . . . . . . . . . . . . . . .. . 596.5 Organization of features (skill value :: Assembly Language) using three-level approach

for Resume data-set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .596.6 Skill Value (Assembly Language) Feature Statistics . . . . . . . . . . . . . . .. . . . 596.7 Organization of features (skill value :: web technologies) using three-level approach for

Resume data-set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606.8 Skill Value (Web Technologies) Feature Statistics . . . . . . . . . . . . . . . .. . . . 616.9 Organization of features (skill value :: Mobile Platforms) using three-level approach for

Resume data-set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616.10 Skill Value (Mobile Platforms) Feature Statistics . . . . . . . . . . . . . . . . . .. . 616.11 Organization of features (skill value :: scripting languages) using three-level approach

for Resume data-set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .626.12 Skill Value (Scripting Languages) Feature Statistics . . . . . . . . . . . . .. . . . . . 636.13 Organization of features (skill value :: Compiler Tools) using three-level approach for

Resume data-set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636.14 Skill Value (Compiler Tools) Feature Statistics . . . . . . . . . . . . . . . . . . .. . . 636.15 Organization of features (skill value :: software tools) using three-level approach for

Resume data-set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646.16 Skill Value (Software tools) Features Statistics . . . . . . . . . . . . . . . . .. . . . . 656.17 Organization of features (skill value :: Libraries/APIs) using three-level approach for

Resume data-set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666.18 Skill Value (Libraries/APIs) Features Statistics . . . . . . . . . . . . . . . .. . . . . 666.19 Organization of features (skill value :: IDEs) using three-level approach for Resume

data-set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676.20 Skill Value (IDEs) Features Statistics . . . . . . . . . . . . . . . . . . . . . . .. . . . 67

Page 12: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Chapter 1

Introduction

In this internet era, the decision makers are facing the problem of information overload. To help inbetter decision making, researchers are making efforts to propose efficient approaches to extract usefulinformation/knowledge from structured [1], semi-structured [17][5] and unstructured [9] data.

Most of the information online is in the form of text, so it has become essential todevelop bettertechniques and algorithms to extract useful and interesting information fromthis large amount of textualdata. Hence, the area of text mining, text analytics and information extractionhave become popularareas of research in recent years. In this thesis, we make efforts to propose improved informationextraction and organization methods to improve resume selection process. A resume is a documentused by individuals to present their background and skill-sets. In the current scenario, large numberof resumes are received online, through e-mails or through services provided by companies like InfoEdge (India) Limited [8]. One of the research issues, in addition to efficient searching, is informationextraction from resumes for better processing of resumes.

In this thesis, we have made an effort to extend the notion of special features [22] to extract specialinformation from the resumes. The notion of special features has been proposed in the literature toimprove the process of product selection in the e-commerce environment. The proposed framework haspotential to help the recruiters to process large number of resumes in a bettermanner.

The remaining part of the chapter is organized as follows. In the next section, we discuss aboutthe issues in resume processing and related research efforts. The overview of proposed problem isdiscussed in Section 1.2. The summary of the thesis is presented in section 1.3.In section 1.4, we listthe contributions of this thesis. The last section contains the thesis organization.

1.1 Issues in resume processing and related efforts

A resume is a multi-topic document where each section describes a differentaspect of an individual.It is represented as a semi-structured document. We assume that each resume has a hierarchical organi-zation of text [24] where the order of sections and the organization within each section may differ acrossresumes.

Table 1.1 shows a sample student resume that contains multiple topics like education, experience,skills and achievements as different sections. Each section contains words and sentences as features.The numbering in each section denotes a feature separated by a delimiter (‘newline’ in our case). Figure1.1 shows the corresponding hierarchical structure for Table 1.1. Thetop layer which is termed as‘Layer 0’ contains Resume ID. It can be observed that sections like education, experience, skills and

1

Page 13: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Table 1.1Sample Resume with corresponding sections and their respective features

Education

1. b.tech. (computer science & engineering) iiit, hyderabad (expected may, 2009) 6.66/10 cgpa.2. senior secondary instrumental school, kota (cbse board 2004) 72%.3. secondary st. sr. sec. school, ajmer (cbse board 2002) 83%.

Skills1. programming languages: c, c++2. operating systems: windows 98/2000/xp, gnu/linux4. scripting languages: shell, python5. web technologies: html, cgi, php6. software tools: microsoft office, latex, gnu/gcc, visual studio 2005/087. database technologies: mysql

Experience1. audio-video conferencing over ip networks:2. duration: nov. 2007 nov. 2008 team size: 2. technical environment: c++ abstract: the objective of this projectwas to develop an audio/video conferencing system which enables multiple users to communicate with each othervia a global server with improved efficiency in terms of voice clarity and low latency. the system is equippedwith resources to facilitate text chat, voice chat and voice/video chat between multiple clients.this client server application was developed using c++ and .net frameworkin windows environment.3. windows firewall4. duration: july-nov 2007 team size: 1 technical environment: c abstract:packets from or to a networkare analyzed and according to the users settings actions are taken on howthe packets would be handled.various options are provided to the user in accordance to which action is taken ranging from what the packetcontains to the source of the packet.5. document request form automation6. duration: sep-nov 2006 team size: 2 technical environment: php, mysql abstract: project developed for iiithyderabad administration. this web-based tool automates the processing ofthe various documents.7. implementation of outer loop join8. duration: jan-march 2007 team size: 1 abstract: implementation of the aboveoperation as a partof the database management systems course.9. myshell10. duration: aug-oct 2006 team size: 1 abstract: developed a program which acted as a shell,starting and running command line arguments as part of our operating systemscourse.

Achievements1. secured 1573 air in all india engineering entrance examination, 2005.2. secured 2216 air in iit-jee screening examination, 20053. cleared national talent search examination level 1 in 2002.4. was among the finalists of the rajasthan state science talent search

2

Page 14: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Resume ID

Education Skills Experience Achievements

Textual Information(consists of words

and sentences)

Textual Information(consists of words

and sentences)

Textual Information(consists of words

and sentences)

Textual Information(consists of words

and sentences)

Layer 0

Layer 1

Layer 2

Figure 1.1Hierarchical structure of Resume in Table 1.1

achievements form the first layer of the resume. Each section is describedby the text containing wordsand sentences which forms the second layer of the resume. Based on the structure of the content, thetext of each section in the second layer can also be organized into several layers.

The main issue in processing of resumes are investigating better search andinformation extractionmethods. Normally, both search and information extraction functionalities are included in a resumeprocessing system. Information extraction is a process that automatically extracts predefined types ofinformation from unstructured documents. Information extraction also includes structuring, groupingand preparing the found data to populate a database. Resume information extraction, an applicationarea of information Extraction, is also called resume parsing. It enables extraction of relevant informa-tion from resumes which have semi-structured form. Although, there are many commercial productson resume information extraction, there has been little published research work in this area. Some ofthe commercial products include Sovren Resume/CV Parser [10], Akken Staffing [20], ALEX Resumeparsing [11], ResumeGrabber Suite [21], and Daxtra CVX [3]. The specification of these products,methods and algorithms used for resume information extraction are not available.

Following are some of the research efforts in resume information extraction.A cascaded two-pass IE framework was designed in [24]. In the first pass, the general information

is extracted by segmenting the entire resume into consecutive blocks and each block is annotated witha label indicating its category. In the second pass, detailed information pieces are further extractedwithin the boundary of certain blocks. Moreover, for different types of information, the most appropriateextraction method is selected through experiments. For the first pass, sincethere exists a strong sequenceamong blocks, a Hidden Markov Model is applied to segment a resume and each block is labeled witha category of general information. They also apply Hidden Markov Model for the educational detailedinformation extraction for the same reason. In addition, classification basedmethod is selected for thepersonal detailed information extraction where information items appear relatively independently.

In [14], the authors give an outlook of an ongoing project on deployinginformation extraction tech-niques in the process of converting any kind of raw application documents written in Polish, such asCurriculum Vitaes, motivation letters or application forms into compact and highly-structured data.They pinpoint the challenging issues to be faced and potential benefits in thearea of learning systems,HR and recruitment modules of information systems.

3

Page 15: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

A four phase approach for resume processing is proposed in [6]. Inthe first step, a resume is seg-mented into blocks according to their information types. In the second step, named entities are found byusing special chunkers for each information type. In the third step, found named entities are clusteredaccording to their distance in text and information type. In the fourth step, normalization methods areapplied to the text. In the end, the extracted information is produced in JSON orXML format.

1.2 Overview of the proposed problem

We extended the notion of special features for better resume processing. We first discuss aboutspecial features and explain the problem overview. Next, we present the overview of the proposed work.

1.2.1 Special features in e-Commerce environment

Due to information overload on the internet, a person often finds it difficult todistinguish betweenvarious similar objects or information. This is widely seen in the E-commerce environment, where acustomer faces several difficulties while selecting a product because there are large variants of eachproduct available in the market but with little deviation in their features. The customer has to carefullybrowse through large amount of information about similar products for buying just one product. Theauthors in [22] propose the notion of ‘special feature’ according to which a special feature of a producthelps distinguish it from other similar products. They show that if a customer starts browsing theproducts by looking at the special features first, then looking at the commonfeatures, he can easilydistinguish between similar products and make a decision with less effort.

1.2.2 Problem Overview

The existing approaches on resume processing focus on identifying andextracting all the informa-tion from each resume and storing each information based on the nature of itscontent. This helps infurther processing of information in the resume. Currently available techniques or services employedby various large enterprises help their Human Resource (HR) managers tofilter thousands of resumes tofew hundred potential ones. Since these filtered resumes are similar to eachother, they have to manuallylook through each resume to select the appropriate candidate.

We define ‘similar resumes’ as the set of resumes that HR managers get after filtering through theirown resume management systems. We have investigated the problem of resumeselection from the setof similar resumes and proposed an efficient framework to solve the same.

For this we extend the notion of ‘special features’ in the context of resume, and call it ‘specialinformation’. We consider that there may exist special information in some resumes when comparedto others. For example, a resume may contain specialty in education, specialty inexperience, specialskills or special achievements. Special information may exist in one or more sections of a resume. Thusidentifying such special information and organizing them efficiently helps in improving the performanceof resume selection process.

For example, consider a group of students from same streams like computer science, electronics etc.there are some common skills which are possessed by all the students in the group and also each studentpossess some special skills that differentiates him/her from rest of the students in the group. At the timeof applying for a job, each student post his resume to the companies of his/her choice. Each student

4

Page 16: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

tries to reflect his/her special skills in the resume. Thus, if we can extract such special skills for eachstudent and organize the information contained in the resumes in an efficientmanner, it would help therecruiters and HR managers to efficiently select the appropriate resumes from a set of similar resumes.

However, it is not a straight forward task to extend the notion of special features to resume process-ing environment. In the product selection environment, a product is a set of atomic features whereasa resume contains several sub-topics/sections, and each section contains different kinds of text. Forexample, experience section contains long sentences in free form text, skills section contains skill type(programming languages) and skill values (c++, java). So, developmentof an approach to process theresume dataset is a complex task, as several approaches have to be developed for dealing with each typeof text. The skill section contains different types of structure and experience section contains a set ofparagraph. So the main research issue is to convert the content in everysection into a set of features andextend special feature framework.

1.3 Summary of the proposed work

In this thesis we first examined the special feature extraction algorithm proposed for product selectionenvironment and proposed an improved algorithm by modifying the clusteringapproach. We use thenotion of special information to address the resume selection problem for twosections of the resume,the skills section and the experience section.

1.3.1 Approach to find special skills information

First we proposed an alternative clustering approach to improve the performance of special featureextraction algorithm proposed in the literature [22]. Next we extended the notion of special features toskill-section of resume. In the skill section, a candidate lists his skills in a particular domain. The skillssection is relatively well organized, with predefined standard terms specific to the area of expertise of thecandidate. We use these terms as features and apply the improved special feature extraction frameworkto reduce the redundancy in the data and present it in a more desirable andeffective form.

1.3.2 Approach to find special experience information

The experience related information contains several paragraphs and iswritten in free-form text, hav-ing predefined standard terms specific to the area of expertise of the candidate. To convert the paragraphsinto features, we have proposed a semi-supervised approach to label the paragraphs. Since each experi-ence text is a short text, traditional classification/labeling techniques performed poorly, so we propose atechnique to label each short text with a domain specific label. Using the proposed labeling algorithm,each experience text is assigned a label corresponding to the domain of the project. The label assignedis name of one of the domains in that field of expertise of the candidate. The special feature extractionalgorithm is being used on the label dataset for extracting special experience information.

1.4 Contributions of the thesis

The major contributions of this thesis are as follows:

5

Page 17: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

• Addressed the problem of resume selection from similar resumes.

• Proposed improved special feature extraction algorithm.

• Proposed a framework to extract special skills information.

• Proposed a short text labeling algorithm.

• Proposed a framework to extract special experience information.

1.5 Outline of the Thesis

The rest of the thesis is organized as follows. In the next chapter, we propose improved special featureextraction algorithm and present a framework to extract special skills information from the resumes. Inchapter 3, we present an algorithm to label short text paragraphs andpresent a framework to extractspecial experience information from the resumes. The last chapter contains conclusions and futurework.

6

Page 18: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Chapter 2

Improving Special Feature Extraction and Extracting Special Skills

A resume contains several sections which include information on education,skills, experience andother information. In this chapter, we have extended the notion of special features to extract special skillsfrom the resume dataset. At first, we propose modification to the special feature extraction algorithmwhich has been discussed in the literature. Next, we extended the improved algorithm and proposed aframework to extract special skills.

2.1 Special features and special feature extraction approach

In this section, we first explain the notion of special features and discussthe special feature extractionalgorithm which has been proposed in [22] for improved product selection in e-Commerce environment.Next, we present the improved algorithm by incorporating modifications to the special feature extractionalgorithm.

2.1.1 About special features

Every product has a set of features. While some of its features are common with some other productsin the same category, every product has one or more features that distinguishes it from other, seeminglysimilar products. We call these features as ‘special features’ becausethey can help a customer to distin-guish between products and make a decision to choose the right product. We measure the specialness ofa feature in a product using the ‘Degree of Specialness’, as described below.

Degree of Specialness:Let fj be a feature, such thatfj ∈ f(pi), wherepi is ith object andf(pi) isset of features ofith object. The degree of specialness (DS) of a featurefj is its capability of makingthe productpi, separate/distinct/unique/special from other products. The DS value fora feature variesbetween zero to one (both inclusive). The DS value of the featurefj is denoted by DS(fj). Then,

DS(fj) =

{

1 if n(fj) = 1

1 − (n(fj)/|P |) otherwise(2.1)

Based on the DS values of features, features can be classified as commonfeatures, common clusterfeatures and special features. Features for which the DS value is ‘0’ are called common features. Featurefor which the DS value is closer to 1 are called special features. The otherfeatures are called as commoncluster features.

7

Page 19: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

As per their DS values, the features are distributed into three levels: I-level, II-level and the III-level. Figure 2.1 depicts the organization of features using the three-levelapproach. I-level contains thecommon features, II-level contains common cluster features and III-level contains special features. Itcan be noted that, for any objectpi, its complete set of features f(pi) is a combination of (i) the commonfeatures at I-level (ii) common cluster features for the cluster in whichpi is a member and (iii) specialfeatures of objectpi at III-level.

Common Features of all the products

Product Special Features

(II-level) Common Cluster Features

(I-level)

(III-level)

special features of P1

special features of P2

special features of P3

special features of P4

Common features for P1 and P2

Common features for P3 and P4

P1

P2

P3

P4

Figure 2.1Three-level feature organization

2.1.2 Special feature extraction and organization

In this section, we present the approach proposed in [22] to extract and organize special features.The algorithm employs a naive clustering approach to group a set of objects into different clusters. Thealgorithm is given in Table 2.1. The input to the clustering algorithm is a set of objects P, similaritythreshold (ST) and feature set F (features of all the objects). The algorithm produces clusters of thoseobjects.

Let pi be theith object,f(pi) be the set of features of objectpi, CL(j) be thejth cluster and CF(j) bethe set of all the features in thejth cluster.

The similarity between the objectspi and CL(j) is denoted by sim(pi, CL(j)) and is calculated asfollows:

sim(pi, CL(j)) = |f(pi)⋂

CF (j)|

The description of the clustering algorithm is as follows. First cluster is initialized with the firstobject. For each other objectpi, if the similarity ofpi with the existing cluster or clusters is greater thansimilarity threshold, the objectpi is put into into the cluster with maximum similarity; Otherwise, newcluster is initialized withpi. Note that once assigned to a cluster, the object can not become a memberof other clusters.

After forming the clusters, the features of each cluster are organized intothree-levels. I-level con-tains the features of all clusters with DS value as ‘0’. The II-level contains the common features of eachcluster. The III-level contains the remaining special features of each object. common features, commoncluster features and special features for each object with formation of clusters as an intermediate step.Each cluster CL(j) is associated with a feature set CF(j) where CF(j) represents the features that arecommon among all the products present in the cluster CL(j).

8

Page 20: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Table 2.1Algorithm: Special feature extraction algorithmInput : n: is a number of products; P: set of ‘n’ products;F: set of features for all ‘n’ products; ST: similarity threshold.Output: I-level / II-level / III-level features1. Formation of Clusters2. Notations used2.1. nc: number of clusters; i, j: integers;2.2. CL(i): the i’th cluster where (i<= n);2.3. CF(i): set of features of i’th cluster;2.4. f(pi): set of features for productpi.3. nc = 0; for i=1 to n{CL(i) = φ and CF(i) =φ}4. Select the first productp1 ;4.1. CL(1) = CL(1)

p1 ;4.2. CF(1) = CF(1)

f(p1);4.3. nc = nc+15. for each productpi ∈ P -{p1}6. for each cluster CL(j) (1≤ j ≤ nc)7 if sim(pi,CL(j)) = maxj(sim(pi,CL(j))) ≥ ST, then8. {CL(j) = CL(j)

pi ;9. CF(j) = CF(j)

f(pi) }10. else{ nc = nc+1;11. CL(nc) = CL(nc)

pi;12. CF(nc) = CF(nc)

f(pi) } .13. end14. end

15. Calculate I-level Features.16. Select allfk from F such that DS(fk) = 0.

17. Calculate II-level Features.18. For each Cluster CL(j) (1≤ j ≤ nc),19. Corresponding CF(j) contains II-level features.

20. Calculate III-level Features21. For eachpi ∈ P22. Calculate the cluster number ofpi

23. Special Feature ofpi = f(pi) - CF(clno) - Common features.

9

Page 21: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Organizing the features using three-level method is an iterative process.The value of similaritythreshold should be chosen such that the products are clustered into a reasonable number of clusters andthe number of features shown to user can be reduced. For example, ST could be chosen as fifty percentof the average number of features in a product eliminating the common features. Then the thresholdcan be gradually increased, and the number of clusters formed and total number of features shown touser can be observed. If the number of features to be shown decreases significantly, we can increasethe threshold and check the same. It can be observed that if the thresholdis decreased, the number ofcommon features for each cluster would decrease and consequently number of clusters shown to userwould be increased. The objective of clustering the objects is to reduce theeffort of users by providingthem with more convenient view with less number of features. In case of large number of clusters, itleads to more confusion. Finally, they can set the ST threshold to particular value which gives minimumnumber of clusters and minimum number of features to be shown to user.

2.2 Improved special feature extraction and organization approach

The clustering algorithm which is presented in the preceding section is naive. The limitations are asfollows.

• The order of the objects taken as input influence the cluster quality.

• Two objects which should be in the same cluster may end up in different clusters as an objectwhich is assigned to one cluster can not be assigned to other cluster.

In this section, we present an alternative clustering algorithm based on thenotion of quality thresholdand show the performance results.

2.2.1 Quality Threshold based clustering algorithm

It can be observed that since all the data objects are available beforehand, each object can be com-pared with all the other objects to find the closest match and form better clusters. We extend the notionof Quality Threshold (QT) clustering to improve the special feature extraction process.

The goal of QT clustering is to form large clusters of objects with similar features. In [7], quality of acluster is defined by the cluster diameter and the minimum number of objects contained in each cluster.We adapt the QT Clustering algorithm to special feature extraction algorithm by defining the notionof cluster diameter as follows. In the proposed context, we define the cluster diameter as the ‘numberof common features’ among all the members in each cluster. We also fix the minimum threshold fornumber of common features the cluster should have.

Description of the proposed algorithmSimilarity is defined as the number of common features between the cluster centroid and a new object.Here, cluster centroid is nothing but the set of common features of all the objects in the cluster. Theprocess of clustering is as follows. At first, we fix the similarity threshold (ST) value.

i For each objectpi from the object listP , do the following. Initializepi as a cluster and from theremaining objects, if the objectpj has maximum number of common features to this cluster, andif their number of common features is not lower than the ST, then addpj to the cluster. Continuetill the above mentioned conditions are satisfied by the objectpj .

10

Page 22: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Table 2.2Algorithm: Improved Three-level algorithmInput : n: is a number of objects; P: set of ‘n’ objects;F: set of features for all ‘n’ objects; ST: similarity threshold;MCF: minimum number of common features.Output: I-level / II-level / III-level features1. Formation of Clusters2. Notations used2.1 nc: number of clusters; i, j: integers;2.2 CL[i]: the i’th cluster where (i<= n);2.3 CF[i]: set of features of i’th cluster;2.4 f(pi): set of features for productpi.3.1 if (|P | ≤ 1) then output P;3.2 else do3.3 /* Base Case */

for eachpi ∈ Pset flag = TRUE;

3.4 CL[i]=pi /* CL[i] is the cluster started bypi */3.5 while ((flag==TRUE) and (CL[i]6= P))

if number of common features of CF[i]∪ f(pj) is maximumthen find j∈ (P - CL[i])

3.6 if number of common features of CF[i]∪ f(pj) > MCFthen set flag=FALSE;

3.7 elseset CL[i] = CL[i] ∪ pj /* add j to cluster CL[i] */end for

3.8 identify set C∈ CL, with maximum cardinality.3.9 output C3.10 P ={P-C}3.11 Repeat from 3.1

4.1Calculate I-level Features.4.2 Select allfk from F such that DS(fk) = 0.

5.1Calculate II-level Features.5.2For each Cluster CL(j) (1≤ j ≤ nc),5.3 Corresponding CF(j) contains II-level features.

6.1Calculate III-level Features6.2For eachpi ∈ P6.3 Calculate the cluster number ofpi

6.4 Special Feature ofpi = f(pi) - CF(clno) - Common features.

11

Page 23: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

ii At the end of the first step, one cluster is formed by taking each object asthe first object ofthe cluster, for each object. A QT cluster is selected, which is the largest cluster which hasmaximum number of common features amongst its members. The objects within this cluster arenow removed from consideration, i.e., from the object listP .

iii Repeat step (i) and step (ii) until no more clusters can be formed.

iv Objects that do not belong in any clusters will be shown as singleton clusters.

The complexity of the proposed algorithm isO(n2).

2.2.2 Performance Comparison

Here we present the performance comparison of the proposed approach against the previous approachdiscussed earlier in this section. We conduct the experiments on real worldproduct features data-setsof mobile phones and laptops. Specification for different models of mobile phones and laptops areextracted from the web sites. After that we discuss the various quality criteria that have been used tocompare the performances of the proposed approach against the previous approach.

2.2.2.1 Dataset Description

The details of the data-sets are as follows:

• Mobile phones dataset [13]: It contains the details of 16 Nokia mobile phonemodels: N-70, N-72,N-73, N-75, N-77, N-80, N-81, N-82, N-85, N-90, N-91, N-92, N-93, N-95, N-96, N-97. Totalnumber of features comes to 382.

• Laptop dataset [19]: It contains the details of 10 HP laptop models: CQ50Z,HDX, dv2700t,dv2800t, dv5t, dv7z, dv6700t, dv9700t, tx2000z and tx2500z. Total number of features comes to320.

We present sample product features of N-77 and N-82 mobile phones in Table 2.3.

2.2.2.2 Quality Criteria

The quality criteria used for performance evaluation are described as follows. Let Pi represent an

object andCj represent a cluster.S(Pi, Pk) represents the similarity between two objectsPi andPk,

wherePi andPk are represented by their frequency vectors, and their similarity is calculated using the

cosine coefficient.

1. Intra-Cluster Similarity:It represents the compactness of clusters. In this criterion, we maximize

the average intra-cluster similarity. Intra-cluster similarity is computed using the following equa-

tion:∑

Pk∈Cj

S(Pi, Pk), Pi ∈ Cj (2.2)

12

Page 24: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Table 2.3Sample Product Features of N-77 and N-82 mobile phone separated by delimiter (,)

N-77 Features N-82 Features

network umts gsm 900 gsm 1800 gsm 1900, network gsm 850 900 1800 1900 hsdpa,announced 2007 1q (february), weight 114 g announced 2007 4q (november),weight 114 g,display type tft 16m colors, display type tft 16m colors,display size 240 x 320 pixels 2.4 inches, display size 240 x 320 pixels 2.4 inches,ringtones type polyphonic (64 channels) mp3, ringtones type polyphonic monophonic true tones mp3,vibration yes, phonebook yes, vibration yes, phonebook practically unlimited-call records yes, card slot microsd (transflash) hotswap, entries and fields photocall,operating system symbian os 9.2 s60 rel 3.1, call records detailed max 30 days,camera 2 mp 1600x1200 pixels video(cif)- camera 5 mp 2592 x 1944 pixels carl zeiss-flash secondary cif video call camera, optics autofocus video(vga 30fps),gprs/data speed class 11, operating system symbian os 9.2 s60 rel 3.1,messaging sms mms email instant messaging, card slot microsd hot swap 2 gb card included,infrared port no gprs/data speed class 32 107 kbps,games yes java downloadable,colors black messaging sms mms email instant messaging,3g 384 kbps, bluetooth v1.2 with a2dp infrared port no, games yes downloadable,dvb-h tv broadcast receiver, video calling wlan wi-fi 802.11 b/g upnp technology,push to talk, java midp 2.0,mp3/m4a/aac/eaac+/wma playerbluetooth v2.0 with a2dp, t9,t9,stereo fm radio, voice command/dial usb v2.0 microusb,browser wap 2.0/xhtml html,pim including calendar to-do list and printing built-in gps receiver, motion sensor (with ui auto-rotate),document viewer, photo/video editor java midp 2.0, mp3/aac/aac+/eaac+/wma player

2. Inter-Cluster Similarity:It represents the isolation of clusters. In this criterion, we minimize the

average inter-cluster similarity. The inter-cluster similarity is computed as follows:

Pm 6∈Cj

S(Pi, Pm), Pi ∈ Cj (2.3)

3. Maximum Total Similarity:This is a modified form of Minimum Total Distance which has been

proposed in [16]. In this criterion, we maximize the total of the sum of inter cluster and intra

cluster similarities. The maximum total similarity is computed as follows:

Pk∈Cj

S(Pi, Pk), Pi ∈ Cj

+

Pm 6∈Cj

S(Pi, Pm), Pi ∈ Cj

(2.4)

where first term is the intra-cluster similarity and second term is the inter-cluster similarity.

2.2.2.3 Experimental Results

The two algorithms were run on the Mobile dataset and Laptop dataset and their performances were

measured based on the three quality criteria discussed above. We set the value of MCF or minimum

13

Page 25: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

number of common features as the values in the set0.1, 0.2, ..., 0.9 to measure their performance across

a range of thresholds. We label MCF as Similarity Threshold on the x-axis ofthe graph. The y-axis is

the score of the algorithm for that particular criteria. We discuss the performances of the two algorithms

on each criteria as follows.

i. Intra-Cluster Similarity:

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Similarity Threshold values

Intr

a C

lust

er S

imila

rity

valu

es

a. Mobile Dataset

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

Similarity Threshold values

Intr

a C

lust

er S

imila

rity

valu

es

b. Laptop Dataset

QT ClusteringNaive Clustering

QT ClusteringNaive Clustering

Figure 2.2Performance of QT and Naive algorithms on Intra-Cluster Similarity criteria

In Figure 2.2, it can be observed that both the algorithms follow a similar trend for each of the

dataset in their Intra Cluster Similarity criteria scores for the range of threshold values. How-

ever, except for a few threshold values for laptop dataset, QT clustering scores higher and hence

performs better overall.

ii. Inter-Cluster Similarity:

In Figure 2.3, it can be observed that both the algorithms follow a similar trend for each of the

dataset in their Inter Cluster Similarity criteria scores for the range of threshold values. Except for

a few threshold values for laptop dataset, QT clustering scores lower andhence performs better

overall.

iii. Maximum Total Similarity:

In Figure 2.4, it can be observed that both the algorithms follow a similar trend for each of the

dataset in their Maximum Total Similarity criteria scores for the range of threshold values. Except

14

Page 26: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3

0.32

0.34

0.36

Similarity Threshold values

Inte

r C

lust

er S

imila

rity

scor

es

a. Mobile Dataset

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Similarity Threshold values

Inte

r C

lust

er S

imila

rity

scor

es

b. Laptop Dataset

QT ClusteringNaive Clustering

QT ClusteringNaive Clustering

Figure 2.3Performance of QT and Naive algorithms on Inter-Cluster Similarity criteria

for a few threshold values for laptop dataset, QT clustering scores lowerand hence performs better

overall.

The performance of the two clustering algorithms on the three criteria shows that QT Clustering

algorithm forms better quality clusters. Hence we use QT clustering algorithm tocluster the data in

the special feature extraction algorithm to organize the resume information in both skills section (next

section) and experience section (next chapter).

2.3 Extracting Special Skills Information from Resumes

In this section, we extend special feature extraction algorithm to resume dataset and propose a frame-

work to extract special skills information from the resumes. We first present the structure of the skills

section in a resume. Next, we extend the algorithm and present the results.

2.3.1 Skills Section Structure

Table 2.3 shows the example of features for skill section, which has been extracted from Table 1.1.

It can be observed that the skills section information contains enumerated sequence of text pieces. Each

text piece consists of a skill type and its skill values. For example, programming languages is a skill

type and c, c++, java are skill values. So, there is a two layer organizationin the skill information as

15

Page 27: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

0 0.2 0.4 0.6 0.8 10.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

Similarity Threshold Values

Max

imum

Sim

ilarit

y sc

ores

a. Mobile Dataset

0 0.2 0.4 0.6 0.8 1

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

Similarity Threshold Values

Max

imum

Sim

ilarit

y sc

ores

b. Laptop Dataset

QT ClusteringNaive Clustering

QT ClusteringNaive Clustering

Figure 2.4Performance of QT and Naive algorithms on Maximum Total Similarity criteria

shown in Figure 2.5. The skills information itself forms a hierarchy where skilltypes form one layer

and skill values form another layer. If we apply the notion of special features on skills section directly

the comparison between the features would not be effective. So we exploit the inherent organization in

the skills information.

Table 2.4Sample Features for Skill Tag

Skill Type features Skill Values features

1. Programming Languagesc, c++2. Operating Systems windows 98/2000/xp, gnu/linux3. Scripting Languages python, perl, shell4. Web Technologies html, cgi, php5. Libraries/APIs opengl, sdl6. Database Technologies mysql mssql7. Software Tools microsoft office, latex, gnu/gcc, visual studio 2005/08

16

Page 28: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Resume ID

Education

Skills

Experience Achievements

Layer 0

Layer 1

Layer 2ProgLang OS Scripting WebTech Others Database

C C++ Windows 98/2000/XP

GNU/Linux

Shell Python html cgi php MSOffice

VisualStudio

Latex MySQL

Figure 2.5Hierarchical structure of Skills

2.3.2 Proposed Approach

We propose an approach by considering that the skill information in a resume is organized into “skill

type” and their corresponding “skill values”. Overall the proposed approach consists of following steps.

First we perform pre-processing on the skills information of the resumes.Then we extract the skill type

and skill value features. After extracting the skill type and skill value features, we calculate DS values

of the features and organize them.

Hence, there are two types of features that could be extracted from skillsinformation. One type

is Skill-Type-Feature-Set (STFS) and another is Skill-Value-Feature-Set (SVFSs). Each element in

STFS is a two attribute tuple< ResumeId, Skilltype > and each element in SVFS is defined as

< ResumeId, Skilltype, Skillvalue >. Note that, for a given resume, there exists one STFS and

several SVFSs. For the same skill type, same skill values exist, but in different form. For example, in

Table 2.3, the skill values for skill type ‘programming languages’ are the same, except for the presence

of some special characters (comma in this case). Thus, direct comparisoncannot be done. So, both

STFS and SVFSs are formed after carrying out the preprocessing steps and then applying the described

algorithm (refer Table 2.4) on the skills information.

Extracting skill types and skill values

The algorithm to extract skill types and skill values is divided into two parts. In the first step, we do the

17

Page 29: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Table 2.5Algorithm to calculate STFS and SVFSInput : R: Set of ‘n’ resumes; F: set of features for all ‘n’ resumes;S: dictionary for all the skill types and|S| is number of distinct skill types.Output: STFS and SVFS1. Notations used

i,j: integers;Sri

: skills information for resumeri,STFSi: array for skill types for resumeri,

where each tuple contain< ri, Skilltype >SV FSij : array for skill values for resumeri and skill typesj ,

where each tuple contain< ri, sj , Skillvalue >2. for i = 1 to n3. Get the skill section features for resumeri in Sri

4. for eachsj in S5. if sj in Sri

6. store the tuple< ri, sj > in STFSi

7. store the tuple< ri, sj , skillvalue > in SV FSij

8. end9. end

preprocessing. and in the second step we apply an algorithm described inTable 2.4 to identify the skill

type and skill value features. The preprocessing steps are as follows:

i. Entire input text is converted to lower case and special characters areremoved.

ii. Stop words occurring in general purpose stop words list are removed.

iii. The skills section in resume is identified with the keyword ‘Skills’ in the heading irrespective of

the position of the Skills section in the resume.

iv. The skill type and its skill value(s) are identified and separately storedusing a delimiter ( : in our

case ).

v. Skill value(s) corresponding to each skill type are sorted lexically andseparated by a comma (,).

For a skill value having more than one word, the words are concatenated.For example, the skill

values for skill feature ‘database technologies: ms sql, postgres sql, mysql’ would be changed to

skill value string: ‘mssql, mysql, postgressql’.

vi. To resolve human errors like spelling mistakes, typo errors etc., we define a data structure called

‘skill values list’ with ‘skill type’ as a hash key and its possible ‘skill values’as its values. Each

skill value is checked in the skill values list. In case of many partial matches, the skill value is

replaced by the skill value from the list with which it has the longest match. In case of no match,

18

Page 30: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

the list is manually updated with the skill value after verification. The possible skill values are

extracted from the resume dataset.

vii. A skill value can have more than one name referring to it. For example “mssql” and “microsoft

sql” refers to same skill. To resolve such ambiguity we identify the various possible ways of

redundant occurrences through data analysis and prepare a hash table with the canonical names

as the hash key and various possible names as a list of hash values corresponding to the canonical

name. All these different names should be replaced by a common name or canonical name to

resolve this issue.

The description of algorithm shown in Table 2.4 is as follows. The input to the second part of the al-

gorithm is set R consisting of ‘n’ resumes, dictionary S that contains all the distinct skill types present

in the set R and|S| denotes the cardinality of dictionary S. The output consists of the skill type feature

set (STFS) and skill value feature set (SVFS). In STFS each element isa tuple consisting of resume

identifier and skill type as its attributes whereas in SVFS each element is a tuple consisting of resume

identifier, skill type and skill value. The steps of the algorithm are as follows: We take each resume and

repeat the following steps for each resume. (i) Identify the skill section ofthe resume using the ‘Skills’

tag. (ii) Process each line of the skill section to identify the skill type and corresponding skill value. (iii)

The resume id (ri) and skill type is stored inSTFSi index of the array of SVFS where as resume id (ri),

skill type (sj) and skill value is stored are the indexSV FSij of the array SVFS. Thus after performing

the preprocessing steps and applying the above described algorithm we get STFS and SVFSs. The next

task is to calculate the specialness values of all the features in STFS and SVFSs and organize the same.

A. Calculating DS Value and Organizing the Special Skill Types:

Given the STFS, the problem is to identify the specialness value and then on the basis of specialness

value organize all the features in the set.

Computing Specialness Value for STFS:Let R be a set of ‘n’ similar resumes, where resumeri ∈ R.

Each resumeri possess some set of features. Let f(ri) be set of skill type features for resumeri. Let F

be set of all skill type features for all resumes such that F =∪ni=1

f(ri). Each feature F is denoted byfj

where0 ≤ j ≤ |F | and n(fj) denote the number of resumes to which featurefj belongs. The DS value

for each feature in STFS is calculated as defined in Equation 1. The input consists of the feature set F

and output consists of feature set F along with the DS values for all the features in the set. Note that,

19

Page 31: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

the set F may contain duplicate features. We consider these features as distinct features because they

belong to distinct resumes.

Organization of STFS:We apply the above described special feature extraction algorithm on the fea-

tures in STFS and organize the features as shown in Figure 2.1. The inputto the algorithm consists of

feature set F which contains all the features in STFS along with their DS values, threshold ST and set

of resumes R and the output of the algorithm consists of three-level organization of STFS features.

B. Calculating DS Value and Organizing Special Skill Values:

Given the skill types and SVFS, the problem is to identify the specialness value and then on the basis of

specialness value organize all the features.

Computing Specialness Value for SVFSs:Let S be a set containing distinct skill type features from all

the resumes andsj ∈ S denotes a particular skill type. Letf(sij) denotes the skill value features for

skill type sj ∈ S and resumeri ∈ R. Let F (sj) be set of all skill value features for skill typesj for all

the resumes such that F(sj) = ∪ni=1

f(sij). The DS value of each feature in SVFSs is calculated using

Equation 1. The input consists of the feature setF (sj) for all sj ∈ S and output consists of feature sets

along with the DS values for all the features for each of skill type.

Organization of SVFSs Features:We apply the above described special feature extraction algorithm on

the features in SVFSs and organize the features as shown in Figure 2.1. The special feature extraction

algorithm is run for each distinct skill typesj ∈ S. The input to the algorithm consists of feature set

F (sj) that consists of features for skill typesj along with their DS values, threshold ST and set of

resumes R and the output of the algorithm consists of three-level organization of skill value features for

each skill typesj .

2.3.3 Overall Framework

Input: Resume as Text Documents Identifying featuresfrom skills section(STFS and their

corresponding SVFSs)Calculating DS Valuesand organizing STFS

Calculating DS Valuesand organizing SVFSs

Output: Three level featureorganization for STFS

Output: Three level featureorganization for STFS

Figure 2.6Flow diagram of overall framework

20

Page 32: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

In this section we explain the overall framework. The input to the proposedapproach are the resumes

stored as text documents where each document contains different sections along with their descriptions

(refer Table 2.1). The steps of proposed framework are discussed below (refer Figure 2.6).

1. Identification of features from skills section: We extract Skill Type Feature Set and Skill value

Feature set from skills information for all the resumes.

• Identifying Skill Type Feature Set (STFS): We identify the skill type features for all the

resumes and form an STFS.

• Identifying Skill Value Feature Set (SVFSs): We identify the skill value features for each of

the skill type and form SVFSs for all the skill types.

2. Calculating DS Value and Organizing Special Skill Type Features: We compute the DS value for

skill type features on the basis of DS values we organize the skill type features.

• Computing DS Value for Skill Type Features: We compute the DS value for skilltype

features based on the notion of degree of specialness as defined in Equation 2.1.

• Organization of Skill Type Features: We organize the skill type features using the special

feature extraction approach described in Section 2.2.1.

3. Calculating DS Value and Organizing Special Skill Value Features: We compute the DS value for

skill value features for each of skill type and on the basis of DS values weorganize the skill value

features for each of the skill type.

• Computing DS Value for Skill Value Features: We compute the DS value for skillvalue fea-

tures for each of skill type based on the notion of degree of specialnessdefined in Equation

2.1.

• Organization of Skill Value Features: We organize the skill value featuresfor each of skill

type using the special feature extraction approach described in Section 2.2.1.

2.3.4 About Using the Proposed Framework

The user can request the system to organize the resumes according to skill type or skill values for

the selected skill type. The resumes are organized firstly on the basis of skill type forming the first layer

and second layer consists of tables for skill value of each skill type. Now, if the recruiter wants to select

21

Page 33: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

a resume only on the basis of skill type, he/she can give only the skill type information as input and

browse through the output containing the special skill type information only.And if he/she wants to

select a resume based on the skill value for a respective skill type he/shecan do so by giving the skill

type and skill value information as input and browsing through the special skill types and then special

skill values for a respective skill type.

2.3.5 Experimental Results

To evaluate the performance, we have applied the proposed framework on real world data-set of re-

sumes. Data-set contains 106 resumes from undergraduate students ofcomputer science department in

a University. All the resumes are available in the same format as shown in Table 1. Total number of

features in skills section were 629. The skill types present in the data-setare shown in table .

We define the performance metric called ‘reduction factor’ (rf) to measurethe performance improve-

ment. The rf denotes the reduction in the number of features that the HR manager needs to browse to

select a resume from set of ‘n’ similar resumes as compared to one-level approaches. Let ‘F’ denote

the total number of features for all the resumes, F(i) denote the number of features in ‘i’-level and ‘L’

denotes the number of levels. The ‘rf’ is defined as,

rf = 1 −

∑Li=1

F (i)

F

2.3.5.1 Performance Results for Skill Type Features (STFS)

Table 2.6 shows the reduction factor for skill type features. It can be seen that rf value comes to 78%.

The results indicate that 78% reduction in the effort could be achieved in resume selection process. The

total numbers of skill types features present were 629 and numbers of features being displayed to user

are only 138. The Table 2.9 shows the organization of corresponding skill type features (set F) using

improved special feature extraction approach, and Table 2.8 shows the feature organization statistics.

The I-level shows the common skill type, the II-level shows the common skill types for each cluster of

resumes and the III-level shows the special skill types for each resume. The resumes can be classified

based on their skill type in one click. Since number of resumes share same special feature we have

mentioned them in same row separated by delimiter comma (,) for user convenience as well as to reduce

space. There is such large reduction in the number of features displayedas large number of features

are present as common features so instead of displaying them for each resume, its been displayed only

22

Page 34: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

once. Similarly the cluster features are displayed once for all the resumes present in a cluster instead

of separately displaying for each one. Note that in table 2.9, we have shrunk some labels for better

presentation. For example, we have presented ‘Mobile Platforms’ as ‘MobPlat’.

Table 2.6All Skill Type Features

Skill Type features

1. programming languages 2. scripting languages3. operating systems 4. web technologies5. database systems 6. libraries/apis7. software tools 8. compiler tools9. mobile platforms 10. middleware technologies11. server side scripting 12. IDE13. microsoft tools and services 14. documentation15. java technologies 16. object oriented analysis and design17. Frameworks and Content Management Systems18. cms19. server technologies 20. version control system21. virtualization tech. and tools 22. assembly languages23. open source tools 24. open source frameworks

Table 2.7Reduction Factor values for skill types using proposed approach

Feature Type |F |∑L

i=1F (i) rf

Skill Types 629 138 0.78

Table 2.8Reduction Factor values for skill types using previous approach

Feature Type |F |∑L

i=1F (i) rf

Skill Types 629 189 0.69

2.3.5.2 Results for Skill Value Features (SVFSs)

The reduction factor in the skill value features for each of the skill type is shown in Table 2.10. It

can be observed that rf values for SVFSs for some skill types is very high, for few skill types low and in

some cases medium. The reason for high reduction factor for some skill types is that there are number

of resumes that share common skill values for these skill types and thus the clusters formed are uniform.

23

Page 35: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Table 2.9Skill Type Feature Organization StatisticsNumber of Common Features = 2Number of Clusters = 23

Features in Cluster 1 = 4, Resumes in Cluster 1 = 24; Features inCluster 13 = 0, Resumes in Cluster 13 = 1Features in Cluster 2 = 3, Resumes in Cluster 2 = 18; Features inCluster 14 = 0, Resumes in Cluster 14 = 1Features in Cluster 3 = 4, Resumes in Cluster 3 = 22; Features inCluster 15 = 0, Resumes in Cluster 15 = 1Features in Cluster 4 = 5, Resumes in Cluster 4 = 4; Features in Cluster 16 = 0, Resumes in Cluster 16 = 1Features in Cluster 5 = 4, Resumes in Cluster 5 = 5; Features in Cluster 17 = 0, Resumes in Cluster 17 = 1Features in Cluster 6 = 2, Resumes in Cluster 5 = 4; Features in Cluster 18 = 0, Resumes in Cluster 18 = 1Features in Cluster 7 = 4, Resumes in Cluster 7 = 3; Features in Cluster 19 = 0, Resumes in Cluster 19 = 1Features in Cluster 8 = 4, Resumes in Cluster 8 = 2; Features in Cluster 20 = 0, Resumes in Cluster 20 = 1Features in Cluster 9 = 5, Resumes in Cluster 9 = 2; Features in Cluster 21 = 0, Resumes in Cluster 21 = 1Features in Cluster 10 = 0, Resumes in Cluster 10 = 1; Features in Cluster 22 = 0, Resumes in Cluster 22 = 1Features in Cluster 11 = 0, Resumes in Cluster 11 = 1; Features in Cluster 23 = 0, Resumes in Cluster 23 = 1Features in Cluster 12 = 0, Resumes in Cluster 12 = 1;

Total Common Cluster Features = 35Special Features = 101Total Features displayed = 2 (Common Features) + 35 (Common Cluster Features) + 101 (Special Features) = 138F = (2 * 106) + (4 * 24 + 3 * 18 + 4 * 22 + 5 * 4 + 4 * 5 + 2 * 4 + 4 * 3 + 4 * 2 + 5 * 2) +101 = 629

The reason for low reduction factor for skill type such as compiler tools ormobile platforms is because

the number of features in these sets are very less. Thus their is very little scope of clustering the resumes

based on common features. In cases like IDEs or software tools, the variety of skill values across the

resumes is very high, hence the low reduction factor. Though in most of thecases reduction factor is

50% or above. Thus we can say that on average there is 50% reduction inthe efforts of HR managers

in resume selection process. For each skill type, its respective skill valuefeatures are organized using

three-level feature organization. Table 2.12 shows the organization of skill value for skill type ‘database

technologies’, and 2.13 shows the feature organization statistics. The I-level shows the common skill

values, the II-level shows the common skill value for each cluster of resumes and the III-level shows

special skill values in ‘database technologies’ for each resume.

We also conducted experiments using the previous feature organization algorithm (naive clustering

algorithm). Table 2.11 shows the results of rf values for SVFS for all skill types. It can be observed that

the proposed clustering approach improves the rf values significantly, asshown in Table 2.10.

Table 2.14 shows the organization of special skills related to skill type Programming languages and

Table 2.15 shows the feature organization statistics. It can be observed that reduction factor comes to

88% with the proposed approach. Similarly, Table 2.12 shows the organization of special skills related

to skill type Database systems. In this case, the reduction factor also comes to88%.

The results of special skill organization for other skill types (Operating systems, Web technologies,

Scripting languages, Software tools, Libraries/apis, IDEs, Compiler tools,Server side scripting, Mobile

platforms and assembly language) are shown in the Appendix.

24

Page 36: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Table 2.10Organization of features (skill types) using three-level approach forResume data-set

Common Features (I-level)

Prog Lang, Oper SysCommon cluster features ResumeId Special Features (III-level)

(II-level)Scrptng Lang, Web Tech R14, R43, R45, R56, R6, R65, R71, R77DBMSLib/APIs, Soft Tools R8, R80, R81, R90

R33, R74, R79, R91, R92, R96, R98 DBMS, IDER49 DBMS, Mob PlatR55 DBMS, Comp ToolsR18, R28, R41 Comp Tools

DBMS, Scrptng Lang, R103, R105, R19, R23, R24, R25, R29, noneWeb Tech R3, R31, R4, R47, R5, R52, R58, R64,

R68, R7, R88Scrptng Lang, Web Tech, R106, R11, R60, R82, R93 IDEDBMS, Lib/APIs R30, R34 IDE, Serv Side Scrptng, Soft Tools

R33, R74, R79, R91, R92, R96, R98 IDE, Soft ToolsR63 IDE, Serv Side ScrptngR69 IDE, Comp ToolsR70, R76, R84, R87, R9 noneR71 Soft Tools

Scrptng Lang, Comp Tools, R28, R18 noneLib/APIs, Web Tech, R55 DBMSSoft Tools R59 IDEIDE, DBMS, R104 Serv Side ScrptngScrptng Lang, Web Tech R72 Soft Tools, Comp Tools

R57 Soft Tools, Serv Side ScrptngR21 Soft Tools, Java TechR85 none

DBMS, Scrptng Lang R32, R89, R94, R99 noneAssmbly Lang, Soft Tools R13, R97 noneWeb Tech, Scrptng Lang R48 IDEIDE, Web Tech, R1 Lib/APIsSoft Tools, DBMS R20 noneWeb Tech, DBMS, Lib/APIs, R49 Soft ToolsScrptng Lang, Mob Plat R50 Virt Tech. and Toolsnone R86 Soft Tools, DBMS

R95 Open Src Frmwrks, Comp Tools, Web Tech, Scrptng Lang,Lib/APIs, DBMS, Soft Tools

R53 IDE, Soft Tools, Lib/APIs, DBMSR26 Scrptng Lang, DBMS, Soft ToolsR78 Web Tech, IDE, Soft Tools, Frmewrks and Content Mngmnt SysR27 Scrptng Lang, IDE, Web Tech, Serv Side Scrptng, Soft Tools, CMS, DBMSR66 Middleware Tech, Web Tech, DBMS, Scrptng Lang,

Lib/APIs, IDE, Mob PlatR38 Web Tech, Soft Tools, Sim Tools, Assmbly LangR73 Scrptng Lang, DBMS, OOADR2 IDE, Scrptng Lang, Web Tech, Serv Side Scrptng, DBMS,

Lib/APIs, Vers Cntrl Sys, Serv TechR61 Assmbly Lang, Scrptng Lang, Soft Tools, DBMS,

Doc, Web Tech, Lib/APIsR44 Open Src Tools, Soft Tools, Web Tech, DBMSR16 Scrptng Lang, Web Tech, Serv Side Scrptng, DBMS,

Assmbly Lang, IDE, Soft Tools, Lib/APIsR36 Scrptng Lang

25

Page 37: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Overall, the results show that significant reduction factor values are obtained which indicates that it

is possible to reduce effort of processing resumes by identifying special information, if any, in effective

manner with the proposed approach.

Table 2.11Reduction Factor values for skill values for each skill type using Proposed Approach

Feature Type |F |∑L

i=1F (i) rf

programming languages289 35 0.88database technologies 191 25 0.88operating systems 296 68 0.77web technologies 348 135 0.62scripting languages 184 64 0.67software tools 302 208 0.31libraries/apis 123 72 0.42IDEs 89 52 0.42compiler tools 6 4 0.34server side scripting 27 15 0.45mobile platforms 4 2 0.5assembly language 10 5 0.5

Table 2.12Reduction Factor values for skill values for each skill type using Previous Approach

Feature Type |F |∑L

i=1F (i) rf

programming languages289 35 0.81database systems 191 25 0.87operating systems 296 68 0.70web technologies 348 135 0.47scripting languages 184 64 0.67software tools 302 208 0.30libraries/apis 123 72 0.42IDEs 89 52 0.42compiler tools 6 4 0.34server side scripting 27 15 0.34mobile platforms 4 2 0.5assembly language 10 5 0.5

26

Page 38: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

2.3.6 Discussion on errors

Special feature extraction algorithm extracts information and organizes it based on its degree of spe-

cialness. This means that every term in the skills section of a resume is assigned a degree of specialness

and placed appropriately in the three-level organization table (Figure 2.1). So if an irrelevant term,

say, ‘hello’ is written as a skill value for ‘programming language’, then its very likely that only one or

very few resumes might contain that, which will make this term ‘hello’, a specialfeature and would be

clearly visible in the list of special features. Such errors won’t affectthe output since a human would be

further analysing the three level organization of features and can easilynotice an anomaly in the result.

Moreover, our approach is based on the assumption that every candidate would try to highlight their

special skills (strengths, best qualities etc.), hence a resume as a text document would correctly give

information about a candidate, as explained in detail in Section 1.2.2.

2.4 Summary of the Chapter

In this chapter, we have proposed an approach to enhance the quality ofclusters for the three-level

feature organization by using Quality Threshold based clustering approach. We extended the special

feature extraction framework to the skills section of resume for improved processing. By observing the

fact that skills section contains a two level organization, skill types and skillvalues, we have extended

feature extraction algorithm to organize special skill types and to organizespecial skill values of each

skill type. The experiments on the resume dataset indicate that the proposedapproach obtains signifi-

cantly better reduction factor values for both skill types and skill values ofeach skill type which indicate

the reduction in the resume processing effort.

In the next chapter, we are going to extend the special feature extractionapproach for organizing

experience related information.

27

Page 39: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Table 2.13Organization of features (skill value :: database systems) using three-level approach forResume data-set

Common Features (I-level)

noneCommon cluster features ResumeId Special Features (III-level)

(II-level)mysql, mssql R10, R11, R12, R13, R15, R19, R20, R24, R3, R30, R71, R72, R74,R75, none

R77, R8, R82, R83, R84, R87, R88, R89, R9, R91, R92, R93, R94, R32,R36, R38, R40, R41, R42, R43, R44, R48, R5, R50, R51, R53, R54,R55,R56, R58, R59, R6, R62, R63, R64, R67R31, R33 sqliteR69 oracle

mysql R1, R14, R16, R17, R18, R2, R23, R25, R27, R29, R34, R37, R39, R45, noneR46, R49, R57, R60, R65, R68, R76, R78, R79, R80, R85, R86, R95, R96

mysql, oracle R26, R52, R22, R61, R66, R7, R21, R70, R81 noneR28 ms accessR97 postgresql

none R4 sql, mysqlR35 mssql, oracle, ms access, mysql, pl/sqlR47 mssqlR73 mysql, postgresqlR90 oracle

Table 2.14Skill Value (Database Systems) Feature StatisticsNumber of Common Features = 0Number of Clusters = 2

Features in Cluster 1 = 2, Resumes in Cluster 1 = 53 Features in Cluster 5 = 0, Resumes in Cluster 5 = 1Features in Cluster 2 = 1, Resumes in Cluster 2 = 28 Features in Cluster 6 = 0, Resumes in Cluster 6 = 1Features in Cluster 3 = 2, Resumes in Cluster 3 = 11 Features in Cluster 7 = 0, Resumes in Cluster 7 = 1Features in Cluster 4 = 5, Resumes in Cluster 4 = 4 Features in Cluster 8 = 0, Resumes in Cluster 8 = 1

Total Common Cluster Features = 10Special Features = 15Total Features displayed = 0 (Common Features) + 10 (Common Cluster Features) + 15 (Special Features) = 25F = (0 * 106) + (2 * 53 + 1 * 28 + 2 * 11 + 5 * 4) + 15 = 191

28

Page 40: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Table 2.15Organization of features (skill value :: programming languages) using three-level approachfor Resume data-set

Common Features (I-level)

c, c++Common cluster features ResumeId Special Features (III-level)

(II-level)java R1, R49, R69 python

R101, R39 vbR102, R105, R50, R52, R53, R63, R66, R68, R16, R36noneR106, R17 perlR33 symbian c++R34, R6, R76 c#R37 c#, .netR95 j2me

python R20, R35, R42, R43, R46, R78 noneR44 php, perlR60 .net, c#, vc++, python, java appletsR90 matlab

open c++ R12 prologR80 symbian

matlab R47, R9 nonevb R29, R40 none

R62 action script 3.0, mxmlR86 perlR13 c#R65 stl

Table 2.16Skill Value (Programming Languages) Feature StatisticsNumber of Common Features = 2Number of Clusters = 9

Features in Cluster 1 = 1, Resumes in Cluster 1 = 37 Features in Cluster 6 = 2, Resumes in Cluster 5 = 1Features in Cluster 2 = 1, Resumes in Cluster 2 = 9 Features in Cluster 7 = 1, Resumes in Cluster 7 = 1Features in Cluster 3 = 1, Resumes in Cluster 3 = 2 Features in Cluster 8 = 1, Resumes in Cluster 8 = 1Features in Cluster 4 = 1, Resumes in Cluster 4 = 2 Features in Cluster 9 = 1, Resumes in Cluster 9 = 1Features in Cluster 5 = 1, Resumes in Cluster 5 = 2

Total Common Cluster Features = 10Special Features = 23Total Features displayed = 2 (Common Features) + 10 (Common Cluster Features) + 23 (Special Features) = 35F = (2 * 106) + (1 * 37 + 1 * 9 + 1 * 2 + 1 * 2 + 1 * 2 + 2 * 1) + 23 = 289

29

Page 41: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Chapter 3

Extracting Special Experience Information from the resumes

As already explained, a resume contains several sections which include education, skills, experience

and other information. In this chapter, we have extended the notion of special features to extract special

experience information from the resume dataset. Note that the experience section contains set of para-

graphs where each paragraph contains information about some distinct experience of the candidate. So

it is not straightforward to apply special feature extraction algorithm to the experience section. In this

chapter, first we discuss the issues in extending special feature extraction algorithm to the experience

section. Then we discuss the previous work done in short text labeling. Next, we propose a short-text

labeling algorithm by exploiting notion of term-label affinity. Subsequently, withthe help of proposed

labeling algorithm, we convert experience section into a set of features and extend the special feature

extraction algorithm proposed in previous chapter to extract special experience information.

3.1 Issues in extracting special information from experience section

Table 3.1 shows the Experience section of a resume. Each experience short text is enumerated. It

can be observed that it has 9 short texts describing the experience of the candidate. Figure 3.1 shows

the hierarchical structure corresponding to the experience section shown in Table 3.1. The experience

section of a resume is written in free form text. Every experience text is fewlines long.

Since the information is in free-form text, extension of special feature extraction algorithm ap-

proaches on this data is not possible. We had to investigate methods to convert the experience related

information in the form of features.

One of the solution is assigning labels for the text. In the literature, several methods are available for

labeling long documents. However, very few research efforts have been made to assign labels for short

30

Page 42: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Table 3.1Sample Experience section of a resume with corresponding subsections and their respectivefeatures

Major Projects

1. title: secure application design with focus on performancedescription: tcp congestion control mechanism can be detrimental in the faceof active attacks of adversaries. an application that is completely dependanton tcp can suffer due to such attacks. we will examine the effect of such attacksin a secure application and propose a design that will overcome these limitations.2. title: implemented extendible hashing as a dbms project.description: given a relation hash its records using extendible hashing techniqueand store it in a separate file then we can search, delete and insert a record onthe basis of attribute value on which hash function has been applied.3. title: flexi-search enginedescription: this project develops a tool, engine, which will be able to searchdocuments regardless of the type of heterogeneity of the data in them.4. title: ssf visualization and system analysis tooldescription: this system takes as input a ssf (shakti standard format) file anddisplays the output tree corresponding to that file. it also compare two ssf filesand display the two trees along with differences highlighted in different color.5. title: multi-threaded file transferdescription: this project involves sending a file from server side to the client.each packet has the information regarding parity and size. there is separatetransmission thread for sending the corrupted packets again.

Minor Projects

6. description: game of carrom using open gl, c++7. description: solar system using sdl, c++8. description: developed web interfaces using php, javascript, ajax, mysql.9. description: implemented unix utilities like top, talk,shell.

documents. In the next section, we propose a new labeling approach to label short text documents using

semi-supervised learning technique.

3.2 Proposed labeling approach for short text

3.2.1 Background and problem definition

Labeling is a useful technique that represents the information content in a particular text by a label.

A label is a term or set of words either predefined or generated from within the text itself. Labeling helps

identify the class of information the text belongs to, and hence is closely related to classification. This

can help group together the short text information that belong to the same class, and in better analysis

and understanding of information of the short text paragraph on the basis of its label. Assigning a

31

Page 43: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Resume ID

Education SkillsExperience

Achievements

Layer 0

Layer 1

Layer 2Project 1 Project 2 Project 3 Project 4

Secure ApplicationDesign for focus...

ImplementedExtendible Hashing...

flexi-searchengine

multi-threadedfile transfer

Figure 3.1Hierarchical structure of Experience Section

representative domain specific label to each short text paragraph canhelp in better understanding of the

paragraphs. For example, the following text is basically about SQL.

A user friendly and user interactive design which will help beginners understandwhat exactly a query is and helps him to learn the sql language easily.

A domain expert will identify this text as belonging to the domains ‘database management system’,

or ‘DBMS’, and ‘User Interface’ under ‘Computer Science’ stream. Hence, the text can be labeled

“DBMS”, “User Interface” or both, and can be represented by its labels during automatic text analysis.

In the current web scenario, new short texts are created on a daily basis. For instance, review of

commercial products, comments on a blogpost, comments in social network websites, tweets, publica-

tion abstracts, or short description of some information in a domain specific document like experience

related information in resume of a candidate. The presence of short text inenormous volume on the

internet and its exponential growth has created great interest in analyzing them in order to automatically

extract useful information and fulfil specific user needs.

Current research efforts towards processing short texts use additional related information to a short

text for short text data expansion. But procuring the required additional information becomes a tough

task when no additional information regarding the short text data is available. The main challenge in

processing and analyzing short texts is that each short text contains information in much fewer words or

lesser detail as compared to a long document, which also means that the wordsin the short text have rel-

atively low absolute frequency in comparison to their usual frequency in along document. This makes

32

Page 44: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

those tasks harder which involve comparison among the short text such astext categorization/labeling,

and traditional categorization/labeling methods ineffective, due to very lessoverlapping of terms among

the short text as compared to long documents. Also, a short text is neither big enough for significant

summarization nor can a suitable label be generated from within it since it has very less number of

terms as features. We have investigated the issues in categorizing short text data by finding the affinity

of the feature terms in the short text for the label assigned to the short text.This means we don’t need

to directly compare two short text. Moreover, we only learn the connection between the feature terms

and the label, hence we don’t require any additional information.

The problem of labeling short text can be defined as follows: LetS = {s1, s2, ..., sn}, where S

is a set of n short texts. Each short textsi contains a set of m terms,Ti = {ti1 , ti2 , ..., tid}. Let

T = {t1, t2, ..., tm} = {T1 ∪ T2 ∪ ... ∪ Tn}, where T is a set of all terms in the short text contained in

the short text set S andTi is set of terms in a short textsi. Also, letL = {l1, l2, ..., lk}, where L is a set

of pre-decided labels andk << n.

So, the problem is to assign a labellj ∈ L for eachsi which contains the termst1, t2, . . . , td, where

i ∈ [1, n] andl ∈ [1, k].

3.2.2 Related Research in Short Text Labeling

In the literature, following are the various approaches related to the problem of Short Text classifica-

tion/labeling:

In [25], the authors use a combination of labeled training data and a secondary corpus of unlabeled but

related longer documents for training the classifier for short text. During learning stage, the unlabeled

corpus is used as “background knowledge” that is similar to both labeled examples and new short text

to establish a connection between them.

In [18], the problem of insufficient word occurrences in a short textis addressed by using a small set of

domain-specific features extracted from author’s profile and text. Naive Bayes classifier is used.

In [4], statistical data of the training set was used to generate rules and patterns in labeled short text data

based on which the new unlabeled short text could be assigned a label.

33

Page 45: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

In this work [2], the authors focus on classification of texts with a high concentration of a specific termi-

nology and complex grammatical structures. Since those characteristics inevitably complicate standard

feature engineering, which is done by language pre-processing (eg:Lemmatization, parsing) that is fur-

ther complicated when the texts are short, the authors use a learning method that can perform reasonably

well without preliminary feature engineering. They use Prediction by Partial Matching (PPM), an adap-

tive finite-context method for text compression that obtains all information from the data without feature

engineering, is easy to implement, is relatively fast, creates a language model adapted to a specific case

and can be used in probabilistic text classifier. The authors use terminologyintense data: medical text.

They build two versions of PPM-based classifiers, one calculating the probability of the next word and

the other calculating the probability of the next character. As per their experimental results, character

based method performed slightly better than word based method.

Since short text documents have little overlap in their feature terms, ICA can’t work well in their case.

So, authors in [15] solve the short-text problem in text classification by using Latent Semantic Analysis

(LSA) as a data preprocessing method, then employing ICA for the preprocessed data.

The authors in [23] present a new model for classifying Chinese short-text that have weak concept signal,

in which three key factors on feature extension, which would determine the classification performance

of short-text, are considered. For the sake of determining the three extension factors, this paper studied

the three key issues as follows: (1) how we do feature extension for short-text; (2) what influence the

different ways of feature extension do to classification performance ofshort-text; (3) how we control the

degree of feature extension for short text. In the stage of classification, a short-text is first extended by

adding new features or modifying the weights of initial features according tothe relationship between

non-feature terms and feature extension mode; meanwhile, the authors improved the effect of feature

extension by controlling the degree of feature extension, and then classified the extended short-text with

the new model. The experimental results show that the new model proposed for short-text classification

considering feature extension can obtain higher classification performance comparing with the conven-

tional classification methods.

The authors in [12] present a general framework for building classifiers that deal with short and sparse

text & Web segments by making the most of hidden topics discovered from large-scale data collections.

34

Page 46: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

The idea was to gain external knowledge to make the data more related as well as expand the coverage

of classifiers to handle future data better. The underlying idea of the framework was that for each clas-

sification task, a large-scale external data collection called universal dataset was collected, and then a

classifier was built on both a small set of labeled training data and a rich set of hidden topics discovered

from that data collection. They show that this framework is general enough to be applied to different

data domains and genres ranging from Web search results to medical text.

Differences over existing approaches for Short Text classification/labeling

Current research efforts towards processing short text generallyuse additional related information for

short text data expansion so that traditional methods could be used to categorize the short text data. How-

ever, procuring such additional information becomes an issue particularlyin cases when no additional

information regarding the short text data is available.

3.2.3 Notion of Term-Label Affinity

We exploit the fact that even though a short text does not contain sufficient number of terms that

can be used as features, the label assigned by a human to a short text can give the notion of an affinity

between the label and the terms in that short text. This notion of term-label affinity is exploited in

assigning a label to a new short text. Note that each short text is considered as an independent document

in our work.

It can be observed that even though any two short texts belonging to the same domain may not have

common terms, there will be common terms between the set of terms in a short text and the set of terms

in rest of the short texts having the same label, or belonging to that same domain, in the training set. If

k short texts have information on a particular domain, then the(k + 1)th short text of that same domain

will contain one or more terms present in the set of all the terms in the previousk short texts. The value

of k may vary, but the phenomenon always holds. Thus, we can consider thefirst k short texts together

as one big text with enough features to characterize that domain. The presence of some of those feature

terms in a new piece of short text can be used to suitably label the new shorttext. The label assigned to

the new short text is the label of the term with the highest term-label affinity value, i.e., how distinctly the

term characterizes the label corresponding to a particular domain. Hence, we use the terms to represent

a short text and do term level comparison across many short texts insteadof direct text level comparison.

35

Page 47: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Table 3.2P1 (left) and P2 (right) have no words in common, yet they could have the same label.

Database Systems

extended the postgresdatabase systemwith the designed and developed microsoft sqllearningskyline operation. the skyline is defined as those suite to provide a learning platform for those

points which are not dominated by any other point.who don’t have basic knowledge of sql. it makesa point dominates another point if it is as good them aware of the basic keywords and queryor better in all dimensions and better in at least syntax of sql using the interactive gui.

one dimension

We elaborate on the issue faced in labeling short text through an example. Let P1 (left) and P2

(right) be the abstracts of projects done by engineering students in Computer Science and Engineering,

as shown in Table 1. Both P1 and P2 should be labeled the same, i.e., “database management system”

or “DBMS” even though none of the words, except the stop words, between P1 and P2 are same. In

case of P1, “postgres”, “database” and “database system” are the key terms while in P2, “sql” and

“microsoft sql” are the key terms that helps a person decide that P1 and P2should be labeled “DBMS”.

Hence, its not the text level but term level processing of text pieces like short texts that would give better

classification results.

3.2.4 Proposed Labeling Approach

The problem we address is assigning a label to short texts. We assume thata fixed set of labeled short

texts are available for learning. The proposed approach contains two phases: Learning from the labeled

short texts and labeling the new short texts. In the learning phase, the main issue is how to learn the

association between a short text and its given label. For this, we identify thefeature terms in the short

text and find the affinity of each feature term in every short text with its given label. The affinity score

of each term for each label is calculated based on the number of short texts the term is present in, which

is assigned that label. In the labeling phase, for each new short text, the feature terms are extracted. The

feature term having the highest term-label affinity score with some label, is identified and that label is

assigned to the new short text.

The main issue here is how to measure the distinguishing power of a feature term in a short text, and

how to identify the feature term with highest distinguishing power that decideswhich label should be

assigned to its short text, i.e., the dominant term.

36

Page 48: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

FeatureExtraction

LearningAlgorithm

LabeledShort Text

FeatureTerms

Term-LabelAffinity Scores Term-Label

AffinityMatrix

Figure 3.2Schematic Representation of Learning Framework

3.2.4.1 Proposed Framework

Our proposed approach consists of the framework as discussed in detail as follows:

I. Feature Extraction Phase:

The traditional ‘bag of words’ model for feature extraction fails to capture some important terms having

more than one words. For example, ‘data mining’ has two words and is an important term in Com-

puter Science and Engineering domain. A ‘bag of words’ model would consider ‘data’ and ‘mining’

separately whereas for the real meaning to be retained, the two words should appear together as ‘data

mining’. Hence, we use Part of Speech tagging as a part of our preprocessing step. The feature terms

from the tagged text are then extracted based on a set of rules, which extracts all the noun phrases from

the text.

The feature extraction phase has following steps:

1. The terms are identified in each short text by tagging the text using a PartOf Speech tagger and

then noun phrases are identified based on rule based parsing.

2. Stop words are removed.

3. Single words are lemmatized.

Thus the result of feature extraction are the following terms: lemmatized single words and noun

phrases.

II. Learning Phase: Creating Term-Label Affinity Matrix

During learning phase, we build Term-Label Affinity Matrix (TLM) and calculate term label affinity

scores. TLM is a data structure that is used to store for each termti and the corresponding labelslj

their Term-Label Affinity score. The corresponding labels are the onesassigned to the short text, one

label to one short text, in which the term appears. For calculation of affinityscores, two methods can be

followed:

i Frequency Score (FS) method

37

Page 49: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

ii Term Frequency - Inverse Label Frequency (TF-ILF) method

Frequency Score (FS) method:

High frequency of occurrence of a term in the short text having a particular label shows the high affinity

of that term for the label, i.e., whenever that label is assigned to a short text in which the term occurs,

that term has high likelihood to be present. Thus, Frequency Score,FS(tj , li), which is the number of

occurrences of a termtj in the short text with the labelli, where more than one occurrence in a single

short text is considered as one occurrence, is calculated for each term-label pair.

Term Frequency - Inverse Label Frequency (TF-ILF) method:

The Frequency Score in some cases give wrong impression of a particular term-label affinity. This is

because many terms, especially general purpose English words excluding the stop words, repeat across

many short text having different labels. These words, therefore, have low discriminative power for the

labels, and hence should have low TLA score for any particular label.

To identify and ignore such words, we define a new weighting scheme calledTF-ILF which calculates

the discriminative power of a word or phrase for a particular label. TF-ILF is similar to TF-IDF, as in,

words with high TF-IDF weight score imply a strong relationship with the document they appear in and

terms with high TF-ILF weight score imply a strong relationship with the label thatis assigned to the

short text they appear in.

TF-ILF is product of frequency of occurrence of a term in the shorttext having a particular label and

the inverse of the number of labels in whose short text it occurs. This ensures that the more the number

of labels for which the term occurs, the less is its discriminative power, or its affinity for any particular

label out of all the labels in whose short text it appears in. This is because of the inverse label frequency

score that reduces the frequency score value suitably.

Let Nl denote the total number of distinct labels, andnl denotes the number of labels for which a term

tj occurs in the short text. Then TF-ILFtj ,li indicates TF-ILF score of termtj with respect to labelli.

TF-ILF = FS(tj , li) ∗Nl

nl

(3.1)

So, if the term is associated with several labels, the TF-ILF score becomeslow. As a result, its dis-

tinguishing capability of labels is reduced. So, the row corresponding to a term tj in the Term Label

Affinity Matrix will be in the form of a vector of scores in which each value in the vector represents the

TF-ILF score of a particular label with respect totj .

38

Page 50: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Table 3.3Learning Algorithm - Creation of TLM with TF-ILF as affinity scoreInput: Set of labeled short text, SOutput: Term-Label Affinity Matrix, TLM1. Notations used:1.1 TLM[ti,lj ]: Term-Label Affinity Matrix having terms as rows and labelsas columns

/* Construct the term-label affinity Matrix */2. for each short textsi ∈ S2.1 Extract each noun phrase,nj ∈ N in si

2.3 if nj of si exists in TLM2.4 if lk of si exists in TLM[nj ]2.5 TLM[nj ,lk]=TLM[ nj ,lk]+12.6 else2.7 TLM[nj ,lk]=13.1 Lemmatize each single non-stop word,wj ∈ W in si

3.2 if wj of si exists in TLM3.3 if lk of si exists in TLM[nj ]3.4 TLM[nj ,lk]=TLM[ wj ,lk]+13.5 else3.6 TLM[wj ,lk]=13.7 end3.8 end

/* Calculate the TF-ILF score for each termtj in TLM */4.1 for eachtj in TLM[]4.2 for eachli in TLM[ tj ].keys()4.3 TLM[tj ,li]=TLM[ tj ,li] * (num labels)/length(TLM[tj ]))4.4 end4.5 end

As part of learning phase, we calculate affinity score for each term-label pair and store the affinity score

in Term-Label Affinity Matrix. The pseudo code for the learning phase ofthe algorithm using TF-ILF

method is shown in Table 3.3.

III. Labeling Phase: Assigning labels to short text

The terms in the new short text are looked up in the term-label affinity matrix andthe term having the

highest term-label affinity score with some label is identified and that label is assigned to the short text.

We call the corresponding term as dominant term in the short text.

The labeling phase has following steps:

• In every new short text, terms are identified as in the Feature Extraction stepabove.

• The dominant term among the terms of the new short text present in the term-label matrix is

identified and the corresponding label is assigned to the short text.

39

Page 51: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

FeatureExtraction

LabelingAlgorithm

UnlabeledShort Text

FeatureTerms

AssignedLabel Labeled

Short Text

Term-LabelAffinity Matrix

Figure 3.3Schematic Representation of Learning Framework

• If none of the terms of the short text are present in the term-label matrix or the term-label affinity

score is very low for the dominant term with its corresponding label, the short text is labeled as

“miscellaneous”.

3.2.4.2 Proposed Approaches

We propose two approaches for labeling short-text based on the framework proposed above. These

two approaches are based on the different kinds of labels used to labelthe short-text.

1. Normal Labels: Table 3.4 shows the set of labels that were assigned to the short-text in the

training set. These labels are the domain names corresponding to the short-text. Hence, each

short-text in the dataset are assigned one of these labels. The proposed framework described in

Section 3.2.4.1 is applied to this dataset and the relationship between the terms in theshort-text

and its assigned label is calculated to build the Term Label Affinity Matrix.

Table 3.4Predefined normal labelsAlgorithm Optimization, Cognitive Science, Compilers, Computer Architecture, Computer Networks,

Computer Graphics, Computer Vision, Data Mining, DatabaseManagement System, Image Processing,Information Retrieval, IT in Agriculture, Language Technology, Middleware System, Mobile Applications,

Multi-agent Systems, Network Security, Operating System,Robotics, Search Application, Web Mining,Software Engineering, Spatial Informatics, Web Applications

2. Higher Level Labels: Some domains fall under a higher domain based on the concept hierarchy

of that field of expertise. For example, in Computer Science and Engineering, ‘database man-

agement’, ‘data mining’, ‘web mining’ etc. fall under “Data Engineering”. So, all the short-text

with either of those labels can be assigned the label “Data Engineering”. Similarly, the short-text

having labels ‘computer vision’, ‘computer graphics’, ‘image processing’ can also be labeled with

“Visual IT”. Hence, we replace some of the labels in Table 3.4 by their higher domain name in

40

Page 52: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

the concept hierarchy, as shown in Table 3.5.

Table 3.5Concept HierarchyNormal labels Higher level label

Computer Architecture, Operating System, Compilers Computer SystemsComputer Networks, Network Security Computer Networks

Computer Graphics, Computer Vision, Image Processing Visual ITData Mining, Database Management System, Web Mining,Data Engineering

The normal labels are replaced by the higher level labels for the respective short text in the training

set and the proposed framework is applied to this dataset to build the Term Label Affinity Matrix.

The set of labels used in this approach is shown in Table 3.6.

Table 3.6Predefined higher level labelsAlgorithm Optimization, Cognitive Science, Computer Architecture, Computer Networks, Data Engineering,

IT in Agriculture, Language Technology, Middleware System, Mobile Applications, Operating System, Robotics,Search Application, Web Mining, Software Engineering, Spatial Informatics, Visual IT, Web Applications

3.2.5 Performance comparison related to labeling approach

We performed a series of experiments on Experience related information in two sets of resumes of

Computer Science and Engineering students. The purpose of the experiment was to label the short text

paragraphs in the experience section in each resume.

Each short text paragraph consists of only one or very few sentences, which is very small as compared

to a document. We assume that all the sentences in each short text paragraph are related to the same

domain in Computer Science and Engineering. The proposed approach can be extended by breaking the

short text paragraph to multiple smaller short text paragraphs or sentences, for assigning multiple labels

to the original short text paragraph.

We conducted our set of experiments by carrying out labeling using scores from the two weighted

metrics:Frequency Score(FS)andTF-ILF. Using FS score method, first the FS score is calculated for

each term-label pair. For a new short text, the term with highest term-label FS score is identified and that

label is assigned to the new short text. Using TF-ILF score method, first the TF-ILF score is calculated

for each term-label pair. For a new short text, the term with the highest TF-ILF term-label score is

identified and that label is assigned to the new short text.

41

Page 53: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

We also applied the traditional classification methods,Naive Bayes classificationandNearest Neigh-

bor classification, on our datasets to compare peformance of the proposed approach with traditional

classification methods. In Nearest Neighbor Classification, first we find similarity scores between every

new short text and the short text in the training set. Then we find k nearest neighbors for each new short

text and the label assigned to majority of those neighbor short text is assigned to this new short text.

In Naive Bayes Classification, first we calculate the probability of occurrence of a term in a short text

having a particular label. Then we calculate the probability of a label to be assigned to the short text,

given the terms in that short text. We do the feature extraction using the same technique that is discussed

in Section 3.2.4.1 and hence use the same set of features in each experience short text for all the four

algorithms.

3.2.5.1 Dataset Description

Our resume dataset 1 and resume dataset 2 consisted of 106 resumes and102 resumes respectively

of students from two different batches. Each resume on an average had 6 short text paragraphs. Each

short text paragraph in each resume was assigned one label in personas per the domain corresponding

to that short text. We then apply the short text labeling technique proposedin Section 3.2.4 to learn the

affinity of terms in the short text paragraph for the label assigned to that paragraph. The experiments on

the two datasets were done separately and independent of each other.

Table 3.7 shows a sample of our dataset, an Experience section of a resumewhich describes the

experience of a candidate with the domain specific label. In our case, domains that fall under Computer

Science and Engineering. Each experience short text is enumerated. The sample has 9 short text para-

graphs which are independent of each other and are pre-processed independently. The tags of “title” and

“description” are ignored while pre-processing, rest of the text is used as short text.

Based on our preliminary experiments we discovered that as we increasedtraining set beyond 30%,

the accuracy saturated (Table 3.8). So, we randomly choose 30% of the dataset as training set and rest

of the dataset as test set. Also, the results are shown in the form of average accuracy scores from the set

of experiments done on the two datasets. Note that, an accuracy score is equal to the percentage of the

short text paragraphs correctly labeled.

42

Page 54: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Table 3.7Sample Experience section of a resume with corresponding subsections and their respectivefeatures

Label Short Text

Major Projects

computer networks

1. title: secure application design with focus on performancedescription: tcp congestion control mechanism can be detrimental in the face of activeattacks of adversaries. an application that is completely dependant on tcp can suffer dueto such attacks. we will examine the effect of such attacks ina secure application andpropose a design that will overcome these limitations.

database management

2. title: implemented extendible hashing as a dbms project.description: given a relation hash its records using extendible hashing technique and storeit in a separate file then we can search, delete and insert a record on the basis of attributevalue on which hash function has been applied.

search application

3. title: flexi-search enginedescription: this project develops a tool, engine, which will be able to search documentsregardless of the type of heterogeneity of the data in them.

language technology

4. title: ssf visualization and system analysis tooldescription: this system takes as input a ssf (shakti standard format) file and displays theoutput tree corresponding to that file. it also compare two ssf files and display the twotrees along with differences highlighted in different color.

computer networks

5. title: multi-threaded file transferdescription: this project involves sending a file from server side to the client. each packethas the information regarding parity and size. there is separate transmission thread forsending the corrupted packets again.

Minor Projects

computer graphics 6. description: game of carrom using open gl, c++computer graphics 7. description: solar system using sdl, c++web application 8. description: developed web interfaces using php, javascript, ajax, mysql.operating system 9. description: implemented unix utilities like top, talk,shell.

Table 3.8Training/Testing dataset ratioTraining Testing Accuracy

20% 80% 67.530% 70% 70.040% 60% 69.450% 50% 71.860% 40% 70.070% 30% 70.380% 20% 69.7

43

Page 55: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

KNN Naive Bayes FS TF−ILF0

10

20

30

40

50

60

70

80

90

100

Labeling Algorithms

Acc

urac

y S

core

s (in

%ag

e)

Experiments on Resume Dataset 1

KNN Naive Bayes FS TF−ILF0

10

20

30

40

50

60

70

80

90

100

Labeling Algorithms

Acc

urac

y S

core

s (in

%ag

e)

Experiments on Resume Dataset 2

Figure 3.4Average accuracy scores by various algorithms for labeling short textparagraphs using nor-mal labels

3.2.5.2 Experiments 1: Normal labels

We conduct a series of experiments on Experience section of resume dataset. We use the set of

normal labels (Table 3.4) for this set of experiments. It can be observedthat the proposed TF-ILF

method is assigning labels correctly in about 70% of the cases.

3.2.5.3 Experiments 2: Higher level labels

We conduct a series of similar experiments as in Section 3.2.5.2 on Experiencesection of resume

dataset. We use the set of higher level labels (Table 3.6) for this set of experiments. It can be observed

that with higher level labels, the proposed TF-ILF method is assigning labelscorrectly in about 80% of

the cases. This was because a domain higher in the concept hierarchy ofa field of expertise will have a

larger set of corresponding terms to distinguish it from other domains at a higher level as compared to

the specific domains.

In all our experiments, as compared to Frequency Score (FS) method, Naive Bayes (NB) method and K

Nearest Neighbor (KNN) method, TF-ILF is performing well because in TF-ILF method, the common

terms are filtered out due to ILF or Inverse Label Frequency part. Overall, the proposed labeling method,

TF-ILF, has the potential to assign labels to short text paragraphs and very short text tweets.

44

Page 56: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

KNN Naive Bayes FS TF−ILF0

10

20

30

40

50

60

70

80

90

100

Experiments on Resume Dataset 1

Labeling Algorithms

Acc

urac

y S

core

s (in

%ag

e)

KNN Naive Bayes FS TF−ILF0

10

20

30

40

50

60

70

80

90

100

Experiments on Resume Dataset 2

Labeling Algorithms

Acc

urac

y V

alue

s (in

%ag

e)

Figure 3.5Average accuracy scores by various algorithms for labeling short textparagraphs using higherlevel labels

3.2.6 Discussion on errors

The proposed short text labeling algorithm identifies a dominant term in the new short text based

on the scores in the term-label affinity matrix calculated in the learning phase using TFILF score. The

label corresponding to this dominant term is then assigned to the short text. In few cases the label

corresponding to the dominant term may not be the best label for the shorttext. For example, in Table

3.2 in Section 3.2.3, P1 and P2 are assigned the same label, i.e., ’DBMS’. But if’query’ is the dominant

term in P2 and the term-label affinity score for ’query’ is highest with the label ’search application’

and not ’DBMS’, then P2 would not be labeled correctly. On analysis of behavior of our algorithm,

we concluded that such cases where the same term is popularly used in morethan one domain is the

reason for our accuracy scores. Even though such cases are few, further analysis can further enhance

the accuracy scores.

3.3 Extracting Special Experience Information from Resumes

We now employ the proposed short text labeling approach and improved special feature extraction

algorithm to extract special experience related information from the resumes. After extracting experi-

ence related sections of the resumes, we divide each experience relatedtext into different paragraphs

45

Page 57: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

and apply the proposed labeling algorithm. The output is a set of labels for each experience related text.

Next we apply special feature extraction approach to organize the information.

We applied the proposed framework on the resume dataset of 106 students. Table 3.1 shows the

Experience section of a typical resume. Each experience short text is enumerated. It can be observed

that it has 9 short texts describing the experience of the candidate. We have extracted short text of all

the resumes and applied the proposed labeling approach and obtained the labels.

Table 3.9 shows the organization of experience types (normal labels) (set F) using improved special

feature extraction approach.The rf value obtained was 32% (Table 3.11). The total numbers of expe-

rience types features present were 607 and numbers of features being displayed to user are 415. The

I-level shows the common experience types (labels), the II-level showsthe common cluster experience

types (labels) for each cluster of resumes and the III-level shows the special experience types (labels) for

each resume. It can be observed that no I-level feature is found. Table 3.10 shows the feature statistics

related to Table 3.9.

Table 3.13 shows the organization of experience types (higher level labels) (set F) using special fea-

ture extraction approach. The total numbers of experience types features present were 490 and numbers

of features being displayed to user are 292. By using the set of higher level labels, we obtained the rf

value of 41% (Table 3.12). Table 3.14 shows the feature statistics related to Table 3.13.

Note that in table 3.9 and 3.13, we have shrunk some labels for better presentation. For example, we

have presented ‘image processing’ as ‘img proc’.

3.4 Summary of the Chapter

In this chapter, we have made an effort to identify special information fromthe Experience section

of a resume. For this we have proposed a short text labeling approach by exploiting the notion of term-

label affinity matrix. By applying the proposed short text labeling approach we converted experience

related text into labels. By using the labels as features of experience related section, we extended special

feature extraction approach and organized the features to identify special information. We could achieve

a reasonable values of reduction factor. The results also indicated that for higher granularity labels, the

performance is improved.

Overall the results indicate that the proposed labeling approach and feature extraction approach pro-

vide the scope to identify special information from the resumes for efficientprocessing.

46

Page 58: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Table 3.9 Organization of features (experience types) using improved special feature extraction ap-proach with normal labels

Common Features (I-level)

noneCommon cluster features ResumeId Special Features (III-level)

(II-level)web app, comp net R35 search app, img proc, software engg

R45 comp graphics, algo optiR5 lang tech, netw secR65 lang tech, comp graphicsR69 img proc, info retr, comp graphics, software enggR7 comp graphics, dbms, middleware sysR73 netw secR79 dbms, data miningR86 dbms, search app, netw secR87 dbmsR92 img proc, roboticsR94 netw sec, comp net, web app, comp graphics, dbms

dbms, comp graphics R24 lang tech, comp net, middleware sysR25 comp vis, img procR32 web miningR34 web app, comp arch, oper sysR43 IT in agriR48 mobile app, netw sec, web appR50 oper sys, web appR75 multi-agent sys, comp net, oper sysR83 netw sec, web appR89 netw sec, middleware sys, web app

oper sys, comp graphics R17 lang tech, web mining, web app, dbmsR30 data mining, algo opti, lang tech, dbms, oper sysR40 algo opti, comp net, oper sys, search appR46 algo opti, comp vis, img procR49 comp net, netw secR66 lang tech, cog sci, compilers, comp netR77, R84 lang tech, web app, dbms, compilersR99 lang tech, algo opti, web app, dbms, oper sys, compilers

web app, dbms R14 img proc, search appR2 graph theory, algo opti, cog sciR52 lang tech, web mining, middleware sysR58 netw sec, middleware sys, IT in Agri, web app, dbms, comp graphics, oper sysR59 oper sys, web app, dbms, algo optiR68 comp vis, img procR80 img proc, comp graphics, cog sci, robotics, middleware sysR98 middleware sys

data mining, web app R1 comp arch, lang tech, dbms, comp netR11 comp arch, robotics, comp graphics, dbmsR15 dbms, IT in Agri, web mining, comp graphicsR21 comp vis, img procR51 comp net, comp graphics, netw sec, dbmsR55 comp graphics, img proc, dbms, comp net, oper sysR70 search app, comp net, dbms, oper sys, comp graphicsR97 dbms, comp net, comp graphics, oper sys, web mining

comp net, dbms R36 middleware sys, lang tech, mobile app, multi-agent sysR12 middleware sys, lang techR53 middleware sys, lang tech, comp graphics, multi-agent sysR103 middleware sys, lang tech, comp graphics, search appR74 middleware sys, lang tech, comp graphics, mobile appR10 middleware sys, netw sec, comp graphics, oper sysR37 IT in agri

47

Page 59: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Common cluster features ResumeId Special Features (III-level)(II-level)

search app, dbms R101 netw sec, lang tech, data miningR102 trans mngmnt, software enggR105 comp net, web app, IT in Agri, lang techR29 comp net, lang tech, comp graphics, web app, oper sysR31 mobile app, search app, netw sec, middleware sysR63 mobile app, IT in Agri, comp graphics, oper sys, web appR81 comp vis, web app, comp graphics, oper sys

dbms, mobile app, R26 web app, oper sys, comp archcomp graphics R39 comp net, data mining, search app, web app, middleware sys, algoopti

R41 web app, IT in Agri, comp net, mobile app, dbms, comp graphics, middleware sysR47 comp vis, data mining, IT in Agri, algo optiR76 comp arch, mobile app, data sim, web app, comp net, dbms, comp graphicsR6 img proc, middleware sys, search app

comp net, web app, R20 oper sys, cog sci, software enggdbms R104 lang tech, robotics, comp graphics, oper sys

R54 spat info, netw sec, data compR67 lang tech, oper sys, comp arch, comp graphicsR71 IT in Agri, text mining, dbms, comp graphics, lang techR82 netw sec, oper sys, comp graphics, image proc

oper sys, web app, R16 mobile app, img proc, comp net, netw seccomp graphics R18 comp net, middleware sys, netw sec, web mining

R23 algo opti, robotics, comp arch, dbmsR3 algo opti, comp graphics, dbms, oper sys, middleware sysR90 comp net, middleware sys, software engg

comp vis, img proc, R22 algo optidbms R27 trans mngmnt, web mining, dbms, search app

R42 comp net, data mining, middleware sys, search appR64 comp graphics, web app, dbms, search appR88 lang tech

oper sys, netw sec R60 img proc, IT in Agri, middleware sys, dbms, comp graphicsR8 data mining, IT in Agri, web mining, dbms, comp graphics, middleware sysR91 IT in Agri, lang tech, web mining, comp net, dbms, netw sec, data mining, text miningR93 algo opti

dbms, comp net, R44 comp vis, img proc, search app, comp archcomp graphics, oper sys R56 mobile app, search app, lang tech, web app, image proc

R78 lang tech, data mining, mobile app, img procR95 multi-agent sys, web mining, comp net, middleware sys, lang tech, comp arch

comp vis, comp graphics, R100 web app, dbms, lang tech, oper sysimg proc, search app R96 robotics, comp arch, oper sys, dbms, data mining

R9 mobile app, web app, middleware sys, comp netweb mining, comp graphics R57 mobile app, info retr, robotics, web app, IT in Agri, data mining, img proc, dbms, comp net, oper sys

R62 middleware sys, data miningR72 IT in Agri, search app, web app, middleware sys, oper sys, netwsec

IT in Agri R106 mobile app, web appR13 noneR33 web app, middleware sys, robotics

lang tech R85 graph theoryR19 noneR4 mobile app, middleware sysR61 robotics, cog sci, netw sec, comp arch, text mining, dbms, lang techR38 algo opti, spat info, comp vis, web app, netw secR28 compilers, lang tech

48

Page 60: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Table 3.10Experience Feature Statistics with normal labelsNumber of Common Features = 0Number of Clusters = 21

Features in Cluster 1 = 2, Resumes in Cluster 1 = 12; Features in Cluster 12 = 2, Resumes in Cluster 12 = 4Features in Cluster 2 = 2, Resumes in Cluster 2 = 10; Features in Cluster 13 = 4, Resumes in Cluster 13 = 4Features in Cluster 3 = 2, Resumes in Cluster 3 = 9; Features inCluster 14 = 4, Resumes in Cluster 14 = 3Features in Cluster 4 = 2, Resumes in Cluster 4 = 8; Features inCluster 15 = 2, Resumes in Cluster 15 = 3Features in Cluster 5 = 2, Resumes in Cluster 5 = 8; Features inCluster 16 = 1, Resumes in Cluster 16 = 3Features in Cluster 6 = 2, Resumes in Cluster 6 = 7; Features inCluster 17 = 1, Resumes in Cluster 17 = 2Features in Cluster 7 = 2, Resumes in Cluster 7 = 7; Features inCluster 18 = 0, Resumes in Cluster 18 = 1Features in Cluster 8 = 3, Resumes in Cluster 8 = 4; Features inCluster 19 = 0, Resumes in Cluster 19 = 1Features in Cluster 9 = 3, Resumes in Cluster 9 = 6; Features inCluster 20 = 0, Resumes in Cluster 20 = 1Features in Cluster 10 = 3, Resumes in Cluster 10 = 6; Featuresin Cluster 21 = 0, Resumes in Cluster 21 = 1Features in Cluster 11 = 3, Resumes in Cluster 11 = 5;

Total Common Cluster Features = 40Special Features = 375Total Features displayed = 0 (Common Features) + 40 (Common Cluster Features) + 375 (Special Features) = 415F = (0 * 106) + (2 * 12 + 2 * 10 + 2 * 9 + 2 * 8 + 2 * 8 + 2 * 7 + 2 * 7 + 3 * 4 + 3 * 6 + 3 *6 + 3 * 5 + 2 * 4 +4 * 4 + 4 * 3 + 2 * 3 + 1 * 3 + 1 * 2) + 375 = 607

Table 3.11Reduction Factor values for experience types (normal labels)

Feature Type |F |∑L

i=1F (i) rf

Skill Types 607 415 0.32

Table 3.12Reduction Factor values for experience types (higher level labels)

Feature Type |F |∑L

i=1F (i) rf

Skill Types 490 292 0.41

49

Page 61: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Table 3.13Organization of features (experience types) using improved special feature extraction ap-proach with higher level labels

Common Features (I-level)

noneCommon cluster features ResumeId Special Features (III-level)

(II-level)comp net, visual IT, R18 middleware sys, data engg, comp sysweb app R45 algo opti

R51, R83, R94 data enggR65 lang techR7, R89 data engg, middleware sysR90 comp sys, middleware sys, software enggR92 robotics

visual IT, data engg, R17, R77, R84 lang tech, web appcomp systems R30 lang tech, algo opti

R34, R50 web appR99 lang tech, web app, algo opti

comp net, mobile app, R16 comp sysweb app, visual IT R39 data engg, search app, middleware sys, algo opti

R41 IT in Agri, comp net, mobile app, data engg, middleware sysR48 data engg

comp net, data engg, R10 visual IT, comp sysmiddleware sys R103 visual IT, lang tech, search app, comp net

R24 visual IT, lang techR12 lang techR42 visual IT, search app

visual IT, web app, R14 search appdata engg R15 IT in agri

R64 search appR21, R68 none

data engg, web app R1 lang tech, comp netcomp sys R100 search app, visual IT, lang tech

R105 comp net, search app, IT in agri, lang techR29 comp net, data engg, search app, lang tech, visual ITR59 algo opti

comp net, data engg R101 search applang tech R104 web app, robotics, visual IT, comp sys

R61 robotics, cog sci, comp sysR91 IT in agri, comp sys

search app, visual IT R27 trans managementdata engg R70 comp net, web app, comp sys

R81 web app, comp sysR96 comp sys, robotics

comp net, comp sys R49 nonevisual IT R55 data engg, web app

R75, R82, R97 data engg, multi-agent sysdata engg, multi-agent sys, R36 mobile appcomp net, middleware sys, R53 visual ITlang tech R95 visual IT, comp sysweb app, data engg R52 lang techmiddleware sys R98 nonecomp net, web app, data engg R67 comp sysvisual IT, lang tech R71 IT in agricomp net, mobile app, data enggR74 middleware sysvisual IT, lang tech R78 comp sysweb app, visual IT R35 search appcomp net, soft engg R69 info retmobile app, search app, R31 comp netmiddleware sys, data engg R6 visual ITdata engg, visual IT R25, R32 nonealgo opti, visual IT, search app, R44 data enggcomp sys, comp net R40 noneweb app, data engg, R11 roboticsvisual IT, comp sys R26 mobile app

50

Page 62: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Common cluster features ResumeId Special Features (III-level)(II-level)

web app, algo opti, visual IT, R23 roboticsdata engg, comp sys R3 middleware syscomp sys, web app, data engg, R56 mobile app, search app, lang techvisual IT, comp net R58 middleware sys, IT in Agri

R76 mobile app, data simsearch app, data engg, IT in Agri, R63 mobile appvisual IT, comp sys, web app R72 comp net, middleware sysdata engg, IT in Agri, comp sys, R8, R60visual IT, comp net, middleware sysweb app, data engg, comp net R79, R87

R86 search appR102 trans mngmnt, data engg, search app, software enggR106 mobile app, IT in Agri, web appR93 algo opti, comp net, comp sysR13 IT in AgriR4 mobile app, middleware sysR5 comp net, web app, lang techR2 algo opti, cog sci, data engg, web appR30 middleware sys, data engg, visual ITR20 comp net, web app, data engg, comp sys, cog sci, software enggR22 visual IT, algo opti, data enggR46 algo opti, visual IT, comp sysR19 lang techR88 visual IT, data engg, lang techR47 visual IT, data engg, IT in Agri, mobile app, algo optiR73 comp net, web appR38 algo opti, spat info, visual IT, web app, comp netR66 lang tech, cog sci, comp sys, comp net, visual ITR43 IT in Agri, data engg, visual ITR28 comp sys, lang tech, data enggR80 web app, visual IT, cog sci, robotics, data engg, middleware sysR37 comp net, IT in Agri, data enggR85 algo opti, lang techR57 mobile app, info retr, robotics, web app, IT in Agri, data engg, visual IT, comp net, comp sysR54 comp net, web app, data engg, spat info, data compR33 web app, IT in Agri, middleware sys, robotics

51

Page 63: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Table 3.14Experience Feature Statistics with higher level labelsNumber of Common Features = 0Number of Clusters = 48

Features in Cluster 1 = 3, Resumes in Cluster 1 = 10; Features inCluster 24 = 0, Resumes in Cluster 24 = 1Features in Cluster 2 = 3, Resumes in Cluster 2 = 7; Features in Cluster 25 = 0, Resumes in Cluster 25 = 1Features in Cluster 3 = 4, Resumes in Cluster 3 = 4; Features in Cluster 26 = 0, Resumes in Cluster 26 = 1Features in Cluster 4 = 3, Resumes in Cluster 4 = 5; Features in Cluster 27 = 0, Resumes in Cluster 27 = 1Features in Cluster 5 = 3, Resumes in Cluster 5 = 5; Features in Cluster 28 = 0, Resumes in Cluster 28 = 1Features in Cluster 6 = 3, Resumes in Cluster 6 = 5; Features in Cluster 29 = 0, Resumes in Cluster 29 = 1Features in Cluster 7 = 3, Resumes in Cluster 7 = 4; Features in Cluster 30 = 0, Resumes in Cluster 30 = 1Features in Cluster 8 = 3, Resumes in Cluster 8 = 4; Features in Cluster 31 = 0, Resumes in Cluster 31 = 1Features in Cluster 9 = 3, Resumes in Cluster 9 = 5; Features in Cluster 32 = 0, Resumes in Cluster 32 = 1Features in Cluster 10 = 5, Resumes in Cluster 10 = 3; Features in Cluster 33 = 0, Resumes in Cluster 33 = 1Features in Cluster 11 = 3, Resumes in Cluster 11 = 2; Features in Cluster 34 = 0, Resumes in Cluster 34 = 1Features in Cluster 12 = 5, Resumes in Cluster 12 = 2; Features in Cluster 35 = 0, Resumes in Cluster 35 = 1Features in Cluster 13 = 5, Resumes in Cluster 13 = 2; Features in Cluster 36 = 0, Resumes in Cluster 36 = 1Features in Cluster 14 = 4, Resumes in Cluster 14 = 2; Features in Cluster 37 = 0, Resumes in Cluster 37 = 1Features in Cluster 15 = 4, Resumes in Cluster 15 = 2; Features in Cluster 38 = 0, Resumes in Cluster 38 = 1Features in Cluster 16 = 2, Resumes in Cluster 16 = 2; Features in Cluster 39 = 0, Resumes in Cluster 39 = 1Features in Cluster 17 = 5, Resumes in Cluster 17 = 2; Features in Cluster 40 = 0, Resumes in Cluster 40 = 1Features in Cluster 18 = 4, Resumes in Cluster 18 = 2; Features in Cluster 41 = 0, Resumes in Cluster 41 = 1Features in Cluster 19 = 5, Resumes in Cluster 19 = 2; Features in Cluster 42 = 0, Resumes in Cluster 42 = 1Features in Cluster 20 = 5, Resumes in Cluster 20 = 3; Features in Cluster 43 = 0, Resumes in Cluster 43 = 1Features in Cluster 21 = 6, Resumes in Cluster 21 = 2; Features in Cluster 44 = 0, Resumes in Cluster 44 = 1Features in Cluster 22 = 6, Resumes in Cluster 22 = 2; Features in Cluster 45 = 0, Resumes in Cluster 45 = 1Features in Cluster 23 = 3, Resumes in Cluster 23 = 3; Features in Cluster 46 = 0, Resumes in Cluster 46 = 1Features in Cluster 47 = 0, Resumes in Cluster 47 = 1; Features in Cluster 48 = 0, Resumes in Cluster 48 = 1

Total Common Cluster Features = 90Special Features = 202Total Features displayed = 0 (Common Features) + 90 (Common Cluster Features) + 202 (Special Features) = 292F = (0 * 106) + (3 * 10 + 3 * 7 + 4 * 4 + 3 * 5 + 3 * 5 + 3 * 5 + 3 * 4 + 3 * 5 + 3 * 5 + 5 * 3+ 3 * 2 + 5 * 2 +5 * 2 + 4 * 2 + 4 * 2 + 2 * 2 + 5 * 2 + 4 * 2 + 5 * 2 + 5 * 3 + 6 * 2 + 6 * 2 + 3 * 3) + 202 = 490

52

Page 64: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Chapter 4

Conclusion and future work

Several commercial products [10][11][20] that are available on the web are employed by companies

for effective processing of resumes. Researchers are making efforts to help improve the process of

resume selection by extending data mining/information extraction methods. In this thesis, we have

extended the notion of special features, which has been proposed to improve the performance of product

selection in the e-commerce environment [22], to process resumes in a more effective manner.

There are several issues in resume processing. One of the issue is processing of similar resumes.

Normally, companies face the problem of selecting appropriate resumes from thousands of similar re-

sumes (of students of same batch or having same expertise). For this problem, we have made an effort

to extend the notion of special features and propose a solution. Extendingthe notion of special features

to the resume selection problem is not simple as resume contains text paragraphs and complex features

like multiple words (n-grams). In addition, information in a resume is organized into different sections

and each section contains different types of features.

In this chapter, we summarize the contributions and mention some of the related future research

problems.

• In the second chapter, we have analyzed the existing special feature extraction algorithm proposed

for product selection environment and we proposed a modified algorithm based on the notion of

quality threshold. We have extended the notion of special features to ‘skills’ section of resumes.

It was identified that skills section contains a substructure, which has a setof skill types and each

skill type contains some skill values. We proposed a framework to extract special features from

the skill type data and observed that the proposed framework could significantly reduce the effort

of resume processing in identifying special skill types. Similar framework was applied to each

53

Page 65: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

skill type to identify its special skill values. It was also shown through experimental results that

the proposed framework could achieve significant savings in the effortsto identify the resume

with special skill. Based on the results it can be concluded that the proposed framework shows a

great promise in reducing the efforts to identify the resume with special skill type and their skill

values.

• In the third chapter, we have made an effort to extend special features to‘experience’ section

of resumes. It was observed that the experience section contains set of paragraphs where each

paragraph is related to a distinct kind of experience. It is not possible to extend the feature selec-

tion framework directly. We try to investigate approaches to convert each paragraph into a set of

features. On surveying the literature, it was found that several research efforts have been made in

the context of web to identify labels for short paragraphs. It can be noted that the paragraphs in

the proposed resumes are also small. We have proposed an improved short text labeling approach

based on the notion of term-label affinity and showed that the proposed short text labeling ap-

proach improves the performance over the existing labeling approaches inthe literature. We have

proposed two approaches. One is normal short text labeling algorithm, which assigns labels based

on the notion of term-label affinity. The other one is a short text labeling algorithm based on the

notion of term-label affinity that assigns higher level labels. Then we extended these algorithms

to convert text paragraphs related to experience portion of the resumesinto features (which are

the assigned labels). First, we have discussed the framework to extract special features from ex-

perience section. We have also shown the performance results by using the proposed algorithms

for assigning normal labels and higher level labels. The results show thatboth algorithms reduce

the effort to process the resume text in identifying special information significantly. Among the

two algorithms, higher level labels based algorithm gives high performance but provides higher

level special features. Based on the results we can conclude that the proposed approach provides

the scope to design resume processing systems for identifying special experience information.

As already mentioned, extracting information from resumes for effective processing is a complex

problem. We have only made a small effort and shown the scope for extracting special information for

resume selection. However, further investigation is required to resolve therelated research issues and

building effective resume processing system. Some of the future directionsin this research work are as

follows.

54

Page 66: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

• As we have already shown, use of concept hierarchy can improve the performance of identifying

special experience information. It is interesting to see how the concept hierarchy, taxonomy struc-

tures, dictionaries and other related repositories can be used to convertresume information into

different conceptual levels and extract corresponding special information for effective decision

making.

• In this thesis, we have only investigated extracting special information from skills and experi-

ence sections by considering them separately. In addition to skills and experience, resume also

contains ‘education’, ‘achievements’ and other information. Investigatingthe approaches to iden-

tify special information from other sections of resume and developing integrated framework by

considering all the sections of resume, for effective decision making, is an interesting research

problem.

• In this thesis, we only extracted special information for effective resume selection. Extracting

other kinds of information by applying data mining or text mining methods for helping resume

selection process is one of the future research directions.

• Several kinds of information can be extracted from different sections of the resume. For effective

decision making, information from all sections have to be seen. Note that there are thousands of

resumes, so investigating effective visualization techniques to enable fastprocessing of thousands

of resumes is a research challenge.

55

Page 67: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Chapter 5

Publications

1. Sumit Maheshwari, Abhishek Sainani, and P. Krishna Reddy; An Approach to Extract Special

Skills to Improve the Performance of Resume Selection, 6th International Workshop on Databases

in Networked Information Systems (DNIS 2010), The University of Aizu, JAPAN, Mar 29-31,

2010, Lecture Notes in Computer Science, vol. 5999, Springer-Verlag,2010.

2. Abhishek Sainani, P. Krishna Reddy; Categorizing short text using term label affinity learning,

12th International Conference on Web-Age Information Management (WAIM 2011).

(Under review)

56

Page 68: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Chapter 6

Appendix

The tables for Feature statistics and Feature organization for other skill value features are presented.

Table 6.1 and Table 6.2 shows the feature organization and feature statisticsrespectively for Server Side

Scripting skill value features. Similarly Table 6.3, Table 6.4, Table 6.5, Table 6.6, Table 6.7, Table 6.8,

Table 6.9, Table 6.10, Table 6.11, Table 6.12, Table 6.13, Table 6.14, Table6.15, Table 6.16, Table 6.17,

Table 6.18, Table 6.19 and Table 6.20 show the feature statistics and featureorganization for Operating

Systems, Assembly Language, Web Technologies, Mobile Platforms, Scripting Languages, Compiler

Tools, Software Tools, Libraries/APIs and IDEs skill value features.

Other Skill Types: middleware technologies, documentation, object orientedanalysis and design, cms,

version control system, open source frameworks, microsoft tools and services, java technologies, Frame-

works and Content Management Systems, server technologies, virtualization tech. and tools, open

source tools are present in only one resume.

Table 6.1Organization of features (skill value :: Server Side Scripting) using three-level approach forResume data-set

Common Features (I-level)

noneCommon cluster features ResumeId Special Features (III-level)

(II-level)cgi, modpython R2 jsp

R6, R7, R8 phpR1, R5 none

jsp, php R3 asp, .netR9 servlets, cgiR4 squid, dovecot, bind, apache, iptables

57

Page 69: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Table 6.2Skill Value (Server Side Scripting) Feature StatisticsNumber of Common Features = 0Number of Clusters = 3

Features in Cluster 1 = 2, Resumes in Cluster 1 = 6 Features in Cluster 3 = 0, Resumes in Cluster 3 = 1Features in Cluster 2 = 2, Resumes in Cluster 2 = 2

Total Common Cluster Features = 4Special Features = 11Total Features displayed = 0 (Common Features) + 4 (Common Cluster Features) + 11 (Special Features) = 15F = (0 * 9) + (2 * 6 + 2 * 2) + 11 = 27

Table 6.3 Organization of features (skill value :: operating systems) using three-level approach forResume data-set

Common Features (I-level)

noneCommon cluster features ResumeId Special Features (III-level)

(II-level)win 2000, win xp, R11, R12, R20, R35, R48, R54, R63, R67, nonewin vista, linux R70, R82, R92

R69 win 9x, macR50 win serv 2008R14, R16, R17, R18, R28, R34, R49, R59, win 9xR60, R62, R71, R75, R80, R95

win, linux R1, R106, R2, R22, R30, R32, R33, R4, R47,noneR51, R58, R64, R65, R68, R73, R77, R86,R90, R93, R94

win xp, win vista, R10, R103, R104, R19, R45, R56, R57, nonelinux R6, R66, R7, R76, R78, R85, R88win 9x, win 2000, R13, R15, R36, R43, R53, R55, R61, R72, nonewin xp, linux R74, R84, R98

R26 win 2003 servR25 win vista

win xp, linux R100, R101, R21, R24, R27, R38, R42, noneR44, R52, R91

win 2000, win xp, linux R23, R3, R46, R81, R83, R9, R97 nonelinux, win xp, win 9x R79, R8, R89 none

R5 linux fedora, linux suse, linux redhat, win vista, win xp, win 9xR105 linux, solaris, win xpR41 win xp, win vista, linux fedora, linux ubuntuR39 win vista, linuxR40 win 9x, win 2000, linux, unix, solarisR29 unix, linux, win xp, win vistaR99 linux fedora, winR96 linux, win 98, win xp, win vistaR102 linux fedora, win xpR37 linux, win 2000, win 2003 serv, win xpR31 linux, unix, win, win ce 5.0R87 win 7

58

Page 70: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Table 6.4Skill Value (Operating System) Feature StatisticsNumber of Common Features = 0Number of Clusters = 19

Features in Cluster 1 = 4, Resumes in Cluster 1 = 28; Features inCluster 11 = 0, Resumes in Cluster 11 = 1Features in Cluster 2 = 2, Resumes in Cluster 2 = 20; Features inCluster 12 = 0, Resumes in Cluster 12 = 1Features in Cluster 3 = 3, Resumes in Cluster 3 = 14; Features inCluster 13 = 0, Resumes in Cluster 13 = 1Features in Cluster 4 = 4, Resumes in Cluster 4 = 13; Features inCluster 14 = 0, Resumes in Cluster 14 = 1Features in Cluster 5 = 2, Resumes in Cluster 5 = 10; Features inCluster 15 = 0, Resumes in Cluster 15 = 1Features in Cluster 6 = 3, Resumes in Cluster 6 = 7; Features in Cluster 16 = 0, Resumes in Cluster 16 = 1Features in Cluster 7 = 3, Resumes in Cluster 7 = 3; Features in Cluster 17 = 0, Resumes in Cluster 17 = 1Features in Cluster 8 = 0, Resumes in Cluster 8 = 1; Features in Cluster 18 = 0, Resumes in Cluster 18 = 1Features in Cluster 9 = 0, Resumes in Cluster 9 = 1; Features in Cluster 19 = 0, Resumes in Cluster 19 = 1Features in Cluster 10 = 0, Resumes in Cluster 10 = 1

Total Common Cluster Features = 21Special Features = 47Total Features displayed = 0 (Common Features) + 21 (Common Cluster Features) + 47 (Special Features) = 68F = (0 * 106) + (4 * 28 + 2 * 20 + 3 * 14 + 4 * 13 + 2 * 10 + 3 * 7 + 3 * 3) + 47 = 296

Table 6.5Organization of features (skill value :: Assembly Language) using three-level approach forResume data-set

Common Features (I-level)

noneCommon cluster features ResumeId Special Features (III-level)

(II-level)mips, 8086 R4 8051

R1, R6 nonemips R2, R5 none

R3 masm

Table 6.6Skill Value (Assembly Language) Feature StatisticsNumber of Common Features = 0Number of Clusters = 3

Features in Cluster 1 = 2, Resumes in Cluster 1 = 3 Features in Cluster 3 = 0, Resumes in Cluster 3 = 1Features in Cluster 2 = 1, Resumes in Cluster 2 = 2

Total Common Cluster Features = 3Special Features = 2Total Features displayed = 0 (Common Features) + 3 (Common Cluster Features) + 2 (Special Features) = 5F = (0 * 6) + (2 * 3 + 1 * 2) + 2 = 10

59

Page 71: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Table 6.7 Organization of features (skill value :: web technologies) using three-level approach forResume data-set

Common Features (I-level)

htmlCommon cluster features ResumeId Special Features (III-level)

(II-level)cgi, php, css R11, R68, R79 xml

R12 xml, javascript, modpythonR14 javascript, modpythonR33 javascript, ajax, xmlR50 ajax, javascript, modpython, xmlR56 xhtmlR69, R71 javascript, jsp, xmlR5, R65, R67, R81, R90 noneR72, R84 xmlR75 javascript, xmlR98 python

xml, cgi, php R27 css, ajax, pythonR49 xhtmlR54 jsp, servlets, css, ajaxR63 jspR76 css, ibm mqR82 mod pythonR85 none

cgi R25, R46, R47, R59, R95, R9 nonecgi, php R15, R3, R35, R79, R8, R88 nonecgi, modpython, php R20, R55, R96 nonejavascript, servlets, R1 cgi, php, ajaxjsp, xml, css R41 cgi, modpython

R93 xhtml, ajaxcgi, javascript R18, R28, R43cgi, servlets, R60 phpjsp R102 j2eejavascript R4, R103 nonecss, php R17, R97 nonecgi, modpython R100, R61 nonejavascript, css, R21 ajaxmod python, cgi R91 xmlxml R19, R58 noneasp, .net, php, R74 xhtml, javascriptcss, ajax R85 xmlxml, css R51, R6 nonexhtml R34, R57 nonephp R24, R42 nonecgi, xml R66, R62 none

R16 css, ajaxR45 xhtml, cgi, phpR10 python, cgiR64 php, cgi, javascriptR22 javascript, ajaxR70 cgi, xml, cssR52 xml, php, jspR77 php, servlets, xmlR78 xhtml, css, adobe flex, javascript, php, actionscriptR101 xml, phpR37 css, cgi, asp, .netR106 xml, php, servlets, ajax, gwtR87 css, javascript, adobe flexR104 css, xml, javascript

60

Page 72: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Table 6.8Skill Value (Web Technologies) Feature StatisticsNumber of Common Features = 1Number of Clusters = 32

Features in Cluster 1 = 3, Resumes in Cluster 1 = 19; Features inCluster 17 = 1, Resumes in Cluster 17 = 2Features in Cluster 2 = 3, Resumes in Cluster 2 = 7; Features in Cluster 18 = 2, Resumes in Cluster 18 = 2Features in Cluster 3 = 1, Resumes in Cluster 3 = 6; Features in Cluster 19 = 0, Resumes in Cluster 19 = 1Features in Cluster 4 = 2, Resumes in Cluster 4 = 6; Features in Cluster 20 = 0, Resumes in Cluster 20 = 1Features in Cluster 5 = 3, Resumes in Cluster 5 = 3; Features in Cluster 21 = 0, Resumes in Cluster 21 = 1Features in Cluster 6 = 5, Resumes in Cluster 6 = 3; Features in Cluster 22 = 0, Resumes in Cluster 22 = 1Features in Cluster 7 = 2, Resumes in Cluster 7 = 3; Features in Cluster 23 = 0, Resumes in Cluster 23 = 1Features in Cluster 8 = 3, Resumes in Cluster 8 = 2; Features in Cluster 24 = 0, Resumes in Cluster 24 = 1Features in Cluster 9 = 1, Resumes in Cluster 9 = 2; Features in Cluster 25 = 0, Resumes in Cluster 25 = 1Features in Cluster 10 = 2, Resumes in Cluster 10 = 2; Features in Cluster 26 = 0, Resumes in Cluster 26 = 1Features in Cluster 11 = 2, Resumes in Cluster 11 = 2; Features in Cluster 27 = 0, Resumes in Cluster 27 = 1Features in Cluster 12 = 4, Resumes in Cluster 12 = 2; Features in Cluster 28 = 0, Resumes in Cluster 28 = 1Features in Cluster 13 = 1, Resumes in Cluster 13 = 2; Features in Cluster 29 = 0, Resumes in Cluster 29 = 1Features in Cluster 14 = 5, Resumes in Cluster 14 = 2; Features in Cluster 30 = 0, Resumes in Cluster 30 = 1Features in Cluster 15 = 2, Resumes in Cluster 15 = 2; Features in Cluster 31 = 0, Resumes in Cluster 31 = 1Features in Cluster 16 = 1, Resumes in Cluster 16 = 2; Features in Cluster 32 = 0, Resumes in Cluster 32 = 1

Total Common Cluster Features = 43Special Features = 91Total Features displayed = 1 (Common Features) + 43 (Common Cluster Features) + 91 (Special Features) = 135F = (1 * 83) + (3 * 19 + 3 * 7 + 1 * 6 + 2 * 6 + 3 * 3 + 5 * 3 + 2 * 3 + 3 * 2 + 1 * 2 + 2 * 2 +2 * 2 + 4 * 2 +1 * 2 + 5 * 2 + 2 * 2 + 1 * 2 + 1 * 2 + 2 * 2) + 91 = 348

Table 6.9 Organization of features (skill value :: Mobile Platforms) using three-levelapproach forResume data-set

Common Features (I-level)

androidCommon cluster features ResumeId Special Features (III-level)

(II-level)none R60 symbian

R49, R66 none

Table 6.10Skill Value (Mobile Platforms) Feature StatisticsNumber of Common Features = 1Number of Clusters = 2

Features in Cluster 1 = 0, Resumes in Cluster 1 = 3Total Common Cluster Features = 0Special Features = 1Total Features displayed = 1 (Common Features) + 0 (Common Cluster Features) + 1 (Special Features) = 2F = (1 * 3) + (0 * 0) + 1 = 4

61

Page 73: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Table 6.11Organization of features (skill value :: scripting languages) using three-level approach forResume data-set

Common Features (I-level)

Common cluster features ResumeId Special Features (III-level)(II-level)

shell, python R11, R13, R14, R15, R17, R18, R22, R25, R28, R3, R41, R47, R5, R50,R51, R66, R71, R74, R76, R80, R83, R84, R88, R9, R90, R92, R97,R98 noneR21, R33, R37, R83 phpR26 php, lispR54 javascriptR55, R59, R70 perl

python, bash sheel R16, R2, R27, R30, R57, R63, R64, R67, R8, R82, R85, R91 noneR45 javascriptR6 powershellR56 perlR60, R72 php

perl, shell R31, R4, R40, R46, R68, R7, R81, R89 phpR105, R35, R73, R94, R99 noneR36 jsp

python, unix shell R12, R75, R61 noneR95 socket programming, perlR100 socket programmingR96 latex

python R10, R102, R103, R104 nonesed, awk, perl R24, R52 bash shell

R19 bash shell, phpR39 shell, php

python, php R79, R62, R101 noneperl, php R23 none

R29 aspbash shell R43, R49 noneshell R42, R48 noneperl R32, R77 nonenone R58 perl, unix shell

R69 shell, batch scriptingR87 perl, php, cgi, action scriptR106 unix shell, awkR65 python, sed, awk, shellR34 bash shell, python, awk, sed, perlR93 bash shell, python, php, java server scriptingR26 python, lisp, php, shell

62

Page 74: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Table 6.12Skill Value (Scripting Languages) Feature StatisticsNumber of Common Features = 0Number of Clusters = 18

Features in Cluster 1 = 2, Resumes in Cluster 1 = 36; Features inCluster 10 = 1, Resumes in Cluster 10 = 2Features in Cluster 2 = 2, Resumes in Cluster 2 = 17; Features inCluster 11 = 1, Resumes in Cluster 11 = 2Features in Cluster 3 = 2, Resumes in Cluster 3 = 14; Features inCluster 12 = 0, Resumes in Cluster 12 = 1Features in Cluster 4 = 2, Resumes in Cluster 4 = 9; Features in Cluster 13 = 0, Resumes in Cluster 13 = 1Features in Cluster 5 = 1, Resumes in Cluster 5 = 4; Features in Cluster 14 = 0, Resumes in Cluster 14 = 1Features in Cluster 6 = 3, Resumes in Cluster 6 = 4; Features in Cluster 15 = 0, Resumes in Cluster 15 = 1Features in Cluster 7 = 2, Resumes in Cluster 7 = 3; Features in Cluster 16 = 0, Resumes in Cluster 16 = 1Features in Cluster 8 = 2, Resumes in Cluster 8 = 2; Features in Cluster 17 = 0, Resumes in Cluster 17 = 1Features in Cluster 9 = 1, Resumes in Cluster 9 = 2; Features in Cluster 18 = 0, Resumes in Cluster 18 = 1

Total Common Cluster Features = 19Special Features = 45Total Features displayed = 0 (Common Features) + 19 (Common Cluster Features) + 45 (Special Features) = 64F = (0 * 99) + (2 * 36 + 2 * 17 + 2 * 14 + 2 * 9 + 1 * 4 + 3 * 4 + 2 * 3 + 2 * 2 + 1 * 2 + 1 *2 + 1 * 2) + 45 = 184

Table 6.13Organization of features (skill value :: Compiler Tools) using three-level approach for Re-sume data-set

Common Features (I-level)

lex, yaccCommon cluster features ResumeId Special Features (III-level)

(II-level)R5 phoenixR7 mx phoenix rdk

Table 6.14Skill Value (Compiler Tools) Feature StatisticsNumber of Common Features = 2Number of Clusters = 2

Features in Cluster 1 = 0, Resumes in Cluster 1 = 1Features in Cluster 2 = 0, Resumes in Cluster 2 = 1

Total Common Cluster Features = 0Special Features = 2Total Features displayed = 2 (Common Features) + 0 (Common Cluster Features) + 2 (Special Features) = 4F = (2 * 2) + (0 * 0) + 2 = 6

63

Page 75: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Table 6.15Organization of features (skill value :: software tools) using three-levelapproach for Resumedata-set

Common Features (I-level)

noneCommon cluster features ResumeId Special Features (III-level)

(II-level)matlab R10, R17, R45, R51, R56, R6 latex

R13 latex, spice, multisim, magicR34 latex, player/stage, octave, assembly languageR16 latex, adobe photoshop, blender, adobe after effects, dreamweaverR57 adobe photoshop, gimpR61 adobe photoshop, gimp, multisimR83 adobe photoshop, gimp, netbeansR92 adobe photoshopR21 wekaR65, R81 none

ms office R1 adobe photoshop, gimp, latexR14 gimp, matlab, vim, ms office, latexR15 latexR18 vim, matlabR28 vim, matlab, latex, hadoop, luceneR30 3ds max, opengl, qt, matlab, latexR35 opengl, qt, latexR38 adobe photoshop, latex, ms visio, matlabR41 netbeans, ms office, gimp, latexR43 visual studio, microsoft phoenixR46 design tools, adobe photoshop, nlp tools, qt, glR48 latex, magic, hspiceR49 matlab, adobe photoshop, dreamweaverR53 adobe photoshopR54 matlab, adobe photoshop, netbeans, eclipseR59 kileR62 matlab, adobe flash professional cs3, adobe photoshopR74 matlab, adobe photoshop, latex, gimp, adobe fireworks, dreamweaver, visual studioR75 gimp, adobe photoshopR80 gimp, matlab, mobilesim, vim, latexR86 rasmol, chime, spdv, chemsketch, hex, isis draw, swiss-modelR90 picasa, ms frontpage, adobe photoshopR91 idapro, latexR95 visual studio, vim, wekaR96 gimpR97 matlab, pspice, multisim, magic, active hdl, ads, latex, vimR98 matlab, adobe photoshop, latexR42, R44 none

latex R1 photoshop, gimp, ms officeR22 opengl, qt, ms officeR27 netbeans, ms office, gimpR46 adobe photoshopR47 glomosim network simulatorR53 gimp, matlab, mobilesim, vim, ms officeR8 gimp, matlab, vim, ms officeR33 jflexR20 none

matlab, latex R10 adobe photoshop, dreamweaver, adobe after effectsR48 adobe photoshop, dreamweaver, msoffice, gimp, adobe fireworks,visual studioR18 ms office, vim, hadoop, luceneR21 player/stage, octave, assembly languageR31, R39 lex, yaccR7 spice, multisim, magic

adobe photoshop R42 matlab, multisim, gimpR45 vim, gimp, latexR55 matlab, gimp, netbeansR52 none

64

Page 76: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Common Features (I-level)

noneCommon cluster features ResumeId Special Features (III-level)

(II-level)visual studio R23 netbeans

R15, R65 nonemagic, latex, R33 hspicems office R62 matlab, pspice, multisim, magic, active hdl, ads, latex, ms office, vim

R16 opengl, qt, multisimR17 gimp, wireshark, tcpdump, gitR6 eclipse, carbideR2 perforceR25 dmtl, wekaR60 ms office, microsoft visual studio, vim, wekaR43 matlab, adobe flash professional cs3, adobe photoshop cs3, ms officeR26 excel macrosR50 google web toolkitR51 adobe photoshop, adobe illustrator, adobe fireworks, adobedreamweaver, adobe flash, adobe after effects,

adobe premiere pro, 3ds max 9R56 rasmol, chime, spdv, chemsketch, hex, isis draw, swiss-model, msofficeR57 ms office, picasa, ms frontpage, adobe photoshop

Table 6.16Skill Value (Software tools) Features StatisticsNumber of Common Features = 0Number of Clusters = 19

Features in Cluster 1 = 1, Resumes in Cluster 1 = 15; Features inCluster 11 = 0, Resumes in Cluster 11 = 1Features in Cluster 2 = 1, Resumes in Cluster 2 = 13; Features inCluster 12 = 0, Resumes in Cluster 12 = 1Features in Cluster 3 = 1, Resumes in Cluster 3 = 9; Features in Cluster 13 = 0, Resumes in Cluster 13 = 1Features in Cluster 4 = 1, Resumes in Cluster 4 = 7; Features in Cluster 14 = 0, Resumes in Cluster 14 = 1Features in Cluster 5 = 1, Resumes in Cluster 5 = 4; Features in Cluster 15 = 0, Resumes in Cluster 15 = 1Features in Cluster 6 = 1, Resumes in Cluster 6 = 3; Features in Cluster 16 = 0, Resumes in Cluster 16 = 1Features in Cluster 7 = 3, Resumes in Cluster 7 = 2; Features in Cluster 17 = 0, Resumes in Cluster 17 = 1Features in Cluster 8 = 0, Resumes in Cluster 8 = 1; Features in Cluster 18 = 0, Resumes in Cluster 18 = 1Features in Cluster 9 = 0, Resumes in Cluster 9 = 1; Features in Cluster 19 = 0, Resumes in Cluster 19 = 1Features in Cluster 10 = 0, Resumes in Cluster 10 = 1

Total Common Cluster Features = 9Special Features = 101Total Features displayed = 2 (Common Features) + 35 (Common Cluster Features) + 101 (Special Features) = 138F = (2 * 106) + (4 * 24 + 3 * 18 + 4 * 22 + 5 * 4 + 4 * 5 + 2 * 4 + 4 * 3 + 4 * 2 + 5 * 2) +101 = 629

65

Page 77: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Table 6.17Organization of features (skill value :: Libraries/APIs) using three-level approach for Re-sume data-set

Common Features (I-level)

noneCommon cluster features ResumeId Special Features (III-level)

(II-level)opengl, sdl R22, R9 shakti api

R10, R16, R21, R27, R32, R34, R4, R6noneR26 glutR38 qt

opengl, glut R28, R29, R30 qtR42 qt, sdl, gtkR43 sdl, qtR14, R17, R19 none

opengl, qt R25, R3, R5, R8 noneopengl R12, R24, R37, R40, R7 nonecuda, sdl, glut, opencv, openscenegraphR2 matlab, qt, opengl

R11 noneR15 opengl, openmp, amigo, mpi, glutR13 qt, liboctave, ariaR1 matlab, openglR18 opengl, stlR45 opengl, glsl, opencv, glut, cg, cudaR44 qt, matlabR47 qt, cimg, gwt-diagrams, gwt-dnd, google web toolkitR46 opengl, qtR41 opengl, opencv, ffmpegR39 qt, opengl, opencv, sdl, cudaR35 opengl, ariaR20 mpi, openmp, opengl, shakti apiR23 win32, .net managed libraries, opengl, open mp, sdlR31 win 32 api, qt, windows driver development kitR33 qt

Table 6.18Skill Value (Libraries/APIs) Features StatisticsNumber of Common Features = 0Number of Clusters = 20

Features in Cluster 1 = 2, Resumes in Cluster 1 = 12; Features inCluster 11 = 0, Resumes in Cluster 11 = 1Features in Cluster 2 = 2, Resumes in Cluster 2 = 8; Features in Cluster 12 = 0, Resumes in Cluster 12 = 1Features in Cluster 3 = 2, Resumes in Cluster 3 = 4; Features in Cluster 13 = 0, Resumes in Cluster 13 = 1Features in Cluster 4 = 1, Resumes in Cluster 4 = 5; Features in Cluster 14 = 0, Resumes in Cluster 14 = 1Features in Cluster 5 = 5, Resumes in Cluster 5 = 2; Features in Cluster 15 = 0, Resumes in Cluster 15 = 1Features in Cluster 6 = 0, Resumes in Cluster 6 = 3; Features in Cluster 16 = 0, Resumes in Cluster 16 = 1Features in Cluster 7 = 0, Resumes in Cluster 7 = 1; Features in Cluster 17 = 0, Resumes in Cluster 17 = 1Features in Cluster 8 = 0, Resumes in Cluster 8 = 1; Features in Cluster 18 = 0, Resumes in Cluster 18 = 1Features in Cluster 9 = 0, Resumes in Cluster 9 = 1; Features in Cluster 19 = 0, Resumes in Cluster 19 = 1Features in Cluster 10 = 0, Resumes in Cluster 10 = 1; Features in Cluster 20 = 0, Resumes in Cluster 20 = 1

Total Common Cluster Features = 12Special Features = 60Total Features displayed = 0 (Common Features) + 12 (Common Cluster Features) + 60 (Special Features) = 72F = (0 * 47) + (2 * 12 + 2 * 8 + 2 * 4 + 1 * 5 + 5 * 2) + 60 = 123

66

Page 78: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Table 6.19Organization of features (skill value :: IDEs) using three-level approach for Resume data-set

Common Features (I-level)

noneCommon cluster features ResumeId Special Features (III-level)

(II-level)gnu/gcc, visual studio R11, R15, R57 none

R60 .net, cygwinR62 .net, flex builder 3R69, R74 matlabR85 netbeans

gnu/gcc R34, R72, R79, R91, R98 nonegnu/gcc, matlab R42 none

R71 netbeansvim, eclipse R54 netbeans, dreamweaver

R59, R96 matlabR93 netbeansR20 none

vim, netbeans R1, R21, R63, R82 nonenetbeans R104, R17 nonenetbeans, eclipse R106 none

R66 jbuilderR16 gnu/gcc, eclipse, visual studio, netbeansR2 vim, intellijR92 vim, visual studioR27 eclipse, gnu/gccR43 bash shellR53 gnu/gcc, netbeansR78 vim, eclipse, adobe flex builder, netbeans, matlabR102 bloodshed dev c++, netbeansR30 vim, matlab, itkR48 matlab, multisim, network simulatorR33 netbeans, carbide.c++

Table 6.20Skill Value (IDEs) Features StatisticsNumber of Common Features = 0Number of Clusters = 18

Features in Cluster 1 = 2, Resumes in Cluster 1 = 8; Features in Cluster 10 = 0, Resumes in Cluster 10 = 1Features in Cluster 2 = 1, Resumes in Cluster 2 = 5; Features in Cluster 11 = 0, Resumes in Cluster 11 = 1Features in Cluster 3 = 2, Resumes in Cluster 3 = 2; Features in Cluster 12 = 0, Resumes in Cluster 12 = 1Features in Cluster 4 = 2, Resumes in Cluster 4 = 5; Features in Cluster 13 = 0, Resumes in Cluster 13 = 1Features in Cluster 5 = 2, Resumes in Cluster 5 = 4; Features in Cluster 14 = 0, Resumes in Cluster 14 = 1Features in Cluster 6 = 1, Resumes in Cluster 6 = 2; Features in Cluster 15 = 0, Resumes in Cluster 15 = 1Features in Cluster 7 = 2, Resumes in Cluster 7 = 2; Features in Cluster 16 = 0, Resumes in Cluster 16 = 1Features in Cluster 8 = 0, Resumes in Cluster 8 = 1; Features in Cluster 17 = 0, Resumes in Cluster 17 = 1Features in Cluster 9 = 0, Resumes in Cluster 9 = 1; Features in Cluster 18 = 0, Resumes in Cluster 18 = 1

Total Common Cluster Features = 12Special Features = 40Total Features displayed = 0 (Common Features) + 12 (Common Cluster Features) + 40 (Special Features) = 52F = (0 * 39) + (2 * 8 + 1 * 5 + 2 * 2 + 2 * 5 + 2 * 4 + 1 * 2 + 2 * 2) + 40 = 85

67

Page 79: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

Bibliography

[1] A. Arasu. Extracting structured data from web pages. InIn ACM SIGMOD, pages 337–348, 2003.

[2] V. Bobicev and M. Sokolova. An effective and robust method for short text classification. InProceedings

of the 23rd national conference on Artificial intelligence -Volume 3, pages 1444–1445. AAAI Press, 2008.

[3] D. CVX. http://www.daxtra.com/ (accessed on june 5, 2011).

[4] Z. Faguo, Z. Fan, Y. Bingru, and Y. Xingang. Research on short text classification algorithm based on

statistics and rules. InElectronic Commerce and Security (ISECS), 2010 Third International Symposium

on, pages 3–7, 2010.

[5] X. Gao and L. Sterling. Semi-structured data extractionfrom heterogeneous sources. In2cd International

Workshop on Innovative Internet Information Systems (IIIS’99), in conjunction with the European Confer-

ence on Information Systems (ECIS’99, 1999.

[6] E. Karamatli and S. Akyokus. Resume information extraction with named entity clustering based on rela-

tionships. INISTA 2010, 2010.

[7] S. Y. Laurie J. Heyer, Semyon Kruglyak. Exploring expression data: Identification and analysis of coex-

pressed genes. Genome Research ’99. Cold Spring Harbor Laboratory Press, 1999.

[8] I. E. I. Limited. http://www.infoedge.in/ (accessed onjune 5 2011).

[9] A. McCallum. Information extraction: Distilling structured data from unstructured text.Queue, 3:48–57,

November 2005.

[10] S. R. Parser. http://www. sovren.com/ (accessed on june 5, 2011).

[11] A. R. Parsing. http://www.hireability.com/alex/ (accessed on june 5, 2011).

[12] X. H. Phan, M. L. Nguyen, and S. Horiguchi. Learning to classify short and sparse text & web with hidden

topics from large-scale data collections. InWWW, pages 91–100, 2008.

[13] M. Phones. http://www.mobile.am/ (accessed on june 5,2011).

[14] J. Piskorski, M. Kowalkiewicz, and T. Kaczmarek. Information extraction from cv. pages 185–192, 2005.

[15] Q. Pu and G.-W. Yang. Short-text classification based onica and lsa. In J. Wang, Z. Yi, J. Zurada, B.-L. Lu,

and H. Yin, editors,Advances in Neural Networks - ISNN 2006, volume 3972 ofLecture Notes in Computer

Science, pages 265–270. Springer Berlin / Heidelberg, 2006.

68

Page 80: Extracting Special Information to Improve the Efficiency …web2py.iiit.ac.in/research_centres/publications/download/masters... · Extracting Special Information to Improve the Efficiency

[16] B. Raskutti and C. Leckie. An evaluation of criteria formeasuring the quality of clusters. InProceedings

of the Sixteenth International Joint Conference on Artificial Intelligence, IJCAI ’99, pages 905–910, San

Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc.

[17] S. Soderland. Learning information extraction rules for semi-structured and free text.Machine Learning,

34(1-3):233–272, 1999.

[18] B. Sriram, D. Fuhry, E. Demir, H. Ferhatosmanoglu, and M. Demirbas. Short text classification in twitter to

improve information filtering. InProceeding of the 33rd international ACM SIGIR conference on Research

and development in information retrieval, SIGIR ’10, pages 841–842, New York, NY, USA, 2010. ACM.

[19] H. O. Store. http://www.shopping.hp.com/ (accessed on june 5, 2011).

[20] A. Stuffing. http://www.akken.com/ (accessed on june 52011).

[21] R. Suite. http://www.egrabber.com/resumegrabbersuite/ (accessed on june 5, 2011).

[22] P. K. R. Sumit Maheshwari. Discovering special productfeatures to improve the process of product selection

in e-commerce environment. ICEC ’09. ACM, 2009.

[23] H. H. Xinghua Fan. A new model for chinese short-text classification considering feature extension. Ar-

tificial Intelligence and Computational Intelligence (AICI), 2010 International Conference, pages 7 – 11,

2010.

[24] K. Yu, G. Guan, and M. Zhou. Resume information extraction with cascaded hybrid model. InACL ’05:

Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 499–506,

Morristown, NJ, USA, 2005. Association for Computational Linguistics.

[25] S. Zelikovitz and H. Hirsh. Improving short text classification using unlabeled background knowledge to

assess document similarity. 2000.

69