Set expansion

14
SET EXPANSION - Team 25 ROMIL PUNETHA DEEP GREWAL SANDEEP KASA 201505568 201364124 201301145

Transcript of Set expansion

Page 1: Set expansion

SET EXPANSION - Team 25

ROMIL PUNETHA DEEP GREWAL SANDEEP KASA201505568 201364124 201301145

Page 2: Set expansion

OUTLINE

• Introduction• Related Work• Approach• Results• References

Page 3: Set expansion

INTRODUCTION

• Set Expansion refers to completing a given set with relevant terms corresponding to the given “seed terms”.• The goal is to find other entities which could belong to

the same set as the given input entities.• For example : Input = mango, banana• Output= strawberry, apples, etc

Page 4: Set expansion

RELATED WORK

• Google sets is a well known example of a web based set expansion system.• Language independent set expansion of named

entities using web.• Set expansion using web based crawling.

Page 5: Set expansion

APPROACH

• Tool used : Word2vec- Finding similarity between words by converting them

into a feature vector and calculating the cosine distances between them.• distance = vector(word1)* vector(word2)• The following link explains the working of word2vec

- Word2Vec• Training of the model done using dataset from the

following link: - Training set for the word2vec model

Page 6: Set expansion

Crawler and Indexer• Indexing word2vec dataset

- used word2vec.Text8Corpus function to create the model using the wiki set.• Web results form Google, Bing,

DuckDuckgo,etc have been used.• Crawled web pages to obtain patterns

containing seed terms (Explained in report).• Edited the python parser to parse specific

parts ofs the data from the web pages.

Page 7: Set expansion

ALGORITHM

• Get web results using input seeds• Crawl the web pages to search for the seed terms within

tags.(used a heuristics based approach to identify relevant tags instead of focusing only on table, ul, li ,ol).• For each term in the seed set :

- if not stopword :i) find its cosine distance with each seed termii) if the word is also found using pattern matching,

push the intersecting terms higher in the output.• Display the top ‘n’ (10 here) results.

Page 8: Set expansion

RESULTS

• Input : cricket, football,volleyball• Output : rugby• Soccer• Hockey• Squash• Badminton• Kabaddi• Bowling• Cricketers• tennis

Page 9: Set expansion

RESULTS

• Input : Samsung, sony, hp• Output : tdk • Nokia• Microsoft• Video• Motorola• Oppo entertainment• Asus

Page 10: Set expansion

RESULTS

• Input : java, python, Perl, php• Output : • JavaScript • scripting • mongo dB • linux• tcl • lisp • Cpan• Numpy• Doctest• gnu

Page 11: Set expansion

RESULTS

• Input : mango, banana, orange• Output : papaya• Mangoes• Coconut• Pineapple• Tomato• Cashews• Lemon• Zucchini• cinnamon• watermelon

Page 12: Set expansion

CONCLUSION

• In this project, we have shown how to expand a set using seed terms and the word2vec tool.• The program has been tested on various seed terms and

the results have been found to be perfectly acceptable.• Various web search APIs like google, bing, etc have been

used to tune the search results.

Page 13: Set expansion

REFERENCES

• A Cross-Lingual dictionary for English Wikipedia Concepts.• https://www.cs.cmu.edu/afs/cs/Web/People/

wcohen/postscript/icdm-2007.pdf• word2vec tool for creating vectors of the words.• Identifying the Sets of Related words from World Wide

Web.• Entity List Completion using Set Expansion Technique.

Page 14: Set expansion

PROJECT LINKS

• GitHub • Drobox• Presentation• Video• Website