Set expansion
-
Upload
sandeepkasa -
Category
Education
-
view
155 -
download
1
Transcript of Set expansion
![Page 1: Set expansion](https://reader035.fdocuments.in/reader035/viewer/2022070519/58ecd6af1a28ab177f8b4579/html5/thumbnails/1.jpg)
SET EXPANSION - Team 25
ROMIL PUNETHA DEEP GREWAL SANDEEP KASA201505568 201364124 201301145
![Page 2: Set expansion](https://reader035.fdocuments.in/reader035/viewer/2022070519/58ecd6af1a28ab177f8b4579/html5/thumbnails/2.jpg)
OUTLINE
• Introduction• Related Work• Approach• Results• References
![Page 3: Set expansion](https://reader035.fdocuments.in/reader035/viewer/2022070519/58ecd6af1a28ab177f8b4579/html5/thumbnails/3.jpg)
INTRODUCTION
• Set Expansion refers to completing a given set with relevant terms corresponding to the given “seed terms”.• The goal is to find other entities which could belong to
the same set as the given input entities.• For example : Input = mango, banana• Output= strawberry, apples, etc
![Page 4: Set expansion](https://reader035.fdocuments.in/reader035/viewer/2022070519/58ecd6af1a28ab177f8b4579/html5/thumbnails/4.jpg)
RELATED WORK
• Google sets is a well known example of a web based set expansion system.• Language independent set expansion of named
entities using web.• Set expansion using web based crawling.
![Page 5: Set expansion](https://reader035.fdocuments.in/reader035/viewer/2022070519/58ecd6af1a28ab177f8b4579/html5/thumbnails/5.jpg)
APPROACH
• Tool used : Word2vec- Finding similarity between words by converting them
into a feature vector and calculating the cosine distances between them.• distance = vector(word1)* vector(word2)• The following link explains the working of word2vec
- Word2Vec• Training of the model done using dataset from the
following link: - Training set for the word2vec model
![Page 6: Set expansion](https://reader035.fdocuments.in/reader035/viewer/2022070519/58ecd6af1a28ab177f8b4579/html5/thumbnails/6.jpg)
Crawler and Indexer• Indexing word2vec dataset
- used word2vec.Text8Corpus function to create the model using the wiki set.• Web results form Google, Bing,
DuckDuckgo,etc have been used.• Crawled web pages to obtain patterns
containing seed terms (Explained in report).• Edited the python parser to parse specific
parts ofs the data from the web pages.
![Page 7: Set expansion](https://reader035.fdocuments.in/reader035/viewer/2022070519/58ecd6af1a28ab177f8b4579/html5/thumbnails/7.jpg)
ALGORITHM
• Get web results using input seeds• Crawl the web pages to search for the seed terms within
tags.(used a heuristics based approach to identify relevant tags instead of focusing only on table, ul, li ,ol).• For each term in the seed set :
- if not stopword :i) find its cosine distance with each seed termii) if the word is also found using pattern matching,
push the intersecting terms higher in the output.• Display the top ‘n’ (10 here) results.
![Page 8: Set expansion](https://reader035.fdocuments.in/reader035/viewer/2022070519/58ecd6af1a28ab177f8b4579/html5/thumbnails/8.jpg)
RESULTS
• Input : cricket, football,volleyball• Output : rugby• Soccer• Hockey• Squash• Badminton• Kabaddi• Bowling• Cricketers• tennis
![Page 9: Set expansion](https://reader035.fdocuments.in/reader035/viewer/2022070519/58ecd6af1a28ab177f8b4579/html5/thumbnails/9.jpg)
RESULTS
• Input : Samsung, sony, hp• Output : tdk • Nokia• Microsoft• Video• Motorola• Oppo entertainment• Asus
![Page 10: Set expansion](https://reader035.fdocuments.in/reader035/viewer/2022070519/58ecd6af1a28ab177f8b4579/html5/thumbnails/10.jpg)
RESULTS
• Input : java, python, Perl, php• Output : • JavaScript • scripting • mongo dB • linux• tcl • lisp • Cpan• Numpy• Doctest• gnu
![Page 11: Set expansion](https://reader035.fdocuments.in/reader035/viewer/2022070519/58ecd6af1a28ab177f8b4579/html5/thumbnails/11.jpg)
RESULTS
• Input : mango, banana, orange• Output : papaya• Mangoes• Coconut• Pineapple• Tomato• Cashews• Lemon• Zucchini• cinnamon• watermelon
![Page 12: Set expansion](https://reader035.fdocuments.in/reader035/viewer/2022070519/58ecd6af1a28ab177f8b4579/html5/thumbnails/12.jpg)
CONCLUSION
• In this project, we have shown how to expand a set using seed terms and the word2vec tool.• The program has been tested on various seed terms and
the results have been found to be perfectly acceptable.• Various web search APIs like google, bing, etc have been
used to tune the search results.
![Page 13: Set expansion](https://reader035.fdocuments.in/reader035/viewer/2022070519/58ecd6af1a28ab177f8b4579/html5/thumbnails/13.jpg)
REFERENCES
• A Cross-Lingual dictionary for English Wikipedia Concepts.• https://www.cs.cmu.edu/afs/cs/Web/People/
wcohen/postscript/icdm-2007.pdf• word2vec tool for creating vectors of the words.• Identifying the Sets of Related words from World Wide
Web.• Entity List Completion using Set Expansion Technique.
![Page 14: Set expansion](https://reader035.fdocuments.in/reader035/viewer/2022070519/58ecd6af1a28ab177f8b4579/html5/thumbnails/14.jpg)
PROJECT LINKS
• GitHub • Drobox• Presentation• Video• Website