Bc. Anton Balucha Assignment from subject Information Retrieval.

13
IDENTIFICATION OF PEOPLE Bc. Anton Balucha http://www.tonyb.sk/ Assignment from subject Information Retrieval

Transcript of Bc. Anton Balucha Assignment from subject Information Retrieval.

Page 1: Bc. Anton Balucha  Assignment from subject Information Retrieval.

IDENTIFICATION OF PEOPLE

Bc. Anton Baluchahttp://www.tonyb.sk/

Assignment from subject Information Retrieval

Page 2: Bc. Anton Balucha  Assignment from subject Information Retrieval.

Anton Balucha - Identification of people

2

MOTIVATION

search engine results many information about many people strewed, not integrated

Available at http://www.tonyb.sk/

Page 3: Bc. Anton Balucha  Assignment from subject Information Retrieval.

Anton Balucha - Identification of people

3

TASK

create an application, which identify occurence of person on various web sites

Available at http://www.tonyb.sk/

Page 4: Bc. Anton Balucha  Assignment from subject Information Retrieval.

Anton Balucha - Identification of people

4

EXISTING SOLUTIONS

http://www.pipl.com – (easy to use , transparent list of results) http://www.zabasearch.com (search people only in USA) http://www.wink.com (search people on social networks) http://www.people.yahoo.com (search people with some entered

parameters – mane, surname, town, state, e-mail) https://addons.mozilla.org/sk/firefox/addon/3167 (plugin into Firefox

browser) http://www.peoplesearch.com (search people only in USA in entered

state) http://www.peekyou.com (search people on various portals -

Google+, Wikipedia, LinkedIn, Flickr, Twitter) http://www.123people.com (search people on various portals -

Google+, Wikipedia, LinkedIn, Flickr, Twitter) http://www.bestpeoplesearch.com (search people only in USA in

entered state, possibility to hire person for searching)

Available at http://www.tonyb.sk/

Page 5: Bc. Anton Balucha  Assignment from subject Information Retrieval.

Anton Balucha - Identification of people

5

DESCTIOPTION OF SOLUTION - ARCHITECTURAL

programmed in Java web application available from z http://www.tonyb.sk/ no static data active using of results from search

engines

Available at http://www.tonyb.sk/

Page 6: Bc. Anton Balucha  Assignment from subject Information Retrieval.

Anton Balucha - Identification of people

6

DESCRIPTION OF SOLUTION - IMPLEMENTATION

Available at http://www.tonyb.sk/

Google results web pages

remove HTML

remove stop words

remove diacritics

stemming TF-IDFidentify

keywords

identify keywords

identify keywords

show results

Page 7: Bc. Anton Balucha  Assignment from subject Information Retrieval.

Anton Balucha - Identification of people

7

USED DATA

Anton Balucha Mária Bieliková Pavol Návrat Peter Borga Petra Majzúnová Miloš Blaško

Available at http://www.tonyb.sk/

Page 8: Bc. Anton Balucha  Assignment from subject Information Retrieval.

Anton Balucha - Identification of people

8

RESULTS

# Meno a priezvisko

|D| |R| |I| |RI| Presnoť (Precision)

Pokrytie (Recall)

1. Anton Balucha

8 2 4 1 0.25 0.5

2. Mária Bieliková

9 8 5 5 1 0.625

3. Pavol Návrat

10 10 8 8 1 0.8

4. Peter Borga

10 2 4 1 0.25 0.5

5. Petra Majzúnová

8 8 7 7 1 0.875

6. Miloš Blaško

10 8 4 4 1 0.5

Available at http://www.tonyb.sk/

Page 9: Bc. Anton Balucha  Assignment from subject Information Retrieval.

Anton Balucha - Identification of people

9

SAMPLE OUTPUT

Anton Balucha http://www2.fiit.stuba.sk/research/pewe/program-2008-2009/ http://dlznik.zoznam.sk/socialna-siet/anton-balucha-1 http://dlznik.zoznam.sk/socialna-siet/anton-balucha-clen-1 http://sk.linkedin.com/pub/anton-balucha/36/52/42a

Mária Bieliková http://www2.fiit.stuba.sk/~bielik/ http://www2.fiit.stuba.sk/~bielik/books/index.html http://www.fhv.umb.sk/app/user.php?ACTION=PUBLICATION&user=bielikova.maria http://mariabielik.zenfolio.com/ http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/b/

Bielikov=aacute=:M=aacute=ria.html

Peter Borga http://en-gb.facebook.com/peterborga http://www.facebook.com/peter.borgneal http://uk.linkedin.com/in/peterborgneal http://peter-borg.com.au/

Available at http://www.tonyb.sk/

Page 10: Bc. Anton Balucha  Assignment from subject Information Retrieval.

Anton Balucha - Identification of people

10

IMPROVEMENT

better text processor better stemming better keyword identification

just right number of keywords

Available at http://www.tonyb.sk/

Page 11: Bc. Anton Balucha  Assignment from subject Information Retrieval.

Anton Balucha - Identification of people

11

IN THE END…

I found what is stemming & lemmatization what is TF-IDF what is precision & recall how interesting is text research

Available at http://www.tonyb.sk/

Page 12: Bc. Anton Balucha  Assignment from subject Information Retrieval.

Anton Balucha - Identification of people

12

INSTALLATION OF APPLICATION

intallation of Java intallation of Apache Tomcat deploy external applications access to the Internet access to the application

Available at http://www.tonyb.sk/

Page 13: Bc. Anton Balucha  Assignment from subject Information Retrieval.

Anton Balucha - Identification of people

13

USED LITERATURE

[1] Michal Laclavík, Martin Šeleng: Vyhľadávanie informácií. Vyhľadávanie informácií. Dostupné na <http://vi.ikt.ui.sav.sk/> (01.12.2011)

[2] Porter Stemmer. Dostupné na <http://tartarus.org/martin/PorterStemmer/> (01.12.2011)

Available at http://www.tonyb.sk/