Search Bootstrapping How / Where to get started. Crawling Start with Nutch – Index directly to...
4
Search Bootstrapping How / Where to get started
-
date post
19-Dec-2015 -
Category
Documents
-
view
216 -
download
1
Transcript of Search Bootstrapping How / Where to get started. Crawling Start with Nutch – Index directly to...
![Page 1: Search Bootstrapping How / Where to get started. Crawling Start with Nutch – Index directly to SOLR – .](https://reader033.fdocuments.in/reader033/viewer/2022051618/56649d2e5503460f94a0570c/html5/thumbnails/1.jpg)
Search Bootstrapping
How / Where to get
started
![Page 2: Search Bootstrapping How / Where to get started. Crawling Start with Nutch – Index directly to SOLR – .](https://reader033.fdocuments.in/reader033/viewer/2022051618/56649d2e5503460f94a0570c/html5/thumbnails/2.jpg)
Crawling
• Start with Nutch– http://nutch.apache.org/
• Index directly to SOLR– http://www.lucidimagination.com/blog/2010/09/10
/refresh-using-nutch-with-solr/
• Create a seed list from DMOZ rdf– http://www.dmoz.org/rdf.html– http://wiki.apache.org/nutch/NutchTutorial
![Page 3: Search Bootstrapping How / Where to get started. Crawling Start with Nutch – Index directly to SOLR – .](https://reader033.fdocuments.in/reader033/viewer/2022051618/56649d2e5503460f94a0570c/html5/thumbnails/3.jpg)
Understanding Content
• Entity Extraction– LingPipe http://alias-i.com/lingpipe/– OpenNLP http://incubator.apache.org/opennlp/
• Entity Identification / Taxonomies– Freebase http://www.freebase.com/
![Page 4: Search Bootstrapping How / Where to get started. Crawling Start with Nutch – Index directly to SOLR – .](https://reader033.fdocuments.in/reader033/viewer/2022051618/56649d2e5503460f94a0570c/html5/thumbnails/4.jpg)
Some Additional Links
• Basic Web Page Parser– https://github.com/pjaol/Webcrawler
• Example of OpenNLP usage– https://github.com/pjaol/entity_extractor