Hydra - Content Processing Framework for Search Driven Solutions
-
Upload
findwise -
Category
Technology
-
view
3.183 -
download
1
description
Transcript of Hydra - Content Processing Framework for Search Driven Solutions
![Page 1: Hydra - Content Processing Framework for Search Driven Solutions](https://reader033.fdocuments.in/reader033/viewer/2022052908/55946be81a28ab7f2b8b4707/html5/thumbnails/1.jpg)
© FINDWISE 2012
Introducing Hydra
An Open Source Document Processing Framework
Joel Westberg
![Page 2: Hydra - Content Processing Framework for Search Driven Solutions](https://reader033.fdocuments.in/reader033/viewer/2022052908/55946be81a28ab7f2b8b4707/html5/thumbnails/2.jpg)
• Founded in 2005
• Offices in Sweden, Denmark, Norway and Poland
• 80 employees (April 2012)
• Our objective is to be a leading provider of Findability solutions utilisingthe full potential of search technology to create customer business value
About Findwise
![Page 3: Hydra - Content Processing Framework for Search Driven Solutions](https://reader033.fdocuments.in/reader033/viewer/2022052908/55946be81a28ab7f2b8b4707/html5/thumbnails/3.jpg)
![Page 4: Hydra - Content Processing Framework for Search Driven Solutions](https://reader033.fdocuments.in/reader033/viewer/2022052908/55946be81a28ab7f2b8b4707/html5/thumbnails/4.jpg)
Technology independent
Creating search-driven Findability solutions based on market-leading
commercial and open source search technology platforms:
Autonomy IDOL
Microsoft (SharePoint and FAST Search products)
Google GSA
IBM ICA/OmniFind
LucidWorks
Apache Lucene/Solr
Elastic Search and more…
![Page 5: Hydra - Content Processing Framework for Search Driven Solutions](https://reader033.fdocuments.in/reader033/viewer/2022052908/55946be81a28ab7f2b8b4707/html5/thumbnails/5.jpg)
Generic Search Architecture
![Page 6: Hydra - Content Processing Framework for Search Driven Solutions](https://reader033.fdocuments.in/reader033/viewer/2022052908/55946be81a28ab7f2b8b4707/html5/thumbnails/6.jpg)
Connecting source to search
Garbage in, garbage out. But what about unstructured data in?
• Flat data is richer than it appears
• Don’t discard information too soon!
The unstructured structured data paradox
Example: News articles
Plain text that contains invaluable metadata for search, such as:
• Title
• Author byline
• Lead paragraph
![Page 7: Hydra - Content Processing Framework for Search Driven Solutions](https://reader033.fdocuments.in/reader033/viewer/2022052908/55946be81a28ab7f2b8b4707/html5/thumbnails/7.jpg)
Enrichment and structuring possibilities
• Enrich your documents with metadata, to power your search
• Language detection
• Sentiment analysis
• Headline extraction
• Regular expression matching and extraction
• Filter out unwanted documents
• Collect statistics
• Export to Staging environments
![Page 8: Hydra - Content Processing Framework for Search Driven Solutions](https://reader033.fdocuments.in/reader033/viewer/2022052908/55946be81a28ab7f2b8b4707/html5/thumbnails/8.jpg)
Classic Pipeline
![Page 9: Hydra - Content Processing Framework for Search Driven Solutions](https://reader033.fdocuments.in/reader033/viewer/2022052908/55946be81a28ab7f2b8b4707/html5/thumbnails/9.jpg)
Classic Architecture
![Page 10: Hydra - Content Processing Framework for Search Driven Solutions](https://reader033.fdocuments.in/reader033/viewer/2022052908/55946be81a28ab7f2b8b4707/html5/thumbnails/10.jpg)
The Hydra Architecture
![Page 11: Hydra - Content Processing Framework for Search Driven Solutions](https://reader033.fdocuments.in/reader033/viewer/2022052908/55946be81a28ab7f2b8b4707/html5/thumbnails/11.jpg)
Main Design Objectives
Scalability
• Horizontally scalable central repository
• Independent processing nodes
Failiure tolerant
• Failiure of a stage affects only a single document
• Failiure of a node affects at most n documents
• Failiures can be automaticly detected
Robustness
• Independent stages
Development ease
• Debug stages from IDE against actual data
• Allow test driven pipeline development
![Page 12: Hydra - Content Processing Framework for Search Driven Solutions](https://reader033.fdocuments.in/reader033/viewer/2022052908/55946be81a28ab7f2b8b4707/html5/thumbnails/12.jpg)
The Hydra Architecture
![Page 13: Hydra - Content Processing Framework for Search Driven Solutions](https://reader033.fdocuments.in/reader033/viewer/2022052908/55946be81a28ab7f2b8b4707/html5/thumbnails/13.jpg)
Hadoop/Big Data integration
Usecases for document enrichment
• Pagerank
• Analytics
Hadoop & Map/Reduce advantages
• Huge scalability
• Ability to work on entire document set at once
Hadoop & Map/Reduce drawbacks
• Batch processing
• Time-to-index
![Page 14: Hydra - Content Processing Framework for Search Driven Solutions](https://reader033.fdocuments.in/reader033/viewer/2022052908/55946be81a28ab7f2b8b4707/html5/thumbnails/14.jpg)
Hadoop/Big Data integration
Blue – First round of indexing only
Red – Second round of indexing
Purple – All documents
![Page 15: Hydra - Content Processing Framework for Search Driven Solutions](https://reader033.fdocuments.in/reader033/viewer/2022052908/55946be81a28ab7f2b8b4707/html5/thumbnails/15.jpg)
Future Configuration UI
![Page 16: Hydra - Content Processing Framework for Search Driven Solutions](https://reader033.fdocuments.in/reader033/viewer/2022052908/55946be81a28ab7f2b8b4707/html5/thumbnails/16.jpg)
Open Source initiative
• Other committers
• The role of Findwise
For more information:
• http://www.findwise.com/hydra
• http://findwise.github.com/Hydra
• Email: [email protected]
![Page 17: Hydra - Content Processing Framework for Search Driven Solutions](https://reader033.fdocuments.in/reader033/viewer/2022052908/55946be81a28ab7f2b8b4707/html5/thumbnails/17.jpg)
Questions?