Apache Tika

Post on 10-May-2015

7.424 views 6 download

Tags:

Transcript of Apache Tika

Apache Tika2007-11-15

Jukka Zittingjukka@apache.org

Apache Tika

An extensible, configurable

content analysis frameworktoolkit

Apache Tika2007-11-15

Jukka Zittingjukka@apache.org

Agenda

The Problem

The Solution

The Project

The Design

Apache Tika2007-11-15

Jukka Zittingjukka@apache.org

The Problem

PDFBoxApache Poi

Apache XercesICU4J

NekoHTMLetc.

Lucene index

Apache Tika2007-11-15

Jukka Zittingjukka@apache.org

It’s Worse Than That

LicensingDependencies

Metadata extractionStructured content

Encryption/CompressionPackage formats

Streaming

Processing ofdigital media

?

?

?

???

??

Apache Tika2007-11-15

Jukka Zittingjukka@apache.org

Agenda

The Problem

The Solution

The Project

The Design

Apache Tika2007-11-15

Jukka Zittingjukka@apache.org

The Solution: Technical

• Generic API for extracting metadata and structured text content from a document– Input: byte stream + optional metadata– Output: XHTML SAX events + metadata

• Automatic content type detection– Magic bytes– File name patterns

Apache Tika2007-11-15

Jukka Zittingjukka@apache.org

The Solution: Legal / Social

• Apache License– (L)GPL projects can implement the Tika API

• Pooling of efforts– Active development and maintenance– Already beyond the functionality of most

custom solutions– Cool future goals: OCR, speech recognition, …

Apache Tika2007-11-15

Jukka Zittingjukka@apache.org

Agenda

The Problem

The Solution

The Project

The Design

Apache Tika2007-11-15

Jukka Zittingjukka@apache.org

Project Status

• Initially planned already in early 2006

• Incubating since March 2007

• Sponsoring PMC: Apache Lucene

• No releases yet– 0.1 release being planned

• Small development team– 6 committers, 3-4 currently active

Apache Tika2007-11-15

Jukka Zittingjukka@apache.org

Current Features

• Media type framework– Shared MIME info spec (freedesktop.org)– Default media type registry (incl. glob and magic patterns)

• Parser components– PDF (PDFBox)– Plain text (ICU4)– XML (SAX)– HTML (NekoHTML)– Word, PowerPoint, Excel (POI)– ODF (SAX)– RTF (Swing)

Apache Tika2007-11-15

Jukka Zittingjukka@apache.org

Project Statistics

Apache Tika2007-11-15

Jukka Zittingjukka@apache.org

Codebase History

LiusNutch

Lius Lite

Tika

textmining

Jackrabbit

Andy Clark

Jukka Zitting

Rida BenjellounChris MattmanJerome Charron

Sami Siren

Bertrand DelacretazKeith Bennett

Apache Tika2007-11-15

Jukka Zittingjukka@apache.org

Agenda

The Problem

The Solution

The Project

The Design

Apache Tika2007-11-15

Jukka Zittingjukka@apache.org

Content Extraction

PPT

Type: application/vnd.ms-powerpointTitle: Apache Tika

Author: Jukka Zitting

new PowerPointParser().parse(…);

Apache Tika2007-11-15

Jukka Zittingjukka@apache.org

Media Type Detection

application/vnd.ms-powerpoint

MimeTypes types = …;MimeType type = types.getMimeType(…);

tika-mimetypes.xml/etc/magic

mime.types

?

Apache Tika2007-11-15

Jukka Zittingjukka@apache.org

Combined Detection and Extraction

PPT

Type: application/vnd.ms-powerpointTitle: Apache Tika

Author: Jukka Zitting

TXT

PDF

XML

new AutoDetectParser().parse(…);

?

Apache Tika2007-11-15

Jukka Zittingjukka@apache.org

Agenda

The Problem

The Solution

The Project

The DesignThank You!