Apache Tika

Apache Tika2007-11-15

Jukka Zittingjukka@apache.org

Apache Tika

An extensible, configurable

content analysis frameworktoolkit

Agenda

The Problem

The Solution

The Project

The Design

The Problem

PDFBoxApache Poi

Apache XercesICU4J

NekoHTMLetc.

Lucene index

It’s Worse Than That

LicensingDependencies

Metadata extractionStructured content

Encryption/CompressionPackage formats

Streaming

Processing ofdigital media

Agenda

The Problem

The Solution

The Project

The Design

The Solution: Technical

• Generic API for extracting metadata and structured text content from a document– Input: byte stream + optional metadata– Output: XHTML SAX events + metadata

• Automatic content type detection– Magic bytes– File name patterns

The Solution: Legal / Social

• Apache License– (L)GPL projects can implement the Tika API

• Pooling of efforts– Active development and maintenance– Already beyond the functionality of most

custom solutions– Cool future goals: OCR, speech recognition, …

Agenda

The Problem

The Solution

The Project

The Design

Project Status

• Initially planned already in early 2006

• Incubating since March 2007

• Sponsoring PMC: Apache Lucene

• No releases yet– 0.1 release being planned

• Small development team– 6 committers, 3-4 currently active

Current Features

• Media type framework– Shared MIME info spec (freedesktop.org)– Default media type registry (incl. glob and magic patterns)

• Parser components– PDF (PDFBox)– Plain text (ICU4)– XML (SAX)– HTML (NekoHTML)– Word, PowerPoint, Excel (POI)– ODF (SAX)– RTF (Swing)

Project Statistics

Codebase History

LiusNutch

Lius Lite

textmining

Jackrabbit

Andy Clark

Jukka Zitting

Rida BenjellounChris MattmanJerome Charron

Sami Siren

Bertrand DelacretazKeith Bennett

Agenda

The Problem

The Solution

The Project

The Design

Content Extraction

Type: application/vnd.ms-powerpointTitle: Apache Tika

Author: Jukka Zitting

new PowerPointParser().parse(…);

Media Type Detection

application/vnd.ms-powerpoint

MimeTypes types = …;MimeType type = types.getMimeType(…);

tika-mimetypes.xml/etc/magic

mime.types

Combined Detection and Extraction

Type: application/vnd.ms-powerpointTitle: Apache Tika

Author: Jukka Zitting

new AutoDetectParser().parse(…);

Agenda

The Problem

The Solution

The Project

The DesignThank You!

Apache Tika

Technology

Transcript of Apache Tika

Apache Tika: 1 point Oh!

Apache Tika End-to-End An introduction to Apache Tika, and integrating it to your application.

Mime Magic With Apache Tika

Arab is Tika

REDISCOVERING ARMENIA - Apache Tika Corpora

Evaluating Text Extraction: Apache Tika’s New tika-eval Moduleevents17.linuxfoundation.org/sites/events/files/slides/... · 2020. 8. 15. · 1 TB (~3 million files) from Common

Apache CXF, Tika and Lucene · 2017-12-14 · Apache CXF, Tika and Lucene The power of search the JAX-RS way ... •REST web APIs are everywhere •JSR-339 / JAX-RS 2.0 is a standard

Apache Tika What’s new with 2.0? · CTO, Quanticate “small, yellow and leech-like, and probably the oddest thing in the Universe ...

Case Appendicitis Khronis Tika

Briefing SUBWAY TIKA LYON

Scientific data curation and processing with Apache Tika

Tika - Mastoiditis

Elasticsearch: implementing document full-text searchfiles.meetup.com/18804222/raytion_elasticsearch_full_text.pdf · Language detection in Apache Tika a lot of Analyzers in Elasticsearch/Lucene

Grama Tika

Evaluating Text Extraction: Apache Tika’s New tika-eval Module · May 18, 2017 Evaluating Text Extraction: Apache Tika’sNew tika-eval Module Tim Allison ApacheCon North America

Apache Tika - what's new with 2.0?

Stat is Tika

Ka tika muri, ka tika mua - Te Mana

Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant.

TIKA TIKA BIRDS - Freddie The Frog · TIKA TIKA BIRDS 2010 copyright Mystic Publishing, Inc. . Author: Sharon Burch Created Date: 2/9/2010 9:09:16 AM