Download - Luís Sarmento Universidade do Porto (NIAD&R) and Linguateca [email protected]

Luís SarmentoUniversidade do Porto (NIAD&R) and

[email protected]

Setup

* 2.8 Ghz PIV

* 2Gb RAM

* 160 Gb IDE HD

* Fedora Core 2

* Perl 5.6

* MySQL 5.0.15

* DBI + DBD-Mysql

Optimize Queries…

* Text at sentence level:

QA, Definition Extraction

* 1-4 word window contexts:

find MWE, collocations

* word co-occurrence data:

WSD, context clustering

Global Motivation

* Obtain fast text query methods for a variety of “data-driven” NLP techniques

* Develop practical methods for querying current gigabyte corpora (web collections…)

* Experiment scalable methods for querying the next generation of terabyte corpora

Tabl e # tupl es( mi l l io ns)

Tabl esi ze ( GB)

I ndexsi ze ( GB)

Metadata 1.529 0.2 0.05Sentences 35. 575 6.55 5.90di ction ary 6.834 0.18 0.272-grams 54. 610 1.50 0.923-grams 173.608 5.43 2.974-grams 293.130 10. 40 6.35co- occurrence 761.044 20. 10 7.56BACO total - 44. 4 ~ 24

Statistics

BACOA large database of text and co-occurrences

Some Practical Problems

* How to compile lists of n-grams (2,3,4…) in a 1B word collection?

* How to obtain co-occurrence info for all pairs of words in a 1B word collection?

* Which data structures are best (and easily available in Perl)

hash tables? Trees? Others (Judy? T-Trees?)…

* How should all this data be stored and indexed in a standard RDBS?

Some conclusions

* RDBS are a good alternative for querying gigabyte text collections for NLP purposes

* complex data pre-processing tasks, data modeling and system tuning may be required

* current implementation deals with raw text but models may be extended for annotated corpora

* query speed depends on internal details of MySQL indexing mechanism

* current performance may be improved by a more efficient database scheme and parallelization

Current Deliverables

* MySQL Encoded database of text, n-grams and information about co-occurrence pairs

* Perl Module to easily query BACO instances

Duplicate

removal(by Nuno Seco [email protected])

WPT03

12 GB

6 GB

1.5M docs

sentence splitting

document metadata

tabular format

loaddata

indexdata

indexed database

metadata + text sentences

Stage 1: Data preparation and loading

text

sentences

Stage 2: compiling dictionary + 2,3,4-grams + co-occurrence pairs

single pass

13 iterationsdisjoint division based on

number of chars

DIC

2GRAMS

3, 4-grams + co-occurrence pairs multiple iterations

N documents per iterationtemp files are sorted

3GRAMS

4GRAMS

CO-OCPAIRS

load data

index data

BACO

Final Tables: * metadata * text sentences * Dictionary * 2,3,4-grams * co-occurrence pairs

Linguateca

* Improving processing and research on the Portuguese language

* Fostering collaboration among researchers

* Providing public and free-of-charge tools and resources to the community

http://www.linguateca.pt

WPT03 - A public resource

* The WPT 03 is a resource built by XLDB Group (xldb.di.fc.ul.pt), and distributed by Linguateca (www.linguateca.pt)

* 12GB, 3.7M web documents and ~1.6B words

* Obtained from the Portuguese web search engine TUMBA! http://www.tumba.pt

NIAD&R

* Research group started in 1998 as part of the LIACC (AI Lab) @ Universidade do Porto

* Research topics: Multi-Agent Systems, E-business Technology, Machine Learning, Robotics, Ontologies

http://www.fe.up.pt/~eol/

BACO: BAse de

Co-Ocorrências