Luís Sarmento Universidade do Porto (NIAD&R) and Linguateca [email protected]

1
Luís Sarmento Universidade do Porto (NIAD&R) and Linguateca [email protected] Setup * 2.8 Ghz PIV * 2Gb RAM * 160 Gb IDE HD * Fedora Core 2 * Perl 5.6 * MySQL 5.0.15 * DBI + DBD-Mysql Optimize Queries… * Text at sentence level: QA, Definition Extraction * 1-4 word window contexts: find MWE, collocations * word co-occurrence data: WSD, context clustering Global Motivation * Obtain fast text query methods for a variety of “data-driven” NLP techniques * Develop practical methods for querying current gigabyte corpora (web collections…) * Experiment scalable methods for querying the next generation of terabyte corpora Table # tuples (millions) Table size (GB) Index size (GB) Metadata 1.529 0.2 0.05 Sentences 35.575 6.55 5.90 dictionary 6.834 0.18 0.27 2-grams 54.610 1.50 0.92 3-grams 173.608 5.43 2.97 4-grams 293.130 10.40 6.35 co-occurrence 761.044 20.10 7.56 BACO total - 44.4 ~ 24 Statistics BACO A large database of text and co-occurrences Some Practical Problems * How to compile lists of n-grams (2,3,4…) in a 1B word collection? * How to obtain co-occurrence info for all pairs of words in a 1B word collection? * Which data structures are best (and easily available in Perl) hash tables? Trees? Others (Judy? T-Trees?)… * How should all this data be stored and indexed in a standard RDBS? Some conclusions * RDBS are a good alternative for querying gigabyte text collections for NLP purposes * complex data pre-processing tasks, data modeling and system tuning may be required * current implementation deals with raw text but models may be extended for annotated corpora * query speed depends on internal details of MySQL indexing mechanism * current performance may be improved by a more efficient database scheme and parallelization Current Deliverables * MySQL Encoded database of text, n-grams and information about co-occurrence pairs * Perl Module to easily query BACO instances Duplicate removal (by Nuno Seco [email protected]) WPT03 12 GB 6 GB 1.5M docs sentence splitting document metadata tabular format load data index data indexed database metadata + text sentences Stage 1: Data preparation and loading text sentences Stage 2: compiling dictionary + 2,3,4-grams + co-occurrence pairs single pass 13 iterations disjoint division based on number of chars DIC 2 GRAMS 3, 4-grams + co-occurrence pairs multiple iterations N documents per iteration temp files are sorted 3 GRAMS 4 GRAMS CO-OC PAIRS load data index data BACO Final Tables: * metadata * text sentences * Dictionary * 2,3,4-grams * co-occurrence pairs Linguateca * Improving processing and research on the Portuguese language * Fostering collaboration among researchers * Providing public and free-of-charge tools and resources to the community http://www.linguateca.pt WPT03 - A public resource * The WPT 03 is a resource built by XLDB Group (xldb.di.fc.ul.pt), and distributed by Linguateca (www.linguateca.pt) * 12GB, 3.7M web documents and ~1.6B words * Obtained from the Portuguese web search engine TUMBA! http://www.tumba.pt NIAD&R * Research group started in 1998 as part of the LIACC (AI Lab) @ Universidade do Porto * Research topics: Multi-Agent Systems, E-business Technology, Machine Learning, Robotics, Ontologies http://www.fe.up.pt/~eol/ BACO: BAse de Co-Ocorrências

description

Stage 1: Data preparation and loading. Global Motivation. * Obtain fast text query methods for a variety of “data-driven” NLP techniques * Develop practical methods for querying current gigabyte corpora (web collections…) - PowerPoint PPT Presentation

Transcript of Luís Sarmento Universidade do Porto (NIAD&R) and Linguateca [email protected]

Page 1: Luís Sarmento Universidade do Porto (NIAD&R) and Linguateca las@fe.up.pt

Luís SarmentoUniversidade do Porto (NIAD&R) and

[email protected]

Setup

* 2.8 Ghz PIV

* 2Gb RAM

* 160 Gb IDE HD

* Fedora Core 2

* Perl 5.6

* MySQL 5.0.15

* DBI + DBD-Mysql

Optimize Queries…

* Text at sentence level:

QA, Definition Extraction

* 1-4 word window contexts:

find MWE, collocations

* word co-occurrence data:

WSD, context clustering

Global Motivation

* Obtain fast text query methods for a variety of “data-driven” NLP techniques

* Develop practical methods for querying current gigabyte corpora (web collections…)

* Experiment scalable methods for querying the next generation of terabyte corpora

Tabl e # tupl es( mi l l io ns)

Tabl esi ze ( GB)

I ndexsi ze ( GB)

Metadata 1.529 0.2 0.05Sentences 35. 575 6.55 5.90di ction ary 6.834 0.18 0.272-grams 54. 610 1.50 0.923-grams 173.608 5.43 2.974-grams 293.130 10. 40 6.35co- occurrence 761.044 20. 10 7.56BACO total - 44. 4 ~ 24

Statistics

BACOA large database of text and co-occurrences

Some Practical Problems

* How to compile lists of n-grams (2,3,4…) in a 1B word collection?

* How to obtain co-occurrence info for all pairs of words in a 1B word collection?

* Which data structures are best (and easily available in Perl)

hash tables? Trees? Others (Judy? T-Trees?)…

* How should all this data be stored and indexed in a standard RDBS?

Some conclusions

* RDBS are a good alternative for querying gigabyte text collections for NLP purposes

* complex data pre-processing tasks, data modeling and system tuning may be required

* current implementation deals with raw text but models may be extended for annotated corpora

* query speed depends on internal details of MySQL indexing mechanism

* current performance may be improved by a more efficient database scheme and parallelization

Current Deliverables

* MySQL Encoded database of text, n-grams and information about co-occurrence pairs

* Perl Module to easily query BACO instances

Duplicate

removal(by Nuno Seco [email protected])

WPT03

12 GB

6 GB

1.5M docs

sentence splitting

document metadata

tabular format

loaddata

indexdata

indexed database

metadata + text sentences

Stage 1: Data preparation and loading

text

sentences

Stage 2: compiling dictionary + 2,3,4-grams + co-occurrence pairs

single pass

13 iterationsdisjoint division based on

number of chars

DIC

2GRAMS

3, 4-grams + co-occurrence pairs multiple iterations

N documents per iterationtemp files are sorted

3GRAMS

4GRAMS

CO-OCPAIRS

load data

index data

BACO

Final Tables: * metadata * text sentences * Dictionary * 2,3,4-grams * co-occurrence pairs

Linguateca

* Improving processing and research on the Portuguese language

* Fostering collaboration among researchers

* Providing public and free-of-charge tools and resources to the community

http://www.linguateca.pt

WPT03 - A public resource

* The WPT 03 is a resource built by XLDB Group (xldb.di.fc.ul.pt), and distributed by Linguateca (www.linguateca.pt)

* 12GB, 3.7M web documents and ~1.6B words

* Obtained from the Portuguese web search engine TUMBA! http://www.tumba.pt

NIAD&R

* Research group started in 1998 as part of the LIACC (AI Lab) @ Universidade do Porto

* Research topics: Multi-Agent Systems, E-business Technology, Machine Learning, Robotics, Ontologies

http://www.fe.up.pt/~eol/

BACO: BAse de

Co-Ocorrências