Luís Sarmento Universidade do Porto (NIAD&R) and Linguateca [email protected]

Luís Sarmento Universidade do Porto (NIAD&R) and Linguateca [email protected] Setup * 2.8 Ghz PIV * 2Gb RAM * 160 Gb IDE HD * Fedora Core 2 * Perl 5.6 * MySQL 5.0.15 * DBI + DBD-Mysql Optimize Queries… * Text at sentence level: QA, Definition Extraction * 1-4 word window contexts: find MWE, collocations * word co-occurrence data: WSD, context clustering Global Motivation * Obtain fast text query methods for a variety of “data-driven” NLP techniques * Develop practical methods for querying current gigabyte corpora (web collections…) * Experiment scalable methods for querying the next generation of terabyte corpora Table # tuples (millions) Table size (GB) Index size (GB) Metadata 1.529 0.2 0.05 Sentences 35.575 6.55 5.90 dictionary 6.834 0.18 0.27 2-grams 54.610 1.50 0.92 3-grams 173.608 5.43 2.97 4-grams 293.130 10.40 6.35 co-occurrence 761.044 20.10 7.56 BACO total - 44.4 ~ 24 Statistics BACO A large database of text and co-occurrences Some Practical Problems * How to compile lists of n-grams (2,3,4…) in a 1B word collection? * How to obtain co-occurrence info for all pairs of words in a 1B word collection? * Which data structures are best (and easily available in Perl) hash tables? Trees? Others (Judy? T-Trees?)… * How should all this data be stored and indexed in a standard RDBS? Some conclusions * RDBS are a good alternative for querying gigabyte text collections for NLP purposes * complex data pre-processing tasks, data modeling and system tuning may be required * current implementation deals with raw text but models may be extended for annotated corpora * query speed depends on internal details of MySQL indexing mechanism * current performance may be improved by a more efficient database scheme and parallelization Current Deliverables * MySQL Encoded database of text, n-grams and information about co-occurrence pairs * Perl Module to easily query BACO instances Duplicate removal (by Nuno Seco [email protected]) WPT03 12 GB 6 GB 1.5M docs sentence splitting document metadata tabular format load data index data indexed database metadata + text sentences Stage 1: Data preparation and loading text sentences Stage 2: compiling dictionary + 2,3,4-grams + co-occurrence pairs single pass 13 iterations disjoint division based on number of chars DIC 2 GRAMS 3, 4-grams + co-occurrence pairs multiple iterations N documents per iteration temp files are sorted 3 GRAMS 4 GRAMS CO-OC PAIRS load data index data BACO Final Tables: * metadata * text sentences * Dictionary * 2,3,4-grams * co-occurrence pairs Linguateca * Improving processing and research on the Portuguese language * Fostering collaboration among researchers * Providing public and free-of-charge tools and resources to the community http://www.linguateca.pt WPT03 - A public resource * The WPT 03 is a resource built by XLDB Group (xldb.di.fc.ul.pt), and distributed by Linguateca (www.linguateca.pt) * 12GB, 3.7M web documents and ~1.6B words * Obtained from the Portuguese web search engine TUMBA! http://www.tumba.pt NIAD&R * Research group started in 1998 as part of the LIACC (AI Lab) @ Universidade do Porto * Research topics: Multi-Agent Systems, E-business Technology, Machine Learning, Robotics, Ontologies http://www.fe.up.pt/~eol/ BACO: BAse de Co-Ocorrências

Upload
roden
Category

Documents
view
15
download
0

Embed Size (px):

description

Stage 1: Data preparation and loading. Global Motivation. * Obtain fast text query methods for a variety of “data-driven” NLP techniques * Develop practical methods for querying current gigabyte corpora (web collections…) - PowerPoint PPT Presentation

Transcript of Luís Sarmento Universidade do Porto (NIAD&R) and Linguateca [email protected]

Luís SarmentoUniversidade do Porto (NIAD&R) and

[email protected]

Setup

* 2.8 Ghz PIV

* 2Gb RAM

* 160 Gb IDE HD

* Fedora Core 2

* Perl 5.6

* MySQL 5.0.15

* DBI + DBD-Mysql

Optimize Queries…

* Text at sentence level:

QA, Definition Extraction

* 1-4 word window contexts:

find MWE, collocations

* word co-occurrence data:

WSD, context clustering

Global Motivation

* Obtain fast text query methods for a variety of “data-driven” NLP techniques

* Develop practical methods for querying current gigabyte corpora (web collections…)

* Experiment scalable methods for querying the next generation of terabyte corpora

Tabl e # tupl es( mi l l io ns)

Tabl esi ze ( GB)

I ndexsi ze ( GB)

Metadata 1.529 0.2 0.05Sentences 35. 575 6.55 5.90di ction ary 6.834 0.18 0.272-grams 54. 610 1.50 0.923-grams 173.608 5.43 2.974-grams 293.130 10. 40 6.35co- occurrence 761.044 20. 10 7.56BACO total - 44. 4 ~ 24

Statistics

BACOA large database of text and co-occurrences

Some Practical Problems

* How to compile lists of n-grams (2,3,4…) in a 1B word collection?

* How to obtain co-occurrence info for all pairs of words in a 1B word collection?

* Which data structures are best (and easily available in Perl)

hash tables? Trees? Others (Judy? T-Trees?)…

* How should all this data be stored and indexed in a standard RDBS?

Some conclusions

* RDBS are a good alternative for querying gigabyte text collections for NLP purposes

* complex data pre-processing tasks, data modeling and system tuning may be required

* current implementation deals with raw text but models may be extended for annotated corpora

* query speed depends on internal details of MySQL indexing mechanism

* current performance may be improved by a more efficient database scheme and parallelization

Current Deliverables

* MySQL Encoded database of text, n-grams and information about co-occurrence pairs

* Perl Module to easily query BACO instances

Duplicate

removal(by Nuno Seco [email protected])

WPT03

12 GB

6 GB

1.5M docs

sentence splitting

document metadata

tabular format

loaddata

indexdata

indexed database

metadata + text sentences

Stage 1: Data preparation and loading

text

sentences

Stage 2: compiling dictionary + 2,3,4-grams + co-occurrence pairs

single pass

13 iterationsdisjoint division based on

number of chars

DIC

2GRAMS

3, 4-grams + co-occurrence pairs multiple iterations

N documents per iterationtemp files are sorted

3GRAMS

4GRAMS

CO-OCPAIRS

load data

index data

BACO

Final Tables: * metadata * text sentences * Dictionary * 2,3,4-grams * co-occurrence pairs

Linguateca

* Improving processing and research on the Portuguese language

* Fostering collaboration among researchers

* Providing public and free-of-charge tools and resources to the community

http://www.linguateca.pt

WPT03 - A public resource

* The WPT 03 is a resource built by XLDB Group (xldb.di.fc.ul.pt), and distributed by Linguateca (www.linguateca.pt)

* 12GB, 3.7M web documents and ~1.6B words

* Obtained from the Portuguese web search engine TUMBA! http://www.tumba.pt

NIAD&R

* Research group started in 1998 as part of the LIACC (AI Lab) @ Universidade do Porto

* Research topics: Multi-Agent Systems, E-business Technology, Machine Learning, Robotics, Ontologies

http://www.fe.up.pt/~eol/

BACO: BAse de

Co-Ocorrências

fe.up.pt/binporto2018binporto2018/wp-content/... · 12h40 - 14h00 Informal Lunch @ UPTEC 14h00 - 18h00 Exhibition: Tech Demo and Business Showcase 14h00 - 15h00 Final Presentation

SINTEF StuntLunch Building a Large Scale Lexical Ontology for Portuguese Nuno Seco Linguateca Node of Coimbra .

LIACC - March 2004 1 Distributed Artificial Intelligence & Robotics Group NIAD&R NIAD&R is a member of LIACC Artificial Intelligence and Computer Science.

Requirements Analysis - INESC TECinescporto.pt/~jneves/feup/2010-2011/pgre2s/requirements.pdf · 1 Requirements Analysis Joao.Neves@fe.up.pt Requirements Analysis Requirements are

Requirements Analysis - INESC TECinescporto.pt/~jneves/feup/2010-2011/pgre2s/requirements.pdf · 1 Requirements Analysis [email protected] Requirements Analysis Requirements are

MULTI-UAV INTEGRATION FOR COORDINATED MISSIONS Ricardo Bencatel ricardo.bencatel@fe.up.pt Pedro Almeida pinto.almeida@fe.up.pt Gil Manuel Gonçalves gil@fe.up.pt.

MULTI-UAV INTEGRATION FOR COORDINATED MISSIONS Ricardo Bencatel [email protected] Pedro Almeida [email protected] Gil Manuel Gonçalves [email protected].

OpenSSH (SSH - Secure SHell) Silvio C. Sampaio silviocs@fe.up.pt Doctoral Programme in Informatics Engineering PRODEI011 - Computer Systems Security –

OpenSSH (SSH - Secure SHell) Silvio C. Sampaio [email protected] Doctoral Programme in Informatics Engineering PRODEI011 - Computer Systems Security –

Luís Paulo Reis Http://lpreis lpreis@fe.up.pt

Luís Paulo Reis Http://lpreis [email protected]

The work of NIAD-UE · Message from the President 1 Historical Sketch 2 Role of NIAD-UE 3 Organization and Administration Organizational Chart 4 Past and Present Presidents 5

O ferro Elemento Metálico - paginas.fe.up.ptprojfeup/submit_15_16/uploads/relat_EMM... · Filipa Castro up201504535@fe.up.pt Inês Casas up201506642@fe.up.pt Filipe Lopes ... Figura

O ferro Elemento Metálico - paginas.fe.up.ptprojfeup/submit_15_16/uploads/relat_EMM... · Filipa Castro [email protected] Inês Casas [email protected] Filipe Lopes ... Figura

Roles of NIAD UE in the National Quality Assurance …...Roles of NIAD‐UE in the National Quality Assurance System Towards Quality‐Assured International University Exchanges The

PoloCLUP – Linguateca Belinda Maia · Who’s Who in Linguateca / Wonderland? Alice? White Rabbit? Mad Hatter? Red Queen? Humpty

eco@fe.up.pt http:fe.up.pt/~eol/MEMBERS/eco.html

[email protected] http:fe.up.pt/~eol/MEMBERS/eco.html

Www.mitportugal.org Doctoral Program in Sustainable Energy Systems 2009 NOV 17 Ricardo Bessa (pds09004@fe.up.pt)pds09004@fe.up.pt Wind Energy, Support.

Www.mitportugal.org Doctoral Program in Sustainable Energy Systems 2009 NOV 17 Ricardo Bessa ([email protected])[email protected] Wind Energy, Support.

Software Repository Mining Analytics to Estimate Software ... · Software Repository Mining Analytics to Estimate Software Component Reliability André Freitas - [email protected]

EE--CCONTRACTING AND EELECTRONIC IINSTITUTIONSeol/TNE/APONT/EContractingEI.pdf · 2010-04-09 · EE--CCONTRACTING AND EELECTRONIC IINSTITUTIONS NIAD&R NIAD&R ––Distributed Artificial

João Bispo, Pedro Pinto, João M. P. Cardoso · 2019. 10. 30. · João Bispo, Pedro Pinto, João M. P. Cardoso jbispo@fe.up.pt, p.pinto@fe.up.pt, jmpc@fe.up.pt João Bispo acknowledges

João Bispo, Pedro Pinto, João M. P. Cardoso · 2019. 10. 30. · João Bispo, Pedro Pinto, João M. P. Cardoso [email protected], [email protected], [email protected] João Bispo acknowledges

Internet Based Grammar Teaching Eckhard Bick - Linguateca

Information and Communication Technologies 1 Medindo a Linguateca Luís Fernando Costa e Luís Miguel Cabral.

Assessing Information Utility in Cooperation-based Robotic ... · Assessing Information Utility in Cooperation-based ... ∗E-mails: {rprocha, jorge}@isr.uc.pt, [email protected] Arkin

Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA.

Luís Sarmento Universidade do Porto (NIAD&R) and Linguateca [email protected]

Documents

Transcript of Luís Sarmento Universidade do Porto (NIAD&R) and Linguateca [email protected]