Rocky2010 roeder full_textbiomedicalliteratureprocesing

Chris.Roeder@ucdenver.edu

http://compbio.uchsc.edu

Full Text Biomedical Literature

Processing:

More Than a Scaling

ChallengeChristophe Roeder, Tom Christiansen, Helen Johnson, Karin Verspoor, (UC

Denver)

Gully Burns (ISI) , Lawrence Hunter (UC Denver)

Obtaining Documents

• Identify documents by querying PubMed

– Challenging due to variations in names

• Not all documents are freely available

– One project identified 3034 documents

• 1253 (41%) licensed, available without charge

• 418 (14 %) available in PubMed Central

– Availability effects experiment reproducibility

• Downloading can be problematic– Manual download is slow. PMC Open Access is

limited

– Arrange bulk download from publishers based on

existing licenses

File Formats

• Documents are available in many formats:

– HTML, XML, PDF, plain text

• Convert to plain text for NLP tool input

– Stripping XML or HTML markup is relatively

– ISI is working on PDF Extract to find correct

– Keep document zoning, other markup

• headings, sections, captions, italics

• Identify source character encoding properly

– XML stores the encoding in file, others do not

Character Representation

• Encoding is a mapping from bytes to characters

• Difficult to discern wich encoding a file uses

– ASCII, UTF-8, MacRoman, ISO-8859-1, or other?

• Reading a file with the wrong encoding can

produce unreported errors and spurious ‘?’

characters

• Java regular expression classes (\w, \s) don’t

match non-ASCII characters

• Some characters look like others:

– dash, en dash, minus

– space, em space, non-breaking-space

Scaling

• Use a cluster when you need more than a

desktop

• Prefer an easy migration from desktop to cluster

• Concurrency (threading) issues are minimized

since most NLP processes are independent

• Finding success using Sun/Oracle Grid Engine

(SGE) and Network File System (NFS) on a

small (48 core) cluster

– NFS shares disks between nodes

– SGE starts and manages processes on

cluster

Acknowledgements

• UC Denver

– Helen Johnson

– Tom Christiansen

– Karin Verspoor, NIH grant R01 LM010120-01

– Larry Hunter,

• NIH 2R01LM009254-04

• NIH 2R01LM008111-04A1

• NIH 5R01GM083649-02

• ISI

– Gully Burns, NSF grant #0849977

Rocky2010 roeder full_textbiomedicalliteratureprocesing

Technology

Transcript of Rocky2010 roeder full_textbiomedicalliteratureprocesing

Successful Website Checklist by Lauren Marx and Jen Roeder

[DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

Happy Birthday Mrs. Roeder€¦ · Happy Birthday Mrs. Roeder A book of birthday wishes by the Pre-K Class . Daisy says…”I hope you have a beautiful day and I hope you go on a

{ Emerging Best Practices for Future Collaboration in preventing Tobacco related disease Linn County Communities Putting Prevention to Work Jill Roeder,

Scott Roeder and Delroy L. Paulhus

2012 PEEK Hdbk Roeder

L. Yilmaz, W. K. V. Chan, I. Moon, T. M. K. Roeder, C ...

Pavel Bleher , Mikhail Lyubich z, and Roland Roeder yroederr/Remy_Slides.pdfLee-Yang zeros for the Diamond Hierarchical Lattice. Pavel Bleher y, Mikhail Lyubich z, and Roland Roeder

The US Wine Industry Liz Borodofsky John Dalton John Roeder James Vineyard.

The Permutation Fugue and Johann Sebastian Bach's Compositional Development Matthias Röder Harvard University roeder@fas.harvard.edu Matthias Röder | roeder@fas.harvard.edu.

Comprehensive Analysis of Polypeptide Signaling Gene Expression and Overexpression Activity in Arabidopsis JiHyung Jun, Elisa Fiume, Adrienne H.K. Roeder,

Food Trends: What’s Hot for 2010 By: Kaiti Roeder, RD, LMNT.

Stephen Hodgetts Tucker Roeder - ewh.ieee.orgewh.ieee.org/r1/berkshire/NOTICES/IEEE Seminar Tax Presentation.pdf · Stephen Hodgetts Tucker Roeder Financial Advisor District Manager

A Distributed Framework for Computation on the Results of Large Scale NLP Christophe Roeder, William.

Welcome Parents! Mrs. Roeder’s Kindergarten 2009-2010 Copyright 2008 Sandra Roeder.

NASRA Issue Brief - Oklahoma State Pension …okpension.ok.gov/reports/misc/2016_NASRA_COLA_Brief.pdfNational Association of State Retirement Administrators, 2 Gabriel, Roeder, Smith

History of Conklin Village, by Larry Roeder May 18, 2016 ...€¦ · History of Conklin Village, by Larry Roeder May 18, 2016 Conklin Colored School 1 . History of Conklin Village,

MARVIN J. ROEDER, JR., VIRGINIA MARVIN J. ROEDER, JR., V. ROEDER, MARVIN … · 2015. 7. 27. · tioners, Marvin J. Roeder, Jr., Virginia V. Roeder, Marvin J. Roeder, Sr., and Anthony

Adrienne Roeder Address2018/02/08 · Adrienne Roeder Address: Weill Institute for Cell and Molecular Biology and School of Integrative Plant Science, Section of Plant Biology Cornell

Infrastructure as Code by Andreas Roeder