Rocky2010 roeder full_textbiomedicalliteratureprocesing

[email protected]

http://compbio.uchsc.edu

Full Text Biomedical Literature

Processing:

More Than a Scaling

ChallengeChristophe Roeder, Tom Christiansen, Helen Johnson, Karin Verspoor, (UC

Denver)

Gully Burns (ISI) , Lawrence Hunter (UC Denver)

Obtaining Documents

• Identify documents by querying PubMed

– Challenging due to variations in names

• Not all documents are freely available

– One project identified 3034 documents

• 1253 (41%) licensed, available without charge

• 418 (14 %) available in PubMed Central

– Availability effects experiment reproducibility

• Downloading can be problematic– Manual download is slow. PMC Open Access is

limited

– Arrange bulk download from publishers based on

existing licenses

File Formats

• Documents are available in many formats:

– HTML, XML, PDF, plain text

• Convert to plain text for NLP tool input

– Stripping XML or HTML markup is relatively

easy

– ISI is working on PDF Extract to find correct

flow

– Keep document zoning, other markup

• headings, sections, captions, italics

• Identify source character encoding properly

– XML stores the encoding in file, others do not

Character Representation

• Encoding is a mapping from bytes to characters

• Difficult to discern wich encoding a file uses

– ASCII, UTF-8, MacRoman, ISO-8859-1, or other?

• Reading a file with the wrong encoding can

produce unreported errors and spurious ‘?’

characters

• Java regular expression classes (\w, \s) don’t

match non-ASCII characters

• Some characters look like others:

– dash, en dash, minus

– space, em space, non-breaking-space

Scaling

• Use a cluster when you need more than a

desktop

• Prefer an easy migration from desktop to cluster

• Concurrency (threading) issues are minimized

since most NLP processes are independent

• Finding success using Sun/Oracle Grid Engine

(SGE) and Network File System (NFS) on a

small (48 core) cluster

– NFS shares disks between nodes

– SGE starts and manages processes on

cluster

Acknowledgements

• UC Denver

– Helen Johnson

– Tom Christiansen

– Karin Verspoor, NIH grant R01 LM010120-01

– Larry Hunter,

• NIH 2R01LM009254-04

• NIH 2R01LM008111-04A1

• NIH 5R01GM083649-02

• ISI

– Gully Burns, NSF grant #0849977

Rocky2010 roeder full_textbiomedicalliteratureprocesing

Technology

Transcript of Rocky2010 roeder full_textbiomedicalliteratureprocesing