Rocky2010 roeder full_textbiomedicalliteratureprocesing
-
Upload
chris-roeder -
Category
Technology
-
view
215 -
download
1
Transcript of Rocky2010 roeder full_textbiomedicalliteratureprocesing
![Page 1: Rocky2010 roeder full_textbiomedicalliteratureprocesing](https://reader035.fdocuments.in/reader035/viewer/2022071704/55a9a66b1a28ab9f518b4885/html5/thumbnails/1.jpg)
http://compbio.uchsc.edu
Full Text Biomedical Literature
Processing:
More Than a Scaling
ChallengeChristophe Roeder, Tom Christiansen, Helen Johnson, Karin Verspoor, (UC
Denver)
Gully Burns (ISI) , Lawrence Hunter (UC Denver)
![Page 2: Rocky2010 roeder full_textbiomedicalliteratureprocesing](https://reader035.fdocuments.in/reader035/viewer/2022071704/55a9a66b1a28ab9f518b4885/html5/thumbnails/2.jpg)
Obtaining Documents
• Identify documents by querying PubMed
– Challenging due to variations in names
• Not all documents are freely available
– One project identified 3034 documents
• 1253 (41%) licensed, available without charge
• 418 (14 %) available in PubMed Central
– Availability effects experiment reproducibility
• Downloading can be problematic– Manual download is slow. PMC Open Access is
limited
– Arrange bulk download from publishers based on
existing licenses
![Page 3: Rocky2010 roeder full_textbiomedicalliteratureprocesing](https://reader035.fdocuments.in/reader035/viewer/2022071704/55a9a66b1a28ab9f518b4885/html5/thumbnails/3.jpg)
File Formats
• Documents are available in many formats:
– HTML, XML, PDF, plain text
• Convert to plain text for NLP tool input
– Stripping XML or HTML markup is relatively
easy
– ISI is working on PDF Extract to find correct
flow
– Keep document zoning, other markup
• headings, sections, captions, italics
• Identify source character encoding properly
– XML stores the encoding in file, others do not
![Page 4: Rocky2010 roeder full_textbiomedicalliteratureprocesing](https://reader035.fdocuments.in/reader035/viewer/2022071704/55a9a66b1a28ab9f518b4885/html5/thumbnails/4.jpg)
Character Representation
• Encoding is a mapping from bytes to characters
• Difficult to discern wich encoding a file uses
– ASCII, UTF-8, MacRoman, ISO-8859-1, or other?
• Reading a file with the wrong encoding can
produce unreported errors and spurious ‘?’
characters
• Java regular expression classes (\w, \s) don’t
match non-ASCII characters
• Some characters look like others:
– dash, en dash, minus
– space, em space, non-breaking-space
![Page 5: Rocky2010 roeder full_textbiomedicalliteratureprocesing](https://reader035.fdocuments.in/reader035/viewer/2022071704/55a9a66b1a28ab9f518b4885/html5/thumbnails/5.jpg)
Scaling
• Use a cluster when you need more than a
desktop
• Prefer an easy migration from desktop to cluster
• Concurrency (threading) issues are minimized
since most NLP processes are independent
• Finding success using Sun/Oracle Grid Engine
(SGE) and Network File System (NFS) on a
small (48 core) cluster
– NFS shares disks between nodes
– SGE starts and manages processes on
cluster
![Page 6: Rocky2010 roeder full_textbiomedicalliteratureprocesing](https://reader035.fdocuments.in/reader035/viewer/2022071704/55a9a66b1a28ab9f518b4885/html5/thumbnails/6.jpg)
Acknowledgements
• UC Denver
– Helen Johnson
– Tom Christiansen
– Karin Verspoor, NIH grant R01 LM010120-01
– Larry Hunter,
• NIH 2R01LM009254-04
• NIH 2R01LM008111-04A1
• NIH 5R01GM083649-02
• ISI
– Gully Burns, NSF grant #0849977