Post on 10-May-2015
Catalog magic: Behind the Scenes of Creating
a World Catalog of the Therevidae
Gail E. KampmeierIllinois Natural History Survey, Prairie Research Institute
University of Illinois at Urbana-Champaigngkamp@illinois.edu
Irina BrakeNational Museum of Natural History, London
Kristin AlgminUniversity of Illinois at Urbana-Champaign
Why is it so Difficult to get from Here… to… Here?
Therevidae
What Would Taxonomists Rather Be Doing?
What do Taxonomists Wish Would Happen?
1995 Freshmen of NSF PEET*
• Towards a World Monograph of the Therevidae (Insecta: Diptera) – 1995 – 2006
• Therevidae is medium-sized family with (now) – 4 subfamilies– ~130 genera– ~1150 species
*National Science Foundation's Partnerships for Enhancing Expertise in Taxonomy
Products
• Trained – 9 dipterists, 7 through Ph.D.– Scientific illustrator– Dozens of students in
databasing
• Publications– 71 publications during grant– 20 more since & counting
• Digitization– Mandala database 1995-– Website– Collaborations with
DiscoverLife.org & GBIF
…and the world is unlikely to run out of flies to study!
Process: Specimens
• Collect, sort, curate, label, sex, determine, & database specimen information– Assign unique identifiers
where none exist
• Visit & borrow material from museums
• Examine types
Is All that Work Worthwhile?
• "A taxonomic paper often plants the very seeds of its own obsolescence." (Johnson 2011)
• There is no getting around the work required to produce a catalog or any taxonomic treatment.
• What we can do, is make sure that the information is accessible and reusable.
• Is it time to ditch traditional catalogs?
Henicomyia by J. Marie Metz
What Choices Do You Have?• Last year's symposium on Arthropod Collections
databases explored some of your options, but not all are suitable.– Online collections database platforms (not suitable for
creating taxonomic catalogues that cross collections not included)• Arctos• Specify 6
– Online taxonomic database platforms – optimize creation of species pages• Species File – taxonomic authority files• Scratchpads – community-oriented contributions• 3I – online revisions of taxa• Encyclopedia of Life – Expert LifeDesks
What Choices Do You Have?• Last year's symposium on Arthropod Collections
databases explored some of your options, but not all are suitable.– Online platforms designed to parse or take parsed data &
repurpose it (incl. online taxonomic database platforms above) • GBIF's Integrated Publishing Toolkit (IPT) – not thought of as a
workbench-level tool• LUCID – especially good for keys & descriptive data• Biodiversity Informatics Journal - Will take in parsed data from
Scratchpads and IPT & eventually databases (mechanism unclear)
– Desktop or server-based platforms – usually in Filemaker or 4D or MSAccess• Mandala – http://www.inhs.illinois.edu/research/mandala/ • Biota - http://viceroy.eeb.uconn.edu/biota/ • Mantis - http://insects.oeb.harvard.edu/etypes/Downloads.htm
The Process: Decide on a Format
• Was decided to publish as traditional Myia catalog• Expectations about what is in a "traditional catalog" or taxonomic
treatment & how it should be formatted– Print styles (italics, bold, centered, hanging indents)– Accented characters (for literature references, authority names, and localities)– Special characters (for and signs)♂ ♀
– Notes kept with the taxon entry or as an appendix?
• Use Mandala to achieve retrievability & formatting of output
General Workflow: Therevid Mandala Database
•Input raw data: The Bulk of the Work is HERE!•Link data in related tables•Create fields for catalog output for
Taxa & their historyLiterature (including disambiguation of
similar citations)List of countries (& selected
states/provinces) by biogeographic region for valid taxa
Create & number notes for listing in appendix
•Create a script that finds data to be exported•Create scripts to format data including styles (bold, italics, codes for paragraph formatting)•Export TaxonID & catalog output field only to Filemaker Pro to isolate output & preserve formatting including accented characters
Mandala production db
Acrobat
MSWord Catalog
Catalog Output to new FMP db
Things Can Get Messy
• Some operations require expert eyes to determine fitness-for-use
• A database can find, sort, & summarize, but ultimately does not "see" anomalies unless specifically programmed to do so
• Automation (scripting, creation of calculated fields) requires time, refinement, & expertise
• Parsed data are key to flexibility
Create Taxonomic Hierarchy
Use to automate
searches & sort catalog
output by classification
hierarchy, rank, &
alphabetically
Use Reason for Status to Dictate Formatting
We Used the Specimen* Table to Define our Distribution
*based on 105,889 specimens with valid names & parsed localities
Script to Find & Sort Specimens
• Once sorted, export a summary for each taxon
• Summary can then be formatted in MSWord
• Bring back into Filemaker for final formatting
• Spot possible outliers• Match TaxonID to import
formatted information into production db
TaxonID x Biogeographic Region x Country x State/Province
Filling in the Cracks
• All taxa, literature, and specimens to be included in the catalog were marked by an expert with a code for easier retrieval
• Communication about scripts & field calculations were done in Google Docs
• Literature with the same authors and years had to be disambiguated with letters following the year. – Used in both the literature cited and text of the catalog
• After including the notes in the text flow, it was decided by the authors to number and put them into an appendix. – Finding & sorting of these could be automated– Replace with series allowed numbering of notes– Awkward (but necessary) to renumber notes when new ones were
found to be needed.
General Workflow
•TaxonID is for reference only•Resize catalog output field (in layout mode) so all contents will always be seen (page size) & make sure to size the field to fit the contents•Open in Preview to check•Save as PDF
Mandala production db
Catalog Output to new FMP db
General Workflow
•This step mainly preserves catalog text styles & accented characters out of FMP•Save As MS Word document after verifying expected results.•Saving as Word will collapse the formatting into giant paragraphs
Mandala production db
Catalog Output to new FMP db
Acrobat
General WorkflowMandala
production db
Acrobat
MSWord Catalog
Catalog Output to new FMP db
•Create styles in MSWord for formatting text & paragraphs•Search & replace special characters (%%, $$, zzz, ||, //);
and signs♂ ♀
•Clean up extra spaces, paragraphs, & punctuation
•Using Google Docs is not (yet) an option for a traditionally published catalog as the formatting tools aren't adequate
Send Out to Experts
Consensus!
• When the experts are happy, we're done, right?
• Still have to update the database & web output online – complements printed catalog as it is dynamic
• Push corrections to public portals of data (own website, DiscoverLife, GBIF, etc.)
• So "magic" is a relative, kind of wishful term—the future is more likely in platforms such as those being coordinated by Pensoft.
References, Resources• Miller, J. et al. 2012. From taxonomic literature to cybertaxonomic content.
BMC Biology 10:87 http://www.biomedcentral.com/content/pdf/1741-7007-10-87.pdf
• Johnson, N.F. 2012. A collaborative, integrated and electronic future for taxonomy. Invertebrate Systematics 25: 471–475. http://www.publish.csiro.au/?act=view_file&file_id=IS11052.pdf
• Biodiversity Data Journal (publication debut Dec. 2012) http://www.pensoft.net/journals/bdj
• Symposium: Arthropod Collections Databases. 2011 ECN meeting, Reno, NV http://www.ecnweb.org/past/2011
• Darwin Core Standard http://rs.tdwg.org/dwc/ • Kampmeier, G. E. and M. E. Irwin. 2009. Meeting the interrelated challenges
of tracking specimen, nomenclature, and literature data in Mandala. Chapter 15 in T. Pape, D. Bickel, and R. Meier (eds.) Diptera Diversity: Status, Challenges and Tools. Leiden: Brill Academic Publishers, pp. 407-437. http://www.inhs.illinois.edu/research/mandala/Ch15_Mandala_DiptDiv2009.pdf
More Refs & Resources
• Kennedy, J., R. Hyam, R. Kukla, T. Paterson. 2006. Standard data model representation for taxonomic information. A Journal of Integrative Biology 10(2): 220-230. http://www.hyam.net/publications/omi.2006.10.220.pdf
• Penev, L., T. Georgiev, P. Stoev, D. Roberts, V. Smith. 2012. Making small data big! The Biodiversity Data Journal (BDJ). TDWG 2012, Beijing, 22-26 October. http://www.tdwg.org/fileadmin/2012conference/slides/Biodiversity_Data_Journal.pdf
• Catalog of Life
http://www.catalogueoflife.org/colwebsite/sites/default/files/2012_CoL-Standard_Dataset_v6_3.pdf
Acknowledgements• Michael E. Irwin• F. Chris Thompson• Neal Evenhuis• Christine Lambkin• Shaun Winterton• Don Webb• Mark Metz• Martin Hauser• Kevin Holston• Steve Gaimari• J. Marie Metz• David Yeates• Amanda Buck• Brian Wiegmann
• Evert Schlinger• John Pickering• FMWebschool• National Science
Foundation• Schlinger Foundation• Illinois Natural History
Survey• University of Illinois• Discover Life• Biodiversity Information
Standards (TDWG)
NSF Projects:
Therevid PEET: DEB-95-21925; 99-77958
Fiji Arthropod Survey: DEB-0425790
FLYTREE: EF-0334948
Tabanid PEET: DEB 07-31528
©2012 University of Illinois Board of Trustees. All rights reserved. For permission information,
contact the Illinois Natural History Survey.
References to commercial products are for informational purposes only and do not imply endorsement.
Appendix
Additional information for the curious of slides jettisoned for
time
Why Use A Database?
• Flexibility– Finely parsed data may be
pieced together for publication, labels
– Scripting of often used functions
• Reuse/repurposing of data– Sharing with GBIF,
DiscoverLife.org, museums
• Centralization of work environment– Workers can be anywhere, any
time zone– Backup can be automated
• Individual work environment– Choice with platforms not
required to be online (although trade-off)
Vision
• "Taxonomy should fully embrace electronic media and informatics tools. Particularly, this step requires the development and widespread implementation of community data standards. The barriers to progress in these areas are not technological, but are primarily social. The community needs to see clear evidence of the value added through these changes in procedures and insist upon their use as standard practice."
Johnson, N.F. 2011. A collaborative, integrated and electronic future for taxonomy. Invertebrate Systematics 25: 471.
Any Database Can Record the Basics, but…
• How the information is related is also key– defining taxonomic ranks as parent-child relationship– valid taxonomic entities related to their synonyms– types and specimens determined for a taxon– literature associated with a taxonomic name– collecting localities and collecting events
• Readability – if a published work rather than raw database output• Format
– Based on existing print models?– Print styles (italics, bold, centered, hanging indents)– Accented characters (for literature references, authority names, and
localities)– Special characters (for and signs)♂ ♀
– Notes kept with the taxon entry or as an appendix?
Mandala Data Model
• Not all of this is required for a traditional catalog, but these tables contain a wealth of vital, interrelated data.
• Tables with rounded edges are authority files
Use the Classification Hierarchy to Automate Searches
Reason for Status Used for
Formatting