Kampmeier ecn 2012

35
Catalog magic: Behind the Scenes of Creating a World Catalog of the Therevidae Gail E. Kampmeier Illinois Natural History Survey, Prairie Research Institute University of Illinois at Urbana-Champaign [email protected] Irina Brake National Museum of Natural History, London Kristin Algmin University of Illinois at Urbana-Champaign

Transcript of Kampmeier ecn 2012

Page 1: Kampmeier ecn 2012

Catalog magic: Behind the Scenes of Creating

a World Catalog of the Therevidae

Gail E. KampmeierIllinois Natural History Survey, Prairie Research Institute

University of Illinois at [email protected]

Irina BrakeNational Museum of Natural History, London

Kristin AlgminUniversity of Illinois at Urbana-Champaign

Page 2: Kampmeier ecn 2012

Why is it so Difficult to get from Here… to… Here?

Therevidae

Page 3: Kampmeier ecn 2012

What Would Taxonomists Rather Be Doing?

Page 4: Kampmeier ecn 2012

What do Taxonomists Wish Would Happen?

Page 5: Kampmeier ecn 2012

1995 Freshmen of NSF PEET*

• Towards a World Monograph of the Therevidae (Insecta: Diptera) – 1995 – 2006

• Therevidae is medium-sized family with (now) – 4 subfamilies– ~130 genera– ~1150 species

*National Science Foundation's Partnerships for Enhancing Expertise in Taxonomy

Page 6: Kampmeier ecn 2012

Products

• Trained – 9 dipterists, 7 through Ph.D.– Scientific illustrator– Dozens of students in

databasing

• Publications– 71 publications during grant– 20 more since & counting

• Digitization– Mandala database 1995-– Website– Collaborations with

DiscoverLife.org & GBIF

…and the world is unlikely to run out of flies to study!

Page 7: Kampmeier ecn 2012

Process: Specimens

• Collect, sort, curate, label, sex, determine, & database specimen information– Assign unique identifiers

where none exist

• Visit & borrow material from museums

• Examine types

Page 8: Kampmeier ecn 2012

Is All that Work Worthwhile?

• "A taxonomic paper often plants the very seeds of its own obsolescence." (Johnson 2011)

• There is no getting around the work required to produce a catalog or any taxonomic treatment.

• What we can do, is make sure that the information is accessible and reusable.

• Is it time to ditch traditional catalogs?

Henicomyia by J. Marie Metz

Page 9: Kampmeier ecn 2012

What Choices Do You Have?• Last year's symposium on Arthropod Collections

databases explored some of your options, but not all are suitable.– Online collections database platforms (not suitable for

creating taxonomic catalogues that cross collections not included)• Arctos• Specify 6

– Online taxonomic database platforms – optimize creation of species pages• Species File – taxonomic authority files• Scratchpads – community-oriented contributions• 3I – online revisions of taxa• Encyclopedia of Life – Expert LifeDesks

Page 10: Kampmeier ecn 2012

What Choices Do You Have?• Last year's symposium on Arthropod Collections

databases explored some of your options, but not all are suitable.– Online platforms designed to parse or take parsed data &

repurpose it (incl. online taxonomic database platforms above) • GBIF's Integrated Publishing Toolkit (IPT) – not thought of as a

workbench-level tool• LUCID – especially good for keys & descriptive data• Biodiversity Informatics Journal - Will take in parsed data from

Scratchpads and IPT & eventually databases (mechanism unclear)

– Desktop or server-based platforms – usually in Filemaker or 4D or MSAccess• Mandala – http://www.inhs.illinois.edu/research/mandala/ • Biota - http://viceroy.eeb.uconn.edu/biota/ • Mantis - http://insects.oeb.harvard.edu/etypes/Downloads.htm

Page 11: Kampmeier ecn 2012

The Process: Decide on a Format

• Was decided to publish as traditional Myia catalog• Expectations about what is in a "traditional catalog" or taxonomic

treatment & how it should be formatted– Print styles (italics, bold, centered, hanging indents)– Accented characters (for literature references, authority names, and localities)– Special characters (for and signs)♂ ♀

– Notes kept with the taxon entry or as an appendix?

• Use Mandala to achieve retrievability & formatting of output

Page 12: Kampmeier ecn 2012

General Workflow: Therevid Mandala Database

•Input raw data: The Bulk of the Work is HERE!•Link data in related tables•Create fields for catalog output for

Taxa & their historyLiterature (including disambiguation of

similar citations)List of countries (& selected

states/provinces) by biogeographic region for valid taxa

Create & number notes for listing in appendix

•Create a script that finds data to be exported•Create scripts to format data including styles (bold, italics, codes for paragraph formatting)•Export TaxonID & catalog output field only to Filemaker Pro to isolate output & preserve formatting including accented characters

Mandala production db

Acrobat

MSWord Catalog

Catalog Output to new FMP db

Page 13: Kampmeier ecn 2012

Things Can Get Messy

• Some operations require expert eyes to determine fitness-for-use

• A database can find, sort, & summarize, but ultimately does not "see" anomalies unless specifically programmed to do so

• Automation (scripting, creation of calculated fields) requires time, refinement, & expertise

• Parsed data are key to flexibility

Page 14: Kampmeier ecn 2012

Create Taxonomic Hierarchy

Use to automate

searches & sort catalog

output by classification

hierarchy, rank, &

alphabetically

Page 15: Kampmeier ecn 2012

Use Reason for Status to Dictate Formatting

Page 16: Kampmeier ecn 2012

We Used the Specimen* Table to Define our Distribution

*based on 105,889 specimens with valid names & parsed localities

Page 17: Kampmeier ecn 2012

Script to Find & Sort Specimens

• Once sorted, export a summary for each taxon

Page 18: Kampmeier ecn 2012

• Summary can then be formatted in MSWord

• Bring back into Filemaker for final formatting

• Spot possible outliers• Match TaxonID to import

formatted information into production db

TaxonID x Biogeographic Region x Country x State/Province

Page 19: Kampmeier ecn 2012

Filling in the Cracks

• All taxa, literature, and specimens to be included in the catalog were marked by an expert with a code for easier retrieval

• Communication about scripts & field calculations were done in Google Docs

• Literature with the same authors and years had to be disambiguated with letters following the year. – Used in both the literature cited and text of the catalog

• After including the notes in the text flow, it was decided by the authors to number and put them into an appendix. – Finding & sorting of these could be automated– Replace with series allowed numbering of notes– Awkward (but necessary) to renumber notes when new ones were

found to be needed.

Page 20: Kampmeier ecn 2012

General Workflow

•TaxonID is for reference only•Resize catalog output field (in layout mode) so all contents will always be seen (page size) & make sure to size the field to fit the contents•Open in Preview to check•Save as PDF

Mandala production db

Catalog Output to new FMP db

Page 21: Kampmeier ecn 2012

General Workflow

•This step mainly preserves catalog text styles & accented characters out of FMP•Save As MS Word document after verifying expected results.•Saving as Word will collapse the formatting into giant paragraphs

Mandala production db

Catalog Output to new FMP db

Acrobat

Page 22: Kampmeier ecn 2012

General WorkflowMandala

production db

Acrobat

MSWord Catalog

Catalog Output to new FMP db

•Create styles in MSWord for formatting text & paragraphs•Search & replace special characters (%%, $$, zzz, ||, //);

and signs♂ ♀

•Clean up extra spaces, paragraphs, & punctuation

•Using Google Docs is not (yet) an option for a traditionally published catalog as the formatting tools aren't adequate

Page 23: Kampmeier ecn 2012

Send Out to Experts

Page 24: Kampmeier ecn 2012

Consensus!

• When the experts are happy, we're done, right?

• Still have to update the database & web output online – complements printed catalog as it is dynamic

• Push corrections to public portals of data (own website, DiscoverLife, GBIF, etc.)

• So "magic" is a relative, kind of wishful term—the future is more likely in platforms such as those being coordinated by Pensoft.

Page 25: Kampmeier ecn 2012

References, Resources• Miller, J. et al. 2012. From taxonomic literature to cybertaxonomic content.

BMC Biology 10:87 http://www.biomedcentral.com/content/pdf/1741-7007-10-87.pdf

• Johnson, N.F. 2012. A collaborative, integrated and electronic future for taxonomy. Invertebrate Systematics 25: 471–475. http://www.publish.csiro.au/?act=view_file&file_id=IS11052.pdf

• Biodiversity Data Journal (publication debut Dec. 2012) http://www.pensoft.net/journals/bdj

• Symposium: Arthropod Collections Databases. 2011 ECN meeting, Reno, NV http://www.ecnweb.org/past/2011

• Darwin Core Standard http://rs.tdwg.org/dwc/ • Kampmeier, G. E. and M. E. Irwin. 2009. Meeting the interrelated challenges

of tracking specimen, nomenclature, and literature data in Mandala. Chapter 15 in T. Pape, D. Bickel, and R. Meier (eds.) Diptera Diversity: Status, Challenges and Tools. Leiden: Brill Academic Publishers, pp. 407-437. http://www.inhs.illinois.edu/research/mandala/Ch15_Mandala_DiptDiv2009.pdf

Page 26: Kampmeier ecn 2012

More Refs & Resources

• Kennedy, J., R. Hyam, R. Kukla, T. Paterson. 2006. Standard data model representation for taxonomic information. A Journal of Integrative Biology 10(2): 220-230. http://www.hyam.net/publications/omi.2006.10.220.pdf

• Penev, L., T. Georgiev, P. Stoev, D. Roberts, V. Smith. 2012. Making small data big! The Biodiversity Data Journal (BDJ). TDWG 2012, Beijing, 22-26 October. http://www.tdwg.org/fileadmin/2012conference/slides/Biodiversity_Data_Journal.pdf

• Catalog of Life

http://www.catalogueoflife.org/colwebsite/sites/default/files/2012_CoL-Standard_Dataset_v6_3.pdf

Page 27: Kampmeier ecn 2012

Acknowledgements• Michael E. Irwin• F. Chris Thompson• Neal Evenhuis• Christine Lambkin• Shaun Winterton• Don Webb• Mark Metz• Martin Hauser• Kevin Holston• Steve Gaimari• J. Marie Metz• David Yeates• Amanda Buck• Brian Wiegmann

• Evert Schlinger• John Pickering• FMWebschool• National Science

Foundation• Schlinger Foundation• Illinois Natural History

Survey• University of Illinois• Discover Life• Biodiversity Information

Standards (TDWG)

NSF Projects:

Therevid PEET: DEB-95-21925; 99-77958

Fiji Arthropod Survey: DEB-0425790

FLYTREE: EF-0334948

Tabanid PEET: DEB 07-31528

Page 28: Kampmeier ecn 2012

©2012 University of Illinois Board of Trustees. All rights reserved. For permission information,

contact the Illinois Natural History Survey.

References to commercial products are for informational purposes only and do not imply endorsement.

Page 29: Kampmeier ecn 2012

Appendix

Additional information for the curious of slides jettisoned for

time

Page 30: Kampmeier ecn 2012

Why Use A Database?

• Flexibility– Finely parsed data may be

pieced together for publication, labels

– Scripting of often used functions

• Reuse/repurposing of data– Sharing with GBIF,

DiscoverLife.org, museums

• Centralization of work environment– Workers can be anywhere, any

time zone– Backup can be automated

• Individual work environment– Choice with platforms not

required to be online (although trade-off)

Page 31: Kampmeier ecn 2012

Vision

• "Taxonomy should fully embrace electronic media and informatics tools. Particularly, this step requires the development and widespread implementation of community data standards. The barriers to progress in these areas are not technological, but are primarily social. The community needs to see clear evidence of the value added through these changes in procedures and insist upon their use as standard practice."

Johnson, N.F. 2011. A collaborative, integrated and electronic future for taxonomy. Invertebrate Systematics 25: 471.

Page 32: Kampmeier ecn 2012

Any Database Can Record the Basics, but…

• How the information is related is also key– defining taxonomic ranks as parent-child relationship– valid taxonomic entities related to their synonyms– types and specimens determined for a taxon– literature associated with a taxonomic name– collecting localities and collecting events

• Readability – if a published work rather than raw database output• Format

– Based on existing print models?– Print styles (italics, bold, centered, hanging indents)– Accented characters (for literature references, authority names, and

localities)– Special characters (for and signs)♂ ♀

– Notes kept with the taxon entry or as an appendix?

Page 33: Kampmeier ecn 2012

Mandala Data Model

• Not all of this is required for a traditional catalog, but these tables contain a wealth of vital, interrelated data.

• Tables with rounded edges are authority files

Page 34: Kampmeier ecn 2012

Use the Classification Hierarchy to Automate Searches

Page 35: Kampmeier ecn 2012

Reason for Status Used for

Formatting