Franz 2017 sols cbs seminar the limits of synthesis for integrative biology

63
The limits of synthesis for integrative biology Nico Franz School of Life Sciences, Arizona State University Center of Biology + Society Conversation Series October 11, 2017 School of Life Sciences, ASU @ http ://www.slideshare.net/taxonbytes/franz-2017-sols-cbs-seminar-the-limits-of-synthesis-for-integrative-biology

Transcript of Franz 2017 sols cbs seminar the limits of synthesis for integrative biology

The limits of synthesis

for integrative biology

Nico Franz

School of Life Sciences, Arizona State University

Center of Biology + Society Conversation Series

October 11, 2017 – School of Life Sciences, ASU

@ http://www.slideshare.net/taxonbytes/franz-2017-sols-cbs-seminar-the-limits-of-synthesis-for-integrative-biology

Premise: The notion of synthesis is appealing

doi:10.1371/journal.pbio.1001468

Premise: The notion of synthesis is appealing

https://www.nsf.gov/funding/index.jsp

Implementation (in systematics): Synthesis = one view (at a time)

• Example: The Open Tree of Life project

doi:10.1073/pnas.1423041112

Implementation (in systematics): Synthesis = one view (at a time)

• Example: The Global Biodiversity Information Facility (GBIF)

https://www.slideshare.net/mdoering/gbif-checklist-bank-and-the-backbone

Implementation (in systematics): Synthesis = one view (at a time)

• Example: The Global Biodiversity Information Facility (GBIF)

• "It is updated regularly through an automated process in which the Catalogue of Life acts as a

starting point also providing the complete higher classification above families. The following 54

sources have been used to assemble the GBIF backbone: …"

doi:10.5072/hufs9m

Initial questions – How to integrate biological data?

• Does synthesis necessarily mean one view?

Initial questions – How to integrate biological data?

• Does synthesis necessarily mean one view?

⇒ No. Most generally: "The combination of components or elements

to form a connected whole" (~ Oxford).

Initial questions – How to integrate biological data?

• Does synthesis necessarily mean one view?

⇒ No. Most generally: "The combination of components or elements

to form a connected whole" (~ Oxford).

• Is equating synthesis with one hierarchy empirically and socially adequate,

or desirable?

Initial questions – How to integrate biological data?

• Does synthesis necessarily mean one view?

⇒ No. Most generally: "The combination of components or elements

to form a connected whole" (~ Oxford).

• Is equating synthesis with one hierarchy empirically and socially adequate,

or desirable?

⇒ Likely not if novel or conflicting views are thereby somehow suppressed.

Initial questions – How to integrate biological data?

• Does synthesis necessarily mean one view?

⇒ No. Most generally: "The combination of components or elements

to form a connected whole" (~ Oxford).

• Is equating synthesis with one hierarchy empirically and socially adequate,

or desirable?

⇒ Likely not if novel or conflicting views are thereby somehow suppressed.

• What are the consequences of synthesis = one view?

• What are the remedies?

• What are the incentives to conceive of synthesis differently?

• What are the obstacles to doing so?

Initial questions – How to integrate biological data?

• Does synthesis necessarily mean one view?

⇒ No. Most generally: "The combination of components or elements

to form a connected whole" (~ Oxford).

• Is equating synthesis with one hierarchy empirically and socially adequate,

or desirable?

⇒ Likely not if novel or conflicting views are thereby somehow suppressed.

• What are the consequences of synthesis = one view?

• What are the remedies?

• What are the incentives to conceive of synthesis differently?

• What are the obstacles to doing so?

⇒ To be explored for the use case of biological systematics / biodiversity data.

Language

Types

Background: Linnaean names refer to "non-types" contingently

Dubois. 2005. http://sciencepress.mnhn.fr/sites/default/files/articles/pdf/z2005n2a8.pdf

Non-types

Language

Background: Linnaean names refer to "non-types" contingently

Dubois. 2005. http://sciencepress.mnhn.fr/sites/default/files/articles/pdf/z2005n2a8.pdf

Non-types

Cleistes bifaria

acc. to author 1

Language

Background: Linnaean names refer to "non-types" contingently

Dubois. 2005. http://sciencepress.mnhn.fr/sites/default/files/articles/pdf/z2005n2a8.pdf

Non-types

Cleistes bifaria

acc. to author 2

Language

Background: Linnaean names refer to "non-types" contingently

Dubois. 2005. http://sciencepress.mnhn.fr/sites/default/files/articles/pdf/z2005n2a8.pdf

Non-types

Cleistes bifaria

acc. to author 3

The Cleistes/Cleistesiopsis use case

⇒ 20 orchid occurrence records, 3 taxonomies, 1 synthesis

⇒ Let's map them!

Charly Lewis, CC BY-SA 3.0doi:10.1101/157214

A. sec. Radford, Ahles & Bell 1968 – The Bible

Source: Radford, Ahles & Bell. 1968. Manual of the vascular flora of the Carolinas. UNC Press, Chapel Hill.

B. sec. Kartesz 2010 – The Federal Standard

Source: Kartesz. 2010. Floristic synthesis of North America, version 9-15-2010. Biota of North America Program, Chapel Hill.

C. sec. Weakley 2015 – The "Best" New Regional Flora

Source: Weakley. 2015. Flora of the Southern and Mid-Atlantic States. UNC Herbarium, Chapel Hill.

Expert views are in conflict.

One aggregate may distort any/all views!

D. sec. SERNEC Raw – Mid-Level Herbarium Aggregator

Source: SERNEC Data Portal. 2017. Available from http://sernecportal.org. Accessed 01 June 2017.

E. sec. SERNEC Synthesis – Mid-Level Herbarium Aggregator

Source: SERNEC Data Portal. 2017. Available from http://sernecportal.org. Accessed 01 June 2017.

What are the implications of "synthesis"?

⇒ The orchids are variously rare and red-listed

Charly Lewis, CC BY-SA 3.0

Individual expert views are in conflict; however...

doi:10.1101/157214

...the synthesis merges the conflicts "unevenly".

doi:10.1101/157214

One view yields novel inferences, with no expert provenance.

doi:10.1101/157214

How to remedy?

⇒ Synthesis as a conflict exposition and alignment service

Charly Lewis, CC BY-SA 3.0

Remedy: Representing taxonomic concepts and alignments

• 9 schemata for the Cleistes/Cleistesiopsis complex

doi:10.3897/rio.2.e10610

• 9 schemata for the Cleistes/Cleistesiopsis complex

• Vertical sections identify congruent taxonomic concept regions

Remedy: Representing taxonomic concepts and alignments

doi:10.3897/rio.2.e10610

• 9 schemata for the Cleistes/Cleistesiopsis complex

• Vertical sections identify congruent taxonomic concept regions

• Colors identify lineages of taxonomic names (epithets) in use

Remedy: Representing taxonomic concepts and alignments

doi:10.3897/rio.2.e10610

• 9 schemata for the Cleistes/Cleistesiopsis complex

• Vertical sections identify congruent taxonomic concept regions

• Colors identify lineages of taxonomic names (epithets) in use

• There is no consensus! Five incongruent schemata are used concurrently

Remedy: Representing taxonomic concepts and alignments

doi:10.3897/rio.2.e10610

Further diagnosis:

If incongruent taxonomies are endorsed

– locally, provisionally, and democratically –

then what is the impact for

aggregated biodiversity data?

Further diagnosis:

⇒ Taxonomy becomes a variable

that we need to represent,

and thereby control for

(at the system level).

The 'consensus' The 'bible'

The (formerly)

federal 'standard'

The 'best', latest

regional flora

"Controlling the taxonomic variable"

"Just

bad"

Expert views

are in conflict

Solution:

Instead of aggregating

an artificial 'consensus',

doi:10.3897/rio.2.e10610

The 'consensus' The 'bible'

The (formerly)

federal 'standard'

The 'best', latest

regional flora

"Controlling the taxonomic variable"

"Just

bad"

Expert views

are reconciled

Solution:

Instead of aggregating

an artificial 'consensus',

build translation services

doi:10.3897/rio.2.e10610

Challenge:

How can we redesign aggregation to yield

high-quality biodiversity data packages?

(very abbreviated version)

Step 1 ⇒ Represent only taxonomic concept labels (TCLs) 1

• Syntax (TCL): taxonomic name [author, year, page] sec. source

1 Multi-taxonomy input/alignment visualizations generated with Euler/X toolkit: https://github.com/EulerProject/EulerX

Cleistes divaricata

sec. Gregg & Catling 1993

Pogonia

sec. Brown & Wunderlin 1997

Step 2 ⇒ Represent each source coherently (Parent-Child relationships)

• Syntax (PC): TCL1 is a child/parent of TCL2 [where TCL1/2 = same source]

Cleistesiopsis bifaria sec. Pans. & de Barr. 2008

is a child of

Cleistesiopsis sec. Pans. & de Barr. 2008

Source: Thau, D.M. 2010. Reasoning about taxonomies. Thesis, UC Davis. http://gradworks.proquest.com/3422778.pdf

== < > >< !

• Two regions N, M are either:

• congruent (N == M)

• properly inclusive (N < M)

• inversely properly inclusive (N > M)

• overlapping (N >< M)

• exclusive of each other (N ! M)

Step 3⇒ Align concepts with Region Connection Calculus (RCC–5)

Source: Thau, D.M. 2010. Reasoning about taxonomies. Thesis, UC Davis. http://gradworks.proquest.com/3422778.pdf

== < > >< !

• Two regions N, M are either:

• congruent (N == M)

• properly inclusive (N < M)

• inversely properly inclusive (N > M)

• overlapping (N >< M)

• exclusive of each other (N ! M)

• RCC–5 articulations answer the query: "Can we join regions N and M?"

• Taxonomies have multiple RCC–5 alignable components: nodes (parents,

children), node-associated traits, even node-anchoring specimens.

Step 3⇒ Align concepts with Region Connection Calculus (RCC–5)

Step 4⇒ Identify occurrence records only to TCLs

Records:EKY39235

MTSU003611

NCSC00040204

Records:BOON8098

CLEMS0061133

WILLI39399

Records:GMUF-0039355

IBE006808

USCH58399

Records:CONV0006268

MDKY00006482

NCU00038930

Records:BRYV0023582, BRYV0023584

KHD00032030, MISS0016604

MMNS000227, NCSC00040206

USMS_000002923, USMS_000002924

VSC0053223, VSC0065528

Records:ARIZ393087

DBG39049

USCH51217

Records:NCU00040710

USCH96248

VSC0053218

Records:CLEMS0012881

FUGR0003293

GA023130

Records:BOON8100

NCSC00040210

SJNM45487

Records:GA023144

LSU00012494

MISS0016608

Records:IBE006810, IND-0012374, MMNS000227

Records:NY8654

• Syntax (ID): Occurrence / organism is identified to TCL

"CLEMS0012881"

is identified to

Cleistes divaricata sec. Smith et al. 2004

[additional ID metadata]

Step 5⇒ Generate logically consistent RCC–5 alignments

• Euler/X is a toolkit that infers logically consistent RCC–5 alignments

• Valued-added: MIR – set of Maximally Informative Relations containing

the RCC–5 articulation for every possible TCL pair ⇒ Scalability

Reasoner inference

Step 5⇒ Generate logically consistent RCC–5 alignments

Step 6⇒ Integrate occurrence-to-TCL identifications & alignments

Records:BOON8098, CLEMS0061133, CONV0006268, EKY39235

GMUF-0039355, IBE006808, IBE006810, IND-0012374

MDKY00006482, MMNS000227, MTSU003611, NCSC00040204

NCU00038930, NY8654, USCH58399, WILLI39399

Records:ARIZ393087, BRYV0023582, BRYV0023584, DBG39049

KHD00032030, MISS0016604, MMNS00022, NCSC00040206

USMS_000002923, USMS_000002924, VSC0053223, VSC0065528

Records:BOON8100, CLEMS0012881, FUGR0003293

GA023130, GA023144, LSU00012494

MISS0016608, NCSC00040210, NCU00040710

SJNM45487, USCH96248, VSC0053218

• Specimen integration is fully driven by TCL-to-TCL RCC–5 signals

The 'consensus' The 'bible'

The (formerly)

federal 'standard'

The 'best', latest

regional flora

"Controlling the taxonomic variable"

Impact:

"Please select your preference (A – D);

we can perform all translations"

doi:10.3897/rio.2.e10610

• We can now respond to queries such as:

• "Show all specimens identified to the taxonomic name Cleistes divaricata"

• Returns many records ⇒ Resolves incongruent lineage of name usages

Remedy: Aggregation as a translational service

• We can now respond to queries such as:

• "Show all specimens identified to the taxonomic name Cleistes divaricata"

• Returns many records ⇒ Resolves incongruent lineage of name usages

• "Now show specimens with the TCL Cleistesiopsis divaricata sec. Weakley 2015"

• Returns record subset ⇒ Resolves only one narrowly circumscribed concept

Remedy: Aggregation as a translational service

• We can now respond to queries such as:

• "Show all specimens identified to the taxonomic name Cleistes divaricata"

• Returns many records ⇒ Resolves incongruent lineage of name usages

• "Now show specimens with the TCL Cleistesiopsis divaricata sec. Weakley 2015"

• Returns record subset ⇒ Resolves only one narrowly circumscribed concept

• "Now show specimens identified to the TCL Cleistes divaricata sec. RAB 1968,

yet translated into the more granular TCLs sec. Weakley 2015"

• Returns (again) many records, yet represents and contrasts two treatments,

as opposed to providing the ambiguous lineage view (above)

• "Show all specimens with ambiguous 2010/2015 TCL identifications…" (etc.)

Remedy: Aggregation as a translational service

Synthesis, conflict,

and integrative biology:

Incentives and obstacles

Understanding the attraction of synthesis = one view

• Ok, so we have diagnosed an issue. How prevalent does it need to be for

aggregation designs to actually change?

Understanding the attraction of synthesis = one view

• Ok, so we have diagnosed an issue. How prevalent does it need to be for

aggregation designs to actually change?

• Complication: Under the one-view design, we cannot measure the extent of the

phenomenon very well.

Understanding the attraction of synthesis = one view

• Ok, so we have diagnosed an issue. How prevalent does it need to be for

aggregation designs to actually change?

• Complication: Under the one-view design, we cannot measure the extent of the

phenomenon very well.

• Is the threshold (of the prevalence of the phenomenon) shared universally

between contributors and users? [⇒ Fitness for use]

Understanding the attraction of synthesis = one view

• Ok, so we have diagnosed an issue. How prevalent does it need to be for

aggregation designs to actually change?

• Complication: Under the one-view design, we cannot measure the extent of the

phenomenon very well.

• Is the threshold (of the prevalence of the phenomenon) shared universally

between contributors and users? [⇒ Fitness for use]

• Are unitary aggregation systems designed to foster distrust particularly among

career-advancing experts (e.g. graduate students, postdocs, early-career

researchers) who tend to produce novel, "groundbreaking" views?

Understanding the attraction of synthesis = one view

• Ok, so we have diagnosed an issue. How prevalent does it need to be for

aggregation designs to actually change?

• Complication: Under the one-view design, we cannot measure the extent of the

phenomenon very well.

• Is the threshold (of the prevalence of the phenomenon) shared universally

between contributors and users? [⇒ Fitness for use]

• Are unitary aggregation systems designed to foster distrust particularly among

career-advancing experts (e.g. graduate students, postdocs, early-career

researchers) who tend to produce novel, "groundbreaking" views?

• Is the "sweeping under the rug" of conflict an expectation grounded in the long

history of taxonomy? It's 2017 for crying out load, shouldn't we have figured

out orchids already? Why can't we have one unified "webpage" for every

species? Or: We're so close, fund us once more and we'll promise to "get there".

Understanding the attraction of synthesis = one view

• Ok, so we have diagnosed an issue. How prevalent does it need to be for

aggregation designs to actually change?

• Complication: Under the one-view design, we cannot measure the extent of the

phenomenon very well.

• Is the threshold (of the prevalence of the phenomenon) shared universally

between contributors and users? [⇒ Fitness for use]

• Are unitary aggregation systems designed to foster distrust particularly among

career-advancing experts (e.g. graduate students, postdocs, early-career

researchers) who tend to produce novel, "groundbreaking" views?

• Is the "sweeping under the rug" of conflict an expectation grounded in the long

history of taxonomy? It's 2017 for crying out load, shouldn't we have figured

out orchids already? Why can't we have one unified "webpage" for every

species? Or: We're so close, fund us once more and we'll promise to "get there".

• Is the quieting of conflict an increasingly acceptable design feature of big data?

Understanding the attraction of synthesis = one view

• Better integration – that accounts for past/present/future conflict – requires a

kind of cognitive readjustment. "I need to ready my data now so that a

dissenting view is more easily/scalably linkable to them". That may be asking

for too much…

Understanding the attraction of synthesis = one view

• Better integration – that accounts for past/present/future conflict – requires a

kind of cognitive readjustment. "I need to ready my data now so that a

dissenting view is more easily/scalably linkable to them". That may be asking

for too much…

• Better integration will likely also force contributors and users to be more

transparent upfront regarding the aims of integration, i.e., to make stronger and

more transparent commitments about fitness for use. Again, asking a lot.

Understanding the attraction of synthesis = one view

• Better integration – that accounts for past/present/future conflict – requires a

kind of cognitive readjustment. "I need to ready my data now so that a

dissenting view is more easily/scalably linkable to them". That may be asking

for too much…

• Better integration will likely also force contributors and users to be more

transparent upfront regarding the aims of integration, i.e., to make stronger and

more transparent commitments about fitness for use. Again, asking a lot.

• It does seem that we are in the process of giving something up for the sake of

big data integration. To some extent the integration designs are still too driven

by technical feasibility constraints (which are a moving target, however).

Understanding the attraction of synthesis = one view

• Better integration – that accounts for past/present/future conflict – requires a

kind of cognitive readjustment. "I need to ready my data now so that a

dissenting view is more easily/scalably linkable to them". That may be asking

for too much…

• Better integration will likely also force contributors and users to be more

transparent upfront regarding the aims of integration, i.e., to make stronger and

more transparent commitments about fitness for use. Again, asking a lot.

• It does seem that we are in the process of giving something up for the sake of

big data integration. To some extent the integration designs are still too driven

by technical feasibility constraints (which are a moving target, however).

• Dealing with ambiguity and conflict in the ways we humans are accustomed to

in integrative biology, is not something that we have translated well enough

into the machine processing realm yet.

Understanding the attraction of synthesis = one view

• Better integration – that accounts for past/present/future conflict – requires a

kind of cognitive readjustment. "I need to ready my data now so that a

dissenting view is more easily/scalably linkable to them". That may be asking

for too much…

• Better integration will likely also force contributors and users to be more

transparent upfront regarding the aims of integration, i.e., to make stronger and

more transparent commitments about fitness for use. Again, asking a lot.

• It does seem that we are in the process of giving something up for the sake of

big data integration. To some extent the integration designs are still too driven

by technical feasibility constraints (which are a moving target, however).

• Dealing with ambiguity and conflict in the ways we humans are accustomed to

in integrative biology, is not something that we have translated well enough

into the machine processing realm yet.

• Personal issue: At what point should my advocacy "stop"?

Acknowledgments

• CBS hosts: Kelle Dhein, Andrea Cottrell & Beckett Sterner – Thank you!

• Euler/X team: Bertram Ludäscher, Shizhuo Yu, Jessica Cheng, Ed Gilbert.

• NSF DEB–1155984 (PI Franz); IIS–118088, DBI–1147273 (PI Ludäscher).

• If you have to read one paper: https://doi.org/10.1093/sysbio/syw023

Products: Concept taxonomy in theory and in practice

ZooKeys. doi:10.3897/zookeys.528.6001

Semantic Web. doi:10.3233/SW-160220

Biological Theory. doi:10.1007/s13752-017-0259-5

PloS ONE. doi:10.1371/journal.pone.0118247

Systematics Biodiv. doi:10.1080/14772000.2013.806371

Systematic Biology. doi:10.1093/sysbio/syw023

Biodiversity Data Journal. doi:10.3897/BDJ.5.e10469 Research Ideas and Outcomes. doi: 10.3897/rio.2.e10610