Identifying and Predicting Novelty in Microbiome Studies ......• Earth microbiome project 1:...

27
Identifying and Predicting Novelty in Microbiome Studies based on Microbiome Searching Xiaoquan SU, Ph. D Bioinformatics Group, QIBEBT Chinese Academy of Sciences [email protected] mse.ac.cn Probiota 2019Denmark

Transcript of Identifying and Predicting Novelty in Microbiome Studies ......• Earth microbiome project 1:...

Page 1: Identifying and Predicting Novelty in Microbiome Studies ......• Earth microbiome project 1: 27,715 samples (Nature, 2017) ... MetaHIT Another microbiome study Data-driven Research.

Identifying and Predicting Novelty in

Microbiome Studies based on Microbiome

Searching

Xiaoquan SU, Ph. DBioinformatics Group, QIBEBT

Chinese Academy of Sciences

[email protected]

mse.ac.cn

Probiota 2019,Denmark

Page 2: Identifying and Predicting Novelty in Microbiome Studies ......• Earth microbiome project 1: 27,715 samples (Nature, 2017) ... MetaHIT Another microbiome study Data-driven Research.

Microbiome big-data

Thompson, et al., Nature, 2017

Gilbert, et al., Nature, 2016

Blaser et al., mBio., 2016

• Human health

• Environment

• Agriculture

• Industry

• Bioenergy

Microbiome

• The basic functionality unit of microorganisms

• Enormous potential in solving crucial issues

Page 3: Identifying and Predicting Novelty in Microbiome Studies ......• Earth microbiome project 1: 27,715 samples (Nature, 2017) ... MetaHIT Another microbiome study Data-driven Research.

0

5,000

10,000

15,000

20,000

25,000

Th

e n

um

be

r o

f p

ap

ers

The number of microbiome studies is increasing in

exponential scale

Data from NSL-CAS, 2018, May

Microbiome Big-data:

More information, more challenges

PB level data

1 PB = 1024 TB

1 TB = 1024 GB

1 GB = 1024 MB

Microbiome big-data

Page 4: Identifying and Predicting Novelty in Microbiome Studies ......• Earth microbiome project 1: 27,715 samples (Nature, 2017) ... MetaHIT Another microbiome study Data-driven Research.

http://metagenomics.anl.gov/

# of public metagenomes: 36,632

# of sequences: >10,000,000,000

# of studies: 1312

https://qiita.ucsd.edu/

# of public metagenomes: 242,593

# of sequences: >20,000,000,000

# of studies: 1163

https://img.jgi.doe.gov/

# of public metagenomes: 6,595

# of sequences: >2,000,000,000

# of studies: 252

Challenge 1: Single-use of microbiome big-

data

PB level data, however:

• Just data depository

• Single use, no further mining

Microbiome big-data

Page 5: Identifying and Predicting Novelty in Microbiome Studies ......• Earth microbiome project 1: 27,715 samples (Nature, 2017) ... MetaHIT Another microbiome study Data-driven Research.

Query words The Internet Matched pages

Web search engine:keyword to keywords

Challenge 2: Different to search in microbiome big data

Microbiome big-data

Microbiome big-dataQuery microbiome Matched samples

No solution for “community to communities” search

BLAST:sequence to sequences

Query reads Matched sequencesReference genomes

Page 6: Identifying and Predicting Novelty in Microbiome Studies ......• Earth microbiome project 1: 27,715 samples (Nature, 2017) ... MetaHIT Another microbiome study Data-driven Research.

New technique for Microbiome Big-data Science

Microbiome Search Engine (MSE)

?

Microbiome Database

Input:

Microbiome StructureOutput: Structural similar

microbiome(s) and meta-data

Identification based on microbiome structural

similarity

Microbiome Search Engine

Page 7: Identifying and Predicting Novelty in Microbiome Studies ......• Earth microbiome project 1: 27,715 samples (Nature, 2017) ... MetaHIT Another microbiome study Data-driven Research.

New technique for Microbiome Big-data Science

Microbiome Search Engine (MSE)

Online system

http://mse.ac.cn

Standalone package & QIIME 2 plugin

Linux/Mac OS X/Embedded Linux of Win10

Microbiome Search Engine

Page 8: Identifying and Predicting Novelty in Microbiome Studies ......• Earth microbiome project 1: 27,715 samples (Nature, 2017) ... MetaHIT Another microbiome study Data-driven Research.

Efficiency evaluation of MSE Search against 1,000,000 samples

MSE enables in-depth data mining among microbiome big-data

• 0.29 s search time against over 1 million samples (340 X speedup)

• Constant search speed, insensitive to database size

Quad Intel Xeon E7, 40 cores

Microbiome Search Engine

Page 9: Identifying and Predicting Novelty in Microbiome Studies ......• Earth microbiome project 1: 27,715 samples (Nature, 2017) ... MetaHIT Another microbiome study Data-driven Research.

A “bird-eye view” of global microbiome pattern

• Earth microbiome project 1: 27,715 samples (Nature, 2017)

• Microbiome Search Engine: 101,983 samples, 301 studies (mBio,

2018)

1Thompson, et al., Nature, 2017

Data-driven Research

Page 10: Identifying and Predicting Novelty in Microbiome Studies ......• Earth microbiome project 1: 27,715 samples (Nature, 2017) ... MetaHIT Another microbiome study Data-driven Research.

The “Microbiome Data Space”

HMP

EMP

MetaHITAnother

microbiome study

Data-driven Research

Page 11: Identifying and Predicting Novelty in Microbiome Studies ......• Earth microbiome project 1: 27,715 samples (Nature, 2017) ... MetaHIT Another microbiome study Data-driven Research.

A “star” is a microbiome

MSE is a super telescope for discovering similar stars (microbiomes)

Data-driven Research

Page 12: Identifying and Predicting Novelty in Microbiome Studies ......• Earth microbiome project 1: 27,715 samples (Nature, 2017) ... MetaHIT Another microbiome study Data-driven Research.

The total number of known bacteria is N (eg., 1,000,000)

Theoretically the number of microbiome structure with m (1 ≤ m≤ N) bacteria is

The global microbiome data: “infinite” space ?

Data-driven Research

𝑚=1

𝑁

𝐶𝑁𝑚 × 𝐴𝑏𝑑 𝑚 → ∞

The phylogeny tree

A microbiome

Page 13: Identifying and Predicting Novelty in Microbiome Studies ......• Earth microbiome project 1: 27,715 samples (Nature, 2017) ... MetaHIT Another microbiome study Data-driven Research.

Microbiome Novelty Score (MNS) 1

MNS measures the uniqueness of a microbiome in a database:

higher MNS = higher novelty1Su, et al., mBio, 2018

Data-driven Research

𝑀𝑁𝑆 = 1 −σ𝑖=110 𝑆𝑖 × 10 − 𝑖

σ𝑖=110 10 − 𝑖

Page 14: Identifying and Predicting Novelty in Microbiome Studies ......• Earth microbiome project 1: 27,715 samples (Nature, 2017) ... MetaHIT Another microbiome study Data-driven Research.

Historical trend of MNS of 101,983 samples from 2010-2017

Normal Distribution (Pearson r=0.92±0.07)

Therefore, we set the 2010 mean MNS (0.15) as the novelty baseline

Data-driven Research

Page 15: Identifying and Predicting Novelty in Microbiome Studies ......• Earth microbiome project 1: 27,715 samples (Nature, 2017) ... MetaHIT Another microbiome study Data-driven Research.

Human1 vs Non-human2 : 1:6

Thus, many more novel microbial patterns exist in natural environments

Trends of novel samples in each sub-category

1Human: Gut, Skin, Oral, Urogenital, etc.2Non-human: Animal, Marine, Lake, River, Soil, House, etc.

Total non-human

Total human

Novel non-human

Novel human

Data-driven Research

Page 16: Identifying and Predicting Novelty in Microbiome Studies ......• Earth microbiome project 1: 27,715 samples (Nature, 2017) ... MetaHIT Another microbiome study Data-driven Research.

Human vs Non-human

• Structure of human microbiome is bounded, approaching saturation

• Turning point of human samples is in 2012, due to HMP publication

Trends of novel samples in each sub-category

Data-driven Research

Page 17: Identifying and Predicting Novelty in Microbiome Studies ......• Earth microbiome project 1: 27,715 samples (Nature, 2017) ... MetaHIT Another microbiome study Data-driven Research.

Our known microbiome data space

Human

Microbiome:

With boundary

Environment

Microbiome:

Boundary unclear

Data-driven Research

Page 18: Identifying and Predicting Novelty in Microbiome Studies ......• Earth microbiome project 1: 27,715 samples (Nature, 2017) ... MetaHIT Another microbiome study Data-driven Research.

Microbiome Attention Score (MAS) 1

MAS measures the attention of a microbiome in a database

higher MAS = higher attention1Su, et al., mBio, 2018

Data-driven Research

𝑀𝐴𝑆 =

𝑖=1,𝑖≠𝑚

10

𝐶𝑜𝑛𝑛𝑒𝑐𝑡𝑖𝑣𝑖𝑡𝑦 𝑚, 𝑖

𝐶𝑜𝑛𝑛𝑒𝑐𝑡𝑖𝑣𝑖𝑡𝑦 𝑚, 𝑖

= ቊሻ𝑆𝑖 , 𝑖𝑓 𝑚 ∈ 𝑡𝑜𝑝 10 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 𝑜𝑓 𝑖 𝑎𝑛𝑑 (𝑆𝑖 ≥ 0.85

0, 𝑖𝑓! 𝑚 ∈ 𝑡𝑜𝑝 10 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 𝑜𝑓 𝑖

Page 19: Identifying and Predicting Novelty in Microbiome Studies ......• Earth microbiome project 1: 27,715 samples (Nature, 2017) ... MetaHIT Another microbiome study Data-driven Research.

The“80/20 Rule Principle”:

Top 20% most frequently matched samples received “High Attention”

The MAS threshold of 14 is determined based on the top 20% MAS samples

Data-driven Research

Distribution of MAS of 101,983 samples from 2010-2017

Page 20: Identifying and Predicting Novelty in Microbiome Studies ......• Earth microbiome project 1: 27,715 samples (Nature, 2017) ... MetaHIT Another microbiome study Data-driven Research.

Microbiome Focus Index (MFI)= f(MNS, MAS)

Habitat Rate

Lake 22.29%

Animal 22.25%

Marine 15.32%

Soil 14.52%

Human 9.47%

High-MFI (focus) microbiome:

a. Novel when born (MNS ≥ 0.15)

b. Attention afterwards (MAS ≥ 14)

2,298 focus samples from 101,983 samples

Data-driven Research

Page 21: Identifying and Predicting Novelty in Microbiome Studies ......• Earth microbiome project 1: 27,715 samples (Nature, 2017) ... MetaHIT Another microbiome study Data-driven Research.

Prediction of high-MFI

samples

Random-Forest differentiates the

Focus samples by 4-year MAS: 98.8% accuracy

MAS development:

wake up in the first 4 years

Focus Microbiome (Beauties):

a. Novel when born: constant

b. Attention afterwards: variable (Sleeping Beauties)

Data-driven Research

Page 22: Identifying and Predicting Novelty in Microbiome Studies ......• Earth microbiome project 1: 27,715 samples (Nature, 2017) ... MetaHIT Another microbiome study Data-driven Research.

Sleeping beauties: Marine & Indoor

Human microbiome: Mother-baby

𝑴𝑨𝑺𝒎𝒂𝒙 =σ𝒊=𝟏𝒀−𝟐𝟎𝟏𝟒 𝑴𝑨𝑺𝒊 × Τ𝑹𝑭𝒊 𝑹𝒆𝒈𝒊

σ𝒊=𝟏𝒀−𝟐𝟎𝟏𝟒𝑹𝑭𝒊

Prediction of high-MFI samples

A hybrid Regression-Random-Forest

Y is the samples’ birth year

MASi is the i-th year’s MAS

Regi is the i-th year’s max-MAS-ratio

RFi is the RF importance of the i-th year

Data-driven Research

Page 23: Identifying and Predicting Novelty in Microbiome Studies ......• Earth microbiome project 1: 27,715 samples (Nature, 2017) ... MetaHIT Another microbiome study Data-driven Research.

Data-driven Research

The treasure map of “Microbiome Data Space” by MSE

Page 24: Identifying and Predicting Novelty in Microbiome Studies ......• Earth microbiome project 1: 27,715 samples (Nature, 2017) ... MetaHIT Another microbiome study Data-driven Research.

http://mse.ac.cn

Page 25: Identifying and Predicting Novelty in Microbiome Studies ......• Earth microbiome project 1: 27,715 samples (Nature, 2017) ... MetaHIT Another microbiome study Data-driven Research.

Acknowledgement

• Bioinfo. Group, Single-Cell Center, QIBEBT-CAS

• D. McDonald, A. Gonzalez, J. Navas, Knight Lab,

UCSDProf. Jian XU

SSC Director, QIBEBT-CAS

Prof. Rob KNIGHT

Knight Lab PI, UCSD

Gongchao JING

Assist. Prof.

Algorithm developer

Lu LIU

Post Doc.

Data manager

Zheng SUN

Post Doc.

Data analyst

Zengbin WANG

Technician

Web developer

Yufeng ZHANG

Graduate student

Algorithm developer

http://mse.ac.cn

Page 26: Identifying and Predicting Novelty in Microbiome Studies ......• Earth microbiome project 1: 27,715 samples (Nature, 2017) ... MetaHIT Another microbiome study Data-driven Research.

MNS-based detection without marker: Is a microbiome healthy or not?

MNS of unhealthy samples are significantly higher than healthy ones

IBD (炎症性肠病)

HIV (艾滋病)

CRC (结直肠癌)

EDD (腹泻型痢疾)

Search-based diagnosis

• Baseline database: 15,704 healthy fecal samples from 56

studies

• Test dataset: 3,113 fecal samples from 9 studies, 5 status

Page 27: Identifying and Predicting Novelty in Microbiome Studies ......• Earth microbiome project 1: 27,715 samples (Nature, 2017) ... MetaHIT Another microbiome study Data-driven Research.

MSE offers precise diagnosis for multiple disease,

AUC=0.81

IBD (炎症性肠病)

HIV (艾滋病)

CRC (结直肠癌)

EDD (腹泻型痢

疾)

Search-based diagnosis

MNS-based detection without marker: Is a microbiome healthy or not?