The GeneCards suite - dors.weizmann.ac.il

80
The GeneCards suite Gil Stelzer

Transcript of The GeneCards suite - dors.weizmann.ac.il

The GeneCardssuite

Gil Stelzer

Gene information consolidation

Pathways integration into SuperPaths

Gene set descriptor enrichment

NGS Gene-phenotype prioritization

Disease unification and annotation

www.genecards.org

The human gene digital compendium

>150 Data Sources

ProtoNet HomoloGene AceView AB KEGGWormBase FlyBase InterPro SOURCE

GeneLynx NCBI dbSNP HORDE GeneTestsBLOCKS

PDB MIPS RZPD

IMGT LEIDEN PupaSNP euGenesGeneAtlas

HGMD BCGD

MTDB Kegg MGD DOts UCSC

Doctors guide

GO

UniProt SwissProt TrEMBL OMIMGDBEnsembl EntrezGene

HORDE

bioalma

GeneLoc

TGDBATLASPubMed

Crow21GenBank

HUGO

UniGene MINTGAD

Blocks

GeneNote

~148,000 gene cards entries with “deep links”

• Automatic mining• Inter-source integration

Number of GenesCategory21,360Protein-coding98,609RNA genes16,337Pseudogenes1,819Genetic loci

18Gene Clusters9,770Uncategorized

147,913Total

Gene-centric information 18 sections – including Summaries, Aliases, Diseases, Genomics, Expression, Pathways, Localization, Publications

The human gene compendium

Stelzer et al, Cur. Prot. Bioinformatics 2016

ncRNA 23 data sourcesHeterogeneity of data

ncRNA onlyAll genes

All classes

ncRNA class-specific

Belinky et al, Bioinformatics 2013

127k entries15k entries

Clustering~99,000 ncRNAs

Positional integration

Unification of ncRNAs in GeneCards

The ncRNA Grand Unification

15 data sources v3

~70k genes

v4.7~150k genes

21,360

98,609

16,337

980132

7,350

RNA genes

Protein codingPseudo

genes

Genetic loci

Gene clusters

Uncategorized

0

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

32100000

10000

1000

10010

RNA gene class

Coun

tBefore unification

After unification

X8 X30 X2

Enhanced ncRNA classes in GeneCards

www.malacards.org

epileptic encephalopathy, early infantile, 1

early infantile epileptic encephalopathy with suppression bursts

spasticity - intellectual disability - x-linked epilepsy

infantile epileptic-dyskinetic encephalopathy

infantile spasms

All names for the same disease Genes for this disease

No disease symbolsNo systematic name management

CDKL5ARXPOMCLINC00581RDXP2FOXG1PVALBSLC25A22PHGDHPNKPIDUA

TPH1SPTAN1MECP2SLC1A3STXBP1ABATFLNAPTCH1RDXTLE1

OMIM6834

Disease Ontology

6047

NIH Rare Diseases

6416

2190

512

Three majordisease namesources showlow overlap-need Integration!

MalaCards19,552

60476497

502

614

6416

983

68342545

3467

2511

405

3963

2533

Disease Ontology

DiseasecardWikipedia

Orphanet

MedlinePlus

CopenhagenDISEASES

GeneTests

GeneReviews

NIH Rare Diseases

Genetics Home Reference

OMIMNovoSeek

GTR

Name integration from 15 sources

A disease compendium

Rappaport et al. (2014) Cur. Prot. Bioinformatics

85,000 disease strings (names+aliases)

~19,600disease names

15 info sources

65,000 aliases

Source hierarchy

Name integration process

TextualCanonicalization

Example –• Liver Cancer• Hepatic Carcinoma• Neoplasm of Liver

• Remove non alphanumeric characters• Replace identities (juvenile=childhood)• Remove suffixes (‘s, s, ies)• Eliminate less-informative words (syndrome)• Eliminate prefixes (resistance to)• Translate Greek characters• Unify word order (liver failure = failure, liver)• Unify, but leave in, Roman/Arabic/Latin numerals

Textual Canonization85,000 disease strings

~19,600 disease names

MalaCards annotation schemes

• Interrogating disease resources for classifications, symptoms, variations, drugs…

• Searching GeneCards for publications, associated genes • Querying disease-related gene-sets in GeneAnalytics /

GeneCards for pathways, phenotypes, compounds, and GO terms

• Searching within MalaCards itself, e.g. for related diseases, organ categories…

Number of GenesCategory4,470Cancer diseases4,821Fetal diseases17,334Genetic diseases

712Infectious diseases2,594Metabolic diseases29,140Rare diseases

Disease-centric information 14 sections – including Summaries, Aliases, Genes, Anatomical context, Pathways, Drugs & therapeutics, Publications

The human disease compendium

Stelzer et al, Cur. Prot. Bioinformatics 2016

MalaCards search results

38 Pancreatic Cancer types

Advanced Search

MalaCards search results

38 Pancreatic Cancer types

Advanced Search

A disease family with “Parent” and “children”

MalaCards search results

General disease categories

Cancer

Fetal

Genetic (Rare)

Genetic(common)Rare

(Non-genetic)

MalaCards affords category overviews

Genetic (Rare)Genetic (common)Rare (Non-genetic)InfectiousFetalCancer

Cancer

Fetal

Genetic (Rare)

Genetic(common)Rare

(Non-genetic)

Tissue-related disease categories

Neuronal

Skin

Eye

Endocrine

Cardio

Bone

18 tissues

Genes with multi-source implications

* Elite genes -based on source-count and importance

Sources for gene information

Elite genesSupported also by specific evidence such as:OMIM: “Molecular basis known” Orphanet: “Causative mutation”Humsavar: “Causative variation”GeneTests: “Genetic Tests” Clinvar“Pathogenic”

10,178 genesTotal: 18,864 diseases

7,987with elitegenes

7,338withoutgenes

3,593 with non-

elite genes

Non-Elite genesGeneCards searchesin sections such as –SummariesPathwaysFunctionPublications

The challenge of confederating diseases with genes

dis 1 dis 2 dis 3 dis 4 dis 5 dis 6 dis 7 dis 8 dis 9 dis 10

gen 1 1 1 1

gen 2 1

gen 3 1 1

gen 4 1

gen 5 1

gen 6 1 1 1

gen 7

gen 8 1

gen 9 1

gen 10 1 1

The gene-diseasematrix

11,791 diseases10,464 genes

dis 1 dis 2 dis 3 dis 4 dis 5 dis 6 dis 7 dis 8 dis 9 dis 10

Gen 1 1 1 1

Gen 2 1

gen 3 1 1

gen 4 1

gen 5 1

gen 6 1 1 1

gen 7

gen 8 1

gen 9 1

gen 10 1 1

The gene- diseasematrix: “mutualmonogamy”

The challenge of confederating diseases with genes

dis 1 dis 2 dis 3 dis 4 dis 5 dis 6 dis 7 dis 8 dis 9 dis 10

gen 1 1 1 1

gen 2 1

gen 3 1 1

gen 4 1

gen 5 1

gen 6 1 1 1

gen 7

gen 8 1

gen 9 1

gen 10 1 1

The gene-diseaseMatrix:Disease with several genes

Gene withseveral diseases

The challenge of confederating diseases with genes

Elite genes

Filtered

Before promiscuity filtering

~5000

~300

~60

Max genes per disease Elite Genes

Gene

s

With disease

No disease

1

10

100

1000

10000

100000

100000023k 123k 1.7k 135 20k 1k Total genes

.36 .003 .73 .06 .02 .07 Fraction with disease

The ncRNA gene prospect

Gene category

Esteller M., Non-coding RNAs in human disease, Nature Reviews 2011

Disease Involved ncRNAs ncRNA type

Beckwith–Wiedeman syndrome lncRNAs H19 and KCNQ1OT1 lncRNA

Silver–Russell syndrome lncRNA H19 lncRNADeafness miR-96 miRNAAlzheimer's disease miR-29, miR-146 and miR-107 miRNA

Alzheimer's disease ncRNA antisense transcript for BACE1 lncRNA

Rheumatoid arthritis miR-146a miRNATransient neonatal diabetes mellitus lncRNA HYMAI lncRNA

Amyotropic lateral sclerosis miR-206 miRNA

ncRNAs in human diseases

http://pathcards.genecards.org/

Pathway sources

Existingredundancyand ambiguity:

Same genes with different pathway names

Different genes with same pathway name

Pathway unification

Pathway Unification

Each pathway data source shows a different view

Creating SuperPaths –Unification in GeneCards of 12 pathway sources

Belinky et al. Database (Oxford). 2015

Pairwise gene compositional similarity between pathways as a basis for combination of hierarchical and nearest neighborhood clustering

𝐽𝐽 𝐴𝐴,𝐵𝐵 =𝐴𝐴 ∩ 𝐵𝐵𝐴𝐴 ∪ 𝐵𝐵

Jaccard similarity index

The unification algorithm

A combination of hierarchical clustering and nearest neighbor graph representationwith 2 cutoffs T1 and T2

0

0

0.2

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.30.1

T1

T2

0.1

0.4 0.5 0.6 0.7 0.8 0.9 1

Redundancy vs informativenessoptimizer

=0.3 =0.7

(Join all J>T2 (H) and the best if J>T1 (NN))

The SuperPath collection in PathCards

3215 pathways 1073 SuperPaths

Optimal T1, T2 wrong T1, T2

Gene

pai

rs

Source

Novel gene pairs in SuperPaths

Shared publications

Gene

pai

r Cou

nt

Coun

t rat

io

Beneficial novel gene pairs in SuperPaths

https://ga.genecards.org/

A powerful gene set analysis tool

Categorized results • Tissue/cell branded expression data • Disease association• Pathways and SuperPaths• Gene Ontologies (GO) connection• Compounds (from several sources)Unique matching algorithms

Tissue Expression Example

PRPH2

Gene-set

Gene-Set Analysis

Photoreceptor-Like Cells

Mature Rod Cells

AIPL1, CRB1, FAM161A, UNC119

Eye

PHO,RP1,

RPGRIP1

RPE65

RPE65

193 genes related to Retinitis Pigmentosa

Result Categories

Aggregation and Filters

Results

Filters AppliedResults

Links and Additional information

Results

Results

Results

http://varelect.genecards.org/

2% 25%

The augmented functional genomeRequires effective bioinformatic tools

Whole exome sequencing

Whole genome sequencing

Regulatory

ncRNAgenes

Intronsand UTRs

Exome

Challenges: annotation, filtration, gene-based interpretation

The NGS interpretation cycle

PatientNGS VCF

V1 G1V2 G2

V3 G2V4 G3V5 G4

V6 G4V7 G5

Phenotypes

Interpretation program

Variant (V) calling annotationandfiltering

G4 (V5)

Genes (G)And other genomic elements

DNA Genomemapping

Medium list

1) Filtering• Genetic model• Population frequency• Protein damage• Known variants

Two-tier variant-containinggene sifting

1-5 genes

~100 genesmedium list

25k NGS coding variants

2) Phenotype- and gene-based interpretation

VarElect - A medium gene list interpreter

Gene A

Gene B

Phenotype

VarElectInterpretation

Direct mode

Indirect mode

Gene- gene relations

Guilt by association via sharing of

Super-Pathways

Mouse phenotypes, tissue expression, paralogy, publications, protein-prtoein interaction, drugs/compounds, domains

geneA geneB geneCgeneA S1 S2geneB S1 S3geneC S2 S3

Stelzer et al, BMC Genomics 2016

The VarElect user web interface - data entry

User Name

diarrhea

Disease: Atypical syndromic congenital diarrhea

The VarElect

Symbolizer

Resolving gene aliases erroneously entered into the VCF file

VarElect in action - Retinitis Pigmentosa with Epilepsy

Input 2: Phenotype keywords-epilepsy OR macular OR retinitis

Zhu X, Oz Levi D, Genetics in Medicine, 2015

Input 1: 63 Genes Output:

CLN6 is 1stTerra Incognita

VarElect “minicards” show hit context for top gene

Top gene

37 Genes

No genes are DIRECTLY related

Phenotype keywords: “capillary leak”

Danit Oz Levi (with Lancet)

Example of indirect mode - Capillary leak syndrome

Implicated gene

Implicating genes

VarElect results

Implicating gene with “minicard” showing hit context

Implicated gene

VarElect results

TLN1 (talin 1) is an excellent candidate gene

• Strong splice site mutation • Passed segregation analysis in family• Talin binds integrins, coupling them to

actin, has known role in vascular endothelium permeability

GeneHancer

What are genomic enhancers

• Distant-acting DNA elements that regulate transcription• Mostly occur in intergenic genome regions• Mediated by DNA looping via a large protein complex• Contain binding site sequences for transcription factors • Weak DNA signatures - identification is challenging

Enhancer

Transcription begins

DNALoop

Promoter

The significance of enhancers

Estimated count – 400,000(x10 than functional genes)

Gene expression fine-tuning much beyond promoters

Enhancers in development and human diseaseExample: Preaxial polydactyly (“many fingers” in Greek)

Visel, A.Nature, 2009

Enhanceropathies – Enhancer-related diseases

SHH – Sonic Hedgehog

Motivation

• A unified compendium of enhancer elements• Combined information on gene-enhancer relations• Use in WGS interpretation

GeneHancer – an integrated enhancer database

Different data sources Automated

data miningPart of the GeneCards Suite

Fishilevich et al, Database (Oxford), 2017

Enhancer integration from 4 sources

284,834 Unified scored

enhancers

Histone modificationsOpen chromatin

TFBSs

Histone modificationsOpen chromatin

Enhancer RNA (eRNA)

Transgenic mouse assaysIn-vivo validation

434,139Total entries

Inter-source overlaps are significant

• Elite enhancers - ~94,000 enhancers (33%) derived from multiplesources

• Enhancer confidence score - Number of sources, sources scores,conserved regions, TFBSs.

Rational prediction of an enhancer’s target genes is essential for biological interpretation

5 methods for enhancer-gene associations

• Methods are predictions/inferences

• Combination is optimal

Chromosome conformation capture

Hi-C - Combination of DNA proximity ligation with high-throughput sequencing

Capture Hi-C (CHi-C) - Promoter-targeted Hi-C

Mifsud, B., et al., Nature Genetics, 2015

Lieberman-Aiden, E., Science, 2009

Expression Quantitative Trait Loci (eQTLs)

Genetic association between an enhancer’s SNP genotype and the expression of a potential target gene

The GTEx Consortium, Science, 2015

Nica, A., et al., Philosophical Transactions of the Royal Society B, 2013

Gene expression intensity

SNP genotype

Enhancer

Variant Gene

Genomic distance

• Immediate neighbors were connected

• This is in line with the distance dependence of other methods

Enhancer – gene association by distance

Overlap among target gene predictions

Distance

TFs co-expression

eQTLs

CHi-C

eRNA co-expression~1,000,000 scored enhancer-

gene pairs• ~75,000 Elite (supported

by >1 method) • ~40,000 Double elite (both

enhancer and gene link are elite)

Enhancer experimental validation

1. How good is our enhancer set?

121/175 (69%) overlap with GeneHancer

2. How good is our gene set?

100/121 (83%) overlap with GeneHancer

56% of the validated enhancer-gene pairs

are Eliteas compared to 7.4% expected at random

Validated enhancers have higher enhancer scores

Comparison to 175 published single-enhancer studies

View in GeneCards - the human gene compendium

β globin

genecards.org

Exome sequencing Gene Phenotype

Coding variants

(25,000)

Whole genome

sequencing

Non-Coding variants(4,000,000)

Enhancer

GeneHancer for sequence analyses of diseasesWGS – Whole Genome SequencingThe interpretation challenge - Which variant causes the disease?

180 kb CNV

Input 1: 18 genes, 31 enhancers

Input 2: phenotype keywords -skin, dermatol*, …

Output:Highest scoring gene is a target of an enhancer

Candidate enhancer for a developmental skin disease

Enhancer Candidate target gene

Candidate target gene

CNV

59 enhancers in a 750kb interval

GeneCards UCSC custom track

GH01F011707• Enhancer within the CNV• Elite enhancer• 445kb upstream from the target gene