从精准医学大数据到智能医疗 - msra.cn · 01.12.2016 ·...
Transcript of 从精准医学大数据到智能医疗 - msra.cn · 01.12.2016 ·...
Precision Medicine Initiative USA
美国精准医学计划
“I want the country that eliminated polio
and mapped the human genome to lead
a new era of medicine – one that
delivers the right treatment at the right
time…... Tonight, I'm launching a new
Precision Medicine Initiative to bring us
closer to curing diseases like cancer
and diabetes – and to give all of us
access to the personalized information
we need to keep ourselves and our
families healthier.”
State of the Union Address (国情咨文2015)
Tuesday, January 20, 2015
“我希望这个消灭小儿痲痺与绘制人类基因组图谱的国家,能领导医学新纪元,能够在正确的时间为患者提供正确的治疗。……今晚我要发起新‘精准医学计划’,让我们离治愈癌症、糖尿病与其他疾病更近一步,并让我们所有人能获得让自己与家人更健康所需要的个性化信息。”
Problems in Drug Treatment
4
Data from: Spear BB et al., TRENDS in
Molecular Medicine 7:201-204, 2001
23%
Dosage
。
Side
Effect
Efficacy
集合了诸多现代医学科技发展的知识与技术体系,体现了医学科学发展趋势,也代表了临床实践发展方向。
意义及必要性
精准医学
基础大样本研究获得疾病分子机制的知识体系
依据
组学数据
患者个体特征• 基因型• 表型• 环境• 生活方式
手段
• 现代遗传学• 分子影像学• 生物信息学• 临床医学
目标• 精准预防• 精准诊断• 精准治疗
掌新的疾病分类体系和诊疗标准
提高国民健康水平
• 减少无效、有害和过度医疗• 降低医疗成本• 优化国家医疗资源配置
推动相关学科快速发展
2011年11月,美国NRC:《迈向精确医学:构建生物医学研究知识网络和疾病分类体系》
概念 意义
目标1
新一代组学技术
目标2
大规模人群队列
健康人群变异数据
疾病人群变异数据
目标3
大数据分析关键技术
目标4
精准防诊治方案
信息采取全基因组测序其他各组学分析
发现变异序列数据分析组学数据分析
分析变异变异与疾病关系
临床路径个性化诊断治疗
目标5
个体化治疗技术
精准医学
专项目标
任务2:大规模人群队列研究
任务3:精准医学大数据的资源整合、存储、利用与共享平台建设
任务1:新一代生命组学技术研发
任务4:疾病精准防诊治系列方案研究和制定
任务5:个体化治疗靶标发现与新技术研发
任务6:精准医疗集成应用示范体系建设
重大基础研究与共性关键技术
临床应用示范与推广
图1:各任务间相互关系图
主要任务
精准医疗大数据的资源整合、储存、利用与共享平台建设
任务3
图4:任务三重点项目间相互关系图
4.临床信息与多层次组学信息整合的
大型精准医学数据库
1.生物医学大数据管理共享
服务支撑体系与标准化平台
2.生物医学大数据有
效挖掘与高性能计算
3.生物医学大数据处理关键信息技术
5.面向转化应用精准医学大型知识库
How do we differ? – Let’s count the ways• Single nucleotide polymorphisms 单核苷酸多态性
• 1 every few hundred bp, mutation rate* ≈ 10-9
• Short indels (=insertion/deletion) 插入/缺失
• 1 every few kb, mutation rate v. variable
• Microsatellite (STR) repeat number 微卫星不稳定性
• 1 every few kb, mutation rate ≤ 10-3
• Minisatellites 小卫星变异
• 1 every few kb, mutation rate ≤ 10-1
• Repeated genes 重复序列
• rRNA, histones
• Large inversions, deletions 大片段插入/缺失
• Rare, e.g. Y chromosome
TGCATTGCGTAGGC
TGCATTCCGTAGGC
TGCATT---TAGGC
TGCATTCCGTAGGC
TGCTCATCATCATCAGC
TGCTCATCA------GC
≤100bp
1-5kb
*per generation
• Using databases, data and domain knowledge
充分利用数据库,数据和专业领域知识
• COSMIC
• GWAS Catalogue (NIGMS)
• OMIM, Human genotype-phenotype relationship databases
• HGMD
• TCGA – cBioPortal
• ICGC
• IPA
• …
Variants Filtering 变异过滤
知识库成为组学数据分析的瓶颈
Figure 1.
Framework for variation discovery and genotyping from next-generation DNA sequencing.
See text for a detailed description.
DePristo et al. Page 10
Nat Genet. Author manuscript; available in PMC 2011 November 01.
NIH
-PA
Au
tho
r Ma
nu
scrip
tN
IH-P
A A
uth
or M
an
uscrip
tN
IH-P
A A
uth
or M
an
uscrip
t
Knowledge base
疾病研究精准医学知识库构建
复旦大学
中国医学科学院医学信息研究所
中国军事医学科学院
中国科学院上海生命科学院
北京蛋白质中心
中国科学院北京基因组所
上海生物信息技术研究中心
浙江大学
哈尔滨工业大学
大连理工大学
指南方向3:精准医学大数据的资源整合、存储、利用与共享平台建设具体指南3.2.1:疾病研究精准医学知识库构建
知识库在精准医学研究中的重要性
队列研究 组学分析
临床
• 基因型• 表型• 环境• 生活方式
…
• 基因组学• 蛋白质组学• 代谢组学
…
大样本、大数据
知识库
知识
生物信息标准
文本资源海量、异构数据 知识网络
• 信息查找• 信息分析• 知识再造• 知识共享
• 疾病诊断• 精准医疗• 健康管理• 资源配置• 病例分析
科研
精准医学目标或应用
生物医学知识库已成为研究热点
欧洲生物信息研究所(EBI)的知识库研究美国国立生物技术信息中心(NCBI)的知识库研究
美国国家医学图书馆(NLM)长期发展战略愿景(2015)
更加标准扩大信息源
提供更加可靠的知识
促进知识形成和传播
更加开放
战略愿景
NCBI、EBI等国际大型生物信息中心依托海量资源持续构建生物医学知识库
国外公司在生物医学知识库方面形成了垄断
• 通过自然语言处理技术从文档中提取信息和知识
• 聘请专业人士进行判读,保证了知识的可靠性
• 致力于组学数据的建模、分析和理解
价格昂贵IPA ~14万元/年/5用户
GeneGo ~25万元/年/用户
使用IPA知识库发表的论文数量逐年递增
• 价格昂贵,形成垄断
• 中国约200余家机构使用,年使用费4000万左右
国外公司投入大量资金开发商业知识库软件数据全面准确 核心数据保密 价格昂贵
形成垄断 阻碍我国精准医学发展
IPA软件
GeneGo软件
IPA和GeneGo软件
Knowledge Management (KM):
A Cross-Cutting Competency
Capture, represent, model, organize
and synthesize the different types of
knowledge to realize comprehensive,
validated and accessible resources
Access, share and
disseminate current and case-
specific knowledge to
stakeholders in a usable format
Operationalize and utilize knowledge,
within existent organizational workflows, to
provide pragmatic services at the point-of-
need (e.g., point-of-care decision support)
Set of processes, methodologies and tools
aimed at maximizing organizational efficiency
through the curation, storage, dissemination and
re-use of enterprise information and experiences
Abidi SSR. Healthcare Knowledge Management: The Art of the Possible. In: Knowledge Management for Health Care Procedures: Springer Berlin/Heidelberg; 2008, 1-20.
Smaltz DH and RC Pinto. Organizational Knowledge – Can You Really Manage It? In: Proc HIMSS Annual Conference and Exhibition, 2004.
Slide Source: Tara Payne, “Knowledge Management for Research”
21
网络空间的知识库
250 Bio. SPO triples (RDF) and growing
Cyc
TextRunner/
ReVerbWikiTaxonomy/
WikiNet
SUMO
ConceptNet 5
BabelNet
ReadTheWeb
应用 QA
大图数据管理 知识图谱构建大图划分
图再划分
网络资源
数据获取1. 防屏蔽分布式爬虫2. Social Crawling
3. Entity crawling
实体/概念抽取
列表抽取
属性抽取
分布式缓存
Linked data 查询处理
带冗余划分
中文知识图谱
面向知识图谱的集成数据开放平台
Probase+
Probase DBpedia Yago
Yago2
Knowitall……
互联网领域知识库
领域知识库
军事领域知识库
图书出版领域知识库
医疗领域知识库
……
微博应用分析 深度阅读 服务匹配 名片识别 ……数据集成
知识互联
实体链接
类别映射
实体映射
关系映射
跨语言实体映射
……
基于知识图谱的语义分析
语义消歧
Word
embe
-dding
实体指代
语义扩展
实体概念化
……
实体识别
分类体系融合
开放关系抽取
数据源选择
IsA 关系抽取
知识图谱技术架构
Freebase
数据
集成
服务
应用
25
技术路线
精准医学知识库
精准医学用户(科学研究和临床应用)
工作流调用应用
“精准医学大数据平台”
精准医学知识库应用接口
精准医学知识库自动注释
外源数据(自动更新)
文本实体识别及语义提取
文本资源 本体和语义网
精准医学知识库人工审编
生物医学数据挖掘
精准医学专项数据
基于生物信息的生物医学知识图谱
基于文本挖掘的生物医学知识网络
资源
深层索引相关性挖掘重要性标注新颖度分析
技术
生物医学公共数据
知识展示知识检索门户网站
研究内容与课题设置
药物
疾病
通路
基因
变异
精准医学知识库内涵
工作流体系知识库网站
精准医学知识库平台
1. 构建精准医学本体和语义网络(课题一)
2. 构建精准医学文本知识网络(课题二)
3. 构建基于生物信息学的精准医学知识图谱(课题三)
4. 开展精准医学知识自动化注释与人工审编(课题四)
5. 研发精准医学知识库的管理与共享系统平台(课题五
)
不同层次的本体生物
长度度量
100
10-1
10-2
10-3
10-4
10-5
10-6
10-7
10-8
10-9
SNOMED(疾病)NCBI 分类系统
SNOMED (器官)
哺乳动物的表型
SNOMED (形态学)
ATCC (细胞株)Cell Ontology细胞本体(细胞类型)Gene Ontology基因本体(亚细胞)
Gene Nomenclature 基因命名Quaternary code 四进制码
技术
测量
有机物
器官
组织
细胞
细胞器
病毒
DNA
碱基
UMLS Semantic Network
• UMLS = Unified Medical Language System (NLM)
• Composed of:
o Metathesaurus
o Semantic Network
o Lexicon
• Contains approximately 5 million codes representing 1 million concepts derived from 100 source terminologies
Drug
GenomeProteome
Symptom & Disease Environment
• treats
• disrupts
• ……
• Gene has Mutation
• Mutation has Size
• ……
• is a
• affects
• complicates
• co-occurs with
• reformulated to
• tradename of
• has precise ingredient
• ingredient of
• ……
• is associated with
• interacts with
• has part
• ……
PM Ontology and
Semantic Networks
面向恶性肿瘤、代谢系统疾病、呼吸系统疾病、心脑血管疾病等重大疾病
药物
疾病/症状
通路
蛋白质
基因/基因变异
环境
Concept Definition
Gene A segment of DNA that codes for a protein
Mutation A mutation is an alteration (deletion, insertion, substitution) of nucleotides (DNA, RNA) or
amino acids (Protein)
Body part An organ or anatomical location in a person.
Disease An abnormal condition affecting the body of an organism.
Patient An individual with a disease
Cohort A group of people; specifically any group or population of people that may be assigned a
disease or characteristic.
Size A number indicating the number of people in a cohort, or the number/frequency of a
mutation.
Age A number or range indicating how old a person/group of people is.
Gender Terms indicating whether someone is male or female
Geographical
location
Terms indicating where a person/group of people comes from, either based on ethnic
origin or where they live.
…… … …
• Verspoor, K., A. Jimeno Yepes, L. Cavedon, T. McIntosh, A. Herten-Crabb, Z. Thomas and J. P. Plazzer (2013). "Annotating the biomedical
literature for the human variome." Database (Oxford) 2013: bat019.
Concept types related to genetic mutations
Relationship Definition
Gene has Mutation A mutation occurs in or near a gene, usually at a given position.
Patient/Cohort has
Mutation
A patient or cohort has a specific genetic variation.
Mutation related to
Disease
A mutation is associated with (or causes) a disease.
Mutation has Size Indicates the number or frequency of mutations.
Disease related to Gene A disease is associated with a gene—that is, a gene (when mutated) is
linked to, or causes a disease.
Disease related to Body
Part
A disease may occur in a body part, or have a body part in its name.
Patient has Age A patient has a given age
Cohort has Age A summary age for a cohort. Often listed as a mean or an age limit.
Patient/Cohort has
Gender
A patient or cohort is male or female.
Patient/Cohort has
Geographic Location
A patient or cohort has a given ethnicity or lives in a given place.
Patient/Cohort has
Disease
A patient or cohort has a disease.
Cohort has Size The size of a cohort group
…… ……
Ingenuity Pathway Analysis (IPA)
IPA以Ingenuity Knowledge Base is
highly structured.QIAGEN maintains it,
http://www.ingenuity.com/products/ipa
IPA Fall Release (2016): 2016.09.30
complex semantic relationships embedded in text
http://www.opennicta.com.au/ho
me/health/variome
The Ingenuity Knowledge Base
• Pathway Analysis• Disease and Fuction Analysis• Molecular Regulatory Networks• Causality Analysis
• From literatures to biomedical Analyses
SNV
in sample
General
SNP database
de novo
SNP
dbSNP
Hapmap
the 1000 genome project
Disease-
related SNP
Reported SNP? Known association?
Phenotype-
related SNP
Drug-related
SNP
Gene-related
SNP
GWAS catalog
ClinVar
GWASdb
SNP Annotation 单核苷酸多态性注释
PharmGKB
VnD
Drug-SNPing
Clinical
database HPO: human phenotype ontology
DO:Disease Ontology
Disease-
related genes
OMIM
GAD
HuGe
The PharmGKB Knowledge Pyramid
PharmGKB提供以下信息:VA: Variant Annotations
PW: Drug-Centered Pathway
VIP: Very Important Pharmacogene Summaries
CA:Clinical Annotations
DG:Pharmacogenomics-Based Drug-Dosing Guidelines
DL:Drug Labels with Pharmacogenomic Information
文章发表情况:
Catalogue of somatic mutations in cancer (COSMIC)
世界上最大的癌症体细胞突变数据库,由Wellcome Trust
Sanger Institute开发和维护,数据类型涵盖两大类:
Expert curation data 1. Manually input from peer reviewed publications by COSMIC expert curators
2. Consists of comprehensive literature curation of selected Census genes at release,
followed by subsequent updates (Cancer Gene Census)
3. Includes additional data points relevant to each disease and publication
4. Provides accurate frequency data as mutation negative samples are specified
5. Also called non-systematic or targeted screen data
Genome-wide screen data 1. Uploaded from publications reporting large scale genome screening data or imported
from other databases such as TCGA and ICGC
2. Provides unbiased molecular profiling of diseases while covering the whole genome
3. Provides objective frequency data by interpreting non mutant genes across each genome
4. Facilitates finding novel driver genes in cancer
http://cancer.sanger.ac.uk/cosmic
Genomic Landscape of Cancer选取任何区域,可放大呈现癌症体细胞突变的具体信息;
COSMIC中包含的数据量统计:(release v78, 5th September 2016)
SNV Gene
Phenotype
Relationship summary 关系总结
Location
Change type
Location
Function annotation
eQTL
Protein damage
GWASGene
annotation
Disease
Drug
Empowering Knowledge Workers
Driving Biological
and Clinical Problems
Knowledge Management
Solutions to Real World Problems
Critical Issues:
Workflows that enable engagement by Subject Matter Experts
Tight coupling of engineering efforts and research programs that can
define “real world” driving problems
Incorporation of human and cognitive factors in all aspects of projects
Biomedical Informatics ≠ Engineering
KM-based Approaches To Interoperability and Usability Are Essential
A “Working” Central Dogma for BMI:
From Data to Knowledge 从数据到知识
51
Data
数据
Information
信息
Knowledge
知识
+ Context
文本+ Application
应用
Big Data 大数据 Precision Medicine精准医学
Clinical Data warehouse
临床数据仓库
+
BioBank
生物数据库
Bioinformatics
生物信息学
+
Data Analytic
数据分析
Knowledgebase
知识库
+
Clinical Decision Support
临床决策支持
大规模知识库的应用前景
52
英国新创公司BabylonHealth将用一款名为“Check”的程序与医生和护理人员比赛,测试
看谁能以最快的速度和准确性处理一连串常见的健康问题。全球第一位人工智能医生将与真人对决,这场竞赛可能成为医学的转折点。