Scientific Data Infrastructure in CAS Dr. Jianhui Li([email protected]) Scientific Data Center Computer...

27
Scientific Data Infrastructure in CAS Dr. Jianhui Li([email protected]) Scientific Data Center Computer Network Information Center Chinese Academy of Sciences

Transcript of Scientific Data Infrastructure in CAS Dr. Jianhui Li([email protected]) Scientific Data Center Computer...

Scientific Data Infrastructure in CAS

Dr. Jianhui Li([email protected])

Scientific Data Center

Computer Network Information Center

Chinese Academy of Sciences

Scientific Data infrastructure

Middle ware(Scientific data grid middleware,

internet-based storage service middleware…)

Scientific databases

Massive storage systemData-intensive computing facilities

High speed network

Application enabled environments and typical applications

Software and Toolkits

(scientific data collection, curation, and publishing, data analyzing and

visualization…)

DRC: Data Resource Center

• A new organization responsible for data preservation, curation and access service in CAS

Mass data backup

Data online service

Mas

s da

ta a

naly

sis

and

proc

ess

Long-term preservation of important data

Data ResourceCenter

Tech

nolo

gy se

rvic

e Netw

ork storage space

system environment

Application

service

mas

s da

ta

Managemen

t system

collaborator

staf

f

Infrastructure for DRC• High Speed Network

– 2Gbps linked with CSTNET– 2 Gbps linked with CSTNET-CNGI– GLORIAD

• Data Intensive Computing facilities– ~1000 CPU Core Clusters + Scientific Computing

Grid( ~200Tflops)• Massive Storage System

– 1PB online disk + 5PB Tape– A storage network will start to build this year

• 1 center + 1 archive center + 10 storage nodes around China

• Over 20PB

Scientific Databases (SDB)

• A Long-term mission started in 1986 which funded by CAS– many institutes involved– long-term, large-scale

collaboration– data from research, for research

• Collecting multi-discipline research data and promoting data sharing– More than 350 research

databases and 400 datasets by 61 institutes

– Over 60TB data available to open access and download

http://www.csdb.cn

Scientific Databases (cont.) • SDB Contents

– Physics & Chemistry, Geosciences, Biosciences, Atmospheric & Ocean Science, Energy Science, Material Science, Astronomy & Space Science

GeoSci ence 43%

Chemi stry 9%Bi oSci ence 18%

I CT 6%

Space 4%

Astronomy 1%

Physi cs 6%Ocean 5%Materi al 5% Energy 3%

Scientific Databases (cont.) • Database integration

– Resource database– Reference database– Application oriented database

Research databaseResearch database

Resource database

Reference database

Applicationorienteddatabase

Scientific Databases (cont.)

• 8 Resource databases– Geo-Science– Biodiversity– Chemistry– Astronomy– Space Science– Micro biology and virus– Material science– Environment

• 2 Reference databases– China Species

– compound• 4 application-Oriented

databases– High Energy (ITER)– Western Environment

Research– Ecology research– Qinghai Lake Research

CAS Scientific Data Grid

• Based on Scientific Data Grid Middleware (SDG)– SDG is built upon the Scientific Database, supporting to find

and access large scale, distributed and heterogeneous scientific data uniformly and conveniently in a SECURE and proper way

• Building scientific data application grid according to domain requirements– Integrate distributed data, analysis tools and storage and

computing facilities, providing a uniform data service interface

– 4 pilot grids • bioscience grid• geoscience grid• Chemistry grid• Astronomy and space science grid

Function Framework of SDG• A scalable and integrated data sharing environment

– Providing services for grid users, grid managers and resource provides

– Operating by the operation center, science gateways and data nodes

最终用户

数据资源提供者

网格管理者

网格运行服务总中心 网格主节点

所享受的服务

所承担的职责

所承担的职责

数据导航数据查询和获取用户注册单点登录

学科应用入口监控和统计信息

数据查询和获取学科应用单点登录

监控和统计信息

政策标准和规范管理网格组织机构管理

数据管理存储管理服务管理用户管理运维管理

监控和统计分析网格运行服务总中心门户

学科领域标准规范管理数据管理用户管理服务管理运维管理

监控和统计分析主题库门户

数据质量保障数据服务维护

网格节点

数据查询和获取学科应用单点登录

应用咨询服务

硬件资源管理数据服务管理

数据增长和维护数据质量管理

基于数据的网格应用

User

Grid Manager

Resource Provider

Operation Center Science Gateway Data Node

Access Scientific Data Grid

Software Tool

Research Database Research Database Research Database

Resource Databases

Reference Databases

Research Database

App-Oriented Databases

External Data Source

Science Gateway and access portal

Grid MiddlewareGrid Middleware

VisualDB - Powered your database

• A toolkit to manage, publish and share scientific database by visual configure interface without writing codes

• A database integration access broker• A data quality assessment tool• A database access and usage statistics tool

Function Framework of VisualDB

Catalog Builder

Security Center

Data Forge

vReport

Application enabled environments and typical applications

• Domain specific data intensive application environment– Support one specific research area– Integrated scientific data, storage, computing analysis model

and tools– An easily and friendly interactive interface– Scalable user defined data process workflow

• Typical pilot systems– Remote sensing data on-demand accessing and processing

service environment– CFCI - China FLUX Cyber-Infrastructure– DarwinTree——Molecular data analysis and application

environment– Atmospheric science data integration analysis platform

Atmospheric science data integration analysis platform • Status quo

Atmospheric Scientists and Researchers

Iteration

Data Preprocessing

NCL、Matlab、CDO

Scientific Data Storage

Web Service、SRB、FTP、HTTP

Data accessing

NCL、Matlab、CDO

Data Computing

NCL、Matlab、CDO

Data Analysis

NCL、Matlab、CDO

Result Output

Data VisualizingResult Data

Atmospheric science data integration analysis platform

• Problems– The size of Atmospheric data has reached

TB level and they are distributed.– The personal computer hard disk, memory

limit of the research work– Many algorithm finished by scientific

researcher can’t be shared easily.

Scientific Data Analysis Online Platform

DistributedDistributed data

Algorithm Model

Web browser 1)custom2)visualize

Algorithm Chosen Data FindingComputing for

Workflow

Combined with data and model

Define workflow

IterativeResercher

Result

Result

Using

Architecture

work flow

Select Data

Choose algorithmConfig param

plot

Analyse result

Iterative

Five step

Select data

Choose algorithm

Config param

plot and result

Thank you!