EARTHCUBE CONCEPTUAL DESIGN A Scalable … · A Scalable Community Driven Architecture ... High...
Transcript of EARTHCUBE CONCEPTUAL DESIGN A Scalable … · A Scalable Community Driven Architecture ... High...
EARTHCUBE CONCEPTUAL DESIGN
A Scalable Community Driven Architecture http://earthcube.org/group/scalable-community-driven-architecture
Overview PI: G. Djorgovski (Caltech)
CO-I: D. Pilone, T. Pilone (Element 84), D. Crichton, E. Law (JPL)
Other key personnel: S. Caltagirone (E84), S. Hughes (JPL),
T. Huang (JPL), A. Mahabal (Caltech)
1/7/16 1 2016 ESIP Winter Meeting
A high level system blueprint for the definition, construction, and deployment of both existing and new components to ensure that they can be unified and integrated into an evolutionary national infrastructure for EarthCube
1/7/16 2
Methodology
! Identification of stakeholders, concerns and requirements
! Identification of architectural use cases and drivers
! Selection of an architectural framework
! Development of the architectural principles
! Development of the architectural models
! Capture of the architecture artifacts in a consolidated report
! Generation of recommendations for adopting the architecture for the EarthCube program
1/7/16 3
1/7/16 4
Stakeholders Stakeholder/Actor Concerns
NSFProgramManagersMakedecisionandprovideguidanceattheEarthCubeprogramlevel.
Providesuf>icientfundingtosupporttheEarthCubemission.
EarthCubeScientistsUseEarthCuberesourcesandservicestoconductscienti>icresearch.
Publishscienti>icresults&curatedataasneeded.
EarthCubeDevelopers DeveloptechnologiesandservicesthatcanbeintegratedintoEarthCube.
EarthCubeArchitects
EstablishEarthCuberequirements,frameworkandoperationalconcept.
Developinformationmodel(vocabulary,ontology).Establishstandardsguidelines.EnsureinteroperabilitybetweenEarthCubeBuildingBlocks.
ExternalDataUsers UseEarthCuberesourcesandservicesforresearch,education,anddecision-making.
Curator EnsuredataisproperlycapturedinEarthCubecompliantdatarepositories.
DataOwner Responsibleforproducingthedata.Concernedaboutitsdistributionanduse.
ExternalDataFacility Responsibleforarchivingdataatotheragencies(NASA,NOAA,USGS,etc);interoperabilitywiththeEarthCubeCyberinfrastructure.
EarthCubeGovernanceCommittees
Responsibleforgeneratingandmonitoringthegovernanceforthesystemincludingdatacuration,access,usecasepriority,interoperabilitystandards,etc.
EarthCubeOf>iceStaff ResponsibleformaintainingthecommunityinvolvementwithinEarthCubeandcommunicatingchangesandhowtousethesystem.
1/7/16 5
Use Cases ! Big Science – Discovery, Comparison, Provenance, Model & visualization
! Collaborative Science
! Dark Data Contribution
! Tools Contribution
! Data Documentation
! Models Sharing
! High Performance Computing and Storage Resources
! Real Time Data
! Physical Sample Curation
1/7/16 6
Drivers ! Transform and accelerate research and discovery by turning data
into knowledge and enabling interdisciplinary data integration.
! Provide critically needed data, tools, and computational resources and frameworks for cross-domain scientific collaboration, analysis and with long-term geoscience software and data preservation, discovery and use.
! Provide a geosicences cyberinfrastructure and architecture that is scalable, extensible and sustainable.
1/7/16 7
Frameworks ! Zachman Framework - For organizing stakeholder concerns and
perspectives.
! ISO/IEC/IEEE 42010:2011- For architectural description guidelines.
! Reference Model for Open Distributed Processing (RM-ODP) – For architectural patterns for distributed systems.
! Open Group Architecture Framework (TOGAF) – For managing the architecture.
! Federal Enterprise Architecture Framework (FEAF) – For classifying the architecture into architectural elements and viewpoints.
! ISO 14721:2003 - Open Archival Information System (OAIS) Reference Model - Provides a standard for information objects.
! ISO/IEC 11179:3 Registry Metamodel and Basic Attributes specification - Provides a schema for a metadata registry.
1/7/16 8
! Scalability
! Community Driven
! Open Science
! Interoperability
! Sustainability
! Distributed
! Data Model Driven
1/7/16 9
ScienceDataManage
SatelliteInstrumentDataSystems
ScienceDataManageAirborne
Data
ScienceDataManageAgency
EarthDataArchives
Data Provider
EarthCubeCI
EarthCube Discovery
1/7/16 10
ScienceDataManage
SatelliteInstrumentDataSystems
ScienceDataManageAirborne
Data
ScienceDataManageAgency
EarthDataArchives
Data Provider
EarthCubeCI
OtherDataSystems(e.g.NOAA)OtherDataSystems(e.g.NOAA)OtherDataSystems(In-Situ,University)
EarthCube Repository EarthCube Discovery
1/7/16 11
ScienceDataManage
SatelliteInstrumentDataSystems
ScienceDataManageAirborne
Data
ScienceDataManageAgency
EarthDataArchives
Data Provider
EarthCubeCI
OtherDataSystems(e.g.NOAA)OtherDataSystems(e.g.NOAA)OtherDataSystems(In-Situ,University)
EarthCube Repository
Data Science Infrastructure (Data, Algorithms, Machines)
ScienceTeams
EarthCube Discovery
1/7/16 12
Applica>ons
DecisionSupport
ScienceDataManage
SatelliteInstrumentDataSystems
ScienceDataManageAirborne
Data
ScienceDataManageAgency
EarthDataArchives
Research
Data ProviderData Analysis
EarthCubeCI
OtherDataSystems(e.g.NOAA)OtherDataSystems(e.g.NOAA)OtherDataSystems(In-Situ,University)
EarthCube Repository
Data Science Infrastructure (Data, Algorithms, Machines)
Earthcube Data Analytics Centers
ScienceTeams
EarthCube Discovery
1/7/16 13
Benchmark
! Earth System Grid Federation (ESGF)
! Early Detection Research Network (EDRN)
! NASA’s Earth Observing System Data and Information System (EOSDIS)
ExArch'Mee*ng,'October'2012
Node2Architecture
•Internally,'each'ESGF'Node'is'composed'of'services'and'applica*ons'that'collec*vely'enable'data'and'metadata'access,'and'user'management.'•ESGF'soNware'stack'combines'custom'soNware'components'developed'by'ESGF'with'other'freely'available'applica*ons'from'eCommerce'(Apache'Tomcat,'Solr,'Postgres,...)'and'geoIinforma*cs'(Thredds'Data'Server,'LAS,'...)•SoNware'components'are'grouped'into'4'areas'of'func*onality'(aka'“flavors”):
•Data'Node':'secure'data'publica*on'and'access•Index'Node':'‣metadata'indexing'and'searching‣web'portal'UI'to'drive'human'interac*on‣dashboard'suite'of'admin'applica*ons‣model'metadata'viewer'plugin
•'Iden*ty'Provider':'user'authen*ca*on'and'group'membership•'Compute'Node':'analysis'and'visualiza*on
•Nodes'flavors'can'be'installed'in'various'combina*ons'depending'on'site'needs,'or'to'achieve'higher'performance'and'scalability
ExArch'Mee*ng,'October'2012
SoGware2Stack2:2Node2Manager
•Enables'con*nuos'exchange'of'service'and'state'informa*on'among'Nodes
•Internally,'it'collects'Node'health'informa*on'and'metrics'(cpu,'disk'usage,'etc.)
•Installed'for'all'Node'flavorsPeerIToIPeer'(P2P)'protocol
•Gossip'protocol:'informa*on'is'exchanged'randomly'among'peers
‣Each'Node'receives'informa*on'from'one'Node,'merges'it'with'its'own'informa*on,'and'
propagates'it'to'two'other'Nodes'at'random
‣No'central'coordina*on,'no'single'point'of'failure•Nodes'can'join/leave'the'federa*on'dynamically
•Each'Node'is'bootstrapped'with'knowledge'of'one'default'peer•Each'Node'can'belong'to'one'or'more'peer'groups'within'which'informa*on'is'exchanged
XML'Registry
•XML'document'that'is'payload'of'P2P'protocol
•Contains'service'endpoints'and'SSL'public'keys'for'all'Nodes'in'the'federa*on
•Derived'products'(list'of'search'shards,'trusted'IdPs,'loca*on'of'Airibute'Services,...)'are'used'by'federa*onIwide'services
Challenge:'good'news'travel'fast,'bad'news'travel'slow...
ASF DAAC SAR Products Sea Ice, Polar
Processes
SEDAC Human Interactions
in Global Change LP DAAC
Land Processes & Features
PO.DAAC Ocean Circulation
Air-Sea Interactions ASDC
Radiation Budget, Clouds, Aerosols, Tropo Chemistry
ORNL DAAC Biogeochemical
Dynamics, EOS Land Validation
GES DISC Atmos Composition &
Dynamics, Global Modeling, Hydrology,
Radiance
LAADS/ MODAPS
Atmosphere
OBPG Ocean Biology & Biogeochemistry
GHRC Hydrological Cycle &
Severe Weather
CDDIS Crustal Dynamics
Solid Earth NCAR, U of Col. HIRDLS, MOPITT,
SORCE GSFC
GLAS, MODIS, OMI, OBPG
LaRC CERES, SAGE III
GHRC AMSR-E, LIS,
AMSR2
JPL MLS, TES
San Diego ACRIM
NSIDC DAAC Cryosphere, Polar
Processes
SIPSs
Key Data
Center
ECS Sites
1/7/16 14
ProcessArchitecture
EarthCubeSystem
Architecture
DataLifecycle
Data Generation
Data Curation
DataTransport
Data Ingest
DataManagement
SearchDistribution
DataAnalytics
Visualization
SoftwareLifecycle Administrative
TechnologyPlanning
SoftwareDevelopment
Release
Governance
Standards
Technology
Policies
ResourcePlanning
DataArchitecture
TechnologyArchitecture
Ingest (Receive, Validate, Accept)
Catalog/DataManagement
Storage(Repository)
Processing
Search and Discovery
DataIntegration
DataAnalysis
Distribution
Visualization
InformationModel
ArchiveModel
Query/Access
DataFormats
ArchiveOrganization
Grammar
DataDictionary
DistributedArchitecture
Data Access
IT Security
Collaboration
Publication
DomainCrosscutting Research Software Lifecycle
Software Development
Software Versioning
Software Archiving
Software Search &
Distribution
Algorithm Storage & Discovery
Data Standards Evaluation
User Roles, Support and Feedback
Use metrics for data, software and site use
Architecture Elements
1/7/16 15
Data Lifecycle Data$Genera)on$
Data$Cura)on$and$Prepara)on$
Data$Transport$
Data$$Ingest$
Data$Management$
Discovery,$Access$&$Distribu)on$
Data$Analy)cs$
Visualiza)on$
Prepare&data&for&use&and&submission&into&EarthCube&
Original&genera7on&of&data&(from&sensors,&inves7gators,&etc)&
Maximize&informa7on&throughput&against&available&bandwidth&
Provides&overall&data&management&services&for&the&data&in&EarthCube&&
Provides&a&plaAorm&for&integra7ng&analy7cs&with&rendering&and&understanding&the&data&
Supports&the&capture&and&valida7on&of&data&into&EarthCube&
Enables&the&analysis&of&massive,&distributed&heterogeneous&data&
Enables&discovery,&access&and&distribu7on&of&the&data&
1/7/16 16
Information Model Context
1/7/16 17
Framework
Sources
Images
Measurements/Observations
RemoteSensing
Text file/ASCII
Spread-sheets
Metadata
etc.
Data Ingest
Data Management
AbstractionJavaPythonRubyGroovyScala…
Data Analysis
Science Workflow
Analytics
MachineLearning
PatternRecognition
Climatologies
Data Reduction
UncertaintyAnalysis
etc.
Visualization
OGC (WMS,WMTS, …)
TWMS
Data Slices
Plots andCoordination
IntegratedViews
Data Distribution
Query/Retrieval
Data Viewer and Interactive
Query
Data Science Framework
Analysis Platform
Search
Metadata Publication
Data Push
Data Access
OpenSearchLuceneSolrElasticSearch
RDBMS⁃ Postgres⁃ Oracle⁃ MySQL
NoSQL⁃ MongoDB⁃ Cassandra
Array⁃ SciDB
Storage⁃ SAN⁃ S3⁃ SSD
Hadoop/HDFS⁃ MapReduce⁃ ZooKeeper⁃ Spark
Graph DB⁃ TitanDB⁃ Neo4J
Triple Store⁃ Virtuoso⁃ AllegroGraph⁃ Sesame⁃ Fuseki
Message Passing Interface
SingleMachine
High Performance Computing
GPU
Data Providers Applied Science
OPeNDAP
W10N
LAS
THREDDS
Data StewardshipCuration
Virtual Machine
Container
InformationData Knowledge
Lucene
OpenSearch
SPARQ
etc.
Transfer
Validation
Metadata
Harvesting
Packaging
Search
Query
Subset
etc.
DataNode
AnalyticNode
1/7/16 18
Example Instantiation
Research
Applications
EarthCube Cyberinfrstructure
Applied Science
SatelliteInformation
Data Systems
AirborneData
AgencyEarth Data Archives
Research
Applications
Decision Support
OtherData Systems
(In-Situ, University)
Data Provider
EarthCubeData Science Infrastructure
EarthCubeData Analytics Centers
EarthCubeDiscipline-Specific
Data Management withData Analytic
Node
EarthCubeData Management
Node
EarthCubeData Management
Node
Data AnalyticNode
EarthCubeRepository
EarthCubeRepository
Sources
Images
Measurements/Observations
RemoteSensing
Text file/ASCII
Spread-sheets
Metadata
etc.
Data Ingest API
Data Management
AbstractionJavaPythonRubyGroovyScala…
Data Distribution
Data Science Framework
Search
Metadata Publication
Data Push
Data Access
OpenSearchLuceneSolrElasticSearch
RDBMS⁃ Postgres⁃ Oracle⁃ MySQL
NoSQL⁃ MongoDB⁃ Cassandra
Array⁃ SciDB
Storage⁃ SAN⁃ S3⁃ SSD
OPeNDAP
W10NTHREDDS
Data StewardshipCuration
Transfer
Validation
Metadata
Harvesting
Packaging
Sources
Images
Measurements/Observations
RemoteSensing
Text file/ASCII
Spread-sheets
Metadata
etc.
Data Ingest API
Data Management
AbstractionJavaPythonRubyGroovyScala…
Data Distribution
Data Science Framework
Search
Metadata Publication
Data Push
Data Access
OpenSearchLuceneSolrElasticSearch
RDBMS⁃ Postgres⁃ Oracle⁃ MySQL
NoSQL⁃ MongoDB⁃ Cassandra
Array⁃ SciDB
Storage⁃ SAN⁃ S3⁃ SSD
OPeNDAP
W10NTHREDDS
Data StewardshipCuration
Transfer
Validation
Metadata
Harvesting
Packaging
Sources
Images
Measurements/Observations
RemoteSensing
Text file/ASCII
Spread-sheets
Metadata
etc.
Data Ingest API
Data Management
AbstractionJavaPythonRubyGroovyScala…
Data Distribution
Data Science Framework
Search
Metadata Publication
Data Push
Data Access
OpenSearchLuceneSolrElasticSearch
RDBMS⁃ Postgres⁃ Oracle⁃ MySQL
NoSQL⁃ MongoDB⁃ Cassandra
Array⁃ SciDB
Storage⁃ SAN⁃ S3⁃ SSD
OPeNDAP
W10NTHREDDS
Data StewardshipCuration
Transfer
Validation
Metadata
Harvesting
Packaging
EarthCubeRepository
EarthCubeData Management
Node
Data AnalyticNode
Data AnalyticNode
EarthCubeData Management
Node
Sources
Images
Measurements/Observations
RemoteSensing
Text file/ASCII
Spread-sheets
Metadata
etc.
Data Ingest API
Data Management
AbstractionJavaPythonRubyGroovyScala…
Data Analysis
Science Workflow
Analytics
MachineLearning
PatternRecognition
Climatologies
Data Reduction
UncertaintyAnalysis
etc.
Visualization
OGC (WMS,WMTS, …)
TWMS
Data Slices
Plots andCoordination
IntegratedViews
Data Distribution
Query/Retrieval
API
Data Viewer and Interactive Query API
Data Science Framework
Analysis Platform
Search
Metadata Publication
Data Push
Data Access
OpenSearchLuceneSolrElasticSearch
RDBMS⁃ Postgres⁃ Oracle⁃ MySQL
NoSQL⁃ MongoDB⁃ Cassandra
Array⁃ SciDB
Storage⁃ SAN⁃ S3⁃ SSD
Hadoop/HDFS⁃ MapReduce⁃ ZooKeeper⁃ Spark
Graph DB⁃ TitanDB⁃ Neo4J
Triple Store⁃ Virtuoso⁃ AllegroGraph⁃ Sesame⁃ Fuseki
Message Passing Interface
SingleMachine
High Performance Computing
GPU
OPeNDAP
W10N
LAS
THREDDS
Data StewardshipCuration
Virtual Machine
Container
Lucene
OpenSearch
SPARQ
etc.
Transfer
Validation
Metadata
Harvesting
Packaging
Search
Query
Subset
etc.
Data Analysis
Science Workflow
Analytics
MachineLearning
PatternRecognition
Climatologies
Data Reduction
UncertaintyAnalysis
etc.
Visualization
OGC (WMS,WMTS, …)
TWMS
Data Slices
Plots andCoordination
IntegratedViews
Query/Retrieval
API
Data Viewer and Interactive Query API
Data Science Framework
Analysis Platform
Hadoop/HDFS⁃ MapReduce⁃ ZooKeeper⁃ Spark
Graph DB⁃ TitanDB⁃ Neo4J
Triple Store⁃ Virtuoso⁃ AllegroGraph⁃ Sesame⁃ Fuseki
Message Passing Interface
SingleMachine
High Performance Computing
GPU
LAS
Virtual Machine
Container
Lucene
OpenSearch
SPARQ
etc.
Search
Query
Subset
etc.
Data Analysis
Science Workflow
Analytics
MachineLearning
PatternRecognition
Climatologies
Data Reduction
UncertaintyAnalysis
etc.
Visualization
OGC (WMS,WMTS, …)
TWMS
Data Slices
Plots andCoordination
IntegratedViews
Query/Retrieval
API
Data Viewer and Interactive Query API
Data Science Framework
Analysis Platform
Hadoop/HDFS⁃ MapReduce⁃ ZooKeeper⁃ Spark
Graph DB⁃ TitanDB⁃ Neo4J
Triple Store⁃ Virtuoso⁃ AllegroGraph⁃ Sesame⁃ Fuseki
Message Passing Interface
SingleMachine
High Performance Computing
GPU
LAS
Virtual Machine
Container
Lucene
OpenSearch
SPARQ
etc.
Search
Query
Subset
etc.
Data Analysis
Science Workflow
Analytics
MachineLearning
PatternRecognition
Climatologies
Data Reduction
UncertaintyAnalysis
etc.
Visualization
OGC (WMS,WMTS, …)
TWMS
Data Slices
Plots andCoordination
IntegratedViews
Query/Retrieval
API
Data Viewer and Interactive Query API
Data Science Framework
Analysis Platform
Hadoop/HDFS⁃ MapReduce⁃ ZooKeeper⁃ Spark
Graph DB⁃ TitanDB⁃ Neo4J
Triple Store⁃ Virtuoso⁃ AllegroGraph⁃ Sesame⁃ Fuseki
Message Passing Interface
SingleMachine
High Performance Computing
GPU
LAS
Virtual Machine
Container
Lucene
OpenSearch
SPARQ
etc.
Search
Query
Subset
etc.
1/7/16 19
Thank You
Questions?
1/7/16 20
EarthCube Conceptual Architecture Discussion
The controversial bits…
1/7/16 21
THIS IS A DISCUSSION.
Please Talk.
1/7/16 22
EarthCube Architect
EarthCube Developer
EarthCube Scientist
Curator
1/7/16 23
NSF Program Manager
External Data Users
External Data
Facility
Earthcube Staff
Governance Committee
Stakeholders ! Do we have the right stakeholders?
! Do they overlap at all? Too much?
! Are they useful to provide use cases and personas that help drive the system?
! Are we missing key stakeholders?
1/7/16 24
Stakeholders
NSF Program Managers EarthCube Scientists
EarthCube Developers EarthCube Architects
External Data Users Curators
Data Owner External Data Facility
EarthCube Governance Committees
EarthCube Office Staff
1/7/16 25
Architectural Principles
Federation Sustainability
Standards (Data) Model-Driven
Extensibility Scalability
Provenance Security
1/7/16 26
Standards… ! We do not advocate a particular standard…
! Our Conceptual Architecture emphasizes fully defined and self contained data rather than prescribing standard(s).
! EarthCube’s heterogenous data, applications, and systems appear to justify possible increase in complexity.
! Common models and representations should be used.
1/7/16 27
EarthCube Software Lifecycle Processes
1/7/16 28
Technology Planning
Software Development
Release
Research Software Lifecycle Processes
1/7/16 29
Technology Planning
Software Development
Software Versioning
Software Search and Distribution
Algorithm Search and Distribution
Software Lifecycle Processes ! We place an emphasis on software versioning,
discovery, etc. for Research Software. Should we treat “EarthCube proper” processes the same way?
! What about discovery and distribution?
1/7/16 30
Metrics ! Use Examples:
! Product Searches
! Products Downloaded
! Services Accessed
! Publications Cited
! Quality Examples: ! Ingestion speed
! Search Response Time
! User “conversions”
1/7/16 31
Metrics & Conceptual Architectures
! Is this the right place to advise / mandate metrics? (e.g. we’re not doing this for standards)
! Should we be specific or just provide categories?
! Do we go so far as to ”mandate” it for EarthCube components / building blocks / etc?
1/7/16 32
Applica>ons
DecisionSupport
ScienceDataManage
SatelliteInstrumentDataSystems
ScienceDataManageAirborne
Data
ScienceDataManageAgency
EarthDataArchives
Research
Data ProviderData Analysis
EarthCubeCI
OtherDataSystems(e.g.NOAA)OtherDataSystems(e.g.NOAA)OtherDataSystems(In-Situ,University)
EarthCube Repository
Data Science Infrastructure (Data, Algorithms, Machines)
Earthcube Data Analytics Centers
ScienceTeams
EarthCube Discovery
1/7/16 33
Places we haven’t expressed an opinion
! Cloud vs. on-premises hosting
! Data location (hosted vs. distributed)
! Compute location
Should we?
1/7/16 34
Best Practices Common Software Stack
Common Data Model
Standard Interfaces Service-Oriented Architecture
Decoupled Storage, Compute, and Data Management
Federated Search
Analytic Services Visualization
1/7/16 35
Misc Questions ! How do we make this real?
! What’s the next thing you need to make EarthCube more valuable to you?
! How can the Conceptual Architecture effort help you get there?
1/7/16 36
Our Next Steps 1. Solicit Reviewers for Conceptual Architecture
Document (NOW!)
2. Incorporate feedback and review comments
3. Write actionable recommendations and incorporate into final Conceptual Architecture
4. Prioritize and Deploy Key Architectural Components
1/7/16 37
We need reviewers! Please contact Emily Law if you’re interested.
Thank you!
1/7/16 38