Common solution for the (very-)large data challenge.

24
Common solution for the (very-)large data challenge. VLDATA Call: EINFRA-1 (Focus on Topics 4-5) Deadline: Sep. 2nd 2014

description

Common solution for the (very-)large data challenge. VLDATA Call: EINFRA-1 (Focus on Topics 4-5) Deadline: Sep. 2nd 2014. 1.1 Objectives. - PowerPoint PPT Presentation

Transcript of Common solution for the (very-)large data challenge.

Page 1: Common solution for the (very-)large data  challenge.

Common solution for the (very-)large data challenge.

VLDATA

Call: EINFRA-1 (Focus on Topics 4-5)Deadline: Sep. 2nd 2014

Page 2: Common solution for the (very-)large data  challenge.

1.1 ObjectivesThe mission/vision/endgoal of VLDATA is to provide common solutions for handling large and extremely large common scientific data in a cost-effective way. This solution builds on existing pan-European e-Infrastructure and tools to provide an interoperable, efficient and sustainable platform for scientific user communities, in particular, to support a new generation of data scientists. The success of this project will secure European leadership on the development and support of big data and global data science and, therefore, will contribute to the leadership of European scientists and enterprises in many research and innovation fields

Page 3: Common solution for the (very-)large data  challenge.

Objectives (I)• O1: a "flexible and extendable" platform

supporting common solutions for large scale distributed data processing and analysis, ensuring interoperability among existing e-Infrastructure providers.– O1.1: (WP2,3,4,5) provide a common solution

using generic e-Infrastructure for processing large scale or extremely large scale of scientific data in a robust, efficient and cost-effective way.

– O1.2: (WP6) provide a flexible and customizable platform that can be extended to cover the specific requirements of each community.

Page 4: Common solution for the (very-)large data  challenge.

Objectives (II)• O2: standardized solutions aiming to a global

interoperability of open access for large-scale data processing, minimizing unnecessary large transfers– O2.1: (WP2) provide common language and standard

for handling big volume of data– O2.2: (WP2,3,4,5) improve the efficiency of

distributed data processing by providing smart data and computing management platform

– O2.3: (WP2,3,4,5) enable effective handling of big data samples by integrating new technologies

– O2.4: (WP8) assess the value of this generic solution towards relevant stakeholders: end scientists, their management, funding agencies, policy makers, companies and the society at large

Page 5: Common solution for the (very-)large data  challenge.

Objectives (III)• O3: Increase the number of users and

Research Infrastructure projects making efficient use of existing e-Infrastructure resources, designing appropriate exploitation strategies and a long-term sustainability plan.– O3.1: (WP5,7) deliver ready-to-use high-quality

standard products for internal and external usage, enhancing interdisciplinary data sciences at a global scale

– O3.2: (WP6,9) increase the degree of the open access of large scale distributed data

– O3.3: (WP9) educate new generation of data scientists and the society in general

Page 6: Common solution for the (very-)large data  challenge.

1.2 Relation to the work programme

Page 7: Common solution for the (very-)large data  challenge.

1.3 Concept and approach (ideas)

Make IT simple• Simplicity: VLDATA provides an abstraction of the different Resources that are

all made accessible the end user via the same interfaces.• Transparency: Users are allowed to specify their Workflows/Pipelines with

different levels of abstractions. The platform takes care of the necessary Resource Allocation to fulfill the required specifications.

• Extendibility and flexibility: VLDATA provides an API that allows users to extend the provided functionality by developing new or customized components

• Reliability: Quality standards and extensive validation in several scientific domains to ensure the readiness-to-use and robustness of VLDATA based solutions

• Scalability: Modular implementation allowing horizontal (amount of connected Resources or Users) and vertical (amount of processed Units) scaling to adapt VLDATA to the needs of each particular community or Research Infrastructure project.

• Smart and intelligent: building on collected experience and monitoring data, algorithm can look for optimized scheduling/searching strategies, including automated decision making based on usage traces and expectations.

• Cost-effective: Building up on existing well-established solutions and incrementally extending and developing to address new challenges with an evolving validated common solution, avoiding unnecessary duplicated efforts.

Page 8: Common solution for the (very-)large data  challenge.

1.3 Concept and approach (model)• Model (building blocks):

– Collaborative modular architecture, with multiple layers sharing the same Framework and Basic modules, allowing horizontal & vertical scaling to ensure scalability.

– Open, iterative, incremental and parallel, requirement-driven development process. Agile(?) methodology.

– Standard procedures for quality assurance, including security, platform integration and validation, including reference benchmarks, and release procedures in accordance with requirements for production level services.

• Layers: (result of 10 evolution of DIRAC development effort)– Framework: (communication, security, access control, user/group

management, DBs)– Basic modules: SystemLogging, Configuration, Accounting,

Monitoring– Low level modules: File Catalog, Resource Status, Request

Management, Workload Management– High level modules: Data Management, Workflow Management– Interfaces: User - Resource

Page 9: Common solution for the (very-)large data  challenge.

1.3 Concept and approach (assumption)• Current solution can be evolved into the new general platform to be

widely applied.• Evolution from grids to clouds, but heterogeneity will increased• Large degree of commonality on low-level requirements and tools

between different scientific domains• Fast grow of data and computing requirements almost doubling

every year. Aggregated estimation close exabyte level in 5 years from now (EGI expects 10.000.000 Cores and 1?? exabyte of scientific data by 2020). (Ref: http://delaat.net/talks/cdl-2014-05-13.pdf)

• Similar grow in number of data objects, computing units and end users (60 % of ESFRI projects completed or launched by 2015).

• New scientific domains are entering the digital era 4th paradigm of science, new data science is emerging (http://research.microsoft.com/en-us/collaboration/fourthparadigm/)

• Data is to be made openly available beyond the community that produced them, down to the citizens that might also contribute to its further processing

• Common development and validation provides robustness as well as cost-saving and, thus, enables sustainability

Page 10: Common solution for the (very-)large data  challenge.

1.4 Ambition

Page 11: Common solution for the (very-)large data  challenge.

2.1 Expected impacts• DIRECT impact.

– scalability, robustness (for the Research Infrastructure)– The expected impact is that participating RI projects will be able to

operate their Distributed Computing Systems efficiently processing their large volume research data, making it available to their end users in reliable and cost-effective way, which couldn't be achieved before, which may lead to new way of organizing science activities, leading to significant scientific break throughs. By providing important functional components (e.g., ) which was missing from existing practices, VLDATA platform will make possible the transparent integration of resources, hiding the complexity from use, resulting in the extension of the scale of the resources Resource Infrastructure projects can utilize. This will increase the number of RI using the project tools and the number of different types of resources reachable through the tools.

– simplicity (for the user: scientist/operator)– cost-efficiency (for funding agencies)– reduce the duplication efforts, maximizing the use of EU-invest e-

Infrastructures, enlarging the user communities, providing efficient data processing services, providing advanced technology by integrating the state-of-the-art which reduces development cost significantly. (also the processing algorithm )

Page 12: Common solution for the (very-)large data  challenge.

2.1 Expected impacts• Indirect impact: large user community

– Science – innovation

– Society– industry

– Citizens– policy maker– new generation data scientists

• On the other hand, the scale of the data challenge requires simple but intelligent solutions to integrate resources from different e-Infrastructure providers.

Page 13: Common solution for the (very-)large data  challenge.

2.2 Measures to maximize impact

Page 14: Common solution for the (very-)large data  challenge.

Research Infrastructures (I)• Belle II: – Usage of DIRAC for the Experiment, use

case presented:• Common access to various platforms: Grid +

cloud + cluster + HPC• Support for Monitoring for Workflow

management tools• Integrate for the needs of other participants• User interface

– EU-T0: Virtual data centers / New Virtualization techniques?

Page 15: Common solution for the (very-)large data  challenge.

Research Infrastructures (II)• PAO:– Usage of DIRAC for the Experiment, data

taking -> 2022• using a standard solution will help the

sustainability.• Extend functionality for their use case.• Common access to various platforms: Grid +

cloud + cluster + HPC (follow evolution of providers) in particular OSG• Open Access to data

– EU-T0: Data locality

Page 16: Common solution for the (very-)large data  challenge.

Research Infrastructures (III)• LHCb:– should cover Run 2 needs and target to the

needs of Run 3 (DAQ Upgrade)• Data rate will be increase by a factor ~5, 10 PB/year.• Integration of Cloud resources.• Massive data-driven Workflows for users.• Data preservation (?)• Resource (cpu/storage/network/...)

description/monitoring/availability/management, smart allocation

• Smart/Intelligent/dynamic data placement strategies (network)

– EU-T0: New Virtualization techniques, Resource description/monitoring/availability, Virtual data centers, Data locality

Page 17: Common solution for the (very-)large data  challenge.

Research Infrastructures (IV)• EISCAT_3D:– searching data (metadata catalog), intelligent

searching (patterns recognition)– visualization, – Workflow to go from one data level to another

with appropriated access rights– Training – flexible interconnection of different resources,

central (HPC) + distributed (Grid/Cloud)– time constrained massive data reduction (10 PB

-> 1 PB / month ??), including the possibility for users defined algorithm.

• EU-T0:

Page 18: Common solution for the (very-)large data  challenge.

Research Infrastructures (V)• BES III:

Page 19: Common solution for the (very-)large data  challenge.

3.1 Work Plan (To be confirmed)• WP1 Coordination (UB, Spain)

– External Advisory board (EUDAT, OGF, RDA, OSG, PRACE, XSEDE, CERN/HelixNebula)

• WP2 Requirement analysis & Design (CU, UK)• WP3 Data-driven development( UB, Spain) • WP4 User-driven development( CYFRONET, Poland)• WP5 Quality ( UAB, Spain)• WP6 Validation (????)

– LHCb (CNRS/INFN)– Belle II (Institut Jozef Stefan, UniMB, Mariborand UniLJ,Slovenia)– EISCAT_3D (SNIC, Sweden/EISCAT Science Associate)– PAO (CESNET, Czech Republic)– BES III (IHEP, China/INFN-Torino, Italy)

– DIRAC 4 EGI, multi-community solution EGI ( EGI.eu, the Netherlands)

• WP7 Dissemination: outreach + Training (CNRS, France)• WP8 Exploitation (ASCAMM, Spain)• WP9 Communication, Internationalization (UvA, the Netherlands)

Page 20: Common solution for the (very-)large data  challenge.

3.2 Management structure and procedures

Coordinator

Tech. Coordinator

Consortium Board (all partners)

Executive Board (1 Representative from each Area)

External Advisory Board

Integration/Operations

WPs (6)

Design/Develop WPs

(2,3,4,5) Communication/

Sustainability WPs (7,8,9)

Internal Communities’Coordinators

External Communities’Coordinators

Project Manager

Comm./Exploit.

Coordinator

Page 21: Common solution for the (very-)large data  challenge.

3.3 Consortium as a whole

Page 22: Common solution for the (very-)large data  challenge.

Private Companies• Bull/Dell (??)• ETL (UK)• AlpesLaser (CH)

Page 23: Common solution for the (very-)large data  challenge.

3.4 Resources to be committed

Page 24: Common solution for the (very-)large data  challenge.

Calendar (milestones)• May 23: Close the Contractors• June 11-13: all WP ready, F2F

meeting to close the Work plan. Deadline for RIs and third Parties

• July 9-11: Close proposal (I)• July 25: Proof read -> External review• Aug 18 -> Sep 2: final upates