2.1.1.2020.02 IIASA - CODATA Big Data FAIR Data and Open ......horizontally scalable software,...

31
Big Data, FAIR Data and Open Data for Systems Analysis Alexei D. Gvishiani Geophysical Center of RAS, Vice chair of IIASA Russian NMO, Chair of Russian CODATA National Committee IIASA – CODATA TG, 24 – 25 February 2020

Transcript of 2.1.1.2020.02 IIASA - CODATA Big Data FAIR Data and Open ......horizontally scalable software,...

Page 1: 2.1.1.2020.02 IIASA - CODATA Big Data FAIR Data and Open ......horizontally scalable software, alternative to traditional DBMS. IIASA –CODATA TG 8 24 –25 February 2020. Big Data.

Big Data, FAIR Data and Open Datafor Systems Analysis

Alexei D. Gvishiani Geophysical Center of RAS, Vice chair of IIASA Russian NMO,

Chair of Russian CODATA National Committee

IIASA – CODATA TG, 24 – 25 February 2020

Page 2: 2.1.1.2020.02 IIASA - CODATA Big Data FAIR Data and Open ......horizontally scalable software, alternative to traditional DBMS. IIASA –CODATA TG 8 24 –25 February 2020. Big Data.

CONTENT

1. Big Data

2. Systems Analysis of Big Data

3. Relation Between Big, FAIR and Open Data

24 – 25 February 2020IIASA – CODATA TG 2

Page 3: 2.1.1.2020.02 IIASA - CODATA Big Data FAIR Data and Open ......horizontally scalable software, alternative to traditional DBMS. IIASA –CODATA TG 8 24 –25 February 2020. Big Data.

Birth of Big Data

• The author of the term “Big Data” isClifford Lynch, editor of “Nature” journal. OnSeptember 3, 2008 he issued a specialvolume of the journal on the topic “How cantechnologies that open up high possibilitiesto work with big data influence the future ofscience?”.

• “Big Data” is a term similar to the metaphors“Big Oil”, “Big Ore”, etc.

IIASA – CODATA TG 3

Explosive growthof data volume

and variety

Leap from the amount of initial data

to the quality of recognizable knowledge

24 – 25 February 2020

Page 4: 2.1.1.2020.02 IIASA - CODATA Big Data FAIR Data and Open ......horizontally scalable software, alternative to traditional DBMS. IIASA –CODATA TG 8 24 –25 February 2020. Big Data.

ИНТЕРНЕТ дает примеры Больших ДанныхTHE INTERNET. Basic example of BD

4

• Facebook – 1 million logins• Instagram – 347,222 people

scrolling• Apple Messages – 18.1 million

texts sent• YouTube – 4.5 million videos

viewed• Twitter – 87,500 people tweeting• Email - 188 million emails sent• WhatsApp – 41.6 million

messages sent• NETFLIX - 694,444 hours

watched

IIASA – CODATA TG 24 – 25 February 2020

Page 5: 2.1.1.2020.02 IIASA - CODATA Big Data FAIR Data and Open ......horizontally scalable software, alternative to traditional DBMS. IIASA –CODATA TG 8 24 –25 February 2020. Big Data.

Sources of Big Data today

• Internet• Social networks• Cellular location data streams• Audio and video recording device data• Continuously measuring device data• Events from RFIDs• Internal, previously not stored, information of enterprises

and organizations generated in information environments

5IIASA – CODATA TG 24 – 25 February 2020

Page 6: 2.1.1.2020.02 IIASA - CODATA Big Data FAIR Data and Open ......horizontally scalable software, alternative to traditional DBMS. IIASA –CODATA TG 8 24 –25 February 2020. Big Data.

Big Data. Scalability

• Scalability is the property of a systemto handle a growing amount of work by addingresources to the system.

• A system is called scalable if

1) it is able to increase its productivity in proportion toadditional resources

2) it has the ability to incorporate additional resources withoutstructural changes to the system

6IIASA – CODATA TG 24 – 25 February 2020

Page 7: 2.1.1.2020.02 IIASA - CODATA Big Data FAIR Data and Open ......horizontally scalable software, alternative to traditional DBMS. IIASA –CODATA TG 8 24 –25 February 2020. Big Data.

Vertical and horizontal scaling

There are two types of scaling:1) Vertical (Scaling Up)

2) Horizontal (Scaling Out)

• Vertical – increase the performance of each component of thesystem. Does not require program changes.

• Horizontal – adding new components to the system. May requiremodification of programs to make full use of the added resources.

7IIASA – CODATA TG 24 – 25 February 2020

Page 8: 2.1.1.2020.02 IIASA - CODATA Big Data FAIR Data and Open ......horizontally scalable software, alternative to traditional DBMS. IIASA –CODATA TG 8 24 –25 February 2020. Big Data.

Big Data. Definition

• Horizontal scaling is dividing the system into smaller structural components and spacing them into separate physical machines and / or increasing the number of servers, nodes and processors that simultaneously perform the same function.

• Big Data is structured and unstructured data of huge volumes and significant diversity, efficiently processed by horizontally scalable software, alternative to traditional DBMS.

8IIASA – CODATA TG 24 – 25 February 2020

Page 9: 2.1.1.2020.02 IIASA - CODATA Big Data FAIR Data and Open ......horizontally scalable software, alternative to traditional DBMS. IIASA –CODATA TG 8 24 –25 February 2020. Big Data.

Big Data. Scalability evaluation.• Scalability is measured through the ratio

P1 / P2where P1 – gain of system’s performance,

and P2 – gain in used resources.

Always P2 ≥ P1. The system has good scaling if P1 / P2 is close to 1.

• In system with this type of scalingP1 / P2 ≈ 1

adding resources gives a slight increase in productivity, and fromsome “threshold” point, adding resources does not have abeneficial effect.

9IIASA – CODATA TG 24 – 25 February 2020

Page 10: 2.1.1.2020.02 IIASA - CODATA Big Data FAIR Data and Open ......horizontally scalable software, alternative to traditional DBMS. IIASA –CODATA TG 8 24 –25 February 2020. Big Data.

Big Data. Criterion

10

• Big Data – is a data system whichsatisfies 3V-principal

• Volume• Velocity• Variety Big  Data 

Volume

Velocity Variety

Variability Veracity Validity Value

• Other Vs are also beingconsidered, which are notdecisive for Big Data, but morerelated to their analysis:

• Variability• Veracity• Validity• Value

IIASA – CODATA TG 24 – 25 February 2020

Page 11: 2.1.1.2020.02 IIASA - CODATA Big Data FAIR Data and Open ......horizontally scalable software, alternative to traditional DBMS. IIASA –CODATA TG 8 24 –25 February 2020. Big Data.

Big Data. Advanced Criterion

11

• Most often added to main Vs are variability and veracity

• A researcher is always inside the system, which prevents him fromdetermining where to start and how to build the stages of Big Dataresearch in time and space

• Systems analysis is a methodology for how to act in this situation

Big Data

VolumeVelocity

Variety Variability

Big Data

VolumeVelocity

Variety Veracity

IIASA – CODATA TG 24 – 25 February 2020

Page 12: 2.1.1.2020.02 IIASA - CODATA Big Data FAIR Data and Open ......horizontally scalable software, alternative to traditional DBMS. IIASA –CODATA TG 8 24 –25 February 2020. Big Data.

Big Data in Earth Sciences

Today:• meteorological data• Earth remote sensing data• Data of ecological superstations SMEAR II

In future:• sensor network data, such as the CeNSE (Central Nervous System

for the Earth) project. Hewlett Packard plans to install up to a trillionminiature sensors worldwide. First commercial application – jointseismic network with Shell

• the Arctic is a potential source of BIG DATA• Geodata Fabric

12IIASA – CODATA TG 24 – 25 February 2020

Page 13: 2.1.1.2020.02 IIASA - CODATA Big Data FAIR Data and Open ......horizontally scalable software, alternative to traditional DBMS. IIASA –CODATA TG 8 24 –25 February 2020. Big Data.

Geodata Fabric

Jeff de La Beaujardière from National Center for Atmospheric Research(NCAR) proposes Geodata Fabric on the following principles:

• object (instead of file) storage model is used by giants such asFacebook, YouTube, Netflix, Google and Amazon;

• consolidated data storage allows to perform calculations on the objectsand to provide just the results to a user;

• cloud technologies = serverless computing to optimize the use ofresources and eliminate the need to maintain a complex storageinfrastructure;

• simplification of data access through standardization and increasing thelevel of abstraction of requests;

• automated data analysis inside the cloud to extract knowledge.

13IIASA – CODATA TG 24 – 25 February 2020

Page 14: 2.1.1.2020.02 IIASA - CODATA Big Data FAIR Data and Open ......horizontally scalable software, alternative to traditional DBMS. IIASA –CODATA TG 8 24 –25 February 2020. Big Data.

Polar Circle Big Data, 66º 33' northern latitude

14

The area of the Arctic Circle – 21 million sq. km.

Circumference of the Arctic Circle – 15,948 km

Population – 4.6 million people2.5 million live in the Russian Arctic

– the warming rate is 5 times higher thanthe average on Earth

IIASA – CODATA TG 24 – 25 February 2020

Page 15: 2.1.1.2020.02 IIASA - CODATA Big Data FAIR Data and Open ......horizontally scalable software, alternative to traditional DBMS. IIASA –CODATA TG 8 24 –25 February 2020. Big Data.

• In the XX century the target of geophysics wasadequate mathematical models of magnetic, electric,gravity, seismo-tectonic and plate-tectonic fields ofthe Earth.

• With BD appearance the challenge in geophysics inXXI century is a holistic electromagnetic-gravitational-seismo-tectonic model of the Earth. The hope to buildit is based on the new knowledge that systemsanalysis allows to extract from BD.

• Mathematical modeling approach is no longersufficient to meet this challenge.

Systems analysis of Big Datain XXI century

IIASA – CODATA TG 24 – 25 February 2020

Page 16: 2.1.1.2020.02 IIASA - CODATA Big Data FAIR Data and Open ......horizontally scalable software, alternative to traditional DBMS. IIASA –CODATA TG 8 24 –25 February 2020. Big Data.

Mathematics in XIX, XX and XXI centuries

16

XXI centuryXX centuryXIX century

Physical, chemical, biological, medical, economic and

linguistic discoveries. Fast development of technosphere.

Space exploration. Atomic energy applications.

Mathematical analysis, linear and

higher algebra, analytic geometry etc

Physical, astronomic, geodesic and geographic discoveries

Mathematical modeling, functional analysis,

cybernetics, creation of computers.

Systems analysis

BIG DATA, FAIR DATA

Mat

hem

atic

sSo

urce

of

deve

lopm

ent

IIASA – CODATA TG 24 – 25 February 2020

Page 17: 2.1.1.2020.02 IIASA - CODATA Big Data FAIR Data and Open ......horizontally scalable software, alternative to traditional DBMS. IIASA –CODATA TG 8 24 –25 February 2020. Big Data.

Big Data Studies. 1

• Artificial neural networks, network analysis

• Optimization, including genetic algorithms

• Pattern recognition

• Predictive analytics

• Simulation

17IIASA – CODATA TG 24 – 25 February 2020

Page 18: 2.1.1.2020.02 IIASA - CODATA Big Data FAIR Data and Open ......horizontally scalable software, alternative to traditional DBMS. IIASA –CODATA TG 8 24 –25 February 2020. Big Data.

Big Data Studies. 2

• Spatial analysis using topological geometric andgeographic data information

• Statistical analysis, including time series analysis

• Visualization of the database to obtain analytics anduse images for further analysis

18IIASA – CODATA TG 24 – 25 February 2020

Page 19: 2.1.1.2020.02 IIASA - CODATA Big Data FAIR Data and Open ......horizontally scalable software, alternative to traditional DBMS. IIASA –CODATA TG 8 24 –25 February 2020. Big Data.

Data mining

Data Mining (DM) – information sifting, knowledge extraction,intelligent data analysis, deep data analysis, data knowledge discovery.

DM is a set of methods for detecting previously unknown, non-trivial andaccessible interpretations of knowledge in data that are useful fordecision-making in various fields of activity (G. Pyatetsky-Shapiro,1989).

Methods:• Association rule learning• Classification. Categorization of new data based on principles

successfully applied to categorize existing data• Cluster analysis• Regression analysis

19IIASA – CODATA TG 24 – 25 February 2020

Page 20: 2.1.1.2020.02 IIASA - CODATA Big Data FAIR Data and Open ......horizontally scalable software, alternative to traditional DBMS. IIASA –CODATA TG 8 24 –25 February 2020. Big Data.

FAIR DATA

FAIRDATA

20

Findability, Accessibility, Interoperability, Reusability

FAIR Principles:• findability • accessibility • interoperability• reusability

IIASA – CODATA TG 24 – 25 February 2020

Page 21: 2.1.1.2020.02 IIASA - CODATA Big Data FAIR Data and Open ......horizontally scalable software, alternative to traditional DBMS. IIASA –CODATA TG 8 24 –25 February 2020. Big Data.

21

FAIR Principles

• Strengthening the reliability of the results necessitates a radical improvement in the infrastructure for reuse of scientific data

• Support for data discovery through effective management

• The importance of using machines (AI) in data-rich research environments

IIASA – CODATA TG 24 – 25 February 2020

Page 22: 2.1.1.2020.02 IIASA - CODATA Big Data FAIR Data and Open ......horizontally scalable software, alternative to traditional DBMS. IIASA –CODATA TG 8 24 –25 February 2020. Big Data.

22

FAIR data principles – 1

To be Findable:

F1. (meta)data are assigned a globally unique and eternally persistent identifier.

F2. data are described with rich metadata.

F3. (meta)data are registered or indexed in a searchable resource.

F4. metadata specify the data identifier.

IIASA – CODATA TG 24 – 25 February 2020

Page 23: 2.1.1.2020.02 IIASA - CODATA Big Data FAIR Data and Open ......horizontally scalable software, alternative to traditional DBMS. IIASA –CODATA TG 8 24 –25 February 2020. Big Data.

23

FAIR data principles – 2

To be Accessible:

A1. (meta)data are retrievable by their identifier using a standardized communications protocol.

A2. the protocol is open, free, and universally implementable.

A3. the protocol allows for an authentication and authorization procedure, where necessary.

A4. metadata are accessible, even when the data are no longer available.

IIASA – CODATA TG 24 – 25 February 2020

Page 24: 2.1.1.2020.02 IIASA - CODATA Big Data FAIR Data and Open ......horizontally scalable software, alternative to traditional DBMS. IIASA –CODATA TG 8 24 –25 February 2020. Big Data.

24

FAIR data principles – 3

To be Interoperable:

I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.

I2. (meta)data use vocabularies that follow FAIR principles.

I3. (meta)data include qualified references to other (meta)data.

IIASA – CODATA TG 24 – 25 February 2020

Page 25: 2.1.1.2020.02 IIASA - CODATA Big Data FAIR Data and Open ......horizontally scalable software, alternative to traditional DBMS. IIASA –CODATA TG 8 24 –25 February 2020. Big Data.

25

FAIR data principles – 4

To be Reusable:

R1. meta(data) have a plurality of accurate and relevant attributes.

R2. (meta)data are released with a clear and accessible data usage license.

R3. (meta)data are associated with their provenance.R4. (meta)data meet domain-relevant community standards.

IIASA – CODATA TG 24 – 25 February 2020

Page 26: 2.1.1.2020.02 IIASA - CODATA Big Data FAIR Data and Open ......horizontally scalable software, alternative to traditional DBMS. IIASA –CODATA TG 8 24 –25 February 2020. Big Data.

26

Big, Open and FAIR Data

Big Data Open Data

FAIR Data

IIASA – CODATA TG 24 – 25 February 2020

Page 27: 2.1.1.2020.02 IIASA - CODATA Big Data FAIR Data and Open ......horizontally scalable software, alternative to traditional DBMS. IIASA –CODATA TG 8 24 –25 February 2020. Big Data.

27

Big Data – F.A.I.R. DataInteraction

IIASA – CODATA TG 24 – 25 February 2020

Page 28: 2.1.1.2020.02 IIASA - CODATA Big Data FAIR Data and Open ......horizontally scalable software, alternative to traditional DBMS. IIASA –CODATA TG 8 24 –25 February 2020. Big Data.

28

Big Data – F.A.I.R. DataInteraction Trough Archives

IIASA – CODATA TG 24 – 25 February 2020

Page 29: 2.1.1.2020.02 IIASA - CODATA Big Data FAIR Data and Open ......horizontally scalable software, alternative to traditional DBMS. IIASA –CODATA TG 8 24 –25 February 2020. Big Data.

29

Science & Technological Data (ST-DATA)

IIASA – CODATA TG 24 – 25 February 2020

Page 30: 2.1.1.2020.02 IIASA - CODATA Big Data FAIR Data and Open ......horizontally scalable software, alternative to traditional DBMS. IIASA –CODATA TG 8 24 –25 February 2020. Big Data.

30

Two Sides of the Coin

IIASA – CODATA TG 24 – 25 February 2020

Page 31: 2.1.1.2020.02 IIASA - CODATA Big Data FAIR Data and Open ......horizontally scalable software, alternative to traditional DBMS. IIASA –CODATA TG 8 24 –25 February 2020. Big Data.

Mathematics of Big Data

31

“The mathematical method by which the consequences are derived fromdefinitions, postulates and axioms ... is the best and most reliable way tofind and generalize the truth”

Benedict Spinoza (1632–1677)

“The advancement and perfection of mathematics are intimately connectedwith the prosperity of the State”

Napoleon Bonaparte (1769–1821)

“Science only reaches perfection when it manages to use mathematics”Karl Marx (1818–1883)

“The highest purpose of mathematics is to find order in the chaos whichsurrounds us”

Norbert Wiener (1894 – 1964)

IIASA – CODATA TG 24 – 25 February 2020