2.1.1.2020.02 IIASA - CODATA Big Data FAIR Data and Open ......horizontally scalable software,...
Transcript of 2.1.1.2020.02 IIASA - CODATA Big Data FAIR Data and Open ......horizontally scalable software,...
Big Data, FAIR Data and Open Datafor Systems Analysis
Alexei D. Gvishiani Geophysical Center of RAS, Vice chair of IIASA Russian NMO,
Chair of Russian CODATA National Committee
IIASA – CODATA TG, 24 – 25 February 2020
CONTENT
1. Big Data
2. Systems Analysis of Big Data
3. Relation Between Big, FAIR and Open Data
24 – 25 February 2020IIASA – CODATA TG 2
Birth of Big Data
• The author of the term “Big Data” isClifford Lynch, editor of “Nature” journal. OnSeptember 3, 2008 he issued a specialvolume of the journal on the topic “How cantechnologies that open up high possibilitiesto work with big data influence the future ofscience?”.
• “Big Data” is a term similar to the metaphors“Big Oil”, “Big Ore”, etc.
IIASA – CODATA TG 3
Explosive growthof data volume
and variety
Leap from the amount of initial data
to the quality of recognizable knowledge
24 – 25 February 2020
ИНТЕРНЕТ дает примеры Больших ДанныхTHE INTERNET. Basic example of BD
4
• Facebook – 1 million logins• Instagram – 347,222 people
scrolling• Apple Messages – 18.1 million
texts sent• YouTube – 4.5 million videos
viewed• Twitter – 87,500 people tweeting• Email - 188 million emails sent• WhatsApp – 41.6 million
messages sent• NETFLIX - 694,444 hours
watched
IIASA – CODATA TG 24 – 25 February 2020
Sources of Big Data today
• Internet• Social networks• Cellular location data streams• Audio and video recording device data• Continuously measuring device data• Events from RFIDs• Internal, previously not stored, information of enterprises
and organizations generated in information environments
5IIASA – CODATA TG 24 – 25 February 2020
Big Data. Scalability
• Scalability is the property of a systemto handle a growing amount of work by addingresources to the system.
• A system is called scalable if
1) it is able to increase its productivity in proportion toadditional resources
2) it has the ability to incorporate additional resources withoutstructural changes to the system
6IIASA – CODATA TG 24 – 25 February 2020
Vertical and horizontal scaling
There are two types of scaling:1) Vertical (Scaling Up)
2) Horizontal (Scaling Out)
• Vertical – increase the performance of each component of thesystem. Does not require program changes.
• Horizontal – adding new components to the system. May requiremodification of programs to make full use of the added resources.
7IIASA – CODATA TG 24 – 25 February 2020
Big Data. Definition
• Horizontal scaling is dividing the system into smaller structural components and spacing them into separate physical machines and / or increasing the number of servers, nodes and processors that simultaneously perform the same function.
• Big Data is structured and unstructured data of huge volumes and significant diversity, efficiently processed by horizontally scalable software, alternative to traditional DBMS.
8IIASA – CODATA TG 24 – 25 February 2020
Big Data. Scalability evaluation.• Scalability is measured through the ratio
P1 / P2where P1 – gain of system’s performance,
and P2 – gain in used resources.
Always P2 ≥ P1. The system has good scaling if P1 / P2 is close to 1.
• In system with this type of scalingP1 / P2 ≈ 1
adding resources gives a slight increase in productivity, and fromsome “threshold” point, adding resources does not have abeneficial effect.
9IIASA – CODATA TG 24 – 25 February 2020
Big Data. Criterion
10
• Big Data – is a data system whichsatisfies 3V-principal
• Volume• Velocity• Variety Big Data
Volume
Velocity Variety
Variability Veracity Validity Value
• Other Vs are also beingconsidered, which are notdecisive for Big Data, but morerelated to their analysis:
• Variability• Veracity• Validity• Value
IIASA – CODATA TG 24 – 25 February 2020
Big Data. Advanced Criterion
11
• Most often added to main Vs are variability and veracity
• A researcher is always inside the system, which prevents him fromdetermining where to start and how to build the stages of Big Dataresearch in time and space
• Systems analysis is a methodology for how to act in this situation
Big Data
VolumeVelocity
Variety Variability
Big Data
VolumeVelocity
Variety Veracity
IIASA – CODATA TG 24 – 25 February 2020
Big Data in Earth Sciences
Today:• meteorological data• Earth remote sensing data• Data of ecological superstations SMEAR II
In future:• sensor network data, such as the CeNSE (Central Nervous System
for the Earth) project. Hewlett Packard plans to install up to a trillionminiature sensors worldwide. First commercial application – jointseismic network with Shell
• the Arctic is a potential source of BIG DATA• Geodata Fabric
12IIASA – CODATA TG 24 – 25 February 2020
Geodata Fabric
Jeff de La Beaujardière from National Center for Atmospheric Research(NCAR) proposes Geodata Fabric on the following principles:
• object (instead of file) storage model is used by giants such asFacebook, YouTube, Netflix, Google and Amazon;
• consolidated data storage allows to perform calculations on the objectsand to provide just the results to a user;
• cloud technologies = serverless computing to optimize the use ofresources and eliminate the need to maintain a complex storageinfrastructure;
• simplification of data access through standardization and increasing thelevel of abstraction of requests;
• automated data analysis inside the cloud to extract knowledge.
13IIASA – CODATA TG 24 – 25 February 2020
Polar Circle Big Data, 66º 33' northern latitude
14
The area of the Arctic Circle – 21 million sq. km.
Circumference of the Arctic Circle – 15,948 km
Population – 4.6 million people2.5 million live in the Russian Arctic
– the warming rate is 5 times higher thanthe average on Earth
IIASA – CODATA TG 24 – 25 February 2020
• In the XX century the target of geophysics wasadequate mathematical models of magnetic, electric,gravity, seismo-tectonic and plate-tectonic fields ofthe Earth.
• With BD appearance the challenge in geophysics inXXI century is a holistic electromagnetic-gravitational-seismo-tectonic model of the Earth. The hope to buildit is based on the new knowledge that systemsanalysis allows to extract from BD.
• Mathematical modeling approach is no longersufficient to meet this challenge.
Systems analysis of Big Datain XXI century
IIASA – CODATA TG 24 – 25 February 2020
Mathematics in XIX, XX and XXI centuries
16
XXI centuryXX centuryXIX century
Physical, chemical, biological, medical, economic and
linguistic discoveries. Fast development of technosphere.
Space exploration. Atomic energy applications.
Mathematical analysis, linear and
higher algebra, analytic geometry etc
Physical, astronomic, geodesic and geographic discoveries
Mathematical modeling, functional analysis,
cybernetics, creation of computers.
Systems analysis
BIG DATA, FAIR DATA
Mat
hem
atic
sSo
urce
of
deve
lopm
ent
IIASA – CODATA TG 24 – 25 February 2020
Big Data Studies. 1
• Artificial neural networks, network analysis
• Optimization, including genetic algorithms
• Pattern recognition
• Predictive analytics
• Simulation
17IIASA – CODATA TG 24 – 25 February 2020
Big Data Studies. 2
• Spatial analysis using topological geometric andgeographic data information
• Statistical analysis, including time series analysis
• Visualization of the database to obtain analytics anduse images for further analysis
18IIASA – CODATA TG 24 – 25 February 2020
Data mining
Data Mining (DM) – information sifting, knowledge extraction,intelligent data analysis, deep data analysis, data knowledge discovery.
DM is a set of methods for detecting previously unknown, non-trivial andaccessible interpretations of knowledge in data that are useful fordecision-making in various fields of activity (G. Pyatetsky-Shapiro,1989).
Methods:• Association rule learning• Classification. Categorization of new data based on principles
successfully applied to categorize existing data• Cluster analysis• Regression analysis
19IIASA – CODATA TG 24 – 25 February 2020
FAIR DATA
FAIRDATA
20
Findability, Accessibility, Interoperability, Reusability
FAIR Principles:• findability • accessibility • interoperability• reusability
IIASA – CODATA TG 24 – 25 February 2020
21
FAIR Principles
• Strengthening the reliability of the results necessitates a radical improvement in the infrastructure for reuse of scientific data
• Support for data discovery through effective management
• The importance of using machines (AI) in data-rich research environments
IIASA – CODATA TG 24 – 25 February 2020
22
FAIR data principles – 1
To be Findable:
F1. (meta)data are assigned a globally unique and eternally persistent identifier.
F2. data are described with rich metadata.
F3. (meta)data are registered or indexed in a searchable resource.
F4. metadata specify the data identifier.
IIASA – CODATA TG 24 – 25 February 2020
23
FAIR data principles – 2
To be Accessible:
A1. (meta)data are retrievable by their identifier using a standardized communications protocol.
A2. the protocol is open, free, and universally implementable.
A3. the protocol allows for an authentication and authorization procedure, where necessary.
A4. metadata are accessible, even when the data are no longer available.
IIASA – CODATA TG 24 – 25 February 2020
24
FAIR data principles – 3
To be Interoperable:
I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
I2. (meta)data use vocabularies that follow FAIR principles.
I3. (meta)data include qualified references to other (meta)data.
IIASA – CODATA TG 24 – 25 February 2020
25
FAIR data principles – 4
To be Reusable:
R1. meta(data) have a plurality of accurate and relevant attributes.
R2. (meta)data are released with a clear and accessible data usage license.
R3. (meta)data are associated with their provenance.R4. (meta)data meet domain-relevant community standards.
IIASA – CODATA TG 24 – 25 February 2020
26
Big, Open and FAIR Data
Big Data Open Data
FAIR Data
IIASA – CODATA TG 24 – 25 February 2020
27
Big Data – F.A.I.R. DataInteraction
IIASA – CODATA TG 24 – 25 February 2020
28
Big Data – F.A.I.R. DataInteraction Trough Archives
IIASA – CODATA TG 24 – 25 February 2020
29
Science & Technological Data (ST-DATA)
IIASA – CODATA TG 24 – 25 February 2020
30
Two Sides of the Coin
IIASA – CODATA TG 24 – 25 February 2020
Mathematics of Big Data
31
“The mathematical method by which the consequences are derived fromdefinitions, postulates and axioms ... is the best and most reliable way tofind and generalize the truth”
Benedict Spinoza (1632–1677)
“The advancement and perfection of mathematics are intimately connectedwith the prosperity of the State”
Napoleon Bonaparte (1769–1821)
“Science only reaches perfection when it manages to use mathematics”Karl Marx (1818–1883)
“The highest purpose of mathematics is to find order in the chaos whichsurrounds us”
Norbert Wiener (1894 – 1964)
IIASA – CODATA TG 24 – 25 February 2020