Big Data - SysFera presentation at the CSCI

27
Big Data Technologies SysFera Benjamin Depardon SysFera mardi 7 juin 2022

description

Big Data is on everyone's lips, but what are the available technical solutions to deal with it? We give a brief overview of several solutions: distributed filesystems, NoSQL databases, and end-to-end solutions that take into account computations.

Transcript of Big Data - SysFera presentation at the CSCI

Page 1: Big Data - SysFera presentation at the CSCI

Big DataTechnologies

SysFeraBenjamin Depardon

SysFera10 avril 2023

Page 2: Big Data - SysFera presentation at the CSCI

10 avril 2023

2

SysFera• 2001: Research project from the Graal team

(Inria/ENS)– DIET: grid middleware

• 2007: SysFera-DS used within the Décrypthon project– Used in production 24/7/365 since then– Selected by IBM to replace Univa-UD

• 2010: Creation of SysFera, INRIA spin-off• 2012: A team of 14 (R&D: 4 engineers and 5 PhD)

– Supported by two experts from INRIA and ENS– SysFera-DS

SysFera

Page 3: Big Data - SysFera presentation at the CSCI

10 avril 2023

3

What is Big Data?• All kinds of data• Valuable insight, but difficult to extract• Several dimensions

– Variety• Structured/unstructured• Text, audio, video…

– Velocity• Time sensitivity• Streaming

– Volume• Large files• Small files in large quantities

– Variability• Different meanings/format over different

time period

SysFera

Page 4: Big Data - SysFera presentation at the CSCI

10 avril 2023 SysFera

Analyze a Variety of InformationAnalyze Information in Motion Social media/sentiment analysis

Geospatial analysis Brand strategy Scientific research Epidemic early warning system Market analysis Video analysis Audio analysis

Smart Grid management Multimodal surveillance Real-time promotions Cyber security ICU monitoring Options trading Click-stream analysis CDR processing IT log analysis RFID tracking & analysis

Analyze Extreme Volumes of Information Transaction analysis to create insight-based

product/service offerings Fraud modeling & detection Risk modeling & management Social media/sentiment analysis Environmental analysis

Discovery & Experimentation

Sentiment analysis Brand strategy Scientific research Ad-hoc analysis Model development Hypothesis testing Transaction analysis to create insight-

based product/service offerings

Manage and Plan Operational analytics – BI reporting Planning and forecasting analysis Predictive analysis …

What can you do with Big Data?

Page 5: Big Data - SysFera presentation at the CSCI

10 avril 2023

Utilities Weather impact analysis on

power generation Transmission monitoring Smart grid management

Retail 360° View of the Customer Click-stream analysis Real-time promotions

Law Enforcement Real-time multimodal surveillance Situational awareness Cyber security detection

Transportation Weather and traffic

impact on logistics and fuel consumption

Financial Services Fraud detection Risk management 360° View of the Customer

IT Transition log analysis

for multiple transactional systems

Cybersecurity

Health & Life Sciences Epidemic early warning

system ICU monitoring Remote healthcare monitoring

Telecommunications CDR processing Churn prediction Geomapping / marketing Network monitoring

SysFera

What can you do with Big Data?

Page 6: Big Data - SysFera presentation at the CSCI

10 avril 2023

6

What do you need?• Hardware

– Storage capacity– Computing power

• Software– Storage

• Filesystems• Databases

– Computation framework

SysFera

Page 7: Big Data - SysFera presentation at the CSCI

10 avril 2023

7

DISTRIBUTED FILESYSTEMS

SysFera

Page 8: Big Data - SysFera presentation at the CSCI

10 avril 2023

8

HDFS• Hadoop Distributed File System• Open source (Apache)• Design

– High throughput instead of low latency– Large data sets (large files), data locality– Fault tolerance (replication)– Write once and read-many (WORM)– Userspace

• Limitations– Write-once model– Cannot be mounted by existing OS– No quotas/access permissions– Name node is a single point of failure

• Used by Yahoo, Twitter, Rackspace, LinkedIn, Facebook…

SysFera

Page 9: Big Data - SysFera presentation at the CSCI

10 avril 2023

9

GlusterFS• Open source (GPLV3) NAS file system• Runs in userspace• File-based distributed mirroring,

replication, striping, load balancing• FUSE, POSIX compliant• Storage quotas• No meta-data server (fully distributed

architecture, elastic hash)• Unified global namespace: aggregation

of disk and memory in a single pool• Data is stored in logical volumes that are

abstracted from the hardware and logically partitioned from each other

• Multiprotocole client support: GlusterFS native, NFS, CIFS, HTTP, WebDAV, FTP

• Real time Self-healing• VM live replication

SysFera

Page 10: Big Data - SysFera presentation at the CSCI

10 avril 2023

10

LUSTRE• Open Source (GPL)• Object based: separate metadata and

file data– Meta Data Servers (MDS) nodes– Object Storage Servers (OSS) nodes

• Consistency: Lustre distributed lock manager (MSD and OSS)

• Performance:– data can be striped– MDT is only involved in pathname and

permission checks, and is not involved in any file IO operations

• POSIX interface• Lustre Network (LNET): infinibands,

TCP/IP, Myrinet…• Targeted to manage large files

SysFera

Page 11: Big Data - SysFera presentation at the CSCI

10 avril 2023

11

DATABASES

SysFera

Page 12: Big Data - SysFera presentation at the CSCI

10 avril 2023

12

CAP theorem (Brewer’s theorem)

It is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:• Consistency• Availability• Partition tolerance

SysFera

Page 13: Big Data - SysFera presentation at the CSCI

10 avril 2023

13

NoSQL• Release ACID conditions• 4 types of NoSQL bases

– Key-value (Memcached, Voldemort): data agnostic

– Document oriented (CouchDB, MongoDB) : data conscious

– Column oriented (Big Table, Hbase, Cassandra)

– Graph (Neo4j)• Requires more work on the

client side

SysFera

Page 14: Big Data - SysFera presentation at the CSCI

10 avril 2023

14

MemCached• Free & open source, high-performance, distributed

memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.

• Simple Key/Value Store• Smarts Half in Client, Half in Server• Servers are Disconnected From Each Other• O(1) Everything• Forgetting Data is a Feature• Used by LiveJournal, flickr, Wordpress.org, Wikipedia,

YouTube…

SysFera

Page 15: Big Data - SysFera presentation at the CSCI

10 avril 2023

15

MongoDB• Document oriented• Transport and storage: BSON format (derived

from JSON, but binary)• Queries

– no join– Map/reduce

• Database contains collections• Collections contain documents• Master-slave replication

SysFera

Page 16: Big Data - SysFera presentation at the CSCI

10 avril 2023

16

Cassandra• Column oriented (inspired from Big Table & Dynamo)• Notion of super-columns

– (sorted) associative array of columns• Range queries on keys• Low latency: sequential access to disk• O(1) DHT• Eventual Consistency• Values limited to 2GB• RPC with Thrift

SysFera

Page 17: Big Data - SysFera presentation at the CSCI

10 avril 2023

17

Neo4J• Graph oriented• Fully ACID transactions• Data is stored as a graph/network

– Nodes and relationships with properties– "Property graph" or "edge-labeled multidigraph"

• Queries– Indexing of nodes and properties– Graph traversal

• Disk-based, native storage• Java, REST API• Master-slave load balancing• Use case: social network

SysFera

Page 18: Big Data - SysFera presentation at the CSCI

10 avril 2023

18

PaaS Databases• Different providers

– Amazon: RDS, SimpleDB– Google: AppEngine (GQL)– Microsoft: SQL Azure

• Different cost models– CPU hour– CPU hour + traffic– Monthly fee + CPU hour + trafficAll depend on the load (number of users)

SysFera

Page 19: Big Data - SysFera presentation at the CSCI

10 avril 2023

19

SOLUTIONS

SysFera

Page 20: Big Data - SysFera presentation at the CSCI

10 avril 2023

20

GO-Transfer: Data transfer as SaaS

SysFera

Reliable file transfer.Easy “fire-and-forget” transfersAutomatic fault recoveryHigh performanceAcross multiple security domains

No IT required.Software as a Service (SaaS)

No client software installationNew features automatically available

Consolidated support & troubleshootingWorks with existing GridFTP serversGlobus Connect solves “last mile problem”

GO-Transfer is the initial offering of the US National Science Foundation’s XSEDE User Access Services (XUAS)

© Ian Foster

Page 21: Big Data - SysFera presentation at the CSCI

10 avril 2023

21

Hadoop environment

SysFera

ZOO

KEEPER (Coordination)

PIG (Data Flow) HIVE (Batch SQL) SQOOP (Data Import)

AVRO (Serialization)

HDFS(Hadoop Distributed File System – Unstructured Storage)

CHUKWA(Displaying, Monitoring, Analysing Logs)

MAP REDUCE (Job scheduling – Raw processing)

HBASE (Real Time Query)

Page 22: Big Data - SysFera presentation at the CSCI

10 avril 2023

22

SysFera

InfoSphere BigInsights Hadoop-based low latency analytics

for variety and volume

IBM Netezza High Capacity Appliance

Queryable Archive Structured Data

IBM Netezza 1000

BI+Ad Hoc Analytics Structured Data

IBM Smart Analytics System

Operational Analytics on Structured Data

IBM Informix Timeseries

Time-structured analytics

IBM InfoSphere Warehouse

Large volume structured data analytics

InfoSphere StreamsLow Latency Analytics for streaming

data

MPP Data Warehouse

Stream ComputingInformation Integration

Hadoop

InfoSphere Information Server

High volume data integration and transformation

IBM Big Data Platform

Page 23: Big Data - SysFera presentation at the CSCI

10 avril 2023

23

SysFera-DS

SysFera

Page 24: Big Data - SysFera presentation at the CSCI

10 avril 2023

24

Dataflows• Iteration strategies• Automatic parallelism• Control structure (if/then/else,

do/while)• Fault tolerant• Multi-workflow scheduling

SysFera

Page 25: Big Data - SysFera presentation at the CSCI

10 avril 2023

25

DAGDA• Meta data-manager• Data management from end to end• Data replication

– Explicit– Implicit

• Data persistency• Memory and disk quotas• Replacement algorithms (LRU, LFU,

FIFO)• Best source selection• Strong link with task manager• Pluggable policies, local data

managers

SysFera

Page 26: Big Data - SysFera presentation at the CSCI

10 avril 2023

26

SysFera

Thank you!

Questions?

[email protected]://www.sysfera.com

Page 27: Big Data - SysFera presentation at the CSCI

10 avril 2023

27

Bibliography• « Big Data & Open Source: Une convergence inévitable ? »,

Stefane Fermigier, http://www.fermigier.com/blog/2012/03/new-whitepaper-big-data-open-source/

• « Visual Guide to NoSQL Systems », http://blog.beany.co.kr/archives/275

• The Cassandra Distributed Database », Eric Evans, http://www.parleys.com/#st=5&id=1866&sl=40

• « Big Data Architecture », Julio Philippe, http://www.slideshare.net/PhilippeJulio/big-data- architecture

• « Big Data in Real-Time analysis at Twitter », Nick Allen, http://www.slideshare.net/nkallen/q-con- 3770885

• …

SysFera