CSCI 4163 / CSCI 6904 Human-Computer Interaction web.cs.dal/~ hawkey /4163
Big Data - SysFera presentation at the CSCI
-
Upload
sysfera -
Category
Technology
-
view
753 -
download
4
description
Transcript of Big Data - SysFera presentation at the CSCI
Big DataTechnologies
SysFeraBenjamin Depardon
SysFera10 avril 2023
10 avril 2023
2
SysFera• 2001: Research project from the Graal team
(Inria/ENS)– DIET: grid middleware
• 2007: SysFera-DS used within the Décrypthon project– Used in production 24/7/365 since then– Selected by IBM to replace Univa-UD
• 2010: Creation of SysFera, INRIA spin-off• 2012: A team of 14 (R&D: 4 engineers and 5 PhD)
– Supported by two experts from INRIA and ENS– SysFera-DS
SysFera
10 avril 2023
3
What is Big Data?• All kinds of data• Valuable insight, but difficult to extract• Several dimensions
– Variety• Structured/unstructured• Text, audio, video…
– Velocity• Time sensitivity• Streaming
– Volume• Large files• Small files in large quantities
– Variability• Different meanings/format over different
time period
SysFera
10 avril 2023 SysFera
Analyze a Variety of InformationAnalyze Information in Motion Social media/sentiment analysis
Geospatial analysis Brand strategy Scientific research Epidemic early warning system Market analysis Video analysis Audio analysis
Smart Grid management Multimodal surveillance Real-time promotions Cyber security ICU monitoring Options trading Click-stream analysis CDR processing IT log analysis RFID tracking & analysis
Analyze Extreme Volumes of Information Transaction analysis to create insight-based
product/service offerings Fraud modeling & detection Risk modeling & management Social media/sentiment analysis Environmental analysis
Discovery & Experimentation
Sentiment analysis Brand strategy Scientific research Ad-hoc analysis Model development Hypothesis testing Transaction analysis to create insight-
based product/service offerings
Manage and Plan Operational analytics – BI reporting Planning and forecasting analysis Predictive analysis …
What can you do with Big Data?
10 avril 2023
Utilities Weather impact analysis on
power generation Transmission monitoring Smart grid management
Retail 360° View of the Customer Click-stream analysis Real-time promotions
Law Enforcement Real-time multimodal surveillance Situational awareness Cyber security detection
Transportation Weather and traffic
impact on logistics and fuel consumption
Financial Services Fraud detection Risk management 360° View of the Customer
IT Transition log analysis
for multiple transactional systems
Cybersecurity
Health & Life Sciences Epidemic early warning
system ICU monitoring Remote healthcare monitoring
Telecommunications CDR processing Churn prediction Geomapping / marketing Network monitoring
SysFera
What can you do with Big Data?
10 avril 2023
6
What do you need?• Hardware
– Storage capacity– Computing power
• Software– Storage
• Filesystems• Databases
– Computation framework
SysFera
10 avril 2023
7
DISTRIBUTED FILESYSTEMS
SysFera
10 avril 2023
8
HDFS• Hadoop Distributed File System• Open source (Apache)• Design
– High throughput instead of low latency– Large data sets (large files), data locality– Fault tolerance (replication)– Write once and read-many (WORM)– Userspace
• Limitations– Write-once model– Cannot be mounted by existing OS– No quotas/access permissions– Name node is a single point of failure
• Used by Yahoo, Twitter, Rackspace, LinkedIn, Facebook…
SysFera
10 avril 2023
9
GlusterFS• Open source (GPLV3) NAS file system• Runs in userspace• File-based distributed mirroring,
replication, striping, load balancing• FUSE, POSIX compliant• Storage quotas• No meta-data server (fully distributed
architecture, elastic hash)• Unified global namespace: aggregation
of disk and memory in a single pool• Data is stored in logical volumes that are
abstracted from the hardware and logically partitioned from each other
• Multiprotocole client support: GlusterFS native, NFS, CIFS, HTTP, WebDAV, FTP
• Real time Self-healing• VM live replication
SysFera
10 avril 2023
10
LUSTRE• Open Source (GPL)• Object based: separate metadata and
file data– Meta Data Servers (MDS) nodes– Object Storage Servers (OSS) nodes
• Consistency: Lustre distributed lock manager (MSD and OSS)
• Performance:– data can be striped– MDT is only involved in pathname and
permission checks, and is not involved in any file IO operations
• POSIX interface• Lustre Network (LNET): infinibands,
TCP/IP, Myrinet…• Targeted to manage large files
SysFera
10 avril 2023
11
DATABASES
SysFera
10 avril 2023
12
CAP theorem (Brewer’s theorem)
It is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:• Consistency• Availability• Partition tolerance
SysFera
10 avril 2023
13
NoSQL• Release ACID conditions• 4 types of NoSQL bases
– Key-value (Memcached, Voldemort): data agnostic
– Document oriented (CouchDB, MongoDB) : data conscious
– Column oriented (Big Table, Hbase, Cassandra)
– Graph (Neo4j)• Requires more work on the
client side
SysFera
10 avril 2023
14
MemCached• Free & open source, high-performance, distributed
memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.
• Simple Key/Value Store• Smarts Half in Client, Half in Server• Servers are Disconnected From Each Other• O(1) Everything• Forgetting Data is a Feature• Used by LiveJournal, flickr, Wordpress.org, Wikipedia,
YouTube…
SysFera
10 avril 2023
15
MongoDB• Document oriented• Transport and storage: BSON format (derived
from JSON, but binary)• Queries
– no join– Map/reduce
• Database contains collections• Collections contain documents• Master-slave replication
SysFera
10 avril 2023
16
Cassandra• Column oriented (inspired from Big Table & Dynamo)• Notion of super-columns
– (sorted) associative array of columns• Range queries on keys• Low latency: sequential access to disk• O(1) DHT• Eventual Consistency• Values limited to 2GB• RPC with Thrift
SysFera
10 avril 2023
17
Neo4J• Graph oriented• Fully ACID transactions• Data is stored as a graph/network
– Nodes and relationships with properties– "Property graph" or "edge-labeled multidigraph"
• Queries– Indexing of nodes and properties– Graph traversal
• Disk-based, native storage• Java, REST API• Master-slave load balancing• Use case: social network
SysFera
10 avril 2023
18
PaaS Databases• Different providers
– Amazon: RDS, SimpleDB– Google: AppEngine (GQL)– Microsoft: SQL Azure
• Different cost models– CPU hour– CPU hour + traffic– Monthly fee + CPU hour + trafficAll depend on the load (number of users)
SysFera
10 avril 2023
19
SOLUTIONS
SysFera
10 avril 2023
20
GO-Transfer: Data transfer as SaaS
SysFera
Reliable file transfer.Easy “fire-and-forget” transfersAutomatic fault recoveryHigh performanceAcross multiple security domains
No IT required.Software as a Service (SaaS)
No client software installationNew features automatically available
Consolidated support & troubleshootingWorks with existing GridFTP serversGlobus Connect solves “last mile problem”
GO-Transfer is the initial offering of the US National Science Foundation’s XSEDE User Access Services (XUAS)
© Ian Foster
10 avril 2023
21
Hadoop environment
SysFera
ZOO
KEEPER (Coordination)
PIG (Data Flow) HIVE (Batch SQL) SQOOP (Data Import)
AVRO (Serialization)
HDFS(Hadoop Distributed File System – Unstructured Storage)
CHUKWA(Displaying, Monitoring, Analysing Logs)
MAP REDUCE (Job scheduling – Raw processing)
HBASE (Real Time Query)
10 avril 2023
22
SysFera
InfoSphere BigInsights Hadoop-based low latency analytics
for variety and volume
IBM Netezza High Capacity Appliance
Queryable Archive Structured Data
IBM Netezza 1000
BI+Ad Hoc Analytics Structured Data
IBM Smart Analytics System
Operational Analytics on Structured Data
IBM Informix Timeseries
Time-structured analytics
IBM InfoSphere Warehouse
Large volume structured data analytics
InfoSphere StreamsLow Latency Analytics for streaming
data
MPP Data Warehouse
Stream ComputingInformation Integration
Hadoop
InfoSphere Information Server
High volume data integration and transformation
IBM Big Data Platform
10 avril 2023
23
SysFera-DS
SysFera
10 avril 2023
24
Dataflows• Iteration strategies• Automatic parallelism• Control structure (if/then/else,
do/while)• Fault tolerant• Multi-workflow scheduling
SysFera
10 avril 2023
25
DAGDA• Meta data-manager• Data management from end to end• Data replication
– Explicit– Implicit
• Data persistency• Memory and disk quotas• Replacement algorithms (LRU, LFU,
FIFO)• Best source selection• Strong link with task manager• Pluggable policies, local data
managers
SysFera
10 avril 2023
27
Bibliography• « Big Data & Open Source: Une convergence inévitable ? »,
Stefane Fermigier, http://www.fermigier.com/blog/2012/03/new-whitepaper-big-data-open-source/
• « Visual Guide to NoSQL Systems », http://blog.beany.co.kr/archives/275
• The Cassandra Distributed Database », Eric Evans, http://www.parleys.com/#st=5&id=1866&sl=40
• « Big Data Architecture », Julio Philippe, http://www.slideshare.net/PhilippeJulio/big-data- architecture
• « Big Data in Real-Time analysis at Twitter », Nick Allen, http://www.slideshare.net/nkallen/q-con- 3770885
• …
SysFera