Angular.js Talk at the November Meetup of the BerlinJS User Group
Semantic web meetup 14.november 2013
-
Upload
jean-pierre-koenig -
Category
Technology
-
view
377 -
download
1
Transcript of Semantic web meetup 14.november 2013
Big Data & Hadoop
Jean-Pierre König
03. Oktober 2013
Semantic Web Meetup
PROFILE COMPANY
WE ARE HERE Vom Standort Kreuzlingen / Schweiz
bedient YMC seit 2001 namhafte
nationale und internationale Kunden.
WORK
Customers
WITH WE
WORK WITH WE
Partners
WEB SOLUTIONS
BIG DATA ANALYTICS
MOBILE APPLICATIONS
WE CREATE Hosting & Support
Kundenspezifische Individuallösungen fürs Web
Social-Media-Anwendungen (z.B. Corporate Blogs, Wikis, Facebook-Apps etc.)
Web-Strategien
Shop-Systeme, Websites, Intranets
Empfehlungssysteme (z.B. für Apps, Webshops, Websites und Intranet)
Vorhersagemodelle (z.B. für Interessen von App-Usern)
Integrierte Suchsysteme (z.B. auch für unstrukturierte Daten)
Massgeschneiderte Web Analytics Systeme (z.B. mit Echtzeit-Metriken und Effekten in
Sozialen Netzwerken)
Training (Apache Hadoop)
Geolokalisierung für ortsspezifische Services
Integration von Sozialen Netzwerken wie Facebook und Twitter
Apps für Tablets und Smartphones (iPhone, Android)
Mobile Strategien
BIG DATA WHAT IS
WHAT IS
§ More general § When data sets become so large and complex that it
becomes difficult to process, including capture, curation, storage, search, sharing, transfer, analysis, and visualization
§ It is difficult to work with using most RDBMS, statistic and visualization systems
§ It requires massively parallel software running on tens, hundreds, or even thousands of servers
§ The 3 V’s by Gartner § Big data is high volume, high velocity, and/or high variety
information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization. (2012)
BIG DATA
WHAT DRIVES
§ Human-generated data § Documents, transaction data, CRM, social media
- your working life is devoted to looking at screens and typing more data into some system.
§ Sensor-generated data § There is the trend that a large part of the physical
world around us will eventually somehow be online – The Internet of Things.
§ Machine-generated data will quickly top human-generated data
BIG DATA
BUSINESS DRIVES DRIVERS
Web Archives
Data Aggregation
Video, Audio & Image Processing
Data Pre-processing
Infrastructure Management
Sampling
Predictive Analytics
360° Customer Experience Management
Social Media Analysis
(Mass) Personalization
Recommendation Engines
Data as a Service
Research
Fraud protection
Risk management
Environment Safety
Digital Security
Infrastructure Observation
Increase
Revenue
Improve
Decision-
Making
Risk
Prevention
THE EMERGING
§ NoSQL* Movement § NoSQL databases are finding significant and growing
industry use in big data and real-time web applications.
§ Hadoop and it’s ecosystem § Enterprise-grade solutions, consulting, support
§ Top 3 vendors: Cloudera, Hortonworks, MapR
§ Adoption throughout the software industry, e.g. IBM BigInsights, Microsoft HDInsight, Oracle Big Data Appliance, EMC/Spring/VMWare Pivotal HD, HP HAVEn, Intel Distribution, Dell w/Cloudera
Also referred to as "Not only SQL"
SOLUTIONS
IN A NUTSHELL HADOOP
WHAT IS
§ An open-source implementation of frameworks for reliable, scalable, distributed computing and data storage Official Hadoop website
§ A reliable shared storage and analysis system O‘Reilly: Hadoop – The Definitive Guide
§ A free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment Margaret Rouse
§ A complete, open-source ecosystem for capturing, organizing, storing, searching, sharing, analyzing, visualizing, and ... Jack Norris
HADOOP
A BRIEF HISTORY OF HADOOP
§ In 2002 Doug Cutting* started with Nutch, a open source web search engine
§ Fortunately Google published papers, that § describes the architecture of their distributed filesystem, called GFS
(2003)
§ introduced MapReduce (2004)
§ In 2005 Nutch released a new version with NDFS and MapReduce and moved out to form an independent subproject called Hadoop in 2006
§ Cutting joined Yahoo! to build and run Hadoop at web scale
§ In 2008 Hadoop became a top-level Apache project and it was used at Yahoo! (10k cores), Last.fm, Facebook and New York Times
*Doug Cutting is also the creator of Apache Lucene
HISTORY
HADOOP IN A NUTSHELL
§ HDFS
§ A distributed file system for storage
§ Is highly fault-tolerant and is designed to be
deployed on low-cost/commodity hardware
§ 1 Master called NameNode, many DataNodes(10+)
§ MapReduce
§ A batch query processor to run an ad hoc query
against your whole dataset and get the results in a
reasonable time
§ 1 Master called JobTracker, many TaskTrackers (10+)
HADOOP
HADOOP FACT-SHEET
MapReduce/distributed processing
§ Economical
§ Commodity hardware
§ Scalable
§ Add notes to increase parallelism
§ Fault tolerant
§ Auto-recover job failures
§ Data locality
§ Process where the data resides
HDFS/distributed storage
§ Economical
§ Commodity hardware
§ Scalable
§ Rebalances data on new nodes
§ Fault Tolerant
§ Detects faults and auto recovers
§ Reliable
§ Maintains multiple copies of data
§ High throughput
§ Because data is distributed
HADOOP
HADOOP PRINCIPLES
§ Schema on read
§ Data locality
§ No shared memory or disks
§ Scales out to thousands of servers
HADOOP
HADOOP SYSTEM COMPENENTS
NameNode DataNode Secondary NameNode
HADOOP
TaskTracker JobTracker
HDFS
MapReduce
Masters Slaves (many of them)
WRITING FILES ON HDFS* WRITING
File.txt
Block A
Block B
Block C
Client NameNode
DataNode 1 DataNode 5 DataNode 6 DataNode 9 DataNode N
He, i want to write A, B
and C of my File.txt.
OK, write to DataNodes
1, 5 and 9.
...
Block A Block B Block C
Block A`
Block B` Block C`
Block A` Block B`
Block C`
* Replication Factor of 3
Rack 1 Rack 2
READING FILES FROM HDFS READING
Client NameNode
DataNode 1 DataNode 5 DataNode 6 DataNode 9 DataNode N
Tell me the block
locations of File.txt.
A à DataNode 1,5,6
B à DataNode 1,5,N
C à DataNode 5,9,6
Block A Block B Block C
...
Block A`
Block B` Block C`
Block A` Block B`
Block C` Rack 1 Rack 2
MAPREDUCE IN A NUTSHELL MAPREDUCE
Deer Bear River
Car Car River
Deer Car Bear
Input
Deer Bear River
Car Car River
Deer Car Bear
Split
Bear, 2
Car, 3
Deer, 2
River, 2
Result
Word Count Example
Map
Deer, 1
Bear, 1
River, 1
Car, 1
Car, 1
River, 1
Deer, 1
Car, 1
Bear, 1
Shuffle
Bear, 1
Bear, 1
Car, 1
Car, 1
Car, 1
Deer, 1
Deer, 1
River, 1
River, 1
Reduce
Bear, 2
Car, 3
Deer, 2
River, 2
MAPREDUCE VS.
§ RDBMS § In a centralized database system, you’ve got one big disk connected to
4 or 8 or 16 big processors.
§ MapReduce § In a Hadoop cluster, every server has 2 or 4 or 8 CPUs. You can run
your job by sending your code to each of the dozens of servers in your cluster, and each server operates on its own little piece of the data. Results are then delivered back to you in a unified whole. You map the operation out to all of those servers and then you reduce the results back into a single result set.
§ Architecturally, the reason you’re able to deal with lots of data is because Hadoop spreads it out. And the reason you’re able to ask complicated computational questions is because you’ve got all of these processors, working in parallel, harnessed together.
RDBMS
HADOOP ECOSYSTEM
HADOOP ECOSYSTEM ECOSYSTEM
HADOOP’S DATABASE HBASE*
§ Unlike RDMS § No secondary indexes
§ No transactions
§ De-normalized, Schema less
§ Random read/write access to big data
§ Billions of rows and millions of columns
§ Automatic data sharding
§ Integrates with MapReduce
* Modeled after Google’s BigTable
HADOOP USE CASES
USE CASES
Data Warehousing
§ Complementary ETL process
OLTP
CRM
ERP
File
Server
...
Data
Warehouse
Analytics
Visualization
Reports
ETL
Logs Logs Logs
Social
Media
Sensors
...
HDFS
PIG Hive MapReduce
Sqoop
Flume
Java API
Data Marts
Data Cubes
USE CASES
Data Warehousing
§ Substitutive ETL process
OLTP
CRM
ERP
File
Server
...
Data
Warehouse
Analytics
Visualization
Reports
Logs
Hadoop
Logs Logs
Social
Media
Sensors
...
Logs
USE CASES
Data Warehousing
§ (Predictive) Analytics at scale
OLTP
CRM
ERP
File
Server
...
Data
Warehouse
Analytics
Visualization
Reports
Hadoop
Lo
gs Logs
Social
Media
Sensors
...
USE CASES
Data Warehousing
§ Machine Learning, Natural language processing, sentiment at scale
OLTP
CRM
ERP
File
Server
...
Data
Warehouse
Analytics
Visualization
Reports
Hadoop
...
ML +NLP
Logs Lo
gs Logs
Social
Media
Sensors
*
* Personalized recommendations
§ content, products, services …
YOU! THANK
YMC AG
Sonnenstrasse 4
CH-8280 Kreuzlingen
Switzerland
Photo Credits:
Slide 03: Matterhorn and Lake by Noel Reynolds
Slde 24: Hadoop Ecosystem by Rishu Shrivastava
@YMC_Big_Data
CONTACT US [email protected]
Tel. +41 (0)71 508 24 86
www.ymc.ch