Hadoop and Big Data: Revealed

16
Hadoop & Big Data: Revealed Presenter: Sachin Holla Date: 08/29/2014

description

Content presented at a talk on Aug. 29th. Purpose is to inform a fairly technical audience on the primary tenets of Big Data and the hadoop stack. Also, did a walk-thru' of hadoop and some of the hadoop stack i.e. Pig, Hive, Hbase.

Transcript of Hadoop and Big Data: Revealed

  • 1. Hadoop & Big Data: Revealed Presenter: Sachin Holla Date: 08/29/2014
  • 2. Big Data: An Overview Big Data - High volume - High velocity - High variety information assets - High Veracity - Require new forms of processing - Like NoSQL, MapReduce, Machine Learning Examples Large Hadron Collider 150 million sensors -> data 40 million times/sec data flow > 150 million petabytes (annual ), or ~ 500 exabytes per day Tipp24 (European lotteries) Analyze billions of transactions and hundreds of customer attributes Leads to a 90% decrease in the time it took to build predictive models
  • 3. DATA: ON A BIG SCALE
  • 4. Hadoop: Elephant in the Room Apache Hadoop - open-source Java-based software framework - distributed processing of large data sets - On clusters of computers based on commodity hardware. Hadoops Benefits (Historical context) - Dont rely on Hardware to provide HA (Big Iron) - Failures are expected and assumed - Framework handles failures to provide a HA computing service - Scale Up v/s Scale Out Key Components - Hadoop Distributed File System (HDFS) the file system - Hadoop MapReduce the programming model - Hadoop (v2) YARN: the resource manager Year Activity 2002Nutch Started 2003 GFS White Paper published 2004 Google MapReduce White Paper 2005 First MR Implementation 2006 Hadoop project in Apache 2008 Hadoop in Y! Production 2009 Wins 500GB sort contest
  • 5. Whats the Hadoop Arch., Kenneth ? (1/2)
  • 6. Whats the Hadoop Arch., Kenneth ? (2/2)
  • 7. Hadoop: FAQs What is a Map-Reduce job and why do I care ? Processing data paradigm in hadoop Batch-mode or in real-time In Java or in a variety of other langs (see below). There are higher-level frameworks that help too like Pig , Hive, etc.. I dont drink java anymore what do I do ? Hadoop is Java-based but Hadoop Streaming supports python, Ruby, R, etc. I/O bound no difference. CPU-bound Java better What is Hadoop2 and how will it affect my big data needs (See slide#14) Much more scalable Programming models v/s Cluster & Resource Management Under what scenarios should I not use Hadoop ? Need Answers in a Hurry Queries Are Complex Needing Optimization Require Random, Interactive Access to Data Store Sensitive Data Replacing Data Warehouse What are differences between Hadoop & traditional database ? Hadoop is not a DB ACID properties Unstructured / mixture of data sources SQL Access
  • 8. Hadoop Stack: Snapshot Technology Domain Description HDFS File Storage Java-based file storage - reliable and scalable access MapReduce Programming Framework Original framework for distributed processing of data Hadoop YARN Resource Mgmt Next generation framework MR and non-MR models Pig ETL / Data Flow Allows High level analysis of large data. Generates MR Hive SQL Interface DW - allows data summarization and ad-hoc queries Hbase Columnar NoSQL storage Column-oriented NoSQL data storage system Sqoop Data Exchange Easy data import/export from Hadoop clusters Zookeeper Process Coordination Highly available system for process coordination Oozie Workflow Scheduler Helps manage complex DAG job workflows Ambari Cluster Monitoring Installation, Admin & Monitoring for Hadoop clusters Avro Serializer Serializes data in efficient binary format. Uses JSON. Spark Real-time data processing Powerful processing engine - speed, ease of use, and sophisticated analytics (using ML).
  • 9. Data Science: The Scoop What is Data Science or a Data Scientist ? To understand data, to process it, to extract value from it, to visualize it, to communicate it Single source v/s disparate sources Mine data for insight to extract business/competitive value What is Machine Learning then ? The science of getting computers to act without being explicitly programmed. Machine learning and statistics may be the stars, but DS orchestrates the whole show. Practical Uses Product Recommendation Medical Diagnosis Stock Trading Face Detection
  • 10. Demo: Lets get dirty ! Hadoop running on Single-Node Pseudo Cluster (Linux VM) Start Hadoop HelloWorld Hadoop style Run a MapReduce job (wordcount) No Java here Use python scripts to run a MapReduce job Lipstick on a Pig Perform ETL on some stocks/dividend data Give me Hive Calculate Top Batter Scores Can you feel the Hbase Dump Sales Data into Hbase and then access via Hive Use AWS to show a real cluster Connect to AWS and startup the cluster Demo performance using wordcount example * All Demos, installation guide and references available @ GitHub
  • 11. And, thats a wrap !
  • 12. Backup
  • 13. Typical Hadoop Cluster
  • 14. Hadoop Stack: Visualized
  • 15. Hadoop: v1 -> v2