Store and Process Big Data with Hadoop and Cassandra

23
Store and Process Big Data with Hadoop and Cassandra Apache BarCamp By Deependra Ariyadewa WSO2, Inc.

Transcript of Store and Process Big Data with Hadoop and Cassandra

Page 1: Store and Process Big Data with Hadoop and Cassandra

Store and Process Big Data with Hadoop and Cassandra

Apache BarCamp

By Deependra Ariyadewa

WSO2, Inc.

Page 2: Store and Process Big Data with Hadoop and Cassandra

Store Data with

● Project site : http://cassandra.apache.org

● The latest release version is 1.0.7

● Cassandra is in use at Netflix, Twitter, Urban Airship, Constant Contact, Reddit, Cisco, OpenX, Digg, CloudKick and Ooyala

● Cassandra Users : http://www.datastax.com/cassandrausers

● The largest known Cassandra cluster has over 300 TB of data in over 400 machines.

● Commercial support http://wiki.apache.org/cassandra/ThirdPartySupport

Page 3: Store and Process Big Data with Hadoop and Cassandra

Cassandra Deployment architecture

key => {(k,v),(k,v),(k,v)} hash(key) => key order

hash(key1)

hash(key2)

Page 4: Store and Process Big Data with Hadoop and Cassandra

How to Install Cassandra

● Download the artifact apache-cassandra-1.0.7-bin.tar.gz from http://cassandra.apache.org/download/

● Extract tar -xzvf apache-cassandra-1.0.7-bin.tar.gz

● Set up folder paths mkdir -p /var/log/cassandra chown -R `whoami` /var/log/cassandra mkdir -p /var/lib/cassandra chown -R `whoami` /var/lib/cassandra

Page 5: Store and Process Big Data with Hadoop and Cassandra

How to Configure CassandraMain Configuration file : $CASSANDRA_HOME/conf/cassandra.yaml cluster_name: 'Test Cluster' seed_provider: - seeds: "192.168.0.121" storage_port: 7000 listen_address: localhost rpc_address: localhost

rpc_port: 9160

Page 6: Store and Process Big Data with Hadoop and Cassandra

Cassandra Clustering

initial_token:

partitioner: org.apache.cassandra.dht.RandomPartitioner

http://wiki.apache.org/cassandra/Operations

Page 7: Store and Process Big Data with Hadoop and Cassandra

Cassandra DevOps $CASSANDRA_HOME/bin$ ./cassandra-cli --host localhost

[default@unknown] show keyspaces;Keyspace: system: Replication Strategy: org.apache.cassandra.locator.LocalStrategy Durable Writes: true Options: [replication_factor:1] Column Families: ColumnFamily: HintsColumnFamily (Super) "hinted handoff data" Key Validation Class: org.apache.cassandra.db.marshal.BytesType Default column value validator: org.apache.cassandra.db.marshal.BytesType Columns sorted by: org.apache.cassandra.db.marshal.BytesType/org.apache.cassandra.db.marshal.BytesType Row cache size / save period in seconds / keys to save : 0.0/0/all Row Cache Provider: org.apache.cassandra.cache.ConcurrentLinkedHashCacheProvider Key cache size / save period in seconds: 0.01/0 GC grace seconds: 0 Compaction min/max thresholds: 4/32 Read repair chance: 0.0 Replicate on write: true Bloom Filter FP chance: default Built indexes: [] Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy

Page 8: Store and Process Big Data with Hadoop and Cassandra

Cassandra CLI

[default@apache] create column family Location with comparator=UTF8Type and default_validation_class=UTF8Type and key_validation_class=UTF8Type;f04561a0-60ed-11e1-0000-242d50cf1fbfWaiting for schema agreement...... schemas agree across the cluster [default@apache] set Location[00001][City]='Colombo';Value inserted.Elapsed time: 140 msec(s). [default@apache] list Location;Using default limit of 100-------------------RowKey: 00001=> (column=City, value=Colombo, timestamp=1330311097464000)

1 Row Returned.Elapsed time: 122 msec(s).

Page 9: Store and Process Big Data with Hadoop and Cassandra

Store Data with Hectorimport me.prettyprint.cassandra.service.CassandraHostConfigurator;import me.prettyprint.hector.api.Cluster;import me.prettyprint.hector.api.factory.HFactory;

import java.util.HashMap;import java.util.Map;

public class ExampleHelper {

public static final String CLUSTER_NAME = "ClusterOne"; public static final String USERNAME_KEY = "username"; public static final String PASSWORD_KEY = "password"; public static final String RPC_PORT = "9160"; public static final String CSS_NODE0 = "localhost"; public static final String CSS_NODE1 = "css1.stratoslive.wso2.com"; public static final String CSS_NODE2 = "css2.stratoslive.wso2.com";

public static Cluster createCluster(String username, String password) { Map<String, String> credentials = new HashMap<String, String>(); credentials.put(USERNAME_KEY, username); credentials.put(PASSWORD_KEY, password); String hostList = CSS_NODE0 + ":" + RPC_PORT + "," + CSS_NODE1 + ":" + RPC_PORT + "," + CSS_NODE2 + ":" + RPC_PORT; return HFactory.createCluster(CLUSTER_NAME, new CassandraHostConfigurator(hostList), credentials); }

}

Page 10: Store and Process Big Data with Hadoop and Cassandra

Store Data with HectorCreate Keyspace: KeyspaceDefinition definition = new ThriftKsDef(keyspaceName); cluster.addKeyspace(definition); Add column family: ColumnFamilyDefinition familyDefinition = new ThriftCfDef(keyspaceName, columnFamily); cluster.addColumnFamily(familyDefinition); Write Data: Mutator<String> mutator = HFactory.createMutator(keyspace, new StringSerializer());

String columnValue = UUID.randomUUID().toString(); mutator.insert(rowKey, columnFamily, HFactory.createStringColumn(columnName, columnValue));

Read Data: ColumnQuery<String, String, String> columnQuery = HFactory.createStringColumnQuery(keyspace); columnQuery.setColumnFamily(columnFamily).setKey(key).setName(columnName); QueryResult<HColumn<String, String>> result = columnQuery.execute(); HColumn<String, String> hColumn = result.get(); System.out.println("Column: " + hColumn.getName() + " Value : " + hColumn.getValue() + "\n");

Page 11: Store and Process Big Data with Hadoop and Cassandra

Variable Consistency● ANY: Wait until some replica has responded.

● ONE: Wait until one replica has responded.

● TWO: Wait until two replicas have responded.

● THREE: Wait until three replicas have responded

.● LOCAL_QUORUM: Wait for quorum on the datacenter the connection was

stablished.

● EACH_QUORUM: Wait for quorum on each datacenter.

● QUORUM: Wait for a quorum of replicas (no matter which datacenter).

● ALL: Blocks for all the replicas before returning to the client.

Page 12: Store and Process Big Data with Hadoop and Cassandra

Variable Consistency

Create a customized Consistency Level:

ConfigurableConsistencyLevel configurableConsistencyLevel = new ConfigurableConsistencyLevel();Map<String, HConsistencyLevel> clmap = new HashMap<String, HConsistencyLevel>();

clmap.put("MyColumnFamily", HConsistencyLevel.ONE);

configurableConsistencyLevel.setReadCfConsistencyLevels(clmap);configurableConsistencyLevel.setWriteCfConsistencyLevels(clmap);

HFactory.createKeyspace("MyKeyspace", myCluster, configurableConsistencyLevel);

Page 13: Store and Process Big Data with Hadoop and Cassandra

CQL

Insert data with CQL:

cqlsh> INSERT INTO Location (KEY, City) VALUES ('00001', 'Colombo'); Retrieve data with CQL

cqlsh> select * from Location where KEY='00001';

Page 14: Store and Process Big Data with Hadoop and Cassandra

Apache

● Project Site: http://hadoop.apache.org

● Latest Version 1.0.1

● Hadoop is in use at Amazon, Yahoo, Adobe, eBay, Facebook

● Commercial support : http://hortonworks.com

http://www.cloudera.com

Page 15: Store and Process Big Data with Hadoop and Cassandra

Hadoop deployment Architecture

Page 16: Store and Process Big Data with Hadoop and Cassandra

How to install Hadoop

● Download the artifact from: http://hadoop.apache.org/common/releases.html

● Extract : tar -xzvf hadoop-1.0.1.tar.gz

● Copy and extract installation to each data node. scp hadoop-1.0.1.tar.gz user@datanode01:/home/hadoop

● Start Hadoop : $HADOOP_HOME:/bin/start-all

Page 17: Store and Process Big Data with Hadoop and Cassandra

Hadoop CLI - HDFS

Format Namenode :

$HADOOP_HOME:/bin/hadoop namenode -format

File operations on HDFS:

$HADOOP_HOME:/bin/hadoop dfs -lsr / $HADOOP_HOME:/bin/hadoop dfs -mkdir /users/deep/wso2

Page 18: Store and Process Big Data with Hadoop and Cassandra

Mapreduce

source:http://developer.yahoo.com/hadoop/tutorial/module4.html

Page 19: Store and Process Big Data with Hadoop and Cassandra

Simple Mapreduce Job

Mapper public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }

Page 20: Store and Process Big Data with Hadoop and Cassandra

Simple Mapreduce Job Reducer: public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

Page 21: Store and Process Big Data with Hadoop and Cassandra

Simple Mapreduce Job

Job Runner:

JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf);

Page 22: Store and Process Big Data with Hadoop and Cassandra

High level Mapreduce Interfaces

● Hive

● Pig

Page 23: Store and Process Big Data with Hadoop and Cassandra

Q & A