Store and Process Big Data with Hadoop and Cassandra
-
Upload
deependra-ariyadewa -
Category
Technology
-
view
7.127 -
download
1
Transcript of Store and Process Big Data with Hadoop and Cassandra
Store and Process Big Data with Hadoop and Cassandra
Apache BarCamp
By Deependra Ariyadewa
WSO2, Inc.
Store Data with
● Project site : http://cassandra.apache.org
● The latest release version is 1.0.7
● Cassandra is in use at Netflix, Twitter, Urban Airship, Constant Contact, Reddit, Cisco, OpenX, Digg, CloudKick and Ooyala
● Cassandra Users : http://www.datastax.com/cassandrausers
● The largest known Cassandra cluster has over 300 TB of data in over 400 machines.
● Commercial support http://wiki.apache.org/cassandra/ThirdPartySupport
Cassandra Deployment architecture
key => {(k,v),(k,v),(k,v)} hash(key) => key order
hash(key1)
hash(key2)
How to Install Cassandra
● Download the artifact apache-cassandra-1.0.7-bin.tar.gz from http://cassandra.apache.org/download/
● Extract tar -xzvf apache-cassandra-1.0.7-bin.tar.gz
● Set up folder paths mkdir -p /var/log/cassandra chown -R `whoami` /var/log/cassandra mkdir -p /var/lib/cassandra chown -R `whoami` /var/lib/cassandra
How to Configure CassandraMain Configuration file : $CASSANDRA_HOME/conf/cassandra.yaml cluster_name: 'Test Cluster' seed_provider: - seeds: "192.168.0.121" storage_port: 7000 listen_address: localhost rpc_address: localhost
rpc_port: 9160
Cassandra Clustering
initial_token:
partitioner: org.apache.cassandra.dht.RandomPartitioner
http://wiki.apache.org/cassandra/Operations
Cassandra DevOps $CASSANDRA_HOME/bin$ ./cassandra-cli --host localhost
[default@unknown] show keyspaces;Keyspace: system: Replication Strategy: org.apache.cassandra.locator.LocalStrategy Durable Writes: true Options: [replication_factor:1] Column Families: ColumnFamily: HintsColumnFamily (Super) "hinted handoff data" Key Validation Class: org.apache.cassandra.db.marshal.BytesType Default column value validator: org.apache.cassandra.db.marshal.BytesType Columns sorted by: org.apache.cassandra.db.marshal.BytesType/org.apache.cassandra.db.marshal.BytesType Row cache size / save period in seconds / keys to save : 0.0/0/all Row Cache Provider: org.apache.cassandra.cache.ConcurrentLinkedHashCacheProvider Key cache size / save period in seconds: 0.01/0 GC grace seconds: 0 Compaction min/max thresholds: 4/32 Read repair chance: 0.0 Replicate on write: true Bloom Filter FP chance: default Built indexes: [] Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
Cassandra CLI
[default@apache] create column family Location with comparator=UTF8Type and default_validation_class=UTF8Type and key_validation_class=UTF8Type;f04561a0-60ed-11e1-0000-242d50cf1fbfWaiting for schema agreement...... schemas agree across the cluster [default@apache] set Location[00001][City]='Colombo';Value inserted.Elapsed time: 140 msec(s). [default@apache] list Location;Using default limit of 100-------------------RowKey: 00001=> (column=City, value=Colombo, timestamp=1330311097464000)
1 Row Returned.Elapsed time: 122 msec(s).
Store Data with Hectorimport me.prettyprint.cassandra.service.CassandraHostConfigurator;import me.prettyprint.hector.api.Cluster;import me.prettyprint.hector.api.factory.HFactory;
import java.util.HashMap;import java.util.Map;
public class ExampleHelper {
public static final String CLUSTER_NAME = "ClusterOne"; public static final String USERNAME_KEY = "username"; public static final String PASSWORD_KEY = "password"; public static final String RPC_PORT = "9160"; public static final String CSS_NODE0 = "localhost"; public static final String CSS_NODE1 = "css1.stratoslive.wso2.com"; public static final String CSS_NODE2 = "css2.stratoslive.wso2.com";
public static Cluster createCluster(String username, String password) { Map<String, String> credentials = new HashMap<String, String>(); credentials.put(USERNAME_KEY, username); credentials.put(PASSWORD_KEY, password); String hostList = CSS_NODE0 + ":" + RPC_PORT + "," + CSS_NODE1 + ":" + RPC_PORT + "," + CSS_NODE2 + ":" + RPC_PORT; return HFactory.createCluster(CLUSTER_NAME, new CassandraHostConfigurator(hostList), credentials); }
}
Store Data with HectorCreate Keyspace: KeyspaceDefinition definition = new ThriftKsDef(keyspaceName); cluster.addKeyspace(definition); Add column family: ColumnFamilyDefinition familyDefinition = new ThriftCfDef(keyspaceName, columnFamily); cluster.addColumnFamily(familyDefinition); Write Data: Mutator<String> mutator = HFactory.createMutator(keyspace, new StringSerializer());
String columnValue = UUID.randomUUID().toString(); mutator.insert(rowKey, columnFamily, HFactory.createStringColumn(columnName, columnValue));
Read Data: ColumnQuery<String, String, String> columnQuery = HFactory.createStringColumnQuery(keyspace); columnQuery.setColumnFamily(columnFamily).setKey(key).setName(columnName); QueryResult<HColumn<String, String>> result = columnQuery.execute(); HColumn<String, String> hColumn = result.get(); System.out.println("Column: " + hColumn.getName() + " Value : " + hColumn.getValue() + "\n");
Variable Consistency● ANY: Wait until some replica has responded.
● ONE: Wait until one replica has responded.
● TWO: Wait until two replicas have responded.
● THREE: Wait until three replicas have responded
.● LOCAL_QUORUM: Wait for quorum on the datacenter the connection was
stablished.
● EACH_QUORUM: Wait for quorum on each datacenter.
● QUORUM: Wait for a quorum of replicas (no matter which datacenter).
● ALL: Blocks for all the replicas before returning to the client.
Variable Consistency
Create a customized Consistency Level:
ConfigurableConsistencyLevel configurableConsistencyLevel = new ConfigurableConsistencyLevel();Map<String, HConsistencyLevel> clmap = new HashMap<String, HConsistencyLevel>();
clmap.put("MyColumnFamily", HConsistencyLevel.ONE);
configurableConsistencyLevel.setReadCfConsistencyLevels(clmap);configurableConsistencyLevel.setWriteCfConsistencyLevels(clmap);
HFactory.createKeyspace("MyKeyspace", myCluster, configurableConsistencyLevel);
CQL
Insert data with CQL:
cqlsh> INSERT INTO Location (KEY, City) VALUES ('00001', 'Colombo'); Retrieve data with CQL
cqlsh> select * from Location where KEY='00001';
Apache
● Project Site: http://hadoop.apache.org
● Latest Version 1.0.1
● Hadoop is in use at Amazon, Yahoo, Adobe, eBay, Facebook
● Commercial support : http://hortonworks.com
http://www.cloudera.com
Hadoop deployment Architecture
How to install Hadoop
● Download the artifact from: http://hadoop.apache.org/common/releases.html
● Extract : tar -xzvf hadoop-1.0.1.tar.gz
● Copy and extract installation to each data node. scp hadoop-1.0.1.tar.gz user@datanode01:/home/hadoop
● Start Hadoop : $HADOOP_HOME:/bin/start-all
Hadoop CLI - HDFS
Format Namenode :
$HADOOP_HOME:/bin/hadoop namenode -format
File operations on HDFS:
$HADOOP_HOME:/bin/hadoop dfs -lsr / $HADOOP_HOME:/bin/hadoop dfs -mkdir /users/deep/wso2
Mapreduce
source:http://developer.yahoo.com/hadoop/tutorial/module4.html
Simple Mapreduce Job
Mapper public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }
Simple Mapreduce Job Reducer: public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
Simple Mapreduce Job
Job Runner:
JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf);
High level Mapreduce Interfaces
● Hive
● Pig
Q & A