Questions

2
Given 1 PB of data to migrate into Hadoop, tell me every aspect and task involved in the migration My humble attempt: Understand vision - how the data will be used Get data growth rate Understand how future-proofed your solution should be - 2 years, 3 years etc Assumed disk space on commodity hardware Assumed replication factor - default of 3 Calculate the disk space for the projected years Add 30% extra to the space calculation to allow for hadoop machiner usage for MR and other aspects Factor in NN server requirement and HA of the same Factor in JT server requirements and HA of the same Factor in Zookeeper requirement and HA of the same Decide format of date storage - text file, sequence file/AVRO/RCFile/Compression & codec - although this might impact space requirement - minimize it Factor in HBase master server requirement and HA of the same if HBase is to run Factor in racks/switches Factor in number of environments - dev/test/QA/prod parallel/production..and cluster size

Transcript of Questions

Given 1 PB of data to migrate into Hadoop, tell me every aspect and task involved in the migration

My humble attempt:

Understand vision - how the data will be used

Get data growth rate

Understand how future-proofed your solution should be - 2 years, 3 years etc

Assumed disk space on commodity hardware

Assumed replication factor - default of 3

Calculate the disk space for the projected years

Add 30% extra to the space calculation to allow for hadoop machiner usage for MR and other aspects

Factor in NN server requirement and HA of the same

Factor in JT server requirements and HA of the same

Factor in Zookeeper requirement and HA of the same

Decide format of date storage - text file, sequence file/AVRO/RCFile/Compression & codec - although this might impact space requirement - minimize it

Factor in HBase master server requirement and HA of the same if HBase is to run

Factor in racks/switches

Factor in number of environments - dev/test/QA/prod parallel/production..and cluster size

Plan out users/groups/authentication/authorization and any data encryption needed.

Plan the load tasks, and data integrity checks