A Formula to Estimate Hadoop Storage and Related Number of Data Nodes..

6/19/2014 A formula to estimate Hadoop storage and related number of data nodes... | LinkedIn

https://www.linkedin.com/groups/formula-estimate-Hadoop-storage-related-988957.S.237869374 1/7

Hadoop Users 49,574 members Member

Discussions Promotions Jobs Members Search

Roydon Pereira, Chandra Sekhar Reddy and 21 others like this

28 comments Jump to most recent comment

You Most FashionableProfessional

Nominate one of your co-workers for The MostFashionable Professional

You could stand a chance to win big too!

Nominate

Latest Activity

Sergey Lipchik, Petri

Salo, and 8 others

joined a group:

Hadoop UsersA group for users ofHadoop and its sub-projects.28m ago

Ziad Bizri commented on a

discussion in Hadoop Users. Jules D.Discardable DistributedMemory: Supporting MemoryStorage in HDFSbit.ly/1rqJQ2K Traditionally,

HDFS, Hadoop's storage subsystem,

Kepner Tregoe Training - Learn Kepner-Tregoe Incident Management Methodology. First Time in India.

Follow Slim

Like (23) Comment (28) Follow Reply PrivatelyMay 3, 2013

A formula to estimate Hadoop storage and related number of datanodes...

Slim Baltagi

Sr. Big Data Architect at TransUnion

Hi

I would like to share with you a formula to estimate Hadoop storage and related

number of data nodes and get your thoughts about it.

1. This is a formula to estimate Hadoop storage (H):

H=c*r*S/(1-i)

where:

c = average compression ratio. It depends on the type of compression used (Snappy,

LZOP, ...) and size of the data. When no compression is used, c=1.

r = replication factor. It is usually 3 in a production cluster.

S = size of data to be moved to Hadoop. This could be a combination of historical

data and incremental data. The incremental data can be daily for example and

projected over a period of time (3 years for example).

i = intermediate factor. It is usually 1/3 or 1/4. Hadoop's working space dedicated to

storing intermediate results of Map phases.

Example: With no compression i.e. c=1, a replication factor of 3, an intermediate

factor of .25=1/4

H= 1*3*S/(1-1/4)=3*S/(3/4)=4*S

With the assumptions above, the Hadoop storage is estimated to be 4 times the size

of the initial data size.

2. This is the formula to estimate the number of data nodes (n):

n= H/d = c*r*S/(1-i)*d

where d= disk space available per node. All other parameters remain the same as in

1.

Example: If 8TB is the available disk space per node (10 disks with 1 TB , 2 disk for

operating system etc were excluded.). Assuming initial data size is 600 TB. n=

600/8=75 data nodes needed

What are your thoughts?

Thanks.

Slim Baltagi

Big Data Solution Architect at Syntel Inc.

Chicago, IL, USA

Share

Comments

Top Contributors in this Group

Can T.

Business Intelligence Software Development Expertat VodafoneFollow Can See all members

Your group contribution level

Start by commenting in a discussion. Group

participants get 4x the number of profile views.

Getting Started

https://adclick.g.doubleclick.net/aclk%3Fsa%3DL%26ai%3DB0tU9yMeiU9a4M4qnuAS1gYKoDa7Hmc8FAAAAEAEgADgAWJ6MkrezAWDlkuiD2A6CARdjYS1wdWItNjM4NTQ4MjAyNzQyODM0N7IBEHd3dy5saW5rZWRpbi5jb226AQlnZnBfaW1hZ2XIAQnaAVpodHRwczovL3d3dy5saW5rZWRpbi5jb20vZ3JvdXBzL2Zvcm11bGEtZXN0aW1hdGUtSGFkb29wLXN0b3JhZ2UtcmVsYXRlZC05ODg5NTcuUy4yMzc4NjkzNzSYArkawAIC4AIA6gImNDEyMi9saW5rZWRpbi5kYXJ0L2dyb3Vwcy8vaXRlbV9kZXRhaWz4AvTRHpADmgiYA-ADqAMB4AQBoAYg%26num%3D0%26sig%3DAOD64_0jB8ReZ24YGel9C2pioSJXWZW9Uw%26client%3Dca-pub-6385482027428347%26adurl%3Dhttps://www.linkedin.com/csp/crd?v=1&cs=0_6cKDttF3NPGS_wSZC50EFIkeIWW387_ht5H2-zKa8JMcHRM7pAwPJ_j9i3FLda7eMJV1QzrJLIxLWp8yIk83ZckUoURnN2B43iDMnpKEeTGOVnpYg_Xii63vF_6kxilW8DOmO4xiC7TVU9WivHh0wrSJDt1hQpAFcNNt-aNLkYVMNX0vSx9eiVP9pk9PEYNZVIPS7LMEj6IVluR0NTuTnMF3aGrCDRhkSbElB95MJZCOUrbeqCvRFocnUVhiwvL_-oUZMXQiiqYiUcsxCYnNUqTDhZjG-c2IjoP9HbQRer2v8VUUC8ksQG__bwO9FWNJ4J6Xz27xCjzrskC8DxQJz5AUgUidPVcEjRTeTb2D15Oh28n_hYrWDUTHdxjpdiAkg6IMN-jDdMbwCVKl5xXdGbGzfx5azolZSRnNfn5s4Oy

https://adclick.g.doubleclick.net/aclk%3Fsa%3DL%26ai%3DB0tU9yMeiU9a4M4qnuAS1gYKoDa7Hmc8FAAAAEAEgADgAWJ6MkrezAWDlkuiD2A6CARdjYS1wdWItNjM4NTQ4MjAyNzQyODM0N7IBEHd3dy5saW5rZWRpbi5jb226AQlnZnBfaW1hZ2XIAQnaAVpodHRwczovL3d3dy5saW5rZWRpbi5jb20vZ3JvdXBzL2Zvcm11bGEtZXN0aW1hdGUtSGFkb29wLXN0b3JhZ2UtcmVsYXRlZC05ODg5NTcuUy4yMzc4NjkzNzSYArkawAIC4AIA6gImNDEyMi9saW5rZWRpbi5kYXJ0L2dyb3Vwcy8vaXRlbV9kZXRhaWz4AvTRHpADmgiYA-ADqAMB4AQBoAYg%26num%3D0%26sig%3DAOD64_0jB8ReZ24YGel9C2pioSJXWZW9Uw%26client%3Dca-pub-6385482027428347%26adurl%3Dhttps://www.linkedin.com/csp/crd?v=1&cs=0_6cKDttF3NPGS_wSZC50EFIkeIWW387_ht5H2-zKa8JMcHRM7pAwPJ_j9i3FLda7eMJV1QzrJLIxLWp8yIk83ZckUoURnN2B43iDMnpKEeTGOVnpYg_Xii63vF_6kxilW8DOmO4xiC7TVU9WivHh0wrSJDt1hQpAFcNNt-aNLkYVMNX0vSx9eiVP9pk9PEYNZT74dgBCOGzi2yyxm7psiIsF3aGrCDRhkSbElB95MJZCOUrbeqCvRFocnUVhiwvL_-oUZMXQiiqYiUcsxCYnNUqTDhZjG-c2IjoP9HbQRer2v8VUUC8ksQG__bwO9FWNJ4J6Xz27xCjzrskC8DxQJz5AUgUidPVcEjRTeTb2D15Oh28n_hYrWDUTHdxjpdiAkg6IMN-jDdMbwCVKl5xXdGbGzfx5azolZSRnNfn5s4Oy

https://www.linkedin.com/nhome/nus-redirect?url=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fprofile%2Fview%3Fid%3D265607708%26snapshotID%3D%26authType%3Dname%26authToken%3DmTm7%26ref%3DNUS%26trk%3DNUS-body-member-name&urlhash=dGfm&pos=-1%3A0&trkToken=action%3DviewMemberByName%26pageKey%3Dgroups%2Fitem_detail%26contextId%3D&tev=0

https://www.linkedin.com/nhome/nus-redirect?url=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fprofile%2Fview%3Fid%3D8931136%26snapshotID%3D%26authType%3Dname%26authToken%3DJpL_%26ref%3DNUS%26trk%3DNUS-body-member-name&urlhash=0OvJ&pos=-1%3A0&trkToken=action%3DviewMemberByName%26pageKey%3Dgroups%2Fitem_detail%26contextId%3D&tev=0

https://www.linkedin.com/nhome/nus-redirect?url=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fgroups%3Fhome%3D%26gid%3D988957%26goback%3D%252Egde_988957_member_237869374%26trk%3Dgrp-name&urlhash=OOYy&pos=-1%3A0&trkToken=action%3DviewGroup%26pageKey%3Dgroups%2Fitem_detail%26contextId%3D&tev=0

https://www.linkedin.com/nhome/nus-redirect?url=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fprofile%2Fview%3Fid%3D2253896%26snapshotID%3D%26authType%3Dname%26authToken%3DfI7O%26ref%3DNUS%26trk%3DNUS-body-member-name&urlhash=M70O&pos=-1%3A0&trkToken=action%3DviewMemberByName%26pageKey%3Dgroups%2Fitem_detail%26contextId%3D&tev=0


https://www.linkedin.com/nhome/nus-redirect?url=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fprofile%2Fview%3Fid%3D308040%26authType%3Dname%26authToken%3D9BSy%26ref%3DNUS%26goback%3D%252Egde_988957_member_237869374%26trk%3DNUS-body-member-name&urlhash=jaNC&pos=-1%3A0&trkToken=action%3DviewMember%26pageKey%3Dgroups%2Fitem_detail%26contextId%3D&tev=0

https://www.linkedin.com/nhome/nus-redirect?url=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2FgroupAnswers%3FviewQuestionAndAnswers%3D%26discussionID%3D5883616629933223940%26gid%3D988957%26goback%3D%252Egde_988957_member_237869374%23commentID_null&urlhash=BQu2&pos=-1%3A0&trkToken=action%3DviewDiscussion%26pageKey%3Dgroups%2Fitem_detail%26contextId%3D&tev=0

https://adclick.g.doubleclick.net/aclk?sa=L&ai=B7pfzyMeiU9P-K8emuATg-IDQBJ6PlOgEAAAAEAEgADgAWM6UttdUYOWS6IPYDoIBF2NhLXB1Yi02Mzg1NDgyMDI3NDI4MzQ3sgEQd3d3LmxpbmtlZGluLmNvbboBCWdmcF9pbWFnZcgBCdoBWmh0dHBzOi8vd3d3LmxpbmtlZGluLmNvbS9ncm91cHMvZm9ybXVsYS1lc3RpbWF0ZS1IYWRvb3Atc3RvcmFnZS1yZWxhdGVkLTk4ODk1Ny5TLjIzNzg2OTM3NMACAuACAOoCJjQxMjIvbGlua2VkaW4uZGFydC9ncm91cHMvL2l0ZW1fZGV0YWls-AL00R6QA5oImAPgA6gDAdAEkE7gBAGgBiA&num=0&sig=AOD64_2IjDGMk-MUMUpkkPAW0AMaBF5asQ&client=ca-pub-6385482027428347&adurl=https://www.linkedin.com/csp/crd?v=1&cs=0_3mYx2g4Q-m_UCIGC0icwxvZJ0kifl6BVSLV5FEdQoNrT50Qv6SiNV8m_XIrmgLHdKsooi0-RF6868GJQRB2g5k3B9I7056tvypZASS6LG2DGA0ORXMcOqrZOWAZQkYV9bpSaKB5nN-qS7kNH0BX52sgemHWiE1dghpTnnW_Y7c62nVUKWbKboyt0SnzydN5eS4yHJI5dSdvhiqdhDMWPg_emCjHQzc4rar6v8HawzEuynVUKWbKboyt0SnzydN5eQWCRd3A7AdJtIYM69KEW5FZ_c1k5QUfjqlVYBm1jbe0kdKhpkA-KurKwuhRFKM97eBv-AlikVNzavyQcozCdZIxLeYyFOKNvVpbyeQ93wfXtGR227fCMZDQMIK2-3OmICjpbOow3VsJOrTPqZgorRd6eNeEK6qIQko37KgR8t50bnrWZkUX-vXcsGQMzqm4gmog64JnEJAxKwFNtW0xZh8nYCFbXAfwZpxcGwssq6-9aU_RGMfc0DUUO4Wc3DEm5_WcouCvEX-HbOxoa0z_3gSqsUxd6GOWU5ix3YsqRQ30tDvWN5GNSQw7hTaBF0dGymJWjecoQmhWJmyuE2ff1n5e_Jo4RLDgBEhN2xaVoceaa4SU8GEeCY0vlt5owBdYsitgI_NyOVSrrqC4uEn3lIX8W0QAwuNAG7MS1uHR1T68tBSZN7vqZXl35ESkAsTdSgS2h1lSFBsneX4FXUYUxBfn9EC5b35RNHQo_f8aMujVEen18aYLU0dRAzp-2yL_EfGzfx5azolZSRnNfn5s4Oy

http://adclick.g.doubleclick.net/aclk%253Fsa%253DL%2526ai%253DB7pfzyMeiU9P-K8emuATg-IDQBJ6PlOgEAAAAEAEgADgAWM6UttdUYOWS6IPYDoIBF2NhLXB1Yi02Mzg1NDgyMDI3NDI4MzQ3sgEQd3d3LmxpbmtlZGluLmNvbboBCWdmcF9pbWFnZcgBCdoBWmh0dHBzOi8vd3d3LmxpbmtlZGluLmNvbS9ncm91cHMvZm9ybXVsYS1lc3RpbWF0ZS1IYWRvb3Atc3RvcmFnZS1yZWxhdGVkLTk4ODk1Ny5TLjIzNzg2OTM3NMACAuACAOoCJjQxMjIvbGlua2VkaW4uZGFydC9ncm91cHMvL2l0ZW1fZGV0YWls-AL00R6QA5oImAPgA6gDAdAEkE7gBAGgBiA%2526num%253D0%2526sig%253DAOD64_2IjDGMk-MUMUpkkPAW0AMaBF5asQ%2526client%253Dca-pub-6385482027428347%2526adurl%253Dhttp://nourl.com

https://www.linkedin.com/groups?about=&gid=988957

https://www.linkedin.com/groups?viewMemberFeed=&gid=988957&memberID=21979395

https://www.linkedin.com/groupfollowing?follow=&followee=21979395&csrfToken=ajax%3A0286726182996048058&trk=fwp_p

https://www.linkedin.com/groupItem?setLike=&gid=988957&type=member&item=237869374&ajax=true&csrfToken=ajax%3A0286726182996048058

https://www.linkedin.com/groupItem?view=&gid=988957&type=member&item=237869374&commentID=-1#lastComment

https://www.linkedin.com/groupItem?follow=&gid=988957&type=member&item=237869374&ajax=true&csrfToken=ajax%3A0286726182996048058

https://www.linkedin.com/groupMsg?displayCreate=&contentType=MEBC&connId=21979395&groupID=988957&goback=%2Egde_988957_member_237869374

https://www.linkedin.com/profile/view?id=21979395&goback=%2Egde_988957_member_237869374

https://www.linkedin.com/groups?viewMemberFeed=&gid=988957&memberID=308040&trk=groups_item_detail-tpinf-rr-pp




https://www.linkedin.com/groups?viewMemberFeed=&gid=988957&memberID=85366150&trk=groups_item_detail-tpinf-rr-pn

https://www.linkedin.com/groupfollowing?follow=&followee=85366150&csrfToken=ajax%3A0286726182996048058&trk=fwp_l

https://www.linkedin.com/groups?members=&gid=988957&trk=groups_item_detail-tpinf-rr-mrb

https://www.linkedin.com/groups?home=&gid=988957&trk=groups_item_detail-h-logo



Madhan

Slim

Shahab

Chandra

Sekhar

Brian

Ted

See all activity

See more

has focused on one kind of storagemedium, namely spindle-based disks.However, a Hadoop cluster cancontain significant amounts of memoryand with the... more Discardable

Distributed Memory: SupportingMemory Storage in HDFSTraditionally, HDFS, Hadoop's storagesubsystem, has focused on one kindof storage medium, namely spindle-based disks. However, a Hadoopcluster can contain significantamounts of memory and with thecontinued drop in memory prices,...2h ago

Ambati MSRao likes a discussion in

Hadoop Users. Tauqeer A.HIPI for Object RecognitionDoes anyone has experienceof working with HIPI for

recognition of objects? I am trying todevelop an object recognition systemin Hadoop. Any other library fulfillingthis requirement will also be... more

2h ago

Subgroups

Cloudera Hadoop Users8,683 members

Cloudera Certified Hadoop

Professionals4,849 members

About Feedback

LinkedIn Corp. © 2014

Like Reply privatelyFlag as inappropriate May 5, 2013

Madhan Sundararajan Devaki

Assistant Consultant at Tata Consultancy Services

What is the intermediate factor and how did you arrive at 0.25?

When compression is not enabled and the replication factor is 3 the required storage will be 3

times the size of original data!

Like (1) Reply privatelyFlag as inappropriate May 6, 2013

Slim Baltagi

Sr. Big Data Architect at TransUnion

i = intermediate factor. It is usually 1/3 or 1/4. It is Hadoop's working space dedicated to storing

intermediate results of Map phases. With no compression i.e. c=1, a replication factor of 3, an

intermediate factor of .25=1/4, the Hadoop storage is estimated to be 4 times the size of the initial

data size and not 3 times as you are mentioning.

bhaskara reddy. C. likes this


Shahab Yunus

J2EE Software Consultant at iSpace Inc.

" i = intermediate factor. It is usually 1/3 or 1/4. "

How have you reached this number? Depending on the nature of the jobs, algorithms or M/R

patterns (e.g. number of mapper), don't you think it can vary? Thanks.


Chandra Sekhar Reddy

Software Developer at Teradata

it's really good though of coming with some formula to calculate the no. of nodes by considering

the factors.

I think we have to consider no. of processors per node and no. of tasks per node as factors.


Brian Macdonald

Enterprise Architect Specializing in Analytics using Big Data, Data Warehousing and

Business Intelligence Technologies

* for intermediate space is not unrealistic. This is a common guidlines for many production

applications. Even Cloudera has recommended 25% for intermediate

results(http://blog.cloudera.com/blog/2010/08/hadoophbase-capacity-planning/). Surely, they do

know a thing or two about production Hadoop systems.

The formula still holds true even if your want to plan for less intermediate space and use a smaller

value for i.

I think this value is good for estimating the available HDFS storage available for the cluster. What I

remind customers to consider is that you will need to plan for output files as well. So the value of

S should include data being ingested as well as the output of any jobs.

SAURABH K., Slim B. and 2 others like this


Ted Dunning

Chief Application Architect at MapR

25% for intermediates is actually unrealistically large for very large datasets if only because it

would take a very long time to write so much intermediate data.

It is still good to have that much spare space because almost all file systems behave much better

with a fair bit of slack in their usage. Aiming at 75% fill is a good target to maintain very good

performance. I have seen some users run consistently at mid 90's fill with good results on MapR,

but I definitely don't recommend it.

Another consideration is how well do you want your system to run under various failure scenarios.

For small clusters, even a single node failure can be very significant (75% on a 5 node cluster

results in 94% fill with a single node failure ... is this system still going to work?). For larger

clusters, you may want to think in terms of single rack or switch failures.

Anton M., Shahab Y. and 1 other like this

Privacy & Terms







https://www.linkedin.com/nhome/nus-redirect?url=http%3A%2F%2Fbit%2Ely%2F1rqJQ2K&urlhash=hPKV&pos=-1%3A0&trkToken=action%3DviewArticle%26pageKey%3Dgroups%2Fitem_detail%26contextId%3D&tev=0&trk=object-photo

https://www.linkedin.com/groups?updates=&gid=988957&updatesFrom=all&trk=groups_item_detail-rr-d-all

https://www.linkedin.com/groups?subgroups=&gid=988957

https://www.linkedin.com/about-us

https://www.linkedin.com/lite/feedback-form

https://www.linkedin.com/nhome/nus-redirect?url=http%3A%2F%2Fbit%2Ely%2F1rqJQ2K&urlhash=hPKV&pos=-1%3A0&trkToken=action%3DviewArticle%26pageKey%3Dgroups%2Fitem_detail%26contextId%3D&tev=0&trk=object-title

https://www.linkedin.com/nhome/nus-redirect?url=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fprofile%2Fview%3Fid%3D225028034%26snapshotID%3D%26authType%3Dname%26authToken%3DzDQf%26ref%3DNUS%26trk%3DNUS-body-member-name&urlhash=mM9S&pos=-1%3A0&trkToken=action%3DviewMemberByName%26pageKey%3Dgroups%2Fitem_detail%26contextId%3D&tev=0


https://www.linkedin.com/nhome/nus-redirect?url=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fprofile%2Fview%3Fid%3D44076306%26authType%3Dname%26authToken%3DyP6Z%26ref%3DNUS%26goback%3D%252Egde_988957_member_237869374%26trk%3DNUS-body-member-name&urlhash=KaDb&pos=-1%3A0&trkToken=action%3DviewMember%26pageKey%3Dgroups%2Fitem_detail%26contextId%3D&tev=0

https://www.linkedin.com/nhome/nus-redirect?url=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2FgroupAnswers%3FviewQuestionAndAnswers%3D%26discussionID%3D5883344436993208323%26gid%3D988957%26goback%3D%252Egde_988957_member_237869374%23commentID_null&urlhash=lDvK&pos=-1%3A0&trkToken=action%3DviewDiscussion%26pageKey%3Dgroups%2Fitem_detail%26contextId%3D&tev=0

https://www.linkedin.com/groups?gid=2067201&trk=groups%2Fitem_detail-r-subgr-subgrpname&goback=%2Egde_988957_member_237869374

https://www.linkedin.com/groups?gid=2067206&trk=groups%2Fitem_detail-r-subgr-subgrpname&goback=%2Egde_988957_member_237869374

https://www.linkedin.com/groupItem?setLike=&gid=988957&type=member&item=237869374&commentID=135580524&ajax=true&csrfToken=ajax%3A0286726182996048058

https://www.linkedin.com/groupMsg?displayCreate=&contentType=MEBC&connId=203219590&groupID=988957

https://www.linkedin.com/groupItem?flag=&gid=988957&type=member&item=237869374&commentID=135580524&flagReason=inappropriate&ajax=true&csrfToken=ajax%3A0286726182996048058






https://www.linkedin.com/groups?viewMemberFeed=&gid=988957&memberID=256915165&goback=%2Egde_988957_member_237869374













https://www.linkedin.com/redirect?url=http%3A%2F%2Fblog%2Ecloudera%2Ecom%2Fblog%2F2010%2F08%2Fhadoophbase-capacity-planning%2F&urlhash=W5VM&_t=tracking_disc



https://www.linkedin.com/groups?trk=group_item_detail-b-show_lks-cmt#










Michael

Jose

Anil

RAJDEEP

Peter

Peter

Peter


Michael Ware

TDB

Estimating compression ratios can be a very complex issue depending on the types of data your

working with. Some data can be compressed quite a bit more than others. If the data type your

storing in Hadoop isn't monolithic then this needs to be compensated for.

Like Reply privatelyFlag as inappropriate June 9, 2013

Jose Morales

Business Development | BigData | Storage Consultant | Cloud Computing | Europe |

Looking for New Challenges

Getting the compression data is very complex as mentioned by Michel Ware, but also I have to

agree that 25% for intermediate data is unreasonably large for any dataset, however if you feel

that your cluster will be reaching any ware near these numbers at deployment, then it is a good

idea to plan for a 75% allocated capacity for some headroom to expand the cluster accordingly.

How about meeting a particular ingest bandwidth requirement? I believe that should also be part of

the equation.

Like Reply privatelyFlag as inappropriate 7 months ago

Anil Kumar

The Big Data Professional

Thanks for the information Dear Mr. Slim


RAJDEEP MAZUMDER

Technical Architect at Verizon Communications

Thanks Slim for nice formulation!

For heterogeneous Hardware Configurations of different Nodes for Bigger Cluster does this flat

formulation will hold good or any difference? Though I can understand here we are trying to

estimate Number of Nodes based on Hadoop storage, which mainly focuses on disk size.

Thanks!


Peter Jamack

Global Big Data PSOM at Teradata

I'd like to know more about your formula and examples.

You have the formula for nodes as n= H/d = c*r*S/(1-i)*d

but then your example is simply n=H/d and 75 data nodes. What is the rest of that formula for

then in that piece?


Peter Jamack


I'm still confused on the data nodes needed according to your formulas.

The initial formula comes up with 2400 and 4X the initial size of the data. The initial size was

600TB. So if we then go and use 75 Data Nodes with only 8TB each, that works out to 600TB

worth of data stored. Except shouldn't that account for the fact the original formula needed 4X that

amount?

that's where I'm confused. You have 75 data nodes(8TB each) that can store 600TB worth of data.

But according to your original formula, you'd need 4X that amount of space.


Peter Jamack


That's where I"m confused.

If you look at n= H/d = c*r*S/(1-i)*d shouldn't the second part be /d instead of *d?

And shouldn't it be 300 Nodes instead of 75?






































Ted

Eswar

Bhushan

Vijaya

Ted


Ted Dunning


You also have to allow headroom in the file system underlying the DFS. For HDFS, this is ext3 or

ext4 usually which gets very, very unhappy at much above 80% fill. For MapR FS, you can go up

to 90% and still get good performance.

You also need to allow headroom for failure modes. For example with 5 nodes, if you lose one,

the other 4 will get an extra 20% of the data as the replicas are re-built. This should not leave you

above the max fill rate mentioned in the first paragraph if you want to keep running efficiently.


Eswar Reddy

DWH and Big Data Developer at IBM

Hello Mr Slim, Very nice explanation with formula of how we will have calculate the Hadoop

Storage and the no . of Clusters

Its much helpful . Thank you

Like (1) Reply privatelyFlag as inappropriate 3 months ago

Bhushan Lakhe

Senior Vice President at Ipsos

I think this is a good starting point! The 'intermediate factor' - which seems to be a debatable

variable, will of course vary as per the cluster activity (jobs etc.) and any job output that would add

to the disk space should be accommodated as part of 'data growth' - along with normal expected

increase in data volume.

Fernando Henrique T. likes this


Vijaya Tadepalli

Engineering Manager at Akamai Technologies

Great discussion here.

@Ted: How much does RAM contribute to processing speed? If we go with 64 GB RAM vs 128

GB RAM can we expect more lines to be processed per minute (while dealing with log files, for

example)?

Ted Dunning


RAM contributes to processing speed in two ways for batch oriented Hadoop programs.

The first way is to support mappers and reducers. You need to have enough space for your own

processes to run as well as buffer space for transferring data through the shuffle step. Small

memory means that you can't run as many mappers in parallel as your CPU hardware will support

which will slow down your processing. The number of reducers is often more limited by how much

random I/O the reducers cause on the source nodes than by memory, but some reducers are very

memory hungry.

The second way that RAM contributes to processing speed is as file space buffers. With

traditional map-reduce programs processing flat files (this includes Hive and Pig), you don't need

to worry much about memory buffering and 20% of total memory should be fine (this is the default

on MapR, for instance). If you are running map-reduce against table oriented code using, say,

MapR's M7 offering, you should have considerably more memory for table caching. The default on

MapR is 35% of memory allocated to file buffer space if tables are in use, but specialized uses I

have seen have gone as high as 80% of memory for buffering.

64GB should be enough for moderate sized dual socket motherboards, but there are definitely

applications where 128GB would improve speed. Moving above 128GB is unlikely to improve

speed for most motherboards.

Make sure you have enough spindles to feed the beast. The standard MapR recommendation

lately is 24 spindles for the data with an internal SSD or small drive for the OS. Some customers

with specialized needs have pushed over 100 drives on a single node. You should do a careful

design check before going extreme (above 36 drives) to make sure that you aren't compromising

reliability. Up to 24 should be fine in most cases as long as you have enough nodes. Keep in

mind that different distributions have different abilities to drive large drive count. Only MapR can

push 2+ GB/s, for instance, and many distributions have trouble with very large block counts on a


























Vijaya

Guillermo

Vijaya

Ted

Ted


single node.

Also look for networking speed. Network speed is moderately important during normal operations

but for the log processing that you suggest is your primary application, you can probably get by

with lower bandwidth for normal operations. Where network speed becomes critical is when you

are recovering from a node failure and need to move large amounts of data from node to node

without killing normal operations.

The ability to control file locations can also mitigate network bandwidth requirements. With MapR,

for instance, you can force all replicas of certain files to be collocated. This allows some programs

to be optimized into merge-joins which can sometimes eliminate several map-reduce steps and

can essentially eliminate network traffic for those parts of the jobs. In one large installation with

1Gbps networking, this resulted in 10x throughput improvements for that part of the data flow. This

idiom is common in some forms of log processing.


Vijaya Tadepalli


@Ted: Thats a lot of valuable information. Thanks much. I am actually at a very preliminary stage

of planning for 20 TB of data and the ask is that 20TB should be processed in less than 1 hour.

"Processing" here refers to couple of lookups for few fields and adding/removing additional fields. I

have this setup in my mind with hourly dump of 20 TB:

5 machines of 16 TB disk space with 64 GB RAM and replication factor of 3. Hadoop job's output

is set to Elasticsearch for indexing. Once hadoop job finishes, data gets deleted and

elasticsearch holds data from there on..Thanks again.


Guillermo Villamayor

New Project in Computer Vision

@Slim Baltagi whichi is the difference among H and S ? In the example H = 4*S

If your r=3 you need r*H = 3*H disk space or I'm wrong?


Vijaya Tadepalli


Any inputs to my proposed cluster setup for dealing 20 TB hourly? Thanks.


Ted Dunning


@Vijaya,

My advice is always to build a prototype before committing to SLA's.

When you say "lookups", what do you mean? Is the lookup against dynamic or static data? How

much data is the lookup going against?

Also, note that you are running pretty close to the wind here on total speed. If you assume that

you can read 1GB/s/node (which is hard to maintain if you are doing *any* significant processing

or if you have to write that much as well), then each node can handle 3600 s/hr x 1GB/s = 3.6 TB

/ hour. The cluster of 5 can handle 18TB/hour. So simply ingesting this much data per hour is

going to be doubtful even without processing it.

How did you imagine that this would happen?

What kind of networking are you assuming?

Amit M., Guillermo V. and 1 other like this

Ted Dunning


@Vijaya,

My advice is always to build a prototype before committing to SLA's.

When you say "lookups", what do you mean? Is the lookup against dynamic or static data? How

much data is the lookup going against?































Vijaya

Ted

Vijaya


Also, note that you are running pretty close to the wind here on total speed. If you assume that

you can read 1GB/s/node (which is hard to maintain if you are doing *any* significant processing

or if you have to write that much as well), then each node can handle 3600 s/hr x 1GB/s = 3.6 TB

/ hour. The cluster of 5 can handle 18TB/hour. So simply ingesting this much data per hour is

going to be doubtful even without processing it.

How did you imagine that this would happen?

What kind of networking are you assuming?

Guillermo V. likes this


Vijaya Tadepalli


@Ted

Yes, definitely we will prototype and then commit to any SLA's.

At this moment, I just wanted to do a rough estimate of hardware required, especially in terms of

RAM as the ask is really about finishing processing and pushing data into elasticsearch within 1

hour. I understand 5 node cluster would be really running it tight for 20 TB but if I add couple of

nodes with same 16 TB disk and 64 GB RAM or even 128 GB RAM, that might do the job?

Lookups are against static data that is about 1 GB. Also, processing mostly involves massaging

each line and atleast at this point not looking at aggregations or counts. For networking, if my

setup can benefit from 10 gb(looks like it) I would go with that option. Hope all this is making

sense.

I haven't worked with Spark but that is something I am considering exploring.

Thanks.


Ted Dunning


I think that there is a good chance that more RAM would help by letting you use more cores. Fast

networking will definitely help.

The key questions that I have are:

a) can the source delivery this much data smoothly? If not, you will need more nodes simply

because you will need some peak bandwidth on the networking side. Systems like Flume are

fairly notoriously difficult to run at very high data rates.

b) do you have to persist this data during processing? Most forms of HA will require that. Many

require that you will have multiple persistent copies if you want to avoid data loss scenarios.

c) do you have enough spindles to meet your persistence requirements? 16TB of disk could be 16

x 1TB (better) or 4 x 4TB (very limiting since max I/O rate will only be about 400-600MB/s at best

due disk limits).

d) can the destination accept data smoothly (or even at all)? My experience with high data rates

into Elastic Search indicate that a sophisticated team can ingest just north of 2 million log events

per second using about 30 nodes of ES. You sound like you are in the same territory or even

higher, but you don't account for this level of hardware.

Tej L., Ganesh T. like this

Vijaya Tadepalli


@Ted:

#a: The expectation is that processing is done within 1 hour after all the data is available in

hadoop cluster. I still have to work on establishing robust pipeline into hadoop. May be scribe,

flume or something like logstash to apply basic regex and push into hdfs, not finalized yet.

#b & #d: No need to persist on hadoop cluster, but whatever gets pushed to elasticsearch needs

to be though. Need to plan for elasticsearch cluster depending on how much data needs to be

available. For example, if 12 hours of data(250 TB) is needed at any given point of time, and with

replication factor of 3, I need to look at how much space is needed for index built by elasticsearch

and then take a call on total disk space required. I am sure RAM would come into picture in the

case of elasticsearch too. All this is still in works.





















Raghavan

Ted

Help Center About Press Blog Careers Advertising Talent Solutions Small Business Mobile Developers Language Upgrade Your Account

LinkedIn Corporation © 2014 User Agreement Privacy Policy Community Guidelines Cookie Policy Copyright Policy Send Feedback

Ads You May Be Interested In

Online MBA - Switzerland

Accredited online Master Degree (MBA) fromSwitzerland, APPLY NOW

›

Telecom SaaS

Comverse BSS Can Be Deployed In The CloudVia SaaS Model. Learn More!

›


#c: Will definitely look into getting more spindles to get better I/O performance.

Again, thank you for the valuable inputs and suggestions so far. I will add more details once I can

finalize on collection and elasticsearch aspects. Others facing similar asks might find it useful.


Raghavan Solium

Lead Architect - Big Data at OSSCube

I really appreciate attempting to formulate the number of nodes. Over all as a formula for

calculation looks good from the data point of view though the intermediate factor is slightly higher.

But this totally ignores the processing requirement. You need to keep in mind that the data nodes

are also going to be processing nodes hence you may need to either increase the number of

nodes beyond this calculation based on your job loads and performance requirements. A formula

which captures all these aspects will do great. A slight nudging of your formula to include the

processing aspects would be good.


Ted Dunning


Raghavan,

You are correct that it is important to consider processing space as well, especially for nodes

with relatively small amounts of storage.

Processing space should not, however, be computed as a fraction of total space, but in terms of

how long the longest job runs. Each node can only write data at a bounded rate and thus the

length of the job determines how much space you need.

If you assume, for instance, that the most extreme job writes for, say, 20 minutes (this is not

processing time, just the more serious write), then a single node probably doesn't need more than

about a TB of processing space (1000 s * 1GB/s).

In any case, that should be included in the usable storage that I computed earlier.

And, as always, your mileage will vary.

Mike T., Jose M. like this

Add Comment

Add a comment...

Send me an email for each new comment.

Fee

db

ac

k

What is the single most

important quality for a

PM to have?

Suggested discussion

Home Profile Connections Jobs Interests Business Services

Advanced 8

Search for people, jobs, companies, and more...

https://www.linkedin.com/ads/start?src=en-all-ad-li-ads_by_li&trk=ads_by_li&utm_medium=ad&utm_source=li&utm_campaign=ads_by_li

https://www.linkedin.com/csp/crd?v=1&cs=0_0MBwLTfkFDeyKismmuFedFdPlnRAl8gyf5wNgjBD-hcGbtuquJswFgupb8GXOEQIyUQ61A_ejgXEO6bNsLqG4tPbNYEM9NpTKpGwTi0b5atFsOFX3hybRKbqeUrwgAbE0PCz3U4qH7-zd5ACG2mCos3hW8PdTECr1nGnu0s7RO_OyfQziUAbfNdaX614D9NcUW1oGJVr0P0p3q6rOb0mZYVNan30miA1gBa6jjluSuVQjboBP0xm_pHjc3gwhzbj2pQpG04awZTjy9afHmv3Zb2A6PH3jS9N-UBY8HKQEXiASiYC80-nbsCZYSLk66ZjhzIjGbxCHd560NXAdi7hg2RS-Llee_D-P7aJc8SBx45sqK2_yHIaN66W7EU9-LpT863sihB-cou4T7WHuJyytPTtV0CHXFa68NpCjRoDTGGJDPajjpwMfop2WpuUFTOLR31HaZ0RHUuvYkhr5_9kps5iqJNUZZY7AGoCtMaYJ4WDkCP_BuhFxzpLqkUMsbfqFBP48cxy3_dpbhqj7jAzm-akVrjMNRvnJK8lYxdFAGm9N-nQErq2Hc0uFQRek8zhXwhAfuQ9GTP6BBDGlG2WYyqePXGj-zkYjIq8knmwowk8YfMhnxgoKYodGh6NEinqDsHS7IS-tUGRBsKWWrYlQzGzfx5azolZSRnNfn5s4Oy




https://www.linkedin.com/csp/crd?v=1&cs=0_9Drgylv6oBvNz4Dj0YFWsq9t4YG-Svr8NvWEscle1jzn2w802dIb1JeIH06WByaCCd_UOhcc4olw9dpEzchLQRH5JR0c4DoHi1TzjLoN78W77AK3GdQZvNbYpbCX1jToyTARy0hfKri50JZoi0HNzFVKAhIz25kNz8j-k_8dR8BhJbM3FrSliDnfZ0bZ4GST8bKzp-1AgIpGp7H0pajYLTELQybAWGC0NoitxZ19y98cCnqKnzIN0NfUXB8k7YRAvj3uWz_z3hDsg6Ebjs4VrENk7-lxPLqjvFRk-w_q_HEYjiPVkFlTCk_vs7s5Z6STa4cPAaIpbsF-5byDzexTnoyVEwFpW3fBSz2YFZwGjvzgPpwqQ5ygVjyNhJf7fSZ3H12Gu5UFO827BGbDPHHZD2jpbOow3VsJOrTPqZgorRd6eNeEK6qIQko37KgR8t50bnrWZkUX-vXcsGQMzqm4glhBbiPkaHsZxf-P4oSp_Sk_20TexpSBFUpK4iVHOzn1LGqkXzfqQAFFWi1Txj11wA00xIg03wrbV_tllU-KcCAqZDxgsV5WeTT8BoHPv1qz3qm6up5_n7tg8bUgSkttC2dwH_pcBN9FsWP7TNyZJnHe32cUoWEin3FHYlIqnYEqBpQeV1EpszEbZIQoDbfQ6k176A94r4vgnc4aSPjxaQx_zGV57Dx5uGJS8jDErOUzgCNv7P3lsmsSygXLJhGGUiOKFR88PMumvxQAvXQ1oTlOFYuhVAg_w7I9C0vizGt1VXN8LtHx4ACuf8v0g_IUmnGzfx5azolZSRnNfn5s4Oy

















https://www.linkedin.com/lite/feedback-form

https://www.linkedin.com/groupItem?view=&gid=37888&type=member&item=262947226&trk=Skyline_click_NGDR&sl=NGDR%3B56277184%3A1403177357309%3B0%3B%3B

https://www.linkedin.com/home?trk=nav_responsive_tab_home

https://www.linkedin.com/profile/view?id=56277184&trk=nav_responsive_tab_profile

https://www.linkedin.com/connections?type=combined&trk=nav_responsive_tab_network

https://www.linkedin.com/jobs?displayHome=&trk=nav_responsive_sub_nav_jobs

https://www.linkedin.com/vsearch/f?adv=true&trk=federated_advs

https://www.linkedin.com/?trk=nav_logo

https://www.linkedin.com/inbox/#messages?trk=nav_utilities_inbox

https://www.linkedin.com/fetch/importAndInviteEntry?trk=nav_utilities_add_connx

https://www.linkedin.com/profile/view?id=56277184&trk=nav_responsive_tab_profile_pic

A Formula to Estimate Hadoop Storage and Related Number of Data Nodes..

Documents

Transcript of A Formula to Estimate Hadoop Storage and Related Number of Data Nodes..