Deduplication using Hadoop and Hbase
-
Upload
shilpa-kanhurkar -
Category
Documents
-
view
231 -
download
0
Transcript of Deduplication using Hadoop and Hbase
-
8/18/2019 Deduplication using Hadoop and Hbase
1/18
Client Side data duplication detector
using Hadoop Framework
Ms. Shilpa D.Kanhurkar ExamNo :!"#
$nder %uidence o&'ro&. '. (. Sahane
Department o& Computer Engineering' K )echnical Cam us Chakan
-
8/18/2019 Deduplication using Hadoop and Hbase
2/18
• What is Big Data?
• Need of De-Duplication?
IntroductionIntroduction
-
8/18/2019 Deduplication using Hadoop and Hbase
3/18
Literature SurveyLiterature Survey
Srno
'aper Name *uthorName
*pproach *d+antage
Disad+antage
1 Sparse Indexing:Large Scale, Inline
Deduplication UsingSampling andLocality
M.Lillibride,
K. Eshghi
Contentbasedsegentation!sapling!sparse inde"ing#
$"cellentdeduplicationthroughput! little %a
SallLossduplication!&'product#
( $"tree Binning)Scalable! 'arallel
Deduplication forChun*-based +ileBac*up
D#Bhag,a
t! #$shghi
Chun* based.hash/
'aralleli0e!le
siilarity
%estoration and
storagere2uiresorenuberof randosee*s
-
8/18/2019 Deduplication using Hadoop and Hbase
4/18
Literature SurveyLiterature Survey
Srno
'aper Name *uthorName
*pproach
*d+antage
Disad+antage
3 DeDu) Building aDeduplication
Storage Systeover CloudCoputing
4#Sun! 5#
Shen
Cloudbased!spa
rse inde"
&ighthroughpu
t
not occur atthe le level!
and theresults ofde-duplicationare notaccurate
6 enti: ! "e# !pproachto !rchi$al Data Storage
%. Seanand D.Sean
Chun*based.hash/
enforces a,rite-oncepolicy toavoid
daageof data#
It is notsuitable todeal ,ithass data!and the
syste isnot scalable#
-
8/18/2019 Deduplication using Hadoop and Hbase
5/18
'roble Stateent'roble Stateent
7o develop a reliable! e8cient client side de-duplication syste using e8cient &ash basedtechni2ues! &adoop! &base# It ,ill help in
o9oading the processing po,er re2uireents ofthe target to the client nodes reducing theaount of data that is to be sent onto thenet,or*#
-
8/18/2019 Deduplication using Hadoop and Hbase
6/18
:b;ective:b;ective 7o Learn the technologies of de-duplication
techni2ues for big data#
-
8/18/2019 Deduplication using Hadoop and Hbase
7/18
$"isting Syste$"isting Syste
&ash based Duplication Detection ethod◦ >D and S&@ I algoriths
◦ Data storage and analysis using &D+S ,ith 'ig! &ive#
@dvantages
◦
It is easy to copute the hash value for given le#◦ D hash function is severallycoproised#
◦ Collision cople"ity of >D is (A6!due to 1(-bit#
◦ 'ig!hive runs batch processes on &adoop they neverdatabases
-
8/18/2019 Deduplication using Hadoop and Hbase
8/18
Client Side Data duplication detector using &adoop◦ &ash based de-duplication techni2ue ,ith hadoop frae,or* #
◦ :ne iproved hash algorith
◦ &base ,hich is No-S2l database to be used on top of hadoop forstoring and fast analy0ing of big datasets #
@dvantages
◦ &base rando real-tie readE,rite access to our data! Fe"ibledata odel#
◦ @llo,ing space to be saved on the storage resource as it
copresses redundant data◦ +ingerprintsG&base achieve high loo* up e8ciency ,ith high
security#
'roposed Syste'roposed Syste
-
8/18/2019 Deduplication using Hadoop and Hbase
9/18
Ipleentation )Ipleentation )SysteSyste Architecture Architecture
HDFS
Map,educe
H(aseFile
s
$N-$EF-/ES
Lookup
For
ExistingHash key
Md5 Generator
Passing
The
Non
Matched
vaues
C/-EN) S-DE SE,0E, S-DE
-
8/18/2019 Deduplication using Hadoop and Hbase
10/18
1/Data is ,ritten onto &D+S +ileSyste fs +ileSyste#get.ne, Conguration.//H
(/ In ap./ function &adoop by default splits theinput le into 6>B bloc*s#
+ileSplit lesplit .+ileSplit/conte"t#getInputSplit./H3/ enerating &ash value of +ile
String h"Jal >D#to&e"String.>D#copute>D.inputJal#getBy
tes.///H6/ Instantiating et class ,ith &ash Jalue
&7able h7able ne, &7able.cong! K&ashK/H
@lgorith@lgorith
-
8/18/2019 Deduplication using Hadoop and Hbase
11/18
/ &ash values are read fro &base copare it ,ith currentvalue
et g ne, et.Bytes#toBytes.h"Jal//H
%esult result h7able#get.g/H
#1/If the hash is ne, then the current hash value isupdated into habse !input le is transferred to bac*upserver#
'ut p ne, 'ut.Bytes#toBytes.h"Jal//H
p#add.Bytes#toBytes.K+ile/
+ileSyste fs+ileSyste#get.conf/H
'ath lenae'ath ne, 'ath.KEuserEshilpaEnalE/H
+SData:utputStrea out fs#create.lenae'ath/H
@lgorith@lgorith
-
8/18/2019 Deduplication using Hadoop and Hbase
12/18
:ne Iproved &ash @lgorith >DplusM based on >D andS&@
Steps)-
i# Inforation lling odule
ii# Initiali0ation odule
ii# &ash value calculation odule
+.!O!4/ O v not./ 4 .!O!4/ 4 v O not.4/
&.!O!4/ "or O "or 4 I.!O!4/ O "or . v not.4//
operational functions )-$ach process has 6 rounds and eachround has 1 steps)
++.a!b!c!d!>;!s!ti/eans) abG..aG.+.b!c!d/G>;GtPiQ/ RRR s /
.a!b!c!d!>;!s!ti/eans) abG..aG..b!c!d/G>;GtPiQ/ RRR s /
&&.a!b!c!d!>;!s!ti/eans) abG..aG.&.b!c!d/G>;GtPiQ/ RRR s /
II.a!b!c!d!>;!s!ti/eans )abG..aGI.b!c!d/G>;GtPiQ/ RRR s /
@lgorith@lgorith
-
8/18/2019 Deduplication using Hadoop and Hbase
13/18
iv# Bit e"tending odulespecial e"tending function)
.!O!4/. @ND O/ :% . @ND 4/ :% .O @ND 4/# With 6-bitinput and output#
,e only have to append the results ,ith eight in front! then!save the to 6-bit registers @@! BB! CC! and DD #
.a!b!c!d!>;!s!ti/ eans) abG..aG..b!c!d/G>;GtPiQ/RR s /
output) @@! BB! CC! and DD
>Dplus algorith based on >D! and absorbed soee"cellent functions fro S&@1# In hash length! >Dplus hasiproved to 1-bit#
@lgorith@lgorith
-
8/18/2019 Deduplication using Hadoop and Hbase
14/18
Data 7ables anData 7ables an
discussionsdiscussions@lgorithic coparision
H!F" acks random read and write access# This is $here H%ase
co&es into picture# 't(s a distributed, scalable, big data store 't
stores data as key)vaue pairs#
-
8/18/2019 Deduplication using Hadoop and Hbase
15/18
Leveraged &adoop frae,or* to design and develop aduplication detection syste that helped us inidentifying ultiple copies of the sae data at the lelevel itself! eliinating duplicateEredundant les inentirety and that too before transission i#e# at theClient .Servers/ end# It thus helps in ,ishful eliinationand thereafter in controlling the nuber of unnecessaryreplicas# 7hereafter! these replicas are anaged andcontrolled as per the re2uireents# By using hash basedduplication techni2ues duplication is detected in fasteranner#
-
8/18/2019 Deduplication using Hadoop and Hbase
16/18
&I' Kapil (a)shi, *+onsiderations or (ig Data: !rchitecture and !pproach,* in !erospace+onerence. -/- /EEE, (ig S)y, M0. 12/ March -/-,pp.I23.
&-' 4. Mali), *5o$erning (ig Data: 4rinciples and 4ractices*, I(M 6ournal o 7esearchand De$elopment, $ol 83, pp.l:l 2I: /1, -/1.
&1' D.5eer, *7educing the Storage (urden $ia Data Deduplication*, in +omputer, thelagship publication o the IEEE +omputer Society, $ol. 9/, pp./82/3, -.
&9'M. Lillibridge, K. Eshghi, D. (hag#at, . Deolali)ar, 5. 0re;ise, and 4. +amble,*Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality,* in3th USE"i< +onerence on =ile and Storage 0echnologies, San =rancisco, +aliornia.
&8' (. >hu, K. Li, and ?. 4atterson, *!$oiding the dis) bottlenec) in the data domaindeduplication ile system,@ in 4roceedings o the Ath USE"i< +onerence on =ile andStorage 0echnologies, San 6ose, +aliornia, -, pp. -AB2--.
%eferences and Bibilography%eferences and Bibilography
-
8/18/2019 Deduplication using Hadoop and Hbase
17/18
&A' %. Sean and D. Sean, *enti: ! "e# !pproach to !rchi$al Data Storage,* in4roceedings o the /st USE"i< +onerence on =ile and Storage 0echnologies,
ed. Monterey. +!: USE"I< !ssociation, --,
&3' >. Sun, 6. Shen and 6. Cong, *DeDu: (uilding a DeduplicationStorage System
o$er +loud +omputing, * in -// /8th international +onerence on +omputer
Supported +ooperati$e or) in Design +S+DF, Lausanne, -//, pp. 192188.
&' D. +e;ary, 5. Les;e), ?. Lu)as;, K. Michal, K. oGciech, S.4r;emysla#, S.er;y,
U. +ristian, and . Michal, *?CD7! stororage a Scalable Secondary Storage,*
in 4roceedings o the 3th conerence on =ile and storage technologies, San
=rancisco, +aliornia, -B, pp./B32-/.
&B' D. (hag#at, K. Eshghi, D. D. E. Long, and M. Lillibridge, *Extreme (inning:
Scalable, 4arallel Deduplication or +hun)2based =ile (ac)up,* in -B IEEEinternational Symposium on Modeling.!nalysis H Simulation o +omputer and
0elecommunication Systems M!S+0S.
.
%eferences and Bibilography%eferences and Bibilography
-
8/18/2019 Deduplication using Hadoop and Hbase
18/18
7han* OouT