Deduplication using Hadoop and Hbase

download Deduplication using Hadoop and Hbase

of 18

Transcript of Deduplication using Hadoop and Hbase

  • 8/18/2019 Deduplication using Hadoop and Hbase

    1/18

    Client Side data duplication detector

    using Hadoop Framework 

    Ms. Shilpa D.Kanhurkar ExamNo :!"#

      $nder %uidence o&'ro&. '. (. Sahane

      Department o& Computer Engineering' K )echnical Cam us Chakan

  • 8/18/2019 Deduplication using Hadoop and Hbase

    2/18

    • What is Big Data?

    • Need of De-Duplication?

    IntroductionIntroduction

  • 8/18/2019 Deduplication using Hadoop and Hbase

    3/18

     Literature SurveyLiterature Survey

    Srno

    'aper Name *uthorName

    *pproach *d+antage

    Disad+antage

    1 Sparse Indexing:Large Scale, Inline

    Deduplication UsingSampling andLocality

    M.Lillibride,

    K. Eshghi

    Contentbasedsegentation!sapling!sparse inde"ing#

    $"cellentdeduplicationthroughput! little %a

    SallLossduplication!&'product#

    ( $"tree Binning)Scalable! 'arallel

    Deduplication forChun*-based +ileBac*up

    D#Bhag,a

    t! #$shghi

    Chun* based.hash/

    'aralleli0e!le

    siilarity

    %estoration and

    storagere2uiresorenuberof randosee*s

  • 8/18/2019 Deduplication using Hadoop and Hbase

    4/18

     Literature SurveyLiterature Survey

    Srno

    'aper Name *uthorName

    *pproach

    *d+antage

    Disad+antage

    3 DeDu) Building aDeduplication

    Storage Systeover CloudCoputing

    4#Sun! 5#

    Shen

    Cloudbased!spa

    rse inde"

    &ighthroughpu

    t

    not occur atthe le level!

    and theresults ofde-duplicationare notaccurate

    6 enti: ! "e# !pproachto !rchi$al Data Storage

    %. Seanand D.Sean

    Chun*based.hash/

    enforces a,rite-oncepolicy toavoid

    daageof data#

    It is notsuitable todeal ,ithass data!and the

    syste isnot scalable#

  • 8/18/2019 Deduplication using Hadoop and Hbase

    5/18

    'roble Stateent'roble Stateent

     7o develop a reliable! e8cient client side de-duplication syste using e8cient &ash basedtechni2ues! &adoop! &base# It ,ill help in

    o9oading the processing po,er re2uireents ofthe target to the client nodes reducing theaount of data that is to be sent onto thenet,or*#

  • 8/18/2019 Deduplication using Hadoop and Hbase

    6/18

    :b;ective:b;ective 7o Learn the technologies of de-duplication

    techni2ues for big data#

  • 8/18/2019 Deduplication using Hadoop and Hbase

    7/18

    $"isting Syste$"isting Syste

    &ash based Duplication Detection ethod◦ >D and S&@ I algoriths

    ◦ Data storage and analysis using &D+S ,ith 'ig! &ive#

    @dvantages

    It is easy to copute the hash value for given le#◦ D hash function is severallycoproised#

    ◦ Collision cople"ity of >D is (A6!due to 1(-bit#

    ◦ 'ig!hive runs batch processes on &adoop they neverdatabases

  • 8/18/2019 Deduplication using Hadoop and Hbase

    8/18

    Client Side Data duplication detector using &adoop◦ &ash based de-duplication techni2ue ,ith hadoop frae,or* #

    ◦ :ne iproved hash algorith

    ◦ &base ,hich is No-S2l database to be used on top of hadoop forstoring and fast analy0ing of big datasets #

    @dvantages

    ◦ &base rando real-tie readE,rite access to our data! Fe"ibledata odel#

    ◦ @llo,ing space to be saved on the storage resource as it

    copresses redundant data◦ +ingerprintsG&base achieve high loo* up e8ciency ,ith high

    security#

     

    'roposed Syste'roposed Syste

  • 8/18/2019 Deduplication using Hadoop and Hbase

    9/18

    Ipleentation )Ipleentation )SysteSyste Architecture Architecture 

    HDFS

    Map,educe

    H(aseFile

    s

    $N-$EF-/ES

    Lookup

    For

    ExistingHash key

    Md5 Generator 

    Passing

    The

    Non

    Matched

    vaues

    C/-EN) S-DE SE,0E, S-DE

  • 8/18/2019 Deduplication using Hadoop and Hbase

    10/18

    1/Data is ,ritten onto &D+S +ileSyste fs +ileSyste#get.ne, Conguration.//H

    (/ In ap./ function &adoop by default splits theinput le into 6>B bloc*s#

    +ileSplit lesplit .+ileSplit/conte"t#getInputSplit./H3/ enerating &ash value of +ile

      String h"Jal >D#to&e"String.>D#copute>D.inputJal#getBy

    tes.///H6/ Instantiating et class ,ith &ash Jalue

    &7able h7able ne, &7able.cong! K&ashK/H

     @lgorith@lgorith

  • 8/18/2019 Deduplication using Hadoop and Hbase

    11/18

    / &ash values are read fro &base copare it ,ith currentvalue

      et g ne, et.Bytes#toBytes.h"Jal//H

      %esult result h7able#get.g/H

    #1/If the hash is ne, then the current hash value isupdated into habse !input le is transferred to bac*upserver#

    'ut p ne, 'ut.Bytes#toBytes.h"Jal//H

      p#add.Bytes#toBytes.K+ile/

      +ileSyste fs+ileSyste#get.conf/H

      'ath lenae'ath ne, 'ath.KEuserEshilpaEnalE/H

    +SData:utputStrea out fs#create.lenae'ath/H

     @lgorith@lgorith

  • 8/18/2019 Deduplication using Hadoop and Hbase

    12/18

    :ne Iproved &ash @lgorith >DplusM based on >D andS&@

      Steps)-

    i# Inforation lling odule

    ii# Initiali0ation odule

    ii# &ash value calculation odule

    +.!O!4/ O v not./ 4 .!O!4/ 4 v O not.4/

    &.!O!4/ "or O "or 4 I.!O!4/ O "or . v not.4//

      operational functions )-$ach process has 6 rounds and eachround has 1 steps)

    ++.a!b!c!d!>;!s!ti/eans) abG..aG.+.b!c!d/G>;GtPiQ/ RRR s /

    .a!b!c!d!>;!s!ti/eans) abG..aG..b!c!d/G>;GtPiQ/ RRR s /

    &&.a!b!c!d!>;!s!ti/eans) abG..aG.&.b!c!d/G>;GtPiQ/ RRR s /

    II.a!b!c!d!>;!s!ti/eans )abG..aGI.b!c!d/G>;GtPiQ/ RRR s /

    @lgorith@lgorith

  • 8/18/2019 Deduplication using Hadoop and Hbase

    13/18

    iv# Bit e"tending odulespecial e"tending function)

      .!O!4/. @ND O/ :% . @ND 4/ :% .O @ND 4/# With 6-bitinput and output#

    ,e only have to append the results ,ith eight in front! then!save the to 6-bit registers @@! BB! CC! and DD #

    .a!b!c!d!>;!s!ti/ eans) abG..aG..b!c!d/G>;GtPiQ/RR s /

    output) @@! BB! CC! and DD

    >Dplus algorith based on >D! and absorbed soee"cellent functions fro S&@1# In hash length! >Dplus hasiproved to 1-bit#

    @lgorith@lgorith

  • 8/18/2019 Deduplication using Hadoop and Hbase

    14/18

    Data 7ables anData 7ables an

    discussionsdiscussions@lgorithic coparision

    H!F" acks random read and write access# This is $here H%ase

    co&es into picture# 't(s a distributed, scalable, big data store 't

    stores data as key)vaue pairs#

  • 8/18/2019 Deduplication using Hadoop and Hbase

    15/18

    Leveraged &adoop frae,or* to design and develop aduplication detection syste that helped us inidentifying ultiple copies of the sae data at the lelevel itself! eliinating duplicateEredundant les inentirety and that too before transission i#e# at theClient .Servers/ end# It thus helps in ,ishful eliinationand thereafter in controlling the nuber of unnecessaryreplicas# 7hereafter! these replicas are anaged andcontrolled as per the re2uireents# By using hash basedduplication techni2ues duplication is detected in fasteranner#

  • 8/18/2019 Deduplication using Hadoop and Hbase

    16/18

    &I' Kapil (a)shi, *+onsiderations or (ig Data: !rchitecture and !pproach,* in !erospace+onerence. -/- /EEE, (ig S)y, M0. 12/ March -/-,pp.I23.

    &-' 4. Mali), *5o$erning (ig Data: 4rinciples and 4ractices*, I(M 6ournal o 7esearchand De$elopment, $ol 83, pp.l:l 2I: /1, -/1.

    &1' D.5eer, *7educing the Storage (urden $ia Data Deduplication*, in +omputer, thelagship publication o the IEEE +omputer Society, $ol. 9/, pp./82/3, -.

    &9'M. Lillibridge, K. Eshghi, D. (hag#at, . Deolali)ar, 5. 0re;ise, and 4. +amble,*Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality,* in3th USE"i< +onerence on =ile and Storage 0echnologies, San =rancisco, +aliornia.

    &8' (. >hu, K. Li, and ?. 4atterson, *!$oiding the dis) bottlenec) in the data domaindeduplication ile system,@ in 4roceedings o the Ath USE"i< +onerence on =ile andStorage 0echnologies, San 6ose, +aliornia, -, pp. -AB2--.

    %eferences and Bibilography%eferences and Bibilography

  • 8/18/2019 Deduplication using Hadoop and Hbase

    17/18

    &A' %. Sean and D. Sean, *enti: ! "e# !pproach to !rchi$al Data Storage,* in4roceedings o the /st USE"i< +onerence on =ile and Storage 0echnologies,

    ed. Monterey. +!: USE"I< !ssociation, --,

     &3' >. Sun, 6. Shen and 6. Cong, *DeDu: (uilding a DeduplicationStorage System

    o$er +loud +omputing, * in -// /8th international +onerence on +omputer

    Supported +ooperati$e or) in Design +S+DF, Lausanne, -//, pp. 192188.

    &' D. +e;ary, 5. Les;e), ?. Lu)as;, K. Michal, K. oGciech, S.4r;emysla#, S.er;y,

    U. +ristian, and . Michal, *?CD7! stororage a Scalable Secondary Storage,*

    in 4roceedings o the 3th conerence on =ile and storage technologies, San

    =rancisco, +aliornia, -B, pp./B32-/.

    &B' D. (hag#at, K. Eshghi, D. D. E. Long, and M. Lillibridge, *Extreme (inning:

    Scalable, 4arallel Deduplication or +hun)2based =ile (ac)up,* in -B IEEEinternational Symposium on Modeling.!nalysis H Simulation o +omputer and

    0elecommunication Systems M!S+0S.

    .

    %eferences and Bibilography%eferences and Bibilography

  • 8/18/2019 Deduplication using Hadoop and Hbase

    18/18

     

     7han* OouT