NOSQL and Data Lake Architecture - DAMA NY...Data Lake Reference Architecture Appliance NOSQL HADOOP...

28
NOSQL and Data Lake Architecture IBM 590 Madison Avenue (at 57th St.) New York, NY By Tom Haughey InfoModel, LLC [email protected] 201 755 3350

Transcript of NOSQL and Data Lake Architecture - DAMA NY...Data Lake Reference Architecture Appliance NOSQL HADOOP...

Page 1: NOSQL and Data Lake Architecture - DAMA NY...Data Lake Reference Architecture Appliance NOSQL HADOOP EDW Mart RYO data Consumers Big Data Data Warehouse RDBMSs Real-time Analytics

NOSQLandDataLakeArchitecture

IBM590MadisonAvenue(at57thSt.)

NewYork,NY

ByTomHaugheyInfoModel,LLC

[email protected]

Page 2: NOSQL and Data Lake Architecture - DAMA NY...Data Lake Reference Architecture Appliance NOSQL HADOOP EDW Mart RYO data Consumers Big Data Data Warehouse RDBMSs Real-time Analytics

Agenda•  DescribeNOSQL•  ProposerelaConshipofNOSQLtothedatalake•  AnalyzethecapabiliCesofNOSQLtosupportBI•  Proposeanewoverallreferencedataarchitecture

©InfoModel,LLC.2017 2

Page 3: NOSQL and Data Lake Architecture - DAMA NY...Data Lake Reference Architecture Appliance NOSQL HADOOP EDW Mart RYO data Consumers Big Data Data Warehouse RDBMSs Real-time Analytics

NOSQL

Ontheoutside

Ontheinside–

hundredstothousandsofservers

Server

©InfoModel,LLC.2017 3

Page 4: NOSQL and Data Lake Architecture - DAMA NY...Data Lake Reference Architecture Appliance NOSQL HADOOP EDW Mart RYO data Consumers Big Data Data Warehouse RDBMSs Real-time Analytics

Other

Author

•  Blog entry could also be absorbed into Blog as an array. •  All arrows are links. •  In many cases, the actual user name would be included rather than just the ID. •  This is an example of a NOSQL data model, not necessarily of notation •  [ ] indicates embedding.

User

User ID User Name Password

Blog

Blog ID Blog Name Blog Description Blog Timestamp User ID Tag [ ] Tag Keyword

Blog Entry

Blog Entry TimeUUID Blog Entry Text Blog ID Comment [ ]

Comment TimeUUID Comment Text User ID Comment Timestamp

Follows

User ID [ ] Followed User ID

Followed By

User ID [ ] Follower User ID

Subscribes To

User ID [ ] Subscribed to Blog ID Subscription Date

Subscribed By

Blog ID [ ] Subscriber User ID Subscription Date

Time Ordered Blog Entries

User ID Blog Entry TimeUUID Posted By User ID Blog Entry Text

BlogNOSQLSampleDesign

©InfoModel,LLC.2017 4

Page 5: NOSQL and Data Lake Architecture - DAMA NY...Data Lake Reference Architecture Appliance NOSQL HADOOP EDW Mart RYO data Consumers Big Data Data Warehouse RDBMSs Real-time Analytics

UnderstandingNOSQL•  ThefollowingessenLaltoproperunderstandingofNOSQL:

–  DatadistribuLon:Dataisspreadhorizontallyovermanyservers–  Semi-structured:Not“schema-less”;schemaisembeddedinthedata

ortheapplicaCon–  Composite.Astructurecanconsistofotherstructures–  Hierarchical:1:Mrankingfromparenttochild(exceptgraphDBMSs)–  Key-valuestructure:MostNOSQL’sconsistofakeyandavalue

•  Valuecanbeablob,string,orcontainerofotherkey:valuepairs.•  TheexcepGontothisisgraphDBMS

–  Materializedqueries:Basicallyastructureforeachquery–  ApplicaLonorientaLon:Notenterprise-oriented,butqueryoriented.–  RelaLonships:UnidirecConallinks,notjoins.–  Datamodel:ANOSQLdatamodelisaphysical,notlogical,datamodel

•  Fourmaintypes:key:value,widecolumn,document,graph©InfoModel,LLC.2017 5

Page 6: NOSQL and Data Lake Architecture - DAMA NY...Data Lake Reference Architecture Appliance NOSQL HADOOP EDW Mart RYO data Consumers Big Data Data Warehouse RDBMSs Real-time Analytics

1-KeyValue

Key Value

Whateveryouwantittocontain,suchasaCSV,ablob,orotherkey:values,e.g.,streamingstockmarketdata:

record={Ccker:…,rate_date:…,price:…,price_change:…,percent_change:…

UseCasesFilestorageLogrecordsSessionstorageContentmanagementStreamingdata(songs,albums,etc.)ProductdatamanagementHighvolumedatafeedsUserdatamanagementHadoop

@InfoModel,LLC.2017 6

Page 7: NOSQL and Data Lake Architecture - DAMA NY...Data Lake Reference Architecture Appliance NOSQL HADOOP EDW Mart RYO data Consumers Big Data Data Warehouse RDBMSs Real-time Analytics

2-WideColumn(ColumnFamily)

Column Family: Personal Column Family: Business Row Key

Name Home Phone Office Phone Address

00001 Luke 212 555 5689 201 891 2536 21 McCoy Ave 00021 Bryce 212 232 6785 201 766 2091 26 McCoy Ave 00033 Matthew 201 234 5768 201 766 7381 26 McCoy Ave 00046 James 908 435 6242 908 657 5438 5 Hatfield St 00057 Jack 347 361 5429 973 376 8394 6 Wallace St

Thetableissortedbasedonrowkey

EachcellcanhavemulCpleversionsindividuallyCmestamped

347 361 9876

Timestamp1 Timestamp2

Cells

•  SomeCmescalledcolumn-orientedbutactuallyrow-oriented•  SeveralvariaConsbutthisistheHbaseversion

©InfoModel,LLC.20177

Page 8: NOSQL and Data Lake Architecture - DAMA NY...Data Lake Reference Architecture Appliance NOSQL HADOOP EDW Mart RYO data Consumers Big Data Data Warehouse RDBMSs Real-time Analytics

2–WideColumn(Cassandra)

CharacterisLcs:•  Sparsetablestructure•  Foreachkey,therecanbe:

–  Variableahributes–  Newcolumnswithout

indicaCngtheyarenew–  Omissionofcolumns

Key1 A=6 C=3 D=53 E=10Key2 B=42 D=62 E=84Key3 A=1 B=5 F=20 H=9

UseCases•  ProductCatalog/Playlist•  RecommendaCon/

PersonalizaConEngine•  SensorData/InternetofThings•  Messaging•  FraudDetecCon•  UsedbyeBay,Twissandra…

©InfoModel,LLC.2017 8

Table’ssortedbasedonrowkey

Columnssortedbasedoncolumnkey

Page 9: NOSQL and Data Lake Architecture - DAMA NY...Data Lake Reference Architecture Appliance NOSQL HADOOP EDW Mart RYO data Consumers Big Data Data Warehouse RDBMSs Real-time Analytics

3-Document•  A key : value store that understands the data Posts =

{_id: “A12345” author: “Mickey”, date: 22/6/2012, text: “The Adventures of Mickey Mouse”, keywords: [“drama”, “comic”, “adventure”] } comments : [ { author: “Nick Machiavelli”, date: 11/12/2012, rating: “Less filling, tastes great”, votes: 7000 } , … ] }

©InfoModel,LLC.2017 9

Page 10: NOSQL and Data Lake Architecture - DAMA NY...Data Lake Reference Architecture Appliance NOSQL HADOOP EDW Mart RYO data Consumers Big Data Data Warehouse RDBMSs Real-time Analytics

•  Contentmanagement•  Highvolumedatafeeds

–  Stockmarketdata–  Streamingmusic

•  OperaConalintelligence–  Atdetailedlevel–  Ataggregatelevel

•  Productdatamanagement•  Userdatamanagement(users,profiles,etc.)•  Hadoop

UseCasesforDocumentDataStores

©InfoModel,LLC.2017 10

Page 11: NOSQL and Data Lake Architecture - DAMA NY...Data Lake Reference Architecture Appliance NOSQL HADOOP EDW Mart RYO data Consumers Big Data Data Warehouse RDBMSs Real-time Analytics

4 - Graph Databases •  Every element contains a direct pointer to its adjacent elements

hhp://www.infoq.com/arCcles/graph-nosql-neo4j©InfoModel,LLC.2017 11

Page 12: NOSQL and Data Lake Architecture - DAMA NY...Data Lake Reference Architecture Appliance NOSQL HADOOP EDW Mart RYO data Consumers Big Data Data Warehouse RDBMSs Real-time Analytics

GraphDatabases•  Based on graph theory •  Addresses data complexity •  Direct path operations are easy •  Can be transactional •  FOAF (Friend Of A Friend) •  Usually paired with an index for search

©InfoModel,LLC.2017 12

Page 13: NOSQL and Data Lake Architecture - DAMA NY...Data Lake Reference Architecture Appliance NOSQL HADOOP EDW Mart RYO data Consumers Big Data Data Warehouse RDBMSs Real-time Analytics

Joins•  NOSQLstructuresarematerializedquerieswithallor

mostofthequerydataembeddedintotheonestructure•  Joinsarenotjust“unnecessary”inNOSQL,theyarenot

possible(throughtheDBMSs)–  Rememberdataishighlydistributedacrossmany,manyservers–  Joinswouldbefar-reachingandinefficient–  NOSQLembeddingimpliesredundancy

•  Joiningacrossserverswouldrequire(somethinglike):–  UseofaparCConingkeythatcanhash(notnecessarilythePK)–  AhashingalgorithmthatconsistentlyresolvesaparCConingkeytothesamenode

–  An~evendistribuConofdata–  Ideally,collocaConofdatatobejoined

©InfoModel,LLC.2017 13

Page 14: NOSQL and Data Lake Architecture - DAMA NY...Data Lake Reference Architecture Appliance NOSQL HADOOP EDW Mart RYO data Consumers Big Data Data Warehouse RDBMSs Real-time Analytics

MetadataintheDataLake•  Somemetadata,suchasdatatype,length,domain,

granularity,business/technicaldefiniConandothers,musteventuallybeassignedtodatalakefor:–  Data–  RelaConshipsandmore

•  SayMonthlySalesRevenueisingestedintothedatalakefromdifferentorgs/countries(inwhichcasethesetotalsarerawdata)–  Arethegrainsthesame?Dotheyagree?–  TheyarebothSalesRevenuebutsupposeoneis“salesasoflastdayofthemonth”andtheotheris“salesasofthelastFridayofthemonth”

–  Metadataisrequiredtounderstandthisdata

©InfoModel,LLC.2017 14

Page 15: NOSQL and Data Lake Architecture - DAMA NY...Data Lake Reference Architecture Appliance NOSQL HADOOP EDW Mart RYO data Consumers Big Data Data Warehouse RDBMSs Real-time Analytics

DataLakeonNOSQL?•  AdatalakecanresideonHadoop,NoSQL,AmazonSimpleStorageService,

arelaConaldatabase,ordifferentcombinaConsofthem•  Fedbydatastreams•  Datalakehasmanytypesofdataelements,datastructuresandmetadata

inHDFSwithoutregardtoimportance,IDs,orsummariesandaggregates•  Importanttounderstandthevariegatednatureofthedatalakedatain

relaContothemetamodeloftheperspecCveNOSQLdatabase–  Semi-structured–  Key:value(mostly)withitshierarchicalstructure–  ThekeyandcolumnnamebeingessenCalpartsofmostNOSQL

•  MoreotendatalakeiskeptonHadoopandfedtoorfromNOSQL–  SomeaddiConallyusedagraphdatabaseontoptokeeptrackofthe

relaConships–  OnceloadedontoNOSQL,thefullpowerofNOSQLcanbeused

•  Note:NOSQLisanoperaConal,notananalyCcal,datastore©InfoModel,LLC.2017 15

Page 16: NOSQL and Data Lake Architecture - DAMA NY...Data Lake Reference Architecture Appliance NOSQL HADOOP EDW Mart RYO data Consumers Big Data Data Warehouse RDBMSs Real-time Analytics

BIonNOSQL•  NOSQLdoesnotyethavecommodityBItools•  Inall,thereareseveralapproaches*toBIonNOSQL,

whichdivideintotwomajorgroups:–  ThosethatuseNoSQLonlyandstrengthentheapplicaConwithbeherUIs,beheradhocreporCngandothercustomfeaturesontotheNoSQLproductstheyarecurrentlyusing

–  ThosethatuseNoSQLtoruntheirapplicaCons,butthentakethatdataoutoftheNoSQLsystemandputitintoaRDBMSortradiConaldatawarehouseformore“aterthefact”analysis.

•  EachapproachhasmanysuccessstoriesandthebestapproachforaparCcularcompanyisbasedontheirspecificneeds,budgetsandskills

*BasedonpresentaConbyNicholasGoodman,andarCclesbyCharlesRoe“BI/AnalyCcsonNOSQL”

©InfoModel,LLC.2017 16

Page 17: NOSQL and Data Lake Architecture - DAMA NY...Data Lake Reference Architecture Appliance NOSQL HADOOP EDW Mart RYO data Consumers Big Data Data Warehouse RDBMSs Real-time Analytics

1-ApplicaLononTopofNOSQL•  ReportsonNOSQL

–  HaveadeveloperbuildanapplicaConforreporCngontopofNOSQL

–  HasthefullrichnessofNoSQLbutisexpensiveduetotheneedforadeveloper

NOSQL ApplicaCon

©InfoModel,LLC.2017 17

Page 18: NOSQL and Data Lake Architecture - DAMA NY...Data Lake Reference Architecture Appliance NOSQL HADOOP EDW Mart RYO data Consumers Big Data Data Warehouse RDBMSs Real-time Analytics

2-EnhancedNOSQL-Only•  AddsadynamicquerybuilderintothereporCngapp•  Needsadevelopertobuilditisevenmoreexpensive

NOSQLQueryBuilderApplicaCon

IndexesAggregates

©InfoModel,LLC.2017 18

Page 19: NOSQL and Data Lake Architecture - DAMA NY...Data Lake Reference Architecture Appliance NOSQL HADOOP EDW Mart RYO data Consumers Big Data Data Warehouse RDBMSs Real-time Analytics

3-NOSQLExtracttoRelaLonal•  SimilartotradiConalETLbutHadooporNOSQLarethesource–  ExtractfromNOSQL/HadoopandinsertintoRDBMS–  AllowstheuseofrichBItools

•  Addstofirstapproach,creaCngadynamicquerybuilderintothereporCngsystem–  Guidedadhoc–  Datafreshnessissueduetoday-olddata

RDBMSETLNOSQL/Hadoop

BI

©InfoModel,LLC.2017 19

Page 20: NOSQL and Data Lake Architecture - DAMA NY...Data Lake Reference Architecture Appliance NOSQL HADOOP EDW Mart RYO data Consumers Big Data Data Warehouse RDBMSs Real-time Analytics

Othersources

4-NOSQLasETLSource•  NOSQLsarejustpartoftheDWsourcing(ETL)•  DataextractedfromNOSQL/HadoopandinsertedintoDWandintegratedwithotherDWdata

•  ProsandCons–  CanusestandardBItools,whicharecostly–  Richflexibility/scalabilityonNOSQLgone–  ETLdevelopmentcost–  Note:Hadoopisabatchenvironment,notreal-Cme

RDBMSETLNOSQL/Hadoop

BI

©InfoModel,LLC.2017 20

Page 21: NOSQL and Data Lake Architecture - DAMA NY...Data Lake Reference Architecture Appliance NOSQL HADOOP EDW Mart RYO data Consumers Big Data Data Warehouse RDBMSs Real-time Analytics

5-NoSQLAddedtoBITools•  Developerintensive

–  AddsaservicetoastandardcommodityBItool–  FlahenstheNoSQLdataandoutputsitintoareport–  NoneedforSQLoradhocweb-basedaccesstools–  NoneedforloadsofexpensivereportswriheninNOSQL

•  TheprogramiswrihenoneCme–  ItisanapplicaConwithsubstanCaldevelopercosts–  CanuseM:Rforup-to-dateaccessto100%ofthedataset–  AggregaConsmightbeslower

NOSQL

NOSQLAddedto

BI

©InfoModel,LLC.2017 21

Page 22: NOSQL and Data Lake Architecture - DAMA NY...Data Lake Reference Architecture Appliance NOSQL HADOOP EDW Mart RYO data Consumers Big Data Data Warehouse RDBMSs Real-time Analytics

6–InterfacingNOSQLandBI•  ThirdpartyEnterpriseInformaConIntegraCon(EII)between

thecommodityBIandtheNoSQLordatalake–  TheEIItoolcanspeaktobothNOSQLandBI–  IntegraConwithotherdata,givesliveup-to-dateaccess–  ETLissimplewithINSERT/MERGEsdonenightlyandhasadhocaccess

tolive,cacheddata.–  “Bestofboth”:NoSQLontheback,BIonthefront–  ThecostandcomplicaConsofintroducingathirdsystem–  ThereissCllsomelossoftherichnessofNoSQLlanguage

EII BIToolsNOSQL

©InfoModel,LLC.2017 22

Page 23: NOSQL and Data Lake Architecture - DAMA NY...Data Lake Reference Architecture Appliance NOSQL HADOOP EDW Mart RYO data Consumers Big Data Data Warehouse RDBMSs Real-time Analytics

TransformingData•  Tousedata,itmustbeputintoauseablestate•  Youcannotputrawdataintoanyrepository(including

thedatalake)andpretendthatipsofactoiteliminatestheneedfortransformaCon(asmanyclaim!)

•  It’sacaseof“Paymenoworpaymelater”•  DataiseithertransformedbyETL/ELTbeforeitisstored

oritistransformedaKerbythequery–  Schemaonwrite-abatchprocessofETLorELTinarelaConalenvironment

–  Schemaonreadbythequerymanager–whetherinRDBMS,NOSQL,orMap:Reduce

•  OnemainreasonforstoringaggregatesintheDWistoprovideconsistentnumbers–  Thedatalaketooshouldensureconsistentnumbers

©InfoModel,LLC.2017 23

Page 24: NOSQL and Data Lake Architecture - DAMA NY...Data Lake Reference Architecture Appliance NOSQL HADOOP EDW Mart RYO data Consumers Big Data Data Warehouse RDBMSs Real-time Analytics

SampleStrategicQuery•  Strategicqueriesusingmanydimensionsand

aggregaConscouldposeaproblemtoNOSQL–  “GivemeabreakdownoftotalcommissionspaidandtransacConcountbyaccount,summarizedbyproducttypeandproductclass,andorderedbytheorganizaConthatownstheproductandtheorganizaConthatsoldtheproduct.”

•  SuchaquerywouldbedifficulttodoinNOSQLwithoutmaterializingthequeryandusingMap/ReducefuncCons

•  ADWonarobustpla|ormcouldefficientlysupportthiswithouthavingtomaterializetheaggregateandsCllsupportquerieswithothermixesofdimensions

©InfoModel,LLC.2017 24

Page 25: NOSQL and Data Lake Architecture - DAMA NY...Data Lake Reference Architecture Appliance NOSQL HADOOP EDW Mart RYO data Consumers Big Data Data Warehouse RDBMSs Real-time Analytics

MiningDataLakeData•  MiningistheuseofmathemaCcalalgorithmstofind

hiddenrelaConshipsinthedata•  Justasthepopularityofnewtoolsisexploding,soare

thecapabiliCesindatamining•  NOSQLiswellsuitedfordatamining•  Data-miningtechniquesfallintofourmajorcategories:

–  ClassificaCon–suchastargetedmarkeCng–  AssociaCon–suchasmarketbasketanalysis–  Sequencing–thosewhoboughtthisboughtthat–  Clustering–developingconclusionsusingspaceanddistance

•  NOTE:InHadoop,queryingandminingcanbedonethroughHive,MahoutandPig

©InfoModel,LLC.2017 25

Page 26: NOSQL and Data Lake Architecture - DAMA NY...Data Lake Reference Architecture Appliance NOSQL HADOOP EDW Mart RYO data Consumers Big Data Data Warehouse RDBMSs Real-time Analytics

DataLakeReferenceArchitecture

Appliance HADOOP NOSQL EDW Mart

RYO data

Consumers

Big Data Data Warehouse

RDBMSs

Real-time Analytics

Files Cloud Web Logs OLAP Tables Docs Sensors Events XML/JSON Streams

Data Streams

Trusted Data

Data Lakes

Raw Data

Data Streams

©InfoModel,LLC.2017 26

Page 27: NOSQL and Data Lake Architecture - DAMA NY...Data Lake Reference Architecture Appliance NOSQL HADOOP EDW Mart RYO data Consumers Big Data Data Warehouse RDBMSs Real-time Analytics

DW,DataMartandNOSQL•  ADWisageneralizedenvironmentinwhichdataisgatheredfrom

manysources,transformed,storedandthendeliveredwithbusinessmeaning–  Feedsanytypeofquery,includingadhocqueries–  ApplicaConneutral

•  DatamartsareaspecializedcollecConofrelateddataforaparCcularusercommunityandforalimitedCme–  Areuser-andapplicaCon-specific–  WillanswerverywellthosequesConswithindatamartsscope

•  FourtypesofNOSQL–  Key:value,widecolumn,document,graph–  MostNOSQLdatabasesarehierarchicalandoperaConal–  Thismeansmaterializingalmosteveryqueryrequiringanydatastores

•  DataLake–  AstoragerepositorythatholdsavastamountofrawdatainitsnaCve

formatunClitisneeded

©InfoModel,LLC.2017 27

Page 28: NOSQL and Data Lake Architecture - DAMA NY...Data Lake Reference Architecture Appliance NOSQL HADOOP EDW Mart RYO data Consumers Big Data Data Warehouse RDBMSs Real-time Analytics

QuesLons?

ThetermNOSQLwasfirstusedbyCarloStrozziin1998todescribeafile-based

databasehewasbuilding