Data Lake: A simple introduction

19
© 2016 IBM Corporation Learn more about Data Lakes on ibm.com: https://ibm.biz/Bdswi9 IBM’s Data Lake – A Basic Definition 1 st June 2016 Mandy Chessell CBE FREng CEng FBCS Distinguished Engineer, Master Inventor Analytics Group CTO Office

Transcript of Data Lake: A simple introduction

Page 1: Data Lake: A simple introduction

© 2016 IBM Corporation

LearnmoreaboutDataLakesonibm.com:https://ibm.biz/Bdswi9

IBM’s Data Lake – A Basic Definition1st June 2016

MandyChessellCBEFREngCEngFBCSDistinguishedEngineer,MasterInventorAnalyticsGroupCTOOffice

Page 2: Data Lake: A simple introduction

© 2016 IBM Corporation2

LearnmoreaboutDataLakesonibm.com:https://ibm.biz/Bdswi9

Data blues & skills issues

§ Adisproportionateportionofthetimespentinanalyticsprojectisaboutdatapreparation:acquiring/preparing/formatting/normalizingthedata

§ Inadditiontorawdata,augmenteddata/analyticalassetscansignificantlyspeeduptheanalyticsprocessandpartiallybridgethetalentgap

Page 3: Data Lake: A simple introduction

© 2016 IBM Corporation3

LearnmoreaboutDataLakesonibm.com:https://ibm.biz/Bdswi9

A growing demand …

BusinessTeamswant• Openaccesstomoreinformation• Morepowerfulanalysisandvisualizationtools

ITTeamsare• Concernedaboutcost.

• Concernedaboutgovernanceandregulatoryrequirements.

Page 4: Data Lake: A simple introduction

© 2016 IBM Corporation4

LearnmoreaboutDataLakesonibm.com:https://ibm.biz/Bdswi9

Big Data Lakes or Swamps?

§ As we collect data• Can we preserve clarity?• Do we know what we are collecting?• Can we find the data we need?

§ Are we creating a data swamp?

§ How do we build trust in big data?• Do we know what data is being used

for?

Page 5: Data Lake: A simple introduction

© 2016 IBM Corporation5

LearnmoreaboutDataLakesonibm.com:https://ibm.biz/Bdswi9

"The need for increased agility and accessibility for data analysis is the primary driver for data lakes," said Andrew White, vice president and distinguished analyst at Gartner. "Nevertheless, while it is certainly true that data lakes can provide value to various parts of the organization, the proposition of enterprise wide data management has yet to be realized."

http://www.gartner.com/newsroom/id/2809117

Page 6: Data Lake: A simple introduction

© 2016 IBM Corporation6

LearnmoreaboutDataLakesonibm.com:https://ibm.biz/Bdswi9

IBM’s Data Lake – designed for data access – with safeguards

IBM’sDataLake=EfficientManagement,Governance,ProtectionandAccess.

Data Lake (System of Insight)

Information Management and Governance Fabric

Data Lake Services

Data Lake Repositories

Page 7: Data Lake: A simple introduction

© 2016 IBM Corporation7

LearnmoreaboutDataLakesonibm.com:https://ibm.biz/Bdswi9

Users supported by IBM’s Data Lake

Data Lake (System of Insight)

Information Management and Governance Fabric

Data Lake Services

Line of BusinessTeams

Data LakeOperations

Data Lake Repositories

Enterprise IT

Other Data Lakes

Systems of Engagement

Systems of Automation

Systems of Record

New Sources

AnalyticsTeams

Governance, Risk andCompliance Team

InformationCurator

Page 8: Data Lake: A simple introduction

© 2016 IBM Corporation8

LearnmoreaboutDataLakesonibm.com:https://ibm.biz/Bdswi9

The subsystems inside IBM’s Data Lake

Data Lake (System of Insight)

Information Management and Governance Fabric

Catalogue

Self-ServiceAccess

EnterpriseIT Data

Exchange

Self-ServiceAccess

AnalyticsTeams

Governance, Risk andCompliance Team

InformationCurator

Line of BusinessTeams

Data LakeOperations

Enterprise IT

Other Data Lakes

Systems of Engagement

Data Lake Repositories

Systems of Automation

Systems of Record

New Sources

AnalyticsEngines

Page 9: Data Lake: A simple introduction

© 2016 IBM Corporation9

LearnmoreaboutDataLakesonibm.com:https://ibm.biz/Bdswi9

View from the user community - fraud

Conformtoregulations

InvestigateFraudCase

Developnewfraudmodels

DetectandpreventfraudDetectand

preventfraudDetectand

preventfraud

Page 10: Data Lake: A simple introduction

© 2016 IBM Corporation10

LearnmoreaboutDataLakesonibm.com:https://ibm.biz/Bdswi9

The role of the catalogue

DataStores

CurationofMetadataaboutStores,Models,Definitions

InformationGovernanceCatalogue

Searchfor,locateanddownloaddataandrelatedartifacts.

ProvisionSandBoxes.

Addadditionalinsightintodatasourcesthroughautomatedanalysis.

Developdatamanagementmodelsandimplementations.

DataStoresDataStores

SandBox Definegovernancepolicies,

rulesandclassifications.Monitorcompliance.

Viewlineage(businessandtechnical)andperformimpactanalysis.

Page 11: Data Lake: A simple introduction

© 2016 IBM Corporation11

LearnmoreaboutDataLakesonibm.com:https://ibm.biz/Bdswi9

Governance ensures proper management and use of information

InformationGovernance

Compliance

PolicyAdministration

PolicyEnforcement

PolicyMonitoring

PolicyImplementation

Standards Protection

Lifecycle

Quality

InformationValuesQuality

InformationDependencies

InformationRequirements

InformationSupplyChainIntegrity

InformationIdentification

InformationRetention

InformationUsage

InformationPrivacy

InformationArchitecture

InformationDisposal

ArePeople/Systemsoperatingproperly

Isdataqualitysufficientforuse?

Isdatakeptforappropriate

lengthoftime?

Isdataproperlyprotectedfromlossorinappropriateuse?

Aresystemsbuilttoappropriate

standards?

Page 12: Data Lake: A simple introduction

© 2016 IBM Corporation12

LearnmoreaboutDataLakesonibm.com:https://ibm.biz/Bdswi9

Data lake security

§ Thedatalake’srepositoriesareonlyaccessedbyauthorizedprocesses.

§ Peopleaccessthedatafromthedatalakethroughtheservices.• Identifiedthroughacommonauthenticationmechanism(egLDAP)• Dataclassifiedinthecatalog• Accessgrantedbybusinessowners• Accesscontrolledbydatalakeservices• Allactivitymonitoredbyprobesthatstoreloginformationintheauditdatazone.

IBM’sDataLake=EfficientManagement,Governance,ProtectionandAccess.

Data Lake

Information Management and Governance Fabric

Data Lake Services

Data Lake Repositories

Page 13: Data Lake: A simple introduction

© 2016 IBM Corporation13

LearnmoreaboutDataLakesonibm.com:https://ibm.biz/Bdswi9

Data Lake (System of Insight)

Information Management and Governance Fabric

Catalogue

Self-ServiceAccess

EnterpriseIT Data

Exchange

Self-Service Access

AnalyticsTeams

Governance, Risk andCompliance Team

InformationCurator

Line of BusinessTeams

Data LakeOperations

Enterprise IT

Other Data Lakes

Systems of Engagement

Systems of Automation

Systems of Record

New Sources

Analytics Engines

IBM’s Data Lake – example deployment options

InfoSphereStreams

InfoSphereInformation

Server

InfoSphereInformationServer

InfoSphereInformationServer

Cognos

WatsonExplorer

CloudantPureData/BLU

InfoSphereBigInsights

InfoSphereMasterDataManagement

WatsonAnalytics

InfoSphereInformationServer,OptimandGuardium

SPSS

Page 14: Data Lake: A simple introduction

© 2016 IBM Corporation14

LearnmoreaboutDataLakesonibm.com:https://ibm.biz/Bdswi9

IBM’s Data Lake

§ Asorganizationsexperimentwithanalyticstheydiscover:• Creatingnewanalyticsrequiresaccesstohistoricaldatafrommanysystems.

• Thisdataincludesvaluableandsensitivedatathatiscoretotheorganization’soperation.

• Hadoopisaflexibleplatformforstoringmanytypesofdatabutisnotnecessarilyfastenoughfortheproductiondeploymentofsomeanalytics.DataneedstobereformattedandcopiedontoaspecialistanalyticsplatformssuchasNetezza.

§ Adatalakeprovides:• Singleextractionofdatafromoperationalsystemsanddistributiontomultipleanalyticsplatforms.

• Cataloguingandgovernanceofthedataintheanalyticsplatforms• Simpleinterfacesforthelineofbusinesstoaccesstheinformationtheyneed.

IBM’sDataLake=EfficientManagement,Governance,ProtectionandAccess.

Data Lake

Information Management and Governance Fabric

Data Lake Services

Data Lake Repositories

Page 15: Data Lake: A simple introduction

© 2016 IBM Corporation15

LearnmoreaboutDataLakesonibm.com:https://ibm.biz/Bdswi9

Governing and managing Big Data for Analytics and Decision Makers

§ AnintroductiontoIBM’sDataLakesolution

http://www.redbooks.ibm.com/redpieces/abstracts/redp5120.html?Open

Page 16: Data Lake: A simple introduction

© 2016 IBM Corporation16

LearnmoreaboutDataLakesonibm.com:https://ibm.biz/Bdswi9

Designing and Operating a Data Reservoir

§ DescriptionofthebehaviourandprocessesthatmakeupadatalakefromIBM(akadatareservoir)

§ Blog• 5thingstoknowaboutadatareservoirhttps://www.ibm.com/developerworks/community/blogs/5things/entry/5_things_to_know_about_data_reservoir?lang=en

§ Redbook• http://www.redbooks.ibm.com/Redbooks.nsf/RedpieceAbstracts/sg248274.html?Open

Page 17: Data Lake: A simple introduction

© 2016 IBM Corporation17

LearnmoreaboutDataLakesonibm.com:https://ibm.biz/Bdswi9

Ethics for Big Data and Analyticsü Context – for what purpose was the data originally surrendered? For

what purpose is the data now being used? How far removed from the original context is its new use?

ü Consent & Choice – What are the choices given to an affected party? Do they know they are making a choice? Do they really understand what they are agreeing to? Do they really have an opportunity to decline? What alternatives are offered?

ü Reasonable – is the depth and breadth of the data used and the relationships derived reasonable for the application it is used for?

ü Substantiated – Are the sources of data used appropriate, authoritative, complete and timely for the application?

ü Owned – Who owns the resulting insight? What are their responsibilities towards it in terms of its protection and the obligation to act?

ü Fair – How equitable are the results of the application to all parties? Is everyone properly compensated?

ü Considered – What are the consequences of the data collection and analysis?

ü Access – What access to data is given to the data subject?

ü Accountable – How are mistakes and unintended consequences detected and repaired? Can the interested parties check the results that affect them?

http://www.ibmbigdatahub.com/whitepaper/ethics-big-data-and-analytics

Page 18: Data Lake: A simple introduction

© 2016 IBM Corporation18

LearnmoreaboutDataLakesonibm.com:https://ibm.biz/Bdswi9

Common Information Models for an Open, Analytical and Agile World

§ TodrivemaximumvaluefromcomplexITprojects,ITprofessionalsneedadeepunderstandingoftheinformationtheirprojectswilluse.Toooften,however,ITtreatsinformationasanafterthought:the“poorstepchild” behindapplicationsandinfrastructure.Thatneedstochange.Thisbookwillhelpyouchangeit.

§ Usingacompletecasestudy,theauthorsexplainwhatCIMsare,howtobuildthem,andhowtomaintainthem.Youlearnhowtoclarifythestructure,meaning,andintentofanyinformationyoumayexchange,andthenuseyourCIMtoimproveintegration,collaboration,andagility.

§ Intoday’smobile,cloud,andanalyticsenvironments,yourinformationismorevaluablethanever.Tobuildsystemsthatmakethemostofit,startrighthere.

Page 19: Data Lake: A simple introduction

© 2016 IBM Corporation19

LearnmoreaboutDataLakesonibm.com:https://ibm.biz/Bdswi9

Data Lake: Taming the Data Dragon (White Paper)

Tamingthedatadragonleadstosignificantbenefitsacrosstheenterprise,fromimprovedproductivitytoincreasedeffectivenessinsalesandmarketing.Adatalakeacceptsdataflowsfromanysourceandbringsthemintoacommonplatformforuse.Dataisstoredinitsraw,unrefinedstateandlocated,processed,refinedandextractedasrequired. However,governanceneedstobeappliedtothedatalaketoensureitbecomesatrusteddatasource,ratherthanaformlesslandingareainwhichdataisstoredwithoutconsiderationofitsvalidity,valueorshelflife.

DownloadNow:https://ibm.biz/Bdswiu