2. Third Nature Inc. Summary Commonusesandcommoditytechnology
leadto Novelpractices leadto
Differentdataanddifferenttechnologyneeds leadto Newarchitectures
Leadto Commonusesandcommoditytechnology
3. Third Nature Inc. Our ideas about information and how its
used are outdated.
4. Third Nature Inc. HowWeThinkofUsers Ourdesignpointisthe
passiveconsumerof information. Proof:methodology
ITroleisrequirements, design,build,deploy, administer
Userroleisrunreports SelfserveBIisnotlike pickingtherightdoughnut
fromabox. Slide 4
5. Third Nature Inc. HowWeThinkofUsers Ourdesignpointisthe
passiveconsumerof information. Proof:methodology
ITroleisrequirements, design,build,deploy, administer
Userroleisrunreports SelfserveBIisnotlike pickingtherightdoughnut
fromabox. HowWeWantUsersto ThinkofUs
6. Third Nature Inc. HowWeThinkofUsers
WhatUsersReallyThink
7. Third Nature Inc. WethinkofBIaspublishing,anoldmetaphor.
Publishinghasvalue,but maynotbeactionable.
8. Third Nature Inc. Planningdatastrategymeansunderstandingthe
contextofdatausesowecanbuildinfrastructure Monitor Analyze
Exceptions Analyze Causes Decide Act No problem No idea Do nothing
We need to focus on what people do with information as the primary
task, not on the data or the technology.
9. Third Nature Inc. Generalmodelfororganizationaluseofdata
Monitor Analyze Exceptions Analyze Causes Decide Act No problem No
idea Do nothing Act within the process Usually real-time to
daily
10. Third Nature Inc. OriginofBIanddatawarehouseconcepts
Thegeneralconceptofa separatearchitectureforBI
hasbeenaroundlonger,but thispaperbyDevlinand Murphyisthefirstformal
datawarehousearchitecture anddefinitionpublished. 10 An
architecture for a business and information system, B. A. Devlin,
P. T. Murphy, IBM Systems Journal, Vol.27, No. 1, (1988)
Slide10CopyrightThirdNature,Inc.
11. Third Nature Inc. Origins:in1988therewasonlybighair.
Norealcommercialemail,publicinternetbarelystarted
Storagestateoftheart:100MB,cost$10,000/GB
OracleApplicationsv1GLreleased;SAPgoespublic, entersUSmarket
Unixismostlyrunbylonghairedfreaks Mobilewasthis
Thisisthecontext:scarcityofdata,ofsystemresources,ofautomated
systemsoutsidecorefinancials,ofmoneytopayforstorage.
12. Third Nature Inc. Generalmodelfororganizationaluseofdata
Collect new data Monitor Analyze Exceptions Analyze Causes Decide
Act No problem No idea Do nothing Act on the process Usually
days/longer timeframe CopyrightThirdNature,Inc.
13. Third Nature Inc. Youneedtobeabletosupportbothpaths Collect
new data Monitor Analyze Exceptions Analyze Causes Decide Act Act
on the process Act within the process Conventional BI, addition of
EDM Causal analysis, data science CopyrightThirdNature,Inc.
14. Third Nature Inc. TheusagemodelsforconventionalBI Collect
new data Monitor Analyze Exceptions Analyze Causes Decide Act No
problem No idea Do nothing Act on the process Usually days/longer
timeframe Act within the process Usually real-time to daily This is
what weve been doing with BI so far: static reporting, dashboards,
ad-hoc query, OLAP CopyrightThirdNature,Inc.
15. Third Nature Inc. Theusagemodelsforanalyticsandbigdata
Collect new data Monitor Analyze Exceptions Analyze Causes Decide
Act No problem No idea Do nothing Act on the process Usually
days/longer timeframe Act within the process Usually real-time to
daily Analytics and big data is focused on new use cases: deeper
analysis, causes, prediction, optimizing decisions This isnt
ad-hoc, reporting, or OLAP. CopyrightThirdNature,Inc.
16. Third Nature Inc. Whenyoufirstgivepeopleaccesstoinformation
thatwasunavailable OH GOD I can see into forever
17. Third Nature Inc. Afterawhileitbecomesthenewnormal
18. Third Nature Inc. Aspracticesevolvebasedonnewcapabilities
Anewlevelof complexity developsover topofthe older,now better
understood processes, leadingtonew dataand analysisneeds.
19. Third Nature Inc. I never said the E in EDW meant
everything What do you mean, Just doughnuts?
20. Third Nature Inc. Thedatawarehousevs businessagility
Allthedata Common,typed,tabulardata Thebottleneckisyou
21. Third Nature Inc. Itsgoingtogetalot worse NotE E
Conclusion:anymethodologybuiltonthepremisethatyou
mustknowandmodelallthedatafirstisuntenable
22. Third Nature Inc. Oldmarketsays:Theresnothingwrongwithwhat
youhave,justkeepbuyingnewproductsfromus
23. Third Nature Inc. Theemergingbigdatamarkethasananswer
24. Third Nature Inc. Thedatalake
25. Third Nature Inc. Thedatalakeafteralittlewhile
26. Third Nature Inc. TANSTAAFL Whenreplacingtheold
withthenew(orignoring thenewovertheold)you alwaysmaketradeoffs,
andusuallyyouwontsee themforalongtime. Technologiesarenot
perfectreplacementsfor oneanother.Oftennot
better,onlydifferent.
27. Third Nature Inc. Bigdataisunprecedented.
Anyoneinvolvedwithbigdataineventhe mostbarelyperceptibleway
28. Third Nature Inc. Wevebeenherebefore
Source:BillSchmarzo,EMC
29. Third Nature Inc. Bigiswellsupportedbydatabasesnow
Source:Noumenal,Inc.
30. Third Nature Inc. Ordersofmagnitude:20yearsagoTB,todayPB
Shiftsindataavailabilitybyordersofmagnitude
necessitatenewmeansofmanagingandusingit.
31. Third Nature Inc. Analyticsembiggens thedatavolumeproblem
ManyoftheprocessingproblemsareO(n2)orworse,so
moderatedatacanbeaproblemforDBbasedplatforms
33. Third Nature Inc.
MostpeopledonotneedspecialtechnologyNumberofpeople The distribution
of data size is about normal, yet these guys set the tone of the
market today. Bigness of data CopyrightThirdNature,Inc.
34. Third Nature Inc.
Analytics:ThisisreallyrawdataunderstorageNumberofjobs Microsoft
study of 174,000 analytic jobs in their cluster: median size ???
Bigness of data CopyrightThirdNature,Inc.
35. Third Nature Inc.
WorkingdataforanalyticsmostoftennotbigNumberofjobs 14 GB Smallness
of data CopyrightThirdNature,Inc.
36. Third Nature Inc.
An(overly)SimpleDivisionoftheProblemSpaceComputation LittleLots
Data volume Little Lots Big analytics, little data Specialized
computing, modeling problems: supercomputing, GPUs Big analytics,
big data Complex math over large data volumes requires shared
nothing architectures Little analytics, little data The entry
point; SAS, SMP databases, even OLAP cubes can work Little
analytics, big data The BI/DW space, for the most part, with work
done in databases
37. Third Nature Inc. Third Nature Inc. Whatmakesdatabig?
Verylargeamounts Hierarchicalstructures Nestedstructures
Linkedstructures Encodedvalues Nonstandard(fora database)types
Deepstructure Humanauthoredtext
bigisbetteroffbeingdefinedascomplexorhardtomanage
CopyrightThirdNature,Inc.
38. Third Nature Inc. Categorizingthemeasurementdatawecollect
Theconvenientdataisthe transactionaldata. GoesintheDWandisused,even
ifitisnttherightmeasurement. Theinconvenientdatais
observationaldata. Itsnotneat,clean,ordesigned
intomostsystemsofoperation. Thedifficultandmisleadingdata
isdeclarativedata. Whatpeoplesayandwhatthey dorequiregroundtruth.
Weneedanarchitecturethat supportsallthreecategories.
CopyrightThirdNature,Inc.
39. Third Nature Inc. Transactionsvsbigdata
Theclassicexampleofstructureddata Transactiondataincludes:
quantificationdetails(date,value,count)
referencedataforexplanation(product, customer,account)
Lotsofmeaningfulinformation Referencedataisusuallysharedacrossthe
organization,henceitsimportance.There aretwoparts:
identifiertouniquelyidentifythesubject
descriptiveattributeswithcommonor standardizedvaluedomains
Transaction details Reference data
40. Third Nature Inc.
Todayitsdifferentdata:observations,nottransactions Sensor data
doesnt fit well with current methods of collection and storage, or
with the technology to process and analyze it.
CopyrightThirdNature,Inc.
41. Third Nature Inc.
Bigdataasatypeofdata:Transactionsvs.Events Transactions:
Eachoneisvaluable Mutable
Theelementsofatransactioncanbeaggregatedeasily
Asetoftransactionsdoesnotusuallyhaveimportantordering ordependency
Events: Asingleeventoftenhasnovalue,e.g.whatisthevalueofone
clickinaseries?Someeventsareextremelyvaluable,butthis
isonlydetectablewithinthecontextofotherevents.
Elementsofeventsareoftennoteasilyaggregated
Asetofeventsusuallyhasanaturalorderanddependencies Immutable
43. Third Nature Inc. Webtrackingdatahasanestedstructure
USER_ID 301212631165031 SESSION_ID 590387153892659 VISIT_DATE
1/10/20100:00 SESSION_START_DATE 1:41:44AM PAGE_VIEW_DATE
1/10/20109:59 DESTINATION_URL
https://www.phisherking.com/gifts/store/LogonForm?mmc=
linksrcemail_m100109_44IOJ1_shop&langId=
1&storeId=1055&URL=BECGiftListItemDisplay REFERRAL_NAME
Direct REFERRAL_URL PAGE_ID PROD_24259_CARD REL_PRODUCTS
PROD_24654_CARD,PROD_3648_FLOWERS SITE_LOCATION_NAME
VALENTINE'SDAYMICROSITE SITE_LOCATION_ID SHOPBYHOLIDAYVALENTINESDAY
IP_ADDRESS 67.189.110.179 BROWSER_OS_NAME
MOZILLA/4.0(COMPATIBLE;MSIE7.0;AOL9.0;WINDOWS
NT5.1;TRIDENT/4.0;GTB6;.NETCLR1.1.4322) unstructured data embedded
in the logged message: complex strings
44. Third Nature Inc. Themissingingredientfrommostbigdata
45. Third Nature Inc.
Thecreation,flowanduseofdataisdifferentfor
transactionsandmachinegeneratedevents Data entry Extract Cleanse
Load UseStore Transactions MDM Generate Store Use UseCleanse
Program Capture This runs at human speed This runs at machine
speed, with higher latency feedback cycles
47. Third Nature Inc. YoucanstorethisdatainanRDBMS,but
48. Exampledata:TwitterMessageAPIPayload Lookslike: This is
really just a record format much like a DB row. Datetime, userID,
name, location, description, message, message metadata, etc. But
its In json or xml.
49. Third Nature Inc.
@markmadsenCheckout:From#MongoDBto#Cassandra:
WhyTheAtlasPlatformIsMigratinghttp://owl.li/cvxFK
Atweethaslotsoffields,butoneimportantone
Thepayloadisfreetextbuthasotherelements:
Fromthesethingsyoulikelywanttogenerateorlinkto referencedata. To
username Hashtag HashtagURL
50. Third Nature Inc. Third Nature Inc.
Internalpayloadelementsformanewgraph The@elementspointto
otherrecordsandcreatea deeplylinkedstructure. Youhavetoassemblethe
linkedstructuretosee whatsreallythere,which meansrepeatedscanning
some/allofthedata. Thederivedpatternis interestingdata,
sometimesmorethanthe individualmessages.
51. Third Nature Inc. Third Nature Inc.
Therearemanypatternsinthedata Follower/followingnetworksareeasy
theyareexplicit andindependentoftheevents.
Communitydetectionrequireslookingatpatternsof@
communicationinadditiontofollowrelationships.
Whatdoyoudowiththeseafterdiscovery? Follower network Conversational
communities
52. Third Nature Inc.
Moredata:patternsemergefromlotsofeventdata Patternsemergefrom
theunderlyingstructure oftheentiredataset. Thepatternsaremore
interestingthansums andcountsoftheevents. Webpaths:clicksina
sessionasnetworknode traversal. Email:trafficanalysis
producinganetwork The event stream is a source for analysis,
generating another set of data that is the source for different
analysis.
53. Third Nature Inc. Bigchangesfordatawarehousingworkloads
Theresultsofanalytic processingcan,oftendo, feedbackintothe
systemfromwhichthey originate. Muchofthedataisbeing read,writtenand
processedinrealtime. Ourdesignpointwasnot changingtablesand
ephemeralpatterns.