Mike Carey · Structure •Format of a dataset’s records and fields •Highly regular (or...
Transcript of Mike Carey · Structure •Format of a dataset’s records and fields •Highly regular (or...
Announcements
• Remembertotrackthecoursewikipage:• https://grape.ics.uci.edu/wiki/asterix/wiki/stats170ab-2018
• Anddon’tforgetaboutthePiazzapage:• http://piazza.com/uci/winter2018/stats170a/home
• ThefirstHWassignmentisdueNOW:• https://grape.ics.uci.edu/wiki/asterix/attachment/wiki/stats170ab-2018/HW1.pdf
• Today:PrinciplesofDataWrangling(fromtheO’ReillybookbyRattenbury etal)
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 1
DataStagesinDataWrangling
Raw Data RefinedData ProductionData
Ingestdata Createcanonicaldataforwidespreadconsumption
Create production-qualitydata
Data discoveryandmetadatacreation
Conduct analyses,modeling,andforecasting
Buildregularreportingandautomateddataproducts/services
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 2
DataProductWorkflowFramework
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 4
(Note:Inreality,therewillbeloop-backsanditeration…)
IngestingKnown&UnknownData
• Relationalenterprisedatawarehouseworld• “Schemaonwrite”(eager)• Transformincomingdataintowarehouseschemaform
• ETL(extract/transform/load)• Canbeappend-onlyormayalsoinvolveupdates
• Today’smoreflexibleworld• NoSQLdatabases(Mongo,Cassandra,AsterixDB,…)offerschemaflexibility• DistributedfilesystemslikeHDFSorS3allowdatadepositstobefilesforlaterprocessing• “Schemaonread”(lazy)
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 6
CreatingMetadata
• Datasetsarecomposedofrecordswithfields• “Recordsoftenrepresentorcorrespondtopeople,objects,relationships,orevents”• “Thefieldswithinarecordrepresentorcorrespondtomeasureableaspectsoftheperson,object,relationship,orevent”• Q:Soundatallfamiliar(fromlastlecture)?
• Keydimensionstounderstand(anddocument)• Structure• Granularity• Accuracy• Temporality• Scope
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 7
Structure
• Formatofadataset’srecordsandfields• Highlyregular(or“rectangular”)endofthespectrum
• Tablewithfixedrows(records)andcolumns(fields)• Recordswithvariant(or“jagged”)structure
• XMLorJSONformatsarepopularexampleshere• Heterogeneouscollectionsofrecords
• Mixesofinformationaboutmultipleentities
• Dataencoding• Fielddetails(e.g.,measurementunits,timezone,…)• Low-levelfieldvalueencoding
• Plantext,binary,zipcompressed,....
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 8
StructureQuestions
• Doallrecordscontainthesamefields?• Howarefieldsaccessed?(Byposition?Byname?)• Howarerecordsdelimited/separated?Isparsingneeded?• Howarerecordfieldsdelimited?Isparsingneeded?• Howarerecordfieldsencoded?Strings?Binary?Enumeratedcodes?Compressed?• Howcomplexistheencoding?(Primitivesvs.hashmapsorarrays)• Whatarethesemantics,andarechecksneeded?• Whatarethe“relationshiptypes”betweenrecordsandfields(atomicvs.nestedsets/arrays)
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 9
Granularity
• Kindsofentitiesthateachdatarecordrepresents• Finegranularity:e.g.,arecordrepresentsasinglesales
transactionbyasinglecustomerataparticularstore• Coarsegranularity:e.g.,arecordrepresentsthetotalsalesinastoreforanentireday• Subtleties:e.g.,contactsvs.actualy payingcustomers
• GranularityQuestions:• Whatkindofthingdotherecordsrepresent?• Dotherecordsrepresentthesamekindsofthings?• Whatalternativeinterpretationsoftherecordsarethere?
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 10
Accuracy
• Qualityofthedataset• Awidevarietyoftype-specificissuesarepossible• Processissues(indataproduction)arealsopossible• Inaccuraciesofvariouskindscanarise
• Misspellings(e.g.,namesorcategoricalattributes)• Lackofappropriatecategories(e.g.,ethnicitylabels)• Missingfieldcomponents(e.g.,AM/PM)
• Frequencyoutlierscanindicatedataproblems
• Rememberthephrase“garbagein,garbageout”…!
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 11
AccuracyQuestions
• Type-specificissues• Timeformat(s),timezones,possibleambiguities,…?• Areaddresscomponentscompleteandconsistent?• Aredigits/componentsofphone#sandUPCcodesmissing?• Aretheremisspellingsormissingnamefields?• Aree-maildomainsvalid?• Arecurrencyamountsinthesamecurrencyandsensible?
• Process-relatedissues• Sensordrift• Peoples’(mis)spellingsandabbreviations
• Inaccuracydistribution(s)• Whatisthemeasureabledistributionofinaccuracies?• Aremanyrecordseffected?• Arethereconcentrationsofinaccuracies?
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 12
Temporality
• Recordsrepresentanentityatapointintime,so…• TemporalityQuestions:• Whenwasthedatasetcollected?• Wereallrecordsandtheirfieldscollected/measuredatthesametime?• Arethetimestampsofthedataknownoravailable,eitherinthedataorasassociatedmetadata?• Haveanyoftherecords/fieldsbeenmodifiedaftertheircreationtime?Arethemodificationtimestampsavailable?• Canthe“staleness”ofthedatabedetermined,ifapplicable?Andifso,how?(Ex:purchasesandreturns)
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 13
Scope
• Twodimensionsofscopeofadataset• Numberofdistinctattributes(breadth/detail)• Populationcoverage(intentional/unintentional,sample,…)
• ScopeQuestions:• Whatentitycharacteristicsarecaptured?Notcaptured?• Aretherecordfieldsconsistent?(E.g.,agevs.DOB,itemsvs.total?)• Canyouinferwhatyouneedfromthedataavailable?• Arethesamefieldsavailableforalltherecords?• Dotherecordsrepresenttheentirepopulationofthings?• Aretheremultiplerecordsperthing?(à de-duplication)• Isthedatasetheterogenous,andifso,how?
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 14
DesigningRefinedData
• Addressstructuralissues• Tabularize,convert(e.g.,categoriesà indicators),…
• Addressgranularityissues• Maywanttostoremultipleversions/levelsofadataset
• Addressaccuracyissues• Maychooseto
• Removerecordswithinaccuratevalues(ifdetectable)• Retainthembutmarkthemasbeinginaccurate• Imputation:replaceinaccuratevalues(defaults/estimates)
• Insomecasestimecanhelp(e.g.,multipleaddresses)• Addressscopeissues• Criticaltounderstandpopulationcoverage,possiblebiases
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 16
RefinedStageAnalyticalFunctions
• Reportinganalyses• Historicaldataà answerquestionsaboutpast/present• Ex:Useofbusinessintelligence(BI)toolsanddashboards• Simplequestions:HowmanycustomersboughtAmazonEchoslastweek,orwhatwerethetopthreemostpopularin-home”listener”devices?
• Complexquestions:Whatwerethekeyfactorsdrivingthepopularityofin-home”listeners”(e.g.,AmazonEcho,GoogleHome)thelasttwoyears?
• Modelingandforecastinganalyses• Historicaltrendsà futuretrends• Ex:Predictionofcustomerretention(e.g.,licenserenewal)• Maywantaprediction,ormaywantthemodelitself• Causalanalyseswillrequirecarefullydesignedexperimentation
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 17
ProductionDataandAutomation
• Creatingoptimizeddata• Puttingdatainidealformfordownstreamconsumption• Constraints:availableprocessingpowerandstorage
• Designingregularandautomatedreports• Monitordatatoensureongoingconstraintsatisfaction• Handle(acceptable)variationswithgeneralizedlogic• Evolutionovertimew.r.t.schemaordataavailability
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 19
DataProductWorkflowSummary
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 20
• Datawrangling=processinvolvedintransformingorpreparingdataforanalysis• Occursbetweenthestages(tomovetothenextone)
DynamicsofDataWrangling
• Accessing thedata• Permissions,infrastructure,crawling,replicating,…
• Transforming thedata• Manipulatingstructure,granularity,accuracy,temporality,andscopeofdatatoalignwiththeanalysisgoals• Iterationbetweentransformingandprofiling thedata
• Publishing thedatasets(ortransformationlogic)andprofilingmetadataaboutthedatasets
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 21
WranglingDynamics(cont.)
• Additionalaspectsforwranglingrealdata• Subsetting• Sampling
• Subsetting• Heterogeneousdatasetswillrequirecreationofhomogeneoussubsetsforefficient/effectivewrangling• Canmergeagainatend,ifneeded
• Sampling• Neededtodealwithverylargedatasets,eitherduetohumanlimitationsortimelimitations• Notsimple:samplesneedtoincludeextremevalues,distributionaltrends,valuevarieties(e.g.,currencies),…
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 22
TransformationActions
• Structuring transformations• Reorderingfields• Breakingdownorcombiningfields(e.g.,addresses)• Aggregatingsubsetsofrecords• Pivoting(recordsà fieldsorviceversa)
• Enriching transformations• Joins tocombinedatasets(e.g.,toattachinformation)• (Outer)Unions tocombinerecordsacrossdatasets• Metadatainsertionintothedata(e.g.,editinginfo)• Newvaluecomputation(e.g.,geo-coding,sentiment,…)
• Cleaning transformations• Manipulatingindividualfieldvalues(e.g.,missingvalues)
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 23
Profiling
• Individualvalueprofiling• Syntacticconstraints
• Ex:MM-DD-YYYY,(XXX)XXX-XXXX• Semanticconstraints
• Ex:nosalestransactionsonholidays
• Set-basedprofiling• Checkingtheshape/extentofthedistributionofvaluesforagivenfield• Ex:Expecteddistributionofsalesacrossmonths
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 24
SyntacticValueProfiling
• Basedonconstraintsonallowablefieldvalues,e.g.:• Boolean:{0,1}(or{true,false},{T,F},…)• Gender:{male,female}• ATMtransactioncount:[0..50000]where50000isbasedon(bankage*365*withdrawals/daylimit)
• CheckingislikeevaluatingaCHECKCONSTRAINTinaSQLDBMS• Whileexploringdata,onemayneedtolookatpositiveandnegativeexamplestodeterminewhatthefinalconstraintshouldreallybe
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 26
SemanticValueProfiling
• Basedonvalues’meanings/interpretations,e.g.:• Agefieldmayhave-1whenageisn’treported,andmaywanttoderiveanewBooleanfield(reported_age)toanalyzewillingnessofcustomerstodisclosetheirage
• Anaddressfieldmayhaveresolvabledifferencessuchas“SanJose,CA”,“SnJose,CA”,and“SanJose,CA,USA”.Othercases,e.g.,is“Moscow”inRussiaorIdaho,maynotbeeasilyresolvedandthereforesemanticallyinvalid
• Somecasesmayinvolveconversions,e.g.,fromanageinyearstoalifestage(e.g.,teen,adult,senior)• Profilingsuchcasesofteninvolvesderivinganewfieldthatencodesthesemanticinterpretationofasourcefield(thatcanthenbesyntacticallychecked)• Howtoconvertasourcefieldtoitsinterpretedvalue?
• Commoncase:deterministicrules• Moredifficultcase(s):probabilisticmappings
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 27
Set-BasedProfiling
• Focusisshape/extentofthedistributionoffieldvaluesacrossrecords,ortherangeofrelationshipsbetweenmultiplerecordfields• Numericfields
• Buildhistogramandcomparetoaknowndistribution• Alsoexaminemin,max,mean,sumtounderstandthedistributionandspotproblems,outliers,etc.
• Categoricalfields• Countuniquevaluesand/orvalueclusters(GROUPBY)
• Otherspecificfieldtypes• Geospatial(zipcode,lat/long):examineaplotonamap• Temporal:examinedataon/indifferentscales/buckets (e.g.,dayofweek,monthofyear,…)
• Canalsodoscatterplotsofvaluesfromseveralfields• Electionexampleinbook:“CandidateMasterFile”(2015-16)
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 28
Transformation:Structuring
• Intrarecord structuring• Reorderingrecordfields(movingcolumns)• Creatingnewrecordfieldsviavalueextraction• Creatingnewrecordfieldsbycombiningfields
• Interrecord structuring• Filteringdocumentsbyremovingsetsofrecords• Shiftinggranularitythroughaggregationsandpivots
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 29
ExtractingValues(Intra)
• Positionalextraction• Dateexample:17012018
• Nameexample:WAYNE,BRUCE
• Patternbasedextraction• Moneyexample:BRIBE($999.00MONTHLY)
• Complexstructureextraction• JSONexample:
{“id:“123”,“Customer”:{“name”:“Fred”,“city”:“LA”},“total”:25.97,“gift”:true,“shipping”:“UPSGround”,“Items”:[{“sku”:401, “qty”:2,“price”:9.99},…]}
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 30
CombiningFields(Intra)
• Nameexample(inreverse):
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 31
FirstName MiddleName LastNameBruce Wayne
Anthony Edward Stark
NameWayne,Bruce
Stark, AnthonyEdward
FilteringRecordsandFields(Inter)
• Removingrecordsorfieldsfromadataset• Example(bothrecord-basedandfield-based):
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 32
Name SuperHero AlmaMaterWayne,Bruce Batman GothamCityCollege
Stark, AnthonyEdward IronMan MIT
Smoak, Felicity MIT
Allen,BartholomewHenry Flash CentralCityU
Name AlmaMaterStark, AnthonyEdward MIT
Smoak, Felicity MIT
Aggregations(Inter)
• Shiftingthegranularityofadataset• Simpleaggregationexample:
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 33
Name SuperHero AlmaMater NumYearsWayne,Bruce Batman GothamCityCollege 2
Stark, AnthonyEdward IronMan MIT 4
Smoak, Felicity MIT 5
Allen,BartholomewHenry Flash CentralCityU 4
AlmaMater NumGrads AvgYears
Central CityU 1 4.0
GothamCityCollege 1 2.0
MIT 2 4.5
Columnà RowPivots(Inter)
• Shiftingthegranularityofadataset• Simple“unpivoting”example:
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 34
Customer CellPhone HomePhone OfficePhoneJohnSmith (212)123-4567 (212)111-2233SallyForth (949)124-8163 (949) 987-6543
Customer PhoneLoc PhoneNum
JohnSmith Cell (212)123-4567
JohnSmith Home (212)111-2233
SallyForth Home (949)124-8163
SallyForth Office (949) 987-6543
Rowà ColumnPivots (Inter)
• Again,shiftingthegranularityofadataset• A“pivoting”example:
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 35
Donor RedCrossSum KJAZSum UnitedWaySumJohnSmith 300.00 500.00 0.00SallyForth 0.00 0.00 7500.00
Donor Charity Gift
JohnSmith RedCross 100.00
JohnSmith KJAZ 500.00
JohnSmith RedCross 200.00
SallyForth United Way 7500.00
Enrichment:Transformations
• Union• Ex:YearSales =Q1SalesU Q2SalesU Q3SalesU Q4Sales• SimplecasewhenQi’sareunion-compatible(alaSQL)• Mayneed“outerunion”ifslightlydifferentfieldsets
• Sales1(region,amount,listpricesale)• Sales2(region,amount,channel)☛ AllSales(region,amount,listpricesale,channel)
• Join• Ex:CustomerPhone INNERJOINDonationONCustomerPhone.Customer =Donation.Donor(ThenwecanstartmakingthoseannoyingcallsJ)
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 36
Enrichment:Metadata
• Examples:• Filenamesofsourcedata(basicprovenance)• Byteoffsetsand/orrecordnumbers(location)• Currentdateand/ortime• Creation/update/accesstimestamps• Recordand/orrecordfieldlineage(provenance)
• Q:Whymightonewanttodothis…?• Gobacktothesource(s)intheeventoferrors• Credibility/authorityofdataunderlyingagivendataproductoranalysis
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 37
Enrichment:ValueDerivation
• Genericderivations• Commonexamples
• Derivedayofweekorseasonfromdate• Convertaddressintozipcode,lat/long,orregion• Analyzetextforsentimentorforentityreferences(people,places,things)
• Computesaleaslistpricetimesdiscountplustax• Mayinvolvedomain-specificaspects
• Regiondefinitionsforgovernmentvs.businesses• Specialterminologyorentitytypes
• Maybedrivenbylaw(e.g.,fieldredaction)• Proprietaryderivations
• Individualorganizations’customizedmodels• CommonlyusedDB/BigDatamechanism:UDFs
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 38
DataCleaningTransformations
• MissingorNULLvalues• Canfilterourrecordswithsuchfields• Canreplacesuchvalues(imputation)
• Averageormedianvalue• Generatevaluesfromsimilarrecords• Uselastvalidvalue(orinterpolate)insequencedata
• Invalidvalues• Somecommonsymptoms
• Inconsistentwithotherfields(e.g.,agevs.DOB)• Ambiguous(e.g.,twodigityears,abbreviations,…)
• Somepotentialcures• Calculatethecorrectorconsistentvalue• Markthevalueasinvalidandanalyzethedatawith/without• Datastandardization(basedonfixedlibraryofvalidvalues),usingeditdistanceordomainknowledgeasatool
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 39
DataEngineer’sResponsibilities
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 41
DataArchitect’sResponsibilities
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 42
DataScientist’sResponsibilities
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 43
ActionListRevisited
1. Ingestingdata2. Describingdata3. Assessingdatautility4. Designingandbuildingrefineddata5. Adhocreporting6. Exploratorymodelingandforecasting7. Designingandbuildingoptimizeddata8. Regularreporting9. Buildingproductsandservices
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 45
OrganizationalBestPractices
• Providewideaccesstodata• Implementmechanismstotrackdatausage• Useacommondatamanipulationlanguagethatspansbusinessunitsanduserroles(e.g.,Excel,SQL,Python,…)• Maintainasystemthatallowsyoutoeasilytransitionfromdevelopmenttoproduction• Considerarotationprogramacrossrolestoenableacleanerhand-offandincreasecross-functionaltrust
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 46
DataWranglingTools
Tool DataScale
UsualPlatform
DataStructures
TransformationParadigm
Excel MB toGB Desktop Grid UI:ScriptandwizardsScope: Singlevalues(formulas)
SQL GBtoTB Server Tables UI:“Script”only (SQL)Scope:Programmatic(scriptsovermultiplerecords)
Trifecta Unlimited Cluster Various UI:Script, “builder”,machine-guidedScope:Programmatic(scriptsovermultiplerecords)
Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 47
Note:OtherpotentialoptionsincludeNoSQLdatabases,Hadoop/Sparkbasedplatforms(e.g.,Hive,SparkSQL),…