Modelling Data On The Web - University of...
Transcript of Modelling Data On The Web - University of...
COMP60411ModellingDataOnTheWeb
UliSattler&BijanParsia
Week1Introduction,DataModels,Tables,andSQL
TopicOverviewWhatisafundamentaldatamodel?Somekeydatamodels
Flat:flatfilesTablebased:relationalTreebased:XMLandabitofJSONGraphbased:RDF
Tradeoffs(esp.representational)betweenthemLookingforthepainpointsandsweetspots
CourseGoals:
Knowledge&UnderstandingThis aimstogiveyoua
goodunderstandingofcoreconceptsofdatamodellingsomefamiliaritywithformalisms,APIs,andlanguages
formodellingdataonthewebdesign/representationissuesthatarise
courseunit
CourseGoals:SkillsThis aimstogiveyoutheability/skillto
comparedifferentdatamodellingformalisms,designoranalyseadatamanagementsystem,
doesitmakegooduseoftheformalism'sfeatures?doesitfititspurpose?
courseunit
CourseStructureLectures
ActivelearningLab
Makesureyouunderstandthecoursework!Readings
AllreadingsavailableonlineCore:the"Learning"eBookseries
(or )(or )
LearningSQL hereLearningXML hereLearningSPARQL
AssessmentCoursework(50%,≈200marks)
Eachweek,amixture1. MCQquizzes(≈10marks)2. Shortessays(≈5marks)3. Amodellingassignment(≈10marks)4. Aprogrammingassignment(≈15marks)Precisemarkbreakdownvaries
Exam(50%)TakenonlineVerylike1&2
Materials&BlackboardAllcoursematerialsareavailableonlineonthematerialspageWeuseBlackboardfor
CourseworkOnlineforums
Usethese!Exam
VariantCircumstancesDisability(EqualityAct):
anyconditionwhichhasasignificant,adverseandlong-termeffectonaperson’sabilitytocarryoutnormalday-to-dayactivities.
Exam&Studysupport&moreGreat,helpfulpeople
and process
DisabilityAdvisoryandSupportService
CounsellingserviceSSO MitigatingCircumstances
...feelfreetoaskus:we'rehappytoadvise!
Assistance&HelpEarlyinterventionismoreeffective
Ifyouarehavingchallengesofanysortthesoonertheyareidentifiedandcommunicatedtousthemorelikelywecanfindagoodresolution
ThisisverytrueformitigatingcircumstancesIfsomethingisinterfering,documentit!FillouttheformwhenthingsarehappeningThereisa"toolate"here!
...whenindoubt,askusandSSOforMitCircs
ExpectedConductWeexpectofyou(andourselves)to
befairmindedtreateachotherwell&withrespectavoidacademicmalpracticetakeresponsibilityforcoursedutiesbeengaged,curious,andactive
Ifyouhaveaproblemorissuepleaseraiseitwithusifthatdoesn'thelp,contactyourprogrammedirector
Preliminaries
Weallhavetostartsomewhere
DataManagement(1)Almosteveryprogrammustdosomedatamanagement
Ifonlyconfigfiles!Manyareinformationheavy
AndmustdealwiththatinformationovertimeDatabaseManagementSystems(DBMSs)
Separate(orseparable)componentSpecialisedforvariablespurposed
Secondarystorage,scaling,complexity,etc.
DataManagement:LifetimeSomedatais(typically)transientorephemeral
PositionofthecursoronthescreenSomedatais(typically)persistent
Bankrecords,addresses,healthdata,libraryentriesCursorpositioncanbe!
(Ifyouarerecordingthescreen...)
We'refocusedondatathatleanstowardpersistent
DataManagement:StructureSomedatais(moreorless)informationallyopaque
E.G.,images,video,text,audioTheinformationcontentisn't(immediately)available
YoutypicallymustdosomeextractionSuchiscalledunstructureddata
SomedataisinformationallytransparentTheinformationcontentisprogrammticallyexplicitSuchiscalledstructureddata
WewilllaterdistinguishStructuredSemi-structured
OutOfScopeThereislotsofDMthat'soutsideourscope1. Performance&Scaling:see2. Concurrency
Thustransactions(YoushouldreaduponACIDity)
3. Tuning,indeedmostphysicallevelstuff4. Cleansing5. Integration
Exceptforatinybit,aroundmerging
COMP62421
Theseconsiderationsdoaffectmodelling!
DataAndTheWebTheWebisacollaborativeinformationstructure
LargelydecentralisedImmenseGrowingrapidlyChangingrapidly
TheWebproducesnewdatachallengesScaleofdataKindofdataShapeofdataUseofdata
DataOn,From,BehindTheWebOntheWeb
data.gov,data.gov.uk,...FromtheWeb
LogfilesBehindtheWeb
Data(base)backedWebsitesThefilesystemisakindofdatabase
ContentManagementSystemsWordpress
SitesasDatabaseFrontEndsSeeAmazon
WhatIsADataModel?ThreeKeyAspects1. UnderlyingDataStructure,"CoreDM"2. DataIntegrity3. DataManipulation4. (Plusafourth!)DataSharing
MoreimportantontheWeb*
"DataModel"IsAmbiguousDatamodelisusedtoreferto...
1. acompletedatarepresentationandmanipulationapproach(wedothis!)
2. justthecoredatamodel
3. aparticulardatarepresentationforadomainorapplication,alsocalledthedomainmodel
"Doesyourcalendardatamodelincludeleapyears?"
Generally,youcantellfromcontext,(2)israre.
KindsOfDataDatacanlenditselftodifferentshapes
Array-likeTree-likeGraph-likeDocument-like
DatacanhavedifferentvolumesSmallto"big"data
DatacanhavedifferentvelocitiesStatic/offlinetostreaming
DatacanhavedifferentusepatternsManyreaders/fewwritersorthereverseorother!
PolyglotPersistence...wearegearingupforashifttopolyglotpersistence—whereany decent sized enterprise will have a variety ofdifferent data storage technologies for different kindsof data. There will still be large amounts of it managed inrelational stores, but increasingly we'll be first asking howwewanttomanipulatethedataandonlythenfiguringoutwhattechnologyisthebestbetforit.
—MartinFowler
PolyglotPersistence(2)This polyglot [e]ffect will be apparent even within a singleapplication. A complex enterprise application uses differentkindsofdata,andalreadyusually integrates information fromdifferentsources.Increasinglywe'llseesuchapplicationsmanage their own data using different technologiesdependingonhowthedataisused.—MartinFowler
Poly-Glot/-SystemPersistenceEvenwithasinglecoredatamodel
MultiplesystemswithdifferentcharacteristicsMultiple,overlapping,domainmodelsMultiple,overlappingowners,versions,variants
ThisisparticularlytrueinontheWeb!
"FlatFiles"--ASimpleModel
ASampleDomainWestartwithaclassicexample:TheAddressBook
PeopleandinformationaboutthemNamesandcontactinformation
Wecandoafirstcutasadiagram
ForExampleBijan!
Name:BijanParsiaCompany:UniversityofManchesterEmail:[email protected]...
Uli!Name:UliSattlerCompany:UniversityofManchesterEmail:[email protected]
Storing!
SlidesarenotagoodstorageplacefordataWehaveanarraylikestructureso...
Howaboutaspreadsheet!1entity/record/personperrowEachfield/attributeisacolumn
Wehavesoftwarethatworkswellwiththis!
InteractingWithTheData
Tothedemo!
PainPointsAround"name"
SortingSortingisoncolumns
Can'tsortbylastnameFiltering
CanfilterbynamesbeginningwithZCannotbysurname'sbeginningwithZ
Around"address"Can'tsortorfilterbypostcodeCan'tsortorfilterbycityCan'tsortorfilterbycounty
Theseareproblemswithourmodel
FixingTheDomainModel
Interacting!
Demoencore!
NewPainPointsVariablenumbersofthe"same"attribute
PhonenumberEmailaddressWebpageInsertingcolumnsispainful
LotsofpartialcolumnsSheernumbersucks
Companieshaveaddresses!Morethanone!Andphonenumbers,etc.
Moreproblemswithourmodel
BadModelBad
FixingTheModel2Wewantaddinga(similar)columntobeeasy!
Easyasaddingarow!MakeanewtablejustforphonenumbersIndexnumberswithpersonrows
FixingTheModelAgain
PainPointsSortingdestroystherelationship
WeusedrownumberstoconnectSortingchangestherownumber!
HardtoseetherecordNolongerasimpleflatfile
CSVformatmakesassumptions
Theseare(mostly)implementationproblems!
WhenADomainModelFailsFailuremustbeanalysed!
Didwegetthedomainwrong?fititwrongintoourcoreDM?pickthewrongcoreCMtomodelitin?
Isitunworkable?workablebutrequiresalotofapplicationcode?reasonablewithsomeworkarounds?
Howmuchtechnicaldebtarewepilingup?
What'sthecostofswitching?
BrokenCoreDataModelIfyouare
always"fighting"thesystemuselotsofapplicationcodetohackthingsliveinanerrorrichenvironmenthaveincreasingamountsofworkaroundsupportinyourdata
Yourdatamodelmightnotbeagoodfitforyourdomainandapplication!
TheRestOfTheDBMSEvenifyourcoredatamodelisn'tagoodfit
Youmightbestuckwiththesystem
YoupaidgoodmoneyforthatOracledatabase!needfeaturesoftheimplementation
isthereanXMLdatabasewithtransactions?what'sthesupportcontract?
bestuckwiththemodelcriticallegacyapps
Justbecausethemodelisbrokendoesn'tmeanthatthesystemis
Orisbrokenenoughtojustifyaswitch
FlatFileProgramming
SharingOurDatabasesSpreadsheets?
Propriatoryish(Excel,GoogleDoc,OpenOffice)Linguafranca:CSV
Comma(orTab)DelimitedValuesExactlythe(pure)flatfilemodelFormat:
Textfile1recordperlineFirstlinecanbespecial(columnnames)Eachcolumnseparatedbya","
Wemayneedtoquotecells(withcommas)
CSVExample
ProgrammaticManipulationIfwestoreourdatabasesasCSV
WecanloadandparsethemintostructuresManipulateourdatafromprograms
Ourprograms,insteadofExcelE.g.,usingtheApacheCommonsCSV
Reader in = new FileReader("path/to/file.csv");Iterable<CSVRecord> records = CSVFormat.EXCEL.parse(in);for (CSVRecord record : records) { String surname = record.get("surname"); String firstName = record.get("first_name"); ...}
SolvingProblemsThissolvessomeproblems!
Inserting/removingcolumnsa"smallmatterofprogramming"Orwecouldusemultiplearrayswithpointers
Wecansplit/combinefieldsatwillWell,withabitofprogramming
WecancontrolsortingwellenoughUsepointerstoconnect
Lotsofwork!
AgainstBespokeProgrammingThisisallatthewronglevel
Flatfilesandflatfile++areubiquitousWeshouldn'tbecodingcomplexfunctions
Overandoveragain!Evenifwecanprogramourwayaroundproblems
Doesn'teliminatetheproblemsSomesolutions(pointers)effectivelychangethemodel!
TablesTable(orrelation)isthecoredatastructure
AtableisasetoftuplesAtupleis
ann-arysequenceasetofkey-valuepairs
FlatfilehadonetableWeallowmany!NamedtablesAkarelations
Relations!(Weusetableandrelationinterchangeably)RelationsarelikeFirstOrderLogic(FOL)predicates
Relationname==PredicatenameNumberofcolumns==Arityofpredicate
Person(bijan,u_o_manchester,...)Predicateistrue(orfalse!)ofitsarguments
Relationis"true"oftupleswhichoccurinitPredicatescanhave
definitions(intensional!)facts(extensional!)
OrderAndIdentityRecords/Rows/Entitiesneedidentity
InExcel,wehadtherowlabelTheorderorpositionofarecordwassignificant
Inourmodel,weneeddistinguishingattributesWepushidentityintothedata:akeyEitheranaturallyuniquesetofattributes
i.e.,adefinitedescriptionoramadeupone:anID
Orderisalwaysapropertyofthedatavaluesimplementation
MultipleTablesActionsonmultipletables:
Splittingatdesigntime:trytonormalizeyourDBruntime:droppingbits
CombiningTaketwotablesandproduceanewtable
ThekeytorelationaldomainmodellingDecomposeyourprobleminto"base"tablesDerivenewtablesforspecificneeds
ARelationalFormalism
WhatIsAFormalism?Aformalsystem(orformalism):
syntax:whatcanwewrite?semantics:whatdoesourwritingmean?withprecise(mathematical)definitionsdesignedtocaptureacoherentsetofoperations("syntax"isloose,e.g.,wemightjusthaveacollectionofoperators)
KeyGoalsOfAFormalism1. tobeclearaboutwhatwemean
Inourspreadsheetis"1"anumber,astring,either,both,somethingelse?
2. toallowthedeterminationofkeypropertiese.g.,complexityofqueryanswering
3. toabstractawayfromparticularimplementionse.g.,allowustodeterminewhenwildlydifferentimplementationsarecorrectthuscaninteroperate
FormalismVs.LanguageFormalismsareoftenabstract
Thiscanbeanadvantage!CanbehardtouseifonlyabstractConcreteinstancestypicallyinvolvecompromise
WefocusonconcretelanguagesFormalismsarethetheoryLanguagesarethepractice
Well,itmaybeallrightinpractice,butitwillneverworkintheory.Intheory,thereisnodifferencebetweentheoryandpractice.But,inpractice,thereis.
OtherQuotesOnTheoryvsPractice
SQL:ALanguageForTablesSchema
CREATE TABLEtable_nameUpdate
INSERT INTOtable_nameDELETE FROMtable_nameUPDATEtable_name...
QuerySELECT ... FROMtable_name
SQLoperations(largely)areclosedovertables
AnInfelicityThereisalotoflingowithslightdifferentmeanings.Conceptsgetdividedup
inslightlydifferentways.
Ourtalk Common LearningSQLp.10
CoreDataModelDataIntegrity DataDefinition SQLschemastatements"CREATE"
DataManipulation Query/UpdateLanguage
SQLDatastatements
ASampleSQLProgramCREATE TABLE People ( name varchar(255), company varchar(255), address varchar(255), phone varchar(255), email varchar(255), home_page varchar(255));
INSERT INTO People VALUES ('Aleshia Tomkiewicz', 'Alan D Rosenburg Cpa Pc', '14 Taylor St, St. Stephens Ward, Kent CT2 7PP', '01835-703597','[email protected]', 'http://www.alandrosenburgcpapc.co.uk');SELECT name FROM People
YoumustDefinebeforeUpdatebeforeQueryI.e.,CREATEbeforeINSERTbeforeSELECT
ModellingWithSQLSQLletsusexpressmodelsatthelogicalto(someofthe)physicallevel
SpecifyingindicesisabitphysicalKnowledgeaboutimplementationmayinformmodellingchoices
SQLhasnomechanismsforconceptuallevel
DomainModel1InSQL
DomainModel1InSQL CREATE TABLE People ( name varchar(255), company varchar(255), address varchar(255), phone varchar(255), email varchar(255), home_page varchar(255));
INSERT INTO People VALUES ('Aleshia Tomkiewicz', 'Alan D Rosenburg Cpa Pc', '14 Taylor St, St. Stephens Ward, Kent CT2 7PP', '01835-703597','[email protected]', 'http://www.alandrosenburgcpapc.co.uk');...
Canwedoallthatwedidinthespreadsheet?
SQLManipulationOfDM1CountrecordsinyourPeopletable:
Searchforitems:
Sortthetable!
SELECT COUNT(*) FROM People
SELECT * FROM PeopleWHERE name like 'Aleshia%'
SELECT * FROM PeopleWHERE name like '%Tomkiewicz'
SELECT * FROM PeopleORDER BY name asc
DomainModel2InSQL
DomainModel2InSQL CREATE TABLE People ( first_name varchar(255), surname varchar(255), company varchar(255), street_address varchar(255), city varchar(255), county varchar(255), post_code varchar(255), phone varchar(255), email varchar(255), home_page varchar(255));
INSERT INTO People VALUES ('Aleshia', 'Tomkiewicz', 'Alan D Rosenburg Cpa Pc', '14 Taylor St', 'St. Stephens Ward', 'Kent', 'CT2 7PP', '01835-703597','[email protected]', 'http://www.alandrosenburgcpapc.co.uk');...
SQLManipulationOfDM2Theoldquerieswork,butwecanimprovethem
Searchforitems:
WecanrecreateDM1!
SELECT * FROM PeopleWHERE first_name = 'Aleshia'
SELECT * FROM PeopleWHERE surname = 'Tomkiewicz'
SELECT first_name || " " ||surname as name, street_address || ", " ||city ||", "|| county ||" " || post_code as address,phone,email,home_pageFROM People
DomainModel3InSQL
DomainModel3InSQL CREATE TABLE People ( person_id SMALLINT UNSIGNED, first_name varchar(255), surname varchar(255), company varchar(255), street_address varchar(255), city varchar(255), county varchar(255), post_code varchar(255), email varchar(255), home_page varchar(255), CONSTRAINT pk_person PRIMARY KEY (person_id));
CREATE TABLE Phone ( person_id varchar(255), number varchar (255), CONSTRAINT pk_phone_number PRIMARY KEY (number));
INSERT INTO People VALUES ('1','Aleshia', 'Tomkiewicz', 'Alan D Rosenburg Cpa Pc', '14 Taylor St', 'St. Stephens Ward', 'Kent', 'CT2 7PP', '[email protected]', 'http://www.alandrosenburgcpapc.co.uk');INSERT INTO Phone Values ('1', '01835-703597')INSERT INTO Phone Values ('1', '01944-369967')
SQLManipulationOfDM3RecreateDM1andDM2:easyFindeveryonewithsamephonenumberCanwehaveunassignednumbers?
How'dDMDo?CoreDM/Datastructure
TablesseemtoworkSQLandRelationalModel
Wecandoeverything!AllqueriesinallmodelsModel3has2tables/requiresjoins
DomainModel3Neaterinsertinganddeleting
Canhaveasmanyphonesasyouwant!Everyotherdomainmodelcanbederived
Justwritethequery:defineasaview!
ExpressivePowerSQLisexpressive
ThecoredatamodelisrichComposingandfilteringtablesdoesalot!Operatorsandfunctionshelpful
Withoutconcat,there'dbetrouble!Thelanguageispowerful
ReasonablycomposableLotsoffeaturesExtendedandextensibleinmanyimplementations
Interopproblems!
QueryingWithSQL
SchemasVs.QueriesCREATEstatements
"create"emptytablesoutofnothingatallwithcertainconstraintswithsomeexpectationofpermanence
SELECTstatements"generate"newtables(possiblywithdata)outofexistingtablesaccordingtosomeconstraintswithnoexpectationofpermanence
ClosedOverTablesSQLis(mostly)closedovertables
MostSQLconstructstaketablesandproducetablesClearexception:Functions!
ManipulationismanipulationoftablesNotrows,columns,orcellsdirectlyRows,columns,andcellsare"degeneratetables"...
FilteringVKeyoperationSELECT:ignoringsomeparts
Basically"find"CanfilterrowsorcolumnsorbothRequires"testing"functionsonvalues
FilteringColumns"Projection"SpecifiedintheSELECTclause
Keepallcolumns:
Justasinglecolumn:
Multiplecolumns:
Renamecolumns:
SELECT * FROM People
SELECT county FROM People
SELECT name, county FROM People
SELECT street_address AS address FROM People
FilteringRowsJustcalled"filter"or"selecting"SpecifiedintheWHEREclauseofyourquery:
Equality:
Range:
Compoundcriteria:
SELECT * FROM People WHERE surname = "Smith"
SELECT * FROM PeopleWHERE heartrate > 95
SELECT * FROM PeopleWHERE heartrate > 95 AND county="Kent"
BuildingTablesWithCrossJoinThefundamentaloperationisCartesianproductPeoplexPhoneThismakesanewrowoutofeverypairofrowsbetweenthetwotable
What'sthesizeoftheresult?Notreallyauser-orientedfeature
"Incidentally"crossjoinsaredangerous!
BuildingTablesWithInnerJoinAninnerjoinisajoinfilteredoncommoncolumns
Usefulforourphonerecords!
(Specialcasecalleda"natural"join.)
SELECT * FROM People, PhoneINNER JOIN ON People.person_id = Phone.person_id
BuildingTablesWithOuterJoinAnouterjoinislikeaninnerjoinbutitreturnsalsorowsthatdon'thaveamatchintheothertable
leftouterdifferentfromrightouterSELECT * FROM People, PhoneRIGHT OUTER JOIN ON People.person_id = Phone.person_id
willreturnalsopeoplewhohavenophone!
BuildingAndFilteringOncewe'vebuiltatablewecanfilterthingsweneed:
SELECT * FROM People, PhoneRIGHT OUTER JOIN ON People.person_id = Phone.person_idWHERE People.surname = "Smith"
...youknewthatalready!?
TheCostAkeyissuewithjoins
WorsecaseisaCROSSEvenifyoudon'tgeneratetheCROSS
Youmighthavetoconsiderallthepairs(Ifyouaren'tcareful)
GoodoptimisersavoidbothConsideringlotsofmatches(thinkindexes)Generatinglargeintermediatetables
MultiplePhoneColumnsSomepeoplehavenoneoroneOrnoemailorwebpage
NoSurnameEvenifwenormalisedthataway
Somepeopledon'thaveasurname!
Nullnullisadistinguishedvaluewhichcanmean:
"Valuenotyetknown""Notapplicabletothisentity""Valueundefined"checkout
Keyproperty:Unequaltoeverythingnull = nullisnevertrueMatchonnotnull,ratherthannull
LSQL
Strangevalue!
OuterJoinsIfyouhavenonullsinyourbasetables
youcan'tgetthemintablesderivedbyinnerjoinHowever,the2phonecolumntableisderivable
WeusetheouterjoinOuterjoinstakeatableT
foreachrowinTextenditwiththe(projected)columnsfromanothertableIfthere'samatch,addthematchedvalues*else,addnulls
SeeLearningSQL forsomeworkedexamplesChapter10
NullProliferationnullnevermatches
SoiteratedouterjoinsproliferatenullsAsyougetwider,yougetsparser
Ifyouarematchingonasparseattributenullsposechallengeforrelationaltheory
AndsomewhatforpracticeStartsmovingfromthesweetspot
SQLAndTheWebAbrieftour
SQLDrivenWebsitesManywebsitesarebackedbyadatabase
PHPmakesiteasyConsiderWordPressandotherCMSs
LotsofunstructuredcontentStuffinblobsandtextfields
KeypropertiesScalingACID:Atomicity,Consistency,Isolation,Durability
TransactionsConcurrentaccess
Thereisa thatisstillgoodreading,espchps -
keyhistoricaltext11 12
CSV&SQLProgramsOnTheWeb
Othergovernmentrepositories:data.govdata.gov.uk
Scientificsitescinicaltrials.govuniprot.org...
UNDatarepository
GoogleQueryVizLanguageASQLlikelanguage
UsedinGoogleDocsSpreadsheetQUERYfunctiontakesqueriesasargument
WebSQLTheWhatWGandW3CtriedtostandardizeWebSQL
Thisspecification introducesasetofAPIs tomanipulateclient-sidedatabasesusingSQL.
Localdatabasebackedwebapps
ForofflineuseJustincreasedcapabilities
function prepareDatabase(ready, error) { return openDatabase('documents', '1.0', 'Offline document storage', 5*1024*1024, function (db) db.changeVersion('', '1.0', function (t) { t.executeSql('CREATE TABLE docids (id, name)'); }, error); });}
ReadingThereisa thatisstillgoodreading,
espchps -keyhistoricaltext
11 12
AnyQuestionsSoFar?
Labs&CourseworkNext,wegototheLabsYoulookinBBatWeek1coursework:
QuizQ1ShortEssaySE1SmallModellingexerciseM1SomequeryingCW1
Read,think,askus!