Normalisation Africamuseum 5 June 2013. What is ‘Normalisation’? Theoretical: satisfying the...

24
Normalisation Africamuseum 5 June 2013

Transcript of Normalisation Africamuseum 5 June 2013. What is ‘Normalisation’? Theoretical: satisfying the...

Page 1: Normalisation Africamuseum 5 June 2013. What is ‘Normalisation’?  Theoretical: satisfying the requirements of the different ‘Normal Forms’, as spelled.

Normalisation

Africamuseum

5 June 2013

Page 2: Normalisation Africamuseum 5 June 2013. What is ‘Normalisation’?  Theoretical: satisfying the requirements of the different ‘Normal Forms’, as spelled.

What is ‘Normalisation’?

Theoretical: satisfying the requirements of the different ‘Normal Forms’, as spelled out by (mainly) E.F. Codd

Practical: make sure data is in your database once and only once Repeated data go to separate table Relationships between the tables are part of the

‘model’ of the database

Page 3: Normalisation Africamuseum 5 June 2013. What is ‘Normalisation’?  Theoretical: satisfying the requirements of the different ‘Normal Forms’, as spelled.

Earlier example

Species # legs # eyes place Countrydate

Asterias rubens 5 0 Oostende Belgium 12/3/2004

Asterias rubens 5 0 Zeebrugge Belgium 13/3/2005

Asterias rubens 5 0 Zeebrugge Belgium 14/3/2005

Cancer pagurus 10 2 De Panne Belgium 12/3/2004

Cancer pagurus 10 2 Oostende Belgium 12/3/2004

Cancer pagurus 10 2 Zeebrugge Belgium 14/3/2004

Asterias rubens 5 0 Wimereux France 13/3/2005

Asterias rubens 5 0 Wimereux France 14/3/2005

Cancer pagurus 10 2 Wimereux France 12/3/2004

Page 4: Normalisation Africamuseum 5 June 2013. What is ‘Normalisation’?  Theoretical: satisfying the requirements of the different ‘Normal Forms’, as spelled.

Why normalise

Save space on disk by avoiding repetition But huge disk space makes this less important Zipping would replace repeated strings by a code

Avoid ‘modification anomalies’ Make model intuitive and informative Make database unbiased with respect to

patterns of querying

Page 5: Normalisation Africamuseum 5 June 2013. What is ‘Normalisation’?  Theoretical: satisfying the requirements of the different ‘Normal Forms’, as spelled.

Modification anomalies

Update anomalies Potential source of conflicting data

Insertion anomalies Some relevant data can’t be stored

Deletion anomalies Some relevant data are lost while deleting other

data

Page 6: Normalisation Africamuseum 5 June 2013. What is ‘Normalisation’?  Theoretical: satisfying the requirements of the different ‘Normal Forms’, as spelled.

Update anomalies

If data is present more than once, it’s possible to create conflicting information by updating one version of he data and not the other

Species # legs # eyes place Countrydate

Asterias rubens 6 0 Oostende Belgium 12/3/2004

Asterias rubens 5 0 Zeebrugge Belgium 13/3/2005

Asterias rubens 5 1 Zeebrugge France 14/3/2005

Page 7: Normalisation Africamuseum 5 June 2013. What is ‘Normalisation’?  Theoretical: satisfying the requirements of the different ‘Normal Forms’, as spelled.

Insertion anomalies

If two concepts are mixed in one table, we can’t store information on new items of one type, unless we have at the same time information on the otherSpecies # legs # eyes place Country

date

Asterias rubens 5 0 Oostende Belgium 12/3/2004

Asterias rubens 5 0 Zeebrugge Belgium 13/3/2005

Asterias arenata 5 0 <null> <null> <null>

Page 8: Normalisation Africamuseum 5 June 2013. What is ‘Normalisation’?  Theoretical: satisfying the requirements of the different ‘Normal Forms’, as spelled.

Deletion anomalies

If two concepts are mixed in one table, we loose information on a concept if the last instance of the other concept is deleted

Species # legs # eyes place Countrydate

Asterias rubens 5 0 Oostende Belgium 12/3/2004

Asterias rubens 5 0 Zeebrugge Belgium 13/3/2005

Asterias arenata 5 0 Zeebrugge Belgium 13/3/2005

Page 9: Normalisation Africamuseum 5 June 2013. What is ‘Normalisation’?  Theoretical: satisfying the requirements of the different ‘Normal Forms’, as spelled.

Making model more intuitive

A good model should reflect the reality it tries to mirror, including the relationships between the entities. Separate entities in real life (can be abstract) should be modelled separately

Species # legs # eyes place Countrydate

Asterias rubens 5 0 Oostende Belgium 12/3/2004

Asterias rubens 5 0 Zeebrugge Belgium 13/3/2005

Shared biological biogeographical

Page 10: Normalisation Africamuseum 5 June 2013. What is ‘Normalisation’?  Theoretical: satisfying the requirements of the different ‘Normal Forms’, as spelled.

… and robust

Entries in a database should be ‘atomic’ Should not be a combination of several smaller

entities such as ‘Oostende, Belgium’ Contain no qualifiers (such as Asterias cfr

rubens; Asterias ?rubens…) Not be dependent on the value of another field Not contain repeated values (e.g. several authors

for a multi-author publication)

Page 11: Normalisation Africamuseum 5 June 2013. What is ‘Normalisation’?  Theoretical: satisfying the requirements of the different ‘Normal Forms’, as spelled.

Avoid bias

Asterias rubens Oostende, Belgium, 12/3 Zeebrugge, Belgium, 13/3 Wimereux, France, 13/3

Asterias arenata Den Osse, Netherlands, 17/3

Cancer pagurus Oostende, Belgium, 12/3 De Panne, Belgium, 12/3 Den Osse, Netherlands, 14/5

Abra alba Oostende, Belgium, 14/5

A ‘nested list’ is easier to query on the grouping factor of the list. It is easy to find in which countries Asterias rubens occurs; to find out which species occur in say France, we must read our complete database

Page 12: Normalisation Africamuseum 5 June 2013. What is ‘Normalisation’?  Theoretical: satisfying the requirements of the different ‘Normal Forms’, as spelled.

The formal process

The key,

The whole key,

And nothing but the key…

So help me (E.F.) Codd

Page 13: Normalisation Africamuseum 5 June 2013. What is ‘Normalisation’?  Theoretical: satisfying the requirements of the different ‘Normal Forms’, as spelled.

N1NF (non-1 Normal Form)

Asterias rubens Oostende, Belgium, 12/3 Zeebrugge, Belgium, 13/3 Wimereux, France, 13/3

Asterias arenata Den Osse, Netherlands, 17/3

Cancer pagurus Oostende, Belgium, 12/3 De Panne, Belgium, 12/3 Den Osse, Netherlands, 14/5

Abra alba Oostende, Belgium, 14/5

Page 14: Normalisation Africamuseum 5 June 2013. What is ‘Normalisation’?  Theoretical: satisfying the requirements of the different ‘Normal Forms’, as spelled.

N1NF

Structure of the ‘table’: drs (species, legs, eyes, place1, country1, date1,

place2, country2, date2, place3, country3, date3)

Entries are not atomic, difficult to query What if we have a fourth distribution

record??

Page 15: Normalisation Africamuseum 5 June 2013. What is ‘Normalisation’?  Theoretical: satisfying the requirements of the different ‘Normal Forms’, as spelled.

1NF

Species # legs # eyes place Countrydate

Asterias rubens 5 0 Oostende Belgium 12/3/2004

Asterias rubens 5 0 Zeebrugge Belgium 13/3/2005

Asterias rubens 5 0 Zeebrugge Belgium 14/3/2005

Cancer pagurus 10 2 De Panne Belgium 12/3/2004

Cancer pagurus 10 2 Oostende Belgium 12/3/2004

Cancer pagurus 10 2 Zeebrugge Belgium 14/3/2004

Asterias rubens 5 0 Wimereux France 13/3/2005

Asterias rubens 5 0 Wimereux France 14/3/2005

Cancer pagurus 10 2 Wimereux France 12/3/2004

Page 16: Normalisation Africamuseum 5 June 2013. What is ‘Normalisation’?  Theoretical: satisfying the requirements of the different ‘Normal Forms’, as spelled.

1NF: the key

A distribution record (a line in our table) is unique when taking into account species, place and date drs (species, place, date, legs, eyes, country)

Table names are usually plural, field (column) names singular. In this type of analysis keys are underlined

Page 17: Normalisation Africamuseum 5 June 2013. What is ‘Normalisation’?  Theoretical: satisfying the requirements of the different ‘Normal Forms’, as spelled.

2NF: the whole key

Moving repeating groups to separate entities, and looking for a key for that entity: remove entities that are dependent only on part of the compound key Distribution records (species, place, date) Species (species, legs, eyes) Places (place, country)

Page 18: Normalisation Africamuseum 5 June 2013. What is ‘Normalisation’?  Theoretical: satisfying the requirements of the different ‘Normal Forms’, as spelled.

2NF: foreign keys

The one original table was split in three Distribution records (drs), species, places

Table drs and species share a field, species, that allow us to find related records Field species is foreign key in table drs Same with drs and places

Species and places can be populated from reference tables (CoL; Gazetteer)

Page 19: Normalisation Africamuseum 5 June 2013. What is ‘Normalisation’?  Theoretical: satisfying the requirements of the different ‘Normal Forms’, as spelled.

3NF: nothing but the key

Moving attributes that are functionally dependent on non-key attribute

Possible structure (in this case same as 2NF) Distribution records (species, place, date) Places (place, country) Species (species, legs, eyes)

Page 20: Normalisation Africamuseum 5 June 2013. What is ‘Normalisation’?  Theoretical: satisfying the requirements of the different ‘Normal Forms’, as spelled.

Elaborating further: IDs

Key of drs is compound, composed of three fields – better to replace with a ‘synthetic’ key (id – ‘autonumber’ or ‘sequence’)

Keys of ‘places’ and ‘species’ are names with real meaning; anything with meaning in real life can change, so also better to replace with artificial key

Page 21: Normalisation Africamuseum 5 June 2013. What is ‘Normalisation’?  Theoretical: satisfying the requirements of the different ‘Normal Forms’, as spelled.

Elaborating further: traits

Our database now has information on number of legs and number of eyes. What if we want to start storing colour? Requires rewrite of the database

Alternative: split out data on biological traits in table with ‘property/value’ pairs Species (id, species, author, parent_id…) Traits (species_id, trait, value)

Page 22: Normalisation Africamuseum 5 June 2013. What is ‘Normalisation’?  Theoretical: satisfying the requirements of the different ‘Normal Forms’, as spelled.

Model

Page 23: Normalisation Africamuseum 5 June 2013. What is ‘Normalisation’?  Theoretical: satisfying the requirements of the different ‘Normal Forms’, as spelled.

Remarks

Sometimes it is better not to normalise completely Surname & first name as 1 attribute instead of 2 Calculated fields to speed up queries

Sometimes it is better to denormalise completely Exchange formats such as Darwin Core

Page 24: Normalisation Africamuseum 5 June 2013. What is ‘Normalisation’?  Theoretical: satisfying the requirements of the different ‘Normal Forms’, as spelled.

Final remarks

Normalisation is a means, not a goal Intelligent denormalising is as much an art as

normalising!