Post on 02-Jan-2016
Get your hands dirty Get your hands dirty cleaning data. cleaning data.
2008 European EMu Users 2008 European EMu Users Meeting, 3rd June.Meeting, 3rd June.
- Elizabeth Bruton, Museum of the - Elizabeth Bruton, Museum of the History of Science, OxfordHistory of Science, Oxford
elizabeth.bruton@mhs.ox.ac.ukelizabeth.bruton@mhs.ox.ac.uk
OutlineOutline
►Data MigrationData Migration►Problem -> Solution approachProblem -> Solution approach►ToolsTools►Manual Data CleaningManual Data Cleaning►ExamplesExamples►Current and Future Practices Current and Future Practices
(Documentation, Policing, Review)(Documentation, Policing, Review)
Data MigrationData Migration
►First step towards better, cleaner dataFirst step towards better, cleaner data►Steps:Steps:
Prepare and analyse legacy systemPrepare and analyse legacy system Data mappingData mapping KE EMu system designKE EMu system design Data migrationData migration
Legacy System AnalysisLegacy System Analysis
►Prepare and analyse previous (legacy) Prepare and analyse previous (legacy) systemsystem Data: structure and relationships - tables Data: structure and relationships - tables
and fields.and fields.►PrimaryPrimary►SecondarySecondary►Cross-referenceCross-reference
Documentation and usageDocumentation and usage Redundant dataRedundant data
KE EMu system designKE EMu system design
►Default and Default and additional fields additional fields across different across different modulesmodules
► Field titlesField titles► Screen DesignerScreen Designer
e.g. Summary tab for e.g. Summary tab for ecatalogue moduleecatalogue module
► Finally data migrationFinally data migration
Data cleaning overviewData cleaning overview
► Problem -> solution approachProblem -> solution approach Input dataInput data OperationsOperations Output dataOutput data
►Manual or automated operations or both?Manual or automated operations or both?►Which tools to use for automated Which tools to use for automated
operations?operations? KE EMu tools – many powerful built-in tools within KE EMu tools – many powerful built-in tools within
EMuEMu Non-KE EMu tools – scripts to use on data Non-KE EMu tools – scripts to use on data
imported from EMu; reimport back into EMuimported from EMu; reimport back into EMu BothBoth
KE EMu Tools: TexqlKE EMu Tools: Texql
►KE Texpress Texql queriesqueries Similar syntax to Similar syntax to
mySQL or SQLmySQL or SQL
►Uses:Uses: Analysing data and Analysing data and
data structuredata structure Analysing search Analysing search
queriesqueries Advanced search Advanced search
queriesqueries
KE EMu Tools: Global ReplaceKE EMu Tools: Global Replace
► Very useful, powerful Very useful, powerful but also potentially but also potentially ‘dangerous’ tool‘dangerous’ tool
► Can use in combination Can use in combination with search query or with search query or list options within EMulist options within EMu
► Can use regular Can use regular expressions and/or expressions and/or wildcard searcheswildcard searches
► Powerful tool for single Powerful tool for single field or Field A->Field B field or Field A->Field B operations operations
KE EMu Tools: Record Merge KE EMu Tools: Record Merge
►Does what it says on the tinDoes what it says on the tin►Merge one or more duplicate record(s) Merge one or more duplicate record(s)
into single recordinto single record►Only ‘attachments’ to different modules Only ‘attachments’ to different modules
are merged into record are merged into record notnot data data►Ditto tool can be used for easily copying Ditto tool can be used for easily copying
data from one record to another data from one record to another ►Attachments to original duplicate Attachments to original duplicate
record(s) are removed so records can record(s) are removed so records can be deletedbe deleted
KE EMu Tools: ReportsKE EMu Tools: Reports
► Tool to present Tool to present information in information in assorted waysassorted ways
►Can be used to Can be used to produce reports but produce reports but can also be used as can also be used as data export tooldata export tool
►Microsoft Excel or Microsoft Excel or CSV format CSV format appropriate for appropriate for more advanced more advanced data operationsdata operations
Non-KE EMu Tools: ScriptingNon-KE EMu Tools: Scripting
►Personally use php and mySQLPersonally use php and mySQL►Perl is also useful scripting tool; used by Perl is also useful scripting tool; used by
KEKE►Have written CSV to mySQL file checker Have written CSV to mySQL file checker
and converter in phpand converter in php►Then run more advanced operations on Then run more advanced operations on
data using php scriptsdata using php scripts►PhpMyAdmin can export data in many PhpMyAdmin can export data in many
formats including CSVformats including CSV
Non-KE EMu Tools: ScriptingNon-KE EMu Tools: Scripting
►Systematic ApproachSystematic Approach Keep copy of original dataKeep copy of original data Produce data mapping or data cleaning Produce data mapping or data cleaning
documentdocument Perform operations using php file on Perform operations using php file on
mySQL tablemySQL table Check data produced (manual or Check data produced (manual or
automatic) and output logsautomatic) and output logs Validate data in EMu and then importValidate data in EMu and then import
Manual Data CleaningManual Data Cleaning
►Some problems cannot be done Some problems cannot be done automatically, either partially or automatically, either partially or entirelyentirely
►Need to be ‘eyeballed’ by a person, Need to be ‘eyeballed’ by a person, preferably someone familiar with the preferably someone familiar with the museum’s collectionsmuseum’s collections
Example: Parties RecordsExample: Parties Records
►Legacy system used two systems of Legacy system used two systems of noting object ‘makers’noting object ‘makers’ Freetext ‘Maker’ field with no centralised Freetext ‘Maker’ field with no centralised
system (1:1 ratio); used for applicable system (1:1 ratio); used for applicable recordsrecords
Assigned makers with centralised system; Assigned makers with centralised system; only used for first 3,000 or so recordsonly used for first 3,000 or so records
►Freetext data imported into EMu resulted Freetext data imported into EMu resulted in approximately 5,500 Parties recordsin approximately 5,500 Parties records
Example: Parties RecordsExample: Parties Records
►Good example of mapping freetext field Good example of mapping freetext field to more structured data field with to more structured data field with 1:Many ratio1:Many ratio
►KE ran script which ‘detected’ maker KE ran script which ‘detected’ maker type and formatted accordingly, i.e. type and formatted accordingly, i.e. Maker Type etcMaker Type etc
►But still much cleaning up to be doneBut still much cleaning up to be done►Two approaches: automatic then Two approaches: automatic then
manualmanual
Example: Parties RecordsExample: Parties Records
►Problem: Creation-related data within Problem: Creation-related data within legacy system were all free-text fieldslegacy system were all free-text fields
►The museum wanted to keep this data The museum wanted to keep this data in some format as it contained valuable in some format as it contained valuable information, such as ambiguities or information, such as ambiguities or uncertaintiesuncertainties
►e.g. Italy or France, Attributed to Smith e.g. Italy or France, Attributed to Smith & Jones, possibly last quarter of 19& Jones, possibly last quarter of 19thth century etccentury etc
Example: Parties RecordsExample: Parties Records
►This data did not fit neatly into defined, This data did not fit neatly into defined, structure fields such as Parties, Places structure fields such as Parties, Places or Creation Dateor Creation Date
►AlsoAlso wanted to clean Parties records wanted to clean Parties records►Solution: Automatic batch process then Solution: Automatic batch process then
manual cleaningmanual cleaning
Example: Parties Records – Example: Parties Records – Automatic ApproachAutomatic Approach
Exported Creation data (Parties, Place, Exported Creation data (Parties, Place, Creation Date) from EMuCreation Date) from EMu
Ran script which checked for and removed Ran script which checked for and removed duplicates in Parties and Placeduplicates in Parties and Place
Note: The above operation deleted rather Note: The above operation deleted rather than manipulated data but still integral part than manipulated data but still integral part of data cleaning operationof data cleaning operation
Copied cleaned Parties, Place, Creation Data Copied cleaned Parties, Place, Creation Data into single free-text field: Creation Notesinto single free-text field: Creation Notes
Re-imported data into EMu using Import ToolRe-imported data into EMu using Import Tool
Example: Parties Records – Example: Parties Records – Automatic ApproachAutomatic Approach
Began data cleaning by running Global Began data cleaning by running Global Replace operation within EMu eparties Replace operation within EMu eparties module, removing 'Signed by', 'Attributed module, removing 'Signed by', 'Attributed to', or 'Made by' from the relevant parties to', or 'Made by' from the relevant parties recordsrecords
Next: Manual ApproachNext: Manual Approach
Example: Parties Records – Example: Parties Records – Manual ApproachManual Approach
Cleaned records: Check Parties Type Cleaned records: Check Parties Type (Person or Organisation) and edited (Person or Organisation) and edited records (Surname, Forename, Organisation records (Surname, Forename, Organisation etc)etc)
Merged and deleted duplicate recordsMerged and deleted duplicate records Checked and deleted unattached parties Checked and deleted unattached parties
recordsrecords
Example: Parties Records – End Example: Parties Records – End ResultResult
►Currently have 3,300 cleaner Parties Currently have 3,300 cleaner Parties recordsrecords
Current and Future PracticesCurrent and Future Practices
►CurrentCurrent Systematic approach to data cleaning; Systematic approach to data cleaning;
incorporated into monthly museum EMu incorporated into monthly museum EMu Users' MeetingUsers' Meeting
ReviewReview
►In ProgressIn Progress DocumentationDocumentation
►FutureFuture PolicingPolicing
ConclusionConclusion
►Data cleaning and policing is an ongoing Data cleaning and policing is an ongoing process for an institution of any sizeprocess for an institution of any size
►Data standards must be set and Data standards must be set and adhered toadhered to
►Needs to be approached and done in a Needs to be approached and done in a systematic waysystematic way
►Any questions?Any questions?