Clare Somerville Trish O’Kane Data in Databases
-
Upload
future-perfect-2012 -
Category
Technology
-
view
569 -
download
3
description
Transcript of Clare Somerville Trish O’Kane Data in Databases
Data in databases
“It’s not what you think”
Clare SomervilleTrish O’Kane
Long term preservation of data requires understanding how data is created and managed
We have to work out: ◦What data the business needs to keep◦What records the business needs
to create and keepAnd….. how
◦What data must be unchanged◦What we mean by usable and retrievable
Our point
We will
cover
The problem, as we see it
What is a record and its attributes
What is a database and how they are built and maintained
How can we use data sets to create records?
What is a data warehouse and how they are built and maintained
How can we ensure that useful data sets are available over time
Agenda
The problem
Definitions
Delivering data &
records from data
◦Data warehousing
◦Data “lifecycle”
management
Conclusion
The problemDatabases have replaced many semi-structured
records ◦ Register of Births, Deaths and Marriages (and Divorces!)◦ EQC claims data
But - we want some of that information available long term in a usable format
Records managers are unfamiliar with the world of structured data◦ Disposal outcome in a draft disposal authority:
“When database decommissioned, transfer to Archives NZ”
◦ Transfer what?
Source
solutio
n
Who wants what?What have we got?
◦Data in databasesWhat do we want?
◦RecordsWhen do we want them?
◦Now, and for the long termBut….what is a record
in the context of data?◦The individual data item? ◦A whole dataset?
Slice and
dice here
Maintain metadata
here
Business users?
Broader audience
What have we got
1. Customers◦ Customers for data◦ Customers for records
2. Information assets◦ Records◦ Transactional data in databases◦ Datasets◦ Data marts and data warehouses
3. What do we have to do to?◦ Principles from data warehousing◦ Data life cycle management
DefinitionsRecords, metadata, data, source systems, database, data warehouse
Records
Recordkeeping definition
In structured world
Public Records Act 2005A record or class of
records in any form in whole or part, created or received by a public office in the conduct of its affairs
A record is a line of data in a table in a database
Attributes of a recordRecordkeeping
perspectiveData management
perspective
Field types◦ Numeric◦ Character◦ Date/time
Composite, derivedValues
Documents the carrying out of the organisation’s business objectives, core business functions, services and deliverables, and/or
Provides evidence of compliance with any current jurisdictional standards, and/or
Documents the value of the resources of the organisation and how risks to the business are managed, and/or
Supports the long-term viability of the organisation
Data and metadataDocuments and metadata
“Essentially there is a different relationship between
data and its metadata
than
documents and their metadata”
Is it data or is it metadata?It depends, doesn’t it?
It’s about the level at which it is used/appliedE.g. Date created
Customer ID Date created Customer name
Customer Type
123 2008-10-20 Bloggs, Joe Retailer
124 2008-10-23 Mouse, Minnie Distributor
125 2008-10-26 Max, Metadata
Direct
date created
Metadata in the data warehouse
Business metadata Technical metadata
Link between database and users – road map for access
Business usersAnalystsLess technical
What data, from where, how, when etc
DevelopersTechnical usersMaintenance and growthOn-going development
Metadata in the data warehouse
Business metadata Technical metadata
Structure of dataTable namesAttribute namesLocationAccessReliabilitySummarisationsBusiness rules
Table namesKeysIndexesProgram namesJob dependenciesTransformationExecution timeAudit, security controls
Metadata
Data Metadata
10 bytes 1 byte
Metadata
Data Metadata
Heaps!
Data – comma delimited
0349,000,A," ","CHANGE ADD ON MED CERT "," "," "," ","","S","GASUP","",00000,71909,00000,0,71909,10393470,00000.00,00000.00,00000.00,00000.00,00000.00,000000,71937,72266,0,139,600,4,72266,471,360480713,000000000,1,00090.00,00037.00,000031543560",00000.00,00000.00,+000000.00,0000000,0000,000,00,000000000,00000,00000,000000.00,000000.00,000000000,009,72266,00000,72268,16414213,000000001,000000000,244,0114340511,04,01,+000000.00,+000000.00,00000,000000,+000000.00,610,0,00146.13,000000.00,000,000,610,0,290763901,290763901,000000000,000000000,0000699873780174,000,D,"C","N","N","Y","Y","Y","N","N","3349533755","Y","T REED","Y","DSWSINVE106 ","BELOQ","Y","NAWEK","TANIA","REED","C","N",02651,009,0000,72273,72268,16405202,0114340511,03,72245,0000,003,011434,0000002288550174,000,A,"C","N","N","Y","Y","Y","N","N","3349533755","Y","T REED","Y","DSWSINVE106 ","BELOQ","Y","NAWEK","TANIA","REED","C","N",02651,009,0000,72273,72268,16405202,0114340511,04,72245,0000,003,011434,0000002288550161,000,D,"A",126,72263,00000.00,600,5,360480713,0000072827280161,000,A,"A",126,72263,00000.00,600,5,360480713,0000072827280057,000,D," "," "," "," "," ","A","","","AHMEV","VOKOG",000000003,0814409,2500,001,25,00,00000.00,000000,00,132,00000,0,+00063.00,72266,14133031,00000,00000.00,2,+00063.00,01,00000.00,000000,2,0,0,00000.00,607,1,471,362400470,000000000,0004094132990057,000,A," ","MANOP "," "," "," ","A","","","AHMEV","VOKOG",000000003,0814409,2500,001,25,00,00000.00,000000,00,132,72269,0,+00063.00,72266,14133031,00000,00000.00,2,+00063.00,01,00000.00,000000,2,0,0,00000.00,607,1,471,362400470,000000000,0004094132990270,000,A," "," ","","N","N","G",128,72266,72268,16414261,01,00000.000,00000.000,0,139,00000.00,00000.00,00000.00,00000.00,00000.00,00000.00,00000.00,00000.00,00000.00,600,5,471,000,000,000,000,000,000,000,000,0001,360480713,00000.00,0006025374450062,000,A,"YYYYYYYYYY ","AUTH01532600063000000000131197N014101 0000000","VA","SATRA","DSWSAUCK119 "," ","MANOP","",003,132,72268,16414266,0000000000,0,607,362400470,000000000,000084800530
Data – in a table
Database
3 layers
Database
•User interface•Rules and algorithms•Data
Application layer
Data layer
Adds, overwrites, deletes dataRuns rules and processes
Provides views, creates reportsTurns data into information
Data in tablesActed on by application layer
Source solution database
Can data fit the PRA definition?• We are “format neutral” in the
management of records, so….• Data can be records!
– Births Deaths and Marriages Register– EQC claims data
• Test questions– If we exclude data what have we lost?– What is the impact of losing data?
• On the business• For the future
Application layer
Data layer
The Solution System is not a recordkeeping system because it…
• Holds transactional data, not evidence of transactions in context (records)
• Isn’t tamper proof – Difficult to know exactly what the
application layer is doing– Different tables and rows may be
managed differently– Hard to roll back to a point in time
• Must overwrite ‘redundant’ data to run efficiently– Compromise of history vs speed– Business use is the priority
• The data layer is not usable without the application layer
Source solution is not a recordkeeping system
Inside a databaseHere today - gone tomorrowTransaction metadata
◦Example: An activity about a customer is a recordIs there a Unique ID
For the transaction? For the customer?
Where and when are/were components located? Multiple data tables in one database Multiple data tables across multiple database
Table names and column names Standard names for elements across tables
Source / business databasesData stored in tablesNormalised structureLots of dataLarge number of usersLots of very quick transactionsVarying history retainedMostly data is overwritten
Data warehouse
Data warehouse
Storing and accessing large amounts of data
Central repository for all or significant parts of the data that an enterprise’s various business systems collect
Data warehouse
Corporate needs
Centrally owned
Corporate effort
Transaction level data
Historical data
Lots of data!
Multiple source
systems
Designed for reporting and
analysis
Large queries
Multiple table joins
Unpredictable use
Pressure on resources
What is the simplest/most robust approach to deliver data and records from databases?
Elegant solutions needed
1 Create policy to document:What authoritative records must be retained
and what metadata must be retainedWhat formats are acceptableWhich (if any) records and metadata are
considered transient artefacts, and why (e.g. format shifting duplicates, quality checking etc),
Get approval for destruction of transient artefacts as part of the normal functioning of the systems that dispose of them
Approach: create and export records from solution system
1. Identify what data tables/records are needed and that can be produced
2. Map identified records to disposal authorities◦ Which records must be kept beyond system
decommission◦ Identify the business need for retention
3. Use the application layer to create and export those records in a suitable format
4. Store in recordkeeping system e.g. data warehouse or EDRMS
5. Retain records needed for the business post-decommission
2 Persistently associate metadataAppropriate metadata associated and
retained with authoritative records◦Identify data linkages between systems◦Retain those linkages or◦Consolidate metadata and associated record
objects into one system, and ensure they are persistently associated
Ensure migrated data/metadata/objects retain their context (e.g. date created, author etc)
Case mgmtsystem
EDRMSCustomer
mgmtsystem
Future state BAU transfers to recordkeeping systems
Create key records and send to EDRMS
Structured data to data warehouse
Data warehouses as an example of good practice
Managing data
Data feeds - principlesDirect data feeds from source systemsNot changed in any wayNo intervening processesAll changes to the dataFully auditableReconcile to source system
For Example: one table…Before:29 months data162 tapes400 million records88 GB
After:29 months data4 physical files27 million records6 GB
Month1
Compare
Month2 Month3Monthn
Compare
. . . . .
. . . .
Differences2Differences1
. . . . .
. . . .
. . . . .
. . . .
Consolidated file
SubsetsFrequently used dataAt a point in timeSmaller, quickerEasier to useDaily, weekly,
monthly
Summary layerAnalysts access the summary layerSmaller, easierData Marts
Summary data
Benefits of data warehouse
One version of the truth
Tuned environment
Can do more – freedom to explore
Full history – track everything
Updated daily
Multiple sources of data
Quick and easy to access
Stored online
Accessible
Data managementData does not manage itself!Difficult, unrulyStandards, processesRoles and responsibilitiesData warehouse teamSkills
◦ Data warehousing, Data management, Software, Hardware, Metadata, Architecture, Analysis, Performance, tuning
Coordination, communication, marketing
Best practiceData warehousing around for yearsProven architectures, technologies,
methodologiesGood infrastructure
… but will it last?
Challenges – big data
33% - data growth contributes to performance issues “most of the time”
Managing storage may cost 3-10 times cost of procurement
Average company keeps 20-40 duplicates of its data
Helping IT and the business to collaborate in managing dataIt’s not just about BI
Business and IT must work together
Helping IT and the business to collaborate in managing data
Data “lifecycle” management
Old EDRMS
New EDRMS
Old case mgmt system
New case
mgmt system
Data warehouse
Decommission = risk
Partial exports
Data lifecycle managementData lifecycle management (DLM) Managing the flow of data, information
and associated metadata through information systems and repositories, from creation and storage through to when it can be discarded.
Recognises that the importance and business value of data does not rely on its age, or how often it is used.
Why DLMData and information has value for
◦strategic and operational business needs ◦managing risk ◦meeting legislative obligations
Value of information decays over time Some information can be archived, some
discardedOccasionally, sometimes unexpectedly,
older data may need to be accessed again, quickly, completely and accurately
DLM Components
MaintainOrganiseDescribeManage
Retain or DisposeArchiveTransferDestroy
UseAccessShareFind
Create or ModifyStandardsFormatsRetrieval
Property
Customer
Tenancy
Requires:Core process artefactsConnected systemsAutomated capture
Requires:Risk identificationLifecycle policiesMetadata schemaBusiness classification linked to business process
Requires:Single source of truthDisposal AuthoritiesDisposal PlanningTiered Storage
Requires: Disposal Authorities Business requirements Disposal planningTiered Storage
Includes data validation
Conclusion
Create and maintain
Principle 1: Recordkeeping Must be Planned and Implemented 1. Responsibility assigned CEO down2. Policy3. Procedures 4. Responsibilities defined, resourced5. Recordkeeping programme & monitoring
Principle 2: Full & accurate records of business activity must be made
Requirement Database
Data Warehouse
1. Functions and business activities identified and documented
2. Records of business decisions and transactions must be created
3. All records of business activity captured routinely into an organisation-wide recordkeeping framework
4. Training provided
Principle 3: records must provide authoritative and reliable evidence of business activity
Requirement Database
Data Warehouse
10. Authentic: accurately documented creation, receipt, & transmission 11. Reliability & integrity, maintained unaltered 12. Useable, retrievable, accessible 13. Complete, with content & contextual information 14. Comprehensive, provide authoritative evidence of all business activities
Principle 4: records must be managed systematically
Requirement Database
Data Warehouse
15. Identified & captured in recordkeeping framework 16. Organised according to a business classification scheme 17. Reliably maintained over time in recordkeeping framework 18. Useable, accessible & retrievable for the entire period of their retention 19. Contextual and structural integrity maintained over time 20. Retention & disposal actions systematic
RK capability of system(s)A system that holds authoritative records
◦Must be capable of recordkeeping, or◦Made capable, or◦Must transfer records to a recordkeeping
systemWho makes that decision?
◦Should be business owner ◦(with advice from IT)
Data warehouses show us ◦what can be done◦how to do it
Developing an Enterprise Information Management Framework
STRUCTURED AND UNSTRUCTURED INFORMATION
GOVERNANCE
INFORMATION ASSET ARCHITECTURE
METADATA MANAGEMENT
SECURITY AND CONTROL
INFO
RM
AT
ION
CU
LT
UR
E
INFO
RM
AT
ION
ST
EW
AR
DS
HIP
BUSINESS INTELLIGENCE AND DATA WAREHOUSING
REFERENCE AND MASTER DATA MANAGEMENT
Authority, management, monitoring and performance of information management functions
A blueprint for the semantic and physical integration of enterprise information assets, technology and the business
The connecting foundation for EIM, used to describe, organise, integrate, share, and govern enterprise information assets
Develop: - Metadata Schema - Controlled Vocabulary - Thesauri - Business Function Classification Utilise system generated metadata
Map across metadata schemas Establish monitoring and maintenance processes Implement metadata management tools
Establish principles Define: - Policies - Standards - Business Rules
Develop a strategy and roadmap Establish structures and arrangements Define roles and processes arrangements
Assess current and desired maturity Determine metrics and measuring Establish monitoring processes
Document legislative framework Understand compliance Determine and optimise business benefits Manage information risk
Organise information for: - Navigation and retrieval - Discovery - Content types and
categorisation
Model key information flows Establish IS design principles and standards Develop an inventory of information, systems and processes
Develop a recordkeeping strategy and roadmap Enable compliant retention and disposal in systems Support access to legacy information Plan for any content migration
Develop an information lifecycle strategy and roadmap Enable integration and interoperability Plan and manage: - Repositories - Storage - Format
Policies, rules and tools that ensure the proper control, protection and privacy of information
Manage access control Manage classified information Ensure regulatory compliance Establish monitoring and metrics
Identify: - Authoritative information - High-value information - Critical information Plan for disaster recovery
Establish security policies and rules Model information security and scenarios Build security into system metadata
Store and transform Integrate and deliver Perform analytics and reporting Support decision making
Capture, store and re-use core business entities Consolidate and match data Manage and control data quality Distribute core data appropriately
The behaviours, values and norms of the enterprise within the context of information use
Oversight of the content, description, quality, and accuracy of enterprise information throughout its lifecycle
Manage and sustain change Provide information leadership Embed EIM in performance management Deliver training and ongoing support Develop toolkits and reference material
Define responsibility, roles and accountability Establish stewardship processes Establish monitoring and maintenance
Social
Documents
IT/OT Transactional Data
Search
Emails
Images
Audio
Text
Mobile
Movies
Future state of dataAccurate, relevant, timely delivery of data and information
◦ Trustworthy information◦ Where it is needed◦ Formats most appropriate to business need and future
Information found quickly, whether it’s old or newClear guidelines for systems and processes
◦ Keep what’s needed for only as long as it’s needed◦ In the right format
Data has recognisable value and appropriate levels of management◦ Business need: we know what’s important, and when it’s
important◦ Risk: we’re clear about what to manage, and how◦ Regulatory framework: we meet legislative obligations
Long term preservation of data requires understanding how data is created and managed
We have to work out: ◦What data the business needs to keep◦What records the business needs
to create and keepAnd….. how
◦What data must be unchanged◦What we mean by usable and retrievable
Our point
Data in databases
“It’s not what you think”Clare SomervilleTrish O’Kane