Scientific Data Management Presented by: Craig A.Stewart [email protected] University Information...
-
Upload
emerald-ross -
Category
Documents
-
view
212 -
download
0
Transcript of Scientific Data Management Presented by: Craig A.Stewart [email protected] University Information...
Scientific Data Management
Presented by:
Craig A.Stewart
[email protected] Information Technology Services
Indiana UniversityCopyright 2002 Craig A. Stewart and the Trustees of Indiana University
2
License terms• Please cite as: Stewart, C.A. 2002. Scientific Data Management. Tutorial
Presentation. Presented at Laboratory Information Management Systems Conference, 2-3 May, Philadelphia, PA. http://hdl.handle.net/2022/14001
• Some figures are shown here taken from web, under an interpretation of fair use that seemed reasonable at the time and within reasonable readings of copyright interpretations. Such diagrams are indicated here with a source url. In several cases these web sites are no longer available, so the diagrams are included here for historical value. Except where otherwise noted, by inclusion of a source url or some other note, the contents of this presentation are © by the Trustees of Indiana University. This content is released under the Creative Commons Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work.
313 June 2002
Why a tutorial on Scientific Data Management at the LIMS Institute
Conference?• Requested on last year’s conference surveys• As scientific research becomes more oriented towards
high-volume lab work, there will be increasing presence of LIMS in scientific labs.
• As labs that already employ LIMS produce larger amounts of data, the techniques already used and understood in scientific research can be applied to management of industrial data
• It is becoming increasingly important to assure long-term preservation of data of all sorts; techniques developed and understood in the scientific data management area can help.
413 June 2002
The key matter to be discussed today
Once the LIMS system has assured you that all of the measurements have been made and checked, and you know where all of the samples are stored, and all of the output data has been written into an output file,
– on what storage medium/system,– and in what logical structure,
should data be stored in to assure its long term readability and utility?
513 June 2002
The approach
• This tutorial casts a very wide net in terms of its subject matter.
• A large part of the challenge in this topic is simply managing the vocabulary.
• Much of the day will be spend introducing concepts and terms.
• We will cover a large span of scale – ranging from single spreadsheets to systems holding hundreds of TBs of data.
613 June 2002
Goals for today
• Explain the key problems of scientific data management• Define and outline the concepts and nomenclature
surrounding the problem• Identify some of the key concepts, a few of the directions
in which good answers might lie, and a few of the directions that definitely head to wrong answers
• Provide enough information and references that you can independently investigate those matters of interest to you.
• At the end of the tutorial, you might not be in a position to start building a scien
713 June 2002
Sources & format• There exists no text for this material that covers this
material in the manner discussed in this tutorial. CAS is an expert in some of the areas to be discussed today, but not all. Expect extensive footnoting and acknowledgement of other sources.
• The level of detail is intentionally uneven. Greater detail is generally associated with one of two factors:– A topic is sufficiently straightforward that some details
will let the participant go off and do something on her/his own.
– A topic is especially important and the participant may want to refer to it later. (In this case we may skim over some details during the actual presentation).
813 June 2002
Outline
Topic Range of application• The problem • Physical storage of data: tapes,
CDs, disk • Data management strategies Single researcher to enterprise• Data warehouses, data federations Enterprise to
national/international communities• Distributed file systems, external
data sources, and data grids• Visualization and collection-time Single researcher to enterprise
data reduction as critical strategies• Archival and backup software systems Lab group to enterprise• Future of storage media• Closing thoughts• References
913 June 2002
Bits, Bytes, and the proof that CDs have consciousness
• A bit is the basic unit of storage, and is always either a 1 or a 0.
• 8 bits make a byte, the smallest usual unit of storage in a computer.
• MegaByte (MB) - 1,048,576 bytes (A CD-ROM holds ~ 600 MBs)
• GigaByte (GB) – ~ 1 billion bytes • TeraByte (TB) - ~ 1 trillion bytes (a large library might
have ~1 TB of data in printed material)• PetaByte (PB) – 1 thousand TBs• ExaByte (EB) – 1 thousand PBs
The problem of scientific data management
1113 June 2002
Explosion of data and need to retain it
• Science historically has struggled to acquire data; computing was largely used to simulate systems without much underlying data
• Lots of data:– Lots of data available “out there”– Dramatically accelerating ability to produce new data
• One of the key challenges, and one of the key uses of computing, is now to make sense out of data now so easily produced
• Need to preserve availability of data for ???
1213 June 2002
http://www.ncbi.nlm.nih.gov/genbank/genbankstats.html
1313 June 2002
Accelerating ability to produce new data
• Diffractometer – 1 TB/year
• Synchotron – 60 GB/day bursts
• Gene expression chip readers – 360 GB/day
• Human Genome – 3 GB/person
• High-energy physics – 1 PB per year
*http://atlasinfo.cern.ch/Atlas/Welcome.html
1413 June 2002
Some things to think about
• 25 years ago data was stored on punched tape or punched cards
• How would you get data off an old AppleII+ diskette? How about one of those high-density 5 ¼” DOS diskettes?
• The backup tape in the sock drawer (especially if it’s a VMS backup tape of an SPSS-VMS data file)
• The no-longer-easily-handled data file on a CD (e.g. 1990 Census data)
• Data is essentially irreproducible more than a short period of time after the fact
1513 June 2002
Have you even tried to read one of your old data files?
Exp_2_2_feb_14_1981
30 0 0.0 139.5 000.0 0.0060 0.02123 -20.48 098.4571 26.2 . .0053 .02123 -20.48 98.4557 . .0057 .02123 -20.47 98.4536 . .0060 .02123 -20.44 98.4533 . .0055 .02123 -20.46 98.4557 . .5760 .43607 0.00 98.4396 408.03 . .5707 .43247 0.00 98.4319 408.03 . .5696 .43161 0.00 98.4350 408.03 . .5718 .43325 0.00 98.4305 408.83 . .5755 .43450 0.00 98.4305 409.16 30 0 5.0 142. . .0045 .02169 1.38 98.8949 26.4 . .0047 .02169 1.39 98.8938 . .0045 .02167 1.38 98.8952 . .0045 .02167 1.41 98.8942 . .0045 .02164 1.41 98.8942 . .4821 .36409 5.45 98.9020 412.24 . .4821 .36512 5.46 98.9020 412.18 . .4847 .36733 5.46 98.8991 412.01 . .4857 .36851 5.46 98.8960 411.78 . .4879 .37028 5.46 98.8949 411.78
1613 June 2002
Even a small file can be undecipherable!
1 m 1 99 1 2102 F 2 320 2 4203 F 2 195 2 3504 M 1 110 1 2155 M 2 218 2 3646 F 3 120 1 3557 M 3 125 1 355
1713 June 2002
And something even older…
Hwæt! We Gardena in geardagum,þeodcyninga, þrym gefrunon, hu ða æþelingas ellen fremedon. Oft Scyld Scefing sceaþena þreatum…
This is from Beowulf, written 1,000 years ago. Think about the language problem relative to the half-life of radioactive waste!
Physical storage of data: tapes, CDs, disk
1913 June 2002
Durability of media• Stone: 40,000 years• Ceramics: 8,000 years• Papyrus: 5,000 years• Parchment: 3,000 years• Paper: 2,000 years• Magnetic tape: 10 years (under ideal conditions; 3-5 more
conservative)• CD-RW: 5-10 years (under ideal conditions; 1.5 years more
conservative)• Magnetic disk: 5 years
• Even if the media survives, will the technology to read it?
2013 June 2002
Data storage: media issues
• So what do you do with data on a paper tape? • Long term data storage inevitably forces you to confront
two issues: – the lifespan of the media– the lifespan of the reading device
2113 June 2002
Data storage: removable magnetic media
• The right answer to any long-term (or even intermediate-term) data storage problem is almost never diskettes. It’s always a race between the lifespan of the media and the lifespan of the readers. One or the other always wins, and usually more quickly than you’d expect.
• Esoteric removable magnetic media are never a good idea. Even Zip drives are probably not a good bet in the long run. What do you do with a critical data set when your only copy is on a Bernoulli drive?
2213 June 2002
Magnetic Tapes
• Tapes store data in tracks on a magnetic medium. The actual material on the tape can become brittle and/or worn and fall off.
• Tapes are best used in machine room environments with controlled humidity.
• There are three situations in which tapes are the right choice:– Within production machine rooms– As backup media– For transfer between machine rooms under some
circumstances
2313 June 2002
Tape formats
• There are several formats with small user bases; these should probably be avoided. [This is admittedly a conservative stance, but…].
• DAT tapes don’t last well• For system backups of office, lab, or departmental
servers, Digital Linear Tape (DLT) is best choice
2413 June 2002
Tape formats, II
• In machine rooms, Linear Tape Open (LTO) is the best choice.
• LTO is a multi-vendor standard • Two variants:
– Accelis: faster, lower capacity (planned up to 25 GB/tape; 50 w compression)
– Ultrium: slower, high capacity (planned up to 100 GB/tape; 200 w compression)
2513 June 2002
Non-magnetic removable media
• Acronym soup: – CD – Compact Disk– CD-ROM – CD-Read Only Memory– CD-RW – CD –Read/Write– DVD – Digital Versatile Disk– DVD-RW – DVD-Read/Write
2613 June 2002
CDs and DVDs con’t
• For routine, reliable, reasonably dense storage of data around the lab, you can’t beat CDs or DVDs.
• CD writers are commonplace & reliable• DVD writers are newer, more costly, and more prone to
format issues.• Always be sure to have extensive and complete
information on the CD – including everything you need to know to remember what it really is later. There should be no data physically on the CD that is not contained in a file burned on the CD.
• Watch out for longevity issues!!
2713 June 2002
CD & DVD Jukeboxes• Jukeboxes are good for
what they do• Because the basic media
are standard, if you had to ditch your investment in the jukebox itself you could
• 240 CD jukebox at left from http://www.kubikjukebox.com/index.htm
2813 June 2002
CD & DVD Jukeboxes, con’t
• System shown at left holds 16 jukeboxes; each holds 240 CDs
• http://www.kubikjukebox.com/index.htm
2913 June 2002
Spinning disk storage
• JBOD (Just a Bunch of Disk) – alright so long as it’s alright to loose data now and again. High speed access, takes advantage of relatively low cost of disk drives. Good for temporary data parking while data awaits reduction.
• RAID (Redundant Array of Independent Disks) – what you need if you don’t want to lose data.
• Lifecycle replacement an issue in both cases
3013 June 2002
Disk Current State of Art
• Seagate Barracuda 180• largest-capacity disc at present: 181.6 GB• Internal Transfer Rate (Mbits/sec) 282-508
• Average Seek Read/Write (msec) 7.4/8.2 • Average Latency (msec) 4.17 • Spindle Speed (RPM) 7200 • Power consumption: 10 watts (idle)
3113 June 2002
Disk Trends
• Capacity: doubles each year• Transfer rate: 40% per year• MB per $: doubles each year
3213 June 2002
RAID*• Level 0: Provides data striping (spreading out blocks of
each file across multiple disks) but no redundancy. This improves performance but does not deliver fault tolerance.
• Level 1: Provides disk mirroring. • Level 3: Same as Level 0, but also reserves one dedicated
disk for error correction data. It provides good performance and some level of fault tolerance.
• Level 5: Provides data striping at the byte level and also stripe error correction information. This results in excellent performance and good fault tolerance.
*webopedia.com
3313 June 2002
RAID 3“This scheme consists of an array of HDDs for data and one unit for parity. … The scheme generates from XOR (exclusive-or) parity derived from bit 0 through bit7. If any of the HDDs fail, it restores the original data by an XOR between the redundant bits on other HDDs and the parity HDD. With RAID 3, all HDDs operate constantly. “
http://www.studio-stuff.com/ADTX/adtxwhatisraid.html
3413 June 2002
RAID 5
“RAID5 implements striping and parity. In RAID5, the parity is dispersed and stored in all HDDs. …. RAID5 is most commonly used in the products on market these days.”
*http://www.studio-stuff.com/ADTX/adtxwhatisraid.html
3513 June 2002
Storage Area Network (SAN)
• Storage Area Network (SAN) is a high-speed subnetwork of shared storage devices. A storage device is a machine that contains nothing but a disk or disks for storing data. A SAN's architecture works in a way that makes all storage devices available to all servers on a LAN or WAN.
*Webopedia.com
3613 June 2002
Network Attached Storage (NAS)
• A network-attached storage (NAS) device is a server that is dedicated to file sharing through some protocol such as NFS. NAS does not provide any of the activities that a server in a server-centric system typically provides, such as e-mail, authentication or file management. …
*modified from Webopedia.com
3713 June 2002
Storage Bricks
• Group of hard disks inside a sealed box• Includes spare disks• Typically RAID 5• When one disk fails, one of the spares is put
to use• When you’re out of spares…• Sun seems to have originated this idea
3813 June 2002
Backups
• A properly administered backup system and schedule is a must.
• How often should you back up? More frequently than the amount of elapsed time it takes you to acquire an amount of data that you can’t afford to loose.
• Backup schedules – full and incremental• RAID disk enhances reliability of storage, but it’s
not a substitute for backups• More about backup software and such later!
3913 June 2002
Disaster recovery• If your data is too important to lose, then it’s too important
to have in just one copy, or have all of the copies in just one location.
• Natural disasters, human factors (e.g. fire), theft (a significant portion of laptop thefts have data theft as their purpose) can all lead to the loss of one copy of your data. If it’s your only copy…… or the only location where copies are kept…
• Offsite data storage is essential– Vaulting services– Remote locations of your business
4013 June 2002
Data management strategies
• Flat files• Spreadsheets and Statistical software• Relational Databases• XML• Specialized scientific data formats
4113 June 2002
Flat Files
4213 June 2002
Data Management Strategies: Flat files
• Nothing beats an ASCII flat file for simplicity• ASCII files are not typically used for data storage
by commercial software because proprietary formats can be accessed more quickly
• If you want a reliable way to store data that you will be able to retrieve later reliably (media issues notwithstanding), an ASCII flat file is a good choice.
4313 June 2002
Data Management Strategies: Flat files, II
• IF you use an ASCII flat file for simple long-term storage, be sure that:– The file name is self-explanatory– There is no information embedded in the file name that
is not also embedded in the file– Each individual data file includes a complete data
dictionary, explanation of the instrument model and experimental conditions, and explanation of the fields
– Lay the data out in accordance with First, Second, and Third Normal Forms as much as is possible (more on these terms later)
4413 June 2002
Data dictionary
• Definition from webopedia.com:– In database management systems, a file that defines the
basic organization of a database. A data dictionary contains a list of all files in the database, the number of records in each file, and the names and types of each field. …
• More generally:– A data dictionary is what you (or someone else) will
need to make sense of the data more than a few days after the experiment is run
4513 June 2002
Spreadsheets and statistical packages
4613 June 2002
Spreadsheet Software as a data management tool
• Microsoft’s Excel may suffice for many data management needs
• If any given data set can be described in a 2D spreadsheet with up to hundreds of rows and columns, and if there is relatively little need to work across data sets, then Excel might do the trick for you
• Do beware of version issues!
4713 June 2002
Spreadsheet software as a data management tool, con’t
• Designed originally to be electronic accountant ledgers• Feature creep in some ways has helped those who have
moderate amounts of data to manage• There are several options, including Open Source products
such as Gnumeric and nearly open source products such as StarOffice
• Since MS Excel is the most commonly used spreadsheet package, this discussion will focus on MS Excel
4813 June 2002
The MS Excel Data menu• Sort: Ascending or descending sorts on multiple columns• Lists: Allow you to specify a list (use only one list per
spreadsheet) and then perform filters, selecting only those that meet a certain criteria (probably more useful for mailing lists than scientific data management)
• Validation: lets you check for typos, data translation errors, etc. by searching for out of bounds data
• Consolidate• Group and outline• Pivottable• Get external data
4913 June 2002
MS Excel Statistics
• Mean, standard deviation, confidence intervals, etc. up to t-test are available as standard functions within MS Excel
• One-way ANOVA and more complex statistical routines are available in the Statistics Add-in Pack
5013 June 2002
MS Excel Graphics
• Does certain things quite easily• If it doesn’t do what you want it to do easily
– it probably won’t do it at all• Constraints on the way data are laid out in
the spreadsheet are often an issue
5113 June 2002
Statistical Software as a data management tool
• SPSS and SAS are the two leading packages• Both have ‘spreadsheet-like’ data entry or editing
interfaces• Both have been around a long time, and are likely to
remain around for a good while• Workstation and mainframe versions of both available
5213 June 2002
What’s wrong with this program?DATA LIST FILE=sample.dat /id 1 v1 3 (A) v2 5 v3 7-9 v4 11 v5 13-15
LIST VARIABLES v1 v2 v3ONEWAY v3 BY v2 (1,3)REGRESSION /DEPENDENT=v5 /METHOD=ENTER v3FINISHm 1 99 1 2102 f 2 320 2 4203 f 2 195 2 3504 m 1 110 1 2155 m 2 218 2 3646 f 3 120 1 3557 m 3 125 1 335
5313 June 2002
Better….DATA LIST FILE=sample.dat /id 1 gender 3 (A) weight 5 glucose 7-9 bp 11 reactime 13-15
LIST VARIABLES gender weight glucoseONEWAY glucose BY weight (1,3)REGRESSION /DEPENDENT=reactime /METHOD=ENTER glucoseFINISHm 1 99 1 2102 f 2 320 2 4203 f 2 195 2 3504 m 1 110 1 2155 m 2 218 2 3646 f 3 120 1 3557 m 3 125 1 335
5413 June 2002
Now you have a fighting chanceDATA LIST FILE=sample.dat /id 1 gender 3 (A) weight 5 glucose 7-9 bp 11 reactime 13-15
VARIABLE LABELS ID ‘Subjet ID #' GENDER 'Subject Gender' WEIGHT ‘Subject Weight in pounds’ GLUCOSE ‘Blood glucose level’
BP ‘Blood Pressure’ REACTIME ‘Reaction Time in Minutes”
VALUE LABELS GENDER m ‘Male’ f ‘Female’
LIST VARIABLES gender weight glucoseONEWAY glucose BY weight (1,3)REGRESSION /DEPENDENT=reactime /METHOD=ENTER glucoseFINISH1 m 1 99 1 2102 f 2 320 2 4203 f 2 195 2 350.
5513 June 2002
An example SAS program/* Computer Anxiety in Middle School Chlidren *//* The following procedure specifies value lables for variables */PROC FORMAT; VALUE $sex 'M'='Male' 'F'='Female'; VALUE exp 1='upto 1 year' 2='2-3 yrs' 3='3+ yrs'; VALUE school 1='rural' 2='city' 3='suburban';DATA anxiety;INFILE clas; INPUT ID 1-2 SEX $ 3 (EXP SCHOOL) (1.) (C1-C10) (1.) (M1-M10) (1.) MATHSCOR 26-27 COMPSCOR 28-29; FORMAT SEX $SEX.; FORMAT EXP EXP.; FORMAT SCHOOL SCHOOL.; /* conditional transformation */ IF MATHSCOR=99 THEN MATHSCOR=.; IF COMPSCOR=99 THEN COMPSCOR=.; /* Recoding variables. Several items are to be reversed while scoring. */ /* The Likert type questionnaire had a choice range of 1-5 */ C3=6-C3; C5=6-C5; C6=6-C6; C10=6-C10; M3=6-M3; M7=6-M7; M8=6-M8; M9=6-M9; COMPOPI = SUM (OF C1-C10) /*FIND SUM OF 10 ITEMS USING SUM FUNCTION */; MATHATTI = M1+M2+M3+M4+M5+M6+M7+M8+M9+M10 /*ADDING ITEM BY ITEM */; /* Labeling variables */LABEL ID='STUDENT IDENTIFICATION' SEX='STUDENT GENDER' EXP='YRS OF COMP EXPERIENCE' SCHOOL='SCHOOL REPRESENTING' MATHSCOR='SCORE IN MATHEMATICS' COMPSCOR='SCORE IN COMPUTER SCIENCE'
COMPOPI='TOTAL FOR COMP SURVEY' MATHATTI='TOTAL FOR MATH ATTI SCALE';
5613 June 2002
SAS example, Part 2/* Printing data set by choosing specific variables */ PROC PRINT; VAR ID EXP SCHOOL MATHSCOR COMPSCOR COMPOPI MATHATTI; TITLE 'LISTING OF THE VARIABLES'; /* Creating frequency tables */PROC FREQ DATA=ANXIETY; TABLES SEX EXP SCHOOL; TABLES (EXP SCHOOL)*SEX; TITLE 'FREQUENCY COUNT'; /* Getting means */PROC MEANS DATA=ANXIETY; VAR COMPOPI MATHATTI MATHSCOR COMPSCOR; TITLE 'DESCRIPTIVE STATICTS FOR CONTINUOUS VARIABLES';
RUN;
/* Please refer to the following URL for further infomation *//* http://www.indiana.edu/~statmath/stat/sas/unix/index.html */
5713 June 2002
An example SPSS programTITLE 'COMPUTER ANXIETY IN MIDDLE SCHOOL CHILDREN'DATA LIST FILE=clas.dat /ID 1-2 SEX 3 (A) EXP 4 SCHOOL 5 C1 TO C10 6-15 M1 TO M10 16-25 MATHSCOR 26-27 COMPSCOR 28-29MISSING VALUES MATHSCOR COMPSCOR (99)RECODE C3 C5 C6 C10 M3 M7 M8 M9 (1=5) (2=4) (3=3) (4=2) (5=1)RECODE SEX ('M'=1) ('F'=2) INTO NSEX /* Changing char var into numeric varCOMPUTE COMPOPI=SUM (C1 TO C10) /*Find sum of 10 items using SUM functionCOMPUTE MATHATTI=M1+M2+M3+M4+M5+M6+M7+M8+M9+M10 /* Adding eachi itemVARIABLE LABELS ID 'STUDENT IDENTIFICATION' SEX 'STUDENT GENDER' EXP 'YRS OF COMP EXPERIENCE' SCHOOL 'SCHOOL REPRESENTING' MATHSCOR 'SCORE IN MATHEMATICS' COMPSCOR 'SCORE IN COMPUTER SCIENCE' COMPOPI 'TOTAL FOR COMP SURVEY' MATHATTI 'TOTAL FOR MATH ATTI SCALE'
5813 June 2002
SPSS Example, Part 2/*Adding labelsVALUE LABELS SEX 'M' 'MALE' 'F' 'FEMALE'/ EXP 1 'UPTO 1 YR' 2 '2 YEARS' 3 '3 OR MORE'/ SCHOOL 1 'RURAL' 2 'CITY' 3 'SUBURBAN'/ C1 TO C10 1 'STROGNLY DISAGREE' 2 'DISAGREE' 3 'UNDECIDED' 4 'AGREE' 5 'STRONGLY AGREE'/ M1 TO M10 1 'STROGNLY DISAGREE' 2 'DISAGREE' 3 'UNDECIDED' 4 'AGREE' 5 'STRONGLY AGREE'/ NSEX 1 'MALE' 2 'FEMALE'/PRINT FORMATS COMPOPI MATHATTI (F2.0) /*Specifying the print formatcomment Listing variables.* listing variables.LIST VARIABLES=SEX EXP SCHOOL MATHSCOR COMPSCOR COMPOPI MATHATTI/ FORMAT=NUMBERED /CASES=10 /* Only the first 10 casesFREQUENCIES VARIABLES=SEX,EXP,SCHOOL/ /* Creating frequency tables STATISTICS=ALLUSE ALL.ANOVA COMPSCOR by EXP(1,3).FINISHcomment Please refer to the following URL for further infomation http://www.indiana.edu/~statmath/stat/spss/unix/index.html.
5913 June 2002
Keys to using Statistical Software as a data management tool
• Be sure to make your programs and files self-defining. Use variable labels and data labels exhaustively.
• Write out ASCI versions of your program files and data sets.
• Stat packages generally are able to produce platform-independent ‘transport’ files. Good for transport, but be wary of them as a long-term archival format
6013 June 2002
Keys to using Statistical Software as a data management tool, 2
• Statistical software is excellent when your data can be described well without having to use relational database techniques. If you can describe the data items as a very long vector of numbers, you’re set!
• Statistical software is especially useful when many transformations or calculations are required
• But beware transforms, calculations, and creation of new variables interactively!
6113 June 2002
Perl and C
• Portable extensible report language• Problematic esoteric rubbish lister• It’s a bit of both• Perl is good way to manipulate small amounts of
data in a prototype setting, but performance in a production setting will probably seem inadequate
• Use Perl to prototype, but if you’re using Perl, rewrite the final application in C or C++
6213 June 2002
Relational Databases
6313 June 2002
Database Definitions*• Database management system: A collection of programs
that enables you to store, modify, and extract information from a database.
• Types of DBMSs: relational, network, flat, and hierarchical.
• If you need a DBMS, you need a relational DBMS • Query: a request to extract data from a database, e.g.:
– SELECT ALL WHERE NAME = "SMITH" AND AGE > 35 • SQL (structured query language) – the standard query
language*modified from webopedia.com
6413 June 2002
Relational Databases*• Relational Database theory developed at IBM by E.F.
Codd (1969)• Codd's Twelve Rules – the key to relational databases but
also good guides to data management generally.• Codd’s work is available in several venues, most
extensively as a book. The number of rules has now expanded to over 300, but we will start with rules 1-12 and the 0th rule.
• 0th rule: A relational database management system (DBMS) must manage its stored data using only its relational capabilities.
• *Based on Tore Bostrup. www.fifteenseconds.com
6513 June 2002
Codd’s 12 rules
1.Information Rule. All information in the database should be represented in one and only one way -- as values in a table.
2.Guaranteed Access Rule. Each and every datum (atomic value) is guaranteed to be logically accessible by resorting to a combination of table name, primary key value, and column name.
3.Systematic Treatment of Null Values. Null values (distinct from empty character string or a string of blank characters and distinct from zero or any other number) are supported in the fully relational DBMS for representing missing information in a systematic way, independent of data type.
6613 June 2002
Codd’s 12 rules, con’t
4.Dynamic Online Catalog Based on the Relational Model. The database description is represented at the logical level in the same way as ordinary data, so authorized users can apply the same relational language to its interrogation as they apply to regular data.
6713 June 2002
Codd’s 12 rules, con’t
5.Comprehensive Data Sublanguage Rule. A relational system may support several languages and various modes of terminal use. However, there must be at least one language whose statements are expressible, per some well-defined syntax, as character strings and whose ability to support all of the following is comprehensible:a. data definition
b. view definition c. data manipulation (interactive and by program) d. integrity constraints e. authorization f. transaction boundaries (begin, commit, and rollback).
6813 June 2002
Codd’s 12 rules, con’t
6. View Updating Rule. All views that are theoretically updateable are also updateable by the system.
7. High-Level Insert, Update, and Delete. The capability of handling a base relation or a derived relation as a single operand applies not only to the retrieval of data, but also to the insertion, update, and deletion of data.
8. Physical Data Independence. Application programs and terminal activities remain logically unimpaired whenever any changes are made in either storage representation or access methods.
6913 June 2002
Codd’s 12 rules, con’t
9. Logical Data Independence. Application programs and terminal activities remain logically unimpaired when information preserving changes of any kind that theoretically permit unimpairment are made to the base tables.
10. Integrity Independence. Integrity constraints specific to a particular relational database must be definable in the relational data sublanguage and storable in the catalog, not in the application programs.
7013 June 2002
Codd’s 12 rules, con’t
11. Distribution Independence. The data manipulation sublanguage of a relational DBMS must enable application programs and terminal activities to remain logically unimpaired whether and whenever data are physically centralized or distributed.
12. Nonsubversion Rule. If a relational system has or supports a low-level (single-record-at-a-time) language, that low-level language cannot be used to subvert or bypass the integrity rules or constraints expressed in the higher-level (multiple-records-at-a-time) relational language.
7113 June 2002
The problem with (some) DBMS computer science
• Database theory is wonderful stuff• It is sometimes possible to get so caught up
in the theory of how you would do something that the practical matters of actually doing it go by the wayside
• This is particularly true of the concept of “normal forms” – only three of which we will cover
7213 June 2002
Some terminology
Formal Name Common Name Also known asRelation Table EntityTuple Row RecordAttribute Column Field
A key is a field that *could* serve as a unique identifier of records. The Primary key is the one field chosen to be the unique identifier of records.
7313 June 2002
First Normal Form
• Reduce entities to first normal form (1NF) by removing repeating or multivalued attributes to another, child entity.
Specimen # Measurement #`1 Measurement #2 Measurement #314 35 43 38
Specimen # Measurement# Value14 1 3514 2 4314 3 38
Specimens14
7413 June 2002
Second Normal Form
• Reduce first normal form entities to second normal form (2NF) by removing attributes that are not dependent on the whole primary key.
Specimen # Measurement# Species Value14 1 M. musculus 3514 2 M. musculus 4316 3 R. norvegicus 38
Specimen # Measurement# Value14 1 3514 2 4316 3 38
Specimens Species14 M. Musculus16 R. norvegicus
7513 June 2002
Third Normal form
• Reduce second normal form entities to third normal form (3NF) by removing attributes that depend on other, nonkey attributes (other than alternative keys).
• It may at times be beneficial to stop at 2NF for performance reasons!Specimen # Measurement# O2 consumption Mass O2
consumption per gram
14 1 35 14 2.5014 2 43 15 2.8716 3 85 28 3.04
Specimen # Measurement# O2 consumption Mass14 1 35 1414 2 43 1516 3 85 28
7613 June 2002
On to database products
• Microsoft Access – Common, relatively inexpensive, moderately scalable
• Oracle – Common, relatively more expensive, extremely robust and scalable
• DB2 – Relatively common, IBM’s commercial database application
• MySQL – Becoming more common, free, good for prototyping and small-scale applications
7713 June 2002
MySQL
• Open source database software• Available for several operating systems• Downloadable from www.mysql.com• Excellent for prototyping database
applications, and in many cases plenty for production
7813 June 2002
Components of MySQL (exemplary of database products
generally)• mysql – executes sql commands• mysqlaccess – manages users• mysqladmin – database administration• mysqld – MySQL server process• mysqldump – dumps definition and contents of a database
into a file• mysqlhotcopy – hot backup of databast• mysqlimport – imports data from other formats• mysqlshow – shows information about server and objects• mysqld_safe – starts and manages mysql on Unix
7913 June 2002
Database applications and the web?
• An Open Source option– MySQL - database– PHP - web scripting application– Apache - web server
• Oracle and its web modules• Stat package and web modules
8013 June 2002
Specialized Data formats
• XML• HDF
8113 June 2002
XML
• The Extensible Markup Language (XML) is the universal format for structured documents and data on the Web.
• http://www.w3.org/XML/
8213 June 2002
A few of “XML in10 points”*
1. XML is for structuring data. XML makes it easy for a computer to generate data, read data, and ensure that the data structure is unambiguous.
2. XML looks a bit like HTML. Like HTML, XML makes use of tags (words bracketed by '<' and '>') and attributes (of the form name="value").
3. XML is text, but isn't meant to be read. 4. XML is verbose by design. (And it’s *really* verbose)5. XML is a family of technologies. (This leads to the
opportunity to create discipline-specific XML templates)
*http://www.w3.org/XML/1999/XML-in-10-points
8313 June 2002
XML
• XML really is one of the most important data presentation technologies to be developed in recent years
• XML is a meta-markup language• The development and use of DTDs (document
type definition) is time consuming, critical, and subject to the usual laws regarding standards
• XML is a way to present data, but not a good way to organize lots of data
8413 June 2002
Some XML examples
• Chemical Markup Language http://www.xml-cml.org/
• Extensible Data Format http://xml.gsfc.nasa.gov/XDF/XDF_home.html
• BioXML – no longer active
8513 June 2002
XML issues
• Great technology• Good commercial authoring systems
available or in development• The problem with standards….• Perhaps the biggest challenge in XML is the
fact that it is so easy to put together a web site and propose a DTD as a standard
8613 June 2002
XML vs PDF
• PDF files are essentially universally readable. PDF file formats give you a picture of what was once data in a fashion that makes retrieval of the data hard at best.
• XML requires a bit more in terms of software, but preserves the data as data, that others can interact with.
• Utility of XML and PDF interacts with proprietary concerns, institutional concerns, and community concerns – which are not always in harmony!
8713 June 2002
Specialized data storage formats - HDF
• Hierarchical Data Format (HDF)• HDF is an open-source effort• http://hdf.ncsa.uiuc.edu/• HDF5 is a general purpose library and file
format for storing scientific data.
8813 June 2002
HDF, con’t
• HDF5 can store two primary objects: datasets and groups. A dataset is essentially a multidimensional array of data elements, and a group is a structure for organizing objects in an HDF5 file.
• Using these two basic objects, one can create and store almost any kind of scientific data structure.
• Designed to address the data management needs of scientists and engineers working in high performance, data intensive computing environments.
• HDF5 emphasizes storage and I/O efficiency.
8913 June 2002
HDF, con’t
• HDF is nontrivial to implement• If you need the full capabilities of HDF,
there’s nothing like it• There is a bit of history of questions about
performance, but HDF5 is designed to resolve these questions
9013 June 2002
Free Software Foundation
• Many of the software products mentioned in this talk (XML, Perl, etc.) are Open Source Software
• The GNU general public license is the standard license for such software
• Some of the best software for specific scientific communities is open source (community software)
• There are certain expectations about such software and how it is used
9113 June 2002
Data exchange among heterogeneous formats
• I have data files in SAS, SPSS, Excel, and Access formats. What do I do?
• Each of the more widely used stat packages contain significant utilities for exchanging data. Stata makes a package called Stat Transfer
• DBMS/Copy (Conceptual Software) probably the best software for exchange among heterogeneous formats
9213 June 2002
Distributed Data
• Data warehouses• Data federations• Distributed File Systems• External data sources• Data Grids
9313 June 2002
Data warehouses
• In a large organization one might want to ask research questions of transactional data. And what will the MIS folks say about this?
• Transactions have to happen now; the analysis does not necessarily have to.
• Data warehousing is the coordinated, architected, and periodic copying of data from various sources, both inside and outside the enterprise, into an environment optimized for analytic and informational processing (Definition from “Data warehousing for dummies” by Alan R. Simon
9413 June 2002
Getting something out of the data warehouse
• Querying and reporting: tell me what’s what• OLAP (On-Line Analytical Processing): do some
analysis and tell me what’s up, and maybe test some hypotheses
• Data mining: Atheoretic. Give me some obscure information about the underlying structure of the data
• EIS (Executive Information Systems): boil it down real simple for me
9513 June 2002
More Buzzwords
• Data Mart: Like a data warehouse, but perhaps more focused. [Term often used by the team after the Data Warehouse fiasco]
• Operational Data Store: Like a data warehouse, but the data are always current (or almost). [Day traders]
9613 June 2002
Distributed File Systems
• DCE/DFS – DFS seems to have a questionable future
• AFS – Andrew File System – Widely used among physicists
9713 June 2002
AFS
• AFS is a distributed filesystem product, pioneered at Carnegie Mellon University and supported and developed as a product by Transarc Corporation (now IBM Pittsburgh Labs). It offers a client-server architecture for file sharing, providing location independence, scalability and transparent migration capabilities for data.
*http://www.openafs.org/main.html
9813 June 2002
AFS Structure
• AFS operates on the basis of “cells”• Each cell depends upon a cell server that creates
the root level directory for that cell• Other network-attached devices can attach
themselves into the AFS cell directory structure• Moving data from one place to another than
becomes just like a file operation except that it is mediated by the network
• Requires installation of client software (available for most Unix flavors and Windows)
9913 June 2002
Computing Grids
• What’s a grid? Hottest current buzzword• A way to link together disparate, geographically
disparate computing resources to create a meta-computing facility
• The term ‘computing grid’ was coined in analogy to the electrical power grid
• Three types of grids:– Compute– Collaborative– Data
10013 June 2002
Compute Grids
• Compute grids tie together disparate computing facilities to create a metacomputer.
• Supercomputers: Globus is an experimental system that historically focuses on tying together supercomputers
• PCs:– Entropia is a commercial product that aims to tie
together multiple PCs– SETI@Home
10113 June 2002
Collaboration Grids
• http://www-fp.mcs.anl.gov/fl/accessgrid/
10213 June 2002
Data Grids
• Globus – beginning to integrate data grid functionality
• Avaki – commercial data grid product• Data Grids “virtualize” data locality
10313 June 2002
J une 5, 2002 1Introduction to Grid Computing
Layered Grid Architecture(By Analogy to Internet Architecture)
Application
Fabric“Controlling things locally”: Access to, & control of, resources
Connectivity“Talking to things”: communication (Internet protocols) & security
Resource“Sharing single resources”: negotiating access, controlling use
Collective“Coordinating multiple resources”: ubiquitous infrastructure services, app-specific distributed services
InternetTransport
Application
Link
Inte
rnet P
roto
col A
rchite
cture
http://www.globus.org/about/events/US_tutorial/slides/index.html
10413 June 2002
J une 5, 2002 50Introduction to Grid Computing
Example:Data Grid Architecture
Discipline-Specific Data Grid Application
Coherency control, replica selection, task management, virtual data catalog, virtual data code catalog, …
Replica catalog, replica management, co-allocation, certificate authorities, metadata catalogs,
Access to data, access to computers, access to network performance data, …
Communication, service discovery (DNS), authentication, authorization, delegation
Storage systems, clusters, networks, network caches, …
Collective(App)
App
Collective(Generic)
Resource
Connect
Fabric
http://www.globus.org/about/events/US_tutorial/slides/index.html
10513 June 2002
Example Data Grids
• GriPhyN (Grid Physics Network) – The key problem: too much data (PB per year)
• Biomedical data– Stanford Genome Gateway Browser mirrors– Humane Genome Database mirrors– Other examples….
10613 June 2002
Federated databases
• A federation of databases is a group of databases that are tied together in some reasonable way permitting data retrieval (generally) and sometimes (maybe in the future) data writing
• Benefits of federated approach:– Local access control. Lets data owner control access– Acknowledges multiple sources of data– By focusing on the edges of contact, should be more
flexible over the long run• Shortcomings: Right now, significant hand work
in constructing such systems
10713 June 2002
DiscoveryLink
10813 June 2002
Web-accessible databases
• Especially prominent in biomedical sciences. E.g. NCBI: • enterez http://www.ncbi.nlm.nih.gov/entrez/• pubmed http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
Provides access to over 11 million MEDLINE citations • nucleotide http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?
db=Nucleotide collection of sequences from several sources, including GenBank, RefSeq, and PDB.
• protein http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein• Genome http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome
The whole genomes of over 800 organisms.
10913 June 2002
Real-time data reduction as a critical strategy
• Data: bits and bytes• Information: that which reduces uncertainty (Claude
Shannon). Literally that which forms within, but more adequately: the equivalent of or the capacity of something to perform organizational work, the difference between two forms of organization or between two states of uncertainty before and after a message has been received, but also the degree to which one variable of a system depends on or is constrained by (see constraint) another. *
• In other words, if there is no realistic circumstance in which you would take an action based on or influenced by a certain number, than this number is data, not information
• We collect a lot more data than we do information*http://pespmc1.vub.ac.be/ASC/INFORMATION.html
11013 June 2002
Real-time data reduction
• Given that we collect much more data than information, what do we do?
• If we can identify something as reliably just data, and definitely not possibly information, why keep it?
• In some cases of instruments that produce data continually, a PC dedicated to on-the-fly data reduction can drastically reduce data storage requirements
11113 June 2002
Knowledge management, searchers, and controlled
vocabularies• A tremendous amount of effort has gone in to
natural language processing, AI, knowledge discovery, etc. with results ranging from mixed to disappointing.
• If you want to be able to search large volumes of data on an ad-hoc basis, then controlled vocabularies are essential. Results here are mixed as well, but at least the problems are sociological, not technological.
• Good example: Gene Ontology Consortium, http://www.geneontology.org/
11213 June 2002
Data Visualization
11313 June 2002
Visualization
• The days when you could take a stack of greenbar down to your favorite bar, page through the output, and understand your data are gone.
• Data visualization is becoming the only means by which we can have any hope of understanding the data we are producing
• A single gene expression chip can produce more pixels of data than the human eye&mind together are capable of processing
11413 June 2002
Gene expression chips*
*http://www.microarrays.org/
11513 June 2002
http://www.research.ibm.com/dx/imageGallery/
11613 June 2002
http://www.research.ibm.com/dx/imageGallery/
11713 June 2002
http://www.research.ibm.com/dx/imageGallery/image212.html
11813 June 2002
Visualization Options
• 2D – commercial software and open source• 2D Open source: IBM’s Data Explorer
http://www.research.ibm.com/dx/• 3D –CAVE or Immersadesk
11913 June 2002
CAVE™
• Cave Automatic Virtual Environment
• Anything *but* automatic
• Best immersive 3D technology available
Image created by Eric Wernert of Indiana University
12013 June 2002
Immersadesk™
• Furniture-scale 3-D environment
• Easier to program than CAVE
• Immersive 3D feel not as good as CAVE, but one can install an Immersadesk™ or similar equipment within a lab! Image created by Eric Wernert of
Indiana University
12113 June 2002
Heirarchical Storage Management Systems
• Differential cost of media– RAM $60-$100/MB– RAID $4-$10/MB– CD ~$1 (readers included)– Tape $0.05-$1
• Differential read rates and access times:– Disk: 1 GB/sec; 9-20 ms access time– Tape: 200 MB/sec; <1 min (autoloader)
12213 June 2002
HSM
• The objective of an HSM is to optimize the distribution of data between disk and tape so as to store extremely large amounts of data at reasonably economical costs while keeping track of everything
12313 June 2002
HSM basic concepts
• Most data is read rarely. Tape is cheap. Keep rarely read data on disk.
• Data that is often used keep on disk.• Stage data to disk on command for faster access
when you know you’re going to need it later. • Stage data to disk in output.• Manage data on tape so as to handle security and
reliability.• Metadata system keeps track of what everything is
and where it is!
12413 June 2002
HSM products
• EMASS Inc. - AMASS (Archival Management and Storage System). http://www.emass.com
• Veritas – www.veritas.com• LSF – Sun Microsystems, Inc.• HPSS (High Performance Storage System) – a
consortium-lead product designed originally for weapons labs and now marketed by IBM
12513 June 2002
HPSS – High Performance Storage System
• Controlled by a consortium, but produced and released as a service from IBM (as opposed to a product)
• Designed to meet the needs of some of the most demanding and security-conscious customers in the world
• Customers include: – Lawrence Berkely Laboratories– Los Alamos National Laboratories– Sandia National Laboratories– San Diego Supercomputer Center– Indiana University
12613 June 2002
Requirements for HPSS
• Absolute reliability of data in all forms (reliably read whenever authorized person wants, and reliably not available to anyone unauthorized)
• High capacity• Speed• Fault detection/Correction
12713 June 2002
HPSS Components• Name Server (NS) – translates standard file names and
paths into HPSS object identifier• Bitfile Server (BFS) – provides logical bitfiles to clients• Storage Server (SS) – manages relationship between
logical files and physical files• Physical Volume Library (PVL) – maps logical volumes to
physical cartridges. Issues commands to PVR• Physical Volume Repository – mounts and dismounts
cartridges• Mover (MVR) – transfers data from a source to a sink
12813 June 2002
12913 June 2002
Backup
• Backup systems and HSMs are fundamentally different!
• Backup systems are designed for operational continuity of computing systems, not for archival storage, and vice versa
• Efforts to mix the two technologies tend not to work well (e.g. restoring onto bare metal from an HSM)
13013 June 2002
Some Backup Systems
• Omnibak (HP)• Legato (www.legato.com)• Brightstore Arcserve (Computer associates -
www.ca.com)• Tivoli (IBM)
13113 June 2002
Backup schedules
• Good backup schedules essential!• Example backup schedule:
– Full backup every 6 months– Incremental since full every month– Incremental since monthly every week– Incremental since weekly every day
• Offsite copies of fulls are a good idea…
13213 June 2002
The future of storage
• “In-place” increases in density• New technologies:
– WORM Optical Storage & holographics– Millepedes– Non-corrosive metal
13313 June 2002
Holographic storage
• Based on 3-D rather than 2-D data storage
• Constantly going to revolutionize storage RSN
• Significant problems with media stability
• WORM (Write Once Read Many) technologies may someday deliver
Image © IBM may not be reusedwithout permission
13413 June 2002
Millipede Storage• Based on atomic force microscopy
(AFM): tiny depressions melted by an AFM tip into a polymer medium represent stored data bits that can then be read by the same tip.
• Thermomechanical storage is capable of achieving data densities in the hundreds of Gb/in² range
• Current best – 20 to 100 Gb/in² • Expected limits for magnetic
recording (60–70 Gb/in²).
*http://www.zurich.ibm.com/st/storage/millipede.html
Image © IBM may not be reusedwithout permission
13513 June 2002
Millipede Storage, Part 2
• Read/Write rate of individual probe is limited
• The Read/Write head consists of ~1,000 individual probes that read in parallel
*http://www.zurich.ibm.com/st/storage/millipede.html
Image © IBM may not be reusedwithout permission
13613 June 2002
Storage of text on nonreactive metal disks
• All of the commonly used storage media depend upon arbitrary standards and are fragile
• If you have data that you really want to keep secure for a long time, why not write it as text on non-corrosive metal disks?
13713 June 2002
Future of computing
• The PC market will continue to be driven largely by home uses (esp games)
• In scientific data management, the utility of computing systems will be less determined by chip speeds and more by memory and disk configurations, and internal and external bandwidth
13813 June 2002
And the future is uncertain!
• If you can see what your storage requirements are 25 years into the future, and they are large scale and significant,then a tremendous investment based on what’s available today may be reasonable.
• In any other case, it may be best to take shorter views – 5 to perhaps 10 years, and build into your thinking the constant need to refresh
13913 June 2002
The ongoing challenge
• One of the key problems in data storage is that you can’t just store it. Data stored and left alone is unlikely under most circumstances to be readable – and less likely to be comprehensible and useable – in 20 years. The problem, of course, is that there is an ever increasing need for tremendous longevity in the utility of data. Because of this it is essential that data receive ongoing curation, and migration from older media and devices to newer media and devices. Only in this way can data remain useful year after year.
14013 June 2002
References
• Simon, A.R. 1997. Data warehousing for Dummies. IDG Books, Foster City, CA.