Oceangraphic data formats

47
1. Data formats • Large heterogeneity in data formats • Data format = the physical or electronic shape in which data is stored • Piece of paper with hand written text = data format • However focuss here: – Electronic data formats – Commonly used data formats

Transcript of Oceangraphic data formats

Page 1: Oceangraphic data formats

1. Data formats

• Large heterogeneity in data formats• Data format = the physical or electronic shape

in which data is stored• Piece of paper with hand written text = data

format• However focuss here:

– Electronic data formats– Commonly used data formats

Page 2: Oceangraphic data formats

1. Data formats

• Why use which format?– Historical reasons:

• Old data mostly in text based list formats• Software and technology is accompagning certain

formats• Example: xml is only being used after its invention

– Other reasons:• Depending on data generator:

– Machine generated data (mostly ascii format)

• Worldwide agreed formats for certain types of data– Facilitate exchange of data packages

Page 3: Oceangraphic data formats

1. Data formats

• Exchange of data formats– Most formats are exchangeable into eachother– Mostly top down:

• Relational structure spreadsheet txt-based

Page 4: Oceangraphic data formats

Data formats: different classifications

• Physical types:– ASCII– BINARY

• Format types : 15 often used data types

Page 5: Oceangraphic data formats

Dataformat – ascii format (1)

• Ascii: American Standard Code for Information Interchange

• ASCII data are encoded so that the human reader can see and understand the values, because they are displayed as normal integers and real numbers. This means that the actual digital file contains print and display information for the human-readable characters, not the actual values of the data. The benefit of using ASCII data is that the user can see, understand and edit the file contents directly; the downside of using ASCII is that the data files are much larger.

Page 6: Oceangraphic data formats

Dataformat – ascii format (2)

• Combination of letters and numbers

• Readable by any computer

• No complex software required

Page 7: Oceangraphic data formats

Dataformat – Binary data

• Binary data are numeric data whose values are expressed in bits and bytes, instead of the human-readable ascii code.

• Number values can be stored in much smaller files: be read more rapidly (by machines)

• the method for large datafiles, especially gridded data.

• To use binary data: not so easy interpreting steps are required

Page 8: Oceangraphic data formats

Dataformat – Binary data

• Contents and structure of binary files may vary:– Type of data stored:

• Bit (0-1) – 1 bit• Byte (0-255) – 8 bits• Short integer (-32,768

32,767) – 16 bits

• Interpreter – translator is required

Page 9: Oceangraphic data formats

Data formats – 15 common used types

• Text – files Ascii/Binary• Spreadsheets • Relational structures• Others

– Images– Maps

Page 10: Oceangraphic data formats

1 & 2 : Auxiliary Formats

• Auxiliary Formats - Information about data files; these are not really "data" files, but are included here for completeness – 1 Header Formats - Information about the

format, location or geo-referencing; usually very short

– 2 Metadata Formats - see also metadata

Page 11: Oceangraphic data formats

3. Document

• Digital data in proprietary formats (or sometimes just simple ASCII) designed for visual inspection, but not for data processing

• ASCII ,MS Word DOC , WordPerfect , HTML , PDF - Adobe Acrobat , PS/EPS - PostScript/Encapsulated PS , Desktop publisher programs - all proprietary ...

Page 12: Oceangraphic data formats

3. Document

• Advantages: Very polished appearance; powerful editors available; compatibility with other major document editing software.

• Disadvantages: (hard to use in data mining)– ASCII text must be extracted for the sections of

interest. – Embedded images must be converted to more

easily used GIF, JPG or BMP formats. PDF and PS/EPS very tricky to convert to other formats.

Page 13: Oceangraphic data formats

4. Gridded data

• File formats:– ASCII : example - SURFER (*.GRD) - with "DSAA"

header lines – Binary : Plain binary grids: byte, short integer, long

integer, single-precision or double-precision; with or without ASCII Header Files (see earlier)

Page 14: Oceangraphic data formats
Page 15: Oceangraphic data formats

4. Gridded data

• Creation of the Grid: – The gridded data file is created from scattered

data points in the real world, by a process called "gridding."

– mathematical methods to create the grid– algorithms are available to examine data points

Page 16: Oceangraphic data formats

4. Gridded data• Gridded data files commonly contain more than a single grid

– Data mostly avaiable for different parameters– Using sequences of XYZ dimensions and parameter dimensions– There is no "correct" way to construct files of multiple data grids

• It is extremely important to document the sequence in which the dimensions (XYZ location, time, parameters) are "read."

• Vector Grids: To represent vectors (literally arrows showing the direction of flow) in ocean and meteorological datasets two methods have been devised: provide the U and V components of the vector, or provide the direction and magnitude of the arrow. Both of these methods have been adapted to grids, for vector results from gridded models for instance. The grids can be contained in separate files, or sequentially listed in the same file.

Page 17: Oceangraphic data formats

4. Gridded data

• Advantages:– Saves storage space – XYZ storage which requires 3 data per gridpoint. – Binary takes much less space than ASCII. – Reading the data is usually a very straightforward

creation of a • DO LOOP routine (or nest of routines) that follows the order in

which the data were stored

• Disadvantages: – Binary data are not liked by those who want "to see"

their data at all times.

Page 18: Oceangraphic data formats

5. Hard copy

• Older, hard copy datasets • necessary evil

– (pre-60s) ocean data has never been digitized• These datasets range from technical reports to hand-

written log sheets and lab sheets. – Reports usually contain enough information to be successfully

digitized– Manuscript holdings often require tedious collation and cross-

referencing in order to assemble all the needed parts. – Datasets with missing critical parts (e.g. station data) exist, as

well as analysis and synthesis reports containing statistics, graphs and tables, but no data.

Page 19: Oceangraphic data formats

5. Hard copy

• Examples:– Lab sheets – Journal articles – Technical Reports – 80-character punch cards - Included here because

many locations lack the facilities to read them – Hand-annotated charts/graphs – Specimen identification cards – Diaries – Ship logs

Page 20: Oceangraphic data formats

5. Hard copy• Risk of data loss:

– Rule in many data centres: No paper data should be mailed or shipped unless photocopied. – All ORIGINAL paper data should be gathered by the data manager immediately after the relevant cruise

and grouped into named folios whose contents are indexed. • All paper data should be submitted to supervised digitization as soon as possible.

– Example: heritage library • Metadata of hard copy data: should fully describe the folios

– numbers of pages – Color of frontpage– Other identifying characteristics

• Advantages: They still exist. • Disadvantages:

– Cannot be used in modern digital analysis.– Digital capture is very labor intensive. – Access is a tricky political issue in some institutions.

• Compatibilities: Published papers in good condition can be scanned and converted to ASCII text with many commercial packages. (OCR techniques)

– Controll afterwards ….

Page 21: Oceangraphic data formats

5. Hard copy

• From hard copy to digital copy ...– Technique used depends on aim and type of data– Often just transformed in ‘document’ format– If to other formats – often man-driven

• In many cases going back to hard copy only way to work (due to lack of metadata, file versions, ...)

Page 22: Oceangraphic data formats

6. Simple Images

– Graphics file without earth mapping information – Interpretation is purely man-based– Very variable– Many file formats:

• TIFF, GIF, JPG, BMP …• RAW versus compressed

– RAW: all image information is stored without compression– Compressed: JPG/GIF information is compressed by

extrapolation, reducing colors smaller files but loss of information

Page 23: Oceangraphic data formats
Page 24: Oceangraphic data formats

6. Simple images

• Some images have added artistic borders -– outside the geographic grid: that obscure the pixel-to-

coordinates relationship• Advantages

– Quick visualization of data that may have originally been extremely complex. Subjective analyses that do not require positional accuracy.

– Disadvantages Quantification difficult; synthesis nearly impossible unless with pictures derived in exactly the same fashion Compatibilities Nearly all graphic picture formats are interchangeable with editor programs.

Page 25: Oceangraphic data formats

7. Geo-referenced images

• Graphics file, with ancillary mapping information, showing 1 or more parameters of the earth's system in a rectilinear grid, usually derived by processing and decimation of very high-density information from aerial or space sensors.– Coordinates of pixel correspond to XY geo-

coordinate.– Color of pixel represents a parameter

Page 26: Oceangraphic data formats

7. Geo-referenced images• TIF files can be made into Geo-Referenced Image files by the addition of internal

geographic tags, which require exact knowledge of the image dimensions and its proper location on the earth's surface.

• JPG, TIF and BMP can be made into Geo-Referenced Image formats by the addition of header "world files," which require exact knowledge of the image dimensions and its proper location on the earth's surface. A world file is a simple ASCII file with the following contents: – X-pixel size (delta X)– Rotation term for row (normally zero)– Rotation term for column (normally zero)– Y-pixel size (delta Y)– X-coordinate of center of upper left pixel– Y-coordinate of center of upper left pixel

• World files for TIF have the extension TFW; • world files for JPG have the extension JPW; • world files for BMP have the extension BPW.

Page 27: Oceangraphic data formats

7. Geo-referenced images

Page 28: Oceangraphic data formats

8-9-10. Mapping data

• Mapping - Mapping data consisting of digital representations of individual objects (points, lines, polygons, etc.) – 8 XY- Mapping line objects, in X (usually longitude) and Y (usually

latitude) coordinates only – 9 List- Mapping objects (points, lines, symbols, text, etc.) without

topology or descriptive attributes – 10 Geographic Information System (GIS) - Mapping objects

(points, lines, polygons, etc.) on the earth incorporated into robust data assemblages that contain additional detailed information about the properties and topologies of the objects. [NOTE: Most GIS systems can also accommodate gridded, geo-referenced image, relational and spreadsheet formats.]

Page 29: Oceangraphic data formats

8. XY data

• Description:– simplest kind of geographic information:

• lines specified by their ordered X and Y coordinates. • country boundaries: separated by several different markers

• ASCII Export Format from GEBCO Database/Software (actually YX in column order)

• Advantages: Simple to write, easy to read (when ASCII). • Disadvantages: Contain no topological relationships

between objects, or attributes of the objects. • Text is rendered as drawing instructions, and cannot be

retrieved as recognizable data.

Page 30: Oceangraphic data formats
Page 31: Oceangraphic data formats
Page 32: Oceangraphic data formats

9. Mapping data - List • ordered list of "map primitives" to be drawn:

– such as points, lines, circles, labels, etc. • These formats are extremely specific to certain software. • They could almost be called "plotter formats" because they do

little more than draw pictures of geographically referenced information.

• Small amounts of data can be included, however, coded into the appearance of such primitives as the circle (variable diameters), the vector arrow (variable lengths), and contour lines (colors).

• Advantages; Usually easy to read/write. • Disadvantages exists in many variant subtypes; MS Word and

WordPerfect differ markedly in the versions they accept.

Page 33: Oceangraphic data formats

10. Geographic Information System (GIS)

• Charting and mapping: tools for natural resource management.• Digital methods are becoming much more common in ocean data

analysis. • Geographic Information System (GIS) data formats contain complex,

multi-theme collections of spatial information that can be used to create maps and charts, and to perform analyses.

• The data formats that can support these systems are not just sufficient to draw maps, but also contain necessary ancillary data about the features included (in space and time).

• NOTE: GIS files can be vector-type or raster-type, and many GIS software systems can handle both. Conversion utilities exist that can convert these files in either direction, although the raster-to-vector conversion often requires intensive quality control by skilled operators.

Page 34: Oceangraphic data formats

10. Geographic Information System (GIS)

• Software:– Esri/Mapinfo/Surfer/...

• Recently: also many online gis-tools– OBIS – Open Gis standards : Open Geospatial Consortium

• an international industry consortium of 334 companies, government agencies and universities participating in a consensus process to develop publicly available geoprocessing specifications.

• Open Geospatial Consortium (OGC) protocols include Web Map Service (WMS) and Web Feature Service (WFS).

Page 35: Oceangraphic data formats

10. Geographic Information System (GIS)

• Formats Within This Group ESRI Shapefiles (SHP) , VPF • Advantages:

– Rapid creation of new maps and charts using the same databases. – No laborious hand-drawing methods. – Synthesis of different kinds of information, on an as-needed basis, from a

common pool of datasets. – Instant changes in projection, scale, coverage area, etc.

• Disadvantages: – GIS formats tend to be very complex, and populating them with the actual

data of interest is laborious. • Compatibilities Most of the major software systems now recognize

each other's formats. – Most have ASCII export routines for simple versions of the internal datafiles

(e.g. DXF).

Page 36: Oceangraphic data formats

11. Message data• Ocean and meteorological data compressed into official (usually WMO-

sanctioned) formats for transmission over approved international channels, especially the WMO's Global Telecommunications System (GTS). These highly compacted formats usually require unpacking programs before they can be used for analysis purposes. [The Self-Describing Formats BUFR and GRIB are also often used for data and analysis messages within the GTS.]

• Formats : DBCP-x, AAXX, BBXX, EEAA, EEBB, EECC, EEDD , IIAA, IIBB, IICC, IIDD , JJXX, JJYY, PPAA, PPBB, PPCC, PPDD , QQAA, QQBB, QQCC, QQDD , TTAA, TTBB, TTCC, TTDD , UUAA, UUBB, UUCC, UUDD , VVAA, VVCC , YYXX , ZZYY

• As an example, the JJYY format encodes real-time bathythermograph data; it replaces an older format, JJXX, used until 1995.

Page 37: Oceangraphic data formats

11. Message data

• Advantages :– Cheap and quick to send over often crowded

circuits; widely accepted among non-technical marine community.

– when of poor quality, they create a "placeholder" for the higher quality data which should follow

• Disadvantages – Only very coarse resolution and/or low precision is

possible due to the message format limitations.

Page 38: Oceangraphic data formats
Page 39: Oceangraphic data formats

11. Message data

This element defines an observation report on temperature, salinity and currents at one particular location on the ocean surface, or in subsurface layers.

Page 40: Oceangraphic data formats

12. Relational database

• A suite of spreadsheet-like tables with explicit links between them in special linkage arrangements (usually contained in additional tables).

• This collection of linked tables, known as a Relational Database (RD), divides up very large initial tables into much smaller tables and eliminates much duplication of information that would otherwise be required.

• Relational Databases require the use of special software (in which they are created, manipulated, and analyzed) called Relational Database Management Systems (RDMS).

• Formats: MS Access, Oracle, Sybase, dBase, SQL Server

Page 41: Oceangraphic data formats

12. Relational database

• Advantages: – Enormously flexible systems, capable of most typical statistical and

graphical analyses of data. – Some have immediate Web compatibility for publishing databases

directly on the Internet; ability to exchange data (via I/O operations or direct linking).

• Disadvantages – Ocean data are seldom published in commercial RDMS formats, due

to the machine- and software-specific requirements they would carry with them.

– Users cannot immediately "look at" their data, although this only requires simple queries that can written in minutes.

• More about these formats later

Page 42: Oceangraphic data formats

13. Spreadsheets• Spreadsheet formats are simply row-and-column data tables. • Easily be imported into several proprietary spreadsheet software

programs and many public domain programs. • Each row is called a "record." • The separate "fields" may be labeled by a single "label row" at the

beginning of the spreadsheet• Formats: EXCEL , WK*• Advantages

– Extremely easy to create, read, quality-control and manipulate in commercial spreadsheet programs. Each record (data line) is unique and complete.

• Disadvantages – Can be quite large, compared to binary files of the same data.

Page 43: Oceangraphic data formats

14. Self describing data formats• Data files that contain information about their own contents and structure. • Collections of other format types :

– Together with metadata about the main data components. • The rules and syntax :

– provided by (international) oversight groups• Examples:

– HDF - widely used for satellite data archives – NetCDF - widely used for gridded data and satellite data – BUFR - meteorological format for observations – GRIB - meteorological format for gridded data

• Advantages:– Metadata and data are "married" within a single structure– Software programs can find and browse desired data by working with the data files themselves

rather than external indexes. – Wide use has given rise to a long list of community software and "read" libraries.

• Disadvantages: – There is steep learning curve for all these formats, due to their complexity and comprehensiveness.

Page 44: Oceangraphic data formats

15. Stratified data formats• A very common method to reduce the large size of Spreadsheet format data is to take the

slowly changing fields, which take up a lot of room in each record and to place them in a totally separate "Cruise/Station" record that precedes all "Data" records to which it refers.

• Naturally, this new type of record will have a different format from the other records. • This process can be taken further, so that "Cruise" records, "Station" records, and "Data"

records all have different formats. – significance in the order of the records: because each "Data" record takes its full meaning from the

closest preceding "Cruise" and "Station" records. • ICES Standard Profile • Advantages:

– Smaller in size than spreadsheet. • Disadvantages :

– Tricky to write software, due to multiple line formats. – Usually the lines are formatted, so it is difficult for the human eye to read the data values. – Use with spreadsheet software is very limited (editing, block sorting/cutting/pasting) due to the

different line formats. – Import to relational databases with "off the shelf" routines is impossible.

Page 45: Oceangraphic data formats

15. Stratified data formatsCruiseid A B C

stationid x y Z W

Sampleid l p K

Sampleid2 l2 p2 K2

Sampleid3 l3 p2 K3

Stationid x2 y2 Z2 W2

... ... ... ... ...

Page 46: Oceangraphic data formats

16. Extra - XML

• Currently widely used• Data exchange format• Extensible Markup Language (XML)

Page 47: Oceangraphic data formats

16. Extra - XML

• Text based – small file size• Ascii format• Similar to stratified hierarchy• Formats defined by international organisations (see also

stratified)• Metadata can be embeded in data• Data exchange format – through internet• Both for data delivery & data request• Used in GIS in recent versions of software• Web technology (e.g. Newsitems, search engines, ...)