Information Considerations - Doc

download Information Considerations - Doc

of 19

Transcript of Information Considerations - Doc

  • 8/9/2019 Information Considerations - Doc

    1/19

    CRF-RDTE-TR-20100202-08

    11/2/2009

    Public Distribution| Michael Corsello

    CORSELLO

    RESEARCH

    FOUNDATION

    INFORMATION CONSIDERATIONSMANAGEMENT OF ENTERPRISE DATA

  • 8/9/2019 Information Considerations - Doc

    2/19

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20100202-08

    AbstractInformation management is a complex topic involving not just information technology, but business

    processes and staffing as well. Enterprise Information Management includes every aspect of the

    creation, storage, use, disposal and accountability for every piece of information an organization comes

    in contact with.

  • 8/9/2019 Information Considerations - Doc

    3/19

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20100202-08

    Table of ContentsAbstract ......................................................................................................................................................... 2

    Introduction .................................................................................................................................................. 5

    Audiences ...................................................................................................................................................... 5

    Business Domains ......................................................................................................................................... 6

    Data Categories ............................................................................................................................................. 7

    Raw Data ................................................................................................................................................... 7

    Accuracy and Precision ......................................................................................................................... 8

    Faults and Blunders ............................................................................................................................... 8

    Data Capture ......................................................................................................................................... 9

    Human Capture ..................................................................................................................................... 9

    Sensor Capture ...................................................................................................................................... 9

    Derived Data ........................................................................................................................................... 10

    Data Uses .................................................................................................................................................... 10

    Data Structure Types................................................................................................................................... 11

    Unstructured (Documents) ..................................................................................................................... 11

    Semi-Structured ...................................................................................................................................... 12

    Structured ............................................................................................................................................... 12

    Structure Element Types ......................................................................................................................... 12

    Textual ................................................................................................................................................. 12

    Tabular ................................................................................................................................................ 12

    Graph Data .......................................................................................................................................... 13

    Spatial .................................................................................................................................................. 13

    Temporal ............................................................................................................................................. 14

    Data Formats ............................................................................................................................................... 15

    File Formats ............................................................................................................................................. 15

    Stability ................................................................................................................................................... 15

    Longevity ................................................................................................................................................. 15

    Data ..................................................................................................................................................... 16

    Storage ................................................................................................................................................ 16

    Format ................................................................................................................................................. 16

    Media .................................................................................................................................................. 16

  • 8/9/2019 Information Considerations - Doc

    4/19

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20100202-08

    Data Volumes .............................................................................................................................................. 16

    Separation ............................................................................................................................................... 17

    Sharding .................................................................................................................................................. 17

    Versioning ............................................................................................................................................... 17

    Archival ................................................................................................................................................... 18

    Evaporation ............................................................................................................................................. 18

    Conclusions ................................................................................................................................................. 19

    Appendices .................................................................................................................................................. 19

    Acronym List ........................................................................................................................................... 19

    References .............................................................................................................................................. 19

  • 8/9/2019 Information Considerations - Doc

    5/19

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20100202-08

    IntroductionOrganizational management of information is a long-standing issue that has not been solved by

    technology. Increasing capabilities to generate data and greater capacities for storing data have

    become available as technology advances. This new information glut presents problems to all

    organizations that create or use information. The storage of data is a multi-dimensional problem that

    has to be considered from many aspects:

    Audiences, both intended and unintended Business domains, including legal and policy restrictions Data categories, such as sensor feeds, field sampling, analytic results Data uses, which includes activities such as data capture, analysis and reporting Data structure types, such as document, tabular, raster, spatial or temporal Data formats, including proprietary formats that may be restrictive in use Data volume over time, which will drive the infrastructure to store data

    Each of these aspects alone will only provide a partial set of demands and restrictions to the handling of

    data for the enterprise. Given that an organization in the large is both an enterprise and a part of

    larger enterprises that include all other organizations in a similar business domain, it is imperative that

    the organization information strategy meet both short and long-term demands of the enterprises that

    the organization interacts with.

    AudiencesInformation is useless without an audience. Even in automated processing systems and control systems,

    an audience receives the result of the information processing. The benefit to the audience is the

    purpose for the information. There are two primary categories of audience when dealing withinformation in any form, intended and unintended.

    The intended audience is divided further into the two

    categories, direct and indirect. The direct intended

    audience is the portion of the audience using an

    information store which the store was designed to

    directly support. This is the primary user base of the

    information. The indirect intended audience is the set of

    users (and domains) which were planned for during

    information store design. This audience will have varyinglevels of success in using the information and may or may not be an actual set of users (there may not

    be any people in this set that actually use the information). It is not uncommon that the indirect

    intended audience only amounts to a small portion of the user base and may be incorrectly targeted

    during the design phase.

    The unintended audience is the set of information users which the information store was not planned or

    designed for. The unintended audience is likewise divided into the two categories, direct and indirect.

    Figure 1. Top-Level Categories of Users

  • 8/9/2019 Information Considerations - Doc

    6/19

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20100202-08

    The direct unintended audience is the portion of the audience for an information store that is within a

    business domain or organization that correlates well with an intended group. The direct unintended

    audience will generally meet with moderately good

    success in information usage with the primary limitations

    being access to the information. Finally, the indirect

    unintended audience is the set of users that are not

    related to an expected domain. This form of user may

    include orthogonal business areas (such as a construction

    firm using business demographic data for site selection)

    and may gain significant benefits from information usage.

    In fact, it is entirely possible that this group of users may out number and out benefit all other user

    groups. Since this group is not intended, they are the most difficult to support. These groups may have

    any form of software applications and require data in

    specific formats. Unfortunately, data may be provided in

    incompatible formats which greatly hinders reuse. While

    the use of open-source data standards (such as XML

    schemas) and web services help cross-platform

    information exchanges, they do not solve the problems of

    data formats for specific applications.

    Business DomainsThe notion of a business domain is a business functional area that will have a related goal and use

    similar information. A single organizational group may operate in multiple business domains and

    therefore have information stores that overlap those multiple business domains.

    A business domain such as hydraulic engineering,

    geodetic survey, ecological monitoring or commercial

    fishing will produce and use information that overlaps

    with other domains. In fact, the specific domains

    mentioned in this paragraph all overlap in significant

    ways. For example, all of the above domains utilize

    information about fixed monuments for relative

    locations, which are the heart of surveying. Likewise, all

    of these domains will have use for information about

    locations in general. Further, these groups will all use

    information on projects. First, all of the domains other

    than commercial fishing would have direct efforts on the construction of a lock, dam or channel. The

    commercial fishing domain may then be an unintended audience utilizing data for planning trawler

    capacities and travel schedules for safe cargo passage. In this case, the commercial fishing domain is

    simply acting as a waterway cargo hauling domain.

    Figure 4. Business Domains with Commonality

    Figure 2. Types of Intended Users

    Figure 3. Relative Magnitude of User Types

  • 8/9/2019 Information Considerations - Doc

    7/19

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20100202-08

    The ultimate consideration for business domains is information structure, use and access. Information

    structure, use and access are the themes that run through the remainder of this paper to describe the

    what, where, when, why and how of interacting with the information produced by an

    organization. One of

    the most important

    considerations with

    business domains will

    be the isolation of

    domains from one

    another in space and

    time. In general,

    each organization will

    operate at specific

    locations and

    produce and maintaininformation for a time

    period that is relevant to that organization. Policy or legal issues relating to the information being

    regulated (such as personally identifiable information (PII)) may drive organizational isolation. Also,

    each organization will isolate its own information technology infrastructure from the global internet for

    security purposes. Security and auditing considerations must be addressed when information must flow

    across these boundaries.

    Data CategoriesFirst, a separation between data and information must be established. Information is the collection of

    data, tools and a context which a human will use to derive knowledge. Data is the collection of stored

    values used as information when presented in context. Data may imply context based upon a storage

    model or management tool (such as a software application). However, data can be moved between

    models to alter or infer other contexts.

    Each category of information that the organization will produce or manage will affect the information

    modeling for that organization. Data categories include raw data sources such as sensors and field

    collection and processed sources such as analysis results. A data category will influence the model used

    to store that information and may greatly affect the theoretical rate of production for that data.

    Raw DataRaw data includes all categories of data that are collected directly as a result of observation of a physical

    phenomenon in the real world. Raw data is the original form of data from a source and must be

    considered the basis for all derived data sets that are derived using a given raw source. Raw data is

    generally collected in a higher volume than derived data and has the greatest potential for reuse of any

    form of data.

    Figure 5. Data Domains Shared Across Business Domains

  • 8/9/2019 Information Considerations - Doc

    8/19

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20100202-08

    Prior to detailing categories of raw data, it is important to appreciate sources of error and

    incompatibility between data sets. Errors in data will fall into two primary areas: measurement error

    and introduced error. Measurement error is in the form of accuracy and precision of the measurement

    made. Introduced error is in the form of faults and blunders in data collection and entry generally

    associated with human interpretation and capture.

    Accuracy and Precision

    Accuracy and precision are commonly confused in spite of their relative simplicity. Accuracy is the

    absolute measure of how close to the actual value a measurement is. It is often described as how close

    a dart is to the bulls-eye, whereas precision is the repeatability of a measure. Precision is the

    closeness of sequential measurements to each other irrespective of the actual value being measured. In

    shooting terms, precision is measured as the size of the group for sequential shots. It is therefore

    quite possible to have low accuracy (missed the target by 12 inches) and high precision (95% of all shots

    were within 0.0001 inches of each other).

    In general, sensors are calibrated for accuracy and have intrinsic precision properties that may not be

    adjustable over the lifespan of the sensor. In many cases, the loss of precision in a sensor is the

    indication that the sensor is at its end of effective life. Sensors with reduced accuracy may introduce a

    systematic error, which can potentially be corrected by adding an error offset.

    Accuracy and precision are of concern in both human and sensor measurements. Human

    measurements tend to be more stochastic in terms of both accuracy and precision, but not necessarily

    of less value.

    Generic Accuracy

    In general terms, accuracy should be recorded for all raw data collected regardless of source or data

    type. These generic accuracy statements are metadata (data about data) attached to all data collectedunder a similar generic accuracy. The inclusion of an accuracy statement with all raw data sets will

    ensure that data users can determine the maximum level to which derived data could be considered

    accurate. No derived data set can be of greater accuracy than the least accurate source data used to

    derive the result. This has profound implications in analysis where partial data sets must be integrated

    to form a complete picture from which analytical results are derived.

    Faults and Blunders

    When a data value is not captured correctly, it may be due to a fault or blunder. A fault is the capture of

    a correct measure or value in an incorrect manner, such as transposing latitude and longitude. A

    blunder is the capture of an incorrect value, either by incorrect measurement or by incorrect entry.Incorrect measurement is a gap between the physical environment and the data system (such as a

    reading from a GPS on a tripod that is not leveled properly). Incorrect entry is a gap between the source

    data and a destination data system (such as incorrectly entering a value into a database or notebook by

    transposing digits).

    Faults and blunders are generally caused by humans and may exist anywhere in the data lifecycle.

    Faults and blunders are random and not easily detected or corrected.

  • 8/9/2019 Information Considerations - Doc

    9/19

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20100202-08

    Data Capture

    Raw data may be captured by humans or by automated means such as sensors. When sensors are used,

    they may perform direct electronic capture or require a human to record and transfer the sensed data.

    Any data capture that requires a human to record the data will be classed as human capture equal to all

    other human captured data.

    Regardless of the mechanism of data capture, raw data should be considered gospel data that must be

    retained for the longest period of time. If the raw data used in an analysis is lost, that analysis could not

    be verified and the validity of the analysis will be ultimately vulnerable to scrutiny. Unfortunately, the

    volume of raw data may be prohibitive to traditional storage for long-term. In this case, a mechanism

    for data persistence will be required that allows for long-term persistence to support the life time of all

    analysis products.

    Human Capture

    Human data capture is the practice of a person measuring, recording and entering values into a data set.

    Human capture is still a common form of data generation and includes all business data entry systems

    such as online forms or general spreadsheets. Since a human is collecting the data and entering it

    into a system, the maximum number of people capturing and entering data and the absolute rate at

    which a single person can generate this data limits the rate of generation. Human capture systems are

    generally low to medium volume capture sources.

    Sensor Capture

    Sensor systems can be direct capture or indirect capture. A direct capture system records all data

    sensed directly to a persistent store such as a database. Direct capture systems often include sensors

    such as:

    Fixed thermal sensors (weather stations) Closed circuit cameras Supervisory Control and Data Acquisition (SCADA) systems Hardware self monitoring Software logging / auditing

    Indirect capture systems include all disconnected sensor systems that require periodic data transfers

    such as field data loggers. Indirect capture systems include sensors such as:

    Hydrolab Datasondes Hardware diagnostic sensors (such as those in automobiles) GPS units Autonomous hardware devices (such as robots, transponder readers / loggers, etc)

    Sensor capture devices tend to provide reliable data if properly used and maintained. Due to the

    automated nature of these devices they can have low to extremely high data capture rates. Indirect

    capture devices are limited in capture rate due to the transfer requirements, but this can be overcome

    by device exchanging.

  • 8/9/2019 Information Considerations - Doc

    10/19

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20100202-08

    Derived Data

    All data products that are created from existing data are considered derived. Derived data will often be

    smaller in size and greater in value to a user than the source data used to produce the derived data.

    Derived data may be produced dynamically on-demand (such as complex queries over a relational

    data store), or statically produced and persisted (such as analysis results and formal reports). Whenever

    possible, metadata should be captured and maintained for all derived data indicating its chain of

    production and all data sets used in the derivation of the product.

    Data UsesEach user will have a set of uses for a data set. The basic forms of data use include capture, view,

    analysis (derivation) and reporting (derivation). The initial use of a data set will be data capture or

    generation, which involves the processes of users that produce and record the raw data.

    Data view (reading or retrieval) is the practice of physically viewing, querying, browsing or enumerating

    a data set. Viewing is the most common use of data. Data viewing is subject to availability, discoveryand transfer constraints. The user context may indicate the presentation format of data and therefore

    may require a transformation of the data from its storage format.

    Analysis is the act of deriving a data set from some other data set or sets. Analysis often requires data

    to be in a specific structure or format that is amenable to the specific type of analysis to be performed.

    Additionally, data volume may be a consideration that limits the ability to transform the data due to

    storage limitations. Finally, analysis processes may require data transfer at specific rates or access

    methodologies (such as random or sequential access) that drive the data storage environment (such as

    server sizing). This tends to be the most performance critical use of data and may significantly affect

    costs and planning.

    Reporting is the act of deriving an output data set from an existing data set. Reporting generally adds

    minimal new information (generally just aggregation functions such as averages or sums), but tends to

    add significant value due to improved information context and comprehension. Reporting may be

    automated (scheduled or triggered) or manual. Reporting may involve data transformation and

    additional storage (such as in an online analytic processing (OLAP) system).

  • 8/9/2019 Information Considerations - Doc

    11/19

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20100202-08

    Figure 6. Relative size of user communities

    All forms of data use will influence storage methodologies, environment scaling and access control.

    Integration of external application systems can be the most significant cost of time and money inarchitecting an effective information strategy. It is important to understand the scale of the users as

    well. In any information system, the largest proportion of users is viewers, while the smallest

    proportion of users is creators. Given this pyramid (as depicted above), it is clear that data creation is

    targeted at a small set of users, which have specialized needs. Due to the small size of the community, it

    is reasonable to invest greater funds per user to create robust data collection tools. If the smaller

    communities are not emphasized, the data available to users will be of lesser quality. In general, it is a

    more effective utilization of resources to invest in collection and analysis tools, with general purpose

    viewing tools being created as needed.

    Data Structure TypesAll data will be structured in a manner that a computer can handle. These structures will abstract the

    physical world to affect the information capability the organization requires. Data structure types

    include three primary levels; unstructured, semi-structured and structured.

    Unstructured (Documents)

    Unstructured data is most commonly documents. These data types store values that have embedded

    meaning with an absence of structure for conveying that meaning. A document can be understood by

    reading it and may have many meanings depending upon the reader. Since there is no pre-defined

    structure to the content, a computer treats the content as a single atomic data item. The content withinthe document may be mined to acquire limited information (generally metadata) beyond that which

    may be entered by the creator. Unstructured data is of limited use within a data set, but provides

    immense use to human users. Unstructured data is primarily managed in an information store as a

    document repository flagged with metadata relating the document to other data.

  • 8/9/2019 Information Considerations - Doc

    12/19

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20100202-08

    Semi-Structured

    Semi-structured data is data containing a limited structural definition. XML is a standard encoding for

    data that may be externally structured via an XML Schema. Semi-structured data may be fully

    conformant with a single structural definition, or it may be conformant in part with multiple structural

    definitions. Semi-structured data will be a consideration for implementation use and consumption for

    externally provided data.

    Structured

    Structured data is all data stored under a pre-defined, well-known structure. Relational databases, or

    object-oriented classes, define structured data. Structured data is the most powerful aspect of data

    storage and modeling as it conveys structure and implies meaning of and between data elements.

    In most cases, an analysis will require specific data in a specific structure, but will not semantically

    comprehend the data used. Structured storage and transfer of data does not imply meaning or

    guarantee correctness. All data structures persisted within an organization repository should be

    structured storage unless otherwise required. Semi-structured data and unstructured data may bestored within a structured repository.

    Structure Element Types

    Within a structured or semi-structured store, each data element will be of some structured type.

    Primary data types include; Boolean (true/false), integral numeric, real numeric, text, binary and

    abstract type. The abstract types are constructed as a defined collection of the other primary data

    types. All primary data types are supported in some form in every language, database and general

    information system.

    Textual

    A textual data element is any free-form text entity that is interpreted as a whole. In general, a

    document can be considered primarily a textual entity. Note that specific document encoding formats

    (such as the Microsoft Word format) may include far more than text and be dependent upon a specific

    tool to read the data in this format. General purpose textual information (including HTML and XML files)

    are encoded in a standard form that is easily understood in any platform.

    Tabular

    Tabular data elements are any data structured as rows and columns of values. Object-oriented classes

    are also considered as a tabular structure in that the class is analogous to a table structure. The key

    concept of tabular data is that of a pre-defined structure for all instances of data conforming to this

    structure. Any data structure that is a two-dimensional array of values is a table. All digital images maybe considered tabular data in their raw form, it is only their encoding (such as JPEG or TIFF) that makes

    them distinct elements. Further, a tabular data structure may store any other form of data within a

    single column of that structure.

  • 8/9/2019 Information Considerations - Doc

    13/19

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20100202-08

    Graph Data

    Graph data elements are any data structured upon relationships between elements. This may be in the

    form of hierarchies (trees) or networks (graphs). A graph (G) is a set of nodes (N) and edges (E),

    indicated as G(N,E) where each edge (e) in the set of edges (E) is a directed pair of nodes (n1,n2) both in

    the set of nodes (N). Edges may be directed as source destination or bi-directional (considered

    undirected). A social network like Facebookis an undirected graph connecting people (nodes) to each

    other that indicate they know each other (edges). A road or river network would also be examples of

    graphs.

    Graph data may also be tabular, in that the nodes and edges may each be represented by a table. It is

    also possible for a set of nodes to be shared across multiple graphs, each with a distinct set of edges.

    Graph data is conceptually simple, but is potentially complex and may be used to support any number of

    complex analytical routines. In general, anything representing connectivity can be considered as a

    graph. Most commercially available data management tools including relational databases have poor

    support for graph data. Further, most commercial tools that do support graph data expect the entire

    graph to be entirely in memory, which is a significant limitation for large data sets. As an example, the

    US roads network has over 30,000,000 edges (road segments) and an approximately equal number of

    nodes (intersections).

    Spatial

    Data regarding a location or graphical depiction (diagram or map) may be considered spatial data.

    Spatial data generically consists of geometries (shapes) as an attribute (column) of a structured data set.

    If the geometries represent physical locations on the Earth or another celestial body, those geometries

    are considered geospatial. A geographic information system (GIS) is one form of repository for storing

    geospatial data.

    Coordinate Systems

    Simple geometry data is based upon a standard Cartesian plane such as a drawing program (e.g.

    Microsoft Visio or open source Dia) would support. If the geometries are geospatial, there may be a

    planetary coordinate system that allows for accurate representation on an ellipsoidal model of the

    planet. The coordinate systems of geospatial geometries are mathematically complex and distort

    aspects of the geometries in specific ways depending upon the coordinate system selected. As data is

    collected, it will exist in a single coordinate system.

    As data is transformed into another coordinate system a distortion will occur. Any analysis performed

    using data from different coordinate systems will contain a measure of error (or bias) based uponartifacts introduced by the distortion.

    Spatial Accuracy, Precision and Scale

    The accuracy and precision of spatial data has an added aspect of vertex density. A geometry, such as a

    line, is composed of multiple points which are connected by a line. This chaining of points (vertices)

    allows for the generation of complex lines (polylines or arcs) and polygons. When created (digitized) the

    spacing between the vertices maybe significant. Each vertex has accuracy and precision measures as

  • 8/9/2019 Information Considerations - Doc

    14/19

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20100202-08

    does the space between the vertices. For example, a line consisting of three points may be accurate and

    precise to within one inch at the vertices, with no measure of accuracy or precision for the linear

    segments connecting the vertices.

    Spatial scale is the ratio of real world distance to the depicteddistance of a geometry. For geospatial

    data, the coordinates stored are often exact within the precision of the data element (such as a 64-bitfloating point number). In terms of scale, the definedgeospatial scale is the scale at which the data was

    collected for use. This scale is an indication of the nominal distances between vertices to depict

    variation in location (such as the curves in a stream). As the scale becomes more precise (tends to 1/1)

    there will be more vertices per unit distance to indicate variation. As the scale becomes less precise

    (tends to 1/) there will be fewer vertices per unit distance. As a result, coarse scale data will depict

    less of the variation in a road or river as it meanders. Further, as the scale is coarser the accuracy of the

    geometry will diverge from the physical world more between the vertices than at a finer scale. When

    the spatial data is transformed to another coordinate system, this may cause great concern as the linear

    segments between vertices cannot distort (curve) with the coordinate system unless interpolated

    vertices are inserted. In the case of added vertices to introduce this curve, the original data is lost and

    may appear more fine-scale than the source data actually was. This is a serious issue that must be

    addressed for all spatial data.

    Temporal

    Data describing the real world is always valid in time. Time will be considered a one-dimensional space

    that is infinite in both directions. Like spatial data, time is an attribute of other data structures such as

    tabular data. Temporal data can be categorized into three primary classes; archival, historical and true

    temporal.

    Archival

    Archival data is not truly a temporal aspect of data, but instead of data handling. Archival is the practice

    of removing old data from the primary operational data store to conserve space and improve

    performance. Archival may be ad hoc or scheduled. Archived data is stored separate (even if only

    logically) from operational data and required additional processing to query. Archival data may be

    coupled with temporal data for additional capability.

    Historical

    Historical data is a form of temporal data that persists data record changes over time. This is often

    accomplished via a start date and end date for old records of a specific entity. This is commonly

    coupled with auditing for transactional histories of data. The historical records may be queried into to

    provide a level of detail on how an entity has changed over time. In historical data, the temporal aspect

    of the history is secondary as all data is operated upon in the present sense and historic information is

    merely available for viewing. History data may be archived to free space.

    Temporal

    Truly temporal data also contains the notion of a start and end date, but adds the concept of time as a

    first-class data element. A temporal data entity may be edited or created in the present, past or future.

  • 8/9/2019 Information Considerations - Doc

    15/19

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20100202-08

    For example, a bridge may be planned for construction in the future and entered into the system for

    that future time. If the bridges are queried in the present, the future bridge will not appear. Likewise, a

    bridge may be scheduled for demolition and therefore be indicated as demolished in the future. If the

    system is queried in the future, the bridge will not appear. Finally, after the bridge is destroyed, it may

    be rebuilt further in the future. Once rebuilt, the original bridge record may be resurrected as existing

    once again (and considered as the same bridge). This allows for a fully temporally aware data system

    that can be acted upon in time. Temporal data may be archived to free space and may be coupled with

    spatial data for full four-dimensional space and time awareness.

    Data FormatsAll data stored within a computer is represented as files. Even relational databases store their data in

    files. An important aspect for data management is the format of the data, both in terms of data models,

    and file formats. A data model is the logical decomposition of an entity into the properties that are

    required to describe that entity. These data models are then persisted in some manner within files. For

    example, the Microsoft Word application saves documents in a .doc or .docx file, each of which have

    a specific encoding of the data contained within the file.

    File Formats

    Data file formats can be open or closed. An open format file is a well-known and well-documented

    format such as a text file (regardless of content), an ESRI Shapefile or Autodesk dxf file. These formats

    are well-defined and publicly published. Anyone can develop an application that can extract the

    information from a file that conforms to these formats. A closed (proprietary) format is any file format

    that is not well-known or well-documented. The Oracle .ora and MS Word .doc file formats are

    examples of closed formats.

    Stability

    File formats will change over time as new versions are released. In general, as we embrace change,

    changes to file formats must follow. Changes in file formats, whether open or closed, will affect all data

    currently persisted. The more frequently a format is revised by the format definers, the less stable the

    format is. As data is stored in a specific version of a format, that data will become more difficult to

    maintain as the format is revised over time.

    Longevity

    Longevity is important to information strategies in four primary respects:

    Data longevity Storage longevity (management) Format longevity (stability) Media longevity

  • 8/9/2019 Information Considerations - Doc

    16/19

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20100202-08

    Data

    Data longevity relates to the period of time that a given piece of information is relevant and valid. In

    general, data is perpetual. For a given data element captured in time, it is a stable representation of

    that element at that point in time, for all time. As time progresses, the data captured in the past

    maintains its representation of that data element at the point in time it was captured. If any form of

    temporal change determination is reasonable, or for primary historical purposes, data must be

    maintained in perpetuity. Only a small portion of raw information is truly transient in nature, whereas

    many derived datasets are highly volatile in time.

    When data longevity is considered, the applicable maintainable lifetime must be identified per data

    element. For that time period, the data must be persisted and protected.

    Storage

    Storage longevity is the natural follow-up to data longevity, where the mechanisms for storage are

    maintained. Storage longevity deals with the planning and maintenance of infrastructures and

    applications for maintaining the data to be stored. Storage longevity includes hardware replacement

    and archival plans for moving data to secondary stores.

    An important consideration in storage longevity is failure mitigation and recovery. If any devices fail,

    how will the data be restored to operational capability? Additionally, if a third party storage is used

    (such as cloud storage), the storage longevity should include any considerations of inter-organization

    practices and policies to ensure storage stability.

    Format

    Format longevity directly relates to the stability of selected storage formats. Further, format longevity

    includes migration of data from one format to another as formats are upgraded, or as format selections

    change (such as a migration from one RDBMS to another). Format longevity may be critical in terms ofaffecting the chain of custody of the data. As a format is changed, it may alter the content (e.g. floating

    point representations may be different, resulting in precision changes), or simply result in a non-

    original, official copy of the content. This is a critical issue to the National Archives and Records

    Administration (NARA).

    Media

    Media longevity deals with the effective lifespan of physical storage media. For example, the lifespan of

    a compact disk (CD) is estimated at anywhere from 2-10 years or more

    (http://www.archives.gov/records-mgmt/initiatives/temp-opmedia-faq.html). The actual lifespan of a

    physical media is most critical for offline or extended storage that is not used frequently. The active

    data media will be replaced during storage maintenance, but media longevity is often unaddressed

    except during long-term archival planning (again a critical NARA issue).

    Data VolumesThe volume of data collected per unit time for a specific data entity will drive much of the planning for

    data storage. In high-volume data collection activities, long-term storage is only practical using low-cost

    http://www.archives.gov/records-mgmt/initiatives/temp-opmedia-faq.htmlhttp://www.archives.gov/records-mgmt/initiatives/temp-opmedia-faq.htmlhttp://www.archives.gov/records-mgmt/initiatives/temp-opmedia-faq.htmlhttp://www.archives.gov/records-mgmt/initiatives/temp-opmedia-faq.html
  • 8/9/2019 Information Considerations - Doc

    17/19

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20100202-08

    offline media such as DVD, tape or Blu-Ray. The longevity of this media then becomes an issue,

    therefore the media may need to be tested and replicated periodically to ensure stability of the content.

    Even for mid to low density data collection, if the data is temporal with a high change rate (such as

    product inventory), accumulation of records may result in an overall higher than anticipated volume.

    Finally, aggregation of multiple data entities can also result in large data volumes. Scaling informationsolutions to handle large data volumes is a complex undertaking that is still in its infancy. Scaling may

    include any of several approaches:

    Separation, keep each data entity type in an isolated store Sharding, keep natural subsets of a data entity type in an isolated store Versioning, keep only changes to data entities over time Archival, move older data offline to secondary stores Evaporation, delete out of date data automatically

    Higher storage densities are also possible using existing technologies such as compression, efficient

    encoding (e.g. JPEG vs BMP for images) and simple removal of redundancy.

    Separation

    Data separation encompasses any means of separating data in a natural manner by keeping unrelated

    information separate from other information. Service oriented architectures (SOA) and cloud

    computing in general strongly advocate data separation.

    Separation may make applications more complex as each type of data is stored in isolation from other

    data. When it is necessary to query across data entity types, multiple data stores must be queried and

    all result sets related based upon commonality. Further, there are issues with data editing anomalies

    that must be addressed.

    Sharding

    The notion of sharding is related to that of separation, with the exception that a single data entity type

    is separated into isolated stores based upon some property of the entity instances. For example, a

    temperature store may be separated by location where each temperature monitoring site has its own

    data store, but all use the same logical model. If temperature data is queried across all sites, then

    multiple parallel queries must be executed and results unioned.

    Sharding, like separation, may cause significant increases in complexity to applications consuming data

    from a shared data store due to the increased query integration complexity.

    Versioning

    The versioning of data is a form of temporal storage where only a single complete record is stored for a

    data entity and all temporal changes are stored as change sets that only contain the information

    elements that changed. In most circumstances, data formats drive the storage mechanisms used. Since

    few data formats include the capability for versioning, there are few circumstances where versioning is

    cost-effective to implement. In most cases, for temporal storage, entire copies of the data element are

  • 8/9/2019 Information Considerations - Doc

    18/19

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20100202-08

    stored for each temporal update made. This results in many copies of unchanged data being persisted

    over time.

    Versioning results in a performance cost in the retrieval of a specific change set. This performance cost

    is a result of the integration of the changes (deltas) to the base record to arrive at the derived temporal

    record. The base record may be optimized to be the most commonly retrieved temporal instance(usually the most current record will be the completely stored base record, resulting in a reverse

    temporal change set).

    Archival

    Archival is the removal of historical records from the primary store to a secondary, higher density but

    lower cost, store. The archival store may add performance to the retained data by reducing the size of

    the store to be actively searched. Unfortunately, archival generally increases the complexity, cost and

    time for querying into the archive.

    Evaporation

    Evaporation (or purging) is the removal (deletion) of old data from the system. Evaporation is

    generally an automated process that expires old records based upon some criteria and deletes those

    records from the system. This may or may not be in conjunction with archival, where old archives are

    destroyed. Evaporation cannot be undone, so this must be thoroughly planned to ensure no useful data

    is lost.

  • 8/9/2019 Information Considerations - Doc

    19/19

    Corsello Research Foundation

    Public Distribution CRF RDTE TR 20100202 08

    ConclusionsEvery organization must evaluate its information strategy and portfolio to ensure all information is

    properly maintained for an effective lifecycle for all users, both intended and unintended. There are

    many considerations for each form of data that comprises the organizational information corpus. Data

    modeling is an activity which is critical to ensure the proper entities are captured in a repeatable,

    standardized and maintainable manner.

    Overall, data is the primary vehicle for transferring information which is used to derive knowledge. Well

    designed and defined data will provide better information and may create opportunities for new,

    unplanned uses of the data without a need for new data models. These emergent capabilities may be of

    much greater societal impact than the original uses the data was modeled for.

    Appendices

    Acronym ListAcronym Description

    DID Data Item Description

    DoD Department of Defense

    DSU Defense Systems Unit

    GAO Government Accountability Office

    NARA National Archives and Records Administration

    References