In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3...

109
Research Collection Master Thesis Data quality analysis for food composition databases Author(s): Mock, Reto Publication Date: 2011 Permanent Link: https://doi.org/10.3929/ethz-a-006660133 Rights / License: In Copyright - Non-Commercial Use Permitted This page was generated automatically upon download from the ETH Zurich Research Collection . For more information please consult the Terms of use . ETH Library

Transcript of In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3...

Page 1: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

Research Collection

Master Thesis

Data quality analysis for food composition databases

Author(s): Mock, Reto

Publication Date: 2011

Permanent Link: https://doi.org/10.3929/ethz-a-006660133

Rights / License: In Copyright - Non-Commercial Use Permitted

This page was generated automatically upon download from the ETH Zurich Research Collection. For moreinformation please consult the Terms of use.

ETH Library

Page 2: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

Data Quality Analysis forFood Composition Databases

Master Thesis

Reto Mock<[email protected]>

Prof. Dr. Moira C. NorrieKarl Presser

Global Information Systems GroupInstitute of Information Systems

Department of Computer Science

23rd September 2011

Page 3: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

Copyright © 2011 Global Information Systems Group.

Page 4: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

Abstract

Data quality has been an issue ever since databases are used. Despite the absence of a cleardefinition, it is widely agreed that data quality is of major importance especially in scientificapplications. Having noted this, it is rather surprising how little research has been done in thearea of data quality.We present a Data Quality Analysis Toolkit which is capable of measuring and visualisingdata quality. A variety of different charts and tables support the user in judging in what areaurgent action is needed. Furthermore, the toolkit provides tools to drill down on the dataquality issues and identify individual problems in the data.In this master thesis, we apply our concept of data quality analysis to the FoodCASE1 data-base, which is the Swiss food composition database managed by the Swiss Food InformationResource (SwissFIR)2 of the ETH Zurich and the Federal Office of Public Health3.

1http://www.foodcase.ethz.ch2http://www.swissfir.ethz.ch3http://www.bag.admin.ch

iii

Page 5: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

iv

Page 6: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

Contents

1 Introduction 11.1 Food Composition and the FoodCASE Project . . . . . . . . . . . . . . . . . 11.2 Data Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background and Related Work 32.1 A Bit of History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 The Beginning of Food Composition Tables . . . . . . . . . . . . . . 32.1.2 The Swiss Food Composition Database . . . . . . . . . . . . . . . . 32.1.3 COST Action 99 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.4 European Food Information Resource (EuroFIR) . . . . . . . . . . . 4

2.2 The FoodCASE Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.2 Business Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.4 Data Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4.1 Data Quality in Food Composition Databases . . . . . . . . . . . . 102.4.2 USDA Quality Index and Confidence Code . . . . . . . . . . . . . . 102.4.3 EuroFIR Quality Index . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Data Quality Analysis Toolkit 133.1 Analysing the Data Quality in the FoodCASE Database . . . . . . . . . . . . 13

3.1.1 Scenario 1: Identifying Missing Data . . . . . . . . . . . . . . . . . . 133.1.2 Scenario 2: Analysing Trends over Time . . . . . . . . . . . . . . . . 163.1.3 Scenario 3: Grouping by User . . . . . . . . . . . . . . . . . . . . . . 16

3.2 The Concepts Behind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2.1 Data Quality Requirement . . . . . . . . . . . . . . . . . . . . . . . . 183.2.2 Data Quality Analysis Tree Definition . . . . . . . . . . . . . . . . . 193.2.3 Data Quality Assessment . . . . . . . . . . . . . . . . . . . . . . . . 213.2.4 Filtering and Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2.5 Long-term Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2.6 Shortcut from FoodCASE . . . . . . . . . . . . . . . . . . . . . . . . . 213.2.7 Percentage Scale vs. User-defined Scale . . . . . . . . . . . . . . . . 22

v

Page 7: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

vi CONTENTS

3.3 Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.3.2 Data Quality Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.3.3 Problem Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3.4 View Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Administration Module 314.1 FoodCASE Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2 Quality Entities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.3 Quality Entity Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.4 Quality Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.5 Quality Assessments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.6 Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5 Implementation 395.1 Project Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.2 Code Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.3 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.3.1 Data Quality Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.4 Mapping to Physical Data Model . . . . . . . . . . . . . . . . . . . . . . . . 435.5 Back-End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.5.1 Running a Quality Assessment . . . . . . . . . . . . . . . . . . . . . 445.5.2 File System Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.6 Front-End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.6.1 Runtime Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . 475.6.2 Tree Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.6.3 NanoGraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.6.4 Tree Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.6.5 Tree View and Tree Editor . . . . . . . . . . . . . . . . . . . . . . . . 515.6.6 GUI Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.6.7 Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.6.8 Chart Highlighting and Breadcrumb Navigation . . . . . . . . . . . 565.6.9 User Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.6.10 Logging and Error Handling . . . . . . . . . . . . . . . . . . . . . . . 57

5.7 Admin Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6 Testing 596.1 Back-End Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.2 Front-End Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.2.1 Automated Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606.2.2 Manual Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

7 Extensions 637.1 Data Quality Prevention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637.2 Confidence Code for Aggregated Foods in FoodCASE . . . . . . . . . . . . . 66

7.2.1 Recipes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

Page 8: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

CONTENTS vii

8 Conclusion 678.1 Summary of Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

8.1.1 Data Quality Analysis Toolkit . . . . . . . . . . . . . . . . . . . . . . 678.1.2 Administration Module . . . . . . . . . . . . . . . . . . . . . . . . . 688.1.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

A Terms and Abbreviations 71

B FoodCASE Quality Requirements 73

C Physical Database Schema 85

List of Figures 91

List of Tables 93

List of Listings 95

Acknowledgements 97

Bibliography 99

Page 9: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

viii CONTENTS

Page 10: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

1Introduction

1.1 Food Composition and the FoodCASE Project

A food composition database (FCDB) provides detailed information on the nutritional com-position of foods. For a given food, it can list all components, such as proteins, vitamins andminerals contained in this food together with the exact amount. FCDBs are usually country-specific and are maintained by the so called food compilers. In Switzerland, there is theFoodCASE1 database which is managed by a collaboration of the Swiss Food InformationResource (SwissFIR)2 of the ETH Zurich together with the Federal Office of Public Health3.On the European level, there was a five-year European Food Information Resource Networkof Excellence (EuroFIR)4, which was funded by the European Commission’s Research Di-rectorate General under the “Food Quality and Safety Priority” of the Sixth Framework Pro-gramme for Research and Technological Development.

1.2 Data Quality

Data quality is not a precisely defined term. There are many different aspects which belongto data quality. These include for example accuracy, completeness, relevancy, accessibilityand interpretability. Wang et al. [wang94] identified 118 “data quality attributes”. In the end,it is always the data consumer who has to decide whether the data are fit for a particular use.

1http://www.foodcase.ethz.ch2http://www.swissfir.ethz.ch3http://www.bag.admin.ch4http://www.eurofir.net

1

Page 11: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

2 1.3. OUR CONTRIBUTION

1.3 Our Contribution

In this master thesis, we designed and implemented a Data Quality Analysis Toolkit, whichallows to measure and visualise data quality. For the reason of practicability, we restrictedourselves to data quality aspects which can be expressed as SQL5 statement. For example,if a database table which contains information about books is given, it is possible to check ifthe ISBN6 is in the correct format and if the check digit is valid. However, it would be outof scope of the system to query an online service to verify that the ISBN really belongs to acertain book.A variety of different charts and tables support the user in judging in what area urgent actionis needed. Furthermore, the Data Quality Analysis Toolkit provides tools to drill down on thedata quality issues and identify individual problems.

1.4 Thesis Outline

The next chapter gives an overview of the theoretical background of data quality assurancerelated to food composition databases. There were already efforts made by various organi-sations including the EuroFIR and the US Department of Agriculture. Chapter 3 introducesthe Data Quality Analysis Toolkit implemented as part of this master thesis. We explain howthe toolkit can provide support to get an overview of the quality of the food composition dataand how problems in the data can be identified. Chapter 4 aims at system administratorsand shows how to extend the Data Quality Analysis Toolkit with additional criteria to check.Chapter 5 is devoted to the implementation of the Data Quality Analysis Toolkit and chapter 6discusses the testing of it. Chapter 7 contains some extensions to the FoodCASE applicationwhich are not directly related to the Data Quality Analysis Toolkit. This includes an approachof data quality assurance at input time. Finally, chapter 8 concludes this master thesis with asummary of our work and an outlook.

5Structured Query Language6International Standard Book Number

Page 12: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

2Background and Related Work

2.1 A Bit of History

2.1.1 The Beginning of Food Composition Tables

Back in 1789, Carl August Hoffmann published a table about the composition of 40 mineralwaters from Germany [hoffm1789]. This document is likely to be the first food compositiontable. Another early food composition table, by Jacob Moleschott, appeared in 1859 in thesecond edition of his book “Physiologie der Nahrungsmittel - Ein Handbuch der Diatetik”[moles1859]. The first American food composition table, which contained 2600 food items,was created in 1896 by Atwater and Woods of the US Department of Agriculture (USDA)[atw1896]. In Switzerland, the first food composition table was compiled during the SecondWorld War by the Federal Office for Wartime Nutrition and published in 1944 [fown44]. Thistable contained data about the amount of energy, carbohydrates, protein and fats contained inabout 250 foods available in Switzerland at this time. Twenty years later, the second Swissfood composition table, by Hogl and Lauber, was issued in the Swiss Food Book [hoegl64].Unfortunately, this table was not updated anymore and could no longer be used after a while.This situation had not changed until the early nineties when the Federal Commission forNutrition (FCN) recommended the creation of a new Swiss Food Composition Database.

2.1.2 The Swiss Food Composition Database

In 1997, the Federal Office of Public Health (FOPH) in collaboration with the ETH Zurichlaunched a project to create a new Swiss Food Composition Database. As a result, the firstversion of the database was released in 2003 as a brochure with the title “Swiss NutrientValue Table” as well as on CD-ROM [foph03]. In a co-financed project of the FOPH and theETH Zurich, the database was updated in the period from October 2006 until May 2009.The current version 3.0.1 of the database contains 935 food items grouped in 13 food groups.For every food item, the nutritive data include carbohydrates, protein, fat, water, alcohol andenergy. Additionally, most food items contain information about many other components as

3

Page 13: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

4 2.1. A BIT OF HISTORY

well. On average, values for 30 different food components are provided for each food.Until recently, the database was managed using a Microsoft Access GUI1. Theresa Hodappcreated a Web interface to the database as her master thesis in 2006. Since then, an online ver-sion of the database has been openly accessible by the public at http://www.swissfir.ethz.ch/datenbank/online_EN (Figure 2.1). The website also provides the possibi-lity to order an E-Book or an offline CSV2 export.

2.1.3 COST Action 99

COST (Cooperation in Science and Technology) was a research programme by the EuropeanUnion between 1994 and 1999. In the field of Food Science and Technology, COST wasmainly concerned with improving food safety and food quality. 27 countries participated inthis COST Action including Switzerland with Florian Schlotke from the Computer ScienceDepartment of the ETH Zurich. One result of this international cooperation was a proposal fora structure for food composition databases [schlotke00]. This recommendation was aimingat improving the quality of food composition data and at simplifying the data interchange onthe European level.

2.1.4 European Food Information Resource (EuroFIR)

As part of the Sixth Framework Programme for Research and Technological Development ofthe European Union from 2005 - 2009, the European Food Information Resource (EuroFIR)3

was established. The EuroFIR continued the initial efforts by the COST Action 99, andpublished a refined recommendation for food composition database management and datainterchange [becker07], based on the former recommendation. Additionally, the EuroFIRcame up with a catalogue of 34 quality questions divided into 7 categories [salvini07]. Fromthe answers to these questions, the EuroFIR Quality Index can be calculated. This index isa measure of the quality of a single food composition data item. Section 2.4.3 explains theEuroFIR Quality Index in more detail.

1Graphical User Interface2Comma Separated Values3http://www.eurofir.org

Page 14: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

CHAPTER 2. BACKGROUND AND RELATED WORK 5

Figure 2.1: Swiss Food Composition Database, Public Online Interface

Page 15: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

6 2.2. THE FOODCASE PROJECT

2.2 The FoodCASE Project

2.2.1 Introduction

In 2007, the project FoodCASE (Food Composition And System Environment)4 was startedby Karl Presser at the ETH Zurich as a research project. The goal of this project was to builda new Swiss Food Composition Database in cooperation with the SwissFIR5 following theEuroFIR standard. This database can now be used to analyse the definition and measurementof diverse quality aspects. The requirements specification of the new system [presser08] wascompleted in 2008 and in 2009 the implementation was started.

2.2.2 Business Concepts

The FoodCASE application contains six main business concepts:

• A Single Food is a food item which was analysed by a laboratory or in any other way.A single food could be for example a Gravensteiner apple.

• A Single Food Component is a component such as fat of a single food. These valuesare typically taken from a laboratory report and entered into the system.

• An Aggregated Food is a “generic” food such as the typical apple eaten in Switzerland.

• An Aggregated Food Component is a component of an aggregated food. The mostimportant difference to a single food component is that the value of an aggregatedfood component is not measured but calculated (aggregated) as weighted mean. Theidea behind this is to publish just one fat value for apples although many cultivars areavailable. In order to get a representative value, the market shares of the total appleconsumption in Switzerland have to be considered when specifying the weights of themeasurements of the different cultivars.

• A Recipe is a special kind of aggregated food for which the exact amount of the in-gredients and the preparation method is known. With this information it is possibleto calculate the values of the resulting food from the values of the ingredients. Thepreparation method has to be known because different yield factors have to be applied.For example, if apples are fermented, they will yield alcohol, but if they are put in theoven, they will dry out and lose most of the water in them.

• A Reference is a document from which information has been extracted. This could bean article in a journal or a website.

The upper table in figure 2.2 shows the list of Single Foods in the FoodCASE database.Currently, there are 848 food items in the database. The table on the bottom lists all theSingle Food Components of the selected single food. The selected row shows that a slice ofgingerbread (German Lebkuchen) contains 13.75g of fat per 100g edible portion. The largestcomponent though is carbohydrate with 61.57g/100g.

4http://www.foodcase.ethz.ch5http://www.swissfir.ethz.ch

Page 16: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

CHAPTER 2. BACKGROUND AND RELATED WORK 7

Figure 2.2: FoodCASE Application

2.2.3 Architecture

The FoodCASE application uses a Java Swing GUI and Java Web Start. The back-end isimplemented using EJB36 session beans running on a JBoss 4.2.3GA7 application server. APostgreSQL 9.08 database serves as the persistent data storage. Figure 2.3 shows the highlevel architecture of the FoodCASE system, which is composed of six components:

• A PostgreSQL Database which stores all the business data of the application.

• A JBoss Application Server which runs EJB3 session beans. They provide serviceswhich can be used by the client modules. The main responsibility of the EJB layer isto take care of the persistence and make it transparent to the clients, so that they canwork directly with the business objects. Most of the persistence logic is implementedin JPQL9. Some batch operations use plain SQL10 for the sake of better performance.

6Enterprise Java Bean7http://www.jboss.org/jbossas8http://www.postgresql.org9Java Persistence Query Language

10Structured Query Language

Page 17: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

8 2.3. OUTLOOK

• A Content Management System (CMS) allowing the food compilers to manage thefood composition data in the system. (Figure 2.2)

• An Administration Module for the system administrators. Among other settings, thismodule contains the user and the thesauri administration. The latter includes the unitsof measure that can be used in the system, the food components and a lot of differentfood and component categorisations. (Figure 4.3 on page 36)

• A Web Page which allows the public to query information about the composition ofthe foods available in Switzerland. (Figure 2.1)

• A Web Service to export a single food item or the whole database as a EuroFIR FoodData Transport Package (FDTP) V1.3. This is a standardised XML11 format for foodcomposition data interchange in Europe and was defined by [moller08].

Figure 2.3: Architecture of the FoodCASE system. The six modules are highlighted in dif-ferent colours.

2.3 Outlook

In 2006, the FOPH initiated the project NANUSS (NAtional NUtrition Survey Switzerland)with the aim of answering the question of what people in Switzerland eat. A first pilot studywas launched in November 2008. 1500 men and women were interviewed by phone andasked what they have eaten and drunk in the last 24 hours. From the answers, a Swiss FoodList was drawn up. The final report of this pilot study was published in September 2010. Atpresent, the preparations for a large scale study, which will take place in 2013, are in progress.

11Extensible Markup Language

Page 18: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

CHAPTER 2. BACKGROUND AND RELATED WORK 9

2.4 Data Quality

Systems and humans are not perfect, which can result in poor data. Because of this, dataquality has been an issue since the first day of databases. Many databases even contain asurprisingly large number of errors. The consequences of poor data are manifold. However,what they have in common is that they cause extra work which is expensive. For this reason,having good data is also important from an economical point of view.Having noted this, it is rather surprising how little research has been done in the area of dataquality. One of the classics is a paper from 1994 by Richard Wang et. al. with the title“Beyond Accuracy: What Data Quality Means to Data Consumers” [wang94]. Althoughthere is no clear definition of data quality, Wang recognised that in the end it is always thedata consumer who decides whether the data are fit for a particular purpose. Following thisconclusion, Wang defined data quality as data that are fit for use by data consumers.There are a lot of data quality dimensions or data quality attributes as Wang called them. Theyinclude accuracy (closeness to true value), precision (reproducibility), timeliness, reliability,currency, completeness, relevancy, accessibility, interpretability and many more.Wang et. al. conducted a two stage survey. In the first survey, 25 data consumers workingin industry and 112 MBA12 students from a U.S. university were asked to list all terms thatcome to mind when thinking about data quality. This resulted in a list of 118 possible dataquality attributes. In the second survey, 1500 randomly selected MBA alumni were asked toassess the importance of those attributes. The result of this work was a conceptual frameworkof data quality.

Data Quality

Intrinsic Data Quality

Contextual Data Quality

Representational Data Quality

Accessibility Data Quality

Believability Accuracy

Objectivity Reputation

Value-added Relevancy Timeliness

Completeness Appropriate amount of data

Interpretability Ease of understanding

Representational consistency Concise representation

Accessibility Access security

Figure 2.4: A Conceptual Framework of Data Quality [wang94]

In this framework, the 15 most important attributes were divided into 4 groups:

• The Intrinsic Data Quality denotes the quality the data have in their own right.

• The Contextual Data Quality refers to the quality the data have within the task athand.

• The Representational Data Quality highlights the importance of a concise andconsistent representation.

• Finally, the Accessibility Data Quality underlines that data lose their value if they arenot easily accessible.

12Master of Business Administration

Page 19: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

10 2.4. DATA QUALITY

2.4.1 Data Quality in Food Composition Databases

In science in general, but in food science in particular, data quality is of major importance.It is widely agreed that food composition data are useless if their origin is unknown. Yetdecision making by governments or individuals is only possible if the data are reliable andtrusted.Often data from different sources are available. In this case, a means is required to judgewhich data are of the best quality or how the individual values should be combined to getmore reliable data.

2.4.2 USDA Quality Index and Confidence Code

The initial data quality evaluation procedures developed by the USDA were manual processesto assess the quality of analytical data for iron, selenium and carotenoids in foods. In thecourse of a redesign of the software system used at the USDA, these procedures were taken astep further and a generic system was developed [holden01].The five original evaluation categories Sampling Plan, Number of Samples, Sample Handling,Analytical Method and Analytical Quality Control were maintained, but the quality assess-ment questions were made more objective. According to the answers to these questions, asingle numeric Quality Index (QI) is assigned to a nutritional component.At aggregation, a Confidence Code (CC) is assigned to the combined value, which is calcu-lated as the weighted mean of the individual values from the different sources of data. TheCC is derived from the QIs by summing up the adjusted ratings of the individual values.

2.4.3 EuroFIR Quality Index

The EuroFIR developed a Quality Index [salvini07] similar to the USDA QI. In fact, theyadopted the five categories from the USDA QI and additionally added Food Description andComponent Identification. For all of the seven categories, a set of questions is defined. Thereare 34 questions in total, which all except one question can be answered with Yes, No or Notapplicable. The following excerpt from the questions catalogue exemplifies what kind ofquestions have to be answered:

• Food Description

– Is the food group known?

– Was the food source of the food or the main ingredient provided?

– 15 more questions

• Sampling Plan

– Was the number of primary samples > 9?

– If relevant, were samples taken during more than one season of the year?

– 4 more questions

• 5 more categories

Page 20: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

CHAPTER 2. BACKGROUND AND RELATED WORK 11

To each category, a score in the range from 1 (worst) to 5 (best) is assigned. The sum of theindividual categories is the EuroFIR Quality Index, which obviously must lie between 7 and35 points.The standard formula to calculate the score of a category is

number of criteria answered positively ∗ 5total number of criteria judged relevant

(2.1)

Special cases are Component Identification and Sample Handling for which the minimumof the answers is taken. Number of Analytical Samples gets a score equal to the number ofanalytical samples used, but at most 5, and for Analytical Quality Control a more complicatedspecial rule is defined.The result of equation 2.1 is a number in the range from 0 to 5. However, the minimum percategory is defined as 1. In this sense the specification is contradicting.In FoodCASE, this problem is mitigated by using a modified formula:

max

(number of criteria answered positively ∗ 5total number of criteria judged relevant

, 1

)(2.2)

The Data Quality Analysis Toolkit, however, is intended to be a generic tool, which shouldbe easily separable from FoodCASE, so that it can be integrated into another system. For thisreason, we decided that we don’t want to “pollute” the code with any special logic, whichis only usable in this specific context. But the toolkit provides a feature which we refer toas user-defined scale. Details about this feature can be found in section 3.2.7. Basically, itallows the user to specify a linear transformation of the data quality scores, which the toolkitinternally maintains as percentage values. In essence, the Data Quality Analysis Toolkitcalculates the values of the categories of the EuroFIR QI as

number of criteria answered positively ∗ 4total number of criteria judged relevant

+ 1 (2.3)

which, from our point of view, would have been the natural way to define it in the first place.Since it is out of scope of this master thesis to initiate a proposal to change the definition ofthe formula in the EuroFIR specification, the EuroFIR QIs displayed in the toolkit slightlydeviate from the values calculated by FoodCASE.Furthermore, the specification tells that the score of each category should be rounded to thenearest integer. A disadvantage of this rule is that it could happen for a data record that, bycoincidence, the scores for all of the seven categories are rounded down; while for a datarecord, which is only slightly better in every category, the scores are always rounded up. Thisway, the difference between the two data records might appear to be much bigger than it isin reality. For this reason, the rounding rule is neither implemented in FoodCASE nor in theData Quality Analysis Toolkit.Finally, there is a special rule for the category Sampling Plan. Here, the specification saysthat some criteria may have more weight than others, depending on the food in question. Wesuspect that food compilers would not always agree on this matter. Therefore, FoodCASEdoes not allow to adjust the weights, which we think is in the interest of a more objectivequality attribution.

Page 21: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

12 2.4. DATA QUALITY

Page 22: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

3Data Quality Analysis Toolkit

3.1 Analysing the Data Quality in the FoodCASE Database

In this chapter, we show how the Data Quality Analysis Toolkit can be used to assess thequality of the food composition data, and how problems can be identified. First, we provide alook at the toolkit from a user’s perspective, by the means of three concrete scenarios. Afterthat, we give a more conceptual description of the toolkit in section 3.2.

3.1.1 Scenario 1: Identifying Missing Data

Scenario: A food compiler is responsible for a national food composition database, which isaccessible by the public via an online interface. On their website, they have declared that forevery food the component energy is provided. Now he gets an e-mail from a user, claimingthat for some foods the component energy is missing. He wants to know whether this is true,and if yes, which foods are affected by this problem.Task at hand: Find all foods for which the component energy is missing.Procedure:

1. Log in to the FoodCASE application and start the Data Quality Analysis Toolkit (Tools→ Data Quality Analysis or Ctrl-D).

2. On the start screen, select the data quality analysis tree definition of interest. In thiscase, select Aggregated Food Data Quality as depicted in figure 3.1. All the othersettings can be skipped for the moment and it can be proceeded with the default values.

3. Click on the Go button. All the relevant data will now be fetched from the databaseand the presentation of the data quality will be prepared. Upon completion, the dataquality tree will appear as shown in figure 3.2. All nodes are rendered as progress bar,indicating the data quality of the criteria they represent.

13

Page 23: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

14 3.1. ANALYSING THE DATA QUALITY IN THE FOODCASE DATABASE

Figure 3.1: Start Screen of the Data Quality Analysis Toolkit

4. (Optional: Rotate the tree to save space on the screen by enabling the Rotate viewoption. When the tree is rotated, its root is no longer on the top, but on the right-handside.)

5. Select the node Mandatory Components and uncollapse it using the context menu or bypressing U on the keyboard. Now all the children of the node Mandatory Componentswill become visible.

6. Select the node For every food, energy must be provided and switch to the ProblemTable view using the context menu or by clicking on the Table button in the ProblemViews panel on the top of the window.

7. In figure 3.3, it can be seen than four aggregated foods for which the component energyis missing have been identified. By double-clicking on a row, the data record is openedand the problem can be fixed.

Page 24: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

CHAPTER 3. DATA QUALITY ANALYSIS TOOLKIT 15

Figure 3.2: Data Quality Tree: Aggregated Foods

Figure 3.3: Problem Table: Aggregated Foods with missing Component Energy

Page 25: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

16 3.1. ANALYSING THE DATA QUALITY IN THE FOODCASE DATABASE

3.1.2 Scenario 2: Analysing Trends over Time

Scenario: Last year, the EuroFIR Quality Index was introduced in a food composition data-base and the stuff was instructed to always answer the quality questions whenever they enterdata into the system. Now the manager wants to verify that people really follow his direc-tions, and he wants to get an overview of how good the data in their system are in terms ofthe EuroFIR QI.Task at hand: Check if the EuroFIR QI of new data is better than the one of old data.Procedure:

1. Log in to the FoodCASE application and start the Data Quality Analysis Toolkit.

2. On the start screen, select the tree definition Single Component Data Quality.

3. Specify the grouping criteria as Group by Year of Mutation.

4. Select the presentation style Data Quality Tree.

5. Click on the Go button.

6. Right-click on the tree node EuroFIR Quality Index and select Data Quality Views→Line Chart in the context menu or press 4.

7. The chart as shown in figure 3.4 appears. The green line for the year 2011 shows thatonly one data record has an EuroFIR QI of 7, which means that for all other records thequality questions have been answered. Remember that the EuroFIR QI ranges from 7at worst to 35 at best. When hovering with the mouse over the line, a tool tip appearswhich provides statistical information about the data series. For example, it can be seenthat the maximum EuroFIR QI achieved in 2011 is only 16.34, which is not that good.In 2010 (blue line), for more than 95% of the data records, the quality questions havenot been answered. Nevertheless, there are a few data records with a really good QIranging up to 30 points.

3.1.3 Scenario 3: Grouping by User

Scenario: A food composition database is used by different data providers who directly entertheir measurement results into the system. To increase the motivation to enter good data intothe system, the best data producers should be rewarded.Task at hand: Determine the user who enters the best data into the system.Procedure:

1. Log in to the FoodCASE application and start the Data Quality Analysis Toolkit.

2. On the start screen, select the tree definition Single Component Data Quality.

3. Specify the grouping criteria as Group by Mutation User.

4. Select the presentation style Data Quality Bar Chart.

5. Click on the Go button.

6. A bar chart will appear as shown in figure 3.5. It is obvious that the user representedby the green bar enters the best data. The grey user follows in second place.

Page 26: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

CHAPTER 3. DATA QUALITY ANALYSIS TOOLKIT 17

Figure 3.4: Data Quality Line Chart: EuroFIR Quality Index

Figure 3.5: Data Quality Bar Chart: Group by Mutation User

Page 27: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

18 3.2. THE CONCEPTS BEHIND

3.2 The Concepts Behind

The idea behind the Data Quality Analysis Toolkit is based on the concept of a Requirement-oriented Data Quality Model (RODQ) developed by Karl Presser [presser11]. A RODQmodel is a conceptual model that describes the data quality requirements of an informationsystem independently on its implementation. In this, it complements other established model-ling languages like ERD1 or UML2. Furthermore, the RODQ model describes how to assessthe data quality.

3.2.1 Data Quality Requirement

The central element of a RODQ model are the Data Quality Requirements. Presser distin-guishes between three types of data quality requirements:

• A Hard Constraint is a requirement which absolutely must be fulfilled. If a hardconstraint is violated, the data are invalid and cannot be used. Therefore, the sys-tem should enforce hard constraints and not allow to store any data unless all hardconstraints are satisfied.

• A Soft Constraint is a requirement which is highly recommended to be fulfilled. If asoft constraint is not satisfied, data quality decreases. However, it might not always bepossible to adhere to all soft constraints. Hence, they cannot be enforced by the system.

• An Indicator is weaker than a soft constraint but it is assumed that it can be used as anindication of data quality. A typical example is the age of the data. If the data are old,they are likely to be outdated, but it is still possible that they are correct.

In the Data Quality Analysis Toolkit, it is possible to specify the type of every requirementand it will be rendered accordingly (Compare with figure 3.6). The user should take therequirement type into account when specifying the weights (importance) of the requirements,as we explain in the next section. But apart from the way the requirement is displayed, therequirement type has no further influence.After Presser, a quality requirement is always associated with a data quality object, wherea data quality object corresponds to a real world object, on which the data quality checkshould be performed. In the Data Quality Analysis Toolkit, the data quality objects are thedatabase entities corresponding to the business concepts mentioned in section 2.2.2: SingleFood, Single Component, Aggregated Food, Aggregated Component, Recipe and Reference.In the following, we will refer to these six entities as the Data Quality Entities.Each quality requirement contains an SQL statement to be executed when the data qualityis assessed. This SQL statement has to map every data record in the quality entity to a dataquality value between 0.0 (worst) and 1.0 (best). If a requirement is not applicable to acertain data record, NULL may be returned. For example, if a database table contains dataabout customers, a data quality requirement could be defined as “For every person, name andfirst name must be provided”. Now if there is a customer for whom only the name but notthe first name is known, the quality requirement is only partially fulfilled. So the data qualitycould be defined to be 0.7 (70%) because usually the name is more important than the first

1Entity Relationship Model2Unified Modelling Language

Page 28: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

CHAPTER 3. DATA QUALITY ANALYSIS TOOLKIT 19

name. If a customer is not a person but a company, the quality requirement is not applicable,so NULL should be returned. An example of how such an assessment SQL could look likecan be found in figure 4.3 on page 36. More about this topic will follow in section 4.4.

3.2.2 Data Quality Analysis Tree Definition

Once all data quality requirements are gathered, similar quality requirements can be groupedtogether. This process can be repeated recursively until ending up with a data quality tree.The root node will then be one single number representing the overall data quality. Sincethe grouping of the quality requirements may be a matter of individual taste, it is possible todefine different data quality analysis trees using the same quality requirements. Similarly, theimportance of the quality requirements may be controversial. Because of this, the weight ofeach quality requirement can be specified in each tree definition independently.Side note: Although highly discouraged, the model supports using a quality requirement inmore than one aggregation node. Strictly speaking, it is no longer a tree in this case, but justan acyclic graph. However, for the sake of simplicity and better understanding, we continueto refer to it as a tree.

Figure 3.6: A very simple Tree Definition showing the different requirement types: hardconstraint, soft constraint and indicator. The aggregation node on level 2 uses aggregationtype mean. The root node uses weighted mean with the weights 3.0 and 1.0.

Page 29: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

20 3.2. THE CONCEPTS BEHIND

The Aggregation Type of an aggregation node determines how the data quality value of theaggregation node is computed from its direct children. The Data Quality Analysis Toolkitprovides six different aggregation types:

• Mean

• Median

• Minimum

• Maximum

• Geometric Mean

• Weighted Mean

The first five aggregation types need no further configuration. In case of Weighted Mean, aweight has to be assigned to every in-edge. By default, every in-edge is given a weight of 1.0.

Figure 3.7: The Tree Editor which allows to create own tree definitions. It provides threeways to modify a tree definition: by using the context menu, the keyboard shortcuts or simplyby dragging a new node from the tree on the left-hand side into the graph area and droppingit. If a tree node or edge is selected, a table will appear on the right-hand side showing theproperties of the selected object. Those properties printed in bold face can be modified by theuser. Instead of always starting from scratch, it is also possible to copy a tree definition fromsomeone else and modify it.

Page 30: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

CHAPTER 3. DATA QUALITY ANALYSIS TOOLKIT 21

3.2.3 Data Quality Assessment

A Data Quality Assessment is the point in time when all the quality requirements are checkedand the results are stored in the database for later analysis. This means that a snapshot of thecurrent data quality in the system is taken. The Data Quality Analysis Toolkit can only beused, if a quality assessment has been run previously. Hence, a quality assessment has to betriggered from time to time, in order that the data quality values calculated by the toolkit areup-to-date. This can be done either

• manually by the system administrator or

• automatically by a timer.

The timer interval can be configured by the system administrator as we explain in section 4.5.

3.2.4 Filtering and Grouping

As already seen in the scenarios, it is possible to filter or group by a variety of properties. Forall six data quality entities, the following filters are provided:

• Filter by Id (only consider a single data record)

• Filter/group by creation user

• Filter/group by year of creation

• Filter/group by mutation user

• Filter/group by year of mutation

Additionally, every data quality entity has its specific filters. For example, the aggregatedfoods can be filtered by the version of the database.

3.2.5 Long-term Analysis

Often one is interested in analysing how the data quality in the system has evolved over time.To enable this, the Data Quality Analysis Toolkit provides different possible approaches. Asmentioned in the previous section, any of the six data quality entities can be grouped by yearof creation or year of mutation. Single food components can additionally be grouped by yearof compilation, year of generation and year of evaluation.When a new version of the database is published (only aggregated foods and recipes), asnapshot of the relevant database tables is taken and all the data in them are copied. Thisenables us to group aggregated foods, aggregated food components, recipes and referencesby version of the database giving us another kind of long-term analysis.

3.2.6 Shortcut from FoodCASE

In order to allow a data record to be quickly opened in the Data Quality Analysis Toolkit, ashortcut button is provided in the FoodCASE application on every detail window (Comparewith arrow 3. in figure 7.1 on page 64). When clicking on it, the toolkit opens and Filter byId is preselected on the start screen. It this case, it is possible to calculate the data qualityvalues online, rather than taking the historical values from a previous quality assessment.

Page 31: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

22 3.2. THE CONCEPTS BEHIND

3.2.7 Percentage Scale vs. User-defined Scale

By default, all the data quality values in the Data Quality Analysis Toolkit are displayed aspercentages, as seen in figure 3.2 and figure 3.5. In certain situations it is desirable though touse a user-defined scale. This might be the case if, for historical reasons, your institute usesa scale other than percentage to measure data quality. A practical example is the EuroFIRquality index where, by definition, to each of the seven categories a value between 1 and 5 isassigned. The total of the EuroFIR QI is in the range from 7 to 35. To allow this, it is possibleto define a linear transformation of the data quality value into a user-defined scale. The onlything which has to be specified on the corresponding tree node in the tree definition is theminimum and maximum value of the preferred scale. (See the table on the right-hand side infigure 3.7). The data quality will then be calculated as

DQuser = DQ% ∗ (maxuser −minuser) +minuser (3.1)

and is displayed in the format

DQuser [minuser,maxuser] (3.2)

(Compare with the range axis in figure 3.4.) It is possible to switch between the percentagescale and the user-defined scale, using the radio buttons on the right-hand side of the view.If the user-scale minimum is not defined, its default is 0; if the maximum is not defined, thesum of the user-scale maxima of the children of the tree node in question is taken.

Page 32: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

CHAPTER 3. DATA QUALITY ANALYSIS TOOLKIT 23

3.3 Views

3.3.1 Overview

The Data Quality Analysis Toolkit provides a total of 11 different views which can be dividedinto two categories:

• The Data Quality Views visualise the data quality of the selected tree node and itsdirect children. The only exception is the tree view which does not only display theselected node but the entire data quality tree. The purpose of the data quality views isto give an overview of how good or bad the data are.

• The Problem Views on the other hand provide the possibility to drill down on the dataquality issues and identify individual problems.

The following table lists all the available views and their options.

Views and View Options Data Quality Problems

Bar

Cha

rt

Box

Plot

His

togr

am

Lin

eC

hart

Spid

erC

hart

Tree

Tree

Tabl

e

Bar

Cha

rt

Pie

Cha

rt

Spid

erC

hart

Prob

lem

Tabl

e

Auto Scale x x x x x x xDraw Swim Lanes xFollow Selection xPercentage Scale/ User Scale x x x x x x x x x x xPlot Orientation (Vertical/ Horizontal) x x x x xRender as Progress Bar xRotate xShow Data Points xShow Parent/ Show Children x x x x x x x xShow Range/Domain Axis as Percentage x x x xShow Standard Deviation xThreshold Slider x x x xUse Grouping x x x x x x x x xUse Legend x xNumber of Options 7 5 6 7 4 4 4 8 3 6 2

Table 3.1: Views and View Options

In the next three sections, every view and the view options are described.

Page 33: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

24 3.3. VIEWS

3.3.2 Data Quality Views

Data Quality Bar ChartThe data quality bar chart view shows the mean data quality of the selected tree node andits contributing direct children. Optionally, the standard deviation can be displayed by themeans of error bars. Refer to figure 3.5 on page 17 for an example.

Data Quality Box PlotThe data quality box plot view graphically depicts the seven-number summary of the selectedtree node and its children as explained in the figure below.

Figure 3.8: A Box Plot visualises the seven-number summary: The red box represents themiddle 50% (Q1-Q3) of the data. The black line is the median; the black circle is the meanvalue. The whisker includes 90% of the data (5%-95%). Minimum and maximum are dis-played as red circles. Here, the plot orientation is horizontal. However, it is more usual to usethe vertical orientation for box plots.

Data Quality HistogramThe data quality histogram view plots the data quality distribution of the selected tree node.

Data Quality Line ChartThe data quality line chart plots each data record ordered by increasing data quality. Thisshows the distribution of the data quality. If the Show Data Points option is enabled, it ispossible to click on a data point and the detail window of this record will be opened. The linechart also supports zooming-in by clicking the left mouse button and dragging. See figure3.4 on page 17 for an example.

Data Quality TreeThe data quality tree view shows the entire data quality tree. Each node is labelled with themean data quality. Nodes (and edges) can be selected by clicking on them. A table willappear which shows the statistical properties of the selected node. In order to switch toanother view, the buttons on top or the context menu (right click) can be used. The lowestlevel can be collapsed/ uncollapsed using the context menu or the keyboard shortcuts. Anexample can be found in figure 3.2 on page 15.

Data Quality Tree TableThe data quality tree table view shows the statistical properties of the selected tree node andits children in tabular form. Tree nodes can be further expand until arriving at the leafs of thetree. Available columns are the node name, node type, node Id, level, index and the statisticalproperties: mean, standard deviation, maximum, 5% percentile, 25% percentile, median,75% percentile, 95% percentile and the minimum. Columns can be added or removed byright-clicking on the header row. See figure 3.10 for an example.

Page 34: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

CHAPTER 3. DATA QUALITY ANALYSIS TOOLKIT 25

Data Quality Spider ChartThe data quality spider chart shows the profile of the selected tree node. The further away thedata points are from the centre, the better. The outermost point on the spokes (black lines)corresponds to a data quality of 100% (if the Auto Scale view option is disabled). Whenmoving the mouse on a data point, a tool tip appears showing the statistical details of thispoint.

Figure 3.9: Example of a Data Quality Spider Chart

3.3.3 Problem Views

All of the following four problem views feature a slider using which the data qualitythreshold can be specified. Additionally, it can be chosen whether data records with a dataquality lower than the threshold, or if records with a quality equal to or higher than thethreshold should be considered. This means that it is not only possible to identify datarecords with problems, but it is also possible to find data records which are of particular gooddata quality. Furthermore, all charts support tool tips. I.e. when pointing with the mouse ona data series, its statistical properties will be listed.

Problem Bar ChartThe problem bar chart view illustrates the number of problems for the selected tree node andits children. The range axis can either display the absolute number of problems, or it can beexpressed as a percentage (number of problems/ total number of data records).

Problem Pie ChartThe problem pie charts view shows the relative amount of problems for the children of theselected tree node.

Problem Spider ChartThe problem spider chart depicts the relative amount of problems compared with each otherchild of the selected tree node.

Problem TableThe problem table view lists all data records for the selected tree node which have a qualityless than the threshold specified. An example can be found in figure 3.3 on page 15.

Page 35: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

26 3.3. VIEWS

3.3.4 View Options

The following list explains all view options. They either have the scope global or per view.Global view options, such as the threshold slider, are adjusted on all views simultaneouslywhile other view options can be specified on every view independently. The view options arerendered on the right-hand side of the view and are stored per user. This means that the pre-ferences will be restored the next time the same user opens the Data Quality Analysis Toolkit.

Use GroupingScope: per viewDefault: true if a grouping criterion has been selected on the start screen; N/A other-

wiseAvailable for: all views except Quality Tree and Problem TableDescription: The Use Grouping option is only available if a grouping criterion has been

specified on the start screen. If so, Use Grouping controls whether

• the total and the value per group is displayed or

• if grouping is disabled, the value of the selected tree node and itsdirect children is shown.

Show Parent/ Show ChildrenScope: per viewDefault: default value depends on the viewAvailable for: all views except Quality Tree, Problem Table and Problem Pie ChartDescription: The semantics of Show Parent and Show Children depends on whether

grouping is enabled or not.

• If grouping is disabled

– Show Parent toggles whether the selected tree node is dis-played in the chart

– Show Children toggles whether the direct children of the selec-ted tree node are displayed in the chart

• If grouping is enabled

– Show Parent toggles whether the total is displayed in the chart

– Show Children toggles whether the different groups are dis-played in the chart

Percentage Scale/ User ScaleScope: globalDefault: percentage scaleAvailable for: all viewsDescription: Controls whether the percentage scale or the user-defined scale is used.

Please refer to section 3.2.7 for a detailed description of these two options.

Page 36: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

CHAPTER 3. DATA QUALITY ANALYSIS TOOLKIT 27

Threshold SliderScope: globalDefault: 50%, lower than (show bad data)Available for: all problem viewsDescription: The Threshold Slider allows to define how many percent of data quality

are required for a data record to be of sufficient data quality. In otherwords, every data record which has a data quality less than the value of thethreshold slider is considered to be deficient.Below the slider, there are two radio buttons to decide whether the bad dataor the good data should be shown. Compare with figure 3.3 on page 15.

Plot Orientation (vertical or horizontal)Scope: globalDefault: verticalAvailable for: Quality Bar Chart, Quality Box Plot, Quality Histogram, Quality Line

Chart and Problem Bar ChartDescription: Using this option, the preferred plot orientation can be changed from ver-

tical to horizontal, meaning that the chart is rotated by 90 degrees.

Draw Swim LanesScope: per viewDefault: trueAvailable for: Quality TreeDescription: If enabled, swim lanes will be drawn in the background of the data quality

tree. This makes it easier to recognise on which level a tree node resides.Levels are numbers from bottom to top (leaf to root) starting from 1.

Follow SelectionScope: per viewDefault: trueAvailable for: Quality TreeDescription: If enabled, the selected node will be automatically centred on the screen.

As the animations are smooth, this makes it a simple way to navigate thetree, if it is too large to fit on the screen. It is even possible to rotate thetree to save space, or it can be zoomed out.

RotateScope: per viewDefault: DefaultAvailable for: Quality TreeDescription: In the Data Quality Analysis Toolkit, it is possible to rotate the data quality

tree by 90 degrees clockwise, so that the root is on the right-hand side. Thishas proven to save a lot of space on the screen.

Page 37: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

28 3.3. VIEWS

Auto ScaleScope: per viewDefault: falseAvailable for: Quality Bar Chart, Quality Box Plot, Quality Histogram, Quality Line

Chart, Quality Spider Chart, Problem Bar Chart and Problem Spider ChartDescription: If enabled, the range axis will be scaled automatically. The lower bound is

always 0, but the upper bound will be adjusted according to the maximumvalue of the data series.

Use LegendScope: per viewDefault: trueAvailable for: Quality Bar Chart and Problem Bar ChartDescription: If enabled, the data series will be rendered in different colours and a legend

will be used to identify the series; if disabled, all series will be colouredthe same and the name of the series will be rendered on the domain axis.

Render as Progress BarScope: per viewDefault: trueAvailable for: Quality Tree TableDescription: If enabled, the statistical properties of the selected node (minimum, mean,

etc.) are rendered as progress bar; if disabled, they are rendered textuallyas percentage values.

Figure 3.10: Tree Table with the Render as Progress Bar view option enabled

Page 38: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

CHAPTER 3. DATA QUALITY ANALYSIS TOOLKIT 29

Show Range/Domain Axis as PercentageScope: per viewDefault: trueAvailable for: Quality Histogram, Quality Line Chart, Problem Bar Chart and Problem

Spider ChartDescription: If enabled, the range/domain axis will be labelled with percentages; if di-

sabled, the absolute number of data records will be used.

Show Data PointsScope: per viewDefault: falseAvailable for: Quality Line ChartDescription: If enabled, every single data point is rendered on the chart, instead of only

the line. When clicking on a data point, the detail window of the cor-responding data record will open. This feature is well suited to identifyoutliers.

Show Standard DeviationScope: per viewDefault: falseAvailable for: Quality Bar ChartDescription: If enabled, an error bar indicating the standard deviation will be rendered

on top of the bar.

Figure 3.11: Bar chart with error bars showing the mean value +/- standard deviation.

Page 39: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

30 3.3. VIEWS

Page 40: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

4Administration Module

Chapter 4 describes the part of the FoodCASE administration module which is related to theData Quality Analysis Toolkit. It is primarily interesting for system administrators who wantto configure the toolkit. Most importantly this includes defining the quality entities and theirdata quality requirements. Furthermore, it is possible to specify the filter and grouping criteriawhich will be available on the start screen of the Data Quality Analysis Toolkit. Qualityassessments can be run manually or a timer can be set up. Lastly, a number of maintenanceoperations are provided to system administrators.The admin module is a standalone Java Web Start application separated from the normalFoodCASE application. It can only be accessed by users who have the “admin” role.

4.1 FoodCASE Data Model

Figure 4.1 shows the data model of the FoodCASE application. The boxes correspond todatabase tables and the arrows represent foreign-key relationships. The data model has three“levels”: Single Food, Aggregated Food and Recipe. Around those three areas, there are a lotof database tables which contain rather static meta data.In order to determine for which tables the data quality should be checked, we coloured eachtable according to where the data originate from. The colours are explained in the text follo-wing the close-up look at the Single Food table in figure 4.2.

31

Page 41: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

32 4.1. FOODCASE DATA MODEL

Figure 4.1: Data Model of the FoodCASE Application

Page 42: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

CHAPTER 4. ADMINISTRATION MODULE 33

Figure 4.2: Data Model of the FoodCASE Application: Close-Up Look at the Single FoodTable (Marked in red in the previous figure).

• yellow: the six data quality entities on which the data quality checks have to be perfor-med.

• purple: tables which directly depend on the quality entities.

• light green: the thesauri. A thesaurus is usually defined by some standards body andimported by the system administrator. In some cases it would be possible to define dataquality requirements. However, since only the defining standards body may modifythese data, we abstained from defining any rules.

• orange: imported data. Here it would be possible to define quality requirements aswell. However, these data are only imported into FoodCASE but not maintained withinthe system. For this reason we did not define any rules at the present stage.

• grey: tables which are not used yet/anymore. Hence, no data quality requirements areneeded at the moment.

• brown: tables which cannot be administrated from the GUI. Most of them containlegacy data.

• turquoise: tables for which no validation rules have been defined yet. At a later stageit might make sense to have a closer look at these tables.

• dark green: tables which contain technical data and need no validation.

Page 43: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

34 4.2. QUALITY ENTITIES

4.2 Quality Entities

A Quality Entity is a first-class database entity (a database table) on which a set of dataquality checks (quality requirements) should be performed. Typically, a quality entitycorresponds to a physical object such as a Single Food. In the FoodCASE application, thereare six quality entities; namely the business concepts described in section 2.2.2: Single Food,Single Component, Aggregated Food, Aggregated Component, Recipe and Reference.

A quality entity is defined by five properties:

Display Name: A string which is used for display purposes. Example: Single Food

Entity Name: The physical name of the database table. Example: tblsinglefood

Alias: A short alias for the database table. Example: f

Key column: The name of the primary key column. Example: f.idsinglefood

Condition: An optional condition which characterises the “active” data records.Example: f.singlefoodhidden = false

4.3 Quality Entity Filters

On each quality entity, filter/grouping criteria can be defined. Here it has to be distinguishedbetween two cases:

• The label of the grouping criterion is contained in the same database table. For examplethe creation date of the data record.

• The quality entity contains a foreign key, and another database table has to be joinedin order to retrieve the label of the grouping criterion. For example, this is the case forthe language of a reference, since the languages are stored in a separate table.

A quality entity filter is defined by seven properties:Entity Id: The Id of the quality entity to which the filter belongs to.

Example: 15

Filter Name: The display name of the filter. Example: Reference Language

Filter Column: The column to filter on. Example: referenceidlanguage

Join Table: The name of the database table which has to be joined to retrieve thefilter label, or NULL if no table has to be joined. Example: tbllanguage

Join Column: The primary key column of the other table. Example: idlanguage

Name Column: The column which contains the filter label. Example: languagename

Apply Condition: true if the condition specified in the quality entity should be applied;false otherwise.

Page 44: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

CHAPTER 4. ADMINISTRATION MODULE 35

4.4 Quality Requirements

The Quality Requirements view is probably the most interesting one. Here it is possible todefine the quality checks which have to be performed whenever a Quality Assessment is run.As already mentioned in section 3.2.1, a quality requirement belongs to a quality entity andhas a type:

Entity Id: The Id of the entity on which the quality check has to be performed.Example: 16

Name: The display name of the quality requirement.Example: Mean >= Min

Description: An optional documentary description.

Type: Hard constraint, Soft constraint or Indicator

Assessment SQL: The SQL statement which is executed when a Quality Assessment isperformed. This SELECT statement must return one row for everydata record in the quality entity. Each row must have the format{requirementId, assessmentId, refId, value}. Value is the dataquality value of the data record identified by refId. It must be a num-ber between 0.0 (worst quality) and 1.0 (best quality). NULL may bereturned to express that the quality requirement is not applicable to acertain data record.

In the assessment SQL, the place holders described in table 4.1 can be used.

Place Holder Replacement Value Example{ 0 } <requirementId>,<assessmentId> 150, 81{ 1 } <entity> <alias> tblaggrfoodcomponent ac{ 2 } <keycolumn> ac.idaggrfoodcomp{ 3 } <condition> ac.aggrfoodcompidversion = 1{ 4 } <alias> ac

Table 4.1: Quality Requirement: Assessment SQL Place Holders

On the next page, figure 4.3 shows a screenshot of the detail window of a quality require-ment, including an example of a quality assessment SQL statement. The full list of qualityrequirements for the FoodCASE database can be found in appendix B.

Page 45: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

36 4.5. QUALITY ASSESSMENTS

Figure 4.3: Administration Tool: Edit a Quality Requirement

4.5 Quality Assessments

The Quality Assessments tab lists all quality assessments in the database with start date, enddate and a remark. Quality assessments can be deleted and the remark can be changed.The Run Quality Assessment view allows the system administrator to manually trigger a qua-lity assessment. It is recommended to do this outside of office hours because this operationnoticeably slows down the FoodCASE system.On the Quality Assessment Timer screen, a timer which regularly triggers a quality assess-ment can be set up. The timer has three properties out of which two are writeable:

Start: The start time when the timer should go off. Preferably, this shouldbe in the middle of the night when nobody is working with the Food-CASE system. Example: 3:00 AM

Interval: The repetition interval in days. It is not recommended to run qualityassessments too often because they require a lot of database space.With the current size of the FoodCASE database, a single run requiresabout 250MB of disk space for the data, 600MB for the database in-dices and 60MB for the file system cache! Example: 90 days

Next Execution: This is a read-only property which indicates when the next qualityassessment will be run.

Page 46: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

CHAPTER 4. ADMINISTRATION MODULE 37

4.6 Maintenance

The maintenance panel provides three tools for system administrators:

• Run Vacuum Analyse: This will run a vaccum analyse on the Postgres database tablewhich contains the quality requirement values. This is the database table where allthe data quality data are stored. The vaccum analyse command optimises the internaldata structures of the database. It should not be run while users are working with theFoodCASE application.

• Truncate Quality Requirement Values: This will delete ALL data quality measure-ments. In order to avoid that a user runs this command by mistake, he explicitly has totype “yes” to express that he is sure he wants to do that.

• Clear File System Cache: This will clear the file system cache on the applicationserver. The file system cache is logically located between the application server andthe database. Please refer to section 5.5.2 on page 45 for more information.

Page 47: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

38 4.6. MAINTENANCE

Page 48: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

5Implementation

The high level architecture of the FoodCASE system was already explained in section 2.2.3.In this chapter we explain the implementation of the Data Quality Analysis Toolkit in moredetail.The Data Quality Analysis Toolkit is embedded and fully integrated into the FoodCASEapplication. Nevertheless, we tried to keep the Data Quality Analysis Toolkit as independentas possible from FoodCASE. The reason for this is that it should be a generic tool, which canbe easily integrated into any other system building on top of a RDBMS1. The only sourcecode dependencies between the Data Quality Analysis Toolkit and FoodCASE are some GUIcomponents we reused in the toolkit, so that it visually nicely integrates into FoodCASE.In the following, we ignore the FoodCASE application and only focus on the code which isrelated to the Data Quality Analysis Toolkit.

5.1 Project Structure

The source code of the Data Quality Analysis Toolkit is spread over four of the six FoodCASEmodules (Figure 2.3 on page 8). These four modules are realised as NetBeans2 projects:

• The FoodCASE Server project contains four stateless session beans related to the DataQuality Analysis Toolkit:

– The QualityAnalysisBean implements all the persistence logic needed by thetoolkit.

– The QualityAnalysisTimerBean is an EJB3 timer. When the timer goes off, aquality assessment is triggered. In section 4.5, it was explained how the systemadministrator can configure the timer.

1Relational Database Management System2http://www.netbeans.org

39

Page 49: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

40 5.2. CODE METRICS

– The QualityAnalysisPrototypeBean contains the back-end of our prototype im-plementation. The prototype is actually obsolete and can be dropped after com-pletion of this master thesis. We just kept it because I was supposed to hand-inany prototypes. The prototype can be accessed from the GUI by first starting theData Quality Analysis Toolkit and then choosing Developer→ Start Prototype inthe menu. All the files belonging to the prototype are clearly marked, so that theycan be removed from the repository within 5 minutes.

– The QualityPreventionBean implements the back-end functionality needed byour enhancements of the “Data Quality Prevention” framework, which we discussseparately in section 7.1.

• The FoodCASE Lib project is used by the client as well as by the admin project. Itcontains all the entity beans, the session interfaces and a number of utilities.

• The FoodCASE Client project is where the implementation of the FoodCASE CMScan be found. We showed a screenshot of it in figure 2.2 on page 7.

• The FoodCASE Admin project contains the administration tool.

5.2 Code Metrics

Code metric are a well suited means to get a high level overview of where the complexity ofa software project resides. Table 5.1 only considers the code we wrote for this master thesis.All the code of the existing FoodCASE application is not included in these numbers.The first thing to notice is that almost two thirds of the code is located in the client project(12936 lines/ 20275 lines = 63.8%). Also the cyclomatic complexity (explained in the nextparagraph) is the highest in the client project (2.61). The reason for this is that the GUIis rather heavy-weighted compared to the back-end. Also parts of the business logic areimplemented in the client project. This is to save network bandwidth and server resources, aswe explain in section 5.6.1. The difference between total number of lines and lines of code ismainly because all public methods are annotated with Javadoc-style comments.The cyclomatic complexity is a complexity measure for software, developed by Sir ThomasJ. McCabe [mccabe76]. It measures the number of linearly independent paths through a soft-ware programme. Practically speaking, it judges the complexity of a source code according tothe number of loops (for, while) and conditionals (if, switch). Additionally, return statements,try/catch/finally blocks and logical operators (&&, ||, .. ? .. : ..) increase the flow complexityof a programme. According to McCabe, a complexity of less than 5 is desirable. Modules (inour case methods) with a complexity over 10 should be refactored and split into smaller mo-dules with a lower complexity. The code metrics show that the Data Quality Analysis Toolkitis generally well structured in small, easily understandable units with an average complexityof less than 3. However, there are a few exceptions. The worst one is the method TreeView-PropertyTable.getValueAt(rowIndex, columnIndex) which implements the model of the tableseen in figure 3.2 on page 15 in the lower-right corner. This method has a complexity of 57and seems to be much too complicated. However, when looking at it in more detail, it is notthat bad. It contains a huge switch and each case block has a number of nested if -statementsin it. The problem here is that JTables are generally row-oriented. In our case, we use it in

Page 50: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

CHAPTER 5. IMPLEMENTATION 41

a column-oriented fashion though. Despite of the large flow complexity, this method is stillrelatively easy to understand.The lack of cohesion of methods (LCOM) refers to a set of techniques to analyse howstrongly the methods of a class are connected to each other. Variant 4 was developed by Hitz& Montazeri [hitz96]. A LCOM4 value of 1 indicates that a class has only one responsibility.Values larger than 1 indicate that a class implements two or more independent concepts and,hence, should be split into smaller classes which are independent of each other. The adminand client project both have an average LCOM4 of 3, which means that theoretically manyclasses could be split into smaller ones. On the other hand, it can be seen that the averagenumber of lines per class (including comments) is around 180, which shows that classesare already relatively small. The highest LCOM4 values are calculated for the entity beansbecause, obviously, the getter and the setter of an instance variable are independent of anyother instance variable. Still, it does not make sense to split the entity classes. All this showsthat LCOM4 can be used as an indicator, but the results have to be analysed with care beforecoming to a conclusion. For the server project, the LCOM4 measure worked as expected.There is the QualityAnalysisBean which easily could be split into smaller classes, which areresponsible for only one of the business concepts (quality entity, quality entity filter, etc.). Inthe FoodCASE project however, all the existing EJBs are in the same Java package. Becauseof this, we didn’t want to create too many additional classes here for the sake of a betteroverview.

Lines1 LOC2 Classes MethodsAdmin 3076 1900 17 106Client 12936 8132 72 490Lib 3159 1594 26 242Server 1104 761 5 63Total 20275 12387 120 901

Lines/Class Methods/Class Avg CC3 Max CC4 LCOM45

Admin 180.9 6.2 1.21 5 3Client 179.7 6.8 2.61 57 3Lib 121.5 9.3 1.51 15 5Server 220.8 12.6 1.78 15 51 Total number of lines2 Lines of code: lines which contain at least one of the following: semicolon “;”,

left curly brace “{”, right curly brace “}” but do not contain double slash “//”3 Average cyclomatic complexity4 Maximal cyclomatic complexity per method5 Lack of cohesion of methods (variant 4)

Table 5.1: Code Metrics of the Data Quality Analysis Toolkit

Page 51: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

42 5.3. DATA MODEL

5.3 Data Model

Figure 5.1 shows the data model of the Data Quality Analysis Toolkit. All of the conceptswere introduced in previous sections. Here is short overview of back-references:

• Quality Entity: Section 4.2

• Quality Entity Filter: Sections 3.2.4 and 4.3

• Quality Requirement: Sections 3.2.1 and 4.4

• Requirement Type 3.2.1

• Quality Assessment: Sections 3.2.3 and 4.5

• Quality Tree Definition: Section 3.2.2

• Aggregation Type: Sections 3.2.2

Figure 5.1: Data Model of the Data Quality Analysis Toolkit, UML Class Diagram

5.3.1 Data Quality Tree

As already mentioned in section 3.2.2, the data quality analysis tree definition has not neces-sarily have to be a tree. We call it a tree though for the ease of understanding. Nevertheless,it is legal for a tree node to have more than one parent. Because of this, it is actually anacyclic graph with a few constraints. Every tree node has a level and an index property. Thelevel property describes the height of a node in the tree. Counting starts from 1 and is frombottom to top (leaf to root). The index describes the logical position of a node from left toright. Counting starts again from 1. As we explain in section 5.6.4, the logical position is notnecessarily the same as the display position of a tree node. The constraints are:

Page 52: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

CHAPTER 5. IMPLEMENTATION 43

∀node ∈ nodes|typenode = “Requirement Node” : levelnode = 1 (5.1)

∀node ∈ nodes|typenode = “Aggregation Node” : levelnode > 1 (5.2)

∀(parent, child) ∈ edges : levelparent > levelchild (5.3)

∀node ∈ nodes|levelnode > 1 :∣∣∣{inedgesnode}∣∣∣ > 0 (5.4)

∀node ∈ nodes|levelnode < max(level) :∣∣∣{outedgesnode}∣∣∣ > 0 (5.5)

∣∣∣{node ∈ nodes|levelnode = max(level)}∣∣∣ = 1 (5.6)

Constraints 5.1 and 5.2 say that requirement nodes can only exist on the lowest level andaggregation nodes must reside on a level higher than 1. Constraint 5.3 ensures that all edgesare pointing “upwards”, i.e. from a node of lower level to another node on a higher level.Constraint 5.4 ensures that every aggregation node has at least one contributing requirementnode. Similarly, constraint 5.5 ensures that there is a path from every node to the root node.Finally, constraint 5.6 makes sure that there is a unique root node.

5.4 Mapping to Physical Data Model

The mapping to the physical data model is straight-forward for most of the Java Entity Beans.The physical database model can be found in appendix C. A TreeNode is either a Require-mentNode or an AggregationNode. CollapsedNodes are virtual and only exist at runtime as areplacement for the children of a collapsed node. To model the inheritance, we used the JPA3

inheritance strategy “joined”. This means that there is a database table for the abstract superclass TreeNode. Additionally, for every entry in the table tblqualitytreenode, there is a mat-ching entry with the same nodeid either in tblqualitytreenodeaggregation or tblqualitytree-noderequirement. The Java enums RequirementType and AggregationType are not modelledas tables, but only their integer representation is stored in the corresponding tables. A checkconstraint ensures that only valid values are accepted by the database. In addition to thosetables depicted in figure 5.1, there are three more tables:

• tblqualityrequirementvalue contains all the data which are collected when a qualityassessment is performed. Performance is of major importance here. This is why thistable is only accessed using plain SQL and is not mapped to JPA.

• tblqualitywarningtext and

• tblqualitywarninguser belong to our enhancement of the “Data Quality Prevention”framework we discuss in section 7.1.

3Java Persistence API

Page 53: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

44 5.5. BACK-END

5.5 Back-End

5.5.1 Running a Quality Assessment

When a quality assessment is performed, a snapshot of data quality of every data record inthe database is taken. This means that every quality requirement is executed and the resultsare written back to the database. This includes the following steps:

1. Create a new Quality Assessment and set the start date plus optionally a remark.

2. For every Quality Requirement do ..

• Load the quality requirement from the database.

• In the assessment SQL, replace all the place holders as describes in table 4.1.

• Prepend INSERT INTO tblqualityrequirementvalue (requirementid, assessmentid,refid, value) and execute the SQL statement.

3. Set the end date of the quality assessment.

4. For every quality requirement, initialise the file system cache as discussed in section5.5.2.

Because a large amount of data is written here, every quality requirement is executed in aseparate transaction to avoid overhead. Otherwise, the rollback segment would grow reallylarge. This has the consequence that, before using a quality assessment, it has to be checkedwhether its end date is set. If this is not the case, the quality assessment is still running or itcrashed during execution.

EJB3 Timer

A quality assessment can either be triggered manually by the system administrator or by anEJB3 timer (Compare with section 4.5). When the timer goes off, ejbTimeout is called onthe QualityAnalysisTimerBean by the timer service, with the containers default identity. Nowexactly the same as in the previous section has to be done: executing all quality requirementsand initialising the cache. However, the QualityAnalysisBean only allows calls to any ofits methods if the security context contains a valid caller principal. In order to fulfil thisrequirement, the QualityAnalysisTimerBean makes a remote call to itself using a technicaluser account.Furthermore, the timer bean also contains methods to query the timer status, cancel the timeror reschedule it.

Page 54: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

CHAPTER 5. IMPLEMENTATION 45

5.5.2 File System Cache

RDBMS are very powerful when it comes to answering complex queries. Nevertheless, thereare faster approaches to store and retrieve large amounts of data, when only very restrictedselection criteria are needed.The table tblqualityrequirementvalue contains four columns: requirementid, assessmentid,refid and value. The refid identifies the data record and value is the data quality value ofthis data record. When a data quality analysis is performed, this table is accessed using acombination of requirementid and assessmentid, and the data quality values of all data recordsare loaded. Experiments have shown that the retrieval performance decreases significantly,when multiple quality assessments are stored in the database, even if optimised indices areused. In order to avoid this problem, we implemented a file system cache on the applicationserver, which logically is located between the application server and the database. This way,all the data are still in the same storage, namely the database. However, the application serveralways loads the data from the file system cache, once it is initialised. This is much faster andsaves resources on the database server.The file system cache writes its files to the temp directory of the application server, and isinitialised when a new quality assessment is performed. If the cache gets deleted, it is re-built the next time the data are requested. The cache files have a file name in the formatQualityAssessment {assessmentid} {requirementid}.bin. Like this, the task of loo-king up the right data is simply delegated to the file system and no sophisticated search datastructures are needed.Table 5.2 shows the format of the file system cache and listing 5.1 contains the code of themethod writeToCache which writes the cache files. The method readFromCache performs theinverse operations and restores the original data structure. Note that the data quality valuesmay be NULL. This is encoded as -1f.

0 1 2 3 4 5 6 7refId value

Table 5.2: File System Cache Structure

• The first four bytes contain the integer valued refId.

• The second four bytes contain the value represented in the IEEE4 754 floating-point“single format” bit layout.

4Institute of Electrical and Electronics Engineers

Page 55: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

46 5.5. BACK-END

1 p r i v a t e vo id wr i t eToCache ( F i l e c a c h e F i l e , L i s t <O b j e c t []> d a t a ) {2 t r y {3 F i l e O u t p u t S t r e a m f o s = new F i l e O u t p u t S t r e a m ( c a c h e F i l e ) ;4 i n t r e f I d ;5 i n t v a l u e ;6 f o r ( O b j e c t [ ] row : d a t a ) {7 / / r e f i d8 r e f I d = ( ( Number ) row [ 0 ] ) . i n t V a l u e ( ) ;9 f o s . w r i t e ( ( r e f I d >>> 24) & 0 x f f ) ;

10 f o s . w r i t e ( ( r e f I d >>> 16) & 0 x f f ) ;11 f o s . w r i t e ( ( r e f I d >>> 8) & 0 x f f ) ;12 f o s . w r i t e ( r e f I d & 0 x f f ) ;1314 / / v a l u e15 v a l u e = F l o a t . f l o a t T o I n t B i t s ( row [ 1 ] != n u l l ?16 ( ( Number ) row [ 1 ] ) . f l o a t V a l u e ( ) : −1f ) ;17 f o s . w r i t e ( ( v a l u e >>> 24) & 0 x f f ) ;18 f o s . w r i t e ( ( v a l u e >>> 16) & 0 x f f ) ;19 f o s . w r i t e ( ( v a l u e >>> 8) & 0 x f f ) ;20 f o s . w r i t e ( v a l u e & 0 x f f ) ;21 }22 f o s . c l o s e ( ) ;23 } c a t c h ( E x c e p t i o n e ) {24 throw new D a t a Q u a l i t y A n a l y s i s E x c e p t i o n (25 ” F a i l e d t o w r i t e t o cache f i l e ” + c a c h e F i l e , e ) ;26 }27 }

Listing 5.1: Writing to File System Cache

Page 56: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

CHAPTER 5. IMPLEMENTATION 47

5.6 Front-End

The front-end project is where most of the complexity of the Data Quality Analysis Toolkitresides. In section 5.6.1, we explain the runtime data structure of a data quality tree andsection 5.6.2 explains how this structure is created. After that, we describe how the tree isvisualised and how the data quality charts are created.

5.6.1 Runtime Data Structure

As seen in section 5.3.1, the tree is modelled by tree nodes which are connected by edges.Behind each of the tree nodes, there are a lot of data quality data. In section 3.2.4, weexplained how the data can be grouped by various criteria. In the example data structure infigure 5.2, the data are grouped by user. The first magnifier shows a table with N columns,where column 0 contains the total over all groups, and columns 1..N contain the data of theindividual groups (in this case the users). Each of the groups contains a list in which the rawdata quality data are contained (Magnifier 2). This list is ordered by descending data qualityvalues. Note that the data quality value can be NULL if a quality requirement is not applicableto a particular data record. The refIds together with the entityId of the tree definition uniquelyidentify the data records to which the data quality values belong to. For every group, thepercentiles, minimum value, mean, standard deviation and row count are cached in the treenode for faster access. For the maximum value, caching is not needed because this is simplythe first row in the list.

5.6.2 Tree Calculation

The term tree calculation refers to the process of loading all needed data into memory andcreating the data structure just described. It is implemented in a multi-threaded fashion tomake the best use of the available computing resources. The logic is split into two classes:

• The TreeCalculation class does the real work, which can be divided into three kindsof tasks:

– Loading requirement node data

– Grouping requirement node data

– Calculating aggregation node data

• The TreeCalculationRunner is the driver of the whole process. It is responsible forthe scheduling of all the tasks, it maintains the progress indicators, and contains theerror handling code. The calculation runner uses two separate thread pools for the dataloading tasks and the calculation task, so that the calculation can already start as thedata arrive.

The following enumeration explains all the steps involved in the tree calculation in moredetail:

1. The user specifies on the start screen of the Data Quality Analysis Toolkit in which datahe is interested in (see figure 3.1). This includes the desired data quality tree definition,optionally a filter or grouping criterion, a quality assessment and how not applicableindicators should be treated.

Page 57: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

48 5.6. FRONT-END

Figure 5.2: Runtime Data Structure of the Data Quality Analysis Toolkit

2. The TreeCalculationRunnerTask is started and ..

(a) loads the tree definition from the database.

(b) If grouping is enabled, the keys which identify the groups, and the grouping va-lues which map every data record to a group, are loaded.

(c) For every requirement node in the tree definition, a LoadRequirementNodeDa-taTask is issued. The data loading task calls an EJB method and gets a list of{refId, value} tuples for the specified quality assessment and requirement.

(d) When a data loading task completes, a grouping task is issued on the secondthread pool if grouping is enabled. This task expands the list previously receivedaccording to the grouping values. Group 0, which is the total, will contain theoriginal list. Additionally, every tuple will be put in the list corresponding to thegrouping value. This means that the amount of data is doubled, and this is oneof the reasons why it is done on the client. The other reason is that we wanted tosave computing resources on the application server, so that a user who analysesthe data quality does not slow down the whole FoodCASE system.

Page 58: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

CHAPTER 5. IMPLEMENTATION 49

(e) As soon as the requirement node data are fetched and grouped, the TreeCalcula-tionTask starts walking the tree from bottom to top. For every aggregation nodein the tree, it calculates the data quality values for every group and every datarecord according to the aggregation type of the node. Remember that the aggre-gation type describes how the data quality values behind an aggregation node arecalculated from its direct contributing children. Since the calculation task doesnot involve any I/O5 operations and all the necessary data are ready in the RAM6,it is much faster than the data loading tasks (< 1 sec). Hence, it wasn’t worth theeffort to parallelise this task.

(f) Finally, all the lists with the {refId, value} tuples are sorted by descending dataquality value (Magnifier 2 in figure 5.2) and the percentiles are initialised.

3. Now the user has 11 different views at its disposal, which visualise the data just calcu-lated.

5.6.3 NanoGraph

The tree view (Figure 3.2 on page 15) and the tree editor (Figure 3.7) use a highly modifiedversion of the NanoGraph7 8 library. NanoGraph originally featured a panel on which nodescould be freely moved around, zooming and an outline panel. However, it did not providedany means to add or remove nodes or edges via the GUI. The library was held very generic,so that different types of nodes, edges, node renders and edge renderers could be used in pa-rallel. It supported different backgrounds, docking strategies, layouting algorithms, selectionof multiple objects and even the export to an SVG9 or JPEG10 image.After a short evaluation, we realised that it was not possible to change to selection mode,so that only one single object could be selected, which nullified our initial intent to use theNanoGraph library as-is. Because we needed to touch the library anyway, we chose to removeall the unneeded features and their implicated complexity. Doing so, we ended up with avery slimmed-down version, which only contained seven classes compared with 40 in thebeginning. Now it was easy to customise those seven classes according to our needs. In theend, we estimate that about 60% of the lines in the remaining NanoGraph classes were addedor modified by us. Still, we think it was worth to use the NanoGraph library as a startingpoint. Like that we didn’t have to design everything from scratch.

5.6.4 Tree Model

The main responsibility of the TreeModel class is to tell the NanoGraphPanel where the treenodes have to be rendered in terms of (x,y) coordinates. It supports persistence and modelmodifications such as adding and deleting requirement and aggregation nodes and addingand deleting edges. In order to save space on the screen, nodes can be collapsed, so thatits children are not visible anymore. The logic for the positioning of the nodes can best bedescribed by the pseudo code in listing 5.2.

5Input/ Output6Random Access Memory7http://www.sourceforge.net/projects/nanograph8http://www.nanoworks.nl9Scalable Vector Graphics

10Joint Photographic Experts Group

Page 59: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

50 5.6. FRONT-END

1 p u b l i c Poin t2D g e t L o c a t i o n ( TreeNode node ) {2 d ou b l e x ;3 d ou b l e y = ( g e t L e v e l C o u n t ( ) − node . g e t L e v e l ( ) ) *

LEVEL HEIGHT + LEVEL HEIGHT / 2 . 0 ;4 i f ( node . i s C o l l a p s e d ( ) ) {5 / / c a s e 1 : node i s c o l l a p s e d6 x = g e t L o c a t i o n ( new Col lapsedNode ( node ) ) . getX ( ) ;7 } e l s e i f ( node i n s t a n c e o f Aggrega t ionNode ) {8 i f ( node . g e t I n E d g e s ( ) . s i z e ( ) > 0) {9 / / c a s e 2 : Aggrega t ionNode i s c o n n e c t e d :

10 / / p o s i t i o n d e t e r m i n e d by i t s c h i l d r e n11 x = ( maxXOfChildren + minXOfChi ldren ) / 2 . 0 ;12 } e l s e i f ( node == movingNode ) {13 / / c a s e 3 : Aggrega t ionNode i s n o t c o n n e c t e d14 / / and moving ( d rag & drop )15 r e t u r n movingLoca t ion ;16 } e l s e {17 / / c a s e 4 : Aggrega t ionNode i s n o t c o n n e c t e d18 / / and n o t moving :19 / / p o s i t i o n d e t e r m i n e d l a s t r e q u i r e m e n t node20 x = xOfLas tRequi rementNode + NODE SPACING +

getNodeWidth ( node ) / 2 . 0 ;21 }22 } e l s e {23 i f ( node == movingNode ) {24 / / c a s e 5 : node i s moving25 x = xOfMovingNode ;26 } e l s e {27 / / c a s e 6 : node i s n o t moving :28 / / a l i g n i t n e x t t o t h e node on t h e l e f t29 x = xOfNodeOnTheLeft + NODE SPACING + getNodeWidth (

node ) / 2 . 0 ;30 }31 }32 r e t u r n new Point2D . Double ( x , y ) ;33 }

Listing 5.2: Positioning of Tree Nodes (Pseudo Code)

Generally, all requirement nodes are rendered on level 1 and are aligned next to each other inthe order of their node.index property. Aggregation nodes are rendered on the level specifiedby their node.level property. The horizontal position is the middle of the first and the last of itschildren. If the aggregation node does not have any children yet, it is aligned on the right ofthe last requirement node. A CollapsedNode object is a replacement for all the children of acollapsed node. Therefore, it is rendered at the same x coordinate as its parent. Requirementnodes can only be moved horizontally. This is why only the x coordinate is overridden,whereas aggregation nodes can be moved on both axes.The RotationTreeModel is an extension of the standard tree model, which adds the feature ofrotating the tree by 90 degrees clockwise, so that the root node is on the right-hand side. Thiscoordinate transformation logically involves three steps:

Page 60: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

CHAPTER 5. IMPLEMENTATION 51

1. Swap the x and y components of the coordinate.

2. Mirror along the y-Axis.

3. Apply a stretching factor to both components.

When the isRotated property is changed, all tree nodes move along a straight line from theirold position to their new position in a 500-millisecond-long animation.The tree model implements the observer pattern, so that any party interested in state changescan register a TreeModelListener. The InteractionManager which implements the tool tips,and the DragAndDropTreeModel which contains the drag and drop functionality used by thetree editor (on the left in figure 3.7) make us of this mechanism.

5.6.5 Tree View and Tree Editor

Figure 5.3: Tree View and Tree Editor, UML Class Diagram

Figure 5.3 shows how all the components mentioned in the previous sections work together.The TreeViewPanel contains a NanoGraphPanel (red association) which has a NanoGraph.The NanoGraph knows the RotationTreeModel and has a node renderer, an edge renderer, abackground and a docking strategy. The tree editor is an extension of the tree view. Simi-larly, the TreeEditorPropertyTable extends the TreeViewPropertyTable with the ability to edita subset of the available properties. The AbstractTreeEventHandler implements the eventhandling which is common to the tree view and the editor. Its two sub-classes contain thecode which is specific to one or the other. The InteractionManager is responsible for keepingtrack of the selected node or edge and for displaying the appropriate tool tips. Lastly, theDragAndDropTreeModel contains the logic which allows to add new nodes to the tree usingdrag and drop.

Page 61: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

52 5.6. FRONT-END

5.6.6 GUI Components

The GUI of the FoodCASE application was built on top of the Swing Application Framework(JSR11 296). Since the Data Quality Analysis Toolkit was expected to be fully integrated intoFoodCASE, it was a logical step to implement the toolkit using the App Framework as well.Figure 5.4 shows the GUI components of the Data Quality Analysis Toolkit. The main frameis highlighted in red colour and there are 14 top level panels which are coloured in green:

• The LoadTreeDataPanel is the start screen of the Data Quality Analysis Toolkit.(Figure 3.1 on page 14)

• The TreeDefinitionTablePanel lists all available tree definitions.

• When a tree definition is opened, the TreeEditorPanel is shown.(Figure 3.7 on page 20)

• Furthermore, there are seven data quality views and four data problem views.

All of those 14 top level panels implement the interface specified by DataQualityAnalysisPa-nel. The first three panels implement this interface directly (green dashed lines in the classdiagram), the quality view and the problems views on the other hand have abstract super-classes, which implement the functionality, which is common to all of them. The transitionsbetween the different panels are realised using a state pattern, where the main frame servesas the state manager. All of the 14 main panels have a reference to the manager (blue asso-ciations), so that the active panel can tell the manager which panel to display next. In detail,a panel switch works like the following:

1. The active panel decides e.g. as a response to a user action that another panel should beshown. It signals this by calling dataQualityAnalysisPanel.loadDataAndShow(panel-ToBeShown).

2. According to panelToBeShown.isDisplayMaximized() the window is maximised or res-tored.

3. The old panel is hidden and a busy message is shown.

4. The stored user settings are loaded from cache and applied to the new panel.

5. panelToBeShown.updateData() is called.

• If true is returned, this means that the data are already ready. So continue withthe next step.

• If false is returned, this means that the data are not ready yet and will be loadedasynchronously, so that the GUI stays responsive. In this case, it has to be waiteduntil the callback method loadingDataDone(..) is invoked.

6. When the data are ready, the new panel is turned visible.

11Java Specification Request

Page 62: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

CHAPTER 5. IMPLEMENTATION 53

Figure 5.4: GUI Components, UML Class Diagram

Page 63: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

54 5.6. FRONT-END

In contrast to the TreeEditorPanel, the TreeViewPanel does not implement the interface Da-taQualityAnalysisPanel. This is because the tree view panel is not used as a main panel, butit is wrapped into the QualityTreePanel (red association in figure 5.4). The ProblemTablePa-nel uses extensions of the table panels provided by FoodCASE. The difference is that in theData Quality Analysis Toolkit only the subset of rows which matches the criteria (data qualitylower/higher than the specified threshold) is displayed. The advantage of reusing the sametable panels is that the user has to configure his table preferences (columns to be displayed,column widths, etc.) only once. What is more, the detail window of a data record can easilybe opened by double-clicking on a row.

5.6.7 Views

The data quality and data problem views are structured according to the Model-View-Controller (MVC) pattern. Most of the views display a chart. These charts are created usingthe JFreeChart12 library. JFreeChart uses the same concept, but a different terminology.What is a view in MVC, is a called a plot in JFreeChart and the models are called datasets.Table 5.3 gives an overview which plot types and datasets are used by the seven quality andthe four problem views.The bar charts, box plot and spider charts use category datasets. Each category has a labeland a value. In addition to the mean value, the box plot also visualises the percentiles. Thebar chart can display the standard deviation if a DefaultStatisticalCategoryDataset is usedinstead of a DefaultCategoryDataset.The histogram and the line chart use different kinds of XY datasets. In case of the histogram,there are 20 bins per data series. Each bin represents a 5% interval and the height of thecorresponding bar is determined by the number of values in this interval. The line chart usesa simple collection of (x,y) coordinates which define the data points. In the data quality linechart, the points are ordered by increasing data quality and the x component is always equalto the index of the point in the list. This results in a monotonically increasing line.For pie plots, JFreeChart provides a special dataset.The data quality tree table is realised using a component from the table library developedby Scientific Applications13. This is a commercial library which was already used withinFoodCASE. Therefore, we already had acquired a licence to use it. The way the data qualitytree view and the problem table work, was already explained in previous sections.In order to integrate the JFreeChart library and the table library into the Data Quality AnalysisToolkit, we had to make a design decision for every view. We could either copy the necessarydata into another data structure which is usable by these libraries or we could write an adapterwhich makes the libraries work together with our own runtime data structure (Section 5.6.1).In the majority of cases, we chose we first option. However, for the box plot and the treetable, the amount of data to copy would have been too large. Therefore, we implementedan adapter class in these two cases, to avoid overhead. While this was straight-forward forthe box plot, the QualityTreeTableModel, which is a wrapper around a TreeNode, was rathercomplicated to come-up with. One of the problems was that we had to add an artificial rootnode depending on the view option Show Parent. Then, the option Use Grouping addedfurther complexity.

12http://www.jfree.org/jfreechart13http://www.scientific.gr

Page 64: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

CHAPTER 5. IMPLEMENTATION 55

Data Quality Problems

Bar

Cha

rt

Box

Plot

His

togr

am

Lin

eC

hart

Spid

erC

hart

Tree

Tree

Tabl

e

Bar

Cha

rt

Pie

Cha

rt

Spid

erC

hart

Prob

lem

Tabl

e

Plot Type/ ViewAbstractTablePanel3 xCategoryPlot1 x x xPiePlot1 xSpiderWebPlot1 x xTreeTable2 xTreeViewPanel4 xXYPlot1 x xDataset/ ModelCategory Datasets

DefaultCategoryDataset1 x xDefaultStatisticalCategoryDataset1 x xQualityBoxPlotDataset4 x

XY DatasetsHistogramDataset1 xXYSeriesCollection1 x

OtherDefaultPieDataset1 xFoodTableModel3 xTreeModel4 xQualityTreeTableModel4 x

1 Class provided by JFreeChart, http://www.jfree.org/jfreechart2 Class provided by Scientific Applications, http://www.scientific.gr3 Class provided by FoodCASE, http://www.foodcase.ethz.ch4 Class developed by Reto Mock for the Data Quality Analysis Toolkit

Table 5.3: Plot Types and Datasets used by the Data Quality Analysis Views

Page 65: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

56 5.6. FRONT-END

5.6.8 Chart Highlighting and Breadcrumb Navigation

The top most component there is on every data quality analysis view is the breadcrumb navi-gation. This one shows the path from the root node of the tree to the current node (Comparewith figure 3.4 on page 17). By clicking on an ancestor, it is possible to navigate upwards inthe tree. Navigation towards the leafs of the tree is possible by clicking on a chart entity, forexample on a bar in case of a bar chart. This by itself was easy to implement.In order that the user realises that he can click on a chart entity, we wanted to highlight themon mouse-over, i.e. the chart entity should change its appearance when the mouse hovers on itas shown in figure 3.5. This proved to be more complicated than expected at first. Finally, wefound a solution for every chart type we use, but unfortunately it works differently in everycase. We wrote a ChartHighlighter class which serves as a facade and hides all this. Still,having five classes and an interface, with a total of 407 lines or 261 lines of code, only forthe highlighting is suboptimal. This certainly is an area where the JFreeChart library couldbe improved.

5.6.9 User Settings

There are a lot of check boxes and radio buttons in the different views of the Data QualityAnalysis Toolkit. Therefore, it would be painful and potentially error-prone, if the codeto store the user settings had to be written for every view option again. For this reason,we created a UserSettingsUtil, which is used by the Data Quality Analysis Toolkit and theadmin tool as well. This utility inspects a panel using reflection, searches for JCheckBoxes,JRadioButtons, JSliders and FoodCASE’s AbstractTablePanels and automatically stores theirstate. This utility method is called whenever a panel switch happens, and the settings arewritten to the database when the Data Quality Analysis Toolkit respectively the admin tool isclosed.Section 3.3.4 explained that some view options are global, whereas others are stored perpanel. For example the value of the threshold slider should always be consistent on all ofthe four problem views. On the other hand, it should be possible to specify the view optionShow Parent on every panel independently. We encountered that in this particular setup, itwas possible to define the following simple rule: check boxes are stored per view, but radiobuttons and sliders are stored globally. This makes sense because it can be assumed that theuser wants the view options

• Percentage Scale/ User-defined Scale,

• Vertical Plot/ Horizontal Plot,

• and the Threshold Slider

to be the same on every view. All the other options should be stored per view.With this user settings utility, the loading and storing of the settings is fully transparent tothe panels, which is really convenient. Nevertheless, it might be desirable in rare cases tohave more control. This is where the interface UserSettingsAware comes into play. Theinterface contains two callback methods: userSettingsLoading(userSettings) and userSetting-sUpdating(userSettings). So panels can implement this interface and are notified when theuser settings are loaded or stored. With the reference which is passed to the callback methods,it is possible to access the user settings and read or store custom settings.

Page 66: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

CHAPTER 5. IMPLEMENTATION 57

5.6.10 Logging and Error Handling

For the logging and error handling we are using the logging facility which is included in Javasince JDK14 version 1.4. Whenever an error occurs, we log it. In addition, we have registereda special handler on the root logger. In case the level of a log record is SEVERE, a pop-upmessage will be shown to the user to inform him about the error. Using this convention, theerror handling is simple as the following code snippet demonstrates:

1 L o g g e r U t i l . g e t L o g g e r ( ) . i n f o (2 ” I ’m an i n f o and w i l l on ly be w r i t t e n t o t h e l o g f i l e ” ) ;3 L o g g e r U t i l . g e t L o g g e r ( ) . warn ing (4 ” Warnings on ly go t o t h e l o g as w e l l ” ) ;5 L o g g e r U t i l . g e t L o g g e r ( ) . s e v e r e (6 ” I ’m i m p o r t a n t and w i l l a d d i t i o n a l l y be d i s p l a y e d as pop−up ” ) ;7 L o g g e r U t i l . g e t L o g g e r ( ) . l o g ( Leve l . SEVERE , ” I have an e x c e p t i o n8 a t t a c h e d which w i l l be w r i t t e n t o t h e l o g f i l e ” , e x c e p t i o n ) ;

Listing 5.3: Logging and Error Handling

5.7 Admin Module

The structure of the part of the admin tool, which belongs to the Data Quality Analysis Tool-kit, is intentionally kept similar to the structure of the GUI components of the toolkit itself(See figure 5.5). The main difference here is that the top level component is not a frame but atabbed pane. This is because in the admin tool, there are a lot of other settings alongside theconfiguration of the Data Quality Analysis Toolkit.In the toolkit the DataQualityAnalysisFrame served as the state manager. In the admin toolthe DataQualityAnalysisAdminTabbedPane assumes this role. So all the panels keep a refe-rence to the tabbed pane. Most of the panels in the admin tool contain a table panel. Whena certain record is opened, a detail window appears. Hence, all the table panels need to havean associated detail frame. These frames know the tabbed pane as well. When a record ischanged using the detail window, it can signal the state manager that the table listing all thedata records should be reloaded. This way it is guaranteed that all the data displayed arealways in-sync.In the toolkit, the busy message was only needed at a central place. In the admin tool, howe-ver, there are different tabs where each of them is independent on the others. For this reason,every panel is wrapped into a DataQualityAnalsysisAdminTab. Each tab maintains its ownstate which either is busy or ready. Depending on this state, the panel itself or a busy messageis displayed when this tab is active.The AbstractDataQualityAnalysisAdminPanel and the AbstractDataQualityAnalysisAdmin-TablePanel both implement the same interface but the first one inherits from JPanel while thesecond one extends the FoodCASE class AbstractTablePanel. In case of a table panel, thelabel of the tab containing this panel is simply the title of the table. However, when the tabdoes not contain a table, the title of the tab has to be specified separately.

14Java Development Kit

Page 67: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

58 5.7. ADMIN MODULE

Figure 5.5: Admin Module, UML Class Diagram

Page 68: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

6Testing

In this chapter, we describe the testing of the Data Quality Analysis Toolkit. For the back-endand front-end testing, we used different methodologies.

6.1 Back-End Testing

The back-end testing was a purely technical testing of the functionality provided by the EJBback-end. For each of the entities (quality entity, quality entity filter, quality requirement,etc.) the EJB provides the operations

• list

• get

• store

• delete

and some operations which are specific to a certain entity. The correct behaviour of theseoperations can well be verified using unit testing. To do this, we used JUnit1 which is themost well-known unit testing framework for Java applications. We created one test case perentity, which executes all possible operations and verifies that they worked properly. A typicaltest case looks like the following:

1. List all instances of the entity.

2. Create a new one and verify that the list now contains one entry more.

3. Retrieve the new instance from the database.

1http://www.junit.org

59

Page 69: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

60 6.2. FRONT-END TESTING

4. Change some attributes of the entity and store it.

5. Retrieve a fresh reference to the entity from the DB and verify that the changed valuesare still there.

6. Delete the entity and verify that the list now contains the same number of rows asbefore the test.

7. Try to retrieve the deleted entity from the DB and verify that null is returned.

Figure 6.1: JUnit Back-End Testing

6.2 Front-End Testing

6.2.1 Automated Testing

The code metrics in section 5.2 disclosed that the server project contains only about 6% ofthe code of the Data Quality Analysis Toolkit. So the unit testing of the back-end coveredonly a marginal part of the codebase. Therefore, we had to come up with a way of testingthe GUI in order to achieve a reasonable code coverage, since most of the code resides in theclient project.For this purpose, we developed a GUITestUtil which tests the GUI of the Data Quality Ana-lysis Toolkit in two phases. In the first phase the tree definition table and the tree editor aretested. The second phase tests the start screen and all eleven data quality analysis views. Thebasic idea behind the GUI testing is to simulate a user who works with the toolkit.The testing of the tree editor is similar to the unit testing. A series of user operations is exe-cuted and in between assertions are checked. The second phase, however, is a little different.What the data quality views do, is in principal visualising data. So the output is a chart forexample, and whether this chart is correct, can only be reasonably verified by a person. Forthis reason, the second phase does not contain any assert statements and, therefore, cannotcheck if the output is correct. Yet what can be checked is that the application does not crashwhatever the user clicks. In order to verify this, the test loops through all data quality analysisviews, and tries out every possible combination of view options. After a successful run, a set-ting on the start screen is changed. Remember that it is possible to specify the tree definitionto be used, the treatment of not applicable requirements and the filter/grouping criteria on thestart screen. This results in

#tree defs×#n.a.r. treatments×#filter modes×#views×2avg. #view options (6.1)

Page 70: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

CHAPTER 6. TESTING 61

test cases. As long as the user has not created any own tree definitions, there are six of themby default. The not applicable requirements treatment has three possible values and there aremany filters, but for the reason of practicability, we only consider three different modes inthe testing: no filtering/grouping, filter criterion is specified, grouping criterion is specified.Inserting the numbers gives

6× 3× 3× 11× 2≈5 = 19409 (6.2)

test cases. After each test case, there is a sleep of 1 second. So the total sleep time is 323minutes for the whole test. Including the time the Data Quality Analysis Toolkit needs toserve the requests, the execution time of the entire test was 352 minutes when the database,the application server and the client were all running on my student lab computer.

Figure 6.2: Automated GUI Testing (Screenshot)

This test did not only ensure that all combinations of view options are properly handled in thesource code (at least so that they don’t cause an exception), but it would also have revealedany memory leaks, which only become noticeable after a while.

Page 71: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

62 6.2. FRONT-END TESTING

Figure 6.3 shows the memory and CPU2 usage during a test execution. Even though ourcode did not have any memory problems, it made us aware of a problem with our threadpool, which caused that always new threads were acquired until the system was running outof native threads.

Figure 6.3: Automated GUI Testing, Application Monitoring

The left diagram shows the memory usage during the GUI test, and on the right-hand sidethe CPU utilisation is plotted. It is clearly visible how the resource usage depends on theamount of data. In the introduction, we mentioned that every food has an average of 30components. Therefore, it could be expected that the trees Single Food Component DataQuality and Aggregated Food Component Data Quality have the highest demands on thememory and the processing resources. This is really the case as the light green area (SingleComponent) and orange area (Aggregated Component) show.

6.2.2 Manual Testing

After the automated testing, we could be almost sure that the Data Quality Analysis Toolkitdoes not crash no matter what the user clicks. But what was left to be tested, was the interac-tion with the mouse. This includes for example that the tool tips are displayed correctly, thebreadcrumb navigation, the highlighting of the chart entities on mouse-over and the down-wards navigation in the tree when a chart entity is clicked on. Furthermore, the context menusand the drag and drop functionality in the tree editor had to be tested. And finally, the mostvital thing which was missing in the automated testing: correctness. So we had to go over allthe charts and check if what was displayed was reasonable, and visually nice. We tested eachview option at least once and gave ourselves various data quality analysis tasks to verify thefunctionality offered by the toolkit and its usability in practice.

2Central Processing Unit

Page 72: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

7Extensions

Apart from the Data Quality Analysis Toolkit, we implemented a few extensions to Food-CASE in this master thesis. These extensions are introduced in the following sections.

7.1 Data Quality Prevention

Data Quality Prevention is not about preventing from data quality, as one might think atfirst. After Presser [presser11] Data Quality Prevention refers to the part of the data qualityframework which tries to avoid that data of poor quality is even entered into the system.Figure 7.1 shows how this feature is implemented in the FoodCASE application: On everyscreen, where data can be edited, there is a data quality evaluation panel on the bottom. Thispanel list all problems the current data record has. Problems are divided into two categories:errors and warnings. In case of an error, the data record cannot be saved.In order to familiarise ourselves with the data quality framework, we added some new vali-dation rules to it. When talking about this mechanism, we noticed that warnings and errorsare not always displayed early enough. When the detail window is open, it is in certain situa-tions already too late to tell the user that there are problems. A concrete example from theFoodCASE application is the aggregation of foods. To create a new aggregated food, someSingle Food Components have to be selected and be placed in the clipboard. After that, it ispossible to click on the “Create Aggregated Food” button. In this case, a number of plausibi-lity checks should be done before the detail window of the new Aggregated Food is opened.For example, if the user only selects 24 out of 25 components of a food, chances are that thiswas not the user’s intention. So a warning should be displayed immediately when the “CreateAggregated Food” button is clicked. In order to allow such checks to be done beforehand, wecreated a simple framework.Figure 7.2 shows how such a warning looks like. While some users may like this feature andfind it helpful, others may be annoyed by it. For this reason, the individual warnings can beturned on and off on a per-user bases as depicted in figure 7.3.

63

Page 73: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

64 7.1. DATA QUALITY PREVENTION

Figure 7.1: FoodCASE: Single Food Detail Window

1. The field which has an error is highlighted in red.

2. The data quality evaluation panel lists all errors in red colour. Warnings are colouredin orange.

3. A shortcut button is provided to quickly open the data record in the Data Quality Ana-lysis Toolkit.

Page 74: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

CHAPTER 7. EXTENSIONS 65

Figure 7.2: Data Quality Prevention: Warning

Figure 7.3: Data Quality Prevention: Configure Warnings

So far, five validation rules have been implemented using our enhancement of the“Data Quality Prevention” framework. All except the one in the middle are enabled. Unfort-unately, the year of generation is not known for most of the data records in the FoodCASEsystem. Therefore, with the current state of the data, the warning message would grow hugeif this rule was enabled as well.

Page 75: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

66 7.2. CONFIDENCE CODE FOR AGGREGATED FOODS IN FOODCASE

7.2 Confidence Code for Aggregated Foods in FoodCASE

In section 2.4.2, we explained that the US Department of Agriculture (USDA) calculates aConfidence Code (CC) which is assigned to aggregated food component items. Furthermore,section 2.4.3 described how the EuroFIR Quality Index (QI) for single food components isdefined. Using these QIs it would be possible to derive a CC in the way the USDA does it.However, the EuroFIR did not define any rules how such a CC should be calculated and from[holden01] it is not clear how the USDA does it in detail. What we know is that they take thesum of the weighted average of the ratings for each of the categories (5 in their case, 7 in ourcase) and assign a letter A, B, C or D to the aggregated food component.Based on this idea, we implemented in FoodCASE a similar method as a proposal. Asalready mentioned earlier, rounding has the disadvantage of losing information. On the otherhand it would be nice if our CC was comparable with the USDA CC. As a compromise,we use the same letters as well, but additionally append a suffix (“++”, “+”, empty, “-” or“--”) to achieve a finer granularity of ratings. The following table shows how we defined thetranslation in detail:

From1 7.0 8.4 9.8 11.2 12.6 14.0 15.4 16.8 18.2 19.6To2 8.4 9.8 11.2 12.6 14.0 15.4 16.8 18.2 19.6 21.0CC D-- D- D D+ D++ C-- C- C C+ C++

From1 21.0 22.4 23.8 25.2 26.6 28.0 29.4 30.8 32.2 33.6To2 22.4 23.8 25.2 26.6 28.0 29.4 30.8 32.2 33.6 35.0CC B-- B- B B+ B++ A-- A- A A+ A++1 “From” values are inclusive2 “To” values are exclusive

Table 7.1: Mapping from Averaged EuroFIR Quality Index to Confidence Code

Remark 1: From a mathematical point of view, it does not make any difference if the weightedaverage for every category is calculated first and then the sum is taken, or if the total EuroFIRQI for every contributing value is calculated first and then the weighted average is taken. Inthe implementation we chose the second option so than we could reuse the existing code forthe QI calculation.Remark 2: If the averaged QI lies exactly on the border of two confidence codes, better oneis taken.

7.2.1 Recipes

In section 2.2.2, we mentioned that a recipe is a special kind of aggregated food for whichthe exact amount of its ingredients and the preparation method is known. The ingredients areaggregated foods as well. Therefore, creating recipes can be an iterative process where, forexample, the recipe for a pizza dough can be entered first and after that a number of pizzaswith different toppings can be created. When a recipe is calculated, we propose that theconfidence code for every component is calculated as the weighted average from the CCs ofthe ingredients according to their contribution to physical weight of the resulting food.

Page 76: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

8Conclusion

8.1 Summary of Work

8.1.1 Data Quality Analysis Toolkit

In this master thesis, we designed and implemented a Data Quality Analysis Toolkit, whichallows to measure and visualise data quality. First, we gave an overview of the history offood composition tables. After that, we focused on data quality in general and on data qualityevaluation in the context of food composition databases specifically. We described how theUS Department of Agriculture (USDA) assigns a Quality Index to single foods and that aConfidence Code for aggregated foods is derived. The EuroFIR came up with a quality indexwhich is very similar to the one of the USDA.Next, we introduced the Data Quality Analysis Toolkit, which we built as part of this thesis,by the means of three concrete scenarios. The toolkit is not limited to the quality measuresas they were defined by the USDA and the EuroFIR, but it allows to define any quality re-quirements. We explained how such quality requirements can be expressed in SQL, and howthe requirements can be grouped together recursively in order to create a data quality analy-sis tree. In this tree, it is possible to use different aggregation types such as weighted meanor taking the maximum. Furthermore, every user can have its own tree definition if it iscontroversial how the data quality tree definition should look like.Using the data quality trees, it is possible to analyse the data quality in the FoodCASE system.By using diverse filter and grouping criteria, a single data record, the entire data in the system,or a subset of it can be considered. Furthermore, it is possible to define if the current dataquality should be analysed or if a historical snapshot should be used. Additionally, the qualityrequirements which are not applicable to a certain data record can be treated in different ways.The Data Quality Analysis Toolkit provides seven different Data Quality Views which allowthe user to assess the quality of the data in the system. The chart types range from simpleones like the bar chart to more advanced ones like the box plot, which visualises the statis-tical properties of every data series. Additionally, the percentiles and standard deviation can

67

Page 77: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

68 8.1. SUMMARY OF WORK

be accessed from every view by using the tool tips. A variety of view options make it pos-sible to customise the views according to the preferences of the user. All these options areremembered by the system the next time the user logs in.Once it is recognised that there are problems in a certain area of the data, the Data ProblemViews allow to exactly quantify the amount of problems and it is possible to drill down onthem. Arriving at the leaf level of the data quality tree, single problems are listed which thencan be fixed. By the means of a Threshold Slider, the user can define how many percent ofdata quality are required until a data record is of sufficient data quality. In other words, if thequality of a data record is below the threshold, it is regarded as being deficient. Likewise, itis also possible to only consider data records of particularly good data quality.The User-defined Scale allows the user to define a scale other than percentage, if this isdesired. For example, this makes sense for the EuroFIR quality index, where the scale rangesfrom 7 to 35 by the definition of the index. Similarly, an institute could use its own scale forhistorical reasons. It is possible to switch between the percentage scale and the user-definedscale at any time.Navigating towards the leafs of the data quality tree is supported by simply clicking on a chartentity. The breadcrumb navigation, which is located at the top of every screen, shows the pathfrom the root node to the selected node and allows to navigate upwards in the tree.

8.1.2 Administration Module

In chapter 4, we introduced the part of the admin module which allows system administratorsto configure the Data Quality Analysis Toolkit. We presented the data model of the Food-CASE system, in which we coloured all the entities according to where the data in themoriginate from. This enabled us to identify six Data Quality Entities for which we defined156 Quality Requirements in total (Compare with appendix B). After that, we explained howfilter and grouping criteria for the data quality entities can be defined. Furthermore, we sho-wed that a Quality Assessment can be triggered manually by the system administrator or thatit can be automated using a timer. Lastly, the admin tool also allows to run a number ofmaintenance jobs.

8.1.3 Extensions

Last but not least, we presented two extensions to FoodCASE which are only marginallyrelated to the preceding chapters. The first one is an enhancement of the “Data QualityPrevention” framework. The existing framework provided a data quality panel, in which allproblems are listed. The new feature allows to define a number of validation rules to bechecked even before the edit window opens. Because some users may like these checks tobe done beforehand, while others may be annoyed by certain checks, it is possible to turn onand off every validation rule individually on a per-user basis.The second extension is a proposal for a confidence code for aggregated foods in FoodCASE;similar to the CC developed by the USDA. As we considered the rounding to the discretevalues A, B, C or D potentially harmful, we introduced additional ratings such as B++ inorder to achieve a finer granularity. With this compromise, we do not lose too much precisionwhile maintaining an easy comparability to the USDA scoring system.

Page 78: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

CHAPTER 8. CONCLUSION 69

8.2 Future Work

Although the Data Quality Analysis Toolkit is already a powerful tool, there are still a fewpoints which could be optimised. In this section, we list a number of possible optimisationsand extensions.

Client-Side Caching In section 5.5.2, we introduced the file system cache, which is imple-mented on the application server. This cache already improved the performance a lot and tookload away from the database. Now everything works fine if a fast connection to the server isavailable; for example if the quality analysis is done from within the ETH network. However,if the user is working from remote and only has a slow internet connection at his disposal, itcould take a while until all the data are fetched. Therefore, it would make sense to use thesame caching logic, which already exists on the server-side, on the client-side as well. Thismeans that there would be an L1 cache locally, an L2 cache on the application server and theoriginal data would remain in the database. This way, the quality assessment data would onlybe copied once to the local machine, resulting in an improved access time the next time thedata quality of the same quality assessment (snapshot) is analysed.

Access Rights The FoodCASE application implements a Role-Based Access Control model(RBAC); however only a primitive one. There are three statically defined roles: read-only,compiler and admin. Only those users, who have the admin role, can log in to the admintool, but in FoodCASE itself, all the users have the same permissions and everybody canaccess everything. Because of this, we considered it needless to implement a sophisticatedaccess control model in the Data Quality Analysis Toolkit. Since the toolkit is held verygeneric, it would be easily possible to separate it from FoodCASE and integrate it into anyother system building on top of a RDBMS. In this context, it would be desirable to have forexample the possibility to declare a data quality tree to be private. This could be realised byassigning an owner to every tree definition. Additionally, there would be different sharingmodes: private, share for reading and share for writing. If more complex rules are required,another possibility would be to implement its own RBAC model in the Data Quality AnalysisToolkit. A third option would be to make use of the access control model of the host system.

Packaging for Distribution The Data Quality Analysis Toolkit is fully integrated into theFoodCASE system. Nevertheless, it would be easily possible to use it in any other systembuilding on top of a RDBMS, as already mentioned multiple times. In order to support this, itis imaginable to ship the toolkit as a JAR1 file. Additionally, we could provide an SQL scriptwhich creates all the tables needed by the toolkit (See appendix C). The implementer of thehost system would only have to place a menu link or a button to start the Data Quality AnalysisToolkit somewhere in his application. And then, of course, he would have to configure thetoolkit as described in chapter 4. Most importantly, he would have to define the qualityrequirements for his system.

Support for OODBMS2 Back-End In the Data Quality Analysis Toolkit, all the qualityrequirements have to be expressed as SQL statements. However, it is conceivable to use the

1Java Archive2Object-Oriented Database Management System

Page 79: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

70 8.2. FUTURE WORK

toolkit in the same way in combination with an object-oriented database back-end. In order todo so, a requirement is that the OODBMS supports a plain text query language such as OQL3.Otherwise, it is not possible to express a quality requirement as string and store it as suchin the database. The persistence logic of the Data Quality Analysis Toolkit itself is mainlyimplemented in JPQL, which makes it independent of the vendor of the RDBMS. However, ifan object-oriented back-end should be used, some effort is required to rewrite the persistencelogic. Since the EJB back-end is relatively slim, this would be feasible. Nevertheless, becausethe back-end has to be touched anyway, it would be worthwhile to rethink how a qualityrequirement can be expressed best in the context of object-oriented databases. Optimally, anapproach should be found which could be used for any OODBMS.

Query Web Services It was a pragmatical decision to express all quality requirements asSQL statements. But it is a restriction as well. One possible extension would be to allow toquery a web service, for example to verify that an ISBN really belongs to a certain book, orto check if a zip code matches a given city. In addition to web services, other channels toaccess information could be integrated as well. This could be a local file, a remote file on anetwork location, an FTP4 server, an LDAP5 server or similar.

Data Quality Alerts On the Data Problem Views, there is a Threshold Slider to define howmany percent of data quality are required until a data record is of sufficient quality. Similarly,a monitoring tool could be implemented, which raises an alarm if a certain data quality rule isviolated; for example if the average data quality drops below a predefined threshold, or if newdata records of poor quality are entered into the system. On top of this monitoring system, anescalation procedure could be defined. For example, if a problem remains unresolved for aweek, the department head gets informed, after two weeks the section head, and if everythingbreaks, it escalates to the executive board.

Additional Quality Requirements In section 4.2, we mentioned that, as of yet, only qualityrequirements for the data which are actively maintained within FoodCASE have been defined.In a second step, this could be extended to check imported third-party data as well. Further-more, it might be possible for a food scientist, who has in-depth knowledge of the businessdomain, to define additional quality requirements for the existing six quality entities.

Using the Toolkit Last but not least, and this is maybe the most important point of all,the Data Quality Analysis Toolkit has to be used to identify data problems in the FoodCASEdatabase, and these problems have to be addressed in order to increase the quality of the data.One major problem we can already tell at this point is that for large parts of the data their ageis unknown.

Additional Quality Prevention Rules Regarding the enhancement we created for the“Data Quality Prevention” framework, additional validation rules could be implemented. Forexample, it should be checked when clicking “Create Aggregated Food” if all public compo-nents (those which appear in the online version) are included.

3Object Query Language4File Transfer Protocol5Lightweight Directory Access Protocol

Page 80: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

ATerms and Abbreviations

API Application Programming InterfaceCC Confidence CodeCD-ROM Compact Disc Read-Only MemoryCMS Content Management SystemCOST Cooperation in Science and TechnologyCPU Central Processing UnitCSV Comma Separated ValuesDB DatabaseETH Swiss Federal Institute of Technology (German: Eid-

genossische Technische Hochschule)EJB Enterprise Java BeanERM Entity Relationship ModelEuroFIR European Food Information ResourceFCDB Food Composition DatabaseFCN Federal Commission for NutritionFDTP Food Data Transport PackageFoodCASE Food Composition And System EnvironmentFOPH Federal Office of Public HealthFTP File Transfer ProtocolGUI Graphical User InterfaceIEEE Institute of Electrical and Electronics EngineersI/O Input/ OutputISBN International Standard Book NumberISSN International Standard Serial NumberJAR Java ArchiveJDK Java Development KitJPA Java Persistence APIJPEG Joint Photographic Experts Group

71

Page 81: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

72

JPQL Java Persistence Query LanguageJSR Java Specification RequestLDAP Lightweight Directory Access ProtocolLOC Lines of CodeMBA Master of Business AdministrationMVC Model-View-ControllerNANUSS NAtional NUtrition Survey SwitzerlandOODBMS Object-Oriented Database Management SystemOQL Object Query LanguageQI Quality IndexRAM Random Access MemoryROM Read-Only MemoryRBAC Role Based Access ControlRDBMS Relational Database Management SystemRODQ Requirement-Oriented Data QualifySQL Structured Query LanguageSVG Scalable Vector GraphicsSwissFIR Swiss Food Information ResourceUML Unified Modelling LanguageUSDA US Department of AgricultureXML Extensible Markup Language

Table A.1: Terms and Abbreviations

Page 82: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

BFoodCASE Quality Requirements

On the following pages, all the quality requirements implemented in the FoodCASE systemare listed. In order to keep similar requirements together in the list, the following prefixesare used:

EC: EuroFIR Consistency: Consistency between the datain FoodCASE and the answers to the EuroFIR qualityquestions

EuroFIR: EuroFIR Quality QuestionMF: Mandatory FieldSP: Statistical ProblemSY: SynonymVN: Valid Number

Table B.1: Quality Requirement Name Prefixes

73

Page 83: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

74

#Q

ualit

yE

ntity

Req

uire

men

tNam

eD

escr

iptio

nTy

pe1

Sing

leFo

odE

very

food

mus

tha

veat

leas

t4

com

-po

nent

valu

esSo

ftC

onst

rain

t

2Si

ngle

Food

Fill

fact

orFi

llfa

ctor

ofsi

ngle

food

Indi

cato

r3

Sing

leFo

odFo

rev

ery

food

,ca

rboh

ydra

te(C

HO

Tor

CH

O)m

ustb

epr

ovid

edC

arbo

hydr

ate

isa

man

dato

ryco

mpo

nent

.C

HO

Tor

CH

Oha

sto

bepr

ovid

ed.

Soft

Con

stra

int

4Si

ngle

Food

Fore

very

food

,ene

rgy

mus

tbe

prov

ided

Ene

rgy

isa

man

dato

ryco

mpo

nent

Soft

Con

stra

int

5Si

ngle

Food

Fore

very

food

,fat

mus

tbe

prov

ided

Fati

sa

man

dato

ryco

mpo

nent

Soft

Con

stra

int

6Si

ngle

Food

Fore

very

food

,pro

tein

mus

tbe

prov

ided

Prot

ein

isa

man

dato

ryco

mpo

nent

Soft

Con

stra

int

7Si

ngle

Food

For

hom

e-m

ade

food

,re

cipe

desc

ript

ion

mus

tbe

prov

ided

Soft

Con

stra

int

8Si

ngle

Food

IfFA

T=

0or

logi

calz

ero,

then

allo

ther

fatty

acid

sca

nex

ists

butn

ot>

0H

ard

Con

stra

int

9Si

ngle

Food

Min

imum

leng

thof

sing

lefo

odna

me

>=

2Fo

odna

me

mus

thav

eat

leas

ttw

ole

tters

Har

dC

onst

rain

t

10Si

ngle

Food

MF:

Eng

lish

food

nam

eE

nglis

hfo

odna

me

ism

anda

tory

Har

dC

onst

rain

t11

Sing

leFo

odM

F:E

uroF

IRcl

assi

ficat

ion

Eur

oFIR

clas

sific

atio

nis

man

dato

rySo

ftC

onst

rain

t12

Sing

leFo

odM

F:R

esta

uran

torh

ome-

mad

eT

here

stau

rant

orho

me-

mad

efla

gis

man

-da

tory

Soft

Con

stra

int

13Si

ngle

Food

MF:

Ret

entio

nfa

ctor

clas

sific

atio

nR

eten

tion

fact

orcl

assi

ficat

ion

ism

anda

-to

rySo

ftC

onst

rain

t

14Si

ngle

Food

Scie

ntifi

cna

me

orbr

and

nam

em

ustb

ese

tIf

nobr

and

nam

eis

set,

then

scie

ntifi

cna

me

mus

tbe

prov

ided

and

vice

vers

aSo

ftC

onst

rain

t

15Si

ngle

Food

SY:

Com

bina

tion

ofla

ngua

ge=e

nan

dty

pe=t

rans

latio

nsh

ould

noto

ccur

Har

dC

onst

rain

t

16Si

ngle

Food

SY:

Com

bina

tion

ofsy

nony

mte

rm,

lan-

guag

ean

dty

pem

ustb

eun

ique

Har

dC

onst

rain

t

17Si

ngle

Food

SY:M

anda

tory

field

syno

nym

term

Ifa

syno

nym

entr

yex

ists

,the

field

syno

-ny

mte

rmis

man

dato

ryH

ard

Con

stra

int

Page 84: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

APPENDIX B. FOODCASE QUALITY REQUIREMENTS 75

18Si

ngle

Food

SY:M

anda

tory

field

syno

nym

type

Ifa

syno

nym

entr

yex

ists

,the

field

syno

-ny

mty

peis

man

dato

ryH

ard

Con

stra

int

19Si

ngle

Com

pone

ntA

cqui

sitio

nty

pekn

own

Ifac

quis

ition

type

=no

tkno

wn,

then

data

have

less

qual

itySo

ftC

onst

rain

t

20Si

ngle

Com

pone

ntA

tlea

ston

em

etho

dsh

ould

exis

tIs

ther

eat

leas

ton

em

etho

dfo

ra

sing

leva

lue?

Soft

Con

stra

int

21Si

ngle

Com

pone

ntA

tlea

ston

ere

fere

nce

shou

ldex

ist

Isth

ere

atle

asto

nere

fere

nce

for

asi

ngle

valu

e?So

ftC

onst

rain

t

22Si

ngle

Com

pone

ntA

tlea

ston

esa

mpl

esh

ould

exis

tIs

ther

eat

leas

ton

esa

mpl

efo

ra

sing

leva

lue?

Soft

Con

stra

int

23Si

ngle

Com

pone

ntD

ata

olde

rtha

n5

year

s(e

valu

atio

nda

te)

Ifev

alua

tion

date

>5

year

s,th

enco

mpi

-le

rsho

uld

look

forn

ewda

taSo

ftC

onst

rain

t

24Si

ngle

Com

pone

ntE

C:B

rand

nam

eIf

the

bran

dna

me

isse

t,an

swer

toth

eco

r-re

spon

ding

Eur

oFIR

ques

tion

shou

ldbe

YE

Sel

seN

O

Soft

Con

stra

int

25Si

ngle

Com

pone

ntE

C:C

omm

erci

alna

me

Ifco

mm

erci

alna

me

isse

t,an

swer

toco

r-re

spon

ding

Eur

oFIR

ques

tion

shou

ldbe

YE

Sel

seN

O

Soft

Con

stra

int

26Si

ngle

Com

pone

ntE

C:G

ener

icna

me

Ifge

neri

cna

me

isse

t,an

swer

toco

rres

-po

ndin

gE

uroF

IRqu

estio

nsh

ould

beY

ES

else

NO

Soft

Con

stra

int

27Si

ngle

Com

pone

ntE

C:L

abac

cred

ited

Ifth

ela

bw

asac

cred

ited,

answ

erto

cor-

resp

ondi

ngE

uroF

IRqu

estio

nsh

ould

beY

ES

else

NO

Soft

Con

stra

int

28Si

ngle

Com

pone

ntE

C:N

umbe

rofa

naly

tical

sam

ples

Num

ber

ofan

alyt

ical

sam

ples

shou

ldm

atch

answ

erto

corr

espo

ndin

gE

uroF

IRqu

estio

n

Soft

Con

stra

int

29Si

ngle

Com

pone

ntE

C:P

ortio

nre

plic

ates

Ifpo

rtio

nre

plic

ates

>1,

answ

erto

corr

es-

pond

ing

Eur

oFIR

ques

tion

shou

ldbe

YE

Sel

seN

O

Soft

Con

stra

int

Page 85: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

76

30Si

ngle

Com

pone

ntE

C:R

efer

ence

mat

eria

lIf

ast

anda

rdre

fere

nce

mat

eria

lwas

used

,an

swer

toco

rres

pond

ing

Eur

oFIR

ques

-tio

nsh

ould

beY

ES

else

NO

Soft

Con

stra

int

31Si

ngle

Com

pone

ntE

C:S

ampl

esho

mog

eniz

edIf

sam

ples

wer

eho

mog

eniz

ed,

answ

erto

the

corr

espo

ndin

gE

uroF

IRqu

estio

nsh

ould

beY

ES

else

NO

Soft

Con

stra

int

32Si

ngle

Com

pone

ntE

C:S

ampl

esst

abili

zed

Ifsa

mpl

esw

ere

stab

ilize

d,an

swer

toco

r-re

spon

ding

Eur

oFIR

ques

tion

shou

ldbe

YE

Sel

seN

O

Soft

Con

stra

int

33Si

ngle

Com

pone

ntE

uroF

IR:A

naly

tical

Met

hod

Indi

cato

r34

Sing

leC

ompo

nent

Eur

oFIR

:Is

com

pone

ntun

ambi

guou

sIs

the

com

pone

ntde

scri

bed

unam

bi-

guou

sly?

Indi

cato

r

35Si

ngle

Com

pone

ntE

uroF

IR:I

sfo

odgr

oup

know

nIs

the

food

grou

p(e

.g.

beve

rage

,des

sert

,sa

vour

ysn

ack,

past

adi

sh)k

now

n?In

dica

tor

36Si

ngle

Com

pone

ntE

uroF

IR:I

sm

atri

xun

itun

equi

voca

lIs

the

mat

rix

unit

uneq

uivo

cal?

Indi

cato

r37

Sing

leC

ompo

nent

Eur

oFIR

:Is

the

heat

trea

tmen

tkno

wn

Isth

eex

tent

ofhe

attr

eatm

entk

now

n?In

dica

tor

38Si

ngle

Com

pone

ntE

uroF

IR:I

sun

itun

equi

voca

lIs

the

unit

uneq

uivo

cal?

Indi

cato

r39

Sing

leC

ompo

nent

Eur

oFIR

:Num

bero

fana

lytic

alsa

mpl

esN

umbe

rofa

naly

tical

sam

ples

Indi

cato

r40

Sing

leC

ompo

nent

Eur

oFIR

:Was

bran

dpr

ovid

edIf

rele

vant

,w

asth

ebr

and

prov

ided

(e.g

.Fe

rrer

o)?

Indi

cato

r

41Si

ngle

Com

pone

ntE

uroF

IR:W

asco

mm

erci

alna

me

prov

ided

Was

the

com

mer

cial

nam

epr

ovid

ed(e

.g.

Nut

ella

)?In

dica

tor

42Si

ngle

Com

pone

ntE

uroF

IR:

Was

cons

umer

s/di

etar

y/la

bel

clai

min

fopr

ovid

edW

asre

leva

ntin

form

atio

non

cons

umer

grou

p/di

etar

yus

e/la

bel

clai

min

fopr

ovi-

ded?

Indi

cato

r

43Si

ngle

Com

pone

ntE

uroF

IR:W

asge

neri

cna

me

prov

ided

Was

the

gene

ric

nam

epr

ovid

ed(e

.g.c

ho-

cola

tepa

ste

with

haze

lnut

s)?

Indi

cato

r

44Si

ngle

Com

pone

ntE

uroF

IR:

Was

geog

raph

ical

info

rmat

ion

prov

ided

Ifre

leva

nt,

was

info

rmat

ion

abou

tth

ege

ogra

phic

alor

igin

ofth

efo

odpr

ovid

ed?

Indi

cato

r

Page 86: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

APPENDIX B. FOODCASE QUALITY REQUIREMENTS 77

45Si

ngle

Com

pone

ntE

uroF

IR:

Was

info

rmat

ion

onpa

ckin

gm

ediu

mpr

ovid

edIf

rele

vant

,w

asin

form

atio

non

pack

ing

med

ium

prov

ided

?In

dica

tor

46Si

ngle

Com

pone

ntE

uroF

IR:

Was

info

rmat

ion

onpr

eser

va-

tion

met

hod

prov

ided

Was

info

rmat

ion

onpr

eser

vatio

nm

etho

dpr

ovid

ed?

Indi

cato

r

47Si

ngle

Com

pone

ntE

uroF

IR:

Was

info

rmat

ion

ontr

eatm

ent

prov

ided

Was

rele

vant

info

rmat

ion

ontr

eatm

enta

p-pl

ied

prov

ided

?In

dica

tor

48Si

ngle

Com

pone

ntE

uroF

IR:W

asla

bora

tory

accr

edite

dW

asth

ela

bora

tory

accr

edite

dfo

rthi

sm

e-th

odor

was

the

met

hod

valid

ated

bype

r-fo

rman

cete

stin

g?

Indi

cato

r

49Si

ngle

Com

pone

ntE

uroF

IR:W

asm

ore

than

one

bran

d/cu

lti-

var/

subs

peci

essa

mpl

edIf

rele

vant

,w

asm

ore

than

one

bran

d(f

orm

anuf

actu

red

pre-

pack

edpr

oduc

t)or

mor

eth

anon

ecu

ltiva

r(fo

rpla

ntfo

ods)

orsu

bspe

cies

(for

anim

alfo

ods)

sam

pled

?

Indi

cato

r

50Si

ngle

Com

pone

ntE

uroF

IR:

Was

num

ber

ofpr

imar

ysa

mpl

es>

9W

asth

enu

mbe

rofp

rim

ary

sam

ples

>9?

Indi

cato

r

51Si

ngle

Com

pone

ntE

uroF

IR:

Was

prod

uctio

nm

onth

/sea

son

indi

cate

dIf

rele

vant

,w

asth

em

onth

orse

ason

ofpr

oduc

tion

indi

cate

d?In

dica

tor

52Si

ngle

Com

pone

ntE

uroF

IR:

Was

reci

pena

me

and

desc

rip-

tion

prov

ided

Was

the

com

plet

ena

me

and

desc

ript

ion

ofth

ere

cipe

prov

ided

?In

dica

tor

53Si

ngle

Com

pone

ntE

uroF

IR:W

asre

fere

nce

mat

eria

luse

dIf

rele

vant

,w

asan

appr

opri

ate

refe

renc

em

ater

ial

ora

stan

dard

refe

renc

em

ater

ial

used

?

Indi

cato

r

54Si

ngle

Com

pone

ntE

uroF

IR:W

assa

mpl

eho

mog

eniz

edW

asth

esa

mpl

eho

mog

eniz

ed?

Indi

cato

r55

Sing

leC

ompo

nent

Eur

oFIR

:W

assa

mpl

em

oist

ure

cont

ent

give

nW

asth

em

oist

ure

cont

ent

ofth

esa

mpl

em

easu

red

and

the

resu

ltgi

ven?

Indi

cato

r

56Si

ngle

Com

pone

ntE

uroF

IR:W

assa

mpl

ing

plan

deve

lope

dW

asth

esa

mpl

ing

plan

deve

lope

dto

re-

pres

ent

the

cons

umpt

ion

inth

eco

untr

yw

here

the

stud

yw

asco

nduc

ted?

Indi

cato

r

Page 87: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

78

57Si

ngle

Com

pone

ntE

uroF

IR:W

asst

abili

zatio

nap

plie

dIf

rele

vant

,wer

eap

prop

riat

est

abili

zatio

ntr

eatm

ents

appl

ied

(e.g

.pr

otec

tion

from

heat

/air

/ligh

t/mic

robi

alac

tivity

)?

Indi

cato

r

58Si

ngle

Com

pone

ntE

uroF

IR:

Was

the

anal

ysed

port

ion

des-

crib

edIf

rele

vant

,was

the

anal

ysed

port

ion

des-

crib

edan

dis

itcl

ear

ifth

efo

odw

asan

a-ly

sed

with

orw

ithou

tthe

ined

ible

part

?

Indi

cato

r

59Si

ngle

Com

pone

ntE

uroF

IR:W

asth

efo

odso

urce

prov

ided

Was

the

food

sour

ceof

the

food

orof

the

mai

nin

gred

ient

prov

ided

(bes

tif

scie

n-tifi

cna

me

incl

uded

,cu

ltiva

r/va

riet

y,ge

-nu

s/sp

ecie

s,et

c.)?

Indi

cato

r

60Si

ngle

Com

pone

ntE

uroF

IR:W

asth

epa

rtof

plan

t/ani

mal

in-

dica

ted

Was

the

part

ofpl

ant

orpa

rtof

anim

alcl

earl

yin

dica

ted?

Indi

cato

r

61Si

ngle

Com

pone

ntE

uroF

IR:

Wer

eco

okin

gm

etho

dde

tails

prov

ided

Ifth

efo

odw

asco

oked

,wer

esa

tisfa

ctor

yco

okin

gm

etho

dde

tails

prov

ided

?In

dica

tor

62Si

ngle

Com

pone

ntE

uroF

IR:W

ere

port

ion

repl

icat

este

sted

Wer

ean

alyt

ical

port

ion

repl

icat

este

sted

?In

dica

tor

63Si

ngle

Com

pone

ntE

uroF

IR:

Wer

esa

mpl

esta

ken

from

im-

port

anto

utle

tsIf

rele

vant

,w

ere

sam

ples

take

nfr

omth

em

ost

impo

rtan

tsa

les

outle

ts(s

uper

mar

-ke

t,lo

cal

groc

ery,

stre

etm

arke

t,re

stau

-ra

ntho

useh

old

etc.

)?

Indi

cato

r

64Si

ngle

Com

pone

ntE

uroF

IR:W

ere

sam

ples

take

nfr

omm

ore

than

one

loca

tion

Ifre

leva

nt,w

ere

sam

ples

take

nfr

omm

ore

than

one

geog

raph

ical

loca

tion?

Indi

cato

r

65Si

ngle

Com

pone

ntE

uroF

IR:W

ere

sam

ples

take

nfr

omm

ore

than

one

seas

onIf

rele

vant

,w

ere

sam

ples

take

ndu

ring

mor

eth

anon

ese

ason

ofth

eye

ar?

Indi

cato

r

66Si

ngle

Com

pone

ntG

iven

year

ofge

nera

tion

caus

esge

nera

-tio

nby

tobe

man

dato

rySo

ftC

onst

rain

t

67Si

ngle

Com

pone

ntH

oww

elld

oes

food

mat

chH

oww

elld

oes

food

mat

chth

efo

odin

the

data

base

?In

dica

tor

68Si

ngle

Com

pone

ntIs

food

natio

nalr

epre

sent

ativ

eH

owre

pres

enta

tive

isth

efo

odto

natio

nal

cons

umpt

ion?

Indi

cato

r

Page 88: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

APPENDIX B. FOODCASE QUALITY REQUIREMENTS 79

69Si

ngle

Com

pone

ntM

atri

xun

itca

nnot

beus

edfo

rag

greg

a-tio

nN

otfo

rag

greg

atio

nus

able

mat

rix

units

={p

er10

0gto

tal

fat,

per

100g

tota

lfa

ttyac

ids,

perg

tota

lfat

,per

gni

trog

en}

Soft

Con

stra

int

70Si

ngle

Com

pone

ntM

etho

din

dica

tork

now

nIf

met

hod

indi

cato

ris

not

know

n,th

enda

taha

vele

ssqu

ality

Soft

Con

stra

int

71Si

ngle

Com

pone

ntM

etho

dpa

ram

eter

man

dato

ryfo

rpr

otei

nca

lcul

ated

from

nitr

ogen

Ifco

mpo

nent

=pr

otei

nan

dm

etho

din

di-

cato

rin{M

I012

1,M

I012

2,M

I012

3}th

enN

CF

shou

ldbe

inse

rted

inM

etho

dPa

ra-

met

erfie

ld,r

ange

[4.6

0to

7.10

],an

dN

CF

sour

cein

the

Met

hod

Ref

eren

cefie

ld

Har

dC

onst

rain

t

72Si

ngle

Com

pone

ntM

etho

dpa

ram

eter

man

dato

ryfo

rto

tal

fatty

acid

sca

lcul

ated

from

tota

lfat

Ifco

mpo

nent

=to

tal

fatty

acid

and

me-

thod

indi

cato

rin

MI0

207

then

met

hod

para

met

erm

ust

bein

sert

edin

the

rang

e[0

.50,

0.99

]

Har

dC

onst

rain

t

73Si

ngle

Com

pone

ntM

etho

dty

pekn

own

Ifm

etho

dty

peis

not

know

n,th

enda

taha

vele

ssqu

ality

Soft

Con

stra

int

74Si

ngle

Com

pone

ntM

F:D

ate

ofco

mpi

latio

nan

dco

mpi

latio

nby

Har

dC

onst

rain

t

75Si

ngle

Com

pone

ntM

F:E

valu

atio

nda

tean

dev

alua

tion

bySo

ftC

onst

rain

t76

Sing

leC

ompo

nent

MF:

Met

hod

nam

eFo

rleg

acy

reas

onno

ton

data

base

and

in-

putm

ask

Har

dC

onst

rain

t

77Si

ngle

Com

pone

ntM

F:M

etho

dor

igin

alna

me

Forl

egac

yre

ason

noto

nda

taba

sean

din

-pu

tmas

kH

ard

Con

stra

int

78Si

ngle

Com

pone

ntM

F:Se

lect

edva

lue

exce

ptw

hen

valu

ety

pe=

trac

e,be

low

dete

ctio

nlim

it,un

de-

cida

ble

orun

know

n

Har

dC

onst

rain

t

79Si

ngle

Com

pone

ntM

F:Y

earo

fgen

erat

ion

Har

dC

onst

rain

t80

Sing

leC

ompo

nent

Old

sing

leva

lue

(yea

rofg

ener

atio

n>

10ye

ars)

Soft

Con

stra

int

Page 89: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

80

81Si

ngle

Com

pone

ntSe

lect

edva

lue

=0

ifva

lue

type

=lo

gica

lze

roH

ard

Con

stra

int

82Si

ngle

Com

pone

ntSe

lect

edva

lue

has

reas

onab

lepr

ecis

ion

Soft

Con

stra

int

83Si

ngle

Com

pone

ntSP

:Mea

n<

=M

axM

ean

mus

tbe<

=m

axim

umva

lue

Har

dC

onst

rain

t84

Sing

leC

ompo

nent

SP:M

ean>

=M

inM

ean

mus

tbe>

=m

inim

umva

lue

Har

dC

onst

rain

t85

Sing

leC

ompo

nent

SP:M

edia

n<

=M

axM

edia

nm

ustb

e<

=m

axim

umva

lue

Har

dC

onst

rain

t86

Sing

leC

ompo

nent

SP:M

edia

n>

=M

inM

edia

nm

ustb

e>

=m

inim

umva

lue

Har

dC

onst

rain

t87

Sing

leC

ompo

nent

SP:S

td.d

ev.<

Max

-Min

Stan

dard

devi

atio

nm

ustb

e<

max

-min

Har

dC

onst

rain

t88

Sing

leC

ompo

nent

SP:S

td.e

rror

<=

Std.

dev.

Stan

dard

erro

rm

ust

be<

=st

anda

rdde

-vi

atio

nH

ard

Con

stra

int

89Si

ngle

Com

pone

ntSP

:SV<

=M

axSe

lect

edva

lue

mus

tbe<

=m

axim

umva

-lu

eH

ard

Con

stra

int

90Si

ngle

Com

pone

ntSP

:SV>

=M

inSe

lect

edva

lue

mus

tbe>

=m

inim

umva

-lu

eH

ard

Con

stra

int

91Si

ngle

Com

pone

ntU

nit

Deg

ree

Bri

xdo

esno

tha

vem

atri

xun

itH

ard

Con

stra

int

92Si

ngle

Com

pone

ntV

alue

type

know

nIf

valu

ety

pe=

unkn

own,

then

the

data

have

less

qual

itySo

ftC

onst

rain

t

93Si

ngle

Com

pone

ntV

N:M

axim

umM

axim

umis

valid

num

ber

Har

dC

onst

rain

t94

Sing

leC

ompo

nent

VN

:Mea

nM

ean

isva

lidnu

mbe

rH

ard

Con

stra

int

95Si

ngle

Com

pone

ntV

N:M

edia

nM

edia

nis

valid

num

ber

Har

dC

onst

rain

t96

Sing

leC

ompo

nent

VN

:Min

imum

Min

imum

isva

lidnu

mbe

rH

ard

Con

stra

int

97Si

ngle

Com

pone

ntV

N:S

elec

ted

valu

eSe

lect

edva

lue

isva

lidnu

mbe

rH

ard

Con

stra

int

98Si

ngle

Com

pone

ntV

N:S

tand

ard

devi

atio

nSt

anda

rdde

viat

ion

isva

lidnu

mbe

rH

ard

Con

stra

int

99Si

ngle

Com

pone

ntV

N:S

tand

ard

erro

rSt

anda

rder

rori

sva

lidnu

mbe

rH

ard

Con

stra

int

100

Sing

leC

ompo

nent

Yea

rofa

naly

sis

mus

thav

e4

digi

tsH

ard

Con

stra

int

101

Agg

rega

ted

Food

Eve

ryfo

odm

ust

have

atle

ast

4co

m-

pone

ntva

lues

Soft

Con

stra

int

102

Agg

rega

ted

Food

Fill

fact

orFi

llfa

ctor

ofag

greg

ated

food

Indi

cato

r

Page 90: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

APPENDIX B. FOODCASE QUALITY REQUIREMENTS 81

103

Agg

rega

ted

Food

For

ever

yfo

od,

carb

ohyd

rate

(CH

OT

orC

HO

)mus

tbe

prov

ided

Car

bohy

drat

eis

am

anda

tory

com

pone

nt.

CH

OT

orC

HO

has

tobe

prov

ided

.So

ftC

onst

rain

t

104

Agg

rega

ted

Food

Fore

very

food

,ene

rgy

mus

tbe

prov

ided

Ene

rgy

isa

man

dato

ryco

mpo

nent

Soft

Con

stra

int

105

Agg

rega

ted

Food

Fore

very

food

,fat

mus

tbe

prov

ided

Fati

sa

man

dato

ryco

mpo

nent

Soft

Con

stra

int

106

Agg

rega

ted

Food

Fore

very

food

,pro

tein

mus

tbe

prov

ided

Prot

ein

isa

man

dato

ryco

mpo

nent

Soft

Con

stra

int

107

Agg

rega

ted

Food

For

hom

e-m

ade

food

,re

cipe

desc

ript

ion

mus

tbe

prov

ided

Soft

Con

stra

int

108

Agg

rega

ted

Food

IfFA

T=

0or

logi

calz

ero,

then

allo

ther

fatty

acid

sca

nex

ists

butn

ot>

0H

ard

Con

stra

int

109

Agg

rega

ted

Food

Min

imum

leng

thof

sing

lefo

odna

me>

=2

Food

nam

em

usth

ave

atle

astt

wo

lette

rsH

ard

Con

stra

int

110

Agg

rega

ted

Food

MF:

Eng

lish

food

nam

eE

nglis

hfo

odna

me

ism

anda

tory

Har

dC

onst

rain

t11

1A

ggre

gate

dFo

odM

F:E

uroF

IRcl

assi

ficat

ion

Eur

oFIR

clas

sific

atio

nis

man

dato

rySo

ftC

onst

rain

t11

2A

ggre

gate

dFo

odM

F:R

esta

uran

torh

ome-

mad

eT

here

stau

rant

orho

me-

mad

efla

gis

man

-da

tory

Soft

Con

stra

int

113

Agg

rega

ted

Food

MF:

Ret

entio

nfa

ctor

clas

sific

atio

nR

eten

tion

fact

orcl

assi

ficat

ion

ism

anda

-to

rySo

ftC

onst

rain

t

114

Agg

rega

ted

Food

Scie

ntifi

cna

me

orbr

and

nam

em

ustb

ese

tIf

nobr

and

nam

eis

set,

then

scie

ntifi

cna

me

mus

tbe

prov

ided

and

vice

vers

aSo

ftC

onst

rain

t

115

Agg

rega

ted

Food

SY:

Com

bina

tion

ofla

ngua

ge=e

nan

dty

pe=t

rans

latio

nsh

ould

noto

ccur

Har

dC

onst

rain

t

116

Agg

rega

ted

Food

SY:

Com

bina

tion

ofsy

nony

mte

rm,

lan-

guag

ean

dty

pem

ustb

eun

ique

Har

dC

onst

rain

t

117

Agg

rega

ted

Food

SY:M

anda

tory

field

syno

nym

term

Syno

nym

term

ism

anda

tory

Har

dC

onst

rain

t11

8A

ggre

gate

dFo

odSY

:Man

dato

ryfie

ldsy

nony

mty

peSy

nony

mty

peis

man

dato

ryH

ard

Con

stra

int

119

Agg

rega

ted

Com

pone

ntA

cqui

sitio

nty

pekn

own

Ifac

quis

ition

type

=no

tkno

wn

then

data

have

less

qual

itySo

ftC

onst

rain

t

120

Agg

rega

ted

Com

pone

ntA

tle

ast

one

cont

ribu

ting

valu

esh

ould

exis

tH

ard

Con

stra

int

Page 91: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

82

121

Agg

rega

ted

Com

pone

ntA

tlea

ston

ere

fere

nce

shou

ldex

ist

Isth

ere

atle

asto

nere

fere

nce

for

asi

ngle

valu

e?So

ftC

onst

rain

t

122

Agg

rega

ted

Com

pone

ntC

ontr

ibut

ing

valu

esm

ust

not

beol

der

than

5ye

ars

(yea

rofg

ener

atio

n)So

ftC

onst

rain

t

123

Agg

rega

ted

Com

pone

ntM

etho

din

dica

tork

now

nIf

met

hod

indi

cato

ris

not

know

n,th

enda

taha

vele

ssqu

ality

Soft

Con

stra

int

124

Agg

rega

ted

Com

pone

ntM

etho

dty

pekn

own

Ifm

etho

dty

peis

not

know

n,th

enda

taha

vele

ssqu

ality

Soft

Con

stra

int

125

Agg

rega

ted

Com

pone

ntM

F:Se

lect

edva

lue

exce

ptw

hen

valu

ety

pe=

trac

e,be

low

dete

ctio

nlim

it,un

de-

cida

ble

orun

know

n

Har

dC

onst

rain

t

126

Agg

rega

ted

Com

pone

ntSe

lect

edva

lue

=0

ifva

lue

type

=lo

gica

lze

roH

ard

Con

stra

int

127

Agg

rega

ted

Com

pone

ntSe

lect

edva

lue

has

reas

onab

lepr

ecis

ion

Soft

Con

stra

int

128

Agg

rega

ted

Com

pone

ntSP

:Mea

n<

=M

axM

ean

mus

tbe<

=m

axim

umva

lue

Har

dC

onst

rain

t12

9A

ggre

gate

dC

ompo

nent

SP:M

ean>

=M

inM

ean

mus

tbe>

=m

inim

umva

lue

Har

dC

onst

rain

t13

0A

ggre

gate

dC

ompo

nent

SP:M

edia

n<

=M

axM

edia

nm

ustb

e<

=m

axim

umva

lue

Har

dC

onst

rain

t13

1A

ggre

gate

dC

ompo

nent

SP:M

edia

n>

=M

inM

edia

nm

ustb

e>

=m

inim

umva

lue

Har

dC

onst

rain

t13

2A

ggre

gate

dC

ompo

nent

SP:S

td.d

ev.<

Max

-Min

Stan

dard

devi

atio

nm

ustb

e<

max

-min

Har

dC

onst

rain

t13

3A

ggre

gate

dC

ompo

nent

SP:S

td.e

rror

<=

Std.

dev.

Stan

dard

erro

rm

ust

be<

=st

anda

rdde

-vi

atio

nH

ard

Con

stra

int

134

Agg

rega

ted

Com

pone

ntSP

:SV<

=M

axSe

lect

edva

lue

mus

tbe<

=m

axim

umva

-lu

eH

ard

Con

stra

int

135

Agg

rega

ted

Com

pone

ntSP

:SV>

=M

inSe

lect

edva

lue

mus

tbe>

=m

inim

umva

-lu

eH

ard

Con

stra

int

136

Agg

rega

ted

Com

pone

ntU

nit

Deg

ree

Bri

xdo

esno

tha

vem

atri

xun

itH

ard

Con

stra

int

137

Agg

rega

ted

Com

pone

ntV

alue

type

know

nIf

valu

ety

pe=

unkn

own,

then

the

data

have

less

qual

itySo

ftC

onst

rain

t

138

Agg

rega

ted

Com

pone

ntV

N:M

axim

umM

axim

umis

valid

num

ber

Har

dC

onst

rain

t

Page 92: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

APPENDIX B. FOODCASE QUALITY REQUIREMENTS 83

139

Agg

rega

ted

Com

pone

ntV

N:M

ean

Mea

nis

valid

num

ber

Har

dC

onst

rain

t12

0A

ggre

gate

dC

ompo

nent

VN

:Med

ian

Med

ian

isva

lidnu

mbe

rH

ard

Con

stra

int

141

Agg

rega

ted

Com

pone

ntV

N:M

inim

umM

inim

umis

valid

num

ber

Har

dC

onst

rain

t14

2A

ggre

gate

dC

ompo

nent

VN

:Sel

ecte

dva

lue

Sele

cted

valu

eis

valid

num

ber

Har

dC

onst

rain

t14

3A

ggre

gate

dC

ompo

nent

VN

:Sta

ndar

dde

viat

ion

Stan

dard

devi

atio

nis

valid

num

ber

Har

dC

onst

rain

t14

4A

ggre

gate

dC

ompo

nent

VN

:Sta

ndar

der

ror

Stan

dard

erro

ris

valid

num

ber

Har

dC

onst

rain

t14

5R

ecip

eA

tlea

ston

ein

gred

ient

shou

ldex

ist

Har

dC

onst

rain

t14

6R

ecip

eA

tlea

ston

ere

fere

nce

shou

ldex

ist

Soft

Con

stra

int

147

Rec

ipe

Fore

very

ingr

edie

ntam

ount

mus

tbe>

0H

ard

Con

stra

int

148

Rec

ipe

For

ever

yin

gred

ient

atle

asto

nepr

epar

a-tio

nm

etho

dsh

ould

besp

ecifi

edSo

ftC

onst

rain

t

149

Rec

ipe

MF:

Rec

ipe

proc

edur

eR

ecip

epr

oced

ure

ism

anda

tory

Har

dC

onst

rain

t15

0R

ecip

eM

F:Y

ield

fact

oral

coho

lY

ield

fact

oral

coho

lis

man

dato

rySo

ftC

onst

rain

t15

1R

ecip

eM

F:Y

ield

fact

orfa

tY

ield

fact

orfa

tis

man

dato

rySo

ftC

onst

rain

t15

2R

ecip

eM

F:Y

ield

fact

orw

ater

Yie

ldfa

ctor

wat

eris

man

dato

rySo

ftC

onst

rain

t15

3R

efer

ence

Acq

uisi

tion

type

know

nIf

acqu

isiti

onty

pe=

notk

now

nth

enda

taha

vele

ssqu

ality

Soft

Con

stra

int

155

Ref

eren

ceM

F:R

efer

ence

lang

uage

Soft

Con

stra

int

155

Ref

eren

ceR

efer

ence

type

know

nIf

refe

renc

ety

peis

not

know

nth

enda

taha

vele

ssqu

ality

Soft

Con

stra

int

156

Ref

eren

ceR

efer

ence

UR

Lis

valid

Soft

Con

stra

int

Tabl

eB

.2:F

oodC

ASE

Qua

lity

Req

uire

men

ts

Page 93: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

84

Page 94: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

CPhysical Database Schema

In this appendix, the definitions of all the tables belonging to the Data Quality AnalysisToolkit are given. Please refer to section 5.3 for a description of the data model.

Note: All the columns of the data type “timestamp” are actually “timestamp without time-zone”.

tblqualityassessment

Column Name Data Type Constraint(s)/Referenceassessmentid serial NOT NULL PRIMARY KEYstartdate timestamp NOT NULLenddate timestampremarks character varying(300)creation timestamp NOT NULLcreationby character varying(100) NOT NULLmutation timestamp NOT NULLmutationby character varying(100) NOT NULL

85

Page 95: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

86

tblqualityentity

Column Name Data Type Constraint(s)/Referenceentityid serial NOT NULL PRIMARY KEYentityname character varying(50) NOT NULL UNIQUEcondition character varying(500) NOT NULLdisplayname character varying(30) NOT NULLkeycolumn character varying(30) NOT NULLalias character varying(10) NOT NULLcreation timestamp NOT NULLcreationby character varying(100) NOT NULLmutation timestamp NOT NULLmutationby character varying(100) NOT NULL

tblqualityentityfilter

Column Name Data Type Constraint(s)/Referencefilterid serial NOT NULL PRIMARY KEYentityid integer NOT NULL tblqualityentity (entityid)filtername character varying NOT NULLfiltercolumn character varying NOT NULLjointable character varyingjoincolumn character varyingnamecolumn character varyingapplycondition boolean NOT NULLcreation timestamp NOT NULLcreationby character varying(100) NOT NULLmutation timestamp NOT NULLmutationby character varying(100) NOT NULL

tblqualityrequirement

Column Name Data Type Constraint(s)/Referencerequirementid serial NOT NULL PRIMARY KEYrequirementname character varying(300) NOT NULL UNIQUErequirement-description

character varying(300)

entityid integer NOT NULLassessmentsql character varying(5000) NOT NULLrequirementtypeid integer NOT NULL CHECK IN (0, 1, 2)creation timestamp NOT NULLcreationby character varying(100) NOT NULLmutation timestamp NOT NULLmutationby character varying(100) NOT NULL

Page 96: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

APPENDIX C. PHYSICAL DATABASE SCHEMA 87

tblqualityrequirementvalue

Column Name Data Type Constraint(s)/Referencevalueid serial NOT NULL PRIMARY KEYrequirementid integer NOT NULL tblqualityrequirement

(requirementid)assessmentid integer NOT NULL tblqualityassessment

(assessmentid)refid integer NOT NULLvalue double precision

tblqualitytreedefinition

Column Name Data Type Constraint(s)/Referencetreedefid serial NOT NULL PRIMARY KEYtreename character varying(100) NOT NULL UNIQUEtreedescription character varying(1000)entityid integer NOT NULL tblqualityentity (entityid)creation timestamp NOT NULLcreationby character varying(100) NOT NULLmutation timestamp NOT NULLmutationby character varying(100) NOT NULL

tblqualitytreeedge

Column Name Data Type Constraint(s)/Referenceedgeid serial NOT NULL PRIMARY KEYparent nodeid integer NOT NULL tblqualitytreenode (nodeid),

UNIQUE (parent nodeid,child nodeid)

child nodeid integer NOT NULL tblqualitytreenode (nodeid),UNIQUE (parent nodeid,child nodeid)

weight double precisiontreedefid integer NOT NULL tblqualitytreedefinition

(treedefid)creation timestamp NOT NULLcreationby character varying(100) NOT NULLmutation timestamp NOT NULLmutationby character varying(100) NOT NULL

Page 97: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

88

tblqualitytreenode

Column Name Data Type Constraint(s)/Referencenodeid serial NOT NULL PRIMARY KEYheight integer NOT NULLindex integer NOT NULLcollapsed boolean NOT NULLuserscalemin integeruserscalemax integertreedefid integer NOT NULL tblqualitytreedefinition

(treedefid)creation timestamp NOT NULLcreationby character varying(100) NOT NULLmutation timestamp NOT NULLmutationby character varying(100) NOT NULL

tblqualitytreenodeaggregation

Column Name Data Type Constraint(s)/Referencename character varying(50) NOT NULLaggregationtypeid integer NOT NULL CHECK IN (0, 1, 2, 3, 4, 5)nodeid integer NOT NULL PRIMARY KEY,

tblqualitytreenode (nodeid)

tblqualitytreenoderequirement

Column Name Data Type Constraint(s)/Referencerequirementid integer NOT NULL tblqualityrequirement

(requirementid)nodeid integer NOT NULL PRIMARY KEY,

tblqualitytreenode (nodeid)

tblqualityusersetting

Column Name Data Type Constraint(s)/Referenceusersettingid serial NOT NULL PRIMARY KEYuserid integer NOT NULL tbluser (iduser)simpleclassname character varying(100) NOT NULLfieldname character varying(100) NOT NULLvalue character varying(100) NOT NULLcreation timestamp NOT NULLcreationby character varying NOT NULLmutation timestamp NOT NULLmutationby character varying NOT NULL

Page 98: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

APPENDIX C. PHYSICAL DATABASE SCHEMA 89

tblqualitywarningtext

Column Name Data Type Constraint(s)/Referencewarningid integer NOT NULL PRIMARY KEY,

DEFAULT nextval(..)warningtext character varying(300) NOT NULLparam character varying(20)creation timestamp NOT NULLcreationby character varying NOT NULLmutation timestamp NOT NULLmutationby character varying NOT NULL

tblqualitywarninguser

Column Name Data Type Constraint(s)/Referencewarninguserid integer NOT NULL PRIMARY KEY,

DEFAULT nextval(..)warningid integer NOT NULL tblqualitywarningtext (war-

ningid), UNIQUE (warnin-gid, userid)

userid integer NOT NULL tbluser (iduser), UNIQUE(warningid, userid)

enabled boolean NOT NULLcreation timestamp NOT NULLcreationby character varying NOT NULLmutation timestamp NOT NULLmutationby character varying NOT NULL

Page 99: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

90

Page 100: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

List of Figures

2.1 Swiss Food Composition Database, Public Online Interface . . . . . . . . . 52.2 FoodCASE Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 FoodCASE System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 82.4 A Conceptual Framework of Data Quality . . . . . . . . . . . . . . . . . . . 9

3.1 Start Screen of the Data Quality Analysis Toolkit . . . . . . . . . . . . . . . 143.2 Data Quality Tree: Aggregated Foods . . . . . . . . . . . . . . . . . . . . . . 153.3 Data Problem Table: Aggregated Foods . . . . . . . . . . . . . . . . . . . . . 153.4 Data Quality Line Chart: EuroFIR Quality Index . . . . . . . . . . . . . . . . 173.5 Data Quality Bar Chart: Group by Mutation User . . . . . . . . . . . . . . . 173.6 A Simple Tree Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.7 Tree Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.8 Example of a Box Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.9 Example of a Spider Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.10 Example of a Tree Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.11 Example of a Bar Chart with Error Bars . . . . . . . . . . . . . . . . . . . . . 29

4.1 Data Model of the FoodCASE Application . . . . . . . . . . . . . . . . . . . 324.2 Data Model of the FoodCASE Application: Close-Up . . . . . . . . . . . . . 334.3 Administration Tool: Edit a Quality Requirement . . . . . . . . . . . . . . . 36

5.1 Data Model of the Data Quality Analysis Toolkit, UML Class Diagram . . . 425.2 Runtime Data Structure of the Data Quality Analysis Toolkit . . . . . . . . 485.3 Tree View and Tree Editor, UML Class Diagram . . . . . . . . . . . . . . . . 515.4 GUI Components, UML Class Diagram . . . . . . . . . . . . . . . . . . . . . 535.5 Admin Module, UML Class Diagram . . . . . . . . . . . . . . . . . . . . . . . 58

6.1 JUnit Back-End Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606.2 Automated GUI Testing (Screenshot) . . . . . . . . . . . . . . . . . . . . . . 616.3 Automated GUI Testing, Application Monitoring . . . . . . . . . . . . . . . 62

7.1 FoodCASE: Single Food Detail Window . . . . . . . . . . . . . . . . . . . . . 647.2 Data Quality Prevention: Warning . . . . . . . . . . . . . . . . . . . . . . . . 657.3 Data Quality Prevention: Configure Warnings . . . . . . . . . . . . . . . . . 65

91

Page 101: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

92 LIST OF FIGURES

Page 102: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

List of Tables

3.1 Views and View Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1 Quality Requirement: Assessment SQL Place Holders . . . . . . . . . . . . 35

5.1 Code Metrics of the Data Quality Analysis Toolkit . . . . . . . . . . . . . . 415.2 File System Cache Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.3 Plot Types and Datasets used by the Data Quality Analysis Views . . . . . 55

7.1 Mapping from Averaged EuroFIR Quality Index to Confidence Code . . . . 66

A.1 Terms and Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

B.1 Quality Requirement Name Prefixes . . . . . . . . . . . . . . . . . . . . . . 73B.2 FoodCASE Quality Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 83

93

Page 103: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

94 LIST OF TABLES

Page 104: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

List of Listings

5.1 Writing to File System Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.2 Positioning of Tree Nodes (Pseudo Code) . . . . . . . . . . . . . . . . . . . . 505.3 Logging and Error Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

95

Page 105: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

96 LIST OF LISTINGS

Page 106: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

Acknowledgements

First and foremost, I would like to thank Prof. Dr. Moira C. Norrie for accepting me and gi-ving me the opportunity to write my master thesis in the GlobIS group. Furthermore, I wouldlike to thank my supervisor, Karl Presser, for the uncomplicated and pleasant collaboration.He has given me much freedom in how I wanted to approach the different goals of this masterthesis and he was always ready to discuss any questions which arose during the work. Specialthanks go to my girlfriend, Marina Spani, for her careful review and her helpful suggestions.Last but not least, I would like to thank my parents who always supported me throughout mystudies.

97

Page 107: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,
Page 108: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

Bibliography

[atw1896] Atwater Wo, Woods Cd. Washington 1896. The Chemical Composition of Ame-rican Food Materials. U.S. Departement of Agriculture, Office of Experiment Stations,Bulletin 28, Pages 1-47

[becker07] Becker Wulf, Unwin Ian, Ireland Jayne and Møller Anders. 2007. Proposal forStructure and Detail of a EuroFIR Standard on Food Composition Data.

[foph03] ETH Zurich, Federal Office of Public Health and Swiss Society of Nutrition. Zolli-kofen 2003. Swiss Nutrient Value Table

[fown44] Federal Office for Wartime Nutrition. 1944. Tabelle der Nahrwerte der Lebensmit-tel. Eidgossisches Gesundheitsamt, Bulletin 33, Pages 378-384.

[hitz96] Hitz M. and Montazeri B. 1996. Chidamber and Kemerer’s Metrics Suite: A Measu-rement Theory Perspective. IEEE Transactions on Software Engineering 22, Pages 267-271

[hoffm1789] Hoffmann Carl August. Weimar 1789. Erweiterte Tabelle uber etliche vierzigMineral-Wasser und Gesundbrunnen Deutschlands.

[hoegl64] Hogl O. and Lauber E. 1964. Nahrwert der Lebensmittel. Schweizerisches Le-bensmittelbuch, First Volume, Pages 713-753

[holden01] Holden Joanne M., Bhagwat Seema A. and Patterson Kristine Y. August 2002.Development of a Multi-nutrient Data Quality Evaluation System. Journal of Food Com-position and Analysis, Volume 15, Issue 4, Pages 339-348, ISSN 0889-1575

[presser08] Presser Karl, Colombani Paolo and Bell Simone. Zurich October 2008. SoftwareRequirement Specification For A Food Composition Database Management System

[presser11] Presser Karl. Zurich 2011. A Data Quality Model Approach and a Data QualityFramework for Food Composition Databases (unpublished)

[mccabe76] McCabe Sir Thomas J. December 1976. A Complexity Measure. IEEE Transac-tions on Software Engineering, Volume SE-2, Number 4, Pages 308-320

[moles1859] Moleschott Jacob. Giessen 1859. Physiologie der Nahrungsmittel. Ein Hand-buch der Diatetik. Zweite vollig umgearbeitete Auflage, Ferber’sche Universitatsbuch-handlung.

[moller08] Møller Anders and Christensen Tue. Denmark 2008. EuroFIR Web Services -EuroFIR Food Data Transport Package, Version 1.3, EuroFIR Deliverable, ISBN 978-87-92125-08-8

99

Page 109: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,

100 BIBLIOGRAPHY

[salvini07] Salvini Simonetta, Oseredczuk Marine, Roe Mark and Møller Anders. 2007. Re-port Guidelines for Quality Index Attribution to Original Data from Scientific Literatureor Reports for EuroFIR Data Interchange. EuroFIR Workpackage 1.3, Task Group 4

[schlotke00] Schlotke Florian, Becker Wulf and Ireland Jayne. Brussels 2000. EurofoodsRecommendations for Food Composition Database Management and Data Interchange,European Commission, EUR 19538, ISBN 92-828-9757-5

[wang94] Wang Richard Y. and Strong Diane M. March 1996. Beyond Accuracy: What DataQuality Means to Data Consumers. Journal of Management Information Systems, 12, 4,Pages 5-33.