In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3...
Transcript of In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3...
![Page 1: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/1.jpg)
Research Collection
Master Thesis
Data quality analysis for food composition databases
Author(s): Mock, Reto
Publication Date: 2011
Permanent Link: https://doi.org/10.3929/ethz-a-006660133
Rights / License: In Copyright - Non-Commercial Use Permitted
This page was generated automatically upon download from the ETH Zurich Research Collection. For moreinformation please consult the Terms of use.
ETH Library
![Page 2: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/2.jpg)
Data Quality Analysis forFood Composition Databases
Master Thesis
Reto Mock<[email protected]>
Prof. Dr. Moira C. NorrieKarl Presser
Global Information Systems GroupInstitute of Information Systems
Department of Computer Science
23rd September 2011
![Page 3: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/3.jpg)
Copyright © 2011 Global Information Systems Group.
![Page 4: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/4.jpg)
Abstract
Data quality has been an issue ever since databases are used. Despite the absence of a cleardefinition, it is widely agreed that data quality is of major importance especially in scientificapplications. Having noted this, it is rather surprising how little research has been done in thearea of data quality.We present a Data Quality Analysis Toolkit which is capable of measuring and visualisingdata quality. A variety of different charts and tables support the user in judging in what areaurgent action is needed. Furthermore, the toolkit provides tools to drill down on the dataquality issues and identify individual problems in the data.In this master thesis, we apply our concept of data quality analysis to the FoodCASE1 data-base, which is the Swiss food composition database managed by the Swiss Food InformationResource (SwissFIR)2 of the ETH Zurich and the Federal Office of Public Health3.
1http://www.foodcase.ethz.ch2http://www.swissfir.ethz.ch3http://www.bag.admin.ch
iii
![Page 5: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/5.jpg)
iv
![Page 6: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/6.jpg)
Contents
1 Introduction 11.1 Food Composition and the FoodCASE Project . . . . . . . . . . . . . . . . . 11.2 Data Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background and Related Work 32.1 A Bit of History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 The Beginning of Food Composition Tables . . . . . . . . . . . . . . 32.1.2 The Swiss Food Composition Database . . . . . . . . . . . . . . . . 32.1.3 COST Action 99 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.4 European Food Information Resource (EuroFIR) . . . . . . . . . . . 4
2.2 The FoodCASE Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.2 Business Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.4 Data Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.1 Data Quality in Food Composition Databases . . . . . . . . . . . . 102.4.2 USDA Quality Index and Confidence Code . . . . . . . . . . . . . . 102.4.3 EuroFIR Quality Index . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Data Quality Analysis Toolkit 133.1 Analysing the Data Quality in the FoodCASE Database . . . . . . . . . . . . 13
3.1.1 Scenario 1: Identifying Missing Data . . . . . . . . . . . . . . . . . . 133.1.2 Scenario 2: Analysing Trends over Time . . . . . . . . . . . . . . . . 163.1.3 Scenario 3: Grouping by User . . . . . . . . . . . . . . . . . . . . . . 16
3.2 The Concepts Behind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2.1 Data Quality Requirement . . . . . . . . . . . . . . . . . . . . . . . . 183.2.2 Data Quality Analysis Tree Definition . . . . . . . . . . . . . . . . . 193.2.3 Data Quality Assessment . . . . . . . . . . . . . . . . . . . . . . . . 213.2.4 Filtering and Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2.5 Long-term Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2.6 Shortcut from FoodCASE . . . . . . . . . . . . . . . . . . . . . . . . . 213.2.7 Percentage Scale vs. User-defined Scale . . . . . . . . . . . . . . . . 22
v
![Page 7: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/7.jpg)
vi CONTENTS
3.3 Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.3.2 Data Quality Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.3.3 Problem Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3.4 View Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4 Administration Module 314.1 FoodCASE Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2 Quality Entities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.3 Quality Entity Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.4 Quality Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.5 Quality Assessments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.6 Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5 Implementation 395.1 Project Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.2 Code Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.3 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3.1 Data Quality Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.4 Mapping to Physical Data Model . . . . . . . . . . . . . . . . . . . . . . . . 435.5 Back-End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.5.1 Running a Quality Assessment . . . . . . . . . . . . . . . . . . . . . 445.5.2 File System Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.6 Front-End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.6.1 Runtime Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . 475.6.2 Tree Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.6.3 NanoGraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.6.4 Tree Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.6.5 Tree View and Tree Editor . . . . . . . . . . . . . . . . . . . . . . . . 515.6.6 GUI Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.6.7 Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.6.8 Chart Highlighting and Breadcrumb Navigation . . . . . . . . . . . 565.6.9 User Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.6.10 Logging and Error Handling . . . . . . . . . . . . . . . . . . . . . . . 57
5.7 Admin Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6 Testing 596.1 Back-End Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.2 Front-End Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2.1 Automated Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606.2.2 Manual Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7 Extensions 637.1 Data Quality Prevention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637.2 Confidence Code for Aggregated Foods in FoodCASE . . . . . . . . . . . . . 66
7.2.1 Recipes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
![Page 8: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/8.jpg)
CONTENTS vii
8 Conclusion 678.1 Summary of Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
8.1.1 Data Quality Analysis Toolkit . . . . . . . . . . . . . . . . . . . . . . 678.1.2 Administration Module . . . . . . . . . . . . . . . . . . . . . . . . . 688.1.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
A Terms and Abbreviations 71
B FoodCASE Quality Requirements 73
C Physical Database Schema 85
List of Figures 91
List of Tables 93
List of Listings 95
Acknowledgements 97
Bibliography 99
![Page 9: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/9.jpg)
viii CONTENTS
![Page 10: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/10.jpg)
1Introduction
1.1 Food Composition and the FoodCASE Project
A food composition database (FCDB) provides detailed information on the nutritional com-position of foods. For a given food, it can list all components, such as proteins, vitamins andminerals contained in this food together with the exact amount. FCDBs are usually country-specific and are maintained by the so called food compilers. In Switzerland, there is theFoodCASE1 database which is managed by a collaboration of the Swiss Food InformationResource (SwissFIR)2 of the ETH Zurich together with the Federal Office of Public Health3.On the European level, there was a five-year European Food Information Resource Networkof Excellence (EuroFIR)4, which was funded by the European Commission’s Research Di-rectorate General under the “Food Quality and Safety Priority” of the Sixth Framework Pro-gramme for Research and Technological Development.
1.2 Data Quality
Data quality is not a precisely defined term. There are many different aspects which belongto data quality. These include for example accuracy, completeness, relevancy, accessibilityand interpretability. Wang et al. [wang94] identified 118 “data quality attributes”. In the end,it is always the data consumer who has to decide whether the data are fit for a particular use.
1http://www.foodcase.ethz.ch2http://www.swissfir.ethz.ch3http://www.bag.admin.ch4http://www.eurofir.net
1
![Page 11: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/11.jpg)
2 1.3. OUR CONTRIBUTION
1.3 Our Contribution
In this master thesis, we designed and implemented a Data Quality Analysis Toolkit, whichallows to measure and visualise data quality. For the reason of practicability, we restrictedourselves to data quality aspects which can be expressed as SQL5 statement. For example,if a database table which contains information about books is given, it is possible to check ifthe ISBN6 is in the correct format and if the check digit is valid. However, it would be outof scope of the system to query an online service to verify that the ISBN really belongs to acertain book.A variety of different charts and tables support the user in judging in what area urgent actionis needed. Furthermore, the Data Quality Analysis Toolkit provides tools to drill down on thedata quality issues and identify individual problems.
1.4 Thesis Outline
The next chapter gives an overview of the theoretical background of data quality assurancerelated to food composition databases. There were already efforts made by various organi-sations including the EuroFIR and the US Department of Agriculture. Chapter 3 introducesthe Data Quality Analysis Toolkit implemented as part of this master thesis. We explain howthe toolkit can provide support to get an overview of the quality of the food composition dataand how problems in the data can be identified. Chapter 4 aims at system administratorsand shows how to extend the Data Quality Analysis Toolkit with additional criteria to check.Chapter 5 is devoted to the implementation of the Data Quality Analysis Toolkit and chapter 6discusses the testing of it. Chapter 7 contains some extensions to the FoodCASE applicationwhich are not directly related to the Data Quality Analysis Toolkit. This includes an approachof data quality assurance at input time. Finally, chapter 8 concludes this master thesis with asummary of our work and an outlook.
5Structured Query Language6International Standard Book Number
![Page 12: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/12.jpg)
2Background and Related Work
2.1 A Bit of History
2.1.1 The Beginning of Food Composition Tables
Back in 1789, Carl August Hoffmann published a table about the composition of 40 mineralwaters from Germany [hoffm1789]. This document is likely to be the first food compositiontable. Another early food composition table, by Jacob Moleschott, appeared in 1859 in thesecond edition of his book “Physiologie der Nahrungsmittel - Ein Handbuch der Diatetik”[moles1859]. The first American food composition table, which contained 2600 food items,was created in 1896 by Atwater and Woods of the US Department of Agriculture (USDA)[atw1896]. In Switzerland, the first food composition table was compiled during the SecondWorld War by the Federal Office for Wartime Nutrition and published in 1944 [fown44]. Thistable contained data about the amount of energy, carbohydrates, protein and fats contained inabout 250 foods available in Switzerland at this time. Twenty years later, the second Swissfood composition table, by Hogl and Lauber, was issued in the Swiss Food Book [hoegl64].Unfortunately, this table was not updated anymore and could no longer be used after a while.This situation had not changed until the early nineties when the Federal Commission forNutrition (FCN) recommended the creation of a new Swiss Food Composition Database.
2.1.2 The Swiss Food Composition Database
In 1997, the Federal Office of Public Health (FOPH) in collaboration with the ETH Zurichlaunched a project to create a new Swiss Food Composition Database. As a result, the firstversion of the database was released in 2003 as a brochure with the title “Swiss NutrientValue Table” as well as on CD-ROM [foph03]. In a co-financed project of the FOPH and theETH Zurich, the database was updated in the period from October 2006 until May 2009.The current version 3.0.1 of the database contains 935 food items grouped in 13 food groups.For every food item, the nutritive data include carbohydrates, protein, fat, water, alcohol andenergy. Additionally, most food items contain information about many other components as
3
![Page 13: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/13.jpg)
4 2.1. A BIT OF HISTORY
well. On average, values for 30 different food components are provided for each food.Until recently, the database was managed using a Microsoft Access GUI1. Theresa Hodappcreated a Web interface to the database as her master thesis in 2006. Since then, an online ver-sion of the database has been openly accessible by the public at http://www.swissfir.ethz.ch/datenbank/online_EN (Figure 2.1). The website also provides the possibi-lity to order an E-Book or an offline CSV2 export.
2.1.3 COST Action 99
COST (Cooperation in Science and Technology) was a research programme by the EuropeanUnion between 1994 and 1999. In the field of Food Science and Technology, COST wasmainly concerned with improving food safety and food quality. 27 countries participated inthis COST Action including Switzerland with Florian Schlotke from the Computer ScienceDepartment of the ETH Zurich. One result of this international cooperation was a proposal fora structure for food composition databases [schlotke00]. This recommendation was aimingat improving the quality of food composition data and at simplifying the data interchange onthe European level.
2.1.4 European Food Information Resource (EuroFIR)
As part of the Sixth Framework Programme for Research and Technological Development ofthe European Union from 2005 - 2009, the European Food Information Resource (EuroFIR)3
was established. The EuroFIR continued the initial efforts by the COST Action 99, andpublished a refined recommendation for food composition database management and datainterchange [becker07], based on the former recommendation. Additionally, the EuroFIRcame up with a catalogue of 34 quality questions divided into 7 categories [salvini07]. Fromthe answers to these questions, the EuroFIR Quality Index can be calculated. This index isa measure of the quality of a single food composition data item. Section 2.4.3 explains theEuroFIR Quality Index in more detail.
1Graphical User Interface2Comma Separated Values3http://www.eurofir.org
![Page 14: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/14.jpg)
CHAPTER 2. BACKGROUND AND RELATED WORK 5
Figure 2.1: Swiss Food Composition Database, Public Online Interface
![Page 15: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/15.jpg)
6 2.2. THE FOODCASE PROJECT
2.2 The FoodCASE Project
2.2.1 Introduction
In 2007, the project FoodCASE (Food Composition And System Environment)4 was startedby Karl Presser at the ETH Zurich as a research project. The goal of this project was to builda new Swiss Food Composition Database in cooperation with the SwissFIR5 following theEuroFIR standard. This database can now be used to analyse the definition and measurementof diverse quality aspects. The requirements specification of the new system [presser08] wascompleted in 2008 and in 2009 the implementation was started.
2.2.2 Business Concepts
The FoodCASE application contains six main business concepts:
• A Single Food is a food item which was analysed by a laboratory or in any other way.A single food could be for example a Gravensteiner apple.
• A Single Food Component is a component such as fat of a single food. These valuesare typically taken from a laboratory report and entered into the system.
• An Aggregated Food is a “generic” food such as the typical apple eaten in Switzerland.
• An Aggregated Food Component is a component of an aggregated food. The mostimportant difference to a single food component is that the value of an aggregatedfood component is not measured but calculated (aggregated) as weighted mean. Theidea behind this is to publish just one fat value for apples although many cultivars areavailable. In order to get a representative value, the market shares of the total appleconsumption in Switzerland have to be considered when specifying the weights of themeasurements of the different cultivars.
• A Recipe is a special kind of aggregated food for which the exact amount of the in-gredients and the preparation method is known. With this information it is possibleto calculate the values of the resulting food from the values of the ingredients. Thepreparation method has to be known because different yield factors have to be applied.For example, if apples are fermented, they will yield alcohol, but if they are put in theoven, they will dry out and lose most of the water in them.
• A Reference is a document from which information has been extracted. This could bean article in a journal or a website.
The upper table in figure 2.2 shows the list of Single Foods in the FoodCASE database.Currently, there are 848 food items in the database. The table on the bottom lists all theSingle Food Components of the selected single food. The selected row shows that a slice ofgingerbread (German Lebkuchen) contains 13.75g of fat per 100g edible portion. The largestcomponent though is carbohydrate with 61.57g/100g.
4http://www.foodcase.ethz.ch5http://www.swissfir.ethz.ch
![Page 16: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/16.jpg)
CHAPTER 2. BACKGROUND AND RELATED WORK 7
Figure 2.2: FoodCASE Application
2.2.3 Architecture
The FoodCASE application uses a Java Swing GUI and Java Web Start. The back-end isimplemented using EJB36 session beans running on a JBoss 4.2.3GA7 application server. APostgreSQL 9.08 database serves as the persistent data storage. Figure 2.3 shows the highlevel architecture of the FoodCASE system, which is composed of six components:
• A PostgreSQL Database which stores all the business data of the application.
• A JBoss Application Server which runs EJB3 session beans. They provide serviceswhich can be used by the client modules. The main responsibility of the EJB layer isto take care of the persistence and make it transparent to the clients, so that they canwork directly with the business objects. Most of the persistence logic is implementedin JPQL9. Some batch operations use plain SQL10 for the sake of better performance.
6Enterprise Java Bean7http://www.jboss.org/jbossas8http://www.postgresql.org9Java Persistence Query Language
10Structured Query Language
![Page 17: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/17.jpg)
8 2.3. OUTLOOK
• A Content Management System (CMS) allowing the food compilers to manage thefood composition data in the system. (Figure 2.2)
• An Administration Module for the system administrators. Among other settings, thismodule contains the user and the thesauri administration. The latter includes the unitsof measure that can be used in the system, the food components and a lot of differentfood and component categorisations. (Figure 4.3 on page 36)
• A Web Page which allows the public to query information about the composition ofthe foods available in Switzerland. (Figure 2.1)
• A Web Service to export a single food item or the whole database as a EuroFIR FoodData Transport Package (FDTP) V1.3. This is a standardised XML11 format for foodcomposition data interchange in Europe and was defined by [moller08].
Figure 2.3: Architecture of the FoodCASE system. The six modules are highlighted in dif-ferent colours.
2.3 Outlook
In 2006, the FOPH initiated the project NANUSS (NAtional NUtrition Survey Switzerland)with the aim of answering the question of what people in Switzerland eat. A first pilot studywas launched in November 2008. 1500 men and women were interviewed by phone andasked what they have eaten and drunk in the last 24 hours. From the answers, a Swiss FoodList was drawn up. The final report of this pilot study was published in September 2010. Atpresent, the preparations for a large scale study, which will take place in 2013, are in progress.
11Extensible Markup Language
![Page 18: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/18.jpg)
CHAPTER 2. BACKGROUND AND RELATED WORK 9
2.4 Data Quality
Systems and humans are not perfect, which can result in poor data. Because of this, dataquality has been an issue since the first day of databases. Many databases even contain asurprisingly large number of errors. The consequences of poor data are manifold. However,what they have in common is that they cause extra work which is expensive. For this reason,having good data is also important from an economical point of view.Having noted this, it is rather surprising how little research has been done in the area of dataquality. One of the classics is a paper from 1994 by Richard Wang et. al. with the title“Beyond Accuracy: What Data Quality Means to Data Consumers” [wang94]. Althoughthere is no clear definition of data quality, Wang recognised that in the end it is always thedata consumer who decides whether the data are fit for a particular purpose. Following thisconclusion, Wang defined data quality as data that are fit for use by data consumers.There are a lot of data quality dimensions or data quality attributes as Wang called them. Theyinclude accuracy (closeness to true value), precision (reproducibility), timeliness, reliability,currency, completeness, relevancy, accessibility, interpretability and many more.Wang et. al. conducted a two stage survey. In the first survey, 25 data consumers workingin industry and 112 MBA12 students from a U.S. university were asked to list all terms thatcome to mind when thinking about data quality. This resulted in a list of 118 possible dataquality attributes. In the second survey, 1500 randomly selected MBA alumni were asked toassess the importance of those attributes. The result of this work was a conceptual frameworkof data quality.
Data Quality
Intrinsic Data Quality
Contextual Data Quality
Representational Data Quality
Accessibility Data Quality
Believability Accuracy
Objectivity Reputation
Value-added Relevancy Timeliness
Completeness Appropriate amount of data
Interpretability Ease of understanding
Representational consistency Concise representation
Accessibility Access security
Figure 2.4: A Conceptual Framework of Data Quality [wang94]
In this framework, the 15 most important attributes were divided into 4 groups:
• The Intrinsic Data Quality denotes the quality the data have in their own right.
• The Contextual Data Quality refers to the quality the data have within the task athand.
• The Representational Data Quality highlights the importance of a concise andconsistent representation.
• Finally, the Accessibility Data Quality underlines that data lose their value if they arenot easily accessible.
12Master of Business Administration
![Page 19: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/19.jpg)
10 2.4. DATA QUALITY
2.4.1 Data Quality in Food Composition Databases
In science in general, but in food science in particular, data quality is of major importance.It is widely agreed that food composition data are useless if their origin is unknown. Yetdecision making by governments or individuals is only possible if the data are reliable andtrusted.Often data from different sources are available. In this case, a means is required to judgewhich data are of the best quality or how the individual values should be combined to getmore reliable data.
2.4.2 USDA Quality Index and Confidence Code
The initial data quality evaluation procedures developed by the USDA were manual processesto assess the quality of analytical data for iron, selenium and carotenoids in foods. In thecourse of a redesign of the software system used at the USDA, these procedures were taken astep further and a generic system was developed [holden01].The five original evaluation categories Sampling Plan, Number of Samples, Sample Handling,Analytical Method and Analytical Quality Control were maintained, but the quality assess-ment questions were made more objective. According to the answers to these questions, asingle numeric Quality Index (QI) is assigned to a nutritional component.At aggregation, a Confidence Code (CC) is assigned to the combined value, which is calcu-lated as the weighted mean of the individual values from the different sources of data. TheCC is derived from the QIs by summing up the adjusted ratings of the individual values.
2.4.3 EuroFIR Quality Index
The EuroFIR developed a Quality Index [salvini07] similar to the USDA QI. In fact, theyadopted the five categories from the USDA QI and additionally added Food Description andComponent Identification. For all of the seven categories, a set of questions is defined. Thereare 34 questions in total, which all except one question can be answered with Yes, No or Notapplicable. The following excerpt from the questions catalogue exemplifies what kind ofquestions have to be answered:
• Food Description
– Is the food group known?
– Was the food source of the food or the main ingredient provided?
– 15 more questions
• Sampling Plan
– Was the number of primary samples > 9?
– If relevant, were samples taken during more than one season of the year?
– 4 more questions
• 5 more categories
![Page 20: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/20.jpg)
CHAPTER 2. BACKGROUND AND RELATED WORK 11
To each category, a score in the range from 1 (worst) to 5 (best) is assigned. The sum of theindividual categories is the EuroFIR Quality Index, which obviously must lie between 7 and35 points.The standard formula to calculate the score of a category is
number of criteria answered positively ∗ 5total number of criteria judged relevant
(2.1)
Special cases are Component Identification and Sample Handling for which the minimumof the answers is taken. Number of Analytical Samples gets a score equal to the number ofanalytical samples used, but at most 5, and for Analytical Quality Control a more complicatedspecial rule is defined.The result of equation 2.1 is a number in the range from 0 to 5. However, the minimum percategory is defined as 1. In this sense the specification is contradicting.In FoodCASE, this problem is mitigated by using a modified formula:
max
(number of criteria answered positively ∗ 5total number of criteria judged relevant
, 1
)(2.2)
The Data Quality Analysis Toolkit, however, is intended to be a generic tool, which shouldbe easily separable from FoodCASE, so that it can be integrated into another system. For thisreason, we decided that we don’t want to “pollute” the code with any special logic, whichis only usable in this specific context. But the toolkit provides a feature which we refer toas user-defined scale. Details about this feature can be found in section 3.2.7. Basically, itallows the user to specify a linear transformation of the data quality scores, which the toolkitinternally maintains as percentage values. In essence, the Data Quality Analysis Toolkitcalculates the values of the categories of the EuroFIR QI as
number of criteria answered positively ∗ 4total number of criteria judged relevant
+ 1 (2.3)
which, from our point of view, would have been the natural way to define it in the first place.Since it is out of scope of this master thesis to initiate a proposal to change the definition ofthe formula in the EuroFIR specification, the EuroFIR QIs displayed in the toolkit slightlydeviate from the values calculated by FoodCASE.Furthermore, the specification tells that the score of each category should be rounded to thenearest integer. A disadvantage of this rule is that it could happen for a data record that, bycoincidence, the scores for all of the seven categories are rounded down; while for a datarecord, which is only slightly better in every category, the scores are always rounded up. Thisway, the difference between the two data records might appear to be much bigger than it isin reality. For this reason, the rounding rule is neither implemented in FoodCASE nor in theData Quality Analysis Toolkit.Finally, there is a special rule for the category Sampling Plan. Here, the specification saysthat some criteria may have more weight than others, depending on the food in question. Wesuspect that food compilers would not always agree on this matter. Therefore, FoodCASEdoes not allow to adjust the weights, which we think is in the interest of a more objectivequality attribution.
![Page 21: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/21.jpg)
12 2.4. DATA QUALITY
![Page 22: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/22.jpg)
3Data Quality Analysis Toolkit
3.1 Analysing the Data Quality in the FoodCASE Database
In this chapter, we show how the Data Quality Analysis Toolkit can be used to assess thequality of the food composition data, and how problems can be identified. First, we provide alook at the toolkit from a user’s perspective, by the means of three concrete scenarios. Afterthat, we give a more conceptual description of the toolkit in section 3.2.
3.1.1 Scenario 1: Identifying Missing Data
Scenario: A food compiler is responsible for a national food composition database, which isaccessible by the public via an online interface. On their website, they have declared that forevery food the component energy is provided. Now he gets an e-mail from a user, claimingthat for some foods the component energy is missing. He wants to know whether this is true,and if yes, which foods are affected by this problem.Task at hand: Find all foods for which the component energy is missing.Procedure:
1. Log in to the FoodCASE application and start the Data Quality Analysis Toolkit (Tools→ Data Quality Analysis or Ctrl-D).
2. On the start screen, select the data quality analysis tree definition of interest. In thiscase, select Aggregated Food Data Quality as depicted in figure 3.1. All the othersettings can be skipped for the moment and it can be proceeded with the default values.
3. Click on the Go button. All the relevant data will now be fetched from the databaseand the presentation of the data quality will be prepared. Upon completion, the dataquality tree will appear as shown in figure 3.2. All nodes are rendered as progress bar,indicating the data quality of the criteria they represent.
13
![Page 23: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/23.jpg)
14 3.1. ANALYSING THE DATA QUALITY IN THE FOODCASE DATABASE
Figure 3.1: Start Screen of the Data Quality Analysis Toolkit
4. (Optional: Rotate the tree to save space on the screen by enabling the Rotate viewoption. When the tree is rotated, its root is no longer on the top, but on the right-handside.)
5. Select the node Mandatory Components and uncollapse it using the context menu or bypressing U on the keyboard. Now all the children of the node Mandatory Componentswill become visible.
6. Select the node For every food, energy must be provided and switch to the ProblemTable view using the context menu or by clicking on the Table button in the ProblemViews panel on the top of the window.
7. In figure 3.3, it can be seen than four aggregated foods for which the component energyis missing have been identified. By double-clicking on a row, the data record is openedand the problem can be fixed.
![Page 24: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/24.jpg)
CHAPTER 3. DATA QUALITY ANALYSIS TOOLKIT 15
Figure 3.2: Data Quality Tree: Aggregated Foods
Figure 3.3: Problem Table: Aggregated Foods with missing Component Energy
![Page 25: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/25.jpg)
16 3.1. ANALYSING THE DATA QUALITY IN THE FOODCASE DATABASE
3.1.2 Scenario 2: Analysing Trends over Time
Scenario: Last year, the EuroFIR Quality Index was introduced in a food composition data-base and the stuff was instructed to always answer the quality questions whenever they enterdata into the system. Now the manager wants to verify that people really follow his direc-tions, and he wants to get an overview of how good the data in their system are in terms ofthe EuroFIR QI.Task at hand: Check if the EuroFIR QI of new data is better than the one of old data.Procedure:
1. Log in to the FoodCASE application and start the Data Quality Analysis Toolkit.
2. On the start screen, select the tree definition Single Component Data Quality.
3. Specify the grouping criteria as Group by Year of Mutation.
4. Select the presentation style Data Quality Tree.
5. Click on the Go button.
6. Right-click on the tree node EuroFIR Quality Index and select Data Quality Views→Line Chart in the context menu or press 4.
7. The chart as shown in figure 3.4 appears. The green line for the year 2011 shows thatonly one data record has an EuroFIR QI of 7, which means that for all other records thequality questions have been answered. Remember that the EuroFIR QI ranges from 7at worst to 35 at best. When hovering with the mouse over the line, a tool tip appearswhich provides statistical information about the data series. For example, it can be seenthat the maximum EuroFIR QI achieved in 2011 is only 16.34, which is not that good.In 2010 (blue line), for more than 95% of the data records, the quality questions havenot been answered. Nevertheless, there are a few data records with a really good QIranging up to 30 points.
3.1.3 Scenario 3: Grouping by User
Scenario: A food composition database is used by different data providers who directly entertheir measurement results into the system. To increase the motivation to enter good data intothe system, the best data producers should be rewarded.Task at hand: Determine the user who enters the best data into the system.Procedure:
1. Log in to the FoodCASE application and start the Data Quality Analysis Toolkit.
2. On the start screen, select the tree definition Single Component Data Quality.
3. Specify the grouping criteria as Group by Mutation User.
4. Select the presentation style Data Quality Bar Chart.
5. Click on the Go button.
6. A bar chart will appear as shown in figure 3.5. It is obvious that the user representedby the green bar enters the best data. The grey user follows in second place.
![Page 26: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/26.jpg)
CHAPTER 3. DATA QUALITY ANALYSIS TOOLKIT 17
Figure 3.4: Data Quality Line Chart: EuroFIR Quality Index
Figure 3.5: Data Quality Bar Chart: Group by Mutation User
![Page 27: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/27.jpg)
18 3.2. THE CONCEPTS BEHIND
3.2 The Concepts Behind
The idea behind the Data Quality Analysis Toolkit is based on the concept of a Requirement-oriented Data Quality Model (RODQ) developed by Karl Presser [presser11]. A RODQmodel is a conceptual model that describes the data quality requirements of an informationsystem independently on its implementation. In this, it complements other established model-ling languages like ERD1 or UML2. Furthermore, the RODQ model describes how to assessthe data quality.
3.2.1 Data Quality Requirement
The central element of a RODQ model are the Data Quality Requirements. Presser distin-guishes between three types of data quality requirements:
• A Hard Constraint is a requirement which absolutely must be fulfilled. If a hardconstraint is violated, the data are invalid and cannot be used. Therefore, the sys-tem should enforce hard constraints and not allow to store any data unless all hardconstraints are satisfied.
• A Soft Constraint is a requirement which is highly recommended to be fulfilled. If asoft constraint is not satisfied, data quality decreases. However, it might not always bepossible to adhere to all soft constraints. Hence, they cannot be enforced by the system.
• An Indicator is weaker than a soft constraint but it is assumed that it can be used as anindication of data quality. A typical example is the age of the data. If the data are old,they are likely to be outdated, but it is still possible that they are correct.
In the Data Quality Analysis Toolkit, it is possible to specify the type of every requirementand it will be rendered accordingly (Compare with figure 3.6). The user should take therequirement type into account when specifying the weights (importance) of the requirements,as we explain in the next section. But apart from the way the requirement is displayed, therequirement type has no further influence.After Presser, a quality requirement is always associated with a data quality object, wherea data quality object corresponds to a real world object, on which the data quality checkshould be performed. In the Data Quality Analysis Toolkit, the data quality objects are thedatabase entities corresponding to the business concepts mentioned in section 2.2.2: SingleFood, Single Component, Aggregated Food, Aggregated Component, Recipe and Reference.In the following, we will refer to these six entities as the Data Quality Entities.Each quality requirement contains an SQL statement to be executed when the data qualityis assessed. This SQL statement has to map every data record in the quality entity to a dataquality value between 0.0 (worst) and 1.0 (best). If a requirement is not applicable to acertain data record, NULL may be returned. For example, if a database table contains dataabout customers, a data quality requirement could be defined as “For every person, name andfirst name must be provided”. Now if there is a customer for whom only the name but notthe first name is known, the quality requirement is only partially fulfilled. So the data qualitycould be defined to be 0.7 (70%) because usually the name is more important than the first
1Entity Relationship Model2Unified Modelling Language
![Page 28: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/28.jpg)
CHAPTER 3. DATA QUALITY ANALYSIS TOOLKIT 19
name. If a customer is not a person but a company, the quality requirement is not applicable,so NULL should be returned. An example of how such an assessment SQL could look likecan be found in figure 4.3 on page 36. More about this topic will follow in section 4.4.
3.2.2 Data Quality Analysis Tree Definition
Once all data quality requirements are gathered, similar quality requirements can be groupedtogether. This process can be repeated recursively until ending up with a data quality tree.The root node will then be one single number representing the overall data quality. Sincethe grouping of the quality requirements may be a matter of individual taste, it is possible todefine different data quality analysis trees using the same quality requirements. Similarly, theimportance of the quality requirements may be controversial. Because of this, the weight ofeach quality requirement can be specified in each tree definition independently.Side note: Although highly discouraged, the model supports using a quality requirement inmore than one aggregation node. Strictly speaking, it is no longer a tree in this case, but justan acyclic graph. However, for the sake of simplicity and better understanding, we continueto refer to it as a tree.
Figure 3.6: A very simple Tree Definition showing the different requirement types: hardconstraint, soft constraint and indicator. The aggregation node on level 2 uses aggregationtype mean. The root node uses weighted mean with the weights 3.0 and 1.0.
![Page 29: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/29.jpg)
20 3.2. THE CONCEPTS BEHIND
The Aggregation Type of an aggregation node determines how the data quality value of theaggregation node is computed from its direct children. The Data Quality Analysis Toolkitprovides six different aggregation types:
• Mean
• Median
• Minimum
• Maximum
• Geometric Mean
• Weighted Mean
The first five aggregation types need no further configuration. In case of Weighted Mean, aweight has to be assigned to every in-edge. By default, every in-edge is given a weight of 1.0.
Figure 3.7: The Tree Editor which allows to create own tree definitions. It provides threeways to modify a tree definition: by using the context menu, the keyboard shortcuts or simplyby dragging a new node from the tree on the left-hand side into the graph area and droppingit. If a tree node or edge is selected, a table will appear on the right-hand side showing theproperties of the selected object. Those properties printed in bold face can be modified by theuser. Instead of always starting from scratch, it is also possible to copy a tree definition fromsomeone else and modify it.
![Page 30: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/30.jpg)
CHAPTER 3. DATA QUALITY ANALYSIS TOOLKIT 21
3.2.3 Data Quality Assessment
A Data Quality Assessment is the point in time when all the quality requirements are checkedand the results are stored in the database for later analysis. This means that a snapshot of thecurrent data quality in the system is taken. The Data Quality Analysis Toolkit can only beused, if a quality assessment has been run previously. Hence, a quality assessment has to betriggered from time to time, in order that the data quality values calculated by the toolkit areup-to-date. This can be done either
• manually by the system administrator or
• automatically by a timer.
The timer interval can be configured by the system administrator as we explain in section 4.5.
3.2.4 Filtering and Grouping
As already seen in the scenarios, it is possible to filter or group by a variety of properties. Forall six data quality entities, the following filters are provided:
• Filter by Id (only consider a single data record)
• Filter/group by creation user
• Filter/group by year of creation
• Filter/group by mutation user
• Filter/group by year of mutation
Additionally, every data quality entity has its specific filters. For example, the aggregatedfoods can be filtered by the version of the database.
3.2.5 Long-term Analysis
Often one is interested in analysing how the data quality in the system has evolved over time.To enable this, the Data Quality Analysis Toolkit provides different possible approaches. Asmentioned in the previous section, any of the six data quality entities can be grouped by yearof creation or year of mutation. Single food components can additionally be grouped by yearof compilation, year of generation and year of evaluation.When a new version of the database is published (only aggregated foods and recipes), asnapshot of the relevant database tables is taken and all the data in them are copied. Thisenables us to group aggregated foods, aggregated food components, recipes and referencesby version of the database giving us another kind of long-term analysis.
3.2.6 Shortcut from FoodCASE
In order to allow a data record to be quickly opened in the Data Quality Analysis Toolkit, ashortcut button is provided in the FoodCASE application on every detail window (Comparewith arrow 3. in figure 7.1 on page 64). When clicking on it, the toolkit opens and Filter byId is preselected on the start screen. It this case, it is possible to calculate the data qualityvalues online, rather than taking the historical values from a previous quality assessment.
![Page 31: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/31.jpg)
22 3.2. THE CONCEPTS BEHIND
3.2.7 Percentage Scale vs. User-defined Scale
By default, all the data quality values in the Data Quality Analysis Toolkit are displayed aspercentages, as seen in figure 3.2 and figure 3.5. In certain situations it is desirable though touse a user-defined scale. This might be the case if, for historical reasons, your institute usesa scale other than percentage to measure data quality. A practical example is the EuroFIRquality index where, by definition, to each of the seven categories a value between 1 and 5 isassigned. The total of the EuroFIR QI is in the range from 7 to 35. To allow this, it is possibleto define a linear transformation of the data quality value into a user-defined scale. The onlything which has to be specified on the corresponding tree node in the tree definition is theminimum and maximum value of the preferred scale. (See the table on the right-hand side infigure 3.7). The data quality will then be calculated as
DQuser = DQ% ∗ (maxuser −minuser) +minuser (3.1)
and is displayed in the format
DQuser [minuser,maxuser] (3.2)
(Compare with the range axis in figure 3.4.) It is possible to switch between the percentagescale and the user-defined scale, using the radio buttons on the right-hand side of the view.If the user-scale minimum is not defined, its default is 0; if the maximum is not defined, thesum of the user-scale maxima of the children of the tree node in question is taken.
![Page 32: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/32.jpg)
CHAPTER 3. DATA QUALITY ANALYSIS TOOLKIT 23
3.3 Views
3.3.1 Overview
The Data Quality Analysis Toolkit provides a total of 11 different views which can be dividedinto two categories:
• The Data Quality Views visualise the data quality of the selected tree node and itsdirect children. The only exception is the tree view which does not only display theselected node but the entire data quality tree. The purpose of the data quality views isto give an overview of how good or bad the data are.
• The Problem Views on the other hand provide the possibility to drill down on the dataquality issues and identify individual problems.
The following table lists all the available views and their options.
Views and View Options Data Quality Problems
Bar
Cha
rt
Box
Plot
His
togr
am
Lin
eC
hart
Spid
erC
hart
Tree
Tree
Tabl
e
Bar
Cha
rt
Pie
Cha
rt
Spid
erC
hart
Prob
lem
Tabl
e
Auto Scale x x x x x x xDraw Swim Lanes xFollow Selection xPercentage Scale/ User Scale x x x x x x x x x x xPlot Orientation (Vertical/ Horizontal) x x x x xRender as Progress Bar xRotate xShow Data Points xShow Parent/ Show Children x x x x x x x xShow Range/Domain Axis as Percentage x x x xShow Standard Deviation xThreshold Slider x x x xUse Grouping x x x x x x x x xUse Legend x xNumber of Options 7 5 6 7 4 4 4 8 3 6 2
Table 3.1: Views and View Options
In the next three sections, every view and the view options are described.
![Page 33: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/33.jpg)
24 3.3. VIEWS
3.3.2 Data Quality Views
Data Quality Bar ChartThe data quality bar chart view shows the mean data quality of the selected tree node andits contributing direct children. Optionally, the standard deviation can be displayed by themeans of error bars. Refer to figure 3.5 on page 17 for an example.
Data Quality Box PlotThe data quality box plot view graphically depicts the seven-number summary of the selectedtree node and its children as explained in the figure below.
Figure 3.8: A Box Plot visualises the seven-number summary: The red box represents themiddle 50% (Q1-Q3) of the data. The black line is the median; the black circle is the meanvalue. The whisker includes 90% of the data (5%-95%). Minimum and maximum are dis-played as red circles. Here, the plot orientation is horizontal. However, it is more usual to usethe vertical orientation for box plots.
Data Quality HistogramThe data quality histogram view plots the data quality distribution of the selected tree node.
Data Quality Line ChartThe data quality line chart plots each data record ordered by increasing data quality. Thisshows the distribution of the data quality. If the Show Data Points option is enabled, it ispossible to click on a data point and the detail window of this record will be opened. The linechart also supports zooming-in by clicking the left mouse button and dragging. See figure3.4 on page 17 for an example.
Data Quality TreeThe data quality tree view shows the entire data quality tree. Each node is labelled with themean data quality. Nodes (and edges) can be selected by clicking on them. A table willappear which shows the statistical properties of the selected node. In order to switch toanother view, the buttons on top or the context menu (right click) can be used. The lowestlevel can be collapsed/ uncollapsed using the context menu or the keyboard shortcuts. Anexample can be found in figure 3.2 on page 15.
Data Quality Tree TableThe data quality tree table view shows the statistical properties of the selected tree node andits children in tabular form. Tree nodes can be further expand until arriving at the leafs of thetree. Available columns are the node name, node type, node Id, level, index and the statisticalproperties: mean, standard deviation, maximum, 5% percentile, 25% percentile, median,75% percentile, 95% percentile and the minimum. Columns can be added or removed byright-clicking on the header row. See figure 3.10 for an example.
![Page 34: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/34.jpg)
CHAPTER 3. DATA QUALITY ANALYSIS TOOLKIT 25
Data Quality Spider ChartThe data quality spider chart shows the profile of the selected tree node. The further away thedata points are from the centre, the better. The outermost point on the spokes (black lines)corresponds to a data quality of 100% (if the Auto Scale view option is disabled). Whenmoving the mouse on a data point, a tool tip appears showing the statistical details of thispoint.
Figure 3.9: Example of a Data Quality Spider Chart
3.3.3 Problem Views
All of the following four problem views feature a slider using which the data qualitythreshold can be specified. Additionally, it can be chosen whether data records with a dataquality lower than the threshold, or if records with a quality equal to or higher than thethreshold should be considered. This means that it is not only possible to identify datarecords with problems, but it is also possible to find data records which are of particular gooddata quality. Furthermore, all charts support tool tips. I.e. when pointing with the mouse ona data series, its statistical properties will be listed.
Problem Bar ChartThe problem bar chart view illustrates the number of problems for the selected tree node andits children. The range axis can either display the absolute number of problems, or it can beexpressed as a percentage (number of problems/ total number of data records).
Problem Pie ChartThe problem pie charts view shows the relative amount of problems for the children of theselected tree node.
Problem Spider ChartThe problem spider chart depicts the relative amount of problems compared with each otherchild of the selected tree node.
Problem TableThe problem table view lists all data records for the selected tree node which have a qualityless than the threshold specified. An example can be found in figure 3.3 on page 15.
![Page 35: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/35.jpg)
26 3.3. VIEWS
3.3.4 View Options
The following list explains all view options. They either have the scope global or per view.Global view options, such as the threshold slider, are adjusted on all views simultaneouslywhile other view options can be specified on every view independently. The view options arerendered on the right-hand side of the view and are stored per user. This means that the pre-ferences will be restored the next time the same user opens the Data Quality Analysis Toolkit.
Use GroupingScope: per viewDefault: true if a grouping criterion has been selected on the start screen; N/A other-
wiseAvailable for: all views except Quality Tree and Problem TableDescription: The Use Grouping option is only available if a grouping criterion has been
specified on the start screen. If so, Use Grouping controls whether
• the total and the value per group is displayed or
• if grouping is disabled, the value of the selected tree node and itsdirect children is shown.
Show Parent/ Show ChildrenScope: per viewDefault: default value depends on the viewAvailable for: all views except Quality Tree, Problem Table and Problem Pie ChartDescription: The semantics of Show Parent and Show Children depends on whether
grouping is enabled or not.
• If grouping is disabled
– Show Parent toggles whether the selected tree node is dis-played in the chart
– Show Children toggles whether the direct children of the selec-ted tree node are displayed in the chart
• If grouping is enabled
– Show Parent toggles whether the total is displayed in the chart
– Show Children toggles whether the different groups are dis-played in the chart
Percentage Scale/ User ScaleScope: globalDefault: percentage scaleAvailable for: all viewsDescription: Controls whether the percentage scale or the user-defined scale is used.
Please refer to section 3.2.7 for a detailed description of these two options.
![Page 36: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/36.jpg)
CHAPTER 3. DATA QUALITY ANALYSIS TOOLKIT 27
Threshold SliderScope: globalDefault: 50%, lower than (show bad data)Available for: all problem viewsDescription: The Threshold Slider allows to define how many percent of data quality
are required for a data record to be of sufficient data quality. In otherwords, every data record which has a data quality less than the value of thethreshold slider is considered to be deficient.Below the slider, there are two radio buttons to decide whether the bad dataor the good data should be shown. Compare with figure 3.3 on page 15.
Plot Orientation (vertical or horizontal)Scope: globalDefault: verticalAvailable for: Quality Bar Chart, Quality Box Plot, Quality Histogram, Quality Line
Chart and Problem Bar ChartDescription: Using this option, the preferred plot orientation can be changed from ver-
tical to horizontal, meaning that the chart is rotated by 90 degrees.
Draw Swim LanesScope: per viewDefault: trueAvailable for: Quality TreeDescription: If enabled, swim lanes will be drawn in the background of the data quality
tree. This makes it easier to recognise on which level a tree node resides.Levels are numbers from bottom to top (leaf to root) starting from 1.
Follow SelectionScope: per viewDefault: trueAvailable for: Quality TreeDescription: If enabled, the selected node will be automatically centred on the screen.
As the animations are smooth, this makes it a simple way to navigate thetree, if it is too large to fit on the screen. It is even possible to rotate thetree to save space, or it can be zoomed out.
RotateScope: per viewDefault: DefaultAvailable for: Quality TreeDescription: In the Data Quality Analysis Toolkit, it is possible to rotate the data quality
tree by 90 degrees clockwise, so that the root is on the right-hand side. Thishas proven to save a lot of space on the screen.
![Page 37: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/37.jpg)
28 3.3. VIEWS
Auto ScaleScope: per viewDefault: falseAvailable for: Quality Bar Chart, Quality Box Plot, Quality Histogram, Quality Line
Chart, Quality Spider Chart, Problem Bar Chart and Problem Spider ChartDescription: If enabled, the range axis will be scaled automatically. The lower bound is
always 0, but the upper bound will be adjusted according to the maximumvalue of the data series.
Use LegendScope: per viewDefault: trueAvailable for: Quality Bar Chart and Problem Bar ChartDescription: If enabled, the data series will be rendered in different colours and a legend
will be used to identify the series; if disabled, all series will be colouredthe same and the name of the series will be rendered on the domain axis.
Render as Progress BarScope: per viewDefault: trueAvailable for: Quality Tree TableDescription: If enabled, the statistical properties of the selected node (minimum, mean,
etc.) are rendered as progress bar; if disabled, they are rendered textuallyas percentage values.
Figure 3.10: Tree Table with the Render as Progress Bar view option enabled
![Page 38: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/38.jpg)
CHAPTER 3. DATA QUALITY ANALYSIS TOOLKIT 29
Show Range/Domain Axis as PercentageScope: per viewDefault: trueAvailable for: Quality Histogram, Quality Line Chart, Problem Bar Chart and Problem
Spider ChartDescription: If enabled, the range/domain axis will be labelled with percentages; if di-
sabled, the absolute number of data records will be used.
Show Data PointsScope: per viewDefault: falseAvailable for: Quality Line ChartDescription: If enabled, every single data point is rendered on the chart, instead of only
the line. When clicking on a data point, the detail window of the cor-responding data record will open. This feature is well suited to identifyoutliers.
Show Standard DeviationScope: per viewDefault: falseAvailable for: Quality Bar ChartDescription: If enabled, an error bar indicating the standard deviation will be rendered
on top of the bar.
Figure 3.11: Bar chart with error bars showing the mean value +/- standard deviation.
![Page 39: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/39.jpg)
30 3.3. VIEWS
![Page 40: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/40.jpg)
4Administration Module
Chapter 4 describes the part of the FoodCASE administration module which is related to theData Quality Analysis Toolkit. It is primarily interesting for system administrators who wantto configure the toolkit. Most importantly this includes defining the quality entities and theirdata quality requirements. Furthermore, it is possible to specify the filter and grouping criteriawhich will be available on the start screen of the Data Quality Analysis Toolkit. Qualityassessments can be run manually or a timer can be set up. Lastly, a number of maintenanceoperations are provided to system administrators.The admin module is a standalone Java Web Start application separated from the normalFoodCASE application. It can only be accessed by users who have the “admin” role.
4.1 FoodCASE Data Model
Figure 4.1 shows the data model of the FoodCASE application. The boxes correspond todatabase tables and the arrows represent foreign-key relationships. The data model has three“levels”: Single Food, Aggregated Food and Recipe. Around those three areas, there are a lotof database tables which contain rather static meta data.In order to determine for which tables the data quality should be checked, we coloured eachtable according to where the data originate from. The colours are explained in the text follo-wing the close-up look at the Single Food table in figure 4.2.
31
![Page 41: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/41.jpg)
32 4.1. FOODCASE DATA MODEL
Figure 4.1: Data Model of the FoodCASE Application
![Page 42: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/42.jpg)
CHAPTER 4. ADMINISTRATION MODULE 33
Figure 4.2: Data Model of the FoodCASE Application: Close-Up Look at the Single FoodTable (Marked in red in the previous figure).
• yellow: the six data quality entities on which the data quality checks have to be perfor-med.
• purple: tables which directly depend on the quality entities.
• light green: the thesauri. A thesaurus is usually defined by some standards body andimported by the system administrator. In some cases it would be possible to define dataquality requirements. However, since only the defining standards body may modifythese data, we abstained from defining any rules.
• orange: imported data. Here it would be possible to define quality requirements aswell. However, these data are only imported into FoodCASE but not maintained withinthe system. For this reason we did not define any rules at the present stage.
• grey: tables which are not used yet/anymore. Hence, no data quality requirements areneeded at the moment.
• brown: tables which cannot be administrated from the GUI. Most of them containlegacy data.
• turquoise: tables for which no validation rules have been defined yet. At a later stageit might make sense to have a closer look at these tables.
• dark green: tables which contain technical data and need no validation.
![Page 43: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/43.jpg)
34 4.2. QUALITY ENTITIES
4.2 Quality Entities
A Quality Entity is a first-class database entity (a database table) on which a set of dataquality checks (quality requirements) should be performed. Typically, a quality entitycorresponds to a physical object such as a Single Food. In the FoodCASE application, thereare six quality entities; namely the business concepts described in section 2.2.2: Single Food,Single Component, Aggregated Food, Aggregated Component, Recipe and Reference.
A quality entity is defined by five properties:
Display Name: A string which is used for display purposes. Example: Single Food
Entity Name: The physical name of the database table. Example: tblsinglefood
Alias: A short alias for the database table. Example: f
Key column: The name of the primary key column. Example: f.idsinglefood
Condition: An optional condition which characterises the “active” data records.Example: f.singlefoodhidden = false
4.3 Quality Entity Filters
On each quality entity, filter/grouping criteria can be defined. Here it has to be distinguishedbetween two cases:
• The label of the grouping criterion is contained in the same database table. For examplethe creation date of the data record.
• The quality entity contains a foreign key, and another database table has to be joinedin order to retrieve the label of the grouping criterion. For example, this is the case forthe language of a reference, since the languages are stored in a separate table.
A quality entity filter is defined by seven properties:Entity Id: The Id of the quality entity to which the filter belongs to.
Example: 15
Filter Name: The display name of the filter. Example: Reference Language
Filter Column: The column to filter on. Example: referenceidlanguage
Join Table: The name of the database table which has to be joined to retrieve thefilter label, or NULL if no table has to be joined. Example: tbllanguage
Join Column: The primary key column of the other table. Example: idlanguage
Name Column: The column which contains the filter label. Example: languagename
Apply Condition: true if the condition specified in the quality entity should be applied;false otherwise.
![Page 44: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/44.jpg)
CHAPTER 4. ADMINISTRATION MODULE 35
4.4 Quality Requirements
The Quality Requirements view is probably the most interesting one. Here it is possible todefine the quality checks which have to be performed whenever a Quality Assessment is run.As already mentioned in section 3.2.1, a quality requirement belongs to a quality entity andhas a type:
Entity Id: The Id of the entity on which the quality check has to be performed.Example: 16
Name: The display name of the quality requirement.Example: Mean >= Min
Description: An optional documentary description.
Type: Hard constraint, Soft constraint or Indicator
Assessment SQL: The SQL statement which is executed when a Quality Assessment isperformed. This SELECT statement must return one row for everydata record in the quality entity. Each row must have the format{requirementId, assessmentId, refId, value}. Value is the dataquality value of the data record identified by refId. It must be a num-ber between 0.0 (worst quality) and 1.0 (best quality). NULL may bereturned to express that the quality requirement is not applicable to acertain data record.
In the assessment SQL, the place holders described in table 4.1 can be used.
Place Holder Replacement Value Example{ 0 } <requirementId>,<assessmentId> 150, 81{ 1 } <entity> <alias> tblaggrfoodcomponent ac{ 2 } <keycolumn> ac.idaggrfoodcomp{ 3 } <condition> ac.aggrfoodcompidversion = 1{ 4 } <alias> ac
Table 4.1: Quality Requirement: Assessment SQL Place Holders
On the next page, figure 4.3 shows a screenshot of the detail window of a quality require-ment, including an example of a quality assessment SQL statement. The full list of qualityrequirements for the FoodCASE database can be found in appendix B.
![Page 45: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/45.jpg)
36 4.5. QUALITY ASSESSMENTS
Figure 4.3: Administration Tool: Edit a Quality Requirement
4.5 Quality Assessments
The Quality Assessments tab lists all quality assessments in the database with start date, enddate and a remark. Quality assessments can be deleted and the remark can be changed.The Run Quality Assessment view allows the system administrator to manually trigger a qua-lity assessment. It is recommended to do this outside of office hours because this operationnoticeably slows down the FoodCASE system.On the Quality Assessment Timer screen, a timer which regularly triggers a quality assess-ment can be set up. The timer has three properties out of which two are writeable:
Start: The start time when the timer should go off. Preferably, this shouldbe in the middle of the night when nobody is working with the Food-CASE system. Example: 3:00 AM
Interval: The repetition interval in days. It is not recommended to run qualityassessments too often because they require a lot of database space.With the current size of the FoodCASE database, a single run requiresabout 250MB of disk space for the data, 600MB for the database in-dices and 60MB for the file system cache! Example: 90 days
Next Execution: This is a read-only property which indicates when the next qualityassessment will be run.
![Page 46: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/46.jpg)
CHAPTER 4. ADMINISTRATION MODULE 37
4.6 Maintenance
The maintenance panel provides three tools for system administrators:
• Run Vacuum Analyse: This will run a vaccum analyse on the Postgres database tablewhich contains the quality requirement values. This is the database table where allthe data quality data are stored. The vaccum analyse command optimises the internaldata structures of the database. It should not be run while users are working with theFoodCASE application.
• Truncate Quality Requirement Values: This will delete ALL data quality measure-ments. In order to avoid that a user runs this command by mistake, he explicitly has totype “yes” to express that he is sure he wants to do that.
• Clear File System Cache: This will clear the file system cache on the applicationserver. The file system cache is logically located between the application server andthe database. Please refer to section 5.5.2 on page 45 for more information.
![Page 47: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/47.jpg)
38 4.6. MAINTENANCE
![Page 48: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/48.jpg)
5Implementation
The high level architecture of the FoodCASE system was already explained in section 2.2.3.In this chapter we explain the implementation of the Data Quality Analysis Toolkit in moredetail.The Data Quality Analysis Toolkit is embedded and fully integrated into the FoodCASEapplication. Nevertheless, we tried to keep the Data Quality Analysis Toolkit as independentas possible from FoodCASE. The reason for this is that it should be a generic tool, which canbe easily integrated into any other system building on top of a RDBMS1. The only sourcecode dependencies between the Data Quality Analysis Toolkit and FoodCASE are some GUIcomponents we reused in the toolkit, so that it visually nicely integrates into FoodCASE.In the following, we ignore the FoodCASE application and only focus on the code which isrelated to the Data Quality Analysis Toolkit.
5.1 Project Structure
The source code of the Data Quality Analysis Toolkit is spread over four of the six FoodCASEmodules (Figure 2.3 on page 8). These four modules are realised as NetBeans2 projects:
• The FoodCASE Server project contains four stateless session beans related to the DataQuality Analysis Toolkit:
– The QualityAnalysisBean implements all the persistence logic needed by thetoolkit.
– The QualityAnalysisTimerBean is an EJB3 timer. When the timer goes off, aquality assessment is triggered. In section 4.5, it was explained how the systemadministrator can configure the timer.
1Relational Database Management System2http://www.netbeans.org
39
![Page 49: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/49.jpg)
40 5.2. CODE METRICS
– The QualityAnalysisPrototypeBean contains the back-end of our prototype im-plementation. The prototype is actually obsolete and can be dropped after com-pletion of this master thesis. We just kept it because I was supposed to hand-inany prototypes. The prototype can be accessed from the GUI by first starting theData Quality Analysis Toolkit and then choosing Developer→ Start Prototype inthe menu. All the files belonging to the prototype are clearly marked, so that theycan be removed from the repository within 5 minutes.
– The QualityPreventionBean implements the back-end functionality needed byour enhancements of the “Data Quality Prevention” framework, which we discussseparately in section 7.1.
• The FoodCASE Lib project is used by the client as well as by the admin project. Itcontains all the entity beans, the session interfaces and a number of utilities.
• The FoodCASE Client project is where the implementation of the FoodCASE CMScan be found. We showed a screenshot of it in figure 2.2 on page 7.
• The FoodCASE Admin project contains the administration tool.
5.2 Code Metrics
Code metric are a well suited means to get a high level overview of where the complexity ofa software project resides. Table 5.1 only considers the code we wrote for this master thesis.All the code of the existing FoodCASE application is not included in these numbers.The first thing to notice is that almost two thirds of the code is located in the client project(12936 lines/ 20275 lines = 63.8%). Also the cyclomatic complexity (explained in the nextparagraph) is the highest in the client project (2.61). The reason for this is that the GUIis rather heavy-weighted compared to the back-end. Also parts of the business logic areimplemented in the client project. This is to save network bandwidth and server resources, aswe explain in section 5.6.1. The difference between total number of lines and lines of code ismainly because all public methods are annotated with Javadoc-style comments.The cyclomatic complexity is a complexity measure for software, developed by Sir ThomasJ. McCabe [mccabe76]. It measures the number of linearly independent paths through a soft-ware programme. Practically speaking, it judges the complexity of a source code according tothe number of loops (for, while) and conditionals (if, switch). Additionally, return statements,try/catch/finally blocks and logical operators (&&, ||, .. ? .. : ..) increase the flow complexityof a programme. According to McCabe, a complexity of less than 5 is desirable. Modules (inour case methods) with a complexity over 10 should be refactored and split into smaller mo-dules with a lower complexity. The code metrics show that the Data Quality Analysis Toolkitis generally well structured in small, easily understandable units with an average complexityof less than 3. However, there are a few exceptions. The worst one is the method TreeView-PropertyTable.getValueAt(rowIndex, columnIndex) which implements the model of the tableseen in figure 3.2 on page 15 in the lower-right corner. This method has a complexity of 57and seems to be much too complicated. However, when looking at it in more detail, it is notthat bad. It contains a huge switch and each case block has a number of nested if -statementsin it. The problem here is that JTables are generally row-oriented. In our case, we use it in
![Page 50: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/50.jpg)
CHAPTER 5. IMPLEMENTATION 41
a column-oriented fashion though. Despite of the large flow complexity, this method is stillrelatively easy to understand.The lack of cohesion of methods (LCOM) refers to a set of techniques to analyse howstrongly the methods of a class are connected to each other. Variant 4 was developed by Hitz& Montazeri [hitz96]. A LCOM4 value of 1 indicates that a class has only one responsibility.Values larger than 1 indicate that a class implements two or more independent concepts and,hence, should be split into smaller classes which are independent of each other. The adminand client project both have an average LCOM4 of 3, which means that theoretically manyclasses could be split into smaller ones. On the other hand, it can be seen that the averagenumber of lines per class (including comments) is around 180, which shows that classesare already relatively small. The highest LCOM4 values are calculated for the entity beansbecause, obviously, the getter and the setter of an instance variable are independent of anyother instance variable. Still, it does not make sense to split the entity classes. All this showsthat LCOM4 can be used as an indicator, but the results have to be analysed with care beforecoming to a conclusion. For the server project, the LCOM4 measure worked as expected.There is the QualityAnalysisBean which easily could be split into smaller classes, which areresponsible for only one of the business concepts (quality entity, quality entity filter, etc.). Inthe FoodCASE project however, all the existing EJBs are in the same Java package. Becauseof this, we didn’t want to create too many additional classes here for the sake of a betteroverview.
Lines1 LOC2 Classes MethodsAdmin 3076 1900 17 106Client 12936 8132 72 490Lib 3159 1594 26 242Server 1104 761 5 63Total 20275 12387 120 901
Lines/Class Methods/Class Avg CC3 Max CC4 LCOM45
Admin 180.9 6.2 1.21 5 3Client 179.7 6.8 2.61 57 3Lib 121.5 9.3 1.51 15 5Server 220.8 12.6 1.78 15 51 Total number of lines2 Lines of code: lines which contain at least one of the following: semicolon “;”,
left curly brace “{”, right curly brace “}” but do not contain double slash “//”3 Average cyclomatic complexity4 Maximal cyclomatic complexity per method5 Lack of cohesion of methods (variant 4)
Table 5.1: Code Metrics of the Data Quality Analysis Toolkit
![Page 51: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/51.jpg)
42 5.3. DATA MODEL
5.3 Data Model
Figure 5.1 shows the data model of the Data Quality Analysis Toolkit. All of the conceptswere introduced in previous sections. Here is short overview of back-references:
• Quality Entity: Section 4.2
• Quality Entity Filter: Sections 3.2.4 and 4.3
• Quality Requirement: Sections 3.2.1 and 4.4
• Requirement Type 3.2.1
• Quality Assessment: Sections 3.2.3 and 4.5
• Quality Tree Definition: Section 3.2.2
• Aggregation Type: Sections 3.2.2
Figure 5.1: Data Model of the Data Quality Analysis Toolkit, UML Class Diagram
5.3.1 Data Quality Tree
As already mentioned in section 3.2.2, the data quality analysis tree definition has not neces-sarily have to be a tree. We call it a tree though for the ease of understanding. Nevertheless,it is legal for a tree node to have more than one parent. Because of this, it is actually anacyclic graph with a few constraints. Every tree node has a level and an index property. Thelevel property describes the height of a node in the tree. Counting starts from 1 and is frombottom to top (leaf to root). The index describes the logical position of a node from left toright. Counting starts again from 1. As we explain in section 5.6.4, the logical position is notnecessarily the same as the display position of a tree node. The constraints are:
![Page 52: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/52.jpg)
CHAPTER 5. IMPLEMENTATION 43
∀node ∈ nodes|typenode = “Requirement Node” : levelnode = 1 (5.1)
∀node ∈ nodes|typenode = “Aggregation Node” : levelnode > 1 (5.2)
∀(parent, child) ∈ edges : levelparent > levelchild (5.3)
∀node ∈ nodes|levelnode > 1 :∣∣∣{inedgesnode}∣∣∣ > 0 (5.4)
∀node ∈ nodes|levelnode < max(level) :∣∣∣{outedgesnode}∣∣∣ > 0 (5.5)
∣∣∣{node ∈ nodes|levelnode = max(level)}∣∣∣ = 1 (5.6)
Constraints 5.1 and 5.2 say that requirement nodes can only exist on the lowest level andaggregation nodes must reside on a level higher than 1. Constraint 5.3 ensures that all edgesare pointing “upwards”, i.e. from a node of lower level to another node on a higher level.Constraint 5.4 ensures that every aggregation node has at least one contributing requirementnode. Similarly, constraint 5.5 ensures that there is a path from every node to the root node.Finally, constraint 5.6 makes sure that there is a unique root node.
5.4 Mapping to Physical Data Model
The mapping to the physical data model is straight-forward for most of the Java Entity Beans.The physical database model can be found in appendix C. A TreeNode is either a Require-mentNode or an AggregationNode. CollapsedNodes are virtual and only exist at runtime as areplacement for the children of a collapsed node. To model the inheritance, we used the JPA3
inheritance strategy “joined”. This means that there is a database table for the abstract superclass TreeNode. Additionally, for every entry in the table tblqualitytreenode, there is a mat-ching entry with the same nodeid either in tblqualitytreenodeaggregation or tblqualitytree-noderequirement. The Java enums RequirementType and AggregationType are not modelledas tables, but only their integer representation is stored in the corresponding tables. A checkconstraint ensures that only valid values are accepted by the database. In addition to thosetables depicted in figure 5.1, there are three more tables:
• tblqualityrequirementvalue contains all the data which are collected when a qualityassessment is performed. Performance is of major importance here. This is why thistable is only accessed using plain SQL and is not mapped to JPA.
• tblqualitywarningtext and
• tblqualitywarninguser belong to our enhancement of the “Data Quality Prevention”framework we discuss in section 7.1.
3Java Persistence API
![Page 53: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/53.jpg)
44 5.5. BACK-END
5.5 Back-End
5.5.1 Running a Quality Assessment
When a quality assessment is performed, a snapshot of data quality of every data record inthe database is taken. This means that every quality requirement is executed and the resultsare written back to the database. This includes the following steps:
1. Create a new Quality Assessment and set the start date plus optionally a remark.
2. For every Quality Requirement do ..
• Load the quality requirement from the database.
• In the assessment SQL, replace all the place holders as describes in table 4.1.
• Prepend INSERT INTO tblqualityrequirementvalue (requirementid, assessmentid,refid, value) and execute the SQL statement.
3. Set the end date of the quality assessment.
4. For every quality requirement, initialise the file system cache as discussed in section5.5.2.
Because a large amount of data is written here, every quality requirement is executed in aseparate transaction to avoid overhead. Otherwise, the rollback segment would grow reallylarge. This has the consequence that, before using a quality assessment, it has to be checkedwhether its end date is set. If this is not the case, the quality assessment is still running or itcrashed during execution.
EJB3 Timer
A quality assessment can either be triggered manually by the system administrator or by anEJB3 timer (Compare with section 4.5). When the timer goes off, ejbTimeout is called onthe QualityAnalysisTimerBean by the timer service, with the containers default identity. Nowexactly the same as in the previous section has to be done: executing all quality requirementsand initialising the cache. However, the QualityAnalysisBean only allows calls to any ofits methods if the security context contains a valid caller principal. In order to fulfil thisrequirement, the QualityAnalysisTimerBean makes a remote call to itself using a technicaluser account.Furthermore, the timer bean also contains methods to query the timer status, cancel the timeror reschedule it.
![Page 54: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/54.jpg)
CHAPTER 5. IMPLEMENTATION 45
5.5.2 File System Cache
RDBMS are very powerful when it comes to answering complex queries. Nevertheless, thereare faster approaches to store and retrieve large amounts of data, when only very restrictedselection criteria are needed.The table tblqualityrequirementvalue contains four columns: requirementid, assessmentid,refid and value. The refid identifies the data record and value is the data quality value ofthis data record. When a data quality analysis is performed, this table is accessed using acombination of requirementid and assessmentid, and the data quality values of all data recordsare loaded. Experiments have shown that the retrieval performance decreases significantly,when multiple quality assessments are stored in the database, even if optimised indices areused. In order to avoid this problem, we implemented a file system cache on the applicationserver, which logically is located between the application server and the database. This way,all the data are still in the same storage, namely the database. However, the application serveralways loads the data from the file system cache, once it is initialised. This is much faster andsaves resources on the database server.The file system cache writes its files to the temp directory of the application server, and isinitialised when a new quality assessment is performed. If the cache gets deleted, it is re-built the next time the data are requested. The cache files have a file name in the formatQualityAssessment {assessmentid} {requirementid}.bin. Like this, the task of loo-king up the right data is simply delegated to the file system and no sophisticated search datastructures are needed.Table 5.2 shows the format of the file system cache and listing 5.1 contains the code of themethod writeToCache which writes the cache files. The method readFromCache performs theinverse operations and restores the original data structure. Note that the data quality valuesmay be NULL. This is encoded as -1f.
0 1 2 3 4 5 6 7refId value
Table 5.2: File System Cache Structure
• The first four bytes contain the integer valued refId.
• The second four bytes contain the value represented in the IEEE4 754 floating-point“single format” bit layout.
4Institute of Electrical and Electronics Engineers
![Page 55: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/55.jpg)
46 5.5. BACK-END
1 p r i v a t e vo id wr i t eToCache ( F i l e c a c h e F i l e , L i s t <O b j e c t []> d a t a ) {2 t r y {3 F i l e O u t p u t S t r e a m f o s = new F i l e O u t p u t S t r e a m ( c a c h e F i l e ) ;4 i n t r e f I d ;5 i n t v a l u e ;6 f o r ( O b j e c t [ ] row : d a t a ) {7 / / r e f i d8 r e f I d = ( ( Number ) row [ 0 ] ) . i n t V a l u e ( ) ;9 f o s . w r i t e ( ( r e f I d >>> 24) & 0 x f f ) ;
10 f o s . w r i t e ( ( r e f I d >>> 16) & 0 x f f ) ;11 f o s . w r i t e ( ( r e f I d >>> 8) & 0 x f f ) ;12 f o s . w r i t e ( r e f I d & 0 x f f ) ;1314 / / v a l u e15 v a l u e = F l o a t . f l o a t T o I n t B i t s ( row [ 1 ] != n u l l ?16 ( ( Number ) row [ 1 ] ) . f l o a t V a l u e ( ) : −1f ) ;17 f o s . w r i t e ( ( v a l u e >>> 24) & 0 x f f ) ;18 f o s . w r i t e ( ( v a l u e >>> 16) & 0 x f f ) ;19 f o s . w r i t e ( ( v a l u e >>> 8) & 0 x f f ) ;20 f o s . w r i t e ( v a l u e & 0 x f f ) ;21 }22 f o s . c l o s e ( ) ;23 } c a t c h ( E x c e p t i o n e ) {24 throw new D a t a Q u a l i t y A n a l y s i s E x c e p t i o n (25 ” F a i l e d t o w r i t e t o cache f i l e ” + c a c h e F i l e , e ) ;26 }27 }
Listing 5.1: Writing to File System Cache
![Page 56: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/56.jpg)
CHAPTER 5. IMPLEMENTATION 47
5.6 Front-End
The front-end project is where most of the complexity of the Data Quality Analysis Toolkitresides. In section 5.6.1, we explain the runtime data structure of a data quality tree andsection 5.6.2 explains how this structure is created. After that, we describe how the tree isvisualised and how the data quality charts are created.
5.6.1 Runtime Data Structure
As seen in section 5.3.1, the tree is modelled by tree nodes which are connected by edges.Behind each of the tree nodes, there are a lot of data quality data. In section 3.2.4, weexplained how the data can be grouped by various criteria. In the example data structure infigure 5.2, the data are grouped by user. The first magnifier shows a table with N columns,where column 0 contains the total over all groups, and columns 1..N contain the data of theindividual groups (in this case the users). Each of the groups contains a list in which the rawdata quality data are contained (Magnifier 2). This list is ordered by descending data qualityvalues. Note that the data quality value can be NULL if a quality requirement is not applicableto a particular data record. The refIds together with the entityId of the tree definition uniquelyidentify the data records to which the data quality values belong to. For every group, thepercentiles, minimum value, mean, standard deviation and row count are cached in the treenode for faster access. For the maximum value, caching is not needed because this is simplythe first row in the list.
5.6.2 Tree Calculation
The term tree calculation refers to the process of loading all needed data into memory andcreating the data structure just described. It is implemented in a multi-threaded fashion tomake the best use of the available computing resources. The logic is split into two classes:
• The TreeCalculation class does the real work, which can be divided into three kindsof tasks:
– Loading requirement node data
– Grouping requirement node data
– Calculating aggregation node data
• The TreeCalculationRunner is the driver of the whole process. It is responsible forthe scheduling of all the tasks, it maintains the progress indicators, and contains theerror handling code. The calculation runner uses two separate thread pools for the dataloading tasks and the calculation task, so that the calculation can already start as thedata arrive.
The following enumeration explains all the steps involved in the tree calculation in moredetail:
1. The user specifies on the start screen of the Data Quality Analysis Toolkit in which datahe is interested in (see figure 3.1). This includes the desired data quality tree definition,optionally a filter or grouping criterion, a quality assessment and how not applicableindicators should be treated.
![Page 57: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/57.jpg)
48 5.6. FRONT-END
Figure 5.2: Runtime Data Structure of the Data Quality Analysis Toolkit
2. The TreeCalculationRunnerTask is started and ..
(a) loads the tree definition from the database.
(b) If grouping is enabled, the keys which identify the groups, and the grouping va-lues which map every data record to a group, are loaded.
(c) For every requirement node in the tree definition, a LoadRequirementNodeDa-taTask is issued. The data loading task calls an EJB method and gets a list of{refId, value} tuples for the specified quality assessment and requirement.
(d) When a data loading task completes, a grouping task is issued on the secondthread pool if grouping is enabled. This task expands the list previously receivedaccording to the grouping values. Group 0, which is the total, will contain theoriginal list. Additionally, every tuple will be put in the list corresponding to thegrouping value. This means that the amount of data is doubled, and this is oneof the reasons why it is done on the client. The other reason is that we wanted tosave computing resources on the application server, so that a user who analysesthe data quality does not slow down the whole FoodCASE system.
![Page 58: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/58.jpg)
CHAPTER 5. IMPLEMENTATION 49
(e) As soon as the requirement node data are fetched and grouped, the TreeCalcula-tionTask starts walking the tree from bottom to top. For every aggregation nodein the tree, it calculates the data quality values for every group and every datarecord according to the aggregation type of the node. Remember that the aggre-gation type describes how the data quality values behind an aggregation node arecalculated from its direct contributing children. Since the calculation task doesnot involve any I/O5 operations and all the necessary data are ready in the RAM6,it is much faster than the data loading tasks (< 1 sec). Hence, it wasn’t worth theeffort to parallelise this task.
(f) Finally, all the lists with the {refId, value} tuples are sorted by descending dataquality value (Magnifier 2 in figure 5.2) and the percentiles are initialised.
3. Now the user has 11 different views at its disposal, which visualise the data just calcu-lated.
5.6.3 NanoGraph
The tree view (Figure 3.2 on page 15) and the tree editor (Figure 3.7) use a highly modifiedversion of the NanoGraph7 8 library. NanoGraph originally featured a panel on which nodescould be freely moved around, zooming and an outline panel. However, it did not providedany means to add or remove nodes or edges via the GUI. The library was held very generic,so that different types of nodes, edges, node renders and edge renderers could be used in pa-rallel. It supported different backgrounds, docking strategies, layouting algorithms, selectionof multiple objects and even the export to an SVG9 or JPEG10 image.After a short evaluation, we realised that it was not possible to change to selection mode,so that only one single object could be selected, which nullified our initial intent to use theNanoGraph library as-is. Because we needed to touch the library anyway, we chose to removeall the unneeded features and their implicated complexity. Doing so, we ended up with avery slimmed-down version, which only contained seven classes compared with 40 in thebeginning. Now it was easy to customise those seven classes according to our needs. In theend, we estimate that about 60% of the lines in the remaining NanoGraph classes were addedor modified by us. Still, we think it was worth to use the NanoGraph library as a startingpoint. Like that we didn’t have to design everything from scratch.
5.6.4 Tree Model
The main responsibility of the TreeModel class is to tell the NanoGraphPanel where the treenodes have to be rendered in terms of (x,y) coordinates. It supports persistence and modelmodifications such as adding and deleting requirement and aggregation nodes and addingand deleting edges. In order to save space on the screen, nodes can be collapsed, so thatits children are not visible anymore. The logic for the positioning of the nodes can best bedescribed by the pseudo code in listing 5.2.
5Input/ Output6Random Access Memory7http://www.sourceforge.net/projects/nanograph8http://www.nanoworks.nl9Scalable Vector Graphics
10Joint Photographic Experts Group
![Page 59: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/59.jpg)
50 5.6. FRONT-END
1 p u b l i c Poin t2D g e t L o c a t i o n ( TreeNode node ) {2 d ou b l e x ;3 d ou b l e y = ( g e t L e v e l C o u n t ( ) − node . g e t L e v e l ( ) ) *
LEVEL HEIGHT + LEVEL HEIGHT / 2 . 0 ;4 i f ( node . i s C o l l a p s e d ( ) ) {5 / / c a s e 1 : node i s c o l l a p s e d6 x = g e t L o c a t i o n ( new Col lapsedNode ( node ) ) . getX ( ) ;7 } e l s e i f ( node i n s t a n c e o f Aggrega t ionNode ) {8 i f ( node . g e t I n E d g e s ( ) . s i z e ( ) > 0) {9 / / c a s e 2 : Aggrega t ionNode i s c o n n e c t e d :
10 / / p o s i t i o n d e t e r m i n e d by i t s c h i l d r e n11 x = ( maxXOfChildren + minXOfChi ldren ) / 2 . 0 ;12 } e l s e i f ( node == movingNode ) {13 / / c a s e 3 : Aggrega t ionNode i s n o t c o n n e c t e d14 / / and moving ( d rag & drop )15 r e t u r n movingLoca t ion ;16 } e l s e {17 / / c a s e 4 : Aggrega t ionNode i s n o t c o n n e c t e d18 / / and n o t moving :19 / / p o s i t i o n d e t e r m i n e d l a s t r e q u i r e m e n t node20 x = xOfLas tRequi rementNode + NODE SPACING +
getNodeWidth ( node ) / 2 . 0 ;21 }22 } e l s e {23 i f ( node == movingNode ) {24 / / c a s e 5 : node i s moving25 x = xOfMovingNode ;26 } e l s e {27 / / c a s e 6 : node i s n o t moving :28 / / a l i g n i t n e x t t o t h e node on t h e l e f t29 x = xOfNodeOnTheLeft + NODE SPACING + getNodeWidth (
node ) / 2 . 0 ;30 }31 }32 r e t u r n new Point2D . Double ( x , y ) ;33 }
Listing 5.2: Positioning of Tree Nodes (Pseudo Code)
Generally, all requirement nodes are rendered on level 1 and are aligned next to each other inthe order of their node.index property. Aggregation nodes are rendered on the level specifiedby their node.level property. The horizontal position is the middle of the first and the last of itschildren. If the aggregation node does not have any children yet, it is aligned on the right ofthe last requirement node. A CollapsedNode object is a replacement for all the children of acollapsed node. Therefore, it is rendered at the same x coordinate as its parent. Requirementnodes can only be moved horizontally. This is why only the x coordinate is overridden,whereas aggregation nodes can be moved on both axes.The RotationTreeModel is an extension of the standard tree model, which adds the feature ofrotating the tree by 90 degrees clockwise, so that the root node is on the right-hand side. Thiscoordinate transformation logically involves three steps:
![Page 60: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/60.jpg)
CHAPTER 5. IMPLEMENTATION 51
1. Swap the x and y components of the coordinate.
2. Mirror along the y-Axis.
3. Apply a stretching factor to both components.
When the isRotated property is changed, all tree nodes move along a straight line from theirold position to their new position in a 500-millisecond-long animation.The tree model implements the observer pattern, so that any party interested in state changescan register a TreeModelListener. The InteractionManager which implements the tool tips,and the DragAndDropTreeModel which contains the drag and drop functionality used by thetree editor (on the left in figure 3.7) make us of this mechanism.
5.6.5 Tree View and Tree Editor
Figure 5.3: Tree View and Tree Editor, UML Class Diagram
Figure 5.3 shows how all the components mentioned in the previous sections work together.The TreeViewPanel contains a NanoGraphPanel (red association) which has a NanoGraph.The NanoGraph knows the RotationTreeModel and has a node renderer, an edge renderer, abackground and a docking strategy. The tree editor is an extension of the tree view. Simi-larly, the TreeEditorPropertyTable extends the TreeViewPropertyTable with the ability to edita subset of the available properties. The AbstractTreeEventHandler implements the eventhandling which is common to the tree view and the editor. Its two sub-classes contain thecode which is specific to one or the other. The InteractionManager is responsible for keepingtrack of the selected node or edge and for displaying the appropriate tool tips. Lastly, theDragAndDropTreeModel contains the logic which allows to add new nodes to the tree usingdrag and drop.
![Page 61: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/61.jpg)
52 5.6. FRONT-END
5.6.6 GUI Components
The GUI of the FoodCASE application was built on top of the Swing Application Framework(JSR11 296). Since the Data Quality Analysis Toolkit was expected to be fully integrated intoFoodCASE, it was a logical step to implement the toolkit using the App Framework as well.Figure 5.4 shows the GUI components of the Data Quality Analysis Toolkit. The main frameis highlighted in red colour and there are 14 top level panels which are coloured in green:
• The LoadTreeDataPanel is the start screen of the Data Quality Analysis Toolkit.(Figure 3.1 on page 14)
• The TreeDefinitionTablePanel lists all available tree definitions.
• When a tree definition is opened, the TreeEditorPanel is shown.(Figure 3.7 on page 20)
• Furthermore, there are seven data quality views and four data problem views.
All of those 14 top level panels implement the interface specified by DataQualityAnalysisPa-nel. The first three panels implement this interface directly (green dashed lines in the classdiagram), the quality view and the problems views on the other hand have abstract super-classes, which implement the functionality, which is common to all of them. The transitionsbetween the different panels are realised using a state pattern, where the main frame servesas the state manager. All of the 14 main panels have a reference to the manager (blue asso-ciations), so that the active panel can tell the manager which panel to display next. In detail,a panel switch works like the following:
1. The active panel decides e.g. as a response to a user action that another panel should beshown. It signals this by calling dataQualityAnalysisPanel.loadDataAndShow(panel-ToBeShown).
2. According to panelToBeShown.isDisplayMaximized() the window is maximised or res-tored.
3. The old panel is hidden and a busy message is shown.
4. The stored user settings are loaded from cache and applied to the new panel.
5. panelToBeShown.updateData() is called.
• If true is returned, this means that the data are already ready. So continue withthe next step.
• If false is returned, this means that the data are not ready yet and will be loadedasynchronously, so that the GUI stays responsive. In this case, it has to be waiteduntil the callback method loadingDataDone(..) is invoked.
6. When the data are ready, the new panel is turned visible.
11Java Specification Request
![Page 62: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/62.jpg)
CHAPTER 5. IMPLEMENTATION 53
Figure 5.4: GUI Components, UML Class Diagram
![Page 63: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/63.jpg)
54 5.6. FRONT-END
In contrast to the TreeEditorPanel, the TreeViewPanel does not implement the interface Da-taQualityAnalysisPanel. This is because the tree view panel is not used as a main panel, butit is wrapped into the QualityTreePanel (red association in figure 5.4). The ProblemTablePa-nel uses extensions of the table panels provided by FoodCASE. The difference is that in theData Quality Analysis Toolkit only the subset of rows which matches the criteria (data qualitylower/higher than the specified threshold) is displayed. The advantage of reusing the sametable panels is that the user has to configure his table preferences (columns to be displayed,column widths, etc.) only once. What is more, the detail window of a data record can easilybe opened by double-clicking on a row.
5.6.7 Views
The data quality and data problem views are structured according to the Model-View-Controller (MVC) pattern. Most of the views display a chart. These charts are created usingthe JFreeChart12 library. JFreeChart uses the same concept, but a different terminology.What is a view in MVC, is a called a plot in JFreeChart and the models are called datasets.Table 5.3 gives an overview which plot types and datasets are used by the seven quality andthe four problem views.The bar charts, box plot and spider charts use category datasets. Each category has a labeland a value. In addition to the mean value, the box plot also visualises the percentiles. Thebar chart can display the standard deviation if a DefaultStatisticalCategoryDataset is usedinstead of a DefaultCategoryDataset.The histogram and the line chart use different kinds of XY datasets. In case of the histogram,there are 20 bins per data series. Each bin represents a 5% interval and the height of thecorresponding bar is determined by the number of values in this interval. The line chart usesa simple collection of (x,y) coordinates which define the data points. In the data quality linechart, the points are ordered by increasing data quality and the x component is always equalto the index of the point in the list. This results in a monotonically increasing line.For pie plots, JFreeChart provides a special dataset.The data quality tree table is realised using a component from the table library developedby Scientific Applications13. This is a commercial library which was already used withinFoodCASE. Therefore, we already had acquired a licence to use it. The way the data qualitytree view and the problem table work, was already explained in previous sections.In order to integrate the JFreeChart library and the table library into the Data Quality AnalysisToolkit, we had to make a design decision for every view. We could either copy the necessarydata into another data structure which is usable by these libraries or we could write an adapterwhich makes the libraries work together with our own runtime data structure (Section 5.6.1).In the majority of cases, we chose we first option. However, for the box plot and the treetable, the amount of data to copy would have been too large. Therefore, we implementedan adapter class in these two cases, to avoid overhead. While this was straight-forward forthe box plot, the QualityTreeTableModel, which is a wrapper around a TreeNode, was rathercomplicated to come-up with. One of the problems was that we had to add an artificial rootnode depending on the view option Show Parent. Then, the option Use Grouping addedfurther complexity.
12http://www.jfree.org/jfreechart13http://www.scientific.gr
![Page 64: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/64.jpg)
CHAPTER 5. IMPLEMENTATION 55
Data Quality Problems
Bar
Cha
rt
Box
Plot
His
togr
am
Lin
eC
hart
Spid
erC
hart
Tree
Tree
Tabl
e
Bar
Cha
rt
Pie
Cha
rt
Spid
erC
hart
Prob
lem
Tabl
e
Plot Type/ ViewAbstractTablePanel3 xCategoryPlot1 x x xPiePlot1 xSpiderWebPlot1 x xTreeTable2 xTreeViewPanel4 xXYPlot1 x xDataset/ ModelCategory Datasets
DefaultCategoryDataset1 x xDefaultStatisticalCategoryDataset1 x xQualityBoxPlotDataset4 x
XY DatasetsHistogramDataset1 xXYSeriesCollection1 x
OtherDefaultPieDataset1 xFoodTableModel3 xTreeModel4 xQualityTreeTableModel4 x
1 Class provided by JFreeChart, http://www.jfree.org/jfreechart2 Class provided by Scientific Applications, http://www.scientific.gr3 Class provided by FoodCASE, http://www.foodcase.ethz.ch4 Class developed by Reto Mock for the Data Quality Analysis Toolkit
Table 5.3: Plot Types and Datasets used by the Data Quality Analysis Views
![Page 65: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/65.jpg)
56 5.6. FRONT-END
5.6.8 Chart Highlighting and Breadcrumb Navigation
The top most component there is on every data quality analysis view is the breadcrumb navi-gation. This one shows the path from the root node of the tree to the current node (Comparewith figure 3.4 on page 17). By clicking on an ancestor, it is possible to navigate upwards inthe tree. Navigation towards the leafs of the tree is possible by clicking on a chart entity, forexample on a bar in case of a bar chart. This by itself was easy to implement.In order that the user realises that he can click on a chart entity, we wanted to highlight themon mouse-over, i.e. the chart entity should change its appearance when the mouse hovers on itas shown in figure 3.5. This proved to be more complicated than expected at first. Finally, wefound a solution for every chart type we use, but unfortunately it works differently in everycase. We wrote a ChartHighlighter class which serves as a facade and hides all this. Still,having five classes and an interface, with a total of 407 lines or 261 lines of code, only forthe highlighting is suboptimal. This certainly is an area where the JFreeChart library couldbe improved.
5.6.9 User Settings
There are a lot of check boxes and radio buttons in the different views of the Data QualityAnalysis Toolkit. Therefore, it would be painful and potentially error-prone, if the codeto store the user settings had to be written for every view option again. For this reason,we created a UserSettingsUtil, which is used by the Data Quality Analysis Toolkit and theadmin tool as well. This utility inspects a panel using reflection, searches for JCheckBoxes,JRadioButtons, JSliders and FoodCASE’s AbstractTablePanels and automatically stores theirstate. This utility method is called whenever a panel switch happens, and the settings arewritten to the database when the Data Quality Analysis Toolkit respectively the admin tool isclosed.Section 3.3.4 explained that some view options are global, whereas others are stored perpanel. For example the value of the threshold slider should always be consistent on all ofthe four problem views. On the other hand, it should be possible to specify the view optionShow Parent on every panel independently. We encountered that in this particular setup, itwas possible to define the following simple rule: check boxes are stored per view, but radiobuttons and sliders are stored globally. This makes sense because it can be assumed that theuser wants the view options
• Percentage Scale/ User-defined Scale,
• Vertical Plot/ Horizontal Plot,
• and the Threshold Slider
to be the same on every view. All the other options should be stored per view.With this user settings utility, the loading and storing of the settings is fully transparent tothe panels, which is really convenient. Nevertheless, it might be desirable in rare cases tohave more control. This is where the interface UserSettingsAware comes into play. Theinterface contains two callback methods: userSettingsLoading(userSettings) and userSetting-sUpdating(userSettings). So panels can implement this interface and are notified when theuser settings are loaded or stored. With the reference which is passed to the callback methods,it is possible to access the user settings and read or store custom settings.
![Page 66: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/66.jpg)
CHAPTER 5. IMPLEMENTATION 57
5.6.10 Logging and Error Handling
For the logging and error handling we are using the logging facility which is included in Javasince JDK14 version 1.4. Whenever an error occurs, we log it. In addition, we have registereda special handler on the root logger. In case the level of a log record is SEVERE, a pop-upmessage will be shown to the user to inform him about the error. Using this convention, theerror handling is simple as the following code snippet demonstrates:
1 L o g g e r U t i l . g e t L o g g e r ( ) . i n f o (2 ” I ’m an i n f o and w i l l on ly be w r i t t e n t o t h e l o g f i l e ” ) ;3 L o g g e r U t i l . g e t L o g g e r ( ) . warn ing (4 ” Warnings on ly go t o t h e l o g as w e l l ” ) ;5 L o g g e r U t i l . g e t L o g g e r ( ) . s e v e r e (6 ” I ’m i m p o r t a n t and w i l l a d d i t i o n a l l y be d i s p l a y e d as pop−up ” ) ;7 L o g g e r U t i l . g e t L o g g e r ( ) . l o g ( Leve l . SEVERE , ” I have an e x c e p t i o n8 a t t a c h e d which w i l l be w r i t t e n t o t h e l o g f i l e ” , e x c e p t i o n ) ;
Listing 5.3: Logging and Error Handling
5.7 Admin Module
The structure of the part of the admin tool, which belongs to the Data Quality Analysis Tool-kit, is intentionally kept similar to the structure of the GUI components of the toolkit itself(See figure 5.5). The main difference here is that the top level component is not a frame but atabbed pane. This is because in the admin tool, there are a lot of other settings alongside theconfiguration of the Data Quality Analysis Toolkit.In the toolkit the DataQualityAnalysisFrame served as the state manager. In the admin toolthe DataQualityAnalysisAdminTabbedPane assumes this role. So all the panels keep a refe-rence to the tabbed pane. Most of the panels in the admin tool contain a table panel. Whena certain record is opened, a detail window appears. Hence, all the table panels need to havean associated detail frame. These frames know the tabbed pane as well. When a record ischanged using the detail window, it can signal the state manager that the table listing all thedata records should be reloaded. This way it is guaranteed that all the data displayed arealways in-sync.In the toolkit, the busy message was only needed at a central place. In the admin tool, howe-ver, there are different tabs where each of them is independent on the others. For this reason,every panel is wrapped into a DataQualityAnalsysisAdminTab. Each tab maintains its ownstate which either is busy or ready. Depending on this state, the panel itself or a busy messageis displayed when this tab is active.The AbstractDataQualityAnalysisAdminPanel and the AbstractDataQualityAnalysisAdmin-TablePanel both implement the same interface but the first one inherits from JPanel while thesecond one extends the FoodCASE class AbstractTablePanel. In case of a table panel, thelabel of the tab containing this panel is simply the title of the table. However, when the tabdoes not contain a table, the title of the tab has to be specified separately.
14Java Development Kit
![Page 67: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/67.jpg)
58 5.7. ADMIN MODULE
Figure 5.5: Admin Module, UML Class Diagram
![Page 68: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/68.jpg)
6Testing
In this chapter, we describe the testing of the Data Quality Analysis Toolkit. For the back-endand front-end testing, we used different methodologies.
6.1 Back-End Testing
The back-end testing was a purely technical testing of the functionality provided by the EJBback-end. For each of the entities (quality entity, quality entity filter, quality requirement,etc.) the EJB provides the operations
• list
• get
• store
• delete
and some operations which are specific to a certain entity. The correct behaviour of theseoperations can well be verified using unit testing. To do this, we used JUnit1 which is themost well-known unit testing framework for Java applications. We created one test case perentity, which executes all possible operations and verifies that they worked properly. A typicaltest case looks like the following:
1. List all instances of the entity.
2. Create a new one and verify that the list now contains one entry more.
3. Retrieve the new instance from the database.
1http://www.junit.org
59
![Page 69: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/69.jpg)
60 6.2. FRONT-END TESTING
4. Change some attributes of the entity and store it.
5. Retrieve a fresh reference to the entity from the DB and verify that the changed valuesare still there.
6. Delete the entity and verify that the list now contains the same number of rows asbefore the test.
7. Try to retrieve the deleted entity from the DB and verify that null is returned.
Figure 6.1: JUnit Back-End Testing
6.2 Front-End Testing
6.2.1 Automated Testing
The code metrics in section 5.2 disclosed that the server project contains only about 6% ofthe code of the Data Quality Analysis Toolkit. So the unit testing of the back-end coveredonly a marginal part of the codebase. Therefore, we had to come up with a way of testingthe GUI in order to achieve a reasonable code coverage, since most of the code resides in theclient project.For this purpose, we developed a GUITestUtil which tests the GUI of the Data Quality Ana-lysis Toolkit in two phases. In the first phase the tree definition table and the tree editor aretested. The second phase tests the start screen and all eleven data quality analysis views. Thebasic idea behind the GUI testing is to simulate a user who works with the toolkit.The testing of the tree editor is similar to the unit testing. A series of user operations is exe-cuted and in between assertions are checked. The second phase, however, is a little different.What the data quality views do, is in principal visualising data. So the output is a chart forexample, and whether this chart is correct, can only be reasonably verified by a person. Forthis reason, the second phase does not contain any assert statements and, therefore, cannotcheck if the output is correct. Yet what can be checked is that the application does not crashwhatever the user clicks. In order to verify this, the test loops through all data quality analysisviews, and tries out every possible combination of view options. After a successful run, a set-ting on the start screen is changed. Remember that it is possible to specify the tree definitionto be used, the treatment of not applicable requirements and the filter/grouping criteria on thestart screen. This results in
#tree defs×#n.a.r. treatments×#filter modes×#views×2avg. #view options (6.1)
![Page 70: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/70.jpg)
CHAPTER 6. TESTING 61
test cases. As long as the user has not created any own tree definitions, there are six of themby default. The not applicable requirements treatment has three possible values and there aremany filters, but for the reason of practicability, we only consider three different modes inthe testing: no filtering/grouping, filter criterion is specified, grouping criterion is specified.Inserting the numbers gives
6× 3× 3× 11× 2≈5 = 19409 (6.2)
test cases. After each test case, there is a sleep of 1 second. So the total sleep time is 323minutes for the whole test. Including the time the Data Quality Analysis Toolkit needs toserve the requests, the execution time of the entire test was 352 minutes when the database,the application server and the client were all running on my student lab computer.
Figure 6.2: Automated GUI Testing (Screenshot)
This test did not only ensure that all combinations of view options are properly handled in thesource code (at least so that they don’t cause an exception), but it would also have revealedany memory leaks, which only become noticeable after a while.
![Page 71: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/71.jpg)
62 6.2. FRONT-END TESTING
Figure 6.3 shows the memory and CPU2 usage during a test execution. Even though ourcode did not have any memory problems, it made us aware of a problem with our threadpool, which caused that always new threads were acquired until the system was running outof native threads.
Figure 6.3: Automated GUI Testing, Application Monitoring
The left diagram shows the memory usage during the GUI test, and on the right-hand sidethe CPU utilisation is plotted. It is clearly visible how the resource usage depends on theamount of data. In the introduction, we mentioned that every food has an average of 30components. Therefore, it could be expected that the trees Single Food Component DataQuality and Aggregated Food Component Data Quality have the highest demands on thememory and the processing resources. This is really the case as the light green area (SingleComponent) and orange area (Aggregated Component) show.
6.2.2 Manual Testing
After the automated testing, we could be almost sure that the Data Quality Analysis Toolkitdoes not crash no matter what the user clicks. But what was left to be tested, was the interac-tion with the mouse. This includes for example that the tool tips are displayed correctly, thebreadcrumb navigation, the highlighting of the chart entities on mouse-over and the down-wards navigation in the tree when a chart entity is clicked on. Furthermore, the context menusand the drag and drop functionality in the tree editor had to be tested. And finally, the mostvital thing which was missing in the automated testing: correctness. So we had to go over allthe charts and check if what was displayed was reasonable, and visually nice. We tested eachview option at least once and gave ourselves various data quality analysis tasks to verify thefunctionality offered by the toolkit and its usability in practice.
2Central Processing Unit
![Page 72: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/72.jpg)
7Extensions
Apart from the Data Quality Analysis Toolkit, we implemented a few extensions to Food-CASE in this master thesis. These extensions are introduced in the following sections.
7.1 Data Quality Prevention
Data Quality Prevention is not about preventing from data quality, as one might think atfirst. After Presser [presser11] Data Quality Prevention refers to the part of the data qualityframework which tries to avoid that data of poor quality is even entered into the system.Figure 7.1 shows how this feature is implemented in the FoodCASE application: On everyscreen, where data can be edited, there is a data quality evaluation panel on the bottom. Thispanel list all problems the current data record has. Problems are divided into two categories:errors and warnings. In case of an error, the data record cannot be saved.In order to familiarise ourselves with the data quality framework, we added some new vali-dation rules to it. When talking about this mechanism, we noticed that warnings and errorsare not always displayed early enough. When the detail window is open, it is in certain situa-tions already too late to tell the user that there are problems. A concrete example from theFoodCASE application is the aggregation of foods. To create a new aggregated food, someSingle Food Components have to be selected and be placed in the clipboard. After that, it ispossible to click on the “Create Aggregated Food” button. In this case, a number of plausibi-lity checks should be done before the detail window of the new Aggregated Food is opened.For example, if the user only selects 24 out of 25 components of a food, chances are that thiswas not the user’s intention. So a warning should be displayed immediately when the “CreateAggregated Food” button is clicked. In order to allow such checks to be done beforehand, wecreated a simple framework.Figure 7.2 shows how such a warning looks like. While some users may like this feature andfind it helpful, others may be annoyed by it. For this reason, the individual warnings can beturned on and off on a per-user bases as depicted in figure 7.3.
63
![Page 73: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/73.jpg)
64 7.1. DATA QUALITY PREVENTION
Figure 7.1: FoodCASE: Single Food Detail Window
1. The field which has an error is highlighted in red.
2. The data quality evaluation panel lists all errors in red colour. Warnings are colouredin orange.
3. A shortcut button is provided to quickly open the data record in the Data Quality Ana-lysis Toolkit.
![Page 74: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/74.jpg)
CHAPTER 7. EXTENSIONS 65
Figure 7.2: Data Quality Prevention: Warning
Figure 7.3: Data Quality Prevention: Configure Warnings
So far, five validation rules have been implemented using our enhancement of the“Data Quality Prevention” framework. All except the one in the middle are enabled. Unfort-unately, the year of generation is not known for most of the data records in the FoodCASEsystem. Therefore, with the current state of the data, the warning message would grow hugeif this rule was enabled as well.
![Page 75: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/75.jpg)
66 7.2. CONFIDENCE CODE FOR AGGREGATED FOODS IN FOODCASE
7.2 Confidence Code for Aggregated Foods in FoodCASE
In section 2.4.2, we explained that the US Department of Agriculture (USDA) calculates aConfidence Code (CC) which is assigned to aggregated food component items. Furthermore,section 2.4.3 described how the EuroFIR Quality Index (QI) for single food components isdefined. Using these QIs it would be possible to derive a CC in the way the USDA does it.However, the EuroFIR did not define any rules how such a CC should be calculated and from[holden01] it is not clear how the USDA does it in detail. What we know is that they take thesum of the weighted average of the ratings for each of the categories (5 in their case, 7 in ourcase) and assign a letter A, B, C or D to the aggregated food component.Based on this idea, we implemented in FoodCASE a similar method as a proposal. Asalready mentioned earlier, rounding has the disadvantage of losing information. On the otherhand it would be nice if our CC was comparable with the USDA CC. As a compromise,we use the same letters as well, but additionally append a suffix (“++”, “+”, empty, “-” or“--”) to achieve a finer granularity of ratings. The following table shows how we defined thetranslation in detail:
From1 7.0 8.4 9.8 11.2 12.6 14.0 15.4 16.8 18.2 19.6To2 8.4 9.8 11.2 12.6 14.0 15.4 16.8 18.2 19.6 21.0CC D-- D- D D+ D++ C-- C- C C+ C++
From1 21.0 22.4 23.8 25.2 26.6 28.0 29.4 30.8 32.2 33.6To2 22.4 23.8 25.2 26.6 28.0 29.4 30.8 32.2 33.6 35.0CC B-- B- B B+ B++ A-- A- A A+ A++1 “From” values are inclusive2 “To” values are exclusive
Table 7.1: Mapping from Averaged EuroFIR Quality Index to Confidence Code
Remark 1: From a mathematical point of view, it does not make any difference if the weightedaverage for every category is calculated first and then the sum is taken, or if the total EuroFIRQI for every contributing value is calculated first and then the weighted average is taken. Inthe implementation we chose the second option so than we could reuse the existing code forthe QI calculation.Remark 2: If the averaged QI lies exactly on the border of two confidence codes, better oneis taken.
7.2.1 Recipes
In section 2.2.2, we mentioned that a recipe is a special kind of aggregated food for whichthe exact amount of its ingredients and the preparation method is known. The ingredients areaggregated foods as well. Therefore, creating recipes can be an iterative process where, forexample, the recipe for a pizza dough can be entered first and after that a number of pizzaswith different toppings can be created. When a recipe is calculated, we propose that theconfidence code for every component is calculated as the weighted average from the CCs ofthe ingredients according to their contribution to physical weight of the resulting food.
![Page 76: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/76.jpg)
8Conclusion
8.1 Summary of Work
8.1.1 Data Quality Analysis Toolkit
In this master thesis, we designed and implemented a Data Quality Analysis Toolkit, whichallows to measure and visualise data quality. First, we gave an overview of the history offood composition tables. After that, we focused on data quality in general and on data qualityevaluation in the context of food composition databases specifically. We described how theUS Department of Agriculture (USDA) assigns a Quality Index to single foods and that aConfidence Code for aggregated foods is derived. The EuroFIR came up with a quality indexwhich is very similar to the one of the USDA.Next, we introduced the Data Quality Analysis Toolkit, which we built as part of this thesis,by the means of three concrete scenarios. The toolkit is not limited to the quality measuresas they were defined by the USDA and the EuroFIR, but it allows to define any quality re-quirements. We explained how such quality requirements can be expressed in SQL, and howthe requirements can be grouped together recursively in order to create a data quality analy-sis tree. In this tree, it is possible to use different aggregation types such as weighted meanor taking the maximum. Furthermore, every user can have its own tree definition if it iscontroversial how the data quality tree definition should look like.Using the data quality trees, it is possible to analyse the data quality in the FoodCASE system.By using diverse filter and grouping criteria, a single data record, the entire data in the system,or a subset of it can be considered. Furthermore, it is possible to define if the current dataquality should be analysed or if a historical snapshot should be used. Additionally, the qualityrequirements which are not applicable to a certain data record can be treated in different ways.The Data Quality Analysis Toolkit provides seven different Data Quality Views which allowthe user to assess the quality of the data in the system. The chart types range from simpleones like the bar chart to more advanced ones like the box plot, which visualises the statis-tical properties of every data series. Additionally, the percentiles and standard deviation can
67
![Page 77: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/77.jpg)
68 8.1. SUMMARY OF WORK
be accessed from every view by using the tool tips. A variety of view options make it pos-sible to customise the views according to the preferences of the user. All these options areremembered by the system the next time the user logs in.Once it is recognised that there are problems in a certain area of the data, the Data ProblemViews allow to exactly quantify the amount of problems and it is possible to drill down onthem. Arriving at the leaf level of the data quality tree, single problems are listed which thencan be fixed. By the means of a Threshold Slider, the user can define how many percent ofdata quality are required until a data record is of sufficient data quality. In other words, if thequality of a data record is below the threshold, it is regarded as being deficient. Likewise, itis also possible to only consider data records of particularly good data quality.The User-defined Scale allows the user to define a scale other than percentage, if this isdesired. For example, this makes sense for the EuroFIR quality index, where the scale rangesfrom 7 to 35 by the definition of the index. Similarly, an institute could use its own scale forhistorical reasons. It is possible to switch between the percentage scale and the user-definedscale at any time.Navigating towards the leafs of the data quality tree is supported by simply clicking on a chartentity. The breadcrumb navigation, which is located at the top of every screen, shows the pathfrom the root node to the selected node and allows to navigate upwards in the tree.
8.1.2 Administration Module
In chapter 4, we introduced the part of the admin module which allows system administratorsto configure the Data Quality Analysis Toolkit. We presented the data model of the Food-CASE system, in which we coloured all the entities according to where the data in themoriginate from. This enabled us to identify six Data Quality Entities for which we defined156 Quality Requirements in total (Compare with appendix B). After that, we explained howfilter and grouping criteria for the data quality entities can be defined. Furthermore, we sho-wed that a Quality Assessment can be triggered manually by the system administrator or thatit can be automated using a timer. Lastly, the admin tool also allows to run a number ofmaintenance jobs.
8.1.3 Extensions
Last but not least, we presented two extensions to FoodCASE which are only marginallyrelated to the preceding chapters. The first one is an enhancement of the “Data QualityPrevention” framework. The existing framework provided a data quality panel, in which allproblems are listed. The new feature allows to define a number of validation rules to bechecked even before the edit window opens. Because some users may like these checks tobe done beforehand, while others may be annoyed by certain checks, it is possible to turn onand off every validation rule individually on a per-user basis.The second extension is a proposal for a confidence code for aggregated foods in FoodCASE;similar to the CC developed by the USDA. As we considered the rounding to the discretevalues A, B, C or D potentially harmful, we introduced additional ratings such as B++ inorder to achieve a finer granularity. With this compromise, we do not lose too much precisionwhile maintaining an easy comparability to the USDA scoring system.
![Page 78: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/78.jpg)
CHAPTER 8. CONCLUSION 69
8.2 Future Work
Although the Data Quality Analysis Toolkit is already a powerful tool, there are still a fewpoints which could be optimised. In this section, we list a number of possible optimisationsand extensions.
Client-Side Caching In section 5.5.2, we introduced the file system cache, which is imple-mented on the application server. This cache already improved the performance a lot and tookload away from the database. Now everything works fine if a fast connection to the server isavailable; for example if the quality analysis is done from within the ETH network. However,if the user is working from remote and only has a slow internet connection at his disposal, itcould take a while until all the data are fetched. Therefore, it would make sense to use thesame caching logic, which already exists on the server-side, on the client-side as well. Thismeans that there would be an L1 cache locally, an L2 cache on the application server and theoriginal data would remain in the database. This way, the quality assessment data would onlybe copied once to the local machine, resulting in an improved access time the next time thedata quality of the same quality assessment (snapshot) is analysed.
Access Rights The FoodCASE application implements a Role-Based Access Control model(RBAC); however only a primitive one. There are three statically defined roles: read-only,compiler and admin. Only those users, who have the admin role, can log in to the admintool, but in FoodCASE itself, all the users have the same permissions and everybody canaccess everything. Because of this, we considered it needless to implement a sophisticatedaccess control model in the Data Quality Analysis Toolkit. Since the toolkit is held verygeneric, it would be easily possible to separate it from FoodCASE and integrate it into anyother system building on top of a RDBMS. In this context, it would be desirable to have forexample the possibility to declare a data quality tree to be private. This could be realised byassigning an owner to every tree definition. Additionally, there would be different sharingmodes: private, share for reading and share for writing. If more complex rules are required,another possibility would be to implement its own RBAC model in the Data Quality AnalysisToolkit. A third option would be to make use of the access control model of the host system.
Packaging for Distribution The Data Quality Analysis Toolkit is fully integrated into theFoodCASE system. Nevertheless, it would be easily possible to use it in any other systembuilding on top of a RDBMS, as already mentioned multiple times. In order to support this, itis imaginable to ship the toolkit as a JAR1 file. Additionally, we could provide an SQL scriptwhich creates all the tables needed by the toolkit (See appendix C). The implementer of thehost system would only have to place a menu link or a button to start the Data Quality AnalysisToolkit somewhere in his application. And then, of course, he would have to configure thetoolkit as described in chapter 4. Most importantly, he would have to define the qualityrequirements for his system.
Support for OODBMS2 Back-End In the Data Quality Analysis Toolkit, all the qualityrequirements have to be expressed as SQL statements. However, it is conceivable to use the
1Java Archive2Object-Oriented Database Management System
![Page 79: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/79.jpg)
70 8.2. FUTURE WORK
toolkit in the same way in combination with an object-oriented database back-end. In order todo so, a requirement is that the OODBMS supports a plain text query language such as OQL3.Otherwise, it is not possible to express a quality requirement as string and store it as suchin the database. The persistence logic of the Data Quality Analysis Toolkit itself is mainlyimplemented in JPQL, which makes it independent of the vendor of the RDBMS. However, ifan object-oriented back-end should be used, some effort is required to rewrite the persistencelogic. Since the EJB back-end is relatively slim, this would be feasible. Nevertheless, becausethe back-end has to be touched anyway, it would be worthwhile to rethink how a qualityrequirement can be expressed best in the context of object-oriented databases. Optimally, anapproach should be found which could be used for any OODBMS.
Query Web Services It was a pragmatical decision to express all quality requirements asSQL statements. But it is a restriction as well. One possible extension would be to allow toquery a web service, for example to verify that an ISBN really belongs to a certain book, orto check if a zip code matches a given city. In addition to web services, other channels toaccess information could be integrated as well. This could be a local file, a remote file on anetwork location, an FTP4 server, an LDAP5 server or similar.
Data Quality Alerts On the Data Problem Views, there is a Threshold Slider to define howmany percent of data quality are required until a data record is of sufficient quality. Similarly,a monitoring tool could be implemented, which raises an alarm if a certain data quality rule isviolated; for example if the average data quality drops below a predefined threshold, or if newdata records of poor quality are entered into the system. On top of this monitoring system, anescalation procedure could be defined. For example, if a problem remains unresolved for aweek, the department head gets informed, after two weeks the section head, and if everythingbreaks, it escalates to the executive board.
Additional Quality Requirements In section 4.2, we mentioned that, as of yet, only qualityrequirements for the data which are actively maintained within FoodCASE have been defined.In a second step, this could be extended to check imported third-party data as well. Further-more, it might be possible for a food scientist, who has in-depth knowledge of the businessdomain, to define additional quality requirements for the existing six quality entities.
Using the Toolkit Last but not least, and this is maybe the most important point of all,the Data Quality Analysis Toolkit has to be used to identify data problems in the FoodCASEdatabase, and these problems have to be addressed in order to increase the quality of the data.One major problem we can already tell at this point is that for large parts of the data their ageis unknown.
Additional Quality Prevention Rules Regarding the enhancement we created for the“Data Quality Prevention” framework, additional validation rules could be implemented. Forexample, it should be checked when clicking “Create Aggregated Food” if all public compo-nents (those which appear in the online version) are included.
3Object Query Language4File Transfer Protocol5Lightweight Directory Access Protocol
![Page 80: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/80.jpg)
ATerms and Abbreviations
API Application Programming InterfaceCC Confidence CodeCD-ROM Compact Disc Read-Only MemoryCMS Content Management SystemCOST Cooperation in Science and TechnologyCPU Central Processing UnitCSV Comma Separated ValuesDB DatabaseETH Swiss Federal Institute of Technology (German: Eid-
genossische Technische Hochschule)EJB Enterprise Java BeanERM Entity Relationship ModelEuroFIR European Food Information ResourceFCDB Food Composition DatabaseFCN Federal Commission for NutritionFDTP Food Data Transport PackageFoodCASE Food Composition And System EnvironmentFOPH Federal Office of Public HealthFTP File Transfer ProtocolGUI Graphical User InterfaceIEEE Institute of Electrical and Electronics EngineersI/O Input/ OutputISBN International Standard Book NumberISSN International Standard Serial NumberJAR Java ArchiveJDK Java Development KitJPA Java Persistence APIJPEG Joint Photographic Experts Group
71
![Page 81: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/81.jpg)
72
JPQL Java Persistence Query LanguageJSR Java Specification RequestLDAP Lightweight Directory Access ProtocolLOC Lines of CodeMBA Master of Business AdministrationMVC Model-View-ControllerNANUSS NAtional NUtrition Survey SwitzerlandOODBMS Object-Oriented Database Management SystemOQL Object Query LanguageQI Quality IndexRAM Random Access MemoryROM Read-Only MemoryRBAC Role Based Access ControlRDBMS Relational Database Management SystemRODQ Requirement-Oriented Data QualifySQL Structured Query LanguageSVG Scalable Vector GraphicsSwissFIR Swiss Food Information ResourceUML Unified Modelling LanguageUSDA US Department of AgricultureXML Extensible Markup Language
Table A.1: Terms and Abbreviations
![Page 82: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/82.jpg)
BFoodCASE Quality Requirements
On the following pages, all the quality requirements implemented in the FoodCASE systemare listed. In order to keep similar requirements together in the list, the following prefixesare used:
EC: EuroFIR Consistency: Consistency between the datain FoodCASE and the answers to the EuroFIR qualityquestions
EuroFIR: EuroFIR Quality QuestionMF: Mandatory FieldSP: Statistical ProblemSY: SynonymVN: Valid Number
Table B.1: Quality Requirement Name Prefixes
73
![Page 83: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/83.jpg)
74
#Q
ualit
yE
ntity
Req
uire
men
tNam
eD
escr
iptio
nTy
pe1
Sing
leFo
odE
very
food
mus
tha
veat
leas
t4
com
-po
nent
valu
esSo
ftC
onst
rain
t
2Si
ngle
Food
Fill
fact
orFi
llfa
ctor
ofsi
ngle
food
Indi
cato
r3
Sing
leFo
odFo
rev
ery
food
,ca
rboh
ydra
te(C
HO
Tor
CH
O)m
ustb
epr
ovid
edC
arbo
hydr
ate
isa
man
dato
ryco
mpo
nent
.C
HO
Tor
CH
Oha
sto
bepr
ovid
ed.
Soft
Con
stra
int
4Si
ngle
Food
Fore
very
food
,ene
rgy
mus
tbe
prov
ided
Ene
rgy
isa
man
dato
ryco
mpo
nent
Soft
Con
stra
int
5Si
ngle
Food
Fore
very
food
,fat
mus
tbe
prov
ided
Fati
sa
man
dato
ryco
mpo
nent
Soft
Con
stra
int
6Si
ngle
Food
Fore
very
food
,pro
tein
mus
tbe
prov
ided
Prot
ein
isa
man
dato
ryco
mpo
nent
Soft
Con
stra
int
7Si
ngle
Food
For
hom
e-m
ade
food
,re
cipe
desc
ript
ion
mus
tbe
prov
ided
Soft
Con
stra
int
8Si
ngle
Food
IfFA
T=
0or
logi
calz
ero,
then
allo
ther
fatty
acid
sca
nex
ists
butn
ot>
0H
ard
Con
stra
int
9Si
ngle
Food
Min
imum
leng
thof
sing
lefo
odna
me
>=
2Fo
odna
me
mus
thav
eat
leas
ttw
ole
tters
Har
dC
onst
rain
t
10Si
ngle
Food
MF:
Eng
lish
food
nam
eE
nglis
hfo
odna
me
ism
anda
tory
Har
dC
onst
rain
t11
Sing
leFo
odM
F:E
uroF
IRcl
assi
ficat
ion
Eur
oFIR
clas
sific
atio
nis
man
dato
rySo
ftC
onst
rain
t12
Sing
leFo
odM
F:R
esta
uran
torh
ome-
mad
eT
here
stau
rant
orho
me-
mad
efla
gis
man
-da
tory
Soft
Con
stra
int
13Si
ngle
Food
MF:
Ret
entio
nfa
ctor
clas
sific
atio
nR
eten
tion
fact
orcl
assi
ficat
ion
ism
anda
-to
rySo
ftC
onst
rain
t
14Si
ngle
Food
Scie
ntifi
cna
me
orbr
and
nam
em
ustb
ese
tIf
nobr
and
nam
eis
set,
then
scie
ntifi
cna
me
mus
tbe
prov
ided
and
vice
vers
aSo
ftC
onst
rain
t
15Si
ngle
Food
SY:
Com
bina
tion
ofla
ngua
ge=e
nan
dty
pe=t
rans
latio
nsh
ould
noto
ccur
Har
dC
onst
rain
t
16Si
ngle
Food
SY:
Com
bina
tion
ofsy
nony
mte
rm,
lan-
guag
ean
dty
pem
ustb
eun
ique
Har
dC
onst
rain
t
17Si
ngle
Food
SY:M
anda
tory
field
syno
nym
term
Ifa
syno
nym
entr
yex
ists
,the
field
syno
-ny
mte
rmis
man
dato
ryH
ard
Con
stra
int
![Page 84: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/84.jpg)
APPENDIX B. FOODCASE QUALITY REQUIREMENTS 75
18Si
ngle
Food
SY:M
anda
tory
field
syno
nym
type
Ifa
syno
nym
entr
yex
ists
,the
field
syno
-ny
mty
peis
man
dato
ryH
ard
Con
stra
int
19Si
ngle
Com
pone
ntA
cqui
sitio
nty
pekn
own
Ifac
quis
ition
type
=no
tkno
wn,
then
data
have
less
qual
itySo
ftC
onst
rain
t
20Si
ngle
Com
pone
ntA
tlea
ston
em
etho
dsh
ould
exis
tIs
ther
eat
leas
ton
em
etho
dfo
ra
sing
leva
lue?
Soft
Con
stra
int
21Si
ngle
Com
pone
ntA
tlea
ston
ere
fere
nce
shou
ldex
ist
Isth
ere
atle
asto
nere
fere
nce
for
asi
ngle
valu
e?So
ftC
onst
rain
t
22Si
ngle
Com
pone
ntA
tlea
ston
esa
mpl
esh
ould
exis
tIs
ther
eat
leas
ton
esa
mpl
efo
ra
sing
leva
lue?
Soft
Con
stra
int
23Si
ngle
Com
pone
ntD
ata
olde
rtha
n5
year
s(e
valu
atio
nda
te)
Ifev
alua
tion
date
>5
year
s,th
enco
mpi
-le
rsho
uld
look
forn
ewda
taSo
ftC
onst
rain
t
24Si
ngle
Com
pone
ntE
C:B
rand
nam
eIf
the
bran
dna
me
isse
t,an
swer
toth
eco
r-re
spon
ding
Eur
oFIR
ques
tion
shou
ldbe
YE
Sel
seN
O
Soft
Con
stra
int
25Si
ngle
Com
pone
ntE
C:C
omm
erci
alna
me
Ifco
mm
erci
alna
me
isse
t,an
swer
toco
r-re
spon
ding
Eur
oFIR
ques
tion
shou
ldbe
YE
Sel
seN
O
Soft
Con
stra
int
26Si
ngle
Com
pone
ntE
C:G
ener
icna
me
Ifge
neri
cna
me
isse
t,an
swer
toco
rres
-po
ndin
gE
uroF
IRqu
estio
nsh
ould
beY
ES
else
NO
Soft
Con
stra
int
27Si
ngle
Com
pone
ntE
C:L
abac
cred
ited
Ifth
ela
bw
asac
cred
ited,
answ
erto
cor-
resp
ondi
ngE
uroF
IRqu
estio
nsh
ould
beY
ES
else
NO
Soft
Con
stra
int
28Si
ngle
Com
pone
ntE
C:N
umbe
rofa
naly
tical
sam
ples
Num
ber
ofan
alyt
ical
sam
ples
shou
ldm
atch
answ
erto
corr
espo
ndin
gE
uroF
IRqu
estio
n
Soft
Con
stra
int
29Si
ngle
Com
pone
ntE
C:P
ortio
nre
plic
ates
Ifpo
rtio
nre
plic
ates
>1,
answ
erto
corr
es-
pond
ing
Eur
oFIR
ques
tion
shou
ldbe
YE
Sel
seN
O
Soft
Con
stra
int
![Page 85: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/85.jpg)
76
30Si
ngle
Com
pone
ntE
C:R
efer
ence
mat
eria
lIf
ast
anda
rdre
fere
nce
mat
eria
lwas
used
,an
swer
toco
rres
pond
ing
Eur
oFIR
ques
-tio
nsh
ould
beY
ES
else
NO
Soft
Con
stra
int
31Si
ngle
Com
pone
ntE
C:S
ampl
esho
mog
eniz
edIf
sam
ples
wer
eho
mog
eniz
ed,
answ
erto
the
corr
espo
ndin
gE
uroF
IRqu
estio
nsh
ould
beY
ES
else
NO
Soft
Con
stra
int
32Si
ngle
Com
pone
ntE
C:S
ampl
esst
abili
zed
Ifsa
mpl
esw
ere
stab
ilize
d,an
swer
toco
r-re
spon
ding
Eur
oFIR
ques
tion
shou
ldbe
YE
Sel
seN
O
Soft
Con
stra
int
33Si
ngle
Com
pone
ntE
uroF
IR:A
naly
tical
Met
hod
Indi
cato
r34
Sing
leC
ompo
nent
Eur
oFIR
:Is
com
pone
ntun
ambi
guou
sIs
the
com
pone
ntde
scri
bed
unam
bi-
guou
sly?
Indi
cato
r
35Si
ngle
Com
pone
ntE
uroF
IR:I
sfo
odgr
oup
know
nIs
the
food
grou
p(e
.g.
beve
rage
,des
sert
,sa
vour
ysn
ack,
past
adi
sh)k
now
n?In
dica
tor
36Si
ngle
Com
pone
ntE
uroF
IR:I
sm
atri
xun
itun
equi
voca
lIs
the
mat
rix
unit
uneq
uivo
cal?
Indi
cato
r37
Sing
leC
ompo
nent
Eur
oFIR
:Is
the
heat
trea
tmen
tkno
wn
Isth
eex
tent
ofhe
attr
eatm
entk
now
n?In
dica
tor
38Si
ngle
Com
pone
ntE
uroF
IR:I
sun
itun
equi
voca
lIs
the
unit
uneq
uivo
cal?
Indi
cato
r39
Sing
leC
ompo
nent
Eur
oFIR
:Num
bero
fana
lytic
alsa
mpl
esN
umbe
rofa
naly
tical
sam
ples
Indi
cato
r40
Sing
leC
ompo
nent
Eur
oFIR
:Was
bran
dpr
ovid
edIf
rele
vant
,w
asth
ebr
and
prov
ided
(e.g
.Fe
rrer
o)?
Indi
cato
r
41Si
ngle
Com
pone
ntE
uroF
IR:W
asco
mm
erci
alna
me
prov
ided
Was
the
com
mer
cial
nam
epr
ovid
ed(e
.g.
Nut
ella
)?In
dica
tor
42Si
ngle
Com
pone
ntE
uroF
IR:
Was
cons
umer
s/di
etar
y/la
bel
clai
min
fopr
ovid
edW
asre
leva
ntin
form
atio
non
cons
umer
grou
p/di
etar
yus
e/la
bel
clai
min
fopr
ovi-
ded?
Indi
cato
r
43Si
ngle
Com
pone
ntE
uroF
IR:W
asge
neri
cna
me
prov
ided
Was
the
gene
ric
nam
epr
ovid
ed(e
.g.c
ho-
cola
tepa
ste
with
haze
lnut
s)?
Indi
cato
r
44Si
ngle
Com
pone
ntE
uroF
IR:
Was
geog
raph
ical
info
rmat
ion
prov
ided
Ifre
leva
nt,
was
info
rmat
ion
abou
tth
ege
ogra
phic
alor
igin
ofth
efo
odpr
ovid
ed?
Indi
cato
r
![Page 86: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/86.jpg)
APPENDIX B. FOODCASE QUALITY REQUIREMENTS 77
45Si
ngle
Com
pone
ntE
uroF
IR:
Was
info
rmat
ion
onpa
ckin
gm
ediu
mpr
ovid
edIf
rele
vant
,w
asin
form
atio
non
pack
ing
med
ium
prov
ided
?In
dica
tor
46Si
ngle
Com
pone
ntE
uroF
IR:
Was
info
rmat
ion
onpr
eser
va-
tion
met
hod
prov
ided
Was
info
rmat
ion
onpr
eser
vatio
nm
etho
dpr
ovid
ed?
Indi
cato
r
47Si
ngle
Com
pone
ntE
uroF
IR:
Was
info
rmat
ion
ontr
eatm
ent
prov
ided
Was
rele
vant
info
rmat
ion
ontr
eatm
enta
p-pl
ied
prov
ided
?In
dica
tor
48Si
ngle
Com
pone
ntE
uroF
IR:W
asla
bora
tory
accr
edite
dW
asth
ela
bora
tory
accr
edite
dfo
rthi
sm
e-th
odor
was
the
met
hod
valid
ated
bype
r-fo
rman
cete
stin
g?
Indi
cato
r
49Si
ngle
Com
pone
ntE
uroF
IR:W
asm
ore
than
one
bran
d/cu
lti-
var/
subs
peci
essa
mpl
edIf
rele
vant
,w
asm
ore
than
one
bran
d(f
orm
anuf
actu
red
pre-
pack
edpr
oduc
t)or
mor
eth
anon
ecu
ltiva
r(fo
rpla
ntfo
ods)
orsu
bspe
cies
(for
anim
alfo
ods)
sam
pled
?
Indi
cato
r
50Si
ngle
Com
pone
ntE
uroF
IR:
Was
num
ber
ofpr
imar
ysa
mpl
es>
9W
asth
enu
mbe
rofp
rim
ary
sam
ples
>9?
Indi
cato
r
51Si
ngle
Com
pone
ntE
uroF
IR:
Was
prod
uctio
nm
onth
/sea
son
indi
cate
dIf
rele
vant
,w
asth
em
onth
orse
ason
ofpr
oduc
tion
indi
cate
d?In
dica
tor
52Si
ngle
Com
pone
ntE
uroF
IR:
Was
reci
pena
me
and
desc
rip-
tion
prov
ided
Was
the
com
plet
ena
me
and
desc
ript
ion
ofth
ere
cipe
prov
ided
?In
dica
tor
53Si
ngle
Com
pone
ntE
uroF
IR:W
asre
fere
nce
mat
eria
luse
dIf
rele
vant
,w
asan
appr
opri
ate
refe
renc
em
ater
ial
ora
stan
dard
refe
renc
em
ater
ial
used
?
Indi
cato
r
54Si
ngle
Com
pone
ntE
uroF
IR:W
assa
mpl
eho
mog
eniz
edW
asth
esa
mpl
eho
mog
eniz
ed?
Indi
cato
r55
Sing
leC
ompo
nent
Eur
oFIR
:W
assa
mpl
em
oist
ure
cont
ent
give
nW
asth
em
oist
ure
cont
ent
ofth
esa
mpl
em
easu
red
and
the
resu
ltgi
ven?
Indi
cato
r
56Si
ngle
Com
pone
ntE
uroF
IR:W
assa
mpl
ing
plan
deve
lope
dW
asth
esa
mpl
ing
plan
deve
lope
dto
re-
pres
ent
the
cons
umpt
ion
inth
eco
untr
yw
here
the
stud
yw
asco
nduc
ted?
Indi
cato
r
![Page 87: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/87.jpg)
78
57Si
ngle
Com
pone
ntE
uroF
IR:W
asst
abili
zatio
nap
plie
dIf
rele
vant
,wer
eap
prop
riat
est
abili
zatio
ntr
eatm
ents
appl
ied
(e.g
.pr
otec
tion
from
heat
/air
/ligh
t/mic
robi
alac
tivity
)?
Indi
cato
r
58Si
ngle
Com
pone
ntE
uroF
IR:
Was
the
anal
ysed
port
ion
des-
crib
edIf
rele
vant
,was
the
anal
ysed
port
ion
des-
crib
edan
dis
itcl
ear
ifth
efo
odw
asan
a-ly
sed
with
orw
ithou
tthe
ined
ible
part
?
Indi
cato
r
59Si
ngle
Com
pone
ntE
uroF
IR:W
asth
efo
odso
urce
prov
ided
Was
the
food
sour
ceof
the
food
orof
the
mai
nin
gred
ient
prov
ided
(bes
tif
scie
n-tifi
cna
me
incl
uded
,cu
ltiva
r/va
riet
y,ge
-nu
s/sp
ecie
s,et
c.)?
Indi
cato
r
60Si
ngle
Com
pone
ntE
uroF
IR:W
asth
epa
rtof
plan
t/ani
mal
in-
dica
ted
Was
the
part
ofpl
ant
orpa
rtof
anim
alcl
earl
yin
dica
ted?
Indi
cato
r
61Si
ngle
Com
pone
ntE
uroF
IR:
Wer
eco
okin
gm
etho
dde
tails
prov
ided
Ifth
efo
odw
asco
oked
,wer
esa
tisfa
ctor
yco
okin
gm
etho
dde
tails
prov
ided
?In
dica
tor
62Si
ngle
Com
pone
ntE
uroF
IR:W
ere
port
ion
repl
icat
este
sted
Wer
ean
alyt
ical
port
ion
repl
icat
este
sted
?In
dica
tor
63Si
ngle
Com
pone
ntE
uroF
IR:
Wer
esa
mpl
esta
ken
from
im-
port
anto
utle
tsIf
rele
vant
,w
ere
sam
ples
take
nfr
omth
em
ost
impo
rtan
tsa
les
outle
ts(s
uper
mar
-ke
t,lo
cal
groc
ery,
stre
etm
arke
t,re
stau
-ra
ntho
useh
old
etc.
)?
Indi
cato
r
64Si
ngle
Com
pone
ntE
uroF
IR:W
ere
sam
ples
take
nfr
omm
ore
than
one
loca
tion
Ifre
leva
nt,w
ere
sam
ples
take
nfr
omm
ore
than
one
geog
raph
ical
loca
tion?
Indi
cato
r
65Si
ngle
Com
pone
ntE
uroF
IR:W
ere
sam
ples
take
nfr
omm
ore
than
one
seas
onIf
rele
vant
,w
ere
sam
ples
take
ndu
ring
mor
eth
anon
ese
ason
ofth
eye
ar?
Indi
cato
r
66Si
ngle
Com
pone
ntG
iven
year
ofge
nera
tion
caus
esge
nera
-tio
nby
tobe
man
dato
rySo
ftC
onst
rain
t
67Si
ngle
Com
pone
ntH
oww
elld
oes
food
mat
chH
oww
elld
oes
food
mat
chth
efo
odin
the
data
base
?In
dica
tor
68Si
ngle
Com
pone
ntIs
food
natio
nalr
epre
sent
ativ
eH
owre
pres
enta
tive
isth
efo
odto
natio
nal
cons
umpt
ion?
Indi
cato
r
![Page 88: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/88.jpg)
APPENDIX B. FOODCASE QUALITY REQUIREMENTS 79
69Si
ngle
Com
pone
ntM
atri
xun
itca
nnot
beus
edfo
rag
greg
a-tio
nN
otfo
rag
greg
atio
nus
able
mat
rix
units
={p
er10
0gto
tal
fat,
per
100g
tota
lfa
ttyac
ids,
perg
tota
lfat
,per
gni
trog
en}
Soft
Con
stra
int
70Si
ngle
Com
pone
ntM
etho
din
dica
tork
now
nIf
met
hod
indi
cato
ris
not
know
n,th
enda
taha
vele
ssqu
ality
Soft
Con
stra
int
71Si
ngle
Com
pone
ntM
etho
dpa
ram
eter
man
dato
ryfo
rpr
otei
nca
lcul
ated
from
nitr
ogen
Ifco
mpo
nent
=pr
otei
nan
dm
etho
din
di-
cato
rin{M
I012
1,M
I012
2,M
I012
3}th
enN
CF
shou
ldbe
inse
rted
inM
etho
dPa
ra-
met
erfie
ld,r
ange
[4.6
0to
7.10
],an
dN
CF
sour
cein
the
Met
hod
Ref
eren
cefie
ld
Har
dC
onst
rain
t
72Si
ngle
Com
pone
ntM
etho
dpa
ram
eter
man
dato
ryfo
rto
tal
fatty
acid
sca
lcul
ated
from
tota
lfat
Ifco
mpo
nent
=to
tal
fatty
acid
and
me-
thod
indi
cato
rin
MI0
207
then
met
hod
para
met
erm
ust
bein
sert
edin
the
rang
e[0
.50,
0.99
]
Har
dC
onst
rain
t
73Si
ngle
Com
pone
ntM
etho
dty
pekn
own
Ifm
etho
dty
peis
not
know
n,th
enda
taha
vele
ssqu
ality
Soft
Con
stra
int
74Si
ngle
Com
pone
ntM
F:D
ate
ofco
mpi
latio
nan
dco
mpi
latio
nby
Har
dC
onst
rain
t
75Si
ngle
Com
pone
ntM
F:E
valu
atio
nda
tean
dev
alua
tion
bySo
ftC
onst
rain
t76
Sing
leC
ompo
nent
MF:
Met
hod
nam
eFo
rleg
acy
reas
onno
ton
data
base
and
in-
putm
ask
Har
dC
onst
rain
t
77Si
ngle
Com
pone
ntM
F:M
etho
dor
igin
alna
me
Forl
egac
yre
ason
noto
nda
taba
sean
din
-pu
tmas
kH
ard
Con
stra
int
78Si
ngle
Com
pone
ntM
F:Se
lect
edva
lue
exce
ptw
hen
valu
ety
pe=
trac
e,be
low
dete
ctio
nlim
it,un
de-
cida
ble
orun
know
n
Har
dC
onst
rain
t
79Si
ngle
Com
pone
ntM
F:Y
earo
fgen
erat
ion
Har
dC
onst
rain
t80
Sing
leC
ompo
nent
Old
sing
leva
lue
(yea
rofg
ener
atio
n>
10ye
ars)
Soft
Con
stra
int
![Page 89: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/89.jpg)
80
81Si
ngle
Com
pone
ntSe
lect
edva
lue
=0
ifva
lue
type
=lo
gica
lze
roH
ard
Con
stra
int
82Si
ngle
Com
pone
ntSe
lect
edva
lue
has
reas
onab
lepr
ecis
ion
Soft
Con
stra
int
83Si
ngle
Com
pone
ntSP
:Mea
n<
=M
axM
ean
mus
tbe<
=m
axim
umva
lue
Har
dC
onst
rain
t84
Sing
leC
ompo
nent
SP:M
ean>
=M
inM
ean
mus
tbe>
=m
inim
umva
lue
Har
dC
onst
rain
t85
Sing
leC
ompo
nent
SP:M
edia
n<
=M
axM
edia
nm
ustb
e<
=m
axim
umva
lue
Har
dC
onst
rain
t86
Sing
leC
ompo
nent
SP:M
edia
n>
=M
inM
edia
nm
ustb
e>
=m
inim
umva
lue
Har
dC
onst
rain
t87
Sing
leC
ompo
nent
SP:S
td.d
ev.<
Max
-Min
Stan
dard
devi
atio
nm
ustb
e<
max
-min
Har
dC
onst
rain
t88
Sing
leC
ompo
nent
SP:S
td.e
rror
<=
Std.
dev.
Stan
dard
erro
rm
ust
be<
=st
anda
rdde
-vi
atio
nH
ard
Con
stra
int
89Si
ngle
Com
pone
ntSP
:SV<
=M
axSe
lect
edva
lue
mus
tbe<
=m
axim
umva
-lu
eH
ard
Con
stra
int
90Si
ngle
Com
pone
ntSP
:SV>
=M
inSe
lect
edva
lue
mus
tbe>
=m
inim
umva
-lu
eH
ard
Con
stra
int
91Si
ngle
Com
pone
ntU
nit
Deg
ree
Bri
xdo
esno
tha
vem
atri
xun
itH
ard
Con
stra
int
92Si
ngle
Com
pone
ntV
alue
type
know
nIf
valu
ety
pe=
unkn
own,
then
the
data
have
less
qual
itySo
ftC
onst
rain
t
93Si
ngle
Com
pone
ntV
N:M
axim
umM
axim
umis
valid
num
ber
Har
dC
onst
rain
t94
Sing
leC
ompo
nent
VN
:Mea
nM
ean
isva
lidnu
mbe
rH
ard
Con
stra
int
95Si
ngle
Com
pone
ntV
N:M
edia
nM
edia
nis
valid
num
ber
Har
dC
onst
rain
t96
Sing
leC
ompo
nent
VN
:Min
imum
Min
imum
isva
lidnu
mbe
rH
ard
Con
stra
int
97Si
ngle
Com
pone
ntV
N:S
elec
ted
valu
eSe
lect
edva
lue
isva
lidnu
mbe
rH
ard
Con
stra
int
98Si
ngle
Com
pone
ntV
N:S
tand
ard
devi
atio
nSt
anda
rdde
viat
ion
isva
lidnu
mbe
rH
ard
Con
stra
int
99Si
ngle
Com
pone
ntV
N:S
tand
ard
erro
rSt
anda
rder
rori
sva
lidnu
mbe
rH
ard
Con
stra
int
100
Sing
leC
ompo
nent
Yea
rofa
naly
sis
mus
thav
e4
digi
tsH
ard
Con
stra
int
101
Agg
rega
ted
Food
Eve
ryfo
odm
ust
have
atle
ast
4co
m-
pone
ntva
lues
Soft
Con
stra
int
102
Agg
rega
ted
Food
Fill
fact
orFi
llfa
ctor
ofag
greg
ated
food
Indi
cato
r
![Page 90: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/90.jpg)
APPENDIX B. FOODCASE QUALITY REQUIREMENTS 81
103
Agg
rega
ted
Food
For
ever
yfo
od,
carb
ohyd
rate
(CH
OT
orC
HO
)mus
tbe
prov
ided
Car
bohy
drat
eis
am
anda
tory
com
pone
nt.
CH
OT
orC
HO
has
tobe
prov
ided
.So
ftC
onst
rain
t
104
Agg
rega
ted
Food
Fore
very
food
,ene
rgy
mus
tbe
prov
ided
Ene
rgy
isa
man
dato
ryco
mpo
nent
Soft
Con
stra
int
105
Agg
rega
ted
Food
Fore
very
food
,fat
mus
tbe
prov
ided
Fati
sa
man
dato
ryco
mpo
nent
Soft
Con
stra
int
106
Agg
rega
ted
Food
Fore
very
food
,pro
tein
mus
tbe
prov
ided
Prot
ein
isa
man
dato
ryco
mpo
nent
Soft
Con
stra
int
107
Agg
rega
ted
Food
For
hom
e-m
ade
food
,re
cipe
desc
ript
ion
mus
tbe
prov
ided
Soft
Con
stra
int
108
Agg
rega
ted
Food
IfFA
T=
0or
logi
calz
ero,
then
allo
ther
fatty
acid
sca
nex
ists
butn
ot>
0H
ard
Con
stra
int
109
Agg
rega
ted
Food
Min
imum
leng
thof
sing
lefo
odna
me>
=2
Food
nam
em
usth
ave
atle
astt
wo
lette
rsH
ard
Con
stra
int
110
Agg
rega
ted
Food
MF:
Eng
lish
food
nam
eE
nglis
hfo
odna
me
ism
anda
tory
Har
dC
onst
rain
t11
1A
ggre
gate
dFo
odM
F:E
uroF
IRcl
assi
ficat
ion
Eur
oFIR
clas
sific
atio
nis
man
dato
rySo
ftC
onst
rain
t11
2A
ggre
gate
dFo
odM
F:R
esta
uran
torh
ome-
mad
eT
here
stau
rant
orho
me-
mad
efla
gis
man
-da
tory
Soft
Con
stra
int
113
Agg
rega
ted
Food
MF:
Ret
entio
nfa
ctor
clas
sific
atio
nR
eten
tion
fact
orcl
assi
ficat
ion
ism
anda
-to
rySo
ftC
onst
rain
t
114
Agg
rega
ted
Food
Scie
ntifi
cna
me
orbr
and
nam
em
ustb
ese
tIf
nobr
and
nam
eis
set,
then
scie
ntifi
cna
me
mus
tbe
prov
ided
and
vice
vers
aSo
ftC
onst
rain
t
115
Agg
rega
ted
Food
SY:
Com
bina
tion
ofla
ngua
ge=e
nan
dty
pe=t
rans
latio
nsh
ould
noto
ccur
Har
dC
onst
rain
t
116
Agg
rega
ted
Food
SY:
Com
bina
tion
ofsy
nony
mte
rm,
lan-
guag
ean
dty
pem
ustb
eun
ique
Har
dC
onst
rain
t
117
Agg
rega
ted
Food
SY:M
anda
tory
field
syno
nym
term
Syno
nym
term
ism
anda
tory
Har
dC
onst
rain
t11
8A
ggre
gate
dFo
odSY
:Man
dato
ryfie
ldsy
nony
mty
peSy
nony
mty
peis
man
dato
ryH
ard
Con
stra
int
119
Agg
rega
ted
Com
pone
ntA
cqui
sitio
nty
pekn
own
Ifac
quis
ition
type
=no
tkno
wn
then
data
have
less
qual
itySo
ftC
onst
rain
t
120
Agg
rega
ted
Com
pone
ntA
tle
ast
one
cont
ribu
ting
valu
esh
ould
exis
tH
ard
Con
stra
int
![Page 91: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/91.jpg)
82
121
Agg
rega
ted
Com
pone
ntA
tlea
ston
ere
fere
nce
shou
ldex
ist
Isth
ere
atle
asto
nere
fere
nce
for
asi
ngle
valu
e?So
ftC
onst
rain
t
122
Agg
rega
ted
Com
pone
ntC
ontr
ibut
ing
valu
esm
ust
not
beol
der
than
5ye
ars
(yea
rofg
ener
atio
n)So
ftC
onst
rain
t
123
Agg
rega
ted
Com
pone
ntM
etho
din
dica
tork
now
nIf
met
hod
indi
cato
ris
not
know
n,th
enda
taha
vele
ssqu
ality
Soft
Con
stra
int
124
Agg
rega
ted
Com
pone
ntM
etho
dty
pekn
own
Ifm
etho
dty
peis
not
know
n,th
enda
taha
vele
ssqu
ality
Soft
Con
stra
int
125
Agg
rega
ted
Com
pone
ntM
F:Se
lect
edva
lue
exce
ptw
hen
valu
ety
pe=
trac
e,be
low
dete
ctio
nlim
it,un
de-
cida
ble
orun
know
n
Har
dC
onst
rain
t
126
Agg
rega
ted
Com
pone
ntSe
lect
edva
lue
=0
ifva
lue
type
=lo
gica
lze
roH
ard
Con
stra
int
127
Agg
rega
ted
Com
pone
ntSe
lect
edva
lue
has
reas
onab
lepr
ecis
ion
Soft
Con
stra
int
128
Agg
rega
ted
Com
pone
ntSP
:Mea
n<
=M
axM
ean
mus
tbe<
=m
axim
umva
lue
Har
dC
onst
rain
t12
9A
ggre
gate
dC
ompo
nent
SP:M
ean>
=M
inM
ean
mus
tbe>
=m
inim
umva
lue
Har
dC
onst
rain
t13
0A
ggre
gate
dC
ompo
nent
SP:M
edia
n<
=M
axM
edia
nm
ustb
e<
=m
axim
umva
lue
Har
dC
onst
rain
t13
1A
ggre
gate
dC
ompo
nent
SP:M
edia
n>
=M
inM
edia
nm
ustb
e>
=m
inim
umva
lue
Har
dC
onst
rain
t13
2A
ggre
gate
dC
ompo
nent
SP:S
td.d
ev.<
Max
-Min
Stan
dard
devi
atio
nm
ustb
e<
max
-min
Har
dC
onst
rain
t13
3A
ggre
gate
dC
ompo
nent
SP:S
td.e
rror
<=
Std.
dev.
Stan
dard
erro
rm
ust
be<
=st
anda
rdde
-vi
atio
nH
ard
Con
stra
int
134
Agg
rega
ted
Com
pone
ntSP
:SV<
=M
axSe
lect
edva
lue
mus
tbe<
=m
axim
umva
-lu
eH
ard
Con
stra
int
135
Agg
rega
ted
Com
pone
ntSP
:SV>
=M
inSe
lect
edva
lue
mus
tbe>
=m
inim
umva
-lu
eH
ard
Con
stra
int
136
Agg
rega
ted
Com
pone
ntU
nit
Deg
ree
Bri
xdo
esno
tha
vem
atri
xun
itH
ard
Con
stra
int
137
Agg
rega
ted
Com
pone
ntV
alue
type
know
nIf
valu
ety
pe=
unkn
own,
then
the
data
have
less
qual
itySo
ftC
onst
rain
t
138
Agg
rega
ted
Com
pone
ntV
N:M
axim
umM
axim
umis
valid
num
ber
Har
dC
onst
rain
t
![Page 92: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/92.jpg)
APPENDIX B. FOODCASE QUALITY REQUIREMENTS 83
139
Agg
rega
ted
Com
pone
ntV
N:M
ean
Mea
nis
valid
num
ber
Har
dC
onst
rain
t12
0A
ggre
gate
dC
ompo
nent
VN
:Med
ian
Med
ian
isva
lidnu
mbe
rH
ard
Con
stra
int
141
Agg
rega
ted
Com
pone
ntV
N:M
inim
umM
inim
umis
valid
num
ber
Har
dC
onst
rain
t14
2A
ggre
gate
dC
ompo
nent
VN
:Sel
ecte
dva
lue
Sele
cted
valu
eis
valid
num
ber
Har
dC
onst
rain
t14
3A
ggre
gate
dC
ompo
nent
VN
:Sta
ndar
dde
viat
ion
Stan
dard
devi
atio
nis
valid
num
ber
Har
dC
onst
rain
t14
4A
ggre
gate
dC
ompo
nent
VN
:Sta
ndar
der
ror
Stan
dard
erro
ris
valid
num
ber
Har
dC
onst
rain
t14
5R
ecip
eA
tlea
ston
ein
gred
ient
shou
ldex
ist
Har
dC
onst
rain
t14
6R
ecip
eA
tlea
ston
ere
fere
nce
shou
ldex
ist
Soft
Con
stra
int
147
Rec
ipe
Fore
very
ingr
edie
ntam
ount
mus
tbe>
0H
ard
Con
stra
int
148
Rec
ipe
For
ever
yin
gred
ient
atle
asto
nepr
epar
a-tio
nm
etho
dsh
ould
besp
ecifi
edSo
ftC
onst
rain
t
149
Rec
ipe
MF:
Rec
ipe
proc
edur
eR
ecip
epr
oced
ure
ism
anda
tory
Har
dC
onst
rain
t15
0R
ecip
eM
F:Y
ield
fact
oral
coho
lY
ield
fact
oral
coho
lis
man
dato
rySo
ftC
onst
rain
t15
1R
ecip
eM
F:Y
ield
fact
orfa
tY
ield
fact
orfa
tis
man
dato
rySo
ftC
onst
rain
t15
2R
ecip
eM
F:Y
ield
fact
orw
ater
Yie
ldfa
ctor
wat
eris
man
dato
rySo
ftC
onst
rain
t15
3R
efer
ence
Acq
uisi
tion
type
know
nIf
acqu
isiti
onty
pe=
notk
now
nth
enda
taha
vele
ssqu
ality
Soft
Con
stra
int
155
Ref
eren
ceM
F:R
efer
ence
lang
uage
Soft
Con
stra
int
155
Ref
eren
ceR
efer
ence
type
know
nIf
refe
renc
ety
peis
not
know
nth
enda
taha
vele
ssqu
ality
Soft
Con
stra
int
156
Ref
eren
ceR
efer
ence
UR
Lis
valid
Soft
Con
stra
int
Tabl
eB
.2:F
oodC
ASE
Qua
lity
Req
uire
men
ts
![Page 93: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/93.jpg)
84
![Page 94: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/94.jpg)
CPhysical Database Schema
In this appendix, the definitions of all the tables belonging to the Data Quality AnalysisToolkit are given. Please refer to section 5.3 for a description of the data model.
Note: All the columns of the data type “timestamp” are actually “timestamp without time-zone”.
tblqualityassessment
Column Name Data Type Constraint(s)/Referenceassessmentid serial NOT NULL PRIMARY KEYstartdate timestamp NOT NULLenddate timestampremarks character varying(300)creation timestamp NOT NULLcreationby character varying(100) NOT NULLmutation timestamp NOT NULLmutationby character varying(100) NOT NULL
85
![Page 95: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/95.jpg)
86
tblqualityentity
Column Name Data Type Constraint(s)/Referenceentityid serial NOT NULL PRIMARY KEYentityname character varying(50) NOT NULL UNIQUEcondition character varying(500) NOT NULLdisplayname character varying(30) NOT NULLkeycolumn character varying(30) NOT NULLalias character varying(10) NOT NULLcreation timestamp NOT NULLcreationby character varying(100) NOT NULLmutation timestamp NOT NULLmutationby character varying(100) NOT NULL
tblqualityentityfilter
Column Name Data Type Constraint(s)/Referencefilterid serial NOT NULL PRIMARY KEYentityid integer NOT NULL tblqualityentity (entityid)filtername character varying NOT NULLfiltercolumn character varying NOT NULLjointable character varyingjoincolumn character varyingnamecolumn character varyingapplycondition boolean NOT NULLcreation timestamp NOT NULLcreationby character varying(100) NOT NULLmutation timestamp NOT NULLmutationby character varying(100) NOT NULL
tblqualityrequirement
Column Name Data Type Constraint(s)/Referencerequirementid serial NOT NULL PRIMARY KEYrequirementname character varying(300) NOT NULL UNIQUErequirement-description
character varying(300)
entityid integer NOT NULLassessmentsql character varying(5000) NOT NULLrequirementtypeid integer NOT NULL CHECK IN (0, 1, 2)creation timestamp NOT NULLcreationby character varying(100) NOT NULLmutation timestamp NOT NULLmutationby character varying(100) NOT NULL
![Page 96: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/96.jpg)
APPENDIX C. PHYSICAL DATABASE SCHEMA 87
tblqualityrequirementvalue
Column Name Data Type Constraint(s)/Referencevalueid serial NOT NULL PRIMARY KEYrequirementid integer NOT NULL tblqualityrequirement
(requirementid)assessmentid integer NOT NULL tblqualityassessment
(assessmentid)refid integer NOT NULLvalue double precision
tblqualitytreedefinition
Column Name Data Type Constraint(s)/Referencetreedefid serial NOT NULL PRIMARY KEYtreename character varying(100) NOT NULL UNIQUEtreedescription character varying(1000)entityid integer NOT NULL tblqualityentity (entityid)creation timestamp NOT NULLcreationby character varying(100) NOT NULLmutation timestamp NOT NULLmutationby character varying(100) NOT NULL
tblqualitytreeedge
Column Name Data Type Constraint(s)/Referenceedgeid serial NOT NULL PRIMARY KEYparent nodeid integer NOT NULL tblqualitytreenode (nodeid),
UNIQUE (parent nodeid,child nodeid)
child nodeid integer NOT NULL tblqualitytreenode (nodeid),UNIQUE (parent nodeid,child nodeid)
weight double precisiontreedefid integer NOT NULL tblqualitytreedefinition
(treedefid)creation timestamp NOT NULLcreationby character varying(100) NOT NULLmutation timestamp NOT NULLmutationby character varying(100) NOT NULL
![Page 97: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/97.jpg)
88
tblqualitytreenode
Column Name Data Type Constraint(s)/Referencenodeid serial NOT NULL PRIMARY KEYheight integer NOT NULLindex integer NOT NULLcollapsed boolean NOT NULLuserscalemin integeruserscalemax integertreedefid integer NOT NULL tblqualitytreedefinition
(treedefid)creation timestamp NOT NULLcreationby character varying(100) NOT NULLmutation timestamp NOT NULLmutationby character varying(100) NOT NULL
tblqualitytreenodeaggregation
Column Name Data Type Constraint(s)/Referencename character varying(50) NOT NULLaggregationtypeid integer NOT NULL CHECK IN (0, 1, 2, 3, 4, 5)nodeid integer NOT NULL PRIMARY KEY,
tblqualitytreenode (nodeid)
tblqualitytreenoderequirement
Column Name Data Type Constraint(s)/Referencerequirementid integer NOT NULL tblqualityrequirement
(requirementid)nodeid integer NOT NULL PRIMARY KEY,
tblqualitytreenode (nodeid)
tblqualityusersetting
Column Name Data Type Constraint(s)/Referenceusersettingid serial NOT NULL PRIMARY KEYuserid integer NOT NULL tbluser (iduser)simpleclassname character varying(100) NOT NULLfieldname character varying(100) NOT NULLvalue character varying(100) NOT NULLcreation timestamp NOT NULLcreationby character varying NOT NULLmutation timestamp NOT NULLmutationby character varying NOT NULL
![Page 98: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/98.jpg)
APPENDIX C. PHYSICAL DATABASE SCHEMA 89
tblqualitywarningtext
Column Name Data Type Constraint(s)/Referencewarningid integer NOT NULL PRIMARY KEY,
DEFAULT nextval(..)warningtext character varying(300) NOT NULLparam character varying(20)creation timestamp NOT NULLcreationby character varying NOT NULLmutation timestamp NOT NULLmutationby character varying NOT NULL
tblqualitywarninguser
Column Name Data Type Constraint(s)/Referencewarninguserid integer NOT NULL PRIMARY KEY,
DEFAULT nextval(..)warningid integer NOT NULL tblqualitywarningtext (war-
ningid), UNIQUE (warnin-gid, userid)
userid integer NOT NULL tbluser (iduser), UNIQUE(warningid, userid)
enabled boolean NOT NULLcreation timestamp NOT NULLcreationby character varying NOT NULLmutation timestamp NOT NULLmutationby character varying NOT NULL
![Page 99: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/99.jpg)
90
![Page 100: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/100.jpg)
List of Figures
2.1 Swiss Food Composition Database, Public Online Interface . . . . . . . . . 52.2 FoodCASE Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 FoodCASE System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 82.4 A Conceptual Framework of Data Quality . . . . . . . . . . . . . . . . . . . 9
3.1 Start Screen of the Data Quality Analysis Toolkit . . . . . . . . . . . . . . . 143.2 Data Quality Tree: Aggregated Foods . . . . . . . . . . . . . . . . . . . . . . 153.3 Data Problem Table: Aggregated Foods . . . . . . . . . . . . . . . . . . . . . 153.4 Data Quality Line Chart: EuroFIR Quality Index . . . . . . . . . . . . . . . . 173.5 Data Quality Bar Chart: Group by Mutation User . . . . . . . . . . . . . . . 173.6 A Simple Tree Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.7 Tree Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.8 Example of a Box Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.9 Example of a Spider Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.10 Example of a Tree Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.11 Example of a Bar Chart with Error Bars . . . . . . . . . . . . . . . . . . . . . 29
4.1 Data Model of the FoodCASE Application . . . . . . . . . . . . . . . . . . . 324.2 Data Model of the FoodCASE Application: Close-Up . . . . . . . . . . . . . 334.3 Administration Tool: Edit a Quality Requirement . . . . . . . . . . . . . . . 36
5.1 Data Model of the Data Quality Analysis Toolkit, UML Class Diagram . . . 425.2 Runtime Data Structure of the Data Quality Analysis Toolkit . . . . . . . . 485.3 Tree View and Tree Editor, UML Class Diagram . . . . . . . . . . . . . . . . 515.4 GUI Components, UML Class Diagram . . . . . . . . . . . . . . . . . . . . . 535.5 Admin Module, UML Class Diagram . . . . . . . . . . . . . . . . . . . . . . . 58
6.1 JUnit Back-End Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606.2 Automated GUI Testing (Screenshot) . . . . . . . . . . . . . . . . . . . . . . 616.3 Automated GUI Testing, Application Monitoring . . . . . . . . . . . . . . . 62
7.1 FoodCASE: Single Food Detail Window . . . . . . . . . . . . . . . . . . . . . 647.2 Data Quality Prevention: Warning . . . . . . . . . . . . . . . . . . . . . . . . 657.3 Data Quality Prevention: Configure Warnings . . . . . . . . . . . . . . . . . 65
91
![Page 101: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/101.jpg)
92 LIST OF FIGURES
![Page 102: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/102.jpg)
List of Tables
3.1 Views and View Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1 Quality Requirement: Assessment SQL Place Holders . . . . . . . . . . . . 35
5.1 Code Metrics of the Data Quality Analysis Toolkit . . . . . . . . . . . . . . 415.2 File System Cache Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.3 Plot Types and Datasets used by the Data Quality Analysis Views . . . . . 55
7.1 Mapping from Averaged EuroFIR Quality Index to Confidence Code . . . . 66
A.1 Terms and Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
B.1 Quality Requirement Name Prefixes . . . . . . . . . . . . . . . . . . . . . . 73B.2 FoodCASE Quality Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 83
93
![Page 103: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/103.jpg)
94 LIST OF TABLES
![Page 104: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/104.jpg)
List of Listings
5.1 Writing to File System Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.2 Positioning of Tree Nodes (Pseudo Code) . . . . . . . . . . . . . . . . . . . . 505.3 Logging and Error Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
95
![Page 105: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/105.jpg)
96 LIST OF LISTINGS
![Page 106: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/106.jpg)
Acknowledgements
First and foremost, I would like to thank Prof. Dr. Moira C. Norrie for accepting me and gi-ving me the opportunity to write my master thesis in the GlobIS group. Furthermore, I wouldlike to thank my supervisor, Karl Presser, for the uncomplicated and pleasant collaboration.He has given me much freedom in how I wanted to approach the different goals of this masterthesis and he was always ready to discuss any questions which arose during the work. Specialthanks go to my girlfriend, Marina Spani, for her careful review and her helpful suggestions.Last but not least, I would like to thank my parents who always supported me throughout mystudies.
97
![Page 107: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/107.jpg)
![Page 108: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/108.jpg)
Bibliography
[atw1896] Atwater Wo, Woods Cd. Washington 1896. The Chemical Composition of Ame-rican Food Materials. U.S. Departement of Agriculture, Office of Experiment Stations,Bulletin 28, Pages 1-47
[becker07] Becker Wulf, Unwin Ian, Ireland Jayne and Møller Anders. 2007. Proposal forStructure and Detail of a EuroFIR Standard on Food Composition Data.
[foph03] ETH Zurich, Federal Office of Public Health and Swiss Society of Nutrition. Zolli-kofen 2003. Swiss Nutrient Value Table
[fown44] Federal Office for Wartime Nutrition. 1944. Tabelle der Nahrwerte der Lebensmit-tel. Eidgossisches Gesundheitsamt, Bulletin 33, Pages 378-384.
[hitz96] Hitz M. and Montazeri B. 1996. Chidamber and Kemerer’s Metrics Suite: A Measu-rement Theory Perspective. IEEE Transactions on Software Engineering 22, Pages 267-271
[hoffm1789] Hoffmann Carl August. Weimar 1789. Erweiterte Tabelle uber etliche vierzigMineral-Wasser und Gesundbrunnen Deutschlands.
[hoegl64] Hogl O. and Lauber E. 1964. Nahrwert der Lebensmittel. Schweizerisches Le-bensmittelbuch, First Volume, Pages 713-753
[holden01] Holden Joanne M., Bhagwat Seema A. and Patterson Kristine Y. August 2002.Development of a Multi-nutrient Data Quality Evaluation System. Journal of Food Com-position and Analysis, Volume 15, Issue 4, Pages 339-348, ISSN 0889-1575
[presser08] Presser Karl, Colombani Paolo and Bell Simone. Zurich October 2008. SoftwareRequirement Specification For A Food Composition Database Management System
[presser11] Presser Karl. Zurich 2011. A Data Quality Model Approach and a Data QualityFramework for Food Composition Databases (unpublished)
[mccabe76] McCabe Sir Thomas J. December 1976. A Complexity Measure. IEEE Transac-tions on Software Engineering, Volume SE-2, Number 4, Pages 308-320
[moles1859] Moleschott Jacob. Giessen 1859. Physiologie der Nahrungsmittel. Ein Hand-buch der Diatetik. Zweite vollig umgearbeitete Auflage, Ferber’sche Universitatsbuch-handlung.
[moller08] Møller Anders and Christensen Tue. Denmark 2008. EuroFIR Web Services -EuroFIR Food Data Transport Package, Version 1.3, EuroFIR Deliverable, ISBN 978-87-92125-08-8
99
![Page 109: In Copyright - Non-Commercial Use Permitted Rights ... · 2 1.3. OURCONTRIBUTION 1.3 OurContribution In this master thesis, we designed and implemented a Data Quality Analysis Toolkit,](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4cf6f0a7130c672449efdf/html5/thumbnails/109.jpg)
100 BIBLIOGRAPHY
[salvini07] Salvini Simonetta, Oseredczuk Marine, Roe Mark and Møller Anders. 2007. Re-port Guidelines for Quality Index Attribution to Original Data from Scientific Literatureor Reports for EuroFIR Data Interchange. EuroFIR Workpackage 1.3, Task Group 4
[schlotke00] Schlotke Florian, Becker Wulf and Ireland Jayne. Brussels 2000. EurofoodsRecommendations for Food Composition Database Management and Data Interchange,European Commission, EUR 19538, ISBN 92-828-9757-5
[wang94] Wang Richard Y. and Strong Diane M. March 1996. Beyond Accuracy: What DataQuality Means to Data Consumers. Journal of Management Information Systems, 12, 4,Pages 5-33.