Mapping and Integration of Multiple Forms into Relational Databases

1
CVDI is a collaboration between the University of Louisiana at Lafayette & Drexel University MAPPING & INTEGRATING MULTIPLE FORMS INTO A DATABASE Yuan An, Ritu Khare, Il-Yeol Song, Xiaohua Hu Background Patient Information Date: Patient Name: M F Gender: BP: DOB: HPI: Weight: Vital Sign Height: Healthy Living Program Date: Patient Name: Hours Exercise: DOB: Smokes: Hours Watching TV: Social Activities Alcohol: piId Date Patient HPI VitalSign gId options 001 Male 002 Female pId Name Gender DOB vId Height Weight BP PatientInformation Patient Gender Vital Signs The FormMapper System Empirical Study in Healthcare Tree Extraction Component Form Mapping and Integration Component Layered Hidden Markov Models(HMMs) Parent Child Association Rules FORM DATABASE root x1 x2 Y1 Y2 z1 X1 z2 X2 Y1 Y2 z1 z2 z3 Y3 Y3 z3 Initial Correspondence Generation and Validation Database Birthing Algorithm NEW DB Merging Algorithm Key Techniques Hierarchical Representation of Forms as Form Trees Hidden Markov Models for Form Information Extraction Sophisticated Matching techniques for Deriving Mapping Correspondences between tree and database Form Tree Patterns and DB design principles to translate a form tree into an equivalent database (See Fig. 4) Quantitative metric (quality tuning factor) to facilitate the decision of merging(or not merging) two mapped tables Desirable Characteristics of Database (w.r.t. the input form) Completeness Correctness Compactness Normalization (3NF) Optimization (minimize potential NULL values & the number of database elements) ID c textbox T j Fig. 3 The FormMapper System has two components: (1) Tree Extraction (2) Form Integration. ID f radiobutton T j ID Options 1 V k T Semantic Form Tree ID f checkbox T j ID c k T ID T j ID T ID f j f T r Fig. 4 Some Form Tree to Database Mapping Patterns. a)Textbox Pattern b)Radiobutton Pattern c)Checkbox Pattern d)Category Subcategory Pattern Datasets 16 highly complex data- entry forms from 3 healthcare institutions. Average 57 form elements per form Benchmarks 16 Gold Standard Trees Prepared Using a DIY form design tool. Two sets of 3 Gold Standard Databases prepared by 2 database experts each with at least 10 years of experience. Tree Extraction Component Expectation Maximization Algorithm on 52 clinical forms Viterbi Algorithm for decoding 5 parent child association rules Accuracy: 96.93% Duration: 0.07 sec per form Form Integration Component Indexing using Lucene Quality tuning factor = 0.5 Duration: 3 sec per form 0 50 100 150 200 Tables Columns Values Foreign Keys FormMapper Gold 1 Gold 2 0 50 100 150 200 Tables Columns Values Foreign Keys 0 50 100 150 200 Tables Columns Values Foreign Keys 52% 28% 20% Perfect Match Positive Mismatch Negative Mismatch 54 % 40 % 6% FormMapper Vs Gold DB On an average, 87% of the database tables are either identical or superior(positive mismatch) to the gold database tables based on the defined database characteristics. Inferior cases (negative mismatch) is mostly due to the missing correspondences (due to extraction inaccuracies) and imprecisely derived cardinalities among category/subcategory in forms. Implications High potential to replace the human experts As more forms are mapped, the database grows automatically in a principled manner . It is challenging to automate the aspects of mapping that rely on human understanding of domain semantics. Work in Progress Leverage Ontology and Controlled Vocabularies to handle semantic heterogeneity. More sophisticated Correspondence Generation and Validation Techniques Consider more complicated merging situations (e.g. a table corresponds to a column) In the quest for database usability, several DIY and WYSIWYG approaches enable non-technical users to design forms. Such approaches (e.g. FormAssembly) automatically translate forms into databases while shielding the users from technical details. Such approaches, however, neither support database evolution due to changing user requirements nor support multiple users managing a common database. Fig. 1 Using forms as the front-end interface mapping to a back-end database is a standard way for data collection. Figure shows a scenario in healthcare domain Fig. 2 A New Form representing a new (or evolved) user requirement Challenges in Mapping Forms to Databases How to automatically understand a user- created form and extract semantic relationships among form elements? How to automatically map the semantic model extracted from a form to the existing database? How to automatically evolve the existing database with desired properties and what are these properties? While there exist many techniques to forward engineer a single form to an individual back-end database, mapping multiple forms to an existing structured database remains unexplored. This work addresses the problem of automatically mapping multiple(possibly overlapping) forms to an existing structured database. Fig. 5. Scale of the evolved Databases Fig. 6. Comparison of Tables. Input Form Database 1 Database 2 Database 3 FormMapper Vs Gold 1 FormMapper Vs Gold 2 Motivation and Focus

Transcript of Mapping and Integration of Multiple Forms into Relational Databases

Page 1: Mapping and Integration of Multiple Forms into Relational Databases

CVDI is a collaboration between the University of Louisiana at Lafayette & Drexel University

MAPPING & INTEGRATINGMULTIPLE FORMS INTO A DATABASE

Yuan An, Ritu Khare, Il-Yeol Song, Xiaohua Hu

Background Patient Information

Date:

Patient

Name:

M FGender:

BP:

DOB:

HPI:

Weight:

Vital SignHeight:

Healthy Living Program

Date:

Patient

Name:

Hours Exercise:

DOB:

Smokes:

Hours Watching TV:

Social Activities

Alcohol:

piId Date Patient HPI VitalSign

gId options

001 Male

002 Female

pId Name Gender DOB

vId Height Weight BP

PatientInformation

Patient

Gender Vital Signs

The FormMapper System

Empirical Study in Healthcare

Tree Extraction Component Form Mapping and Integration Component

Layered Hidden Markov Models(HMMs)

Parent Child Association Rules

FORM

DATABASE

root

x1 x2

Y1 Y2

z1X1

z2X2

Y1

Y2

z1 z2z3Y3

Y3

z3

Initial Correspondence Generation and Validation

Database Birthing Algorithm NEW DB

Merging Algorithm

Key Techniques

Hierarchical Representation of Forms as Form Trees

Hidden Markov Models for Form Information Extraction

Sophisticated Matching techniques for Deriving Mapping Correspondences between tree and database

Form Tree Patterns and DB design principles to translate a form tree into an equivalent database (See Fig. 4)

Quantitative metric (quality tuning factor) to facilitate the decision of merging(or not merging) two mapped tables

Desirable Characteristics of Database (w.r.t. the input form)

Completeness

Correctness

Compactness

Normalization (3NF)

Optimization (minimizepotential NULL values & the number of database elements)

ID c

textbox

Tj

Fig. 3 The FormMapper System has two components: (1) Tree Extraction (2) Form Integration.

ID f

radiobutton

Tj

ID Options

1 Vk

T

Semantic Form Tree

ID f

checkbox

Tj

ID ck

T

ID

Tj

ID

TID fj f

Tr

Fig. 4 Some Form Tree to Database Mapping Patterns.

a)Textbox Pattern

b)Radiobutton Pattern c)Checkbox Pattern

d)Category – Subcategory Pattern

Datasets

16 highly complex data-

entry forms from 3healthcare institutions.

Average 57 form elements per form

Benchmarks

16 Gold Standard Trees

Prepared Using a DIY form design tool.

Two sets of 3 Gold

Standard Databases prepared by 2 database experts each with at least 10 years of experience.

Tree Extraction Component

Expectation Maximization Algorithm on 52 clinical forms

Viterbi Algorithm for decoding

5 parent child association rules

Accuracy: 96.93%

Duration: 0.07 sec per form

Form Integration Component

Indexing using Lucene

Quality tuning factor = 0.5

Duration: 3 sec per form

0

50

100

150

200

Tables Columns Values Foreign Keys

FormMapper

Gold 1

Gold 2

0

50

100

150

200

Tables Columns Values Foreign Keys

0

50

100

150

200

Tables Columns Values Foreign Keys

52%28%

20%Perfect Match

Positive Mismatch

Negative Mismatch

54%

40%

6%

FormMapper Vs Gold DB

On an average, 87% of the database

tables are either identical orsuperior(positive mismatch) to thegold database tables based on thedefined database characteristics.

Inferior cases (negative mismatch) ismostly due to the missingcorrespondences (due to extractioninaccuracies) and imprecisely derivedcardinalities amongcategory/subcategory in forms.

Implications

High potential to replace the human experts

As more forms are mapped, the database grows automatically in a principled manner .

It is challenging to automate the aspects of mapping that rely on human understanding of domain semantics.

Work in Progress

Leverage Ontology and Controlled Vocabularies to handle semantic heterogeneity.

More sophisticated Correspondence Generation and Validation Techniques

Consider more complicated merging situations (e.g. a table corresponds to a column)

In the quest for database usability, several DIY and WYSIWYG approachesenable non-technical users to design forms. Such approaches (e.g.FormAssembly) automatically translate forms into databases whileshielding the users from technical details. Such approaches, however,neither support database evolution due to changing user requirementsnor support multiple users managing a common database.

Fig. 1 Using forms as the front-end interface mapping to a back-end database is a standard way for data collection. Figure shows a scenario in healthcare domain

Fig. 2 A New Form representing a new (or evolved) user requirement

Challenges in Mapping Forms to Databases

How to automatically understand a user-created form and extract semantic relationships among form elements?

How to automatically map the semantic model extracted from a form to the existing database?

How to automatically evolve the existing database with desired properties and what are these properties?

While there exist many techniques to forward engineer a single form toan individual back-end database, mapping multiple forms to an existingstructured database remains unexplored. This work addresses theproblem of automatically mapping multiple(possibly overlapping)forms to an existing structured database.

Fig. 5. Scale of the evolved Databases

Fig. 6. Comparison of Tables.

Input Form

Database 1

Database 2

Database 3

FormMapperVs Gold 1

FormMapperVs Gold 2

Motivation and Focus