ABSTRACTING AND AUTOMATING HIERARCHICAL DATA …€¦ · 14/10/2019  · •Categorization of the...

21
ABSTRACTING AND AUTOMATING HIERARCHICAL D ATA MODELS: LEVERAGING THE SAS® FORMAT PROCEDURE CNTLIN OPTION T O BUILD D YNAMIC FORMATS THAT CLEAN, CONVERT , AND C ATEGORIZE D ATA TROY MARTIN HUGHES OCTOBER 2019

Transcript of ABSTRACTING AND AUTOMATING HIERARCHICAL DATA …€¦ · 14/10/2019  · •Categorization of the...

Page 1: ABSTRACTING AND AUTOMATING HIERARCHICAL DATA …€¦ · 14/10/2019  · •Categorization of the DSM-5 level 2 classification (CLASS2) into the DSM-5 level 1 classification (CLASS1)

ABSTRACTING AND AUTOMATING HIERARCHICAL DATA MODELS: LEVERAGING THE SAS® FORMAT PROCEDURE

CNTLIN OPTION TO BUILD DYNAMIC FORMATS THAT

CLEAN, CONVERT, AND CATEGORIZE DATA

TROY MARTIN HUGHES

OCTOBER 2019

Page 2: ABSTRACTING AND AUTOMATING HIERARCHICAL DATA …€¦ · 14/10/2019  · •Categorization of the DSM-5 level 2 classification (CLASS2) into the DSM-5 level 1 classification (CLASS1)

Copyright © 2019 Troy Martin Hughes

BIOGRAPHY

2

Troy has been a SAS practitioner for more than 20

years, has managed SAS projects in support of

federal, state, and local government initiatives, and

is a SAS Certified Base, Advanced, and Clinical

Trials Programmer. He has been a frequent

presenter at SGF, SAS Analytics Experience,

WUSS, SCSUG, MWSUG, SESUG, and

PharmaSUG. He has an MBA in Information

Systems Management and certifications including:

PMP, PMI-RMP, PMI-PBA, PMI-ACP, CISSP,

CSSLP, ITIL, CSM, CSD, CSPO, CSP-SM, and

CSP-PO. Troy is a consultant for the Department

of Defense (DoD) and is a US Navy veteran with

two Afghanistan deployments.

Page 3: ABSTRACTING AND AUTOMATING HIERARCHICAL DATA …€¦ · 14/10/2019  · •Categorization of the DSM-5 level 2 classification (CLASS2) into the DSM-5 level 1 classification (CLASS1)

Copyright © 2019 Troy Martin Hughes

SAS FORMATS

Used to transform data

• Cleans data by identifying/removing extraneous values

• Converts/standardizes data when alternate forms exist (i.e., entity resolution)

• Categorizes (bins) data into groups or hierarchies

Weaknesses

• Often maintained within SAS code

• Some data models (e.g., hierarchies, taxonomies) require multiple formats

3

Page 4: ABSTRACTING AND AUTOMATING HIERARCHICAL DATA …€¦ · 14/10/2019  · •Categorization of the DSM-5 level 2 classification (CLASS2) into the DSM-5 level 1 classification (CLASS1)

Copyright © 2019 Troy Martin Hughes

SAS FORMATS: DSM-5 AND ICD-10

Formats always begin with a data model:

proc format;

value $ dsm

'291.0'='Alcohol dependence with intoxication delirium'

'291.1'='Alcohol dependence with alcohol-induced persisting amnestic disorder'

'291.2'='Alcohol dependence with alcohol-induced persisting dementia';

run;

4

DSM-5 Code ICD-10 Code Disorder Name

291.0 F10.221 Alcohol dependence with intoxication delirium

291.1 F10.26 Alcohol dependence with alcohol-induced

persisting amnestic disorder

291.2 F10.27 Alcohol dependence with alcohol-induced

persisting dementia

Page 5: ABSTRACTING AND AUTOMATING HIERARCHICAL DATA …€¦ · 14/10/2019  · •Categorization of the DSM-5 level 2 classification (CLASS2) into the DSM-5 level 1 classification (CLASS1)

Copyright © 2019 Troy Martin Hughes

RAW DATA (NO FORMAT)

data codes;

length dsmcode $8;

label dsmcode='DSM-5 Code’;

dsmcode='291.0'; output;

dsmcode='291.0'; output;

dsmcode='291.1'; output;

dsmcode='17'; output;

run;

proc print data=codes noobs label;

run;

5

DSM-5 Code

291.0

291.0

291.1

17

Page 6: ABSTRACTING AND AUTOMATING HIERARCHICAL DATA …€¦ · 14/10/2019  · •Categorization of the DSM-5 level 2 classification (CLASS2) into the DSM-5 level 1 classification (CLASS1)

Copyright © 2019 Troy Martin Hughes

FORMATTED DATA

data codes;

length dsmcode $8;

label dsmcode='DSM-5 Code’;

dsmcode='291.0'; output;

dsmcode='291.0'; output;

dsmcode='291.1'; output;

dsmcode='17'; output;

run;

proc print data=codes noobs label;

format dsmcode $dsm.;

run;

6

DSM-5 Code

Alcohol dependence with intoxication

with delirium

Alcohol dependence with intoxication

with delirium

Alcohol dependence with alcohol-

induced persisting amnestic disorder

17

“17” does not appear in the data

model (SAS format) so its value

is not transformed

Page 7: ABSTRACTING AND AUTOMATING HIERARCHICAL DATA …€¦ · 14/10/2019  · •Categorization of the DSM-5 level 2 classification (CLASS2) into the DSM-5 level 1 classification (CLASS1)

Copyright © 2019 Troy Martin Hughes

PROBLEM 1: FORMATS DEFINED IN CODE

Issues

• Modularity is decreased because changes to the data model require (unnecessary)

changes to the underlying software

• Interoperability is decreased because SAS formats cannot be used for other non-SAS

purposes (without parsing the code)

• Master data management (MDM) is compromised because SAS and non-SAS versions

of formats are maintained

Solution

• Maintain dynamic formats external to software (e.g., in XML, Excel, text files, or other

interoperable file formats)

7

Page 8: ABSTRACTING AND AUTOMATING HIERARCHICAL DATA …€¦ · 14/10/2019  · •Categorization of the DSM-5 level 2 classification (CLASS2) into the DSM-5 level 1 classification (CLASS1)

Copyright © 2019 Troy Martin Hughes

PROBLEM 2: CAN’T MODEL ONE-TO-ONE-TO-ONE

Issues

• A format can map one value to another value, or can bin many values into one

value, but cannot model one-to-one-to-one relationships (e.g., DSM-5 code to

ICD-10 code to diagnosis name)

Solution

• Maintain a single data model from which formats can be dynamically built (using

CNTLIN)

8

DSM-5 Code ICD-10 Code Diagnosis Name

291.0 F10.221 Alcohol dependence with

intoxication delirium

Page 9: ABSTRACTING AND AUTOMATING HIERARCHICAL DATA …€¦ · 14/10/2019  · •Categorization of the DSM-5 level 2 classification (CLASS2) into the DSM-5 level 1 classification (CLASS1)

Copyright © 2019 Troy Martin Hughes

PROBLEM 3: CAN’T MODEL HIERARCHICAL DATA

Issues

• Formats can bin data into categories, but can only bridge between two

hierarchical levels at one time

Solution

• Maintain a single data model from which formats can be dynamically built (using

CNTLIN)

9

Classification 1 Classification 2 Diagnosis Name DSM-5 Code

Substance use and

addictive disorders

Alcohol-related

disorders

Alcohol dependence with intoxication delirium 291.0

Alcohol dependence with alcohol-induced sleep

disorder

291.82

Unspecified alcohol-related disorder 291.89

Substance-related

disorders

Substance dependence with intoxication delirium 292.81

Stimulant use disorder 304.40

Other or unknown substance-related disorder 304.90

Page 10: ABSTRACTING AND AUTOMATING HIERARCHICAL DATA …€¦ · 14/10/2019  · •Categorization of the DSM-5 level 2 classification (CLASS2) into the DSM-5 level 1 classification (CLASS1)

Copyright © 2019 Troy Martin Hughes

7 POSSIBLE FORMATS WITH ONLY 3 DATA LEVELS

10

CLASS1

DSM-5 first-level

classificationCLASS2

DSM-5 second-level

classification DSM5name

DSM-5 diagnosis

name

DSM5code

DSM-5 code

1

2

3

4

5

6

7

Page 11: ABSTRACTING AND AUTOMATING HIERARCHICAL DATA …€¦ · 14/10/2019  · •Categorization of the DSM-5 level 2 classification (CLASS2) into the DSM-5 level 1 classification (CLASS1)

Copyright © 2019 Troy Martin Hughes

7 POSSIBLE FORMATS WITH ONLY 3 DATA LEVELS

• Conversion of the DSM-5 code to the DSM-5 diagnosis name (e.g., converting “290.0” into “Alcohol dependence

with intoxication delirium”).

• Categorization of the DSM-5 code into the DSM-5 level 2 classification (CLASS2) (e.g., categorizing “290.0” as

“Alcohol-related disorders”).

• Categorization of the DSM-5 code into the DSM-5 level 1 classification (CLASS1) (e.g., categorizing “290.0” as

“Substance use and addictive disorders”).

• Conversion of the DSM-5 diagnosis name to the DSM-5 code (e.g., converting “Alcohol dependence with

intoxication delirium” into “290.0”).

• Categorization of the DSM-5 diagnosis name into the DSM-5 level 2 classification (CLASS2) (e.g., categorizing

“Alcohol dependence with intoxication delirium” as “Alcohol-related disorders”).

• Categorization of the DSM-5 diagnosis name into the DSM-5 level 1 classification (CLASS1) (e.g., categorizing

“Alcohol dependence with intoxication delirium” as “Substance use and addictive disorders”).

• Categorization of the DSM-5 level 2 classification (CLASS2) into the DSM-5 level 1 classification (CLASS1) (e.g.,

categorizing “Alcohol-related disorders” into “Substance use and addictive disorders”).

11

Page 12: ABSTRACTING AND AUTOMATING HIERARCHICAL DATA …€¦ · 14/10/2019  · •Categorization of the DSM-5 level 2 classification (CLASS2) into the DSM-5 level 1 classification (CLASS1)

Copyright © 2019 Troy Martin Hughes

BUILD_FORMAT MACRO TO THE RESCUE

Macro Definition%macro build_format(fmtname= /* name of SAS format generated */,

dsnmodel= /* data set in LIB.DSN or DSN format containing data model */,

var1= /* variable (within the model) being transformed or categorized */,

var2= /* variable (within the model) to which VAR1 is transformed */);

Sample Invocations

%build_format(fmtname=DSM5code_to_name,

dsnmodel=DSMmodel, var1=DSM5code,

var2=DSM5name);

%build_format(fmtname=DSM5code_to_class2_,

dsnmodel=DSMmodel, var1=DSM5code,

var2=class2);

12

CLASS1

DSM-5 first-

level

classification

CLASS2

DSM-5

second-level

classificatio

n

DSM5name

DSM-5

diagnosis

name

DSM5code

DSM-5 code

1

2

3

4

5

6

7

Page 13: ABSTRACTING AND AUTOMATING HIERARCHICAL DATA …€¦ · 14/10/2019  · •Categorization of the DSM-5 level 2 classification (CLASS2) into the DSM-5 level 1 classification (CLASS1)

Copyright © 2019 Troy Martin Hughes

EXTERNAL XML DATA MAP (DSM5MODEL.MAP)<?xml version="1.0" ?>

<SXLEMAP version="2.1">

<TABLE name="DSM5model">

<TABLE-PATH syntax="XPath">

/TABLE/CLASS1/CLASS2/DIAG

</TABLE-PATH>

<COLUMN name="CLASS1" retain="YES">

<PATH>/TABLE/CLASS1 </PATH>

<TYPE>character</TYPE>

<DATATYPE>string</DATATYPE>

<LENGTH>50</LENGTH>

</COLUMN>

<COLUMN name="CLASS2" retain="YES">

<PATH>/TABLE/CLASS1/CLASS2 </PATH>

<TYPE>character</TYPE>

<DATATYPE>string</DATATYPE>

<LENGTH>50</LENGTH>

</COLUMN>

<COLUMN name="DSM5name">

<PATH>/TABLE/CLASS1/CLASS2/DIAG/@DSM5name </PATH>

<TYPE>character</TYPE>

<DATATYPE>string</DATATYPE>

<LENGTH>100</LENGTH>

</COLUMN>

<COLUMN name="DSM5code">

<PATH>/TABLE/CLASS1/CLASS2/DIAG/@DSM5code </PATH>

<TYPE>character</TYPE>

<DATATYPE>string</DATATYPE>

<LENGTH>8</LENGTH>

</COLUMN>

</TABLE>

</SXLEMAP>

13

Page 14: ABSTRACTING AND AUTOMATING HIERARCHICAL DATA …€¦ · 14/10/2019  · •Categorization of the DSM-5 level 2 classification (CLASS2) into the DSM-5 level 1 classification (CLASS1)

Copyright © 2019 Troy Martin Hughes

EXTERNAL XML DATA MODEL (DSM5MODEL.XML)<?xml version="1.0" encoding="utf-8" ?>

<TABLE>

<CLASS1> Substance use and addictive disorders

<CLASS2> Alcohol-related disorders

<DIAG DSM5name="Alcohol dependence with intoxication delirium" DSM5code="291.0"/>

<DIAG DSM5name="Alcohol dependence with alcohol-induced sleep disorder"

DSM5code="291.81"/>

<DIAG DSM5name="Unspecified alcohol-related disorder" DSM5code="291.89"/>

</CLASS2>

<CLASS2> Substance-related disorders

<DIAG DSM5name="Substance dependence with intoxication delirium" DSM5code="292.81"/>

<DIAG DSM5name="Stimulant use disorder" DSM5code="304.40"/>

<DIAG DSM5name="Other or unknown substance-related disorder" DSM5code="304.90"/>

</CLASS2>

</CLASS1>

</TABLE>

14

Page 15: ABSTRACTING AND AUTOMATING HIERARCHICAL DATA …€¦ · 14/10/2019  · •Categorization of the DSM-5 level 2 classification (CLASS2) into the DSM-5 level 1 classification (CLASS1)

Copyright © 2019 Troy Martin Hughes

BUILD_FORMAT IN ACTION

Ingest the XML data map and data model:

* change this location to the location of the XML map file and XML model file;

filename DSM5in '/folders/myfolders/DSM5model.xml';

filename DSMmap '/folders/myfolders/DSM5model.map';

libname DSM5in xmlv2 xmlmap=DSMmap;

data DSMmodel;

set DSM5in.DSM5model;

run;

Sample Invocation

%build_format(fmtname=DSM5code_to_name,

dsnmodel=DSMmodel, var1=DSM5code, var2=DSM5name);

15

Page 16: ABSTRACTING AND AUTOMATING HIERARCHICAL DATA …€¦ · 14/10/2019  · •Categorization of the DSM-5 level 2 classification (CLASS2) into the DSM-5 level 1 classification (CLASS1)

Copyright © 2019 Troy Martin Hughes

BONUS – GAME OF THRONES EDITION

The Game of Thrones world contains characters who hail from various regions, and

this data model (spreadsheet) can be exported to CSV and imported into SAS.

SAS® Data-Driven Development, page 339

16

Page 17: ABSTRACTING AND AUTOMATING HIERARCHICAL DATA …€¦ · 14/10/2019  · •Categorization of the DSM-5 level 2 classification (CLASS2) into the DSM-5 level 1 classification (CLASS1)

Copyright © 2019 Troy Martin Hughes

BONUS – GAME OF THRONES EDITION

A data set (Favorites) might list the favorite characters and seasons (of fans) as

text-mined from a user forum:

Jaime Lannister,2

Cersei Lannister,7

Daenerys,4

Daneris,7

Daenerys Targaryen,2

Jaime,1

Kit Harington,2

Aria Stark,7

Tyrion Lannister,3

Sansa Stark,5

SAS® Data-Driven Development, page 281

17

Page 18: ABSTRACTING AND AUTOMATING HIERARCHICAL DATA …€¦ · 14/10/2019  · •Categorization of the DSM-5 level 2 classification (CLASS2) into the DSM-5 level 1 classification (CLASS1)

Copyright © 2019 Troy Martin Hughes

BONUS – GAME OF THRONES EDITION

The BUILD_FORMAT macro can be used to perform entity resolution:

%build_format(fmtname=variation_to_character,

dsnmodel=GOT_model_tabular,

var1=variation,

var2=character);

data validate;

length newcharacter $50;

set favorites;

newcharacter=put(favcharacter,

$variation_to_character.);

run;

18

Page 19: ABSTRACTING AND AUTOMATING HIERARCHICAL DATA …€¦ · 14/10/2019  · •Categorization of the DSM-5 level 2 classification (CLASS2) into the DSM-5 level 1 classification (CLASS1)

Copyright © 2019 Troy Martin Hughes

BONUS – GAME OF THRONES EDITION

BUILD_FORMAT can be used a second time to categorize characters into

regions:%build_format(fmtname=character_to_region,

dsnmodel=GOT_model_tabular,

var1=character,

var2=region);

data validate;

length region $50 newcharacter $50;

set favorites;

newcharacter=put(favcharacter,

$variation_to_character.);

region=put(newcharacter,

$character_to_region.);

run;

19

Page 20: ABSTRACTING AND AUTOMATING HIERARCHICAL DATA …€¦ · 14/10/2019  · •Categorization of the DSM-5 level 2 classification (CLASS2) into the DSM-5 level 1 classification (CLASS1)

Copyright © 2019 Troy Martin Hughes

BONUS – GAME OF THRONES EDITION

With two calls to BUILD_FORMAT, the unruly data are cleaned and categorized:

Jaime Lannister,2

Cersei Lannister,7

Daenerys,4

Daneris,7

Daenerys Targaryen,2

Jaime,1

Kit Harington,2

Aria Stark,7

Tyrion Lannister,3

Sansa Stark,5

SAS® Data-Driven Development, page 294

20

Page 21: ABSTRACTING AND AUTOMATING HIERARCHICAL DATA …€¦ · 14/10/2019  · •Categorization of the DSM-5 level 2 classification (CLASS2) into the DSM-5 level 1 classification (CLASS1)

Copyright © 2019 Troy Martin Hughes

CONCLUSION

• Complex, hierarchical, and dynamic data models should be maintained external

to code to facilitate data independence.

• Through data-driven software design, external data models support software

modularity, interoperability (beyond SAS), and master data management.

• This text introduces the BUILD_FORMAT macro that dynamically builds

complex and hierarchical SAS formats from XML files, Excel spreadsheets, and

other canonical formats.

21