Prediction of costs associated with the hospital …...associated diagnosis: B20*, B21*, B22*, B23*,...

1
Figure 3 THE 4 STEPS OF THE DATA MINING APPROACH 26,706 patients 30,294 patients 3,463 patients 17,602 4,473 THE ALGORITHM SPLITS PATIENTS INTO 2 SUBGROUPS using the most discriminative variable for the cost REPETITION OF STEP 2 ON EACH SUBGROUP UNTIL 48 ys GENDER MEAN AGE COMORBIDITES OPPORTUNISTIC INFECTION AVERAGE COST EXPLAINED VARIABLE 10 5 PRESENCE OF AN ENDOCRINO-METABOLIC COMPLICATION ? 2,447 patients 13,084 1,016 patients 28,484 NO YES NO YES 25,369 patients 3,850 1,337 patients 16,293 NO YES 21,980 patients 1,280 3,389 patients 9,190 NO YES 804 patients 12,016 533 patients 22,746 A DATA SCIENTIST CONFIGURES THE ALGORITHM Information gain criterion Learning sample ratio Maximal depth Observations weights Statistical pruning Stopping criterion 5,981 MEAN ANNUAL COST / PATIENT 34% 66% NO YES SELECTION OF ALL THE PATIENTS’ FEATURES PRESENCE OF A BACTERIAL INFECTION ? REMAINING VARIABLES ARE NOT DISCRIMINATIVE ENOUGH OR TOO FEW PATIENTS IN EACH SUBGROUP TO BE SPLIT AGAIN OR THE NUMBER OF SUBGROUPS REACHES A GIVEN THRESHOLD PRESENCE OF A CANCER ? PRESENCE OF A CANCER ? 1 - No presence of infection not associated to HIV. 2 - Presence of a bacterial infection 3 - Absence of an endocrino-metabolic comorbidity EXEMPLE OF A PROFILE RESULTS : 165 PROFILES each leaf of the tree is a profile of patients 88,1% 83,7% 72,6% 2,6% 11,4% 4,4% 11,2% 1,8% 8,1% 3,5% 804 patients 12,016 DID THE PATIENTS HAVE AN INFECTION UNRELATED TO HIV ? see Figure 2 Prediction of costs associated with the hospital management of HIV patients in France ADVANCED DATA MINING APPROACH Martin PRODEL 1,3 , Camille AMAZ 1 , Alexandre VAINCHTOCK 1 , Laurent FINKIELSZTEJN 2 , Céline AUBIN 2 1 HEVA, Lyon, France ; 2 ViiV Healthcare, Marly-le-Roi, France ; 3 Mines, Saint-Étienne, France Data mining is not new and has been efficiently employed in others domains (Bank, Cybernetics, Marketing, Energy, etc.) 1 . The amount of data collected in medical information systems is tremendous, but much data remains unused because of their complexity 2 . This still increasing volume of data requires new analytical approaches that are efficient, sensitive and better than classical statistics to handle Big Data 1 . INTRODUCTION CONCLUSION CRITERIA DATA MINING STATISTICS PARADIGM MAIN FEATURE SECOND FEATURE EXAMPLES OF TECHNIQUES VOLUME TYPICAL RESULTS Explanatory approach, discover the unknown with no preconception Unrivalled quality of predictions and explanations, using cross-validations. Use extrapolation for missing data High adaptability and reusability in other domains Decision Tree, Neural Network, Bayesian, Genetic Algorithm, Support Vector Machine Capable of dealing with millions of data (“Big data”) Data mining selects and tells you the non-linear combinations of many variables that best explain the value of the target variable. (Un)-validate predefined hypothesis, risk of experimenter bias Precise measurement of uncertainty, comparison of 2 populations, confidence assessment of measures Mathematical definition of statistical tests ensures significant results Principal Component Analysis, p-value testing, multivariate & univariate regression Suitable for middle size datasets (< 1 million observations) Logistic regression tells you that X% of the cost is explained by having this or this comorbidity, PCA tells you if variables are redundant. OBJECTIVE METHODS RESULTS The main objective of this study is to assess the capability of a state-of-the-art data mining technique to be applied on a healthcare database in order to better understand drivers of healthcare expenditure and the management of diseases. We utilized the approach in order to explain the patient features that are the main drivers of cost associated with HIV patients' hospital management. A secondary objective is to assess how reproducible such an approach is on other medical databases with potentially different patient features. Study period: January 2013 - December 2014 Number of HIV patients: 30,294 Number of stays: 70,180 Total annual cost: 180 million € Mean annual cost per patient : 5,981 € (+/- 10,661) Nb of patient profiles found using Data Mining: 165 KEY FIGURES The PMSI-MSO (French Medical Information System - Medicine, Surgery, Obstetric units) database was used to extract all hospital stays in 2013 with at least one of the following HIV ICD-10 (International Classification of Diseases, 10 th revision) codes as principal diagnosis, related diagnosis or significantly associated diagnosis: B20*, B21*, B22*, B23*, B24*. DATA EXTRACTION The cost of stays were evaluated from a NHS perspective. The PMSI database provides many details about stays’ cost as it was specifically designed for economic evaluation of hospital activities. HIV HOSPITALIZATION COST Data were analyzed with a supervised-learning data mining technique, an Enhanced Decision Tree algorithm based on Breiman’ CART model 3 (figure 3). OUR DATA MINING ALGORITHM JESS 2017 20 avril 2017 Télécom ParisTech The present study is a pilot that successfully demonstrates how a data mining technique, from the field of Artificial Intelligence, can help us better understand Hospital costs for HIV patients: our data mining algorithm identified specific patient profiles which explain the differentiating cost drivers in HIV inpatient care. Classical statistical approaches would struggle to provide such detailed profiles with numerous combinations of variables. 1 - This approach could work with other data sources, especially with more clinical and laboratory data (e.g.: viral load) . It can also explain drivers for any available target variables (cost, death, patients virologically suppressed or patients dropping out of care). 2 - Such collaborative efforts between health care professionals and engineers can lead to enhanced uses of health data that can provide new answers and thus improve disease management Example: Preventing a bacterial infection in newly diagnosed HIV patients with Endocrino-Metabolic comorbidities, is much more cost reducing than preventing a fungal infection. 3 - Based on results of Data Mining analyses, clinicians will know the medical interventions to focus on and prioritize. Example: Results show that newly diagnosed HIV patients are more likely to get bacterial infections if they already have neurological and renal comorbidities (=predictive capabilities). REFERENCES : 1 Adem Karahoca, Dilek Karahoca and Mert Şanver (2012). Survey of Data Mining and Applications (Review from 1996 to Now), Data Mining Applications in Engineering and Medicine, Associate Prof. Adem Karahoca (Ed.), InTech, DOI: 10.5772/48803. 2 Herland, Matthew, Taghi M Khoshgoftaar, et Randall Wald. «A review of data mining using big data in health informatics.» Journal Of Big Data, 2014: 1--35. 3 Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J. (1984) “Classification and Regression Trees”, Wadsworth, Belmont, CA. Republished by CRC Press Figure 1 METHODOLOGICAL DIVERGENCES BETWEEN DATA MINING AND STATISTICS For each patient hospitalized once in 2013 with an HIV code, a 12 months follow-up was performed to capture any further hospital stay. The same algorithm and medical review as for the inclusion stay were conducted on these additional stays, in order to exclude stays not related to HIV. All stays were classified by HIV clinician experts into 10 groups of comorbidities and 5 groups of opportunistic infections (OI) using ICD-10 codes (figure 2). 1-YEAR FOLLOW-UP OF PATIENTS & CLASSIFICATIONS OF STAYS € 1,500 € 5,981 € 20,000 € 30,000 € 43,000 Annual cost / patient 30,294 patients mean cost 5,981 mean cost 3,850 25,369 p. 84 % mean cost 2,388 19,078 p. 63 % 1 - Infections unrelated to HIV ? 2 - Bacterial opportunistic infection ? 3 - Presence of a cancer ? 4 - Fungal opportunistic infection ? 5 - Endocrino-metabolic comorbidity ? 533 p. 2 % mean cost 22,746 632 p. 2 % PROFILE 1 ENTIRE COHORT PROFILE 2 PROFILE 3 PROFILE 4 1 1 3 5 5 3 4 5 2 2 mean cost 33,830 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 Endocrino-metabolic Cardiovascular Psychiatric Hepatic Cancer Neurological Respiratory diseases Infections unrelated to HIV Renal Bone 8,373 p. 7,616 p. 7,439 p. 6,906 p. 4,724 p. 4,576 p. 4,084 p. 3,486 p. 3,011 p. 2,885 p. Figure 2 INPUT VARIABLES FOR THE ANALYSIS OF PATIENT PROFILES WITH DATA MINING 7% 5% 5% 2% 2% Bacterial Fungal Viral 2,058 p. 1,419 p. 1,369 p. 660 p. 614 p. Protozoan Other 28% 25% 25% 23% 16% 15% 13% 12% 10% 10% Figure 4 DISCOVERY OF PATIENT PROFILES NO YES 70,180 stays M€ 180 total annual cost

Transcript of Prediction of costs associated with the hospital …...associated diagnosis: B20*, B21*, B22*, B23*,...

Page 1: Prediction of costs associated with the hospital …...associated diagnosis: B20*, B21*, B22*, B23*, B24*. DATA EXTRACTION The cost of stays were evaluated from a NHS perspective.

Figure 3THE 4 STEPS OF THE DATA MINING APPROACH

26,706patients

30,294patients

3,463patients

€ 17,602€ 4,473

THE ALGORITHM SPLITS PATIENTSINTO 2 SUBGROUPSusing the most discriminative variable for the cost

REPETITION OF STEP 2 ON EACH SUBGROUP UNTIL

48ys

GENDER MEAN AGE COMORBIDITES OPPORTUNISTICINFECTION

AVERAGE COSTEXPLAINED VARIABLE

10 5

PRESENCE OF AN ENDOCRINO-METABOLIC

COMPLICATION ?

2,447patients

€ 13,084

1,016patients

€ 28,484

NO YESNO YES

25,369patients

€ 3,850

1,337patients

€ 16,293

NO YES

21,980patients

€ 1,280

3,389patients

€ 9,190

NO YES

804patients

€ 12,016

533patients

€ 22,746

A DATA SCIENTISTCONFIGURES THE ALGORITHM

Information gain criterion

Learning sample ratio

Maximal depth

Observations weights

Statistical pruning

Stopping criterion

€ 5,981MEAN ANNUAL COST / PATIENT

34% 66%

NO YES

SELECTION OF ALL THE PATIENTS’ FEATURES

PRESENCE OF ABACTERIAL INFECTION ?

REMAINING VARIABLES ARE NOT DISCRIMINATIVE ENOUGH

OR

TOO FEW PATIENTS IN EACH SUBGROUP TO BE SPLIT AGAIN

OR

THE NUMBER OF SUBGROUPS REACHES A GIVEN THRESHOLD

PRESENCE OFA CANCER ?

PRESENCE OFA CANCER ?

1 - No presence of infection not associated to HIV.

2 - Presence of a bacterial infection

3 - Absence of an endocrino-metabolic comorbidity

EXEMPLE OF A PROFILE

RESULTS : 165 PROFILESeach leaf of the tree is a profile of patients

88,1%

83,7%

72,6% 2,6%

11,4%

4,4%

11,2% 1,8%

8,1% 3,5%

804patients

€ 12,016

DID THE PATIENTS HAVE AN INFECTION UNRELATED TO HIV ?

see Figure 2

Prediction of costs associatedwith the hospital management of

HIV patients in France

ADVANCED DATA MINING APPROACH

Martin PRODEL1,3, Camille AMAZ1, Alexandre VAINCHTOCK1, Laurent FINKIELSZTEJN2, Céline AUBIN2

1HEVA, Lyon, France ; 2ViiV Healthcare, Marly-le-Roi, France ; 3Mines, Saint-Étienne, France

Data mining is not new and has been efficiently employed in others domains (Bank, Cybernetics, Marketing, Energy, etc.)1. The amount of data collected in medical information systems is tremendous, but much data remains unused because of their complexity2. This still increasing volume of data requires new analytical approaches that are efficient, sensitive and better than classical statistics to handle Big Data1.

INTRODUCTION

CONCLUSION

CRITERIADATA MINING STATISTICS

PARADIGM

MAIN FEATURE

SECOND FEATURE

EXAMPLES OF TECHNIQUES

VOLUME

TYPICAL RESULTS

Explanatory approach, discover the unknown with no preconception

Unrivalled quality of predictions and explanations, using cross-validations. Use extrapolation for missing data

High adaptability and reusability in other domains

Decision Tree, Neural Network, Bayesian, Genetic Algorithm, Support Vector Machine

Capable of dealing with millions of data (“Big data”)

Data mining selects and tells you the non-linear combinations of many variables that best explain the value of the target variable.

(Un)-validate predefined hypothesis, risk of experimenter bias

Precise measurement of uncertainty, comparison of 2 populations, confidence assessment of measures

Mathematical definition of statistical tests ensures significant results

Principal Component Analysis, p-value testing, multivariate & univariate regression

Suitable for middle size datasets (< 1 million observations)

Logistic regression tells you that X% of the cost is explained by having this or this comorbidity, PCA tells you if variables are redundant.

OBJECTIVE

METHODS

RESULTS

The main objective of this study is to assess the capability of a state-of-the-art data mining technique to be applied on a healthcare database in order to better understand drivers of healthcare expenditure and the management of diseases.We utilized the approach in order to explain the patient features that are the main drivers of cost associated with HIV patients' hospital management. A secondary objective is to assess how reproducible such an approach is on other medical databases with potentially different patient features.

Study period: January 2013 - December 2014Number of HIV patients: 30,294Number of stays: 70,180Total annual cost: 180 million €Mean annual cost per patient : 5,981 € (+/- 10,661)Nb of patient profiles found using Data Mining: 165

KEY FIGURES

The PMSI-MSO (French Medical Information System - Medicine, Surgery, Obstetric units) database was used to extract all hospital stays in 2013 with at least one of the following HIV ICD-10 (International Classification of Diseases, 10th revision) codes as principal diagnosis, related diagnosis or significantly associated diagnosis: B20*, B21*, B22*, B23*, B24*.

DATA EXTRACTION

The cost of stays were evaluated from a NHS perspective. The PMSI database provides many details about stays’ cost as it was specifically designed for economic evaluation of hospital activities.

HIV HOSPITALIZATION COST

Data were analyzed with a supervised-learning data mining technique, an Enhanced Decision Tree algorithm based on Breiman’ CART model3 (figure 3).

OUR DATA MINING ALGORITHM

JESS 201720 avril 2017

Télécom ParisTech

The present study is a pilot that successfully demonstrates how a data mining technique, from the field of Artificial Intelligence, can help us better understand Hospital costs for HIV patients: our data mining algorithm identified specific patient profiles which explain the differentiating cost drivers in HIV inpatient care. Classical statistical approaches would struggle to provide such detailed profiles with numerous combinations of variables.

1 - This approach could work with other data sources, especially with more clinical and laboratory data (e.g.: viral load).It can also explain drivers for any available target variables (cost, death, patients virologically suppressed or patients dropping out of care).2 - Such collaborative efforts between health care professionals and engineers can lead to enhanced uses of health data that can provide new answers and thus improve disease managementExample: Preventing a bacterial infection in newly diagnosed HIV patients with Endocrino-Metabolic comorbidities, is much more cost reducing than preventing a fungal infection.

3 - Based on results of Data Mining analyses, clinicians will know the medical interventions to focus on and prioritize.Example: Results show that newly diagnosed HIV patients are more likely to get bacterial infections if they already have neurological and renal comorbidities (=predictive capabilities).

REFERENCES : 1Adem Karahoca, Dilek Karahoca and Mert Şanver (2012). Survey of Data Mining and Applications (Review from 1996 to Now), Data Mining Applications in Engineering and Medicine, Associate Prof. Adem Karahoca (Ed.), InTech, DOI: 10.5772/48803.2Herland, Matthew, Taghi M Khoshgoftaar, et Randall Wald. «A review of data mining using big data in health informatics.» Journal Of Big Data, 2014: 1--35.3Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J. (1984) “Classification and Regression Trees”, Wadsworth, Belmont, CA. Republished by CRC Press

Figure 1METHODOLOGICAL DIVERGENCES BETWEEN DATA MINING AND STATISTICS

For each patient hospitalized once in 2013 with an HIV code, a 12 months follow-up was performed to capture any further hospital stay. The same algorithm and medical review as for the inclusion stay were conducted on these additional stays, in order to exclude stays not related to HIV. All stays were classified by HIV clinician experts into 10 groups of comorbidities and 5 groups of opportunistic infections (OI) using ICD-10 codes (figure 2).

1-YEAR FOLLOW-UP OF PATIENTS & CLASSIFICATIONS OF STAYS

€ 1,500

€ 5,981

€ 20,000

€ 30,000

€ 43,000

Annual cost / patient

30,294patients

mean cost

€ 5,981

mean cost

€ 3,850

25,369 p.

84 %

mean cost

€ 2,388

19,078 p.

63 %

1 - Infections unrelated to HIV ?

2 - Bacterial opportunistic infection ?

3 - Presence of a cancer ?

4 - Fungal opportunistic infection ?

5 - Endocrino-metabolic comorbidity ?

533 p.

2 %

mean cost

€ 22,746

632 p.

2 %PROFILE 1

ENTIRE COHORTPROFILE 2 PROFILE 3 PROFILE 4

11

3

5

5

3

4

5

2

2

mean cost

€ 33,830

1 2 3 4 5 1 2 3 4 5 1 2 3 4 51 2 3 4 5

Endocrino-metabolic

Cardiovascular

Psychiatric

Hepatic

Cancer

Neurological

Respiratory diseases

Infectionsunrelated to HIV

Renal

Bone

8,373 p.

7,616 p.

7,439 p.

6,906 p.

4,724 p.

4,576 p.

4,084 p.

3,486 p.

3,011 p.

2,885 p.

Figure 2INPUT VARIABLES FOR THE ANALYSIS OF PATIENT PROFILES WITH DATA MINING

7%5%

5%

2%

2%

BacterialFungalViral

2,058 p.

1,419 p.

1,369 p.

660 p.614 p.

ProtozoanOther

28%

25%

25%

23%

16%

15%

13%

12%

10%

10%

Figure 4DISCOVERY OF PATIENT PROFILES

NO

YES

70,180stays

M€ 180total annual cost