Cancer Diagnostic Prediction with Amazon ML – A Tutorial

Cancer Diagnostic Prediction with Amazon ML – A Tutorial By

Kato Mivule, Researcher June 2015

1

Cancer Diagnostic Prediction with Amazon ML – A Tutorial

Agenda •  The Dataset •  Amazon ML Account setup •  S3 Services – data storage •  The ML Model •  Results •  Conclusion •  References

2

•  Characteristics of the Wisconsin Breast Cancer dataset is given in the figure above. •  The dataset contains 11 attributes, 10 for the observations, and 1 for the class label. •  The goal is to use the data collected from the observations to make a prediction if

future diagnosis from data with similar traits will be, 2 (Benign) or 4 (Malignant).

Cancer Diagnostic Prediction with Amazon ML– The Dataset

3

The Wisconsin Breast Cancer Dataset Characteristics: UCI Machine Learning Repository

•  Download the Breast Cancer Wisconsin (Diagnostic) dataset from the UCI Machine Learning repository.

•  Online at: [https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)]


4


•  Download the Breast Cancer Wisconsin (Diagnostic) dataset from the UCI Machine Learning repository.

•  Online at: [https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)]

5

•  Preprocessing: •  For the first row of the data, each column or attribute is named. •  Ensure at this stage that missing values are replaced with average or most frequent

values. Amazon ML at this point does not do well with missing values in data.

Cancer Diagnostic Prediction with Amazon ML – The Dataset

6

•  Make sure to save your file as a CSV file using the Windows CSV format if you are using MS Excel.

Cancer Diagnostic Prediction with Amazon ML – The Dataset

7

•  Log into your Amazon web services (AWS) account. •  You can use the same credentials if you already buy and sell on Amazon. •  Online: http://www.aws.amazon.com/

Cancer Diagnostic Prediction with Amazon ML – The Account

8

•  Once logged in, you will notice many services offered by Amazon. •  Our interest for now is the Storage and Content Delivery - S3 and Analytics -

Machine Learning Services. •  To start, click on the S3 web service link under the Storage and Content Delivery. •  S3 web service allows us to upload and store data on the Amazon Cloud.

Cancer Diagnostic Prediction with Amazon ML – The Account

9

•  Clicking on the S3 link should bring us inside the S3 console. •  To create a new bucket to store our dataset, click on the Create Bucket tab.

The black out on some of the lists was for security reasons.

Cancer Diagnostic Prediction with Amazon ML – S3 Services

10

•  At this point you can give your data bucket a name. •  Amazon AWS demands you select a “Region” where your data will be stored. •  For now we shall go with the default region, the “US Standard.



11

•  On the right side of the S3 Panel, is the None, Properties, and Transfers tabs. •  Click on your Bucket-name link on the left to open the data bucket.



12

•  Once inside the data bucket, Amazon ML shows no datasets – the bucket is empty. •  Next, click on the Upload tab on the top-left corner to upload data.



13

•  You could either drag and drop datasets directly into the bucket or use the “Add Files” button to upload the old fashioned way.

•  Keep in mind, Amazon ML at this point in time will only support CSV files.


14

•  After a successful upload of data, click on the radio button on the left to highlight the new dataset and then click on the Properties Tab on the right to learn more about the dataset.

•  Copy the provided URL link for the dataset on the right of the bucket panel. The S3 link will be needed to tell the Amazon ML where to access the data.


15

•  For now, we are done with the S3 web service under the Storage and Content Delivery section.

•  We return to the main Amazon Web Services Panel and choose Machine Learning service under Analytics section.

Cancer Diagnostic Prediction with Amazon ML – ML Model

16

•  At this point, we are now presented with the Amazon Machine Learning panel. •  Click on the Create new tab on the left of the panel. •  Select the Datasource and ML model – this allows us to use the cancer data we

uploaded to the S3 services.


17

•  Selecting the Datasource option brings us to the Create datasource panel. •  The first step is to select where Amazon ML will get the data. •  Select the S3 radio button and input the link saved from the S3 services after uploading

the cancer diagnosis dataset.


18

•  Amazon ML requests permission to access your dataset in the S3 services section. •  Select “Yes” to proceed.


19

•  Amazon ML indicates that it successfully accessed and validated your data in the S3 storage service.

•  “Continue” to proceed.


20

•  Select the “Yes” radio button since the cancer dataset contains column names. •  In this preprocessing step, Amazon ML allows for the editing of the data schema to

choose various data types.


21

•  The next step is to select the “Target” – the attribute that will work as the class label for classification of the data.

•  Label1 is chosen for this particular cancer dataset, it contains two classes representing cancer cases diagnosed as 2 for Benign, and 4 for Malignant.

•  Amazon ML uses the “Target” attribute to automatically select the ML algorithm – in this case, Regression is chosen. Later, the Binary Classification will be used.


22

•  Amazon ML allows for the selection of a row identifier attribute to help follow which prediction, in this case, class labels 2 and 4, parallels to which observation.


23

•  Update and corrections can still be made at this point. •  Click on the Edit button to review the Input data, Schema, and Target. •  Regression is chosen as the Target for the Label1 attribute but the Target section will be

edited later to choose Binary Classification.


24

•  Cross Validation: By default Amazon ML divides the dataset into two parts; with 70 percent of the data for Training and the remaining 30 percent for Testing.

•  In this case, the breast cancer diagnosis data was divided into 488 records for Training and 211 records for Testing.


25

•  In the Review Panel, a summary of the ML model is given and adjustments can still be made at this point to the model settings.

•  Click finish to proceed once you are satisfied with the settings.


26

•  After execution, the ML model report is returned. The Evaluation status on the right side under the Evaluation Summary, should read Completed, in green.

•  Under the ML model report is the Evaluations link, with a dropdown menu to Summary, Alerts, and Explore performance links to evaluate performance.

•  Amazon ML returns a performance metric value and the Explore model performance button gives more visualization of results.

Cancer Diagnostic Prediction with Amazon ML – Results

27

The RMSE: Amazon Machine Learning Developers Guide – Evaluating ML Models •  Amazon ML returns the root-mean-square error (RMSE) value for Regression models.


28

•  The root-mean-square error (RMSE) value is returned for the Regression model. •  The smaller the RMSE, the better the performance of the ML model. •  Amazon ML reports that for this experiment, the Regression model achieved a root mean

square error ( RMSE) of 0.35, better than the baseline of 0.90.


29

•  Amazon ML provides a visual distribution of residuals for the ML Regression model in form of a bar chart with the option to change the bin width.

•  Where Residual = Observed value – Predicted value.


30

•  Under the ML model report, click on the Evaluations dropdown menu and then click the Alerts link.

•  A summary of the criteria used to evaluate the ML model is given, showing the cross validation, number of records for both training and testing, and schema attributes used.

•  488 records were used for Training, while 211 were used for Testing (evaluation data).


31

•  Amazon ML provides the option to learn about the characteristics of the dataset being used for the ML model.

•  Go to back to the Amazon ML dashboard, on the listed Entities, click on the Entity Name with Type, Datasource.


32

•  A frequency distribution is given for the class Label1 attribute in the Training sample data, showing 295 cases listed as 2 = Benign while 193 as 4 = Malignant.

•  A total of 488 records were used for Training, while 211 records were reserved for Testing.


33

•  Amazon ML gives basic descriptive statistics for each attribute in the dataset. •  Click on the Preview, on the right of the table, a visualization for each attribute is given.


34

•  The Preview for Feature6 in the dataset, gives a visualization of the frequency distribution and summary of the basic descriptive statistics in that attribute.


35


36

The F1 score: Amazon Machine Learning Developers Guide – Evaluating ML Models

•  In this next section, the Target parameters are edited to select Binary Classification. •  Run Binary Classification ML model and make a comparison with results from the

Regression ML model.

•  A summary of the Binary classification ML model performance returns an F1 score at 0.94.

•  The F1 score is normalized between 0 and 1; a higher F1 score in this case, 0.94, would indicate better performance for the Binary classification model.


37

•  Amazon ML provides both the F1 score metric value and visualization for the Binary Classification ML model.

•  Hovering the cursor over each rectangular box, displays a percentage of records correctly classified and those misclassified.


38

•  To explore the visualization aspect of the model, click on the Explore model performance button. A confusion matrix is presented.

•  On the horizontal side are the Predicted values, while on the vertical, are the True values. The F1 score values are presented for each row, including the totals.


39

•  211 records were used as Testing data for the Binary classification ML model. •  163 records belonged to Label 2 (Benign), the other 48, belonged to Label 4 (Malignant).

•  99% of cancer cases diagnosed as Benign in the Training data, were correctly predicted as belonging to group 2 (Benign) in the Testing data, while only 0.61% of the the same records were misclassified as belonging to group 4 (Malignant), in the Testing data.

•  The F1 score for the group 2 was at 0.98, almost a perfect score – approaching 1.


40

•  85% of cancer cases diagnosed as Malignant in the Training data, were correctly predicted as belonging to group 4 (Malignant) in the Testing data, while 14.58% of the the same records were mistakenly predicted as group 2 (Benign), in the Testing data.

•  The F1 score for the group 4 was at 0.91. The total F1 score was averaged at 0.94.


41

Conclusion •  Amazon ML is intuitive and could assist the data scientist to focus on knowledge

discovery while leaving issues to do with hardware and other computational resources to the engineers at Amazon cloud services.

•  The potential for Amazon ML applications in Health Data Science is enormous.

•  ML algorithms are still constrained to choices provide my Amazon ML, namely, Binary, Multi-class, and Regression classification models. Including other ML algorithms in the future would provide more choice for comparative studies.

•  Data preprocessing is still a pain – one has to strictly follow Amazon ML guidelines. Currently Amazon ML only accepts CSV file formats. However, automation of this process would be ideal.

Cancer Diagnostic Prediction with Amazon ML – Conclusion

42

References •  Amazon ML, Online: [www.aws.amazon.com]

•  Breast Cancer Wisconsin (Diagnostic) dataset from the UCI Machine Learning repository. Online at: [https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)]

•  K. P. Bennett & O. L. Mangasarian: "Robust linear programming discrimination of two linearly inseparable sets", Optimization Methodsand Software 1, 1992, 23-34 (Gordon & Breach Science Publishers).

•  Evaluating ML Models - Amazon Machine Learning Developer Guide, Available Online: [http://docs.aws.amazon.com/machine-learning/latest/dg/evaluating_models.html]

•  Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Cancer Diagnostic Prediction with Amazon ML – References

43

Thanks

Questions?

Contact Kato Mivule @ [kmivule/gmail/com]

Cancer Diagnostic Prediction with Amazon ML – Questions

44

Cancer Diagnostic Prediction with Amazon ML – A Tutorial

Data & Analytics

Transcript of Cancer Diagnostic Prediction with Amazon ML – A Tutorial