Cancer Diagnostic Prediction with Amazon ML – A Tutorial

44
Cancer Diagnostic Prediction with Amazon ML – A Tutorial By Kato Mivule, Researcher June 2015 1

Transcript of Cancer Diagnostic Prediction with Amazon ML – A Tutorial

Page 1: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

Cancer Diagnostic Prediction with Amazon ML – A Tutorial By

Kato Mivule, Researcher June 2015

1  

Page 2: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

Cancer Diagnostic Prediction with Amazon ML – A Tutorial

Agenda •  The Dataset •  Amazon ML Account setup •  S3 Services – data storage •  The ML Model •  Results •  Conclusion •  References

2  

Page 3: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

•  Characteristics of the Wisconsin Breast Cancer dataset is given in the figure above. •  The dataset contains 11 attributes, 10 for the observations, and 1 for the class label. •  The goal is to use the data collected from the observations to make a prediction if

future diagnosis from data with similar traits will be, 2 (Benign) or 4 (Malignant).

Cancer Diagnostic Prediction with Amazon ML– The Dataset

3  

The Wisconsin Breast Cancer Dataset Characteristics: UCI Machine Learning Repository

Page 4: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

•  Download the Breast Cancer Wisconsin (Diagnostic) dataset from the UCI Machine Learning repository.

•  Online at: [https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)]

Cancer Diagnostic Prediction with Amazon ML– The Dataset

4  

Page 5: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

Cancer Diagnostic Prediction with Amazon ML– The Dataset

•  Download the Breast Cancer Wisconsin (Diagnostic) dataset from the UCI Machine Learning repository.

•  Online at: [https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)]

5  

Page 6: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

•  Preprocessing: •  For the first row of the data, each column or attribute is named. •  Ensure at this stage that missing values are replaced with average or most frequent

values. Amazon ML at this point does not do well with missing values in data.

Cancer Diagnostic Prediction with Amazon ML – The Dataset

6  

Page 7: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

•  Make sure to save your file as a CSV file using the Windows CSV format if you are using MS Excel.

Cancer Diagnostic Prediction with Amazon ML – The Dataset

7  

Page 8: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

•  Log into your Amazon web services (AWS) account. •  You can use the same credentials if you already buy and sell on Amazon. •  Online: http://www.aws.amazon.com/

Cancer Diagnostic Prediction with Amazon ML – The Account

8  

Page 9: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

•  Once logged in, you will notice many services offered by Amazon. •  Our interest for now is the Storage and Content Delivery - S3 and Analytics -

Machine Learning Services. •  To start, click on the S3 web service link under the Storage and Content Delivery. •  S3 web service allows us to upload and store data on the Amazon Cloud.

Cancer Diagnostic Prediction with Amazon ML – The Account

9  

Page 10: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

•  Clicking on the S3 link should bring us inside the S3 console. •  To create a new bucket to store our dataset, click on the Create Bucket tab.

The black out on some of the lists was for security reasons.

Cancer Diagnostic Prediction with Amazon ML – S3 Services

10  

Page 11: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

•  At this point you can give your data bucket a name. •  Amazon AWS demands you select a “Region” where your data will be stored. •  For now we shall go with the default region, the “US Standard.

The black out on some of the lists was for security reasons.

Cancer Diagnostic Prediction with Amazon ML – S3 Services

11  

Page 12: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

•  On the right side of the S3 Panel, is the None, Properties, and Transfers tabs. •  Click on your Bucket-name link on the left to open the data bucket.

The black out on some of the lists was for security reasons.

Cancer Diagnostic Prediction with Amazon ML – S3 Services

12  

Page 13: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

•  Once inside the data bucket, Amazon ML shows no datasets – the bucket is empty. •  Next, click on the Upload tab on the top-left corner to upload data.

The black out on some of the lists was for security reasons.

Cancer Diagnostic Prediction with Amazon ML – S3 Services

13  

Page 14: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

•  You could either drag and drop datasets directly into the bucket or use the “Add Files” button to upload the old fashioned way.

•  Keep in mind, Amazon ML at this point in time will only support CSV files.

Cancer Diagnostic Prediction with Amazon ML – S3 Services

14  

Page 15: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

•  After a successful upload of data, click on the radio button on the left to highlight the new dataset and then click on the Properties Tab on the right to learn more about the dataset.

•  Copy the provided URL link for the dataset on the right of the bucket panel. The S3 link will be needed to tell the Amazon ML where to access the data.

Cancer Diagnostic Prediction with Amazon ML – S3 Services

15  

Page 16: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

•  For now, we are done with the S3 web service under the Storage and Content Delivery section.

•  We return to the main Amazon Web Services Panel and choose Machine Learning service under Analytics section.

Cancer Diagnostic Prediction with Amazon ML – ML Model

16  

Page 17: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

•  At this point, we are now presented with the Amazon Machine Learning panel. •  Click on the Create new tab on the left of the panel. •  Select the Datasource and ML model – this allows us to use the cancer data we

uploaded to the S3 services.

Cancer Diagnostic Prediction with Amazon ML – ML Model

17  

Page 18: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

•  Selecting the Datasource option brings us to the Create datasource panel. •  The first step is to select where Amazon ML will get the data. •  Select the S3 radio button and input the link saved from the S3 services after uploading

the cancer diagnosis dataset.

Cancer Diagnostic Prediction with Amazon ML – ML Model

18  

Page 19: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

•  Amazon ML requests permission to access your dataset in the S3 services section. •  Select “Yes” to proceed.

Cancer Diagnostic Prediction with Amazon ML – ML Model

19  

Page 20: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

•  Amazon ML indicates that it successfully accessed and validated your data in the S3 storage service.

•  “Continue” to proceed.

Cancer Diagnostic Prediction with Amazon ML – ML Model

20  

Page 21: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

•  Select the “Yes” radio button since the cancer dataset contains column names. •  In this preprocessing step, Amazon ML allows for the editing of the data schema to

choose various data types.

Cancer Diagnostic Prediction with Amazon ML – ML Model

21  

Page 22: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

•  The next step is to select the “Target” – the attribute that will work as the class label for classification of the data.

•  Label1 is chosen for this particular cancer dataset, it contains two classes representing cancer cases diagnosed as 2 for Benign, and 4 for Malignant.

•  Amazon ML uses the “Target” attribute to automatically select the ML algorithm – in this case, Regression is chosen. Later, the Binary Classification will be used.

Cancer Diagnostic Prediction with Amazon ML – ML Model

22  

Page 23: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

•  Amazon ML allows for the selection of a row identifier attribute to help follow which prediction, in this case, class labels 2 and 4, parallels to which observation.

Cancer Diagnostic Prediction with Amazon ML – ML Model

23  

Page 24: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

•  Update and corrections can still be made at this point. •  Click on the Edit button to review the Input data, Schema, and Target. •  Regression is chosen as the Target for the Label1 attribute but the Target section will be

edited later to choose Binary Classification.

Cancer Diagnostic Prediction with Amazon ML – ML Model

24  

Page 25: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

•  Cross Validation: By default Amazon ML divides the dataset into two parts; with 70 percent of the data for Training and the remaining 30 percent for Testing.

•  In this case, the breast cancer diagnosis data was divided into 488 records for Training and 211 records for Testing.

Cancer Diagnostic Prediction with Amazon ML – ML Model

25  

Page 26: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

•  In the Review Panel, a summary of the ML model is given and adjustments can still be made at this point to the model settings.

•  Click finish to proceed once you are satisfied with the settings.

Cancer Diagnostic Prediction with Amazon ML – ML Model

26  

Page 27: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

•  After execution, the ML model report is returned. The Evaluation status on the right side under the Evaluation Summary, should read Completed, in green.

•  Under the ML model report is the Evaluations link, with a dropdown menu to Summary, Alerts, and Explore performance links to evaluate performance.

•  Amazon ML returns a performance metric value and the Explore model performance button gives more visualization of results.

Cancer Diagnostic Prediction with Amazon ML – Results

27  

Page 28: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

The RMSE: Amazon Machine Learning Developers Guide – Evaluating ML Models •  Amazon ML returns the root-mean-square error (RMSE) value for Regression models.

Cancer Diagnostic Prediction with Amazon ML – Results

28  

Page 29: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

•  The root-mean-square error (RMSE) value is returned for the Regression model. •  The smaller the RMSE, the better the performance of the ML model. •  Amazon ML reports that for this experiment, the Regression model achieved a root mean

square error ( RMSE) of 0.35, better than the baseline of 0.90.

Cancer Diagnostic Prediction with Amazon ML – Results

29  

Page 30: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

•  Amazon ML provides a visual distribution of residuals for the ML Regression model in form of a bar chart with the option to change the bin width.

•  Where Residual = Observed value – Predicted value.

Cancer Diagnostic Prediction with Amazon ML – Results

30  

Page 31: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

•  Under the ML model report, click on the Evaluations dropdown menu and then click the Alerts link.

•  A summary of the criteria used to evaluate the ML model is given, showing the cross validation, number of records for both training and testing, and schema attributes used.

•  488 records were used for Training, while 211 were used for Testing (evaluation data).

Cancer Diagnostic Prediction with Amazon ML – Results

31  

Page 32: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

•  Amazon ML provides the option to learn about the characteristics of the dataset being used for the ML model.

•  Go to back to the Amazon ML dashboard, on the listed Entities, click on the Entity Name with Type, Datasource.

Cancer Diagnostic Prediction with Amazon ML – Results

32  

Page 33: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

•  A frequency distribution is given for the class Label1 attribute in the Training sample data, showing 295 cases listed as 2 = Benign while 193 as 4 = Malignant.

•  A total of 488 records were used for Training, while 211 records were reserved for Testing.

Cancer Diagnostic Prediction with Amazon ML – Results

33  

Page 34: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

•  Amazon ML gives basic descriptive statistics for each attribute in the dataset. •  Click on the Preview, on the right of the table, a visualization for each attribute is given.

Cancer Diagnostic Prediction with Amazon ML – Results

34  

Page 35: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

•  The Preview for Feature6 in the dataset, gives a visualization of the frequency distribution and summary of the basic descriptive statistics in that attribute.

Cancer Diagnostic Prediction with Amazon ML – Results

35  

Page 36: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

Cancer Diagnostic Prediction with Amazon ML – Results

36  

The F1 score: Amazon Machine Learning Developers Guide – Evaluating ML Models

Page 37: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

•  In this next section, the Target parameters are edited to select Binary Classification. •  Run Binary Classification ML model and make a comparison with results from the

Regression ML model.

•  A summary of the Binary classification ML model performance returns an F1 score at 0.94.

•  The F1 score is normalized between 0 and 1; a higher F1 score in this case, 0.94, would indicate better performance for the Binary classification model.

Cancer Diagnostic Prediction with Amazon ML – Results

37  

Page 38: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

•  Amazon ML provides both the F1 score metric value and visualization for the Binary Classification ML model.

•  Hovering the cursor over each rectangular box, displays a percentage of records correctly classified and those misclassified.

Cancer Diagnostic Prediction with Amazon ML – Results

38  

Page 39: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

•  To explore the visualization aspect of the model, click on the Explore model performance button. A confusion matrix is presented.

•  On the horizontal side are the Predicted values, while on the vertical, are the True values. The F1 score values are presented for each row, including the totals.

Cancer Diagnostic Prediction with Amazon ML – Results

39  

Page 40: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

•  211 records were used as Testing data for the Binary classification ML model. •  163 records belonged to Label 2 (Benign), the other 48, belonged to Label 4 (Malignant).

•  99% of cancer cases diagnosed as Benign in the Training data, were correctly predicted as belonging to group 2 (Benign) in the Testing data, while only 0.61% of the the same records were misclassified as belonging to group 4 (Malignant), in the Testing data.

•  The F1 score for the group 2 was at 0.98, almost a perfect score – approaching 1.

Cancer Diagnostic Prediction with Amazon ML – Results

40  

Page 41: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

•  85% of cancer cases diagnosed as Malignant in the Training data, were correctly predicted as belonging to group 4 (Malignant) in the Testing data, while 14.58% of the the same records were mistakenly predicted as group 2 (Benign), in the Testing data.

•  The F1 score for the group 4 was at 0.91. The total F1 score was averaged at 0.94.

Cancer Diagnostic Prediction with Amazon ML – Results

41  

Page 42: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

Conclusion •  Amazon ML is intuitive and could assist the data scientist to focus on knowledge

discovery while leaving issues to do with hardware and other computational resources to the engineers at Amazon cloud services.

•  The potential for Amazon ML applications in Health Data Science is enormous.

•  ML algorithms are still constrained to choices provide my Amazon ML, namely, Binary, Multi-class, and Regression classification models. Including other ML algorithms in the future would provide more choice for comparative studies.

•  Data preprocessing is still a pain – one has to strictly follow Amazon ML guidelines. Currently Amazon ML only accepts CSV file formats. However, automation of this process would be ideal.

Cancer Diagnostic Prediction with Amazon ML – Conclusion

42  

Page 43: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

References •  Amazon ML, Online: [www.aws.amazon.com]

•  Breast Cancer Wisconsin (Diagnostic) dataset from the UCI Machine Learning repository. Online at: [https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)]

•  K. P. Bennett & O. L. Mangasarian: "Robust linear programming discrimination of two linearly inseparable sets", Optimization Methodsand Software 1, 1992, 23-34 (Gordon & Breach Science Publishers).

•  Evaluating ML Models - Amazon Machine Learning Developer Guide, Available Online: [http://docs.aws.amazon.com/machine-learning/latest/dg/evaluating_models.html]

•  Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Cancer Diagnostic Prediction with Amazon ML – References

43  

Page 44: Cancer Diagnostic Prediction with Amazon ML – A Tutorial

Thanks

Questions?

Contact Kato Mivule @ [kmivule/gmail/com]

Cancer Diagnostic Prediction with Amazon ML – Questions

44