Review questions for feature engineering os.path.join(a, b) a b a b a€¦ · COSC 4360 ML for DS -...

COSC 4360 ML for DS - Instructor’s notes Ch.5 - Model Evaluation

Notation: □ Means pencil-and-paper QUIZ ► Means coding QUIZ

Review questions for feature engineering (Ch.4)

1] Define feature engineering. 2] Name a FE technique we have encountered before Ch.4. Give three examples of ML algorithms where it is useful, and one where it is not. 3] What are categorical features (variables)? Can they be represented numerically? 4] The Python library function os.path.join(a, b) does the following (select one):

a) Concatenates all files in the directory a in one big file, and names it b. b) Joins the lists a and b into a pairwise list of tuples and returns this list. c) Concatenates a and b into a path whose syntax conforms to the underlying OS, and returns the path as a

string. d) Adds the string b to all file names in the directory a. Returns nothing.

5] Name a useful pandas method that finds the unique values in a column, and counts how many times each occurs. 6] The pandas method that performs one-hot encoding is called:

a) one_hot( ) b) map_one_hot( ) c) create_dummies( ) d) get_dummies( ) e) OneHotEncoder( )

7] What does a value of -1 mean as a parameter for the numpy method reshape( )? For example, what is the result of a.reshape(-1, 1) when a is an array? 8] Define bins and binning. Why is binning useful in ML? 9] The numpy function that performs binning automatically is named:

a) digitize( ) b) to_bins( ) c) binning( ) d) histogram( )

10] ► (a)Create an array arr of 500 random numbers (floats), triangularly distributed between 10 and 20.

(b)Bin the array using 7 bins and display the bin memberships. (c) Display how many datapoints are in each bin.

11] Why do decision tree models do not benefit from binning?


12] The sklearn preprocessing class that calculates powers of the original datapoints is named: a) Powers( ) b) AllPowers( ) c) Polynomial( ) d) Polynomials( ) e) PolynomialFeatures( ) f) PolyShape( )

13] What does univariate mean in univariate transformation? Are there other types of transformations? 14] Give an example of univariate transformation used in our text.


Solutions:

1] Define feature engineering. A: FE = the process of transforming the data to an optimal representation for a given application. 2] Name a FE technique we have encountered before Ch.4. Give three examples of ML algorithms where it is useful, and one where it is not. A: Scaling (see Chs. 2, 3) is one of the many FE techniques. It is used in PCA, KNN, SVM. It is not very useful in DT algorithms. 3] What are categorical features (or data or variables)? Can they be represented numerically? A: They are features usually represented as characters/strings/text. Although they are sometimes represented with numbers (e.g. music_genre: 1=opera, 2=rock-and-roll, 3 = blues), these numbers lack an order relation that is characteristic of true numerical features. 4] The Python library function os.path.join( ) does the following (select one): (c) 8] Define bins and binning. Why is binning useful in ML? A: Bins are intervals (usually adjacent and equal in length) used to group datapoints. 10] ► (a) Create a numpy array arr of 500 random numbers (floats), triangularly distributed between 10 and 20.

(b) Bin the array using 7 bins and display the bin memberships. (c) Display how many datapoints are in each bin.


Cross-Validation (pp. 253-60)

As we know, we need to measure how well a classifier or regressor generalizes, i.e. how well it performs on new data1. This type of model evaluation is necessary because, through overfitting, a model can “remember” all the training data, but this is no guarantee that it will perform well on new data.

Cross-validation is an improvement of the train-test-split method we have been using.

k-fold cross-validation

The dataset is split into k (usually 5, sometimes 10 or 20, if the dataset is small) subsets, called folds. The train-test method is applied iteratively by using as the test set each of the k folds, while the algorithm is trained with the remaining k-1:

□ The diagram above illustrates the simplest way to select the folds, which is consecutive. Can you predict what problem we can run into if we can use consecutive folds and the dataset is sorted by target label?

□ What would be an easy fix to the problem above, if we want to still use consecutive folds?

At the end, we take the average (mean) of the k test scores. In Scikit-learn, k-fold cross-validation is implemented in the model_selection library, just like train_test_split:

Note: The function cross_val_score does not use sequential folds for classification - it uses stratified folds instead - keep reading!

1Generalization is also important for unsupervised learning, but harder to measure.


The default k of 3 can be changed by passing the parameter cv:

The mean is not the only measure we should pay attention to. Especially when the number of points in the dataset is small, the distribution of the scores contains important clues about how sensitive the model is to the training data. If the individual scores are close together we say that the model is insensitive (or robust).

► Calculate and display the variance and standard deviation of the 5 cross-validation scores above. What do you conclude?

The std. dev. is only 3.9%, which is probably OK for a dataset this small (150 points). Perhaps more testing is necessary ...

► Repeat the exercise above with 10-fold cross-validation. What do you conclude?

The std. dev. has increased from 3.9% to 6%, which is worse. This indicates that either we have too few data points (which is true in this case!), or that the dataset is not linearly separable (which is also true in this case!).


k-fold-cross-validation vs. train-test-split

Pros:

Eliminating possible randomness “artifacts”, e.g. making it unlikely to place all the “hard” points or all the outliers in the testing set by chance, or the opposite, placing all the “easy” data points in the testing set.

We have multiple scores instead of just one, so we get an idea about the sensitivity/robustness of the model on future data, e.g. by calculating the variance/std. dev.

More effective use of the data, especially for small datasets.

Con:

It is k times more expensive!

Commonly-used values for k: 5 and 10.

Stratified K-Fold cross-validation

To avoid the pitfalls of consecutive folds,

cross_val_score uses stratified cross-validation by default for classification, but only consecutive cross-validation for regression.

More control over cross-validation with KFold

Scikit-learn’s KFold splits a dataset into k consecutive folds (without shuffling by default). The KFold splitter can be passed directly to the cv parameter of cross_val_score; this forces the cross-validation function to use consecutive folds. Let us experiment with 3 strict, consecutive folds in the iris dataset:


Even with more folds, the variance of the scores is high (Why?):

We can shuffle the data to avoid the anomaly:

It is also possible to use KFold “manually”, by calling the classifier with the subsets generated by it:

Food for thought: What data structure is kf exactly? Extreme cross-validation: Leave-one-out (LOOCV) The name says it all! It is prohibitively expensive and not really necessary for large datasets, but affordable and better than k-fold for small and very small datasets (100s of samples and lower).

► Compare LOOCV to k-fold CV with 150 folds on the iris dataset (mean and std. dev.). Explain the results.

▀


Solutions:

□ The diagram above illustrates the simplest way to select the folds, which is consecutive. Can you predict what problem we can run into if we can use consecutive folds and the dataset is sorted by target label?

□ What would be an easy fix to the problem above, if we want to still use consecutive folds?

► Compare LOOCV to k-fold CV with 150 folds on the iris dataset (mean and std. dev.). Explain the results.

The results are identical, because LOOCV is the same as k-fold CV when k equals the number of points in the dataset!

▀


Shuffle-split cross-validation

All the CV presented so far make use the entire dataset; this can be time-consuming when we have a large dataset, and do not wish to use all of it (maybe we are in the exploratory phase, just deciding what algorithm to use).ShuffleSplit allows to choose the fractions of datapoints used for training and testing, as well as the total number of splits. As the name implies, the points are randomly chosen.

Note that the test_size and train_size above do not add up to 1. (This was done only for illustration - for the small iris dataset it is a mistake not to use all datapoints available!) As with train_test_split, if test_size is missing, its value will be 1 - train_size.

► Calculate the mean and std. dev. for the CV scores above. How do they compare with those from LOOCV?

SKIP Cross-validation with groups (pp. 261-2)

Review questions for CV:

1] List all the CV-related functions we covered in this chapter.

2] We have a dataset with 10 million samples. Is CV going to be useful? Explain.

3] We have a dataset with 100 samples. Is CV going to be useful? Explain.

4] True or false? When cross_val_score is used with a regressor, the folds are stratified.

5] Fill in the blanks: KFold always creates _____________ folds.

6] Fill in the blanks: ___________ cross-validation is an extreme form of ___________ cross-validation.

7] What is the main benefit of shuffle-split CV? When is it likely to be used?

8] True or false? Leave-one-out CV uses stratified folds.


Grid Search and the Validation Set (pp.262-9)

When a model has two or more hyper-parameters (tunable parameters, a.k.a. knobs), trying out combinations of parameters becomes a time-consumming task. Remember these tables from Ch.2:

We can systematically (read brute force) search the best combination using nested loops:


Problem: We used the test set in order to adjust the hyper-parameters, i.e. as part of the training process! The information in the test set has “seeped into” the model, so we are again in danger of overfitting! This phenomenon is know as data leakage.


Solution: Introduce a third set, the validation set:

Note: The syntax ** used in the code above simply “unpacks” the best_parameters dictionary into the arguments of the SVC constructor. It is the same as:


Grid-search with Cross-Validation

Visualization of results above:


Grid-Search in Scikit-learn

In order to use Scikit-learn’s implementation of grid-search, we first need to specify the grid values as a dictionary:

Then we split the data into train and test sets. Important: we do not need to manually set aside the validation set, because the Scikit-learn tool will handle it internally!

Now we are ready to use GridSearchCV:

The method fit has searched the grid and then also used the best combination to train a model on the entire dataset! Although it is in reality a meta-estimator, it provides the same methods score and predict as a regular estimator:

SKIP the remainder of the Grid-Search section!


Evaluation Metrics and Scoring (pp.277-294)

What measures we used so far to evaluate our models:

For classification: accuracy = (# of samples classified correctly) / (total # of samples)

For regression: R-squared, a.k.a. coefficient of determination ...2

Why do we sometimes need other measures? It depends on the application, a.k.a. we have to consider the business metric. Example: false positives and false negatives in medical diagnostics ... especially with imbalanced datasets - remember Bayes!

For individual work: Imbalanced dataset example on pp.280-281

Confusion matrix

(Code provided by instructor)

□ How do we compute the accuracy based on the confusion matrix?

2 See the instructor’s notes at the beginning of Ch.1


Accuracy = (TP + TN)/(TP + TN + FP + FN)

Other popular measures based on the confusion matrix:

Precision = TP/(TP + FP) → Measures accuracy only on the predicted positives. It is important when we want to limit false positives3. A.k.a. Predicted Positive Rate (PPR). We want it close to 1.

(Positive) Recall, or Sensitivity = TP/(TP + FN) → Measures accuracy only on the actual positives. This is important when we want to limit false negatives4. A.k.a. True Positive Rate (TPR). We want it close to 1.

Negative Recall, or Specificity = TN/(TN + FP) → Measures accuracy only on the actual negatives, symmetrical to sensitivity. A.k.a. True Negative Rate (TNR). Like precision, it is important when we want to limit false positives; unlike precision, FP is compared against TN, so it is especially useful when the dataset is imbalanced, containing relatively few Negative labels.

• For historical reasons, its complement is sometimes used: False Positive Rate (FPR) = 1 – specificity = FP/(TN + FP) – see the ROC curve below.

□ Calculate all the four measures above for the confusion matrix in the text, where positive is taken to be nine:

Solutions: Precision = TP/(TP + FP) = 39/(39+2) = 0.9512 = 95.12%

Recall = TP/(TP + FN) = 39/(39+8) = 0.8298 = 82.98%

Specificity = TN/(TN + FP) = 401/(401+2) = 0.9975 = 99.75%

FPR = 1 – specificity = 0.0025 = 0.25%

Notes:

• Because the dataset is imbalanced with many more Negative labels, the specificity is much better than the precision.

• The less-than stellar recall has nothing to do with the lack of balance – the algorithm simply could not train very well on only 47 samples of the digit 9!

3 For example, when classifying malware, when a FP can cause disruptive/expensive downtimes for the entire business. 4 For example, when classifying life-threatning situations, like tumors.


Precision = TP/(TP + FP) Recall = TP/(TP + FN)

□ There is always trade-off between precision and recall (FP vs. FN). Assume that the dataset has k Positive and k Negative samples. Consider these extreme cases:

• Algorithm decides that all datapoints are Positive! What are the

o Recall?

o Precision?

• Algorithm decides that all datapoints are Negative, except the one Positive that has the highest probability of being Positive! What are the

o Recall?

o Precision?

To capture this trade-off, we can combine precision and recall with the (harmonic) mean:

F-score = 2∙precision∙recall/(precision + recall) → Captures both precision and recall

• Problem with the f-score: precision and recall are weighted equally, but in practical applications usually one has a greater impact than the other. Because of this, a more general f-score was created, in which precision can be given more or less weight than recall via an extra parameter, beta. To be more precise, the f-score defined above is called f1-score, corresponding to beta=15.

□ Calculate the f1-score for the two extreme cases mentioned above.

Some of the measures above can be generated individually or all at once with built-in Scikit-learn functions:

EC

5 https://en.wikipedia.org/wiki/F1_score

https://en.wikipedia.org/wiki/F1_score


Taking uncertainty into account

Note: The SVC classifier generates a “soft” value for each datapoint with the method decision_function – remember the discussion of multi-class classification in Ch.2! decision_function returns positive and negative values, and the prediction is made based by the threshold (decision value) zero: if the datapoint has negative value, it belongs to class 0, if positive value – class 1 (exactly zero means on the boundary!) Making the threshold negative, as in the text example shown above, means that some points initially in class 0 are now moved to class 1. This improves the recall TP/(TP + FN), but damages the precision TP/(TP + FP).

This can be seen numerically if we experiment with the threshold and display the measures:

► Experiment with the SVC code above to find out at what threshold we have no more false positives (FP). Are the recall and precision values what you were expecting?

Some classifiers have the similar method predict_proba (e.g. KNN, DT), and some have both (LogReg).

Read carefully this entire section! (pp.288-90)


Solutions:

► Experiment with the SVC code above to find out at what threshold we have no more false positives (FP). Are the recall and precision values what you were expecting?


According to the output above, the threshold is between 1.08 and 1.09 in this case. As expected, when precision is 1, recall is very low, 0.11, because raising the threshold has simultaneously decreased TP and increased FN!


Precision-recall curves (p-r curves)

Setting the requirement on a classifier to have a certain values for a certain measure, e.g. 90% precision, is called settign the operating point. Before doing this, it is useful to have an overall view of all possible operating points. This is made possible by the precision-recall curve.


[This output was obtained with n_samples=(100,20)]

[This output was obtained with n_samples=(200,25)]

□ Can you tell at which end of the curve the threshold is small (negative) and at which end large (positive)?


Comparing the p-r curves for two classifiers6:

6 The file precision_recall_curve_SVC_vs_RF.py is provided by the instructor, similar to that on p.293.


Area Under Curve (AUC) is a compact measure that summarizes the entire p-r curve of an algorithm. It is an overall assessment of the classification algorithm, averaged over all possible thresholds. A better curve/algorithm in this sense has an AUC as close to 1 as possible:

□ Which of these p-r curves/algorithms is best? Overall best?

How do the green (en) and blue (ir) curves/algorithms compare? Overall?

Scikit-learn implements it with the function average_precision_score:

Note: GridSearchCV can be instructed to use the area under the p-r curve as scoring metric, by setting scoring='average_precision'.

SKIP the remainder of Ch.5, starting at Receiver operating characteristics (p.294)


Solutions:

□ Can you tell at which end of the curve the threshold is small (negative) and at which end large (positive)?


A: In the UL corner, the recall is 1, so FN = 0, so likely there are no negatives, so threshold is small (negative). Similarly, in the LR corner threshold is large (positive).

Review questions for feature engineering os.path.join(a, b) a b a b a€¦ · COSC 4360 ML for DS -...

Documents

Transcript of Review questions for feature engineering os.path.join(a, b) a b a b a€¦ · COSC 4360 ML for DS -...