Lab Instructor’s€¦ · Title of the Assignment: Supervised Learning - Regression (Using R)...

Department of Information Technology Subject: Instructor Manual Computer Laboratory VII

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

SNJB’s Late Sau. K. B. Jain College of Engineering, Chandwad.

Department: Information Technology

Lab Instructor’s

Manual

Name of Faculty: Nirmal Khyati R.

Subject: Computer Laboratory VII Code: 414458



Savitribai Phule Pune University

Fourth Year of Information Technology Engineering (2015 Course)

414458: Computer Laboratory VII

Teaching Scheme:

Practical:04 Hours/Week

Credits:02 Examination Scheme:

TW:50 Marks PR: 50 Marks

Prerequisites:

Knowledge of Programming Languages 1. Java 2. R 3. Python 4. C++

Course Objectives:

1. To Understand the Security issues in networks and Applications software.

2. To understand the machine learning principles and analytics of learning algorithms.

Course Outcomes:

By the end of the course, students should be able to 1. The students will be able to implement and port controlled and secured access to software

systems and networks. 2. The students will be able to build learning software in various domains.

Suggested List of Laboratory Assignments PART –A (ICS)

Assignment 1

Write a program in C++ or Java to implement RSA algorithm for key generation and cipher

verification.

Assignment 2

Develop and program in C++ or Java based on number theory such as Chinese remainder.

Assignment 3

Write a program in C++ or java to implement SHA1 algorithm using libraries (API)

Assignment 4

Configure and demonstrate use of vulnerability assessment tool such as Snort tool for intrusion or

SSL Web security.

PART –B (MLA)



Assignment 1

Study of platform for Implementation of Assignments Download the open source software of your interest. Document the distinct features

and functionality of the software platform. You may choose WEKA and R and Python

Assignment 2

Supervised Learning - Regression (Using R) Generate a proper 2-D data set of N points. Split the data set into Training Data set and

Test Data set. i) Perform linear regression analysis with Least Squares Method. ii) Plot the graphs for Training MSE and Test MSE and comment on Curve Fitting and Generalization Error. iii) Verify the Effect of Data Set Size and Bias-Variance Tradeoff. iv) Apply Cross Validation and plot the graphs for errors. v) Apply Subset Selection Method and plot the graphs for errors. vi) Describe your findings in each case

Assignment 3

Create Association Rules for the Market Basket Analysis for the given Threshold. (Using R)

Assignment 4

Implement K-Means algorithm for clustering to create a Cluster on the given data.(Using Python)

Assignment 5

Implement SVM for performing classification and find its accuracy on the given data. (Using Python)

Assignment 6

Creating & Visualizing Neural Network for the given data. (Using Python)

Assignment 7

On the given data perform the performance measurements such as Accuracy, Error rate, precision,

Recall, TPR,FPR,TNR,FPR etc. (Using Weka)

Assignment 8

Principal Component Analysis-Finding Principal Components, Variance and Standard Deviation calculations of principal components.(Using R)

Reference Books

1. Open source software-WEKA and R and Python . 2. JAVA 6.1 or more ( for RJava Package). 3. Dr. Mark Gardener, Beginning R The Statistical Programming Language, ISBN: 978-81-2654120- 1, Wiley India Pvt. Ltd. 4. Jason Bell, “Machine Learning for Big Data Hands-On for Developers and Technical Professionals”, ISBN: 978-81-265-5337-2-1, Wiley India Pvt. Ltd



_________________________________________________________ Assignment No:1

__________________________________________________________ Title of the Assignment: Study of platform (WEKA and R) for Implementation of Assignments

Objective of the Assignment: Download the open source software of your interest. Document

the distinct features and functionality of the software platform

Prerequisite: Basic Concepts of Machine Learning

Theory:

1. Write down Steps to install R for Open Source Operating System.

1.1 Step 1: Install R

sudo apt-get install r-base

1.2 Install RStudio

sudo apt-get install gdebi

cd ~/Downloads

wget https://download1.rstudio.org/rstudio-xenial-1.1.419-amd64.deb

sudo gdebi rstudio-xenial-1.1.379-amd64.deb

2. Explain with Example Data Types in R.

Source: https://data-flair.training/blogs/r-data-types/

Vector – A basic data structure of R containing the same type of data

c(2, 3, 5) [1] 2 3 5

[1] 2 3 5

length(c("aa", "bb", "cc", "dd", "ee"))

[1] 5

Matrices – A matrix is a rectangular array of numbers or other mathematical objects. We can do

operations such as addition and multiplication on Matrix in R.

mymat = matrix(1:12,4,3)

mymat

https://data-flair.training/blogs/r-data-types/



[,1] [,2] [,3]

[1,] 1 5 9

[2,] 2 6 10

[3,] 3 7 11

[4,] 4 8 12

Lists – Lists store collections of objects when vectors are of same type and length in a matrix.

n = c(2, 3, 5)

s = c(“aa”, “bb”, “cc”, “dd”, “ee” )

b = c(TRUE, FALSE, TRUE, FALSE, FALSE )

x = list( n, s, b, 3) # x contains copies of n, s, b

Data Frames – Generated by combining together multiple vectors such that each vector becomes a

separate column

a <- c(1,2,3,4)

b <- c(2,4,6,8)

levels <- factor(c("A","B","A","B"))

df <- data.frame(first=a, second=b,f=levels)

3. Explain Data Types in WEKA

Numeric:

This type of attribute represents a floating-point number.

Nominal:

This type of attribute represents a fixed set of nominal values.

String:

This type of attribute represents a dynamically expanding set of nominal values. Usually used

in text classification.

Date:

This type of attribute represents a date, internally represented as floating-point number storing

the milliseconds since January 1, 1970, 00:00:00 GMT. The string representation of the date

must be ISO-8601 compliant, the default is yyyy-MM-dd'T'HH:mm:ss.

Relational:

This type of attribute can contain other attributes and is, e.g., used for representing Multi-

Instance data. (Multi-Instance data consists of a nominal attribute containing the bag-id, then a

relational attribute with all the attributes of the bag, and finally the class attribute.)

4. Enlist Features of WEKA and R.

http://www.iso.org/iso/en/prods-services/popstds/datesandtime.html



WEKA : Waikato Environment for Knowledge Analysis

It’s a data mining/machine learning tool developed by Department of Computer Science,

University of Waikato, New Zealand.

Main Features Of WEKA:

• 49 data preprocessing tools

• 76 classification/regression algorithms

• 8 clustering algorithms

• 3 algorithms for finding association rules

• 15 attribute/subset evaluators + 10 search algorithms for feature selection

5. Explain .csv and. arff File formats in brief.

.CSV Format:

• CSV is probably the simplest possible structured format for data

• CSV strikes a delicate balance, remaining readable by both machines & humans

• CSV is a two dimensional structure consisting of rows of data, each row containing

multiple cells. Rows are (usually) separated by line terminators so each row

corresponds to one line. Cells within a row are separated by commas (hence the

C(commands) part)

• CSV is a "text-based" format, i.e. a CSV file is a text file. This makes it amenable for

processing with all kinds of text-oriented tools

• CSV file looks like

.arff Format

• The data set is organized in the following order: relation, attribute(s), data

• Nominal attributes are followed by curly braces that denote the set of values the

attribute can take on.

• If values include spaces, they must be placed in quotation marks.



• Numeric values are followed by the keyword numeric

• When the class value is not distinguished as the factor that is supposed to be predicted,

it can be listed as an attribute.If a value is missing, it is represented by a question

mark.

• .arff file looks like:

Assignment:

Students are instructed to complete 8 chapters based on R Studio using the following link:

http://tryr.codeschool.com/



----------------------------------------------------------------------------------------------------------------

Assignment No: 2

______________________________________________________________________________

Title of the Assignment: Supervised Learning - Regression (Using R)

Objective of the Assignment: Generate a proper 2-D data set of N points. Split the data set into

Training Data set and Test Data set. i) Perform linear regression analysis with Least Squares

Method. ii) Plot the graphs for Training MSE and Test MSE and comment on Curve Fitting and

Generalization Error. iii) Verify the Effect of Data Set Size and Bias-Variance Trade-off.

iv) Apply Cross Validation and plot the graphs for errors. v) Apply Subset Selection Method and

plot the graphs for errors. vi) Describe your findings in each case

Prerequisite: Basic Concepts of Machine Learning

----------------------------------------------------------------------------------------------------------------------------

Theory:

1. Explain Least Square Method for Linear Regression in brief.

• It is a method that can be used to learn linear models for classification and regression.

• Regression problem is to learn a function estimator f :X →R from examples (xi, f (xi )),

where X = Rd.

• The differences between the actual and estimated function values on the training examples are

called residual

• єi = f(xi)− f (xi). The least-squares method, introduced by Carl Friedrich Gauss in the late

eighteenth century, consists in finding f such that ∑

Ni=1єi

2 is minimized.

• Simple regression y = b0 + b1 x

y = predicted

b0 = estimate of the regression in intercept

b1 = estimate of regression slope



2. Write the algorithm for K-fold Cross validation.

Algorithm for K Fold Cross Validation:

Step 1 Split the dataset into K equal partitions (or “folds”).

Step 2 Use fold 1 as the testing set and the union of the other folds as the training set.

Step 3 Calculate testing accuracy.

Step 4 Repeat steps 2 and 3 K times, using a different fold as the testing set each time.

Step 5 Use the average testing accuracy as the estimate of out-of-sample accuracy.

o A value of k=10 is very common in the field of applied machine learning.

Assignment: The following observation is given by one company to find the relationship between age

of person and his/her typing speed. Consider the following sample data :

Age of typist (x) Words per minute (y)

25 78

30 70

35 65

40 58

45 48

50 42

a. Find regression line that best predicts value of dependent variable

b. Find SST,SSE, SSR and co-efficient of determination(R2).

c. Explain pictorial relationship between SST,SSE,SSR.

d. Decide whether regression equation is good fit or not for the given sample data



______________________________________________________________________________

Assignment No: 3

---------------------------------------------------------------------------------------------------------------------

Title of the Assignment: Market Basket Analysis (Using R)

Objective of the Assignment: Create Association Rules for the Market Basket Analysis for the

given Threshold

---------------------------------------------------------------------------------------------------------------------

Theory:

1. Explain Association Rule Mining in brief.

• It is rule based unsupervised machine learning task for discovering interesting relations between

variables in large databases or transactions.

• In Given set of transactions, it finds rules that will predict the occurrence of an item based on the

occurrences of other items in the transaction.

• Association Rule Mining is two Step Approach

1. Frequent Itemset Generation

Generate all items which have support ≥ minsup

2. Rule Generation

Generate high confidence rules from frequent itemset Each rule is a binary partitioning of a

frequent itemset

• The First step of Frequent Item generation is Computation expensive , If there are d itemset then

2d possible candidate itemset are possible. To calculate the support and confidence for each item

set and again matching with minsup an minconf is quite time consuming

2. Explain Support, Confidence with help of formulae and example.

Support Count (Freq) : Frequency of occurrence of an itemset in the whole transaction is

known as Support count.

Support is an indication of how frequently the items appear in the transaction. Fraction of

transactions that contains both x and y.

Supp = Freq (X Y)

Number of Transaction

Confidence : indicates the number of times the if/then statements have been found to be true. It

Measures how often items in Y appear in transactions that contain X

Conf = Freq (X Y)

Freq (X)

E.g,



TID Items

1. Bread, Peanuts, Milk, Fruit, Jam

2. Bread, Jam, Soda, Chips, Milk, Fruit

3. Steak, Jam, Soda, Chips, Bread

4. Jam, Soda, Peanuts, Milk, Fruit

5. Jam, Soda, Chips, Milk, Bread

6. Fruit, Soda, Chips, Milk

7. Fruit, Soda, Peanuts, Milk

8. Fruit, Soda, Peanuts, Milk

Supp ({Milk, Bread}) = 3/8

Supp ({Soda, Chips}) = 4/8

Conf({bread}→{milk}),

Conf = Freq (Bread Milk)

Freq (Bread)

= 3

4 = 0.75

Assignment:

1. Find all 3-item itemset from given itemset with minimum support=2

T1 {K, A, D, B}

T2(D, A, C, E, B}

T3{C, A, B, E}

T4{B, A, D}



--------------------------------------------------------------------------------------------------------------------

Assignment No: 4

---------------------------------------------------------------------------------------------------------------------

Title of the Assignment: K-Means algorithm for clustering (Using Python)

Objective of the Assignment: Implement K-Means algorithm for clustering to create a Cluster on

the given data.

---------------------------------------------------------------------------------------------------------------------

Theory:

1. Explain the properties of Cluster.

• A cluster is a collection of instances which are “similar” between them and are

“dissimilar” to the instances belonging to other clusters

• Each Cluster must have one represented of cluster which is known as center or centroid of

cluster

• The Clustering should be one in such a way that intra cluster distance is minimized and inter

cluster distance is maximized.

2. Write down the Algorithm for K Means Clustering.

Pseudocode for K Means Clustering Algorithm

Input

Data D ℝd number of cluster K ∈ N

Output

K Cluster means 1, 2, ..., k ℝd randomly initialize K vectors 1, 2, ..., k, …, k

ℝd.

repeat

Assign each x D to argmin Dist (x, );

for j = 1 to K do

Dj { x D | x assigne to cluster j };



j = 1

| Dj |

x

x Dj

;

end

until no change in 1, 2, ..., k

return 1, 2, ..., k

Conclusion:

Assignment:

1. Consider the following data set consisting of the scores of two variables on each of eight

individuals. Apply k-means clustering algorithm to allocate the instances in to proper

clusters.

Given Instance A B

P1 2 2

P2 1 14

P3 10 7

P4 1 11

P5 3 4

P6 11 8

P7 4 3

P8 12 9

--------------------------------------------------------------------------------------------------------------------



______________________________________________________________________________

Assignment No: 5

______________________________________________________________________________

Title of the Assignment: SVM for classification (Using Python )

Objective of the Assignment: Implement SVM for performing classification and find its

accuracy on the given data.

______________________________________________________________________________

Theory:

1. Maximum-margin linear classifier: Support Vector Machine with mathematic

formulation.

Support vectors are the hyperplane used to maximize the margin in between two classes.

Support vector machines are the supervised machine algorithm used in classification or

regression problems. It is useful as it transforms user data and with the platform of these

transformations it looks for optimal boundary among possible outputs.

Consider three hyper-planes (A, B and C) and all three are segregating the classes too.

Here, with the help of maximizing the distances in between the nearest data point and

hyperplane a better hyperplane can be chosen.



The distance between data point and hyperplane is called as Margin. As shown in figure

above the margin for hyperplane C is high than A and B. Therefore, the right hyperplane

is C.

Margin is ratio of distance of closet examples from the decision line to hyperlane i.e.

Margin= m/ ||w||

where is m is distance in between nearest training instances and decision boundary and w

is weight vector. It can be solved by either primal or dual form.

For maximizing the margin: ‘a’ is arbitrary can be normalised by equation. And margin

will be like:

A = wT x+ b = -a

C = wT x+ b = 0

B= wT x+ b = +a

With the help of quadratic programming , a well-studied solution algorithm the primal

form of SVM can be calculated:

max γ = 1/ ||w||

w, b

S.t. (wT

xj + b) yj >= 1 V j

Primal Form:

min (wT . w)

w, b

S.t. (wT

xj + b) yj >= 1 V j

This gives margins as

A = wT x+ b = -1

C = wT x+ b = 0

B= wT x+ b = +1

As the higher is margin, it has higher robustness. If we select a hyper-plane with a low

margin then it has high chance of miss-classification.



--------------------------------------------------------------------------------------------------------------------

Assignment No: 6

---------------------------------------------------------------------------------------------------------------------

Title of the Assignment: Neural Network (Using Python)

Objective of the Assignment: Creating & Visualizing Neural Network for the given data.

---------------------------------------------------------------------------------------------------------------------

Theory:

1. What is Neural Network

In essence, a neural network is a collection of neurons connected by synapses. This

collection is organized into three main layers: the input layer, the hidden layer, and the output

layer. You can have many hidden layers, which is where the term deep learning comes into

play. In an artifical neural network, there are several inputs, which are called features, and

produce a single output, which is called a label.

The circles represent neurons while the lines represent synapses. The role of a synapse is to

multiply the inputs and weights. You can think of weights as the “strength” of the connection

between neurons. Weights primarily define the output of a neural network. However, they are

highly flexible. After, an activation function is applied to return an output.

2. Explain how the simple feedforward neural network works.

Here’s a brief overview of how a simple feedforward neural network works:

1. Takes inputs as a matrix (2D array of numbers)

2. Multiplies the input by a set weights (performs a dot product aka matrix multiplication)



3. Applies an activation function

4. Returns an output

5. Error is calculated by taking the difference from the desired output from the data and the

predicted output. This creates our gradient descent, which we can use to alter the weights

6. The weights are then altered slightly according to the error.

7. To train, this process is repeated 1,000+ times. The more the data is trained upon, the more

accurate our outputs will be.

At its core, neural networks are simple. They just perform a dot product with the input and weights and

apply an activation function. When weights are adjusted via the gradient of loss function, the network

adapts to the changes to produce more accurate outputs.

Assignment:

Consider the below input and Calculate the value of last instance. Compare the answer with the

output of python code. Consider the random weights (0.2, 0.6, 0.1, 0.8, 0.3, 0.7)

Hours Studied,

Hours Slept (input)

Test Score

(output)

2, 9 92

1, 5 86

3, 6 89

4, 8 ?



--------------------------------------------------------------------------------------------------------------------

Assignment No: 7

---------------------------------------------------------------------------------------------------------------------

Title of the Assignment: Performance Measurements. (Using Weka)

Objective of the Assignment: On the given data perform the performance measurements such

as Accuracy, Error rate, precision, Recall, TPR, FPR,TNR etc).

---------------------------------------------------------------------------------------------------------------------

Theory:

State the definition of Accuracy, Error rate, precision, Recall , TPR, FPR,TNR, FPR.

1. Accuracy: Accuracy (ACC) is calculated as the number of all correct predictions divided by

the total number of the dataset. The best accuracy is 1.0, whereas the worst is 0.0.

2. Error Rate: Error rate (ERR) is calculated as the number of all incorrect predictions divided

by the total number of the dataset. The best error rate is 0.0, whereas the worst is 1.0.

3. Precision: Precision (PREC) is calculated as the number of correct positive predictions

divided by the total number of positive predictions. It is also called positive predictive value

(PPV). The best precision is 1.0, whereas the worst is 0.0.

4. Recall or TPR : Recall is calculated as the number of correct positive predictions divided

by the total number of positives. It is also called Sensitivity or true positive rate (TPR). The

best sensitivity is 1.0, whereas the worst is 0.0.

5. TNR or True Negative Ratio (TNR): is calculated as the number of correct negative

predictions divided by the total number of negatives. It is also called Specificity (TNR). The

best specificity is 1.0, whereas the worst is 0.0.

6. FPR: False positive rate (FPR) is calculated as the number of incorrect positive predictions

divided by the total number of negatives. The best false positive rate is 0.0 whereas the worst

is 1.0. It can also be calculated as 1 – specificity.

Assignment:

• Calculate Accuracy , precision , recall for the following. And compare the result

with WEKA tool



--------------------------------------------------------------------------------------------------------------------

Assignment No: 8

---------------------------------------------------------------------------------------------------------------------

Title of the Assignment: Principal Component Analysis (Using R)

---------------------------------------------------------------------------------------------------------------------

Objective of the Assignment: Principal Component Analysis-Finding Principal Components,

Variance and Standard Deviation calculations of principal components

---------------------------------------------------------------------------------------------------------------------

Theory:

What are the objectives of Principal Components Analysis (PCA)?

Write down the algorithm for Principal Component Analysis.

Conclusion:

Assignment:

Perform the Principal Component analysis and display the stepwise result using R Studio.

Lab Instructor’s€¦ · Title of the Assignment: Supervised Learning - Regression (Using R)...

Documents

Transcript of Lab Instructor’s€¦ · Title of the Assignment: Supervised Learning - Regression (Using R)...