dhomaseghanshyam.files.wordpress.com€¦ · Web viewA Support Vector Machine (SVM) is a...

SNJB’s Late Sau. K. B. Jain College of Engineering, ChandwadDepartment of Computer Engineering

Machine Learning Mock End Solution 2018-19 Sem-IIClass- BE Computer Pattern- 2015 Marks-70

Q1 A) Explain the concept of Deep Learning and bio-inspired adaptive systems with reference to machine learningAns- many researchers started training bigger and bigger models, built with several different layers (that's why this approach is called deep learning), to solve new challenging problems.The availability of cheap and fast computers allowed them to get results in acceptable timeframes and to use very large datasets (made up of images, texts, and animations).

• The idea behind these techniques is to create algorithms that work like a brain• neurosciences and cognitive psychology.• In particular, there's a growing interest in pattern recognition and associative memories whose structure and

functioning are similar to what happens in the neocortex. Such an approach also allows simpler algorithms called model- free

• It is based on generic learning techniques and repeating experiences. testing different architectures and optimization algorithms is quite simple

Q1 B) What do you mean by Feature selection and Filtering with reference to machine learning.Ans- An unnormalized dataset with many features contains information proportional to the independence of all features and their variance. Let's consider a small dataset with three features, generated with random Gaussian distributions:

• Even without further analysis, it's obvious that the central line (with the lowest variance) is almost constant and doesn't provide any useful information.

• the entropy H(X) is quite small, while the other two variables carry more information. • A variance threshold is, therefore, a useful approach to remove all those elements whose contribution (in terms

of variability and so, information) is under a predefined level. • scikit-learn provides the class VarianceThreshold that can easily solve this problem.• Feature selector that removes all low-variance features.

• This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.

• threshold : float, optionalFeatures with a training-set variance lower than this threshold will be removed.• remove the features that have the same value in all samples.

>>> X = [[0, 2, 0, 3], [0, 1, 4, 3], [0, 1, 1, 3]] >>> selector = VarianceThreshold() >>> selector.fit_transform(X) array ([[2, 0],

[1, 4], [1, 1]])

in order to select the best features according to specific criteria based on F-tests and p-values, such as chi-

square or ANOVA.

Q2 A) Explain with examples how Machine Learning has brought revolution in our lives?Ans- Machine learning is one element (perhaps the driving force) of AI, whereby a computer is programmed with the ability to self-teach and improve its performance of a specific task. In essence, machine learning is all about analyzing big data -- the automatic extraction of information and using it to make predictions, decipher whether the prediction was correct, and if incorrect, learning from that to make a more correct prediction in the future.

Google, Amazon, Netflix and other monolithic online platforms use it to deliver semantic results based on algorithms that analyze a user's search, purchase and viewing history to predict what is it they're looking for or more likely to want.

The data they have at their disposal is massive. A recent global digital reportpublished by We Are Social and Hootsuite states that the number of people using the internet to search has hit 4 billion people in 2018. Every second, there are approximately 40,000 searches processed, which equates to 3.5 billion a day, or an incredible 1.2 trillion searches per year. Each year, humanity spends the equivalent of 1 billion years online.

Applications- Washing Machine, Gizar, Microwave Oven, Refrigerator etc Health-Machine learning is taking a bigger part in our health and well-being on a daily basis, and it is already being used for faster patient diagnosis. Even the prevention of illness in the first place have been aided by predicting the potential health problems one may be susceptible to, based on age, socio-economic status, genetic history, etc.

In the very close future, we can expect the automation of practically every aspect of your home. You can be stuck in traffic on your way from work and cozy your home from the car, turning the heat on, dimming the lights and having your favorite song playing as you step through the door.

Q2 B) What is effect of unnormalized dataset in feature engineering?Ans- We do data normalization when seeking for relations. Some people do this methods, unfortunately, in experimental designs, which is not correct except if the variable is a transformed one, and all the data needs the same normalization method, such as pH in sum agricultural studies. Normalization in experimental designs are meaningless because we can't compare the mean of, for instance, a treatment with the mean of another treatment logarithmically normalized. In regression and multivariate analysis which the relationships are of interest, however, we can do the normalization to reach a linear, more robust relationship. Commonly when the relationship between two dataset is non-linear we transform data to reach a linear relationship. Here, normalization doesn't mean normalizing data, it means normalizing residuals by transforming data. So normalization of data implies to normalize residuals using the methods of transformation.

In data mining, we can consisted the four main steps:1. Smoothing data2. Generalization of data3. Normalization of data4. Construction of new characteristicsWe can easier construct the new characteristics after the normalization process.

In ANN and other data mining approaches we need to normalize the inputs, otherwise the network will be ill-conditioned. In essence, normalization is done to have the same range of values for each of the inputs to the ANN model. This can guarantee stable convergence of weight and biases.

http://www.internetlivestats.com/google-search-statistics/

https://wearesocial.com/blog/2018/01/global-digital-report-2018

https://www.entrepreneur.com/topic/amazon

https://www.entrepreneur.com/topic/google

Q3 A) Explain random sample consensus algorithm with exampleAns-A common problem with linear regressions is caused by the presence of outliers. An ordinary least square approach will take them into account and the result (in terms of coefficients) will be therefore biased. In the following figure, there's an example of such a behavior:The less sloped line represents an acceptable regression which discards the outliers, while the other one is influenced by them. An interesting approach to avoid this problem is offered by random sample consensus (RANSAC), which works with every regressor by subsequent iterations, after splitting the dataset into inliers and outliers.

• The model is trained only with valid samples (evaluated internally or through the callable is_data_valid()) and all samples are re-evaluated to verify if they're still inliers or they have become outliers.

• The process ends after a fixed number of iterations or when the desired score is achieved.>>>from sklearn.linear_model import LinearRegression

>>> lr = LinearRegression(normalize=True) >>> lr.fit(X.reshape((-1, 1)), Y.reshape((-1, 1))) >>> lr.intercept_ array([ 5.500572]) >>> lr.coef_ array([[ 2.53688672]])

• As imagined, the slope is high due to the presence of outliers. The resulting regressor is y =5.5 + 2.5x (slightly less sloped than what was shown in the figure).

• Now we're going to use RANSAC with the same linear regressor:• RANSAC is an iterative algorithm for the robust estimation of parameters from a subset of inliers from the

complete data set.>>> from sklearn.linear_model import RANSACRegressor >>> rs = RANSACRegressor(lr) >>> rs.fit(X.reshape((-1, 1)), Y.reshape((-1, 1))) >>> rs.estimator_.intercept_ array([ 2.03602026]) >>> es.estimator_.coef_ array([[ 0.99545348]]) In this case, the regressor is about y = 2 + x (which is the original clean dataset without outliers).

Q3 B) What are strengths and weaknesses of linear regression?Ans- Linear regression is a statistical modelAdvantagesSpace complexity is very low it just needs to save the weights at the end of training. hence it's a high latency algorithm.Its very simple to understand Good interpretabilityFeature importance is generated at the time model building. With the help of hyperparameter lamba, you can handle features selection hence we can achieve dimensionality reductionDisadvantageThe algorithm assumes data is normally distributed in real they are not. Before building model multicollinearity should be avoided.

There are two general limitations to linear regression for data analysis:1. Does the model adequately describe the processes that generated the data?

a. Is the output really linear in all the inputs?b. Are the inputs really independent of each other?c. Are all the inputs included in the model?

2. Is the data sufficient to determine the coefficients in the model?a. Is there enough data?b. Does the data have high enough signal to noise? (a & b are related. You may be able to compensate for

lower signal to noise by collecting more data.)Q4 A) What is Isotonic regression explain with suitable exampleAns-The isotonic regression finds a non-decreasing approximation of a function while minimizing the mean squared error on the training data.The benefit of such a model is that it does not assume any form for the target function such as linearity. It produces a piecewise interpolating function minimizing the functional:

An example (with a toy dataset) is provided next:>>> X = np.arange(-5, 5, 0.1) >>> Y = X + np.random.uniform(-0.5, 1, size=X.shape) Following is a plot of the dataset. As everyone can see, it can be easily modeled by a linear regressor, but without a high non-linear function, it is very difficult to capture the slight (and local) modifications in the slope:

• The class IsotonicRegression needs to know ymin and ymax (which correspond to the variables y0 and yn in the loss function). In this case, we impose -6 and 10:

from sklearn.isotonic import IsotonicRegression >>> ir = IsotonicRegression(-6, 10) >>> Yi = ir.fit_transform(X, Y)

• The result is provided through three instance variables:>>> ir.X_min_ -5.0 >>> ir.X_max_ 4.8999999999999648 >>> ir.f_ <scipy.interpolate.interpolate.interp1d at 0x126edef8>

• The last one, (ir.f_), is an interpolating function which can be evaluated in the domain [xmin, xmax]. For example:

>>> ir.f_(2) array(1.7294334618146134)

Q4 B) Write a Short note on Polynomial regression?Ans- A regression equation is a polynomial regression equation if the power of independent variable is more than 1. The equation below represents a polynomial equation:

y=a+b*x^2

While there might be a temptation to fit a higher degree polynomial to get lower error, this can result in over-fitting. Always plot the relationships to see the fit and focus on making sure that the curve fits the nature of the problem. Here is an example of how plotting can help:

• Generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree.

• For example, if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2].

• Number of features in the output array scales polynomially in the number of features of the input array, and exponentially in the degree. High degrees can cause overfitting.

Q5 A) Define the terms:a. Bernoulli’s Distribution b. Binomial Distribution c. MAP Decision Rule d. Maximum Likelihood FunctionAns- Bernoulli’s Distribution- If X is random variable and is Bernoulli-distributed, it can assume only two values (for simplicity, let's call them 0 and 1) and their probability is:

Binomial Distribution - The binomial distribution gives the discrete probability distribution of obtaining exactly successes out of Bernoulli trials (where the result of each Bernoulli trial is true with probability and false with probability ). The binomial distribution is therefore given by

(1)

http://mathworld.wolfram.com/BernoulliTrial.html

http://mathworld.wolfram.com/BernoulliTrial.html

http://mathworld.wolfram.com/DiscreteDistribution.html

MAP Decision Rule- a maximum a posteriori probability (MAP) estimateis an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to the method ofmaximum likelihood (ML) estimation, but employs an augmented optimization objective which incorporates a prior distribution (that quantifies the additional information available through prior knowledge of a related event) over the quantity one wants to estimate. MAP estimation can therefore be seen as aregularization of ML estimation.

Maximum Likelihood Function- In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a statistical model, given observations. The method obtains the parameter estimates by finding the parameter values that maximize the likelihood function. The estimates are called maximum likelihood estimates, which is also abbreviated as MLE.

Q5 B) Explain one dimensional and N-dimensional Gaussian mixture model with suitable exampleAns- A Gaussian mixture model is parameterized by two types of values, the mixture component weights and the component means andvariances/covariances. For a Gaussian mixture model with components, the component has a mean of and variance of for theunivariate case and a mean of and covariance matrix of for the multivariate case. The mixture component weights are defined as for component , with the constraint that so that the total probability distribution normalizes to . If the component weights aren't learned, they can be viewed as an a-priori distribution over components such that . If they are instead learned, they are the a-posteriori estimates of the component probabilities given the data.

One Dimensional Model

N-Dimensional Gausian Mixture ModelExample- Density EstimationSince the GMM is completely determined by the parameters of its individual components, a fitted GMM can give an estimate of the probabilities of both in-sample and out-of-sample data points, known as density estimation. Furthermore, since numerically sampling from an individual Gaussian distribution is possible, one can easily sample from a GMM to create synthetic datasets.

https://brilliant.org/wiki/a-posterior/

https://brilliant.org/wiki/a-priori/

https://brilliant.org/wiki/multivariate-normal-distribution/

https://brilliant.org/wiki/normal-distribution/

https://en.wikipedia.org/wiki/Likelihood_function

https://en.wikipedia.org/wiki/Statistical_model

https://en.wikipedia.org/wiki/Statistical_parameter

https://en.wikipedia.org/wiki/Estimator

https://en.wikipedia.org/wiki/Regularization_(mathematics)

https://en.wikipedia.org/wiki/Prior_distribution

https://en.wikipedia.org/wiki/Optimization_(mathematics)

https://en.wikipedia.org/wiki/Maximum_likelihood

https://en.wikipedia.org/wiki/Point_estimation

https://en.wikipedia.org/wiki/Posterior_distribution

https://en.wikipedia.org/wiki/Mode_(statistics)

Q6 A) Define the terms: a. Multinomial distribution b. Gaussian distribution c. Support vector machine d.Kernal Matrix Ans- Multinomial distribution- Feature vectors represent the frequencies with which certain events have been generated by a multinomial distribution. This is the event model typically used for document classification.

• A multinomial distribution is useful to model feature vectors where each value represents, for example, the number of occurrences of a term or its relative frequency.

• If the feature vectors have n elements and each of them can assume k different values with probability pk, then:

• The conditional probabilities P(xi|y) are computed with a frequency count (which corresponds to applying a maximum likelihood approach), but in this case, it's important to consider the alpha parameter (called Laplace smoothing factor). Its default value is 1.0 and it prevents the model from setting null probabilities when the frequency is zero.

• It's possible to assign all non-negative values; however, larger values will assign higher probabilities to the missing features and this choice could alter the stability of the model. In our example, we're going to consider the default value of 1.0.

b. Gaussian distribution- In Gaussian Naive Bayes, continuous values associated with each feature are assumed to be distributed according to a Gaussian distribution. A Gaussian distribution is also called Normal distribution. When plotted, it gives a bell shaped curve which is symmetric about the mean of the feature values as shown below:

Gaussian naive Bayes is useful when working with continuous values whose probabilities can be modeled using a Gaussian distribution:

c. Support vector machine- in machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis.

• A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. • In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane

which categorizes new examples.d. Kernal Matrix- A vector v is in the kernel of a matrix A if and only if Av=0. Thus, the kernel is the span of all these vectors.Similarly, a vector v is in the kernel of a linear transformation T if and only if T(v)=0.For example the kernel of this matrix (call it A) [100201]is the following matrix, where s can be any number: ⎡⎣0−s2s⎤⎦

Q6 B) What are support vectors and margins? also explain soft margin SVM? Ans- The SVM in particular defines the criterion to be looking for a decision surface that is maximally far away from any data point. This distance from the decision surface to the closest data point determines the margin of the classifier. This method of construction necessarily means that the decision function for an SVM is fully specified by a (usually small)

https://www.projectrhea.org/rhea/index.php/Linear_transformation

https://www.projectrhea.org/rhea/index.php/Span

https://www.projectrhea.org/rhea/index.php/Matrix

https://en.wikipedia.org/wiki/Normal_distribution

subset of the data which defines the position of the separator. These points are referred to as the support vectors (in a vector space, a point can be thought of as a vector between the origin and that point). Figure 15.1 shows the margin and support vectors for a sample problem. Other data points play no part in determining the decision surface that is chosen.

In both the soft margin and hard margin case we are maximizing the margin between support vectors, i.e. minimizing 1/2||w||21/2||w||2. In soft margin case, we let our model give some relaxation to few points, if we consider these points our margin might reduce significantly and our decision boundary will be poorer, so instead of considering them as support vectors we consider them as error points and give certain penalty for them which is proportional to the amount by which each data point is violating the hard constraint.In hard margin case we are minimizing:In the case of error points we need to use a more general version of Support Vector Machine that is with soft margins.So instead of minimizing 1/2||w||21/2||w||2 we have to do this:now ξξ helps us in allowing some slackness in constraint. This approach gives linear penalty to mistakes in classification.

The minimization problem can be changed to:

https://nlp.stanford.edu/IR-book/html/htmledition/support-vector-machines-the-linearly-separable-case-1.html#fig:support-vectors

Q7 A) Define cluster Tree? Write and explain Agglomerative Clustering algorithmAns- A cluster tree is a tree T such that

Every leaf of T is a distinct symbol. Every internal node of T has at least two children. Each internal node of T is labelled with a non-negative value. Two or more nodes may be given the same value.

(For the purposes of sections 5-7, labels may be taken to be non-negative integers; in section 8, it will be useful to allow rational labels.)

Every leaf of the tree is labelled 0. The label of every internal node in the tree is less than the label of its parent.

For any node N of T, the field ``N.symbols'' gives the set of symbols in the leaves in the subtree of T rooted at N, and the field ``N.label'' gives the integer label on node N.Thus, for example, in Figure 1, n3.label = 3 and n3.symbols = {a,d}; n1.label = 5 and n1.symbols = {a,b,c,d,e,f,g}.As we shall see, the nodes of the tree T represent the clusters of a set of points, and the labels represent the relative sizes of the diameters of the clusters. Agglomerative: This is a "bottom-up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. Divisive: This is a "top-down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering are usually presented in a dendrogram.

Q7 B) Define Ensemble learning with suitable example also what do you mean by bagging and boosting as ensemble methods.Ans- Ensemble modeling is a powerful way to improve the performance of your model. It usually pays off to apply ensemble learning over and above various models you might be building. Time and again, people have used ensemble models in competitions like Kaggle and benefited from it.Ensemble learning is a broad topic and is only confined by your own imagination.

Example: I want to invest in a company XYZ. I am not sure about its performance though. So, I look for advice on whether the stock price will increase more than 6% per annum or not? I decide to approach various experts having diverse domain experience:1. Employee of Company XYZ: This person knows the internal functionality of the company and have the insider information about the functionality of the firm. But he lacks a broader perspective on how are competitors innovating, how is the technology evolving and what will be the impact of this evolution on Company XYZ’s product. In the past, he has been right 70% times.

https://en.m.wikipedia.org/wiki/Dendrogram

https://en.m.wikipedia.org/wiki/Greedy_algorithm

https://en.m.wikipedia.org/wiki/Top-down_and_bottom-up_design

https://en.m.wikipedia.org/wiki/Top-down_and_bottom-up_design

https://cs.nyu.edu/davise/om-dist/node4.html#figtree

https://cs.nyu.edu/davise/om-dist/node11.html#Section8



2. Financial Advisor of Company XYZ: This person has a broader perspective on how companies strategy will fair of in this competitive environment. However, he lacks a view on how the company’s internal policies are fairing off. In the past, he has been right 75% times.3. Stock Market Trader: This person has observed the company’s stock price over past 3 years. He knows the seasonality trends and how the overall market is performing. He also has developed a strong intuition on how stocks might vary over time. In the past, he has been right 70% times.4. Employee of a competitor: This person knows the internal functionality of the competitor firms and is aware of certain changes which are yet to be brought. He lacks a sight of company in focus and the external factors which can relate the growth of competitor with the company of subject. In the past, he has been right 60% of times.5. Market Research team in same segment: This team analyzes the customer preference of company XYZ’s product over others and how is this changing with time. Because he deals with customer side, he is unaware of the changes company XYZ will bring because of alignment to its own goals. In the past, they have been right 75% of times.6. Social Media Expert: This person can help us understand how has company XYZ has positioned its products in the market. And how are the sentiment of customers changing over time towards company. He is unaware of any kind of details beyond digital marketing. In the past, he has been right 65% of times.Given the broad spectrum of access we have, we can probably combine all the information and make an informed decision.BAGGINGBootstrap Aggregation (or Bagging for short), is a simple and very powerful ensemble method. Bagging is the application of the Bootstrap procedure to a high-variance machine learning algorithm, typically decision trees.

1. Suppose there are N observations and M features. A sample from observation is selected randomly with replacement(Bootstrapping).

2. A subset of features are selected to create a model with sample of observations and subset of features.3. Feature from the subset is selected which gives the best split on the training data.(Visit my blog on Decision Tree

to know more of best split)4. This is repeated to create many models and every model is trained in Parellal5. Prediction is given based on the aggregation of predictions from all the models.

BOOSTING

1. Boosting refers to a group of algorithms that utilize weighted averages to make weak learners into stronger learners. Unlike bagging that had each model run independently and then aggregate the outputs at the end without preference to any model. Boosting is all about “teamwork”. Each model that runs, dictates what features the next model will focus on.

Q8 A) Explain with the help of diagrams and equations Minkowski, Euclidean, Manhattan and Hamming distances.Ans-Three famous metrics (to calculate the distance between two points):1- Manhattan distance2- Euclidean distance3-Minkowski distance

There are three special cases of the Minkowski distance:

p = 1: this distance measure is often called city block distance, or Manhattan distance. p = 1, binary data: Hamming distance. The Hamming distance defines the number of common "1" bits of two

binary values. p = 2: with p equalling 2 the Minkowski distance is reduced to the well-known Euclidean distance.

Equations:N-dimensional Manhattan distance

Dimensional Euclidean distance

https://machinelearning1.files.wordpress.com/2013/03/manhattann.gif

N dimensional Euclidean distance

N-dimensional Minkowski distance

for c = 1 , c = 2, the Minkowski metric becomes equal to the Manhattan and Euclidean metrics respectively.

Q8 B) what is the concept Kernels as High Dimensional Feature Mapping also What do you mean majority classes decision rule. Ans-The algorithm above converges only for linearly separable data. If the data set is not linearly separable, we can map the samples into a feature space of higher dimensions:

in which the classes can be linearly separated. The decision function in the new space becomes:

where

and are the parameters of the decision plane in the new space. As the vectors appear only in inner products in

both the decision function and the learning law, the mapping function does not need to be explicitly specified.

Instead, all we need is the inner product of the vectors in the new space. The function is a kernel-induced implicit mapping.

Definition: A kernel is a function that takes two vectors and as arguments and returns the value of the inner

product of their images and :

As only the inner product of the two vectors in the new space is returned, the dimensionality of the new space is not important.

https://machinelearning1.files.wordpress.com/2013/03/euclidean3.gif

https://machinelearning1.files.wordpress.com/2013/03/euclideann.gif

The learning algorithm in the kernel space can be obtained by replacing all inner products in the learning algorithm in the original space with the kernels:

The parameter can be found from any support vectors :

Example 0: linear kernel

Assume , ,

A rule is extracted from the tree by associating a path from the root to a leaf (the rule condition) with the majority class at the leaf (the rule conclusion). The majority class is simply that having the greatest frequency in the class distribution of training examples reaching the leaf. The set of such rules, one for each path, is the induced classifier and can be used to classify unseen examples.

The key point of these methods is how to learn the best rule given a set of examples One possibility is to use the idea from decision trees and search in the space of conjunctions We start with the empty rule and each step we select the best new conjunction for the rule using an heuristic (eg. entropy) and a greedy strategy (eg. keep the best) To avoid local optima, a more exhaustive search can be performed, for example using beam search and storing the k best rules The majority class of the examples selected is assigned as the rule prediction

Being N the number of examples in a node, E the number of examples that are not in the majority class and B the binomial probability distribution function, the estimated error for a node can be computed as: Error = N · Bcf (E, N)

Q9 A) Explain deep learning. What are the challenges in Deep learning?Ans- Deep learning is a set of machine learning algorithms that model high-level abstractions in data using architectures consisting of multiple nonlinear transformations. What does it mean?A deep learning technology is based on artificial neural networks (ANNs). These ANNs constantly receive learning algorithms and continuously growing amounts of data to increase the efficiency of training processes. The larger data volumes are, the more efficient this process is. The training process is called «deep», because, with the time passing, a neural network covers a growing number of levels. The «deeper» this network penetrates, the higher its productivity is.How Deep Learning WorksA deep machine learning process consists of two main phases: training and inferring. You should think about the training phase as a process of labeling large amounts of data and determining their matching characteristics. The system compares these characteristics and memorizes them to make correct conclusions when it faces similar data next time.A deep learning training process includes following stages:

1. ANNs ask a set of binary false/true questions or.2. Extracting numerical values from data blocks.3. Classifying data according to the answers received.4. Labeling Data.

During the inferring phase, the deep learning AI makes conclusions and label new unexposed data using their previous knowledge.

Challenges in Deep learning- The need for lots of data Neural networks at the core of deep learning are black boxes Overfitting the model Lack of flexibility

Q9 B) Write short note on knowledge based recommendation system.Ans-

• The basic concepts are users, items, and ratings (or an implicit feedback about the products, like the fact of having bought them). Every model must work with known data (like in a supervised scenario), to be able to suggest the most suitable items or to predict the ratings for all the items not evaluated yet.

• We're going to discuss two different kinds of strategies: • User or content based• Collaborative filtering

Knowledge-based recommender systems (knowledge based recommenders) are a specific type of recommender system that are based on explicit knowledge about the item assortment, user preferences, and recommendation criteria (i.e., which item should be recommended in which context). These systems are applied in scenarios where alternative approaches such as collaborative filtering and content-based filtering cannot be applied.A major strength of knowledge-based recommender systems is the non-existence of cold-start (ramp-up) problems. A corresponding drawback is a potential knowledge acquisition bottleneck triggered by the need to define recommendation knowledge in an explicit fashion.Suggests products based on inferences about a userʼs needs and preferences ! Functional knowledge: about how a particular item meets a particular user need ! The user model can be any knowledge structure that supports this inference " A query, i.e., the set of preferred features for a product " A case (in a case-based reasoning system) " An adapted similarity metric (for matching) " A part of an ontology ! There is a large use of domain knowledge encoded in a knowledge representation language/approach.

Q9 C) Write short note on Artificial Neural Network Ans- Artificial neural networks (ANN) or connectionist systems are computing systems vaguely inspired by the biological neural networks that constitute animal brains.[1][2] The neural network itself is not an algorithm, but rather a framework for many different machine learning algorithms to work together and process complex data inputs. [3] Such systems "learn" to perform tasks by considering examples, generally without being programmed with any task-specific rules. For example, in image recognition, they might learn to identify images that contain cats by analyzing example images that have been manuallylabeled as "cat" or "no cat" and using the results to identify cats in other images. They do this

https://en.wikipedia.org/wiki/Labeled_data

https://en.wikipedia.org/wiki/Image_recognition

https://en.wikipedia.org/wiki/Artificial_neural_network#cite_note-3

https://en.wikipedia.org/wiki/Machine_learning



https://en.wikipedia.org/wiki/Brain

https://en.wikipedia.org/wiki/Biological_neural_network

https://en.wikipedia.org/wiki/Biological_neural_network

https://en.wikipedia.org/wiki/Knowledge_acquisition

https://en.wikipedia.org/wiki/Cold-start

https://en.wikipedia.org/wiki/Recommender_system#Content-based_filtering

https://en.wikipedia.org/wiki/Collaborative_filtering

https://en.wikipedia.org/wiki/Recommender_system

https://en.wikipedia.org/wiki/Recommender_system

https://ru.wikipedia.org/wiki/%D0%9D%D0%B5%D0%B9%D1%80%D0%BE%D0%BD%D0%BD%D0%B0%D1%8F_%D1%81%D0%B5%D1%82%D1%8C

without any prior knowledge about cats, for example, that they have fur, tails, whiskers and cat-like faces. Instead, they automatically generate identifying characteristics from the learning material that they process.

An ANN is based on a collection of connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit a signal from one artificial neuron to another. An artificial neuron that receives a signal can process it and then signal additional artificial neurons connected to it.

In common ANN implementations, the signal at a connection between artificial neurons is a real number, and the output of each artificial neuron is computed by some non-linear function of the sum of its inputs. The connections between

artificial neurons are called 'edges'. Artificial neurons and edges typically have a weight that adjusts as learning

.

Q10 A) Explain silhouettes with Suitable exampleAns-The silhouette score is based on the principle of "maximum internal cohesion and maximum cluster separation". In other words, we would like to find the number of clusters that produce a subdivision of the dataset into dense blocks that are well separated from each other. After defining a distance metric (Euclidean is normally a good choice), we can compute the average intracluster distance for each element:

• We can also define the average nearest-cluster distance (which corresponds to the lowest intercluster distance):

•The silhouette score for an element xi is defined as:

• This value is bounded between -1 and 1, with the following interpretation:• A value close to 1 is good (1 is the best condition) because it means that a(xi) << b(xi) • A value close to 0 means that the difference between intra and inter cluster measures is almost null and therefore

there's a cluster overlap• A value close to -1 means that the sample has been assigned to a wrong cluster because a(xi) >> b(xi)

Q10 B) Write a short note on Linear layers Ans- These layers are an array of neurons. Each take multiple inputs and produce multiple outputs• The input and output number is arbitrary and of uncorrelated length.

https://en.wikipedia.org/wiki/Weight_(mathematics)

https://en.wikipedia.org/wiki/Real_number

https://en.wikipedia.org/wiki/Synapse

https://en.wikipedia.org/wiki/Neuron

https://en.wikipedia.org/wiki/Artificial_neuron

• These building blocks are basically an array of linear combinations of inputs scaled by weights. Weights multiply inputs with an array of weight values, and they are usually learned with a learning algorithm.

• Linear layers do not come with non-linearities, you will have to add it after each layer. If you stack multiple

layers, you will need a non-linearity between them, or they will all collapse to a single linear layer.

Q10 C) Write a short note on Convolutional layersAns- They are exactly like a linear layer, but each output neuron is connected to a locally constrained group of input neurons.• This group is often called receptive-field, borrowing the name from neuroscience. Convolutions can be

performed in 1D, 2D, 3D… etc.• If the inputs are correlated, then it makes more sense to look at a group of inputs rather than a single value like

in a linear layer. Linear layer can be thought of a convolutions with 1 value per filters.

********************************** THE END*********************************************

dhomaseghanshyam.files.wordpress.com€¦ · Web viewA Support Vector Machine (SVM) is a...

Documents

Transcript of dhomaseghanshyam.files.wordpress.com€¦ · Web viewA Support Vector Machine (SVM) is a...