• Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off
Reading...
Front

Card Range To Study

through

image

Play button

image

Play button

image

Progress

1/23

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

23 Cards in this Set

  • Front
  • Back

A weak learner with less than 50% accuracy does not present any problem to the AdaBoost algorithm.

False. If the error > 0.5, then the weight assigned to the mis-classified points will be smaller than the weight assigned to the points that are classified correctly. Thus, subsequent iterations will not try to classify these points correctly and thus the algorithm is likely to exhibit poor performance.

AdaBoost is not susceptible to outliers. If your answer is true ,explain why. If your answer is false, then describe a simple heuristic to fix AdaBoost so that it is not susceptible to outliers.

False. AdaBoost is susceptible to outliers. A possible heuristic is to put a threshold on the weight and remove all points that have very large weights (typically outliers may have large weights because they will be consistently mis-classified).

As we increase k, the k nearest neighbor algorithm begins to overfit the training dataset. True or False.

False. In fact, as we increase k, the algorithm under-fits because the decision surface becomes simpler. (for example, when k = #points, the induced function is just the majority class function).

What is the time complexity of the K-means algorithm. Use the O-notation.Explain your notation clearly.

The time complexity is O(nkdi), where n is the number of data points, d is the number of features, k is the number of clusters and i is the number of iterations needed until convergence.

Compare decision trees with kNN; what do they have in common; what are the main differences between the two approaches?

Commonality: Decision trees and kNN are both supervised learning methods that assign a class to an object based on the features. Main Differences: Decision trees define a hierarchy of rules in the form of trees and these rules are formed from training data. These trees give priority to more informative features. A decision tree could become very complex as the decision tree gets larger and requires pruning to improve its performance.

A lot of decision making systems use Bayes’ theorem relying on conditional independence assumptions—what are those assumptions exactly? Why are they made? What is the problem with making those assumptions?

Every feature Fi is conditionally independent of every other feature Fj for i ≠j for a given class. It assumes the presence of a particular feature of a class is unrelated to the presence of any other feature. The assumption is made to simplify decision making by simplifying computations and by dramatically reducing knowledge acquisition cost. The problem of making those assumptions is that features are often correlated, and making this assumption in presence of correlation leads to errors in probability computations, and ultimately to making the wrong decision.

What does bias measure; what does variance measure? Assume we have a model with a high bias and a low variance—what does this mean?

Bias: measure the error between the estimator’s expected parameter and the real parameter.Variance: measures how much the estimator fluctuates around the expected value. A high bias and a low variance model is a simple model that is underfitting to the dataset

Maximum likelihood, MAP, and the Bayesian approach all measure parameters of models.What are the main differences between the 3 approaches?

Maximum likelihood estimates the parameter by estimating the distribution that most likely resulted in the data. MAP and Bayesian approach both take into account the prior density of the parameter. MAP replaces the whole density with a single point to get rid of the evaluation of the integral, whereas the Bayesian approach uses an approximation method to evaluate the full integral.

What is the biggest advantage of decision trees when compared to logistic regression classifiers?

Decision trees do not assume independence of the input features and can thus encode complicated formulas related to relationship between these variables whereas logistic regression treats each feature independently.

What is the biggest weakness of decision trees compared to logistic regression classifiers?

Decision trees are more likely to overfit the data since they can split on many different combination of features whereas in logistic regression we associate only one parameter with each feature.

Briefly describe the difference between a maximum likelihood hypothesis and a maximum a posteriori hypothesis.

ML: maximize the data likelihood given the model, i.e., argmaxP(Data|W)


MAP: argmaxP(W|Data)W

The error of a hypothesis measured over its training set provides a pessimistically biased estimate of the true error of the hypothesis.

False. The training error is optimistically biased since it’s biased while usually smaller than the true error.

if you are given m data points, and use half for training and half for testing, the difference between training error and test error decreases as m increases.

True. As we have more and more data, training error increases and testing error de-creases. And they all converge to the true error.

Overfitting is more likely when the set of training data is small

True. With small training dataset, it’s easier to find a hypothesis to fit the training data exactly i.e., overfit.

Overfitting is more likely when the hypothesis space is small

False. We can see this from the bias-variance trade-off. When hypothesis space is small, it’s more biased with less variance. So with a small hypothesis space, it’s less likely to find a hypothesis to fit the data very well i.e., overfit.

Are outliers always bad and we should always ignore them? Why? (Give one short reason for ignoring outliers, and one short reason against.)

Outliers are often “bad” data, caused by faulty sensors or errors entering values; in such cases, the outliers are not part of the function we want to learn and should be ignore. On the other hand, an outlier could be just an unlikely sample from the true distribution of the function of interest; in these cases, the data point is just another sample and should not be ignored.

Declare or compute the VC dimension of the following classifiers. A K-nearest neighbor classifier with K = 1.

When K = 1 a 1NN can correctly classify all data points, hence theVC dimension is infinity.

Declare or compute the VC dimension of the following classifiers. A single-layer perceptron classifier.

A perceptron is a linear classifier and hence the VC dimension is D+1.

Declare or compute the VC dimension of the following classifiers. Assume input space ,D = 2. A square that assigns points within as one class and points outside as another class. Draw a scenario where this classifier shatters all points for the VC dimension you have proposed.

The VC dimension is 3. Draw three points in 2D in a standard tripod structure and the square is able to shatter all labeling configurations.Note that a square can’t cope with 4 points regardless of how they are placed. A rectangle can shatter 4 points if they are structured in a diamond-like shape.

When the data is not completely linearly separable, the linear SVM without slack variables returns w = 0.

False, there is no solution.

Assume we are using the primal non linearly separable version of the SVM optimization target function. What do we need to do to guarantee that the resulting model is linearly separable?

Set C =∞

After training a SVM, we can discard all examples which are not support vectors and can still classify new examples.

True

Increasing the number of layers always decrease the classification error of test data.

False