Description
Perceptrons
1. Which of the following set of x ∈ R3 can be shattered by the 3D perceptron hypothesis set? The set contains all hyperplanes of the form with our usual notation of x0 = 1:
3 hw(x)=sign wixi .
Choose the correct answer; explain your choice.
[a] {(2,3,4),(4,3,2),(3,3,3)}
[b] {(1,1,1),(2,3,4),(4,3,2),(4,2,3)} [c] {(1,1,1),(2,3,4),(4,3,2),(2,2,2)}
[d] {(1,1,1),(2,3,4),(4,3,2),(4,2,3),(3,2,4)}
[e] none of the other choices (none of them can be shattered)
1 of 6
i=0
2. What is the growth function of origin-passing perceptrons in 2D? Those perceptrons are all per- ceptrons with w0 = 0. Choose the correct answer; explain your choice.
Hint: Put your input vectors on the unit circle, and perhaps think about a polar coordinate system.
[a] 2N + 2 [b] 2N + 1
[c] 2N
[d] 2N − 1
[e] 2N − 2
Donut Hypothesis Set
- The “donut” hypothesis set in Rd contains hypothesis parameterized by two positive numbers a and b, where +1 ifa≤di=1x2i ≤b, h(x) = −1 otherwise.What is the growth function of the hypothesis set? Choose the correct answer; explain your choice. [a] N+1+1
2
[b] N+1+1 3
[c] N+1+1 6
[d] N+1+1 d
[e] none of the other choices
- Following the previous problem, what is the VC dimension of the donut hypothesis set? Choosethe correct answer; explain your choice.[a] d [b] 6 [c] 3 [d] 2
[e] none of the other choices
More on VC Dimension
5. Which of the following hypothesis set is of a different VC dimension, compared to others? Choose the correct answer; explain your choice.
- [a] Unions of two positive intervals over x ∈ R, which returns +1 if x is within at least one of the intervals.
- [b] Axis-aligned rectangle classifiers for x ∈ R2, which returns +1 if x is inside a rectangle whose edges are parallel to the axes of R2.
- [c] Positively-biased perceptrons over x ∈ R4, which contains perceptrons with w0 > 0.
- [d] Polynomial hypotheses of degree 3 for x ∈ R, which are of the form
[e] none of the other choices
3
h(x) = sign( wixi). i=0
2 of 6
6. For a finite hypothesis set H with 1126 binary classifiers, what is the largest possible value of dvc(H)? Choose the correct answer; explain your choice.
[a] 1126 [b] 112
[c] 11 [d] 10 [e] 1
Deviation from Optimal Hypothesis
7. Recall that the multiple-bin Hoeffding bound quantifies the BAD probability from any hypothesis h
in the hypothesis set. That is,
P[∃h∈Hs.t. |Ein(h)−Eout(h)|>ε]≤2Mexp−2ε2N.
Define the best-Ein hypothesis
g = argminh∈HEin(h)
and the best-Eout hypothesis (which is optimal but can only be obtained by a “cheating” algorithm)
g∗ = argminh∈HEout(h).
Using the multiple-bin Hoeffding bound above, with probability more than 1 − δ, which of the
following is an upper bound of Eout(g)−Eout(g∗)? Choose the correct answer; explain your choice. [a]1 lnM
2N δ [b] 1 ln2M
Nδ [c]1 ln2M
2N δ [d]21 ln2M
2N δ [e] 1 lnM
Nδ
VC Bound
8. When using the positive ray model taught in class, given ε = 0.1 and δ = 0.1, among the five
choices, what is the smallest N such that the BAD probability of the VC bound 12
P[∃h∈Hs.t. |Ein(h)−Eout(h)|>ε]≤4mH(2N)exp −8ε N is ≤ δ? Choose the correct answer; explain your choice.
[a] 10000 [b] 11000 [c] 12000 [d] 13000 [e] 14000
3 of 6
Hessian and Newton Method
9. Let E(w): Rd → R be a function. Denote the gradient bE(w) and the Hessian AE(w) by
For any given wt, let
∂E
∂w1 (w)
∂2E(w) ∂2E (w) … ∂2E (w) ∂ w 12 ∂ w 1 ∂ w 2 ∂ w 1 ∂ w d
∂2E∂2E∂2E ∂w2∂w1 (w) ∂w2 (w) … ∂w2∂wd (w)
.2… . …..
∂2E∂2E∂2E ∂wd∂w1 (w) ∂wd∂w2 (w) … ∂wd2 (w) d×d
∂w2 (w)
∂E bE(w) = ∇E(w) = .
and AE(w) =
.
∂E (w)
∂wd d×1
Then, the second-order Taylor’s expansion of E(w) around u is: E(w)≈E(u)+bE(u)T(w−u)+ 21(w−u)TAE(u)(w−u).
Suppose AE(u) is positive definite. What is the optimal direction v such that w ← u+v minimizes the right-hand-side of the Taylor’s expansion above? Choose the correct answer; explain your choice.
Note that iterative optimization with v is generally called Newton’s method.
[a] +(AE(u))−1bE(u) [b] −(AE(u))−1bE(u) [c] +(AE(u))+1bE(u) [d] −(AE(u))+1bE(u)
[e] none of the other choices
10. Following the previous problem, considering minimizing Ein(w) in logistic regression problem with
Newton’s method. Consider a data set {(xn, yn)}Nn=1 with the cross-entropy error function for Ein:
1 N
Ein(w) = N ht(x) =
ln(1 + exp(−ynwT xn)) 1 .
n=1
1 + exp(wtT x)
What is the Hessian AE(wt) with E = Ein? Choose the correct answer; explain your choice.
[a] N1 Nn=1xnxTn
[b] N1 Nn=1∥wt∥2xnxTn
[c] N1 Nn=1h2t(ynxn)xnxTn
[d] N1 Nn=1 ht(ynxn)ht(−ynxn)xnxTn [e] none of the other choices
Linear Regression
- In the lecture, we learned that the linear regression weights w must satisfy the normal equation XT Xw = XT y. If XT X is not invertible, we cannot compute wlin = (XT X)−1XT y. Then, one possible solution iswlin = X†y,where X† is called the Moore-Penrose pseudo-inverse. Let X = UΣVT be the singular value
decomposition of the N by d + 1 matrix X, with U and V being unitary matrices. The Moore-
Penrose pseudo-inverse X† = VΣ†UT is a d + 1 by N matrix, where Σ†[i][n] = 1 when Σ[n][i] Σ[n][i]
is nonzero, and 0 otherwise. Which of the following statements related to X† is incorrect? Choose the incorrect statement; explain your choice.
[a] (XT X)−1XT = X† when XT X is invertible. [b] For any k ∈ Z+, (XX†)k = XX†.
[c] XX† = IN , the N by N identity matrix. [d] trace(XX†) = rank(X).
[e] none of the other choices
- In the lecture, we learned that the logistic regression can be derived by maximizing the likelihood function where each label yn is generated from P (+1|xn) “pretended by” 1/(1 + exp(−wT xn)). Now, consider a case where each real-valued label yn is generated from p(y|xn), where p is used instead of P to denote a probability density function (instead of a probability mass function—you can ignore the subtle mathematical difference for now). We will consider p(y|xn) to be pretended by a normal distribution with mean wT xn and variance a2. Assume that all inversions in the equations below exist, what is the optimal w∗ that maximizes the likelihood function for this case? Choose the correct answer; explain your choice.[a] XT X−1 XT y [b] aXTX−1XTy[c] a2 XT X−1 XT y [d] XTX+a2I† XTy
[e] none of the other choices
Experiments with Linear Models
Next, we will play with linear regression, logistic regression, and their use for binary classification. We will also learn how the their different robustness to outliers. We will generate artificial 2D data for training and testing in the next problems. Each example (x,y) is assumed to be generated from the following process:
- Flipafaircointogeteithery=+1ory=−1.
- If y = +1, generate x = (1, x1, x2) where (x1, x2) comes from a normal distribution of mean [2, 3]0.6 0 and covariance 0 0.6 .
- If y = −1, generate x = (1, x1, x2) where (x1, x2) comes from a normal distribution of mean [0, 4] 0.4 0Please generate N = 200 examples from the process as your training data set D. Then, generate 5000 more examples from the process as your test data set (for evaluating Eout).Hint: Be sure to check whether your normal distribution function needs you to provide the variance,
and covariance 0 0.4 .
which would be like 0.6 for the yn = +1 cases, or the standard deviation, which would be like
√
0.6.
13. (*) Implement the linear regression algorithm taught in the lecture. Run the algorithm for 100
times, each with a different random seed for generating the two data sets above. What is the
average Esqr(wlin), where Esqr denotes the averaged squared error over N examples? Choose the in in
closest answer; provide your code.
[a] 0.04 [b] 0.16 [c] 0.24 [d] 0.28 [e] 0.40
- (*) Following the previous problem, what is the average E0/1(w ) − E0/1(w ), where 0/1 in lin out lindenotes the 0/1 error (i.e. using wlin for binary classification), E0/1 denotes the averaged 0/1 error inover N examples, E(0/1) is estimated using the averaged 0/1 error on the test data set above? out
Choose the closest answer; provide your code.
[a] 0.091 [b] 0.065 [c] 0.039 [d] 0.013 [e] 0.001
- (*) Consider two algorithms. The first one, A, is the linear regression algorithm above. The secondone B is logistic regression, trained with gradient descent with η = 0.1 for T = 500 iterations,starting from w = 0. Run the algorithms on the same D, and record [E0/1(A(D)),E0/1(B(D))].
0 out out
Repeat the process for 100 times, each with a different random seed for generating the two data sets above. What is the average [E0/1(A(D)),E0/1(B(D))]? Choose the closest answer; provide
your code.
[a] (0.018, 0.018) [b] (0.058, 0.058) [c] (0.058, 0.093) [d] (0.138, 0.108) [e] (0.268, 0.208)
16. (*) Following the previous problem, in addition to the 200 examples in D, add 20 outlier exam- ples generated from the following process to your training data (but not to your test data). All outlier examples will be labeled y = +1 and x = [1,x1,x2] where (x1,x2) comes from a normal
0.3 0 ′ distribution of mean [6, 0] and covariance 0 0.1 . Name the new training data set D . Run
the algorithms on the same D′, and record [E0/1(A(D′)),E0/1(B(D′))]. Repeat the process for out out
100 times, each with a different random seed for generating the two data sets above. What is the average [E0/1(A(D′)),E0/1(B(D′))]? Choose the closest answer; provide your code.
[a] (0.070, 0.018) [b] (0.240, 0.048) [c] (0.090, 0.058) [d] (0.090, 0.078) [e] (0.270, 0.108)
out out
out out
6 of 6


