Name: NTUCSIE Homework 2-Perceptrons Solved
SKU: 64118
Availability: InStock

Description

Rate

Perceptrons

1. Which of the following set of x ∈ R3 can be shattered by the 3D perceptron hypothesis set? The set contains all hyperplanes of the form with our usual notation of x0 = 1:

􏰓3􏰔 hw(x)=sign 􏰆wixi .

Choose the correct answer; explain your choice.

[a] {(2,3,4),(4,3,2),(3,3,3)}
[b] {(1,1,1),(2,3,4),(4,3,2),(4,2,3)} [c] {(1,1,1),(2,3,4),(4,3,2),(2,2,2)}

[d] {(1,1,1),(2,3,4),(4,3,2),(4,2,3),(3,2,4)}
[e] none of the other choices (none of them can be shattered)

1 of 6

i=0

2. What is the growth function of origin-passing perceptrons in 2D? Those perceptrons are all per- ceptrons with w0 = 0. Choose the correct answer; explain your choice.

Hint: Put your input vectors on the unit circle, and perhaps think about a polar coordinate system.

[a] 2N + 2 [b] 2N + 1

[c] 2N
[d] 2N − 1

[e] 2N − 2
Donut Hypothesis Set

The “donut” hypothesis set in Rd contains hypothesis parameterized by two positive numbers a and b, where􏰒 +1 ifa≤􏰅di=1x2i ≤b, h(x) = −1 otherwise.What is the growth function of the hypothesis set? Choose the correct answer; explain your choice. [a] 􏰁N+1􏰂+1
2

[b] 􏰁N+1􏰂+1 3

[c] 􏰁N+1􏰂+1 6

[d] 􏰁N+1􏰂+1 d

[e] none of the other choices
Following the previous problem, what is the VC dimension of the donut hypothesis set? Choosethe correct answer; explain your choice.[a] d [b] 6 [c] 3 [d] 2
[e] none of the other choices

More on VC Dimension

5. Which of the following hypothesis set is of a different VC dimension, compared to others? Choose the correct answer; explain your choice.

[a] Unions of two positive intervals over x ∈ R, which returns +1 if x is within at least one of the intervals.
[b] Axis-aligned rectangle classifiers for x ∈ R2, which returns +1 if x is inside a rectangle whose edges are parallel to the axes of R2.
[c] Positively-biased perceptrons over x ∈ R4, which contains perceptrons with w0 > 0.
[d] Polynomial hypotheses of degree 3 for x ∈ R, which are of the form

[e] none of the other choices

h(x) = sign(􏰆 wixi). i=0

2 of 6

6. For a finite hypothesis set H with 1126 binary classifiers, what is the largest possible value of dvc(H)? Choose the correct answer; explain your choice.

[a] 1126 [b] 112

[c] 11 [d] 10 [e] 1

Deviation from Optimal Hypothesis
7. Recall that the multiple-bin Hoeffding bound quantifies the BAD probability from any hypothesis h

in the hypothesis set. That is,
P[∃h∈Hs.t. |Ein(h)−Eout(h)|>ε]≤2Mexp􏰁−2ε2N􏰂.

Define the best-Ein hypothesis

g = argminh∈HEin(h)
and the best-Eout hypothesis (which is optimal but can only be obtained by a “cheating” algorithm)

g∗ = argminh∈HEout(h).
Using the multiple-bin Hoeffding bound above, with probability more than 1 − δ, which of the

following is an upper bound of Eout(g)−Eout(g∗)? Choose the correct answer; explain your choice. [a]􏰕1 ln􏰁M􏰂

2N δ [b] 􏰕1 ln􏰁2M􏰂

Nδ [c]􏰕1 ln􏰁2M􏰂

2N δ [d]2􏰕1 ln􏰁2M􏰂

2N δ [e] 􏰕1 ln􏰁M􏰂

Nδ

VC Bound
8. When using the positive ray model taught in class, given ε = 0.1 and δ = 0.1, among the five

choices, what is the smallest N such that the BAD probability of the VC bound 􏰌12􏰍

P[∃h∈Hs.t. |Ein(h)−Eout(h)|>ε]≤4mH(2N)exp −8ε N is ≤ δ? Choose the correct answer; explain your choice.

[a] 10000 [b] 11000 [c] 12000 [d] 13000 [e] 14000

3 of 6

Hessian and Newton Method
9. Let E(w): Rd → R be a function. Denote the gradient bE(w) and the Hessian AE(w) by

For any given wt, let

 ∂E
∂w1 (w)

 ∂2E(w) ∂2E (w) … ∂2E (w) ∂ w 12 ∂ w 1 ∂ w 2 ∂ w 1 ∂ w d

∂2E∂2E∂2E ∂w2∂w1 (w) ∂w2 (w) … ∂w2∂wd (w)

.2… . …..

∂2E∂2E∂2E ∂wd∂w1 (w) ∂wd∂w2 (w) … ∂wd2 (w) d×d

 ∂w2 (w)

∂E bE(w) = ∇E(w) =  .

and AE(w) =

 .

∂E (w)
∂wd d×1

Then, the second-order Taylor’s expansion of E(w) around u is: E(w)≈E(u)+bE(u)T(w−u)+ 21(w−u)TAE(u)(w−u).

Suppose AE(u) is positive definite. What is the optimal direction v such that w ← u+v minimizes the right-hand-side of the Taylor’s expansion above? Choose the correct answer; explain your choice.

Note that iterative optimization with v is generally called Newton’s method.

[a] +(AE(u))−1bE(u) [b] −(AE(u))−1bE(u) [c] +(AE(u))+1bE(u) [d] −(AE(u))+1bE(u)

[e] none of the other choices
10. Following the previous problem, considering minimizing Ein(w) in logistic regression problem with

Newton’s method. Consider a data set {(xn, yn)}Nn=1 with the cross-entropy error function for Ein:

1 􏰆N

Ein(w) = N ht(x) =

ln(1 + exp(−ynwT xn)) 1 .

n=1

1 + exp(wtT x)
What is the Hessian AE(wt) with E = Ein? Choose the correct answer; explain your choice.

[a] N1 􏰅Nn=1xnxTn
[b] N1 􏰅Nn=1∥wt∥2xnxTn

[c] N1 􏰅Nn=1h2t(ynxn)xnxTn

[d] N1 􏰅Nn=1 ht(ynxn)ht(−ynxn)xnxTn [e] none of the other choices

Linear Regression

In the lecture, we learned that the linear regression weights w must satisfy the normal equation XT Xw = XT y. If XT X is not invertible, we cannot compute wlin = (XT X)−1XT y. Then, one possible solution iswlin = X†y,where X† is called the Moore-Penrose pseudo-inverse. Let X = UΣVT be the singular value
decomposition of the N by d + 1 matrix X, with U and V being unitary matrices. The Moore-

Penrose pseudo-inverse X† = VΣ†UT is a d + 1 by N matrix, where Σ†[i][n] = 1 when Σ[n][i] Σ[n][i]

is nonzero, and 0 otherwise. Which of the following statements related to X† is incorrect? Choose the incorrect statement; explain your choice.

[a] (XT X)−1XT = X† when XT X is invertible. [b] For any k ∈ Z+, (XX†)k = XX†.

[c] XX† = IN , the N by N identity matrix. [d] trace(XX†) = rank(X).

[e] none of the other choices
In the lecture, we learned that the logistic regression can be derived by maximizing the likelihood function where each label yn is generated from P (+1|xn) “pretended by” 1/(1 + exp(−wT xn)). Now, consider a case where each real-valued label yn is generated from p(y|xn), where p is used instead of P to denote a probability density function (instead of a probability mass function—you can ignore the subtle mathematical difference for now). We will consider p(y|xn) to be pretended by a normal distribution with mean wT xn and variance a2. Assume that all inversions in the equations below exist, what is the optimal w∗ that maximizes the likelihood function for this case? Choose the correct answer; explain your choice.[a] 􏰁XT X􏰂−1 XT y [b] a􏰁XTX􏰂−1XTy[c] a2 􏰁XT X􏰂−1 XT y [d] 􏰁XTX+a2I􏰂† XTy
[e] none of the other choices

Experiments with Linear Models

Next, we will play with linear regression, logistic regression, and their use for binary classification. We will also learn how the their different robustness to outliers. We will generate artificial 2D data for training and testing in the next problems. Each example (x,y) is assumed to be generated from the following process:

Flipafaircointogeteithery=+1ory=−1.
If y = +1, generate x = (1, x1, x2) where (x1, x2) comes from a normal distribution of mean [2, 3]􏰐0.6 0􏰑 and covariance 0 0.6 .
If y = −1, generate x = (1, x1, x2) where (x1, x2) comes from a normal distribution of mean [0, 4] 􏰐0.4 0􏰑Please generate N = 200 examples from the process as your training data set D. Then, generate 5000 more examples from the process as your test data set (for evaluating Eout).Hint: Be sure to check whether your normal distribution function needs you to provide the variance,

and covariance 0 0.4 .

which would be like 0.6 for the yn = +1 cases, or the standard deviation, which would be like

√

0.6.

13. (*) Implement the linear regression algorithm taught in the lecture. Run the algorithm for 100

times, each with a different random seed for generating the two data sets above. What is the

average Esqr(wlin), where Esqr denotes the averaged squared error over N examples? Choose the in in

closest answer; provide your code.

[a] 0.04 [b] 0.16 [c] 0.24 [d] 0.28 [e] 0.40

(*) Following the previous problem, what is the average 􏰋􏰋􏰋E0/1(w ) − E0/1(w )􏰋􏰋􏰋, where 0/1 in lin out lindenotes the 0/1 error (i.e. using wlin for binary classification), E0/1 denotes the averaged 0/1 error inover N examples, E(0/1) is estimated using the averaged 0/1 error on the test data set above? out
Choose the closest answer; provide your code.

[a] 0.091 [b] 0.065 [c] 0.039 [d] 0.013 [e] 0.001
(*) Consider two algorithms. The first one, A, is the linear regression algorithm above. The secondone B is logistic regression, trained with gradient descent with η = 0.1 for T = 500 iterations,starting from w = 0. Run the algorithms on the same D, and record [E0/1(A(D)),E0/1(B(D))].

0 out out

Repeat the process for 100 times, each with a different random seed for generating the two data sets above. What is the average [E0/1(A(D)),E0/1(B(D))]? Choose the closest answer; provide

your code.

[a] (0.018, 0.018) [b] (0.058, 0.058) [c] (0.058, 0.093) [d] (0.138, 0.108) [e] (0.268, 0.208)

16. (*) Following the previous problem, in addition to the 200 examples in D, add 20 outlier exam- ples generated from the following process to your training data (but not to your test data). All outlier examples will be labeled y = +1 and x = [1,x1,x2] where (x1,x2) comes from a normal

􏰐0.3 0􏰑 ′ distribution of mean [6, 0] and covariance 0 0.1 . Name the new training data set D . Run

the algorithms on the same D′, and record [E0/1(A(D′)),E0/1(B(D′))]. Repeat the process for out out

100 times, each with a different random seed for generating the two data sets above. What is the average [E0/1(A(D′)),E0/1(B(D′))]? Choose the closest answer; provide your code.

[a] (0.070, 0.018) [b] (0.240, 0.048) [c] (0.090, 0.058) [d] (0.090, 0.078) [e] (0.270, 0.108)

out out

6 of 6

[SOLVED] NTUCSIE Homework 2-Perceptrons

If Helpful Share:

Description

Related products

NTUCSIE Homework 4-Regularization

NTUCSIE Homework 1-The Learning Problem

NTUCSIE Homework 5-Hard-Margin SVM

Related in this category

More in this category

NTUCSIE Homework 6-Aggregation

NTUCSIE Homework 3-Linear Regression

NTUCSIE Homework 5-Hard-Margin SVM

NTUCSIE Homework 4-Regularization

NTUCSIE Homework 1-The Learning Problem