[SOLVED] COMP6211E Optimization for Machine Learning (Homework #3)

34.99 $

Description

5/5 - (1 vote)

Theoretical Problems (13 points)

1. (3 points) Consider x ∈ R+. Consider

h(x) = − ln x

• Find the Bregman divergence Dh(x, y). • Consider function

f (x) = 0.5x − ln(x + 1).

What’s the smoothness and strong convexity parameters of f with respect to h?

• Consider h-mirror descent method for solving f (x). What’s the formula to obtain xt from xt−1

with a learning rate ηt.

2. (3 points) Consider a composite optimization problem

f (x) + g(x),

with non-convex regularizer g(x) given by

g(x) = λ

d (cid:88)

j=1

ln(1 + |xj|),

for some λ > 0. Since logarithm grows slowly when |xj| increases, this regularizer leads to less bias than L1 regularization. Assume we want to apply proximal gradient method

to solve this problem, where the proximal optimization problem becomes

proxηg [x − η∇f (x)]

proxηg(z) = arg min

x

(cid:107)x − z(cid:107)2

(cid:21) 2 + ηg(x)

(cid:20) 1 2

(1)

• Show that when η > 0 is sufficiently small, then the proximal optimization problem (1) is strongly

convex for all z. What is the range for such η?

• Find the closed form solution for proxηg(z) when η is sufficiently small so that (1) is convex. • If f (x) is L smooth and 2λ strongly convex. Does the proximal gradient algorithm converge? If

so, how many iterations are needed to obtain an (cid:15) primal suboptimal solution?

1

3. (3 points) Consider the optimization problem

n (cid:88)

i=1

fi(Aix) + (cid:107)A0x(cid:107)2,

where Ai are matrices. We rewrite it as

φ(x, z) =

n (cid:88)

i=1

fi(zi) + (cid:107)z0(cid:107)2

subject to A1x − z1 = 0, . . . , Anx − zn = 0, A0x − z0 = 0.

• Write down the Lagrangian function L(x, z, α) with multipliers α1, . . . , αn and α0 corresponding

to the n + 1 linear constraints.

• Find the dual formulation φD(α) = minx,z L(x, z, α) in terms of f ∗ i . • If n = 1 and A1 = I, write down the dual objective function in α0 by eliminating α1. Assume

f ∗ 1 (·) is smooth, how to get primal the primal variable x from dual α0?

4. (4 points) Consider a symmetric positive definite matrix A, and let

f (x) =

1 2

x(cid:62)Ax − b(cid:62)x,

g(x) =

λ 2

(cid:107)x(cid:107)2

2 + µ(cid:107)x(cid:107)1.

• Find the Fenchel’s dual of f (x) + g(x). • Find ∇f ∗(α) and ∇g∗(α) • Write down a closed form formula for the dual ascent method. • Find the smoothness of f ∗ and g∗ and explain how to set learning rate for dual ascent.

Programming Problem (7 points)

• Download data in the mnist sub-directory (which contains class 1 (positive) versus 7 (negative)

• Use the python template “prog-template.py”, and implement functions marked with ’# implement’.

• (4 points) Complete codes and get the plots

• (1 points) Discuss the behaviors of proximal and dual averaging algorithms in the experiments

• (1 points) Discuss the impacts of different µ in the experiments

• (1 points) Moreover, can you reduce the degree of oscillations for GD by selecting other learning rates

when the objective function is highly non-smooth?

2