Description
Exercise Sheet 9
We consider a class optimization problems of the type:
min J(θ) s.t. ∀mi=1 : gi(θ) = 0 and ∀li=1 : hi(θ) ≤ 0
θ
For this class of problem, we can build the Lagrangian:
ml
L(θ, β, λ) = J(θ) + βigi(θ) + λihi(θ).
i=1 i=1
where (βi)i and (λi)i are the dual variables. According to the Karush-Kuhn-Tucker (KKT) conditions, it is necessary for a solution of this optimization problem that the following constraints are satisfied (in addition to the original constraints of the optimization problem):
∂L = 0 ∂θ
∀li=1 : λi ≥ 0 ∀li=1 : λihi(θ) = 0
(stationarity)
(dual feasibility) (complementary slackness)
We will make use of these conditions to derive the dual form of the kernel ridge regression problem.
Exercise 1: Kernel Ridge Regression with Lagrange Multipliers (10 + 20 + 10 + 10 P)
Let x1,…,xN ∈ Rd be a dataset with labels y1,…,yN ∈ R. Consider the regression model f(x) = w⊤φ(x) where φ: Rd → Rh is a feature map and w is obtained by solving the constrained optimization problem
min N 1ξi2 s.t. ∀Ni=1 : ξi = w⊤φ(xi) − yi and 1∥w∥2 ≤ C. ξ,w i=12 2
where equality constraints define the errors of the model, where the objective function penalizes these errors, and where the inequality constraint imposes a regularization on the parameters of the model.
- (a) Construct the Lagrangian and state the KKT conditions for this problem (Hint: rewrite the equality con- straint as ξi − w⊤φ(xi) + yi = 0.)
- (b) Show that the solution of the kernel regression problem above, expressed in terms of the dual variables (βi)i, and λ is given by:
β = (K + λI)−1λy
(c) Express the prediction f(x) = w⊤φ(x) in terms of the parameters of the dual.
(d) Explain how the new parameter λ can be related to the parameter C of the original formulation.
Exercise 2: Programming (50 P)
Download the programming files on ISIS and follow the instructions.
where K is the kernel Gram matrix.
Exercise sheet 9 (programming) [WiSe 2021/22] Machine Learning 1
Gaussian Processes
In this exercise, you will implement Gaussian process regression and apply it to a toy and a real dataset. We use the notation used in the paper “Rasmussen (2005). Gaussian Processes in Machine Learning” linked on ISIS.
LetusfirstdrawatrainingsetX=(x1,…,xn)andatestsetX⋆ =(x1⋆,…,xm⋆)fromad-dimensionalinputdistribution.TheGaussianProcessisamodel under which the real-valued outputs f = (f1, …, fn) and f ⋆ = (f1⋆ , …, fm⋆ ) associated to X and X ⋆ follow the Gaussian distribution:
where
f0ΣΣ⋆
⋆
0
ΣΣ
] ]([
[f
Σ = k(X, X) + σ2I
[
]) Σ⋆ =k(X,X⋆)
∼N,
⊤⋆ ⋆⋆
Σ⋆⋆ =k(X⋆,X⋆)+σ2I
and where k( ⋅ , ⋅ ) is the Gaussian kernel function. (The kernel function is implemented in utils.py .) Predicting the output for new data points X ⋆ is
achieved by conditioning the joint probability distribution on the training set. Such conditional distribution called posterior distribution can be written as:
f⋆ |f ∼ N(Σ⊤⋆Σ−1f , Σ⋆ ⋆ − Σ⊤⋆Σ−1Σ⋆) ⏟⏟
μ⋆ C⋆
Having inferred the posterior distribution, the log-likelihood of observing for the inputs X ⋆ the outputs y ⋆ is given by evaluating the distribution f ⋆ | f at y ⋆ :
logp(y |f) = − 1(y −μ )⊤C−1(y −μ )− 1log|C | − mlog2π ⋆ 2⋆⋆⋆⋆⋆2⋆2
where | ⋅ | is the determinant. Note that the likelihood of the data given this posterior distribution can be measured both for the training data and the test data.
Part 1: Implementing a Gaussian Process (30 P) Tasks:
Createaclass GP_Regressor thatimplementsaGaussianprocessregressorandhasthefollowingthreemethods:
def __init__(self,Xtrain,Ytrain,width,noise): Initialize a Gaussian process with noise parameter σ and width parameter w. The
variable Xtrain isatwo-dimensionalarraywhereeachrowisonedatapointfromthetrainingset.TheVariable Ytrain isavector containingtheassociatedtargets.ThefunctionmustalsoprecomputethematrixΣ−1forsubsequentusebythemethod predict() and
loglikelihood() .
def predict(self,Xtest): For the test set X ⋆ of m points received as parameter, return the mean vector of size m and covariance matrix of size m × m of the corresponding output, that is, return the parameters (μ ⋆ , C ⋆ ) of the Gaussian distribution f ⋆ | f.
def loglikelihood(self,Xtest,Ytest): For a data set X ⋆ of m test points received as first parameter, return the loglikelihood of observing the outputs y ⋆ received as second parameter.
In [1]:
# ————————– # TODO: Replace by your code # ————————– import solutions
class GP_Regressor(solutions.GP_Regressor): pass
# --------------------------
Test your implementation by running the code below (it visualizes the mean and variance of the prediction at every location of the input space) and compares the behavior of the Gaussian process for various noise parameters σ and width parameters w.
In [2]:
import utils,datasets,numpy import matplotlib.pyplot as plt %matplotlib inline
# Open the toy data
Xtrain,Ytrain,Xtest,Ytest = utils.split(*datasets.toy())
# Create an analysis distribution
Xrange = numpy.arange(-3.5,3.51,0.025)[:,numpy.newaxis]
f = plt.figure(figsize=(18,15))
# Loop over several parameters:
for i,noise in enumerate([2.5,0.5,0.1]):
for j,width in enumerate([0.1,0.5,2.5]):
# Create Gaussian process regressor object
gp = GP_Regressor(Xtrain,Ytrain,width,noise)
# Compute the predicted mean and variance for test data
mean,cov = gp.predict(Xrange)
var = cov.diagonal()
# Compute the log-likelihood of training and test data
lltrain = gp.loglikelihood(Xtrain,Ytrain)
lltest = gp.loglikelihood(Xtest ,Ytest )
# Plot the data
p = f.add_subplot(3,3,3*i+j+1)
p.set_title(‘noise=%.1f width=%.1f lltrain=%.1f, lltest=%.1f’%(noise,width,lltrain,lltest)) p.set_xlabel(‘x’)
p.set_ylabel(‘y’)
p.scatter(Xtrain,Ytrain,color=’green’,marker=’x’) # training data
p.scatter(Xtest,Ytest,color='green',marker='o') p.plot(Xrange,mean,color='blue') p.plot(Xrange,mean+var**.5,color='red') p.plot(Xrange,mean-var**.5,color='red') p.set_xlim(-3.5,3.5)
p.set_ylim(-4,4)
# test data # GP mean # GP mean + std # GP mean - std
Part 2: Application to the Yacht Hydrodynamics Data Set (20 P)
In the second part, we would like to apply the Gaussian process regressor that you have implemented to a real dataset: the Yacht Hydrodynamics Data Set available on the UCI repository at the webpage http://archive.ics.uci.edu/ml/datasets/Yacht+Hydrodynamics (http://archive.ics.uci.edu/ml/datasets/Yacht+Hydrodynamics). As stated on the web page, the input variables for this regression problem are:
1. Longitudinal position of the center of buoyancy 2. Prismatic coefficient
3. Length-displacement ratio
4. Beam-draught ratio
5. Length-beam ratio 6. Froude number
and we would like to predict from these variables the residuary resistance per unit weight of displacement (last column in the file yacht_hydrodynamics.data ).
Tasks:
Load the data using datasets.yacht() and partition the data between training and test set using the function utils.split() . Standardize the data (center and rescale) so that each dimension of the training data and the labels have mean 0 and standard deviation 1 over the training set.
Train several Gaussian processes on the regression task using various combinations of width and noise parameters.
Draw two contour plots where the training and test log-likelihood are plotted as a function of the noise and width parameters. Choose suitable ranges of parameters so that the best parameter combination for the test set is in the plot. Use the same ranges and contour levels for the training and test plots. The contour levels can be chosen linearly spaced between e.g. 50 and the maximum log-likelihood value
In [3]:
# ————————– # TODO: Replace by your code # ————————– import solutions
%matplotlib inline solutions.yacht()
# ————————–
Noise params: 0.005 0.007 0.008 0.010 0.011 0.013 0.014 0.016 0.017 0.019 0.020 0.022 0.023 0.025 0. 026 0.028 0.029 0.031 0.032 0.034 0.035 0.037 0.038 0.040 Width params: 0.050 0.135 0.220 0.304 0.389 0.474 0.559 0.643 0.728 0.813 0.898 0.983 1.067 1.152 1. 237 1.322 1.407 1.491 1.576 1.661 1.746 1.830 1.915 2.000




