Task 1: Optimization
1a) Numerical Optimization
Derivative of the function
(
frac{partial f}{partial x_j} = sum_{i=1}^N 200(x_i - x_{i-1}^2)(delta_{i,j} - 2x_{i-1}delta_{i-1,j}) - 2(1 - x_{i-1})delta_{i-1,j} \
= 200(x_j - x_{j-1}^2) - 400x_j(x_{j+1} - x_j^2) - 2(1 - x_j)
)
but the first and the last x are exception
(frac{partial f}{partial x_0} = -400x_0(x_1 - x_0^2) - 2(1 - x_0) \
frac{partial f}{partial x_{N-1}} = 200(x_{N-1} - x_{N-2}^2)
)
This is complete code:
import numpy as np
import matplotlib.pyplot as plt
# Derivative of the function
def derrivation_rosenbrock(x):
xm = x[1:-1]
xm_m1 = x[:-2]
xm_p1 = x[2:]
der = np.zeros_like(x)
der[1:-1] = 200*(xm-xm_m1**2) - 400*(xm_p1 - xm**2)*xm - 2*(1-xm)
der[0] = -400*x[0]*(x[1]-x[0]**2) - 2*(1-x[0])
der[-1] = 200*(x[-1]-x[-2]**2)
return der
def main():
# set the learning rate first
# this learning set is ad hoc corresponds to objective function
lrate = 0.002
# initialize a point
x = np.array([-0.5, 0.2,-0.3, 0.1,-0.4, 0.5,-0.1, 0.1,-0.3, 0.4,-0.1, 0.2,-0.5, 0.4,-0.2, 0.4,-0.3, 0.2,-0.3, 0.4])
# set number of steps (can be changed as you want)
steps = 10000
# we have to record all tuples of the points and its function
ai = []
n = 20
for i in range(steps):
sum = 0
for j in range(n-1):
f = (100 * ((x[j+1] - x[j] ** 2) ** 2) + (x[j] - 1) ** 2)
sum += f
# sum = rosen(x)
# append the point and its obj function to ai as 1D list
ai.append([x, sum])
# compute its derrivative on point a
fi = np.array(derrivation_rosenbrock(x))
# set the new point based on its
x = x - np.dot(lrate, fi)
# convert ai into a numpy array
ai = np.array(ai)
# print the last 10 of ai
print(ai[-10:-1])
# the minimum value of the function is just the last element of ai
print(f'the minimum is: {ai[-1, 1]} at point: {ai[-1,0]}')
# Plot argmin parameter
x = np.arange(0, 10000, 1)
plt.plot(x, ai[:,1])
plt.xlabel("x axis label")
plt.ylabel("y axis label")
plt.title("Argmin")
plt.legend(["Parameter"])
plt.show()
if __name__ == '__main__':
main()
1b) Gradient Descent Variants
(1)
vanilla:
pros:
For convex optimization problems, the algorithm will converge to global optima. In the case of non-convex questions, the final result will converge to local optimum.
cons:
It can converge at local minima and saddle points.
stochastic:
pros:
The benefit of SGD is that it greatly reduces the time complexity. However, due to the random selection of a sample each time, it undoubtedly increases the randomness in the iteration process. Therefore, we can obviously observe the fluctuation in the graph of the function value on the number of iterations.
cons:
Can veer off in the wrong direction due to frequent updates.
mini-batch:
pros:
On the one hand, we have greatly reduced the number of iterations and increased the efficiency; on the other hand, we have maintained some good properties of SGD, that is, we have reduced the Time Complexity when calculating the gradient and achieved better convergence performance.
cons:
But we have to configure the Mini-Batch size hyperparameter.
(2)
The momentum is an extension of the traditional gradient descent method (SGD), which is more efficient than SGD. Momentum method, also known as SGD with Momentum, is a method to accelerate gradient vectors to the relevant direction and finally achieve accelerated convergence. Momentum method is a very popular optimization algorithm and is used in many models today.
Using the stochastic gradient descent method, we will not calculate the exact derivative of the loss function. Instead, we estimate from a small set of data. This means that we are not always heading in the best direction because the results we get are "noisy". Therefore, a weighted average of the exponents can provide a better estimate that is closer to the derivative of the actual value than would be obtained by noisy computation. This is one reason why the momentum method may be better than traditional SGD.
Task 2: Density Estimation
2a) Maximization Likelihood Estimate of the Exponential Distribution
The probability density function of the exponential distribution is defined as:
(
f(n) =
egin{cases}
frac{1}{s}exp(-frac{x}{s}) & ext{if }x geq 0 \
0 & ext{if }x < 0
end{cases}
)
Its likelihood function is:
(
L(lambda, x_1, x_2,cdots, x_n) = prod_{i=1}^n p(x|s) = prod_{i=1}^n frac{1}{s}exp(-frac{x}{s}) = frac{1}{s^n}exp(-frac{1}{s}sum_{i=1}^n x_i)
)
To calculate the MLE we need to solve the equation
(frac{dln(L(lambda, x_1, x_2, cdots, x_n))}{dlambda} = 0 \
frac{dln(L(lambda, x_1, x_2, cdots, x_n))}{dlambda} = frac{dln(lambda^n e^{-lambda} sum_{i=1}^n x_i)}{dlambda} \
= frac{d(nln(lambda) - lambdasum_{i=1}^n x_i)}{d lambda} \
= frac{n}{lambda} - sum_{i=1}^n x_i
)
becuase (lambda = frac{1}{s} Rightarrow \
frac{n}{lambda} - sum_{i=1}^n x_i = sn - sum_{i=1}^n x_i)
Task 3: Density Estimation
3a) Prior Probabilities
(p(A) = frac{count(A)}{count(A) + count(B)})
3b) Biased ML Estimate
Define the bias of an estimator:
(Bias_ heta[widehat{ heta}] = E_ heta[widehat{ heta}] - heta = E_ heta[widehat{ heta} - heta])
Biased estimates of the conditional distribution:
(s^2 = frac{sum_{i=1}^N (x_i-ar{x})}{N})
Unbiased estimates of the conditional distribution:
(s^2 = frac{sum_{i=1}^N (x_i-ar{x})}{N-1})
We need to calculate (mu)
3c) Class Density
3d) Posterior
Task 4: Non-parametric Density Estimation
4a) Histogram
Histogram with bin size 0.02 seems like overfitting, but Histogram with bin size 2 has too large bins. It doesn't represent the data distribution well. Histogram with bin size 0.5 is the best.
4b) Kernel Density Estimate
A Gaussian kernel with (sigma) = 0.03 has a log likelihood of 2425.84;
A Gaussian kernel with (sigma) = 0.2 has a log likelihood of 2383.64;
A Gaussian kernel with (sigma) = 0.8 has a log likelihood of 853.49.
So the kernel with (sigma) = 0.03 performs the best.
4c) K-Nearest Neighbors
The K-nearest neighbors method with K = 2 has a log likelihood of -1256.04;
The K-nearest neighbors method with K = 8 has a log likelihood of -1127.73;
The K-nearest neighbors method with K = 35 has a log likelihood of -949.19;
So the K-nearest neighbors method with K = 35 performs the best.
4d) Comparison of the Non-Parametric Methods
Why do we need to test them on a different data set??
(
egin{array}{c|ccc}
n & sigma = 0.03 & sigma = 0.2 & sigma = 0.8 & K = 2 & K = 8 & K = 35 \
hline
log likelihood & 2425.84 & 2383.64 & 853.49 & -1256.04 & -1127.73 & -949.19\
end{array}
)
I will use KNN, because it seems more stable than KDE.
import math
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from scipy.spatial.distance import cdist
training_data = pd.read_csv("nonParamTrain.txt", sep=" ")
test_data = pd.read_csv("nonParamTest.txt", sep=" ")
training_data.columns = test_data.columns = ["value"]
x_min = -4
x_max = 8
# 4a) Histogram
def plot_histo():
histo_size = [0.02, 0.5, 2]
for i, s in enumerate(histo_size):
plt.figure(i)
# number of bins = training_data.max().value / s
training_data.plot.hist(by="value", bins=math.ceil(training_data.max().value / s))
plt.xlabel("x")
plt.title("Histogram with bin size {}".format(s))
plt.xlim(x_min, x_max)
def gaussian_kernel(x, data, sigma):
numerator = np.sum(np.exp(-(x - data) ** 2 / (2 * sigma ** 2)))
denominator = np.sqrt(2 * math.pi) * sigma
return numerator / denominator
# 4b) Kernel Density Estimate
def gaussian_KDE():
sigmas = [0.03, 0.2, 0.8]
steps = (x_max - x_min) / 500
x = np.arange(x_min, x_max, steps)
# x = np.sort(test_data.values, axis=0)
plt.figure()
for sigma in sigmas:
# get log-likelihood
# lecture05 slides48
y = np.empty(training_data.values.shape[0])
for i, val in enumerate(training_data.values):
y[i] = gaussian_kernel(val, training_data.values, sigma)
print("The train log−likelihood for sigma = {} is {}".format(str(sigma), str(np.sum(np.log(y)))))
# get plots
y = np.empty(x.shape)
for i, val in enumerate(x):
y[i] = gaussian_kernel(val, training_data.values, sigma)
print("The test log−likelihood for sigma = {} is {}".format(str(sigma), str(np.sum(np.log(y)))))
plt.plot(x, y, label="$sigma=$" + str(sigma))
plt.ylabel('Density')
plt.xlabel('x')
plt.legend()
plt.show()
def knn():
ks = [2, 8, 35]
steps = (x_max - x_min) / 300
x = np.arange(x_min, x_max, steps)
# calculate pairwise distances
x_dist = cdist(x.reshape(x.shape[0], 1),
training_data.values.reshape(training_data.values.shape[0], 1),
metric="euclidean")
for k in ks:
y = np.empty(x.shape)
for i, val in enumerate(x_dist):
# find nearest k points and take point with greatest distance as Volume size
# this assumes the distance matrix was computed with two different vectors
# use k+1 for train data
# np.argpartition(val, range(k))[:k] means top k element
V = val[np.argpartition(val, range(k))[:k]][-1]
# calculate density
y[i] = k / (training_data.values.shape[0] * V * 2)
print("The log−likelihood for k={} is {}"
.format(k, np.sum(np.log(y))))
plt.plot(x, y, label="k={}".format(k))
plt.ylabel('Density')
plt.xlabel('x')
plt.legend()
plt.show()
# plot_histo()
gaussian_KDE()
knn()
plt.show()
Task 5: Expectation Maximization
5a) Gaussian Mixture Update Rules
Input: (x = (x^{(1)}, x^{(2)},cdots,x^{(m)})), Joint Distribution (p(x,1| heta)), conditional distribution (p(z|x, heta)), Iteration times (J).
- Randomly initialize model parameter ( heta), for example ( heta^0)
- for j from 1 to J:
Expectation: Calculate conditional probability expectation of joint distribution:
(Q_i(Z^{(i)}) := P(z^{(i)}|x^{(i)}, heta))
Maximization: Maximize (L( heta)), we become ( heta):
( heta := argmax sum_{i=1}^msum_{z^{(i)}}Q_i(z^{(i)})logP(x^{(i)}, z^{(i)} | heta))
Repeat expectation and maximization until ( heta) converge.
Output: model parameter ( heta)
5b) EM
Iteration from 1 step to 30 steps:
Log likelihood change:
Full code is behind
import math
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
training_data = pd.read_csv("gmm.txt", sep=" ")
k = 4
n_iter = 30
def main():
# init
covar = np.array(
[np.identity(training_data.shape[1]) for _ in range(k)])
mu = np.random.uniform(training_data.min().min(), training_data.max().max(),
size=(k, training_data.shape[1]))
pi = np.random.uniform(size=(k,))
log_likelihood = np.empty((n_iter,))
for i in range(n_iter):
alpha = e(x=training_data.values, mu=mu, covar=covar, pi=pi)
mu, covar, pi = m(x=training_data.values, alpha=alpha)
# plot at given steps
if i + 1 in [1, 3, 5, 10, 30]:
plt.figure(i)
visualize(mu, covar, training_data.values, i)
log_likelihood[i] = calculate_log_likelihood(training_data.values, mu, covar, pi)
print("Finished iteration {}".format(i))
plt.figure(n_iter + 1)
plt.plot(log_likelihood)
plt.xlabel("Iterations")
plt.ylabel("Log-Likelihood")
plt.title("Log-Likelihood Progress")
plt.show()
def e(x, mu, covar, pi):
alpha = np.empty((k, x.shape[0]))
for i in range(k):
alpha[i] = pi[i] * multivariate_gaussian(x, mu[i], covar[i])
# sum over all models per data point
denominator = np.sum(alpha, axis=0)
return alpha / denominator
def m(x, alpha):
N = np.sum(alpha, axis=1) # sum over all data points per model
mu = np.zeros((k, x.shape[1]))
covar = np.zeros((k, x.shape[1], x.shape[1]))
for i in range(k):
# update mu
for j, val in enumerate(x):
mu[i] += (alpha[i, j] * val)
mu[i] /= N[i]
# update covariance
for j, val in enumerate(x):
diff = val - mu[i]
covar[i] += alpha[i, j] * np.outer(diff, diff.T)
covar[i] /= N[i]
# update pi
pi = N / x.shape[0]
return mu, covar, pi
def multivariate_gaussian(data, mu, covar):
out = np.empty(data.shape[0])
denominator = np.sqrt((2 * math.pi) * np.linalg.det(covar))
covar_inv = np.linalg.inv(covar)
# compute for each datapoint
for i, x in enumerate(data):
diff = x - mu
out[i] = np.exp(-0.5 * diff.T.dot(covar_inv).dot(diff)) / denominator
return out
def visualize(mu, covar, data, iteration):
steps = 100
x_data = data[:, 0]
y_data = data[:, 1]
x_min = x_data.min()
x_max = x_data.max()
y_min = y_data.min()
y_max = y_data.max()
x = np.arange(x_min - 1, x_max + 1, (x_max - x_min + 2) / steps)
y = np.arange(y_min - 1, y_max + 1, (y_max - y_min + 2) / steps)
Y, X = np.meshgrid(y, x)
Z = np.empty((steps, steps))
for i in range(k):
for j in range(steps):
# construct vector with same x and all possible y to cover the plot space
points = np.append(X[j], Y[j]).reshape(2, x.shape[0]).T
Z[j] = multivariate_gaussian(points, mu[i], covar[i])
plt.contour(X, Y, Z, 1)
# plot the samples
plt.plot(x_data, y_data, 'co', zorder=1)
plt.xlabel('x')
plt.ylabel('y')
plt.title("Mixtures after {} steps".format(iteration + 1))
def calculate_log_likelihood(x, mu, covar, pi):
likelihood = np.empty((k, x.shape[0]))
for i in range(k):
likelihood[i] = pi[i] * multivariate_gaussian(x, mu[i], covar[i])
return np.sum(np.log(np.sum(likelihood, axis=0)))
if __name__ == '__main__':
main()