SML_Assignment2

Task 1: Optimization

1a) Numerical Optimization

Derivative of the function
( frac{partial f}{partial x_j} = sum_{i=1}^N 200(x_i - x_{i-1}^2)(delta_{i,j} - 2x_{i-1}delta_{i-1,j}) - 2(1 - x_{i-1})delta_{i-1,j} \ = 200(x_j - x_{j-1}^2) - 400x_j(x_{j+1} - x_j^2) - 2(1 - x_j) )
but the first and the last x are exception
(frac{partial f}{partial x_0} = -400x_0(x_1 - x_0^2) - 2(1 - x_0) \ frac{partial f}{partial x_{N-1}} = 200(x_{N-1} - x_{N-2}^2) )
This is complete code:

import numpy as np
import matplotlib.pyplot as plt

# Derivative of the function
def derrivation_rosenbrock(x):
    xm = x[1:-1]
    xm_m1 = x[:-2]
    xm_p1 = x[2:]
    der = np.zeros_like(x)
    der[1:-1] = 200*(xm-xm_m1**2) - 400*(xm_p1 - xm**2)*xm - 2*(1-xm)
    der[0] = -400*x[0]*(x[1]-x[0]**2) - 2*(1-x[0])
    der[-1] = 200*(x[-1]-x[-2]**2)
    return der


def main():
    # set the learning rate first
    # this learning set is ad hoc corresponds to objective function
    lrate = 0.002
    # initialize a point
    x = np.array([-0.5, 0.2,-0.3, 0.1,-0.4, 0.5,-0.1, 0.1,-0.3, 0.4,-0.1, 0.2,-0.5, 0.4,-0.2, 0.4,-0.3, 0.2,-0.3, 0.4])
    # set number of steps (can be changed as you want)
    steps = 10000
    # we have to record all tuples of the points and its function
    ai = []
    n = 20
    for i in range(steps):
        sum = 0
        for j in range(n-1):
            f = (100 * ((x[j+1] - x[j] ** 2) ** 2) + (x[j] - 1) ** 2)
            sum += f
        # sum = rosen(x)
        # append the point and its obj function to ai as 1D list
        ai.append([x, sum])
        # compute its derrivative on point a
        fi = np.array(derrivation_rosenbrock(x))
        # set the new point based on its
        x = x - np.dot(lrate, fi)

    # convert ai into a numpy array
    ai = np.array(ai)
    # print the last 10 of ai
    print(ai[-10:-1])
    # the minimum value of the function is just the last element of ai
    print(f'the minimum is: {ai[-1, 1]} at point: {ai[-1,0]}')


    # Plot argmin parameter
    x = np.arange(0, 10000, 1)
    plt.plot(x, ai[:,1])
    plt.xlabel("x axis label")
    plt.ylabel("y axis label")
    plt.title("Argmin")
    plt.legend(["Parameter"])
    plt.show()

if __name__ == '__main__':
    main()

1b) Gradient Descent Variants

(1)
vanilla:
pros:
For convex optimization problems, the algorithm will converge to global optima. In the case of non-convex questions, the final result will converge to local optimum.
cons:
It can converge at local minima and saddle points.

stochastic:
pros:
The benefit of SGD is that it greatly reduces the time complexity. However, due to the random selection of a sample each time, it undoubtedly increases the randomness in the iteration process. Therefore, we can obviously observe the fluctuation in the graph of the function value on the number of iterations.
cons:
Can veer off in the wrong direction due to frequent updates.

mini-batch:
pros:
On the one hand, we have greatly reduced the number of iterations and increased the efficiency; on the other hand, we have maintained some good properties of SGD, that is, we have reduced the Time Complexity when calculating the gradient and achieved better convergence performance.
cons:
But we have to configure the Mini-Batch size hyperparameter.
(2)
The momentum is an extension of the traditional gradient descent method (SGD), which is more efficient than SGD. Momentum method, also known as SGD with Momentum, is a method to accelerate gradient vectors to the relevant direction and finally achieve accelerated convergence. Momentum method is a very popular optimization algorithm and is used in many models today.

Using the stochastic gradient descent method, we will not calculate the exact derivative of the loss function. Instead, we estimate from a small set of data. This means that we are not always heading in the best direction because the results we get are "noisy". Therefore, a weighted average of the exponents can provide a better estimate that is closer to the derivative of the actual value than would be obtained by noisy computation. This is one reason why the momentum method may be better than traditional SGD.

Task 2: Density Estimation

2a) Maximization Likelihood Estimate of the Exponential Distribution

The probability density function of the exponential distribution is defined as:
( f(n) = egin{cases} frac{1}{s}exp(-frac{x}{s}) & ext{if }x geq 0 \ 0 & ext{if }x < 0 end{cases} )
Its likelihood function is:
( L(lambda, x_1, x_2,cdots, x_n) = prod_{i=1}^n p(x|s) = prod_{i=1}^n frac{1}{s}exp(-frac{x}{s}) = frac{1}{s^n}exp(-frac{1}{s}sum_{i=1}^n x_i) )
To calculate the MLE we need to solve the equation
(frac{dln(L(lambda, x_1, x_2, cdots, x_n))}{dlambda} = 0 \ frac{dln(L(lambda, x_1, x_2, cdots, x_n))}{dlambda} = frac{dln(lambda^n e^{-lambda} sum_{i=1}^n x_i)}{dlambda} \ = frac{d(nln(lambda) - lambdasum_{i=1}^n x_i)}{d lambda} \ = frac{n}{lambda} - sum_{i=1}^n x_i )
becuase (lambda = frac{1}{s} Rightarrow \ frac{n}{lambda} - sum_{i=1}^n x_i = sn - sum_{i=1}^n x_i)

Task 3: Density Estimation

3a) Prior Probabilities

(p(A) = frac{count(A)}{count(A) + count(B)})

3b) Biased ML Estimate

Define the bias of an estimator:
(Bias_ heta[widehat{ heta}] = E_ heta[widehat{ heta}] - heta = E_ heta[widehat{ heta} - heta])
Biased estimates of the conditional distribution:
(s^2 = frac{sum_{i=1}^N (x_i-ar{x})}{N})
Unbiased estimates of the conditional distribution:
(s^2 = frac{sum_{i=1}^N (x_i-ar{x})}{N-1})
We need to calculate (mu)

3c) Class Density

3d) Posterior

Task 4: Non-parametric Density Estimation

4a) Histogram

Histogram with bin size 0.02 seems like overfitting, but Histogram with bin size 2 has too large bins. It doesn't represent the data distribution well. Histogram with bin size 0.5 is the best.

4b) Kernel Density Estimate

A Gaussian kernel with (sigma) = 0.03 has a log likelihood of 2425.84;
A Gaussian kernel with (sigma) = 0.2 has a log likelihood of 2383.64;
A Gaussian kernel with (sigma) = 0.8 has a log likelihood of 853.49.
So the kernel with (sigma) = 0.03 performs the best.

4c) K-Nearest Neighbors

The K-nearest neighbors method with K = 2 has a log likelihood of -1256.04;
The K-nearest neighbors method with K = 8 has a log likelihood of -1127.73;
The K-nearest neighbors method with K = 35 has a log likelihood of -949.19;
So the K-nearest neighbors method with K = 35 performs the best.

4d) Comparison of the Non-Parametric Methods

Why do we need to test them on a different data set??
( egin{array}{c|ccc} n & sigma = 0.03 & sigma = 0.2 & sigma = 0.8 & K = 2 & K = 8 & K = 35 \ hline log likelihood & 2425.84 & 2383.64 & 853.49 & -1256.04 & -1127.73 & -949.19\ end{array} )
I will use KNN, because it seems more stable than KDE.

import math

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from scipy.spatial.distance import cdist

training_data = pd.read_csv("nonParamTrain.txt", sep="   ")
test_data = pd.read_csv("nonParamTest.txt", sep="   ")

training_data.columns = test_data.columns = ["value"]

x_min = -4
x_max = 8

# 4a) Histogram
def plot_histo():
    histo_size = [0.02, 0.5, 2]

    for i, s in enumerate(histo_size):
        plt.figure(i)
        # number of bins = training_data.max().value / s
        training_data.plot.hist(by="value", bins=math.ceil(training_data.max().value / s))
        plt.xlabel("x")
        plt.title("Histogram with bin size {}".format(s))
        plt.xlim(x_min, x_max)


def gaussian_kernel(x, data, sigma):
    numerator = np.sum(np.exp(-(x - data) ** 2 / (2 * sigma ** 2)))
    denominator = np.sqrt(2 * math.pi) * sigma
    return numerator / denominator

# 4b) Kernel Density Estimate
def gaussian_KDE():
    sigmas = [0.03, 0.2, 0.8]
    steps = (x_max - x_min) / 500
    x = np.arange(x_min, x_max, steps)
    # x = np.sort(test_data.values, axis=0)
    plt.figure()
    for sigma in sigmas:

        # get log-likelihood
        # lecture05 slides48
        y = np.empty(training_data.values.shape[0])
        for i, val in enumerate(training_data.values):
            y[i] = gaussian_kernel(val, training_data.values, sigma)

        print("The train log−likelihood for sigma = {} is {}".format(str(sigma), str(np.sum(np.log(y)))))

        # get plots
        y = np.empty(x.shape)
        for i, val in enumerate(x):
            y[i] = gaussian_kernel(val, training_data.values, sigma)

        print("The test log−likelihood for sigma = {} is {}".format(str(sigma), str(np.sum(np.log(y)))))

        plt.plot(x, y, label="$sigma=$" + str(sigma))
        plt.ylabel('Density')
        plt.xlabel('x')

    plt.legend()
    plt.show()


def knn():
    ks = [2, 8, 35]
    steps = (x_max - x_min) / 300
    x = np.arange(x_min, x_max, steps)
    # calculate pairwise distances
    x_dist = cdist(x.reshape(x.shape[0], 1),
                   training_data.values.reshape(training_data.values.shape[0], 1),
                   metric="euclidean")

    for k in ks:
        y = np.empty(x.shape)
        for i, val in enumerate(x_dist):
            # find nearest k points and take point with greatest distance as Volume size
            # this assumes the distance matrix was computed with two different vectors
            # use k+1 for train data
            # np.argpartition(val, range(k))[:k] means top k element
            V = val[np.argpartition(val, range(k))[:k]][-1]
            # calculate density
            y[i] = k / (training_data.values.shape[0] * V * 2)

        print("The log−likelihood for k={} is {}"
              .format(k, np.sum(np.log(y))))

        plt.plot(x, y, label="k={}".format(k))
        plt.ylabel('Density')
        plt.xlabel('x')

    plt.legend()
    plt.show()


# plot_histo()
gaussian_KDE()
knn()
plt.show()

Task 5: Expectation Maximization

5a) Gaussian Mixture Update Rules

Input: (x = (x^{(1)}, x^{(2)},cdots,x^{(m)})), Joint Distribution (p(x,1| heta)), conditional distribution (p(z|x, heta)), Iteration times (J).

Randomly initialize model parameter ( heta), for example ( heta^0)
for j from 1 to J:
Expectation: Calculate conditional probability expectation of joint distribution:
(Q_i(Z^{(i)}) := P(z^{(i)}|x^{(i)}, heta))
Maximization: Maximize (L( heta)), we become ( heta):
( heta := argmax sum_{i=1}^msum_{z^{(i)}}Q_i(z^{(i)})logP(x^{(i)}, z^{(i)} | heta))
Repeat expectation and maximization until ( heta) converge.
Output: model parameter ( heta)

5b) EM

Iteration from 1 step to 30 steps:

Log likelihood change:

Full code is behind

import math

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

training_data = pd.read_csv("gmm.txt", sep="  ")
k = 4
n_iter = 30


def main():
    # init
    covar = np.array(
        [np.identity(training_data.shape[1]) for _ in range(k)])
    mu = np.random.uniform(training_data.min().min(), training_data.max().max(),
                           size=(k, training_data.shape[1]))
    pi = np.random.uniform(size=(k,))

    log_likelihood = np.empty((n_iter,))

    for i in range(n_iter):
        alpha = e(x=training_data.values, mu=mu, covar=covar, pi=pi)
        mu, covar, pi = m(x=training_data.values, alpha=alpha)

        # plot at given steps
        if i + 1 in [1, 3, 5, 10, 30]:
            plt.figure(i)
            visualize(mu, covar, training_data.values, i)
        log_likelihood[i] = calculate_log_likelihood(training_data.values, mu, covar, pi)

        print("Finished iteration {}".format(i))

    plt.figure(n_iter + 1)
    plt.plot(log_likelihood)
    plt.xlabel("Iterations")
    plt.ylabel("Log-Likelihood")
    plt.title("Log-Likelihood Progress")
    plt.show()


def e(x, mu, covar, pi):
    alpha = np.empty((k, x.shape[0]))
    for i in range(k):
        alpha[i] = pi[i] * multivariate_gaussian(x, mu[i], covar[i])

    # sum over all models per data point
    denominator = np.sum(alpha, axis=0)

    return alpha / denominator


def m(x, alpha):
    N = np.sum(alpha, axis=1)  # sum over all data points per model

    mu = np.zeros((k, x.shape[1]))
    covar = np.zeros((k, x.shape[1], x.shape[1]))

    for i in range(k):
        # update mu
        for j, val in enumerate(x):
            mu[i] += (alpha[i, j] * val)

        mu[i] /= N[i]

        # update covariance
        for j, val in enumerate(x):
            diff = val - mu[i]
            covar[i] += alpha[i, j] * np.outer(diff, diff.T)

        covar[i] /= N[i]

    # update pi
    pi = N / x.shape[0]

    return mu, covar, pi


def multivariate_gaussian(data, mu, covar):

    out = np.empty(data.shape[0])
    denominator = np.sqrt((2 * math.pi) * np.linalg.det(covar))
    covar_inv = np.linalg.inv(covar)

    # compute for each datapoint
    for i, x in enumerate(data):
        diff = x - mu
        out[i] = np.exp(-0.5 * diff.T.dot(covar_inv).dot(diff)) / denominator

    return out


def visualize(mu, covar, data, iteration):

    steps = 100

    x_data = data[:, 0]
    y_data = data[:, 1]

    x_min = x_data.min()
    x_max = x_data.max()
    y_min = y_data.min()
    y_max = y_data.max()

    x = np.arange(x_min - 1, x_max + 1, (x_max - x_min + 2) / steps)
    y = np.arange(y_min - 1, y_max + 1, (y_max - y_min + 2) / steps)

    Y, X = np.meshgrid(y, x)
    Z = np.empty((steps, steps))

    for i in range(k):
        for j in range(steps):
            # construct vector with same x and all possible y to cover the plot space
            points = np.append(X[j], Y[j]).reshape(2, x.shape[0]).T
            Z[j] = multivariate_gaussian(points, mu[i], covar[i])
        plt.contour(X, Y, Z, 1)

    # plot the samples
    plt.plot(x_data, y_data, 'co', zorder=1)
    plt.xlabel('x')
    plt.ylabel('y')
    plt.title("Mixtures after {} steps".format(iteration + 1))


def calculate_log_likelihood(x, mu, covar, pi):
    likelihood = np.empty((k, x.shape[0]))
    for i in range(k):
        likelihood[i] = pi[i] * multivariate_gaussian(x, mu[i], covar[i])

    return np.sum(np.log(np.sum(likelihood, axis=0)))


if __name__ == '__main__':
    main()

作者：Rest探路者
出处：http://www.cnblogs.com/Java-Starter/
本文版权归作者和博客园共有，欢迎转载，但未经作者同意请保留此段声明，请在文章页面明显位置给出原文连接
Github：https://github.com/cjy513203427

相关阅读:
Linux企业级项目实践之网络爬虫（6）——将程序设计成为守护进程
 Linux企业级项目实践之网络爬虫（5）——处理配置文件
 Linux企业级项目实践之网络爬虫（3）——设计自己的网络爬虫
 Linux企业级项目实践之网络爬虫（4）——主程序流程
 Linux企业级项目实践之网络爬虫（1）——项目概述及准备工作
 Linux企业级项目实践之网络爬虫（2）——网络爬虫的结构与工作流程
 泛化、依赖、关联、聚合、组合
 日常（停课后的月考）
日常（停课后的月考）
打击罪犯
原文地址：https://www.cnblogs.com/Java-Starter/p/14847847.html