How Machine Learning Works

Probability

Introduction

Machine learning has been marketed to the world as a concept of artificial intelligence for a long time, but I was completely unable to understand this concept. I read a few textbooks, but all I could think of was that it was just a regression analysis of complicated functional types, and I had always thought that AI-related services offered in the world were merely the result of a huge amount of manual labor, such as collecting huge amounts of data and inputting attribute data for that data (attaching labels that indicate characteristics to images, text, audio, etc.). Well, there are some who would say, “Hey, that’s pretty much right”… However, although it may seem a little late, I have only recently come to understand how machine learning is different from other data analysis methods such as regression analysis and how it works, so the purpose of this article is to provide a brief explanation here.

Machine Learning and Regression Analysis

When a certain relationship can be found between two data sets \(\small x,y\), even in the presence of uncertainty (noise), a problem will frequently arise in which you want to estimate that relationship as a function \(\small \hat{y}=E[y\:|\:x]=f(x)\) from samples of observed data. Consider the problem of predicting the value of observed data (explained variable) \(\small y\) when one piece of data (called explanatory variable) \(\small x\) is given exogenously. From the obtained observed data \(\small (x_i,y_i),i=1.\cdots, N\), we can determine the function \(\small \hat{y}=f(x)\) by estimating the parameters that determine the shape of the function \(\small f(x)\) so that the prediction error is minimized. The squared error:

\[ \small \Xi=\sum_{i=1}^N \left(y_i-f(x_i) \right)^2 \]

is often used as a measure of the estimation error, and this type of function estimation method is called the least squares method. Estimating the relationship between data as a function from such observed data is generally called regression analysis. For example, if we assume that the function \(\small f(x)\) is a linear function:

\[ \small f(x) = \beta_0+\beta_1x, \]

it is called linear regression analysis. The estimated function \(\small \hat{y}=f(x)\) is often called a model that explains the observed data. This can be extended to cases where there are multiple explanatory variables, and performing regression analysis by defining a function like

\[ \small \hat{y}=E[y\:|\:x_1,\cdots,x_n]=f(x_1,\cdots,x_n) = \beta_0+\beta_1x_1+\cdots+\beta_nx_n \]

is called multiple regression analysis. The least squares method for multiple regression analysis has favorable properties as an optimization problem (called a convex optimization problem), and parameters can be estimated by simple matrix calculations.

 The neural networks used in machine learning are an extension of this concept, and involve using multiple (multiplexed) functions (called neurons) represented as:

\[ \small h_k=f_k(x) = \sigma(\beta_{k,0}+\beta_{k,1}x_1+\cdots+\beta_{k,n}x_n) \]

by applying a nonlinear function called an activation function to a linear function, and then using a multi-layered function. This is then further multi-layered as shown in

\[ \small z_l = g_l(h) = \sigma(\gamma_{l,0}+\gamma_{l,1}h_1+\cdots+\gamma_{l,n}h_n), \]

and the final output result is output as shown in

\[ \small y = \xi(z)=\varphi(\delta_0+\delta_1 z_1+\cdots+\delta_n z_n). \]

The function \(\small \varphi\) is called a regularization function and converts the output results into the format of observed data. The input value is simply \(\small x_1\cdots,x_n\) and the output result is \(\small y\), so even though the calculation process is somewhat complicated, it is easy to understand that it is simply a representation of the function \(\small y=f(x_1,\cdots,x_n)\). In machine learning, this function is used to estimate parameters \(\small \beta_{k,0},\cdots,\beta_{k,n}\), \(\small \gamma_{l,0},\cdots,\gamma_{l,n}\), \(\small \delta_0,\cdots,\delta_n\) so that the squared error is minimized. The number of neurons per layer \(\small m\) can be increased arbitrarily, and the number of layers can also be increased arbitrarily. If the number of neurons per layer is \(\small m\) and the number of layers is \(\small p\), a neural network is a model that represents the function \(\small y=f(x_1,\cdots,x_n)\) using \(\small n\times m\times p\) parameters.

 The difference between machine learning and linear regression analysis is whether the shape of the function \(\small y=f(x_1,\cdots,x_n)\) is linear or nonlinear with respect to the explanatory variables. The activation function \(\small \sigma(x)\) is generally a nonlinear function, and the sigmoid function:

\[ \small \sigma(x) = \frac{1}{1+e^{-x}} \]

is often used. In addition, a function like

\[ \small \sigma(x) = \max\{x, 0 \} \]

called ReLU (Rectified Linear Unit) also seems to be commonly used. I have thought of machine learning as an extension of regression analysis, but if the activation function is a linear function \(\small \sigma(x)=x\), it is easy to deduce that there is no point in multi-layering or multiplexing, and that it amounts to nothing more than performing linear regression analysis with a redundant procedure. Thinking about it this way, machine learning can be interpreted as a special case of nonlinear regression analysis.

Linearization of Nonlinear Regression Analysis and the Curse of Dimensionality

Since machine learning is equivalent to regression analysis when \(\small y=f(x_1,\cdots,x_n)\) is a nonlinear function, you might think that it would be better to estimate the model as nonlinear regression analysis from the beginning. However, general nonlinear regression analysis has some undesirable properties, and many problems can be reduced to linear regression analysis. Let us explain this briefly.

 It should be noted that in general, the optimization problem for minimizing the squared error:

\[ \small \Xi=\sum_{i=1}^N \left(y_i-f(x_i) \right)^2 \]

is not necessarily a convex optimization problem. As a specific example, when performing nonlinear regression analysis with

\[ \small f(x) = \frac{1}{1+e^{-(\beta_0+\beta_1x)}}, \]

which is called logistic regression analysis, it is known that the squared error function is not a convex function with respect to \(\small \beta_0,\beta_1\). Therefore, logistic regression analysis uses a method of finding regression coefficients as parameters that maximize the probability of generating observed data, called a likelihood function, instead of minimizing squared error.

 The problem with the objective function of a minimization problem being a non-convex function is that non-convex optimization problems are mathematical problems that cannot be solved using efficient algorithms, known as NP-hard problems. Essentially, the only way to find a global solution is to exhaust all possible solutions, which results in computational cost that grows exponentially with the number of variables. Regression analysis of nonlinear functions involves a wide variety of methodologies for efficiently estimating parameters, making it a mathematical problem for which there is no generalized algorithm for efficiently finding a solution in any case, as is the case with linear regression analysis.

 If we think about it this way, it may be that nonlinear regression analysis is difficult to estimate parameters in the first place, and in most cases it cannot be applied in practice. However, in reality, nonlinear regression analysis can often be reduced to linear regression analysis. In analysis, there is a theorem known as the Stone–Weierstrauss theorem, which states that any continuous function can be approximated by a polynomial with a certain degree of convergence. This means that any continuous function \(\small y=f(x)\) can be approximated by

\[ \small y=f(x) = \beta_0+\beta_1x+\beta_2x^2+\cdots. \]

Since the right-hand side is a linear combination of the data \(\small 1,x,x^2,\cdots\), the parameters of this function \(\small \beta_0,\beta_1,\beta_2,\cdots\) can be estimated using linear regression analysis given the observed data \(\small y_i,x_i\). In other words, any function type can be modeled using linear regression analysis without using regression analysis for nonlinear functions. In particular, when there are few explanatory variables, it may often be easier to use linear regression analysis by increasing the number of explanatory variables rather than calculating using nonlinear regression analysis.

 However, this approach also has limitations. This is because the number of terms in linear regression analysis that must be prepared increases exponentially with the number of explanatory variables. Extending the Stone-Weierstrauss theorem to multiple variables gives us

\[ \small y=f(x_1,\cdots,x_n) = a_0+\sum_{i=1}^na_1^ix_i+\sum_{i=1}^n \sum_{j=1}^n a_2^{ij}x_ix_j+\cdots, \]

but it can be seen that the terms in linear regression analysis increase exponentially as the number of orders considered increases. For example, if the explanatory variables are \(\small n=100\) and up to the third order are taken into account, the result becomes \(\small 100+10000+1000000\), which means the linear regression analysis will have more than one million terms. It would be difficult to estimate the parameters of this function, and it would require preparing a much larger amount of data than this number. In this way, the operation of converting nonlinear regression analysis into linear regression analysis can be said to be an approach that suffers from the curse of dimensionality. The following is how machine learning using neural networks avoids the curse of dimensionality.

Multilayering and Nonlinear Functions

Consider approximating the sigmoid function:

\[ \small f(x_1,\cdots,x_n)=\sigma\left(\beta_0+\sum_{i=1}^n\beta_ix_i\right) = \frac{1}{1+\exp\left(-\left(\beta_0+\sum_{i=1}^n\beta_ix_i\right) \right)} \]

with a polynomial by performing Taylor expansion. For \(\small x\), calculating the first and second derivatives gives

\[ \small \begin{align*} &\frac{\partial f(x_1,\cdots,x_n)}{\partial x_i} =-\beta_i\frac{\exp\left(\beta_0+\sum_{i=1}^n\beta_ix_i) \right)}{\left(1+\exp\left(\beta_0+\sum_{i=1}^n\beta_ix_i) \right)\right)^2} \\ &\frac{\partial^2 f(x_1,\cdots,x_n)}{\partial x_i\partial x_j} =-\beta_i\beta_j\left(\frac{\exp\left(\beta_0+\sum_{i=1}^n\beta_ix_i) \right)}{\left(1+\exp\left(\beta_0+\sum_{i=1}^n\beta_ix_i) \right)\right)^2}-2\frac{\exp\left(2\left(\beta_0+\sum_{i=1}^n\beta_ix_i) \right)\right)}{\left(1+\exp\left(\beta_0+\sum_{i=1}^n\beta_ix_i) \right)\right)^3} \right). \end{align*} \]

You might be thinking, hey, what we need to calculate in machine learning is the derivative with respect to \(\small \beta\), not the derivative with respect to \(\small x\). However, it should be noted that the issue at stake here is what kind of functional form the neuron represents, not that we are trying to estimate the parameters. Assume that we can approximate

\[ \small \begin{align*} &\frac{\partial f(x_1,\cdots,x_n)}{\partial x_i} \approx-\hat{\beta}_i \\ &\frac{\partial^2 f(x_1,\cdots,x_n)}{\partial x_i\partial x_j} \approx-\hat{\beta}_i\hat{\beta}_j \end{align*} \]

by ignoring the latter \(\small \exp\) terms.

 From the above calculation, we can see that the sigmoid function has only \(\small n\) parameters, but if we assume that the mean value of the dependent variable is 0, we can think of it as an approximation of a quadratic function:

\[ \small h_k=f_k(x_1,\cdots,x_n) = \hat{\beta}_0+\sum_{i=1}^n\hat{\beta}_ix_i+\sum_{i=1}^n \sum_{j=1}^n \hat{\beta}_i\hat{\beta}_jx_ix_j+o(x^3). \]

However, it is possible to adjust the number of parameters by combining multiple neurons, and the ability to approximate information that would normally require \(\small n^2+n\) parameters as closely as possible with an arbitrary number of parameters \(\small n\times m\) is equivalent to a one-layer model of a neural network.

 In addition, if a single-layer neural network can approximately represent a functional form equivalent to a quadratic form, then by making it multi-layered, it is possible to increase the degree of nonlinearity of the function that can be represented. For example, if we consider a two-layered neural network:

\[ \small \begin{align*} &h_k=f_k(x_1,\cdots,x_n) \approx \hat{\beta}_0+\sum_{i=1}^n\hat{\beta}_ix_i+\sum_{i=1}^n \sum_{j=1}^n \hat{\beta}_i\hat{\beta}_jx_ix_j+o(x^3),\; k=1,\cdots,n \\ &z_l=g_l(h_1,\cdots,h_n) \approx \hat{\gamma}_0+\sum_{i=1}^n\hat{\gamma}_ih_i+\sum_{i=1}^n \sum_{j=1}^n \hat{\gamma}_i\hat{\gamma}_jh_ih_j+o(h^3),\; l=1,\cdots,n,
\end{align*} \]

the first layer can only represent up to second order with respect to \(\small x\), but the second layer can represent up to fourth order. When dealing with highly nonlinear data, a deeper layering makes it possible to express more nonlinear functions. Similar calculations can also be performed for the ReLU function (a similar discussion can be made by using approximations using Chebyshev polynomials instead of Taylor expansion).

 In summary, machine learning is a method that solves the problem of the curse of dimensionality, which requires \(\small n+n^2+\cdots+n^p\) terms in linear regression-based nonlinear regression analysis, by approximating nonlinear functions with \(\small n\times m\times p\) parameters by using function multiplexing and multi-layering to perform regression analysis. It is important to note that while regression analysis generally requires more data than the number of parameters to be estimated, this is not necessarily the case with machine learning. This is because the increase in parameters accompanying multi-layering is intended to accommodate nonlinear relationships between data, and is not necessarily a term that acts to fit large amounts of data. When you hear about artificial intelligence using a trillion parameters, which is often mentioned in the news, you might wonder if it has been trained from more than a trillion samples, but the number of data samples it is trained from is not necessarily greater than the number of parameters (although the actual situation is difficult for outsiders to gauge…).

Machine Learning and Non-convex Optimization Problems

Most of the convenient things in the world come with some drawbacks. Machine learning using neural networks cleverly avoids the curse of dimensionality in linear regression by using activation functions and multi-layering, but there are some sacrifices that machine learning makes compared to linear regression in order to gain these benefits.

 The optimization problem to be solved when estimating the parameters of a linear regression model is known as a convex optimization problem, and there are established methods for efficiently finding a solution to this problem. Convex optimization problems without constraints can be solved using a convergent calculation known as Newton’s method, and this method is known to converge to a solution at a very fast rate, known as quadratic convergence. More specifically, the linear regression model has a solution formula using a matrix, and parameter estimates can be obtained just by matrix calculations. Therefore, it has the property that parameters can be estimated very easily.

 On the other hand, the optimization problem that machine learning solves to estimate parameters is known as a non-convex optimization problem, and in fact, this mathematical problem is known as a mathematical problem for which no efficient numerical calculation method has been discovered (or perhaps no method exists). The only way to find the exact optimal solution is to search for solutions in a brute force method. A general non-convex optimization problem is a problem with computational complexity called NP-Hard, which means that a solution cannot be found with a computational complexity that is polynomial relative to the number of parameters (the computational complexity increases exponentially with the number of parameters). In other words, machine learning solves the problem of exponentially growing parameters in linear regression, but at the cost of an exponentially growing amount of computation required to solve the optimization problem that must be computed to estimate the parameters.

 Of course, there are many methods for efficiently estimating parameters in machine learning using neural networks. A typical algorithm is Stochastic Gradient Descent (SGD), which is similar to a method for solving convex optimization problems. To find a solution that minimizes the objective function \(\small F(\beta_0,\cdots,\beta_n)\), the method of using the gradient:

\[ \small \nabla F=\left[\begin{array}{c} \frac{\partial F}{\partial \beta_0} \\ \frac{\partial F}{\partial \beta_1} \\ \vdots \\ \frac{\partial F}{\partial \beta_n}
\end{array} \right] \]

to find a solution through

\[ \small \beta_{p+1} = \beta_{p}-\eta \nabla F \]

and convergent calculation is called gradient descent. \(\small \eta\) is a parameter called the learning rate. It is known that convex optimization problems converge to an optimal solution, but non-convex optimization problems usually fall into a local minimum solution, and therefore cannot generally be used as a method for solving non-convex optimization problems. Stochastic gradient descent is an algorithm that divides the training data into several parts, randomly shuffles them, and trains them to search for a global optimal solution. Although the mechanism by which such algorithms can search for global optimal solutions is known to a certain extent, it is easy to see that they are not applicable to any non-convex optimization problem. However, in machine learning using neural networks, this algorithm is often a method that can search for the optimal solution (or a solution close to it). The technique of using multi-layered neural networks to learn model parameters is called deep learning, and can be thought of as a method for solving non-convex optimization problems (albeit in a very limited scope).

 This idea can be applied to other optimization problems. Deep learning solves optimization problems that minimize the estimation error between observed data and a model, but it can also be applied to solving decision-making problems that maximize a specific reward function \(\small R_{t}(s_t, a_t)\). The machine learning technique used to solve decision-making problems is called reinforcement learning. In particular, reinforcement learning is often formulated as a dynamic programming problem, which can be expressed as:

\[ \small V(s,a) = \max_{a_t(s_t)} E\left[\sum_{t=1}^\infty \gamma_tR_{t}(s_t, a_t) \right] \]

with a discount rate \(\small \gamma_t\). Reinforcement learning is the problem of determining the action function \(\small a_t(s_t)\) that maximizes the present value of this reward. \(\small s_t\) is a state that is generated stochastically, and in the case of finance, it would be information such as market data. The action function \(\small a_t(s_t)\) is a function of these states, and a function that returns the optimal action given the state \(\small s_t\) is an action function. As with regression analysis, the optimal parameters for this function can be estimated using machine learning techniques. Of course, if this could be formulated as a convex optimization problem, a solution could be found using standard dynamic programming techniques, but when it becomes a non-convex optimization problem, this poses an algorithmic challenge in reinforcement learning. From the above, reinforcement learning can be considered as a field that studies methods for solving non-convex dynamic programming problems.

Conclusion

When you hear the terms machine learning and artificial intelligence, it can be difficult to understand what exactly is being researched in these fields, but based on the discussion in this article, you will understand that the methods researched in these fields are essentially methods for solving non-convex optimization problems. In addition, since it seems that there is no exact solution to this mathematical problem other than the brute force method, there are an infinite number of ingenious ways to efficiently find an approximate solution. Therefore, it can be inferred that the ideas researched as machine learning and artificial intelligence techniques will always be incomplete, and that it is a field that never runs out of research topics. In this sense, it is a field that has no end, whether viewed as an academic field or a business field, and it can be considered an extremely attractive field for students and business people, in the sense that it is never too late to enter. Well, what often happens is that everyone thinks the same way, which leads to over-entry and a red ocean…

Comments