ANALYSIS OF MULTI LAYER NEURONAL NETWORKS MODELING AND LONG SHORT-TERM MEMORY.

- This paper analyzes fundamental ideas and concepts related to neural networks, which provide the reader a theoretical explanation of Long-Term Memory (LSTM) networks operation classified as Deep Learning systems, and to explicitly present the mathematical development of Backward Pass equations of the LSTM network model presented in the book, Supervised Sequence Labeling with Recurrent Neural Networks [1]. This mathematical modeling associated with software development will provide the necessary tools to develop an intelligent system capable of predicting the behavior of licensed users in wireless cognitive radio networks.

.Generic model of an artificial neuron [4] Whenever a value is typed in to the network (from the training set), an output is generated, therefore a function must be considered that enables quantifying the errors made and, depending on that, know if it is necessary to modify the weight values to minimize the error given by the network (i.e., until the network outputs are as close as possible to the reality that is being simulated). The process by which these weights are adjusted is known as training, and the procedure is applied as a learning algorithm. The ability to produce correct outputs for inputs not seen during training is known as generalization) [3].
An ANN is a network comprised of units called neurons which are related in some way. In order to give a more formal definition (from the mathematical point of view) we use the concept of graph. Fig. 2 is a directed graph with the following properties 1. EveryI node has an associated x_i state. 2. Every connection between two nodes (i and j) is assigned a weight w_ij ∈ 3. For each i node there is a threshold . 4. For each i node a function f_i is defined, which depends on the weights of its connections, the threshold, and the states of the j nodes connected to it. This function , , provides the new state of the node.  [4] A wide variety of ANNs have been developed, but can be classified, by type of connections in two groups: with cycles and without cycles. This research paper briefly discusses the multilayer perceptron (MLP), which is a type of feed forward neural network without cycles; as well as some basic ideas about the RNNs), to finally focus the study on LSTM networks, which is the purpose of this work.

A. Multilayer Perceptron (MLP).
MLP is an ANN whose topology (structure in which neurons are organized) is characterized by grouping neurons of the same type in substructures called layers (input, hidden, output) as shown in Fig. 3; connections between neurons only allow the flow of information in one direction (forward, so the neurons that are in the same layer are not related) and these can be totally or partially connected. MLPs are suitable for pattern recognition and prediction tasks, and are considered universal approximations of functions (especially in higher dimensions [1]), since their outputs depend only on the current inputs (or of the moment). Fig. 3: Structure of a multilayer perceptron [6] From here on, the total number of layers hidden in the MLP will be denoted by L; and the weight assigned to the connection between the neuron i of layer k and neuron j of layer 1, by _ ^ ; The other notations are found in Table I. The most common choices for activation are function j, due to its non-linearity (it allows reducing problems with multiple hidden layers to one with a single hidden layer) and differentiability (allows training the network with descending gradient) these are the functions:  2 In summary, we have the following notation for the h-th neuron of layer l:

(4)
It is said that an ANN has learned if you can set the error found in its outlets is minimal. Thus, the main purpose of applying neural networks in a problem (that could be to characterize PUs in cognitive networks ) is to minimize the function of E error that depends "only" on the weights W_ij ^ ((k)) (if there are activation thresholds, the function would also depend on its values). Generally, E is defined as the mean square error between the current output and the desired output: where, si are the test values ("real\data" of the situation); the yi are the neuron outputs of the output layer, i.e. [1].
To train the MLP, the gradient descent algorithm is almost always used, which in principle indicates that when the weights have to be modified, it is done in the opposite direction of the derivative (this informs about the behavior of the function in a given point) since to go against the derivative assures us that the function values studied will decrease.
The idea behind the descending gradient in an MLP is to find the derivative of the error function with respect to each of the weights _ ^ , then modify those weights in the opposite direction to the derivative; more precisely, what is done is to subtract from the weight or -whereα< 1(reason for learning).
The partial derivative is taken because it represents the error variation when modifying a single variable, and to calculate the gradient efficiently, the technique known as Backpropagation is normally used.

B. Backpropagation para MLPs.
This kind of algorithm can be interpreted as the mathematical heart of ANNs and is not only used to train feedforward networks like MLP, but can be adapted for RNNs; the LSTM model uses an adaptation of it in its learning, because of this it is necessary to understand how the method works.
Backpropagation consists of repeatedly applying the chain rule for partial derivatives, and the first step consists of the derivatives of the loss function (or error) E with respect to the output neurons. For calculations presented in this document, the sigmoid function given in equation 1 is adopted as activation function θ_j; therefore: Thus, taking into account equations 1 and 2, and applying the chain rule of calculation in several variables to equation 8, we have: 9 1 10 In addition to the above, the following notation will be used for any unit j in the neural network:

11
Thus, for units in the last hidden layer, we have:

12
For the other hidden layers, we have:

13
Once the deltas for all hidden neurons are calculated, by calculating the derivatives with respect to each of the weights, we arrive at:

II. RECURRENT NEURAL NETWORK (RNN)
Unlike the MLP, the recurrent neural networks (RNN) allow one or more of the neurons that form it to feed back (graphically, cycles can be seen); the above suggests that an RNN can, in principle send the "history" of inputs previous to each output. his analysis considers an RNN with a single self-connected hidden layer (see Fig.  4). The key idea is that the recurring connections allow a "memory" of previous inputs that, remaining in the internal state of the neuron, decreases in the unit output. It is possible to apply, for the RNN learning, a similar method as used for MLP. The activation functions are maintained, but the modification that the system undergoes at that moment is related to that the activations arrive at the hidden layer from two places: the input layer and from the same hidden layer (Fig. 5). To mathematically analyze the RNN the notation shown in Table II shall be considered. Taking the above as a basis, we will use an analogue of Backpropagation but for RNN, namely: Backpropagation Through Time -BPTT. As in Backpropagation, BPTT consists of repeatedly applying the chain rule although the most important thing is that for RNNs, loss function depends on the activation of the hidden layer. Therefore, for the h-th hidden neuron we have:

18
Keeping in mind that the same weights are used at each time interval, we must apply the sum on all the time considered to obtain the derivatives with respect to the network weights. Therefore:

19
In some cases it is advisable to "unwind" the feedback neuron in order to better understand what is happening (Fig. 6); in doing so we can see a frame-by-frame of the "states" of the neuron as time progresses. Fig. 6.Operation of a unit in an RNN [7] III. LONG SHORT-TERM MEMORY LSTM neural networks are a type of ANN whose structure consists of a set of memory blocks; basically are recurrently connected subnets (Fig. 7). Each block has one or more self-connected cells and three "gates" that, for the cells, will perform the functions of writing (input), reading (output), and reset. This type of ANN was designed to solve the problem of the descending gradient (loss of learning achieved since the first inputs are "forgotten").  Fig. 9 an of time, thus lly, LSTM on tself [9]. Subs y [10]. Struct urons, which n in Fig.8.   [7] atural to think cks in their sta ory block [7] usually traine to approxima based on [5]. In addition, it should be noted that the connection weight between neurons i and j is denoted as _ ; H is the number of blocks in the hidden layer; h represents the output of the other blocks in the hidden layer; the symbol _ ^ represents the state of cell at time .

Gates
Calculations are presented for a single block which is assumed to be single cell. It will also be assumed that the output layer has K units. The procedure for calculating the Forward Pass and Backward Pass equations (BPTT) is shown below.

Forward Pass Equations.
For the three gates (input, output and forget) of the cell (Fig. 8) propagation functions a_ ^ , a_ ^ and a_ ^ , not only consider the weighted sum of the current inputs but also the outputs, in the immediately preceding time of the blocks in the hidden layer, and of the states of the other cells of the same block in the immediately preceding time (except in the output gate because there is need of the current state of the cells). From a careful analysis of Fig. 8 we can see that the mathematical representation of the above description results in equations 20 to 25 [1]. In the cells, two elements must be considered. The first is the propagation function _ ^ , which depends not only on the current inputs, but on the outputs in the immediately preceding time of other blocks in the hidden layer. The second is the State of the neuron _ ^ , which indicates if the neuron is retaining information or will forget it and depend on the output of the forget gate and the input gate. The cell output would indicate if the stored information is learned or retained. Based on the above premise we have: 26 27 28 Backward Pass Equations.Based on the fact that a variation of the Backpropagation (as mentioned above) will be used, the chain rule must be applied to calculate the partial derivatives. Initially assume the following definitions:

∈ ∈
Defining E as the loss function (error), and starting from the fact that we wish to establish how the error varies when making changes in weights, we have from the chain rule the following:

29
Thus, the objective is to calculate this resulting first partial derivative, but in the case of LSTM, there are four types of "a" to be calculated, namely: Taking into account that the summation is done on c, because the model is developed in a single block (which has cells inside), when calculating the respective derivatives we find the mathematical descriptions shown below: Cells:

´ 44
Note that the equations depend on the terms _ ^ , _ ^ , therefore we should also study how the error is affected by making changes both in the cell outputs as in their states. Keep in mind that error, in principle is a function whose variables are the outputs generated by the H blocks of the hidden layer. Moreover, for a fixed block, the resulting output (of a cell) at a time t will affect units of output layers at time t and the next input of each of the blocks in the hidden layer. Thus:

47
Now we will study what happens to the error if changes are made in the states of the cell. The state of cell at time , _ ^ reports whether or not the information was modified at that time. Therefore _ ^ is a value that affects the input of all gates, the next state of the cell, and clearly, the output of the cell itself. Thus we get: Comparison between MLP and LSTM. To conclude this discussion paper, in Table IV the most important characteristics between MLP and LSTM are compared; techniques that will be used to characterize PUs in cognitive radio networks to validate the convenience of using LSTM as a predictor of future states spectral channels use by primary users. IV. CONCLUSIONS  The article presents a mathematical analysis about the operation of type MPL, ANN, finally LSTM neural networks.  From the point of view of LSTM we show the mathematical modeling to obtain Backward Pass equations in LSTM networks; this analysis is very important to develop algorithms that lead to assess its application in fields as telecommunications in lines of research such as Cognitive Radio.  The analysis performed on LSTM to deduce the Backward Pass equations does not exist in the studied literature and there are no indications that they exist on the Internet, therefore it is a resource that is available to be used by the academic and scientific community.