Advanced Stochastic Optimization Algorithm for Deep Learning Artificial Neural Networks in Banking and Finance Industries

One of the objectives of this paper is to incorporate fat-tail effects into, for instance, Sigmoid in order to introduce Transparency and Stability into the existing stochastic Activation Functions. Secondly, according to the available literature reviewed, the existing set of Activation Functions were introduced into the Deep learning Artificial Neural Network through the “Window” not properly through the “Legitimate Door” since they are “Trial and Error “and “Arbitrary Assumptions”, thus, the Author proposed a “Scientific Facts”, “Definite Rules: Jameel’s Stochastic ANNAF Criterion”, and a “Lemma” to substitute not necessarily replace the existing set of stochastic Activation Functions, for instance, the Sigmoid among others. This research is expected to open the “Black-Box” of Deep Learning Artificial Neural networks. The author proposed a new set of advanced optimized fat-tailed Stochastic Activation Functions EMANATED from the AI-ML-Purified Stocks Data namely; the Log – Logistic (3P) Probability Distribution (1st), Cauchy Probability Distribution (2nd), Pearson 5 (3P) Probability Distribution (3rd), Burr (4P) Probability Distribution (4th), Fatigue Life (3P) Probability Distribution (5th), Inv. Gaussian (3P) Probability Distribution (6th), Dagum (4P) Probability Distribution (7th), and Lognormal (3P) Probability Distribution (8th) for the successful conduct of both Forward and Backward Propagations of Deep Learning Artificial Neural Network. However, this paper did not check the Monotone Differentiability of the proposed distributions. Appendix A, B, and C presented and tested the performances of the stressed Sigmoid and the Optimized Activation Functions using Stocks Data (2014-1991) of Microsoft Corporation (MSFT), Exxon Mobil (XOM), Chevron Corporation (CVX), Honda Motor Corporation (HMC), General Electric (GE), and U.S. Fundamental Macroeconomic Parameters, the results were found fascinating. Thus, guarantee, the first three distributions are excellent Activation Functions to successfully conduct any Stock Deep Learning Artificial Neural Network. Distributions Number 4 to 8 are also good Advanced Optimized Activation Functions. Generally, this research revealed that the Advanced Optimized Activation Functions satisfied Jameel’s ANNAF Stochastic Criterion depends on the Referenced Purified AI Data Set, Time Change and Area of Application which is against the existing “Trial and Error “and “Arbitrary Assumptions” of Sigmoid, Tanh, Softmax, ReLu, and Leaky ReLu.

The paper started with the Introduction, Literature Review, Material and Methods, Results; the paper crown up with Conclusion remarks.

Literature Review
Activation functions add a non-linearity to neural networks, thus gives it ability to learn complex functional mappings from data. A neural network without an activation function: (1) Can only represent a linear relationship between variables and, (2) Does not hold Universal approximation theorem. This can be seen below: That means The choice of a Neural Network Activation Functions does not follow DEFINITE RULES; however, this makes the Deep Learning Artificial Neural Network a "Black Box" System and exposes its predictions to a "Model Risk".
Sebastian Urban (2017) stated that the choice of activation functions was not seriously challenged by researchers (except for special purpose applications), until recently when Nair et al. (2010) introduced the rectified linear unit (ReLU), a neuron with an activation function that is linear for positive inputs and zero for negative inputs.
This achievement led to a wave of follow-up research in activation functions specifically tailored to deep networks. The year 2015, introduced stochasticity into the activation function by sampling the value for the slope with each training iteration from a fixed uniform distribution. Clevert et al. (2015) and Klambauer et al. (2017) replaced the negative part of ReLUs with a scaled exponential function and showed that, under certain conditions, this leads to automatic renormalization of the inputs to the following layer and thereby simplifies the training of the neural networks, leading to accuracy improvements of deep feed-forward networks on tasks from the UCI Machine = 11 1 + 12 2 + 13 3 1 = 11 1 + 12 2 + 13 3 1 = 11 1 + 12 2 + 13 3 Activation Function which is another Linear Function (Weak), not strong enough to Model many kind of complex Data behavior, we need Nonlinear Activation rfm.ideasspread.org Risk and Financial Management Vol. 1, No. 1;2019 Learning repository (Lichman, 2013) amongst others. Nearly fully adaptable activation functions have been proposed by Eisenach et al. (2017). Similarly, Scardapane et al. (2017) also use a basis expansion, but with a set of Gaussian kernels that are equally distributed over a preset input range.
According to Asman Dureja and Payal Pahwa (2019), the choice of activation functions affects the neural network in term of OPTIMIZATION and to retrieve the better results. Several activation functions have been introduced in machine learning for many practical applications. But which activation function should use at hidden layer of deep neural networks was not identified. According to Chigozie Enyinna Nwankpa et al. (2018), LRELU, PReLU and RReLU perform better than the ReLU but some of these functions lack theoretical justifications to support their state-of-the-art results. Soufiane Hayou et al. (2019) stated that an inappropriate selection of Activation Function can lead to the loss of information of the input during forward propagation and the exponential vanishing/exploding of gradients during back-propagation. Schoenholz et al. (2017) applied Gaussian process approximation to Deep Neural Networks.
Joonho Lee et al. (2019), in their work, proposed a probabilistic activation function, called ProbAct. Unfortunately, the output value of ProbAct is sampled from a normal distribution; with the mean value same as the output of ReLU and with a fixed or trainable variance for each element.
Professionally, the opinions of the Members of StackExchange (2018) sponsored by IBM stated that "I choose the activation function for the output layer depending on the output that I need and the properties of the activation function that I know. In hidden layers, I use a leaky ReLU to avoid dead neurons instead of the ReLU and the tanh instead of the sigmoid. Of course, I don't use a linear function in hidden units. However, the choice for them in the hidden layer is mostly due to trial and error. However, the more activation functions I discover the more I'm confused about the choice of the function to use in hidden layers. And I don't think that flipping a coin is a good way of choosing an activation function. It would be incredibly difficult to recommend an activation function that works for all use cases (although I believe that SELU was designed so that it would do the right thing with pretty much any input).At the end of the day, you are probably going to get as many opinions as there are people about the right choice of activation function. The bottom line is that there is no universal rule for choosing an activation function for hidden layers".

Artificial intelligence (AI)
Artificial intelligence (AI) was created as a subfield of computer science targeted to solve tasks that humans are good at or the goal of AI is to mimic human intelligence. Artificial intelligence and machine learning technology in banking industry can be used for Credit decisions (It can allow for a fast, accurate assessment of a potential borrower, for far less cost than traditional methods, it eliminates bias, as machines have more objectivity than human employees), Risk assessment and management (banks mitigate risk because they receive accurate reporting, not prone to human error, AI does even more to reduce risks for banks and customers). By viewing the history of risk cases, AI can help banks forecast issues and take early steps to avoid problems), Fraud prevention (By analyzing spending patterns, location, and client behavior, machine learning can detect anomalies in spending and alert the cardholder, dramatically reducing credit card fraud) and Personalized approach (notifies members if their card was charged twice for an expense or if they tipped an exorbitant amount at a restaurant. Machine Learning uses algorithms to learn from data without relying on rules-based programming. Machine Learning originally developed as a subfield of Artificial Intelligence (AI), one of the goals behind machine learning was to replace the need for developing computer programs "manually." Considering that programs are being developed to automate processes, we can think of machine learning as the process of "automating automation." In other words, machine learning lets computers "create" programs (often, the intent for developing these programs is making predictions) themselves. In other words, machine learning is the process of turning data into programs. Machine learning (and deep learning) definitely helps to develop "AI," however, AI doesn't necessarily have to be developed using machine learning -although, machine learning makes "AI" much more convenient.
Machine learning models can also predict which banking tools individual members might use and recommend them so customers can make better financial decisions. Process automation (Using robotic process automation (RPA), banks can remove human error and restructure the workforce to focus on more pressing tasks. JPMorgan Chase & Co launched COIN or Contract Intelligence, which automated the processing of legal documents, extraction of data, and review of certain types of legal contracts. Machine learning algorithms could use image recognition to identify patterns in the agreements. What normally would take roughly 360,000 labor hours per year, took the model a few hours. Another example of automating tasks is the increased use of chatbots that provide quick and reliable answers to consumers. Using AI-powered mobile and web chatbots, banks can speed up the time it takes for consumers to receive answers and decrease the need for human assistants to answer questions. Other applications are improved customer services, Investment Predictions Marketing, Network Security and Algorithm Trading. Deep learning is a subfield of machine learning, referring to a particular subset of models that are particularly good at certain tasks. Deep Learning algorithms run data through several "layers" of neural network algorithms, each of which passes a simplified representation of the data to the next layer. The ability to process large numbers of features makes Deep Learning algorithms very powerful when dealing with unstructured data (such as images, audio and video). However, Deep Learning algorithms can be overkill for less complex problems because they require access to a vast amount of data to be effective and may be outperformed by Tree-Based Algorithms in these cases. Instead of organizing data to run through predefined equations, Deep Learning algorithms characterize basic parameters about the data and train the computer to learn on its own by recognizing patterns using many layers of processing. Deep Learning maps inputs to outputs using multiple layers of nonlinear processing units. It is capable of creating models for previously manual procedures. The first is creating risk management models for lending and credit risk management; the second is in fraud prevention, where AI systems identify, track and flag potential threats.

Neuron
A neuron is just a MATHEMATICAL FUNCTION that takes inputs (the outputs of the neurons pointing to it) and returns outputs. These outputs serve as inputs for the next layer, and so on until we get to the final, output layer, which is the actual value we return. For each set of inputs, the Neural Network's goal is to make each of its outputs as close as possible to the actual expected values. A neuron will take an input vector, and basically do three things to it:  Multiply it by a weights vector.


Apply an activation function to that value.
We get a new vector as each layer's output; feed it to the next layer as inputs, and so on. Bias is a constant which helps the model in a way that it can fit best for the given data. Each layer in a Neural Network has an activation function.

Figure 6. Neuron
Source: Google Images (2019) 3.1.1.5 Neural Network A Neural Network is a Machine Learning model that, given certain input and output vectors, will try to "fit" the outputs to the inputs. What this means is, given a set of observed instances with certain values we wish to predict, and some data we have on each instance, it will try to generalize those data so that it can predict the values correctly for new instances of the problem. And we finally got to the core of our business: that's what activation functions do. We'll typically use non-linear functions as activation functions. This is because the linear part is already handled by the previously applied product and addition.

Methods
Model complexity has increased; models are being used in many areas of banking and finance applications. This scrutiny was kick-started by the aftermath of the global financial crisis, in which it became apparent that at least some incumbent models at the time were unfit for purpose, which was exacerbated by a lack of appropriate governance around models and their use.
The supervisory reaction has drastically increased regulatory scrutiny for how models are built, approved and maintained. In 2011, the Federal Reserve and the Office of the Comptroller of the Currency published the supervisory guidance on Model Risk Management which mandated banks to submit documentation on model risk management policies and practices. This is a regulatory standard for model risk management and the regulators around the globe picked up the challenge. In 2017, the European Central Bank (ECB) published guidelines which expect financial institutions to have a well-developed model risk management framework in place, and to apply it to all models used for business decision-making; especially those pertaining to capital requirements. In the United Kingdom (UK), the Prudential Regulation Authority (PRA) is also heavily focused on model risk; with stress testing models getting a lot of attention. In 2018, the PRA published a supervisory statement on model risk management for stress testing (SS3/18), covering firms' development, implementation and use of stress testing models. Algorithmic trading has also seen increased regulatory focus in the UK. The PRA published a supervisory statement on Algorithmic Trading (SS5/18) outlining expectations on the risk management and governance of Algorithmic trading, where machine learning models are increasingly being used to facilitate automated trading decisions. These developments are mirrored in many other jurisdictions around the world.
Most banks today are quite familiar with machine learning algorithms that learn by example from historical data and are used to predict outcomes and uncover patterns that are not easily spotted by humans. However, not all classes of machine learning algorithms are recent developments; many have been used in the banking and financial services industry for decades. These include basic statistical methods and techniques like linear and logistic regression and rule-based modeling. These traditional methods are relatively simple, well-understood by practitioners and easily explainable and interpretable. However, recently, the U.S. Congress has intensively criticized AI and ML as the Technological or Scientific advancement that lack transparency subject to Data Manipulations with unregulated outcomes that does not gives Companies possibilities to make well informed, AI and ML Algorithms are running in secrecy, the Black Box and beyond the control of the User and the Builder, Technology without having Human Independent supervisor, oversight or regulation, that the Technology cannot trusted anymore, AI & ML algorithms gives bad recommendations, care should be taken about Final Ranking of the content among others. To increase TRANSPARENCY and STABILITY in the Activation Functions output (final ranking of the content) so as to help policy makers, Companies and decision makers with optimum the right choice among the final activation functions to enable them make well informed decision, the paper propose the following Advanced Stochastic Algorithms. Also, this propose algorithm will logically make an attempt to answer the questions: "why is that one works better than the other?", "how do we know which one to use?", "is it hardcore maths?" and so on.

The Basic Idea
The basic IDEA was due to U.S. Congress meeting entitled, "Optimizing Jamilu (2015) proposed Jameel's Contractional-Expansional Stress Method that depends on JAMEEL'S CRITERION to incorporate fat-tail effects in BANKING and ECONOMIC Stochastic Models, particularly the Logistic Regression (LOGIT) and PROBIT Models as shown in the figure below:  For instance, the Ultimate Goal is to drastically minimize the NOISES in the Right Photo of figure 10 above.
After the Sigmoid was Stressed, the Author continued to brainstorm with available Literature concluded that the existing set of Activation Functions were introduced into the Deep Learning Artificial Neural Network System through the "WINDOW" not properly through the "LEGITIMATE DOOR". Biologically, Neuron performs three basic functions, namely: Receive signals (or information); Integrate incoming signals (to determine whether or not the information should be passed along); and, Communicate signals to target cells (other neurons or muscles or glands). However, this cannot be done successfully without action of a Nonlinear Function (Brain of Neuron) residing in a Neuron of Human Brain EMANATED from incoming signals (information). This Non-linear Function (Brain of Neuron) residing in a Neuron of Human Brain EMANATED from incoming signals is what we called Activation Function of a Deep Learning Artificial Neural Network.
The Author PROPOSED that the Competent and Eligible Activation Functions for the successful conduct of Artificial Neural Networks are the Activation Functions EMANATED from the AI-ML-Purified Data Set under consideration satisfied AI-ML-Jameel's Stochastic or Deterministic Criterion because of the following SCIENTIFIC FACTS: (1) They EMANATED from the referenced AL-ML-Purified Date Set and satisfied AI-ML-Jameel's Stochastic and or Deterministic Criterion; (2) They have a very strong (if not perfect) CORRELATION with the referenced AI-ML-Purified Data Set. A link between the Data Set and Activation Functions MUST be strongly established since Artificial Neural network uses past historical data to predict the future of a given task with the aid of machines; (3) They relate better to the referenced AI-ML-Purified Data Set then the existing Assumed and Trial and error Activation Functions; (4) They indeed describe the distribution of our referenced AI-ML-Purified Data Set, which is a listing or function showing all the possible values (or intervals) of the data and how often they occur; Under this criterion, we run the goodness of fits test on our referenced PURIFIED AL-ML-DATA SET such that: (i) We accept if the Average of the ranks of Kolmogorov Smirnor, Anderson Darling and Chi-squared is less than or equal to Three (3); (ii) We must choose the fat-tailed Probability Distribution follows by our referenced PURIFIED AL-ML-DATA SET ITSELF regardless of its Rankings; (iii) If there is tie, we include both the fat-tailed Probability Distributions in the selection; (iv) At least Two (2) fat-tailed Probability Distributions must be included in the selection; (v) We select the most occur Probability Distribution as the qualify candidate in each case of test of goodness of fit on our referenced PURIFIED AL-ML-DATA SET;

Stocks Advanced Optimized Activation Functions
The Ranking of the Advanced Optimized Activation Functions was obtained using Jameel's Criterion (second version). The author considered Eleven (11) out of Fifty (50)  1 1 Fatique Life (3P) Probability Distribution (5 th ): Inv.Gaussian (3P) Probability Distribution (6 th ): Dagum (4P) Probability Distribution (7 th ): Lognormal (3P) Probability Distribution (8 th ): Note that the paper did not check the Monotone Differentiability of the functions presented above. However, the market trends are very volatile, many things had happened from 2019 -2015. The distributions time series Data was from 2014-1990, Jamilu (2015 showed this. Different results may be obtained when conducted in the year 2019. Anyway, the paper adopted the fat-tailed probability distributions obtained as per as 2015.

Conclusion
The U.S. Congress, professionals, and the general public strongly believed that AI and ML have shortcomings despite the overwhelming advancement in Banking and Finance Industries, electronic trading platforms, medical diagnosis, robot control, entertainment, education, health, and commerce.
To address the FEARS raised by the U.S. Congress, the paper attempted to increase TRANSPARENCY and STABILITY in terms of Activation Functions output to help the policymakers, companies, and decision-makers with right Activation Functions using Definite Rules to enable them to make a well-informed decision.
To open the "Black Box" of Deep Learning Artificial Neural Network Systems, we need MONOTONE DIFFERENTIABLE FAT-TAILED PROBABILITY DISTRIBUTIONS EMANATED from our REFERENCED AI-ML-Purified Data Set satisfied Jameel's Stochastic ANNAF Criterion.
This will strongly establish a very high ( ; 2 This is against the traditional Trial and Error set of assumed Activation Functions, INDEPENDENT of DTA.
Research for better activation functions is still ongoing, however, the FUTURE of AI and ML lies on the emphasis given to the DEFINITE RULES of choosing Right Activation Functions considering the impact of LOW-PROBABILITY, HIGH-IMPACT EFFECTS, basically, attention given to FAT-TAILED EFFECTS.

Research Companies and Data Sources
This research paper was the extension of the Author's (Jamilu Auwalu Adamu (2015) As shown below: Type A: ; x is called Generalized Log-Logistic or Log-Logistic (3P) Probability Distribution. Where, μ ∈  is the location parameter, 0 σ > the scale parameter and ξ ∈  the shape parameter. The shape parameter ξ is often restricted to lie in [ ] 1,1 − , when the probability density function is bounded.
However, in this research work we will restrict ξ to three cases only 1 ξ = . However, one can test, for 0 ξ = or 1 ξ = − in the subsequent researchers.