A nonparametric optimal neural regression tree model for water quality improvement in a paper manufacturing industry

,


Introduction
This work is motivated by a particular problem in a modern paper industry which produces papers for multiple uses. Paper machines produce papers by using pulp, fibers, fillers, chemical lubricants and a huge amount of water. Boiler produces steam for power generation purposes and also helps to make pulp for paper production. The steam produced in the boiler is used to cook wood chips (along with the cooking chemicals). Steam is also sent to dryer cans to remove water from the sheet produced which the drainage, vacuum, and mechanical pressing sections of the paper machine can't accomplish. The steam from the boilers is also used throughout the mill in heat exchanging, steam-traced piping, stock chests, etc. The boiler stipulates the desired level of water quality to be received from the water treatment plant (WTP). In WTP, the process of demineralization (DM Process) is used for removal of dissolved solids by the ion exchange process (IEP). IEP involves two stages of demineralization (Batchelder 1965). In the first stage, it removes cations from water by the cation exchange process and then removes anions from water by the anion exchange process in the second stage (Lhassani et al. 2001). DM process outlet pH happens to be the key performance indicator (KPI) of WTP. It was found that WTP can't produce water of desired quality specified by the boiler. By controlling the water quality, we will be able to address the problem of variation in DM outlet water pH as well as improve boiler water tube health and finally make improvement in the paper manufacturing process. An extensive preliminary analysis was conducted to determine a set of possible causal variables that happen to be the key to water pH level variations. The aim of this paper is to determine the optimal levels of important process parameters to maintain water pH level within the desired standard as specified by the boiler. For this, we would develop a regression model which will also be useful for future prediction of water pH based on the important causal parameters. Controlling water pH will necessarily improve the water quality of the boiler which will be economical for the company as well. The formulation of the above-stated problem along with data collection plan is described in Section 2.
Statistical and machine learning tools like factor analysis, SPC techniques, decision trees, and artificial neural networks, etc have been applied to solve several problems in analyzing water quality of river (Singh et al. 2009;Ouyang et al. 2006;Mahuli, Rhinehart, and Riggs 1993) and surface water planning (Gmar et al. 2017;Bhattacharya and Solomatine 2005). Multivariate statistical tools and pattern recognition methods are found to be effective to capture the temporal and spatial variations in river water quality (Singh et al. 2004;Alberto et al. 2001) and manufacturing process efficiency . It was found that decision trees and neural networks (NN) have the ability to model arbitrary decision boundaries. Regression tree (RT) is found more robust when limited data are available, unlike NN. But decision trees are high variance estimators and greedy in nature (Breiman 2017), whereas neural networks are popular for tackling prediction problems. Advanced neural networks have many free tuning parameters and hence there is a risk of over-fitting in case of small and medium sized data sets. To utilize the positive aspects of these two powerful models, theoretical frameworks for combining both classifiers are often used jointly to make decisions. Mapping tree-based models to neural networks allows exploiting the former to initialize the latter. The idea of mapping tree based models into NN are presented by several papers to solve many supervised learning problems (Sethi 1990;Sirat and Nadal 1990). Similar idea in the interface area of decision trees and neural networks can be found in some recent literature as well (Chen et al. 2005;Balestriero 2017). But the major disadvantages of these algorithms are having many free tuning parameters and thus poor robustness. In spite of the use of neural tree models in practical problems of classification, regression, and forecasting, very little is known about the statistical properties of these models. In this paper, we obtained the upper-bound of neural regression tree model parameters and designed an optimal NRT model to solve the regression task of water quality problem in the paper manufacturing system. Motivated by the above discussion, we have proposed an optimal NRT model with optimal values of the model parameters. Harnessing the ensemble formulation, we try to exploit the strengths of RT and NN models and overcome their drawbacks. The proposed model has the advantages of significant accuracy, very less number of tuning parameters and easy interpretability as compared to more "black-box-like" advanced neural networks. Our model is very useful for small or medium sized complex data structures, like in the current water quality problem in paper manufacturing industries. The theoretical bounds for model parameters are proposed. We practically recommend using the highest value of the model parameters as the optimal values when working with small and medium-sized data sets to achieve 'best' performance of the model and this justifies the name of the model as "optimal NRT model". Finally, the robustness of the proposed algorithm is shown through experimental evaluation.
The paper is structured as follows. Section 2 describes the motivating problem and its formulation. In Section 3, we introduce optimal neural regression tree model along with the upper bound of its parameters. Section 4 analyzes the data by the proposed method and compares it with other state-of-the-art. Finally, the concluding remarks are given in Section 5.

Motivating example
In this section, we describe the motivating real-life problem and the data collection plan for our study. The data on pH is taken on a regular basis from the Boiler lab. However, when the study was taken up, we considered only the most recent 12 months data. Measurement of water pH level at Boiler lab is done through a digital meter which is calibrated once a month using a standard solution. Hence, the normal gage R&R analysis was not necessary to be carried out to ensure correctness of the values observed using laboratory analysis. DM process outlet pH happens to be the KPI of water treatment plant which gives water of desired quality specified by the boiler. Water treatment plant comprises of two units: fresh water treatment and condensate water treatment. Freshwater treatment plant involves filtration of Freshwater obtained from the tube well followed by treatment of water with chemicals, making the water suitable for Boiler and Turbine. Condensate treatment involves filtration and treatment of condensate water obtained from Paper machines. Quality of condensate coming from Paper machines is the main constraint of the current problem. Once the filtration of water/condensate is done, it is sent to Demineralization plant where water is made suitable to be used by the boiler. Details of the types of equipment used in DM plant processes are given (Vedelago and Millar 2018) along with the process flow diagram in Figure 1.
• Strong acid cation (SAC) exchanger extracts the cations from the water by the cation resin and converts it into mineral acids. A vessel is provided therewith manually operated valves for controlling the flow rates of the inlet and outlet water. • Strong base anion (SBA) exchanger is an anion exchanger in which mild-steel a vessel is internally lined with rubber to prevent corrosion. An external vessel is also provided with manually operated valves for controlling the flow rates of the inlet and outlet water. • The mixed bed (MB) unit comprises of a mild steel rubber-lined pressure vessel.
Externally, the unit is provided with piping and valves to control the flow of water during service and various regeneration stages. • Activated carbon filters (ACF) are used for the removal of chlorine and organic matter from water. Normally, the activated carbon filters are placed downstream of multi-grade filters and need to be regularly back-washed by reversal of flow so as to keep the surface of the carbon particles clean.
Once the water is treated, it is stored in a tank called DM water storage tank (DMST). The water is pumped from DMST to Boiler via DM Transfer Pump. Chemicals like HCl and Caustic Soda are used for regeneration of the resins in SAC, SBA and MB Exchanger. Morpholine is used for scaling the pH of DM Water. Condensate water is also sent through Mixed Bed exchanger for treatment of water. Transferring DM water of pH in the range of 8.5-9.2 is essential to maintain drum water quality and the health of the water tubes running through Boiler. It was found that the variation in pH of DM water is large and maintaining proper pH level is essential to maintain boiler water quality. The original problem can be considered to be a two-stage problem: 1) developing a prediction model for outlet water pH of DM process; 2) using the model and further actions to reduce variations in pH of DM outlet water and maintain it close to the specified target (i.e., mid value of 8.5 − 9.2). While developing a prediction model for outlet water pH of DM process, one needs to identify the important parameters of the process which together will be able to explain a significant amount of variation present in the system. This work will also have an impact on the amount of chemical usage, the health of the boiler tube and quality of DM outlet water.

Model
In this section, we will describe how a pre-trained RT can be reformulated as a two-layered NN with similar types of predictive behavior. Neural networks and tree-based models are performing well in many real-life engineering problems. The idea of combining decision trees with neural networks was discussed in previous literature but mainly for performing classification tasks (Sethi 1990;Sirat and Nadal 1990;Foresti and Micheloni 2002;Chen, Yang, and Abraham 2007;. We extend the approach for regression problems and try to bridge the gap between theory and practice by finding the upper bounds for the tuning parameters of the model. We present an example of a mapping from tree based model to a two-hidden layered neural network in Figure 2. A regression tree (left) and its corresponding NN structure (right). The circle nodes in the tree belong to split nodes and square nodes to leaf nodes. The path to the green shaded leaf (4) consists of all red nodes (0, 1, 3). Numbers in neurons correspond to numbers in tree model nodes. The highlighted connections in the network are those relevant for the activity of the green neuron and its output value.
We are given a training sample D n = {(X 1 , Y 1 ), (X 2 , Y 2 ), ..., (X n , Y n )} with n observations on p independent variables. Neural regression tree (NRT) is a nonparametric regression model in which p input features X ∈ C p = [0, 1] p are observed and we try to predict a square integrable output function Y ∈ Ê. A regression tree is a regression function estimate that uses a hierarchical axes-parallel split of the input space, where each tree node corresponds to one of the segmentation subsets in C p . For simplicity and easy interpretability, let us consider ordinary binary RT where a node has exactly two child nodes or zero child nodes (leaf nodes). A regression tree consists of split nodes (for example, x (i) ≥ α for some i ∈ {1, 2, ..., p} and some α ∈ C) and leaf nodes. The feature space C p is partitioned into axes-parallel hyper-rectangles. The standard splitting criteria MSE (mean squared error) has been used here to create RT. During prediction, the input is first passed into the tree root node. It is then iteratively transmitted to the leaf node that belongs to the subspace in which the input is located; this is repeated until a leaf node is reached. If a leaf represents region R, then the natural regression function estimate takes the simple form t n ( is the number of observations in cell R (where, by convention, 0/0 = 0). In other words, the prediction for a query point x in leaf node R is the average of the Y i of all training instances that fall into this region R. Let us assume for now that we have at hand a regression tree t n (whose construction eventually depends upon the data D n ), which takes constant values on each of k ≥ 2 terminal nodes. It turns out that this estimate may be reinterpreted as two hidden layers neural networks, as summarized here. Let HL1 = {H 1 , . . . , H k−1 } be the collection of all hyperplanes participating in the construction of t n . We note that each H k ′ ∈ HL1 is of the form To reach the leaf of the query point x, we try to find, for each hyperplane H k ′ , the side on which x falls (+1 for right and −1 for left). Using the above notations, the tree estimate t n is mapped to the neural network as discussed below. Also, an example is presented in Figure 2 with the description for a clear understanding of the neural regression tree model.
Designing the first hidden layer (HL1). The input layer (IL) supplies the features to the first hidden layer of neurons which corresponds to k − 1 perceptrons, with the threshold activation function defined as τ So, for each split in the tree, there is a neuron in HL1 whose activity encodes the relative position of an input x with respect to the concerned split. The output of the first layer are ±1-vector (τ (h 1 (x)), . . . , τ (h k−1 (x))), which describes all decisions of the inner tree nodes (it also includes the nodes off the tree It is important to remember that each neuron k ′ of this layer is connected to one and only one input x (i k ′ ) , and that the connection has weight 1 and offset −α i k ′ .
Designing the second hidden layer (HL2). HL1 outputs a (k − 1)-dimensional vector of ±1-bits that encodes the precise position of x in the leaves of the tree. The leaf node identity of x can be extracted from the above-mentioned vector using a weighted combination of the bits along with an appropriate threshold function. The second hidden layer has k neurons, one for each leaf, and assigns a terminal cell to x. Let HL2 = {L 1 , . . . , L k } be the collection of all tree leaves, and let L(x) be the leaf containing x. We connect a unit k ′ from HL1 to a unit k ′′ from HL2 iff the hyperplane H k ′ is involved in the sequence of splits forming the path from the root to the leaf L k ′′ . The connection has weight +1 if the split by H k ′ is from a node to a right child in that path, and −1 otherwise. Suppose we have (u 1 (x), . . . , u k−1 (x)) as the vector of ±1-bits seen at the output of HL1, then the output where l(k ′′ ) is the length of the path from the root to L k ′′ . To understand the intuition behind the choice (1), we can see that there are exactly l(k ′′ ) connections starting from the first layer and pointing to k ′′ , and that Using (1), the argument of the threshold function is 1/2 if x ∈ L k ′′ and is smaller than −1/2 otherwise. Hence v k ′′ (x) = 1 iff the terminal cell of x is L k ′′ . To summarize, HL2 outputs a vector of ±1-bits (v 1 (x), . . . , v k (x)) whose components equal −1 except the one corresponding to the leaf L(x), which is +1.
Output layer (OL). Let (v 1 (x), . . . , v k (x)) be the output of the HL2. If v k ′′ (x) = 1, then the output layer computes the averageȲ k ′′ of the Y i corresponding to X i falling in L k ′′ . This is equivalent to taking 3.1. Optimal Neural Regression Tree We have already described how a given RT can be mapped to a two hidden layered (2HL) NN model. Now we are going to critically summarize the functionality of the neural regression tree (NRT) model (see Figure 2) as follows: • An RT is built having (k n − 1) split nodes and k n leaf nodes which is further mapped into a 2HL NN model having (k n − 1) and k n hidden neurons in HL1 and HL2 respectively. Since the value of k depends on n, so we are going to use k n instead of k in the rest of the paper. • In HL1, the neurons compute all the tree split decisions and indicate the split directions for the inputs. Further, HL1 passes the information to HL2. • The probabilistic interpretation of the network output can be obtained by interpreting the activation functions in the HL.
The tree estimate t n , depending on D n , can be seen as a neural network estimate. The architecture of this network is fixed, and so are the weights and offsets of the three layers. A natural idea is then to keep the structure of the network intact and let the parameters vary in a subsequent network training procedure with backpropagation training. In other words, once the connections between the neurons have been designed by the tree-to-network mapping, we can learn network parameters in a better way by minimizing the empirical MSE for this network over the sample D n .
We propose an optimal NRT model in which we theoretically find the upper bound of these free parameters which are nothing but the optimal values of the parameters for small and medium sample sized data sets. In order to increase the generalization capabilities of the NRT model, we replace the original relay-type activation function τ (u) with a hyperbolic tangent activation function σ(u) := tanh(u) which has a chosen range from −1 to 1. More precisely, we use in optimal NRT model, σ 1 (u) = σ(β 1 u) at every neuron of the first hidden layer and σ 2 (u) = σ(β 2 u) at every neuron of the second hidden layer. Here, β 1 and β 2 are positive hyper-parameters that determine the contrast of the hyperbolic tangent activation: larger the parameters β 1 and β 2 , the sharper is the transition from −1 to 1. As β 1 and β 2 approach to infinity, the continuous functions σ 1 and σ 2 converge to the threshold function. We call our model optimal NRT in the sense that the optimal values (upper bounds for this case) of the tuning parameters k n , β 1 and β 2 are derived (see Section 3.2). Besides eventually providing better generalization, the hyperbolic tangent activation functions favor smoother decision boundaries and permit a relaxation of crisp tree node membership. Lastly, they allow operations with a smooth approximation of the discontinuous step activation function. This makes the loss function of the network differentiable with respect to the parameters almost everywhere, and gradients can be backpropagated to train the model.

Upper bound of model parameters
Let us consider the tree in the ensemble and denote by G 1 ≡ G 1 (D n ), the bipartite graph which models the connections between the input vectors x = (x (1) , . . . , x (p) ) and the k n − 1 hidden neurons of HL1. Similarly, let G 2 ≡ G 2 (D n ) be the bipartite graph represents the connections between the first layer and the k n hidden neurons of HL2. Let M (G 1 ) be the set of p × (k n − 1) matrices A = (a ij ) such that a ij = 0 if (i, j) / ∈ G 1 . Also let M (G 2 ) be the (k n − 1) × k n matrices B = (b ij ) such that b ij = 0 if (i, j) / ∈ G 2 . The parameters that specify the first hidden units are contained in a matrix A of M (G 1 ) identified by the weights over the edges of G 1 and a column vector of biases b 1 and it is of size k n − 1. Similarly, the parameters of the second hidden units are represented by a matrix B of M (G 2 ) of weights over G 2 and by a column vector b 2 of offsets having size k n . Let us take the output weights and offset to be W out = (w 1 , . . . , w kn ) ⊤ ∈ Ê kn and b out ∈ Ê, respectively. And the parameters that specify the NRT model are represented by a "vector" as shown below: We further assume that there exists a positive constant c 1 such that where · ∞ is the supremum norm of a matrix and · 1 is the L 1 -norm of a vector. The rationale behind this assumption (4) is that the weights and offsets are taken by the computation units of the second layer and the output layer. We note that this condition is satisfied by the original random tree estimates as soon as Y is assumed to be bounded. Finally, we can assume that the absolute value of Y ∞ ≤ L < ∞ almost surely, for some L.
Therefore, letting Λ n = λ = (A, b 1 , B, b 2 , W out , b out ) : (4) is satisfied , we see that the neural network implements functions of this particular form where λ ∈ Λ n . Our aim is to tune the parameters λ using the data D n such that the function realized by the obtained network becomes a 'good' estimate that can minimize the empirical error. Let F n = f λ : λ ∈ Λ n , where F n be the class of neural networks and m n be the network that minimizes the empirical L 2 error which is defined as over all functions f ∈ F n , i.e., J n (m n ) ≤ J n (f ), where F n is a rich class of functions, including additive functions, polynomial functions having coefficients of the same sign, products of continuous functions and etc. Thus, we will try to find out an estimate m n : C p → R of the regression function We say m n is consistent if E[m n (X) − m(X)] 2 tends to 0 as n → ∞ (the expectation is evaluated over X and the training sample D n ). We can write using (Györfi et al. 2006, Lemma 10.1) where µ denotes the distribution of X. To find the upper-bound of the model parameters, we need to show the estimation error (first term in the R.H.S. of Eqn. 5) and approximation error (second term in the R.H.S. of Eqn. 5) tend to 0. The former can be proved using non-asymptotic uniform deviation inequalities and covering numbers corresponding to F n , as shown in Theorem 3.1. Approximation error can be handled using a pseudo-estimate similar to RT generated t n and application of Lipschitz property on the activation function of the NRT model, as shown in Theorem 3.2. Throughout the paper, we assumed X is uniformly distributed in C p and Y ∞ ≤ L < ∞ almost surely, for some L.
The next two theorems state that with certain restrictions imposed on the number k n of terminal nodes and with the parameters β 1 , β 2 being properly regulated as functions of n, the empirical L 2 risk-minimization provides upper bounds for the NRT model parameters. Larger the value of k n , β 1 and β 2 , the better is the model. That is why, we call the upper bound of the model parameters as the optimal values of NRT model. It is important to note that the depth of the tree and β 1 , β 2 are controlled for risk-minimization.
Proof. For proof, refer to Appendix 1.
Proof. For proof, refer to Appendix 2.
Hence, the optimal choice of parameters obtained using empirical risk minimization for optimal NRT model are: (a) k n = o(n 1/6 ) (b) β 1 = o(n 2/3 ) and (c) β 2 = o(log(n)) for practical use in regression problems having small and medium-sized datasets.

Data
To identify the causal parameters causing the variations in the DM water outlet pH, at first a brainstorming session was held with the process experts. Since not all the parameters are controllable by the users of the processes, a list of controllable parameters is prepared and accordingly data are collected for the two processes (as shown in Figure 1). The data set collected from the process over a year consists of the observations from the following causal variables: Inlet Flow, Water Pressure (water inlet pressure to the exchanger), Air Pressure, MB stroke and Amount of Morpholine/Chemical dosing (Liter per hr). The values of the response variable (DM water pH) varies between 7 to 10 (refer to Figure 3 for a graphical summary). Sample datasets for DMST-1 and DMST-2 are given in Tables 1 and 2. These data sets will be used for finding a prediction model that can help the company to forecast future water pH level and take further steps for the reduction in water pH variation.

Performance metrics
The metrics used in this study to evaluate the performance of different regression models (including the proposed model) are Root mean square error (RMSE), Mean absolute percentage error (MAPE), Co-efficient of multiple determination (R 2 ) and Adjusted R 2 (AdjR 2 ): where, y i , y, y i denote the actual value, average value and predicted value of the dependent variable, respectively for the i th instant. Here n and k denote the number of data points and independent variables used for performance evaluation, respectively. The lower the value of RMSE and MAPE and the higher the value of R 2 and Adj R 2 , the better is the model is.

Preliminary data analysis
We started the analysis of the data with Anderson-Darling normality test on DM water outlet pH and it confirms that the dependent variable doesn't follow normal distribution (see Figure 3). An absence of normality in the DM water outlet pH data actually removed the possibility of application of conventional parametric regression methods. This leads us to think about the nonparametric regression approaches.
The stability of the process was checked using X − R control chart (see Figure 4). From X − R chart, it can be easily concluded that DM water outlet pH has high variation and it also indicates the presence of some assignable causes in the process.
The dataset contains only numerical features and no missing entries, so no data cleaning task was performed. We only transformed the feature values into standardized form to achieve consistent results while using NN and NRT models. The standardization is done using the following equation: As the output of the networks will also lie between 0 and 1, "logsig" is used as output transfer function to bring it back in the original form at the end of the modeling by using, here Z i is the standardized output within 0 and 1, and Y pred i is the predicted output of the model.

Results
Since the data sets don't satisfy the basic assumptions of parametric regression models, we decided to use nonparametric regression models. We further shuffled the observations of the water pH dataset randomly and split it into training and testing data sets in a ratio of 70 : 30. Each experiment is repeated 5 times with different randomly assigned training and test sets and we will finally report the averages of the performance metrics observed over 5 times validations in Table 3. A few popular nonparametric regression models such as k-Nearest neighbor (kNN), support vector regression (SVR), Regression splines (B-splines), multiple adaptive regression spline (MARS), RT and NN with 2 hidden layers (NN with 2HL) were applied to the data sets and the results were recorded. Then we started experimenting with our proposed model. The training procedure for the optimal NRT model is as follows. RT is first built using the scikit-learn implementation (Pedregosa et al. 2011) for tree designing. From the tree, we extracted the set of all split directions and split positions and used them to build neural network initialization parameters. The NRT models are then trained using the TensorFlow framework (Abadi et al. 2016). The optimization with the network model is done by minimizing the MSE on the training set. It is achieved by employing an iterative gradient-descent optimization technique. We have used the default functions available in TensorFlow for this. Each neural network (including NRT) model was trained for 100 epochs. From the theoretical results on the upper bounds of the model parameters, we can easily find out the values of the initial contrast parameters of the activation functions of the first and second HL of the model. The value of β 2 will be lower than β 1 which is confirmed from the theoretical bounds presented in Section 3.2. This is practically very much significant, since for a relatively small β 2 , the transition in the activation function from −1 to +1 is smoother and a stronger gradient signal reaches the first hidden layer in backpropagation training. Similarly a converse explanation can be given for β 1 . In all experiments, we have used β 1 = o(n 2/3 ) and β 2 = o(logn), where n is the number of training samples. Another property of the data set is that it is small sized and here tree based models perform better as compared to NN, since the latter typically need plentiful training data to perform well. For two data sets given in Section 4.1 standard NN (with 2HL) do not achieve comparable performance as compared to proposed optimal NRT model. In a similar way neural decision trees (NDT) were also implemented to asses its performance on the water pH data sets. Since it has many free tuning parameters, the training time and memory requirement are quite high as compared to the proposed model. Our proposed optimal NRT model is faster, especially when trained on a GPU. In Table 3, we present the results of different regression model on the water pH data sets and best results are displayed in bold fonts. From the experimental evaluation of different regression models, it can be concluded that our proposed optimal NRT model outperforms the state-of-the-arts.

Recommendation
The proposed optimal NRT model will be useful for forecasting the future values of water pH given the values of the process variables. This will be helpful for the management and engineers to take preventive actions in the future. However, our objective of the work is not only to develop a competitive prediction model for water pH level in DM process but also to find out optimal level of process parameters (in other words, causal variables) using statistical modeling. To keep DM water pH in the range of 8.5 − 9.2 as specified by the boiler manual, the recommendations of our model is to maintain MB strokes, water pressure and chemical consumption within a specified range as shown in Table 4. Inlet flow can't be controlled by the user of the process and Air pressure need not be controlled as far as the recommendation of the proposed model goes. Table 4 gives the optimum range of controllable parameters for DMST 1 and DMST 2 based on the recommendation of regression analysis performed by optimal NRT model. 5.0-6.0 45-60 6.5-7.5 8.5-9.0 DMST 2 5.0-6.0 40-55 7.5-9.5 8.5-9.1 Table 4 depicts a range for all the important process variables for which the expected DM water outlet pH will be within the required specification, i.e., 8.5−9.2. However, in order to find out the exact values of the process parameters within the range prescribed in Table 4, a design of experiment would be necessary, which was subsequently carried out to solve the problem. Though we have discussed only the proposed model and its accuracy level, our model also helped the manufacturing process industry to improve its water quality level and to gain monetary benefits due to a reduction in the chemical consumption.

Conclusions
This paper proposes a novel machine learning paradigm to solve a process efficiency problem in a paper manufacturing industry. The purpose of this article is to develop a model for reducing the variation of water pH level in the DM process in a manufacturing industry. Our study proposed an optimal NRT model that maps RT into a 2-HL NN model with the theoretically obtained upper bound of the parameters of the model. Consequently, the proposed ensemble model successfully demonstrated the best performance and offered a practical solution to the problem of finding optimal level for the tuning parameters to improve water quality in the DM process. The major advantage of the proposed optimal NRT model is that it has very less tuning parameters, easily interpretable as compared to "black-box-like" advanced NN models and theoretically sound as compared to many similar types of algorithms. It was theoretically shown that there is some gain in considering 2HL in NN, but it is not really necessary to go beyond 2HL in NN (Devroye, Györfi, and Lugosi 2013). The proposed ensemble model will have an edge while working with limited data sets where the benefits of neural optimization can usually not be exploited, but with a specific inductive bias of the optimal NRT, it becomes possible. This also answers the question one may ask about the need for a two-step pipeline (like optimal NRT model) over advanced NN models.
Appendix 1. Proof of Theorem 3.1 The idea of the proof is based on the Theorem 16.1 of (Györfi et al. 2006). The set F n contains all neural networks constrained by Eqn. (4) of Section 3.2 with inputs in R p , two hidden layers of respective size k n − 1 and k n , and one output unit. We note that F n is deterministic, in the sense that it does not depend on D n , but only on the size n of the original data set. We can derive a useful exponential inequality to prove the Theorem 3.1. using uniformly bounded classes of functions. We have assumed that for each f ∈ F n , f satisfies f ∞ ≤ c 1 k n and Y is also bounded ( Y ∞ ≤ L < ∞).
Let z n 1 = (z 1 , . . . , z n ) be a vector of n fixed points in Ê p and let H be a set of functions from Ê p → Ê. For every ε > 0, we let N 1 (ε, H , z n 1 ) be the L 1 ε-covering number of H with respect to z 1 , . . . , z n . N 1 (ε, H , z n 1 ) is defined as the smallest integer N such that there exist functions h 1 , . . . , h N : Ê p → Ê with the property that for every Note that if Z n 1 = (Z 1 , . . . , Z n ) is a sequence of i.i.d. random variables, then N 1 (ε, H , Z n 1 ) is a random variable as well.
As ε → 0, we get the estimation error converges to 0 and the proof is complete.
To find the upper bounds of I 1 and I 2 , we will use the following properties of functions: • Recall that σ i is tan hyperbolic activation function and τ is a threshold activation function, then we can write for all u ∈ Ê, |σ i (u)−τ (u)| ≤ 2e −2βi|u| for all i = 1, 2.
R.H.S. of (7) tends to 0 if the conditions of Theorem 3.2 holds and hence the proof.