Oil Family Typing Using a Hybrid Model of Self-Organizing Map and Artificial Neural Network

Identifying the number of oil families in petroleum basins provides practical and valuable information in petroleum geochemistry studies from exploration to development. Oil family grouping helps us track migration pathways, identify the number of active source rock(s), and examine the reservoir continuity. To date, almost in all oil family typing studies, common statistical methods such as principal component analysis (PCA) and hierarchical clustering analysis (HCA) have been used. However, there is no publication regarding using artificial neural networks (ANNs) for examining the oil families in petroleum basins. Hence, oil family typing requires novel, not overused and common techniques. This paper is the first report of oil family typing using ANNs as robust computational methods. To this end, a self-organization map (SOM) neural network associated with three clustering validity indices were employed on oil samples belonging to the Iranian part of the Persian Gulf’ oilfields. For the SOM network, at first, ten default clusters were selected. Afterwards, three effective clustering validity coefficients, namely Calinski-Harabasz (CH), Silhouette indexes (SI) and Davies-Bouldin (DB), were operated to find the optimum number of clusters. Accordingly, among ten default clusters, the maximum CH (62) and SI (0.58) were acquired for four clusters. Likewise, the lowest DB (0.8) was obtained for four clusters. Thus, all three validation coefficients introduced four clusters as the optimum number of clusters or oil families. The number of oil families identified in the present report is consistent with those previously reported by other researchers in the same study area. However, the techniques used in the present paper, which have not been implemented so far, can be introduced as more straightforward for clustering purposes in the oil family typing than those of common and overused methods of PCA and HCA.


Introduction
Identifying the relationship between oil samples and grouping them, known as oil family classification, as a part of petroleum system studies, plays a paramount role in various aspects of the oil industry, including exploration, development, etc.The primary outcomes of oil family typing are detecting migration pathways and evaluating the continuity between different oil reservoirs 1 .
It is for a long time that geochemists use the statistical techniques PCA and HCA to group oil families in petroleum basins 2,3 .However, it is an undeniable fact that artificial intelligence (AI) and machine learning (ML) systems are developing on a regular basis and provide various applications for scientists [4][5][6][7][8] , and petroleum geochemists are no exception.AI and ML techniques in petroleum-related studies have been widely used in recent years.Amar et al 9 used Ml approaches to model oil-brine interfacial tension at high pressure and high salinity conditions.Mazloom et al 10 used AI algorithms to estimate asphalten adsorbtion in nonocomposites.Rostami et al 11

utilized
ANNs for predicting the natural gas viscosity.Mokarizadeh et al 5 implemented ANNs and ML algorithms to determine the solubility of SO2 in ionic liquids.Hemmati-Sarapardeh et al 12 conducted the modeling natural gas compressibility using a kind of ANN.Amooie et al 13 took advantage of ML methods for geological carbon storage studies.Menad et al 4 estimated the solubility of CO2 in brine via advaned ML techniques.Razghandi et al 14 predicted under-saturated crude oil viscosity by ML algorithms.Bolandi et al 15 evaluated source rock characteristics by ML methods.Bolandi et al 16 studeied the organic facies of source rocks by combining ML and ANNs.
Tabatabaei et al 17 utilized ML algorithm for estimation of total organic carbon (TOC) from well log data.Naghizadeh et al 18 estimated viscosity of CO2-N2 gaseous mixtures by smart ML models.Kadkhodaie-Ilkhchi et al 19 integrated endividual smart ML models with a committee machine intelligent system to approximate TOC from petrophisical well logs.Ghiasi-Freez et al 20 used committee machines to predict permeability from petrographic image analysis.Tohidi-Hosseini et al 6 predicted solution gas-oil Ratio via a robust ML system.Esfahani et al 21 implemented ML paradigms for determination of natural gas density.Hajirezaie et al 22 employed a powerful ML algorithm to estimate under-saturated reservoir oil viscosity.Karkevandi-Talkhooncheh et al 23 used the adaptive neuro fuzzy interface system optimized with evolutionary algorithms for modeling CO2-crude oil minimum miscibility pressure.Barati-Harooni et al 24 employed different ML and AI frameworks to predict minimum miscibility pressure (MMP) in enhanced oil recovery (EOR) process by N2 flooding.Amiri-Ramsheh et al 25 conducted an study about modeling of wax disappearence temperature (WDT) using different AI and ML methods.Mohammadi et al 26 employed a powerful ML technique to model hydrogen solubility in hydrocarbons.Moosanezhad-Kermani et al 27 employed a kind of ANN for modelling of carbon dioxide solubility in ionic liquids.Rezaei et al 28 implemented a radial basis function neural network with evolutionary algorithms for modelling of gas viscosity at high pressure and high temperature conditions.Khamehchi et al 29 utilized divers ML and AI systems to model viscosity of light and intermediate dead oil systems.In addition to the mentioned studies, recently researchers used AI and ML for organic geochemistry purposes.For example, Safaei-Farouji and Kadkhodaie 30 used intelligent AI and ML methods for estimation of kerogen type from petrophisical well logs.Collectively, even though AI and ML methods have been used in various petroleum-related firlds, oil family typing using an artificial neural network is missing.ANNs have various applications that one of which is clustering [31][32][33] .Therefore, oil family grouping as a kind of clustering problem can be solved via ANNs.
The SOM function as an artificial neural network proposed by 34 maps multidimensional data to a two-dimension space.This space is created with the help of a competitive and unsupervised learning process.SOM neural network preserves the topological properties of the input space by utilizing a neighborhood function.Actually, the resulting map illustrates the relationship between input patterns. 35,36.
The primary use of SOM is clustering and other types of unsupervised classifications 35,36 .So far, for oil family grouping, limited common statistical methods, such as PCA and HCA, have been used, but using artificial neural networks is entirely missing.Rabbani et al 2 geochemically analyzed thirty-three oil samples from several oil fields in the Persian Gulf's Iranian sector.They defined four main oil families through statistical methods of PCA and HCA.Mashhadi and Rabbani 37 also geochemically investigated twenty oil samples from oil fields in the Iranian part of the Persian Gulf.They identified two distinct genetic oil families using PCA analysis.In another study, Hosseini et al 3 based on the study of fourteen oil samples from the eastern Iranian sector of the Persian Gulf and implementing HCA, identified two different oil families.
Petroleum geochemistry studies of the examined area have been conducted by previous researches 2, 3,37   ; correspondingly, in the present paper, we focus on using a SOM neural network as a novel paradigm to determine oil families in the region.Indeed, the present study enables us to relate our outcomes to previously published works in the study area while using more database and introducing a new method for oil family typing.
In the following introduction, the method used and recent works are generally explained.The second part of the paper is devoted to the data preparation and methodology.Then, the obtained results are discussed in the third section.Ultimately, the final part of the study provides a summary of the findings.

Materials and Methods
Collectively, 60 oil samples were collected from the literature 2,3,37 .These samples belong to different oilfields in the Iranian part of the Persian Gulf.This Gulf and its coastal regions are home to about two-thirds of the world's proven oil reserves (715 billion barrels) 38 .The examined oilfields include Dorood, Kharg, Aboozar, Foroozan, Salman, Resalat, Reshadat, Balal, Bahregansar, Souroush, Nowrouz, Sirri A, Sirri C, Sirri D, and Sirri E. The location map of the studied oil field is given in figure 1.Also, the detailed geochemical and biomarker analysis of the studied crude oil samples can be found in Hosseiny et al 3

2.1.Principal component analysis
The first stage in this study was using PCA to decrease data dimensions.Since sixteen different geochemical and biomarker parameters were implemented as inputs, it was mandatory to diminish dimensions to illustrate data and provide graph results 39,40 .Accordingly, the data dimensions or components were decreased from sixteen to three using PCA.

2.2.Creating the self-organizing map (SOM) network
Artificial neural networks mimic the learning process in the human brain.A key component in processing a neural network is the neurons that receive the inputs and generate the outputs using nonlinear operations.The SOM artificial neural network can learn complex and high-dimension data and extract a visible cluster set 34 .The process of SOM network training consists of two repetitive phases.The first phase selects the best mapping unit (neural network neurons) to adapt to input data.The second phase is to update the mapping to provide the best representation and display input data 41 .
The process of selecting the best unit to conform to the input data (best adaptive unit or BMO) is based on the minimum distance (usually the Euclidean distance).Then in the update phase, each BMU and its neighboring units (within a given radius) move closer to the input data and fully comply with it.This neighborhood radius decreases with each phase selected and updated, eventually leading to a final (two-dimensional) mapping 42 .
The SOM network is composed of an input layer of nodes, and an output layer of neurons, in which the grouping of the inputs is formed 43 .The output layer is called the competitive layer because the competitive role of the network during the training process takes place at this layer.A competitive layer is a two-dimensional plane structured with m neurons while accommodating an input of n neurons.Each input layer neuron with different weight values is connected to the competing layer neurons, and also, a series of minor connections are made between the competing layer neurons 44 .The number of neurons may vary from a few tens to a few thousand.Each neuron is assigned a dimensional vector d with weight m, of which d is the same dimension as the input vectors.Neurons are connected to their neighboring neurons by a neighborhood relationship that affects the topology or structure of the map.Common topologies are square, hexagonal, triangular or irregular grids 45 .
As depicted in Figure 2, the SOM neural network consists of a set of M=m×m processing neurons.
Suppose these M neurons are organized on a grid in a plane.In that case, the obtained network is two-dimensional because this network projects multi-dimensional input vectors onto a twodimensional surface; for a given network, the input vector x is composed of a fixed dimension n.
In the array, the n components of the input vector x (i.e., x1, x2, . .., xn) are connected to each neuron.For a connection from the ith component of the input vector to the jth neuron, a synaptic weight wij is assigned.Thus, an n-dimensional vector wj of synaptic weights is related to each neuron j 46 .In brief, the process of the SOM network is as following 46 : 1. Calculate the distance between the pattern (X) and all neural neurons 46 dij =‖ xk-wij ‖ (1) 2. Select the nearest neuron as the winning neuron 46 wij: dij = min(dmn) 3. Update each neuron according to the neighbourhood function 46 .
This process is repeated until a specific stopping criterion is reached.Often the criterion for stopping is a certain number of repetitions.To stabilize the convergence and stability of the map, the learning rate and neighbourhood radius are reduced in each iteration.Therefore, convergence will tend to zero.The measuring distance between the vectors is the Euclidean distance 46 .

2.3.Clustering Validity Indices
The clustering validity indexes commonly are used associated with a clustering algorithm.
According to the selected index, to determine the exact number of clusters, either minimum or maximum index value aids to figure out the optimum number of clusters (k) 47 .
Generally, validity indexes can be grouped into internal and external.Internal indexes employ the information related to the data itself, whilst external indices, such as labels, are implemented by external information.Internal measures can improve clustering algorithms.By contrast, external measures can be used merely for validation.Internal indices are generally employed to determine k value [48][49][50][51] .
In this paper, for the SOM neural network, three efficient internal coefficients, including DB, CH, and SH, were implemented to determine the optimum number of clusters for oil samples.Initially, a number of 10 classes were selected for the SOM network.The model was developed based on these clusters; then, the optimum number of classes as the optimum number of oil families was recognized using the coefficients.

David-Bouldin (DB) Index:
This index aims to minimize the average distance between each cluster and the most similar one.
The minimum value for the DB index indicates the optimum number of clusters or oil families 52 .
This index is described as 52 : In which  , shows the within-to-between cluster distance ratio for the i th and j th clusters. , can be defined as 52 : Where di represents the mean distance between each point in the ith cluster and the cluster's centroid, di,j denotes the Euclidean distance between the centroids of the i th and j th clusters.The optimum clustering solution possesses the lowest DB index value 52 .

Calinski-Harabasz (CH) Index
CH index 53 demonstrates the quality of clustering solution based on the average sum of squares between and within a cluster.It can be measured as 47 : In which SSB shows the average between-cluster sum of squares.SSW indicates the average within-cluster sum of squares, k represents the number of clusters, and n denotes the number of observations.The average SSB is calculated as bellows 47 : Where   is the centroid of cluster I,  shows the mean of all data points, and ‖  − ‖ typifies the Euclidean distance between the centroid of the cluster and the mean of all data points.The formulation of mean SSW is computed as bellows 47 : =1 (8)   In which k indicates the number of clusters,  is a sample,   demonstrates the ith cluster,   shows the centroid of the cluster   , and ‖ −   ‖ is Euclidean distance between sample and centroid of the cluster 47 .
A higher CH quantity epitomizes a better data clustering outcome or the optimum number of questionable clusters.Therefore, high SSB and low SSW numbers give a well-separated cluster 47 .

Silhouette Index (SH)
SH index 54 demonstrates how close every data point is to other data points within a cluster and how well clusters are detached from each other.Simply put, it operates based on the distance between each point between and within clusters.The highest silhouette quantity indicates the optimum number of clusters (k) 55 .

𝑠𝑝(𝑖) = 𝑏(𝑖)−𝑎(𝑖)
{(),()} (9)   In which () is named silhouette width of point.a (i) shows the mean distance between the ith point and all the points in the clusters Pi, (i = 1, 2, . .., n).b (i) displays the most minor of these distances.Hence, it can be observed that the silhouette value will be between 1 and -1.For every clustering, the average index of all sp (i) is employed 47 .The detailed feature of the SOM network used for clustering in the present study is given in Table .2.

Table.2.
The features selected for the SOM network.

Results
Ten clusters as the default numbers have been defined for the SOM network as the definite number of clusters or oil families is unknown.The samples were distributed in these clusters.Nevertheless, the principal objective of this study is to find the optimum number of clusters and hence oil families among these defined clusters.Therefore, validity indices were employed.
Regarding clustering validity coefficients, the maximum values of CH (62) and SI (58) parameters were determined for four clusters (Figures 3a & b).Additionally, the minimum DB coefficient (0.8) was achieved for four clusters (Figure 3c).This means that all three used clustering validity indices showed four clusters as the optimum number of clusters.Figure 4    were analysed to identify oil families in the present paper to reach more reliable results.

Conclusions
Lack of novelty in previous studies was the main reason for which we decided to find a new method for identifying oil families, a vital study, in petroleum basins.Thus, an SOM neural network was selected for this purpose.In creating the SOM network, ten clusters were initially defined in the network.Then, three effective clustering validity coefficients were implemented to identify the optimum number of clusters based on geochemical and biomarker characteristics of oil samples used as inputs for the network.The maximum CH and SI coefficients were acquired for four clusters.Similarly, the lowest DB coefficient was obtained for four clusters among ten defined clusters.Accordingly, all three validation indices introduced four clusters as the optimum number of clusters, hence the number of oil families.Finally, it should be noted that, while some statistical methods such as PCA or HCA can be employed for oil family typing, these approaches have become over-used, and petroleum geochemistry studies and specifically oil family grouping demands novel paradigms.Accordingly, this paper introduced the SOM artificial neural network as a quick and easy-to-use method, which could be great asses for geochemists in petroleum geochemistry studies for classification purposes.

Declaration of interests
The authors declare that he has no known competing for financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Appendix
, Mashhadi and Rabbani 37 , and Rabbani et al 2 .Table. 1 summarizes the 16 geochemical and biomarker parameters used as inputs for the SOM network.

Figure 1 .
Figure 1.The geographical map of the studied oil fields.

Figure 2 .
Figure 2. The main structure of a SOM neural network.
in a 3-D shape typifies four clusters identified by SOM neural network.Therefore, it can be concluded that four oil families exist in the Iranian part of the Persian Gulf.In other words, at least four different source rocks have generated the reservoir oils.

Figure3.
Figure3.The outcomes obtained by CH (a), H (b), and DB coefficients (c) demonstrating the optimum number of clusters.

Figure 4 .
Figure 4.The schematic of the SOM clustering results illustrating four oil families.

Table 1 :
biomarker parameters used as inputs for the SOM network.