A Novel Convolutional Neural Network for Classifying Indian Coins by Denomination

Coins recognition systems have humungous applications from vending and slot machines to banking and management firms which directly translate to a high volume of research regarding the development of methods for such classification. In recent years, academic research has shifted towards a computer vision approach for sorting coins due to the advancement in the field of deep learning. However, most of the documented work utilizes what is known as ‘Transfer Learning’ in which we reuse a pre-trained model of a fixed architecture as a starting point for our training. While such an approach saves us a lot of time and effort, the generic nature of the pre-trained model can often become a bottleneck for performance on a specialized problem such as coin classification. This study develops a convolutional neural network (CNN) model from scratch and tests it against a widely-used general-purpose architecture known as Googlenet. We have shown in this study by comparing the performance of our model with that of Googlenet (documented in various previous studies) that a more straightforward and specialized architecture is more optimal than a more complex general architecture for the coin classification problem. The model developed in this study is trained and tested on 720 and 180 images of Indian coins of different denominations, respectively. The final accuracy gained by the model is 91.62% on the training data, while the accuracy is 90.55% on the validation data.


Introduction
Coin Classification and recognition systems have seen tremendous application in the past few years due to the surge of machine automation in our society [1]. Such systems are deployed to reduce the odds of human error and fast-track the repetitive process of counting and sorting a large number of coins. While traditionally, such systems make use of the mechanical and electromagnetic properties [2] of the coins to classify them, it is due to the enormous development in the field of computer vision and artificial intelligence that new systems are coming forth that utilizes robust neural networks to detect, segment and classify coins based on the images of the coins processed in real-time. Although the majority of the application stems from the sectors mentioned above yet due to the increasing reliability of coin classification systems and the increasing availability of computational resources for the average person, various studies have been done aiming to develop a system that can assist visually impaired people in identifying and classifying real currency [ [3], [4]]. While a currency classification or identification system would contain both the banknote and coin classifier, this paper confines its scope towards the coin classification part of the system.
Coins are smaller and usually have less color difference between classes (different denominations). Also, coins are more durable than notes, and therefore they survive in the market for a more extensive period; hence the market has more variations of coins currently in use than banknotes. Therefore, a coin classification system must be robust to account for the greater variation and deficiency of features in comparison to a banknote classification model. In this paper, the authors have proposed a deep learning model that can reliably classify coins based on the coin's image. For training the model, a dataset was created with 900 images of different denominations (one rupee, two rupees, five rupees, ten rupees, and twenty rupees) Indian coins. It includes 210 images of one rupee coin, 120 images of two rupee coins, 270 images of five rupee coins, 180 images of ten rupee coins, and finally, another 120 images of twenty rupee coins. Each class contains different variations of the denomination to ensure that model has enough for training.

Almisreb, Ali & Saleh, Mohd.A. (2019). [5]
In this study, the authors trained and tuned some of the famous deep learning architecture, namely Alexnet, Googlenet, and Vgg16, using a transfer learning framework on an image dataset that contained 110 images in total of Bosnian banknotes, which included 11 different classes. The main characteristics identified in this study were currency texture, color, and forms. Vgg16 was the most accurate, and Googlenet was the least accurate at 88%.

Rahman, Mohammad & Poon, Bruce & Amin, Md Ashraful & Yan, Hong. (2014). [6]
describes the method of selecting thresholds for the algorithms. The experiment included test images with white background as well as images with a complex and 'closer to real' background to judge the practicality of the model. The proposed system was found to be 89.4% accurate on the white paper background and 78.4% accurate on a complex background.

A. Chalechale (2007). [7]
The author of this paper proposed a new approach for recognizing coins in an image. The approach is centered around the idea of measuring the similarity between full colour multi-component coin images. The significant advantage of this approach is that image segmentation before the detection is not required, and as segmentation is a very computationally expensive process by using this approach, we save a lot of computational time.

Ranjendra, P. & Anithaashri, T.P. (2020). [8]
This paper is oriented around Indian currency. The authors followed the same structure as Rahman et al. [2], wherein they created a dataset that contains both Indian banknotes as well as coins of different denominations and three state of the art deep learning frameworks, namely Alexnet, Googlenet, and Vgg16, were trained, tuned and compared. The result showed that Googlenet (with 22 convolution layers) was the least accurate with an accuracy of 88%, while Vgg16 (with 16 convolution layers) was the most accurate.

Abburu, Vedasamhitha & Gupta, Saumya & Rimitha, S. & Mulimani, Manjunath & Koolagudi, Shashidhar. (2017). [9]
A deep learning model was suggested in this research, which could classify both the country of origin and the denomination or value of a specified banknote. The model works only with paper notes and avoids dealing with coins altogether. The dataset in this study contains paper notes from 20 different countries and their denomination variants, and the model utilizes features like size, texture, color, and text in the image. The models were tuned to classify the notes with about 90% accuracy.

Methodology
Most of the recent studies which are focused on showing how the Deep learning model can be used to classify images of currency (banknote or coins) into various classes have used transfer learning framework wherein they use one or more of the popular CNN architectures like Googlenet, Alexnet, Vgg15 or Vgg16 [ [10]; [8]]. While there is no apparent downside to using these frameworks, one can argue that such networks are not tailor-made for a specific problem. Instead, they are general in nature can be used in about any situation and the problem with that is that even by using 22 convolution layers, the Googlenet CNN can only get about 88% accuracy [8], since we cannot tweak the architecture of Googlenet without losing all the pre-trained weights we have to settle for more epochs which in turn would result in higher computational time, or we can resort to collecting more data for more effective training. Both of the solutions are time-consuming and do not guarantee better accuracy.
Hence it is evident that for analyzing the effectiveness of CNNs in classifying currency, a novel architecture is needed that must be created and trained from scratch so that we can get the most accurate solution with a simpler and less computationally expensive neural network. For developing a new architecture, rigorous experimentation is needed, and for benchmarking different architectures, a baseline accuracy of 85% is chosen as this is the average accuracy reported by various studies that incorporated Googlenet CNN architecture.

Accumulation
Most dataset available online does not include images of twenty rupee coins since it was recently approved by the government of India and was first minted in 2019. Hence for this study, we have created a new dataset [11] which includes images of Indian coins of all the currently circulating denominations, namely one rupee (₹1), two rupees (₹2), five rupees (₹5), ten rupees (₹10) and as well as twenty rupees (₹20). Since there are multiple designs for a coin of the same denomination, therefore it must be ensured that the majority of the designs are included in the training data so that we can make a robust model that can identify the denomination of the coin regardless of the change in design, an open access copy of the dataset is available on Kaggle with the same name. Hence, multiple design variations are included in our dataset. Some design variations included in the data are shown in Figures 2(a) and 2(b). The dataset contains images of 30 different coins, consisting of 7 types of one rupee coins, four types of two rupee coins, nine types of five rupee coins, six types of ten rupee coins, and another four types of twenty rupee coins. Each coin has been imaged from five different viewing angles as shown in Figure 2(c) and in three different lighting conditions i) Artificial Light, ii) Low Light, and iii) Natural Light. Hence, there are 15 possible combinations of viewing angle and lighting condition; furthermore, each combination has been sampled twice. Therefore, each coin has been imaged 30 times with varying light and viewing angles. The whole data set contains 900 such images, out of which 210 images are of one-rupee coins, 120 images are of tworupee coins, 270 images are of five-rupee coins, 180 are images of ten-rupee coins, and finally, the rest 120 images are of twenty-rupee coins.

Resize
Resizing is the first and arguably the most critical step in image pre-processing for training a neural network as a larger image size will have more information and can produce better correlation inside the neurons, but at the same time, a large size means that the time taken by the neural network to train will also be significantly high. Therefore, a particular size should be selected that can retain the maximum amount of information, yet it should be small enough to have a practical training time.
In their original form, all the images inside the dataset have a pixel size of 4000 x 3000 (12000000 or 12MP); this is a considerable size for an image that is not suitable for neural networks. So, after experimenting with different sizes, an image size of 320 x 240 was selected, and all the images were resized while keeping the aspect ratio intact.

Normalization
Normalization is the process of rescaling the pixel intensity of an image to a predefined scale from the original scale. All images are in RGB format in the dataset, with pixel intensity ranging from 0 to 255 for each color channel. Image normalization was used to transform the pixel intensities from a scale of 0-255 to a scale of 0-1.
Where In and Io are the pixel intensity after and before normalization, respectively.

Augmentation
Training a Convolutional neural network requires a lot of data to achieve optimum accuracy. While our dataset contains a decent number of images, increasing the dataset further can effectively enhance the performance of the trained model. Data augmentation techniques are usually implemented to develop robust image classifiers using limited data [12] that can artificially generate images by employing various transformations such as random rotation, shift, flips, etc. Some transformations used in the study are shown in Figure 3.

Proposed Model
The architecture of the CNN model proposed in this paper consists of five convolution layers and five pooling layers while decreasing the size of the image and increasing the number of feature maps.

Convolution Layer
Each Convolution layer is a 2D convolution layer followed by a Rectified Linear Unit (ReLU) activation function, which has been shown to have good convergence properties, for example, scale invariance and 1-Lipschitz continuity [13], A ReLU function takes the incoming signal from the neuron. It passes on only if the signal is positive. This means that if the incoming signal is negative, the signal will not proceed further in the neural network, and if it is positive, it will move forward to the next neuron unchanged. The ReLU Activation Function: Where x is the input to the neuron, the kernel size or the dimension of feature maps was fixed as 3x3x3 (3x3 map with three color channels), and a 1x1x1 stride was selected. Furthermore, feature maps increased from 16 (2 4 ) to 256 (2 8 ).

Pooling Layer
Max pooling layers are used between some of the convolution layers to reduce the dimensionality of the feature maps. The stride of the Max Pooling layer was of 2 pixels that were selected, and finally, a global average pooling layer was used instead of a flattening layer at the junction of the CNN and the fully connected layer to reduce the shape of the final output tensor coming from the last convolution layer to shorten the time required to train the fully connected layer.

Fully Connected Layer
The low dimensional feature maps are fed into the fully connected layer through a global average pooling layer that reduces the parameter's size and works as a flattening layer, converting the feature map to separate neurons that make up the fully connected input layer. The ReLU activation function also follows this layer.
The last dense layer computes the final probabilities given by the model for each class. This layer utilizes the SoftMax function, also known as the normalized exponential function that normalizes the neural network's output to a probability distribution over the prediction classes. The SoftMax Function: = 1, 2, … . . , = ( 1 , 2 , … . . , ) ∈ ℝ Using the SoftMax function in the last layer, we convert the output of the fully connected layer into an equivalent set of probabilities that belongs to each prediction class.

Training
The model weights were updated using the Adam (Adaptive Moment Estimation) Stochastic Gradient Descent optimization algorithm [14] so that the output corresponds with the given labels. At the same time, an adaptive learning rate was adopted that started with a value of 0.01 (1e-02) and would get reduced by a factor of 0.1 every time the training got hit by a learning plateau (with a patience level of 12 epochs). The Categorical cross-entropy function was selected as our model's optimization criteria or as the loss function.

Model Evaluation
The model was developed and trained on the Google Collaboratory environment with a GPU backend support (NVIDIA Tesla K80). The model was trained for a total of 150 Epochs and was found to achieve an accuracy of 91.62% on the training data, while on the validation or test data, the accuracy was found to be 90.55%, The high level of agreement between the accuracy on training and test data signifies that the model is not overfitting. Figure 5(a) and (b) show the accuracy and loss curve over training epochs, respectively, while Figure 5(c) shows the adaptive learning rate during training.
The accuracy metric is not sufficient to evaluate the performance of a multiclass classifier; rather, the performance of each class should be judged separately. Figure 6 shows the confusion matrix for the classifier, and it is evident from the confusion matrix that the model performed best at classifying Five rupees (₹5) and Ten Rupees (₹10) coins while it struggles the most to classify Twenty Rupee (₹20) coins, this phenomenon is confirmed by looking at the accuracy scores for each class separately shown in Table 1.
Given that evaluation metrics like positive predictive value (PPV) or precision and negative predictive value (NPV) can only be used in binary classification, therefore using the 'one vs. rest' approach, the multiclass confusion matrix can be converted into five binary confusion matrix (one for each class). For example, considering class 1 (One rupee), all the correctly classified images will be termed as True Positives (TP), all the images that were correctly classified as not One rupee will be termed as True Negative (TN), and all the images that were incorrectly classified as One rupee will be termed as False Positive (FP). Lastly, all incorrectly classified images as not One rupee will be termed False Negative (FN). Using this approach, TP, TN, FP, and FN can be calculated for each class, which will be used to calculate the positive predictive value(PPV) and the negative predictive value(NPV). The calculated values for TP, TN, FP, FN, PPV, and NPV for each class are shown in Table 2.

Conclusion and Future Prospect
In this study, we have created a new dataset of 900 images that can be used to train a machine learning model to classify Indian coins based on their denomination. Out of the 900 images, 720 images were used for training the model, and 120 images were set aside for validation. A CNN architecture was developed from scratch that contains five convolution layers. The accuracy gained on the training and validation sets was 91.62% and 90.55%, respectively. The validation accuracy (90.55%) was greater than the average accuracy of Googlenet CNN (88%) reported by various previous studies. We also confirmed that a relatively simple (4 convolution layers) CNN architecture specifically made for a particular problem (Coin Denomination Classification) was more efficient than a more complex (22 convolution layer) general-purpose architecture (Googlenet).
The model developed in this study was later evaluated with both multiclass classification metrics like Over-All-Accuracy and Per Class Accuracy as well as binary classification metrics like Positive Prediction Value (PPV) and Negative Prediction Value (NPV)). It was observed that the model performed best at classifying Five rupees (₹5) and Ten Rupee (₹10) coins while it was worst at classifying Twenty Rupee (₹20) coins. This phenomenon can be explained by the fact that the dataset contains only 120 images of twenty rupee coins while there were 270 and 180 images of five and ten rupee coins, respectively.
In this study, the class imbalances in the dataset were neglected; therefore, for future improvement of the model, applying thresholding to compensate for prior class probabilities is advised for increasing the overall accuracy of the model. Moreover, since the model's architecture is relatively shallow with only five convolution layers adding more convolution blocks might yield more accurate results.