Circuit convergence study using machine learning compact models

Machine learning (ML) compact device models (CM) have emerged as an alternative to physics-based CMs. ML CMs can find a mathematical model close to the device characteristics without much prior knowledge, which saves the time of model formation. Additionally, versatile capabilities such as process-awareness, model merging, and fitting new technologies, promote the usage of ML CMs. While ML CMs draw great attention in CAD, their convergence in SPICE has not been carefully studied. Here different activation functions are used to create ML CMs, and then the circuit convergence is tested. We found that inverse square root unit (ISRU) activation has the best convergence. Besides, gate-to-source and gate-to-drain capacitance is founded to benefit the convergence in transient analysis. The circuit convergence rate is 100% for ISRU, sigmoid, and tanh when the capacitor is present. On the other hand, ISRU significantly outperforms other activation functions in DC sweep, achieving 81% convergence. If quasi-static transient analysis is employed to replace DC sweep, 100% convergence is achieved by ISRU. Due to its superior convergence, ISRU is the most promising for future ML CMs in SPICE.


I. INTRODUCTION
With the evolution of the times, many people pay attention to machine learning. In the early days, humans were not interested in machine learning because machine learning models consume a lot of computing power reference to physical models. Nowadays, as more data or more complex phenomena need to be processed, it is more difficult for physical models to process it in a reasonable manner or in a short time. Metal oxide semiconductor field-effect transistor (MOSFET) is the most basic semiconductor device in circuits. Semiconductor device becomes smaller with Moore's Law, and many characteristics are not well described in emerging areas, which leads to a complex physical model. Besides, the physical model is increasing complexity, so the formation of the physical model takes a lot of time and labor costs [1]. This aspect can cause circuit design companies to be unable to use the most advanced devices to design the circuit in time. As the size of MOSFETs shrinks, the BSIM model has become more complex. When the oxide layer is thicker on the MOSFETs, we do not need to consider the gate leakage current and gate-induced drain leakage (GIDL) effect. At this time, the BSIM3 model is relatively simple. When the oxide layer becomes thinner on the MOSFETs, the device has a gate leakage current and GIDL effect. The BSIM model is expanded with the gate leakage current model and GIDL model [2]. The complexity of the BSIM model also increases with the advanced process. To prevent the short channel effect, halo implantation is available in the process, which makes the electric field and parasitic resistance more complicated [3,4]. At the same time, the halo implantation is also added to the BSIM model [5]. Afterward, since the shallow trench isolation (STI) process affects threshold voltage and mobility, so the stress effect model is introduced after the BSIM4 model [6]. It can be seen that the device model will become complicated in nano-devices, which leads to a lot of time to compose a physical model. Therefore, researchers are looking after using machine learning to construct a device CM. The machine learning model is a neural network trained by forwarding propagation and backward propagation [7]. The nonlinearity is formed by activation functions. Specifically, multilayer perceptron (MLP) is a mathematical model which uses a matrix to create an equation. We can adjust the number of neurons and hidden layers, input/output parameters, and activation function. The number of neurons, the number of hidden layers, and the activation function can determine the accuracy and complexity of the model. While physical models require repeated modifications, which cost engineers a lot of time to build physical models. In contrast, engineers only need a database, and then machine learning models can be composed. Among the architecture parameters in ML models, the activation functions play an important role in convergence. Thus, in this work, we study the effect of different activations on circuit convergence.
In literature, there have been some works using ML CMs in semiconductor technologies [1]. Chea-Wei Teo et al. uses the parameters of TCAD and the random forest method to predict the location of the defect [8]. Himmel et al. uses machine learning to complete the etching model [9]. Sidi Bel Abbes extracts the critical parameters of MOSFET and then uses MLP to predict the current value [10]. Creech et al. construct a ML CM for microave devices [11] Chen et. al. use ANN to form a compact model of TFTs, and then Vgs and Vds, gate width (W), and gate length(L) are used to predict the current value [12]. Gao et. al. also use NN to form a device model [13]. Huang et. al. proposed dimension-reduced ANN to reduce the problem of overfitting [14]. Kim uses physics-based loss function and then uses backpropagation to train the device model, and this practice can make the model prediction more accurate [15]. Habal uses voltage and gate length to predict the current value and gm accurately [16]. While ML has been used extensively in semiconductor process and used in several cases in semiconductor device CMs construction, the application of ML device CMs in SPICE has not prevailed in literature, and few people discuss the application of machine learning models to circuits simulation [17][18][19][20][21][22]. Samsung and Sandia's paper discusses the accuracy of the ML CMs [18,19]. Vgs and Vds are 0V to 1V with adaptive steps in Sandia's paper. It can be known from the result that when the high-density step value is extracted in the subthreshold, this can make the model prediction more accurate. Samsung's paper proposed that the simulation speed of the ML model is slower than the simulation speed of the traditional model. Hammouda also used ANN and Taylor series to compose a device model, and the device model was put into the circuit simulation. Nevertheless, Hammouda did not mention the convergence issue of ML models in circuit simulation [20].
We choose HSPICE to study the circuit convergence using ML CMs because HSPICE is the most popular SPICE used by industry or academia. We use Verilog-A to build device CMs [23]. Because Verilog-A has defined the fundamental laws, it can save a lot of time for model developers and allow model developers to focus on the device models [24]. While many papers use machine learning to form device models , there are only a few papers that really use machine learning models in circuit simulations [17][18][19][20][21][22]. Here the focus is on the application of ML CMs to circuits and proposing an activation function for high convergence.

A. Data collection
BSIM-CMG 110.0.0 is used as the baseline to compare the convergence with ML-based CMs [25]. The ML CMs are fitted to BSIM-CMG IV. Vgs, Vds, and the number of fins are the parameters in BSIM-CMG. When collecting the parameters of BSIM-CMG 110.0.0, the gate length is 30 nm, and then we can adjust the number of fins. There are 20 kinds of number of fins from 10 to 200. Subsequently, the fitting is carried in python and Tensorflow. To test the convergence of the circuit, we choose different kinds of activation functions. There are three different types of activation functions [26]. There are ReLu and ISRLU for the discontinuous type and unrestricted output value activation functions. Softplus and Swish possess continuous unrestricted output values. Softsign, ISRU, Sigmoid and hyperbolic tangent possess continuous and restricted output values. Finally, the model is incorportaed into HSPICE using Verilog-A syntax.
Vgs and Vds of BSIM-CMG 110.0.0 are set from 0V to 1V with 50mV step for n-MOSFET. On the contrary, Vgs and Vds of BSIM-CMG 110.0.0 are set from 0V to -1V with 50mV step for p-MOSFET. We only consider the electrical characteristics caused by the gate and drain and source, and the base should be connected to the source in circuits. Each device has 441 data, so there are 8,820 data for n-MOSFET and p-MOSFET, respectively.

B. Machine learning model formation
After collecting the parameters, the machine learning model is formed in python 3.7.6 [27]. The training model uses MLP to train the device model. Function of MLP is referenced from the Tensorflow 2.0.0 [28], Numpy 1.21.2 [29], Scikit-Learn 0.24.2 [30], Matplotlib 3.3.1 [31], Pandas 1.1.1. First, the data can be shuffled. 90% of data is used as the training set, and10% is used as the test set. When forming the BSIM-CMG model, the input parameters are Vgs, Vds, and the number of fins. The number of hidden layers is three, and the number of neurons is fifteen in each hidden layer. The activation functions use ReLu, softsign, hyperbolic tangent, sigmoid, softplus, ISRU, ISRLU, Swish, as expressed by (1a)-(1h) [32].
When the device model is trained, the data should be normalized first. The machine repeats gradient descent and weight update until the gradient of the loss function does not decrease, early stopping can be set. After the model is trained, the training data and testing data are fed to the models, and root mean square error (RMSE) between the estimated and actual values. This can determine the accuracy of the model. (3)

C. Verilog-A expression
First, the device name, and the device nodes, and the device parameters are defined [23,33,34]. Then the weights of the ML model are introduced into the verilog file. MOSFET has three nodes. Device parameters are Vds, Vgs, and the number of fins for CMG-MOSFETs, and L is uniformly 30nm in this work Loop expression can form a layer of neuron calculation, and the number cycles of the loop are determined by the number of neurons in a layer. Since the ML model of output value is standardized, it needs to be returned to the original value after the model calculation is completed. Finally, we use the current expression to complete the device model. The output value of the ML model is current in unit of ampere. While we only fit the characteristics of the first quadrant of Vds-Vgs, which can cause the circuit simulation to be distorted to the third quadrant. Therefore, we add the third quadrant current in Verilog-A by flipping the current values for negative Vds since the device is symmetric in structures.

D. HSPICE simulation
We use CPU Intel core i7-4790 3.6GHz to test the simulation time of the circuit. The circuit simulator uses the Newton-Raphson method to find the convergence value. If roots of the system of non-linear equations can be found quickly, the function needs to be continuous and smooth. Machine learning has many activation functions with different properties, and each activation function has different smoothness and continuity. As a result, the selection of activation functions has critical effect on the circuit convergence. After constructing Verilog-A models, the device CMs are introduced into HSPICE. We chose a logic gate with a more complicated configuration to test the convergence. As shown in Fig. 1 XOR, instead of common NAND, NOR, or NOT, gate is used. The speed of the ML models are also compared in addition to the convergence.
When we simulate the circuit, the high potential operates at 1V, and the low potential operates at 0V. First, Matlab is used to generate four hundred circuit files with different proportions of device sizes. Then we use batch files to automatically run circuit tests with 400 different ratios of device sizes. Afterward simulation, Matlab R2021a is used to read the waveforms from Cscope. Hspice Toolbox for Matlab and Octave is used so that Matlab can read HSPICE output files [35]. We use transient analysis or DC sweep to test the convergence and simulation speed. In transient simulation, the total transient time is 10ns using sinusoidal stimuli. The amplitude is 0.5V with 0.5V offset. The frequency is set to 2GHz/1GHz for two inputs. We use 400 different n-MOSFET and p-MOSFET ratios to test the convergence of circuits in transient analysis. Then the circuit convergence ratio of each activation function can be displayed together with the simulation time. Afterward, we can choose three activation functions with better convergence, and then using these activation functions to test the subsequent analysis in HSPICE.
The gate to source and gate to drain capacitance, Cgs and Cgd, is added to the device model, and then the circuit convergence can be tested again.
After transient simulation, the DC sweep is then tested with the scanning range of the DC sweep from 0V to 1V at two input terminals with 0.1V step. Because the convergence in DC sweep is not 100% for any of the activation functions, we use quasistatic transient analysis to replace the DC sweep as illustrated in Fig. 2. The maximum step size of transient analysis is set to 0.016sec. To improve the convergence, the input waveform is carefully ramped. Input A is a triangular wave, and input B will slowly rise from 0V to 1V. When the voltage of input A is from 0V to 1V, the voltage of input B is fixed to a value. When the voltage of input A changes from 1V to 0V, the voltage of Input B is slowly raised by 0.05V. When the voltage of Input B reaches. The calculation is repeated until 400 different W/L ratios are simulated.

E. Capacitance model
In our ML CMs, increased capacitance of Cgs and Cgd in general improves the convergence of the circuit. This is a positive, desired phenomenon, and the problem is that what is the range of Cgs and Cgd values in realistic devices. Cgs and Cgd values can be reference from literature [36][37][38][39][40][41][42][43]. The reasonable capacitance values, based on our literature review, fall into the orders of magnitude in 1×10 -8 F/m -1×10 -12 F/m. In fact, the Cgs and Cgd consists of overlap capacitance (Cov) and channel capacitance (Cch). Cov is a fixed value while Cch is bias dependent. To estimate the worst cases and to see the effect of capacitance orders of magnitude, we ignore Cch, and only retain a fixed-value Cov. Different orders of magnitude of Cgs = Cgd =Cov are tested, and convergence performance is recorded.

A. Fitting ability of ML model
It can be known from Fig. 3 and Fig. 4 that the ML model can accurately predict the current values of the MOSFETs. Different activation functions can also lead to different accuracy in ML model predictions. It can be known from Table I. that the ML model using Swish, the fitting capability is the best for n-MOSFET. On the other hand, ISRLU fits p-MOSFET the best. It can be found from MAPE that the activation functions with larger output ranges has a better fitting ability. On the contrary, the activation functions with limited output ranges can be less effective in fitting. When the input of Swish and ISRLU is >>0, the output value has a linear rise. This can make the output values relatively large, and the fitting ability is better. In addition, it is observed that the prediction accuracy of n-MOSFETs is better than that of p-MOSFETs. To test the predictive ability of the ML model for p-MOSFET, we have tried different loss functions in (4). When W is 0, (11) is equivalently log(I) fitting. W can also be adjusted, from 1 to 10 10 , to form a balanced fitting between linear and log scales. Based on our trial the results, it can be found that different loss functions are still difficult to improve the MAPE of the ML model for p-MOSFET. Using W from 1 to 10 10 improves the train set MAPE but does not have much effect on the test set data MAPE. It is seen that the predictive ability of the ML model is relatively poor in the vicinity of 0V. We have also tried to use all the data into training, and then the accuracy of the ML model prediction can be improved. Nevertheless, when all of the data is put into training, overfitting can exist. Thus, we finally use the original loss function (W=0), and 90% of the data is used as training sets.

B. Convergence of ML model for transient analysis
Before we test the convergence of the circuit, we check whether the current is minimal to Vgs less than 0 and whether the current has an explosive current to Vgs greater than 1V. The inspection result shows that the current is smaller than 1×10 -10 A for Vgs less than 0V, and the current is not explosive for Vgs greater than 1V. Therefore, in case the node voltage during New-Raphson iterations is accidentally entering a voltage range outside training, there should not be exploding value problems. As a result, the divergent circuit simulations are more likely due to the ML CM itself.
Before we consider the third quadrant current, we have tested XOR, NAND, NOR, and Inverter. The test result is that the circuit with XOR has the worst convergence, because the XOR circuit is more complicated. The circuit convergence of NAND, NOR, and Inverter is much improved, but it still follows different activation functions, leading to different circuit convergence. In general, activation functions with smooth co-domain, smooth derivatives, and bounded co-domain can lead to better convergence. As far as XOR is concerned, circuit simulation is unable to converge if the third quadrant current in transistor is not specified in ML CMs. Therefore, we added the current in the third quadrant to the ML CMs assuming symmetric device structures, and then the circuit convergence was tested again. The test results show that ISRU, Sigmoid, Softsign, and hyperbolic tangent SPICE convergence are satisfactory in transient analysis, as shown in Table II. The correlation between the average speed in SPICE and convergence, for different activation functions, is actually not very pronounced. The results in Table II show that the average time of circuit simulation of ReLu and ISRLU is relatively fast, but their circuit convergence is rather poor.
When we look for a suitable activation function for HSPICE, we expect to find an activation function that possesses a continuous function values and derivatives. If an activation function has a discontinuity, which can cause the circuit to diverge at the moment of potential transition, such as ReLu. The ReLu function has a discontinuous first derivative at zero, making it impossible to be differentiable. ISRLU function is continuous until the second derivatives, but carefully processed continuity does not improve the SPICE convergence [44]. Another problem of activation functions in SPICE is an unbounded function values, which can cause convergence problems. This is seen in RELU, ISRLU, Swish, and Softplus. Because the value changes dramatically, and errors can accumulate and propagate, which can cause divergence. ISRU, Sigmoid, and hyperbolic tangent are continuous and differentiable functions without piecewise definitions. Besides, the function values are bounded, so the convergence of ISRU, Sigmoid, Softsign, and hyperbolic tangent is better than ReLu, ISRLU, Swish, and Softplus. The derivatives of ISRU, Sigmoid, Softsign, and hyperbolic tangent are in (5a)-(5d), respectively.
From Table II, it can be found that the ML models of all activation functions are slower than the BSIM model in the circuit simulation [18,19]. Nevertheless, the BSIM model has many capacitors. The capacitors have differential expressions so that the variation in current and voltage is slowed down, which is beneficial for circuit convergence. Fig. 5 shows the circuit simulation speed of different ratios between n-MOSFET and p-MOSFET. Circuit simulation speed does not have to specific correlation with respect to the n-MOSFET to p-MOSFET W/L dimension ratio, i.e., n-MOSFET to p-MOSFET dimension ratio has no significant relevance to the circuit convergence. Even if the n-MOSFET to p-MOSFET dimension ratio does not match, the ML models convergence is not deteriorated. On the other hand, the BSIM model does not have any problem in convergence in all n-MOSFET to p-MOSFET dimension ratios. The scatter plots for divergent cases in terms of the n-MOSFET to p-MOSFET W/L dimension ratio is shown in Fig. 6.

C. Convergence for capacitors of ML model
Capacitors are added to the ML CMs in this section to promote the transient analysis convergence. Only practical capacitance values in Cgs and Cgd can be added to prevent overly large, unrealistic scenarios. From Table III, we can find that the convergence of the circuit has been significantly improved. Since the capacitor can help the voltage to change more slowly, improved circuit convergence is achieved. The capacitance value also deserves some discussion. Certainly, we prefer large capacitors, since small capacitance is regard as an open circuit, which has no effect on the device model and SPICE. The capacitance values, however, needs to be realistic to make the calculation here meaningful. From literature, we found that the reasonable range of Cgs and Cgd is 1×10 -8 F/m -1×10 -12 F/m, while most of the reported MOSFET values falls around 1×10 -10 F/m. It can be seen from Table III that when the capacitance is 1×10 -12 F/m, ML models does not converged in all cases. In addition, the results show that the larger the capacitance values are, the more pronounced the improved circuit convergence is. In the capacitance calculations, I=C×dV/dt. dV/dt does not directly affect the continuity of the current, while C can affect current continuity if the capacitance values have discontinuities. Fortunately, although discontinuous capacitance values can cause current value discontinuities, the effect is not very pronounced for Cgs and Cgd in the range of 1×10 -8 F/m -1×10 -12 F/m. Therefore the continuity of the capacitance values are less critical than the capacitance value in the device model.
In addition to a single XOR gate, we also connect the XOR circuit in series, i.e. two XOR at the first stage connected to one XOR at the second stage, and then we use ISRU ML models to test the convergence of the circuit. The test result is that the 100% convergence is seen when the capacitance value is 1×10 -10 F/m. Besides, the deterioration of circuit convergence occurs when the device model does not include Cgs and Cgd. This results in only 87% circuit convergence.  Table IV is not 100% even for ISRU, and thus we manage to use transient simulation to replace the D.C. sweep to promote the usage of ML CMs. In general, the dc sweep ramping can be discontinuous if we have nested sweep at two nodes. To circumvent this problem and utilize the previous SPICE solution for the next step, we design the input waveform for quasi-static transient. The results of using quasi-static transient analysis to replace dc sweep are shown in Table V. It can be seen from the results that using transient analysis can significantly improve the convergence. Because the input voltage gradually ramped and the discontinuity in nested sweep is resolved, the convergence of the circuit is significantly improved. Similar to Table III in the previous section, here we have tested different practical capacitor values in Cgs and Cgd to show the feasibility of the proposed methodology. Since quasi-static transient analysis is operated at low frequencies, capacitance has less effect compared to the results in Table III. Nonetheless, we can still see ISRU has 100% convergence for all capacitance values while sigmoid and tanh fails at a small proportion at some capacitance values.  The problems in convergence and simulation speed need to be overcome when machine learning is applied to the model of semiconductor devices. Specifically, the convergence of circuit simulations using ML CMs still cannot be compared with the convergence in conventional physical models. Even with the best activation function, i.e. ISRU, the circuit still cannot fully converge. Here we found that to improve the SPICE convergence, the activation functions should be continuous and possess smooth derivatives with bounded function values. Sigmoid, ISRU, and hyperbolic tangent have continuity, smooth derivatives and a bounded co-domain, and thus the circuit converges easier. The circuit convergence ratio of hyperbolic tangent is 99.50% for transient analysis without Cgs and Cgd. The circuit convergence of sigmoid and ISRU is 95%. Adding Cgs and Cgd. in transient simulation further improves the convergence. When a realistic capacitance values for Cgs and Cgd is added to the MOSFET device model, the circuit convergence of tanh, ISRU, and sigmoid is 100% if Cgs and Cgd is >10 -11 F/m. Additionally, it is found that the circuit convergence of dc sweep is worse than the circuit convergence of transient analysis. In DC sweep, ISRU significantly outperforms other activation functions. If a quasi-static transient analysis is used to replace DC sweep, and realistic capacitance values are specified in ML MOSFET CMs, ISRU can provide convergent results for all purposes in circuit simulation, including DC and transient analysis. Therefore, in terms of our numerical study, ISRU has the best overall convergence and can potentially be more suitable for future machine learning based CMs in circuit simulation.