Reinforcement Learning in Urban Network Traffic Signal Control: A Systematic Literature Review

Improvement of traffic signal control (TSC) efficiency has been found to lead to improved urban transportation and enhanced quality of life. Recently, the use of reinforcement learning (RL) in various areas of TSC has gained significant traction; thus, we conducted a systematic literature review as a systematic, comprehensive, and reproducible review to dissect all the existing research that applied RL in the network-level TSC (NTSC) domain. The review only targeted the network-level articles that tested the proposed methods in networks with two or more intersections. We used natural language processing to define the search strings and searched Google Scholar, Web of Science, IEEE Xplore, ACM Digital Library, Springer Link, and Science Direct databases. This review covers 160 peer-reviewed articles from 30 countries published from 1994 to March 2020. The goal of this study is to provide the research community with statistical and conceptual knowledge, summarize existence evidence, characterize RL applications in NTSC domains, explore all applied methods and major first events in the defined scope, and identify areas for further research based on the explored research problems in current research. Reinforcement Learning; traffic light control; urban network; multi-agent system; intelligent transportation system; artificial intelligence.


Introduction
With an explosion in urban and rural population rates, city transportation systems become less efficient at handling the ever-growing numbers of commuters. A lack of space and resources with which to improve infrastructure poses problems in accommodating the increasing urban population. The resulting congestion further leads to increased pollution caused by sitting idle in traffic jams, traffic delays and bottlenecks, and a rise in accidents. The secondary issues that arise are just as severe, including economic loss and an overall decrease in quality of life. This presents the problem of improving traffic flow and traffic signal control (TSC) within the already existing infrastructure.
Traffic signals are most often regularised through fixed time, actuated, or adaptive control methods, whether the state-of-the-art methods or the methods deployed in the real-world, such as SCATS (Sims et al., 1981), SCOOT (Hunt et al., 1981), and TUC (Diakaki et al., 2002). Fixed time signal control involves a repeating pattern that does not change with the live traffic situation and which continues through its cycles regardless of dynamic traffic changes in that area. The actuated control method operates traffic signals based on real-time data of loop detectors. Despite being traffic responsive, the actuated control method is not designed to fully address fluctuating traffic demands, thereby rendering it less than optimal, specifically in highly saturated volumes. Conversely, an adaptive signal is a more efficient solution as it has the built-in capacity to adapt to traffic changes without the restrictions that plague the actuated method. Reinforcement Learning (RL) (Sutton, Barto, et al., 1998) in TSC is generally employed to advance the category of the adaptive methods.
Derived from the natural learning processes observed in animals, RL allows a TSC system to learn and adapt to their environment. In a traffic environment, several components, such as pedestrians, drivers, vehicles and traffic signals may interact with each other. In TSC, traffic signals are the most common agents. Some TSC systems have a single agent in the RL environment, however it is common to have multiple agents work either cooperatively or competitively, in what is called Multi-Agent Reinforcement Learning (MARL). The benefit here is that agents work across a large environment while still having the precision of a single agent or close to it. The agents interact in a simulated traffic environment in different situations to learn the optimal way of interacting with the environment in a real-world setup. RL works based on a reward system that promotes long-term goals in an environment. The learning process is a feedback cycle of state, action and reward, where RL learns how an agent should map the states to actions to maximize a reward (this is discussed in detail in Section 1.1.). See Figure 1. The action often involves setting the phase duration. However, there are other action types, including setting the phase order, cycle time, offset, etc. We define these TSC-related terms in the following. A phase is defined as a period of time during which a set of non-conflicting traffic movements receive a green signal. A cycle is composed of several phases and cycle time is the time required to complete a full sequence of the phases. The proportion of the cycle that is green is called split. Moreover, in a coordinated system, offset is defined as the time that the green phase at an intersection begins after the beginning of green time of the reference signal. The main goal of TSC is to improve the environment or network performance (e.g. delay time, travel time, queue length, and speed) by controlling the actions of the agents. The main focus of this paper is on controlling the timing of traffic signal agents, although this control can be integrated with the control of actions of other types of agents, like vehicles in a connected vehicle environment.
Due to the rising popularity of RL in TSC recently, specifically in NTSC, we aim to thoroughly characterize the existing research in the area of urban traffic networks where RL is applied and to provide a complete account of what has already been explored. To this end, we exclude the research that has only proposed or tested for single isolated intersection control. Thus, we only concentrate on the application of RL in the network-level TSC, called as RL in NTSC or RL-NTSC for brevity, that tested the proposed methods in networks with two or more intersections.
It is worth noting, however, that there are several surveys and review papers that do cover this area. For instance, (Wei et al., 2019c;Yau et al., 2017;Mannion et al., 2016;Bazzan and Klügl, 2014;Bazzan, 2009;Abdulhai and Kattan, 2003;Wei et al., 2021) all present general reviews or surveys on TSC methods, compiling a list of the most recent methods and algorithms related to RL in TSC. Additionally, two very recent papers, (Gregurić et al., 2020;Haydari and Yilmaz, 2020), discuss the applications and opportunities regarding Deep Reinforcement Learning (Deep RL) in TSC, while a number of relevant studies that do not exclusively focus on RL in TSC, e.g. (Yuan Wang et al., 2019;Nguyen et al., 2018;Tahilyani et al., 2013;Z. Liu, 2007;Jácome et al., 2018;Yizhe Wang et al., 2018a;D. Zhao et al., 2011;Eom and B.-I. Kim, 2020) were found to exist. Nonetheless, to the best of our knowledge, there is no such systematic literature review aimed at examining the existing research of RL-NTSC. This research aims to: (i) collect all the existing relevant papers in the defined area and present a systematic, explicit, comprehensive, and reproducible review for identifying, evaluating, and synthesizing the existing body of the literature (based on the definition of a systematic literature review given by Fink (Fink, 2019) and Okoli et al. (Okoli and Schabram, 2010)), (ii) provide statistical and conceptual knowledge based on the qualitative and descriptive data analysis on data extracted from the included articles to investigate what has been done in the area, which methods and techniques were used (alone or as a core or combined method in integration with other methods), and which patterns, trends, and information can be extracted using data analysis techniques from the data reviewed, (iii) show which methods and which NTSC application domains have still room to be more elaborated in further research, (iv) identify the major first events in RL-NTSC to get familiar with how the research novelties and contributions are temporally located in the course of research in the area, which is specifically well to find the very recent research problems, (v) explore the recent research problems and domains with their frequencies that helps identify potential future research, (vi) provide common future directions based on what the included papers recommended, and (vii) summarize existence evidence, (viii) identify and summarize areas for further research. For convenience, Tables 1 summarize the main acronyms in RL and NTSC domains.

Reinforcement Learning in Traffic Signal Control
A Markov Decision Process (MDP) is a mathematical framework well suited to optimize decision-making processes under uncertainty. An MDP is a four-tuple S, A, R, T , including, respectively, state space S, action space A, reward function R, and transition function T . An MDP satisfies the Markov Property if the transition function, whether known or unknown, depends only upon the current state and the action taken, not on the sequence of events that preceded it.
If the reward and transition functions are known, the optimal policy can be found using dynamic programming (DP) methods via the recursive definitions of the value function. However, when the environmental dynamics are not known, i.e. reward and transition functions are unknown, the agent has to estimate the value of taking action in a state without using knowledge about the reward function and transition probabilities. In this situation, RL is suitable. RL can be model-based where the agent samples from the environment to estimate the reward and transition functions and find an optimal policy. Unlike model-based RL, in model-free RL algorithms the agent directly estimates the Q-function (Q-function is discussed later in this section) from experience while the reward and transition functions are unknown beforehand. SF  In model-free RL, an agent tries to learn the optimal way of interacting with an environment. RL learns how an agent should map the states to actions to maximize a numerical reward. At each time step (or decision point) k, based on a policy π that is intended to be optimized during the learning process towards reaching an optimal policy π * , the agent takes action a k from a set of possible actions A in response to the current state s k from a set of possible states S; i.e. a k = π(s k ). Simply put, a policy is a rule that the agent follows in selecting actions based on its current state. At the end of step k, the agent receives a reward r k from the environment based on a reward function R where the elements of the reward can be collected through sensors. A sequence of state, action, and reward is a history of an agent saved in memory. At each time step, the RL agent tries to learn an optimal policy from its history of interactions with the environment that maximizes the discounted cumulative reward: where γ ∈ [0,1] is a discount factor. The discount factor is associated with time horizons and is used to balance immediate and future rewards. It determines how much the RL agent cares about rewards in the distant future compared to those in the immediate future. If γ is 0, the agent only cares about the most immediate reward (R k =r k ). If γ is 1, the reward is not discounted and the distant future reward is considered (R k =r k +r k+1 +r k+2 +...). As we set γ closer to 1, future rewards are given greater emphasis relative to the immediate reward. For more details, see (Sutton and Barto, 2018).
One of the most frequently used and successful RL methods in traffic signal control is Q-learning (Sutton and Barto, 2018), which was first investigated in 1989. Q-learning is a model-free RL. It is also an off-policy RL algorithm that uses a different policy for estimating Q-values than for action-selection. It updates the Q-values of the current state-action pair using the greedy policy to estimate the Q-value of the optimal policy of the next state-action pair. In other words, the optimal policy π * is learned by estimating a second function, called Q-function, that specifies the value of an action (following a given policy π) given the current state. Q-function calculates the quality of a state-action combination. Assuming the agent continues to follow the optimal policy, the Q-value is defined as the expected discounted future reward of taking action a k in state s k : Q π (s k , a k ) = E π [R k |s k , a k ] = E π [ ∞ φ=k γ (φ−k) r φ |s k , a k ]. (2) The Q-value is estimated by iterative Bellman updates: Q π (s k+1 , a k+1 ) = Q π (s k , a k ) + α(Ψ k − Q π (s k , a k )), where α ∈ [0,1] is the learning rate that is set through experimentation, and Ψ k = r k + γ max a k+1 Q π (s k+1 , a k+1 ) is the target. Hence: Q π (s k+1 , a k+1 ) = Q π (s k , a k ) + α[r k + γ max a k+1 [Q π (s k+1 , a k+1 )] − Q π (s k , a k )], Q-learning uses Q-table to decide which action to take. A simply put summary of Q-learning (Sutton and Barto, 1998) shows the algorithm to follow these steps: 1. s k is the current state in which the agent resides. 2. The agent chooses action a k from the available or acceptable actions. 3. In response, the agent receives a reward r k for action a k and the next state s k+1 . As already discussed, this state s k+1 is merely a projection or estimation of the next state, rather than a known value. 4. Q(s k+1 , a k+1 ) is then updated using equation 4. 5. The entire process is repeated. Here, we should note that R k+1 in Equation 1 is the reward that is obtained at the next time step (after taking action), but it is not necessarily known upfront. That is why Q-learning is a model-free RL that does not know about the model of the environment, i.e. the reward and transition functions.
In the learning process, there is the exploration-exploitation dilemma. The agent tries to exploit based on what it already learned to achieve the reward, and at the same time, it also must explore possible actions for each state to find the one that has received the highest reward for exploiting it.
The rest of this paper is organized as follows: Section two will explore the methodology used to conduct this review, including the search strategy, selection criteria, and data extraction; Section three delves into the results and findings presented in section one; Section four discusses common future works and key findings; Section five covers threats to the validity of our work; and Section six wraps up with a look at future implications of the current paper.

Review Method
In this study, search strategy, inclusion and exclusion criteria, and data extraction are intertwined, and the selection criteria and data extraction are performed within the search and search evaluation steps; hence, we explore this as a whole rather than separately. It should be noted that to identify the relevant literature we conducted both manual and automated searches, and included all articles published up to the end of March 2020. We designed and implemented our review based on the guidelines provided by (Tricco et al., 2018).

Search strategy, selection criteria, and data extraction
To identify the most appropriate search terms and strings for the automated search, we used literature review, manual content analysis, and Natural Language Processing (NLP) in the steps that follow.
1: Based on knowledge in the area of the study, we identified seven venues, including five journals and two conferences as listed in Table 2, and manually searched for and reviewed the relevant published studies from 2017 to 2019 based on their title, abstract, and keywords. In very limited cases, we also examined the conclusion and searched for specific, relevant key terms to help with the inclusion decision. Fifteen pertinent articles were found during this first stage search. These papers were also used in the fourth step to form the Quasi-Gold Standard (QGS) (Abad et al., 2016) to evaluate the quality of the search strings.
2: We performed NLP, including language modelling and lexical association analysis (Abad et al., 2019), to extract the most frequently used terms in the 15 retrieved articles for the purpose of identifying search strings. Figure 2 shows the directed graph of common bi-grams formed from the QGS set, based on the frequency analysis on the pre-processed texts collected from the title, abstract, introduction, section/subsection headings, and conclusion of the 15 articles. Based on the results from the NLP, several inspections and investigations, and using various combinations of the most related search terms, the following single search string was chosen: reward AND action AND traffic light OR traffic signal AND reinforcement learning .
3: Using the identified search string we queried Google Scholar, which served as our main search base, yielding 2,887 articles. To increase the reliability of our findings, the search was complemented by searching within Web of Science, IEEE Xplore, ACM Digital Library, Springer Link, and Science Direct databases after the data extraction. This accelerated our search process by providing a view of those papers that were included and excluded, based on the defined criteria. 4: We formed the Quasi-Gold Standard. Our automated search retrieved all 15 articles, indicating the quasisensitivity of 100%. 5: We began the process of inclusion and exclusion by carefully defining the selection criteria. This iterative process continued even during data extraction, (i.e. the last step) to ensure the correctness of the included and excluded papers. Figure 3 depicts the flowchart of the article selection process in this paper. Publications that meet the following criteria were excluded: (1) those articles not related to RL-NTSC, (2) duplicate papers, (3) review and survey papers, (4) presentations, abstracts, extended abstracts, viewpoints, letters, reports, technical reports, projects, table of contents, any papers that have not been peer reviewed, theses and dissertations, books, and book chapters, (5) publications in any language other than English, (6) those papers whose better or updated version was found in another journal or publication and used, and (7) unavailability of the full-text of the paper on the internet.
Since the scope of our research is network-scale, it is henceforth referred to as NTSC (i.e. any scales involving two or more intersections), and another important exclusion criterion used was those papers that propose or test a proposed method in a single isolated intersection only.
We included those papers containing an RL method, whether it was the core or combined method applied in the context of NTSC. These were included whether or not they provided evaluation and simulation, and if they provided only a framework.  This investigation is designed to cover the papers that apply RL in controlling traffic signals, and as such, if RL is used to control only vehicles, it is out of scope for our paper. In essence, as long as RL is being applied in NTSC, whether vehicle control is involved or not, the paper was included in the current study. Hence, the connected and autonomous vehicle (CAV) environment is covered in our paper as long as NTSC is involved. The CAV in the current study is inclusive of all environments of cases of CAV, including: vehicle-to-vehicle (V2V), vehicle-to-infrastructure (V2I), and infrastructure-to-vehicle (I2V).
Despite the fact that below-mentioned research areas have traffic components (including: public transit, bikes, and pedestrians) in common with our research, they are out of scope since TSC is not their main focus as it is in the current study. The research areas include: ramp metering, freeway, traffic control (not TSC), public transit (where the focus is on bus control and TSC is excluded), route choice, routing systems, pedestrian routing, reactions of cyclists to speed advice, ride-sharing, best path selection, lane changing, autonomous intersection, traffic congestion detection, driver behaviour, traffic signal control simulation, simulator, simulation environment, online calibration, traffic assignment problems, couriers management in express systems, fleet management, toll plaza, traffic analytics, traffic control architecture based on fog computing paradigm where the focus is on fog paradigm (not TSC), imagebased learning, image processing, and optimizing the sensor installation locations in a traffic network. We only found the topics, including ramp metering, public transit, emergency vehicles, and fog computing in NTSC, in which NTSC is included; thus we included these studies.
6: After applying the selection criteria and identifying the included papers, we used both, backward snowballing by screening the reference list of the included papers, and forward snowballing (Greenhalgh and Peacock, 2005) by scanning the citation of the included papers to reduce the probability of missing some papers in our inclusion process; see Figure 3.
7: In addition to Google Scholar, five other electronic databases were searched, yielding no new articles. All articles found here had already been covered in the search using Google Scholar and the snowballing method.
8: From the included papers, data were extracted and analyzed.
In the following section, we provide an analysis of the data we extracted. In Table 3, we present an overall comparison along with details of some of the features extracted for all 160 papers included in our study. The papers that were reviewed are listed in the section entitled Included Articles. Other resources cited in this paper are listed in the reference section. It should also be noted that RL methods in Table 3 in the second column are classified into three main categories: model-free, model-based and RL. In this table, For model-free RL methods we present the specific model-free RL method that is used, such as Q-Learning, SARSA, etc. However, the model-based RL methods are shown as "Model-based RL". Also, in this table, by "RL" we mean to represent RL in general, where no specific RL method is mentioned in the paper. Moreover, there are articles that do not provide any specific RL method rather they mention the temporal difference (TD) update strategy. In this case, we report the method as TD as presented in these papers. Figure 4, the number of studies regarding RL-NTSC is increasing. This topic continues to gain momentum, and hence, importance as the world and specifically, urban populations increase and this is demonstrated in the large number of papers published recently. During the first 15 years, from 1994 to 2008, a total of 22 papers on the subject were published, compared to 27 papers in the year 2019 alone. These statistics further attest to the fact that an in-depth review like ours is important at this time as it provides future works a comprehensive view of the past 25 years.

As shown in
The 160 included articles have been published in 104 different publication venues, including conferences (57%), journals (39%), and workshops and symposia (4%), see Figure 5. The first (Mikami and Kakazu, 1994) and second (Cao et al., 1999) papers in RL-NTSC were published in 1994 and 1999, respectively. These papers proposed modelfree RL methods, while the first model-based RL method in RL-NTSC (M. A. Wiering, 2000) was proposed in 2000. Among the 160 included papers, 6 papers (M. Wiering et al., 2004;Kuyer et al., 2008;Houli et al., 2010;Gomaa, 2012, 2014) used or extended this model-based method, called as Wiering method. In addition to these Wiering method based research works,  proposed a model-based method that is based on a mechanism for creating, updating and selecting one among several partial models of the environment.

Countries, Departments and Affiliations
Figure 6 exhibits our data of the distribution of research papers among each country. 160 papers from a total of 30 different countries were included in this review, with most of the research papers coming from 7 countries: China, USA, India, Iran, Ireland, Canada, and Brazil. These papers represent 67% of our pool of papers, strongly suggesting that as of 2020, these countries are the global leaders in the research of RL-NTSC. An interesting fact to note is that all seven of the countries mentioned have a traffic index 1 above 140.45 according to (Index, 2014), with Iran having the highest traffic index of the group, at 216.09. The motor data company, INRIX (Inrix, 2020), finds Ireland's capital city of Dublin to be the 7th worst city in the world in terms of hours lost due to traffic congestion based on 2019 data (154 hours). And, according to the Central Intelligence Agency (CIA) World Factbook (Central Intelligence Agency, 2020), the USA, China, India, and Brazil rank in the top four countries with the longest road network, respectively, with Canada at eighth.
This study also took into consideration the department and affiliations of authors, and identified four groups of departments, as follows: (1) computer related departments, such as computer engineering/science, information technology, and electrical engineering, (2) transportation related departments, including civil and transportation engineering departments, (3) other engineering departments, such as industrial, mechanical, material, and geomatics, and (4) other departments, like science, business, management, astronautics, and English. It was found that 62% of the authors were from computer related departments (group 1), and 26% were from transportation related departments (group 2), while researchers from other departments (groups 3 and 4) accounted for 12% of the total.
Notably, authors from these four groups of departments had very low research collaboration with each other, and had conducted research in only 20 out of 160 articles, generally. The computer related and transportation related departments collaborated in only 11 articles, while they contributed independently in 98 and 36 papers, respectively.
This investigation also uncovered that academia and research institutes have low research collaborations with industry and government, with 11 instances of collaboration between academia and industry and only one appearance of government in research papers that collaborated with academia in a paper in 2018. Suffice it to say that research exploring the potential benefits of the collaborations between these three parts, i.e. academia, industry, and government is needed, as the findings in the literature (Anderson and Odei, 2018) suggest that increased collaboration in these domains may boost the efficiency of the proposed methods in real-time, real-world applications.

Method identification and analysis
In this section, we elaborate on the proposed methods from different angles. We start with the categorization of the proposed methods in terms of the method and environment attributes, respectively, in Figures 7 and 8.

Methods' contribution and combination
The proposed methods in the included articles consist of (M1) the applied RL methods in NTSC (154 papers) or (M2) theoretical developments in RL with feasibility or applicability assessment of the proposed methods in different contexts, one of which is the NTSC environment (6 papers).
To get familiar with the variations of application of RL in solving different research problems in NTSC and the way that RL is applied, we identified two main groups: (M1.1) RL is used alone or in combination with methods   (134) are classified in this category where RL is the only applied method, or is used in combination with other ML, GT, and DP methods. Different innovative designs were proposed to tackle the problems of NTSC. The NTSC problem has a continuous state space, infinite horizon, and is only partially observable and difficult to model. In this context, most of the papers try to improve the performance (e.g. (Moghadam and Mozayani, 2013)), dimensionality (e.g. by means of function approximation (Waskow and Bazzan, 2010;Prashanth and Bhatnagar, 2010)), complexity (e.g. by organizing agents in groups of limited size , and using holonic RL methods (Abdoos et al., 2013)), scalability (e.g. (Nuli and Mathew, 2013)), stability (e.g. (Aslani et al., 2018b)), speed of optimization (e.g. in Transfer Learning models (N. Xu et al., 2019)), state and/or action space manageability or generalizability (e.g. (Araghi et al., 2015;Gaikwad et al., 2016)), and centrality problem in applying RL in NTSC. Some studies invest in solving convergence and oscillation problems that commonly appear in the multi-agent context. For instance, (Reda et al., 2019) proposed a model based on Double Deep Q-learning (DDQ), with Experience Replay and cooperation between agents. They used NN to reduce the correlation between agents and improve performance. Improving the accuracy, optimum functioning, and efficiency of the results is also of high interest to the researchers, through hierarchical methods (Yizhe Wang et al., 2018b;T. Tan et al., 2019), for example. There are several papers where the main focus of research is on studying the coordination between agents (e.g. (Higuera et al., 2019)), and the integrated network, specifically signalized intersections and ramp metering (e.g. (El-Tantawy and Abdulhai, 2010;El-Tantawy et al., 2013). Famous methods proposed in traffic theory context, such as CTM (Chanloha et al., 2014;Ajorlou et al., 2015;Qu et al., 2020), Max-plus (Kuyer et al., 2008;Medina and Benekohal, 2012), and Max Pressure (MP) or back pressure (brought in NTSC from the communications networks theory) , are also applied in some research, and in other studies multiple traffic optimization goals are simultaneously optimized (multi-policy RL), e.g. (Dusparic et al., 2016).
Considering both vehicular and pedestrian traffic in the network-scale is a recent application of RL, where a distributed multi-agent RL method is proposed by (Y.  for the first time in 2017. By the increasing use of deep RL methods, the number of papers that focused on improving deep RL models in NTSC has increased since 2018 (e.g. (Wei et al., 2018;C. Li et al., 2018;X.-Y. Liu et al., 2018). The co-learning problem of both classes of learning agents, traffic signals and drivers, with different goals (minimizing individual travel times vs minimizing the queues locally), different nature (driver agents learn in episodes that are asynchronous, while traffic light agents learn continuously (non-episodic)), and the nontrivial task of microscopic modelling and simulation (whose actions are highly coupled) is another area of research that was addressed (Lemos et al., 2018). In addition to these lines of research in this category, analyzing what specifically RL does differently (i.e. analysis of the learned policies) than other TSC methods is another research effort that is motivated and conducted by (Genders and Razavi, 2020).
M1.2 : In category M1.2, the articles applied methods from other fields (rather than ML, GT, and DP) in RL or RL framework (M1.2.1), or used RL in the methods from other fields or in a specific design/model/framework for NTSC context (M1.2.2). Out of 20 articles in category M1.2, there are 8 articles that either applied RL in optimization problems (e.g. Swarm optimization) or an optimization algorithm is used in RL methods/frameworks in the context of NTSC.
M1.2.1 : To cope with non-stationary environments,  proposed, formalized and showed the efficiency of a method called the RL-CD, or Reinforcement Learning with Context Detection, which performs well in non-stationary environments, and better than classic RL algorithms (Q-learning and Prioritized Sweeping). In a similar work, (Oliveira et al., 2006) assessed the feasibility of applying RL-CD approach in a more realistic scenario, implemented by means of a microscopic traffic simulator. (Zhang et al., 2007) showed how to use Conditional Random Fields (CRFs) to model control processes, where CRFs model joint actions in a decentralized Markov decision process and define how agents can communicate with each other to choose the best joint action. The CRF model clearly outperformed the independent agents approach.  enhanced the single-objective controller by developing a multi-agent NTSC system based on a multi-objective sequential decision-making framework using Bayesian interpretation and some innovative reward design. (Zhu et al., 2015) proposed a Junction Tree Algorithm (JTA) based approach to obtain an exact inference of the best joint actions in traffic signal coordination that outperformed independent learning (Q-learning), real-time adaptive learning, and fixed timing plans. To predict future system states and avoid unwanted states, (Yongheng Wang et al., 2016) applied Proactive Complex Event Processing (Pro-CEP) method (in processing proactive traffic congestion control) that uses RL to find the optimal joint policy. This method works well when used to control congestion. To develop learning and adaptation mechanisms to deal with disturbances, (Darmoul et al., 2017) proposed a distributed TSC system based on hybridization between Case-Based Reasoning (CBR) and an adaptation of the reinforcement principle within artificial immune networks (a mechanism inspired by biological immunity). The method controls interrupted flow at signalized intersections.  studied how the attention mechanism helps cooperation (to minimize the average queue length) via using graph attentional networks to facilitate communication, incorporating the temporal and spatial influences of neighbouring intersections to the target intersection, and building up index-free modelling of neighbouring intersections.
M1.2.2 : (Su and Tham, 2007) integrated sensor networks and grid computing and the usage of web services to implement this integration. They used Q-learning algorithms in distributed Stargates to NTSC. Stargates is a computer with sensor signal processing capabilities. (Abdoos et al., 2015) modelled a holonic multi-agent system and proposed a holonic RL multi-agent system method that improves the performance of individual Q-learning in NTSC. (Mashayekhi and List, 2015) showed the applicability and efficacy of using auction theory combined with an RL in a multi-agent system. In low traffic volume the proposed method outperforms actuated and pre-timed control strategies, but in heavy volume the pre-timed control strategy outperforms the proposed method. (Iyer et al., 2016) designed a distributed multi-agent method with coordination between agents through the communication of decision data. In this paper, the effectiveness of integrating fuzzy logic controller (to deal with continuous states and actions) and Q-learning (for learning during the process) is studied.
With regards to the optimization-related articles in category M1.2, all of the proposed methods in this category are designed to handle the growing complexity and curse of dimensionality and to increase the speed of TSC by using RL in the optimization problems or applying the optimization algorithms in an RL method or framework. To achieve cooperation in the long term, as stated earlier, (Mikami and Kakazu, 1994) combined RL of a local agent with global optimization through Genetic Algorithms (GA) by which the RL parameters are modified. (Cao et al., 1999) used a classifier system with a fuzzy rule representation, with both evolutionary and RL methods to provide a faster method than hierarchical control. (Cao et al., 2000) incorporated Learning Classifier System (LCS) and TCP/IP (Transmission Control Protocol/Internet Protocol) based communication server into a distributed learning control strategy to increase the speed of control. (Prothmann et al., 2009) managed the complexity by using an organic approach to NTSC and proved the feasibility of the proposed approach. (W. Lu et al., 2011) proposed a multi-agent NTSC by using Swarm Intelligence and Neuro-Fuzzy RL to combine the better attributes of both with improving the learning speed and performance. (W. Liu et al., 2014) optimized the TSC for V2I networks by proposing a cooperative distributed Q-learning algorithm with a fast gradient-descent function approximation. (Ozan et al., 2015) applied an RL method in the optimization problem to reach good solutions for NTSC. The results were better than the genetic algorithm and hill-climbing methods in low demand but could not outperform them in medium and high demands. (Tahifa et al., 2015) showed that Swarm Q-learning performs better than standard Q-learning in increasing the speed of TSC. To alleviate traffic congestion and limit the effects of incidents on traffic flow, (El Hatri and Boumhidi, 2017) proposed a Q-learning based traffic management model, which simultaneously optimizes vehicle re-routing and TSC based on the Multi-Objective Particle Swarm Optimization (MOPSO) method.
M2 : To provide a simple and efficient method to implement, (Natarajan et al., 2011) put into operation a functional gradient boosting approach to imitation learning in relational domains. The proposed approach outperforms both learning a single relational regression tree and performing a propositional functional gradient to represent the policy in all domains. To provide a solution to multi-objective problems with correlated objectives rather than typical multi-objective problems, (Brys et al., 2014) proposed an RL-based method combining multiple correlated rewards and shaping signals by measuring confidence (i.e. combining the feedback from all objectives, instead of only looking at a single one), called adaptive objective selection. They formally defined a new class of multi-objective problems called correlated multi-objective problems (CMOP), whose set of solutions being optimal for at least one objective is so restricted that the decision-maker is least concerned about which of these is found, and more so about how fast one is found, or how well one is approximated. (Sadigh et al., 2014) proposed a method for synthesizing a control policy for an MDP such that traces of the MDP satisfy a control objective expressed as a linear temporal logic (LTL) formula through using an RL algorithm that finds the policy optimizing the expected utility of every state in the Rabin-weighted product MDP. They prove that the method is guaranteed to find a controller that satisfies the LTL property with probability one if such a policy exists, and they suggest empirically with a case study in traffic control that their method produces reasonable control strategies even when the LTL property cannot be satisfied with probability one. (Prashanth and Ghavamzadeh, 2016) optimized variance-related risk measures in rewards and demonstrated its usefulness in an NTSC application. The risk-sensitive algorithms result in lower variance but higher long-term cost compared to their risk-neutral counterparts.  illustrated the usefulness of modeling human decisions by Cumulative Prospect Theory (CPT) paradigm in RL and suggested that CPT-based criteria is useful in a NTSC application. (Gan et al., 2019) proposed a dynamic correlation matrix based MARL approach where the meta-parameters are evolved using an evolutionary algorithm in a distributed manner. This was done to provide meaningful theoretical verification by using both agent-level implementation and system-level convergence verification. Agents using the proposed learning algorithm reach optimal behaviours faster than other canonical learning techniques.
In the following, we introduce the types of RL methods used in the articles we included in this paper. Furthermore, we extracted data about other methods that an RL method has been integrated with to provide a solution, whether as a core or combined method.

RL methods
Q-learning (Watkins and Dayan, 1992) ranks first on the list of used RL methods in 96 (60%) studies. 13 papers do not provide any specific RL method, especially when the goal of the paper is the proposal of a framework or when an RL concept is used. In this case, we used "RL" in Table 3. The remaining papers include methods that are listed (along with their frequency) in the first column of Table 4.
Furthermore, RL is the core method in 149 (93%) of the studies. In 11 studies, other methods were employed as the core method in which RL is used as a combined method, see the second column of Table 4. There are also other non-RL methods that were used in combination with the RL methods or frameworks, see the third column of Table  4. Below is a synopsis of the RL methods.
• Q-learning and SARSA: Both Q-learning and SARSA (State, Action, Reward, State, Action) methods are critic-only where they use Q-tables to decide which action to take. The biggest difference between the two is • Critic-only, Actor-only, and Actor-Critic: The critic uses the calculated Q-values (or function values) to choose its action while the actor uses the policy to decide. Actor-only methods work to improve the policy. Actor-critic being a combination of both, allows both calculations of the Q-values and the policy to choose appropriate action. The papers that used an actor-critic method are identified in Table 5.
• W-learning: W-learning is a multi-policy self-organising action-selection technique proposed in (Humphrys, 1996) that builds on Q-learning. In W-learning, there is a competition among selfish Q-learners where agents learn Q-values for state-action pairs for each policy and W-values for each of the states of each of their policies to explore what happens if the nominated action is not followed.
• Approximate Dynamic Programming (ADP): The computational complexity of DP algorithms due to excessive system state and their need for an exact algorithm and true value function makes the algorithm impractical in solving large-scale TSC problems. ADP aims not to fall into the predicament of computational complexity by replacing the true value function of the DP with an approximation function. In other words, it is similar to the model-based RL with function approximation. The research papers that used ADP method are identified in Table 5.
• Policy-Gradient: The policy gradient method does not need to estimate the state or action value functions. It learns parameterized policy functions directly by searching policy space to maximize a measure based on the accumulated reward. In this way, it averts the convergence problems of estimating value functions.
• Inverse RL: In Inverse RL (Ng, Russell, et al., 2000), the reward function of an agent (that the agent tries to optimize) is learned and determined by observing the agent's behaviour over time, the environment model, and the environment measurements. This approach is akin to learning from an expert and is helpful in the domains where the reward may not be easily accessible, like TSC (Natarajan et al., 2010). This method has its origins in Imitation learning (also called as apprenticeship learning, learning by observation, or learning from demonstrations). It is comparable to supervised learning, with the key difference being that the examples are not i.i.d, but instead, follow a meaningful trajectory (Bagnell et al., 2006).
• Learning Classifier Systems: A learning classifier system (LCS) (Butz et al., 2005) is a rule-based RL system in which each rule (or classifier) is composed of a condition, an action, and a reward (or evaluation). LCS combines an evolutionary process (e.g. a genetic algorithm), with a learning process (e.g. RL), wherein a rule is constructed as {IF 'condition' THEN 'action'}. A genetic algorithm tries to improve condition-action rule space by generating new classifiers from current strong classifiers and removing the weak ones. RL is responsible for selecting the action with the best-rewarded response or evaluation to be executed.
• Learning Automata: The action selection in learning automata is performed based on the last selected action and the received reward. A learning automata method forms from a vector of probabilities over the set of actions, which are updated (i.e. increase or remain the same) based on the reward.
• Model-based and Model-Free: Model-based methods provide the agent with part or all of a model in which the agent must work. In model-free methods the agent develops its own model in which to work and has fewer restrictions. Essentially, in model-based methods the transition and reward functions are assumed to be available to construct a model, unlike in model-free methods where the agents do not need to have access to information regarding how the environment works. The papers that used model-based methods are identified in Table 5.

Control method scheme
By analyzing the proposed methods in the included articles, we identified three levels of control, including (i) regular network control, which covers the general idea of controlling all traffic signals in a regular network, (ii) perimeter control, which improves the performance of the entire network by controlling a part of traffic signals that located on the boundary of a region in the network, and (iii) streetcar 2 bunching control, which mitigates the effect of streetcar bunching along transit routes by controlling successive signalized intersections.
The regular network control has been the focus of 158 out of 160 included articles. On the other hand, although there are numerous research works in the domain of perimeter or cordon control, the application of RL for perimeter control is studied, for the first time, in 2019. (Ni and Cassidy, 2019) explored how RL can be used to re-time traffic signals on the perimeter by developing an RL based controller with NN architectures that controls perimeter with spatially-varying metering rates. With regard to the third level of control, (Ling and Shalaby, 2005) conducted research to mitigate the effects of streetcar bunching along transit routes through automating streetcar bunching control by means of multiple RL agents that act on a series of successive signalized intersections. The proposed method was able to effectively split up a streetcar bunch and prevent it from forming again.

NTSC method: Centralization (centralized, hierarchical, and decentralized/distributed methods)
When tackling multi-agent problems, there is a spectrum from centralized to decentralized decision making. At large scale implementation, multiple agents tackle the task given while communicating with each other. This usually results in quicker optimization as each agent learns from its neighbours as well as from itself (Tahifa et al., 2015). With multiple agents, the collected data and actions can be stored centrally in a location that all agents can access to function as one agent. In this setup, the central agent often makes all decisions for the system, and this can slow down the learning process while coordinating the unit (OroojlooyJadid and Hajinezhad, 2019). Although DNNs helps enhance the scalability of RL, training a centralized RL agent is still infeasible for large-scale NTSC. Conversely, a  distributed approach can be used to store the local information around multiple agents, allowing each one to make its own decision while still communicating with its neighbours (Hüttenrauch et al., 2019). In this approach, all agents would be considered "equal" (Baldazo et al., 2019). If the local agent does not communicate with other neighbor agents the system is called a decentralized system. Another setup that uses a combination of both centralized and decentralized/distributed methods is a hierarchical system, which can be categorized as a centralized method since it involves centralization. It forms a hierarchy of sorts, where the lower agents may have limited to no ability to enact upon the environment without permission from the "leader". A hierarchical control allows agents to perform micro-actions between tasks to improve the finesse of the agents. Most (65%) of the proposed methods in NTSC are designed in a decentralized way. There are 7 papers proposing holonic or hierarchical methods while the rest are centralized methods that might be rarely applicable in real-world scenarios in a real-time process.
3.2.5. RL methods' components and types State, action, and reward. There are normally three main components in RL: state, action, and reward. There are several elements that can define state and action in various papers. The elements of the state can be similar or different from those of the rewards. Based on our collected data of 160 studies, we identified 35 distinct elements of state and 30 of reward. The top 5 frequently used elements in state are queue size with 73 occurrences (38%), phase  (100) other action types class description frequency Act1 set the value of a threshold metric for each traffic signal (2) Act2 set a link specific metering rate in the perimeter control (1) Multiple Actions including: (Act3) set the value of a threshold metric for each traffic signal, (Act4) set a link specific metering rate in the perimeter control, and (Act5) select a route as a driver's action (3) state (11%), number of vehicles (10%), the position of the vehicles (6%), and speed (6%). In reward, the top 5 are queue size with 71 occurrences (30%), delay (13%), waiting time (9%), the number of vehicles (6%), and number of vehicles passed the intersection (or generally throughput) (4%). The elements of the state in 16 (5%) research papers and those of reward in 18 (7%) papers are not available. The list of all the elements of the state (38 unique elements) and reward (39 unique elements) found in the papers are available in Appendix A and Appendix B. This might help the new researchers get some idea in defining these components. These two appendices also depict the most frequently used elements in states and reward, respectively.
In addition to state and reward, action needs to be defined. In the majority of the research papers action is defined as a traffic signal or phase switch. However, there are some that will use a different definition than the signal switch. These actions include (Act 1) set the value of a threshold metric for each traffic signal, (Act 2) a link specific metering rate in the perimeter (or cordon) control, (Act 3) select a route as a driver's action, (Act 4) set the acceleration of the vehicles, and (Act 5) set the maximum speed of the vehicles. The last three options are used in a mixed environment where the vehicles are also considered as well as traffic signals. When two or three types of actions were used, it was reported as "Multiple" in Table 3.
The actions directly related to control traffic signals are categorized into two main groups (phase-based and cycle-based), and seven classes where each class is defined based on cycle length, phase duration, and phase order. Each of these three elements can be fixed or variable. See Table 6. The decision point in cycle-based methods is the end of the cycle where cycle length, phase duration, or phase order are determined. In phase-based methods, the decision is made at the end of a phase, which includes phase duration determination and phase selection. In this case, phase duration can be set as fixed for the entire phase or can be allowed to be extended at the end of the phase. Note that in phase-based methods, the cycle and phase order are not applicable. In Table 6, we removed class 1 in which all the three elements (cycle length, phase duration, and phase order) are fixed. This is not applicable in RL design and is counter-intuitive to optimize.
62% of the papers proposed phase-based methods (i.e. class 7), while 32% constructed their methods on cyclebased methods. Only 2 papers focused on using RL, assuming that the phase duration is fixed. Moreover, only 20 (12%) papers worked on applying RL in a fixed cycle-length setup. There are also 3 papers with multiple actions or sets of actions, and 6 papers in which the action is not clearly or completely defined. The details related to each paper are given in Table 3.
Action selection methods and parameters. To select the actions, various action selection methods can be used. Based on the data we retrieved, 71 (44%) papers did not state which action selection they used, which is a significant number. -greedy has the highest usage and was observed in 51 (32%) papers. Softmax (or Boltzmann) and greedy methods are used in 17 (11%) and 7 (5%) papers, respectively. Other action selection methods with a frequency of 4 or less are Random, Distributed W-learning (DWL), credit assignment algorithms, -softmax, and Upper Confidence Bound (UCB). Two papers employed multiple or combined action selection methods, while two others compared various action selection methods.
• -greedy strategy and SoftMax : -greedy uses the epsilon term to balance exploration and exploitation of the environment, encouraging the former early on and switching to the later as the algorithm learns. It randomly selects actions for the next round based on the values of the exploration rate ( ), discount factor (γ) and learning rate (α). SoftMax behaves similarly but with weight parameters either assigned or learned to each action. They can be quite sensitive to changes and encourage an outcome where important parameters have more value. The sensitive nature of the weights makes them tricky to learn/find and can affect the performance of the algorithm significantly.
We collected data about the RL parameters, including: exploration rate, discount factor and learning rate. We found that the information was not displayed in a number of cases. 121, 60, and 86 out of 160 papers did not reveal the information about these parameters, respectively. In papers where the data are presented, the most common option was for the authors to acknowledge that these parameters exist within the range of (0,1). While it is a piece of information, it is not especially useful seeing as this information is well known among common practitioners of reinforcement/machine learning.
State space and action space discretization. Reducing state space (for the states such as queue size, flow rate, and density), action space, and even reward space is one of the ways that can be used to reduce computational cost to make the methods more applicable in real-time process in the real world. To this end, the continuous states are grouped, for example, in 3 levels: low, medium, and high. Another way is using the comparison between the states with the previous step. This way the space is divided into two groups: better or worse. Reducing the space using grouping the data may come at the expense of the lower accuracy of the results, thus lowering the efficiency or optimality. Nonetheless, 59 (37%) of the papers used discretization while 91 (57%) did not discretize the state or action spaces. 3 (2%) papers used discretization only for the discrete methods, but not for the continuous methods that were evaluated.
Tabular vs approximation-based methods. Another point that impacts the efficiency of the performance of the methods in real-world is using tabular RL methods where a look-up table is used to map the spaces. Approximation methods are used to provide a good approximation of the states that were not experienced in training. Interestingly, the number of the papers that used either of these two methods are eerily close: 74 (47%) papers used a tabular method while 72 (45%) used approximation methods.
The approximation methods that are used in our pool of research studies are shown in Table 7. The applied neural networks, as one of the approximation methods, in the papers are also presented in the table. 38 papers used different neural network approximation methods, and the second-most used group is the Tile Coding, observed in 7 papers. One of these applied its own proposed method (Prashanth and Bhatnagar, 2010).
• Neural Networks: NNs are loosely designed after the brain and are constructed for tasks such as pattern recognition, labelling, and processing of data. They are commonly used for clustering, classification, and predictions. In the case of traffic control, prediction is the prominent use, though other use does occur. NNs consist of 3 main sections: input section, hidden section, and output section. Usually, there is only one of each input and output layers, but the hidden section may contain more. Each layer contains nodes. In the output layer, the number of nodes often corresponds with the number of possible outputs. In the input layer, the number of nodes often corresponds with the different types/sources of input data. The number of nodes within the hidden layers depends on what each layer is designed to do, as well as changes with the purpose of the network. There are many types of NNs in deep RL.
-Artificial Neural Networks (ANN): Known as a feed forward network, all incoming information is only processed and pushed in the forward direction.
-Recurrent Neural Networks (RNN): Unlike ANN, this neural network pushed processed data back to previous layers and nodes. It shares parameters across different time steps and results in fewer overall parameters. The fewer parameters allows for a smaller network.
-Convolutional Neural Networks (CNN): These networks use an extra step called convolution (for which it was named), which involves applying different "filters" to reduce the complexity of the input data. As the information filters through the network, these filters can be applied to highlight specific features of the data. This is one of the most common neural networks and is used in many disciplines.
With neural networks, deep RL becomes highly suited for complex environments presented by intersections and the dynamic changes that occur within a day of traffic.
• Tile Coding: Tile coding is another well-known function approximator. Unlike the continuous methods such as radial basis functions (RBF), tile coding is a discretization method which is used in RL. It is a piece-wise constant approximation method that approximates the action-value functions by partitioning the state space into small regions with a constant reward value. Tilings design considers three main components: width of tiles, the resolution, and the number of required tilings based on the hyper-volume of the whole state space. For more information about Tile coding method, the reader can refer to (Abdoos et al., 2014).
Deep reinforcement learning based methods. Deep RL takes a different approach when dealing with the complex influx of data associated with traffic. It incorporates neural networks into RL algorithms and combines the advances in training layered neural networks into abstract high-level representations of the raw input data, giving non-linear methods , (C. Li et al., 2018). These "layers" allow the agent to look at smaller, more reduced versions of data to extract information without an overload. The neural networks are what allow an algorithm to go "deep" and work with large or complex data input more efficiently.
Here, we compare Q-learning and Deep Q-learning. Q-learning, is the process of iteratively updating Q-Values for each state-action pair using the Bellman Equation until the Q-function eventually converges to Q * . Instead of estimating the Q-value of each state-action pair separately in Q-learning, deep reinforcement learning algorithms (Mnih et al., 2013) use deep neural networks as function approximators to map from states to Q-values. This makes possible the use of a larger and/or continuous state space through parameterization (Lillicrap et al., 2015). The integration of artificial Neural Nets (NNs) into the Q-learning process is referred to as Deep Q-learning, and a network that uses NNs to approximate Q-functions is called a Deep Q-Network (or DQN). In other words, DQN is a Q-learning, which is parameterized with deep NN with parameters θ, i.e. Q(s, a; θ). The neural network input is the state, the number of output neurons is the number of the possible actions, and the targets are the Q-values of each of the actions.
Unlike Q-learning, whose convergence in the limit (infinity) is guaranteed, we do not have such guarantees for DQN. This is because (i) the data set is not i.i.d. and (ii) as the agent learns, the targets move (Van der Pol and Oliehoek, 2016). To ameliorate this issue, different methods can be used, such as dueling architecture (Ziyu Wang et al., 2016) to improve stability and target network (Mnih et al., 2015) to solve the overoptimistic problem. Dropout (Srivastava et al., 2014) can also be used to make the controller more robust and prevent the neural network from overfitting.
Learning rate needs to be optimized in training deep neural networks. To this purpose, different optimization methods can be used, such as stochastic gradient descent, Adagrad, and Rmsprop. In this line, Adam optimizer (Kingma and Ba, 2014) is an adaptive learning rate optimization algorithm, which is computationally efficient, has little memory requirements, and is generally fairly robust to the choice of hyper parameters (Goodfellow et al., 2016). In addition, experience replay is also used to help with stability and convergence behaviour of the algorithm when using a non-linear function approximator.
Deep learning was used in 27 (17%) papers. This is certainly a reasonable portion of the study, and the point to be noted is that using deep learning in RL in the network-scale began in 2016 with 2 papers. After a year with no cases, deep learning was used in 5 studies in 2018, followed by a sharp increase in 2019 with 18 papers. It is expected that this trend will continue into the future, as in the first season of 2020, 3 out of 6 research papers used deep learning. The papers that used deep RL methods are identified in Table 5. SUMO has the frequency of 16 (out of 28) in conducting simulations for the deep learning related methods. The traffic simulation software tools used for deep learning are depicted in Figure 9.

Environment attributes and traffic simulation
In this section, we discuss the environment attributes, including network type, vehicle class, data source, and data communication processing. This classification presented in Figure 8.

Simulation and simulated networks
16 traffic simulator software or platforms are identified in these studies to simulate the traffic. These include SUMO 3 , VISSIM 4 , PARAMICS 5 , GLD 6 , AIMSUN 7 , ITSUMO (Silva et al., 2004), CityFlow 8 , TRANSYT 9 /TRANSYT-  7F 10 , MATLAB 11 , AIM (Dresner and Stone, 2008), TSIS 12 , MATISSE (Torabi et al., 2018b), USTCMTS (used in (Shi and F. Chen, 2018)), SMPL (used in (Su and Tham, 2007)), and SeSAm (used in (Bazzan et al., 2007)). See the distribution of frequency of the traffic simulators in included papers in Figure 9. Since the source of the last three simulation software tools is not found, we referred the reader to the papers in which they are used, for further information.
Of interest is that SUMO was used for the first time in 2015 and has since become the most frequently used software in this area, with 17 out of 27 occasions in 2019 alone. The second most frequently used software, (i.e. VISSIM) started being used in 2010, and PARAMICS, the first well-known traffic simulator has been in use since 2003. Prior to this the tools were either custom-built or not clearly outlined in the research. In 18 studies, the authors designed or used a custom-built environment for simulations. What is surprising is that 27 studies (16.7%) did not state the simulation tools that they used for the test, not including the those that did not use any simulations.
With regard to the type of maps used to test the proposed RL methods by the included studies, of those that used maps, 100 (62%) papers used a synthetic map, 45 (28%) papers used a real-world map, and 9 (6%) papers used both types, see Figure 10. Moreover, 49 real maps are used from 14 countries for simulation purposes, with China topping the list at 10 maps, followed by the USA, Ireland and Canada, see Figure 10. These maps are mostly large-scale networks of intersections. Figure 11 shows the distribution of the number of intersections studied. Though many studies are still currently being held with fewer than a dozen intersections, there is also work on 25+ intersections. And while 44% of the papers run simulations in small-scale networks with 8 intersections or lower, we observed that the recent trend is to test the proposed methods in medium and large-scale networks. The size of the network is also important as it may impact the exponential growth of state-space and action-space and generally the complexity and computational effort required to reach a solution through an RL method, whether for training or testing stages.
We categorized the type of the testbed network into three main groups, including: (1) the network of intersections, (2) arterial network, and (3) the signalized roundabout networks. As already mentioned, the isolated intersection case is excluded from our study. The arterial network is an open network, as compared to a closed network. Arterial network control is applied on a sequence of intersections to provide preference to progressive traffic flow along the arterial. Unlike the isolated intersections, the intersections in the arterial network operate as a system and the system coordinates timing of adjacent intersections. On the other hand, the network of intersections is considered as a closed loop. Being a closed loop, the network demands at least four intersections. The networks of two and three intersections may have the characteristics and behaviour of both groups (the network of intersections and arterial network); therefore, we considered them separately as two sub-groups under the first group, i.e. the network of intersections. Finally, the signalized roundabout networks deals with both approaching and circulatory lanes, which are fundamentally different from the first two groups. The signalized roundabout networks involve more than one set of traffic signals in a node. (Rizzo et al., 2019b,a), both published in 2019 are only two papers in the literature that studied the signalized roundabouts, which indicates a potentially open research problem on the RL area. The dynamics of signalized roundabouts is complex because it deals with both approaching and circulatory lanes. The conflicts between approaching and circulatory flows cannot be solved by metering only the approaching lanes and reacting consequently because the circulatory lanes may be occupied. (Rizzo et al., 2019b) proposed a deep RL method for signalized roundabouts in congested networks to maximize traffic flow while being able to avoid traffic jams in connected junctions. In another research conducted in this domain, (Rizzo et al., 2019a) studied the possibility of deriving explanations from a neural network agent (trained using Policy Gradient) for TSC in a signalized roundabout. How the agent learns to react differently based on each specific lane's traffic by implicitly predicting the route of the traffic and thus its future circulatory occupancy is explored. This is done by analyzing the relation between the agent phase preferences and the actual traffic, assessing the agent capability of reacting to the current detectors state, and estimating the effect of the road detectors state on the agent selected phases through the SHAP model-agnostic technique. The results of this research reveal that it is possible to extract meaningful explanations on the decisions taken by the policy. Further research involves the study of the trade-off of accuracy in comparison with a complex deep learning controller.
And, only 17 papers worked on arterial networks (out of 189 network configurations in 160 papers), all between 4-8 intersections except one with 16 intersections, which is the maximum number in an arterial network. The highest numbers of intersections in a network are 225 ((Yongheng Wang et al., 2016;Ni and Cassidy, 2019) In 5 of the papers that we came across, the number of intersections of the network or arterial network is not released, and in 6 papers there is no simulation used. See Figure 11. We identified three vehicle classes, including private vehicles (cars), public transit (buses), and emergency vehicles (ambulances). Among 160 articles, only 3 of them addressed the public transit (2 papers) and emergency systems (1 paper) in RL-NTSC.
In the public transit domain, there are two types of vehices, including buses and streetcars. We address streetcars in one of the three control schemes, which is streetcars bunching control. Excluding streetcars, there are 2 research works regarding buses. One considers transit priority, while another one does not. In 2014, (Chanloha et al., 2014) developed a distributed CTM-Based MARL for network-scale signal control with transit priority that outperforms pre-emptive and differential priority control methods because of the improved awareness of the signal switching cost. To eliminate the need for feature extraction in the state space and to directly use available information received from the high-detailed traffic sensors (Shabestray and Abdulhai, 2019) proposed a multimodal Deep RL based traffic signal controller that combines both regular traffic and public transit and minimizes the overall travellers' delay through the intersection.
With regard to emergency vehicles, in 2017, to detect and give priority to emergency vehicles, (Kristensen and  1995-1998 and 2001-2002, are not shown. Ezeora, 2017) proposed a reinforced traffic control policy that reduces the waiting time of emergency vehicles at intersections as well travel time of other vehicles using a multi-agent system development framework (JADE). Also, (Y.  is among very few articles that addressed pedestrian element along with private vehicle class in the network. It is important that the considered map as a testbed would be consistent with a real-world set-up. For example, one-lane or one-way crossing links cannot replicate the common cases in real-world scenarios. This may dramatically impact the usability and performance evaluation of the methods, specifically in terms of computation efficiency. Based on the collected data, most of the papers used a good level of complexity in the number of lanes and turns. 24% of the papers used real-world maps with multiple lanes, 28% used two or more lanes (including 13% using three or more lanes) with a good level of complexity, and 20% used the synthetic maps with multiple lanes but without significant information on the number and types of turns. See the number of publications in each category in Figure  10. By a good level of complexity, we mean that in a regular driving style the through and left lanes are involved regardless of the right turn. The through and left lanes in the opposite approaches have conflicting right of ways and add to the state and action spaces, which increases the complexity of the problem. In the regular driving style, right turns can usually be accommodated simultaneously with either through or left lanes and do not impact the state and action dimensions. Also, in a right-side driving style, the through and right lanes are considered regardless of a left turn (similar logic to the regular driving applies here). In 25 (15%) of the papers, a low level of lane/turn complexity, including one-lane links, one-way crossing links, or multiple-lane links with only through lanes is represented. 9% of the papers did not reveal any or enough information about the lanes and turns.

Traffic data collection and traffic demand
We identified four categories of data source for the proposed methods, including general detection devices (66%), loop detectors (17%), vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) and infrastructure-to-infrastructure (I2I) (11%), and camera/image (6%). By general detection devices, we mean that the authors did not specify any specific data-source. The papers that use loop detectors provide either no specific design or specific designs like (1) one detector at the stop-line, (2) one detector at a distance from the stop-line, (3) one detector at the upstream (end of the lane), (4) two detectors at the stop-line and at a distance from the stop-line, or (5) two detectors at the stop-line and at the end of the lane.
The third category covers the methods in V2V, V2I, and I2I environment. Connected vehicles is a specific class in this category where vehicles can exchange data with other vehicles and the infrastructure. The fourth category presents the methods that just use traffic cameras as a data source without any process on images or are inherently image-based. In image-based methods, there are two directions: (i) an image-like representation, i.e. image representation of vehicles' positions, where an image is a matrix of pixel values of a view of intersections, and (ii) images retrieved from traffic simulation software are used as the data source.
In line with the data and the data sources, computing paradigms in the data communication process in RL-NTSC is a new research area and the research on application of computing paradigms, specifically edge and fog computing, in RL-NTSC is topical and has started since 2019 with three papers. (R. Gao et al., 2019) designed a framework of edge computing under the NTSC scene and proposed a cooperative NTSC algorithm based on MARL to avoid the curse of dimensionality, provide minimal response time, and reduce network load.  proposed a large-scale Edge-based RL (ERL) solution to better alleviate congestion in complex traffic scenarios based on Edge Computing nodes for traffic data collection. They concluded that ERL in distributed edge servers has much better scalability and faster Deep NN training than the cloud service. (Q. Wu et al., 2019) proposed a traffic control architecture based on fog computing paradigm and a distributed RL algorithm (that connects traffic signals, vehicles, Fog nodes and traffic cloud) to overcome communication bandwidth limitation and reduce communication delay, make real-time traffic condition information available to vehicles, and lower the probability of traffic congestion in the city through generating traffic signal control flow and communication flow for each intersection. This is not only suitable for current vehicles but also more useful for driverless vehicles anticipated in the future, as it will be able to plan its route much more intelligently with information from the Fog node.
Although there are numerous methods listed in the literature, for traffic signal control, they are mostly effective in low traffic loads. The main challenge in traffic control is managing the high load of traffic representing rush-hour in the morning or afternoon in a network that may lead to a spillback condition in different links. Our research found that 74% of the studies simulated the traffic condition in high demands, close to saturation, saturated, or oversaturated conditions to test the efficiency of the proposed methods in traffic congestion (labelled as Sat in Table  3). Nevertheless, among these, only 5% explicitly applied and addressed a spillback prevention strategy, and 6% analyzed or mentioned that the proposed method is able to prevent spillback (labelled as Spillback in Table 3). 6% of the papers explained spillback but never considered or applied it, whereas in 56% of the studies with high demand, spillback is not even mentioned. 22% neither addressed high demand and spillback, nor explicitly mentioned them (labelled as Neither in Table 3), and 4% provided no evaluation (labelled as No Evaluation in Table 3). Figure 12 depicts the NTSC application domains (or scenarios) and the frequency of publications in each domain based on the control method scheme and environment attributes defined in Figures 7 and 8. The references to the publications, except for the first row to save space, are also provided. This figure helps to locate which areas have already researched, what are the publications in the areas, and in which areas there is still room for improvement and development. This can be found out based on the frequency of the publications in the areas. Figure 13 is an infographic timeline that aids in identifying past and current trends, specifically highlighting the research areas and challenges that have come to the fore most recently. It essentially indicates major first events in RL in NTSC. This helps to find out since when a research line has started. In addition, ?? includes key statements made by the authors of the studies included in this paper.    Su and Tham, 2007 This application of Q-learning and SensorGrid can be seen as the first step towards expanding the usage of SensorGrid. Kuyer et al., 2008 The first application of max-plus to a large-scale problem (not in small applications) and thus verifies its efficacy in realistic settings.

Major first events in RL-NTSC and authors' key statements
El-Tantawy and Abdulhai, 2010 The first study that has tackled the Integrated traffic control problem (Ramp Metering (RM), Variable Message Signs (VMS), and Signalized Intersections (SI)) to find a closed-loop optimal control solution using a coordination mechanism that minimizes the communication requirements.
Prashanth and Bhatnagar, 2010 The first application of RL with function approximation for NTSC. Waskow and Bazzan, 2010 The first attempt to tackle the dimensionality problem in MARL by means of function approximation.

Natarajan et al., 2011
The first adaptation of a Statistical Relational technique for the problem of learning relational policies from an expert (imitation learning).
Prashanth and Bhatnagar, 2011 The first to design RL-based NTSC algorithms that minimize a long-run average cost criterion. Nuli and Mathew, 2013 No traffic adaptive control model exists to account for traffic heterogeneity and limited lane discipline.

El-Tantawy et al., 2014
The first study to investigate the effect of TD(λ) methods for NTSC as a continuing task (i.e., not a finite episode) with a discounted reward in which looking ahead to future steps is less important compared to a finite episodic task with undiscounted reward.

Zhu et al., 2015
Junction Tree Algorithm has not been applied to address the coordinated signal control problem.
Yongheng  Other works with the same functionality (predicting future system states by Proactive Complex Event Processing) have not been found.  The first work to combine cumulative prospect theory (CPT) with RL, and to investigate (and define) human-centered RL.

Darmoul et al., 2017
The first to integrate Case-based reasoning and RL for NTSC and integrate immune features within MARL to achieve disturbance management. Wei et al., 2018 None of existing studies have used the real traffic data to test their methods.
C. Li et al., 2018 The study of the applicability of deep RL on the road network has not yet been carried out.

Torabi et al., 2018a
Validated on the largest realistic simulated traffic network published to date for collaborative multi-agent based NTSC. Vinitsky et al., 2018 The first to propose a standard set of benchmarks for traffic control in a micro-simulator and a framework for simultaneously learning control for a mixture of AVs interacting with human drivers and infrastructure in which deep RL can be applied to the control task.

Zhou et al., 2019
Edge based RL is the first RL proposal to optimize traffic signals on neighborhood scale.

T. Tan et al., 2019
The first attempt to use hierarchical deep RL models in large-scale NTSC.

Chu et al., 2019
The first paper to present a fully scalable and decentralized MARL algorithm for the state-of-the-art deep RL agent: independent advantage actor critic (IA2C), within the context of NTSC by extending the idea of Independent Q-learning on A2C.

Wei et al., 2019a
The first time that the individual RL model automatically achieves coordination along arterial without any prior knowledge.

N. Xu et al., 2019
The first work to consider the impact of slow learning in RL on real-world applications by the effective transfer of RL algorithms trained on simulated traffic to the real-world traffic to reduce the mistakes to be made in the real world. Rizzo et al., 2019b The first to address signalized roundabouts in congested network, as a complex TSC scenario using a deep RL method. Zheng et al., 2019a The first work to reduce the problem space and explore different scenarios more efficiently, so that the RL algorithm can find the optimal solution within a minimal number of trials, instead of blindly exploring on repeated situations.

Wei et al., 2019b
The first work to use GAN in RL for NTSC and to conduct experiments on the large-scale road network with hundreds of traffic signals.

Ni and Cassidy, 2019
The first to extend RL to the cordon-control problem. Rizzo et al., 2019a The first to consider model explanation methods such as LIME and SHAP for the explanation and interpretation of RL agents decisions that can be verified by domain experts (RL with Explainability).

P. Chen et al., 2019
This study is among the earliest to apply deep RL for arterial adaptive signal control.

Code availability
Still another feature that we investigated is the availability of the code, which we considered a good resource for those new to the field, and useful for reproducing the research. We found that in 10 papers in NTSC the authors made their code available, and these can be found in Table 5. (Brys et al., 2014) is the first paper in the area of RL in NTSC that made the code available in 2014. 7 of these 10 papers are deep methods published in 2018 and 2019. 2 papers provide actor-critic methods, while the rest are based on Q-learning. The codes are written for SUMO (5 papers), CityFlow (3 papers), GLD (1 paper), and AIM (1 paper). The approximation methods include NN, SPSA, Phase gate, and Tile coding.

Evaluation
150 (94%) of the studies provided an evaluation, while 7 (4%) did not. Three (2%) papers provided selfcomparison, meaning that they compared the variations of the proposed methods with each other, but not with other NTSC methods. The papers that did not provide an evaluation are identified in Table 5.
The authors used different TSC methods and performance measures to compare and validate their proposed method. Fixed time methods alone are inefficient in evaluating other methods because they are unable to adapt to the traffic flow changes, however, they are sufficient to use in exploring the feasibility of a proposed method or the proof of concept. We found that 27 (17%) of the studies used only fixed time or random methods/policies for comparisons. In 39 (24%) cases, the TSC methods are used where no RL method such as actuated and adaptive methods are included, as they provide for better evaluation. In 76 (47%) studies, RL has been used either alone or with other types of methods. Involving RL methods for evaluation, however, cannot always guarantee a perfect evaluation. Generally, if an RL method is used along with actuated and adaptive methods, it can provide a great evaluation, specifically when the comparison is made with state-of-the-art RL methods proposed by the other authors in the field. We collected these RL methods as a reference for the readers in Table 9. The table also provides the citation of referenced papers that used evolutionary and meta-heuristics algorithms, real-world, fixed time, and adaptive methods. It might be of interest to the reader to know the number of methods that are used in these papers for evaluation/comparison purposes: 1 (33%), 2 (31%), 3 (14%), 4 (5%), and 5 to 8 (5%), demonstrating that comparison of a method with only one other is the most common.
Among the performance measures, delay is the most frequent performance measure in the papers with 71 occurrences (20%), followed by travel time and waiting time and queue size (each 12%), number of stops and speed and throughput (e.g. the number of the vehicles passed the intersection) (each 6%), and environmental measures (5%), which accounts for 80% of the papers. We found 33 unique performance measures that are listed in Appendix C. Each unique measure delegates several similar measures. The list shows a variety of measures that authors may consider or use in their following research works. Appendix C also depicts the top ten performance measures.

Common future works and research opportunities
During the course of our investigation, a few recurring steps the authors took in order to advance their research into the future, were noticed. One of the most recommended areas for future investigation involves testing the traffic signal controllers in the real world. Given that the final hurdle from theory to implementation is to see if the concept can successfully direct traffic at busy and unpredictable intersections and not just in simulations, this comes as no surprise. Authors also look to expand their work to a bigger network and to increase the number of phases that controllers could select, in addition to making traffic signal control a multi agent system and adapting to bigger intersections. In the same vein, the diversification of the proposed traffic signal formulations so that they could potentially be of greater use to more people, is also important. Adapting plans to different modes of transport, including motorized traffic such as public and mass transit, taxis, and freight vehicles, and non-motorized traffic like pedestrians and bikes are challenging at best. Incorporating better communication methods between  Arel et al., 2010;El-Tantawy et al., 2013;L.-H. Xu et al., 2013;Dusparic and Cahill, 2009a;Kuyer et al., 2008;Prabuchandran et al., 2014;Araghi et al., 2015;Camponogara and Kraus, 2003;Ma, 2018,Watkins andDayan, 1992;Riedmiller, 2005;Claus and Boutilier, 1998;M. Tan, 1993;Choy et al., 2006;Vu et al., 2018 Actor-critic RL Konda andBorkar, 1999;Sutton and Barto, 1998;Baird III, 1999 Deep RL Van der Pol andOliehoek, 2016;Nishi et al., 2018;Chu et al., 2019;Wei et al., 2018,Schutera et al., 2018J. Gao et al., 2017;C. Wu et al., 2017;Zheng et al., 2019bDistributed Intersection Management Protocol Liang et al., 2018Junction Tree Algorithm Zhu et al., 2015TILDE Blockeel and De Raedt, 1998Propositional Function-Gradient Boosting Dietterich et al., 2004 Wang et al., 2008;Daeinabi et al., 2011 their controllers, accounting for delays in communication and addressing noise (unwanted data) that their sensors might pick up are other points of focus for the future. Still another popular theme that arose out of our study is improving the performance of controllers. Specifically, a common direction is to change the definition of the reward function and obtain an improved state space, thereby allowing the controller to render a decision. The final future implication from this study is online learning, which is a popular choice since its strength lies in the controller's ability to continuously adapt to traffic signal conditions. Although some efforts are already underway to achieving this, such as focusing on reducing the time required for learning during this continuous process, the area is still in its infancy.

Key findings
This paper allows us to have a comprehensive view of the past 25 years of research on applying RL to NTSC. This view allows us to see that the community has employed classical approaches (e.g., Q-learning) in the vast majority of the investigations. Thus, we see a large avenue for extensions, especially given that Q-learning is a tabular method and, as such, it is not fully equipped to deal with continuous spaces and/or with centralized approaches, in which the state space tends to be vast. Related to this issue, deep learning is advancing to fill the gap left by methods that do not fully deal with huge state (and possibly, action) spaces. The number of papers employing deep learning is increasing, as demonstrated in our literature review.
The use of non-commercial microscopic traffic simulators is on the rise, with SUMO being used more and more, especially within the computer science community. Associated with this trend, is that there has been an increase in the exchange of code and experiences (e.g., SUMO has an active mailing list), which is certainly a positive trend.
Furthermore, we have noted that there is a lack of interaction between traffic and transportation engineering practitioners and researchers investigating the use of RL on NTSC. One of the consequences of this is that a high number of papers does not include or deal with real data, thus challenging the proper validation of the experimental results. Also, no testbeds are being proposed. In fact, real-world scenarios are lacking, for which one could find at the very least a detailed map (including geometry), actual demand, fixed-time signal timings and target measurements to be used for comparison purposes. Moreover, the creation of testbeds would likely bring different communities together around common goals.
Based on the revealed data by the authors of the included papers and our analysis, the literature of RL in NTSC motivates the following areas to expand as open research problems and research opportunities for future work: • Using different RL methods for different research topics in NTSC: There are a few studies that compared the efficiency of different methods. For instance, in (Aslani et al., 2018a), several methods are evaluated, including discrete state Q-learning(λ), discrete state SARSA(λ), discrete state actor-critic(λ), continuous state Q-learning(λ), continuous state SARSA(λ), continuous state actor-critic(λ), continuous state residual actorcritic(λ), which is a combination of the residual algorithm with actor-critic(λ). In this study, continuous state actor-critic(λ) showed the best robustness and performance. In another research, (Chu et al., 2019) used independent advantage actor critic (IA2C) where deep neural networks are employed for both policy and value approximations. Using different RL methods may provide an insight about how to improve the performance, robustness, speed, and efficiency of the proposed methods.
• Using various state, action, and reward elements in RL methods: defining and designing effective states and actions and reward functions are very important in the RL process to reach efficient results. Appendix A, Appendix B, and Table 6 help to get familiar with the options of state, action, and reward elements that exist and already used and to guide reaching new definitions of these components.
• Using and extending the idea of independent RL: In independent RL, the local agents learn their own policy independently by modeling other agents in the environment. This approach is scalable, however its convergence issue needs to be addressed, like by using Experience Replay.
• Using deep RL and hierarchical deep RL methods in large-scale networks: The efficiency of the deep RL methods and hierarchical methods has already been discussed. The first hierarchical deep RL in large-scale network was published in 2019.
• Using deep RL for arterial networks: This topic was only researched in 2019.
• Automatically achieving coordination along arterial without any prior knowledge using RL: Coordination in arterial can be achieved by using (i) the conventional coordination systems where a fixed offset among all intersections is used (Urbanik et al., 2015), (ii) the optimization-based methods (Little et al., 1981), and (iii) centralized RL-based optimization methods, which consider jointly modeling the action between learning agents (Van der Pol and Oliehoek, 2016). The methods of the last option are computationally expensive as they need to negotiate between the agents in entire network. An alternative is to use decentralized RL agents to achieve coordination. Research was conducted into this interesting area once in 2019.
• Using different function approximation (e.g. GAN) in RL methods: It is worth noting that the application of RL with function approximation started since 2010. Each function approximation has its own advantage and strength in different contexts, environment, and scale.
• Establishing an appropriate tradeoff between optimality and scalability: This is a long-run research topic that is still very important, specifically in real-world large-scale networks.
• Defining manageable state-space, action-space, and reward function: Considering the high volume of computation efforts in RL and the need for faster solutions, specifically in online learning, the necessity of defining efficient state-space, action-space, and reward function is of importance, while keeping the accuracy of the results.
• Reducing the problem space: For instance, by not blindly exploring on repeated situations, one can reduce the problem space. This was researched for the first time in 2019.
• Defining multiple reward functions for different traffic situations: Generally, a single reward function is defined for all traffic conditions; however, rewards can be defined dynamically as a response to the traffic states and multiple reward structure can be used (Ngai and Yung, 2011). This is done whether as pre-defined rewards for different time of analysis where the congestion level is known in advance (Houli et al., 2010), or the reward function can dynamically be adjusted with varying congestion states at the intersection .
• Using real traffic data, realistic networks, and large-scale networks, and considering traffic heterogeneity: Real traffic are highly dynamic over time and realistic traffic network environment demands more challenges and concerns in dealing with applying RL rather than simple hypothetical setups using traffic simulations. Although there are a few research papers which recently focused on this challenge (Wei et al., 2018;Torabi et al., 2018a), there are still several challenges in dealing with the application of RL in the real-world setup. Considering traffic heterogeneity in traffic control models can also be of interest to research (Nuli and Mathew, 2013).
• Using RL and SensorGrid: The application of SensorGrid, i.e. the integration of sensor networks and grid computing, in RL started in 2007 but its usage in NTSC has never expanded since then.
• Integrating the concepts, methods, and frameworks from other fields with RL, such as JTA (for coordinated TSC problems), Pro-CEP (for predicting future system states), CPT (as a human-centered RL), Immune Network (for disturbance management): There are only 4 papers that applied RL in the methods/models/frameworks from other fields (i.e. group V3). This trend has ceased since 2016. Nevertheless, the application of other methods in RL (i.e. group V2) is still in progress. Except for the general NTSC (AD1), in other NTSC application domains we do not observe any applications of V2 to V5. This essentially motivates to propose methods based on incorporating various RL methods with the methods from other fields and optimization methods, and developing theoretical aspects of RL methods.
• Using RL in optimization problems and optimization methods in RL: Optimization algorithms, such as swarm optimization, can be integrated with RL for improvement. For instance, swarm optimization can be applied to find the optimal parameters in the reward function as a sub-problem in RL (W. Lu et al., 2011). It can also be applied to rapidly find the global optimal solution for functions with wide solution space (Tahifa et al., 2015). On the other hand, RL method can be applied in the optimization problem to reach good solutions for signal timing optimization (Ozan et al., 2015). The application of integration of optimization algorithms and RL is not very recent and not very frequently used (appeared in only 8 articles) but shows the efficiency of the integration compared to using RL or the optimization algorithm alone.
learning relational policies from an expert (i.e. imitating expert) (Natarajan et al., 2011), (ii) defining a new class of multi-objective problems called CMOP (Brys et al., 2014), (iii) synthesizing a control policy for an MDP such that traces of the MDP satisfy a LTL control objective (Sadigh et al., 2014), (iv) optimizing variance-related risk measures in rewards (Prashanth and Ghavamzadeh, 2016), (v) modeling human decisions  in RL, and (vi) proposing a dynamic correlation matrix based MARL approach to reach optimal behaviours faster than other canonical learning techniques. More theoretical methods can still be proposed and examined to address the requirements of NTSC problem.
• Using RL in combination with traffic theories (e.g. CTM, MP, shock wave theory, etc): Although the usefulness of this type of combination has been shown in a few research works Chanloha et al., 2014;Ajorlou et al., 2015;Qu et al., 2020), this trend can be developed more, for instance, for shock wave theory and RL.
• Using RL as a core or combined method in integration with other methods: A part of research works employed the integration of the methods and frameworks from other fields with RL, which are listed in Table 4.
• Using RL in the integrated traffic systems/networks, including ramp metering, variable message signs, network of signalized intersections, arterials, and freeways: The integration of different traffic systems is a topic that started in 2010 and contains a few research works, yet there is room for more exploration and improvement.
• Using statistical relational techniques, such as imitation learning: This topic started in 2011, but the number of publications in this area have been very limited since then.
• Using Transfer learning in RL: It is the transfer of RL algorithms trained on simulated traffic to the real-world traffic to reduce the mistakes made in the real world. This topic is currently demanding in the area of NTSC to make RL methods well suited and deployable for a real world setup.
• Using RL in Signalized roundabouts, specifically in congested networks: The idea of applying RL for signalized roundabouts was first proposed in 2019 in 2 articles.
• Using RL in perimeter or cordon control: Although there are several publications in the area of perimeter control, RL was applied to this area only once in 2019, which shows a great opportunity to investigate new solutions in this area of research.
• Focusing on explainability when using RL: This is a very recent topic, which is explored in 2019 for the first time, where the RL agents decisions can be explained, interpretted, and verified by domain experts. There is only 1 article published on this topic.
• cloud/edge/fog-based RL: High latency communication between vehicles (or driver-less vehicles) in a connected vehicles network and limited communication bandwidth to apply the real traffic infrastructure are among the problems that NTSC still suffers from. Integrating computing paradigm such as edge and fog and RL algorithms has been shown to be a good solution for these research problems (Q. Wu et al., 2019). This topic was researched in 2019 in 3 articles, which shows a great potential and tendency towards improving the communication efficiency when using RL.
• Using RL for public transit, whether with or without public transit priority: Modeling and analysing bi-modal traffic environment, including cars and buses, are very important in urban traffic management and it is a long-lasting research area. However, there are only 2 publications in the area of RL-NTSC published in 2014 and 2019.
• Using RL for emergency vehicles: We found only one study that uses RL in the domain of emergency vehicles. The study models and simulates a TSC for autonomous vehicles at intersections, that also gives priority to emergency vehicles. By advancement of communication between vehicles and infrastructure, controlling traffic signals based on emergency vehicles in both traditional and connected vehicles environment using RL should become a hot topic in the area of RL-NTSC.
• Using RL in image-based NTSC methods: Based on our systematic review in the area of RL-NTSC we did not find any articles that uses real images from video detection devices as a data source for the proposed RL methods. Instead, image representation and images from simulation have been used for this purpose. Image-based RL methods in NTSC will be a great avenue for research to adopt to real-world situations.
• Using RL in automating streetcar bunching control in transit routes: This interesting idea is not new (back to 2005) but it was never extended in further research works.
• Considering the use of RL in mixed environments, including regular, connected and autonomous vehicles: With the development of wireless communication, connected vehicles environments (called as vehicular ad hoc network or VANET), provides the capabilities to collect real-time traffic information for adaptive TSC. The connected vehicles technology facilitates two communication modes, including V2V and V2I, where the vehicles send vehicles identification number (ID), position, current time, speed data, and a timestamp to the intersection agents. And the intersection agents process the information, and can share it with the other neighboring intersection agents. Using RL in this type of network in NTSC is common and research in this area still continues to address several challenges, such as the difficulty of transportation modelling and optimization.
• Establishing standard benchmarks for traffic control in traffic simulators: Benchmarks helps researchers to concentrate on algorithmic improvements and control techniques rather than system and environment design. Using the benchmarks, the researchers can evaluate their results against other compared methods effectively. In different contexts, several benchmarks have been prepared to evaluate and compare RL methods, such as the Arcade Learning Environment (ALE) (Bellemare et al., 2013) for evaluating algorithms designed for tasks with high-dimensional state inputs and discrete actions, rllab (Duan et al., 2016) for tasks with partial observations and hierarchically structured tasks, and NGSIM dataset (Transportation., 2008) for microscopic data on human driving behavior. (Vinitsky et al., 2018) is the first and only article that provides new benchmarks in the use of deep RL in a micro-simulator to create controllers for mixed-autonomy traffic, where CAVs interact with human drivers and infrastructure. They characterize a set of RL algorithms by their effectiveness in training deep NN policies.
As indicated in this study, the unexplored open questions include fairness, decentralization, generalization, and sample-efficiency. Future works can provide similar standard benchmarks in other traffic simulators and to address the abovementioned open questions in addition to considering other factors and cases (such as considering buses or integration of different methods with RL) and for other contexts in NTSC.
• online learning: This line of research in RL-NTSC started with (Choy et al., 2003) in 2003 and continued in 5 other articles Cai et al., 2009;Dai et al., 2011;Dusparic and Cahill, 2012;Yin et al., 2015), where the TSC agents need to continuously learn in the traffic network and update the weights and connections in the NN in real-time. The regular RL methods like Q-learning and W-learning require exploration periods while learning optimal actions and weights and make online learning almost impossible. Moreover, in the online learning process when dealing with an infinite horizon control problem, the issue of lack of stochastic exploration and the possibility of getting stuck in local minima need to be addressed (Kohonen, 2012;Yen et al., 2002). Multistage online learning processes that involve reinforcement learning, weight adjustment, and adjustment of fuzzy relations are proposed in (Choy et al., 2003;Srinivasan et al., 2006). (Cai et al., 2009) investigated two online learning techniques, including reinforcement learning and monotonicity approximation. A reinforcement training based online learning NTSC is designed in , which employs a feed-forward neural network. (Dusparic and Cahill, 2012) used DWL and (Yin et al., 2015) used ADP for online learning. The online learning process helps continue learning in real-time, where the adaptation of approximation is processed online.
It is also worth noting that the classification provided in Figure 12 can help with finding the open research problems. For instance, based on this figure, there is only one article that presents the application of RL in the perimeter control. We also see that the focus of the paper is on general detection and network of intersections.
Thus, different open research questions can be explored, such as extending the perimeter control for a bi-modal environment, considering bus priority, image-based traffic data detection, or perimeter control in a connected vehicle environment. This logic can also be applied for the other application domains.
Furthermore, more open research questions can be found based on a range of excluded papers. By considering NTSC within the context of the excluded papers, this becomes possible. For instance, application of RL-NTSC considering or combining with following components: bikes, pedestrians, route choice, routing systems, pedestrian routing, reactions of cyclists to speed advice, ride-sharing, best path selection, lane changing, autonomous intersection, traffic congestion detection, driver behaviour, NTSC simulation, simulators, simulation environment, online calibration, traffic assignment problems, couriers management in express systems, fleet management, toll plaza, traffic analytics, image processing, and the sensor installation locations in a traffic network.

Threats to Validity
There are threats to the validity of the results and findings of our review, which will now be discussed. Although attempts were made to select our search systematically and so that they capture the existing articles in the area under investigation, part of the included articles were retrieved during the forward and backward snowballing process. This suggests that the possibility of losing part of the existing evidence is real. Moreover, research papers that may be relevant based on our criteria might have been excluded. The authors of this paper strove to collect as much relevant data as possible, and to cross reference the information for accuracy, inaccuracy remains a possibility due to the large number of research papers and features we were dealing with. Differences in our understanding, as well as the intersection of concepts with RL, such as GT, ADP, DP, and LA may have also lead to omitting some relevant research papers that the key term "reinforcement learning" in our search string could not capture.

Conclusion
This paper presented a comprehensive, systematic literature review on the application of RL in NTSC. The main goal of this research is to identify all eligible articles in the defined area, analyze the data of the included articles, provide statistical and conceptual knowledge based on the qualitative and descriptive data analysis, provide the highlights, variety of the applied methods, patterns, trends, frequency of existing research works in various application domains, major first events in RL-NTSC, common future directions based on what the included papers recommended, and other useful information for further research in the area of RL-NTSC. In addition to the detailed material throughout the paper, the key findings are summarized and discussed. Considering all the published review papers we uncovered in this area in our literature review, this paper covers the highest number of articles.
The review of the literature for the application of RL in single isolated intersections can be considered as an implication for future practice, which is complementary to this review paper. The integration of the results and findings of both scales can provide useful insights.   The numbers indicate the frequency of occurrences in the included papers, and (Bottom) All utilized performance measures (with their frequency).