(NOT) THROWING THE GAME - AN APPLICATION OF MARKOV DECISION PROCESSES AND REINFORCEMENT LEARNING TO OPTIMISING DARTS STRATEGY

. This article determines an aimpoint selection strategy for players in order to improve their chances of winning at the classic darts game of 501. Although many studies have considered the problem of aimpoint selection in order to maximise the expected score a player can achieve, few have considered the more general strategical question of minimising the expected number of turns required for a player to ﬁnish. By casting the problem as a Markov decision process and utilising the reinforcement learning method of value iteration, a framework is derived for the identiﬁcation of the optimal aimpoint for a player in an arbitrary game scenario. This study represents the ﬁrst analytical investigation of the full game under the normal game rules, and is, to our knowledge, the ﬁrst application of reinforcement learning methods to the optimisation of darts strategy. The article concludes with an empirical study investigating the optimal aimpoints for a number of player skill levels under a range of game scenarios.


Introduction
In this paper, our aim is to determine the strategies that players of varying skill levels should adopt to improve their chances of winning at darts.We shall exclusively be considering the standard darts game 501, where two opposing players take alternating turns throwing darts at the board, with each turn consisting of 3 throws.The objective for the players is to reduce their personal tally of 501 to 0, by subtracting the amount they score with their throws.However, in order to finish, a player must reduce their tally to exactly 0, and do so by landing their last dart within the double ring or inner bull (see Figure 1), a process known as doubling out.If a player exceeds their remaining tally with any throw, reduces their tally to zero without doubling out or leaves themselves with a tally of 1, then they are said to have 'gone bust'.Any player who goes bust, immediately forfeits the remaining throws of their turn and has their tally returned to the level at which they began their turn.The first player to reduce their tally to 0, by means of doubling out, is the winner of the game.This end of game procedure results in the game having two distinct stages.In the initial stage, the player's tally is still some way from 0, and it is generally accepted that the objective here should be for the player to simply reduce their tally as quickly as possible.This is achieved by aiming for the point on the board, which maximises the player's expected score per dart.The problem of score maximisation has been extensively studied, for example by [7,28,17,4], and it is well understood where players of various skill levels should aim in order to maximise their expected score.In the second stage, the player's tally is approaching zero, and consideration must be given to the requirement of doubling out, along with the risk and associated costs of going bust.As a result of these considerations, the determination of an optimal strategy and aimpoint during this phase is considerably more complicated.To resolve this question, we examine the more general problem of minimising the number of turns it takes a player to finish the game from an arbitrary game state.In doing so, we hope to be able to make strategy recommendations for players in the second stage of the game.Furthermore, this will allow us to test the earlier assumption that a player is best served by maximising their scoring during the initial stage of the game and also determine the optimal point for the player to switch their attention from scoring to finishing.The question of turn minimisation was considered in [18], where a branch and bound method was applied to the problem.However, in [18], the author considered a simplified variation of the game, where the consequences of going bust differed from the standard game.In the variation of [18], a player going bust still forfeits the remaining throws of their turn.However, their tally is not reset to the level at which they began their turn and instead advances to the next turn intact.The inclusion of the tally reset introduces an additional dimension to the problem, increasing its complexity, and leads to some interesting strategical choices given certain game scenarios.To solve the problem, we frame it as a Markov decision problem (MDP) and apply the methods of reinforcement learning and dynamic programming.There exists a large body of work concerning the application of these methods to the study of optimal strategy selection within sports, see [21], with cricket ( [11,12]) and tennis ( [20,13]) receiving the most attention.More generally, there is a strong history of the application of the analytical methods of operational research in the study of sports, and the interested reader is directed to [29] and the references therein for a more detailed review.Most recently, there has been a renewed interest in the application of reinforcement learning methods to strategical decisions within games.These methods, when coupled with the power of neural networks, have produced advances in game playing algorithms for complex strategy games, for example Go, [25].
1.1.Dartboard Design and Dimensions.Although there are many different dartboard designs, for example the London 5's board [1] and the Yorkshire board [3], the most commonly used board is the standard dartboard as shown below (see Figure 1), and it is play on this board that we shall be exclusively considering in this paper.The standard board consists of 20 equally sized sectors, each giving the player a score, which ranges from 1 to 20.The board has two concentric bands located at radial distances 99-107mm and 162-170mm from the centre, known as the 'treble ring' and 'double ring' respectively.A player landing their dart in the treble ring has their sectorial score trebled, i.e. hitting the 'treble 20' produces a score of 60, whilst hitting the double ring, doubles the sectorial score.The centre of the board consists of two concentric circles of radii 6.35mm and 15.9mm.The inner circle is known as the 'inner bull' and provides the player with a score of 50, whilst the band surrounding it is known as the 'outer bull' and gives a score of 25.As described previously, the point of the game is for the player to reduce their tally from 501 to exactly 0, and in doing so hit either the inner bull or double ring with their last dart.Each section of the board is separated by a wire of thickness 1.27-1.85mm.The dartboard should be mounted so that the centre is at a height of 1.73m, whilst players throw from the 'oche', which must be a horizontal distance of 2.37m from the wall on which the board is mounted.The measurements given above are those of an official championship dartboard, as specified in the rules of the [10], one of the main governing bodies in darts.
The ordering of the sectors on the dartboard is attributed to Brian Gamlin [2,6].Gamlin, a carpenter from Bury in Lancashire, is thought to have invented the sequence in 1896.The numbering on the board is intended to encourage accuracy, with the higher scoring sectors tending to be placed next to the lower ones.Therefore, a player going for a high score must hit their target or else be penalised by receiving a lower than average score.For example a player who shoots for the 20 sector and misses is likely to hit either 1 or 5, resulting in a greatly reduced score.A number of studies have considered the possibility of rearranging the ordering so as to optimise the difficulty of the board.The article [14] considers the problem from a combinatorial point of view, seeking to maximise the p-norm of the difference between successive sectors.The author provides an alternative arrangement, which surpasses the standard arrangement with respect to this p-norm measure.The study by [27] takes a more probabilistic approach to the problem, modelling the effect the arrangement has on the maximum expected score a player can achieve (by optimising their aimpoint), under certain assumptions on the distribution of thrown darts.Given a skill level, as modelled by the deviation of the distribution, the maximum expected score is then minimised across possible board arrangements.Interestingly, the two studies agree on the optimally difficult dartboard.However, the actual impact of the rearrangement in terms of scoring was found to be rather minimal in [27], suggesting that Gamlin was not far off with his traditional arrangement.

Modelling Preliminaries
In this section, we outline a number of simplifications and assumptions made during the modelling process.Primarily, these are in regard to the distribution used to model the errors of darts thrown by real players.We also introduce the conventions, which shall be adopted in referencing specific points on the dartboard.

2.1.
Modelling Simplifications and Assumptions.Firstly, our model does not take account of the wires on a real dartboard.These wires range in thickness from 1.27 to 1.85mm, however, their effective thickness is less, as darts often glance off them into the target bed.
We shall assume that the outcome of darts thrown by our players follows a two dimensional normal distribution, centred at their aimpoint with independent horizontal and vertical components with standard deviations of σ x and σ y (measured in mm) respectively.In addition, we take this distribution to be constant across the dartboard, i.e. the player's accuracy remains constant for all aimpoints.In the upcoming analysis, when referring to a player having a skill level of (σ x , σ y ), we are assuming that the errors in their throws follow a distribution as described.A normal distribution would seem reasonable if the error regarding the dart landing point is the result of the cumulative effect of small discrepancies in many factors: stance, posture, grip, motion of the arm and hand, timing, rhythm, release of the dart, etc.In which case, the Central Limit Theorem would lead us to expect that the outcome would follow an approximately normal distribution.Furthermore, we note that in previous studies considering such matters [18,7,17,28,4], the two dimensional normal distribution has been widely used to model the outcome of darts throws.
Intuitively, we might anticipate that σ x (horizontal standard deviation) would be less than σ y (vertical standard deviation).Throughout the motion of throwing a dart, the player's arm largely remains within one vertical plane and so the force they exert on the dart is also directed within this plane.As no horizontal forces act on the dart during its flight, it therefore follows a trajectory within this plane.As a result, a player's horizontal accuracy is primarily determined by their ability to restrict their motion to a plane directed at their target.As a player throws a dart, their hand follows an arc; the point along this arc at which the dart is released determines its initial launch angle.If the dart is released early, then it will have a high launch angle, whereas darts released later, will typically be projected at lower angles.The player must also judge the correct amount of force to apply to the dart, however, this is dependent upon the launch angle.The vertical accuracy of a player is thus determined by their ability to control the darts trajectory; to do so they must balance the launch angle with force, whilst taking into account the effects of gravity on the dart.These factors involve more judgement and 'feel' than those involved in the horizontal motion.As a result, we might expect players to struggle more with vertical consistency, especially when they are under pressure.
In addition to having a discrepancy between the horizontal and vertical accuracy of throws, we might consider the possibility of a correlation between the horizontal and vertical components of a miss.Due to the biomechanics of the throwing action, it might be that a right-handed player is more likely to miss high and left or to pull their throw low and right, with the opposite being true for a left-handed player.This possibility was considered in [23], where such correlations, in conjunction with the arrangement of the dartboard, were posited as a possible explanation for a lower number of left-handed players on the PDC professional darts tour, than would be expected from the proportion of left-handed individuals in the population as a whole.The option of a correlation between the horizontal and vertical error components appears in the modelling of [28], where the authors utilised the general case of a bivariate normal distribution to model players' misses.However, in this paper we will not consider such a possibility; the purpose of this work is not the validation of a given distribution as a model for real darts throws, and therefore we do not analyse or test the validity of our distributional choice against real data.Instead, we are interested in the strategical decisions faced by a player in light of the scoring rules and uncertainty of outcome.The methods developed and applied in this paper are readily adapted to other choices of distribution if seen fit, simply requiring the use of a different random number source or the substitution of the requisite probability density function into the calculations.
Finally, we assume that the results of successive darts are independent.However, in reality there are several reasons why this may not always be true.It is common for previous darts to obscure the target area, reducing the chances of success on following throws.Conversely, if a player has already hit their target, then it may be easier for them to replicate this result on the throws that follow, as they have developed a feel for the shot, leading to the occurrence of the 'hot hand phenomenon' [22].Alternatively, they might be able to make adjustments to correct slight misses.
2.2.Specification of Points on the Dartboard.Throughout this study, it will at times be necessary to specify particular points on the dartboard.We do so in a variety of ways, choosing the method according to the situation.Generally, when referring to points, we shall use a variation on plane polar coordinates as shown below.We measure angles in degrees, clockwise from the 12 o'clock position, and distance in mm, radially from the centre of the board (see Figure 2).We have chosen this system as we believe it allows the user to more easily assimilate the position of points on the board and is more naturally understood by the layperson.However, this non-standard system does not lend itself so readily to the mathematical analysis and computations involved in the project.For this reason, we also make use of the standard Cartesian coordinate systems.When using the Cartesian system, points are classified by their horizontal and vertical coordinates (x, y), taking the centre of the board (centre of the inner bull) as our origin, with our horizontal axis running through the 6 and 11 sectors, whilst the vertical axis runs through the 3 and 20 sectors.Again, all distances will be given in mm unless otherwise stated.

Initial Stage -Maximising Scoring
As described previously, during the initial stages of the game, the player is best served by reducing their tally as quickly as possible.This is achieved by aiming at the point on the board, which maximises their expected score from a single throw.Although this problem has been previously addressed in a number of articles [18,7,28,17,4], we include it here as it provides an instructive introduction to the modelling of the game of darts.In addition, the results provide a valuable set of observations against which to check the methods of the following section.
3.1.Problem Definition.Under the distributional assumptions set out in the previous section, considering a player of skill level (σ x , σ y ), aiming at the point with cartesian coordinates (µ x , µ y ), the outcome of their throw (x, y) follows a distribution with the following bivariate normal probability density function: Letting the random variable S denote the score achieved by our player, and d(x, y) the function taking values as per the score on the dartboard at the point with cartesian coordinates (x, y), then the expected score achieved by our player by aiming at (µ x , µ y ), is given as the expected value of the function d applied to the outcome of the throw.We compute this value via the following double integral: where the domain of integration D signifies the scoring region of the board.The question now is, given a skill level (σ x , σ y ), which aimpoint (µ x , µ y ) maximises the value of (3.2).
3.2.Solution Method and Implementation.In order to determine the optimal point, we cover the dartboard with a square grid of points (µ n x , µ m y ) , where µ n x = n∆, µ m y = m∆, with ∆ = 170/N mm, for some (large) positive integer N and where Although ideally we would like the choice of ∆ to be as small as possible in order to increase the precision of our results, the number of evaluations of (3.2) required scales as O((1/∆) 2 ), and therefore taking ∆ too small would result in a computationally intractable problem.Additionally, there is a practical limit as to how close we need to set the aimpoints, as a player can not distinguish between points that are extremely close.Moreover, when we consider the length scales and level of precision involved (i.e. the thickness of a dart and the tolerances in the board etc), beyond a certain point, working with increasingly smaller mesh sizes overstates the achievable level of accuracy.
Having defined a grid, we then compute the value of (3.2) with (µ x , µ y ) = (µ n x , µ m y ) for each point in our grid.The point (or points) which provide(s) the maximum value is then taken as the optimum aimpoint for skill level (σ x , σ y ), and the value achieved there, as the optimised expected score.When it comes to computing the values of (3.2) there are a number of approaches that can be taken when evaluating the requisite integrals, including various quadrature routines and Monte Carlo integration.Utilising a mesh with step-size ∆ = 1mm, we carried out the described procedure for a range of skill levels (σ x , σ y ), where, for the moment, we have made the assumption of equal horizontal and vertical accuracy i.e. σ x = σ y = σ.The integrations were performed by utilising both Matlab's built-in quadrature method intergal2 and also a two stage Monte Carlo approach.This Monte Carlo approach involved an initial examination of all points of the grid using a sample size of k darts per point.After this initial search, the top K scroing points are re-examined utilising k darts per point, where k k.We applied this approach with an initial sample size of k = 10000, K = 500 and a larger sample size of k = 100000.The following heatmaps in Figure 3, derived via the quadrature method, give a graphical representation of how the expected scores vary across the board, for players with a range of skill levels.The higher scoring sectors, such as the high trebles and inner bull, show up very clearly in the more accurate cases (σ = 5, 15), with this evening out as the accuracy decreases.For the least accurate case (σ = 40), we see much less variation across the board, with the greatest intensity towards the centre of the board with a slight preference for the lower left quadrant.This is perhaps not unexpected as this quadrant features none of the lowest scoring sectors making it more forgiving on the lesser skilled, player.Indeed, the left-hand side of the board is colloquially known as the 'married man's side', because married men allegedly play it safe [5].
Table 1 outlines the optimal points found using both of the methods, and the expected scores achieved there.In general, there is reasonable agreement between the two methods.Having said this, the quadrature method appears somewhat more consistent between accuracy levels, suggesting that the results derived via this method may be more reliable.However, the Monte-Carlo method does enjoy a considerable speed advantage over the quadrature method, and we could increase the accuracy by increasing the sample size per aimpoint.Figure 4 provides a graphical representation of the aimpoints identified in Table 1 (using the quadrature method intergal2).We can see that for the most accurate players, the centre of the treble 20 offers the best scoring on average, with this point moving up and to the left as skill level drops.This occurs as a consequence of the widening out of the sectors as you move away from the centre, providing for a higher proportion of darts to land in the 20 sector, as opposed to its lower scoring neighbours.The preference for the left-hand arises from the relative advantage in missing to this side and hitting the 5 sector, as opposed to the 1 sector on the right.At some point, as the standard deviation increases from 15 to 20, the optimal scoring point switches to the treble 19, from where it progresses upwards and inwards towards the centre of the board as the skill level drops.The lowest skilled players should aim for the centre in order to maximise the probability of hitting the board at all. a Computed using Matlab's built-in quadrature method intergal2.
b Computed using the two stage Monte-Carlo method described.
Table 1.Optimal aimpoints and expected scores.Finally, the chart below (see Figure 5) plots the expected score for three different aimpoints against the skill level of the player.As we can see, for players of the highest skill level (lowest σ), the optimal score and the score derived from the centre of the treble 20 coincide, as we would expect from the previous table and figure.We can see that for such players, there is a clear scoring advantage to be had here over, for instance, the inner bull.Unsurprisingly, as skill level drops, we observe a general decrease in the expected score from all points.However, we can also observe a difference emerging between the expected score achieved at the optimal point and the mean score from the treble 20.This is also inline with the findings of Table 1, where the optimal point switches from the treble 20 to the treble 19 and inwards to the centre.As we continue to decrease the skill level, the expected score provided by aiming at the centre of the board comes to dominate the treble 20, and as the optimal aimpoint approaches the centre of the board in Figure 4, we observe a convergence between the scores achieved at these points.Having covered the strategy for a player looking to maximise their scoring during the initial stages of the game, in the next section we will look to address more generally the question of how a player should go about maximising their chances of winning from an arbitrary game state.Strictly speaking, a player achieves this by adopting the strategy which maximises the probability of them finishing before their opponent.However, analytically this represents a rather challenging proposition, requiring consideration of the opponent's strategy, which, if they are playing in a somewhat optimal fashion, should give consideration back to the original player's strategy and so on.These interactions in strategy could occur back and forth and result in a problem of high complexity.Instead, we look to minimise the number of further turns (beyond the current one) that our player will take to reduce their tally to zero, including doubling out.We expect that in most cases this strategy will coincide with the one for maximising a player's chances of winning.

General Strategy -Minimising Further Turns
The question of minimising further turns has received little attention in the literature in comparison to the problem of maximising score, with the only significant contribution provided by [18].In his article, Kohler utilises a branch and bound approach to determine the optimal aimpoints in terms of minimising the expected number of further turns for a player to finish.However, the variation of the game he considers is simplified in one important respect from the standard version of the game considered here.In the version of the game considered in [18], the consequences for a player going bust still involve forfeiting any remaining throws of their turn, however, their tally remains the same and they do not face the additional penalty of having their tally returned to the level it was at prior to the start of their turn.This significantly reduces the complexity of the problem, removing a variable from the analysis.In the upcoming section, this additional factor (and penalty) will be included in our situational modelling and it will be interesting to observe the impact this factor has on the optimal strategies identified.
4.1.Problem Definition.We shall begin by introducing a number of notations, which shall be employed in the following framing and analysis of our problem.Let us denote by S t the tally the player is currently on, with the player about to take throw t(= 1, 2, 3) of their turn.As such, S 1 denotes the tally on which the player began the current turn, and the tally to which they will return if they go bust.Therefore, for our purposes, the triple (S t , t, S 1 ) fully defines the situation a player finds themselves in.We denote by T the number of further turns (beyond the current one) for the player to reduce their tally to 0, including doubling out.We use E (T |S t , t, S 1 ) to denote the expected value of this quantity for a player in the state (S t , t, S 1 ), assuming they adopt the optimal strategy to minimise the quantity T .Now consider a player facing the situation (S t , t, S 1 ) who is evaluating where on the board it is best to aim their next throw.We denote by p the general aimpoint and use D to refer to the set of all such points on the board.We then use P r(r|p) to signify the probability of scoring r, whilst aiming at p.The set of outcomes r, which result in a player going bust, given a current tally of S t , we denote by B. Then, assuming that on all subsequent throws the player selects the optimal (minimising expectation of T ) strategy, the expected values E (T | •, •, •) satisfy the following equations: where the quantities τ , S τ and δ t,3 , take the following values: .
The optimum aimpoint for the player in the situation (S t , t, S 1 ) is then given by the point p which satisfies (4.1) by minimising the collection of terms on the right-hand side.The difficulty now lies in proceeding with the computation of the right-hand side of (4.1), given that it requires existing knowledge of the values E (T | •, •, •).However, the problem of computing these expected values, and the optimum selection of aimpoints, naturally fits into the framework of a shortest stochastic path (SSP) problem, a particular class of Markovian decision process (MDP).Therefore, we now spend some time reframing the problem in such a way.where γ is known as the discount factor, which in our case is equal to 1.The equation (4.2) is known as the Bellman Equation, first appearing in [8].A common approach to the solution of such problems is via the reinforcement learning method of value iteration [26, page 100].This method, which originated in the seminal paper of [24], proceeds in the following manner: For each state S, we start with an initial estimate V 0 (S), of the true value V (S).We then sequentially compute further approximations via the iterative relationship This is done for all states S simultaneously, and the process continues until the values obtained for all states are deemed sufficiently close to their predecessors.At which point, if V * (S) denotes the final value of V n+1 (S) obtained in the iteration stage, the optimum aimpoint for a player in state S is the point p, which minimises the following quantity: In which case, the evaluation of (4.4) for the minimising point p gives E (T |S), i.e. the expected number of further turns required to finish.
In reinforcement learning terminology, any rule which pairs states S to a choice of p is known as a 'policy' and is often denoted by π.A policy is said to be 'stationary', if the determination of the p depends only on the current state.Finally, a policy which maps each S to the corresponding p, which minimises (4.4), is known as the 'optimal policy' and is often denoted by π * .
4.2.1.Method Convergence.It is not immediately apparent that this method will indeed provide a solution to our problem.Therefore, we now outline an argument which guarantees the method will; (i ) generate a solution, and (ii ) that the solution generated is correct.
Suppose for the moment that there are N possible states, {S i } N i=1 , then we can consider the set of associated values V (S i ) as forming a vector in R N .In cases when the discount factor is less than unity, γ < 1, the application of the right-hand side of (4.2) can be shown to form a contraction on R N with respect to the maximum norm [15, page 209].The Banach Fixed Point Theorem [19, Theorem 5.1-2] then guarantees the existence of a unique set of values {V (S)}, which satisfy (4.2).Furthermore, these values can be obtained as the limit of the iterative procedure (4.3), starting from any initial value.However, for our particular case, since the discount factor γ is equal to 1, we must deepen our analysis.We now introduce some further terminology relating to SSP's and associated policies before providing a result, which establishes convergence for our case.
First, we introduce the notion of an absorbing state.A state S a is said to be 'absorbing' if P r(S a |S a , p) = 1 for all choices of p. Consequently, once we enter the state S a we remain there for all subsequent steps.The cost associated with such a state is zero, i.e.V (S a ) = 0. Having described an absorbing state, and assuming the existence of such a state, we introduce what is referred to as a 'proper' policy.A stationary policy is said to be 'proper', if when following the policy, there is a non-zero probability that the absorbing state will be reached within a finite number of steps, regardless of the current state.A stationary policy, which is not proper, is said to be 'improper'.With these notions in place, we now outline a set of conditions, which are sufficient to guarantee the existence of a unique solution to (4.2), in the case that γ = 1.
Let us assume that the following conditions are satisfied by our SSP: (1) There exists a cost free absorbing state; (2) There exists at least one proper stationary policy; (3) For every improper policy, the cost associated with at least one state is unbounded.
If each of these conditions are satisfied, then there exists a set of unique values {V (S)}, which satisfy (4.2), and can be obtained via the iterative procedure (4.3), [9, Proposition 3.2.2].
If we consider how these conditions apply to our problem of darts strategy, we can see that the assumptions ( 1)-( 3) do indeed hold.Clearly, the end game state, when a player has successfully doubled out, constitutes an absorbing state as described above, and therefore assumption (1) holds.
In the game, it is always possible to reach the final end state in a finite number of steps.Furthermore, given a player and any game state, it is possible to prescribe a (fixed) choice of aimpoint for all possible intervening states between the current state and the end state, such that selecting these aimpoints results in a non-zero probability of reaching the end state.For the player following this policy, the subsequent selection of aimpoints would depend exclusively on the game state they find themselves in, such a policy therefore constitutes a stationary policy, and the non-zero probability of termination makes this a proper policy.Therefore, condition (2) is satisfied for our player.
Finally, if an improper policy exists for a given player, then by definition, for the player following this policy, there must be at least one game state such that the probability of reaching the end state in a finite number of steps is zero.For the player in question finding themselves in this particular state, the number of further turns taken must be unbounded, and as such, they must incur an infinite cost through the further turn cost C. Therefore, assumption (3) is satisfied.

Implementation.
Having outlined the general framework of our solution method, in the upcoming section we detail a number of specific considerations which are particular to implementing our method in the specific problem context.

4.3.1.
Transition Probabilities and Aimpoints.The value iteration method, outlined above, requires us to have values for the transition probabilities P r(S |S, p) for all valid S , S state combinations and all aimpoints p.In our implementation, these were approximated utilising a Monte Carlo approach in the following way.Given an aimpoint p, we simulated a number of dart throws aimed at p and computed the resulting scores achieved on each throw.In addition, we also recorded the radial distances of the landing points from the centre of the board.The relative proportions of each score outcome r were then used to estimate the probabilities P r(r|p), that is (4.5)P r(r|p) ≈ No. of darts aimed at p producing a score of r No. of darts aimed at p .
Additionally, having obtained the radial distance of the outcomes allows us to attribute whether a score outcome equal to the player's tally, that is r = S t , was a case of doubling out or going bust.From this information it is relatively simple to construct the probabilities P r(S |S, p), by considering which outcomes lead from state S to state S .
The reliability of any inferences derived from the value iteration method is dependent upon the accuracy of the values used for the transition probabilities P r(S |S, p), and hence the probability estimates of P r(r|p) from (4.5).However, for many choices of aimpoint p and outcome r, the true value of P r(r|p) will be small, but non-zero.For such cases, the occurrence of a score r, whilst aiming at p constitutes a 'rare event'.When using the Monte Carlo approach (4.5) to estimate the probabilities of such events, the sample size (and hence computational expense) required in order to obtain a given precision (in terms of confidence interval width) scales proportional to 1/P r(r|p) [16, page 64].Therefore, if each potential aimpoint requires a sample of such a size, practical computation concerns restrict the number of possible aimpoints that we are able to consider.Furthermore, for each state S, the iteration procedure requires the computation of the bracketed expression on the right-hand side of (4.3) for all aimpoints p.This again restricts the number of aimpoints that can be practically considered.For the purposes of this study, we utilise potential aimpoints arranged along rays radiating from the centre of the board.For each sector, we have three such rays, the first being the centre line of the sector, with an additional ray either side, placed so as to be equidistant between the centre line and the sector edges.The aimpoints are then distributed at set distances along these rays to ensure good coverage of the sector.A full list of these distances is set out in The arrangement of the aimpoints within a typical sector can be seen in Figure 6.Aimpoints outside of the scoring region of the board are also considered, in particular at 175, 185 and 195mm, since, as we will see, it is not always optimal to aim for some doubles straight away, when trying to double out.We also include a set of distant aimpoints, at a radial distance of 5000mm (not pictured).These points are included to allow for the possibility of deliberately missing, which we suspect could be optimal in cases where the cost of going bust outweighs the cost of an additional turn.In our experimentation, with the aimpoints as outlined, we computed the probabilities (4.5) Figure 6.Considered aimpoints for probability calculations.and the resulting transition probabilities, using a sample size of ten million simulated darts per point.

Value Iteration Algorithm.
In practice, the iteration procedure outlined in (4.3) can be prohibitively slow, especially when the number of states we have to consider is large.In order to speed up the convergence of the algorithm, we employed a number of strategies, which we now outline.4.3.3.Initial Approximation.The iteration algorithm (4.3) requires us to provide initial approximations V 0 (S) for each state S. The closer these approximations are to the true values, the less iterations of (4.3) we will be required to perform before termination.Therefore, if we are able to compute accurate approximations in an efficient manner, then using these approximations as a starting point for our iterative procedure can greatly speed up our computation.To accomplish this, we decompose the cost in terms of number of further turns into two parts.These two parts roughly correspond to the two stages of the game described earlier, i.e. a cost to finish by doubling out and a cost to reach a finishing position.
We consider a player to be in a finishing position once they have reduced their tally to 40 or below.Although a player can finish when S t = 50 by hitting the inner bull, this is rather challenging and usually games are completed by hitting one of the double sections.Furthermore, for all even tallies of 40 or under, there is a corresponding double which allows the player to double out.To approximate the cost of reaching such a position, we utilise the maximum expected scores computed earlier and listed in Table 1 (quadrature method).Given a current tally S t > 40, we subtract 4 − t times the maximum expected score from S t , to obtain the expected starting tally for the next turn.The difference between the resulting figure and 40 (assuming it exceeds 40) is then divided by 3 times the maximum expected score, in order to approximate the cost of reaching a finishing position.However, if after subtracting 4 − t times the maximum expected score from S t , the result is 40 or less, or indeed if S t ≤ 40, then this step is skipped as it is assumed the player is in a finishing position or will be on their next turn.
In approximating the cost of finishing for a player, we make the simplifying assumption that the value V (S) is roughly equal for all values of S t under 40.We also make the further assumption that the cost is independent of t and S 1 .Although this is not the case, it greatly reduces the complexity and cost of the calculations.Having made these assumptions, we compute the cost V * (S) for S = (S 1 , 1, S 1 ) with S 1 = 2, 3, . . ., 10 using the iterative procedure (4.3).Whilst the iterative procedure can be slow when the number of states is large, the calculations are easily accomplished for the small number of states we are now considering.The values of V * (S) obtained are then averaged across S 1 = 2, 3, . . ., 10 in order to provide an approximate finishing cost.This cost is then added to the cost of reaching a finishing position (if any), in order to produce our initial approximation V 0 (S).This process is repeated for all states S under consideration.The values obtained are then used as the starting point for our algorithm, which proceeds to a preliminary iteration stage as described below.In the iteration procedure (4.3) described above, the updated values V n+1 (S) are computed for all states S, for every iteration.However, after a number of such iterations it is likely that for many states S, the values are close to their true value and remain fairly constant between iterations.For such states, the repeated calculation of the values V n+1 (S) represents a computational inefficiency.To counter this, we introduce a preliminary iteration stage.This stage proceeds as per the iteration procedure (4.3), however, once the difference between successive values V n (S) and V n+1 (S) falls below a specified preset tolerance, the iterative procedure is halted for state S, whilst continuing for the others.Only once the iterations for all states have halted, is the process complete.The inclusion of this preliminary iteration stage allows for an efficient refinement of the values obtained from our initial approximation, prior to commencing with the full value iteration procedure (4.3) outlined earlier.
In our experimentation, this initial value iteration procedure was carried out with state specific iteration terminating when successive iterations differed by less than 0.01.Subsequently, we carried out the full iteration procedure (4.3) until the difference in successive approximations was below 0.01 for all states simultaneously.Finally, the optimal aimpoint was identified by finding the point p that minimised the value of (4.4), using the values V * (S) obtained from our three stage procedure.

Experimentation and Results
Given the number of parameters involved, i.e. (S t , t, S 1 ) for the game situation and (σ x , σ y ) for the skill level, providing a full rundown of the optimal play in all possible game scenarios for a range of skill levels would be impractical due to the number of possible combinations.Instead, we provide the details of the optimal aimpoints for a range of current tallies S t , ranging from 2 to 250, for each of the possible throws t = 1, 2, 3.With regards to the initial tally S 1 , we mainly restrict our attention to two cases.First, we take the initial tally to be equal to the current tally, S 1 = S t , corresponding to a scenario where a player has achieved a cumulative score of 0 so far in their turn.At the other extreme, we consider the case where the player has scored the maximum possible so far in their current turn and the initial tally S 1 is as large as possible given S t and t.Therefore, S 1 = S 1 for the first throw, whilst S 1 = S 2 + 60 for the second throw and S 1 = S 3 + 120 for the third.By considering these two extreme cases we hope to observe the influence of the initial tally on strategy selection.Finally, for the skill level, we once again assume the horizontal and vertical accuracies to be equal, that is σ x = σ y = σ, and consider the values σ = 12.5, 25, 40, 50 and 80.These values were selected to correspond with the player classes A to E, identified in [18], so as to enable comparison between results.A full rundown of the aimpoints for these players can be found in Tables 4 to 8 within the Appendix.We now outline a number of general findings from the results and attempt to interpret the underlying strategical reasoning behind them.

Comparison of Early Game Strategy: Does Maximising Scoring Equal Minimising Turns?
In the first half of this paper, we looked at the problem of aimpoint identification in order to maximise a player's expected score from a single throw.This was done under the assumption that the player is best served by reducing their tally as quickly as possible in the initial stages of the game, and the best way to do so is to maximise their score per dart.However, the aim of the game is to finish before the opponent, and therefore minimising the number of further turns is a more appropriate strategy to follow.In reality, we would expect the two strategies to coincide during the initial stages of the game.Examining the Tables 4 to 8 in the Appendix, we can see that for each of the skill levels, and for the highest tally values, the aimpoint identified via the turn minimisation procedure of this section largely coincides with the maximum scoring aimpoints, detailed in the Table 1.
Below in Table 3 and Figure 8 we can see the aimpoints identified via the turn minimisation procedure, for a wider range of skill levels, under the assumptions t = 1 and S 1 = 250, that is for S = (250, 1, 250).Comparing these to the aimpoints for score maximisation, given in Table 1 and Figure 4 (also plotted in blue in Figure 8), we can see strong agreement between the turn minimisation and score maximisation optimised aimpoints.We should also note that some of the discrepancies that appear are merely a consequence of the different meshes used between the two problems.This agreement not only confirms our intuitive believe regarding the optimal strategy in the early stages, but also provides reassurance regarding the validity of the methods and results derived in this section.Players of the highest skill level should start considering their finish before reaching this point in order to avoid these values.For example, a professional on 189, taking their final throw of a turn, generally will switch from their usual treble 20 to the treble 19.The reasoning behind this is that if they miss the intended treble, then the single 19 will also leave a three dart outshot of 170, whereas a single 20 leaves the player stranded on 169.However, for most players the chances of making one of these high outshots is very low, therefore there is less need to consider their finish this early and they are better served by continuing to maximise their scoring for longer.Applying our method to the above example of the game state S = (S 1 , 3, 189), we found that for players of high skill level (σ = 5, 10, 15), whose regular score maximisation aimpoint is the treble 20, the recommended aimpoint switches from the treble 20 to the centre of the treble 19.For players with skill levels just below this, that is for σ just above 15, the score maximisation aimpoint is itself the treble 19 and hence no switch is observed.However, if we consider decreasing the skill level further, for σ >> 15, the recommended turn minimising aimpoints for game state S = (S 1 , 3, 189) correspond to the score maximising points as expected.
To examine this phenomenon of strategy switching, we observed the highest tally S t at which the recommended aimpoint significantly moves away from the score maximising location.This was carried out for a range of skill levels and for throws t = 1, 2, 3.In all cases we took S 1 = S t +(t−1)s, where s represents the optimised expected score for the player (rounded to the nearest integer), as outlined in Table 1.This choice of S 1 was made in order to be as representative as possible of a player who has, up until the point in question, been following a score maximisation strategy.
In all cases, starting with a high value for the tally S t we reduce this one at a time and observe the location of the optimal aimpoint.As you would expect from the previous section, initially, the aimpoint identified corresponds to the score maximising point.However, as S t continues to be reduced, we reach a point at which movement is observed.Typically, the first movements consist of small oscillations around the score maximising aimpoint.The significance of such movements are not always apparent, most likely being a slight favouring of location due to the odd or even value of the tally, but ultimately the aimpoint still signifies a focus on maximal scoring.Therefore, we are interested in measuring the first significant move in aimpoint away from the score maximising aimpoint, which we define to be a movement of 20mm or more.Having observed such a move, we record the tally S t as the tally at which the player's optimal strategy switches from scoring to finishing.The results are presented graphically in Figure 9, where we have fit the least squares polynomial of degree 4, which was found to offer a good fit.The chart provides an indication of the tally S t at which a player of a given skill level should start considering their endgame strategy, and as expected, the higher the skill level of the player, the earlier consideration must be given to doubling out.  4 and 8, and particularly comparing the two cases for S 1 within each skill level, we see that the initial tally S 1 can have a significant impact on the optimal aimpoint.This is in contrast to the results of [18], where S 1 did not feature as a factor.
Comparing high and low skill levels, we can observe that the significance of S 1 is greater for lower skilled players.Examining the two cases for S 1 , for the highest skilled players with the lowest values of σ (e.g.Table 4), we see very little difference between the two strategy recommendations.
In general, the highest skilled players are recommended to go for the aimpoints that will see them double out in the lowest possible number of turns, even at the risk of going bust.Considering this from a cost benefit perspective this strategy would seem reasonable.Given their higher accuracy, the likelihood of success for these players is greater, whereas the cost of going bust is reduced, as their higher average scoring ability enables them to return to a finishing position sooner than a lower skilled player.
At the other extreme, for the lower skilled players with larger values of σ (e.g.Table 8), we can observe a number of occurrences of aimpoints with radial distance 5000mm (shown in red), signifying the recommendation of an intentional miss.This represents the optimal choice for a player when the expected cost of going bust exceeds the expected cost of foregoing an immediate opportunity to finish.For example, if we consider the scenario where a player of skill level (σ x = 80, σ y = 80) starts their turn on a tally of 128, and manages to score treble 20's with both their first and second throws, leaving them on 8 ahead of their third throw, corresponding to a game state S = (8, 3, 128).Now, perhaps the obvious choice for this player is to aim for the double 4 with their final dart, in an attempt to double out as soon as possible.However, in doing so the player faces a high chance of going bust due to the relatively low level of their tally and their inaccuracy.In this case, the cost of going bust for the player involves having their tally reset to 128, undoing the progression from their first two throws, which were unusually high scoring.With an optimised expected score per throw for such a player being 11.26, were they to go bust, we would expect it to take approximately two to three further turns for them to get their tally back to a level where they have options to double out.Therefore, the recommended strategy in this case is for the player to intentionally miss, in order to protect the progression made during the turn.In comparison, if we consider the same player, again facing a tally of 8 on the final throw of their turn, but this time having commenced their turn also on 8 and scored 0 with their first two throws, i.e. a game state of S = (8, 3, 8), then the optimal strategy is to aim towards the double 4 in an attempt to finish as soon as possible.
The strategy of intentionally missing comes as a consequence of the rules regarding going bust.
As the simplified rules considered in [18] did not feature a tally reset, this phenomenon was not observed by the author.Its occurrence here is illustrative of how, under the normal game rules, the initial tally S 1 can impact strategy choice and it is perhaps the most significant example of how the results in this paper differ from those of [18].

Conclusion and Further Considerations
In this paper, we considered the problem of a player looking to minimise the expected number of further turns to complete the standard darts game 501.This represents the first study of the full version of the game as compared to simplified variations, as in [18], or simply score maximisation, for example in [7,28,17,4].By framing the problem as a Markov decision problem, and utilising the methods of reinforcement learning, coupled with Monte Carlo simulation, we were able to determine the best aimpoint for a player of a given skill level in an arbitrary game situation.In contrast to the simplified game variation previously considered, the optimal strategy in our game was found to be dependent upon the starting tally of the player's turn.As a result, we observed some deviation in aimpoint recommendations from those given in [18], with some interesting consequences arising given the correct game scenario conditions, such as the recommendation to intentionally miss.
As discussed in the previously, a player ultimately optimises their strategy when they maximise their probability of finishing before their opponent.This presents an apparently complex problem, which in addition to the factors considered in this study, requires consideration of the adversary's game position, skill level and strategy.For example if the opponent is highly skilled and is getting close to finishing, then our player should likely favour a more aggressive strategy which increases their probability of finishing in a low number of turns, even if on average such a strategy results in a higher cost in terms of turns to finish.Furthermore, just as our player should give consideration to their opponent, so an opponent playing in an optimal fashion must consider the skill, game position and strategy of the original player.These two-way interactions in strategy could occur back and forth indefinitely, and it is not immediately clear that the method developed in this paper would generalise to allow for the inclusion of the opponent in the analysis.In any case, this would greatly increase the number of factors in the game state S. Perhaps as a first step we might consider the expected number of turns it will take the opponent to finish, using the methods developed in this paper, then seek the aimpoints which maximise the probability that our original player doubles out in fewer turns.
Appendix A. Tables of Aimpoints for Section 5 Table 4. Optimal aimpoints for player with skill σ x = σ y = 12.5.

Figure 1 .
Figure 1.Layout of a standard dartboard.

Figure 2 .
Figure 2. Specification of points on dartboard.

Figure 3 .
Figure 3. Heatmaps of expected score for various skill levels.

Figure 4 .
Figure 4. Optimal points for maximising expected score.

Figure 5 .
Figure 5. Expected score at optimal point vs. treble 20 vs. board centre

Figure 8 .
Figure 8. Turn minimisation aimpoints for large tally S t .

Figure 9 .
Figure 9. Switch from score maximisation to finishing.
4.2.Solution Method.Let us denote the state (S t , t, S 1 ) by S and use S to represent some alternative state (S t , t , S 1 ), which we imagine the player advancing to after their next throw.We use V (S) to denote the value of E (T |S) = E (T |S t , t, S 1 ), and the notation P r(S |S, p) to signify the probability of transitioning from state S to state S , whilst throwing at the point p.Finally, C(S ) is used to denote the additional cost associated with taking the final throw of a turn, either by taking the third throw or else by going bust.Therefore, C(S ) takes the value 1 when t = 1, and otherwise it takes the value 0 when t = 2 or 3.With such a framework in place, the equation (4.1) now takes the form

Table 3 .
Turn minimisation aimpoints for large tally S t .