Application of Machine Learning Techniques in Short-term Travel Time 1 Prediction Using Multiple Data Sources

10 Having access to accurate travel time is of great importance for both highway network users and 11 traffic engineers. The travel time which is currently reported on several highways is estimated by 12 employing naïve methods and using limited sources of data. This results in unreliable and 13 inaccurate travel time prediction and could impose delay on travelers. Therefore, the main 14 objective of this study is short-term prediction of travel time for highways using multiple data 15 sources including loop detectors, probe vehicles, weather condition, network, accidents, road 16 works, and special events in order to consider the effect of different factors on travel time. To this 17 end, two machine learning methods, K-Nearest Neighbors and Random Forest, are employed. 18 After applying data cleaning process on datasets and combining them, the models are trained to 19 predict and compare short-term harmonic average speed as a representative of travel time for 5-20 minute prediction horizons in one hour ahead. The travel time is calculated as the ratio of the length 21 of each link and the harmonic average speed for all reporting vehicles. Hence, a model is trained 22 for each technique to predict travel time 5 minutes ahead, 10 minutes ahead, and all the way down 23 to 60 minutes ahead. The results confirm satisfying performance of both models in short-term 24 travel time prediction with slightly outperformance of Random Forest model. A feature importance 25 and sensitivity analysis also applied for the Random Forest model, and traffic variables are found 26 as the most effective variables in predicting the travel time. 27


Introduction
Nowadays, having sufficient information about travel time plays a pivotal role for both road users and traffic engineers.Hardly is it possible for road users to make decisions before and during their trips, unless they have access to reliable information about travel time of the route they are passing through.Having access to accurate travel time will also enable traffic engineers to increase the efficiency and safety of traffic in road networks (Van Lint, 2008).Travel time can be defined as the total time required for a vehicle to pass a route from one point (origin) to another point (destination) with respect to all the delays which might be imposed on that vehicle (Zhu et al., 2009).Currently, the travel time which is reported on Variable Message Signs (VMSs) of some highways is estimated by employing naïve methods that use limited sources of data (Huisken and van Berkum, 2003).This might cause inaccurate travel time prediction, and impose delay on travelers, which consequently makes the travel time reports on the VMSs unreliable.Therefore, the objective of this study is considering the effect of different factors such as weather condition, accidents, road works, and special events, which are rarely considered in other studies, on travel time of highways by combining these real-time sources of data and employing machine learning methods.
Studies aimed at finding travel time could be divided into two main groups of estimating and predicting travel time (Mori et al., 2015).Travel time estimation studies focus on calculating travel times of trips that have already finished, with respect to the data captured during those trips (Celikoglu, 2013;Li et al., 2013;Soriguera and Robuste, 2010).On the other hand, travel time prediction studies, are concentrating on using the historical and real time data to forecast the travel time for the future time intervals (Van Lint, 2004).There are several methods which are used for travel time prediction, and the difference between methodologies stems from their complexity and accuracy, that make each of them suitable for specific conditions.
Basically, travel time prediction models could be categorized into three main groups.The first group are naïve models which generally consist of very simple methods (Huisken and van Berkum, 2003).These models have several assumptions in their nature which make them simple but less accurate (Van Lint, 2004).Thus, these models are generally used as the baseline for the other methods or whenever just a simple prediction is required.One of the famous naive models is instantaneous predictor, which assumes the traffic condition remains consistent with the passage of time (Van Lint and Van Hinsbergen, 2012).Based on this assumption, the most updated estimation of travel time is assumed to be constant and equal to the travel time in the future.It is obvious that expanding the prediction horizon will reduce the accuracy of this type of model.
Another kind of naïve predictor uses the historical travel time data to predict the future travel time regarding the similarity of traffic condition in different periods of time (Schmitt and Jula, 2007).This model works well when recurrent traffic conditions occur in a specific path (Van Lint and Van Hinsbergen, 2012).
Second group is traffic-theory-based models.The basis of this group of models is on the relation between traffic variables such as speed, density, and occupancy.Since the prediction refers to travel time in future, the first step in these models is to simulate the traffic condition of a network for a given time in the future and then travel time could be predicted based on traffic condition of that network (Mori et al., 2015).Regarding this issue, three kinds of simulation comprised of macroscopic (Papageorgiou et al., 2010), microscopic (Edara et al., 2017), and mesoscopic (Taylor, 2003) are applied in various studies.
The last group of models, which are data-based models, require a large amount of data and use statistical methods to predict travel time.The more the data is available, the more accurate the result will be.These models could be divided into two main groups of parametric and non-parametric models (Mori et al., 2015).Parametric models have a pre-defined function and just the parameters of the model need to be estimated.There are several parametric models each of which has their own structure such as Linear Regression (Du et al., 2012), Bayesian Nets (Castillo et al., 2011), and Time Series models (Billings and Yang, 2006;Karimpour et al., 2017;Yang, 2005).
Non-parametric models are another part of data-based models in which the structure of the function should be defined based on the data as well as number and typology of parameters.The most important goal of these methods is to utilize a large database and learn the relationships between variables, instead of using complex approaches, in order to predict travel time.One of the most famous group of models includes Artificial Neural Network models which have been widely used in transportation studies (Dharia and Adeli, 2003;Fan and Gurmu, 2015;Parsa et al., 2020aParsa et al., , 2020bParsa et al., , 2019aParsa et al., , 2019b)).Specifically, K-Nearest Neighbors (Qiao et al., 2013;Yu et al., 2016;Zhao et al., 2018), Regression Trees (Nikovski et al., 2005), Local Regression (Simroth and Zahle, 2010), Support Vector Regression (Wu et al., 2004;Yildirimoglu and Ozbay, 2012), and Random Forest (Leshem and Ritov, 2007) are non-parametric models which are performing well for travel time prediction.
One of the most important and effective factors on choosing a method for estimating or predicting travel time is the available data.Although there are several kinds of methods each of which has its own advantages and disadvantages, none of them could be used unless adequate and related sources of data are available.Based on the classification done by Wu, et al. (Wu et al., 2004), there are several kinds of sensors using for obtaining traffic data and these sensors can be grouped with respect to their ability to obtain travel time directly.Accordingly, two main groups of sensors are link-based and point-based sensors.Link-based sensors measure the travel time directly for a route or link by passive probe vehicles, active test vehicles, or license-plate matching.
In point-based sensors travel time is measured indirectly by devices such as loop detectors, laser detectors, and video cameras.The data which is collected from link-based sensors are more accurate, however point-based sensors are available more and could collect cost-effective realtime data.
Based on the characteristics of all different methods, since the main data source for this study is collected from loop detectors, the data-based methods are employed in order to predict the travel time.In general, to the best of the authors knowledge, most of studies in this area have considered only traffic related variables in order to develop a model for travel time prediction.
However, other variables such as weather condition could have significant impact on the travel time (Qiao et al., 2012;Rakha et al., 2012), which is infrequently investigated in travel time prediction studies.Nookala showed that inclement weather condition could increase the traffic congestion and travel time, since the capacity of highway decreased in inclement weather, while the traffic demand does not change (Nookala, 2006).Therefore, the main goal of this study is to predict travel time for links of Eisenhower highway in the city of Chicago, by utilizing a combination of different sources of data including loop detectors, probe vehicles, weather condition, network, accident occurrence, road works, and special events.To this end, average speed of each link, and consequently travel time is predicted using KNN and Random Forest models.
In what follows, first the datasets and the process of cleaning and combining them, which are used in this study, are introduced.Also, the methods which are employed to predict the travel time are described.Then, the results of the models are shown and explained.In the end, the discussion of the results, conclusion and limitations of the study are discussed.

Data Loop Detectors Data
In this study, different sources of data are used in order to develop models to predict travel time.
The main part of dataset is the information captured by inductive loop detectors in the Chicago highways network which is collected by the Gateway Traveler Information System and provided by the Illinois Department of Transportation (IDOT).For this study, a 5-mile segment from west bound of the Eisenhower highway, consist of 14 loop detectors is selected.For each loop detector in each lane of the highway, number of vehicles, occupancy and average speed are reported every 20 second.These raw data points contain lots of missing and erroneous records which might be caused by detectors' malfunction, deterioration of pavement or any other reasons.Therefore, two types of single and combined thresholds are applied to the dataset, as a part of data cleaning procedure.

Single Threshold
There are several thresholds for each of the count, occupancy and speed values which are used in different studies.In this study, based on the characteristics of the selected highway, value of 3000 vehicles per hour per lane is set as the threshold for the count, which is almost equal to 17 vehicles per 20 second per lane.So, if the value of count in a lane does not lie within the domain of (0, 17), it is labeled as erroneous data point.Moreover, the extreme amounts for occupancy (i.e., more than 90% in 5 minutes) are assumed incorrect so that any data point with occupancy value outside the domain (0, 90) is labeled as erroneous data point.In addition, we set the value of 80 mph for the upper threshold of speed, and any point with more than 80 mph or less than 0 mph speed is considered as an outlier.

Combined Threshold
As mentioned earlier, in this study three variables of count, occupancy and speed are reported by loop detectors.The first threshold in this group is described as "only one zero variable out of three", which means if only one of those three variables is zero whereas the other two are non-zero, that record is reasonably considered as an erroneous record.The second combined threshold is "only one non-zero variable out of three".So that if two variables out of count, occupancy and speed are zero while the other one is non-zero, again this record of data is labeled as an error.Finally, the last combined threshold is "all zero variables", in which a record is considered as an erroneous record if all the variables of count, occupancy, and speed are zero.This threshold filters out the data points that could impact the average speed of intervals incorrectly.

Data Imputation
The missing and erroneous data points could negatively affect the results of prediction models (Karimpour et al., 2019).Therefore, these points should be either eliminated from the dataset or imputed by new data points through one of the three following ways.First, "temporal estimation" is used when a detector has a missing data point, or it reports a value which exceeds threshold for an interval.In this case, an appropriate way for imputing the data point is taking the average of values for previous and next time intervals, especially when the time intervals are short (i.e., in this study it is 20 second).In another approach, which is "spatial estimation", the missing or erroneous data points are imputed with the average of the values for previous and next loop detectors at the same time.This method is proper whenever loop detectors are close to each other, or using the temporal estimation is not applicable.Finally, "historical estimation" could be used whenever traffic condition is recurrent in a location in which missing or erroneous data is reported.So, in this case the erroneous data points are imputed using the reported data points from same location for same time of day and same day of week in the past.In this study, the speed values in the loop detectors dataset are time mean speeds.The time mean speed is calculated by averaging the point speeds of all vehicles passing a loop detector at that detector's location.This speed could not reflect the travel time of a path segment accurately, since the speed of vehicles could vary across that segment.Therefore, in this study in order to consider the speed values another source of data which is collected by probe vehicles is utilized.

Probe Vehicles Data
This dataset is provided by the National Performance Management Research Data Set (NPMRDS) and includes the harmonic average speed, free flow speed, historical speed, travel time, and annual average daily traffic (AADT) for each link of highways.INRIX is creating the NPMRDS dataset of average speeds and travel times for specific road segments across the NPMRDS road network for every 5 minutes.INRIX uses its large data source includes millions of connected vehicles and trucks that are supplying location and movement data anonymously.They also implemented a path-processing algorithm to meet the NPMRDS requirements using its existing source data.
For the present study, value of travel time from this dataset is chosen to be the target variable.This variable is defined as the ratio between the segment length and the harmonic average speed of vehicles passing the segment.Since the travel time of each link is dependent to length of the link, the harmonic average speed is selected as the target, so that by predicting the average speed, value of travel time could be calculated using the length of each link.The missing values of this dataset are imputed by temporal estimation, spatial estimation, and historical estimation, as it is explained in the previous section.

Weather Condition Data
The hourly weather condition data which is provided by the National Weather Service (NWS) is another source of data used in the present study.This data is available for airport stations and based on the location of selected segment, data from the Midway International Airport is utilized.In the dataset, weather condition is reported by 94 different states.In this study, all the states are grouped into four categories from fair to severe weather conditions.In the first category, weather conditions such as "Fair" and "Partly Cloudy" are grouped.The second category is comprised of weather conditions like "Fair with Light Haze" and "Light Haze Dust and Windy".Unfavorable weather conditions such as "Light Drizzle" and "Light Rain Fog" constitute the third category.Finally, the fourth and the most severe category includes weather conditions like "Snow Fog" and "Thunderstorm Heavy Rain Fog".In addition to the weather condition, surface temperature is also available in the dataset which is used in this study.Similar to other data sources there are plenty of missing values in this dataset which are imputed using the information from previous and next time intervals.

Accident Data
This dataset is also collected by the Gateway Traveler Information System.In the accident data details of accidents occurred in highways are reported.Time of accident, geographical location, and severity of accident are among the most important variables in this dataset.
Based on the accident data analysis and observing its effect on highway traffic at the studied area, when an accident occurs on a highway link, that link, the previous link, and the next link on the highway are affected on average for one hour after the accident occurrence.This effect is considered as binary variables to show the affected links and time intervals in the model.Two binary variables for previous link and next link are defined, since the effect of accident is different in upstream and downstream.Also, another variable in the model demonstrates the accident severity in three levels of minor, medium and major.Whenever the accident severity is not determined at accident detection time, the medium level, which is the most common one, is selected.

Road Work Data
This dataset which is collected by the Gateway Traveler Information System, is comprised of details of construction or maintenance operations in the highways.In this dataset variables such as start time, end time, geographical location, and severity are reported.
In order to consider the effect of this variable in the model a binary variable is defined.
When a roadwork project is executing on a link of highway, that link, the previous link, and the next link are affected from start time to end time of the project.Also, if one of the lanes of highway is blocked, variable of "number of lanes" considers its effect in the model.

Special Event Data
The special event dataset collected by the Gateway Traveler Information System provides detailed information about stadium events, parades, and road races.In this dataset several variables such as start time, end time, location, and type of events are reported.
When a special event is occurring close to highway links, those links are considered to be affected by that event for one hour before or one hour after that event.Considering the next or previous one-hour depends on the direction of extra traffic and the relative locations of the event and each link.For example, if a highway link is located in the path toward an event location, the extra traffic passes that link before the event in order to go to that location, so in this case we consider that link to be affected by the event one hour before start of the event.For the links which are located in the path from the event location toward other parts of the city, one hour after end of the event would be the affected time.Since these variables can considerably impact travel time of a route, we combine and use them in the model.

Network Data
This dataset includes several variables from the studied highway segment such as number of lanes, number of entrances, number of exits, and length for each of the highway links.
After applying data imputation process on each of the mentioned datasets, all of them are combined based on their location and timestamps, so that a link-based dataset which is aggregated in 5 minutes, from April 2017 to December 2017 is created for 15 links of west bound of the Eisenhower highway.Table 1 demonstrates the variables of the final combined dataset which is used for developing travel time prediction models.

Methods
Machine learning methods are compatible with the type of data that is available for this study, in order to develop the travel time prediction model.Among different machine learning methods K-Nearest Neighbors and Random Forest are selected based on their specific properties, which are described as follows.

K-Nearest Neighbors
K-Nearest Neighbors (KNN) needs only a large volume of data points, and without developing a mathematical model could predict the target variable.Also, there is no need to define parameters of the model in advance, and the authenticity of the data is maintained since no smoothing procedure is done to the data.Therefore, this method is an appropriate choice for the nonparametric problem of travel time prediction (Yu et al., 2016).
This method is among the most famous supervised machine learning techniques, which are widely used for regression and classification.As it is explained by Han et al. (Han et al., 2011), the KNN method basically classifies a dataset by comparison to the similar records of a trained dataset.In this study, the KNN regression algorithm is used, since the target is speed which is a continuous value rather than a predefined class.In general, training dataset has n attributes, and each of its records could be presented in a n-dimensional variable space by the value of the n attributes.After that, to predict the target value for new records from test dataset, each new record with its attribute values finds its location in the variable space, and the algorithm looks for the k nearest neighbors that are closest to the new record among the trained data points.Therefore, in this study the speed value for a new record would be predicted by taking average of the speed values of k nearest neighbors to that record.In other words, the method searches for nearest neighbors among all historical data points and uses them for the prediction.
In this study, the Euclidean metric is used to calculate the distance between two points in the n-dimensional variable space.Between two points of  1 = ( 11 ,  12 , . . .,  1 ) and  2 = ( 21 ,  22 , . . .,  2 ), the Euclidean distance is defined as Equation 1: Before calculating the distance between points, values of attributes are normalized in order to prevent attributes with larger values like occupancy outweighing attributes with smaller values like binary attributes (Han et al., 2011).

Random Forest
Random Forest is an ensemble supervised machine learning method which could be used as classifier or regressor for categorical or numerical datasets (Han et al., 2011).In this method a combination of several random Decision Trees (DT) are utilized, so that each of the DTs votes to predict value of the target variable.In order to combine DTs, Bootstrap Aggregation or bagging that is a model averaging approach for combining machine learning methods to increase accuracy is used.To develop a Random Forest model multiple random samples from training dataset are selected with replacement in several iterations, and for each of them a DT is trained.Then, for a new record from test dataset, each of the trained DTs returns value of the target variable.The final result is calculated by taking average of all the predicted values for the target from DTs. Random Forest is robust to noisy data and overfitting, also it is expected to have higher accuracy than individual DT, since it decreases variance of the DTs (Han et al., 2011).Random Forest typically works accurately and fast when a large dataset is available.Also, it could manage large number of variables as the model inputs.Therefore, these characteristics make the Random Forest model a proper choice for predicting the travel time in this study (Fan et al., 2018).

Results
After   The feature importance analysis could only relatively rank the importance of variables in the model, therefore in order to assess the potential impact of the most important variables on the that increasing the occupancy reduces the average speed, and it shows a linear relationship as it is expected.For the count variable, a nonlinear relationship is observed.As it is mentioned, the average speed value in here is for all the links and times in different conditions.Therefore, on average the speed is increasing by increasing the count from -25% to +50%, also by decreasing the count from -25% to -50%.The effect of AADT on the average speed is also nonlinear, that is by increasing the AADT the average speed increases, but there is no significant change in average speed by increasing the AADT from +10% to +50%, and from -50% to -10%.detectors, occupancy can capture traffic condition better that the others.However, fusion of several traffic data from different sources could increase prediction accuracy of the models.
Weather condition is an important variable in the studies related to traffic condition.In this study, we utilized weather condition in our models and found it effective in predicting travel time.
It is suggested that to better capture impact of weather condition in the model, data collected from more accurate sensors are required for the similar studies.
Finally, accident, special events, and road works are important variables which can have a considerable impact on travel time prediction.Although in this study these variables have relatively less impact on travel time prediction, for the future studies we suggest that rich amount of incident data along with dynamic models could accurately capture the impact of these variables on short-term travel time prediction.
figure displays 7 daily traffic patterns, started from Monday, September 11, 2017 to Sunday, September 17, 2017.It is observed that weekdays have almost similar traffic patterns although the traffic pattern is different for Friday, Saturday, and Sunday.

Figure 1
Figure 1 Weekly pattern of average flow from cleaned loop detectors dataset

Figure 2
shows histogram of the average speed from this dataset.

Figure 2
Figure 2 Average speed histogram from probe vehicles dataset location of loop detector which is closest to midpoint of the link at the 5-minute time interval Aggregated occupancy at location of previous loop detector on the road at the 5minute time interval Aggregated occupancy at location of next loop detector on the road at the 5vehicles on loop detector which is closest to midpoint of the link at the 5-minute time interval Aggregated count of vehicles on previous loop detector on the road at the 5-minute time interval Aggregated count of vehicles on next loop detector on the road at the 5-minute time interval AADT Annual average daily traffic of the link EntRamp Number of entrance ramps on the link ExtRamp Number of exit ramps on the link Miles Length of the link in miles Lanes Number of lanes on the link IsAcc_p 1: if an accident occurred on the link or previous link, at the 5-minute time interval, and next one-hour time intervals after accident occurrence, and 0: otherwise IsAcc_n 1: if an accident occurred on the link or next link at the 5-minute time interval, and next one-hour time intervals after accident occurrence, and 0a roadwork project is executing on the link, previous link, or next link at the 5-minute time interval, and 0: otherwise IsSE 1: if a special event is occurring close to the link at the 5-minute time interval, and either next one-hour, or previous one-hour time intervals**, and 0: otherwise Speed (target) Average speed of vehicles on a link at the 5-minute time interval * Each 3-hour time interval of a day is considered as a binary variable.For instance, Time1 means from midnight to 3 am, and Time8 represents interval of 9 pm to 12 am.** Considering the next or previous one-hour depends on the direction of extra traffic and the relative locations of event and link.For example, if a link is located in the path toward an event, the extra traffic passes this link before the event, so in this case we consider previous one-hour as the effective time.

Figure 4
Figure 4 True speed values versus predicted speed values -(a) KNN, (b) Random Forest sensitivity analysis is also conducted.To this end, the top three important variables in the list which are occupancy (Occupancy, Occupancy_p, and Occupancy_n), count (Count, Count_p, and Count_n), and AADT are selected.Then, the final Random Forest model was run several times with change in each of these variables by  10%,  20%,  30%,  40%, and  50% while other variables remain the same.Figure5demonstrates the sensitivity of average speed for all the links and all the time intervals to the important variables.Based on the figure, it indicates

Figure 5
Figure 5 Sensitivity analysis of important variables in travel time prediction model

Figure 3 Comparison of KNN and Random Forest models' accuracy in different prediction horizons Regarding Figure 3, it
applying data cleaning process on different data sources and combining them to create the final dataset, 80% of the data points are randomly selected for model training and the remaining 20% are used as the test data to evaluate the models.In addition, number of neighbors for the KNN model is assigned experimentally as 4, and for Random Forest model, 20 DTs in forest with max depth of 30 are set.Finally, KNN and Random Forest models are trained to predict and compare short-term average speed as a representative of travel time, as mentioned in previous sections, for 5-minute prediction horizons in one hour ahead.That is, a model is trained for each technique to predict travel time at the moment, 5 minutes ahead, 10 minutes ahead, and all the way down to 60 minutes ahead.Figure3displays and compare the prediction accuracy (i.e., R-Squared Score) of KNN and Random Forest models.Random Forest model outperforms KNN model in all 13 prediction time intervals, however, the difference between their prediction accuracy is more negligible for very short-term prediction horizons.Both models perform extremely good at 0 minute ahead (i.e., at the prediction moment which would be travel time estimation), however, Random Forest with prediction accuracy of 95.3% slightly outperforms KNN with prediction accuracy of 92.9%.Then, prediction accuracy is decreasing by increasing prediction horizon in a way that prediction accuracy at 5 minutes and 10 minutes ahead for Random Forest model are 93.9%, and 91.6%, respectively, and those of KNN model are 91.1%, and 88.1%, respectively.Another point based on Figure3is that prediction accuracy of both models is decreasing in a faster rate in the first 30 minutes ahead horizon than the horizon of 30 minutes to 60 minutes ahead.Despite the plateauing effect for prediction horizons above 30 minutes of both models, Random Forest model performance is more robust as it achieves prediction accuracy of 84% at 60 minutes ahead while that of KNN model is only 76.8%.canbeconcluded that both models, especially Random Forest, are well-suited to predict travel time in a short-term horizon (i.e., 15 minutes ahead).Therefore, to better observe performance of these models in short-term prediction, true speed values versus predicted speed values of KNN, and Random Forest models are plotted in the Figure4.This figure confirms that the shorter prediction horizon, the more prediction accuracy for both models.It is worth noting that, regarding training time, Random Forest model outperforms KNN model as well.Accordingly, feature importance analysis is applied for the Random Forest model, and the result is shown in Table2.As it is expected the traffic variables are the most important variables that affect the travel time.After traffic variables, temperature is found effective in the model which shows the importance of weather condition variable in traffic related predictions.Unlike our expectation, accident, special event, and road work incidents have relatively lower impact in the model.However, the reason for low importance of these variables might stem from small number of these incidents in the data.It is worth noting that since these incidents are correlated with traffic variables which are important inputs of the model, their effects are considered indirectly through the traffic variables in the model.