Taxi Travel-time Prediction


Taxi dispatch system is a complicated and challenging task given the traffic dynamics of a modern urban locality. A properly managed dispatch system reduces traffic congestion as well as improving local economy. Thus the real-time estimation of travel time for taxis is of great importance for existing electronic dispatch systems.

Current electronic dispatch system captures the geo-locations of different taxis at different point of time, thus generating the availability of taxis at the location where the booking is processed. But this collective information of the geo-locations and the time-stamp can cook much more in fact. It is a valuable source of information in predicting the future trip times, the traffic pattern and the demand of transportation generated at different locations.

Scenario 1:

A user books a Cab near Science City, Kolkata and the system shows him the available cabs nearby, the ones that are free roaming. Out of them, the best option, which is 15 minutes away from the booking location is allotted. But the system ignores the fact that a booked cab at that very moment, passing through Science City is about to terminate its trip within 5 minutes. As a result, this cab is going to be a better option than any other free cab in the vicinity. It is possible only if the system knows the trip time of every ride. It also aids in increasing the fuel efficiency by calculating the optimum path of travel.

Scenario 2:

 After finishing a trip, the cabs roam freely around and wait for the next booking. But the demand is pretty high in Salt Lake during the evening. Because the offices breaks off in the evening, there is a gamut of demand generated at that particular point of time in Salt Lake. A new driver in town is quite unlikely to know all these dynamics of a city, thus might miss some business opportunities due to this. A real time traffic condition and demand should be updated to every cab to optimize the positions of free cabs so as to minimize congestion and facilitate traffic movement.

Keeping these shortcomings of the existing system in mind, our project involves developing a model for predicting the travel time for a particular cab along different routes in a rectangular map of Porto. The output is based on several factors derived from floating car data, like –

CALL TYPE:(online/on road)
ORIGIN CALL: The identity of the caller
ORIGIN STAND: It contains a unique identifier for the taxi stand.
TAXI ID: (Integer): The unique ID of the taxis that are enrolled in the system.
TIMESTAMP: Unix Timestamp (in seconds). It identifies the trip’s start time.
DAYTYPE: It identifies the type of the day in which the trip starts. (Holiday, Day before Holiday, Weekday).
MISSING DATA: It is FALSE when the GPS data stream is complete and TRUE whenever one (or more) locations are missing.
POLYLINE: The array of geo-locations of a particular trip

Data can be collected in two ways –

  1. Sensors placed across the street
  2. Floating Car Data

FCD are the position fixes of the vehicles traversing in the city throughout the day. FCD is pretty effective despite the infrequent data stream because of its accuracy and inexpensive infrastructure cost.


The architecture involves detecting the path of a vehicle during the trip, then matching the positions to corresponding grid indexes, then throw it to form several ensemble methods of classification or regression (according to need). While predicting we only need to set the exploratory variables and the model will give us the shortest path and the time required to cover that.

1Motion Detection
Since the incoming streams of traffic data come from taxi cabs, delivery vehicles, or other commercial fleet vehicles, there will always be a certain amount of ambiguity between a slowdown in traffic and a commercial stop by the vehicle. (i.e. for a taxi customer, or a delivery vehicle dropping off packages). Therefore, any further processing must clean out all unwanted stops that are in the GPS logs.

The vehicle while hired starts transmitting its position at 15 seconds interval, thus due to the slowdown in traffic a clutter of points around the same stop will recur. Thus a cleaned up data is required to describe the trip in terms of evenly spaced geo-points. Moreover, as we require a set of explanatory variables in predictive analytics, we need the trip to be represented in terms of predefined variables. Here, the variables are the edges between the predefined points, known as the Standard Points spread across the city, whose value represents the average velocity at that road segment in one approach and approximate time of travel along that edge.

 # Getting all the location points
all_location = []
for i in range(len(main_data)):
all_location += main_data.iloc[i].POLYLINE
all_location = [list(x) for x in set(tuple(x) for x in all_location)]

Grid Matching
In order to create a set of Standard Points along the city, we use our training data, a set of 1.7 million observations (trip) of different cars running throughout the day during the period of a month and the polyline variable of each. The polyline variable is an array of points with latitude, longitude and timestamp {x1, y1, t1}.The maximum and minimum longitude and latitudes are used to draw the rectangle of interest. Now this rectangle is divided using several grids placed 100 meters apart. The conversion of longitude and latitude to distance notion is done using the Haversine Distance.

2where, 3

d : the distance between the two points (along a great circle of the sphere),
r : the radius of the sphere,
∅_1, ∅_2: latitude of point 1 and latitude of point 2
𝜆_1,𝜆_2  : longitude of point 1 and longitude of point 2
After that, the locations that the car traversed through are imposed on this grid. (Green – Start, Yellow – End)


Further, the points that are inside the same grid are represented by the first point of traversal and this point becomes this grid’s standard point. The other points in the similar grid are omitted and the time in between edges are added to the total count. Thus, the new map becomes –


And at the output we have a set of points with grid index as their location. {x1, y1, t1} => {i1, t1}

# Producing the standard points using square tilling and choosing one of the locations from each tile
lat = [i[0] for i in all_location]
lon = [i[1] for i in all_location]

# we are taking a square of size 100 meter * 100 meter 
int1 = int(distance1([max(lat),0],[min(lat),0])/.1)+1 
int2 = int(distance1([0,max(lon)],[0,min(lon)])/.1)+1 
array1 = np.linspace(min(lat),max(lat),int1)  
array2 = np.linspace(min(lon),max(lon),int2) 

# defining the position of a tile in the map 

def position(i): num1,num2 = 0,0 
for j in range(len(array1)-1): 
if array1[j] <= i[0] and array1[j+1] >= i[0]: 
num1 = j 
for j in range(len(array2)-1): 
if array2[j] <= i[1] and array2[j+1] >= i[1]: 
num2 = j 
return len(array1)*num2 + num1 
standard_points_dict = {i:[0,0] for i in range(0,len(array1)*len(array2))} for i in range(len(main_data)):  
x = list(main_data['POLYLINE'].iloc[i])  
for j in range(len(x)):  
if standard_points_dict[position(x[j])] == [0,0]:  standard_points_dict[position(x[j])] = x[j]


Directed Graph

Directed Graph
Directed Graph (Sample Trip)

From the points imposed on the grid we get the locations that are traversed through. These becomes our edges for the directed graph. The edges can be calculated in two ways-

  1. If we are interested in dealing with continuous variables, Regression is the only way. For that, we take average velocity between two nodes as our edges.
    e1,2 = (Haversine distance) / (t2-t1)
  2. In another approach, we can use Clustering techniques, for the sake of which the time between two consecutive nodes, the discrete variable is taken as the edge.
    e1,2 = t2-t1

Thus at the end of this step, we have a sparse matrix.

Supervised Learning

Now we are left with a set of directed graphs, each for a new trip. The edge weight depends on the time of travel. This data is fed into the predictive model as per our strategy. We used Decision Tree and Random Forest for both categorical and continuous data. Lastly we applied Gradient Boosted Trees and observed the improvement in performance in terms of several measures.

Decision Tree

The response variable Y is categorical, so we can use information theory to measure how much we learn about it from knowing the value of another discrete variable A:


where, I[Y | A = a] = H[Y] − H[Y | A = a]. The definitions of entropy H[Y] and conditional entropy H[Y |A = a].


In case of continuous data, we apply the regression technique, where we look to minimize the least square. Thus the entropy function is the least square distance of all the points from the mean.


# For training of decision tree
weights2 = [] 
# we will store each decision tree an element of a large array
for i in groups_roads: 
# groups road is an array consists of all the historical data along each road segment
 x = [[j[2]] for j in i]
 y = [j[3] for j in i]
 clf = tree.DecisionTreeClassifier()
 clf =,y)
 weights2 += [[i[0][0],clf]]
 weights2 += [[(i[0][0][1],i[0][0][0]),clf]]

Random Forest

Here, each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. The final output is the maximum voted result or the average of each tree. While this process reduces the variance due to this averaging but the bias increases because instead of picking the best split, we are taking a random sample of the whole feature-set and taking the best split among them.

clf = ensemble.RandomForestRegressor()

Gradient Tree Boosting

In the boosted method of regression, we try to minimise the residuals, by assuming it to be a function of the explanatory variables.


Thus the goal is to minimise the residuals h(x). This residual is expressed in terms of step-size  and loss function L.


Here we’ve used a least square loss function to create the optimum graph for that particular time.

The decision algorithm is run for the prediction of each observation. As a result of this, we get a real-time update of the traffic condition. The decisive algorithm gives us the graph of the city for a particular time. All that is left is to find the best route given the initial and destination node.

clf = ensemble.GradientBoostingRegressor()

Predictive Modelling

Dijkstra’s Algorithm

In this section we used Dijkstra’s Algorithm of shortest path calculation.


  1. Demand Distribution
    11The Demand distribution of taxi along the daytime is shown here. The highest peak is at the range of 9 to 10 A.M. (Office Hours). A steady distribution thereafter till 6 P.M. This can be utilized to create a fare/Km model during the course of the day and can also be used to optimize the resource utilization of the system.
  2. Trip Time & Trip Distance
    Individual trip time during daytime is shown here. The trip time is shown in seconds, thus need to be divided by 60. We can infer that almost all of the trips in the morning and at night are short trips. During the day time only a few trips are made for long hours.Most of the trip lengths are within 10 km. This is in conformity with the previous  plot of travelling time where most of the travelling time are within half an hour.
  3. Most Traversed Area
    The heat map of the most traversed nodes are shown above. It shows that the red marked zones are the most central zone in the city. The centrality test of the map conforms with the result shown.
  4. Centrality
    Centrality gives us the nodes and edges that are traversed most. According to the sampled data, the Rua da Constituição is the busiest street in Porto and the centrality measures of both nodes and edges comes to the similar conclusion. 16 15



Four kinds of measures are used –


  1. Mean Absolute Percentage Error: The MAPE (Mean Absolute Percent Error) measures the size of the error in percentage terms. It is calculated as the average of the unsigned percentage error.
  2. Mean Absolute Error: The MAE measures the average magnitude of the errors in a set of forecasts, without considering their direction. It measures accuracy for continuous variables.
  3. Mean Percentage Error: MPE is the computed average of percentage errors by which forecasts of a model differ from actual values of the quantity being forecast.
  4. Hypothesis Testing: The PE>0 is tested against the null that PE = 0 and a the normalised version of it is assumed to follow the t-distribution.

The results of the different models are shown as follows –


A summary of the different model is depicted in the following picture, where we draw a visual comparison of a typical trip.



  1. The Dijkstra’s Algorithm is a appropriate for static map. While the traffic in a city is altering constantly, we should try out a layered approach and the shortest path should incorporate the real time of arrival at a particular node.
  2. Apply an incremental ARIMA model to each segment for a specific time of the day (Data needed)
  3. Use Map Matching Technique to position the standard points on the road accurately.
  4. Real Time Traffic Map (travelling route management, traffic signal optimization etc.)
  5. Optimizing of our approach in order to reduce model error.
  6. Implement it locally.
  7. Consolidate all these Features in an Android Application