Keywords

1 Introduction

As the quick development of ride-hailing business, various online ride-hailing platforms have emerged, such as DiDi and Uber. In fact, the annual volume of passengers transported by DiDi has exceeded 10 billion.Footnote 1 The market value of the global online ride-hailing business is expected to grow to $285 billion by 2030. In such a business, the ride-hailing platform needs to match riding orders with vehicles and dispatch idle vehicles efficiently in order to improve the profit.

Specifically, the ride-hailing platform needs to match vehicles with orders while dispatching idle vehicles to the potential high-demanding zones to avoid randomly exploring potential riding orders periodically (i.e., over multiple rounds). The ride-hailing platform should maximize the long-term social welfareFootnote 2 (i.e., the sum of social welfare of all rounds), instead of maximizing the social welfare of one round. However, the current matching and dispatching decisions may affect the future vehicle distributions, and thus affect the future matching and dispatching, which may affect the overall social welfare. Therefore, we need to consider the decision of current round on the future impacts when designing the order matching and idle vehicle dispatching algorithm to maximize the long-term social welfare. In more detail, historical information about matching and the corresponding social welfare can provide the implicit information about how much value the vehicle can provide in the spatial-temporal state, which can provide some insights for designing the matching and dispatching algorithm. Therefore, we design a vehicle value function to characterize the future value the vehicle can provide in the spatial-temporal state, and then use this value function to design the order matching and idle vehicle dispatching algorithm. In so doing, our algorithm can take into account the impacts of current decisions on the future rounds, and thus can improve the long-term social welfare.

In more detail, this paper advances the state of the art in the following ways. Firstly, we design a vehicle value function to characterize the future value the vehicle can provide. Then we consider the dispatching of idle vehicle to a zone as a virtual order. In so doing, we combine the order matching problem and idle vehicle dispatching problem as a whole order matching problem. We then convert the order matching problem to a bipartite graph maximum weight matching problem with the vehicle values as the edge weights. In so doing, we can complete the matching and dispatching quickly, and can avoid the issue that vehicles are concentrated in some zones. We run experiments to evaluate the proposed algorithm. The experimental results show that the proposed algorithm can outperform benchmark approaches in terms of the long-term social welfare. It can also improve the service ratio and achieve an effective utilization of idle vehicles.

The rest of this paper is organized as follows. We introduce the related work in Sect. 2. We then describe basic settings in Sect. 3, and introduce the proposed algorithms in Sect. 4. We provide experimental analysis in Sect. 5 and conclude the paper in Sect. 6.

2 Related Work

There exist many works about ride-hailing, especially in the order matching and idle vehicle dispatching issues [9, 14]. For the order matching problem, the ride-hailing platform usually matches vehicles with orders to maximize profit, maximize order service ratio or minimize the travel distance. For maximizing profit, Cheng et al. [2] proposed a queueing theory-based order matching framework, while combining demand forecasts with predicted idle time slots to maximize the expected profits of the platform for each round. For maximizing the order service ratio, Garaix et al. [3] proposed an iterative algorithm to solve the order matching problem to maximize the order service ratio. For minimizing vehicle travel distance, Cao et al. [1] proposed a large-scale many-to-many matching algorithm based on spatial pruning techniques in the shared mobility environment to minimize the detour distance of vehicles.

There also exist some works about dispatching idle vehicles. Holler et al. [5] proposed a deep reinforcement learning based approach for solving the order matching and idle vehicle dispatching problems to maximize the profits of all vehicles. Haliem et al. [4] proposed a route planning framework based on demand forecasting and reinforcement learning to dynamically generate optimal routes. Liang et al. [6] integrated both real-time order matching and idle vehicle dispatching within a Markov decision process framework to increase drivers’ profits while reducing the waiting time of passengers. Shou et al. [10] used Markov decision process to model an idle vehicle finding passengers and used inverse reinforcement learning to solve the reward function of the model.

However, to the best of our knowledge, existing works usually did not consider the impacts of current decision of order matching and idle vehicle dispatching on the future rounds, and did not maximize the long-term social welfare. Furthermore, existing works usually analyzed the order matching and dispatching problems separately. In this paper, we address the above issues by taking the spatio-temporal value of vehicles into account, and consider the order matching and idle vehicle dispatching as a whole to maximize the long-term social welfare.

3 Basic Settings

In this paper, we assume that all vehicles are managed by the online ride-hailing platform. In the ride-hailing system, firstly, passengers submit riding orders to the platform. Then the platform matches orders with available vehicles, and provide dispatching suggestions for idle vehicles.

Furthermore, we divide the entire time into T time steps (i.e., rounds) \(\mathcal {T}=\{1,2,\cdots \,T\}\). The geographical zone where passengers and vehicles are located is constructed as a road network, which is defined as follows:

Definition 1

Road Network. The road network is defined as a weighted graph \(G=\left( L,E \right) \) where L is the set of nodes and E is the set of edges on the road network. We use \(dis\left( {{l}_{i}},{{l}_{j}} \right) \) to represent the shortest path from node \({{l}_{i}}\) to \({{l}_{j}}\). \(dis\left( {{l}_{i}},{{l}_{j}} \right) \) is also used to denote the weight of edge \(\left\langle {{l}_{i}},{{l}_{j}} \right\rangle \).

The riding order is defined as follows:

Definition 2

Order. An order \(o\in \mathcal {O}\) is defined as a tuple \((l_{o}^{p},l_{o}^{d},t_{o}^{r},t_{o}^{w},va{{l}_{o}})\), where \(l_{o}^{p},l_{o}^{d}\) is the pick-up and drop-off locations of order o respectively, \(t_{o}^{r}\) is the time when the order o is raised, \(t_o^w\) is the maximum time that a passenger in order o is willing to wait for the riding service, \(va{{l}_{o}}\) is the highest price the passenger is willing to pay for the service, which can be regarded as the real value of this order for the passenger.

Noted that in the realistic scenarios, passengers do not need to express the above value information when submitting orders. However, such information will be used to maximize social welfare. Therefore, similar to existing work [15, 16], we assume that the passenger is required to submit this value for the riding service. Note that passengers may not reveal this information truthfully in order to obtain more profits. How to prevent passengers from dishonestly revealing their values is beyond the scope of this paper. In addition, we assume that when the order o is not matched within \(t_o^w\) time, the passenger is not willing to wait and the order will be cancelled.

Definition 3

Vehicle. A vehicle \(v\in \mathcal {V}\) is defined as a tuple \(({l}_{v},{c}_{v})\), where \({{l}_{v}}\) is the current position and \({{c}_{v}}\) is the unit travel cost.

Note that different types of vehicles may have different unit travel costs. When the platform matches orders with vehicles, it can only match the order with feasible vehicle, which is defined as follows.

Definition 4

Feasible Vehicle. For an order o, a feasible vehicle v must serve a passenger before the maximum waiting time \(t_{o}^{w}\), that is, \(dis({{l}_{v}},l_{o}^{p})\le {{V}_{avg}}\cdot t_{o}^{w}\), where \(dis\left( {{l}_{v}},l_{o}^{p} \right) \) is the distance from the current position \({{l}_{v}}\) of the vehicle to the pick-up position \(l_{o}^{p}\) of order o, and \({{V}_{avg}}\) is the average speed of the vehicle.

We now introduce the social welfare of the ride-hailing system, which consists of the profits of the passengers and the platform. The passenger’s profit \({{u}_{o}}\) is the passenger’s true value for order o minus its payment for the riding service:

$$\begin{aligned} {u_o} = \left\{ {\begin{array}{*{20}{c}} {va{l_o} - {p_o},\quad o \in {{ O}^w}}\\ {\quad \quad \quad \ 0,\quad o \notin {{ O}^w}} \end{array}} \right. \end{aligned}$$
(1)

where \({{p}_{o}}\) is the price paid to the platform and \({{\mathcal {O}}^{w}}\) is the set of matched orders. When order o is not matched, the passenger’s profit \({{u}_{o}}=0\). The platform’s profit is the sum of passengers’ payments for the matched orders minus the costs of the corresponding vehicles to complete these orders over all time steps: \({U_p} = \mathop \sum \nolimits _{t = 1}^T \mathop \sum \nolimits _{o \in {\mathcal O}_t^w} ({p_o} - C_{\varTheta _{t}(o)}^o)\;\), where \(\mathcal {O}_{t}^{w}\) is the set of matched orders at time step t, \(\varTheta _t\) denotes the matching results at time step t, and \(\varTheta _t(o)=v\) means that order o is matched with vehicle v. \(C_{\varTheta _t(o)}\) is the cost of vehicle \(\varTheta _t(o)\) completing order o, which is:\(C_{\varTheta _t(o)}^{o}=\left( dis\left( l_{\varTheta _t(o)}, l_{o}^{p}\right) +dis\left( l_{o}^{p}, l_{o}^{d}\right) \right) \cdot c_{\varTheta _t(o)}\).

We now give the definition of social welfare. Note that in this paper, we consider the long-term social welfare, which is the sum of the profits of all participants over the whole time steps, i.e., the summary profits of the platform and passengers, which is:

$$\begin{aligned} \begin{aligned} SW=&\mathop \sum \nolimits _{t=1}^{T}\left( \mathop \sum \nolimits _{o \in \mathcal {O}_{t}^{w}}\left( v a l_{o}-p_{o}\right) +\mathop \sum \nolimits _{o \in \mathcal {O}_{t}^{w}}\left( p_{o}-C_{\theta _{t}(o)}^{o}\right) \right) \\ =&\mathop \sum \nolimits _{t=1}^{T} \mathop \sum \nolimits _{o \in \mathcal {O}_{t}^{w}}\left( v a l_{o}-C_{\varTheta _{t}(o)}^{o}\right) \end{aligned} \end{aligned}$$
(2)

4 The Algorithm

In this paper, we intend to maximize the long-term social welfare over all time steps. Therefore, we need to consider the impacts of current decision on the future matching. We design a vehicle state value function, which implies the ability of vehicles to make social welfare in different spatio-temporal states. Then based on the vehicle value function, we design the order matching and idle vehicle dispatching algorithm.

4.1 Vehicle Value Function

The vehicle value function shows the potential social welfare the vehicle can make in the future in the current spatial-temporal state, which is:

Definition 5

Vehicle Value Function. The vehicle value function is \({V}\left( t,g,c \right) \), where \(t\in \mathcal {T}\) is the time step, \(g\in G\) is the zone index at which the vehicle is located, and c is the vehicle unit travel cost.

At each time step, the platform collects order information and makes decisions based on the current vehicle states, including whether the vehicle is matched with the order, whether the vehicle is stationary, or whether the vehicle is idle and dispatched to some place. Then the platform computes the social welfare of the current time step and enter into the next step. In such a multi-round matching process, we can capture how the current vehicle state can affect the future social welfare, i.e., the vehicle value function. This process is a sequential decision process, and thus we can model it as a Markov Decision Process (MDP) [11], and then compute the state value function by value iteration.

In the following, we give the description of MDP \(M=\langle S,A,P,r,\gamma \rangle \).

State: The state of each vehicle is defined as a tuple \(s=\left( t,g,c \right) \in S\), which is the vehicle value function.

Action: The action is \(a\in A=\{{{a}_{1}},{{a}_{2}},{{a}_{3}}\}\), where \({{a}_{1}}\) means that the platform matches an order with a vehicle, \({{a}_{2}}\) means that the vehicle is stationary, and \({{a}_{3}}\) means that the platform dispatches an idle vehicle to an adjacent zone.

Reward: The reward r is the profit of the passengers and the platform when the action is taken. Note that the reward is 0 when the vehicle’s action is stationary and negative (caused by the vehicle’s travel cost) when the vehicle’s action is dispatched. The reward value r is calculated as: \(r = va{{l}_{o}} - {C_{\varTheta (o)}}\), where \(va{{l}_{o}}\) is the passenger’s value for an order o and \(C_{\varTheta (o)}\) is the cost required for the vehicle \(\varTheta (o)\) to complete the order o. For an order which lasts for T time steps, the cumulative reward \({{R}_{\gamma }}\) is: \({R_\gamma } = \mathop \sum \limits _{t = 0}^{T - 1} {\gamma ^t}\frac{r}{T}\), where \(\gamma \) is a discount factor that decreases the impact of the past rewards, and is set to 0.9.

In this paper, we solve the MDP by using value iteration. The platform collects historical matching data to construct a historical state transition tuple \(D=\{({{s}_{i}},{{a}_{i}},{{r}_{i}},{{s}_{i}}^{\prime })\}\), which means that the agent acts \({{a}_{i}}\) in the state \({{s}_{i}}\) to obtain an instant reward \({{r}_{i}}\) and transfer to the next state \({{s}_{i}}^{\prime }\). Since different types of vehicles may have different unit travel cost, we use cost c to represent the type information of the vehicle, and therefore the state transition information of the same type of vehicle constitutes a value function data set. Referring to existing work [13], we assume that the online policy generating the state transfer data remains constant during the phase of learning the value function. In the following, we will omit the policy parameter \(\pi \). State transition involves three actions, which are matching orders, stationery and idle vehicle dispatching.

When the action is to match an order, the vehicle receives an immediate reward \({{R}_{\gamma }}\) and makes a state transfer. The Temporal difference (TD) update rule is:

$$\begin{aligned} V\left( s \right) = V\left( s \right) + \alpha \left[ {{R_\gamma } + \gamma V\left( {s'} \right) - V\left( s \right) } \right] \end{aligned}$$
(3)

where \(s=\left( t_0,g,c \right) \) is the state of the vehicle at the current time step, \({s}'=\left( t_3,g_{ld},c \right) \) is the state of the vehicle after completing the matching order, in which \(t_3\) is the time step when the passenger reaches the destination and \(g_{ld}\) is the index of the order destination zone.

When the action is being stationary, the immediate reward of the agent is 0. The TD update rule is:

$$\begin{aligned} V\left( s \right) = V\left( s \right) + \alpha \left[ {0 + \gamma V\left( {s''} \right) - V\left( s \right) } \right] \end{aligned}$$
(4)

Since the vehicle takes a stationary action, the position of the vehicle does not change, i.e., \(s'' = \left( {t_1,g,c} \right) \).

When the action is to dispatch idle vehicle, we construct a virtual order where the value is 0, the origin of the order is g, and the destination of the order is one of the neighboring zones of g. The TD update rule is:

$$\begin{aligned} V\left( s \right) = V\left( s \right) + \alpha \left[ {{R_\gamma }^\prime + \gamma V\left( {s'''} \right) - V\left( s \right) } \right] \end{aligned}$$
(5)

where \({s}'''=\left( t_2,{g}''',c \right) \) is the state of the vehicle after the dispatching is completed, in which \(t_2\) is the time step when the idle vehicle is dispatched to the destination, \({g}'''\in {{g}_{near}}\) is a neighboring zone of g.

figure a

Next, we describe how to compute the vehicle value function V. The platform first collects historical state transfer data, and then uses a dynamic programming based value iteration algorithm to backward recursively calculate value \(V\left( {{s}_{i}} \right) \) in each state to obtain the vehicle value function \(V\left( s \right) \). The details are shown in Algorithm 1.

4.2 Order Matching and Idle Vehicle Dispatching Algorithm Based on Value Function

After obtaining the vehicle value function, we now describe how to use this function to design the order matching and idle vehicle dispatching algorithm. The order matching and idle vehicle dispatching problem to maximize the social welfare is actually a bipartite graph maximum weight matching problem. At each time step t, the set of orders \({{O}_{t}}\) and the set of vehicles \({{V}_{t}}\) are two disjoint sets of vertices of the bipartite graph. The weight of edge \(\left\langle o,v \right\rangle \) is the difference \(\varDelta V\) of the value of vehicle v after completing order o, which is: \(\varDelta V=\gamma ^{\mathrm{{\varDelta }}{t_{o,v}}} V\left( s^{\prime }\right) -V(s)+R_{\gamma }\), where s is the state when the vehicle v is matched with the order o and \({s}'\) is the state when the vehicle v delivers the passenger corresponding to the order o to the destination, \(\varDelta {{t}_{o,v}}\) is the time required for vehicle v to complete this trip, and \({{R}_{\gamma }}\) is the cumulative reward. The details of the value function-based order matching and idle vehicle dispatching algorithm are shown in Algorithm 2, which is named VFOMIVD for short.

In Algorithm 2, line 1 initializes the set of matched orders \(\mathcal {O}^w\), the matching result \(\varTheta \), the social welfare SW. For each time step of order matching and idle vehicle dispatching, lines 3 and 4 initialize the set of orders \({{\mathcal {O}}_{t}}\) and vehicle set \({{\mathcal {V}}_{t}}\). The order collection \({{\mathcal {O}}_{t}}\) contains orders submitted by passengers at the current time step and orders that have not been matched in previous steps and are still within the maximum waiting time. The vehicle set \({{\mathcal {V}}_{t}}\) includes vehicles that are not serving orders at the current step, i.e., idle vehicles. Lines 5 to 17 construct the bipartite graph. For each matching pair \(\left\langle o,v \right\rangle \), the platform calculates the difference \(\varDelta V\) and use it as the weights of the edges of the bipartite graph.

Note that only the matching pair \(\left\langle o,v \right\rangle \) with \(\varDelta V>0\) is inserted into the bipartite graph, while if vehicle v cannot serve the passenger corresponding to order o within the maximum waiting time, its corresponding \(\varDelta V=0\). For each idle vehicle v, the platform fictitiously creates several virtual orders \({o}'\) from the current location of vehicle v to its neighboring zone \(g\in {{g}_{near}}\). In so doing, the algorithm combines the order matching and idle vehicle dispatching as a whole. In the order matching, the platform calculates the difference \(\varDelta {V}'\) of its corresponding state value and insert it into the bipartite diagram. In lines 18, the platform solves the bipartite graph using the Kuhn-Munkres algorithm [8]. Line 19 calculates the social welfare for the current time step, and lines 20 to 22 record the results for the current time step. Finally, the platform updates the trips of the vehicles that have been matched with orders and eliminates the set of orders \(\mathcal {O}_{t}^{w}\) that have been matched from the set of orders \(\mathcal {O}\).

figure b

5 Experimental Analysis

In this section, we run experiments to evaluate the proposed algorithm based on the real taxi order data in New York city, which has been used by a large number of related works [9, 13, 15].

  1. (1)

    Order Data. We collect taxi order data on Manhattan Island in June 2019 from New York City Taxi and Limousine Commission (TLC).Footnote 3 Each taxi order data contains departure and destination locations, order starting time, trip fare and trip mileage.

  2. (2)

    Map Data: We use Manhattan taxi zone map provided by TLC as the map data, and we number each zone.

  3. (3)

    Fuel Consumption Data. We cannot find the travel cost data of New York taxis. Instead, we collect the urban vehicle fuel consumption data of type M1 and M2 from China Automobile Fuel Consumption Query SystemFootnote 4 of the Ministry of Public Information of China to compute the vehicle cost in the below.

  4. (4)

    The Shortest Path Cache. To ensure that the distance between any two nodes can be quickly queried during experiments, we pre-build the cache of the shortest path matrix and the shortest path distance matrix.

  5. (5)

    Order Data Processing. We remove some orders with noises (i.e. orders with invalid fares, zero trip milage and so on). We eliminate the order data in those isolated zones. We count the number of orders per hour each day and we find that the number of orders on weekdays and weekends varies greatly over time. For consistency, we use the order data of weekdays (20 days in total) in the evaluation.

5.1 Experimental Settings

The number of orders for each hour of a weekday is also significantly different, and the performance of the algorithm in the peak period is more important. Furthermore, a large number of order data in the peak time period is also helpful to generate the vehicle value function. Therefore, we choose the time period (19:00 to 21:00) in the weekday for the evaluation. We now compute the average number of order data from 19:00 to 21:00 in these 20 days, and randomly choose the average number of one day as the order input. For other parameters, we set the length of each time step as 60 s. The maximum waiting time for passengers is chosen randomly from {3 min, 4 min, 5 min, 6 min, 7 min, 8 min}. The average vehicle travel speed \({{V}_{avg}}\) is set to 7.2 mph. For each vehicle, the unit travel cost is randomly selected from \(\{6,8,10\}\times 2.5/6.8/1.6 \$ \)/km.Footnote 5 The initial location of the vehicle is randomly selected in Manhattan taxi zone map. In the experiments, we try different numbers of vehicles, which is increased from 1500 to 3000. For each experiment, we repeat it for 10 times and compute the average result.

5.2 Evaluation of Order Matching and Idle Vehicle Dispatching Algorithm

In this section, we evaluate the proposed VFOMIVD algorithm against some benchmark algorithms.

Benchmark Algorithms and Metrics

  1. (1)

    mdp [13]. The mdp algorithm also utilizes the vehicle value function to guide the order matching, but its vehicle state does not consider the variability of vehicles in terms of cost and also does not consider the dispatching of idle vehicles.

  2. (2)

    mT-Share [7]. The mT-Share algorithm intends to minimize the vehicle travel cost. The order matching in the mT-Share algorithm uses a greedy algorithm to match orders with vehicles with the least extra cost.

  3. (3)

    Nearest-Matching. The Nearest-Matching algorithm is widely used in industry (i.e., Uber), where the platform matches orders with the nearest vehicles.

  4. (4)

    Greedy &GPri [16]. The Greedy &GPri algorithm greedily matches orders with vehicles with the highest social welfare iteratively and adopts a critical value-based pricing algorithm. The reason of choosing this greedy method is that existing research has showed that greedy method can perform well in the crowdsourcing tasks [12].

In terms of the evaluation metrics, in addition to the social welfare, we also investigate service ratio, which is the ratio of the number of matched orders to the total number of orders submitted by passengers.

Analysis of Experimental Results

The experimental results are shown in Fig. 1(a). We find that as the number of vehicles increases, the social welfare increases since more orders are served. We find that VFOMIVD algorithm achieves the highest social welfare. We also find that greedy based method Greedy &Gpri can perform well. We also look into the service ratio in Fig. 1(b). We find that the VFOMIVD algorithm achieves the maximum service ratio. Although the service ratio of algorithms such as mdp is close to that of the VFOMIVD algorithm when the number of vehicles increases, the social welfare of the VFOMIVD algorithm is still the largest. This may imply in our algorithm, vehicles are more likely to converge to zones where more social welfare is generated.

Fig. 1.
figure 1

Experiments of order matching and idle vehicle dispatching algorithm.

5.3 Evaluation of Idle Vehicle Dispatching

In this section we further analyze the effectiveness of the idle vehicle dispatching algorithm. In order to evaluate the performance of the idle vehicle dispatching algorithm, we combine different benchmark dispatching algorithms with the order matching module of VFOMIVD to generate the benchmark algorithms.

Benchmark Algorithms and Metrics

  1. 1.

    VFOM. We remove the idle vehicle dispatching module of VFOMIVD algorithm (lines 12 to 19 in Algorithm 2) and keep the order matching module.

  2. 2.

    VFOM-RD. This algorithm adds random dispatching into VFOM algorithm, which randomly dispatches idle vehicles to their neighboring zones. Random dispatching has been used in related works [16].

  3. 3.

    VFOM-ND. It adds the nearest dispatching to the VFOM algorithm, which is a common dispatching algorithm used by companies (e.g., Uber) to dispatch idle vehicles to the nearest neighboring zone [3].

In addition to evaluating the performance on social welfare and service ratio, we consider one more metric to evaluate the idle vehicle dispatching algorithm, which is the platform operating cost, consisting of the costs of all served orders and idle vehicles travelling to dispatched zones over the whole time steps.

Analysis of Experimental Results

The experimental results are shown in Fig. 2(a). We find that VFOMIVD algorithm achieves the largest social welfare. As the number of vehicles increases, the social welfare obtained by all algorithms increases. We also find that the social welfare of VFOM algorithm (where no dispatching algorithm is used) and VFOM-RD is similar. This may imply that random dispatching is not beneficial for the utilization of idle vehicles. From Fig. 2(b), we find that the VFOMIVD algorithm achieves the maximum service ratio. This means that after using the proposed dispatching algorithm, the platform can serve more orders, and thus can achieve the maximum social welfare. From Fig. 2(c), we find that the platform operating cost of the VFOMIVD algorithm is higher than the VFOM algorithm. However, from Figs. 2(a) and 2(b) we find that the social welfare and service ratio of the VFOMIVD algorithm are higher. This may imply that the increased operating cost of our algorithm is caused by dispatching idle vehicles to zones with more riding demands, and thus can serve more orders and bring more social welfare.

Fig. 2.
figure 2

Experiments of idle vehicle dispatching.

In summary, we find that because VFOMIVD algorithm takes into account the spatio-temporal value of vehicles and dispatches vehicles to zones where more vehicles are needed, it can utilize idle vehicles to serve more riding orders, and thus can increase social welfare.

6 Conclusion

In this paper, we proposed an order matching and idle vehicle dispatching algorithm to maximize the long-term social welfare in the ride-hailing system. By considering the impacts of current order matching and idle vehicle dispatching decisions on the future rounds, we design a vehicle value function, which can characterize the ability of the vehicle to make social welfare in the future spatio-temporal state. Based on the vehicle value function, we design the order matching and idle vehicle dispatching algorithm. Finally, we run extensive experiments to evaluate the proposed algorithm. The experimental results show that our algorithm can help online ride-hailing platforms to dispatch idle vehicles efficiently to improve the utilization of idle vehicles, and thus can increase the service ratio and social welfare.