Introduction

In recent years, smartphone-operated, non-station-based bike fleets (i.e., free-floating bike sharing, hereafter referred to as FFBS) have witnessed exponential growth worldwide (Cheng et al. 2022b; Hirsch et al. 2019). For instance, in less than two years since its launch in North America, FFBS has rapidly expanded to over 200 systems operating in more than 150 cities (Hirsch et al. 2019). Meanwhile, the FFBS system has ushered in a golden age of expansion in China, with its implementation in over 200 cities and a total of 23 million bikes in just a few years (Gu et al. 2019). By equipping shared bikes with a Global Positioning System (GPS) device, the FFBS system allows users to rent a nearby bike and return it in any suitable place (e.g., on-street corrals and sidewalk racks) via mobile applications (Zhao and Ong 2021). It greatly improves the flexibility and accessibility of journeys by offering “door-to-door” services for local residents (Cheng et al. 2022b).

As FFBS schemes continue to grow in popularity around the world, large FFBS journey data with individual mobile locations and trajectories are readily available (Chen et al. 2022b). In many studies based on journey data, researchers have found that point clusters with high FFBS usage were more concentrated near metro stations, residential neighborhoods, and office buildings (Chen and Ye 2021; Guo and He 2020; Du et al. 2019b). The findings provide valuable insights for understanding spatio-temporal travel characteristics, predicting regional demand, and designing scheduling strategies. However, most of them only considered the origin (O) or destination (D) points of FFBS trips, while the studies from the origin–destination (OD) flow perspective are limited (Chen et al. 2022b; Zhang et al. 2021). Although the discovery of OD flow clusters consisting of many similar trips is essential for unveiling daily human mobility and home-work commuting patterns (Guo et al. 2020; Liu et al. 2022a; Wood et al. 2010), the focus of existing FFBS studies has not yet been extended from O or D points to OD flows.

Due to occlusion and display clutter issues, a substantial amount of trips overlap and intersect each other, making it challenging to discover flow clusters in large flow data (Zhu and Guo 2014). Some studies aggregate trips using predefined spatial areal units (e.g., regular grids, and traffic analysis zones) (Chen et al. 2022b; Wood et al. 2010; Zheng et al. 2021). This aggregation approach is valid in reducing the flow cluttering problem, but it ignores the flow patterns at local scales (Zhu and Guo 2014; Zhu et al. 2019). In recent years, several flow clustering methods have been developed in an attempt to extract flow clusters from large flow data (Guo et al. 2020; Gao et al. 2018; Tao and Thill 2016). These clustering methods mitigate the cluttering and overlapping issues by extracting clusters of similar trips, while maximizing the spatial resolution of the data (Song et al. 2019; Zhu and Guo 2014). Nevertheless, detecting flow clusters with irregular shapes and uneven densities from large flow data is still a huge challenge (Liu et al. 2022a), which we will review in the “Flow clustering methods” section.

In reality, due to the nature of the short-distance trips, shared bikes are likely to stay near their initial assigned location (Zhang and Meng 2019). Inspired by this, we propose a two-stage flow clustering method by integrating the Leiden community detection and the shared nearest-neighbor-based flow (SNN_flow for short) clustering methods. More concretely, in Stage I, the Leiden algorithm is leveraged to partition the entire study area into multiple FFBS activity zones with strong intra-connections, thus decomposing a large flow clustering problem into multiple small sub-problems. In Stage II, the FFBS flow clusters with varying shapes and densities in each activity zone are identified separately using the SNN_flow method, and then the extraction results of all activity zones are merged.

Taking the FFBS system in Nanjing, China as a case study, an empirical investigation is performed on the applicability and performance of the two-stage flow clustering method in identifying flow clusters. This study tackles the following two research questions: (i) What are the typical characteristics of the spatio-temporal patterns of FFBS flow clusters? (ii) What are the similarities and diversities in the shape and density distribution of FFBS flow clusters? This study contributes to the existing literature in two ways. First, it proposes a two-stage flow clustering method that can be leveraged to efficiently detect FFBS flow clusters with arbitrary shapes and inhomogeneous densities from large-scale journey data. Second, it unveils the spatio-temporal patterns and endpoint distribution characteristics of FFBS flow clusters, which could help transportation planners and decision-makers to better understand the heterogeneity of flow clusters and thus take rational measures to make the resource allocation of the FFBS system as balanced as possible.

The remainder of the paper is structured as follows. “Literature review” section provides an overview of FFBS OD flows and flow clustering methods. “Two-stage flow clustering method” section introduces the two-stage flow clustering method in detail. “Study area and data description” section describes the study area and the data used. “Results and discussions” section presents the research findings. Finally, our main conclusions are summarized and policy implications are drawn in “Conclusions and policy implications” section.

Literature review

FFBS OD flows

Numerous existing studies on FFBS data analysis have focused on revealing the mechanisms influencing O or D point usage patterns. These studies have investigated different aspects of this issue, including socio-demographics (Link et al. 2020; Orvin and Fatmi 2021), weather conditions (Peters and MacKenzie 2019), land use (Chen and Ye 2021; Cheng et al. 2020a), built environment (Guo and He 2020; Shen et al. 2018), and access to metro system (Cheng et al. 2022a, 2023; Ma et al. 2019). On balance, FFBS usage is higher in areas with denser populations, comfortable weather conditions, higher land use mix, friendlier cycling environments, and better interchange facilities. These findings are of great importance in many facets such as cycling facility planning (Zhao and Ong 2021), bike scheduling strategy design (Chang et al. 2021), and ridership activity prediction (Xu et al. 2018). However, most of them use isolated models to analyze trip origins and destinations, and few investigate FFBS trips from the perspective of OD flows.

Abstracting flow clusters from large-scale, chaotic journey data is crucial to reveal the spatio-temporal dynamics of human mobility and commuting patterns (Liu et al. 2022a). Currently, only some initial works have looked at OD flows using FFBS journey data. Based on the Shanghai Mobike dataset, Du et al. (2019b) visualized the spatio-temporal distribution of FFBS OD flows by exploiting the ODPFM (O-D Proportion Flow Map) tool. They found that the spatial distribution of FFBS OD flows varied considerably by land use type and period of time. Zheng et al. (2021) constructed an OD spatial network using Beijing Mobike dataset and investigated the unbalance characteristics of the FFBS system. The results suggested that most of the study areas are in a relatively flat stage of supply and demand, while a few areas have large imbalances in resource supply and demand. Drawing on a four-month FFBS OD flow dataset in Singapore, Zhang et al. (2021) identified some activity zones from cycling behaviors by applying a modularity optimization community detection method. They found that the activity zones yielded from the FFBS networks are locally clustered. Furthermore, taking Nanjing, China as an example, Chen et al. (2022b) constructed a spatial interaction network using FFBS OD flows. Based on this, the urban activity zone borders were delineated leveraging the Leiden algorithm. They pointed out that the FFBS activity zone borders overlap more with natural borders (e.g., water bodies and mountains) than with administrative borders.

The aforementioned studies addressing FFBS mobility patterns often focus on the OD flows from one individual area to the other, providing valuable findings for the spatial interactions of the FFBS system. However, since there is no fixed station constraint for FFBS bikes, these studies typically use regular grids to aggregate FFBS usage (Du et al. 2019b; Zhang et al. 2019; Zheng et al. 2021), and few have investigated FFBS flow clusters from a finer spatial resolution. Many questions regarding what are the typical spatio-temporal patterns of FFBS flow clusters and whether they have varying shape and density distributions remain unanswered. Therefore, it is necessary to employ an efficient flow clustering method to detect inhomogeneous flow clusters from large-scale FFBS journey data.

Flow clustering methods

According to the basic principles of flow clustering, the related methods focus on the following categories: hierarchical clustering, statistical-based clustering, and density-based clustering.

In the hierarchical clustering methods, researchers first calculate the similarity between OD trips based on specific metrics (e.g., OD point locations and flow properties), and then use a specified strategy (e.g., agglomerative and divisive) to construct OD flow data into a hierarchical structure to identify flow clusters (Guo et al. 2020). For instance, Zhu and Guo (2014) considered both start and end positions in defining the similarity of two trips and proposed an agglomerative clustering approach to handle large-scale flow data. Yao et al. (2018) developed a new spatial similarity metric based on the angle and length differences between any pair of trips, and a similar agglomerative clustering approach was applied to extract flow clusters. Xiang and Wu (2019) proposed a new hierarchical clustering method (called TOCOFC) to obtain flow clusters from the original, chaotic trips. The method introduces a similarity metric to measure the spatio-temporal similarity between different trips, and then employs a recursive optimum cut-based approach to partition trips. In summary, the hierarchical clustering methods have been widely utilized in small-scale OD flow datasets, but may not be applicable to large OD flow datasets like FFBS journey data because of their high computational complexity (Liu et al. 2022a).

In the statistical-based clustering methods, researchers have extended the traditional spatial statistics in hopes of detecting flow clusters from large OD flow datasets. For instance, Liu et al. (2015) improved the global and local Moran’s I statistics to extract flow clusters containing highly spatially correlated trips, and conducted an empirical study using taxi data from Shanghai, China as a case study. Tao and Thill (2016) proposed a K-function extension method for OD flow data to upgrade its detection target from point clusters to flow clusters. In addition, Gao et al. (2018) introduced a multidimensional spatial scan statistics approach to identify flow clusters. These statistical-based clustering methods can effectively detect statistically significant flow clusters, but they have obstacles in detecting flow clusters with arbitrary shapesFootnote 1 (Song et al. 2019).

Given that some density-based clustering algorithms (e.g., DBSCAN (Ester et al. 1996) and OPTICS (Ankerst et al. 1999)) are well able to identify irregularly shaped point clusters, researchers have successfully upgraded point clustering to flow clustering by improving these traditional algorithms (Gallego et al. 2018; Tao and Thill 2016). Although such density-based clustering approaches have competitive advantages in detecting arbitrarily shaped clusters, they exhibit poor clustering performance when the density of OD flows is unevenly distributed (Reddy and Bindu 2017). Furthermore, these methods are mainly developed based on Euclidean spatial distances (Liu et al. 2022a). However, related studies have shown that Euclidean distance-based clustering methods may introduce a significant systematic bias in the presence of network constraints (Besse et al. 2016; Yamada and Thill 2010). For FFBS journey data, the above methods are clearly not applicable as their OD points are typically strongly constrained by road networks (Hua et al. 2020; Zhang et al. 2019). To handle these issues, Liu et al. (2022a) recently presented a shared nearest-neighbor-based flow (SNN_flow) clustering method, which possesses superior performance in identifying clusters of network-constrained OD flows with irregular shapes and inhomogeneous distributions. However, for the city-level OD flow data, the efficiency of the SNN_flow method does not seem to be ideal as it has a relatively high time complexity.

By and large, existing studies have tried many clustering algorithms in the flow cluster extraction problem to provide valuable insights for unveiling human mobility patterns. However, they still have gaps in effectively detecting flow clusters of varying shapes and densities from large-scale OD flow data. To this end, we combine the respective advantages of community detection and SNN_flow methods, and propose a two-stage flow clustering method to extract FFBS flow clusters, trying to provide profound insights for uncovering human mobility patterns and allocating infrastructure resources.

Two-stage flow clustering method

In Stage I, the study area is divided into multiple FFBS activity zones using the Leiden algorithm. Based on this, the FFBS flow clusters within each activity zone are identified separately in Stage II employing the SNN_flow method.

Stage I: activity zone delineation

A three-step identification framework is developed to delineate the FFBS activity zone borders, as depicted in Fig. 1. First, we construct an undirected weighted network (G = (V, E)) upon the FFBS trips (Fig. 1a, b), where V is the vertex set of the network G, consisting of the centroids of all spatial units; E is the edge set of the network G, consisting of all links between each pair of centroids; each edge e corresponds to a weight W(e), which represents the size of OD flow (i.e., FFBS trip count).

Fig. 1
figure 1

Identification framework of FFBS activity zones: a FFBS trip preparation; b undirected weighted network construction; c community structure division; and d activity zone delineation

Second, based on the community detection algorithm, all vertices are divided into multiple vertex groups (Fig. 1c). The basic principle of this algorithm is to form a community structure based on the degree of connection between vertices. That is, vertices located in the same community are relatively closely connected, while vertices located in different communities are very sparsely connected to each other (Girvan and Newman 2002). The Louvain algorithm is a classical community detection algorithm, covering two elementary phases (i.e., node local movement and network aggregation), which provides an efficient solution for vertex grouping (Jin et al. 2021). However, it has been found that the Louvain algorithm may derive some internally poorly connected communities during the community structure partitioning process (Traag et al. 2019). To overcome this defect, Traag et al. (2019) recently extended the Louvain method by adding a partition refinement phase and proposed the so-called Leiden algorithm, which is more computationally efficient and uncovers better partition structures. In this work, the Leiden community detection algorithm is exploited to discover the partition structure of the FFBS system.

In addition, we measure the performance of the communities partitioned by the Leiden algorithm based on the modularity Q. The value of Q lies between zero and one, and the larger the value, the more stable the corresponding community structure (Jin et al. 2021). The modularity Q of the weighted network (Arenas et al. 2008) can be written as:

$$Q = \frac{1}{2W}\sum\limits_{ij} {\left( {W_{ij} - \frac{{s_{i} s_{j} }}{2W}} \right)} \, \delta \left( {c_{i} ,c_{j} } \right)$$
(1)

where W is the weight values of all edges, Wij is the weighted adjacency matrix (i.e., the weight value of the edge between vertices i and j), si (sj) refers to the strength of vertex i (j), ci (cj) refers to the community to which vertex i (j) is partitioned, and δ(·) is an indicator function, if ci = cj, then δ = 1, else, δ = 0.

Third, the borders of the community structures are automatically identified and highlighted with the help of the ArcGIS Dissolve Boundaries tool (Fig. 1d). These communities have dense FFBS trips within them and can serve as effective proxies for user activity spaces (Chen et al. 2022b). The study area is eventually separated by the borders of the FFBS activity zones.

Stage II: flow cluster identification

For each FFBS activity zone delineated in the previous phase, the SNN_flow method is exploited to identify its respective flow clusters. Specifically, the SNN_flow method consists of three essential steps. First, the FFBS trips and road network datasets are collected and preprocessed as inputs for the subsequent steps (“Data preparation” section). Second, a suitable k value is estimated to determine the clustering scale (“Appropriate k-value estimation” section). Finally, the flow clusters of FFBS are detected based on the SNN density (“Flow cluster detection” section). The pseudo-code of SNN_flow is given in Algorithm 1.

figure a

Data preparation

With no fixed dock limitation, FFBS bikes can be parked freely near buildings along urban streets (as long as parking is permitted), such that the location of some OD points is somewhat offset from the road segment (e.g., point Om in Fig. 2a). To address this issue, a map matching approach is used to match each pair of OD points onto their nearest road segment (White et al. 2000). A road network is then constructed based on the existing road dataset to search for the network distanceFootnote 2 between OD point pairs as a proxy for trip trajectories. Network distance is usually a more accurate reflection of users’ actual travel behavior than a straight-line path (i.e., Euclidean distance) (Apparicio et al. 2008; Páez et al. 2020). Taking the fm trip composed of origin point (Om) and destination point (Dm) as an example, the spatial distribution of its network distance is shown in Fig. 2b. More specifically, the fm trip can be expressed as (Om, Dm, LS(Om), LE(Om), LS(Dm), LE(Dm)), where LS(Om) and LE(Om) denote the length of the shortest path between the origin point (Om) and the start/end nodeFootnote 3 of the road segment where it is located, respectively; similarly, LS(Dm) and LE(Dm) are the shortest path lengths from Dm to the start node and end node of its road segment, respectively.

Fig. 2
figure 2

Identification process of the three-nearest neighbors of a certain trip (fm): a FFBS trip preparation; b matching FFBS trips onto the road network; c road node distance matrix construction; and df k-nearest neighbor identification

According to the basic principle of the SNN_flow algorithm, it is necessary to further search for the k-nearest neighbors corresponding to each trip (Liu et al. 2022a). Hence, we need to calculate the distance (for this study, the network distance is used) between any two trips in advance. Taking fm and fn as an example, the network distance between them is calculated by the following formula (Shu et al. 2021):

$$ND(f_{m} ,f_{n} ) = ND(O_{m} ,O_{n} ) + ND(D_{m} ,D_{n} )$$
(2)

where ND(Om, On) refers to the network distance between the origin points of fm and fn, and ND(Dm, Dn) refers to the network distance between the destination points of fm and fn. For the two origin points Om and On located at the road segments Sij and Spq in Fig. 2b, the network distance ND(Om, On) between Om and On can be chosen as the minimum value from the following two cases: (i) LS(Om) + ND(i, p) + LS(On); (ii) LE(Om) + ND(j, q) + LE(On). From this, we can see that the network distance between two trips contains two parts, one is an uncertain distance consisting of road nodes and OD points (e.g., LS(Om), and the other is a fixed distance consisting of road nodes (e.g., ND(i, p)).

In reality, the network distance between road nodes is fixed and does not change with the location of the OD points. Therefore, we can calculate this part of the distance in advance to reduce the workload of calculating the network distance between OD points and improve computational efficiency. Figure 2c displays an example of constructing a distance matrix based on local network nodes.

Appropriate k-value estimation

As mentioned above, in order to obtain the SNN density of a trip, we need to first identify its network-constrained k-nearest neighbors (Fig. 2f). In practice, most trips are within an acceptable distance from their kth nearest neighbors when the value of k is not large (Pei et al. 2012). This means that we only need to compute the network distance of fm from those trips within a certain range from it. In this study, linear buffers of length L0 are drawn for the origin and destination points of each trip, respectively. Figure 2d depicts the linear buffers at both ends of fm. The network distances from the Om (or Dm) point to the other origins (or destinations) in the buffer are all less than L0. L0 depends on the most of trip distances for a given transportation mode. For FFBS, 5 km is typically considered as its longest trip distance (Chen 2021), hence that is the length threshold (L0) adopted in this study.

After a linear buffer is constructed for each trip, its neighboring trips located within the buffer can be further extracted. Figure 2e illustrates the neighboring trips of fm within its buffer. Then, combined with the local road node distance matrix, we efficiently calculate the network distance between the target trip fm and its neighboring trips, and on this basis, identify the k-nearest trips of fm. Figure 2f shows the three-nearest trips identified from the neighboring trips of fm (assuming k = 3).

As we can see, estimating the appropriate value of k is very critical as its magnitude determines the reasonableness of the SNN density distribution. Either too low or too high k-value will affect the normal estimation of SNN density. Many researchers have applied the ratio between the variance of the (k + 1)th nearest distance and that of the kth nearest distance (RKD for short, which refers to the capitalized initials of ratio, (k + 1)k and distance, respectively) to estimate the appropriate k value with good performance (Liu et al. 2022a; Pei 2011), and hence that is the index used in this study.

$$RKD = \frac{{{\text{Var}}_{k + 1}^{*} (x)}}{{{\text{Var}}_{k}^{*} (x)}}/R_{k} \quad (k \ge 1)$$
(3)

where \({\text{Var}}_{k}^{*} (x)\) denotes the variance of the kth nearest distance of the trips (i.e., the distance between each trip and its kth nearest trip is first calculated, and then the variance of all distances is calculated); \({\text{Var}}_{k + 1}^{*} (x)\) has a similar meaning; and Rk is a constant term whose value is equal to the ratio of the expectation value of the above two distances. As the k value increases, the RKD value will gradually level off. We can easily identify the magnitude of the k value when RKD is at the leveling-off change point. For more details, please refer to the study of Pei (2011).

Flow cluster detection

In the previous subsection, we identified k neighboring trips for each trip, and in this subsection, the SNN density of each trip is estimated to finally detect flow clusters. Following Ester et al. (1996) and Liu et al. (2022a), we introduce some important concepts for SNN algorithm:

  • Definition 1 (SNN similarity). The number of nearest neighbors shared by the k-nearest trips of any two trips. For trips fm and fn, their k-nearest trips can be expressed as KNN(fm) and KNN(fn), and the SNN similarity of them can be expressed as:

    $$SNN(f_{m} ,f_{n} ) = \left| {KNN(f_{m} ) \cap KNN(f_{n} )} \right|$$
    (4)
  • Definition 2 (Directly reachable). If SNN(fm, fn) ≥ k/2, the two trips, fm and fn, are directly reachable.

  • Definition 3 (SNN density). It refers to the number of trips that are directly reachable from a particular trip (e.g., fm).

  • Definition 4 (Core flow). For a particular trip, fm, if p-value(fm) ≤ α (α is the significance level), then it is regarded as a core flow. The p-value of fm can be written as:

    $$p{\text{ - value}}(f_{m} ) = \frac{{\mathop \sum \nolimits_{i = 1}^{R} I_{i} (SNND_{r} (f_{m} ) \ge SNND_{o} (f_{m} ))}}{1 + R}$$
    (5)

where Ii(·) is an indicator function, if SNNDr(fm) ≥ SNNDo(fm), then Ii = 1, else, Ii = 0. SNNDr(fm) and SNNDo(fm) refer to the SNN density of fm calculated from the random trips and observed trips. R is the number of Monte Carlo simulations. Some researchers confirmed that Monte Carlo simulation can minimize the sampling effort without affecting the overall performance of the model when α = 0.05, R = 99 (Silva et al. 2009; Liu et al. 2022b), which is used in this study.

  • Definition 5 (Border flow). A trip that is directly reachable from a core flow but is not itself a core flow.

  • Definition 6 (Noise flow). A trip that is neither a core flow nor directly reachable from one.

The following steps describe how the SNN algorithm detects flow clusters based on the density-connectivity mechanism. (i) A case (fm) is randomly selected from the dataset. The case fm is considered a core flow if its p-value is smaller than or equal to α. Immediately afterwards, fm is added to an initial cluster and a cluster ID is assigned to it (e.g., Ck). If the case fm is not a core flow, the SNN algorithm moves on to another case; (ii) We assume that the algorithm selects a case (fm) and finds it is a core flow. The algorithm then visits each of the reachable cases that are directly reachable with fm and repeats the same task: calculate the SNN density. If the reachable case is also identified as a core flow, it is added to the Ck cluster; (iii) If the algorithm finds a reachable case that is directly reachable with fm but has a p-value greater than α, then this case is considered as a border flow. A border flow can still be added to a Ck cluster as long as it is directly reachable from any core flow in the Ck cluster. The search continues recursively until all reachable cases of fm are visited; (iv) The algorithm selects a case in the dataset that it has not visited before and starts the process of (i)–(iii) all over again. Those cases that are neither core flows nor directly reachable from one are grouped into the noise flows. Finally, a flow cluster consisting of core and border flows aggregates a certain number of spatially similar trips.

As stated earlier, the SNN_flow method consisting of three essential steps is utilized to identify the flow clusters for each activity zone. It is thus necessary to finally merge the flow clusters of all activity zones for subsequent analysis.

Study area and data description

Study area

Nanjing is the capital of Jiangsu province of China, a megacity and the second largest city in the East China region (Fig. 3a). Nanjing had a total area of 6587 km2 and a population of 8.33 million as of 2018. There are 11 administrative districts, six of which are urban districts (i.e., Gulou, Jianye, Xuanwu, Qinhuai, Yuhuatai, and Qixia) and the remaining five are suburban districts (i.e., Liuhe, Pukou, Jiangning, Lishui, and Gaochun) (Cheng et al. 2020b), as shown in Fig. 3b.

Fig. 3
figure 3

Spatial distribution of a Jiangsu province; b Nanjing city; and c study area

Since the beginning of 2017, FFBS was first launched in Nanjing and quickly attracted numerous users due to its advantages such as flexible mobility and smart rental process (Hua et al. 2020). FFBS is usually backed by venture capital funding. For profit-making purposes, most bikes are assigned to densely populated areas with high demand (Cheng et al. 2020a; Gu et al. 2019). Nanjing is no exception, and citizens in its peripheral districts (i.e., Lishui and Gaochun) have no FFBS bikes to use. Therefore, the remaining nine administrative districts of Nanjing are selected as the study area (see Fig. 3c). Note that the traffic analysis zone (TAZ) within the study area was adopted as the spatial unit for delineating the FFBS activity zones (see “Stage I: Activity zone delineation” section).

Data description

The FFBS journey data were provided by Mobike, which at the time had the largest share of the FFBS fleet in Nanjing (Cheng et al. 2022b). The dataset records journey information of users, including fields such as user ID, bike ID, unlock time, lock time, coordinates of origins and destinations. We focus on the mobility pattern of the FFBS system on weekdays. In this study, data for only three consecutive weekdays (from 12 (Tuesday) to 14 (Thursday) September 2017) are used due to data availability. Nevertheless, they could still serve as a valid sample to validate the applicability of the method and unravel the daily patterns of FFBS trips (Guo and He 2020). During this period, the average temperature in Nanjing was between 20 °C and 28 °C with no rainfall, which was suitable for outdoor activities such as cycling. To mitigate the interference of abnormal data, we removed FFBS journeys with travel times less than 2 min or longer than 120 min (Chen et al. 2022b; Zhao et al. 2015). Nearly 1.9 million trips made by a total of 190,008 bikes were eventually recorded.

The road dataset was obtained from Amap (https://ditu.amap.com/), one of the most popular mapping service providers in China. In order to calculate the network distance between adjacent FFBS trips (see “Data preparation” section), a road network needs to be constructed on the basis of the original road dataset with the help of ArcGIS Network Analyst Extension. In addition, we applied a solution recently developed by Xu (2022) to further refine the connectivity of the road network by checking and modifying its topology (https://github.com/xuxinkun0591/gaode2/).

Another dataset we adopted is the land use map provided by the Nanjing Planning Bureau. The land use map consists of many polygons with different shapes, and each polygon has a corresponding land use type attribute. In line with a related study (Pan et al. 2012), a variety of land use types are divided, including campus, hospital, scenic spot, metro station, employment district, residential district, commercial district, green space, water body, and other land use. The spatial distribution of different land use types in the study area is shown in Fig. 4. Based on the land use information, we can initially infer the travel purpose of FFBS trips in the subsequent analysis (Lei et al. 2020).

Fig. 4
figure 4

Spatial distribution of different land use types in the study area

Results and discussions

Flow clusters identification

Application of the two-stage flow clustering method

In a recently published work by Chen et al. (2022b), the partitioning of FFBS activity zones has been examined employing the Leiden algorithm using the same data source. The community structure of the study area is obtained in just a few seconds due to the low time complexity of the Leiden algorithm, which proves that this algorithm is very efficient. The study pointed out that the most robust community structure was yielded when the entire study area was divided into 22 FFBS activity zones (Fig. 5b). It can be seen from Fig. 5a, b that the FFBS activity zone borders coincide with the established administrative borders in a small percentage.

Fig. 5
figure 5

Spatial distribution of a administrative districts and b activity zones in the study area; and proportional distribution of FFBS trips from September 12 to 14, 2017 of c administrative districts and d activity zones in the study area. Note spatial distribution of activity zones b is adapted from Chen et al. (2022b). Those connections with less than 100 FFBS trips are not displayed in this figure cd to avoid display clutter issues

To validate the rationality of the FFBS activity zone delineation, the proportional distribution of FFBS trips within and between regions is investigated for three weekdays (September 12 to 14, 2017), as shown in Fig. 5c, d. First, as we can see, while most FFBS trips are distributed within the same administrative district (87.89%), there is still a certain share of FFBS trips used to connect different administrative districts (12.11%). This distribution characteristic is more prominent in urban districts (e.g., Gulou, Qinhuai, and Xuanwu districts). By contrast, the activity zones delineated by the Leiden algorithm have stronger intra-zone connections (92.34%). While the number of activity zones is increasing, FFBS trips between them show the opposite trend (i.e., inter-zone trips, 7.66%). It means that activity zone borders could portray FFBS user travel behavior and urban spatial structure in a more reasonable way. Therefore, by dividing the study area into multiple activity zones, a complex network can be decomposed into multiple sub-networks. This process is expected to significantly improve the computational efficiency of the SNN_flow method while minimizing the effect of inter-zone connections.

Then, the flow clusters within each activity zone are detected separately using the SNN_flow method, and the flow cluster detection results are further merged for all activity zones. It is noteworthy that morning peak (7:00–9:00, referred to as AM) and evening peak (17:00–19:00, referred to as PM) are considered the focus of flow clusters analysis, as FFBS usage is higher and more time-concentrated during these periods. In addition, we extract flow clusters with the number of similar trips greater than 30 from the daily AM and PM peaks to ensure that the number of flow clusters is within a reasonable range (Liu et al. 2022a). Taking the activity zone 14 as a case study, the details of flow clusters identification are illustrated in Appendix 1.

Table 1 depicts a summary of the flow clusters identified for all activity zones during the AM and PM peaks. On the whole, the number of similar trips and flow clusters stabilized at an equilibrium level during three different weekdays. For instance, during the AM peak, the number of flow clusters remained between 375 and 404, corresponding to a percentage of similar trips located between 16.92% and 17.57%. Nevertheless, we found salient differences in the number and size of flow clusters identified between the AM and PM peaks. Taking September 12, 2017 (Tuesday) as an example, while the number of raw trips during the PM peak (117,386) was larger than that during the AM peak (108,387), the number of flow clusters extracted during these two time periods showed an opposite trend (328 for PM vs. 375 for AM), and the corresponding number and proportion of similar trips also followed this trend. This implies that more flow clusters are identified and the size of the flow clusters is usually larger during the AM peak compared to the PM peak (see the mean values in Table 1). A plausible explanation is that commuters tend to have less stringent time constraints for returning home during the PM peak, during which they may complete some discretionary activities (e.g., shopping, eating, and entertainment) (Chen et al. 2022c; Ji et al. 2017). This leads to a reduction in the share of commuting demand that concentrates a large number of similar trips.

Table 1 Summary results of flow cluster detection from September 12 to 14, 2017

Efficiency comparison of flow clustering methods

In this subsection, we focus on comparing the efficiency of SNN_flow method and the two-stage flow clustering method (Leiden & SNN_flow) in extracting flow clusters. The largest difference between the two methods in the process of identifying flow clusters is the input. More specifically, the former takes the dataset of the entire study area as input, while the latter first partitions the study area into 22 activity zones, and then takes the dataset of each activity zone as input separately. Both methods take a little time in the data preparation step (see “Data preparation” section), but the former method has difficulty in obtaining results within a limited time in the flow cluster detection step (see “Flow cluster detection” section). Under this circumstance, the running time of the appropriate k-value estimation step (see “Appropriate k-value estimation” section) was selected as a proxy to compare the efficiency of these two methods in this study. It is noteworthy that both methods were implemented in Python 3.8.11. All computational experiments were conducted on a desktop with a 2.90 GHz computer processing unit and 64 GB memory.

We randomly sampled from the dataset and generated five datasets with different numbers of raw trips. The running times of these two methods in estimating k values for these five datasets are displayed in Table 2. It is found that when the number of raw trips increases to a certain threshold, the time spent by the SNN_flow method is incredibly high. For example, when the number of raw trips increased to 50,000, its running time is nearly 7000 s. In contrast, the running time of the two-stage flow clustering method is in an acceptable range. On the other hand, FFBS trips have the distinct characteristics of short distance and local aggregation (Chen et al. 2022b; Zhang et al. 2021), and thus it seems more reasonable to extract the corresponding k value for each activity zone than to extract a unique k value from the entire study area. In summary, for the FFBS system, the two-stage flow clustering method that divides the study area into multiple activity zones and then treats them separately is more efficient and reasonable than the SNN_flow method that directly treats the entire study area.

Table 2 Running time of SNN_flow and the two-stage flow clustering methods for estimating appropriate k values

Spatio-temporal patterns of flow clusters

Inference of potential travel purpose

In this subsection, we focus on the spatio-temporal patterns of the flow clusters identified in “Application of the two-stage flow clustering method” section. First of all, the travel purpose of the flow clusters is inferred by combining the land use information (see Fig. 6). More concretely, if the proportion of the origin (destination) points of a flow cluster that falls into a certain land parcel exceeds 50%, we assign the land use type of this parcel to the head (end) of this flow cluster. Note that for an origin (destination) point that does not fall into any of the parcels, we group it into the parcel nearest to it. As shown in Fig. 6a, in the case of this identified flow cluster, most of its origins and destinations fall into parcels of the metro station type and employment district type, respectively. Therefore, it is reasonable to assume that this is a flow cluster for addressing the “last-mile” demand between a metro station and a workplace.

Fig. 6
figure 6

Matching results of origin–destination land use types of flow clusters. (a) a matching case of “metro station → employment district” type flow cluster; matrix of origin–destination land use types for all flow clusters during (b) the AM peak and (c) the AM peak from September 12 to 14, 2017

Figure 6b, and c illustrates the matching results of origin–destination land use types for the AM peak and PM peak flow clusters from September 12 to 14, 2017. For the AM peak (as shown in Fig. 6b), those OD flow clusters of the “residential district → metro station” type have the highest share (47.73%). This is followed by the flow clusters of the “metro station → employment district” type (24.12%). Those flow clusters that span directly from residential districts to employment districts also have a share, coming in third (8.98%). The remaining types of flow clusters (e.g., “metro station → commercial district”, “campus → campus”) are fewer in number during the AM peak, together accounting for less than 20% of the total. Similar to the AM peak, the OD points of the flow clusters during the PM peak are primarily concentrated in three land use types: metro station, residential district, and employment district, but their trip chain order is the opposite of that of the AM peak (see Fig. 6c). To put it another way, the flow clusters during the PM peak are dominated by return-home trips, including “metro station → residential district” (40.14%), “employment district → metro station” (25.65%), and “employment district → residential district” (8.15%).

Overall, the percentage of flow clusters used to meet “first-/last-mile” demand between metro stations and adjacent residences/workplaces is considerable, both during the AM (71.85%) and PM (65.79%) peaks. This implies that FFBS commuting trips with similar spatio-temporal characteristics mostly occur near metro stations. Another interesting finding is that the proportion of flow clusters addressing the “first-/last-mile” between metro stations and adjacent residences (47.73% for AM, 40.14% for PM) was considerably higher than those addressing the “first-/last-mile” between metro stations and adjacent workplaces (24.12% for AM, 25.65% for PM). One reason is that many companies provide commuter shuttles for their employees as an optional way to address the “first-/last-mile” needs of the metro system (Johnson et al. 2015; Kou et al. 2022). The other reason may be that many workplaces (e.g., industrial parks, government agencies) rarely allow FFBS bikes parking inside for management purposes, and the parking spaces available for commuters near the gates are usually limited (Chen and Ye 2021). This somewhat reduces the possibility of choosing FFBS as the connection mode of the metro system.

Analysis of spatio-temporal distribution characteristics

The spatial distribution of FFBS flow clusters during the peak hours from September 12 to 14, 2017 was depicted with the help of the Line Density tool in ArcGIS (see Fig. 7). The length and direction of the flow clusters are characterized by the centerline extracted from the similar trips, and the size of the flow clusters is weighted by the number of similar trips. As shown in Fig. 7, the redder the color of the grid, the higher the number of similar trips occurring at that location. As expected, the density of flow clusters during the AM peak is generally larger than that of flow clusters during the PM peak.

Fig. 7
figure 7

Spatial distribution of FFBS flow cluster density during a the AM peak and b the PM peak from September 12 to 14, 2017. Note: the numerical intervals in the legend are divided by the Jenks Natural Breaks Classification tool

As shown in Fig. 7, metro stations perform a considerable role in the formation of FFBS flow clusters. In order to provide nuanced and appropriate guidance to relevant policies, it is necessary to investigate from which metro stations these flow clusters converge and diverge. First, four types of flow clusters related to metro stations are labeled according to peak hours and trip chain order, namely AM “first-mile” clusters (i.e., “residential district → metro station”), AM “last-mile” clusters (i.e., “metro station → employment district”), PM “first-mile” clusters (i.e., “employment district → metro station”), and PM “last-mile” clusters (i.e., “metro station → residential district”). Then, the number of similar trips from September 12 to 14, 2017 corresponding to these four types of flow clusters is aggregated to each metro station, as shown in Fig. 8.

Fig. 8
figure 8

Aggregation results of the number of similar trips from September 12 to 14, 2017 corresponding to the four types of flow clusters at the metro stations. a “first-mile” clusters (residential district → metro station) arriving at metro stations during the AM peak; b “first-mile” clusters (employment clusters → metro station) arriving at metro stations during the PM peak; c “last-mile” clusters (metro station → employment clusters) departing from metro stations during the AM peak; and d “last-mile” clusters (metro station → residential district) departing from metro stations during the PM peak

Some interesting findings can be drawn from Fig. 8. For instance, for the AM “first-mile” clusters, those metro stations that converge a large number of similar trips from residences are principally located outside the city center (Fig. 8a). As for the AM “last-mile” clusters, those metro stations that diverge plenty of similar trips to workplaces are mostly concentrated in the core city (see Fig. 8c). This coincides with the work of Gan et al. (2020) that the residential-oriented metro stations are located in more remote areas than the employment-oriented metro stations concentrated in urban cores. They argued that a major reason is that these relatively remote areas often grew out of under-functioning urban villages, lacking companies and enterprises that can provide a substantial number of job opportunities.

During the PM peak, the spatial distribution of similar trips arriving at the metro stations (Fig. 8b) is similar to that of similar trips departing from the metro stations during the AM peak (Fig. 8c). Figure 8a and d also follow the same trend. Nevertheless, we find a significant difference in the AM “first-mile” clusters (Fig. 8a) and PM “last-mile” clusters (Fig. 8d). Specifically, the residential-based metro stations are located in more remote peripheral areas during the PM peak compared to the AM peak. This may be due to the fact that many commuters will have discretionary activities (e.g., shopping, eating, and entertainment) in their return-home journeys during the PM peak, and inner areas with more commercial land uses appear to be better able to meet these flexible needs (Chen et al. 2022c).

Endpoint distribution characteristics of flow clusters

In this section, two classical tools in spatial analysis, namely standard deviational ellipse (SDE)Footnote 4 and calculate distance band from neighbor count (CDBFNC),Footnote 5 are adopted to portray the shape and density distribution of flow clusters (Zhu et al. 2016).

We focus on three types of work-related flow clusters during the AM peak (i.e., “residential district → metro station”, “metro station → employment distric”, and “residential district → employment district”) and three types of return-home-related flow clusters during the PM peak (i.e., “employment district → metro station”, “metro station → residential district”, and “employment district → residential district”), all of which have a high share (see “Inference of potential travel purpose” section for details). It is worth noting that we need to extract the endpoints of these flow clusters as the input of the two tools, as both of them are limited to processing point data (Zhu et al. 2016). For the AM peak from September 12 to 14, 2017, the total number of flow clusters in terms of three work-related types is 945, corresponding to 1890 point clusters (945 × 2, i.e., a flow cluster contains one origin point cluster and one destination point cluster). According to the point cluster land use type, we further divide the 1890 observations into three categories (840 for metro station, 663 for residential district, and 387 for employment district). By analogy, there are 1470 observations in the PM peak from September 12 to 14, 2017 (654 for metro station, 480 for residential district, and 336 for employment district).

We set two standard deviations as the input parameter for SDE, so that the ellipse covers as many points in the point cluster as possible (95%) with less influence from outliers. The tool finally outputs the long and short semi-axes of each ellipse. To be more intuitive, we use the flattening value to depict the shape of the ellipse (point cluster). The flattening value is equal to the ratio of the difference between the long and short semi-axes to the long semi-axes. Its value spans from zero to one, and the closer the value is to one, the flatter the shape of the point cluster. Figure 9 shows the distribution of flattening values for the 1890 (1470) point clusters during the AM (PM) peak of the three weekdays in the form of kernel density and box plots. On the whole, the highest flattening values are found for the metro station point clusters during the AM peak (mean = 0.562), followed by the employment district point clusters (mean = 0.503), and the lowest flattening values for the residential district point clusters (mean = 0.415) (see Fig. 9a). This implies that the shape distributions of employment district and metro station point clusters are inclined to be flatter than that of residential district point clusters. This is perhaps due to the fact that during the peak-hour periods, parking spaces near the entrances of office buildings, especially metro stations, are often in short supply (Zhao and Ong 2021), resulting in many travelers having to park their FFBS bikes along the surrounding sidewalks. In contrast, the shape distribution of residential district point clusters tends to be more circular. The major reason for this may be the existence of many non-gated residential communities in Nanjing (Xinhua Daily 2022), which allows travelers scattered there to park FFBS bikes closer to their exact destination. The distribution of flattening values during the PM peak is basically the same as that during the AM peak, except that it is more uniform (see Fig. 9b). The potential rationale is that, as we discussed in “Application of the two-stage flow clustering method” section, the transaction time and location of return-home trips during the PM peak tend to be less concentrated compared to those of work-related trips during the AM peak (Chen et al. 2022c; Ji et al. 2017).

Fig. 9
figure 9

Distribution of flattening values at the origins and destinations of flow clusters during a the AM peak and b the PM peak from September 12 to 14, 2017

Since the number of spatial points falling into each point cluster is more than 30 (the recognition threshold for flow clusters is 30, see “Flow clusters identification” section for details), we set the input parameter (n) of CDBFNC tool to 30. The tool finally returns the average distance of all spatial points of the point cluster to the 30th nearest neighbor. Figure 10 illustrates the distribution of average nearest neighbor distance for the 1890 (1470) point clusters during the AM (PM) peak of the three weekdays. As shown in Fig. 10a, the overall distribution of the average nearest neighbor distance of the metro station point clusters is significantly shorter (mean = 71.8 m). This result further indicates that metro stations with relatively limited parking resources often need to carry the operational management pressure incurred by the rapid convergence and divergence of FFBS bikes during peak hours. For the residential district point clusters, we find that their average nearest neighbor distances are longer in general (mean = 260.7 m). In other words, the spatial distribution of points within the residential district point clusters is more dispersed than the other two types of point clusters. One possible reason is that, compared with the compact office buildings and metro stations, many large residential neighborhoods in Nanjing are located in less-developed peripheral areas (Cheng et al. 2022b; Gan et al. 2020), where there is relatively sufficient space for bike parking. The average nearest neighbor distance exhibits essentially the same overall distribution during the AM and PM peaks, except that its distribution is somewhat more uniform during the PM peak (see Fig. 10b). This finding is similar to the distribution of flattening values during the peak hours (Fig. 9), further demonstrating that FFBS trips in the morning are more intensively concentrated.

Fig. 10
figure 10

Distribution of average nearest neighbor distances at the origins and destinations of flow clusters during a the AM peak and b the PM peak from September 12 to 14, 2017

Conclusions and policy implications

Discovering FFBS similar trips is of great importance for understanding spatio-temporal interactions and human mobility patterns. However, extracting flow clusters consisting of similar trips from large-scale, chaotic journey data remains under-researched. To deal with this issue, this study presents a two-stage flow clustering method, which integrates the Leiden community detection algorithm and the SNN_flow clustering method to efficiently identify flow clusters with arbitrary shapes and inhomogeneous densities. Taking the Nanjing FFBS system as a case study, we demonstrate that the methodological framework helps to significantly improve the efficiency of flow cluster identification.

The results of flow cluster detection (see Table 1) show that although the number of raw trips is higher during the PM peak, the number of flow clusters and corresponding similar trips identified during this period are notably less than those during the AM peak. From the perspective of spatio-temporal patterns, some interesting findings can also be drawn. First, the share of flow clusters used to meet the “first-/last-mile” demand between metro stations and adjacent residences/workplaces is quite high during both the AM (71.85%) and PM (65.79%) peaks. Second, the share of the “first-/last-mile” flow clusters between metro stations and adjacent residences (47.73% for AM, 40.14% for PM) is markedly higher than that of the “first-/last-mile” flow clusters between metro stations and adjacent workplaces (24.12% for AM, 25.65% for PM). Third, the residential-based metro stations in the “first-/last-mile” flow clusters are principally located out of the city center, while the employment-based metro stations in the “first-/last-mile” flow clusters are mostly concentrated in the core city, which is more pronounced during the PM peak. We also investigate the shape and density distribution of the flow clusters. The endpoint distribution results show that metro station point clusters typically have a flatter, linear-like shape distribution than residential point clusters. In addition, we find that spatial points in metro station point clusters are more concentrated, and their density distribution is generally higher than that of other sorts of point clusters.

The spatio-temporal patterns of flow clusters could assist transportation planners and decision-makers in establishing effective policies and regulations to facilitate the rational use of FFBS infrastructure resources. First, extracting flow clusters that concentrate a large number of similar trips could provide nuanced guidance for FFBS operators to allocate resources more efficiently. For instance, during the epidemic prevention and control period, knowing the spatio-temporal dynamics of similar trips could help enhance the efficiency of staff in cleaning and disinfecting bikes (Teixeira and Lopes 2020). Second, metro stations, as the primary departure/arrival places of FFBS similar trips, play a crucial role in addressing the “first-/last-mile” commuting demand of local residents. However, around metro stations, there is often an obstacle in addressing the operational management pressure incurred by the rapid convergence and divergence of FFBS bikes, and the tidal phenomenon of “no bikes to rent or no parking spaces to return” often occurs during the peak hours (Chen et al. 2022a). An effective solution is to predict the similar trips in certain areas in advance according to the past spatio-temporal distribution of flow clusters, thereby reserving a certain amount of bikes and parking spaces for users. Third, compared with the “first-/last-mile” between metro stations and adjacent workplaces, the solution of the “first-/last-mile” between metro stations and adjacent residences is more dependent on the FFBS system. Although FFBS has attracted many users due to its convenience of payment and parking, it is clearly vulnerable to extreme weather such as heavy rain and low temperatures (Shen et al. 2018). In contrast, microcirculation bus – a recently emerging public transportation mode – can provide short-distance travelers with a safer and more comfortable travel service (Du et al. 2019a). Therefore, microcirculation bus service is expected to be the preferred mode of connection to meet the “first-/last-mile” demand between metro stations and adjacent residential neighborhoods under severe meteorological conditions. Fourth, jobs-housing imbalance leads to different FFBS-metro usage patterns during the AM and PM peaks. The differences are critical for designing FFBS fleet rebalancing strategies. For instance, during the AM peak, many metro stations outside the city center may be piled up with a great deal of returned shared bikes, and staff will need to clean them in a timely manner. During the PM peak, the provision of shared bikes near these metro stations becomes insufficient, and it is necessary to allocate more bikes there in advance to address the return-home demand.

The endpoint distribution of flow clusters also provides scholars and decision-makers with some valuable insights and policy implications. Specifically, we inferred from Fig. 9 that the narrow space near metro station entrances results in many users having to park their FFBS bikes along sidewalk racks. This implies that the catchment area with a certain radius size (i.e., acceptable walking distance, e.g., 300 m) generated in the center of a metro station may not be able to accurately capture the FFBS-metro integrated use (Cheng et al. 2022b; Xu et al. 2019). Therefore, it seems more reasonable to construct the catchment area in terms of network walking distance rather than in a straight line. In addition, geo-fenced parking spaces have been put into use in many cities around the world to tackle the disorderly parking of shared bikes (Zhang et al. 2019; Cheng et al. 2022b). It is found that differences in land use types (e.g., metro station, residential district, and employment district) and time of day (e.g., morning peak and evening peak) can bring varying distributions of shape and density in FFBS parking areas. To improve the efficiency of parking utilization, transportation planners may consider flexibility in the size and shape of geo-fenced areas to meet parking needs.

In addition, the journey data of travel modes such as taxis and buses usually have a larger order of magnitude compared to those of FFBS. Many studies have pointed out that traditional flow clustering methods may have some hindrance in efficiently extracting flow clusters from the above travel modes (Liu et al. 2022a; Song et al. 2019). The two-stage flow clustering method proposed in this study may be an effective solution. To be specific, the community detection algorithm is utilized to first divide the entire study area into multiple activity zones with strong intra-connections, thus decomposing a large flow clustering problem into multiple small sub-problems.

Admittedly, there are several limitations to this study. First, we conducted an empirical analysis based on cross-sectional data (three-weekday FFBS journey data), which makes it difficult to trace the evolutionary mechanism of FFBS flow clusters over time. This study will be extended by performing a longitudinal analysis if a longer period of journey data becomes available in the future. Second, this study did not focus on individual-level mobility patterns, which are also important for understanding home-work commuting. Therefore, exploring the similarities and diversities of FFBS flow clusters among different user groups is also a worthwhile research topic. Furthermore, only the flow clusters within each activity zone were extracted in this study. Although the proportion of FFBS trips between different activity zones is small (7.66%), the identification and analysis of their flow clusters can be further taken into account, which is worth of on-going study. Nevertheless, as a first attempt to extract FFBS flow clusters and investigate their spatio-temporal patterns, our findings could provide further insights into human movement patterns and home-work commuting behavior.