Introduction

Since the advent of the Smart Card Automated Fare Collection System, or Smart Card in short, a major issue for transit planners has been how to fully realize the potential of the massive transaction data (Pelletier et al. 2011). One possibility is that a Smart Card data analysis culminates in providing an exact estimation of the physical and schedule-based connections of each and every card holder’s trip. Archived information on the complete train choices of passengers has numerous applications in planning and operation of a public transit network (Seaborn et al. 2009; Trépanier et al. 2007). Typical examples include:

  1. 1.

    Empirical evaluation of a transit assignment model (Lam and Lo 2004; Kato et al. 2010; Raveau et al. 2011). Most transit assignment models in the literature have verified their validity based on the train choice data from the surveyed passengers. It normally requires a costly procedure to secure a large enough sample to guarantee a level of accuracy.

  2. 2.

    Clarification of revenue collected by different train operating companies (Rinks 1986; Tsamboulas and Antoniou 2006). A public transit network may evolve into an integration of sub-networks of different operating companies. The lines are then intertwined so that a card holder may carry out a trip with a minimal card transactions with the system. This leaves the difficulty in allocating the fare revenue collected from passengers among the operating companies. There appears an agreement that the real data on the ridership (e.g. in person \(\times\) kilometers) of each sub-network is essential in any clarification method.

  3. 3.

    Estimation of connection cost (Nour et al. 2010; Guo and Wilson 2009). In principle, a transit assignment contemplates an equilibrium attained by a user-optimal behavior pursuing a minimum cost connection for a trip. An intensive real path choice from Smart Card data is expected to provide us a firm empirical ground of the user-optimal behavior studies.

Among the data fields of a Smart Card, the quadruple, (Departure station \(O\), Entry time at gate, Arrival station \(D\) and Exit time at gate) is available in a majority of train networks. It is evident that this is also a minimal set of data required for a precise estimation of passengers’ connections. Conversely, the quadruple appears a maximal data set that may well be expected from the Smart Card data. For a modern day metro there is a demand to accommodate a passenger’s trip with a minimal transaction with the fare collection system. The privacy issue also makes difficult implementation a system that monitors more traces of individual trips. The quadruple, therefore, seems most reasonable data available from a Smart Card system.

To estimate a schedule-based connection of a passenger, we simply need to identify the sequence of trains chosen for his/her OD trip. Yet, it appears to be a nontrivial task to trace the sequence even if the quadruples are known. A natural way to use the quadruples might be to compute the inter-gate time of a passenger, the interval between entry and exit times at the origin \(O\) and destination \(D\), respectively, to assign him/her to a connection of the closest mean inter-gate time.

The inter-gate time of a passenger, however, has a large variance because it has, as its components, dismount movement-times, namely gate-to -platform, transfer, and platform-to-gate times. In the metro network in Seoul, for instance, not rare is an \(O\)\(D\) having alternative connections with similar inter-gate times. The method, thus, may err seriously. For instance, the Shillim–Garak Market Station pair has two alternative physical connections, one via the station, SeoulNat’lUnivEducation, and the other via Jamsil. The mean and the standard deviation of inter-gate time on former connection are, respectively, 2936.7 and 189.1 s, while on the latter, 2923.5 and 239.2 s. If we assume, for simplicity, the inter-gate times have a Gaussian distribution, 43.7 % of passenger of the first physical connection has an inter-gate time closer to the mean gate-time of the second physical connection. However, as will be shown in this paper, the quadruples, as a whole, provide crucial information on the connection choice of passengers.

This paper is organized as follows. The section continues with a summary of the previous estimation methods relying on the Smart Card data. In “Principles of proposed method” section, we identify a set of transit behaviors and a special class of metro passengers that enable us to develop a consistency condition that a true connection necessarily satisfies. In “The algorithm” section, we develop an algorithm to detect a connection of a passenger based on the consistency condition and illustrate some possible cases. “Connection estimation results” section reports on an empirical evaluation of the method applied to 32 million weekly \(O\)\(D\) trips in the Seoul metropolitan area, from Sunday, November 20 to Saturday, November 26, 2011. Finally, some concluding remarks are provided in “Conclusion” section.

Literature review

The potential of a substantial and detailed collection of Smart Card data in the public transit system has attracted attention of researchers since the turn of the millennium. See, for example, (Lehtonen et al. 2002; Bagchi and White 2005). Consequently various behavioral analyses of Smart Card data mostly in countries that adopted the system in the early period, have emerged (Asakura et al. 2008; Guo and Wilson 2009; Jang 2010; Kusakabe et al. 2010; Morency et al. 2007; Park et al. 2008; Seaborn 2008; Seaborn et al. 2009; Trépanier et al. 2007; Utsunomiya et al. 2006). [For a more comprehensive literature review on Smart Card data used in public transit systems, readers are referred to Pelletier et al. (2011)].

Most of these studies, with a few exceptions, are based on statistical analyses of Smart Card transaction data. However, there has been much emphasis on the importance of deterministic and detailed information of travel behavior, such as the length of of individual trips Bagchi and White (2005).

In reality, exact and complete estimates of the connections of individual passengers would have led to a more beneficial analysis in many studies (Utsunomiya et al. 2006; Morency et al. 2007; Park et al. 2008; Seaborn 2008; Asakura et al. 2008; Guo and Wilson 2009; Jang 2010; Raveau et al. 2011). However, such estimates could not be achieved solely from a statistical behavioral study. In particular, recent studies (Trépanier et al. 2007; Seaborn et al. 2009; Kusakabe et al. 2010) for the purpose of marketing or transit planning have aimed at more specific estimates of passengers’ connections.

Trépanier et al. (2007) proposed a method of estimating, from Smart Card data and bus routes, the alighting stops of individual passengers in Gatineau, Quebec, Canada. For buses, the locations were recorded in Smart Card data at boarding, but not necessarily on alighting. Suppose a passenger travelled on two buses \(A\) and \(B\) consecutively in a day. They reasoned that the alighting stop from bus \(A\) was the closest to the boarding location of bus \(B\). When a passenger rides only a single bus \(A\) in a day, they identify the bus, say, \(A'\), that the passenger will ride in the near future and apply the same reasoning to \(A\) and \(A'\) for estimating the passenger’s alighting stop from \(A\).

In a multi-modal public transit network, it is nontrivial to decide if two consecutive legs of a single passenger journey are actually the connections of a single trip. In Seaborn et al. (2009), they considered the transfers between buses and between bus and train (but not between trains, because a record on connecting trains was not available from Smart Card data). They proposed to consider them as the connections of a single trip if the time interval between the alighting from the first vehicle and the boarding of the second vehicle was less than a specific threshold value. To determine the threshold value, they applied the statistical method proposed in Seaborn (2008) to the time intervals whose alighting and boarding locations were adequately close. The same method was also used in Jang (2010) to determine a threshold value for the transit network in Seoul, Korea.

Table 1 Literature summary

The work Kusakabe et al. (2010) is closely related to ours in that it was aimed at estimating the choices for passengers from express, rapid and local trains for the same \(O\)\(D\) trip. The proposed method uses Smart Card data, railway topology, and the train timetable on the premise that trains operate exactly to the timetable. The time expanded network was then derived from the timetable. A key assumption was that the passengers always choose the shortest connection in transit time and number of transfers. Given the entry and exit times at gates for a passenger from Smart Card data, their algorithm determines the most probable boarding and alighting trains. They then return a shortest connection consistent with the two trains as the passenger’s connection. A tie is broken, if possible, by choosing the connection for a minimum number of transfers.

Zhou and Xu (2012) performed a case study to estimate the schedule-based connections of passengers based on their entry and exit times and the train log, namely the real data on arrival and departure times of trains. It is based on the assumption that the passengers minimize a surplus time, the time interval between the earliest arrival of a passenger at the platform and the departure of the first train since then. Table 1 summarizes these studies and ours.

Smart Card data

Since its introduction in 2000, the Smart Card has quickly become the predominant method of payment for the metro network in the Seoul metropolitan area. By 2005, 72 % of metro passengers used a Smart Card, with approximately 20 million transactions per day (Park et al. 2008). In 2011, Smart Card became the only payment method.

Our connection estimating method in “The algorithm” section assumes the quadruple (Departure station \(O\), Entry time at gate, Arrival station \(D\), Exit time at gate) for each \(O\)\(D\) trip of a metro passenger. For trains, this appears to be the case in most Smart Card systems as shown in Table 2, although in some systems, for instance in Chicago, the pair (Arrival station \(D\), Exit time at gate) was not retained in the data.

Table 2 Comparison of data fields for the Smart Card in cities.

Principles of proposed method

The schedule-based connection estimating problem can be posed formally as follows.

Definition 2.1

Section, physical and schedule-based connections By a section, we mean the physical metro line between two adjacent stations inclusively. By a physical connection, in turn, we mean the concatenation of sections that a passenger passes during her/his \(O\)\(D\) trip. Then a schedule-based connection is defined to be a sequence of trains that a passenger can take on a physical connection.

For example, the arc between Isu and Dongjak stations of the metro network illustrated in Fig. 1, is a section. We can see that the physical connection from Bongcheon to Dongjak consists of five sections. And a train from Bongcheon to Sadang and its connecting train at Sadang to Dongjak constitutes a schedule-based connection of Bongcheon-Dongjak pair. Note that once the schedule-based connection has been estimated, the physical connection is immediate. Also, by a train log, we mean the set of records indicating the arrival and departure time of each train at each station.

Problem 2.2

Schedule-Based Connection Estimation Problem.

Input: The topology of the physical metro network, the set of passengers over a predetermined time horizon along with their quadruples, \(q =\) (Departure station \(O\), Entry time at gate, Arrival station \(D\), Exit time at gate), and the real arrival and departure times of trains from a train log.

Output: The schedule-based connection of each passenger for his/her \(O\)\(D\) trip, namely, the complete sequence of boarding, transfer, and alighting trains, that was chosen for his/her \(O\)\(D\) trip in the metro network.

Generating tentative physical connections for O–D trips

The proposed method constructs a prior set of physical connections for each \(O\)\(D\) trip by excluding irrational connections. A passenger travelling from Station \(O\) to Station \(D\) chooses a connection of the least cost. If other conditions are equivalent, the cost is an increasing function of each of the number of transfers, travel times and the level of congestion [e.g. Bureau of Public Roads (1964), De Cea and Fernandez (1993), Nielsen (2000)]. According to a Shin et al. (2007), trips involving three or more transfers between lines accounts for 1.5 % of total trips in the Seoul metropolitan area. We first exclude such trips from the consideration.

For each \(O\)\(D\) pair, we enumerate every possible physical connections and group them to the numbers of transfers, \(n =0,\,1\), and \(2\). From each group, we discard the connections that were longer than a shortest one by \(k\) sections or more. Thus we assume that people do not route longer by some threshold value. We set \(k =10\) which is equivalent to 31 min in in-transit time. This is motivated by that the mean in-transit time of a passenger is 30 min in Seoul metropolitan area. From a case analysis, indeed, the number of trips with an inter-gate time longer than the minimum by more than 30 min is insignificantly small.

Then we perform an inter-group comparison; every physical connection is removed if it has 10 more sections than an alternative physical connection with fewer transfers. The connections left are defined to be the tentative physical connections of the \(O\)\(D\) pair. In the case of the Seoul metropolitan area, the process generated 3.9 tentative connections on average for an \(O\)\(D\) pair.

Reference passengers

In the above process, we found that the passengers that are prevalent have a unique tentative physical connection. Consider, for instance, a Bongcheon-Isu trip on the metro network in Fig. 1. It requires at least one transfer since the stations are on different lines. Two physical connections are possible: one via Shindorim and Seoul Station, counter-clockwise, the other via Sadang clockwise. The former requires two transfers and 17 sections while the latter one transfer and 3 sections. The process return, therefore, as a unique tentative physical connection for a Bongcheon-Isu trip. A similar argument is possible for the pairs, BongCheon-Dongjak, SeoulNat’lUniv-Isu, SeoulNat’lUniv-Dongjak and so on.

Fig. 1
figure 1

\(O\)\(D\) pairs with a unique tentative physical connection

About 47 % of the daily passengers on the metro-network in the Seoul metropolitan area turned out to have a unique tentative physical connection.

Definition 2.3

Reference passengers By a reference passenger, we mean a passenger whose \(O\)\(D\) trip has a unique tentative physical connection.

Reference passengers, as their connections are guaranteed, play a crucial role in the proposed connection estimation method.

Alighting and boarding time intervals

The idea is best illustrated by an example. There were 203 trips of passengers from Shillim to Gangnam station, initiated between 7 and 9 A.M. on November 21, 2011. The first plotting in Fig. 2 shows the entry times of the passengers at the origin, Shillim. They appear uniformly distributed as expected. However, the exit times at the destination of the same passengers, Gangnam, show a spiky pattern distributed over a brief period of time.

Fig. 2
figure 2

Entry and exit times for the same set of Shillim–Gangnam passengers

It is a typical behavior of an alighting passengers to rush to a gate and accomplish exit as soon as possible. The platform-to-gate time of each passenger is, thus, typically the maximal speed of a passenger and hence has the characteristic of an extreme value (Einmahl and Smeets 2011). In fact, according to Ko et al., the platform-to-gate time of an alighting passenger is best fitted by the Fréchet distribution, mostly used for fitting extreme values. Figure 3 shows the relative frequency of the platform-to-gate time of the alighting passengers at Gangnam station from 5:30 to 11:00 A.M., November 21, 2011, which has been fitted with Gamma, Inverse Gaussian and Fréchet distributions. The Fréchet distribution is the best fit.

Fig. 3
figure 3

Platform-to-gate-time distribution at Gangnam station

Definition 2.4

Alighting groups and time intervals By an alighting group, \(AG(X, N)\), we mean the set of passengers that alight from the train \(X\) at their common destination \(N\) (regardless of their origins). An alighting time interval is the time interval between the first and last exit times in an alighting group.

The extreme value characteristic of the platform-to-gate times renders an alighting time interval substantially smaller than an interarrival times of trains at a station and hence disjointed. In the Seoul metropolitan area, the smallest headway in peak hours was 3.5 min while the platform-to-gate times are 1.9 and 1.0, the mean and standard deviation.

Definition 2.5

Boarding groups By a boarding group, \(BG(X, N)\), we mean the set of passengers that board the same train \(X\) at their common origin \(N\) (regardless of their destinations).

Unlike the alighting case, the boarding behaviors of metro passengers do not present disjoined time intervals. However, the first-come-first-served queue discipline is well-observed in boarding and the order on entry is maintained. It transpires in the metro network of Seoul metropolitan area that at most two consecutive time intervals of boarding groups may overlap. To see this, consider Fig. 4, a 2-dimensional plot of the entry and exit times at a gate, of Shillim–Gangnam passenger groups, called an entry-exit map originated from Kusakabe et al. (2010), where the \(x\)-axis represents the entry time and \(y\)-axis the exit time of a passenger. From the figure, the passenger group of each Shillim–Gangnam train is identified by the rectangle of boarding and alighting time intervals in the entry exit map. Furthermore, the disjointed alighting time intervals make the rectangles also disjointed, a source of preciseness of the proposed method.

Fig. 4
figure 4

Entry-exit map of the Shillim–Gangnam passengers

Suppose the alighting and boarding time intervals of \(AG(X, N)\) and \(BG(X, N)\), respectively, are known to us for each train \(X\) and station \(N\). Then, we can determine if \(X\) can be the alighting or boarding train of a passenger at station \(N\) by checking if the exit or entry time at \(N\) falls in the time intervals of \(AG(X, N)\) or \(BG(X, N)\). To develop the consistency check into a connection estimation method, we need first to estimate the time intervals of trains at each station as in Fig. 4.

It is not however a trivial task to derive the time intervals solely by a plotting of the quadruples of passengers. The passengers from in- and out-bound trains, for instance, may happen to exit at the same gate with a proximity in time. Or, in a transfer station, the alighting passengers from different lines may merge at a gate. The second key idea is to use the reference passengers to derive the time intervals.

Estimation of alighting and boarding time intervals

Choose only the reference passengers whose physical connections involve no transfer. Suppose his/her quadruple from Smart Card is \(q=(O,{\text{ Entry }} \text{ time }=t_1,D, {\text{ exit }} {\text{ time }}=t_2)\). We then consider the set \(P\) of trains that departed from \(O\) after \(t_1\) and the set \(Q\) of trains that arrived at \(D\) prior to \(t_2\). If the two sets have only one common train, say \(X\), it should be the the alighting train of the reference passenger at \(D\). In other words, the passenger belongs to the alighting group \(AG(X,D)\).

Also notice that if either the gate-to-platform or platform-to-gate time is less than the inter-arrival time of the trains, as in most of the real cases, \(P \cap Q\) should be a singleton (whose element is, of course, the train choice of \(q\)). Thus, we can identify the alighting train of the reference passengers (if their trips are not delayed abnormally from gate to platform or from platform to gate).

Thus by repeating the procedure to each of chosen reference passengers we can capture a large subset \(\widetilde{AG}(X, N)\) of \(AG(X, N)\) for each train \(X\) and station \(N\). Hence the time interval \(\widetilde{AG}(X, N)\) offers a good estimate of that of the alighting time interval of \(AG(X, N)\).

Once \(\widetilde{AG}(X, N)\) has been constructed for each \(X\) and \(N\), we derive an estimate \(\widetilde{BG}(X,N)\) of the boarding group \(BG(X, N)\) in the following manner. Check every reference passenger who departed at \(N\) via \(X\). Put the passenger into \(\widetilde{BG}(X,N)\) if his/her exit times at \(D\), the destination, fall in the alighting time interval of \(\widetilde{AG}(X, D)\).

Similarly, the boarding time interval of \({BG}(X, N)\) is then estimated using the time interval of \(\widetilde{BG}(X, N)\). As discussed earlier, unlike the alighting case, the time intervals may overlap.

Another simple but important observation is that a reference passenger who made a transfer in his/her trip is a verifier that a transfer has actually been made between the two connecting trains he/she rode. For instance, in Fig. 1 a reference passenger from Bongcheon to Isu station certifies that there has been a transfer between the connecting trains he/she used at the transfer station Sadang. This is very useful when estimating connection of a passenger whose trip involves a transfer. From a possible list of connecting trains at a transfer station, we can remove ones that have no verifier of an actual transfer.

Definition 2.6

Transfer reference passengers By the transfer reference passengers \(RP(X,Y,A)\), we mean the set of reference passengers who transferred from Train \(X\) to \(Y\) at Station \(A\).

We now discuss how to find the transfer reference passengers \(RP(X, Y, A)\). Suppose \(X\) and \(Y\) are, respectively, from Lines 1 and 2. We look up the list of the reference passengers who transferred from Line 1 to Line 2 at \(A\). Suppose the quadruple of an \(O\)\(D\) passenger is consistent with \(X\) and \(Y\). Namely, his/her entry time at \(O\) on Line 1 falls into the time interval of \(\widetilde{BG}(X,O)\) and exit time at \(D\) on Line 2 falls into the time interval of \(\widetilde{AG}(Y,D)\). He/she is a proof that transfer has been made from \(X\) to \(Y\) at \(A\) and, thus, is added to \(\widetilde{RP}(X,Y, A)\).

Estimation of time intervals from insufficient passengers

Obviously, the accuracy of time intervals depends on the the size of \(\widetilde{AG}\) or \(\widetilde{BG}\). At the stations in suburban areas in non-peak hours, the reference passengers may not be sufficient to provide reliable time intervals. The issue can be resolved by aggregation of alighting passengers of trains at each station. Under the assumption that the alighting behavior of passengers is independent of the time of a day, it provides a sufficient collection of platform-to-gate times for a reliable alighting time interval at each station.

The Garak Market station, a transfer station located at the intersection of Lines 3 and 8, is scant in passenger traffic. The number of reference passengers per train at the station varies from 1 to 30. We aggregate the reference passengers of the 142 inbound trains at the station and fit their platform-to-gate times to a Fréchet distribution. We then discard the lowest 2.5 % and the highest 2.5 % as outliers with an excessive length of boarding or alighting time. In our case this accounts for on average 1.59 passengers per train.

The range \([\tau , \tau +L]\) of platform-to-gate times of the remaining passengers is then defined as the standard alighting time interval at each station. The alighting time interval of each train can be obtained simply by translating \([\tau , \tau +L]\) to begin at the arrival time of the train.

Fig. 5
figure 5

Standard alighting time interval at the Garak Market station on Line 8 inbound

Figure 5 shows the resulted standard alighting time interval at Garak Market station. Initially from the 142 inbound trains, there were 672 initial reference passengers from which exclude are 19 passengers, 0.23 trips per train. The range is \([\tau , \tau +L] = [28, 90\; {\text{ s} }]\) with the length \(L = 62\) s. The figure also shows the translation of the standard alighting time interval to the arrival time, 08:19:04, of Train \(X\). The resulting standard alighting time interval \([{\text{08:19:32 }}, {\text{08:20:34 }}]\) of \(X\) is significantly larger than the time interval \([{\text{08:19:35 }}, {\text{08:20:01 }}]\) estimated from the reference passengers of \(X\) alone.

The standard boarding time intervals can be constructed analogously.

The algorithm

Given the quadruple \(q = (O, {\text{ Entry }} {\text{ time }}=t_1, D, {\text{ Exit }} {\text{ time }}=t_2)\) of a passenger, we carry out the following steps for every tentative physical connection for \(O\)\(D\) trip.

Suppose the physical connection, say \(P\), requires no transfer. Then, we look up a train \(X\) on \(P\) whose boarding and alighting time intervals contain \(t_1\) and \(t_2\), respectively. If none, we reject \(P\). Otherwise, we put \(P\) in the list of consistent physical connections of \(q\) along with the train \(X\), a single-train schedule-based connection on \(P\).

Suppose \(P\) entails two transfers at stations, say, \(M\) and \(N\). (We discuss this case only since, then, the single-transfer case becomes obvious). We first construct the list of tentative schedule-based connections for \(q\) on \(P\), the list of sequences of trains \(S = X_1-X_2-X_3\) on \(P\) such that

  1. 1.

    The boarding interval of \(X_1\) at \(O\) contains \(t_1\), the alighting interval of \(X_3\) at \(D\) contains \(t_2\), and

  2. 2.

    The arrival times of \(X_1\) and \(X_2\) are no later than the departure times of the following trains, \(X_2\) and \(X_3\), respectively, at the transfer stations \(M\) and \(N\).

Note that this is a necessary condition that \(X_1-X_2-X_3\) can be a schedule-based connection of the trip \(q\) on \(P\). Then we loop up the transfer reference passengers \(RP(X_1, X_2, M)\) and \(RP(X_2, X_3, N)\). If both sets are nonempty, we return \(S\) as a consistent schedule-based connection on \(P\). We reject \(S\), otherwise.

The algorithm returns \(P\) as the physical connection of \(q\), only if \(P\) is the only physical connection that admits a consistent schedule-based connection. Otherwise, namely, if there is none or more than one such physical connections, the algorithm declares a failure to the input quadruple \(q\).

Initially, we apply the algorithm based on the standard time intervals in “Estimation of time intervals from insufficient passengers” section. The passengers successfully returned with a unique physical connection are added to the reference passenger set. Once we have acquired sufficient reference passengers, we replace the standard time intervals with the time intervals derived from the reference passengers of individual trains and repeat. The algorithm can be summarized as in Fig. 6.

Fig. 6
figure 6

The flow of algorithm

Note that there may be multiple consistent schedule-based connections even when a unique physical connection is returned.

In our case, around 9 % of the trips were returned with more than one schedule-based connections due to e.g. overlapping boarding time intervals and/or multiple connecting trains. However, we can estimate the probability that each of the schedule-based connections is the choice of passenger. The details are given in Appendix.

Illustration of actual estimation

The performance of the method is probably best understood by some actual cases of estimation.

Unique physical and schedule-based connections

Figure 7 shows the trips of two passengers, say \(a\) and \(b\) who departed from Shillim station, at 07:33:47 and 07:34:55 s, and arrived at Garak Market station, at 08:16:53 and 08:19:51 s, respectively, on November 21, 2011: a = (Shillim, 07:33:47, Garak Market, 08:16:53) and b = (Shillim, 07:34:55, Garak Market, 08:19:51).

Fig. 7
figure 7

Schedule-based connection estimation of 2 Shillim–Garak Market trips

There are two alternative physical connections: beginning at the origin, Shillim station, both follow Line 2 outer-circle. However, one transfers at SeoulNat’lUnivEducation station to line 3, the other at Jamsil station to line 8. The algorithm checks, for each passenger, which physical connection has a logical connection, a sequence of trains all consistent with his/her quadruple.

Consider \(a\). On the physical connection, Shillim-SeoulNat’lUnivEducation-GarakMarket, there is a unique train \(X_1\) whose boarding time interval contains the entry time of \(a\). Of the two trains, \(Y_1\) and \(Y_2\) that have been verified by transfer reference passengers to connect \(X_1\) to Line 8 at SeoulNat’lUnivEducation, \(Y_1\) has an alighting time interval containing \(a\)’s exit time at Garak Market station. Thus, Shillim-SeoulNat’lUnivEducation-GarakMarket is added to the list of consistent physical connection of \(a\) along with the consistent schedule-based connection \(X_1-Y_1\).

On the alternative physical connection, Shillim-Jamsil-GarakMarket, \(a\) should be assigned to the same tentative boarding train \(X_1\). However, neither of the two trains \(Z_1\) and \(Z_2\) that connect \(X_1\) at Jamshil to Line 8 has an alighting time interval consistent with \(a\)’s exit time at Garak Market station. Thus, Shillim-Jamshil-GarakMarket is rejected. Therefore, the algorithm returns Shillim-SeoulNat’lUnivEducation-GarakMarket, as a unique physical connection of \(a\) along with the unique schedule-based connection \(X_1-Y_1\).

Consider \(b\). \(X_2\) is the only train whose boarding time is consistent with his/her entry time, on the physical connection, Shillim-SeoulNat’lUnivEducation-GarakMarket. However, the only verified train \(Y_2\) of \(X_2\) to Line 8 has alighting time interval inconsistent with \(b\)’s exit time at the destination Garak Market. Thus, the physical connection is rejected for \(b\).

On the physical connection, Shillim-Jamsil-GarakMarket, on the other hand, of the two connecting trains \(Z_2\) and \(Z_3\) at Jamsil station, \(Z_2\) is has alighting time interval consistent with \(b\)’s exit time as indicated in the figure. Thus, Shillim-Jamsil-GarakMarket is returned as the physical connection for \(b\), and \(X_2-Z_2\) is confirmed as the schedule-based connection.

Analysis of failed cases

The algorithm fails when there are none or more than one physical connections consistent with the quadruple of an input trip. Figure 8 illustrates the latter case. Consider a trip a = (Janghanpyeong, 08:24:23, Sangsu, 09:01:28). There are two alternative physical connections, I and II, that are comprised of the same line combination, Line 5 and 6, but different transfer stations, Cheonggu and Gongdeok, respectively. The entry and exit times times match with a unique schedule-based connection \(X_1-Y\).

Fig. 8
figure 8

A case of failure: Indeterminate physical connection

But, the transfer from \(X_1\) to \(Y\) are verified by a transfer reference passenger at both transfer stations, Cheonggu and Gongdeok. Both the physical connections I and II are consistent with the quadruple \(a\) and the method is failed. A failure due to multiple consistent physical connections occurred more often when two or more physical connections are distinct only in transfer station.

Connection estimation results

The metro network

The Seoul metropolitan area has 15 metro lines, 412 stations and 33,548 trains as operating as of November 20, 2011. On these days of November 20 to 26, 2011, there were 47,618,710 metro \(O\)\(D\) trips. Of the possible \(O\)\(D\) pairs, 904,897 pairs have nonzero traffic and each carried 50 trips on average. In our study, we first excluded the trips involving 3 private lines, Metro 9, AREX(airport line), and DXLine, and one public line, the Incheon City Line, because the train logs were entirely unavailable.

When the time interval between the entry and exit at a gate was the twice or more the standard deviation off the mean, the trip was most likely voluntarily delayed. The number of trips with such excessive inter-gate times was 1,571,417 which is 3.3 % of the total trips. In addition, we have found that the actual record can be delayed after card tagging at a gate because of a disruption in the communication network. Those abnormal trips, delayed voluntarily or in tagging, were excluded from our data set. Finally, simply for an accuracy, we ruled out the senior and handicapped citizens that have inter-gate times \(6.7\,\%\) longer than others Overall, the estimation algorithm was applied to 32,419,106 \(O\)\(D\) trips as summarized in Table 3.

Table 3 Summary of trips selected for our experiment

The success rates

Table 4 summarizes the rate at which the method returns a unique physical connection to the possible numbers of transfers required by the tentative physical connections of an \(O\)\(D\) trip. As indicated in the first column of Table 4, 51.3 \(\%\) of the trips have only the tentative physical connections with no transfer, and \(26.7\,\%\) the tentative physical connections requiring two transfers, etc.

Table 4 Success rate for each combination of the numbers of transfers in physical connections

From the table, the success rate gets lower when there is alternative physical connection requiring two transfers. Overall, the success rates were 92.6 and 83.4 %, respectively, for the physical and schedule-based connections.

Consistency of train choice of metro passengers

We first probe the central assumption of transit behavior studies: do the metro passengers make a rational train choice? To do so, we rely on the analysis of Cronbach (1951) to check the consistency of the metro choice of passengers of an \(O\)\(D\).

We performed connection estimation for an additional day, Monday, March 19, 2012 to be compared with Monday, November 21, 2011. We selected the 1513 \(O\)\(D\)’s whose daily traffics are no less than 100 trips in both days and which has more than one alternative physical connections. The horizontal axis in Fig. 9 indicate the 3897 physical connections while the vertical axis the proportion of its \(O\)\(D\) passenger having chosen it on November 21, 2011. The same plot is done for March 19, 2012, maintaining the order of physical connections, but exchanging the axes about the diagonal.

Fig. 9
figure 9

Consistency of physical connection estimation

Obviously, if the train choice of passengers is consistent, the plotting should exhibit a concentration of dots around the \(45^{\circ }\) diagonal, which is the case in the figure. In fact, the Pearson’s correlation coefficient was very high, namely, 0.94. A paired-comparison T-test accepted the null hypothesis that the train choices of passengers for their \(O\)\(D\) trips are not different on the 2 days. The Cronbach’s \(\alpha\) was also 0.974. In any statistical sense, passengers indeed make an identical choice over the two Mondays. We extended the test over the 5 days of the week, November 20 to 26, 2011 and we obtained a similar result.

Passenger flow on the time-expanded network

As algorithm returns the schedule-based connection for each and every trip, we can derive the complete passenger flow on the time-expanded network. Figure 10 shows the passenger flow, e.g., on the logical network time-expanded around Daerim station, a transfer station of Lines 2 and 7, from 07:45 to 07:55 A.M., November 20, 2011. In the time interval, there were 4 trains arriving from Line 2 inner-circle, denoted by \(X_1, X_2, X_3\), and \(X_4\), and 3 trains from Line 7 inbound, \(Y_1, Y_2\), and \(Y_3\). The train logs and the passenger traffics are summarized in Table 5.

Fig. 10
figure 10

Passenger flow on the time-expanded network at the Daerim intersection of Line 2 and 7 from 07:45 to 07:55 A.M. on November 21, 2011

Table 5 The trains and their associated passenger flows at the Daerim intersection from 07:45 to 07:55 A.M. on November 21, 2011

In the figure, indicated are the passenger flows associated with each train. Of the 421 passengers of Train \(X_4\) at Daerim station on Line 2 arriving from Shindorim station, 3 exited and 2 transferred to Line 7 outbound. To the remaining 416 passengers, 75 entering passengers joined. Also 37 transfer passenger from Line 7 outbound, and 3, 38, and 6 transfer passengers from Trains \(Y_1\), \(Y_2\) and \(Y_3\) on Line 7 inbound in the order, are added. The resulting 575 passengers departed to the GuroDigitalComplex station along Line 2 inner circle.

We can also derive the transfer times between connecting trains. For instance, the transfer time for 3 passengers from \(Y_1\) to \(X_4\) was 446 seconds, the difference between the departure time of \(X_4\) and the arrival time of \(Y_1\). Crowdedness in public transport is an important factor for the level of service (Weidmann et al. 2012; Cox et al. 2006). The passenger flows on the time-expanded network provide us with the exact load on each train which is, we believe, the most important data in a study on how crowdedness affects the train choice of passengers.

Conclusion

First, we studied a set of behaviors of metro passengers by examining the gate times from the Smart Card data, which produce time intervals precise enough to identify the passengers boarding, transferring, and alighting of trains based on the entry and exit times and stations of a passenger.

  1. 1.

    The platform-to-gate time of an alighting passenger has the spiky characteristic of an extremal value; the exit times at a gate of the passengers from the same train are distributed over a very brief period of time. The time intervals of trains are disjointed.

  2. 2.

    The boarding behavior of metro passengers, however, is devoid of such disjointed time intervals. However, the first-come-first-served queue discipline is observed well enough to allow us to derive useful time intervals of boarding groups.

Second, we recognized and separated the class of passengers who have a unique predominant connection for a trip. Such passengers, more prevalent than expected, not only provide us reliable estimates of the time intervals but also bear witness to an actual transfer between trains from lines intersecting at a transfer station.

Third, we propose a connection estimation algorithm checking consistency of the time intervals of trains in a tentative connection with the gate times and stations of a passenger, which necessarily holds when the connection is an actual choice of a trip.

The proposed algorithm is applied to 32 million trips from Smart Card data collected in the Seoul metropolitan area on the week, from Sunday, November 20 to Saturday, November 26, 2011. As a result, our method could determine a unique physical connections in \(92\) % of the trips. The result shows a consistent physical connection choice over the 5 weekdays.