Introduction

Smart card systems have been widely implemented in Japanese urban railways as an automated fare payment system and both prepaid and postpaid systems are in use. The Ministry of Land, Infrastructure, Transport and Tourism (MLIT) (2008) reported that more than 30 railway operators accept smart cards in various regions in Japan. Suica, PASMO, ICOCA, and PiTaPa are the major smart cards; more than 30 million cards in total were issued by the end of 2007. This number covers approximately 2.5–42% of the population in the operation area of each railway operator. Both public and private railway operators are in charge of managing the smart card data. These railway operators independently manage the transaction data for their own area, but they sometimes participate in cooperative systems that cover the management areas of the membership companies.

Although the main purpose of smart cards is to collect fares for public transport, the transaction data recorded by these cards can be used to analyze both passenger demand and system performance. Transaction data obtained from each smart card user are highly accurate. The system records the exact time at which the card holder passes a ticket gate, the stations where he/she boarded and alighted from the train, and his/her smart card ID number. Most of the urban rails in Japan record these data at both boarding and alighting stations, although the systems in many cities of other countries only record the data at the boarding gate. This is partly because the train fares in Japan are proportional to the distance between the boarding and alighting stations, and the information from both stations are essential for automated fare collection. The transaction data are recorded in a database, and the long-term collected data can be used to construct accurate and large-scale datasets for analyzing trends in passenger movement. This information can be very useful for detecting changes in railway service demand. In addition to the demand analysis, the transaction data can also be effectively used for evaluating train operation performance.

Smart card systems for public transport have been adopted in various countries. Several previous works have discussed the use of smart card data for demand analysis, management, and planning of public transportation. Lehtonen et al. (2002) discussed the possibility of using smart card payment system data for public transport planning and statistics in Finland, which was one of first countries to use the smart card system. Bagchi and White (2005) also considered the potential role of smart card data for travel behavior analysis. Case studies in Southport, UK, highlighted the advantages and limitations of smart card data for public transport analysis. Utsunomiya et al. (2006) suggested potential uses of smart card data for transit planning; they studied the frequency and consistency of daily travel patterns and variability of smart card customer behavior by using data obtained in Chicago. Morency et al. (2007) used 277 consecutive days of data from the Canadian transit network and applied data mining techniques to measure the spatial and temporal variability of transit use by card type. Guo and Wilson (2009) focused on transfer planning in the large public transport network in London. They employed origin–destination (OD) data from the smart cards to assess system-wide passenger flow in the London Underground system. Seaborn et al. (2009) analyzed multimodal journeys for information on transit planning using the smart card system in London. They considered multimodal transfer combinations of bus-to-Underground, Underground-to-bus, and bus-to-bus.

Although public transport smart card transaction data provide continuous and detailed travel information, certain aspects of that information are incomplete. Chu and Chapleau (2008) presented methods for archiving smart card data by estimating the arrival times of bus routes and identifying linked trips using spatial–temporal concepts. Trepanier et al. (2007) presented a model to estimate the destination location for each individual boarding a bus with a smart card by using the automated fare collection system data. Sahin and Altun (2007) proposed a method to extract passenger and operation information for improving transit management by using data collected in Istanbul, Turkey.

In spite of the wide-spread use of smart card systems, there are few studies dealing with smart card data on public transport in Japan. Asakura et al. (2008) analyzed urban railway passenger behavior by using smart card transaction data obtained from an urban railway system in Japan. They focused on passenger train choice behavior before and after improvements in the train timetables. The train timetables included several types of trains that stop at different stations and have different travel times. Passengers could choose between several trains that reached their desired destination. Asakura et al.’s study pointed out that passengers appeared to choose trains depending on travel time and congestion. The study also mentioned that a passenger’s choice of train could be estimated from the train timetable and smart card transaction data that record the time at both the boarding and alighting stations. Although the study discussed the concept of using smart card transaction data, a methodology to do so has not yet been proposed.

The train timetables have a significant influence on attracting passengers and managing congestion. As mentioned by Asakura et al. (2008), some Japanese urban railways in metropolitan areas have several types of trains that are generally named as express, rapid, and local trains. The local trains stop at all stations. The express and rapid trains pass some stations and have shorter travel times than the local trains. The express trains stop at roughly every fifth station. Extra fare is not always required for faster services. During weekdays, each type of train runs at intervals of 5–15 min. Various types of trains run punctually according to the predetermined timetable. Train passengers can choose the most convenient trains. Unless the travel distance is very short and train frequency is very high, passengers seem to arrive at stations considering the train timetable especially in peak-hours because they prefer boarding a particular type of train. It means that many passengers may decide which train to board before arriving at the station. The analysis in this study partly relies on this observation.

Several studies have been carried out on passengers’ train choice behavior for urban railways in Japan. For example, Ieda et al. (1988) estimated passenger train choices using OD data obtained by a census. They treated a train timetable as a network. Passengers were given a utility function and assigned according to the static user equilibrium concept. Ieda et al. also introduced indicators for planning train timetables.

Smart card data have the potential for analyzing day-to-day dynamics of train choices by considering behavioral changes of each individual. If each passenger’s choice of train is estimated over a long period, the information can be used to improve train timetables and railway company customer relationship management (CRM). That will be useful for improving train timetable and fare discount services by considering long-term individual trends in train choices. For example, railway companies can easily check how passengers select trains after timetable improvements. Before-and-after comparisons of the transaction data can be effectively used for this purpose.

The aim of our study is to develop a methodology that can be used to estimate trains boarded by smart card holders from smart card transaction data. This paper presents a methodology and an algorithm for this estimation using long-term transaction data. The scope of the methodology focuses on the urban railway in the Japanese metropolitan areas during the peak-hours. However, the methodology can be applied to the other situations, such as off-peak hours or middle class city areas, when the railways satisfy several conditions. That is, the methodology is applicable under the following conditions, which are discussed in detail later:

  • Data are observed in both entry and exit gates in the stations.

  • Trains are punctual according to the scheduled timetable.

  • Stations do not have any attractive facilities where passengers stay longer between swiping the card and boarding a train, such as book stores, cafes, or kiosks.

Smart card transaction data and train timetable data

The proposed methodology requires smart card transaction and train timetable data. ID numbers of smart card holders are recorded each time they pass entry or exit gates, and each transaction record indicates an unlinked trip. Smart card transaction records provide information on ID numbers, the date, departure station, passage time at an entry gate, arrival station, and passage time at an exit gate. The times at the entry and exit gates are recorded by the exact number of minutes.

Data from the train timetable contain a record of the time for each train arrival and departure at every station. Each record in the train data contains the following: departure station, departure time, arrival station, arrival time, and train identification information. The identification information is used to distinguish individual trains. Both the departure and arrival times are recorded by the exact number of minutes. Example smart card data and train timetable data are shown in Fig. 1.

Fig. 1
figure 1

Examples of smart card data and train timetable data

Methodology

Relational map

Figure 2 shows a relational map of smart card data and the train timetable. This map depicts the relationship between the train timetable and gate passage times observed in the transaction data. The horizontal axis indicates the time of day at the arrival station, and the vertical axis indicates the time of day at the departure station. The dot labeled “Passenger a” indicates the corresponding time when the passenger passes through the gates at the departure and arrival stations. Each record from the transaction data can be plotted as a dot in this map. The horizontal lines of the map represent the departure times of trains, and the vertical lines are the arrival times. Passengers can board trains that are plotted above the dot, and they can get off trains plotted on the left side of the dot. A passenger can take any train that departs after their entry time at the departure station, but only if that train arrives before the passenger’s exit time at the arrival station.

Fig. 2
figure 2

Relational map of smart card transaction data and train timetable

Figure 3 shows actual transaction data plotted on a relational map. This map was used in Asakura et al. (2008) (but was referred to as an overlapping analysis). They mentioned that a passenger can be observed to be passing through the entry gate immediately before the train departure and passing through an exit gate immediately after train arrival. Passengers tend to minimize the waiting time at the departure station and any lost time at the arrival station.

Fig. 3
figure 3

Actual transaction data plotted on relational map

Assumptions for passenger behavior

One of the simplest methods for estimating passenger choice in trains is to describe the choice behavior according to the passenger utility functions. When passengers are assumed to have all of the information on the train schedules, this method require enumerating all possible combinations of the trains that can be available for each passenger traveling between a pair of stations. The most preferred alternatives for the passenger are then determined. However, enumerating all possible combinations is not realistic because the number of combinations becomes too large when several types of trains are scheduled. This means that a large amount of computational time is required for the estimation process. When actual smart card data are processed, both passenger behavior and the computation time need to be considered because data on millions of trips are being processed. To minimize the computation time, the number of alternative train combinations should be reduced. Thus, developing a method to generate a subset of train combinations with few trains based on reasonable assumptions is necessary.

Based on the findings shown in the relational maps, we employed the following assumptions of passenger behavior when estimating train choice:

  1. (i)

    A passenger minimizes the total waiting time at the departure station and lost time at the arrival station.

  2. (ii)

    Passengers choose trains that minimize the transfer frequency while satisfying assumption (i).

  3. (iii)

    If more than two combinations of train choices satisfying assumption (ii) exist, the passenger chooses one with equal probability.

Assumption (i) was simply derived from the relational maps. Passengers are assumed to have sufficient knowledge on the timetable or experience in using trains, so they can recognize the departure time of the trains that they want to board. This means that passengers have decided their preferred combinations of trains before they reach the departure station. To satisfy this assumption, train operators should maintain a certain level of punctuality. Assumption (i) also assumes that the stations do not have any attractive facilities where passengers can stay longer between swiping the card and boarding a train. If there are these facilities such as book stores or kiosks, some of the passengers will not try to minimize the waiting time or the lost time.

Assumptions (ii) and (iii) are practical when there are several train types with different stopping stations, as shown in Fig. 4. In the example shown in the figure, there are three types of trains. The express train stops only at station F, and the rapid train skips several stations. The local train stops at all of the stations. When a passenger wants to board a train going from station A to station F, he/she has four possible paths of boarding trains, as indicated by paths a–d in Fig. 4. Some of the choices may satisfy assumption (i) at the same time. For example, if a passenger can transfer to the same rapid train from the local train at station B or D, he/she has two path choices, paths a and c.

Fig. 4
figure 4

Possible paths from station A to station F

Although multiple paths can satisfy assumption (i), some of these paths are unrealistic. For example, paths b and d both satisfy assumption (i). However, the passenger would clearly choose path d because path b requires too many transfers. On path b, the passenger initially boards the local train and then transfers to the rapid train that arrives at station D earlier than the local train. However, he/she transfers to the same local train again at station D. Paths including these improbable transfers are eliminated by assumption (ii), which restricts transfer frequency.

Assumption (iii) identifies the path when multiple paths satisfying assumption (ii) appear. For example, when both paths a and c have the same travel time and satisfy assumption (i), both also satisfy assumption (ii). However, these choices cannot be identified by only using smart card transactions because transfer stations are not observed, and the trains on path a are the same as the trains on path c. Therefore, assumption (iii) presumes that a passenger chooses one of the paths with equal probability.

To derive the combination of trains boarded by a passenger that satisfies assumption (iii), all possible paths satisfying assumption (ii) have to be enumerated. In addition, all trains on the paths must be possible to board. In Fig. 4, the express train cannot be contained in the possible paths even if the time difference between the passenger’s exit time and the arrival time of the express train at station F is very small. To enumerate the possible paths in a simple way, the transfer network is defined in the next subsection.

Transfer network

The transfer network represents possible space–time paths of passengers; the paths are linked by train connections. The network also represents the links that connect the time of a passenger’s entry into a departure station with the possible trains that he/she may board, as well as the links that connect the time of a passenger’s exit from an arrival station with the possible trains he/she may exit from. The combination of trains boarded by each passenger is represented by a path in the transfer network, an example of which is shown in Fig. 5.

Fig. 5
figure 5

Example of transfer network

The network consists of four types of nodes and four types of links. A set of nodes V is composed of sets of entry nodes V s , exit nodes V t , departure nodes V d , and arrival nodes V a . A set of links A is composed of sets of entry links A s , exit links A t , train links A r , and transfer links A c .

The entry nodes represent the times at the entry ticket gates at the departure stations. These nodes are defined for each minute at each station. Each node has two attributes: the time of day and station identity. Time(v) and Station(v) indicate the time and station attributes of node v, respectively. The exit nodes represent the times at the exit ticket gates at the arrival stations. These nodes are also defined for each minute at each station, and they have the same two attributes as the entry nodes. The departure nodes indicate the departure of trains at every station. The arrival nodes indicate the arrival of trains at every station. Both the departure and arrival nodes are generated from the train timetable data. Each departure or arrival node has three attributes: the time of day, station identity, and train identity. The time and station are indicated by Time(v) and Station(v) similar to other nodes, while the identity of the train is expressed by Train(v).

The network has four types of links; each link type is specified by the types and attributes of its head and tail nodes. The tail of each link a is indicated by V st (a), and the head is indicated by V ed (a). All of the link types must satisfy the following equation:

$$ {\text{Time}}(V_{ed} (a)) - {\text{Time}}(V_{st} (a)) \ge 0\quad a \in A $$
(1)

This equation prevents the transfer network from displaying loops. If the train schedules are for a single or double-track railroad, the network is defined as a simple graph.

Entry links indicate passenger wait times at departure stations. These links are defined between the entry and departure nodes, and they must follow a path from an entry node to a departure node. This type of link satisfies the following equation:

$$ {\text{Station}}(V_{st} (a)) = {\text{Station}}(V_{ed} (a))\quad a \in A_{s} $$
(2)

Exit links express the lost time between the train arrival time and passage time of an exit gate at arrival stations. These links are defined between the arrival and exit nodes, and they follow a path from the arrival node to the exit node. Exit links satisfy the following equation:

$$ {\text{Station}}(V_{st} (a)) = {\text{Station}}(V_{ed} (a))\quad a \in A_{t} $$
(3)

The train links indicate trains that are running between stations and stopping at stations. A train link representing a train en route to its next station follows a path from the departure node to the arrival node and satisfies the following equation:

$$ {\text{Station}}(V_{st} (a)) \ne {\text{Station}}(V_{ed} (a))\;{\text{and}}\;{\text{Train}}(V_{st} (a)) = {\text{Train}}(V_{ed} (a)) $$
(4)

Conversely, train links that represent trains stopping at a station follow a path from the arrival node to the departure node and satisfy the following equation:

$$ {\text{Station}}(V_{st} (a)) = {\text{Station}}(V_{ed} (a))\;{\text{and}}\;{\text{Train}}(V_{st} (a)) = {\text{Train}}(V_{ed} (a)) $$
(5)

Transfer links represent passengers who are waiting for their next train at a station. The transfer link follows a path from the arrival node to the departure node and satisfies the following equation:

$$ {\text{Station}}(V_{st} (a)) = {\text{Station}}(V_{ed} (a))\;{\text{and}}\;{\text{Train}}(V_{st} (a)) \ne {\text{Train}}(V_{ed} (a))\quad a \in A_{c} $$
(6)

To estimate the paths that satisfy the assumptions in the first subsection of this section, the costs of links are defined as vectors with two attributes:

$$ C(a) = (\tau_{a} ,c_{a} ) $$
(7)

where τ a is the wait or lost time at the departure or arrival station. c a is a discrete penalty resulting from transfer and is defined by:

$$ c_{a} = \left\{ {\begin{array}{*{20}c} 1 & {a \in A_{c} } \\ 0 & {\text{otherwise}} \\ \end{array} } \right. $$
(8)

The costs of the links are summarized as follows:

$$ C(a) = \left\{ {\begin{array}{*{20}l} {({\text{Time}}(V_{ed} (a)) - {\text{Time}}(V_{st} (a)),0)} & {{\text{if}}\;a \in A_{s} \cup A_{t} } \\ {(0,0)} & {{\text{if}}\;a \in A_{r} } \\ {(0,1)} & {{\text{if}}\;a \in A_{c} } \\ \end{array} } \right. $$
(9)

Possible paths satisfying the assumptions

When departure station s o , arrival station s d , passage time of entry gate t o , and passage time of exit gate t d are derived from the smart card transaction data, the origin node v o is defined as:

$$ (s_{o} ,t_{o} ) = ({\text{Station}}(v_{o} ),{\text{Time}}(v_{o} ))\quad v_{o} \in V_{s} $$
(10)

Similarly, the destination node v d is defined as:

$$ (s_{d} ,t_{d} ) = ({\text{Station}}(v_{d} ),{\text{Time}}(v_{d} ))\quad v_{d} \in V_{t} $$
(11)

The set of paths that satisfy assumptions (i) can be expressed as follows:

$$ R_{0} = \left\{ {r_{0} |r_{0} \in R_{od} ,\sum\limits_{{a \in A(r_{0} )}} {\tau_{a} \le \sum\limits_{a \in A(r)} {\tau_{a} \quad {\text{for}}\;\forall r \in R_{od} } } } \right\} $$
(12)

where

R od :

The set of all paths whose origin is v o and destination is v d

A(r):

The set of links included in path r

Finally, the set of paths that satisfy both assumptions (i) and (ii) can be expressed as follows:

$$ R_{c} = \left\{ r_{c} | r_{c} \in R_{0} ,\sum\limits_{{a \in A(r_{c} )}} {c_{a} } \le \sum\limits_{{a \in A(r_{0} )}} c_{a} \quad {\text{for}}\;\forall r_{0} \in R_{0} \right\} $$
(13)

Algorithm

This section contains the proposed algorithm for estimating the set of possible paths derived from the smart card transaction data and transfer network. Figures 6 and 7 show the flowchart of the proposed algorithm. At the beginning of the algorithm, denoted by “a” in Fig. 6, the smart card transaction data are sorted by the departure station, time of gate passage at the entry, arrival station, and time of gate passage at an exit to reuse some computational results. After this data are sorted, one of the records of the transaction data is extracted for estimation. To estimate costs that satisfy Eq. 13, the shortest path search using Dijkstra’s algorithm is applied at “b” in Fig. 6. Note that the Dijkstra’s algorithm is replaceable with the other shortest path search algorithms. At “c,” all paths that satisfy Eq. 13 are listed in the path list with the subroutine shown in Fig. 7. The subroutine searches a sub-graph whose nodes have time attributes between the times of the origin and destination nodes. If the cost of a path in the sub-graph is equal to the cost derived from “b” in Fig. 6, the path satisfies (13) and is added to the path list. All of the paths in the path list are train choices that satisfy the assumptions.

Fig. 6
figure 6

Flowchart of estimation algorithm

Fig. 7
figure 7

Flowchart of estimation algorithm (subroutine of Fig. 6)

As shown at “d” in Fig. 6, when the next record has the same origin and destination nodes as the current record, the same path list is reused for the next record to save calculation time. If only the origin node is the same as the current record, the costs calculated at “b” are reused, and the path list is restructured at “c.” If neither node is the same, the processes for “b” through “d” are repeated. These processes are repeated until all of the records are calculated.

Empirical analysis

This section shows the empirical analysis conducted using the proposed method. One of the aims of this analysis was to confirm whether the computation time required was practical. The other was illuminate any problems from a practical standpoint. The first subsection presents the data used in the analysis. The second shows the estimation results. The third discusses the accuracy of the results from two viewpoints: a summary of the results and by tracing the trips made by a specific individual smart card holder.

Data

The empirical analysis employed smart card transaction data obtained from an urban railway company in Osaka, Japan. The share of smart card holders was about 10% of the total passengers for the railway company. The railway company provided smart card data for research purposes. Identification information of the individual passenger was anonymously processed prior to the analysis. The privacy of each smart card holder was strictly protected throughout this study.

The targeted sections of the railway have 42 stations. The sections have several branch lines though the main line, including most of the stations. This analysis used the OD data on one of the directions for the railway. There are local trains and several types of rapid and express trains with different stop stations. The local trains stop at all the stations. The express trains stop at approximately every fifth stations. The rapid trains stop at more stations than the express trains. All of the trains are operated by the same fare system corresponding to the boarding distance.

The data for analysis covered 662,419 unlinked trips observed over 39 weekdays in 2007. The contents of the transaction data records and train timetable data are described in the second section. The transfer networks were constructed based on the train timetable data.

To improve computing speed, the entry and exit links were only defined between nodes that exhibited a difference between time attributes of less than 20 min. The possible negative influences of this decision should be limited because trains are scheduled within 20 min of one another. This time window affected the results in the early morning and at midnight when the trains are scheduled more than 20 min apart. Note that these networks do not consider incidents such as delays or temporary changes to the train schedule.

Results

Table 1 shows the estimation results. We used a computer with two Intel Xeon 5110 (1.60 GHz) processors and 4 GB memory. According to the results, it took 92 min to estimate the 39 days of transaction data. It took 2.4 min to estimate the data for 1 day on average. In terms of calculation time, the proposed methodology can be used for regular processing of transaction data and seems to be acceptable for a daily routine of data processing (see appendix for more details on computation time).

Table 1 Summary of estimation result

Discussion

Discussion focused on summary of estimation

According to Table 1, no path was estimated for 8,741 trips. These incomplete trips account for 1.3% of the total trips taken. Figure 8 shows the percentage of incomplete trips during each hour of the day. The data suggest that a remarkable number of these trips occurred around 4:00 p.m. Figure 9 shows the day-to-day changes in the levels of incomplete trips at 4:00 p.m. These figures show that most of the errors happened during certain hours in the daytime of certain days. These incomplete trips occurred only when the network did not have a path connecting between the corresponding entrance and exit nodes. However, the trains are scheduled within 20-min intervals in the daytime. This means that the passengers have access to any trains that they can board within 20 min after passing the entrance gate. There are two possible situations. One is when a passenger departs by the fastest train delayed at the departure station. Another is when a passenger arrives at destination earlier than the scheduled time of the fastest train by using extra trains that are not recorded in the scheduled train timetable data. These two cases rarely occur if trains are operated punctually according to the predetermined timetable. To avoid this inconsistency, it would be better to use the actual train logs obtained from the train control system. This may be more significant when the trains are not punctual according to the schedule.

Fig. 8
figure 8

Percentage of incomplete trips for each hour of day

Fig. 9
figure 9

Day-to-day changes in percentage of incomplete trips at 4 p.m.

Figure 10 shows the distribution for the number of possible paths at each trip. The average number of paths is 1.29. According to the figure, 85% of the trips had only one path, and 99% of the trips had less than five paths. This result suggests that the path for 85% of the trips is identified by using the first and second assumptions presented in the third section. Figure 11 represents the distribution for the number of transfers in each trip. The average number of transfers was 0.45, and the maximum number of transfers was three. Figure 12 shows the histogram for the total waiting and exit loss times. The average time was 330 s. About 99% of the trips had loss times of less than 780 s. These results imply that unrealistic choices of trains were not estimated for most of the trips.

Fig. 10
figure 10

Distribution of number of possible paths

Fig. 11
figure 11

Distribution of frequency of transfer

Fig. 12
figure 12

Distribution of sum of waiting and exit loss time

Discussion focused on paths of an individual passenger

This section focuses on a specific individual passenger’s trips and compares the data with estimation using the overlapping analysis concept introduced by Asakura et al. (2008).

Figure 13 represents unlinked trips between stations S and T that were made by a specific individual passenger during the observation period. The trips represented in Fig. 13 were made during the 39 days. The vertical axis of this figure represents the entry time at station S, and the horizontal axis is the exit time at station T. The time expressed in the figure is the time relative from 7:00 a.m. The trips made by the passenger are represented by square dots at the corresponding entry and exit times. The depth of each dot represents the number of trips observed during the same entry and exit times. The circle dots indicate the estimation results of the proposed methodology. The dots correspond to the path sets shown in Table 2. The horizontal lines inside the diagrams represent the departure time of trains at station S, and the vertical lines are the arrival time of trains at station T. Four types of trains stopped at these stations. Each train type stopped at different stations and had different travel times. Trains K and L stopped at station S. All trains stopped at station T throughout the observation period.

Fig. 13
figure 13

Departure and arrival times of individual passenger and estimation results

Table 2 Estimated trains of trips made by a passenger

Figure 13, in minutes relative to 7 a.m., shows that the passenger frequently entered station S around 70 min and exited at station T around 100 min. This suggests that the passenger boarded train K departing at 73 min or train L departing at 76 min. Also, the passenger arrived at station T by train I at 96 min or train J at 100 min. This is because many of the data values in Fig. 13 are plotted under train K or L on the right side of train I or J. In this manner, the overlapping analysis can visually detect the trains boarded by the passenger.

Table 2 shows the estimation results using the proposed method in this study. Five sets of possible paths were estimated. These sets are indicated by the letters ae. All of the path sets include only one possible path. This means that assumption (iii) was not used for estimating this passenger’s trips. In the 27 trips, the passenger chose path d that departed at 73 min and arrived at 96 min. ae indicate the path sets shown in the table. The results of the estimation were confirmed to coincide with the results of the overlapping analysis.

According to Table 2, the waiting time between the passenger’s entry and train departure in path set b is very small. The natural choice for the passenger should be c because it seems to be difficult for him/her to reach the platform from the entry gate in such short period. In this case, the wrong path may have been estimated. However, there is no way to confirm this by using only the smart card transaction data. The probe person survey can be used as a method of validation. If the GPS logs from the probe person survey can be made available for smart card holders, the estimation results can be easily validated by actually tracking smart card holders. In addition, these types of unsuccessful results with waiting and loss times that are too short would be prevented by adding a minimum time between a gate and platform of stations to the attributes of the transfer network.

Conclusion

This paper proposes a methodology and an algorithm for estimating trains boarded by passengers by using recorded smart card transaction data obtained at both boarding and alighting stations. The passenger behavior model used in the estimation is an extension of our proposed data analysis method.

In the empirical analysis, estimates of trains boarded by passengers were made using actual smart card transaction data. The proposed methodology can be used to process recorded transaction data in a practical amount of computation time. The results were verified using the number of transfers, number of possible paths, and trips made by an individual smart card holder.

There are two approaches that can be used to verify the results. One method confirms whether each individual trip was correctly estimated by using the GPS logs from a probe person survey. The other method confirms the number of passengers on each train at each section. This information can be estimated from data for the load weight of trains. When the results of our methodology are compared with the load weight data, it is necessary to scale the number of passengers estimated by our methodology to the load weight data because the smart card transaction data observe only the smart card holders, who make up a portion of the passengers, while the load weight data include all the passengers.

As mentioned in the third section, the assumptions of the methodology implicitly rely on the punctuality of the trains. The empirical analysis shows that the developed method may generate a few incomplete trips with no estimated paths. However, the number of incomplete trips is not significant (less than 1.5%). These incomplete trips are generated due to train delays and extra trains that are not included in the usual timetable. If the actual train logs from the train control systems are available for the analysis, they should be used instead of the scheduled timetables to reduce the number of incomplete trips. If the trains are not punctual, the assumptions mentioned in the third section have to be modified. In this study, the acceptable level of punctuality for the proposed method has not been determined. In future works, sensitivity tests will be carried out to determine how delays influence the results. A possible way is to apply the proposed method to transfer networks that have various trains arriving stochastically a few minutes late at stations.

The behavior assumptions mentioned in the third section were employed as a compromise between sound behavior assumptions on plausible paths and computation times. For railways with less punctuality or with facilities where passengers stay longer between swiping the card and boarding a train, it would be better to have weaker criteria in order not to eliminate paths which are the most likely to be chosen by passengers. Assumptions have to be considered to allow relaxation in the criteria in assumptions (i) and (ii). This means that passengers can choose a path from a larger number of physically possible paths. However, such an improvement may require enumerating more combinations of trains than the proposed method. This would increase the computation times. The behavioral assumptions for the improved method have to be balanced with the time consumption for computing the large amount of smart card data.

This study focused on creating path sets of boarded trains that satisfy assumptions (i) and (ii). For most of the trips, only one or a few paths were estimated. This study assumed that a passenger chooses a path with equal probability from the estimated path set if multiple paths satisfying assumption (ii) are estimated. However, each passenger appears to have his/her own preference. Hence, whether equal probability is appropriate is an issue, which is left for future consideration.

In future work, evaluation of train timetables will be conducted by using estimation results of the methodology. The estimation results consist of daily train choice of each passenger which are also useful for identifying passengers’ characteristics. For example, if the estimated paths are combined with information on congestion of every train, which can be obtained from other methods such as load weight data of trains, the railway operators can know which passengers avoid the congestion or which passengers give priority to the travel time. For creating train timetables, this information will be useful for knowing which types of trains are needed by the passengers. The information is also applicable for before-and-after analysis of timetable changes to check how passengers select trains after the timetable improvements.