Keywords

1 Introduction

The research on the connections between the urban environment, its organization, and sustainability targets has increased substantially in recent years, linking the broad interests among policymakers, research initiatives, and neighborhood communities to assure more sustainable, high quality and livable cities [1,2,3]. The mobility aspect here plays an important role: infrastructure planning and renewal, planning of the amenities and services, conducting people’s habits, and routines, managing the public transport, or planning for higher walkability are important factors in this initiative. However, timely reactions require a different approach in integrated planning practices [4]; it entails continuous and adaptive monitoring, as well as the assessment of the existing situations based on frequent and solid data infrastructures.

Data-driven urbanistic practices are at the core of these endeavors. The rapid evolution in information and communication technologies has brought the potential of urban analytics and assessments onto the subsequent levels and consequently better prospects for well-informed decisions, also on the bases of the existing data, its repurposing, and reuse. The evolution towards a data-based society, where data from different domains and activities can be publicly accessed, reused, and integrated, is one of the strong European goals.Footnote 1 The increasing availability of new forms of data stemming from smart sensors and our interactions with large socio-technical systems, such as social media platforms or mapping applications (e.g. Twitter, Instagram, Strava, etc.) and ubiquitous devices such as smartphones or activity trackers (e.g. smartwatches), allows us to use data science and artificial intelligence (AI) to better understand the cities [5, 6] peoples’ needs and expectations, as well as peoples’ behavior, movements, and to find relations between these and the urban form and organization [7]. Not only can urban planners tackle urbanistic challenges to create healthier and more sustainable environments [8], but they can nowadays examine socio-environmental systems on a finer level and with the enormous support of human-gathered or geo-location data evidence. Moreover, the analytical methods developed and applied in other domains have proven to offer possible synergies among the examined problems and the methods used. For example, today, more than ever, different types of rhythmic datasets are being captured, thus, the rhythmicity analysisFootnote 2 has also become an important aspect in other fields of research [9], also by interpreting sustainability and efficiency trends, such as traffic flows and their oscillation, walkability patterns, energy consumption patterns, buildings’ performance variations.

However, in combining heterogeneous data to solve the cross-cutting urban issues it is common to encounter syntactic and semantic discrepancies [10], mainly due to spatial, temporal, or thematic diversities, different techniques of capturing, and institutional dispersion of the studied datasets. In particular, national and municipality-related institutions create and operate datasets based on specific purposes which are designed for problems at hand and cover specific areas of interest (from mobility, accessibility, and commuting, noise and air pollution, to energy consumption, waste management, or sociodemographic data, indicators of peoples’ commitment, their habits, and patterns of behavior, etc.). This diversity in data subjects results also in sparse and incompatible datasets, discontinued or disconnected time series, and reciprocally incompatible data queries characterized by diverse data models and storage structures [11,12,13].

Despite the huge amounts of data generated, paradoxically, obtaining micro-urban, and fine-grained records that correspond to higher spatial resolutions and eloquence, still often bring significant data scarcity [14] and thus requires advanced techniques of data integration, repurposing, or introducing new ways of data interchangeability. For this reason, feasible and serviceable data sources are increasingly being extended also towards citizens- and the crowds- data harnessing. Here, the miscellaneous, location-stamped data, is collected by engaged citizens or end-users. Different citizen science activities also create new learning opportunities, increase scientific legibility among the public and allow civic participation in important decisions or to foster important sustainability goals [15]. With a continuously growing number of smart and wearable devices or other sensors, the feasibility and the richness of crowdsourced and citizens collected data is increasing [16]; however, it brings additional difficulties in the aggregation process with other data repositories, and the necessity to introduce solid indicators and validation techniques for further processing of data queries.

In this paper, we present selected results of the two pilot studies within the national researchFootnote 3 to process and interchangeable use three different data sources (i.e., Google Directions API [17], governmental road vehicle monitoring, and the We Count Ljubljana Telraam database [18]), two basic indicators (i.e., traffic counts and travel times), and several different analyses (i.e., cosinor analyses, travel time reliability analyses, regression modeling) to assess the roadway traffic flows on certain strategical routes/trips in the city of Ljubljana, Slovenia. Even though services, such as Google Directions API [17], can be used to assess short-term traffic conditions, these cannot be directly applied to the assessment of traffic trends and travel time reliabilities, which presented the focus of our analyses. Our research entailed a specific focus on the network performance of individual motorized traffic on the six routes connecting the three neighborhoods with important points in the city by suggesting vehicle counts analyses and travel time metrics for the assessment. We extract and present, for the purpose of this paper, the data sources and the analyses applied as well as interpretation techniques used in two studies of this research to showcase the research, benefits attained, and difficulties encountered. Selected detailed results have been published in Janež et al. [16] and Verovšek et al. [19].

In this paper, we first briefly explain the three different sources of the roadway flow performance used in the study. We further outline the methodological contexts of the study; the geographical and time frames and describe the data variables and the necessary data pre-processing. We continue with the description of the types of analyses and measures applied in the study and the demonstration of results presented by different interpretation techniques, followed by a short discussion and conclusion.

2 Three Sources of Roadway Flow Performance

Commonly, the network travel performance and the efficiency of road systems have been estimated by roadway flow rates directly rendered from the vehicle counts. The state-of-practice procedures on counting propose stationary on-road or over-road counting devices, among which the inductive loop counters (ILC) embedded in the pavements are by far the most widely used in conventional traffic control systems [20, 21]. We obtained historical traffic count data from the Slovenian Ministry of Infrastructure and the Municipality of Ljubljana (MOL).

Technical advances of the last decade have attested a strong development in sensor-based solutions. Coupled with the new initiatives of citizens science, cost-efficient sensors have been developed and promoted for traffic counting. As part of the H2020 citizen science project WeCount Ljubljana,Footnote 4 which has also been extended to WeCount the Littoral and WeCount Novo mesto in Slovenia, engaged citizens have placed WeCount Telraam sensors on the windows of their homes and offices to count traffic flows on the city's streets. The project features an open-source WeCount Ljubljana Telraam platform, developed with support from the Belgian federal government's Smart Mobility Belgium fund and the European Union's Horizon 2020 research and innovation programme as part of the WeCount project [22], and collects data captured by a low-resolution sensor with a Raspberry Pi module that processes the sensor inputs and sends the count data to the central database [18]. As the count is based on visual input, the count can only be done during the day.

The advances and presence of cellular networks have also evolved significantly in the last decade which has also enabled continuous tracking of vehicles and floating car data [23]. With that, the roadway performance can be measured by travel times on the selected routes and the traveling reliability referred to that. These two measures and their derivatives are by far more intuitive and bring a different understanding of the traffic situations and patterns. In our study, we used Google Directions API’s data [17] to render the trip durations on the selected routes in the city. The Directions API provides real-time traffic data and modeled estimations for travel times and directions between selected locations, to enable real-time vehicle routing. One of our endeavors was to repurpose and effectively couple these measures with the traffic counts to estimate flow performances in the long run, and eventually enable interchangeable use of both sources if needed. The possible prediction of trip durations (and reliability) from traffic counts represent an added value for the assessment. There have been several studies employed to improve the assessment strategies in this regard e.g. [24,25,26], however, the comparison of the research settings is problematic due to different input data targeting, different information outputs, or travel modes examined (e.g. bus public transport), different geographical contexts, etc.

3 Methodological Context

3.1 Location and Periods

We demonstrated the proposed analysis in different residential districts of Ljubljana, the capital of Slovenia with approximately 280,000 residents. We selected routes connecting different strategic destination points of the city. The routes were based on car-based trips (Fig. 1). The city center and a widely popular shopping center located on the city boundary were selected as the main strategic points. A visual representation of the established routes is available on the interactive map (link). For each route, travel time rates and traffic counts were evaluated in two different periods. In the first study, the observation period covered 4 weeks in October 2020 with the additional period- and location- extensions for traffic counts examination in our second study. We applied available count data of WeCount Ljubljana Telraam counters in 2021 and coupled it with the ILC counters within equivalent periods and fitting locations.

Fig. 1.
figure 1

An example of a route setting visualized by the Google Directions API interface [17]. The route connects the selected neighborhoods on the outskirts of the city with one of the two strategical destination points, i.e., the city center.

3.2 Data Variables and Pre-processing

Three types of aforementioned traffic data were engaged: (i) travel times data obtained on selected routes using Google Directions API; (ii) vehicle count data captured from on-road traffic sensors (magnetic loops) provided by governmental and municipal traffic services, and (iii) WeCount Ljubljana Telraam vehicle count data publicly available under the CC-BY-NC license.Footnote 5 Travel times were normalized to obtain the variables of average pace (in seconds per meter) and speed (in meters per second) for each route at a given time. We filtered the data to remove the outliers, namely, we removed the measurements deviating more than three standard deviations from the mean. Traffic data were augmented with weather data (iv), which we obtained using the Visual Crossing Weather API [27]. Weather data were classified into two categories, namely normal and adverse weather conditions as described in [19]. All four datasets were aligned at a given timepoint to further assess their correlations and mutual impacts. Another factor that was included in the analyses was the type of day. Days were classified into two categories, namely workdays and weekends as these reflect different traffic patterns as already described in the previous studies [28, 29].

4 Interpretation of Data

4.1 Types of Analyses and Measures

In the two pilot studies of the national research project, we used several different analytical approaches to interpret the data series described in the following section.

Cosinor analyses. Travel time data were examined using the rhythmicity analyses by the set of cosinor regression models [9, 30]. We presumed a 24-h main period and assessed the number of components in a model for each dataset using the extra sum-of-squares F-test, as described in [9]. We employed the selected model to identify the locations, heights, and number of peaks and troughs repeating with a 24-h rhythm. Moreover, we used the constructed models to compare different scenarios on the same route, i.e., differences between workdays and weekends and differences between normal and adverse weather conditions.

Travel time index (TTI) and the planning time index (PTI). In the subsequent step of the travel times analysis, we introduced two existing measures of travel time reliability, i.e. the travel time index and the planning time index, both calculated on the hourly terms. TTI is defined as the observed average travel time divided by the constant of the free-flow travel time rate on the observed route [31], whereas the PTI in its formulation uses 95th-percentile travel time to represent the near-worst case travel time [32, 33], thus, in general, is more sensitive to rare events, particularly to accidents. The advantage of the TTI and PTI here is the possibility to directly compare them on the equivalent numeric scales. Since both measures are based on the free-flowFootnote 6 factor, they enable comparable assessment in the case of different road types.

Regression of ILC data. We observed eight road segments with the current matching ILC and WeCount Ljubljana Telraam counters (the overlapping location, equivalent direction of counting, and matching timestamp), and estimate their interchangeability based on the regression analyses. We tested different regression models using different sets of features to find the best-performing model for each scenario (the quality of predictions was assessed using the R2 – coefficient of determination metric). The regression models were trained on 70% of the data and tested on the remaining 30% of the data. Grid search cross-validation was applied to identify the optimal values of hyperparameters for each of the models. The whole framework was implemented in Python relying on the scikit-learn library [35].

Predicting travel times using count data. We analyzed the possibilities to predict the travel times on a route using the count data. Firstly, we identified the counters located on a selected route. Secondly, we assessed the Pearson and Spearman correlation coefficients between the travel times and count data. Furthermore, we employed a simple linear and exponential model to predict the travel time data from vehicle counts for a route. The prediction quality was again assessed using the R2 metric obtained with 10-fold cross validation.

4.2 Representation of the Results

The results of the two pilot studies of the national research overviewed here were presented by different interpretation techniques. The actual detailed results are not in the scope of this paper; however, we present several selected aspects of the results.

Distributions of travel times. Analysis of distributions of travel time values is vital since different distributions might require different steps in further quantitative analyses. We used frequency distributions visualization with different boxplot-type and violin-plot-type graphs. Violin-plots enable the assessment of the multimodality in the distributions and allow for estimating the basic differences between the routes concerning the median, quartiles and outliers, distribution width, and skewness. To normalize the travel time values with regard to the length of a route, we used space-mean speed (in meters per second) or the average travel time rate – pace, which present the exact inverse of each other. In the next step, we rescaled the obtained pace measurements to 24 h and plotted their distributions in dependence on the day of the week (i.e., workday or weekend) and the weather conditions (i.e., normal or adverse) – Fig. 2.

The assessed distributions of travel times not only guide further analyses but also present a baseline for defining travel time reliability metrics. From a measurement perspective, reliability is thus quantified, for a given trip over a significant timespan. It may be viewed from different perspectives, which include the focus on the travel time distributions within the course of a day, from day-to-day, or even within a month or a season of the year [31]. A balanced summary of travel time measures and reliability performance comes from [36], recommending a specific, e.g. 95th, percentile travel time, which also corresponds to the buffer index. Reliability metrics also include on-time measures such as the percentage share of trips completed within a travel time threshold or failure measures like the percent of trips that exceed a travel time threshold [37].

Many metrics are expressed relative to the free-flow travel time, i.e., travel time in low traffic-flow conditions, which is becoming the benchmark for travel time and reliability analysis. Moreover, while the distributions show actual travel times, their values are commonly normalized to obtain travel rates that enable comparative analyses across routes of different lengths (Fig. 3).

Fig. 2.
figure 2

An example of the preliminary analyses: pace distributions on all six routes divided by workday/weekday (right boxplot) and normal//adverse weather (left plot). Figure adapted from [19]. All the segments share similar distributions except segment 5, which presents a highway road segment. It is evident that the type of the day (workday/weekday, right plot) has a significantly stronger effect on traffic than the weather conditions (left plot).

Rhythmicity trends. We analyzed the rhythmicity trends, i.e., trends repeating with a predefined period (in our case this was set to 24 h), of travel times more precisely by applying cosinor models with different complexity to the data obtained on the workdays and weekends, and to normal and adverse weather conditions. Figure 4 summarizes the results of the cosinor model obtained on Route 4 on different types of a day (i.e., workdays or weekends). Measured travel times are plotted against the fitted curves to determine the level of consistency between a model and the underlying data. We assessed the overall significance values of each fit using the F-test. Quantile-quantile (Q-Q) plots were also analyzed to assess the goodness of fit of each model [34].

Fig. 3.
figure 3

An example of the travel time index and planning time index during; the 24-h course for the Route 4; a comparison of normal and adverse weather situations. The shaded areas represent the corresponding 95% confidence intervals. Two peaks are expressed during the day, both in adverse and normal weather situation, however, the travel time variability is evidently higher in the case of adverse weather.

TTI and PTI trends. TTI and PTI values can be employed to systematically assess the travel time trends, and to compare the pervasiveness of “normal” delays with the “exceptional delays” (Fig. 3). Typical delays together with the normal trip durations can be assessed using the TTI. On the other hand, TTIs can be extended with the surplus delays using the PTIs. In this context, the surplus delays present the unexpected delays caused by non-recurrent events.

Fig. 4.
figure 4

An example of the travel time rate (pace) distributions on Route 4; a comparison between workdays (above) and weekends (below) with the residual Q-Q visual check plots. While the cosinor model can satisfactorily describe the observed data during the weekends, its goodness-of-fit is lower on the workdays. The obtained results indicate that the observed data can be described with the selected cosinor models.

Pair-plotting: travel times and traffic counts. We compared the trip durations with the vehicle counts per hour for the equivalent period utilizing correlation analyses. Figure 5 illustrates the correlations between the average pace [s/m] and the network flow [vehicles/hour] captured by stationary ILC counters (0174-1, 0855-1, 1010-1) on Route 5.

Fig. 5.
figure 5

An example of the correlations between the average pace [s/m] and the network flow [vehicles/hour] on Route 5. rP and rS denote Pearson’s and Spearman's rank correlation coefficient, respectively. Figure adapted from [19]. The obtained results indicate that the average pace if highly correlated with all of the observed counters (Spearman’s rank correlation values equal to 0.88).

Pair-plotting: ILC counts and WeCount Ljubljana Telraam counts. We matched and compared the ILCs with WeCount Ljubljana Telraam counters on the observed road. Since WeCount Ljubljana Telraam counters might face several issues (e.g., due to low light conditions or obstruction), we employed different machine learning regression models, which are more robust to data anomalies and do not presume specific distributions of input data.

Regressing ILC data. We further estimated the interchangeability of the ILC counters with WeCount Ljubljana Telraam counters in different scenarios by regression analyses. Figure 6 presents the results of the regression analysis using the top five performing models on a selected road in two different directions. The majority of the presented models reflected a relatively strong predictive capability, which indicated that pre-trained regression models could be used to induce the count data of a counting source with an alternative counting infrastructure.

Fig. 6.
figure 6

The predictive power of top five regression models on a road segment (Ižanska cesta) in different directions. R2 (test) presents the coefficient of determination evaluated on the testing. Figure adapted from [16]. Our results indicate that the ILC data can be accurately predicted using the WeCount Ljubljana Telraam counters with the majority of the selected models.

5 Discussion and Conclusion

Our results indicate that the observed datasets can be used interchangeably in certain cases. Count data can be obtained using different types of counting infrastructure. Afterwards, we can employ the count data to assess the travel times on a selected route. Even though the travel times can be expressed with traffic counts using relatively simple regression models, the substitution of one type of count data with another requires precise tuning of regression models, which limits its generalization. For example, our analyses were performed in selected parts of a specific city in a specific season. To transfer the obtained models to different cities, neighborhoods or even the same road segments in different conditions, the models would require additional training using the additional input data. This presents the main limitation of the presented approach, which we aim to address in our future work. We could partially address this problem with the application of unsupervised machine learning approaches, such as clustering of sensors and their locations. The main benefit of such (unsupervised) approach is that it reduces the requirement for prior training of the models. However, its feasibility needs to be verified on the proposed testbed or beyond.

One of the crucial questions that arise here is the actual feasibility of the diverse citizens’ science databases (present in different geographical environments, capturing data of different scopes and with diverse dispersion degrees) to be used in scientific research. Dense geographical resolution and solid distribution of citizens’ sensors can enable more accurate results with better prediction capabilities. Namely, the broader use of a large number of less reliable sensors to improve the accuracy of obtained data has been employed in different engineering disciplines (see [38]). An example vividly illustrating the concept of increasing the accuracy by redundancy was reported by Weis et al. [39] demonstrated this concept by employing several imprecise watches to obtain a highly accurate clock. Using an adequate number of less reliable Telraam sensors might be able to provide more accurate results in relation to the appurtenant ILC counter, as possible cut-outs could be replaced by redundant sensors in direct vicinity. In our pilot case, we were able to obtain relatively accurate predictions of count data on Ižanska and Dunajska road by employing a single or two Telraam counters.

More than 1700 Telraam sensors are in operation today, mostly in Europe. The highest density of sensors is in Belgium and the Netherlands. Smaller concentrations are also found around Dublin, Cardiff and Slovenia. Telraam is wholly owned and maintained by Rear Window BV (BE0762.549.266), a spin-off initiative of TML, Mobiel 21 and Waanz.in [18]. During the WeCount project, project partners from Leuven, Madrid/Barcelona, Cardiff, Dublin and Ljubljana have upgraded the already established Telraam sensor and platform with a focus on usability and the aim to make the technology friendlier to reach a wider range of Telraam users. A variety of interactions between the project teams and local citizens supported this process with a range of engagement methods, guidelines and recommendations to identify and promote local communities. Given the different cultural contexts and approaches to recruiting and engaging citizens, as well as the different urban structures in the cities, all of which influence the effectiveness of the sensor distributions, remain the most important and demanding aspect of such a citizens science approach.