Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The illegal parking of bicycles around railway stations is becoming an urban problem in Japan and other countries. An increase in awareness of health probmlems [1] and energy conservation [2] led to a 2.6 fold increase in bicycle ownership in Japan from 1970 to 2013. In addition to the insufficient availability of bicycle parking spaces, a lack of public knowledge of bicycle parking laws has meant that the problem of illegally parked bicycles is becoming more prevalent. The illegally parked bicycles block vehicle and foot traffic, cause road accidents, encourage theft, and disfigure streets. Furthermore, the broken windows theory [3] suggests that, by increasing urban disorders, they may lead to an increase in minor offenses.

Thus, in order to raise public awareness of this urban problem, we considered it necessary to publish data about the daily situation with respect to the illegally parked bicycles as Open Data. The Open Data is data that can be freely used, re-used and redistributed by anyone [4]. It is recommended that the Open Data should be structured according to the Resource Description Framework (RDF)Footnote 1, which is the W3C-recommended data model, and that relevant links should be created between the data elements. This is called Linked Open Data (LOD) [5]. In recent years, the community that publishes LOD on the Web has become more active. The publication of urban problem data on the Web as LOD will allow users to develop information services that can contribute to solving urban problems. By using LOD about the illegally parked bicycles, for example, visualization of the illegally parked bicycles, suggestion of locations for optimal bicycle parking spaces, and removal of the illegally parked bicycles will be possible. However, Open Data sets available for the illegally parked bicycles are currently coarse, and it is difficult for services to utilize the data. In addition, other data concerning issues such as bicycle parking and government statistics, have been published in a variety of formats. Hence, a unification of data formats and definition of schema for data storage are important issues that need to be addressed.

In this study, we collect data about the illegally parked bicycles from Twitter and the attribute data describing attributes, which affect the number of illegally parked bicycles. In order to facilitate the reuse of these data sets which have different formats, we define schemata, unify the data formats, and publish the data on the Web as LOD. Moreover, we estimate the missing data (the number of illegally parked bicycles) using Bayesian networks. Our predictions take into consideration attributes such as time, weather, nearby bicycle parking information, and nearby Points of interest (POI). However, because there are cases that lack these attribute values, the missing attribute values are also complemented based on the semantics of the LOD. We thus use Bayesian networks to estimate the number of illegally parked bicycles for datasets, whose attributes have been complemented. These results are also incorporated to build LOD with a particular property. In addition, we develop a service that visualizes the illegally parked bicycles using the constructed LOD. This visualization service raises the awareness of the issue in local residents, and prompts users to provide more information about the illegally parked bicycles. Therefore, this study is divided into the following six phases. Phases (2) to (6) are executed repeatedly as more input data become available.

  1. 1.

    Designing LOD schema.

  2. 2.

    Collecting observation data and attribute data.

  3. 3.

    Building of LOD based on schema.

  4. 4.

    Complementing missing attribute values using LOD.

  5. 5.

    Using Bayesian networks to estimate the missing number of illegally parked bicycles at each location.

  6. 6.

    Visualization of illegally parked bicycles using LOD.

Thus, we build LOD while collecting data, and complementing the missing data. The service that visualizes the illegally parked bicycles will give local residents incentive to report infractions, as Open Data. In this manner, we aim to solve the problem of the illegally parked bicycles by building the ecosystem for Open Urban Data.

The remainder of this paper is organized as follows. In Sect. 2, an overview of sensor LOD and crowdsourcing is given. In Sect. 3, our techniques for data collection and building LOD are described. In Sect. 4, two approaches, which complement the missing attribute values, and estimate the illegally parked bicycle using Bayesian networks, are described. Also, we evaluate our results and summarize our findings. In Sect. 5, the visualization of the LOD is described. Finally, in Sect. 6, we discuss some possible directions for the future research that have arisen from our work.

2 Related Work

In most cases, LOD sets have been built based on the existing databases. However, there is little LOD available, which provides sensor data for urban problems so far. Thus, it is required to have methods for collecting new data to build Linked Open Urban Data. Data collection methods for building Open Data include crowdsourcing and gamification. A number of projects have employed these techniques. OpenStreetMapFootnote 2 is a project that creates an open map using crowdsourced data. Anyone can edit the map, and the data are published as Open Data. FixMyStreetFootnote 3 is a platform for reporting regional problems such as road conditions and illegal dumping. Crowdsourcing to collect information in FixMyStreet has meant that regional problems are able to be solved more quickly than ever before. Zook et al. [6] reported the case, where the crowdsourcing was used to link published satellite images with OpenStreetMap after the Haitian Earthquake. A map for the relief effort was created, and the data were published as Open Data. Celino et al. [7] have proposed an approach for editing and adding Linked Data using a Game with a Purpose (GWAP) and Human Computation. However, since the data concerning illegally parked bicycles are time-series data, it is difficult to collect data using these approaches. Therefore, new techniques are required and we propose a method to build Open Urban Data while complementing the missing data.

Also, there are studies about building of Linked Data for cities. Lopez et al. [8] proposed a platform, which publishes sensor data as Linked Data. The platform collects stream data from sensors, and publishes RDF in real-time using IBM InfoSphere Stream and C-SPARQL [9]. The system is used in Dublinked2Footnote 4, which is a data portal of Dublin, Ireland, and publishes information of bus routes, delay, and congestion update every 20 s. However, since embedding sensors is costly, this approach is not suitable for our study.

Furthermore, Bischof et al. [10] proposed a method for collection complementation, and republishing of data as Linked Data, as with our study. This method collects data from DBpedia [11], Urban AuditFootnote 5, United Nations Statistics Division (UNSD)Footnote 6, and U.S. CensusFootnote 7, and then utilizes the similarity among such large Open Data sets on the Web. However, we could not find the corresponding data sets and thus apply the same approach to our study.

Fig. 1.
figure 1

Overview of this study

3 Collection of Observation Data and Building of LOD

Figure 1 provides an overview of this study. The LOD building system collects data about the illegally parked bicycles, builds LOD with a fixed schema, complements the missing attribute values, and estimates the missing data (the number of the illegally parked bicycles). This system builds sequential Open Urban Data generation, while integrating the government data and the existing LOD. The web application posts information about the illegally parked bicycles to Twitter and visualizes the distribution of them on a map.

3.1 Collection of Observation Data

We began by collecting tweets containing location information, pictures, hashtags, and the number of the illegally parked bicycles. However, obtaining the correct locations from Twitter is difficult, since mobile phones often attach incorrect location information. Mobile phones are equipped with inexpensive GPS chips, and so it is known that the accuracy will be inaccurate due to weather conditions and GPS interference area [14]. To address this problem, we developed a web application that enables users to post to Twitter after correcting their location information, and made an announcement asking public users to post tweets of illegally parked bicycles using this application. Figure 2 shows a screen shot of this application. After OAuth authentication, a form and buttons are shown. When the location button is pressed, a marker is displayed at the user’s current location on a map. The marker is draggable, allowing users to correct their location information. When the users add their location information, enter the number of illegally parked bicycles, take pictures, and submit them, tweets including this information with a hashtag are posted.

The data were collected from January, 2015 until September, 2015. The LOD was built by the observation data. In order to estimate the number of the illegally parked bicycles using Bayesian networks, data for attributes considered as the causes of the illegal bicycle parking were also required. For this purpose, meteorological data were acquired from the website of the Japanese Meteorological Agency (JMA), and bicycle parking data also were acquired from the websites of municipalities.

Fig. 2.
figure 2

Screenshots of the web application

3.2 Schema Design and Building of LOD

Illegally Parked Bicycles LOD Schema Design. When building LOD based on a well-known ontology, it becomes possible to make deductions based on that ontology. In addition, reduction in labor is possible when trying to understand the different data structures for each LOD. The observation data for the illegally parked bicycles resemble sensor data, since it is time-series data, which include location, date and time information. As a result, our schema for the illegally parked bicycles LOD was designed with reference to the Semantic Sensor Network OntologyFootnote 8. Figure 3 shows part of the illegally parked bicycles LOD. Sensors and monitoring cameras are not used in this study, and then people observing illegally parked bicycles are considered to be virtual sensors and included as instances of the Sensor class. Since this LOD links to DBpedia JapaneseFootnote 9 and GeoNames.jpFootnote 10, it is possible for people and programs to acquire additional information by conducting traces. DBpedia Japanese is the LOD of Japanese Wikipedia, and a hub of the LOD cloud. GeoNames.jp is the URI base of Japanese place names. Using that schema definition, it is possible to acquire longitude and latitude data as numerical values that are easy for programs to use. Moreover, it is possible to search specified time and area ranges using SPARQL Protocol and Query Language (SPARQL)Footnote 11. In Fig. 3, the data that 15 illegally parked bicycles have been observed in front of Fuchu Station at 20:24:15 on June 18, 2015 are represented by an RDF graph.

Fig. 3.
figure 3

Part of the illegally parked bicycles LOD

Building of Illegally Parked Bicycles LOD. Collected data about illegally parked bicycles are converted to LOD based on the designed schema. First, the server program collects tweets containing particular hash-tags, location information, and the number of illegally parked bicycles in real-time. The number of illegally parked bicycles is extracted from the text of tweet using regular expressions.

Next, the program checks whether there is an existing observation point to a radius of less than 30 m using the latitude and the longitude of the tweet. If there is no observation point in illegally parked bicycles LOD, the point is added as a new observation point. In order to add new observation points, the nearest POI information is obtained using Google Places API and Foursquare API. A new observation point is generated based on the name of the nearest POI.

Then the address, prefecture’s name, and city name are obtained using Yahoo! reverse geocoder API and then Links to GeoNames.jp are generated based on the obtained information. GeoNames.jp is a Japanese geographical database. This process is necessary for integration with other data.

Fig. 4.
figure 4

Part of the integrated LOD

After collecting tweets, the information about observation points is obtained using Web API, and then an RDF graph is added to the illegally parked bicycles LOD in real-time.

Building of LOD Based on for Attributes. Since the data sets acquired in Sect. 3.1 are in a number of different formats, they have poor reusability. Therefore, it was necessary to design the schema for these data and build the LOD. We designed the weather LOD schema with reference to the Weather OntologyFootnote 12. Based on this schema, we converted the data acquired from the JMA website into LOD. We also designed a bicycle parking LOD schema and converted the data acquired from websites of municipalities into LOD. A bicycle parking resource is an instance of the “Bicycle_Parking” class, and has properties of location information, shape, and the maximum number of bicycles that can be accommodated. Furthermore, the weather LOD and the bicycle parking LOD were linked to the illegally parked bicycles LOD. Figure 4 shows part of the integrated LOD. Also, the LOD have been published via the SPARQL endpointFootnote 13. Thus, it is possible to link the number of illegally parked bicycles with the weather data and the data about nearby bicycle parking areas.

4 Complementing and Estimating Missing Data

Since we rely on public people to observe illegally parked bicycles, we do not have round the clock data for any place, and so there are the missing data in the illegally parked bicycles LOD. There are also the missing geographical data, since we do not have exhaustive knowledge of all the places, where bicycles might be illegally parked. Because the number of the illegally parked bicycles is influenced by several attributes, we estimate this missing data using Bayesian networks. We considered geographical features and weather to be major attributes affecting to the illegal parking of bicycles, and so we used data about these attributes in our estimation. However, there are also the missing attribute values. Thus, so we first complement these attribute values from similar observation data, which are found using SPARQL searches on the illegally parked bicycles LOD, the DBpedia Japanese, and the Japanese WordNet RDF [12]. Figure 5 illustrates the complementation process of the missing attribute values. After the complementation, the number of the illegally parked bicycles is estimated using Bayesian networks.

Fig. 5.
figure 5

Complementation of missing attribute values

4.1 Complementing of Missing Attribute Values

In this paper, we consider seven attributes: day of week, time, precipitation (true = 1 or false = 0), the nearest POI, distance to the nearest station, distance to the nearest bicycle parking, and the maximum number of bicycles that can be accommodated in the nearest bicycle parking area. As an example, we explain our approach in the case, where the value of the maximum number of bicycles that can be accommodated is missing. Suppose the aggregates of each attribute are given by day of week \(A=\{sun, mon,...,sat\}\), time \(B=\{0,1,...,23\}\), precipitation \(C=\{0,1\}\), distance to the nearest station \(D=\{0,1,...\}\), distance to the nearest bicycle parking \(E=\{0,1,...\}\), category of POI \(F=\{0,1,...\}\), the maximum accommodation number \(G=\{0,1,...\}\), and the number of illegally parked bicycles \(H=\{1,2,...,6\}\), then the observation data are stored as an aggregate O of vectors \(o\in A\times {}B\times {}...\times {}H\). The number of parked bicycles is classified into six classes by the number of bicycles: 0–10, 11–20, 21–30, 31–40, 41–50, and 51–60. The missing attribute values are complemented using the corresponding attributes of the most similar data found in a search on the observation data. When the observation data including the missing attribute values is \(o_1\), and the observation data that is a candidate from the complementary source is \(o_2\), the similarity of \(o_1\) and \(o_2\) is calculated using the distance formula provided in Eq. 1.

$$\begin{aligned} Dist(o_{1},o_{2})=\sum _{x\in X}\frac{|sub(o^x_1,o^x_2)|}{max(x)}+\sum _{y\in Y}\frac{propCost(o^y_1,o^y_2)}{{max(propCost(o^y_1,o^y_2))}},&\end{aligned}$$
(1)

where X is the set of attributes with numerical values. If the value of the maximum accommodation number is missing, \(X=\{B,C,D,E,H\}\). In addition, Y is the set of attributes whose values are not numeric, so \(Y=\{A,F\}\). Moreover, \(o^x_1\) denotes the value of the attribute x in \(o_1\), \(sub(o^x_1, o^x_2)\) is the value of the difference between \(o^x_1\) and \(o^x_2\), and max(x) is the maximum value of the difference in attribute x. Note that the maximum value of the difference in time is 12. The differences between the distances to the nearest station, and the differences between the distances to the nearest bicycle parking are given as numerical values in the range 0–11, where each unit corresponds to 20 m of distance. The variable \(propCost(o^y_1,o^y_2)\) denotes the distance between \(o^y_1\) and \(o^y_2\) on DBpedia Japanese and Japanese WordNet RDF. Therefore, the value of \(propCost(o^y_1,o^y_2)\) may be interpreted as the total cost required to travel from \(o^y_1\) to \(o^y_2\) on these LOD. The right side of Fig. 5 shows an example of this process.

Table 1. Semantics and costs of properties (owl: http://www.w3.org/2002/07/owl#, skos: http://www.w3.org/2004/02/skos/core#, dbpedia-owl: http://dbpedia.org/ontology/, wn20schema: http://www.w3.org/2006/03/wn/wn20/schema/, rdfs: http://www.w3.org/2000/01/rdf-schema#)

Furthermore, properties are classified into four semantics, and each of these semantics is allocated a cost. Table 1 shows these semantics and the corresponding costs of the properties in the classification. The variable \(propCost(o^y_1,o^y_2)\) denotes the sum of the total costs of the properties, which are passed through from \(o^y_1\) to \(o^y_2\). The maximum value of \(propCost(o^y_1,o^y_2)\) is 12, which is the value obtained by multiplying the cost of the other properties by 6 based on the hypothesis of Six Degrees of Separation [13]. After searching a group \((o_1, o_2)\) of the observation data, where \(Dist(o_1,o_2)\) is minimized, \(o_2\) is substituted for \(o_1\).

4.2 Estimating the Number of Illegally Parked Bicycles Using Bayesian Networks

We estimate the number of illegally parked bicycles when the number data is missing. The input dataset is the dataset complemented using the method described in Sect. 4.1. Bayesian networks are graphical models incorporating probabilities that represent a causal relationship between the variables of interest. Since we consider the number of illegally parked bicycles to be causally related to the day of the week, weather, and surroundings, we use Bayesian networks for our estimations. We use the Bayesian network tool, WekaFootnote 14 to estimate the unknown numbers of illegally parked bicycles. The input data is a set O, which consists of vectors with eight elements. There are 747 observation data. We used HillClimb as search algorithm, and also used Markov blanket classifier. The estimated data are added to the illegally parked bicycles LOD with a particular property. More details are described in experiments.

Table 2. Statistics for observation data

4.3 Evaluation and Discussion

747 pieces of observational data were collected in total from January 1 to September 20, 2015. The number of triples (records in DB) included in the illegally parked bicycles LOD was 98315. Table 2 shows statistics about the observation data. There are 237 pieces of the observation data that have the missing attribute values, and these missing attribute values have been complemented using the method discussed in Sect. 4.1. Furthermore, the number of the illegally parked bicycles for those datasets, whose attributes have been complemented from the input data, is estimated using the Bayesian networks. The attributes are day of week, time, precipitation, distance to the nearest station, distance to the nearest bicycle parking, category of POI, and the maximum accommodation number of bicycles that can be accommodated in the nearest bicycle parking area. As a result of a 10-fold cross-validation, the accuracy of the estimation for the unknown number of illegally parked bicycles was 64.3 %. Table 3 shows the detailed accuracy. The most of observation data were in the range of 0–10, and the accuracy of estimation was high. The accuracy of the other ranges is, however, relatively lower, since there were not sufficient amount of the observation data. Thus, it affected the overall accuracy.

Table 3. Detailed accuracy

Also, we selected ten observation points randomly, and then estimated the number of the illegally parked bicycles at unobserved times. As a result, the data for six observation points were correctly estimated. Since Open Data is in the early stages of the diffusion, we believe data collection and expansion are of great importance, as well as the accuracy of data.

Moreover, the accuracy of the estimated data in this study was lowered for the following reasons. The amount of the observation data was less than the amount required, and it was imbalanced. The observations used in this experiment were not equally distributed over all observation points, and the quantity of data obtained from each observation point was different. As a result, the quantity of data obtained was not sufficient to accurately estimate each conditional probability for the number of the illegally parked bicycles.

Furthermore, the accuracy may have been lowered by restricting the number of nearby POIs to a single location. In many cases, several stores and establishments are close to an observation point, and they affect to an increase in the number of illegally parked bicycles. Therefore, we can improve the accuracy of our results by allowing multiple POIs and incorporating weights for each type of POI.

5 Visualization of LOD

Data visualization enables people intuitively understand data contents. Specifically, it is possible to raise the awareness of an issue among local residents by providing a visualization of data pertaining to the urban problem. Furthermore, it is expected that we shall collect more urban data. In this section, our visualization method of the illegally parked bicycles LOD is described.

The illegally parked bicycles LOD are published on the web, and SPARQL endpointsFootnote 15 are set. Consequently, anyone can download it and use it as APIs via the SPARQL endpoint. As an example of the use of this data, we developed a web application that visualizes illegally parked bicycles. The application can display time-series changes of the distribution of the illegally parked bicycles on a map. Also, the application has a responsive design, and so it is possible to use it on various devices such as PCs, smartphones, and tablets. When the start and end times are selected, and the play button is pressed, time series changes of the distribution of the illegally parked bicycles is displayed. The right side of Fig. 2 shows a screenshot of an Android smartphone, on which the web application is displaying such an animation near Chofu Station in Tokyo using a heatmap and a marker UI. In this study, we designated the point of illegally parked bicycles according to user’s tweets. However, the ranges or scales of the areas vary and thus it is difficult to display the exact ranges of the illegally parking areas. Therefore, in the current visualization, the point of the marker and the center of the heatmap are located at the center of the observation points, and the range of the heatmap is fixed in 30 m radius. Also the concentration of the heatmap is proportional to the logarithm based on the number of illegally parked bicycles. This visualization application and the tweet application in the left side of Fig. 2 are hosted on the above website, and so it is possible to see the visualized information just after tweeting. Thus, users are given the instant feedback of posting new data.

6 Conclusion

In this paper, building and visualization of Open Urban Data was described for a solution of illegally parked bicycles problem. The techniques proposed were data collection from Twitter, an illegally parked bicycles LOD based on a schema design, complementing and estimating the missing data, and then visualization of the LOD. Thus, we expect that it increases public awareness of local residents to the problem, and also encourages them to post more data.

In the future, we will increase the amount of observation data and attributes in order to improve the accuracy of the estimation. Moreover, we will visualize statistics of the illegally parked bicycles LOD, and clarify the problems caused by illegally parked bicycles in cooperation with local residents. Also, we will evaluate the growth rate of illegally parked bicycles LOD.