1 Introduction

Location-based services recently made impact on business models which leverage the benefits of ad-targeting and location data of their customers and potential customers, by revealing insights of business intelligence from users’ whereabouts [1]. Since 2014, a mobile app called Foursquare offers a game-like online platform that provides a mashup of local search [2], discovery service [3] and recommendation services [4] to users interacting by their GPS-enabled smartphones. In the original version, Foursquare developed a social networking layer that facilitates users to share their locations with friends, via the ‘check-in’ app feature. Users would proactively inform the app when they are at a specific location via their cell phones or computers by choosing from a list of venues that the app automatically locates nearby. Check-ins are basically location records of where they have been, and Check-in points [5] are the “stay” points which could be earned by spending a period of time there at the location, usually doing something significant. These are essential ingredients for association rule mining [6], advanced recommendation applications [7] and Internet of Things [8].

As of August 2014, a new version of Foursquare 8.0 (http://foursquare.com) was launched, focusing on only search service and recommendation service. Users of the new Foursquare 8.0 can “follow” their friends to obtain recommendations from their circles of friends. The check-in features were dropped from Foursquare 8.0, but transferred to a companion app called Swarm. Swarm implemented fully the synergy of location sharing and social networking features into a separate app. Users can share their locations with their Swarm friends whose current locations are visible among themselves via the app. In this new version of Foursquare partner app called Swarm, the check-in locations are pulled from every user in real time, and they are more precisely defined than a menu of local landmarks as in the previous versions. With the two apps combined, working hand-in-hand, more accurate recommendation services and richer contexts about the check-in locations can be obtained. For instance, check-ins information collected by Swarm and passed over to Foursquare, helps the two apps understand what kinds of places the users like to visit, not at individual level but as clique of friends or even a culture. The composite of big data [9] being collected from different aspects of locations, likes, times, shopping and dining habits of individual users and swarms of users, by Foursquare and Swarm have unprecedented impact on business intelligence.

The data collection is ubiquitous: every user as a person is now becoming the sum of everywhere he/she has visited and everything he/she has liked or done. By evading some privacy issues, information generated by the users become just the sum of all the interactions and recommendations happening online, and locations of check-ins are ingredients for detecting patterns. With these patterns and trends available from a big data, and matching them across to your profile, the app predicts (hence recommend) what canteens you might head to (when it is lunch time), or what bistro are worth spending time with your friends (if it is a Friday evening). Big data analytics refine recommendations from a massive volume of events, such as what canteens you have patronized before, what food you ordered, what time of the day, day of the week, date of the year it was, whom you dined a meal or coffee chat with, on what occasions, based on what and whose recommendations, etc. For an example of scenario at the individual level: you are going to celebrate anniversary with your wife, it is just a few days away; and other Foursquare users have lately posted about dinner deals and gift ideas. The app will likely recommend a local dining place that is popular and well-reviewed. On top of that, cuisines and menu choices could be guessed from your recent postings and likes, as well as from your friends’ who share the same interests as you. At the macro view, trends and the most talked-about places and foods, are known from the big data. Users may be recommended to places which the app guesses they would like and they have not visited before. The app has the business insights like an economist, which can quantify approximately how many people would be visiting a place on certain occasions, how trends are positively or negatively correlated to one another, and how people move around places, their favorite hang-outs and what places they will frequent or shun away from, at a particular time or day.

A fundamental premise for supporting all these app features mentioned above is sequential information among the events and visited locations, in addition to of course the big data-related technology like database and ICT infrastructure. Sequential mining has been a popular research topic studied by many researchers. In this paper, a relatively simple technique is proposed and studied, which can guess the most likely subsequent check-in location by computing the probabilities of check-in occurrences. Being able to predict the next check-ins is important for supporting Foursquare recommendations which are based on the sums. Instead of predicting and tracking each user’s trail individually, general patterns are usually found more useful in inferring trends and popular location sequences. While most of the research works published in the past are on finding the sequences of the paths of an individual person, this work differs by considering only the aggregated sums. Privacy issues are hence prevented, at least reduced to certain extent when only sums of checked-ins are taken into account. Given any checked-in location, the proposed analytics computes and lists out a list of most likely next checked-in locations from the mass of historical checked-in records which usually is stored as a big data.

2 Difference between Foursquare check-ins sequences and continuous trajectories

There are app and online software that visualizes the “trails” of Foursquare users overlaid on a map. These trails are aggregated from daily collected records of check-ins stored in a database. The objective of such visualization usually is to identify the moving patterns either as an animation over time or a specific still image at a particular time of the day, showing the density of the foot traffic rather than sequences of paths. Different colors are used to show the trails of visits to different types of stores—food, education, professional activities and entertainment, etc. In this case, all trails and paths are not directional, the visits or stops are unordered and they mainly show about the popularities in terms of traffic intensity. An example called ‘infographics/pulse’ of New York City is shown in Fig. 1.

Traffic intensity is an effective indicator about the most-visited places at different times. It lacks of information of temporal order. In our proposed analytics, a new dimension of temporal order is extracted from check-in records in the form of sequences. A sequence in the context of our analytic is a series of geo-location points which are recorded as check-ins by Foursquare users. It contains temporal order of check-ins that are contributed by all users over time, starting from some specific date and time; default is since perhaps the launch of an app version. By knowing any pair of check-in points, starting point and current point, which are connected by two timestamps of check-ins at consecutive order, a list of likely next check-in points are produced. The next likely next check-in points are computed with probabilities which are ranked in descending order, showing the most probable subsequent check-in locations on top of the list.

Fig. 1
figure 1

City pulses at 12 p.m. by the infographic function of Foursquare.com

Although it is possible to infer a ranked list of subsequent check-in locations from a minimum of two check-in points forming a short sequence, theoretically a sequence can be as however long the consecutive series of check-in points are available. By choosing different pairs of sequential check-in points, different predictions can be made possible. For one example, as shown in Fig. 2 of a pair of check-in points on the map of a New York City, a list of possible future check-in locations are possible.

Fig. 2
figure 2

Map of recommended check-ins from Foursquare.com

In the above example, Museum of Chinese in America is starting point, current position is New Museum. Depending on time, if time is enough next location could be another museum, which is one of the most probable on the predicted list. This is even stronger the prediction if it falls on some holiday or special day when museum admissions are free of charge. Counter-intuitively, if the starting point is something else, for another example, Sohotel which is in between the Museum of Chinese in America and New Museum, the most probable future check-in location would be anything but the Museum of Chinese in America or another museum. Users who often start off from the hotel, visited a museum, maybe on the way to other attractions or to eat if it is close to lunch time. In short, different segments that connected different pairs of check-in points will result in different predicted next locations.

There is no shortage of research works in calculating trajectory. Nevertheless, trajectory predictions by other works, such as [10,11,12] are based on calculations of individual paths. This work here is unique in the sense that the sequences are taken from mass records from big data of online check-in information that are loading into the database from millions of Foursquare users in real time.

In the case of Foursquare, it is somewhat different from the prediction of continuous trajectory [13]. As the next coordinate of Foursquare may come from many different points, unlike continuous trajectory prediction, its next coordinate is near the decision point. The number is usually centered from 2 to 3.

Figure 3a shows a typical trail appear on Foursquare: one person has checked-in at S1 then moved to S2. S2 is decision point, in the sense that where else will be the next check-in point after visiting S2. D1, D2, D3, D4 and D5 are the next probable check-in coordinates, extending out from S2. In the graph, the probable check-in coordinates are distributed irregularly, some are near the decision point, and some are far from it. Also the direction is not regular. But in Fig. 3b, which is the continuous trajectory prediction case, from S1 to S2 is a continuously connecting path who has walked from S1 to S2. S2 is the decision point. The probable next coordinate may could be D1, D2 or D3. D1, D2 and D3 are more predictable as they usually go by the road structure, such as junction, intersection, road-joints, corners, etc. The subsequent coordinates are relatively better distributed around the decision point, because they are constrained by the physical road network [14].

Fig. 3
figure 3

a Foursquare trails, b continuous trajectory

3 Proposed algorithm

We consider the entire map as \(m=\langle V\), \(E \rangle \). The beginning known check-in points is represented by Ps. The last known check-in points is represented by Pe. The check-in points from Ps to Pe, are input data of the algorithm. Any next probable check-in point could be denoted as Dn.

The probability of a person checked-in from Ps to Pe at last checked-in at Dn could be represented as following:

$$\begin{aligned} {Prob}\left( {Ps,\,Pe,\,Dn} \right) ={Prob}\left( {Dn|Ps,\,Pe} \right) \end{aligned}$$
(1)

In check-in historical data set, checked-in history from Ps to Pe could be represented as His(Ps, Pe). Checked-in history from Ps to Pe then checked-in at Dn could be represented as His(PsPeDn). The path number of all people who has checked-in from Ps to Pe in historical data set is denoted as \({\vert } His( Ps,\,Pe){\vert }\). So, \({His}({Ps},\, {Pe},\, {Dn}){\vert }\) also represents the path number of all people who has checked-in from Ps to Pe then Dn in historical data set. So, the equation could be changed as following.

$$\begin{aligned} Prob\left( {Ps,\,Pe,\,Dn} \right)= & {} Prob\left( {Dn|Ps,\,Pe} \right) \nonumber \\= & {} \frac{\left| {His\left( {Ps,\,Pe,\,Dn} \right) } \right| }{\left| {His\left( {Ps,\,Pe} \right) } \right| } \end{aligned}$$
(2)

So far these probabilities are considered without taking account of time. When time is considered, the historical data will be segmented into groups of data by different time periods: e.g., morning, afternoon, and evening, etc. Each group of data that are grouped by the time-stamped of the data is used to infer the checked-in probabilities separately during different periods of time. Such division of data by time is called ‘time filter’ implying a specific portion of time that happened within a period of time is used for computing the next likely check-in location. The objective of applying time filter is to enhance the prediction accuracy.

4 Simulation experiments

In Fig. 4, a person has checked-in from S1 to S2, the next probable check-in point may vary. In the graph, they are D1, D2, D3, D4 and D5. Also, they all could be represented by Dn1, Dn2, Dn3, Dn4 and Dn5. The experiment is to infer the number of all the people who have checked-in from S1, S2 then Dni (Dni could be represent any next possible point. In Fig 4, they are Dn1, Dn2, Dn3, Dn4 and Dn5). In the example given in Fig. 4, for example, a tourist is exploring Singapore on foot. After recording his check-in that started in China Town and ended at Clark Quay, there are five possible destinations; any one of these destinations from D1 to D5 could be the next check-in assuming the tourist is heading to the north direction and is exploring points-of-interest sequentially by hopping from one after another instead of taking a long haul point-to-point detour.

Fig. 4
figure 4

Simulation graph

In our proposed system, for guessing the next check-in, first, input the known coordinates S1 and S2 into the program. The program will mine the number of paths which the person has checked-in from S1 to S2 at last ended at Dni. In this case, when the simulation program ended, \({\vert }{His}({Ps},\,{Pe},\, {Dn}1){\vert }\), \({\vert }{His}({Ps},\,{ Pe}, { Dn}2){\vert }\), \({\vert }{} { His}({ Ps},\, { Pe},\, { Dn}3){\vert }\), \({\vert }{} { His}({ Ps},\, { Pe},\, { Dn}4){\vert }\), and \({\vert }{} { His}({ Ps},\, { Pe},\, { Dn}5){\vert }\) are all obtained. \({\vert }{} { His}({ Ps},\,{ Pe}){\vert }\) is the summarized value of \({\vert }{} { His}({ Ps},\, { Pe},\, { Dn}1){\vert }\), \({\vert }{} { His}({ Ps},\,{ Pe},\,{ Dn}2){\vert }\), \({\vert }{} { His}({ Ps},\, { Pe},\, { Dn}3){\vert }\), \({\vert }{} { His}({ Ps},\, { Pe},\, { Dn}4){\vert }\), and \({\vert }{} { His}({ Ps},\, { Pe},\, { Dn}5){\vert }\). According to Formula 1, the probability of each possible next check-in point that the person may visit can be calculated. The point that has the maximum value is the most possible next check-in point.

4.1 General experiment settings

In this experiment, we used four cases for testing our proposed algorithm for Foursquare movement prediction. The data of the cases are shown in Table 1. They are input data to the simulation system with different starting and ending points. Graphically the paths are shown in Figs. 5, 6, 7 and 8 which are superimposed on a map. The four figures represent four typical cases in guessing the next check-in locations, which might happen over four typical proximities. For instance, Test 1 is about a case of a user who travelled across islets separated by river. Test 2 is about a travel route in the same islet. Test 3 is a short distance across two blocks of buildings. Test 4 is a relatively longer distance across several streets and several blocks, but within the same suburb or city.

Table 1 Input data format and values
Fig. 5
figure 5

Test 1

Fig. 6
figure 6

Test 2

Fig. 7
figure 7

Test 3

Fig. 8
figure 8

Test 4

For each of the four cases in the experiment, we data mine each path number which has passed Ps, Pe and then to each next check-in point after Pe from historical check-in data set. At last, we calculate the probability of each next check-in point. In the following tables of results, latitude (lat) and longitude (lng) are the coordinates of the next check-in data for each test. Num is the times of historical check-in data which has passed Ps, Pe and the corresponding next check-in coordinate. Probability is calculated by Formula 1. Details of the results of the tests are given in Tables 2, 3, 4 and 5, where the top five most accurate records are listed.

Table 2 Results of Test 1
Table 3 Results of Test 2
Table 4 Results of Test 3
Table 5 Results of Test 4

In Test 1, there are 115 next check-in points, but only the Num area of first 16 are bigger than 1. The most probable check-in points are the first two coordinates. They are (40.7713, \(-73.9821\)) and (40.7678, \(-73.9823\)). The probabilities are 8.955 and 6.965%, respectively.

In Test 2, there are 129 next check-in points, the Num area of first 53 are bigger than 1. The most probable check-in point are the first two coordinates. They are (40.7608, \(-73.9879\)) and (40.76437, \(-73.9869\)). Probabilities are 7.306 and 4.566%.

In Test 3, there are 6 next check-in points. Num area of first and the second are bigger than 1. The most probable check-in points are the first two coordinates. They are (41.90273, \(-87.6319\)) and (41.91062, \(-87.6532\)). The probabilities are 40% and 20%, respectively.

In Test 4, there are 16 next check-in points. Num area of first six are bigger than 1. The most probable check-in points are the first two coordinates. They are (40.7451, \(-74.0383\)) and (40.751, \(-74.0316\)). The probabilities are 32.895% and 28.947%, respectively.

4.2 Experiment on various lengths and accuracies

If a person checked-in at location A, then location B. If the distance between A and B were farther, the prediction accuracy may drop. The longer the distance, the more the accuracy decline should be. This is the hypothesis, and this experiment tries to verify the accuracy decline in relation to the path distance apart. There are seven test paths listed in the following to be used in the experiment.

In Table 6, seven sample paths and their distances between two points, from Ps to Pe, are listed. For each path, we choose the top three highest accuracy to be reported as results. They are acc1, acc2 and acc3. Relationships between path length and accuracy are shown in Fig. 9.

Table 6 Results of various lengths and accuracies

AccSum is the sum of values of acc1, acc2 and acc3. It is the total accuracies of the three individual accuracies. Form the graph, we could make an observation that prediction accuracy is concerned with the path length which is from Ps to Pe. As path length increased, the accuracy decreased, which verified our hypothesis about the efficacy of the prediction.

Fig. 9
figure 9

Results of various lengths and accuracies

4.3 Time filter: accuracy on each test path

For different times, people may do different kinds of things. In this case, we modify the prediction about the future paths by considering the elements of time which may constrain certain locations [15]. It is hoped that by including the time element, the prediction accuracy could be enhanced. In the following four tests, all check-ins are created in the evening from 1700 to 2230 hours. In big cities like New York and Chicago, it may be the most bustling time period. Compared with the setting in the General Experiment, the results are shown in Figs. 10, 11, 12 and 13. From the figures, we could know whether the time filter would enhance the prediction accuracy. In all cases, the tests with time filter are always better than those without time filter in the Acc Top 1 scenario. In Test 1 and Test 2 where islet is involved in the environmental setting of the tests, the points-of-interest are usually obvious because they are mostly tourist attractions near the islet area. However, in Test 3 and Test 4, there are urban building blocks in central business district where shops and points-of-interest are densely packed. The results are not so consistent, such as Acc Top 4 in Test 3 and Acc Top 2 and Acc Top 3 in Test 4. They are due to the dense proximity of the urban settings. Those shops may be visited and checked in by users at all the times, regardless of time periods of the day. For example, a busy pharmacy or a newsstand near the subway exit may be patronized by almost equal amount of customers throughout the whole day in CBD.

Fig. 10
figure 10

Test 1 comparison

Fig. 11
figure 11

Test 2 comparison

Fig. 12
figure 12

Test 3 comparison

Fig. 13
figure 13

Test 4 comparison

5 Conclusion

With the rapid development of mobile Internet, location-based services are becoming more and more capable and prevalent. New recommendation services, for example, rely heavily on sensing and location-based technology. In this paper, we investigated some basic concepts related to geographical (geo) location. We looked into the intricacies of predicting the next locations given some current position point in a virtual social media platform [16]. For this purpose, the geo-tagging by Foursquare service is used as the base for our examples. Foursquare can be described as a location-centric social networking game. It allows you to receive recommendations from your peers or general statistics of a crowd for lifestyle experiences, given where you are currently. By knowing where you are, and where you will be next, in the coming moments, the recommendation experience could be enhanced. Therefore, location prediction is an important research topic that has potential commercial values in marketing and others. For this reason, simulation a experiment based on a Foursquare dataset was set up in an attempt to test the next location prediction and verify this position. We mainly used the Foursquare database and new algorithm to predict the use’s next location.

In the prediction model, the required input comes in the form of a minimum of two pairs of coordinates, and the output is a list of predicted locations associated with the likelihood probabilities. In the experiment, we tested some parameter variables including number of check-ins, segment lengths, and time factors. The results showed that different prediction accuracies are produced depending on the different values of the variables. We reported a set of experiments with different factors on the new method, and we found that when the time factor is considered in the prediction, the results will be more accurate.

This paper contributes to big data analytics where simple but effective computing methods are needed, in inferring the next users’ locations on a virtual geo-tagging platform. It is believed that the prediction model could be used as building blocks for more sophisticated online recommendation applications when more services and available sources of data are mashed up.