Keywords

1 Introduction

Over half of the world’s population now lives in cities (Martine et al. 2007), and understanding the cities we live in has never been more important. Urban planners need to plan future developments, transit authorities need to optimize routes, and people need to effectively integrate into their communities.

Currently, a number of methods are used to collect data about people, but these methods tend to be slow, labor-intensive, expensive, and lead to relatively sparse data. For example, the US census cost $13 billion in 2010 (Costing the Count 2011), and is only collected once every 10 years. The American Community Survey is collected annually, and cost about $170 million in 2012, but only samples around 1 % of households in any given year (Griffin and Hughes 2013). While data like this can benefit planners, policy makers, researchers, and businesses in understanding changes over time and how to allocate resources, today’s methods for understanding people and cities are slow, expensive, labor-intensive, and do not scale well.

Researchers have looked at using proprietary call detail records (CDRs) from telecoms to model mobility patterns (Becker et al. 2011; González et al. 2008; Isaacman et al. 2012) and other social patterns, such as the size of one’s social network and one’s relationship with others (Palchykov et al. 2012). These studies leverage millions of data points; however, these approaches also have coarse location granularity (up to 1 sq mile), are somewhat sparse (CDRs are recorded only when a call or SMS is made), have minimal context (location, date, caller, callee), and use data not generally available to others. Similarly, researchers have also looked at having participants install custom apps. However, this approach has challenges in scaling up to cities, given the large number of app users needed to get useful data. Corporations have also surreptitiously installed software on people’s smartphones (such as CallerIQ (McCullagh 2011) and Verizon’s Precision Market Insights (McCullagh 2012)), though this has led to widespread outcry due to privacy concerns.

We argue that there is an exciting opportunity for creating new ways to conceptualize and visualize the dynamics, structure, and character of a city by analyzing the social media its residents already generate. Millions of people already use Twitter, Instagram, Foursquare, and other social media services to update their friends about where they are, communicate with friends and strangers, and record their actions. The sheer quantity of data is also tantalizing: Twitter claims that its users send over 500 million tweets daily, and Instagram claims its users share about 60 million photos per day (Instagram 2014). Some of this media is geotagged with GPS data, making it possible to start inferring people’s behaviors over time. In contrast to CDRs from telcos, we can get fine-grained location data, and at times beyond when people make phone calls. In contrast to having people install custom apps (which is hard to persuade people to do), we can leverage social media data that millions of people are already creating every day.

We believe that this kind of geotagged social media data, combined with new kinds of analytics tools, will let urban planners, policy analysts, social scientists, and computer scientists explore how people actually use a city, in a manner that is cheap, highly scalable, and insightful. These tools can shed light onto the factors that come together to shape the urban landscape and the social texture of city life, including municipal borders, demographics, economic development, resources, geography, and planning.

As such, our main question here is, how can we use this kind of publicly visible, geotagged social media data to help us understand cities better? In this position paper, we sketch out several opportunities for new kinds of analytics tools based on geotagged social media data. We also discuss some longer-term challenges in using this kind of data, including biases in this kind of data, issues of privacy, and fostering a sustainable ecosystem where the value of this kind of data is shared with more people.

2 Opportunities

In this section, we sketch out some design and research opportunities, looking at three specific application areas. Many of the ideas we discuss below are speculative. We use these ideas as a way of describing the potential of geotagged social media data, as well as offering possible new directions for the research community.

2.1 For City Planners

First, and perhaps most promisingly, we believe that geotagged social media data can offer city planners and developers better information that can be used to improve planning and quality of life in cities. This might include new kinds of metrics for understanding people’s interactions in different parts of a city, new methods of pinpointing problems that people are facing, and new ways of identifying potential opportunities for improving things.

2.1.1 Mapping Socioeconomic Status

It is important for governments to know the socioeconomic status of different sections of their jurisdiction in order to properly allocate resources. In England, for example, the government uses the Index of Multiple Deprivation (IMD) to measure where the problems of poverty are the most severe, and to therefore mitigate those effects. The IMD is based on surveys and other statistics collected by different areas of government. However, even in developed countries, surveys and statistics can be difficult and expensive to collect.

Past work with cell phone call logs suggests that it is possible to find valuable demographic information using communication records. For example, Eagle et al found that network diversity in phone calls correlated with the IMD (Eagle et al. 2010). A recent project by Smith-Clarke et al. (2014) explored the use of call logs to map poverty. Especially in developing countries, maps of poverty are often very coarse and out of date. This is not a simple problem of improving an already-good metric; data at different granularities can tell very different stories. For example, Fig. 1 shows two different maps of socioeconomic data in the UK, one very coarse-grained and one fine-grained.

Fig. 1
figure 1

Left: accurate data about socio-economic deprivation in England; darker indicates more deprivation. Right: much coarser, out-of-date information about the same index. Figure from (Smith-Clarke et al. 2014)

However, call log data, while more complete than surveys, still presents the limitations mentioned earlier: it is proprietary, coarse-grained, and lacking context and transparency. Much social media data, on the other hand, are publicly visible and accessible to researchers. Different forms of social media data also offer their own advantages. For example, Twitter users follow other users, and Foursquare check-ins often have “likes” or comments attached.

2.1.2 Mapping Quality of Life

Socioeconomic status, however, is not the only metric that matters. A community can be poor but flourishing, or rich but suffering. Other metrics like violence, pollution, location efficiency, and even community coherence are important for cities to track. Some of these are even more difficult to track than socioeconomic status.

We believe some aspects of quality of life can be modeled using geotagged social media data. For example, approximating violence may be possible by analyzing the content of posts. Choudhury et al. (2014) showed that psychological features associated with desensitization appeared over time in tweets by people affected by the Mexican Drug War. Other work has found that sentiments in tweets are correlated with general socio-economic wellbeing (Quercia et al. 2011). Measuring and mapping posts that contain these emotional words may help us find high crime areas and measure the change over time.

As another example, location efficiency, or the total cost of transportation for someone living in a certain location, can be approximated by sites like Walkscore.com. However, Walkscore currently only relies on the spaces, that is, where services are on the map. It does not take into account the places, the ways that people use these services. A small market may be classified as a “grocery store”, but if nobody goes to it for groceries, maybe it actually fails to meet people’s grocery needs. We believe geotagged social media data can be used as a new way of understanding how people actually use places, and thereby offer a better measure of location efficiency.

2.1.3 Mapping Mobility

One more analysis that could be useful for city planners is in understanding the mobility patterns of people in different parts of different cities. This kind of information can help, for example, in planning transportation networks (Kitamura et al. 2000). Mobility can help planners with social information as well, such as how public or private people feel a place is (Toch et al. 2010). Previously, mobility information has been gathered from many sources, but they all lack the granularity and ease of collection of social media data. Cell tower data has been used to estimate the daily ranges of cell phone users (Becker et al. 2013). At a larger scale, data from moving dollar bills has been used to understand the range of human travel (Brockmann et al. 2006). Among other applications, this data could be used for economic purposes, such as understanding the value of centralized business districts like the Garment District in New York (Williams and Currid-Halkett 2014). It seems plausible that pre-existing social media data could help us find the similar information without needing people to enter dollar bill serial numbers or phone companies to grant access to expensive and sensitive call logs. Geotagged social media data is also more fine-grained, allowing us to pinpoint specific venues that people are going to.

2.1.4 “Design Patterns” for Cities

Originating in the field of architecture, design patterns are good and reusable solutions to common design problems. Geotagged social media data offers new ways of analyzing physical spaces and understanding how the design of those spaces influences people’s behaviors.

For example, in his book A Pattern Language (1977), Alexander and colleagues present several kinds of patterns characterizing communities and neighborhoods. These patterns include Activity Nodes (community facilities should not be scattered individually through a city, but rather clustered together), Promenades (a center for its public life, a place to see people and to be seen), Shopping Streets (shopping centers should be located near major traffic arteries, but should be quiet and comfortable for pedestrians), and Night Life (places that are open late at night should be clustered together).

By analyzing geotagged social media data, we believe it is possible to extract known design patterns. One possible scenario is letting people search for design patterns in a given city, e.g. “where is Night Life in this city?” or “show the major Promenades”. Another possibility is to compare the relationship of different patterns in different cities, as a way of analyzing why certain designs work well and others do not. For example, one might find that areas that serve both as Shopping Streets and as Night Life are well correlated with vibrant communities and general well-being.

2.2 For Small Businesses

Understanding one’s customers is crucial for owners of small businesses, like restaurants, bars, and coffee shops. We envision two possible scenarios for how geotagged social media data can help small business owners.

2.2.1 Knowing Demographics of Existing Customers

Small businesses cannot easily compete with big-box stores in terms of data and analytics about existing customers. This makes it difficult for small businesses to tailor their services and advertisements effectively.

Businesses can already check their reviews on Yelp or Foursquare. We believe that geotagged social media data can offer different kinds of insights about the behaviors and demographics of customers. One example would be knowing what people do before and after visiting a given venue. For example, if a coffee shop owner finds that many people go to a sandwich shop after the coffee shop, they may want to partner with those kinds of stores or offer sandwiches themselves. This same analysis could be done with classes of venues, for example, cafés or donut shops.

As another example, an owner may want to do retail trade analysis (Huff 1963), which is a kind of marketing research for understanding where a store’s customers are coming from, how many potential customers are in a given area, and where one can look for more potential customers. Some examples include quantifying and visualizing the flow and movement of customers in the area around a given store. Using this kind of analysis, a business can select potential store locations, identify likely competitors, and pinpoint ideal places for advertisements.

Currently, retail trade analysis is labor intensive, consisting of numerous observations by field workers (e.g. watching where customers come from and where they go, or shadowing customers) or surveys given to customers. Publicly visible social media data offers a way of scaling up this kind of process, and extending the kind of analysis beyond just the immediate area. For example, one could analyze more general kinds of patterns. For example, what are the most popular stores in this area, and how does the store in question stack up? How does the store in question compare against competitors in the same city?

Knowing more about the people themselves would be useful as well. For example, in general, what kinds of venues are most popular for people who come to this store? If a business finds that all of its patrons come from neighborhoods where live music is popular, they may want to consider hosting musicians themselves. All of the information offered by a service like this would have to be rather coarse, but it could provide new kinds of insights for small businesses.

2.2.2 Knowing Where to Locate a New Business

New businesses often have many different potential locations, and evaluating them can be difficult. Public social media data could give these business owners more insight into advantages and disadvantages of their potential sites. For example, if they find that a certain neighborhood has many people who visit Thai restaurants in other parts of the city, they could locate a new Thai restaurant there.

2.3 For Individuals

There are also many opportunities for using geotagged social media to benefit individuals as well. Below, we sketch out a few themes.

2.3.1 Feeling at Home in New Cities

Moving to a new city can make it hard for people to be part of a community. The formally defined boundaries of neighborhoods may help people understand the spaces where they live, but not so much the socially constructed places (Harrison and Dourish 1996). Currently it is difficult for non-locals to know the social constructs of a city as well as locals do. This is particularly important when someone changes places, either as a tourist or a new resident.

Imagine a new person arriving in a diverse city like San Francisco, with multiple neighborhoods and sub-neighborhoods. It would be useful for that person to know the types of people who live in the city, and where each group goes: students go to this neighborhood in the evening, members of the Italian community like to spend time in this area, families often live in this neighborhood but spend time in that neighborhood on weekends.

Some work has been done in this area, but we believe it could be extended. Komninos et al. collected data from Foursquare to examine patterns over times of day and days of the week (Komninos et al. 2013), showing the daily variations of people’s activity in a city in Greece. They showed when people are checking in, and at what kind of venue, but not who was checking in. Cheng et al, too, showed general patterns of mobility and times of checkins (Cheng et al. 2011), but these statistics remain difficult for individuals to interpret.

Related, the Livehoods project (Cranshaw et al. 2012) and Hoodsquare (Zhang et al. 2013) both look at helping people understand their cities by clustering nearby places into neighborhoods. Livehoods used Foursquare checkins, clustering nearby places where the same people often checked in. Hoodsquare considered not only checkins but also other factors including time, location category, and whether tourists or locals attended the place. Both of these projects would be helpful for people to find their way in a new city, but even their output could be more informative. Instead of simply knowing that a neighborhood has certain boundaries, knowing why those boundaries are drawn or what people do inside those boundaries would be helpful. Andrienko et al. (2011) also describe a visual analytic approach to finding important places based on mobility data that could help newcomers understand which places were more popular or important.

Another approach to helping people get to know the city is to help them get to know the people in the city, rather than the places. We look to the work of Joseph et al. (2012) who used topic models to assign people to clusters such as “sports enthusiast” or “art enthusiast”. We could imagine this information being useful for individuals to find other like-minded people.

2.3.2 Discovering New Places

In the previous section, we described applications to help people become accustomed to a new city. However, sometimes the opposite problem may arise: people become too comfortable in their routines and they want to discover new places. Some recent projects have helped people to discover new places based on actions that can be performed there (Dearman et al. 2011) or aspects that people love about the place (Cranshaw et al. 2014). However, we believe that this idea can be pushed further. Perhaps combining a person’s current mobility patterns with visualizations of other people’s mobility patterns would help a person to put their own actions in context. They may realize that there are entire social flows in the city that they did not even know existed.

2.3.3 Understanding People, Not Just Places

Tools like Yelp and Urbanspoon already exist to help people find places they would like to go or discover new places that they didn’t know about. Previous work like Livehoods (Cranshaw et al. 2012) also worked to enable a richer understanding of the places there. The benefit of incorporating social media data, though, is that users can start to understand the people who go to places, not just the places themselves.

3 Potential Research in This Design Space

In this section we sketch out some potentially interesting research projects in using geotagged social media data. Our goal here is to map out different points in the overall design space, which can be useful in understanding the range of applications as well as the pros and cons of various techniques.

3.1 Who Goes Here?

The approach is simple: select a geographic region (which may be as small as an individual store) and retrieve all the tweets of the people who have ever tweeted there. Then, compute a heat map or other geographic visualization of the other places that they tweet. This kind of visualization could help elucidate all the places associated with a given place, and could be useful to small business owners or managers of a larger organization like a university.

3.2 Groceryshed

Watershed maps show where water drains off to lakes and oceans. Researchers have extended this metaphor to map “laborsheds” and “paradesheds” (Becker et al. 2013) to describe where people who work in a certain area come from, or people who attend a certain parade. We could extend this metaphor even further to describe “sheds” of smaller categories of business, such as Thai restaurants.

More interestingly, we could map change over time in various “sheds”. This could be particularly important for grocery stores. Places that are outside any “groceryshed” could be candidate areas for a new store, and showing the change in people’s behavior after a new store goes in could help measure the impact of that store. The content of Tweets or other social data could also show how people’s behavior changed after a new grocery store was put in.

3.3 How Is This Place Relevant To Me?

We envision a system that can convey not only what a place is, but also what it means. Imagine a user looking up a particular coffee shop. Currently, they can look up the coffee shop’s web site, find basic information like store hours, and find reviews on sites like Yelp. Using geotagged social media data, however, we could surface information like:

  • Your friends (or people you follow on Twitter) go here five times per week.

  • Friends of your friends go here much more than nearby coffee shops.

  • People who are music enthusiasts like you (using topic modeling as in Joseph et al. (2012)) often go to this coffee shop.

  • You’ve been to three other coffee shops that are very similar to this one.

  • People who tweet here show the same profiles of emotions as your tweets.

These could help people form a deeper relationship with a place than one based on locality or business type alone. In addition, we could pre-compute measures of relevance for a particular user, giving them a map of places that they might enjoy.

3.4 Human Network Visualizations

We can go beyond assigning people to groups by topics, also showing where they tweet over time. This could help people understand the dynamics of neighborhoods where, for example, one group of more affluent people are pricing out a group of previous residents. One interesting work in this area is the Yelp word maps (2013), which show where people write certain words, like “hipster”, in reviews of businesses. However, this still describes the places; using social media data, we could show maps that describe the people. Instead of a map of locations tagged as “hipster”, we could identify groups of people based on their check-in patterns and tag where they go during the day. Perhaps the hipsters frequent certain coffee shops in the morning and certain bars at night, but during the day hang out in parks where they do not check in.

3.5 Cheaper, Easier, and Richer Demographics

For all of our groups and stakeholders, it is important to understand demographic information of city regions. We could improve the process in two main ways. First, we could make it cheaper and easier to infer existing demographic information. We plan to investigate whether Twitter volume, or volume of certain topics of discussion, correlates with deprivation or other measures of socioeconomic status or quality of life. If so, then we can use the social media measures as proxies for real measures, and thereby collect that information cheaply and in real time.

Second, we plan to create more descriptive demographics. In each neighborhood, we can calculate each person’s average distance traveled, radius of gyration, and other measures of mobility. We could either simply output statistics or we could create interactive visualizations that show the daily movements of people in each neighborhood.

4 Data Sources and Limitations

For our research, we are currently using data from Twitter, due to its richness and volume. Twitter claims over 500 million tweets are posted per day (Krikorian 2013). Furthermore, this data is publicly available. While only a small fraction of these tweets are geotagged, even a small fraction of tweets from any given day forms a large and rich data set. Furthermore, past work suggests that the sampling bias from only selecting these tweets is limited (Priedhorsky et al. 2014).

4.1 Biases in the Data

Of course, neither Twitter nor Foursquare provides an exactly accurate view of people’s mobility. Both are communicative media, not purely representative. For example, Rost et al. (2013) report that the Museum of Modern Art in New York has more check-ins than Atlanta’s airport, even though the airport had almost three times as many visitors in the time period that was studied. In some cases this will not matter; if we are clustering people, for example, grouping people who communicate that they go to the same places will be nearly as successful as grouping people who actually go to the same places. In other cases, we hope to minimize this bias by primarily comparing similar businesses, but we must remain aware of it.

Second, these media can be performative as well. People check in not because the check-in represents their location the most accurately, but because they want to show off that they have performed the check-in (Cramer et al. 2011). Sometimes people may avoid checking in for the same reason; they do not want it to be known that they checked in at a certain venue, like a fast food restaurant (Lindqvist et al. 2011).

Third, there are currently several demographic biases in these data sets. For example, Twitter, Flickr, and Foursquare are all more active per capita in cities than outside them (Hecht and Stephens 2014). Furthermore, these social media sites are all used by predominantly young, male, technology-savvy people.

One effect of this bias is shown in Fig. 2. This screenshot shows the results of some of our clusters in Livehoods (Cranshaw et al. 2012), with each dot representing a venue, and different colors representing different clusters. Note the lack of data in the center of the figure. This area is Pittsburgh’s Hill District, a historic area which was the center of Pittsburgh’s jazz scene in the early twentieth century. Currently, the median income for residents in the Hill District is far lower than in other parts of Pittsburgh. This neighborhood has also seen some revitalization with new senior housing, a library, a YMCA, several small office buildings, and a grocery store. However, there is still a notable lack of geotagged social media data in this area.

Fig. 2
figure 2

Several livehoods in Pittsburgh. Each dot represents one Foursquare location. Venues with the same color have been clustered into the same livehood (the colors are arbitrary). The center of the map is a residential area where people with relatively low socioeconomic status live. There is also a notable lack of foursquare data in this area

In short, while geotagged social media data has great potential, we also need to be careful because this data may not necessarily be representative of all people that live in a city. It is possible that this demographic bias may solve itself over time. Currently, about 58 % of Americans have a smartphone (Pew Internet 2014), and the number is rapidly growing. However, it may still be many years before demographics are more representative, and there is still no guarantee that the demographics of geotagged social media data will follow. For now, one approach is to look for ways of accounting for these kinds of biases in models. Another approach is to make clearer what the models do and do not represent.

4.2 Privacy Implications

Privacy is also a clear concern in using geotagged social media to understand cities. From an Institutional Review Board (IRB) perspective, much of social media data is considered exempt, because the researchers do not directly interact with participants, the data already exists, and the data is often publicly visible. However, as researchers, we need to go beyond IRB and offer stronger privacy protections, especially if we make our analyses available as interactive tools.

Here, there are at least two major privacy concerns. The first is making it easy to access detailed information about specific individuals. Even if a person’s social media data is public data, interactive tools could make a person’s history and inferences on that history more conspicuously available. Some trivial examples include algorithms for determining a user’s home and work locations based on their tweets (Komninos et al. 2013). More involved examples might include other aspects of their behaviors, such as their activities, preferences, and mobility patterns. In the Livehoods project, we mitigated this aspect of user privacy by only presenting information about locations, not people. We also removed all venues labeled as private homes.

Second, we need to be more careful and more thoughtful about the kinds of inferences that algorithms can make about people, as these inferences can have far-reaching effects, regardless of whether they are accurate or not. There are numerous examples of inferences outside of geotagged social media that might be viewed as intrusive, embarrassing, or even harmful. For example, Jernigan and Mistree (2009) found that, given a social network with men who did not self-report their sexuality, they could identify gay men simply by analyzing the self-reported sexuality of an individual’s friends. As another example, the New York Times reported on how Target had developed algorithms that could infer if a customer was pregnant (Duhigg 2012). A separate New York Times article reported on how people were assessed for credit risks based on what they purchased as well as where they shopped (Duhigg 2009).

It is important to note that these risks are not just hypothetical. At least one person had his credit card limit lowered, with the explanation that “other customers who have used their card at establishments where you recently shopped have a poor repayment history with American Express” (Cuomo et al. 2009).

It is not yet clear what the full range and extent of inferences is with geotagged social media. A significant concern is that inferences like the ones above and ones using social media data can become a proxy for socioeconomic status, gender, or race, inadvertently or even intentionally skirting around charged issues under the guise of an “objective” algorithm. It is also not clear if there is anything that can be done about these kinds of inferences, given that these inferences would be done on private servers. It is unlikely that there is a technical solution to this problem. It may very well be the case that society will require new kinds of laws governing how these inferences are used, rather than trying to control the inferences themselves.

4.3 Creating a Sustainable Ecosystem

We hope to find a way to co-create value both to social media users and to people in our work. As it exists now, value flows only from users to marketers and analysts. To create a more sustainable tool, and to avoid impinging on users’ freedoms, it is important that the users gain some benefit from any system we create as well. Some of our projects point in this direction, especially the ones aimed at individual users. People may be more amenable to a tool that offers businesses insights based on their public tweets if they can have access to those insights as well.

A successful example of co-creation of value is Tiramisu (Zimmerman et al. 2011). This app aimed to provide real-time bus timing information by asking people to send messages to a server when they were on a bus. Users were allowed to get more information if they shared more information. In contrast, OneBusAway (Ferris et al. 2010) provides real-time bus information using sensors that are installed on buses. Using a collaborative approach, it may not be necessary to implement a costly instrumentation project. In addition, people may feel more ownership of a system if they contribute to it.

5 Conclusion

Despite the challenges, social media remains a potentially transformative, yet underused, source of geographic data. The works we have cited here represent useful early attempts, but we hope to inspire more. Analytics tools based on geotagged social media data can help city planners plan future developments, businesses understand the pulse of their customers, and individuals fit into new cities much more seamlessly. As our cities grow quickly and more of the world moves into heavily urbanized areas, instead of using costly methods to understand our cities, researchers of all kinds will be able to mine existing data to understand social patterns that are already there.