Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

A notable example of a widely popular system with volunteered geographic information (VGI) capabilities is Wikipedia, an online collaborative encyclopedia. Wiki technology provides simple methods for Web-based collective authorship where anyone can contribute. Using this technology, Wikipedia provides a large-scale social computing system in which participants collectively author encyclopedic information.

Since 2001, Wikipedia has 17.5 million articles in 263 languages. Since March 2007, Alexa has ranked Wikipedia in the top 10 Internet sites. As of 23 February 2010, Wikipedia has 15 million articles in 272 languages with 860 million edits from 22 million contributors (Wikimedia 2010). During 2009 alone, Wikipedia had 365 million unique visitors that generated 133.6 billion page views (Zachte 2010a). Its impact on the Web’s content is significant. Fifty-one percent of its site visits come from link-based search engine referrals (Alexa Internet, Inc. 2009). Of those page views that were referred to Wikipedia by external sites, 42% were referred by Google search, maps, and other services, and 8% were made by Google’s “web-crawling” software GoogleBot (Zachte 2009). Over 1.2 million articles are place-based articles (i.e., “geotagged”) (as of April 2011). These geotagged articles span dozens of languages and are accessible through geobrowsers and online mapping services.

As the Internet itself grows, many describe it as placeless—cyberspace without place. Yet sociological researchers find cultural differences in virtual communities that mimic real-world environments, and a shared understanding of a virtual place is a central determinant in such research. But today, any Internet user can get some sense of place through rich interactive geovisualization technologies. “Slippy maps” depict roads and buildings and other geographical features using simple point-and-drag navigational and informational tools and even 3D imagery. Within these online mapping interfaces, users may access a diverse set of VGI, including geotagged Wikipedia articles and photographs.

Yet, despite the advantages of the Internet for collaborative work, authors are fundamentally engaged in knowledge production processes that are grounded in social structures and norms, and in turn, physical place. Geographic distance, in particular, should be a significant factor in online knowledge production. But the nature of the Internet in a globalized world has led to debate on whether geographic distance matters (cf. Cairncross 1997; Friedman 2005; Goodchild 2004; Marston et al. 2005). That is, the Internet may redefine the role of physical place in our lives due to reduced communication costs and increased ubiquity. Zook (2005, p. 54) summarizes this debate as a new “geography of electronic spaces,” as the Internet becomes “a recombinant space for political, cultural, and economic interaction.”

This chapter focuses on information production methods and processes behind geographic Wikipedia articles and discusses the nature of these production processes. For example, are contribution patterns similar between VGI and non-VGI content? How do authors geotag articles? What is the geography of Wikipedia’s authorship? What is the spatial distribution of articles and contributors, and how does physical proximity influence contributions, either by article topic or language?

2 Collective Authorship Processes

Collective authorship is one type of information production process—a mass collective effort by individuals to produce information artifacts within a digital commons. The term “information production” itself has different semantics across disciplines. In the humanities, the term may represent the authoring of a written work or book; in economics, market resources, or commodities, or perception, or even a constitutive force in society (Browne 1997, p. 266); in library science, how we communicate collaborative work to public scientific knowledge (Cronin 2001); and, in social computing, collaborative filtering or recommendation systems (Beenen et al. 2004), blogging as community forums (Nardi et al. 2004), and user-generated tag clouds (Golder and Huberman 2006). For Wikipedia, the terms wikinomics (Tapscott and Williams 2006), collective intelligence (O’Reilly 2005), and crowdsourcing (Brabham 2008) all reflect the user-centric processes that drive information production.

And user-centric it is. Each month, over ten million authors contribute to Wikipedia articles, roughly divided into two classes of contributors—a small, highly productive set, then everyone else. The Web itself has a scale-free, power law distribution in its link structure (Broder et al. 2000) and surfing behavior (Huberman et al. 1998), and Wikipedia has them for both readership (Priedhorsky et al. 2007) and editing (Almeida et al. 2007; Kittur et al. 2007; Voss 2005). For example, the intensity of authorship shows that a small number of Wikipedia articles receive the majority of edits, and the vast majority of articles receive a small number of edits (i.e., the long tail).Footnote 1

Wikipedia’s production processes are nontrivial, despite its perception in the popular media as a loose or chaotic system. Wikipedia has many policies and mechanisms to govern contributions, including rule-making, monitoring, conflict resolution, and norms (Forte and Bruckman 2008; Lih 2009; Viégas et al. 2007a, b). Its most well known policy is that contributors must write articles using a neutral point of view, and this is a key discussion point between authors (e.g., Bryant et al. 2005; Viégas et al. 2004). As described by Wikipedia, neutral point of view (NPOV) is “a fundamental Wikimedia principle and a cornerstone of Wikipedia,” requiring that “all content [be] written from a neutral point of view, representing fairly, proportionately, and as far as possible without bias, all significant views that have been published by reliable sources” (http://en.wikipedia.org/wiki/NPOV).

The term Wikipedian does not have a strict definition, other than being a contributor to a Wikipedia article generally.Footnote 2 Registered, anonymous, administrative Wikipedians and bots are the four basic types of contributor. Registered Wikipedians create an account on Wikipedia, and their contributions are explicitly tagged in the article history using their account. Anonymous Wikipedians do not provide any registration information, and their computer’s IP address is used in lieu of an account. “Bots” and other administrative Wikipedians are both special cases of registered accounts, but they have additional access or permissions to edit articles. The overwhelming majority of Wikipedians do not collaborate with each other in a traditional sense. They do not often discuss their contributions with others (Viégas et al. 2007a) and as such form a loosely collaborative, online collective authorship. The most active segments of the Wikipedian population are 91,817 Wikipedians with at least five contributions per month and 1,076,908 Wikipedians with at least ten contributions total (Zachte 2010b). The “long tail” has 21.1 million Wikipedians, each of whom have less than ten contributions total.

Although authorship processes are largely invisible to readers, the authors themselves struggle to control article content around information types, responsibility, perspectives, organization, or provenance and creation (Miller 2005; Sundin and Haider 2007). Wikipedia provides complete article histories for those readers wanting detailed authorship information. WikiScanner, for example, is a data-mining tool that extrapolates from article edit histories the location or affiliation of anonymous authors (Griffith 2007). But the utility of explicit authorship information is debatable. As summarized by Viégas (2005, p. 61), on the one hand explicit authorship information may be “an important part of social collaboration in the sense that it adds context to interactions,” and on the other hand it may be “irrelevant and sometimes even detrimental to the creation of truly communal repositories of knowledge.”

In fact, the success of Wikipedia and other “user-generated content” Web services (O’Reilly 2005) has challenged academic theories of production. Benkler (2002) argues that in terms of economic models of production, when the efficiency gains of “peering” exceeds the costs of organizing human capital into a firm or market, a commons-based peer production system will emerge. Its advantage is based not only on reduced costs of human capital and communications but also on the nonrival aspects of Web-based information artifacts—i.e., many people can read (consume) a webpage simultaneously without degrading its value. This effectively eliminates allocation costs to consumers and increases the pool of potential contributors, which mitigates effects from free riders.

When applied to geographic information production, these factors will likely challenge the “knowledge politics” of spatial data infrastructures (Elwood 2010). For example, they may weaken traditional notions of authoritative sources as the collective social production of spatial information increases (Budhathoki et al. 2008; Coleman et al. 2009; Sieber and Rahemtulla 2010). As Sui (2008, p. 4) argues, the “wikification of GIS is perhaps one of the most exciting, and indeed revolutionary developments since the invention of [GIS] technology in the early 1960s.” Moreover, Wikipedia’s editorial patterns in the production of VGI content are similar to those for nongeographic content. That is, each of the four types of contributors exhibits editorial patterns that are systematic when contributing to geographic articles, but idiosyncratic across languages (Hardy 2008).

3 Volunteered Geographic Information in Wikipedia

Now, we turn to the specific types of geographic information produced through collective authorship in Wikipedia. Geographic information, in general, informs us about the where of things. It is spatial information about a phenomenon’s distribution in our geographic world (Goodchild 2000). Georeferencing is the set of methods for defining a geographic location on the globe (Hill 2006), and geotagging assigns geographic locations to content (Amitay et al. 2004), referring to “tagging” georeferenced metadata to a document or other content. A geotag may contain geographic coordinates, extent, shape, or feature type information. A useful geometry for cataloguing georeferenced content is the minimum-bounding rectangle, which is the smallest rectangle aligned with the coordinate axes that spans all coordinates for a given location.

Wikipedia primarily uses single points and bounding rectangles rather than fine-resolution polygons in its geotagging processes. In this case, a geotag contains simple geographic coordinates for latitude and longitude, and this georeferenced information is embedded into articles using one of many microformats and extensions to Wikitext, Wikipedia’s content markup language. For example, the Template:Coord and Infobox Wikitext templates accept point coordinates (Wikipedia.org 2008). In fact, there are dozens of ways to include geographic coordinates in an article. There is not a single “geotag” standard or format for Wikipedia, or the Web for that matter (Table 11.1).

Table 11.1 Example geotag formats for University of California, Santa Barbara (UCSB; approx. 34.41°N, 119.85°W)

The geotagging process itself in Wikipedia is haphazard. Wikipedia started explicitly using structured geotagging in February 2005 when geotags were introduced into Wikipedia in 2005 by Egil Kvaleberg’s gis extension to MediaWiki. Some authors create geotags manually using a reference digital or paper map to estimate coordinates, while others resolve toponyms based on existing online gazetteers. Alternatively, bots perform a bulk of the automated geotagging based on GEOnet Names Server, an online gazetteer, and run periodically. This process also adds geographic feature type (i.e., city, river, mountain, etc.) when it is available from the gazetteer.

The vast majority of geotagging is reportedly done by a variety of bots (Kühn and Alder 2008), and their ad hoc nature ultimately makes it more difficult to extract geotags from articles. For example, a semiautomated bot Anomebot2 runs periodically to geotag articles or mark those that may need a geotag. It cross-references named entities in over 100,000 article titles with online gazetteer services.Footnote 3 These bots provide a structural mechanism to integrate existing geographic data sources into articles. But they are not semantic in nature, nor do they generate standard­ized markup (Table 11.1). In fact, they increase the complexity of extracting structured geographic information from articles because of their chaotic, ad hoc nature and that of the Wikitext markup and templates themselves (Sauer et al. 2007). The end result is that geotag extraction requires ad hoc or data-mining approaches to deal with the nondeterministic, semistructured nature of article templates and ad hoc inclusion of geotags. But, anecdotally, some claim the majority of geotags were created manually and not via automated processes (T. Alder, 22 April 2008, personal communication). This further obscures the lineage of these geographic coordinate data.

To index place-based articles, the Wikipedia-World project creates a catalog of geotagged articles (Kühn and Alder 2008). Since geotagging in Wikipedia is chaotic, this process relies on data-mining methods and is largely heuristic (Fig. 11.1). In May 2008, this process found 1,163,797 geotagged articles across 230 languages and 234,474 unique locations (at 1 km resolution). Wikipedia-World uses these data to provide various online mapping services and exports the underlying geographic data as database tables. And the index of place-based articles is growing rapidly. In May 2011, the same process found 1.7 million geotagged articles across 273 languages and 1.1 million unique locations (at 1 km resolution, Fig. 11.2).

Fig. 11.1
figure 1

Detailed workflow for geotag data-mining software (Reprinted from Kühn 2008)

Fig. 11.2
figure 2

Spatial distribution of geotagged Wikipedia articles, visualized using log-scale density for number of article contributions at 10 km resolution

4 Geography of Authorship

In systems like Flickr and Wikipedia, VGI content itself is spatially clustered (Hecht and Gergle 2010), and Wikipedia articles are also more likely to link to articles about places nearby (Hecht and Moxley 2009). But the literature does not directly address whether VGI production processes themselves exhibit regular spatial patterns. This section will discuss a spatial model for contributions, and results that show anonymous contributors exhibit geographic effects that fit an exponential distance decay function.

4.1 Data Collection

Wikipedia manages hundreds of individual language-specific databases across three data centers in the United States, Netherlands, and South Korea. Their services use open-source MediaWiki software and data models (MediaWiki 2006). Wikipedia provides article and metadata via periodic dumps of their database and as static HTML files (http://meta.wikimedia.org/wiki/Data_dumps), but historically, these data do not always include complete article contribution records due to their volume and limited operational resources (e.g., the August 2008 dump of the English Wikipedia had 2.5 million articles and 250 million contributions—http://en.wikipedia.org/wiki/Special:Statistics).

The openness of their data lends itself to empirical study by researchers (e.g., Almeida et al. 2007; Priedhorsky et al. 2007; Voss 2005). This study collects data directly via SQL from near real-time replicas of Wikipedia databases, provided by Wikimedia Deutschland’s Toolserver (http://toolserver.org). These databases use MySQL and the MediaWiki database schema, which organizes articles by revision. Briefly, the revision table provides metadata for author contributions and links to the page and text table for details on the article’s contents. For every article, the page table contains a unique identifier and the language-specific title for the article, and the text table stores the article’s contents. Wikipedians write articles using Wikitext, a loosely structured markup language (http://en.wikipedia.org/wiki/Wikipedia:MARKUP), and they embed semistructured metadata within the article (Fig. 11.3). The nondeterministic nature of Wikitext’s grammar and conventions causes problems for structured data extraction (cf. Sauer et al. 2007). The WP:GEO project in Wikipedia governs an infrastructure for adding geographic information to articles (http://en.wikipedia.org/wiki/Wikipedia:GEO). They provide an array of “wiki templates” that have a semistructured syntax for embedding geographic coordinates.

Fig. 11.3
figure 3

VGI production process in Wikipedia. Authors contribute to place-based articles using Wikitext and embedded geotags that are stored in database tables, including a full history of revisions. For anonymous authors, each revision includes their IP address. In the example, two authors contribute to an article about UC Santa Barbara whose signature distance d α is 1,246 km, defined as the average distance weighted by contributions, for example, \( (2·4050+5·125)/(2+5)=1246\)

Wikipedia-World’s database (Kühn and Alder 2008) from 10 May 2008 uses an extensive data-mining process to extract geotags embedded in Wikitext articles (Fig. 11.1).Footnote 4 For each geotagged article, we extract all the authoring history and the most recent version from the replica databases.

To simplify computation across language-specific databases, we migrate the authoring histories into a single shared database, where we modify MediaWiki tables to associate a source language for each record (e.g., page_id and a new page_lang column comprise the primary key instead of only page_id) and to remove data incidental to analysis. This data model provides a multilingual abstraction layer to Wikipedia articles, authors, and their contributions. It has tables for article, author, and geotag data, and author_article and geotag_article association tuples. It also provides fast access to summary statistics per article and per author. The data extraction from the MediaWiki tables results in page and text with 990,315 articles, revision with 32,141,334 author contributions between 2001 and 2008, and user with 578,448 registered author accounts. Since the user table contains records only for registered authors, the analysis extracts and parses data from the revision.rev_user_text column to identify IP addresses for anonymous users and to integrate them into the data model.

4.2 Spatial Model of Authorship

Each author in Wikipedia has a “spatial footprint” comprised of all of the articles to which they have contributed. For anonymous authors, we can estimate their location using IP geolocation (Fig. 11.4). For registered authors and bots (Figs. 11.5 and 11.6), we have no direct estimate of their location, although an indirect estimate based on their spatial footprint is possible (Lieberman and Lin 2009). But are there spatial patterns in these interactions between the authors and the places about which they write?

Fig. 11.4
figure 4

Spatial footprint of an anonymous author with 172 contributions to 143 articles in the Danish Wikipedia. The yellow icon represents an estimate of the author’s location, based in IP geolocation

Fig. 11.5
figure 5

Spatial footprint of a registered author with 1,099 contributions to 296 articles in the Danish Wikipedia. Markers represent geotagged location of each article edited by author, the vast majority of which are clustered inside Denmark

Fig. 11.6
figure 6

Spatial footprint of a bot with 3,006 contributions to 1,601 articles in the Danish Wikipedia

4.2.1 Gravity Models

In regional geography and related disciplines, spatial interaction models form the basis of social theories (Haynes and Fotheringham 1984). These models pertain to flows (interactions) between two or more geographic regions. They have a decades-long history in geography dating back to “social physics” in the early twentieth century (Fotheringham 1981; Wilson 1969, 1971). Distance decay or “gravity” models are one type of spatial interaction model. They use “mass” functions to deal with scale and distance effects. The general gravity model (Sen and Smith 1995, p. 3) is

$$ {T}_{ij}={A}_{i}·{B}_{j}·F({d}_{ij}), $$
(11.1)

where T ij is the interaction between population centers i and j; A i and B j are unspecified origin and destination weight (mass) functions; d ij is the spatial factor, or distance between regions i and j; and F(d ij ) is an unspecified distance decay function, which is commonly a power, exponential, or gamma (combined) function (Sen and Smith 1995, pp. 93–99).

In spatial information theory, an individual’s information field is the spatial distribution of the “knowledge an individual has of the world” (Morrill and Pitts 1967, p. 406) and is a factor when modeling sociospatial behaviors, like diffusion of innovation or migration (Hägerstrand 1967). An individual’s information field decays as the distance from the individual increases. In quantitative geography, gravity models formalize spatial interaction analysis by using this type of distance decay function (Fotheringham and O’Kelly 1989; Sen and Smith 1995). When Wikipedians choose to write about a place, their mean information fields should exhibit distance decay effects found in other sociospatial phenomena, like innovation diffusion. When Wikipedians as a group write more articles, for example, they expand the overall spatial coverage of Wikipedia articles. But when an individual Wikipedian writes an article about a place, that place is likely to be nearby. Thus, our hypotheses for this study are (a) Wikipedians write articles about nearby places more often than distant ones and (b) this likelihood follows an exponential distance decay function.

4.2.2 Gravity Model for VGI Production

To model VGI production as a spatial process, we define a probabilistic model where the dependent variable is a likelihood for interaction, based on a spatial factor. Specifically, we use a probabilistic invariant exponential gravity model (Sen and Smith 1995, p. 102). In terms of Eq. 11.1, T ij is converted to the probability of an interaction based on a spatial factor. The mass terms A i and B j are combined into a single invariant constant K to allow for uneven distributions of authors and articles over the Earth’s surface. Finally, F(d ij )  =  exp(−βd ij ), an exponential distance decay function.

$$ \rm{Pr}(d={d}_{a})=K·exp(-b{d}_{a}),\rm{where}d={d}^{\prime }\pm e. $$
(11.2)

Equation 11.2 shows the model using the probability Pr(d  =  d α ) as the likelihood that a given article has a signature distance d α equal to a distance d within a range of d′  ±  ε (K and β are empirically derived constants). For this spatial model, we use a “signature distance” d α metric to measure the proximity effect for a given article (Hardy et al. 2012). The metric is the average distance between an article and its n authors, weighted by relative number of contributions from each author (Figs. 11.3 and 11.7). That is, each anonymous author has a spatial footprint that is the set of contributions made to any geotagged article by that author. Every author has a single footprint, and every article belongs to its authors’ footprints. This model requires a known location for both articles and authors, so we use MaxMind’s GeoLite City database, which uses proprietary methods to convert IP addresses into geographic coordinates, to estimate the locations of anonymous Wikipedians whose IP addresses are embedded into their contributions.Footnote 5 Location-based services have driven the development of methods to convert IP addresses into geographic coordinates (Muir and Oorschot 2009; Stanger 2008) and to evaluate positional accuracy (Gueye et al. 2006; 2007).

Fig. 11.7
figure 7

The UCSB article in English has a signature distance of 533 km based on 135 anonymous authors with 719 revisions. Each contribution is shown as a white line, with thicker lines denoting more contributions

4.2.3 Model Results by Article

To fit the model in Eq. 11.2 to the study data, we use an ordinary least squares regression method with a logarithmic transformation to a linear model:

$$ \rm{ln}\left[\rm{Pr} (d={d}_{\alpha })\right]\\=\\\rm{ln}K- \beta{d}_{\alpha }. $$
(11.3)

All geographic calculations use ∼10-km resolution and great circle distances (where 1  =  1.852 km). We selected the sample from available data to satisfy the methodological requirements that articles have at least one anonymous contribution (for author location estimates) and that articles have one and only one geotag (for signature distance metric). We convert the units of d α from km to 103 km, and use observed relative frequency for Pr(d  =  d α ). The model fits at K  =  0.0022 and β  =  0.2842 (n  =  438,077; R 2=  0.9005; p  <  0.01; f  =  17,480; DF  =  1,930). When signature distances are relatively low (d α <  2), there is no correlation across language databases, suggesting spatial behavior is idiosyncratic across languages.

4.2.4 Model Results by Article Category

To test whether signature distances vary by category, we collected categorical data for English articles. Contributors may categorize Wikipedia articles into one or more categories. These categories are not strictly tags but rather registered categories, although anyone may create a new category. These categories are often descriptive of a topic such as “14th-century architecture” or “Art museums and galleries in Paris.” They may be editorial, however, and denote workflow items such as “Tokyo railway station stubs,” or “All articles needing style editing,” or “Articles lacking sources from December 2009.” The category space is flat with no consistent nomenclature. Each article’s category is displayed at the bottom of the article, and each category has an “article” that lists all articles belonging to that category. From our study, we collected 8,474 unique categories with at least ten English articles, comprising 372,793 articles.Footnote 6 We then extracted 4,512 unique keywords (minus common words) from the category title to create an inverted index of category keywords. Each index entry has a unique category keyword, the number of articles that belong to the category, and a mean signature distance for those articles.

For topic keywords with at least 50 articles, Table 11.2 shows the popular topic keywords in English articles by the mean signature distance d α (n  =  372,793; mean  =  3,049 km). While not conclusive, there is some evidence that signature distances do vary by topic. Topic keywords with lower mean distances are “local” in scope such as cities (“[New] York”), state names, administrative boundary terms (“County” or “Metropolitan”), and buildings (“Museums”). Those with higher mean distances were “regional” in scope such as non-English speaking cities (“Paris”), country names (“Australia”), and regional boundaries (“Islands” or “Province” or “Region”).

Table 11.2 Popular topic keywords in English articles, sorted by distance

5 Discussion

This section presents some further research issues on architectural, social, and methodological factors, beginning with how both geotagging and geolocation could better support VGI production processes.

5.1 Architectural Factors

The lack of well-structured geotags is problematic. In particular, further research on methods for specifying geotags as first-class metadata—rather than as the most basic common denominator of latitude, longitude coordinates—is needed. If collaborative online gazetteers with large-scale global coverage were to emerge, they might serve as a basis for toponym-indexed geotags and thus relieve users from low-level georeferencing tasks. In the meantime, collaborative methods are a possible approach to improving geotag metadata, especially within scientific communities. Currently, geotagging schemes are opaque and inconsistent and are done by automated bots or by users who specify geographic coordinates interactively from a general-purpose mapping service. Neither of these schemes preserve semantic or context information about place and instead leave only precise numerical coordinates of ambiguous intent.

For decades, metadata has been the ever-present, cure-all solution to heterogeneous data integration and use. Yet high-quality, ubiquitous metadata is extremely rare in practice, despite geospatial data infrastructures that are designed to be interoperable and metadata-centric (de By et al. 2009; van Loenen et al. 2009). Current VGI systems may provide insights on how users could produce and manage better metadata for geotags. Metadata is “data about data,” intended to facilitate data discovery, integration, and use (or reuse). Practitioners often standardize metadata syntax and semantics, but adherence to metadata standards is extremely rare in distributed systems, especially large or global ones; this is hereafter referred to as the “metadata problem.” GIS usually assumes strongly typed spatial data representations, and GIScientists have developed disambiguation methods (e.g., toponym resolution or fuzzy boundaries) for spatial data that do not comply with these structures. These complexities make metadata important for geospatial integration and use. VGI systems, however, successfully integrate heterogeneous data sources on a global scale without solving the metadata problem directly. VGI systems use “best effort” geotagging methods and representations to avoid the complexity of richer GIScience approaches to georeferencing. Moreover, the VGI notion of metadata, and its production and management, is different than in geospatial data infrastructures.

Scientific communities have collaborated on metadata standards and conventions, such as CF (Hankin et al. 2009) and its predecessor (COARDS 1995), but in a study of earth science datasets published via the OPeNDAP protocol (Hardy et al. 2006), they do not accurately follow these conventions. In fact, only a minority of them claims their convention (as required), and even of those, only a fraction accurately adhere to their stated convention. In practice, scientific data sharing varies by discipline. Ecologists, for example, take idiosyncratic approaches to data sharing and reuse, which depend on disciplinary knowledge and social factors (Zimmerman 2007). This metadata problem forces scientists to use specialized knowledge and manual effort for data reuse.

Wikipedia may provide some lessons for metadata production and management in geospatial data infrastructures (Table 11.3). GIScientists may consider the wiki approach to metadata production and use to address how they might integrate the increasingly voluminous VGI data into metadata-based geospatial data infrastructures. In particular, the novelty and practicalities of VGI production may benefit the scientific community as they confront increasingly large-scale, heterogeneous data integration problems in metadata-poor environments—a recurrent research area (Hardy 2010; Hardy et al. 2006; Lanter 1991; Rodriguez et al. 2009).

Table 11.3 Applying Wikipedia approaches to geotagging

Ideally for analysis, all contributions would have explicit geographic information for the author’s location. But these data are not available in most VGI applications, including Wikipedia. Thus, geolocation methods are problematic for VGI contributions due to constraints in data availability and also privacy concerns. This study exploits IP addresses to apply geolocation methods for anonymous contributors. IP geolocation methods, however, are inherently both spatially and temporally dynamic in nature, inaccurate at large scales (i.e., street-level), and relatively easily evaded by savvy users or anonymizing software (Duckham and Kulik 2005; Muir and Oorschot 2009).

Alternatives are similarly constrained. Current survey-based methodologies are limited (e.g., Nov (2007) used email solicitations which yielded about 150 authors) due to the level of anonymity in Wikipedia. Spatial analysis methods based on behavioral patterns, such as the locations of the articles to which an author has contributed, are relatively new in this research area (Lieberman and Lin 2009). Combined approaches (i.e., where quantitative spatial analysis models are calibrated with surveyed locations) may prove useful. Furthermore, VGI is increasingly moving into the mobile domain where users leave (often implicitly) digital traces more conducive to geolocation methods, such as GPS-enabled smart phones, cell phone tower records, or even georeferenced photos (Girardin et al. 2008; González et al. 2008). These trace data can enable spatial data-mining methods for tracking trajectories of individuals or groups (Kisilevich et al. 2010).

Interdisciplinary approaches may also prove useful since geolocation methods are used in other domains. Geographic profiling, for example, is a criminal “investigative methodology that uses the locations of a connected series of crimes to determine the most probable area of offender residence” (Rossmo 2000, p. 1). Geographic profiling systems use spatial distribution and probability distance strategies, such as center of the circle, centroid, median, geometric mean, harmonic mean, and center of minimum distance algorithms (Snook et al. 2005).

5.2 Social Factors

How do social factors (such as communication, culture, language, settlement patterns (diaspora), and socioeconomic status) influence VGI contributions? The production and use of VGI will likely shift spatial data infrastructures architecturally to provide for social factors (Budhathoki et al. 2008; Coleman et al. 2009; Elwood 2010; Elwood et al. 2012; Sieber and Rahemtulla 2010). Further modeling of social characteristics in the collaborative authorship process might include spatiotemporal constraints on social networks of Wikipedians or future VGI systems based on increasingly rich social network technologies.

For example, the VGI production model defines work in the signature distance metric in simple terms as an edit count. But the literature has many different definitions for “work,” including edit counts (Kittur et al. 2007), edit deltas (Zeng et al. 2006), edit similarity (i.e., information distance) (Voss 2005), edit longevity (i.e., age or survival or persistence) (Adler and de Alfaro 2007; Wöhner and Peters 2009), and edit visibility (Priedhorsky et al. 2007). These metrics may better model social processes and clarify sociospatial factors in collaborative authorship. In particular, edit longevity and edit visibility more directly reflect social phenomena like “edit wars”Footnote 7 and herding behaviors, respectively. Similarly, our study had limited comparison of geographic effects across article categories, but further analysis on content-centric dimensions may help study these social processes.

Another question is whether language and population demographics affect spatial patterns in VGI. Ideally, spatial models for collective authorship would include probabilities for how many potential Internet users who speak a given language are available to make contributions for any given location. This study did not normalize authorship by population or potential speakers due to a lack of available data at the needed resolution. Balk and Yetman (2004) provide relatively large-scale data for population estimations but do not include speaker estimates. Moreover, Internet use is spatially variant (Billón et al. 2008; Zook 2005) where large-scale Internet population estimates are not readily available.

Furthermore, at a global scale of the Internet, the concept of “near” is different than in social science research that conducts studies at smaller scales (Graham 1998). For example, in our study, less than 2,000 km is relatively “near” compared to the full scope of available Wikipedia contributors. Notwithstanding global or even virtual travel (Urry 2002), typical scales for nearness are much smaller than 2,000 km, such as walking in urban centers (Turner and Penn 2002) or commuting distance via transportation networks (Weber 2003).

Finally, the notion of collective action through new media is at the core of VGI. VGI and the related phenomena of neogeography expand the notion of the “public” from prior work in public participation GIS to include much larger, distributed civic participation (Elwood 2008; Hall et al. 2010; Sieber 2006; Sui 2008).

5.3 Methodological Factors

Finally, what methodological advancements are required for future research? The high-volume, highly distributed, real-time, and social nature of VGI is inherently difficult to analyze with simple computational methods. Rather, as shown in our research, significant computational resources and data-mining methods are better suited for empirical studies of VGI. Data-mining methods with a resolution at subarticle levels, such as sections or paragraphs, would improve the sample size. Also, geographic and network visualization methods may enable a visual analytics approach to studying VGI.

In the coming years, wiki-based VGI systems, where the provenance of information is transparent, may no longer apply as the ephemeral and social nature of VGI rises. Specifically, one of the key challenges in methodology will be to effectively cope with data deluge in an environment where data are filtered through social networks (Watts et al. 2002). If information primitives become based on distance or connectivity through fluctuating social networks, then traditional information science methodologies will not be applicable at large scales. Social network methodologies, which are based on graph theory, are now being used to study online collaborative environments, such as massively multiplayer gaming (Szell and Thurner 2010), and blogging (Liben-Nowell et al. 2005).

6 Conclusion

Although the underlying technologies of online geographic services have been in development for many years, the behavioral impacts of VGI production are largely unknown. These services require large-scale data interoperability and collaboration, for example, neither of which has a purely technical solution. VGI production will likely create new knowledge politics, and many of the problematic emerging issues are institutional and sociobehavioral in nature, not technological (Elwood 2008, 2010; Goodchild 2008). For example, the capacity of a ubiquitous Internet to reduce communication costs has raised questions of whether geographic distance matters in information and economic production (Cairncross 1997; Castells 2010).

This chapter addresses two basic questions in VGI production, namely, (1) how individuals contribute place-based information to a digital commons and (2) authorship dynamics of such collective effort. Our approach takes a user-centric perspective of spatial behavior in VGI production. Research on VGI production is a nascent area with many unexplored avenues, in architectural, social, and methodological factors. These factors form a basis of a research agenda that asks (a) how to improve the structure and quality of essential geographic metadata, (b) how language and demographics affect VGI, and (c) how social networks change the nature of VGI.