1 Introduction

The movement of many human interactions to the internet has led to massive volumes of text that contain high-value information for social scientists. For example, online illicit sex markets have yielded tens of millions of sex provider advertisements and over one million customer reviews of those providers. These online texts describe prices, locations, personal characteristics, preferences about the commercial encounter, and other information that is useful for social science but otherwise difficult to obtain at a large scale.Footnote 1 Unfortunately, these texts are intended for individual human, not analytical, consumption: they are casually-written and contain nonstandard usage and slang. To date, preparing such data for statistical analysis has required substantial human annotation, with the concomitant expense and necessary reduction in data size.

Recent advances in computer science enable macroscopic analysis of such data at finer resolution than previously possible by extracting high-quality structured analysis-ready information from text and images with minimal human annotation. In this paper we employ the DeepDive system to create a structured database of facts recovered from human-written source texts in the online illicit sex market. DeepDive uses large-scale probabilistic inference in a user-enabled feedback loop, thereby avoiding problems common to most standard annotation techniques, such as reliance on brittle rules or fixed grammars. In a number of applications, DeepDive obtains accuracy that is similar to that of a human annotator (Callaway 2015; Peters et al. 2014). We are thus able to obtain a very large and high-quality database about subtle concepts, derived from an extremely messy collection of documents (Appendix Table A1 provides precision/recall figures).

Data on this market are relatively easy to access in small quantities because much of the activity happens in public web fora (e.g. www.backpage.com). The postings on these sites are text advertisements written in informal English, akin to classified ads, often accompanied by images. As with classified ads, there is informal broad agreement about the kind of information to provide (prices, locations, etc) but diversity in both mode of expression (slang, colloquial usage, nonstandard usage) and in exactly which data values are provided. Figure 1 displays a full example of an online advertisement as well as two examples of ad text to illustrate the linguistic challenges in this space. Our text collection contains information from almost 30 million text ads for sex services, scraped from 19 distinct websites between early-2011 and January 2016 by IST Research.

Fig. 1
figure 1

Sample ad from backpage.com

Little is known about the market for sex services, but there are many reasons to want to know more. The sex services market is a two-sided market where buyers aim to connect with providers of sex services and providers wish to offer their services. Given the illegal nature of the services, both the service provider and John are concerned with the possibility that they could be matching with a law enforcement agent. Additionally, service providers also face a risk of encountering a violent John.Footnote 2 In fact, small scale surveys have found that as many as two in three sex service providers have been assaulted by customers or pimps (Weitzer 1999).

Sex services were traditionally solicited in outdoor spaces, which resulted in the creation of red-light districts (Hubbard and Sanders 2003). In the presence of mounting social pressure and threat of arrest, sex service workers were largely relegated to specific locations in urban areas where illegal activities were more tolerated. However, the introduction of the internet fundamentally changed the nature of the initial interaction between clients (the demand side of the market) and service providers (the supply side). Instead of face-to-face interactions, the initial interactions between potential clients and service providers began when a client responded to an advertisement, which offered sex services. The movement to arranging services online versus through face-to-face interactions is thought to result in more safety for service providers (Bass 2015), as service providers can screen potentially violent clients.Footnote 3 Additionally, the movement to online advertisements enabled service providers to coordinate appointments and have more control over the location where services are to be performed, rather than waiting at specific outdoor locations or propositioning potential clients in public locations (e.g. bars, casinos). So, service providers can now determine whether they are willing to travel to a potential client’s location (outcall), whether the client must come to the location of the service provider’s choosing (incall) or if the service provider can accommodate either situation. Despite these differences in the arrangement of services, payment for services has largely remained unchanged. Cash is still the currency of such transactions, which is most often paid upon completion of services. Violence against the service provider is thought to be most likely to occur at the point of payment, creating a market for additional security.Footnote 4

While product differentiation existed in this market prior to online platforms, it was more difficult for such differentiation to occur. Service providers could place advertisements in alternative weeklies (e.g. The Village Voice) or they could work with an escort agency, which would coordinate service providers and potential clients for a fee. The movement to online advertisements enabled sex service providers to become more entrepreneurial and independent, enabling them to keep a greater share of the proceeds from the services that they offered. The movement online also enabled further horizontal and vertical differentiation of services. Vertical product differentiation was enabled, as idiosyncratic preferences for services (e.g. massage, erotic massage, escort, BDSM, “girlfriend experience”, etc.) could be catered to and advertised across providers who are willing to perform such acts. Horizontal product differentiation was also further enabled, as the costs of advertising and searching for one’s ideal variety were both reduced through online platforms. Moreover, advertisements could link to review web sites that enabled a client to see testimonials of the service provider’s quality. Thus, a service provider could generate a reputation, which enabled the service provider to command higher rates for services performed (Cunningham and Kendall 2016).

The movement of sex service advertisement online also resulted in significantly new knowledge about the market for sex services, as measuring the supply for these services was nearly impossible prior to its movement online. Although crude, we can now measure the number of sex service advertisements by city on a given day. Table 10.4 of Cunningham and Kendall (2011c), for example, reports the average number of sex services advertisements on one platform per MSA population across 31 of the largest municipalities in the US that are offered in a day range from 0.36 (Cleveland) to 18.34 (San Francisco). Unfortunately, such robust measures of the demand for sex services do not exist. In one of the only studies that attempts to estimate the demand for sex services, Roe-Sepowitz et al. (2016) estimate the demand for sex services in 15 large municipalities in the US. On average, the study finds that 1 out of every 20 males over the age of 18 years old in these jurisdictions was soliciting online sex ads.

From an academic perspective, the online market for sex is representative of the broader class of markets in which regulation and contract enforcement are decentralized because the underlying activity is illegal.Footnote 5 From a policy perspective, an increasingly large share of commercial sex transactions are coordinated through online markets (Cunningham and Kendall 2011b). The emergence of robust online markets for sex have been associated with a range of social ills including child prostitution (Hughes 2002; Mitchell et al. 2010), human trafficking (Latonero et al. 2011), and a drop in the average age of prostitutes (Cunningham and Kendall 2011b). At the same time, these markets may reduce transaction costs in the market for sex and enable better use of reputational mechanisms, both of which can be welfare enhancing for buyers and sellers (Cunningham and Kendall 2011b). Cunningham et al. (2018) also notes that the introduction of online sex service clearinghouses (namely, the erotic services section of www.craigslist.com) significantly reduced female homicides. Using the data from these markets to better understand the commercial sex trade therefore has great potential.

Our analysis makes a concrete methodological contribution as well. Because only some online fora are well-structured, and text ads have nonstandard content that is difficult for traditional natual language processing (NLP) methods, past sex market researchers have used relatively small amounts of data.Footnote 6

As a result of these relatively small data sizes, relevant statistics must be aggregated into coarse geographic or temporal regions in order to be statistically useful. For example, a traditional small nationwide sample might yield only a few advertisements from a given city, forcing the analyst to aggregate advertisements at a state level in order to retain a minimal number of counts in each aggregated group. In contrast, our extracted database is significantly larger than even the largest previous effort. We extract price/location tuples for 4.5M ads, of which 2.1M occurred in locations for which we have the full set of covariates.Footnote 7 Elements in this large and high-accuracy dataset do not have to be aggregated into very coarse groupings in order to retain statistical validity: the data is higher-resolution than past efforts. For example, there may no longer be any need to aggregate advertisements to the state level; many cities will retain sufficient counts. This high resolution data enables us to control for local-level variation in contextual factors (such as local wages or commute times) that would have been impossible with data aggregated at coarser levels.

Using these unique data we find that pricing in the market is broadly rational from an economic perspective. This is not surprising; previous survey-based research has shown that prostitutes charge a premium for risky behaviors and that the size of the premium is greater for more attractive sex workers (Gertler et al. 2005). Exploiting within-period/within-city variation in the pricing structure across different service venues shows that services performed at a location of the buyer’s choosing (so-called ‘outcall’) earn an estimated 18% price premium, approximately $23 more for an hour-long session, controlling for a wide range of factors.

There could be several reasons for this premium. On the supply side, allowing the buyer to choose the location entails both additional physical risk and additional travel time. On the demand side, customers may be willing to pay a premium to reduce their risk and travel time. To assess the magnitude of these sources of variation in pricing we compare how prices vary as the physical size of the MSA for which services are offered expands and as the rate of violent crime in an area changes. Critically, the difference in those correlations across incall services, i.e. service at the provider’s chosen location, and outcall services, i.e. services at the customer’s chosen location, provides a way to sort out supply from demand elasticities. We find that prices for incall services are uncorrelated with MSA size and violent crime rates once some basic controls are added. Prices for outcall services, however, are strongly positively correlated with MSA size, though they are not correlated with violent crime rates once MSA fixed effects are accounted for.

These results are consistent with the incall and outcall markets being relatively segmented markets. If incall/outcall were one market then we should see prices moving in opposite directions regardless of whether supply is elastic, demand is elastic, or both. That is women living in larger areas who do not want to travel should compensate men to come to them by offering lower prices and they should charge more for travel. That we primarily observe movement in the outcall market across city size suggests that both supply and demand are fairly inelastic with respect to distance in the incall market but not in the outcall market.

The magnitude of the increase for outcall as city size increase indicates a price for providers’ travel time of $36 per extra hour of average commute time in a city, much smaller than the $151/hour mean price for incall time with a client, but much larger than the average female wage of $14/hour in our sample.Footnote 8 That difference is consistent with workers in this market demanding substantial compensation to make up for the distastefulness of time spent with clients, a compensating differential very large compared to the differentials that are easily measured.Footnote 9

These results have a number of policy implications. Most importantly, improved labor market opportunities for women appear to change the composition of suppliers in the market but does not necessarily reduce the volume of activity, at least as proxied by ad postings. Secondly, the large difference between compensation required for travel and that for time spent with clients implies that many workers in the market would happily shift to other activities given the opportunity. Finally, with regards to some risks associated with prostitution (i.e. violence), it appears that sex workers advertising online may have sufficient market power to demand compensation for those risks in the outcall market, implying that the supply of workers in that market is inelastic.

The remainder of this paper proceeds as follows. Section 2 provides background on the online market for sex services. Section 3 outlines the technological innovations that enable this research. Section 4 briefly introduces the data. Section 5 analyzes the relationship between pricing and social conditions, including economic opportunities for potential providers. Section 6 concludes.

2 Background

This section provides basic background information on the online market for real-world sex.

2.1 Online ads for sex

Online advertisements for sex contain a wide variety of information, presumably whatever the seller deems necessary to drive demand for their services. The style of ads varies by website, some contain explicit language (e.g., “fetish friendly”, specify sex acts, or clearly discuss prices per hour), while others use more veiled language that almost implies dating (e.g., “Gf services always offered :)”). Ad postings include varying levels of information including age, ethnicity, height, weight, build, hair and eye color, and measurements. Some advertisements include links to the provider’s reviews by their past clients. Almost all advertisements include images, most of which are sexually suggestive or explicit, and phone numbers that allow clients to contact them. Sometimes, ads include guidelines for conduct on the phone with them; examples include ‘no texting’ and ‘no foul language.’ The market is clearly segregated by provider gender and we focus on ads for services by women.

2.2 Ad sources

Online ad content in this market comes from three distinct sources. First, there are individual providers who post ads representing themselves and pay on a per-ad basis. Second, there are content aggregators which repost content from backpage.com or other sites (see. e.g. Escortphonelist.com) and make money by selling ad space on their websites. Third, there are spammers who post ads in many locations seeking to drive traffic to other websites on which they sell space to advertisers or goods to those who click through (e.g. ‘click here to see my pics’ or redirects to other websites requiring payment for services).Footnote 10 In our analysis we focus on the first type by filtering out duplicate ads and by restricting the sample to ads whose contact information is not reused too frequently.

2.3 Differences between sites and users

Content on the 19 sites scraped for this analysis is as varied as in any other online market along two dimensions. First, there is site-level variation. Some sites are very formal and employ a standardized format (e.g. theeroticreview.com), while others allow ad hoc posts similar to craigslist.com. The largest website in our sample, backpage.com, had almost 16.8M ads posted between 2013 and early-2016, while the smallest, myredbook.com, had only 35,400 during the same period. The share of ads with prices varied between 2.1% on several sites up to 80.0% of the 35,400 ads scraped from myredbook.com. Most websites covered a wide geographic area, more than 200 unique MSAs per site, but the share of ads with location information varied widely. On average 68% of the ads had extractable location information, but the rate by website varied from a high of 87% for cityvibe.com to a low of only 20% for myproviderguide.com.

Second, there is variation in ad content due to the decisions of providers about how much to disclose. Prior research shows that providers who offer more information command higher prices (Logan and Shah 2013). Including more information also entails greater risks from law enforcement. The more potentially identifying information a provider offers the easier it is for law enforcement to track them down. While this may be a minor concern for independent voluntary providers, those who are underage or operating in jurisdictions where police pursue prostitution face a clear tradeoff.

3 Technology

High quality extraction from free-form text is challenging because of the massive amount of linguistic variation possible. Standard text processing approaches such as regular expressions are quite brittle and dependent on small changes in the source text. NLP methods are often effective at discovering linguistic information about the text (e.g., parse trees), but do not alone solve the extraction challenge.

DeepDive (Zhang 2015) is a system for extracting relational databases from unstructured text. It is distinctive when compared to previous information extraction systems in its ability to obtain very high quality databases for a reasonable engineering cost.Footnote 11 The system ingests raw documents and emits a structured database.

The most important component of DeepDive is a novel developer framework that allows an engineer to reliably and rapidly improve extracted data quality, until the output database is as good as the downstream application requires. Internally, DeepDive includes a high-performance engine for statistical inference and learning, allowing it to handle data that is noisy and imprecise. It is enabled by a number of recent innovations in scalable machine learning and data management (Shin et al. 2015; Zhang and Re 2014; Recht et al. 2011). Figure 2 outlines the overall DeepDive processing pipeline.

Fig. 2
figure 2

Processing pipeline for sex ads

As described in detail in Zhang (2015), the system entails a three-phase process. For each phase, the engineer writes a short piece of program code, usually in Python.

  • The candidate generator is an engineer-written function that is applied to each input document and yields candidate extractions. The goal of the candidate generator is to eliminate “obviously” wrong outputs (e.g., non-numeric prices). Its output should be high-recall, low-precision.

  • The extraction features are user-defined functions that are applied to each candidate emitted in the previous step. An extraction feature is intended to encode a user-understandable piece of evidence about each candidate, useful when deciding whether the candidate is a correct extraction or not. For example, does a candidate for price have a $ symbol to its left? Obtaining high accuracy often means using many high-quality extraction features. Unlike some statistical frameworks in which features are largely or entirely synthetic, all DeepDive features are designed to be comprehensible by humans to permit manual debugging.

  • The distant supervision rules provide a positive or negative label to some of the feature-enriched candidates. For example, there might be extremely unambiguous prices that can be safely labeled as correct extractions. Alternatively, 9-digit zip codes that begin with a non-zero are unambiguously not prices for most applications, and can be safely labeled as incorrect price extractions. Despite the inevitable labeling errors that such rules introduce, we have found this approach to be preferable to the time-consuming process of manually providing labels.

After applying the above three steps, DeepDive constructs a large factor graph model that creates a random boolean variable for each candidate. The system infers a probability for each candidate, then applies a threshold to each inferred probability to determine whether the extraction will be placed in the output database (e.g., a given candidate is determined to be a one-hour price if the inferred probability exceeds 0.75). In this paper, DeepDive ingests a corpus of unstructured sex ads, then produces a high-quality output database which is used for all of our social science analysis.

4 Data

We analyze two unique data sets in our analysis. The first data set is derived from content in nearly 30 million online ads for sex across 19 different websites. The scrape was conducted as part of the Defense Advanced Research Project Agency’s (DARPA) Memex project by IST Research between early 2011 and January 2016. 51.2% of the data come from www.backpage.com, another 10.0% were scraped off sites which repost ads from www.backpage.com and other sources (including www.escortphonelist.com, www.escortsincollege.com, www.escortads.com, www.massagetroll.com, and www.escortsintheus.com), and the remainder come from various sections of www.craigslist.com which still hosted escort ads at the time of data collection (4.1%) as well as smaller focused sites such as www.cityxguide.com (2.9%), www.myproviderguide.com (2.3%), and a number of smaller sites.

The second data set contains information from approximately 1.1 million online reviews of sex services recorded at the web site www.theeroticreview.com, which was, until recently, the largest website hosting reviews of sex services. These data were scraped in March 2016. Since the website archives old content the reviews are from as early as 1998 through March 2016.

We describe each of these data sets in detail below.

4.1 Advertisement data

Each ad consists of a free-form text field, and — depending on the site — additional structured fields such as post date or location. When available, these fields are scraped using the HTML structure of the site and merged with other extractions. Most content, however, is only available in the free-form text. Scraped ads which had a specific service location were mapped to a Census Bureau Metropolitan Statistical Area (MSA), the smallest geographic unit for which reliable labor force data are available in time-series across the United States.

While the initial scrape contains 29.9 million online ads, our final data set is the subset of the initial data for which we could extract information on the full range of relevant covariates. Whether an ad is for incall, outcall or both services is extracted from all ads. Posting date information could not be accurately extracted from approximately 5 million ads. Location was unclear in another 13 million ads (i.e. the scraped text did not specify the locations for which services were offered). Prices charged by providers were missing or unclearly stated in another 8 million ads. Finally, 1.8 million ads with location information were for small towns that could not fit clearly within one of the Census MSA locations. Our final estimation data contain all 2.1M ads for which we have information on the full set of covariates identified below, do not represent ads for providers working in massage parlors, and are not posted at unrealistically high rates.

Table 1 provides summary statistics of the ad-level data for all ads that did not appear to be spam (Panel A), for the 2.1M ads that could both be linked to an MSA for which we have the full set of covariates (Panel B), as well as the MSA/month (Panel C) and MSA level covariates Panel (D).

Table 1 Summary statistics - Ad data and MSAs

4.2 Review data

Review data from www.theeroticreview.com (TER) are broken into two components. First, each service provider has a top page where specific structured text about the provider can be obtained. For example, the provider’s age range, hair color, email, phone number, preferences for performing services at a location of their choice or the John’s choice, average performance rating, and appearance statistics can be found on this page. Second, on separate subpages the specific reviews from each John that reviewed the provider can be accessed. Each review contains a series of additional information about the provider, including appearance and performance ratings that are specific to that review, whether the service was an incall or outcall, type of intercourse, and precautions taken by the provider such as the use of a screening agency.

Table 2 provides summary statistics for the 1.1 million reviews from TER. Figure 3 shows the average incall price per hour for each of the 82 most populous MSAs in the United States. Circles are sized by ads per capita. Darker shading indicates higher median ad prices within the MSA.

Table 2 Summary statistics - reviews
Fig. 3
figure 3

Map of online ads for sex services in the 82 largest MSA throughout the United States. Circles sized by ads per capita. Darker shading indicates higher median prices

4.3 Control variables

To analyze the relationship between prices in the ads and other factors we include a broad range of MSA-level covariates from the following sources:

  • Opportunity costs. We obtain measures of unemployment from the American Community Survey (ACS) data files for each MSA at an annual level. To assess travel time providers of outcall services can expect, we generate average commute (which is asked as a categorical variable) by weighting the midpoint of response times for each bucket of respondents by the proportion of respondents for that bucket.

  • Wages. To control for economic opportunity we generate a Bartik-style wage instrument that isolates variation in the MSA-level wage time-series due to national level industry trends (Bartik 1991). To generate our Bartik wages we compute average wages by gender, industry, and MSA using the 2000 census data. We then computed the share of employment by industry and MSA for each month of the sample period using the Bureau of Labor Statistics Quarterly Census of Employment and Wages (QCEW). We calculate instrumented wages by assuming that each job pays the estimated wage for the given industry and MSA from 2000. Because QCEW is a census of wages based on unemployment insurance reporting, we get a wage estimate that varies at the MSA month level. As a robustness check we also include data on monthly median rental prices.Footnote 12

  • Law enforcement risk. We utilize data from the Law Enforcement Management and Administrative Statistics (LEMAS) databases to determine the number of full time law enforcement employees in each MSA.

  • Abuse risk. We use data from the Uniform Crime Reports (UCR) to determine the total annual violent crimes per year in a given MSA as a proxy for abuse risk. We provide a placebo test that the coefficient is not just proxying for overall crime by using the rate of property crime per capita. These data are both cross-validated with information in the National Incident Based Reporting System (NIBRS).

5 Results

Our core results focus on understanding how risk and labor market opportunities interact. We use straightforward multivariate regression to identify the conditional correlations between various risk factors and pricing.

5.1 Crime, travel time, and pricing

Different service locations expose sex workers to different risks (Spice 2007; Maticka-Tyndale et al. 1999; Taylor 2003). Workers in massage parlors are generally considered the safest from abusive customers. Workers who set the service location, i.e. “incall” are the next safest as they can and often do make sure the service takes place in a location with security. Workers offering services in a location of the customer’s choosing, i.e. “outcall,” are the most exposed to risk of abuse by customers or law enforcement stings.

Consistent with workers requiring compensation for risk, there is clear variation in prices by service location. Figure 4 clearly highlights this fact, plotting the distribution of prices for all 2.1M ads with prices which specify a service venue, do not appear to be spam, and are linked to an MSA for which we have the full set of covariates.Footnote 13 Mean and median prices are drastically lower for massage parlor ads than for other service locations. Within other venues both mean and median prices are statistically significantly lower among incall providers than outcall providers, but there is substantial overlap in the price distributions.

Fig. 4
figure 4

One-hour price distribution by service venue. Prices imputed for ads that do not advertise a one-hour rate

Of course, different service locations also require providers to engage in different levels of travel. In particular, providers offering outcall may have to spend more time traveling than those who provider service at a venue of their choosing (incall), and buyers face the opposite costs. We therefore decompose the price premium into two components, physical risk and travel time, by estimating how the relationship between pricing and service venue varies across a proxy for the physical size of the service area, average commute time in the MSA as measured by the American Community Survey.

Specifically, we estimate the following regression at the ad level, dropping all ads by providers who average more than 6 ads per day to remove spam from the sample:

$$ \begin{array}{@{}rcl@{}} P_{i,j,m} = \alpha + \beta_{1} \text{outcall}_{i} + \beta_{2} \text{both}_{i,j} + \beta_{3} \text{unclear}_{i,j} + \beta_{4} C_{j} + \beta_{5} (\text{outcall}_{i,j} \times C_{j})\\ + \beta_{6} (\text{both}_{i,j} \times C_{j}) + \beta_{7} (\text{unclear}_{i,j} \times C_{j}) + \tau_{m} + \mathbf{\Delta} X_{j,m} + w_{j} + \epsilon_{i}. \end{array} $$
(1)

Here ads are indexed by i and each occurs in an MSA j and month m. Pi is the price for an ad and the first three variables in the regression are indicators for whether the ad lists outcall as a location, offers both incall and outcall, or is unclear as to the service venue. β1 captures the price premium for outcall over incall in an MSA with zero commuting time, β2 does so for ads offering outcall or incall, and β3 estimates the price premium for ads which are unclear as to the service venue. β4 reports how much the price for incall varies as the average commute time in the MSA increases. β5 through β7 report how those costs change for other service venues. We include two fixed effects in all regressions, τm is a month fixed effect to capture any nationwide secular trends in pricing for sex services in online ads and wj is a website fixed effect to account for consistent differences in pricing across sites. Xj, m is a vector of MSA/month-level traits such as unemployment rate, number of ads posted, and number of unique providers posting ads, as well as MSA-level variables such as population and racial composition, all of which might be correlated with pricing. We cluster standard errors at the MSA level in all regressions to allow for within locality correlations in the errors when assessing statistical significance.

By looking at how much of the price premium for outcall comes from travel costs at different levels of the commute time variable we can assess the relative importance of risk vs. average commute time in pricing. The variation in pricing reflects potential differences in risks and differences in travel costs for both buyers (men) and sellers (women). To be clear, commute time reported in surveys is a proxy for travel time for the average outcall or incall event, but since there is no reliable way to measure the average travel time for an outcall service event we use it as a proxy. The relationship between the estimated coefficient in Table 3 and the value of time to providers (the actual quantity of interest) depends on the ratio of commute time to outcall travel time. If that number is substantially greater than 1, then we understate the value of time and therefore overstate the risk premium providers demand for outcall appointments.Footnote 14

Table 3 Commute time and prices in the online market for sex

As Table 3 shows quite clearly the majority of the outcall premium comes from travel time. Column 1 shows the simple differences in costs across service venue, and the price premium is quite clear with outcall services commanding a $24 per act premium, roughly a 17% premium on the median incall price of $140. Column 2 adds a control for travel time and various MSA/month and MSA level covariates, showing that commute time is positively correlated with pricing, and that the MSA-level covariates do not affect the price premiums. Column 3 adds in the full set of interaction terms (column 3) to see how incall prices and the outcall premium varies with MSA size. Incall prices are positive but statistically insignificantly correlated with average commute time.Footnote 15 However, each additional minute of average commute time predicts a statistically significant increase of $0.53 in the price of outcall services. At the 50th percentile of commute time, roughly 30 minutes, the outcall premium is $22.50. The estimates on the interaction terms are quite robust. They change little when we add an MSA fixed-effect (column 4) to account for all time-invariant characteristics of the MSAs or when we include Bartik (1991) instruments for male and female wages in the licit market as controls along with their interaction with commute time (column 5) to account for any correlations driven by secular trends in local labor markets that are not captured by controlling for unemployment rates. Footnote 16

So what do these prices reveal about the components of pricing? In column 3, our preferred specification because it allows us to directly estimate the role of commute time, we examine the nonlinear effects of commute time on prices. At the 1st percentile of commute time, roughly 20 minutes, the outcall price premium is $18.3, of which roughly $11 comes from travel time. At the 50th percentile of commute time, roughly 28 minutes, the outcall premium is $22.5. And at the 99th percentile of commute time, roughly 38 minutes, the outcall premium is roughly $28. If we treat the distance-invariant outcall premium in column (3) as the risk component, then the additional risk from performing these services at the buyer’s chosen location accounts for roughly 34% of the outcall premium in the median-sized MSA.Footnote 17 Thus, travel time appears to be the main driver of outcall pricing but there is another component.

To further assess whether risk is indeed driving price we estimate a series of regressions adding in interactions of service venue with measures of violent crime rates, which we believe proxy for risk to providers and buyers, as well as property crime rates, which are arguably less correlated with risk to providers. As before we show the regression with controls and with MSA fixed-effects and cluster standard errors at the MSA level. If risk is driving a large share of the outcall premium and if incall providers can do more to shield themselves from abuse risk, then we should see that outcall prices are positively correlated with violent crime rates, but incall prices are not. We present the regressions with the interactions of the different types of crime rates with service venues separately (columns 1 and 2 without MSA fixed-effects and columns 4 and 5 with MSA fixed-effects) and jointly (columns 3 and 6) so that we are estimating the correlations with violent crime rates conditional on property crime rates and vice versa.

Table 4 shows that prices for outcall services and for services offered in either venue are positively correlated with violent crime rates when controlling for property crime rates (column 3), but the correlation is not robust to controlling for MSA fixed-effects (columns 4 and 6). Conditional on violent crime rates, property crime rates appear to be negatively correlated with pricing for outcall services (columns 3 and 6). The magnitude of the relationship is modest. A one standard deviation increase in the number of violent crimes (roughly 310 per 100,000 people per year) predicts a $4.23 (95% CI of $0.67 to $7.8) increase in the outcall premium (using the estimates from column 3). This represents a 0.064 standardized treatment effect and an 2.8% percentage point increase from the mean incall price of $151/hour. There is no similar positive relationship with property crime rates.

Table 4 Commute time and prices in the online market for sex

These results are consistent with the incall and outcall markets being segmented. If incall/outcall were one market then we should see prices moving in opposite directions across the markets, that is as conditions favor incall services outcall pricing should drop and vice versa. With respect to commute time, for example, women living in larger areas who do not want to travel (or who live in high-crime areas and fear crime) should pay men to come to them by offering lower prices and they should charge more for travel. That we primarily observe movement in the outcall market across crime rates and city size suggests supply and demand are fairly inelastic with respect to distance in the incall market but not in the outcall market.

Under the assumption that the markets are segmented, then these regressions enable one more decomposition of interest. Sex work is generally considered a distasteful job for which workers require compensation well above the market wages they could earn in other occupations (Rao et al. 2003). The premium that workers in so-called ‘dirty’ jobs receive above what similarly skilled people earn in less unpleasant occupations are known as compensating differentials and can take the form of job amenities (e.g. more time off, flexible hours, etc.) or higher wages (Viscusi 1993; Lavetti 2015).Footnote 18 Using our data we can break the cost of a session into the travel time component, for which the worker should require no special compensation as it is equivalent to the individuals ‘normal’ work options, and the service time component, for which they would expect special compensation. Assuming that average travel time for an average outcall is similar to the average commute time, then for a median sized city a provider will charge roughly $18 for 30 minutes spent in the car, implying an hourly wage of $36/hour, which is small compared to the $151 mean price for an hour spent with the client in incall services. This is a massive compensating differential, implying either that sex workers are in great demand or that the occupation is quite distasteful.Footnote 19

As an additional robustness check we include a number of other time-varying controls measured at the MSA-month level in Appendix Table A2.Footnote 20 Column 1 presents the core estimating equation from Table 3 Column 4 to enable easy comparison. Column 2 adds in controls for the number of ads per capita in that MSA-month to control for the possibility that changes in competition in the ad space differentially affect prices in the outcall vs. incall market. Column 3 adds in linear, quadratic, and cubic terms in the number of providers advertising in the MSA-month for the same reason. Column 4 accounts for the concentration of advertising in the MSI by including a term for the Hirschman-Herfindahl Index (HHI) of the number of providers in the MSA-month. Column 5 includes fixed-effects at the MSA-year level and clusters standard errors at the same level. Column 6 adds controls for the median rental price in the MSA-month to better control for local economic conditions for the subset of 170 MSAs where that variable is available. Column 7 includes all controls for the same subset. Column 8 restricts our core estimating approach to (a) ads that are priced between the 5th and 95th percentile of pricing in all ads and (b) MSA-months that are between the 5th and 95th percentiles in terms of ads per capita. In all cases the core estimate of the interaction term between outcall prices and commute time remains statistically strong and substantively large. In addition we estimate models controlling for rates of prostitution arrests and sex offenses measured at the MSA-month level using UCR data. Although none of the results change with these controls we do not include them in the table because of the significant reporting biases that may be present in using reported crime to measure sexual assault risk and arrest risk.Footnote 21

To account for potentially differential trends across MSA, Appendix Table A3 compares our baseline specification (Column 1) to a model that includes MSA-specific linear time trends (Column 2) and one with quadratic time trends (Column 3). In both cases the core results remain substantively unchanged. Overall the results are quite stable once standard two-way fixed effects are included.

5.2 Controlling for provider characteristics

Our core data lack detailed information on provider traits such as appearance or specific sex acts offered. It is possible, though unlikely, that such traits correlate with MSA size and service location in a way that generates spurious results. As an additional robustness check we re-run the core analysis focusing on prices for providers listed in The Erotic Review which provides detailed data on individual providers as described in Section 4.2. These data represent the subset of providers advertising online who have chosen to pay for a page on the review site, presumably because doing so enables them to establish a stronger reputation with customers and their average pricing is significantly higher than in the overall sample. If we see a similar outcall premium in this subset where we can control for unobservables in the main data that should provide greater confidence in the core estimates reported in Tables 3 and 4.

Specifically, we estimate the premium associated with offering either outcall or both outcall and incall on the average price charged by a specific service provider:

$$ \begin{array}{@{}rcl@{}} P_{i,j,m} &=& \alpha + \beta_{1} \text{outcall}_{i} + \beta_{2} \text{both}_{i,j} + \beta_{3} \text{App}_{i} + \mathbf{R}\text{Race}_{i} + \mathbf{D}\text{Desc}_{i} + \mathbf{A}\text{Act}_{i} + \beta_{4} \text{Perf}_{i} \\ &&\tau_{m} + \gamma_{j} + \epsilon_{i,j,m}. \end{array} $$
(2)

Here providers are indexed by i and each occurs in an MSA j and month m. Pi is the price for a review and the first two variables in the regression are indicators for whether the provider offers outcall or both incall and outcall services. β1 captures the price premium for outcall over incall services, while β2 does so for providers offering outcall or incall services. A series of additional controls are included that account for specific features of the provider. β3 captures the effect of controls for a provider’s appearance, which is measured on a 1-10 scale. The vector R identifies the impact of race of the provider, which are measured with a vector of indicator variables of black, Asian, Hispanic, or other non-white categories. D captures the price premia associated with a vector of variables describing the provider’s appearance and build, which includes indicators for the provider’s build (e.g. “average” or “thin”), tattoos (e.g. “a few” or “many”), breast appearance (e.g. “average” or “perky”), and breast implants (“yes”, “no”, or “don’t know”). A identifies the price premium associated with specific acts performed by the provider, which include indicators for oral sex, oral sex without a condom, anal sex, or multiple orgasms per session. β4 captures the price premium associated with the provider’s performance rating, which is determined on a 1-10 scale. We also include city and month fixed effects, in addition to a provider specific idiosyncratic error term. As before we cluster standard errors at the MSA level.

As Table 5 indicates, there is a considerable outcall premium. This premium is identifiable regardless of whether the provider offers outcall services exclusively or provides flexibility in their willingness to provide services at the client’s location or at the providers location. Interpreting column (1), we note that relative to provider’s that exclusively provide incall services, those that offer outcall or both incall and outcall services receive a price premium of approximately $57 and $50, respectively, above the average incall price of $242.50. In columns (2) - (7) we gradually increase the number of controls that are included in the estimation. In our most saturated model —column (7) —the price premia is considerably smaller than in our most naive specification (approximately half the size), but a sizable and statistically significant effect remains. In effect, after controlling for both observable and a wide range of normally unobservable features of providers, the price premium associated with service providers that are offering higher risk services by traveling to the client’s location receive approximately $20-$25 more per session, which translates into an approximate 10% price premium above providers that exclusively offer incall services

Table 5 Provider traits and prices in the online market for sex

6 Conclusion

Women selling sex services online appear to engage in rational pricing behavior. The risk associated with traveling to the client is rewarded with a significant price premium that goes beyond the costs associated with commuting. The costs associated with abuse risk borne by direct involvement with the client at a location of their choosing appears to be compensated. Most importantly, the magnitude of the compensating differential workers in this market demand for service time compared to travel time, roughly $115, is a significant share of the $151/hour mean price for incall time with a client.

These results imply that many workers in the market would happily shift to other activities given the opportunity and that sex workers advertising online for outcall services have sufficient market power to demand compensation for risks, suggesting the labor supply for this market is fairly inelastic.

More broadly our analysis demonstrates there is great potential for learning about behavior in online markets by combining cutting-edge machine reading technology with established econometric approaches. Importantly, though, if pricing is broadly rational then pricing anomalies should be detectable. Organizations involved in trafficking women for sex work are unlikely to internalize labor market conditions in the same way that voluntary providers do, and so are likely to shift prices in different ways. Such price-based anomaly detection is an important avenue for future research.