1 Introduction

Video streaming is a massive industry and keeps growing. Account sharing is a major problem faced by streaming service providers. According to a poll by Consumer Reports in 2015  [2], \(46\%\) of respondents who use a streaming service share their account with someone outside their households. An earlier poll by Thomson Reuters in 2014 found that \(15\%\) to \(20\%\) millennials shared their accounts  [1]. More recent, a study by CNBC in 2018 found that an estimated \(35\%\) of millennials share passwords for streaming services  [7]. Consequently, the streaming industry loses huge potential revenue due to account sharing. The loss could be hundreds of millions of dollars annually for Netflix alone  [7].

Although streaming service providers have strong financial interests in addressing the problem, they face multiple challenges when trying to identify shared accounts. 1. Huge volume of data: leading providers have millions of users and billions of sessions each month. 2. Unstructured and noisy data: session logs are plain text with lots of noise: missing information, numerical error, etc. 3. Perhaps the trickiest problem is that it is perfectly legitimate to share within a household. Family members can share one account from anywhere (e.g. home, office, school or on travel) and use any device. Only accounts which are shared across household are against policies and should be pursued. Currently, service providers choose not to manually identify/label account sharing because it is too costly and prone to error. Note account sharing in this paper refers to against-policy sharing unless we explicitly note otherwise.

1.1 Problem Definition

In this paper, we focus on the video streaming service in the TV Everywhere ecosystem, although our solution can be easily adapted to other services. TV Everywhere is also known as authenticated streaming service, where subscribers are authenticated and authorized to stream video from Multichannel Video Programming Distributors (MVPDs). Major MVPDs in USA all have millions of subscribers. For example, both \( AT \& T\) and Comcast have 20+ million subscribers  [5]. The TV Everywhere ecosystem has access to session logs of all subscribers. Each session log contains a slew of information such as account ID, location and some content information, as shown in Fig. 1.

Fig. 1.
figure 1

An example session log. Each session log contains information such as the subscriber’s ID (obfuscated), the device used for connection, the GPS location, etc.

We propose to utilize the session information for creating a service which automatically identifies shared accounts. MVPDs can benefit from the service in two ways: (1) the opportunity of growing revenue remarkably: conversion of shared accounts to regular paid accounts, even just a small fraction of them, will bring in millions of dollars because of their large customer bases; (2) limiting the number of shared accounts also translates to server/network load reduction, which leads to significant cost cutting.

One big challenge for this service is that there is no ground truth label, since the labeling cost is prohibitive as we discussed in Sect. 1. This certainly constrains our choice of algorithms. More importantly, a big question arises: without any ground truth to compare against, how can we justify the results of the service? This question is crucial for demonstrating the value of the service due to the consequences of regulating account sharing: treating normal accounts as shared accounts (false alarm) will annoy their subscribers and lead to potential loss of business. Classifying shared accounts as regular accounts, on the other hand, means the solution brings no value to them. We believe that only an explainable and presentable solution can address this challenge. Whether an account is identified as shared or not, it is paramount that MVPDs can easily understand the reason so to trust the results.

2 Existing Work

The streaming industry has tried to restrain account sharing by adding constraints to user accounts: (1) limiting the number of concurrent streaming; (2) asking users to register a limited number of devices to their accounts. However, the first approach can adversely affect concurrent streaming by family members, who are entitled to do so. In addition, limiting concurrent streaming can be circumvented by sharing accounts at different time periods. Having to register a limited number of devices will hurt customer experience and is not desirable either: first, it is a hassle to do the registration; second, customers are having more and more devices and they can easily hit the limit. In addition, an account owner can sell/give the “registered” device to other people, so they can use the device for streaming and easily defeat the policy.

In the academic community, there has been considerable research  [4, 8,9,10,11] on modeling user behavior from session logs, mainly for improving recommendations. They mostly focus on identifying multiple users by the content that they watched. The techniques that were used are: collaborative filtering  [8], subspace clustering  [10], graph partitions  [9] and topic modeling  [11].  [6, 10] are the very few paper which attempt to determine whether an account is shared by multiple users. However, multiple users sharing one account are not necessarily against-policy. In fact, more often than not, they are shared within a household. Therefore, they cannot solve the account sharing problem.

3 Our Solution

The account sharing detection service must accommodate all variations that a normal account could have, so regular users are not impacted. After all, good user experience is the key for MVPDs to maintaining and growing their customer bases. This means that it needs to handle the following scenarios and label them as normal accounts: 1. a big family with a large number of concurrent sessions, since everyone likes the freedom of choosing his/her own content; 2. ever-growing number of devices in a household as new devices are being added all the time; 3. family members commuting to places such as school/office/mall, or traveling to other states and stream video anywhere they want.

Although the problem is complicated, the following assumptions usually hold. (1). Even though they share the account, users outside of the household (against-policy sharing)) are unlikely to share devices with account holders, since they live in different places. They might use devices which used to be owned by the account holder, through sale/gift, but the devices are transferred and not shared, i.e., the account owner is unlikely to use it again. (2). Non-family users are likely to stream videos from separate locations, not the home of account holders. Otherwise, they are more like a part of the household and virtually impossible to be identified. Based on these analysis, we propose an novel approach to estimate a sharing score of each account. It utilizes both geolocation and device information to address the problem. Algorithm 2 describes how the sharing score is estimated. It depends on Algorithm 1 to process data and construct efficient retrievable user profiles. Note the GPS coordinates in the session logs are usually noisy; we need to mitigate the problem when processing raw log files as described in Algorithm 1. Without this process, multiple geolocations might be associated with an account even if the owner only streams in her/his home, because the GPS coordinates of difference sessions can be different due to noise.

Fig. 2.
figure 2

The data structures used in the algorithms.

We first introduce the notation and data structures shown in Fig. 2.

  1. 1.

    Device Usage Map (DUM) represents a distribution of device usages. It is a hashmap \(\mathcal {M} = \lbrace (\mathcal {D}^0: \mathcal {C}^0), (\mathcal {D}^1: \mathcal {C}^1), \ldots , (\mathcal {D}^K: \mathcal {C}^K) \rbrace \), K is the number of devices used in the location. \(\mathcal {D}^k\) is the \(k^{th}\) device in the list, \(\mathcal {C}^k\) is the count (histogram) of usages for the \(k^{th}\) device.

  2. 2.

    Location Usage Map (LUM) is a hashmap in the form

    \(\big \{ (\mathcal {L}_0: \mathcal {M}_0), (\mathcal {L}_1: \mathcal {M}_1), \ldots , (\mathcal {L}_N: \mathcal {M}_N) \big \} \), where N is the number of locations associated with the account. \(\mathcal {L}_i\) is the \(i^{th}\) GPS coordinates, \(\mathcal {M}_i\) is the Device Usage Map associated with \(\mathcal {L}_i\). For instance, a LUM with only one element: \( \big \{ (40.2814, -111.698): \{(4277780:16), (11090085:2)\} \big \}\). It means that 2 devices have been used at location \((40.2814, -111.698)\): device \(\#\)4277780 was used 16 times (appears in 16 sessions), device \(\#\)11090085 was used twice.

  3. 3.

    userMap stores the profile of all users; each entry contains an userID and a list of Location Usage Maps associated with the account. It stores all locations and devices that are associated with the account, as well as the relationship between locations and devices (which devices are used in which locations). The relationship can be represented as a 2D (location-device) matrix, but the matrix will be very sparse: an account can have a long list of LUMs due to traveling or account sharing; a location can have a long list of DUMs since a household can have an arbitrary number of devices. Therefore, we use hashmap to represent these 3 levels of maps: userMap, Location Usage Map and Device Usage Map, so to make our approach very efficient (searching for their keys happens in a constant time).

  4. 4.

    deviceMap is a hashmap: \( \lbrace \) (original deviceID : device index), ... \( \rbrace \). It represents a mapping from the original deviceID (a string) to a integer value, e.g., 4277780 in the above example. Checking whether a device exist in the system is super efficient using hashmap. In addition, a device can appear in many locations, even across users. Using integer instead of string can reduce the space requirement significantly.

figure a

Assuming the GPS noise follows a zero mean distribution, the observed coordinates will be centered around the true coordinates. The step 3 in Algorithm 1 is essentially doing non-maximum suppression, it combines neighboring Location Usage Maps into one single usage map in the center locationFootnote 1. \(\sigma \) is set to be 50 m in our experiment. The average GPS accuracy is about 7.8 m  [3], well within the \(\sigma \) range. So we can almost guarantee to handle GPS noise. This \(\sigma \) setting is also fine enough to identify account sharing across street. If the GPS noise do not follow a zero-mean distribution, the coordinates will be shifted by the non-zero mean. However, it will not affect the scoring algorithm since Algorithm 2 is only based on the usage pattern, not the exact location.

figure b

The userMap generated from Algorithm 1 captures the usage pattern of all accounts: one entry for an account. Each entry contains a list of Location Usage Maps, in descending order of their device usages. We use Google Map API for visualizing the usage pattern, so people can easily see why an account is labeled as normal or abusive sharing. Some screen copies of the interactive map are shown in the result section, such as Table 1, Table 4 and Table 5. Each location associated with the account is tagged with a red balloon. A red circle is centered at the root of each balloon, representing the usage in that location. The bigger the circle, the larger the number of usages (sum of all device usages in that locations). Note the circle size is not linearly proportional to the number of usages. Because the range of usage numbers is very wide, from 1 to multiple thousands. Consequently, a large circle will make all other circles too small to see. Instead, the size is based on the natural logarithm of the numerical value, so that we can see the difference in usages across different locations. The location with the most usages is called the base location of the account, presumably it is the owners home place. The base location is important for the scoring Algorithm 2. A regular account is more likely to have a dominant base location, because that is the place where most household members enjoy the streaming service. For a heavily shared account, the usage pattern is more distributed.

Algorithm 2 estimates the score (risk) of an account being shared, by checking the device usage of all locations other than the base. If a device is not in the list of “registered” devices, i.e., it is a new device never used in the base location, it is more likely to be used by outsiders (users not belonging to the household). If it appears in the list, but used much more often in other locations than the base, it is also possibly an outsider’s device (e.g. a friend who visits occasionally), although the probability is much lower than an “unregistered” device. This is captured in (5) in Algorithm 2. We believe that the risk of sharing is fairly low when the non-base usage is not significantly higher than the usages at the base location. \(\beta \) is set to be 20 in our experiments. That is, when the usage in other locations is 20 times as high, we increase \(R_j\) by 0.5. Higher \(R_j\) leads to higher R, which ultimately leads to a higher sharing score. \(r-\beta \) is divided by 3 so the logistic curve does not saturate too quickly.

We also take the distribution of locations into account when estimating sharing scores. The idea is that uses far apart are more likely to be due to account sharing. Household members may go to office or school and stream video every day, but are less likely to go to the other side of the country. The distance weight \(W_d\) is introduced for this purpose. The minimum distance \(\alpha \) is set to be 50 so the distance weight will have no effect in Algorithm 2(3c) for usages within 50 miles, while usage in locations which are hundreds of miles away will be penalized and lead to higher scores.

Even if all streaming sessions appear to happen in the base location, it is still possible that the account is shared since people can fake their location by geo-spoofing. For example, geo-spoofing has been used by Pokemon Go players to “go” to places without physically being there. It is not a widespread practice yet, but we should be ready to tackle it. The device weight \(W_t\) in Algorithm 2 is designed for this purpose. The higher the number of devices, the higher the weight. So even if all steaming sessions share the same location, the score will still be higher if there is an extremely large number of devices. This is our first attempt to tackle the geo-spoofing problem, so we are relatively generous on the parameter setting, with \(\gamma = 20 \). For example, based on the current settings, if 400 devices are used in an account, the weight will equal to 2. If the account’s Location Usage Maps has only one location (all sessions happen in one location), the final score will be 0.46. If the number of devices is less than \(\gamma \), the weight is always 1. For accounts with just one location and less than 20 devices, their R value in Algorithm 2(5) are always 1. Consequently, they get score 0, which signifies an unquestionably safe account. Even with this generous setting, we identified some potentially shared accounts where all sessions appeared to be in one location; an example is shown in Table 2. The worst case has 6,829 devices under one account, clearly an shared account using geo-spoofing. It gets a score of 0.75, not extreme but high enough to be identified.

4 Experiments Result

We use a three-month session log of TV everywhere system for testing the proposed solution. The number of users in this data is 30,620,878. The total number of session records in the data is 1,032,254,858 and the size of this data is 1.01 terabyte. Using 0.5 as the threshold for sharing score, we identified about \(6.45\%\) accounts as shared, with very high confidence. Those accounts are usually quite obvious to be shared as what we show in Table 4 and Table 5. Using a threshold of 0.05, we identified about \(15.66\%\) accounts as shared. Some manual verification might be need for the these results. Nevertheless, based on random check, we can see the identified accounts are indeed likely to have been shared. About \(70\%\) of users stream videos from just one location using less than 10 devices. They are all labeled as regular/non-sharing accounts as their sharing scores are 0.

Table 1. This account has been used by 27 devices in total. However, they are all used in the base location and the number of devices is not super high, so it is still likely a legitimated account with score 0.05.
Table 2. All sessions happens in just one location for this account. However, the account has been used by many more devices, 72 in total. The number of devices suggests that geo-spoofing might be used here. The score is 0.21.
Table 3. This account has been used in many locations. The location with most usage is labeled bold. However, the account is identified as a regular account because few devices are used. It is most likely family members traveling around.

4.1 Multiple Devices in One Location

Two cases are shown in Table 1 and 2. Although both have only one location for all sessions, the case in Table 2 is more likely to be a shared account due to an extraordinary number of devices being used. The accounts which have zero sharing scores all have the same visualization (one location with few devices).

4.2 Multiple Devices Multiple Locations

The case in Table 3 represents an interesting pattern: many locations and few devices. We call it a traveling account. The account has two devices (\(\#747409\) and \(\#868586\)) associated with it. The base location is \((40.7046, -73.9216)\) where device \(\#747409\) was used 15 times and device \(\#868586\) was used 6 times. Both devices were used in other locations, suggesting that they were taken to travel around. The account sharing score is relatively low for this case (only 0.01) and it is labeled as a safe account. The score is not 0 though, because it is possible that a friend visited the account holder’s base location (probably home) multiple times with device \(\#868586\), thus he/she got the device “registered” to the account and lowered the account sharing score. Nevertheless, the probability is very low in comparison with other shared accounts. Note the sharing score is updated monthly with incoming data, so next month device \(\#868586\) will not be “registered” with the account if the friend no longer brings it to the account holder’s base location. As a result, the sharing score would be much higher according to Algorithm 2, as the device is not used in the base location. This would cause the account to be labeled as shared, which is the desired result. Therefore, even if users know about how we identify shared accounts, it is not easy for them to game the system: they have to pay regular visits to account holders’ base location, in order to “register” their devices to the account. Otherwise, they will be identified.

Table 4. A typical shared account: used in many locations, without a clear base location (many locations have similar numbers of usages); 39 devices have been used for streaming with this account.
Table 5. A wildly shared account: used in numerous locations; 33,909 devices have been used for streaming with this account in the 3-month period.

4.3 Identified Shared Accounts

See Table 4 and Table 5 for some typical cases.

4.4 Discussion

As we have argued, the solution has to be explainable and presentable so people can understand and trust it. This has been a design principle for our solution. As we have demonstrated, our identification results can be illustrated intuitively and digested easily. This surely helps our solution and results to be more trustworthy.

Both Algorithm 1 and Algorithm 2 are naturally parallelizable: we can easily split the computation by grouping account userIDs. In our implementation, we divide the work by the first character of userIds, i.e., [0–9, a–f], so the work was split into 16 batches. Thus we don’t need to have a huge hashmap of all users, instead we work with 1/16 of them at each batch. This greatly reduced the memory requirement for our implementation. We use only one workstation for this experiment. It can finish all jobs in a week, which is enough for (the currently designed) monthly sharing score update. In the future, we can use a machine cluster to scale for more users if necessary, e.g., 16 machines to process the 16 batches.

Since our solution is based on GPS coordinates, it is possible that in high population density areas, e.g. high rise apartment, people can share their accounts without being caught, since they are indistinguishable by position alone. The fact that we also consider the number of devices mitigates this to some extent. Nevertheless, more information such as the number of concurrent sessions and user behavior analysis will be needed to better address the issue. In any cased, the fact that we can identify over \(6\%\) accounts as reliably shared accounts can already have a significant impact, potentially saving streaming service providers hundreds of millions of dollars.

5 Conclusion

In this paper, we have proposed a novel solution for identifying shared accounts for video streaming services. It has several major advantages. First, it is efficient; we can process 3 months of data with 30 million users in a week using one single machine, with over 2 million shared accounts detected. Second, the results are explainable. Each processed subscriber, whether labeled as shared or regular, is giving an intuitive and interactive web-based illustration, so that service providers can understand and trust the results. Note that our solution preserves privacy: we obfuscate deviceID using an integer when showing the result, so service providers do not need to worry about violation of privacy when using our solution. In addition, the proposed solution handles noise in geolocation information. Last but not the least, it guards against geo-spoofing, making it hard for subscribers to game the system.

Although our solution is designed in the context of TV Everywhere ecosystem, it can be directly applied to other video streaming services such as Netflix and Hulu. In addition, it can be generalized to other applications, e.g., music streaming and more broadly, subscription-based software, e.g., Photoshop.