1 Introduction

The increasing trend of the usage of video services for applications related to business, education, and entertainment purposes necessitates better coding technologies for efficient storage and transmission of video data. Further, this requirement is emphasized by the growing size of the display devices capable of playing videos with High Definition or Ultra High Definition resolution. Most of the video streaming is delivered over the Internet and it is estimated that the proportion of video data over the Internet will grow further [5]. This trend of the usage of multimedia services is expected to raise the users’ awareness about perceptual quality. Both service providers and consumers are interested in the delivered level of perceptual quality. Besides compression, the perceptual quality of videos can get degraded due to distortions in the transmission medium. In traditional video streaming, effects on users’ Quality of Experience (QoE) due to varying network conditions have not been addressed completely. Hypertext Transfer Protocol (HTTP) based Adaptive Streaming (HAS) offers a video streaming solution that is more robust against network induced distortions, so there are only limited or no losses in the transmission. One of the salient features of HAS, which is standardized by Motion Pictures Expert Group (MPEG) as Dynamic Adaptive Streaming over HTTP [15], is the availability of control with the client in order to adapt the video streaming to the varying network conditions. The adaptation is made possible through the provision of multiple bitrate copies in segments of the video being transmitted from the server. The usage of adaptive streaming has been notably observed to be useful in the reduction of video stalling that might occur due to bandwidth constraints [45]. A client might prefer to switch to lower quality video instead of experiencing a halt (buffering) in the video playback. Moreover, such a provision of multiple levels of video quality has other advantages as well, besides offering the possibility of adapting to the consumers’ display terminal and the preferred price plan of the services.

As of now, HAS is being developed to find optimal solutions for its various stages. For example, under what conditions, would it be perceptually preferable to switch to a lower quality in order to avoid a stalling in the playback? Also, the options of slowly or rapidly switching to the lowest or highest quality might pose different impacts on the user QoE. To this end, subjective experiments of quality assessment are performed using a test stimuli that is representative of different adaptation scenarios. The research in this area has received rather big attention during recent years in the efforts to obtain an understanding of the perceived quality of the comparably slowly varying quality changes, which HAS gives rise to, combined with the abrupt halt that occasionally still may occur due to rebuffering. A problem in this research area is that international standards for conducting subjective tests for HAS are still largely missing. The currently available standards, e.g. ITU-R Rec. BT.500-13 [17], ITU-T Rec. P.910 [16], and ITU-T Rec. P.913 [18], only cover methods for short sequences with the exception of Single Stimulus Continuous Quality Evaluation (SSCQE). SSCQE is a method for continuously giving quality scores and could be used for studying HAS. However, it is a method that is hard to setup and carry out, since it requires precisely calibrated scoring devices. The viewer may drift in quality without noticing it and there is usually a delay between the occurrence of an event and the time instant of the viewer response. Furthermore, the aggregation to an overall score is not straightforward. An alternative approach that can avoid such issues is Paired-Comparison (PC) based subjective assessment that we have used in this study and is presented in Section 4.1.

Subjective experiments are considered to be the most valid methodology to assess the QoE and are generally conducted in a controlled laboratory environment. Objective or computer software assisted methods [34] have been largely seen as an alternative approach, to get around the complications involved in the laboratory based subjective experiments. However, even the objective methods with state-of-the-art performance are not considered as an optimal replacement of subjective assessment. Crowdsourcing based subjective experiments have gained attention to replace needs of laboratory based tests and these experiments offer promising correlation with the latter [19]. The procedure of crowdsourcing mainly involves collecting subjective assessment of quality through ubiquitous streaming via the Internet. This enables the investigator to receive opinions from a vast variety of viewers; in a time-flexible, test-data size scalable, and swift manner. In this study we investigate how crowdsourcing could potentially be used for studying HAS and how it relates to more traditional laboratory testing.

2 Goals and contributions

This paper presents a subjective study on HAS through crowdsourcing as briefly reported in [35]. We employ a test stimuli representative of various adaptive video streaming scenarios that are adopted in practice by most service providers to find a good ABR strategy (e.g. how to arrange the bitrate budget). Additionally, we report on the content-dependency of the subjective QoE. We also report a laboratory based subjective experiment in order to further investigate the results of the crowdsourcing based experiment and the correlation with the laboratory based results.

In this paper we only consider bitrate adaption by adjusting the bitrate of the video encoding and not other forms of adaptation such as spatial and temporal scaling. For studies on such adaptation and their influence on the QoE we refer to [20, 32].

In comparison to the related work given in the following section, it becomes evident that more subjective studies on the assessment of various scenarios of adaptive video streaming are required. Especially, it is desirable to conduct studies that are closer to the real-life usage of video services. Therefore, the application of Paired Comparison based subjective assessment for adaptive video streaming in a crowdsourcing environment is our major novel contribution. Additionally, we have analyzed the results of crowdsourcing experiment in the light of a follow-up laboratory-based experiment as well. The experimental design is deliberately made to not introduce too many new parameters in order to make it possible interpret the results. Therefore, we start from a data set that has already been annotated and introduce the changes from there. Based on [27] it is now clear that experimental results need to be repeated in several independent studies to make a trustworthy result. In that regard our contribution is also that we manage the repetition of some previous studies as well e.g. how a viewer reacts to one vs multiple stalling events.

The remaining part of this article is structured as the following. An account of related work is summarized in Section 3. Section 4 presents an outlook of the test data and methodology used in the subjective experiments. Section 5 summaries the model used to process the user feedback obtained through the crowdsourcing experiment. An analysis of the results is presented in Section 6. A discussion and conclusive remarks are presented in Sections 7 and 8, respectively.

3 Related work

Robinson et al. [30] conducted a subjective study to evaluate user experience on HAS based video streaming under constrains of varying bandwith, latency, and video-data losses. Various observations were made including the preference of constant bitrate over frequently changing bitrate and a slow drop to the lower quality over oscillating bitrates. Staelens et al. [37] performed a subjective study for long duration videos on tablet devices for investigating the impact of quality switches due to adaptive streaming. They observed that users mainly perceive the change of quality from the highest to the lowest levels and high to medium quality changes remain largely unnoticeable. Moreover, stalling in the playback of videos was observed to be the least preferable. In [31] it was studied how spatial and temporal quality switching had different impact on the QoE and which features of the switching that were most relevant in relation to the QoE. The results reported in [42] investigated the optimum number of coding quality levels that could be used in an adaptive video system by studying the just noticeable difference levels that exist in the quality range of video content. The incorporation of the effects of frame rate and resolution adaptations on the user perception to obtain the encoding configuration that maximizes the QoE for a certain type of content has been investigated in [7]. The study presented in [23] particularly compared the difference of the impact of increase and decrease of quality in response to a variation in the network condition. It is reported that downgrading the quality has a stronger impact on the QoE. Similar results were obtained in the study reported in [10]. Moreover, in order to study the impact of slow or rapid variations of quality in comparison with low or high quality video streaming, [38, 40] presented the results of their subjective assessment of QoE on such test stimuli. In [28] the authors used Youtube and crowdsourcing to conduct a subjective experiment of Adaptive BitRate (ABR) streaming. The results indicated that the delivered representation bitrate and the number of stalls were the main influence factors on the QoE. Finally, [8, 14, 33] presented surveys on the studies related to various influence factors of QoE in HTTP adaptive streaming.

Of the related works mentioned above, the subjective experiments in [14, 28] were conducted by crowdsourcing. In [14], authors analyse the effect of switch amplitude (i.e., quality level difference), switching frequency, and recency effects on HAS Quality of Experience (QoE) while in [28], authors analyse the effect of average representation bitrate (i.e., media throughput at the client), average startup time (or startup delay), and average number of stalls on existing DASH-based Web clients. However, the subjective test performed in these experiments do not use PC based methodology and moreover, the subjective test we performed were more intense with highest number of test videos. An introduction to crowdsourcing, a discussion of the differences between crowdsourcing and laboratory experiments, and best practices for crowdsourcing, such as including a screen test and control questions, are presented in [12, 13]. In [4, 29] web-based platforms for subjective studies of QoE for videos are presented. In [9] various improvements for implementing subjective crowdsourcing experiments are proposed, so-called momento methods that increase the reliability and execution time of the crowdsourcing campaign. This work builds on our previous work [35], where the initial results from the crowdsourcing experiment are presented.

4 Subjective experiments of video quality

Most of the test videos used in this study were previously used in the laboratory based subjective experiment reported in [38, 40]. These test videos closely resembles the video quality level and the content types used by service providers. Also, different adaptive scenarios are considered in the experiment to address the service providers’ concerns. From now onwards, we refer to this as Laboratory Experiment 1. The original purpose of Laboratory Experiment 1 was to compare the outcomes of a subjective experiment using a traditional and standardized Absolute Category Rating (ACR) test methodology versus a semi-continuous methodology developed to evaluate long sequences in a more realistic setting as explained in [38]. Also, the impact of some of the technical factors of the adaptation scenarios, such as the amplitude of the quality switching and video chunk size was investigated. The impact of including or excluding audio was investigated in [39] and no statistically significant effect was found. Therefore, we chose to exclude audio in the subjective experiments done in this work.

The original videos were all in 1280x720 resolution with a frame rate of 24 fps and encoded using the high profile for H.264/AVC at 4 different bitrates: 600 kbps, 1 Mbps, 3 Mbps, and 5 Mbps. The videos were encoded with closed GOP, maximum 2 B-frames and 3 reference frames. Seven different sources were used; three sources were taken from entertainment action/romance movies (denoted Pirates, Darkhour, Streetdance) and the rest was content from: a soccer match (denoted Football), a sports documentary (denoted ClosetoEdge), a newscast (denoted News), and a concert (denoted Rollingstone). The selection of the source content is motivated in [38].Footnote 1 The subjective Laboratory Experiment 1 was carried out at the lab of Acreo Swedish ICT AB (Acreo Lab) in a test room compliant with the ITU-R BT. 500 [17].

Spatial perceptual Information (SI) and Temporal perceptual Information (TI) as defined in [16] can be used for categorizing the video content. In content with low SI values are found scenes with minimal spatial detail, while content with high SI values contains scenes with the most spatial detail. Content with low TI values consist of still scenes and very limited motion, while in content with high TI values scenes with a lot of motion are found.

The content of the original videos can be described as follows, where the spatial and temporal information are noted as (SI,TI) for each content. The Pirates video (48,29) is from an action movie and features some scenes in smooth motion, some with groups of walking people, and some with camera panning. The Darkhour video (51,28) is from a thriller movie and features scenes with rapid scene changes and cloudy atmosphere. The Streetdance video (46,34) is from a drama movie and consist mostly of scenes with smooth motion with static background and some scenes with groups of dancing people in bright ambient. The Football video (56,29) is from a TV broadcast of a football match and has moderate motion and wide angle camera sequences with panning. The ClosetoEdge video (43,24) is from a documentary and is mostly shot with a handheld camera and features varying characteristics. The Rollingstone video (45,42) is from a concert recording, where the lead singer moves around a lot and the video has some sudden scene changes. The News (49,23) video is from a TV news broadcast and has static scenes with one/two standing/sitting people, some scenes with moving background and some scenes without reporters with more motion and panning.

Several adaptation scenarios for the videos were produced in the Laboratory Experiment 1, such as going from a high to a low bitrate in a stepwise manner. Out of all those scenarios, the following are used in this work: Gradual Decreasing (GD), Rapid Decreasing (RD), constant 600 kbps (N600), constant 1 Mbps (N1), constant 3 Mbps (N3), and constant 5 Mbps (N5). Additional details of these scenarios, such as the timing of the bitrate steps, can be found in [38]. Additionally, we introduced new buffering scenarios to test the quality perception in relation to the aforementioned scenarios. The buffering scenarios include: 1 Freezing event for 2 seconds in the constant 3 Mbps video (1F3M), 2 Freezing events for 1 second each in the constant 3 Mbps video (2F3M), and 1 Freezing event for 2 seconds in the constant 1 Mbps video (1F1M). The freezing events were in most cases placed in an evenly spaced manner, except when this coincided with or were very close to a scene change; in this case the freezing event was moved a few seconds away from the scene change, so the interaction between those two effects was minimized. We did not consider initial delay in our test stimuli as some studies, e.g. [11], noted that it does not seem to pose a significant impact on the QoE for the user. Due to the semi-continuous methodology used in previous work, some of the degradations were applied to the content at different time intervals. Thus, each Processed Video Sequences (PVS) originating from a specific original content as described above might be from different time intervals in that original content.

In total 9 different scenarios were used, resulting in a total of 63 stimuli. Table 1 presents a summary of this test stimuli. Using this test stimuli, we conducted a crowdsourcing experiment that is referred to as Crowdsourcing Experiment in the rest of the paper. Additionally, a laboratory based experiment was performed that we refer to as Laboratory Experiment 2 in this article. Table 2 presents a summary of the usage of this test stimuli in different experiment setups. Note that Laboratory Experiment 2 was mainly conducted to better understand any difference between the results of Lab Experiment 1 and the Crowdsourcing Experiment to confirm whether it is due to the difference in test material or due to the experimental methodology. Based on the findings in Lab Experiment 2 we were able to conclude that the difference in the results is most likely due to the PC methodology as discussed in Section 6.

Table 1 Test stimuli
Table 2 Use of test stimuli in experiments

4.1 Crowdsourcing experiment

Crowdsourcing is a powerful and cost effective tool to perform short and simple tasks online as it facilitates the access to a large number of fairly low price workers in a short period of time. However, performing multimedia subjective quality assessment with crowdsourcing brings many challenges. If the resources at viewers end are limited for instance, low internet connections, low resolution screens etc., it is very difficult to transmit and display high quality multimedia contents. Moreover, having very little control over the viewers environment, such as viewing conditions, viewers mental state etc., makes crowdsourcing tests untrustworthy compared to laboratory tests. In addition, it is very difficult to check the reliability of the viewers. Therefore, in order to deal with these challenges, we have embedded different screen tests in our crowdsourcing tool. Viewers are obliged to perform screen tests and answer survey questions, which helps us to know about the visibility power and personal background of the viewers along with some of the display properties of their screens and current environment. Different dummy questions related to the multimedia content are asked during the test in order to differentiate the unreliable viewers.

Crowdsourcing experiments should be as simple as possible for the viewer, therefore we chose to follow the Paired Comparison (PC) evaluation methodology [16], where the test sequences are grouped into pairs that are presented to the viewer one after the other. Also, to keep it simple for the viewer, after each pair of videos the viewer is simply asked which of the stimuli he or she preferred via the online interface. Since we chose the PC methodology, which is very simple for the viewer [25], we did not train the viewers before the test, which would have been necessary if e.g. a rating scale had been used [12]. This also has the advantage that we did not influence the viewers during a training session, which can cause the obtained data to be biased.

We used the optimized square design [25] for the pairings based on our assumptions of the quality levels. This was done to reduce the number of pairings needed to get reliable measurements. Using this method, our complete test set consisted of a total of 126 pairings. These pairing were divided into 14 tasks with 3 videos from 3 different contents, i.e., 9 videos for each task. For a random video from each content the viewers also needed to answer a control question about the content. We also used the screen test from [13] prior to the subjective test to filter out potential malicious viewers. In total 215 paid viewers participated as viewers in the crowdsourcing experiment from 30 countries. Figure 1 shows the distribution of viewers across different countries.

Fig. 1
figure 1

Distribution of viewers across different countries

To conduct the crowdsourcing subjective experiment we chose to create our own web-based platform capable of presenting videos to viewers for performing paired-comparisons. In this article, we present a brief documentation of this software. For further documentation and technical implementation the reader is invited to access the Web page of the open source project that served as the basis for our platform [36]. The test videos are required to be in a playable format for internet browsers, but otherwise our platform can be used for any PC subjective video experiment, not only for ABR videos as was done in this work. Alternatives to our platform for PC subjective experiments in crowdsourcing include [4, 29]. Advantages of our platform includes: access to the source code for easy modification, the overall experiment is easily broken into smaller tasks, viewers are dynamically assigned to the current task with fewest views, an unique solution to ensure smooth playback, and the design and setup of the paired comparisons can be entirely defined by the experiment designer.

In the current version of the platform, it is assumed that each viewer should watch 9 comparisons and that each viewer is directed to the front page with a unique id. In this work, the Microworkers platformFootnote 2 was used to hire the viewers. The platform was built using the Hypertext Preprocessor (PHP) language and Javascript. The flow of direction of the interface in the interface is illustrated in Fig. 2. Dashed lines indicate dependencies, meaning that boxes connected only with dashed lines are used as part of the pages they are connected to. Most pages of the interface include a link to an instruction page which is a simple web-page with detailed instructions for the viewer.

Fig. 2
figure 2

Flowchart of the interface, the dashed line indicates dependency

The front page of the interface consists of a small version of the instructions and the screen test from [13] is shown. Placed below that is a small questionnaire that can be tailored with e.g. demographic questions. At the very bottom of the page is a progress bar showing the status of the loading process of the first pair of videos. When the loading is done a start button will appear, which leads to the test loop. In this way the videos are ensured to be playable without unintended interruptions. The compatibility of the viewer’s platform is also tested. If any error is detected, e.g. JavaScript being disabled or the device resolution is too low, the viewer is redirected to the relevant error page automatically. The error pages contain information about the specific errors and what the viewer might be able to do in order to redeem the error.

The evaluation loop consists of 3 pages. Two of them handle the video playback, while the third presents the question regarding preference. An example of the playback page can be seen in Fig. 3. In order to ensure a smooth uninterrupted playback, unless it is intentional, videos are at high frame rate played (invisibly to the user) on previous pages. In this way, we ensure that the videos are fully downloaded, buffered, and ready to be displayed locally.

Fig. 3
figure 3

Screenshot of the first video playback page. The video frame is from [1]

For 1/3 of the video pairs, a control question related to the content in the video is also asked on the preference page. The answers from the control question can be used when filtering out unreliable viewers. In this way screening of the viewers is done twice: first based on the information collected at the front page before the evaluations of the videos as described above, and finally after the experiment has been concluded when information about the control questions is also available. If there are more pairs for the viewer to evaluate, the next pair of videos will be loaded on the preference page, with a progress bar showing the status at the bottom of the page. In this case the viewer will be redirected back to the first video playback page. Otherwise, the viewer will be redirected to the end page. At the end page, the viewer will receive a unique code as documentation of their participation.

The status page contains information meant for the test manager. It provides a quick overview of the current status of the tests with information which is retrieved directly from the database. The interface is connected to a database in order to store and exchange information between the interface and participants. The database stores information about viewer answers to video pair preferences, content control questions, screen test results (as described below), the initial buffer time, the time spent on playback for each video and the current progress for each viewer.

Screen tests are used to find the end user watching conditions. If the watching conditions for the viewers are not favorable then the test scores available are not reliable. In our web based application, we therefore applied two screen test mechanisms as described in [13] [9]. In the screen test the visibility of the symbols depend on different conditions, such as screen orientation, screen resolution, screen brightness, screen color combination, viewer’s eyesight etc. Viewers are not allowed to proceed in the test without performing the screen test. In addition, an unreliability score was calculated using the screen test implementation from [13]. Based on the unreliability scores obtained from the screen test, 215 viewers out of 266 potential viewers were allowed to complete a subset of the experiment.

4.2 Laboratory experiment 2

As the test videos used in the Crowdsourcing Experiment (Section 4.1) were composed of additional test scenarios as compared to Laboratory Experiment 1, it might be argued that the results of the two setups can not be compared directly due to the difference of the test data. Therefore, in order to better understand any deviation of results from the Laboratory Experiment 1 and the crowdsourcing experiment if any, the videos chosen for Laboratory Experiment 2 were the same as the ones used in the Crowdsourcing Experiment (Section 4.1) while the evaluation methodology used was the same as the Laboratory Experiment 1 [40], namely the Absolute Category Rating (ACR) test methodology [16] with a discrete rating scale from 1 to 5.

As in [40], the subjective experiment was carried out at the Acreo Lab. The lab was equipped with a 46″ Hyundai S465D display with the native resolution of 1920x1080 and 60 Hz refresh rate. Viewing distance was 4 times of display height. The peak white luminance of TV was 177 cd/m 2 and the ambient illuminance level in the room was about 20 lux. A modified version of the VQEGPlayer [3] was used to present the randomized PVS and the voting interface after each PVS.

The viewers were initially screened for visual acuity (Snellen chart), color vision (Ishihara), and asked to fill the Simulator Sickness Questionnaire (SSQ) [6] as well as answering some questions about their background and video habits. The viewers were then instructed in the testing procedure and a training session was performed, so the viewers could familiarize themselves with the procedure and the range of the qualities. During the training session some examples of PVSs including quality variation (adaptation scenarios) and videos with buffering events were shown. The actual test with the randomized PVSs were then carried out in one session lasting around 20 to 30 minutes. After the test, the viewers were again asked to fill the SSQ. The viewers were of different ages (mean 32.5, median 28.5, max 60 and min 18) and background. There were 7 female and 15 male.

5 Processing of the pair-comparison data

In order to analyze the results obtained from the pair-comparison tests in the crowdsourcing experiment and being able to compare with the results from the Lab tests, it is required to convert the obtained preference values into quality values for each PVS. As noted in some related studies [24, 25], it is possible to compare the results obtained through pair-comparison with the results obtained via ACR method. To this end, we use the Bradley-Terry-Luce (BTL) model [2], given its popularity in similar studies of video quality assessment, e.g. [21, 22]. If p i j represents the probability that a video stimulus i is preferred over a stimulus j, the related BTL takes the form as given by the following:

$$ p_{ij} = \frac{\pi_{i}}{\pi_{i}+\pi_{j}} $$
(1)

where π i is the quality score for stimulus i and it can take on any value between or equal to zero and one. This expression can be reformulated by using the empirical probability of preference values, i.e.,

$$ p_{ij} = \frac{m_{ij}}{m_{ij}+m_{ji}} $$
(2)

where m i j is the frequency of stimulus i being preferred over stimulus j. The corresponding π i can be computed by maximizing the log-likelihood function given by the following expression:

$$ L(\pi_{1}, \pi_{2}, \pi_{3},...,\pi_{n}) = \sum\limits_{i=1}^{n}\sum\limits_{j=1}^{n}p_{ij}\ln(\frac{\pi_{i}}{\pi_{i}+\pi_{j}}) $$
(3)

where n is the number of stimuli. This expression can be solved by modern computer assisted iterative methods and we adopted the optimization routines in a software package as mentioned in [44]. This packages relies on the Matlab function fminsearch to find solution of the model. Specifically, we used the preference matrices of the nine stimuli. Necessary cautions and measures were taken to avoid any local extrema points. To this end, we carefully inspected the covariance matrix so that it did not contain negative values on its main diagonal. Moreover, an initial seed of likelihood values has been provided to the software package and the obtained likelihood values have been used as seed to an iterative call to the underlying function. The iterations were conditioned to a tolerance values of 10−4 for the difference of the likelihood values for successive calls.

Among other parameters of the model, this software provides Hessian matrix of the log-likelihood function and that can be used to compute the related covariance matrix. This in turn can be used to estimate standard errors from the main diagonal of the covariance matrix denoted \(Diag\left [ \widehat {cov()}\right ]\). Finally, the 95 % confidence intervals are obtained by the following:

$$ \pm1.96\sqrt{Diag\left[ \widehat{cov()}\right]} $$
(4)

We transform the obtained estimates of the probability values using the natural logarithm [2] and normalize them to the interval of the ratings in the laboratory experiments for obtaining mean opinion score (MOS) values. The same transformation is used on the bounds of the confidence intervals from equation (4). Thus, the transformed confidence interval will be uneven due to the natural logarithm transform.

6 Results and discussion

6.1 Crowdsourcing experiment

To analyze the results we applied the Bradley-Terry-Luce (BTL) model [2] as detailed in Section 5. The viewers were filtered by excluding viewers with too many unlikely preferences. We define an unlikely preference as a preference where the corresponding probability in the BTL model is lower than a threshold θ. In our test, we only allow 2 out of 9 unlikely preferences and we set θ = 0.25. With this approach 6 viewers were excluded from the final results.

In order to validate the results obtained from the crowdsourcing experiment, we compared the opinion scores obtained from the Laboratory Experiment 1 [40] to the crowdsourcing experiment. This is shown in Fig. 4. The results show that the opinion scores obtained from both the experiments are strongly correlated, however not as strongly as could be expected of a repetition of a laboratory test. This can be due to the differences in the test setup, such as evaluation method, viewing environment and the introduction of new distortions. To investigate this, we performed a new laboratory test, Laboratory Experiment 2, as detailed in Section 4.2. The results and comparison to this test can be seen in Section 6.2.

Fig. 4
figure 4

Comparison between Laboratory Experiment 1 and the crowdsourcing subjective experiment. Linear correlation: 0.77

Our experiment verifies the results from earlier studies, e.g., [26], that buffering events have a high impact on the QoE. Due to this, users generally prefer viewing videos at lower bitrates than having buffering events in videos at higher bitrates.

The quality of the videos can also be compared against the average bitrate of the videos. This has been illustrated in Fig. 5, where the mean of the subjective scores has been calculated over the video contents. Generally, users prefer videos at higher bitrates, i.e., 3 or 5 Mbps and the difference between them is probably more due to the difference in content than the difference in compression levels. Users dislike buffering events and it seems that the frequency is more important than the total duration of these events (both videos at 3 Mbps with buffering has a total buffering time of 2s), which is in line with earlier studies e.g. [43]. But if the bitrate is high enough and the frequency of the buffering events is low enough, e.g., the 1F3M video, this seems to be a viable alternative to decreasing the bitrate of the video or having a constant low bitrate, e.g., 600 kbps or 1 Mbps.

Fig. 5
figure 5

Opinion scores versus the average bitrate for the crowdsourcing experiment

We also investigated the impact of the video content on perceptual preference of different adaptation scenarios. In contrast to relying only on Spatial and Temporal perceptual Information (SI and TI) indices [41], we also analyze our results using semantic indicators such a genre (e.g. action movie, concert recording, and newscast as outlined for the test sequences in Section 1) of the video as well. The opinion scores for each PVS can be seen in Fig. 6 and it is evident that the content plays a major role in the perception of quality. The content with lowest standard deviation of the BTL scores across the different degradation types is the Football content (with a standard deviation of 0.47 compared to values from 0.63 to 0.82 for other contents), which might be explained by the high spatial complexity of that content, which makes the different adaptation strategies more attractive compared to lowering the bitrate. The content with the highest standard deviation of the BTL scores across degradation types is the Pirates content (with a standard deviation of 0.82 compared to the second highest of 0.76) due to very high scores for high bitrates and low scores for medium to low average bitrates. This could be due to that the source for this content is high quality, visually pleasing, and from a very well known blockbuster making it easier to distinguish between quality levels. For ClosetoEdge and News the PVS with gradual decreasing bitrate has higher BTL scores than other content (statistical significant with an overall confidence level of 0.90 to Darkhour, Pirates, and Rollingstone), which could be due to the general low temporal complexity of these sequences. For Darkhour the 2F3M scores lower than all other PVSs, which might be due to suspense being interrupted by two pauses. In the Football sequence the constant 600 kbps ranks low compared to other Football PVSs, which might be due to high temporal complexity of the sequence, causing a lot of flickering artifacts at low bitrates. This is also the case for the Rollingstone content where the uniform black background contains a lot of very noticeable artifacts at 600 kbps. The value of the BTL scores of the 600 kbps PVS subtracted from the mean of the other PVSs for the Football and Rollingstone contents are respectively 0.67 and 1.38 compared to other contents where this value is in the range of −0.40 to 0.39. We also note that for ClosetoEdge the 5 Mbps video is rated lower than expected, which is probably due to specific content in this PVS.

Fig. 6
figure 6

Opinion scores with 0.95 confidence intervals for every PVS for the crowdsourcing experiment. The PVSs are indexed by the first letter in their names (see Section 4)

We also performed statistical tests on all overlapping PVSs between the crowdsourcing test and Laboratory Experiment 1 [40] for significance difference. With an overall confidence level of 90 % there were no significant difference in the means of any PVS.

6.2 Lab experiment 2

Before any analysis of the results, screening of the observers according to ITU-R Rec. BT.500-13 [17] was applied. No observers were eliminated in the screening. The MOS results of this laboratory experiment correlates very well with the original laboratory test as it can be seen in Fig. 7. The linear correlation coefficient with the crowdsourcing results are on the other hand only 0.69 if the same subset of videos are used as in Fig. 8, while the linear correlation is 0.70 if the full set of videos are used to calculate it. Therefore, it seems that the PC methodology is not suited as a substitution for the ACR methodology in this case where the video pairs can be from different periods of time in the original source video. Even so, the trend of the overall ranking of the degradation which can be seen in Fig. 9 seems to be well aligned with results from the crowdsourcing experiment.

Fig. 7
figure 7

Scatter plot for overlapping sequences in Laboratory Experiment 1 and Laboratory Experiment 2. Linear correlation: 0.96

Fig. 8
figure 8

Scatter plot for Crowdsourcing and Lab 2 experiment

Fig. 9
figure 9

MOS versus the average bitrate for laboratory experiment 2

The MOS for each PVS can be seen in Fig. 10 and again it is evident that the content plays a major role in the perception of quality, however in some cases the conclusions seems to differ somewhat from the crowdsourcing experiment. Generally, the content originating from blockbusters (Closetoedge, Darkhour, Pirates and Streetdance) have higher MOS for high constant bitrates than the rest of the content. This difference in the experiment could be caused by the assumed difference in screen sizes and screen quality from the crowdsourcing experiment to the lab experiment. The content with lowest standard deviation of the MOS across the different degradation types is still the Football content (now with a standard deviation of 0.50 compared to values from 0.61 to 0.99 for other contents). However, the content with the highest standard deviation of the MOS across degradation types is the Streetdance content (with a standard deviation of 0.99 with Pirates begin second highest with a value of 0.93). In this experiment the PVS with gradual decreasing bitrate still has higher MOS for ClosetoEdge, but now also for Darkhour instead of News compared to other content (statistical significant with an overall confidence level of 0.90 to all other contents except News). For Darkhour the 2F3M score is still the lowest for that specific content, but not the lowest score when compared to other content. The 600 kbps PVS for the Rollingstone content where the still has a very low score compared to the other degradations in that content, but this trend is not nearly as drastically for the Football content. The value of the MOS of the 600 kbps PVS subtracted from the mean of the other PVSs for the Rollingstone content is 0.82 compared to other contents where this value is in the range of −0.29 to 0.45.

Fig. 10
figure 10

MOS with 0.95 confidence intervals for every PVS for laboratory experiment 2. The PVSs are indexed by the first letter in their names (see Section 4)

For Laboratory Experiment 2, we also performed statistical tests to see if the means of the PVSs were significant different. In this case, we found that at an overall confidence level of 90 % four PVSs had significant different means. This corresponds to 6.3 % of the total number of PVSs. The four cases were: the movie clip from ClosetoEdge with constant 5 Mbps bitrate, the movie clips from Darkhour with gradually decreasing bitrate and with constant 5 Mbps bitrate, and the movie clip from Streetdance with constant 5 Mbps bitrate. In all four cases the scores from the crowdsourcing experiment was significantly lower than the scores from Laboratory Experiment 2.

7 Discussion

The lab-based experiments are more reliable than crowdsourcing experiments however, they have limitations such as 1) high cost in terms of time and labor, 2) limited participant’s diversity. In addition, users need to be physically present in the laboratory to perform the test [46]. On the other hand, crowdsourcing experiments allow an investigator to get opinions from a vast variety of subjects; in a time-flexible, test-data size scalable, and swift manner [35].

A summary of the results of both experiments is presented in Table 3. The obtained results shows acceptable correlation between the laboratory test and the crowdsourcing test. One of the trade-offs when using crowdsourcing compared to laboratory studies is the increase in uncertainty that comes from lack of control of the viewer, the environment of the viewer when doing the test and the equipment used by the viewer. This could manifest itself into increased variation or standard deviation in the experiment. On the other hand it is relatively cheap to increase the sample size in crowdsourcing test compared to a corresponding lab test. Let us assume a fairly common set-up for a video quality test with about 100 video clips in a within subject design and we assume Normal distribution. If we would like to be able to find difference in MOS that is e.g. about one, i.e. change one level on the ACR 5 point scale, and compensating for multiple comparisons using Bonferroni, giving an overall 95 % confidence of significance for the whole test, which would give an alpha of 0.05/(100 * (100-1)/2 = 0.00001 per comparison. In the Fig. 11 we plot the required sample size to reach a power of the test to be 0.8 as function of the standard deviation for MOS differences of 0.5 and 1.0. We can then see that if for instance the standard deviation is increased from 0.8 to 1.0 for MOS difference 1.0 (blue curve), we would need to go from less than 30 viewers to about 40 viewers to keep the power of the test at the 0.8 level. Half a MOS is also an interesting case, then we would like to resolve a 0.5 MOS difference, which is also shown in the Fig. 11 (red curve), the same increase will require that the number viewers are increased from about 75 to about 125. The point of this discussion is that adding 50 or 100 viewers in crowdsourcing is fairly easy and could very well compensate the increase in uncertainty based on the lack of control. Due to this advantage and the promising correlation, crowdsourcing experiments have gained enough popularity and can be a alternative to lab-based experiments.

Table 3 Summary of the results
Fig. 11
figure 11

The required sample size to show a significant difference between MOS that have differences of 0.5 and 1.0, as a function of standard deviation for a test of 100 video clips and giving a power of 0.8

8 Conclusion

We presented a study on the investigation of crowdsourcing based subjective assessment of video quality as a potential alternative for laboratory based experiments. Our novel approach includes the application of Paired Comparison based subjective assessment in a crowdsourcing environment. In our experiments, we employed a test stimuli representative of various adaptive video streaming scenarios that are adopted in practice by most service providers. The subjective experiment conducted in a crowdsourcing environment verifies the results of earlier studies of adaptation scenarios, including the effect of buffering events. Our study suggests that in a network environment with fluctuations in the bandwidth, a medium or low video bitrate which can be kept constant is the best approach. Moreover, if there are only a few drops in bandwidth, one can choose a medium or high bitrate with a single or few buffering events. In this case the duration of the buffering events should be long enough to minimize the risk of another buffering event in the near-future. Additionally, we reported on the content-dependency of the subjective QoE.

Lastly, we conducted a laboratory based subjective experiment to further investigate the results of the crowdsourcing based experiment. The obtained results suggest that correlation of the crowdsourcing based results with the laboratory based might have been affected by the use of paired-comparison (PC) technique of presentation of test stimuli to the viewers combined with the intermix of content and degradations. More experiments can be performed to verify this indication to weigh this disadvantage of PC as compared to its advantage in simplifying the test procedure. This is especially important in a crowdsourcing environment where an investigator has lesser control on test setup adopted by a remote user.

An interesting extension of this work would be to analyze the demographics of the viewers from the experiments, especially the crowdsourcing experiment, and investigate any correlations between the demographics and the perceptual preference of video quality.