Keywords

1 Introduction

The digitization in the media industry forces the vast majority of enterprises to rethink and reorganize the revenue model of their organizations. That is why those companies transformed into the digital space to stay profitable and embrace contemporary trends, mainly focusing on the online part of the business. Nevertheless, low barriers to entry into the industry, widespread access to content on similar topics, and reduced attention span among readers make the competition even more fierce. Two primary ways for gaining a competitive advantage emerge for digital media businesses focus on customers or content. In the first case, the organization personalizes the content and strategy based on the user behavior using such tools as cookie tracking, data analysis, dynamic paywall, and dynamic pricing [2, 11]. In the second case, attention is paid to analyzing the characteristics of the article and recognizing whether there are any repeating patterns among successful articles in order to provide adequate feedback to content producers - journalists, editors, and the editorial board. The importance of such content analysis methods is increasing in the highly competitive digital media business [1, 7]. That is especially relevant considering that companies are striving for more data about users in order to personalize the offer and increase their chances of subscription. However, only a small percentage of readers register on the website and leave contact data. Moreover, it is observed that more and more page views are of new or anonymous users. That is why organizations turn to other ways of increasing the pool of subscribers. Firstly, getting to know the specifics of the created articles enables to optimize the paywall strategy of the organization, including the decision of which articles should be available for the user before displaying the paywall. Secondly, analysis of articles can be a feedback for both the editorial board and authors themselves when it comes to preferences and tastes of the users as well as successful publishing strategy.

The next big step for content analysis is the implementation of artificial intelligence to enhance the business processes of the publishers [1, 3, 7]. As discussed in this paper, one of the outcomes of such implementation is data-based scoring for the content to better represent its chance of success. Success may be defined differently for every enterprise, data team, and editorial board. In this paper, we focus on success expressed as the situation where an article increases the chances of the user to subscribe. Therefore, adequate machine learning models for Propensity to Subscribe (P2S) are applied to provide a score of the article – how likely it is that the user will subscribe having read that article. The models include variables mentioned above, such as when users read the article, the daily/weekly patterns of interest, behavioral features of the article, number of referred visits (using a link), time of reading and attention, age of the article, and the interaction with a paywall. On top of that, custom variables that help assess the article are added and calculated.

Inspired by the research on modeling user engagement profiles for detection of reader’s propensity to subscribe presented in [11], we introduce the content scoring solution based on articles’ engagement profiles and aimed to be applied to enhance the dynamic paywall policy. However, unlike in the cases of the analysis of the behavior of readers [11], P2S models for articles are a highly dynamic issue and so the method and approach cannot be copied. That is why, in this paper, we propose the new architecture for building scalable and efficient content scoring solution.

The paper contribution is as follows. We propose the novel content profiling framework being a part of the Deep Glue System [11] responsible for managing and optimizing the access for digital media users. In particular, we describe the article profiles based on comprehensive engagement statistics of users reading this article. Furthermore, we demonstrate how such profiles can be enriched and dynamically updated in real-time and then applied to propensity to subscribe modeling and paywall control. Finally, we experimentally evaluate the performance of machine learning algorithms which utilize the proposed digital content profiles for the application scenario of predicting the propensity to subscribe based on article features.

2 Related Work

The digital content profiling for detecting users’ propensity to subscribe is an underexplored research problem [1, 2, 7]. Many studies concerning the general content scoring problem have already been published, i.e., [10, 12], but none of them is focused on digital article scoring aimed at optimizing subscription sales. The most relevant research results on modeling and measuring user engagement with digital articles are presented in [1, 7]. Unlike our approach, Carlton et al. [1] study the problem of engagement prediction. Furthermore, they use short video content as their application scenario. On the other hand, the author of [7] analyses user engagement patterns with page views of news articles. Specifically, he investigates the relationship between engagement levels and information gained in the articles’ text. In contrast to his research, we are not limited to news articles. Moreover, we are focused on user features closely related to the subscription process, e.g., describing user interactions with a paywall. In [3], Davoudi et al. propose the subscription prediction model using user engagement measures. Additionally, in more recent research presented in [2] the authors propose engagement-based paywall control policies. However, their research is not focused on modeling article profiles and does not investigate the influence of engagement-based profiling on the efficiency of machine learning models.

In the case of research on user profile enrichment techniques, many solutions focus on social media applications [5, 9]. Unlike in the case of our method, they are based on processing textual data [9] extracted from social services such as Twitter [5]. Our approach is closer to the research of Tang et al. [13] and Li et al. [8], which propose to build time-agnostic temporal features based on aggregations in a specific time window as some time-forgetting mechanism. However, their studies apply to real-time recommendation systems [13], and streaming service churn prediction [8]. Our research is strictly connected to studies presented in [11], in which the framework for building digital media user profiles using their engagement features has been presented. However, in this paper we introduce the propensity-to-subscribe scoring solution based on articles’ engagement profiles, which is aimed to optimize the dynamic paywall mechanism for the case of new or anonymous users.

3 Content Profiling Based on Behavioral Features

In this paper, we introduce the content profiling framework to enrich the page view events with additional engagement-based features of articles. These new features may be seen as the current article profile based on statistics concerning article readers. Specifically, for a given article a and a given timestamp t, the article profile p(at) is formally modeled as a sequence of features:

$$p(a,t) = (f_1, \dots , f_m),$$

where m is the total number of profile features. The new features are generated using events collected in various periods before the time t, which usually corresponds to the timestamp of the enriched event. The details about profile feature types and the description of specific features applied in tests presented in this paper are presented in Tables 1 and 2, respectively.

Table 1. Article profile features.

The proposed article profiles contain the information intended to be helpful when predicting if reading the given article may increase the user’s propensity to subscribe for content. In particular, it includes the most recent historical data on article page views, readers’ attention time, types of users reading the article, user engagement segments, traffic sources, statistics of paywall displays and clicks, and the number of subscription purchases. Most of the features are aggregation features, including counters of specific events (e.g., the number of subscriptions sold just after reading the article) or total sums of a given original numeric feature (e.g., the number of seconds spent in the system) in the given time window (e.g., today, yesterday or during last 7 days). Additionally, profiles include features based on simple statistics such as the average or percentage of occurrence of some feature values, including segment-based features corresponding to readers from different user groups. Finally, we defined custom features based on predefined formulas involving the current values of original or enriched profile features. Some of them are just simple ratios, and others describe dynamics of given feature change in time, e.g., modeling differences between today and yesterday or between today and last week.

Fig. 1.
figure 1

Deep Glue content scoring architecture diagram.

The overview of the content scoring system architecture is depicted in Fig. 1. Stream of events (describing all user-article interactions) is collected in real-time and stored on a distributed messaging system (Apache KafkaFootnote 1), which is one of the Stream processor components. Each event is enriched on Apache FlinkFootnote 2 with current engagement features. First, the corresponding user profile, updated on every interaction (based on the solution presented in [11]), is added. Subsequently, the events are enhanced with article profile features (as described in Table 1). Events enriched with engagement features are used to generate predictions controlling the workflow execution for every article on the website. The performance can be monitored online using metrics emitted to a Data warehouse solution. Machine learning model used for serving is trained and evaluated offline, periodically, in batch manner. Models that passed the evaluation are serialized and pushed to the Stream processor environment.

4 Experimentation Dataset

We use the unique dataset containing the real data collected based on the traffic on a digital media webpage. It consists of events describing the article views of users exploring the content of a digital site of a large media publisher. Articles published on the website are news, reports or reviews on politics, technology, environment, business, and economics. They have various characteristics including both short news with timely content which are popular for a limited time and then become irrelevant as well as reports or reviews with content which continues to be relevant long time after its publication date.

In this paper we use the data collected during the second half of 2021. The raw data contains around 100M events corresponding to page views of 200K unique articles viewed by 50M unique users. Each event has been dynamically enhanced by features from the most recent article profiles built using available historical information on the engagement of users which read a given article. Then, the enriched samples were used to build the ML models predicting user’s propensity to subscribe after reading the article. In order to make our results reproducible, we made our anonymized dataset publicly availableFootnote 3. The details of dataset’s preprocessing including data cleaning, filtering, engagement-based enhancement, and final dataset’s statistics are presented in Sect. 5.

5 Experimentation Setup

In this section, we present the details of the experimentation scenario. The description contains the information on dataset preparation, building machine learning models, and the way of their evaluation.

Dataset Prepartion. The original dataset – introduced in Sect. 4 – consists of 100M events corresponding to article views. Just after its collection, each event was enriched by the most recent article profile available in the profile store. The profiles contain engagement features from the Deep Glue Content Profiling System described in Sect. 3. The outline of article profile features used in experiments presented in this paper is presented in Table 2. The article views are labeled based on the information on the subscription purchase just after reading a given article. Specifically, we selected 2098 new subscription purchases (i.e. non-renewal purchases from newly acquired users) with registered information about the last seen article. Moreover, we excluded all the articles views of users with active subscription from the datasets. Then, due to the fact that the labeled dataset was highly imbalanced, we decided to randomly downsample the events with negative label with the downsampling ratio set experimentally to 0.02. The final data samples were generated synthetically based on the characteristics of data collected. The basic statistics of the datasets are summarized in Table 3.

Table 2. Features used in experiments.
Table 3. Basic statistics of a preprocessed dataset used in experiments.

Experimental Scenarios. We tested the effectiveness of our approach using two experimentation scenarios: (i) a basic off-line scenario assuming 10 repetitions of the experiment based on different random splits to train and testing data, and (ii) an additional real-world scenario assuming efficiency evaluation of models built using historical data. Both scenarios are based on the dataset presented in Sect. 5. The purpose of off-line tests was to obtain the reliable results provided as averages of 10 individual results. Each repetition is based on different random splits on training and test data for a training ratio equal to 0.75. For the real-world scenario, the model was built using the data collected during 20 weeks, and then evaluated during the next 5 week period. By choosing time as a factor for partitioning the data, we could mimic the real-time nature of the target infrastructure. The goal of the real-world experiment was to demonstrate the impact of real-time profile enrichment with time-agnostic behavioral features on the efficiency of propensity-to-subscribe modeling.

5.1 Approaches Under Comparison

To demonstrate the efficiency improvement caused by engagement-based enrichment of article profiles, we compared the following approaches:

  • the baseline prediction algorithm based on the ML model utilizing the raw features describing the article (see Table 2), i.e., article author, topic, number of days from publication, as well as weekday, day, hour of a page view,

  • the basic profile prediction algorithm utilizing the content profile enriched by basic engagement-based features based on general counters, i.e., total number of distinct users which read the article, total sum of readers’ attention time, total numbers of page views, paywall displays, paywall clicks, and conversions,

  • the full profile prediction algorithm utilizing each feature of digital content profile introduced in Table 2.

Furthermore, in order to provide some more detailed and insightful discussion, we delivered the additional efficiency comparison for models build using different thematically-grouped parts of engagement-based content profiles. We followed the group definition introduced in Table 2. Specifically, we compared the impact of features related to (i) attention time and page views statistics, (ii) paywall and conversion statistics, (iii) types of users which read the article, and (iv) the traffic source.

We used the CatBoost classifier [4] (implemented using its official library [14]) to build machine learning models compared in this paper. We applied CatBoost with default parameters and predefined random_state for our experiments. We indicate all the raw features presented in Table 2 as categorical features. The classifier choice was driven by technological constraints and business needs. Firstly, we were limited to algorithms that did not cause high execution latency, which was crucial to ensure the high-quality real-time performance of the infrastructure. Secondly, since most crucial article basic features, such as the author’s name, and the article topic, are categorical, we chose the solution known to handle this kind of data effectively. The efficiency of algorithms was evaluated using the test set by means of standard ML measures [6], i.e., the Area Under the ROC curve (AUC), the average precision (AP), accuracy, balanced accuracy - included due to the fact of dealing with highly imbalanced data, precision, recall, and F1. The evaluation results are presented in Sect. 6.

6 Results

In this section, the results of experiments introduced in Sect. 5 are presented.

6.1 Results of Off-Line Experiments

The results of off-line experiments are presented in Figs. 234 and 5, and then summarized using Tables 4 and 5.

Fig. 2.
figure 2

ROC Curves presenting the impact of article profile enhancement with user engagement features.

Fig. 3.
figure 3

Precision Recall Curves presenting the impact of article profile enhancement with user engagement features.

Fig. 4.
figure 4

ROC Curves presenting the impact of various groups of profile features.

Fig. 5.
figure 5

Precision Recall Curves presenting the impact of various groups of profile features.

Table 4. Results of experiments presented as means and standard deviations for series of 10 experiment iterations (baseline vs user engagement profiles).
Table 5. Results of experiments presented as means and standard deviations for series of 10 experiment iterations (impact of different groups of features).

Comparing AUC curves (see Fig. 2) proves that the models exploiting enriched data have achieved better efficiency than the models based on raw features describing the articles. We can also observe the quality progress implied by applying full profile features when looking at prediction efficiency through precision-recall curves (see Fig. 3). This observation confirms the importance of adding more specific features, such as counters and averages within the shorter time window concerning events from today or yesterday, and custom features modeling ratios, percentages, or change dynamics. The values of most popular machine learning measures [6] collected in Tables 4 confirm the crucial role of engagement-based article profiles in the performance of digital subscription propensity models. The efficiency progress is evident when looking at metrics such as average precision, balanced accuracy, and the F1 score. Figure 45 and Table 5 deliver the additional efficiency comparison for models build using different thematically-grouped parts of engagement-based content profiles. We have observed that the biggest quality improvement is caused by using features corresponding to the number of article views and their attention time. The results confirm the still high but slightly smaller importance of features describing the paywall displays and clicks and subscriptions bought after reading the article. The impact of user types and traffic sources has appeared as less significant.

6.2 Results for the Real-World Scenario

This section presents the results of tests conducted according to the real-world scenario provided in Sect. 5. This additional experiment aims to check the efficiency of real-time article profile enrichment with time-agnostic behavioral features in the target business scenario involving the online propensity-to-subscribe scoring. This scenario is much harder than the offline scenario based on a random split of train and testing sets since it requires testing the performance using new data, which usually is different from the one used for training. The results are collected in Tables 6 and 7. The biggest performance decrease is observed for baseline models when comparing the metric values with the corresponding results of the offline scenario. In the case of models involving the use of article profiles, this deterioration is much smaller, especially when we limit to basic most-general features. The results confirm that the proposed profile features are more general, time-agnostic, and, therefore, more useful when applying the models in the target online environment. In addition, similar results to those presented were obtained in online tests using a real system, which further reinforces this hypothesis.

Table 6. Results of the experiment for the real-world scenario (baseline vs user engagement profiles).
Table 7. Results of the experiment for the real-world scenario (impact of different groups of features).

7 Conclusions

In this paper, we present that real-time article profile enrichment with time-agnostic features based on users’ engagement leads to significantly improved machine learning models to detect the user’s propensity to buy. We demonstrate that AI-based accurate targeting of the users interested enough in the offer to pay for access, is ready to become a standard in the digital media industry.

Our findings indicate that to attract more subscribers, the media companies should invest into modeling data-driven features describing article performance. While metrics such as the number of unique users or total number of views are useful in many ways, by themselves are not enough to infer a user’s propensity to buy. A fit for purpose ML model can properly take into account multiple factors to predict attractiveness of any particular article in real-time, outperforming any attribution model heuristic that is currently widely used in the industry.

As article attractiveness is usually a highly dynamic concept – especially in the case of news – modeling additional data-driven features describing the most recent readers’ engagement has turned out to be an efficient tool for propensity-to-subscribe model improvement. Our profiling method delivers user engagement features that are more context-independent and time-agnostic. Consequently, they are more generalizable and applicable to many scenarios within the industry, especially when the ML models are usually served in an environment that differs from the one where the models are built. Finally, our research opens future work directions, including extending the framework with additional features modeling the content type, such as video, text, or rich-media articles.