1 Introduction

One use of social media and other web analytics data is customer segmentation (Jansen 2009), which is an approach for separating an overall customer population based on segment differences defined by a specific set of attributes. Customer segmentation is a common practice across many industries with the set of attributes utilized being relevant to the particular domain. Examples of such domains include marketing, advertising, education, and system design. E-commerce companies and other organizations rely on customer segmentation to target specific customer groups with content and products that the consumers within a segment would likely find relevant. Additionally, customer segmentation might also lead to a deeper understanding of customer preferences, needs, and wants by isolating what each segment finds most valuable. Based on these insights, organizations can more effectively engage with their customers, audience, or users. In software design, marketing planning, and advertising development, there are continuing efforts for identifying and assessing segments of people (i.e., customers, audience, or markets) to optimize some performance metric (e.g., the speed of task, buying preferences, or ease of use).

Major online social media platforms used for distributing content and other products present unique challenges for customer segmentation efforts attempting to rely on online customer data. The customer segmentation approach relies on identifying key attributes from which one can separate customers into segments (Cooil et al. 2008). Targeting customers via behavioral segmentation involves dividing the customer base based on their collective behavior. A behavior can be a single attribute (e.g., viewing online content) or a set of behaviors (e.g., viewing online content, length of video watched, etc.), but it is typically focused on the way the segment responds to, uses, or engages with a product. Targeting customers via demographic segmentation includes segmenting the customers based on one or more differentiating characteristic that often includes, but is not limited to, gender, age, race, location, education, income, or career. However, most prior work in customer segmentation has focused on using individual website data, such as that available from Google Analytics, yet, there is an increasing effort to employ customer segmentation using social media data from the major online platforms. This data presents unique challenges, as it is typically aggregated to preserve the privacy of individuals, so methods must be employed to deal with the issues this aggregation causes in inferring customer attributes.

Our aim is to automatically create customer segments using real customer data aggregated by online platforms, meaning it is grouped according to customer attributes (e.g., Male, 25–34, South Korea) (Jung et al. 2017). This grouping is more common due to privacy concerns that prompt online platforms to provide only aggregated customer data, rather than session- or customer-level data. Aggregation complicates the customer segmentation generation process. In the research reported here, we investigate using aggregated social media data for isolating customer segments based on both the behaviors and on the demographics of those customers and then linking the two customer segments groupings for a complete representation of the customer base. Therefore, we propose, develop, implement, and evaluate an approach for mining privacy-preserving aggregated statistics of the customer base, which we can use to generate data-driven customer segments that are easily interpretable by web information analysts. To demonstrate the impact and applicability of this customer segmentation research, we leverage that information to develop a system to automatically generate personas for the identified customer segments.

One advantage of this approach is that it can easily be adapted for a diverse range of organizations that create content for major online platforms since such aggregated customer statistics are the de facto standard provided by analytics interfaces of online platforms, such as Facebook Insights and YouTube Analytics. Our research differs from prior work in that earlier work in identifying customer segments (Jansen et al. 2011) or for the internal application of a private company (Zhang et al. 2016) had individual-level data. However, due to privacy and other concerns, online data from the major platforms is typically not individualized. Instead, the data has been already aggregated, typically along coarse attributes, such as gender, complicating the customer segmentation process.

2 Literature review

2.1 Customer segmentation

The concept of customer segmentation is attributed to Smith (1956), where the researcher advocated employing market segmentation along with the product segmentation that was common at the time. Since then, customer segmentation has been an ongoing research area (Bonoma and Shapiro 1984), as the availability of online data has greatly increased. Overall, customer segments arise from attributes that unify customers to form groups or separate some customers from others (Jenkinson 1994). There have been a variety of data and methods employed to create customer segments (Firat and Shultz 1997; Marcus 1998; Shapiro and Bonoma 1984). Given the availability of ample survey of literature articles in the area (Beane and Ennis 1987; Chéron and Kleinschmidt 1985; Foedermayr and Diamantopoulos 2008), we do not present a comprehensive review here but provide insights in the activity and variety of research in the customer segmentation area.

Website data has been used to segment customers into various revenue groupings (Ortiz-Cordova and Jansen 2012); this is an example of behavioral segmentation. Search query data has been used to classify the gender of searchers and then relate this demographic attribute to revenue generation (Jansen et al. 2013). Increasingly, customer segmentation processes are leveraging social media data for both behavioral and demographic grouping (Jansen et al. 2011) of customers. Tuna et al. (2016) examine the identification of segments from social media, specifically from customer attributes such as gender and age, among others. Dursun and Caber (2016) employ RFM (recency, frequency, monetary) analysis on data from a major hotel chain’s customer relationship management system with results showing eight customer segments with the majority of the customers as ‘Lost Customers,’ staying for shorter periods and spending less relative to other segments. RFM analysis is a marketing technique used to quantitatively determine which customers are the best ones by examining how recently a customer has purchased (recency), how often they purchase (frequency), and how much the customer spends (monetary). Antoniou (2017) uses segmentation from online platforms in cultural heritage applications by extracting user personality and cognitive style profiles. Concerning the specific use of social media data, Kamboj, Kumar, and Rahman (2017) find that social use, hedonic use, and cognitive use positively influence the financial and market performance of firms.

2.2 Personas

As the motivation for our customer segmentation approach, we use it to develop a working system for automatic persona generation. Introduced to the design domain by Cooper (2004) with follow-up refinement by Pruitt and Adlin (2006), a persona is a representation of an actual segment of customers presented as an imaginary person. The end product is a persona profile pertaining to the customer segment that the fictionalized person represents. Personas have expressed benefits beyond what numbers by themselves can provide for identifying customer segments (Pruitt and Grudin 2003). For several years, personas have been used in system development (Cooper 2004; Pruitt and Adlin 2005), product design (Goodwin and Cooper 2009; Smith 1956), and marketing (Revella 2015; Stern 1994), among many other fields and industry verticals. Personas are a continuance of efforts from a variety of domains for identifying, constructing, and assessing segments of people (i.e., customers, audience, etc.) to optimize some performance metrics (e.g., advertising engagement, the speed of a task, ease of use, the effectiveness of effort, sales, etc.). Personas are reportedly a part of design processes and industry workflows (Dharwada et al. 2007; Eriksson et al. 2013; Friess 2012; Judge et al. 2012; Nielsen and Hansen 2014) for both long- and short-term projects (Judge et al. 2012) with a reported positive return on investment (Drego and Dorsey 2010).

It is suggested that one develops personas from real data that is derived from actual people (Pruitt and Adlin 2006). Using actual customer data is crucial to making personas believable (Judge et al. 2012) and for designers to appropriately leverage personas. However, a recognized problem is that creating personas is always not a cheap or quick procedure, as the creation has historically involved ethnographic methods. As one-time data collection actions, the personas created can be quickly outdated without new rounds of data collection. Without real-time data, designers have no validation whether the personas are representative of current customers (Chapman and Milham 2006). These restrictions are especially acute in the situation of creating digital content for distribution via major online platforms (e.g., Facebook, Twitter, YouTube, etc.). The research that has been conducted in converting actual online customer data into personas is limited and is also quite sparse in actually creating personas from this data. In the marketing and advertising area, there is work on using large pools of online consumer data to segment markets (Clarke 2015); the work focused on overall design approaches but did not take the research to creating personas. For example, Jansen et al. (2011) used the data from nearly 35,000 customers on a social media platform to cluster customers based on how they share commercial information. However, the researchers did not use these results to generate personas, stopping at the segmentation level; although, they did assign descriptive names to each. In another work, Zhang et al. (2016) analyzed customer-level clickstreams to identify ten common workflows using hierarchical clustering. They then present five customer facets based on the probability of platform use that they then gave a name to. While this work is close to our work, the customer-level clickstreams are often not available to those online content creators, especially when using aggregated data from online platforms.

2.3 Synthesis of prior work

From a review of customer segmentation literature, research using customer-level segmentation data, especially to generate personas, is needed. With the potential customers in the millions or billions, traditional ethnography and related methods may not scale well and can be cost-prohibitive. While there have been some online data-driven approaches (Chiang et al. 2015; Zhang et al. 2016), they have used fine-grain customer-level data that is not often available and that potentially has privacy issues. Such approaches using individual-level data are not suited for most content creators who only see aggregated statistics via a platform’s analytic tools.

Therefore, there are many unanswered questions concerning using social media analytics for customer segmentation and whether the findings from this segmenting process can be put to practical use. Can one isolate customer segments based on behavioral interaction on social media platforms? Can online data deliver demographic insights for customer segmentation? Can the customer segments be identified in real time? Can the customer segments be frequently updated? These are the questions that motivate our research.

Thus, this investigation continues a stream of research in generating customer segments and personas (An et al. 2016a, b, 2017; Jansen et al. 2017a; Jung et al. 2017; Kwak et al. 2017) from publicly available social media data, such as from Facebook (Jansen et al. 2016; Zhang et al. 2016) or YouTube (Jansen et al. 2017b) in which the approach was clustering and unsuccessful. Specifically, the research reported in this manuscript is an expansion of a four-page conference article (An et al. 2017). In this manuscript, we focus on the customer segmentation aspects of the research, expand the data sets employed in the analysis, increase the methods of evaluation, and showcase the application of our customer segmentation approach in the development of a system to automatically generate personas from large-scale, aggregated social media data. Personas generated from these methods can be used as-is or can be used in conjunction with data collected from more traditional persona creation methods, and they can be enriched further using qualitative methods (Salminen et al. 2017).

3 Research objectives

Our premise is that aggregated behavioral and demographic customer data, as well as privacy-preserving concerning consumers of a product, service, system, or content can be collected from major online platforms and can be rapidly analyzed to identify customer segments that are usable for a variety of commercial purposes. From this premise, our goal is to develop a methodology to (a) mine aggregated large-scale privacy-preserving aggregated online customer data from major online platforms; (b) use this online data to identify distinct and impactful customer segments; and (c) to automatically generate personas with realistic descriptions and attributes that represent these key customer segments. We see several advantages to this approach, with this approach sufficing as either a standalone method for personas generation or in conjunction with conventional offline methods of persona creation. Therefore, our research objectives are:

  1. 1.

    Recognize discrete customer segments based on behavioral interactions with online content posted on major online social media platforms.

  2. 2.

    Identify the discrete demographic customer segments associated with each of these behavioral customer segments.

  3. 3.

    Integrate the associated behavioral customer segments and the demographic customer segments.

  4. 4.

    Demonstrate the practicality of this approach via a system to automatically generate personas representative of these customer segments.

For investigating these research questions, we rely on non-negative matrix factorization (NMF). The foundation and the encoding depend on what decomposition technique is employed. There are three matrix decomposition methods that are commonly used for the purpose presented here, namely principal component analysis (PCA), vector quantization (VQ), and non-negative matrix factorization (NMF) (Lee and Seung 1999). Concerning the actual technique, VQ, PCA, and NMF bring different decomposition outcomes by having different constraints in W and H. With VQ, each column in H has to be a unary vector. Therefore, only one entry in a given column in H has a non-zero value, and all others have to be zero (Gray 1984). This one-entry constraint makes H too simplified to explain meaningful behavioral patterns by a combination of content interactions. Consequently, VQ is not appropriate for our purpose. With PCA, the rows in H have to be orthogonal, and columns in W have to be orthonormal (Jolliffe 2002). PCA approaches an entry in V as a linear combination of the corresponding row and the column in W and H, respectively. Nevertheless, PCA entries in W and H can be either positive or negative. These positive and negative coefficients result in complex cancellations, making the results difficult to interpret, so PCA is inappropriate for our purpose. In contrast, NMF does not allow negative entries in W and H. As no subtraction led by the negative coefficients is permissible, we consider a linear combination as only an additional combination of bases. This non-zero restriction makes interpretation of the matrix decomposition straightforward; therefore, we choose NMF to extract shared content consumption patterns from the aggregated customer interaction statistics.

That being said, this research is novel in several respects. It is one of the first research efforts using online aggregated social media data at scale for customer segmentation. The data from such platforms is aggregated, unlike the data from prior work in identifying segments using individual-level data. Due to privacy and other concerns, customer data from the major online social media platforms is not individualized, as it is aggregated along typically coarse attributes such as gender, which complicates generating customer segments. Therefore, one must develop techniques to decompose this aggregated data for customer segment generation while still respecting data privacy. Our method is flexible in terms of the number of possible customer segments generated. Typically, customer segmentation and persona creation focus on a small number of segments or personas. While appropriate in certain environments, it may not be appropriate for organizations distributing products via major online platforms with millions or billions of worldwide customers. As our data sets are typically large, in the tens of millions if not more, we can validate our customer segments using quantitative methods. Also, our use of automatically generating personas from these customer segments is novel. Beyond our limited previous work (An et al. 2016a, b; Jansen et al. 2017b), we could locate no prior research pertinent to generating fully developed personas using aggregated data for those who distribute their products via major online platforms. Finally, the methodology presented here can be generalized to any organization that distributes content via major platforms. Therefore, the impact of this research is broad, and it is applicable to many domains.

4 Framework to identify customer segments

To develop customer segments from aggregated customer statistics, we first formulate the problem and clarify the setting, particularly the characteristics of the required dataset. Next, we apply non-negative matrix factorization (NMF) (Lee and Seung 1999) to identify separate behavioral segments and then to align these with customers segmented by demographics. The combination of the behavioral and demographic segments becomes the basis for the final integrated customer segments. NMF has been used in the prior work for customer segmentation (Shi et al. 2015b) but not for persona generation beyond our earlier work.

4.1 Problem formulation: general settings

Explaining the shape of the required dataset, our approach begins with one matrix encoding customers’ interactions with content. We first build a matrix representing customers’ interactions with the online products. We represent by V the g × c matrix of g customer segments (G1, G2, ..., Gg) and c pieces of content (C1, C2, ..., Cc). The individual elements of the matrix V, Vij, are any value that represents the behavioral interaction by customer segment Gi for content Cj. In the case of YouTube Analytics, for example, Vij is a view count of a particular video and Cj from customer segment Gg. In the case of Facebook Insights, for example, Vij is the total minutes a particular video is watched and Cj, from a customer segment defined by gender, age, and country, such as [Female, 25–34, Australia]. Besides these two examples, there are other options to show the interaction between customers and content pieces, such as likes, ratings, and subscriptions. Such options can be used if they provide a breakdown of statistics across demographic groups. However, we note that such detailed statistics on demographics are not provided for views by both YouTube Analytics and Facebook Insights.

A customer segment (Gg) interacts with the set of digital products (C1, C2,..., Cc), so, a customer segment is defined as a set of the touch points with the digital content collection. With this matrix (V) as the basis, we can identify the distinct customer behavior patterns, which can be a vector of any set of customer touch points. Once we have the matrix V, we discover the number of significant latent patterns by decomposing it; that will become the basis of the personas, explaining the persona’s preference toward content in the next step.

Regarding the generalizability of our method, we do not have hard constraints. This method is generalizable and applicable across (1) data of diverse granularity and (2) any content category. For example, a customer segment, Gi, can be an individual customer if the data is available at that granularity and if there is no privacy worry. This implies that the research method is generalizable to customer-level data, as well as to the aggregate-level data that we use here. We use only one matrix that represents attention to each content item. This type matrix can be easily accessed through many current social media analytic tools. YouTube Analytics and Facebook Insights provide statistics of attention (e.g., view counts) from a certain customer segment, defined by age, gender, or country for each video and post. Also, beyond social media analytic tools, our approach can be applied to any domain where the matrix V can be defined. As an example, if an online store provides statistics concerning which customer segments purchase which products, Vij can be defined as the number of purchases from a particular customer segment i for a particular product j. Our method can find personas of that retail store without any modification of the core algorithm that we present here; therefore, this algorithmic method is generalizable.

4.2 Non-negative matrix factorization to identify behavioral customer segments

Moving to our first research objective (recognize discrete customer segments based on behavioral interactions with online content posted on major online social media platforms), we use the social media data to identify discrete customer segments based on different behaviors. This process is quite challenging, as the customer statistics from most platforms are aggregated to preserve the privacy of the customers. Therefore, to isolate customer behavior patterns, the data must be disaggregated. For segmentation, we first experimented with k-means clustering (An et al. 2016a), but it was found ineffective because clustering, by definition, cannot break a given demographic segment into hidden behavioral segments. Thus, we turned to matrix decomposition techniques, specifically NMF, as outlined in (Jung et al. 2017), an approach used in other domains (Xu 2018). We conceptually present this matrix decomposition approach here.

Once we have the matrix V, as outlined above, the next step is to discover the underlying latent factors or the product behavioral patterns that become the basis of the customer segments. The matrix decomposition is presented graphically in Fig. 1.

Fig. 1
figure 1

Outline of matrix decomposition for identifying distinct behavioral segments and then impactful demographic segments

As shown in Fig. 1, V is our g × c matrix of g customer segments (G1, G2, ..., Gg) and c contents (C1, C2, ..., Cc). When V is decomposed, W is a g × p matrix; H is a p × c matrix, and ε is an error term. In this case, p is the number of latent factors (behavioral patterns) that we can choose, which can control the resolution of the customer behavior patterns discovered. When we choose more latent factors, we get more fine-grained customer behavior patterns. The column in W is a basis for the segment, and the row in H is an encoding that consists of coefficients that combine with each basis and represent a linear combination of the bases. The resulting matrix decomposition equation is

$${\mathbf{V}}={\mathbf{WH}} + \varvec{\varepsilon} {\text{ or }}{V_{ij}}=\mathop \sum \limits_{{k=1}}^{p} {W_{ik}}{H_{kj}}+{\varepsilon _{ij}}.$$
(1)

In NMF, a column in H represents each of common content consumption patterns. The coefficient, Hij, shows the importance of content, Cj, to explain the content consumption pattern, Pi. (i.e., distinct customer interaction pattern). As mentioned, H shows a set of distinct content consumption patterns represented by a linear combination of customer interactions with content.

4.3 Identification of representative demographics of behavioral customer segments

Moving to our second research objective (identify the discrete demographic customer segments associated with each of these behavioral customer segments), we find the most impactful customer demographic segments associated with the previously defined behavioral customer segments. For identifying the demographic customer segments, we take a two-step approach: (a) finding a set of representative demographic segments for each behavioral segment and (b) identifying the representative or most impactful demographics from this set. After decomposing the matrix V, we have the matrix H (containing the customer behaviors) and another matrix, W (containing the demographic groups).

First, we focus on W in Fig. 1. A row in W represents each customer segment consisting of different common behavior patterns. The coefficient, Wij, is a relative proportion of a consumption pattern, Pj, in a customer segment, Gi (i.e., impactful customer demographic segment). A row in W represents how each customer segment can be characterized by different consumption patterns. A column in W shows how a distinct consumption pattern is associated with different customer segments. Thus, for each column, the customer segment with the largest coefficient can be interpreted as the most impactful customer segment for that corresponding pattern. A single behavioral segment is likely to have multiple associated customer demographic segments and vice versa. Thus, for each column, the customer group with the largest coefficient or weight can be interpreted as the most impactful customer demographic group for that corresponding pattern.

4.4 Integration of behavioral and demographics segments

For research objective three (integrate the associate behavioral customer segments and the demographic customer segments), we take a two-step approach of (a) finding a representative customer behavioral segment, as outlined above, and then (b) identifying the representative demographics of this group. Determining the demographics of the representative customer segments depends on how the customer segments are defined in V, the most efficient way is to use the data broken down by demographics when building V. For instance, if V has a row mapping into a group defined as [age group, gender, country], then it is trivial to find a segment’s representative demographics. Social media analytics tools often provide customer statistics in a format that we can leverage for both demographic customer segments and for persona profile descriptive snippet (i.e., a short textual phrase describing a persona). Once we have identified the column in W with the largest coefficient, we then select that demographic grouping from our data set. For example, YouTube provides a demographic classification of 2 genders × 7 age groupings × 249 countries (3486 possible demographic groupings) per video, which is the ceiling of possible demographic segments that can be addressed.

5 Data collection

To develop and implement our approach, we leverage customer data from AJ+, an online news channel from Al Jazeera Media Network. In the highly competitive online news industry, understanding customers is notably important to both increase the consumption of digital content and to get relevant and noteworthy information to the readers that may be impacted by the news events. Two common goals for many organizations are to increase digital content consumption and to enhance the facilitation of digital content interaction with customers. Specifically, with the news industry, prior studies point out considerable differences between production and consumption patterns (Abbar et al. 2015), as the online news industry is competitive and fluid (Abbar et al. 2015; Kwak and An 2014). Therefore, in the news area, as with many other verticals, a proper understanding of customers is critically important, and this is an issue that online customer data can address (Mao and Zhang 2015; Shuradze and Wagner 2016).

We focus on the AJ+ YouTube channelFootnote 1 as the aggregated customer statistics data source, which we use as a proof of concept for our customer segmentation research, reserving future analysis of Twitter, Facebook, and other social media platforms for future work. However, the technique is generalizable to any online platform that provides aggregate customer statistics. The primary reason to focus on YouTube is that the analytics interface gives detailed statistics for every video. We do not lose generalization in showing proof of concept using a single media account because we use the data offered by YouTube, which has a universal format for all individual YouTube channels. In other words, our approach does not use any AJ+ dependent features; instead, we use the representative data that all YouTube accounts have. Also, the approach is transferable to other platforms, such as Facebook, that have identical or similar data variables (i.e., age group, gender, and location).

AJ+ is natively digital content platform, meaning that it was designed from the ground up to service news in the viewer’s medium with no redirect to a website. AJ+ is based mainly on platforms. Therefore, digital content is specifically designed to be viewed on the Facebook Newsfeed, YouTube Channel, or Twitter Timeline depending on the readers who are most active on those platforms. As an example of an AJ+ YouTube video, see Fig. 2, noting specifically the number of views.

Fig. 2
figure 2

Example of YouTube video from the AJ+ YouTube channel, with number of views

For the owner of the channel, the YouTube API provides analytics data for each video product and various customer profile data, (e.g., gender, age, country location, and which site the customer comes from), although at an aggregate level. Therefore, individual customer data is not provided. Via the YouTube API, we collect the detailed record of product views by country, gender, and age group for each of AJ+ video. In the research presented here, we focus on view counts due to their high volumes. A customer group is defined by gender, age, and country, such as [Male, 18–24, India]. We note that this detailed data breakdown is accessible with YouTube channel owner’s (i.e., AJ+) permission. In summary, we collect data from 4,320 video products produced from June 13, 2014 to July 27, 2016. Collectively, these videos have more than 30 million views from customers in nearly 200 countries at the time of the study. Being quite robust, the YouTube analytics interface provides, for each video of the AJ+ channel, customer profile attributes (e.g., gender, age, country location, and which site the customer comes from) at an aggregate level. We use these customer and video attributes to explore if information dissemination can identify meaningful customer segment values based on video content interaction and on related demographics provided by YouTube. One can access the data in the YouTube analytics interface by the YouTube APIsFootnote 2. The parameters we use for this research are listed below. There are various video KPI metrics; however, we only focus on viewCount (the number of views) in this research.

  • Customer attributes

  • ageGroup YouTube viewers are classified into seven age categories (13–17 years, 18–24 years, 25–34 years, 35–44 years, 45–54 years, 55–64 years, and 65 years and older).

  • gender YouTube viewers are classified as either male or female, so there are two possible categories for a customer.

  • country YouTube uses the two-letter ISO-3166-1 country code index to classify where viewers are from, with 249 current officially assigned country codes at the time of this study.

  • Video attributes

  • viewCount YouTube provides the number of views per video.

6 Results

Given our dataset, we define a customer segment as a unique combination of (country, gender, ageGroup). So, with two gender groups, seven age groups, and 249 countries, we have an upper limit of 3486 customer segments. (i.e., 2 × 7 × 249). In actuality, our data set has 2214 customer segments, as the data has customers from 190 unique countries—we exclude as non-impactful those countries in which total view counts of 4320 videos are less than 1000 and those countries for which not all age groupings are represented.

6.1 Exploratory analysis of AJ+ YouTube data

We begin by presenting some of the overall statistics from the AJ+ YouTube channel data. Due to business concerns, we do not provide the exact absolute numbers, instead providing percentages only. The AJ+ customer population is worldwide with the top three countries, in terms of viewership, being Canada, Great Britain, and the United States (US), with each representing 2.44% of total viewership in terms of the number of unique videos watched. Regarding the total number of views, the US is the largest customer market segment with about 49.4% of video views. Although AJ+ was designed to target the US market, it is interesting to note that most viewers come from outside the US, making it challenging to have a comprehensive understanding of the customer base.

Concerning customers’ gender and age distribution, 20.9% of viewers were female with 79.1% being male. YouTube views are classified into multiple age categories (13–17 years, 18–24 years, 25–34 years, 35–44 years, 45–54 years, 55–64 years, and 65 years and older). As AJ+ is designed to target young generation by adopting platforms and that our data comes from the YouTube platform, it is logical that young adult males is the biggest segment.

For behaviors, some videos show a worldwide appeal, with 100 videos being viewed in 100 or more countries. Conversely, in the dataset, there were about 100 videos that were viewed by customers from five or fewer countries. In terms of the actual number of views, the viewership counts per individual videos follow a power law distribution with a small number of videos being viewed a lot and a large number of videos being viewed a small number of times. This finding is not surprising, as such skewed popularity of videos is one of the well-known characteristics of viewing behavior on YouTube (Cha et al. 2007).

6.2 Research objective one results—identification of customer behavior segments

To begin the decomposition, we first develop a matrix representing customers’ interaction with the online content products. The matrix’s columns are the online products, in this case, the AJ+ videos [e.g., c contents (C1, C2, ..., Cc)]. The matrix’s rows are the customer segments or customer demographic segments (e.g., g customer segments [G1, G2, ..., Gg)]. Therefore, the matrix describing the association between customer segments and contents is denoted by V the g × c matrix of g customer segments or customer demographic segments and c contents. The element of the matrix V, Vij, is any statistic that represents the one interaction or set of interactions of the customer group Gi for content Cj. In the research presented here, the customer interaction element is viewCount. Using this matrix approach as the basis, we can decompose (i.e., separate into simpler components) the overall matrix V into two matrices: W and H. The matrix W encodes an association between customer segments and behavioral customer segments (i.e., latent content interaction patterns), and the matrix H encodes an association between behavioral customer segments and pieces of content. The resolution in finding customer segments can be adjusted by the number of columns in W or that of rows in H. To sum up, once we have the matrix H, we discover the underlying latent patterns, which describe the customer interaction with content, and that will become the basis of the customer demographic segments in the next step.

Although one can present as many behavioral segments as the data contains, cognitive limits of the end users of the results pose a restriction; it is not purposeful to show them hundreds of customer segments. As the number of segments in our work is strongly tied to user experience and use, it is not best to compute the optimal number of segments in the matrix. Even if the optimal number is a large number, it is not good for persona creation because that number is too big to effectively employ in daily practice. For purposes of demonstrating the results in this manuscript, we present six customer behavioral segments in Table 1; although using NMF, we can generate as many segmentations as desired. In fact, this is the only parameter required, the number of segments, by NMF.

Table 1 Results of NMF for matrix H showing six customer behavioral segments and associated weights for twenty of the videos

6.3 Research objective two results—identification of customer demographic segments

Moving to our second research objective, we identify the most impactful customer demographic segments associated with the previously defined behavioral customer segments. After decomposing the matrix V, we have the matrix H (containing the customer behaviors) and another matrix, W (containing the demographic groups). Each row in W represents how each customer demographic segment can be characterized by different consumption patterns. The columns in W show how a common consumption pattern is associated with different customer segments. A single behavioral segment can, possibly, have multiple associated customer demographic segments. Thus, for each column, the customer group with the largest coefficient or weight can be interpreted as the most impactful customer demographic group for that corresponding pattern. Although one can present as many behavioral segments as the data contains, cognitive limits of the customers of the system pose a limit; it is not purposeful to show them hundreds of customer segments. In terms of populating W, the YouTube Analytics interface provides demographic percentages (based on gender and age) of viewing for each video by country (as shown in Table 2). Using this data, we can populate the demographic attributes of W, which we then associate with H, which contain the customer segments.

Table 2 Results videos and viewing by demographics used for the demographic data in matrix W showing two 14 age and gender customer demographic segments per video by country and associated view counts for 20 of the videos

6.4 Research objective three results—integrating customer behavioral and demographic segments

Moving to our third research objective, we integrate the most impactful customer demographic segments associated with the previously defined behavioral customer segments. A single behavioral segment is likely to have multiple associated demographic customer segments. Thus, for each column, the customer group with the largest coefficient or weight can be interpreted as the most impactful customer demographic group for that corresponding pattern. Although one can present as many behavioral segments as the data contains, cognitive limits of the system’s customers pose a restriction; it is not purposeful to show end users of analytics information of hundreds of customer segments. Therefore, we constrain the number of segments shown to end users. Table 3 shows six demographical customer segments for each of the six behavioral customer segments.

Table 3 Six customer behavioral segments presented with the top six associated customer demographic segments

In Table 3, we present five customer demographic segments identified via our matrix decomposition approach to discover distinct customer behavioral segments based on online content interaction (see Column 1). Then, for each behavioral pattern, Table 3 displays the top customer demographic segments associated with each of these five behavioral segments (see Columns 2 through 5). Our decomposition approach calculates a weight for each of these demographic segments, assigning a higher weight to the most impactful (i.e., largest) demographic customer segments.

The results in Table 3 also demonstrates how our approach using NMF is more effective than using clustering methods such as K-means++ in finding meaningful customer behavioral segments with representative demographic groups. In our previous work (An et al. 2016a), we applied the clustering method (K-means++) to YouTube data to find a set of groups that share the common video consumption patterns. When using K-means++, all 14 (7 age groups × 2 gender groups) US demographic groups were clustered as one group. The results by K-means++ were technically correct. However, such clustering results are not practical in actual use because they miss hidden behavioral segments within each of demographic group. The US is the biggest customer segment in our dataset, and thus even within one demographic group (e.g., US-25-Male), there exist a few different video consumption patterns. Since our data is aggregated, clustering methods such as K-means++ cannot capture such behavioral differences. Unlike clustering methods, decomposition methods such as NMF can “decompose” the aggregated data, identifying subtle behavioral differences among the demographic groups. As a result, NMF method results in presenting three US demographic groups as representative behavioral segments while K-means++ results in having one US group. Considering that US customers are the majority in our data, having three groups is more reasonable than having one group.

In Table 3, there are at least two findings that are apparent upon analysis. First, there is a predominance of male segments. Second, there is a clustering effect by gender, and then age, and sometimes location. For example, in the first behavioral segment, we see that the top three demographic segments are all young males from the United States. This opens up an interesting research question of how granular one needs to get for each of these demographic segments, as there is apparently little behavioral difference when segregating this portion of the population into three different age brackets.

6.5 Research objective four results—integrating customer behavioral and demographic segments

We believe that there are many possible use cases for employing behavioral and demographic segments, both separately and in conjunction by linking the behavioral segments to demographic segments. Here, we present one possible use case, using the research results to automatically generate personas using social analytics data to first isolate customer segments, both behavioral and demographic. We have developed a system that automatically collects aggregated data and that decomposes it using the method outlined above. We then turn the customer segments into rich personas by adding personality attributes to each, as outlined in other work (An et al. 2017; Jung et al. 2017). The result of this is that we automatically generate personas based on actual customer data from online platforms, a significant evolution of persona creation research, which can be used standalone or in conjunction with other persona creation methods.

As shown in Fig. 3, with an example of six customer segments as the bases of the personas (fictive people based on real data), the system presents a demographically appropriate image, name, country, age, and gender for each persona. This demographic data is first derived from the social media data, i.e., gender, country, and age. Using this information, the system then accesses backend databases selecting gender, age, and country appropriate images and names.

Fig. 3
figure 3

Screenshot of the Automated Persona Generation System generating six personas based on YouTube data. Note the images for each persona and the demographic information that appears on the cursor rollover of one of the images

The demographic information is displayed when the cursor hovers above a persona image. When one of the persona images is clicked, the corresponding persona description is displayed, as shown in Fig. 4.

Fig. 4
figure 4

Screenshot of a persona description that is automatically generated from social media analytics and contains both behavioral and demographic customer segmentation data

In choosing the number of personas to generate (or display via the system), we give end users the flexibility to choose the number of personas generated. The number should be much smaller than the number of total groups (|G|) and that of contents (|C|) because the condition for NMF is |p| << min(|G|, |C|). In the case of AJ+, customers can choose any number of personas from 5 to the order of 15. However, the cognitive load of hundreds of personas may make personas unusable as it is difficult to make sense of so many sub-groups of the audience; therefore, in practice, a smaller number is more reasonable, although in theory, the method and resulting system can generate as many personas as desired and as indicated by the data.

6.6 Scalability of discovering behavioral segments in empirical settings

Prior to moving on to the quantitative evaluation of customer segmentation methodology, we show the scalability of discovering behavioral segments in various empirical settings. Our approach to building personas can be divided into two parts: NMF and refinement of behavioral segments by adding personality, such as name, photo, etc. The latter is a simple task of searching the database and thus can be done in O(k) where k is the number of segments. The former can have various time complexities based on which implementations can be used. Here we use Python scikit library, which has O(gck) where g is the number of groups, c is the number of contents, and k is the number of segments. Thus, we measure the elapsed time of only the NMF part varying the size of the input matrix. Considering the distribution of the real traces, we sample the original AJ+ matrix with 10% intervals and measure the elapsed time of NMF with a commodity setup (a laptop with Intel i7-3770 CPU and 12.0 GB memory).

Figure 5 shows the elapsed time of running NMF with different sizes of the data and different numbers of the behavioral segments. We run the approach 500 times for each configuration and compute the average elapsed time for each. As shown in Fig. 5, the elapsed time for running NMF linearly grows with the size of the data and remains under minutes with a commodity laptop. Optimization and distributed computation might shorten the elapsed time more. Considering that our matrix is built from one entire news media outlet, our approach can be applied to other comparable social media accounts.

Fig. 5
figure 5

Elapsed time (seconds) of running NMF with different sizes of the data and different numbers of the behavioral segments

6.7 Consistency in the resulting matrices of NMF

As the NMF is an approximation algorithm, the resulting matrices might be changed by parameters, such as initial values or numerical solvers. Also, algorithms might not converge in a specific situation (Lin 2007). To show the resulting matrices’ consistency in an empirical setting, we tested our algorithms with different settings and compared the results.

We ran our algorithms with different initializations, which are (1) random, (2) Non-negative Double Singular Value Decomposition (NNDSVD), (3) NNDSVD with zeros filled with the average of the original matrix, and (4) NNDSVD with zeros filled with small random values, provided by Python scikit learn library. Also, for each initialization, we test different numerical solvers, which are Coordinate Descent solver and Multiplicative Update solver. As a result, we have 4 × 2 = 8 different configuration of initializations and numerical solvers.

For each configuration, we ran it 500 times for each experiment and extracted representative groups for k (the number of personas) equaling 5 and 10. We report the results from two perspectives. One is the consistency within a given configuration. Running 500 times of NMF, we obtain the same representative groups for all of them. Not a single run with different representative groups exists. The other is the consistency across the configurations. When k = 5, we find three representative groups appear 4000 times (500 times × 8 configurations). The remaining two representative groups for each configuration are different across the configurations. However, their variations are quite limited; four different groups appear 3500, 1500, 2000, and 1000 times. When k = 10, we find five representative groups appear 4000 times, and for remaining spots, seven different groups appear 3500, 3500, 3500, 3000, 2000, 500, and 500 times. Of course, there are methodologies to pick the best configuration among them by minimizing the difference that is measured by Frobenius norm, between the product of W and H and the original matrix V. Nevertheless, most of the results are quite stable and converge with the empirical data.

7 Quantitative evaluation of customer segmentation methodology

To quantitatively evaluate our approach to customer segmentation, we conduct two analyses using a ten-fold methodology on the data set. The two analyses are (a) predicting the most impactful customer demographic for a given customer behavioral segment and (b) predicting new video views by demographic.

7.1 Predicting customer segment interest in new content and number of video views

One of the benefits of using NMF for generating customer segments and personas is a clear association, represented in H (p × c), between the customer segments’ interest and non-interest in specific digital content. Beginning with this association, we can identify content, Hn, that a given customer segment might be interested in even before content publication.

For the problem of predicting interest in new content, the most intuitive solution is to find similar content that has already been published relative to the new content and assume that the level of interest in similar content will remain the same by a given customer segment. To compute the similarity of content in a robust way, we define content features. The features can be anything: topics, length, mood, color, price, and so on. Formally, we define a matrix, F (ccontents × ffeatures), capturing the features of the content. We then can derive another matrix, K(ppersonas × ffeatures), that represents an association between a customer segment and content features:

$$K=k({\mathbf{H}},{\mathbf{F}}),$$
(2)

where k is a kernel function. Thus, we can rewrite Eq. (2) with some appropriate mapping function φ:

$$K=\varphi ({\mathbf{H}})\varphi ({\mathbf{F}}).$$
(3)

For computational simplicity, we assume φ = I. In other words, the interest in content is the sum of the interest in its features. Then, we can get a direct multiplication of two matrices:

$$K={\mathbf{HF}}.$$
(4)

By multiplying \({\mathbf{F}}_{{{\text{right}}}}^{{ - 1}}~\)for both sides, we get:

$${\mathbf{H}}={\mathbf{KF}}_{{{\text{right}}}}^{{ - 1}},$$
(5)

where \({\mathbf{FF}}_{{{\text{right}}}}^{{ - 1}}\) = I.

The representation of H in Eq. (5) guides us on how to predict Hn. For new content, we can define Fn that represents new content and their features. By substituting Fn into Eq. (5), we can get Hn:

$${\mathbf{H}_n}={\mathbf{K}}({{\mathbf{F}}_{\varvec{n}}})_{{{\text{right}}}}^{{ - 1}}.$$
(6)

\(~({{\mathbf{F}}_{\varvec{n}}})_{{{\text{right}}}}^{{ - 1}}~\) can be computed by the following:

$$({{\mathbf{F}}_{\varvec{n}}})_{{{\text{right}}}}^{{ - 1}}=~{\varvec{F}}_{{\varvec{n}}}^{{\varvec{T}}}{({{\varvec{F}}_{\varvec{n}}}{\varvec{F}}_{{\varvec{n}}}^{{\varvec{T}}})^{ - 1}}.$$
(7)

Equation (7) is valid when Fn has linearly independent rows (\({{\varvec{F}}_{\varvec{n}}}{\varvec{F}}_{{\varvec{n}}}^{{\varvec{T}}}\) is invertible). If not, we split a set of new content products into several sets so that Fn of each set has linearly independent rows. This procedure avoids losing the method’s generality.

By combining Eqs. (6) and (7), we write Eq. (8), representing the association between customer segments and new content:

$${{\varvec{H}}_n}={\varvec{K}}{\varvec{F}}_{{\varvec{n}}}^{{\varvec{T}}}{({{\varvec{F}}_{\varvec{n}}}{\varvec{F}}_{{\varvec{n}}}^{{\varvec{T}}})^{ - 1}}.$$
(8)

The key of Eq. (8) is that K, the matrix representing an association between customer segments and features, does not need to be changed for newer content because K depends on content features, not the content itself.

This is an application and an advantage of our customer segmenting methodology relative to other limited approaches that have been attempted for online data-driven customer profiling methods. In addition to providing segments of their customers, our approach also identifies the target customer segment for new content once its features are selected and measured. The content creators then have an opportunity to refine their content prior to its publication to more directly appeal to the customers they want to target.

By combining Eqs. (1) and (8), for new content, we get Vn:

$${{\varvec{V}}_{\varvec{n}}}~ \cong {\varvec{W}}{{\varvec{H}}_{\varvec{n}}}={\varvec{W}}{\varvec{K}}{\varvec{F}}_{{\varvec{n}}}^{{\varvec{T}}}{\left({{\varvec{F}}_{\varvec{n}}}{\varvec{F}}_{{\varvec{n}}}^{{\varvec{T}}}\right)^{ - 1}}.$$
(9)

Similar to Eq. (8), it is possible to predict the views of new content by customer segment based on a content feature of the new product.

7.2 Experimental setup for evaluation

Using this approach, we first define a training and a testing data set. We divide all 4323 videos, ordered by publishing date, into 10 slices. Among the 10 slices, we use the 10th slice as our testing set, which is the latest 432 videos. Then, for training, we use some of the remaining slices to consider the recency and their expressive power given that these videos represent the most current audience preferences. Although there is a general belief that more training data leads to better prediction performance in machine learning (Brownlee 2016), in our case, more videos to train the model is not necessarily helpful because the customer base might change and evolve over time. In such cases, older data might not reflect the behavioral patterns of the current customers.

To better understand how the size or the recency of the training set affects the prediction performance, we iteratively run an experiment with a varying number of slices in the training data from one (the most recent) to ninth (the oldest). For clarity’s sake, we use the percentage of the testing data to the whole instead of the number of slices; N = 10% means the ninth slice only, and N = 30% means the 7th, 8th, and 9th slices. We get nine different sizes of the training data sets by changing N from 10 to 90%, with an offset of 10%. For each training set, we construct a matrix V, and by applying NMF, we get a matrix W and H. Then, we build a Latent Dirichlet Allocation (LDA) (Blei et al. 2003) topic model for each training set to construct a matrix F and Fn. Once we construct these five matrices, we estimate the view counts of new videos for demographic groups, Vn (g groups × n videos), according to Eq. (9).

7.3 Measure of evaluation—Kendall’s coefficients

For each of the new videos in the 10th set, we rank the demographic groups based on weight values in Vn. We compare this ranking with the true ranking of the groups computed from the real view counts of that video by Kendall rank correlation coefficient. Since we have 432 test videos, we have 432 Kendall’s coefficients (t) for each experiment run. For evaluation, we first use the mean of those 432 Kendall’s coefficients; a higher coefficient means a method performs better in ranking demographic groups for a video, which represents how well a method perform in finding. Second, we present how many of the 432 test cases have statistically significant results. The higher the number of significant cases, the better the method performs in identifying demographic groups with a higher view of a given video.

For comparison to a baseline, we employ two other models: (1) a random model and (2) a collaborative filtering (CF) model. The random model ranks groups randomly for a new video. The CF model computes the average view counts of each demographic group and uses them for any new video, as CF-based recommending system assigns an average behavior of customers for the new content that would represent what one would consider a standard web analytics approach (Bowden 2009).

7.4 Results of evaluation

Figure 6 shows the result of the experiment: (a) the average t of cases where the result is statistically significant (p < 0.05) and (b) the number of those significant cases in each experiment. The inferior performance of the random model proves that the view counts from each segment for the videos are far from the random construction. In fact, there are not many significant cases for the random model (see Fig. 6), and the average t for the random model is near 0.0 for any size of the training set.

Fig. 6
figure 6

The result of predicting the ranking of demographic groups by Random, CF, and our model. Y axis value is the average of Kendall’s rank correlation coefficient of our 432 test cases. X axis value is the percentage of the dataset used for training. Ten percent means we use the last 10% of the dataset before the test dataset

Figure 7 shows that our model outperforms the CF-based model in ranking the groups for new videos when N is 20–90%, and it shows comparable performance when N is 10%. These results demonstrate that our customer segment prediction performs very well at forecasting interest in new content by customer segment and that our approach outperforms, from 20 to 90%, the widely used CF approach, even with a set of limited features.

Fig. 7
figure 7

The number of significant cases when testing Kendall’s Rank Correlation coefficients among 432 test cases by Random, CF, our model. Y axis value is the number of cases where Kendall’s rank correlation coefficients are significant among 432 test cases

Also, for the CF-based model, the number of significant cases strikingly drops, from 412 significant cases to 18, when N is greater than 40% (Fig. 6). From our stability analysis, we show that the channel encountered a sudden change in their consumer customer in that period. The CF-based model is not robust for such sudden changes, resulting in having no significant cases when N > 40%. The average t of the CF model would show a significant drop when N > 40% if we plot the average t of all cases. In contrast, our customer segmenting approach is robust enough to adapt to the changes of the audience, as shown by the number of significant cases in Fig. 6, even when N varies from 10 to 90%.

8 Evaluation of distinctiveness of personas by varying the number of personas

In this section, we offer insight into the relationship between the number of personas and their distinctiveness. Two contradictory patterns can possibly emerge. On one hand, the personas can become more distinctive when the number of personas increases. As a higher number of personas are discovered, subtle differences are likely to be captured by different personas that otherwise would be subsumed into the same persona. On the other hand, the personas can overlap when the number of personas increases. This might happen when the number of actual customer segments is smaller than the number of personas. For an analogy, consider a set of red balls and blue balls. Then, a mixed segment of red and blue balls cannot be avoided if finding more than two segments.

We define distinctiveness of personas as the average distance between rows in H. More specifically, we consider each row in H as a c-dimensional vector and compute the average cosine distance between all the pairs of two vectors. The number of possible pairs here is pC2 = p(p−1)/2.

Figure 8 shows how the distinctiveness of personas with varying the number of the personas. Interestingly, the emerging pattern is neither the two possible patterns we mentioned above. Rather, it is a complicated combination of them. The distinctiveness first sharply decreases until four or five personas are found, and then it steadily increases and becomes almost stable.

Fig. 8
figure 8

Distinctiveness of personas with varying the number of personas for YouTube dataset

This tendency highlights the difficulty in choosing the optimal number of personas in a real scenario. As online content is consumed by millions of customers and their grouped behavior is not as simple as a theoretical example, the number of latent patterns and their distinctiveness shows the complex nature rather than a simple positive or negative correlation. However, while a higher number of personas than four or five can capture fine-grained differences in customer behavior, it increases the cognitive load to handle this information at the same time. Thus, in the real scenario, the number of personas should be carefully chosen by those who actually use the personas or customer segments in their work routine.

9 Discussion and implications

The results of our research demonstrate that social media data from the major online platforms is quite robust in identifying customer segments for both behavioral and demographic segments. We demonstrate several important results and implications concerning customer segmentation. Notably, the approach shows that one can use actual, real-time, aggregated online customer data at scale to identify meaningful customer segments and then can automatically generate personas from these segments. As such, this research addresses a previously open question of investigation and advances both segmentation and persona research by presenting an approach to use data at scale and to continually leverage this data to keep personas updated. Using our method, we do not need to vigilantly survey customers, who may or may not be actual consumers of our content. We can leverage online social media data from actual consumers to extract customer segments based on genuine customer data. Additionally, our method can be used to supplement qualitative methods of personas creations.

The major strength of our approach is that it benefits from actual customer data, reducing time and cost for generating both behavioral and demographic customer segments and providing a mechanism for linking these two types of customer groupings into coherently integrated segments. Also, in contrast to prior persona research, our research focuses on using data from major online platforms that are, in most cases, aggregated to some level. Prior work in using online data as the basis of personas used individual customer data, to which many content creators do not have access. Our research using NMF demonstrates that one can use this aggregate data to both identify customer segments and then to automatically generate rich personas. In this research, we also generate a relatively large number of personas. While prior work has recommended a small number of personas, this is not feasible or realistic for content, systems, or platform channels with millions of followers. This is a first step in focusing persona research on the needs of the designers and producers of online content. Our research focuses on the increasingly common situation of digital content creators that are distributing their content to an extremely large, heterogeneous customer base via major online platforms, which is, or is becoming, the de facto technologies of distribution. In this situation, the role of the personas is to identify customer tastes and interests, rather than in more traditional system interactions. However, we believe the approach could be applied in these situations also.

Although the research presented here leverages YouTube data, the method is transferable to most platforms, as the data collected is in similar aggregated format. Like YouTube Analytics data, Facebook Insights provide content consumption statistics from a certain customer segment, defined by age, gender, and country for each video and post. Unlike YouTube, Facebook Insight provides the total view time (total minutes of the video watched) of a video instead of the number of total views. We use two different API calls, “view_time_by_age_bucket_and_gender” and “view_time_by_region_id” to calculate the view time of each video for each customer segment. Once we have this data, then we create matrix V, and the following step is same as we did for the YouTube data. We note that there is no extra cost except collecting data for our system to process datasets from two different online platforms. In fact, we have implemented the approach using Facebook data for the same organizations with the resulting personas displayed in Fig. 9.

Fig. 9
figure 9

Screenshot of the Automated Persona Generation System generating six personas based on Facebook data. Note the images for each persona and the demographic information that appears on the cursor rollover of one of the images

Yet, there are numerous research and development fronts that we are pursuing in the future to enhance the impact of this research. Given the reliance on streams of social media data, we could certainly implement direct access to the foundational customer data for the content creators, as suggested by (Faily and Flechais 2011) and also to persona campaigns, as suggested by (Judge et al. 2012), where updates concerning the personas are continually sent to the content creators. This feature may be important, as it appears that designers like continued access to the actual customer data, aside from the persona itself, to aid them in their design decisions (Judge et al. 2012). In this work, we use NMF as the basis for our persona generation. To identify customer segments, we are also investigating other prominent and advanced approaches besides NMF such as convergent NMF (Mirzal 2014), PCSNMF decomposition (Shi et al. 2015a), and rank-adaptive NMF (Shan et al. 2018). Also, we used the LDA topics for characterizing videos based on video headlines. However, there are many other candidate features (Zarrinkalam et al. 2018). Carefully selecting an increased number of features could improve the performance of our model. Most importantly, as mentioned above, we will conduct an in-depth field evaluation, such as that reported in (Dittmar and Hensch 2015), of the system with actual journalists, both producers and editors.

We consider this research is a starting point for leveraging behavioral and demographic customer segments from social media analytics data for a vast number of other applications and services with minimal manual efforts. If we can leverage additional rich information concerning a consumer, such as an ethnicity, socio-economic status, and precise location, our approach and results would become even more useful. For future research, we are exploring these avenues. For example, it might be possible to extract demographic information using shared links on Facebook (An et al. 2016a), via Twitter, or via Google + profiles. Links shared on Facebook could reveal information about the customers, such as socio-economic status, as the links reveal particular interests. There has been prior research showing that affluent customers visit more high-end luxury product websites, while budget-conscious customers visit price aggregation or discount websites. Thus, the socio-economic status of the consumer can be distinguished by the websites they visited (Odlyzko 2003). Other features, such as psychographics, political orientation, and brand affiliations could also be associated with the personas based on interest mapping.

10 Conclusion

In this research, we show that personas can be rapidly and automatically created from large-scale, aggregated customer data from major online platforms, resulting in personas that are based on behavioral data that reflect real people and created from sizeable data quantities permitting quantitate analysis. We evaluated our persona generation methodology, showing that our method generates actual and stable personas that are predictable. Although specifically focusing on digital content creators, our approach is flexible and resilient for application in a wide range of contexts where customer-centric data needs to be transformed into easy-to-understand representations for decision-making and customer insights.