This chapter will take the second Tencent Advertising Algorithm Competition in 2018 (as shown in Fig. 13.1) as an example to analyze the practical cases related to computational advertising and will explain the complete process and precautions of the cases in detail. This chapter is mainly divided into six parts, namely, competition question understanding, data exploration, feature engineering, model training, model integration, and contest question summary. This is not only the organizational structure of all chapters of case examples in this book, but also an important process for a competition. I believe that under the guidance of this book, you can quickly become familiar with the competition process and apply what you have learnt into practice.

Fig. 13.1
A screenshot of a dark screen reads a text in a foreign language with an abstract of blocks and reads Tencent advertising algorithm competition, A C M multimedia 2018 grand challenge.

2018 tencent advertising algorithm competition

1 Understanding the Competition Question

As the saying goes, sharpening your axe will not delay your job of chopping wood. Before the competition, you should fully understand the information related to the competition questions, know the needs behind these questions, and then achieve the purpose of examining the questions correctly. This competition is based on Audience Lookalike Expansion of computational advertising problems. The contestants are lucky because the organizer of the contest have the largest social platform in China. Both the quality of the data they provide, and the professionalism of the competition are impeccable. The content of this section is also mostly from the official description of the competition questions given by Tencent.

1.1 Competition Background

Advertising based on social relationships (that is, social advertising) has become one of the fastest growing types of advertising in the Internet advertising industry. Tencent social ad platform is a commercial ad platform that relies on Tencent’s rich social products, rooting in Tencent’s massive social data, and using powerful data analytics, machine learning, and cloud computing capabilities to create a service for tens of millions of businesses and hundreds of millions of users. Tencent social ad platform has always been committed to providing accurate and efficient advertising solutions, and complex social scenarios, diverse advertising forms, and huge user data have brought considerable challenges to achieve this goal. In order to overcome these challenges, Tencent social advertising platform is also constantly trying to find better algorithms for data mining and machine learning.

The topic of this algorithm competition originates from a real advertising product in Tencent’s social advertising business—Audience Lookalike Expansion (later referred to as Lookalike). The purpose of this product is to find other groups similar to the target group from a large number of people based on the target group provided by advertisers, in order to achieve the goal of audience expansion, as shown in Fig. 13.2.

Fig. 13.2
An abstract defines how a pair of human clipart that are labeled seed population is mapped to four humans who are labeled lookalike audience.

Audience lookalike expansion

In an actual advertising business application scenario, Lookalike can find potential consumers similar to these existing consumers from the target consumers based on the existing consumers of advertisers, so as to effectively help advertisers reach new customers and extend their business. At present, Lookalike is based on the first-party data provided by advertisers and the effect data of advertising (that is, the seed population mentioned later), combined with Tencent’s rich data tags. Through the mining of deep neural networks, it has realized the function of expanding high-quality potential customers with similar characteristics for multiple advertisers simultaneously, online in real time. The working mechanism of Lookalike is shown in Fig. 13.3.

Fig. 13.3
A diagram defines how the seed users with a security network are processed via a network that extracts user features based on user profiles and predicts the extended audience by displaying relevant ads.

How lookalike works

1.2 Competition Data

The question data (desensitized) extracted for this competition is that for certain 30 consecutive days. Usually, the data file can be divided into four parts: training set file, testing set file, user feature file, and advertisement feature file corresponding to the seed package. These four parts are introduced separately below.

  • Training set file train.csv: Each row in this file represents a training sample, and the fields are separated by commas. The format is "aid, uid, label", where aid is used to uniquely identify an advertisement; uid is used to uniquely identify a user; label is the sample label with a value of +1 or −1; + 1 is the seed user, and −1 is the non-seed user. In order to simplify the question, a seed package corresponds to only one aid, and the two are one-to-one correspondence.

  • Testing set file test.csv: Each line in this file represents a test sample, and the fields are separated by commas in the format “aid, uid”. The meaning of two fields is the same as that of the training set file.

  • User feature file userFeature.data: Each line in this file represents a user’s feature data, with a vertical line “|” between each field fields; the format is “uid | features”. Among them, features are composed of many feature_group, and each feature_group represents a feature group; the multiple feature groups are also separated by a vertical line “|” in the format “feature_group1 | feature_group2 | feature_group3 |...”. If a feature group consists multiple values, it is separated by spaces as in the format “feature_group_name fea_name1 fea_name2 ...”, and the fea_name in it adopts the format of data numbering.

  • The advertisement feature file adFeature.csv corresponding to the seed package: the format of each line in this file is “aid, advertising serId, campaignId, creativeId, creativeSize, adCategoryId, productId, productType “. The first field aid is used to uniquely identify an advertisement, and the remaining fields are advertisement features, separated by commas. For data security reasons, we encrypt uid, aid, user features, and advertisement features as follows.

    • uid: Randomize the number of each uid from 1 to n to generate an unduplicated encrypted uid - n is the total number of users (Assuming that the number of users is one million, all users are randomly scattered and arranged, and the sequence number after the arrangement is used as the user’s uid. The value range of the sequence number is [1, one million]).

    • aid: Refer to uid’s encryption method to generate encrypted aid.

    • User features: refer to the encryption method of uid to generate encrypted fea_name.

    • Ad features: refer to the encryption method of uid to generate encrypted fields.

      Next, the values of user features and advertising features will be explained.

  • Description of value selection of user features

    • Age: segmentation representation; Each number represents an age group;

    • Gender: male, female;

    • Marital status: single, married, etc. (multiple states can coexist);

    • Education background: doctor, master, undergraduate, high school, junior high school, primary school;

    • Consumption ability: high and low;

    • Geographical location (LBS): Each number represents a geographical location;

    • Interest category: Mining different data sources to obtain 5 interest feature groups, which are represented by interest 1, interest 2, interest 3, interest 4, and interest 5. Each interest feature group contains several interest IDs;

    • Keywords: Mining different data sources to obtain three keyword feature groups which are represented by kw1, kw2 and kw3 respectively. Each keyword feature group contains several keywords that users are interested in which can be more than interest categories. Fine grain indicates user preferences;

    • Topics: Use LDA algorithm to mine user preference topics. Specifically, mine different data sources to obtain 3 topic feature groups, which are represented by topic1, topic2, and topic3;

    • Recent app installation behavior (appIdInstall): including apps installed in the last 63 days, where each app is represented by a unique ID;

    • appIdAction: The ID of the app with a high user engagement rate;

    • Internet connection type (ct): Wi-Fi, 2G, 3G, 4G;

    • Operating system (os): Android, iOS (version number is not distinguished);

    • Mobile telecommunication operator (carriers): mobile, Unicom, telecommunications, others;

    • Real estate (house): having real estate, not having real estate.

  • Description of value selection of advertisement features

    • Advertisement ID (aid): Advertisement ID corresponds to specific advertisements. Advertising refers to the advertising creativity (or advertising materials) created by advertisers and the settings related to advertising display, including the basic information of the advertisement (advertising name, delivery time, etc.), promotion objectives, delivery platforms, advertising specifications, advertising creativity, advertising audience (that is, targeting settings of advertisements), advertising bids, and other information;

    • Advertiser ID: The account structure is divided into four levels: account, promotion plan, advertisement, and material. Advertisers and accounts are one-to-one correspondence;

    • Campaign ID: The promotion plan is a collection of advertisements (similar to the folder function of a computer). Advertisers can place advertisements with the same conditions such as the promotion platform, budget limit, and whether to deliver at a constant speed in the same promotion plan for management;

    • Material ID (creativeId): advertising content that is directly displayed to users. There can be multiple groups of materials under one advertisement;

    • Material size (creativeSize): The material size ID is used to identify the size of the advertising material;

    • Ad category (adCategoryId): Ad classification ID, using the ad classification system;

    • Product ID (productId): the ID of the product promoted;

    • Product type (productType): The product type corresponding to the advertising target (such as commodities corresponding to JD.com, corresponding download in Apps).

1.3 Competition Tasks

Lookalike automatically finds a similar audience, called the extended audience reach, among the candidate population provided by the advertiser through computation based on the seed population (also known as the seed package). This competition question will provide the contestants with hundreds of seed population, user characteristics corresponding to a large number of candidate audiences, and advertising characteristics corresponding to the seed population. For the sake of business data security, all data have been desensitized. The entire data sets are divided into training sets and testing sets. In the training set, users belonging to the seed group as well as those not belonging to the seed population are calibrated respectively (i.e. positive and negative samples) in the candidate group. The model prediction will detect whether the contestant’s algorithm can accurately calibrate whether the users in the testing set belong to the corresponding seed package. The seed population corresponding to the training set and the testing set is exactly the same. What is shown in Fig. 13.4 is the distribution of the population groups.

Fig. 13.4
Three concentric circles. The outer circle reads base audience, the second circle reads expanded audience and the third innermost circle reads seed population and is labeled matched users.

Distribution of population groups

In order to test whether the contestant’s algorithm can learn users and seed groups well, this competition requires participants to submit results that include the scores of candidate population for various seed groups belonging to the seed group (the higher the score, the more likely the candidate population is to be potential look-alike expansion users of this seed group). The seed groups provided by the preliminary and semi-final are the same except for the difference in order of magnitude.

1.4 Evaluation Indicators

If there is a relevant effect action (such as clicking or converting) after advertising to similar users extended is conducted, it is considered a positive example; if there is no effect behavior, it is considered negative. Each seed population to be evaluated will provide the following information: the advertisement ID (aid) corresponding to the seed population group, the ad characteristics, and the corresponding candidate group set (including the uid and user features of each candidate user). Contestants need to calculate the scores of users in the testing set for each seed group; the game will calculate the AUC index for each seed population group accordingly and use the average AUC value of all m seed groups to be evaluated as the final evaluation index; the formula is as in (13.1):

$$ \frac{1}{m}\sum \limits_{i=1}^m{\mathrm{AUC}}_i $$
(13.1)

wherein AUCi represents the AUC value of the i-th seed population group.

1.5 Competition FAQ

What is the essential task of this competition?

The basic task of the competition question is to carry out accurate user matching of the future advertisement push based on the previous advertisement push and the user click records, so as to improve the click-through rate by the users for the push ads, thus improving the conversion rate, bringing commercial value to advertisers, and charging ads marketing fees.

There are several evaluation indicators in Internet advertising. What is the correlation between these indicators?

The first indicator is exposure, which refers to the number of times one advertisement is exposed to the users, that is, how many users have been the push ads; the second indicator is the number of clicks after the user sees the advertisement, that is, how many users click and come to the advertisement page; the third indicator is the conversion. If users see the advertisement and purchase corresponding products, the number of this part of users is the conversion. It can be seen that the exposure, click-through rate, and conversion form an inverted pyramid structure, that is, decreasing progressively. Of course, exposing ads can not only bring direct user conversion to advertisers, but also be a disguised form of marketing to advertisers’ brands and popularity.

2 Data Exploration

This section will analyze and interpret the available information and data provided by the competition to explore possible modeling ideas. Generally speaking, if memory allows, contestants can generally use common third-party python open-source packages such as jupyter notebook, pandas, and numpy to explore data. Different functions can be used according to the analysis needs. The functions commonly used in pandas packages are read_csv (), head (), describe (), value_counts (), plot (), shape, etc.

2.1 Public Data Sets for the Competition

Take the data of the preliminary as an example, the data set files provided are train.csv (training set), test1.csv (testing set), test1_truth.csv (testing set label), adFeature.data (basic attribute of advertisements), and userFeature.csv (basic information of users).

2.2 Training Sets and Testing Sets

The training set and the testing set only list the ID column and the label column. For this part, the data set publicly provided by Tencent also gives the real label of the testing set. Participants need to make it clear that this is a problem of two primary keys for matching the users with advertisements. Therefore, you can properly view the overlap between aid and uid in the training set and the testing set to determine the difference between the training set distribution and the testing set distribution.

2.2.1 Distribution Differences

First of all, it is necessary to confirm that there are no missing values in the training set or in the testing set, and the proportion of positive samples in the training set is 4.8%, which is probability the data obtained after a certain sampling since it is hard for the click rate in actual business to reach this level. Then merge and count the unique values of aid and uid deduplicated in the training set and testing set respectively, as shown in Table 13.1.

Table 13.1 Distribution of uid and aid

It can be seen that less than 18% of the uid in the testing set appears in the training set, while the aid appears all the same in the testing set and in the training set. In fact, this is also in line with business logic—that is, in the case of a short period of time to maintain the same type of advertising, probability matching is predicted based on the click effect of the existing launch for users who have not been pushed the advertisement, thereby increasing the number of clicks, and bringing commercial benefits.

After the value difference of the single primary key is checked, it is necessary to confirm that the value of the two primary keys is also unique, that is, to confirm that the combination of aid and uid is unique. The unique representation here has only one definite label value. The code verification is as follows:

train_nunique = train[['uid', 'aid']].drop_duplicates().shape[0] test1_nunique = test1[['uid', 'aid']].drop_duplicates().shape[0] all_nunique = test1[['uid', 'aid']].append(train[['uid',    'aid']]).drop_duplicates().shape[0] assert train_nunique == train.shape[0] assert test1_nunique == test1.shape[0] assert train_nunique + test1_nunique == all_nunique

Finally, according to the above analysis, there is still a lack of logical closed loop, that is, whether the distribution of advertisement ID placed in the training set and the testing set is the same. The verification result is shown in Fig. 13.5.

Fig. 13.5
A line graph of ratio versus aid. It plots two fluctuating trends for train and test 1. Both have multiple peaks and dips at sporadic intervals and overlap with each other.

Distribution of advertisement ids released in training set and testing set

As can be seen from Fig. 13.5, the distribution of advertisements in the training set and the testing set is basically the same. Therefore, the focus is to examine the degree of interest of different users in the same advertisement, or it can be said that the participants need to find out the characteristics of the user group of the same advertisement, and then discover more users who may be interested in the advertisement by taking advantage of the existing click data, which is the theme of this competition—Audience Lookalike Expansion.

2.3 Advertising Attributes

Use pandas. DataFrame ().head () method to display the basic data. As shown in Fig. 13.6, although all the data have been desensitized, it does not prevent participants from understanding the meaning of each field. Section 13.1.2 has listed detailed descriptions of advertisement features. Participants can view the data by themselves under the help of the instructions.

Fig. 13.6
A table of five rows and eight columns. It lists the i d, advertiser i d, campaign i d, creative i d, creative size, ad category i d, product i d, and product type.

Advertising attributes data display

2.4 User Information

Since the format of the user feature file is .data, which is not conducive to the direct analysis and statistics by the contestants, it is converted into a .csv format file first. The specific operation code is as follows:

# Check whether the file path already exists if os.path.exists('data/preliminary_contest_data/userFeature.csv'):    user_feature=pd.read_csv('data/preliminary_contest_data/userFeature.csv') else: userFeature_data = [] with open('data/preliminary_contest_dataa/userFeature.data', 'r') as f:     for i, line in enumerate(f):       line = line.strip().split('|')              userFeature_dict = {}          for each in line:          each_list = each.split(' ')          userFeature_dict[each_list[0]] = ' '.join(each_list[1:])              userFeature_data.append(userFeature_dict)       if i % 1000000 == 0:          print(i)      user_feature = pd.DataFrame(userFeature_data)   user_feature.to_csv('data/preliminary_contest_data/userFeature.csv',     index=False)

After converting the raw data source into pandas.dataframe format, the analysis becomes very convenient. Due to too many fields, only some fields are shown in screenshots here, as shown in Fig. 13.7. In addition to the user ID uid, other fields are user attributes. The user atrribues are divided into univariate attributes and multivariate attributes. Age, gender, marriageStatus, education, consumptionAbility, and LBS are univariate attributes with only one value per user. Interest 2, interest 5, and kw2 are multivariate attributes where each user will have multiple values. The processing of multivariate attributes will use algorithms related to natural language processing, which will be explained in Sect. 13.3.

Fig. 13.7
A table of three rows and ten columns. It lists the u i d, age, gender, marriage status, education, consumption ability., L B S, interest 2, interest 5, and k w 2.

Display of basic user features

2.5 Feature Splicing of Data Sets

After becoming familiar with the training set, testing set, advertising attributes, user information, participants are able to comprehend the relationship between these table files, i.e., using the ID columns of the training set and testing set as the basis to associate the advertising attributes with user information, and to form a wide table of features with ID columns and tags in the conventional sense; the remaining features can be directly used for modeling, but multivariate attribute features may require additional processing.

Because the original data source is relatively large, for some participants who have just started, they may not have enough computing resources readily available to use. Therefore, in order to facilitate participants to quickly understand and run a successful demo, this book conducts 1% random sampling of the training set and testing set in this round, so that the big data problem is converted into a small data problem, and participants can quickly carry out relevant data exploration, feature engineering, and model building. After the scheme is determined here, if there are enough resources, you can perform full data modeling. Here are the codes implemented for random sampling and data splicing:

   train = train.sample(frac=0.01, random_state=2020).reset_index(drop=True) test1 = test1.sample(frac=0.01, random_state=2020).reset_index(drop=True) test1['label'] = -2 # Extract user information for the existing training sets and testing sets user_feature = pd.merge(train.append(test1), user_feature,    how='left', on='uid').reset_index(drop=True) # Splicing advertising information data = pd.merge(user_feature, ad, how='left', on='aid') # Perform label conversion to facilitate the differentiation of training sets and testing sets data['label'].replace(-1, 0, inplace=True) data['label'].replace(-2, -1, inplace=True)

At the same time, in order to facilitate modeling, it is necessary to replace the −1 representing negative samples in the sample label with 0 and record the real labels of the testing set at the same time, so as to verify and compare the subsequent modeling. The display of label distribution is shown in Fig. 13.8.

Fig. 13.8
A bar graph of the number of values versus the label value. It plots the following. (negative 1, 20000), (0, 80000), and (1, 5000). The y-axis data are approximated.

Display of label distribution

Then distinguish feature categories according to the univariate and multivariate attributes of features. The distinguishing methods are as follows:

   cols = train.columns.tolist()    cols.sort()    se = train[cols].dtypes    # multivariate attributes    text_features = se[se=='object'].index.tolist()    #  univariate attributes    discrete_features = se[se!='object'].index.tolist()    discrete_features.remove('aid')    discrete_features.remove('uid')    discrete_features.remove('label')

Finally, it can be concluded that there are 16 multivariate features text_features in the data set, namely appIdAction, appIdInstall, ct, interest1, interest2, interest3, interest4, interest5, kw1, kw2, kw3, marriageStatus, os, topic1, topic2, topic3, and 14 discrete_features univariate features, which are LBS, adCategoryId, advertising Id, age, campaignId, carrier, consumptionAbility, creativeId, creativeSize, education, gender, house, productId, productType. Since the handling of different types of characteristics is very different, making simple distinctions makes it easier to work more efficiently later.

2.6 Basic Modeling Ideas

Through simple data exploration and splicing of table files, participants will be able to perceive that this data structure is very clear. In fact, there are two types of features, namely, multivariable text features text_features and univariate discrete features discrete_features. Therefore, this chapter will consider a novel modeling idea, which is to introduce a CatBoost model that can directly support text_features for modeling.

3 Feature Engineering

This section will perform some feature extraction on the basis of data exploration. The data of this competition is very representative. Except for the ID column and the tag column, the other columns are feature columns, and the feature columns here are all discrete columns, including multivariate features and univariate features. This data organization form and the organization form of Chap. 8 are two typical scenarios. The raw data in Chap. 8 is some user behavior records, which need to be designed and extracted before modeling. Of course, it does not mean that the actual cases in this chapter do not need feature design and extraction; rather it is just because the feature engineering here will be somewhat different from that in Chap. 8. This section will take the data of the 2018 Tencent Advertising Algorithm Competition as an example to illustrate another set of commonly used feature design and extraction schemes, in which classic features and business features are used to extract information from univariate fields, while text features are aimed at multivariate fields.

3.1 Classic Features

Intuitively speaking, ordinary models (such as LR, RF, GDBT, etc.) cannot distinguish and process univariate discrete features during training. Therefore, such features need to be transformed so that they can be characterized by continuous columns with large and small meanings, and then use models for quantitative differentiation and study. This section will introduce the meanings and extraction methods of three common statistical features.

3.1.1 Count Feature

This is a simple counting feature, which can measure the frequency of occurrence of a univariate discrete field and indicate whether a certain attribute of the sample is suitable to the majority or minority. The pandas.series value_counts () method is usually used for frequency statistics. The count coding feature corresponds to the countVectorizer used for multivariate fields introduced in Sect. 13.3.3, which is most obvious in the data value distribution such as the long tail distribution. Reflected in this competition, the count coding feature is called the exposure feature, which can be the exposure of a single field or a combination of multiple fields. The following part takes some univariate feature fields in this data set as examples to give the raw input data and the output exposure feature.

Figure 13.9 shows a portion of the input data:

Fig. 13.9
A table of five rows and six columns. It lists u i d, L B S, ad category i d, advertiser i d, age, and campaign i d.

Training set raw input data

Figure 13.10 shows a portion of the output data:

Fig. 13.10
A table of five rows and five columns. It lists the exposure L B S, exposure ad category i d, exposure advertiser i d, exposure age, and exposure campaign i d.

Exposure features output

Take the exposure_age field as an example. The sample values of age are 1, 2, and 5, and the corresponding numbers are 26,029, 25,245, and 26,179 respectively. The other fields have similar meanings. The exposure_age_and_gender is the combinational counting of age and gender. Only one univariate feature field is needed to calculate the univariate exposure feature, and two univariate feature fields are needed to calculate the exposure feature above the second order. In this way, the value that does not have the size relationship is mapped into the quantity value, which can intuitively demonstrate whether the user belongs to the majority or minority, for example, in the dimension of age. Participants will realize that this can reflect the age difference of users to a certain extent. On this basis, you can even calculate third-order features and features of higher order, which is of course easy to cause dimension explosion. This is also the essence of the N-Gram algorithm for text feature extraction in the field of natural language processing.

3.1.2 Nunique Feature

The second type of feature is the number of attribute values feature nunique, which refers to the number of attribute values after the intersection of two univariate fields. The two univariates can have an inclusion relationship or be independent of each other. Usually, when two univariates have an inclusion relationship and the number of attribute values between different branches varies greatly, the modeling effect is more obvious. For example, if the user’s geographic location attribute, that is, the LBS, contains desensitized city ID and subway line information, it is possible to add additional information mined by the geographic location attribute to the user, since it is generally believed that cities with more subway lines have more prosperous economies and may have larger population.

The output nunique feature is shown in Fig. 13.11. Observing the nunique_adCategoryId_in_LBS feature column, it can be noticed that to some extent, this column can reflect the adCategoryId distribution range of different LBS. This is the representation of LBS at the adCategoryId level, which in turn can be characterized at the LBS level. However, this part of information is also only a first-order expression, and both the including items and the items being included for this type of feature can be extended at a higher order.

Fig. 13.11
A table of five rows and five columns. It lists the n unique ad category i d in L B S, n unique advertiser i d in L B S, n unique age in L B S, n unique campaign i d in L B S, and n unique carrier in L B S.

The nunique feature output

3.1.3 Ratio Feature

The ratio feature can be constructed by making use of the interaction between the two features during the construction of the second-order feature mentioned above in the count coding feature part. The calculation result of the count coding feature and the nunique feature are both integers. Unlike these two, the value obtained by calculating the ratio feature is a decimal between 0 and 1. If the nunique feature can reflect the influence of the distribution range of the feature, then the ratio feature can reflect the proportion, or preference level.

3.2 Business Features

Section 13.3.1 has introduced the three classic features of count coding, nunique, and ratio. This section will introduce another statistical feature that requires tags basing on this competition question, namely the business feature. This feature may be used in the classification model. In fact, it is the label distribution ratio of different values for each discrete field, which is reflected in this competition as the click-through rate.

3.2.1 Click-Through Rate Features

Before introducing click-through rate features, we must first clarify two concepts—over-fitting and leakage. Over-fitting refers to the model’s over-learning of the training set during training, resulting in poor generalization performance, especially when the distribution of the training set and the testing set, especially the joint distribution of features and tags, is quite different. Leakage means that when the model is trained, the features are mixed with tag information, which leads to tagging as part of the features to some extent. Therefore, the model has excellent learning effect. However, the problem is that the tags of the testing set should have been unknown and there may be distribution differences, which will also lead to poor or even extremely poor generalization performance of the model, as well as over-fitting. Therefore, extreme care should be taken when using labels to extract and process related features. It is necessary to strengthen the expression of features on labels without over-expression, which will lead to over-fitting and leakage of label information.

In order to avoid label leakage to a certain degree, the idea of five-fold cross validation can be used for cross click rate statistics, so that the click-through rate characteristics of each sample obtained do not use the information of its label. The specific algorithm steps are as follows:

  1. 1.

    The training set is randomly divided into n equal parts;

  2. 2.

    The click-through rate feature corresponding to each training set obtained in step (1) is statistically mapped by the remaining n-1 training sets, and the click-through rate feature mapping result of one testing set is obtained at the same time;

  3. 3.

    After step (2) is completed, the click-through rate feature corresponding to the entire training set can be obtained, and the click-through rate feature mapping result of the testing set is averaged for n times of different n-1 training sets; then the click-through rate feature corresponding to the testing set can be obtained.

Next, the specific implementation code is given. Special attention should be paid to the fact that only the first-order click-through rate feature is given here, i.e. the original category feature is directly constructed, and the click-through rate feature after the cross combination of the category feature is not given.

# Step 1 n_parts = 5 train['part'] = (pd.Series(train.index)%n_parts).values for co in cat_features:    col_name = 'ctr_of_'+co    ctr_train = pd.Series(dtype=float)    ctr_test = pd.Series(0, index=test.index.tolist())    # Step 2    for i in range(n_parts):       se = train[train['part']!=i].groupby(co)['label'].mean()       ctr_train = ctr_train.append(train[train['part']==i][co].map(se)) ctr_test += test[co].map(se)    train_df[col_name] = ctr_train.sort_index().fillna(-1).values    test_df[col_name] = (ctr_test/5).fillna(-1).values

3.3 Text Features

The above two sections have introduced some feature extraction methods for univariate discrete fields. However, there is another category of multivariate discrete fields in the competition introduced in this chapter, such as interests, keywords, and topics. How to process such fields and perform feature engineering are also worth discussing. This section will introduce relevant algorithms of natural language processing and process such fields as text features. Figure 13.12 shows the fields of interest.

Fig. 13.12
A table of five rows and two columns. It lists the u i d and interest 1.

Multi-valued feature interest1

The following is the basic preparations before extracting text features: mainly importing the library and initializing the data set:

from scipy import sparse from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer from sklearn.preprocessing import OneHotEncoder,LabelEncoder from sklearn.decomposition import TruncatedSVD train_sp = pd.DataFrame() test_sp = pd.DataFrame()

Let’s first look at the sparse matrix structure of the scipy library, which is a data storage method different from pandas.DataFrame(). The sparse matrix is characterized by its high total dimension, but each user only has a value in a small part of it, so it will not take up too much memory while maintaining ultra-high dimensions. Next, the sparse matrix features will be generated from three aspects.

3.3.1 OneHotEncoder

OneHotEncoder, also known as one-hot coding, refers to the encoding process of univariate discrete fields to form a sparse matrix structure. Simply put, it is to change a univariate discrete field with a unique value of n into an n-dimensional vectors of 0 and 1, and then store it as a sparse matrix structure. The DataFrame format is used here for codes implementation:

ohe = OneHotEncoder() for feature in cat_features:    ohe.fit(train[feature].append(test[feature]).values.reshape(-1, 1)) arr = ohe.transform(train[feature].values.reshape(-1, 1)) train_sp = sparse.hstack((train_sp, arr)) arr = ohe.transform(test[feature].values.reshape(-1, 1)) test_sp = sparse.hstack((test_sp, arr))

After one-hot encoding, the original single variable discrete field that does not have a quantization size relationship is converted into multiple continuous fields represented by 0 and 1, which can be directly used for logical regression models (LR) and other models that do not directly support discrete fields.

3.3.2 CountVectorizer

Likewise, since univariate discrete fields can convert continuous features of 0 and 1 values, multivariate discrete fields also have corresponding conversion methods, namely CountVectorizer. It makes perfect sense to count each field of multivariate separately to represent the number of occurrences of samples on a certain value. Of course, the data of this competition will not be repeated because of the multiple values of a single user on features such as interest, so the converted value is still only 0 or 1. The specific implementation code is given below:

cntv=CountVectorizer() for feature in text_features:    cntv.fit(train[feature].append(test[feature]))    train_sp = sparse.hstack((train_sp, cntv.transform(train[feature]))) test_sp = sparse.hstack((test_sp, cntv.transform(test[feature])))

3.3.3 TfidfVectorizer

TfidfVectorizer is a statistical vector related to word frequency. Its similarity with CountVectorizer is that their feature dimensions are the same. The difference between them is that CountVectorizer calculates the number of values of an attribute in different dimensions, while TfidfVectorizer calculates frequency. The importance of an attribute increases in proportion to the number of times it appears in a sample, but at the same time it will decrease inversely as appearing more frequently in the entire data set. The specific code is as follows. Special attention should be paid to the fact that TfidfVectorizer () contains parameters, but they are default parameters, that is, no settings are made.

tfd = TfidfVectorizer() for feature in text_features:    tfd.fit(train[feature].append(test[feature]))    train_sp = sparse.hstack((train_sp, tfd.transform(train[feature]))) test_sp = sparse.hstack((test_sp, tfd.transform(test[feature])))

So far, participants may have a question naturally: that is, the approach of this section will undoubtedly produce ultra-high dimensional features, which may cause performance problems. In view of this risk, in addition to using sparse matrix as the storage structure of data, there is also an auxiliary method to reduce dimension to a certain extent, remove redundant extremely sparse dimensions, or map features to low-dimensional space through feature transformation, thus realizing optimization of calculation speed and memory occupancy.

3.4 Feature Dimension Reduction

3.4.1 TruncatedSVD

The sklearn (scikit-learn) is a powerful machine learning Python open source package, which is consisted of various commonly used modules. The feature decomposition module contains multiple algorithms for feature dimension reduction to deal with different types and forms of features. In this book, in order to facilitate participants to quickly become familiar with the algorithm process and skills, the competition data (about 10 W data) was sampled in advance. However, participants who have undergone text feature processing will find that their feature dimensions explode to 25 W +, which will bring great performance challenges to modeling. Therefore, a certain degree of dimension reduction can be considered first. The decomposition module in the sklearn package has a TruncatedSVD arithmetic operator for dimension reduction of sparse matrix structures, which can specify the number of features of the principal component for matrix output. Its usage is similar to that of text feature processing operators. Here are the codes implemented for TruncatedSVD usage:

   svd = TruncatedSVD(n_components=100, n_iter=50, random_state=2020)    svd.fit(sparse.vstack((train_sp, test_sp)))    cols = ['svd_'+str(k) for k in range(100)]    train_svd = pd.DataFrame(svd.transform(train_sp), columns = cols)    test_svd = pd.DataFrame(svd.transform(test_sp), columns = cols)

In addition to SVD, there are many dimension reduction methods that can be used, such as PCA (Principal Components Analysis), LDA (Linear Discriminant Analysis), and NMF (Non-negative Matrix Factorization), etc. These methods have great differences in the specific dimension reduction process, indicating that different dimension reduction methods have the possibility of common use.

3.5 Feature Storage

It should be noted that in order to achieve better results in the competition, the field information in the testing set is usually added to the computation and processing of features, but in actual business applications, this approach is impossible to achieve, and some competitions will explicitly require that the field information of the testing set should not be used for feature engineering. After the feature processing in the previous sections, in addition to the original data features, three other feature files are generated. As shown in Fig. 13.13, a description of all feature files is given.

Fig. 13.13
A table of four rows and four columns. The column headers are short name, training seta, training seta, and feature description.

Description of the feature file

4 Model Training

4.1 LightGBM

The LightGBM model is able to support category features during training, but the premise is that LabelEncoder coding processing needs to be performed first. Feature module includes LabelEncoder for univariate discrete fields, and SVD is the sparse matrix feature of multivariate discrete fields after dimension reduction processing. Combine the two with the LightGBM model and use a five-fold cross validation method to train the model. Finally, the verification set evaluation score of the model is 0.67922 (AUC index), and the testing set evaluation score is 0.61864.

It is obvious that the training set has over-fitting phenomenon; that is, the testing set evaluation score is much lower than the verification set evaluation score, which may be caused by the feature over-fitting in the feature module and the information loss after SVD dimension reduction.

4.2 CatBoost

Catboost is also one of the most commonly used models, because it directly supports the processing and modeling of text features, that is, multivariable field features, and can be trained and modeled using only the raw data source. It also uses the way of five-fold cross-validation to train the model, and the verification set score of the model is 0.64900 and the testing set score is 0.66501.

4.3 XGBoost

CatBoost is able to directly support text features (text_features) and category features (cat_features) because of the sparse processing of these fields within the model. Therefore, the relevant operators of the sklearn package can be used for processing in the outer layer first, and then XGBoost can be used for modeling. The verification set score of the model is 0.67905, and the testing set score is 0.67671.

5 Model Integration

5.1 Weighted Integration

A simple weighted integration is carried out according to the score of the testing set. The specific calculation method is: RandomForest result × 0.2 + LightGBM result × 0.3 + XGBoost result × 0.5. The verification set score of the model is 0.68147, and the testing set score is 0.68208. It can be seen that the effect of weighted integration is still relatively obvious, and there is no need for complicated operations.

5.2 Stacking Integration

Stacking structures have many alternatives. In this competition, we choose to use verification results and prediction results of LightGBM model and XGBoost model as the eigenvalues, and CatBoost model will be the final model, playing the role of training and prediction. This is because CatBoost model can obtain good prediction effect even in the case of only taking advantage of the original features with its relatively strong prediction ability. The following will specifically show the universal implementation code of Stacking integration which is often used:

def stack_model(oof_1, oof_2, oof_3, pred_1, pred_2, pred_3, y, eval_type='regression'): # oof_1oof_2oof_3 are results of verification sets of the three models # pred_1pred_2pred_3 are results of testing sets of the three models # y is the truth label of training sets, eval_type is the task type train_stack = np.vstack([oof_1, oof_2, oof_3]).transpose() test_stack = np.vstack([pred_1, pred_2, pred_3]).transpose()   from sklearn.model_selection import RepeatedKFold   folds = RepeatedKFold(n_splits=5, n_repeats=2, random_state=2020)   oof = np.zeros(train_stack.shape[0])   predictions = np.zeros(test_stack.shape[0])   for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_stack, y)):     print("fold n° {}".format(fold_+1))     trn_data, trn_y = train_stack[trn_idx], y[trn_idx]     val_data, val_y = train_stack[val_idx], y[val_idx]     print("-" * 10 + "Stacking " + str(fold_) + "-" * 10)     clf = BayesianRidge()     clf.fit(trn_data, trn_y)    oof[val_idx] = clf.predict(val_data)    predictions += clf.predict(test_stack) / (5 * 2) if eval_type == 'regression': print('mean: ',np.sqrt(mean_squared_error(y, oof))) if eval_type == 'binary': print('mean: ',log_loss(y, oof)) return oof, predictions

Here, oof_1, oof_2, and oof_3 are the corresponding results of verification sets of the three models, and pred_1, pred_2, and pred_3 are the results of testing sets of the three models. As a general stacking framework, there are no specific constraints on the three models. The two parts are spliced separately to obtain one training set and one testing set with only three feature columns, and then the training set is fed into the BayesianRidge model for training; the final results are then stored in advance.

The final model verification set score is 0.70788, and the testing set score is 0.67445. It can be seen that the offline score of stacking integration is usually higher, but a consistent result that has been improved cannot be obtained on the testing set due to overfitting and other reasons.

6 A Summary of the Competition Question

6.1 More Schemes

6.1.1 GroupByMean

As what has been mentioned above, click-through rate features are extracted by combining univariate discrete fields with tags. From this point, it can be thought that the 0 and 1 columns similar to tags are obtained after sparse matrixing of multivariate discrete fields. Therefore, statistics can be done to compute the mean of the value of the univariate discrete field in a multivariate discrete field, namely groupby (cat_features) [text_features] .mean (). For example, when the value for age is 5, compute the proportion of people with an interest ID of 109 for interest1 in the group.

6.1.2 N-Gram

When extracting CountVectorizer features, the book uses the default parameter, that is, ngram = (1,1). It has not tried to take a higher-order N-Gram for statistics. The higher-order N-Gram essentially adds a layer of combination of features, so that information belonging to the same multivariable discrete field is tied together, such as identifying who likes running and cycling at the same time.

6.1.3 Graph Embedding

This kind of method is mainly used to extract vector representations of categories such as uid or aid, which can well mine user and advertisement information from the graph structure. Those uid or aid that have homogeneity or isomorphism in the graph can also be represented by the embedded vectors. Figure 13.14 shows two embedded vector extraction methods for DeepWalk.

Fig. 13.14
A diagram defines how the interconnected nodes in a graph network are mapped to a sequence of nodes in the random walk method. It is further mapped to the Skip-gram method that has hierarchical levels of stacks that gives the output u of i.

Process of extracting embedding vectors

6.2 Sorting Out Knowledge Points

6.2.1 Feature Engineering

As for feature engineering, this chapter introduces common feature extraction methods from three aspects: classical features, business features, and text features. Among them, the classical features are mainly interactive statistics between univariate discrete fields, including count coding, nunique and ratio features; the business features part introduces the click-through rate features combined with industry scenarios and domain knowledge, which are also features that need to be combined with tags; the text features portion introduces several different ways to generate sparse matrices, which are especially useful when dealing with large-scale univariate and multivariate discrete fields.

6.2.2 Modeling Ideas

The competition question in this chapter represents a typical data organizing form and table data structure. For this kind of data, a relatively common feature engineering method can be abstracted. This is also one of the reasons why this competition question is used in this book to explain the case of computational advertising. The principle of Lookalike is to find potential users who are similar to the users who have clicked on the advertisement by taking advantage of the marketing results of previous advertisements, so as to achieve continuous exposure and clicks of advertisements. Therefore, the focus should be on finding similarities between users, especially the joint similarity in all dimensions. Unfortunately, machine learning is limited by feature engineering and cannot achieve the best results, while deep learning neural networks can perform nested combination and nonlinear function fitting on text fields, so the neural networks model had better performance in this competition.

6.3 Extended Learning

This competition requires participants to provide and submit the scores that show the candidate users of various seed population groups in the testing set belong to the corresponding seed groups (the higher the score, the more likely the candidate users are to be potential look-alike users of a certain seed group); then, can the probability of users clicking on an advertisement be regarded as a click-through rate prediction problem? This is very similar to 2017 Tencent Advertising Algorithm Contest. The basic feature construction method and model selection are the same. The difference is that there is no time-related information in the user behavior sequence of the 2018 Tencent Advertising Algorithm Competition, which lacks a lot of time-related features. Of course, this is also caused by business of Lookalike.

6.3.1 2017 Tencent Advertising Algorithm Contest: Conversion Rate Prediction of Mobile App Advertising

Computational advertising is one of the most important business models of the Internet. The effect of advertisement delivery is usually measured from three aspects: exposure, clicks, and conversion. Most advertising systems are limited by the function of returning advertising effect data and can only be optimized by using exposure or clicks to measure the effect of ad delivery. Tencent Social Ads makes the most of its unique capabilities in user identification and conversion tracking data to help advertisers track the conversion result after advertising, trains the predicted conversion rate model (pCVR) based on advertising conversion data, and introduces pCVR factors in advertising ranking to optimize the effect of ad delivery and improve ROI. This question takes mobile App ads as the research target and predicts the probability of activation of App ads after they are clicked: pCVR = P (conversion = 1 | ad, user, context); that is, given the advertisement, user, and context, predict the probability that the App ad will be activated after being clicked. The industry has always attached more importance to the research of advertisement click-through rate conversion (CTR), and the current application is relatively mature. Tencent’s prediction of advertisement conversion rate (CVR) in this competition is unique. The competition has high research value both in academic research and industry application fields.

Basic ideas: 2017 Tencent Algorithm Contest is an early CTR contest, and many methods are worth learning from, including a lot of classic operations. In terms of models, most players chose the tree model and FFM model, and then combined various Stacking combinations to get the final result. At that time, the model used for predicting advertising clicks was relatively simple since the DeepFM, xDeepFM, AFM, etc. used today came out late.

In terms of feature construction, they are also similar, such as basic features, user category features, advertising category features, context features, interaction features, and other features. The focus here is on other features, which can be called trick features, specifically including the conversion of the user’s repeated clicks on the day, the time difference between the first and last items of the repeated samples on the day (the feature variables are the same), and the repeated samples on the day are sorted by time.

$$ p=f\left(\frac{f^{-1}(0.1)+{f}^{-1}(0.15)+{f}^{-1}(0.08)}{3}\right)=0.1067 $$
(13.2)

The champion’s plan has great innovations in the model. In addition to the tree model, wide & deep, and PNN, it also uses an improved and innovative NFFM model, and the single model score is higher than that of the third place on the list. The final model integration method used is weighted average integration, but it is integrated after logit inversion. To be specific, first substitute the results of each model into the sigmoid inverse function, then get the mean, and finally use the sigmoid function for the mean value. Compared with the common weighted average, this method is more suitable for situations with small differences in results.

# sigmoid function def f(x): res = 1 / ( 1 + np.e ** ( -x ) ) return res # sigmoid inverse function def f_ver(x): res = np.log( x / ( 1 - x ) ) return res