1 Introduction

Natural Language Processing (NLP) is one of the main computational fields in modern day artificial intelligence. It is an area supported by Data Science, Computer Science and Linguistics and mainly seeks to close the gap between computing systems and human language. NLP has made its way to most digital goods and services in modern day applications. NLP techniques are used in Search Engine systems [4], Machine Translation [11], Speech Recognition [6, 13] and Sentiment Analysis [10, 14] to name a few. The ongoing symbiotic trend with regards to the increased electronic information consumption and the production of data by end-users using these electronic systems means that the field of NLP is a thriving area of continued research and development with new resources being created daily.

Textual data is constantly being shared between users or stored in the cloud depending on the application. Every day 5.5 billion cellular text messages are exchanged [3], 500 million tweets are created per day as of 2013 [21] and the average instant messaging service consumer sends 72 messages per day [20].

An important and challenging area of NLP is the use of textual summarization. The end goal of summarization techniques is, given an original sequence of words/characters (typically a sentence or paragraph), produce a condensed version of the original sequence that satisfies a set of linguistic characteristics. A similarity score between the original and shortened sequence can be used to quantify performance. Textual summarization can be split into two main categories, Extractive and Abstractive Summarization. Both types of summarization were key points in this research and will be explored thoroughly in a later section. Textual summarization is mainly implemented in the fields of customer review mining [7], headline generation [2, 18] and extraction of the main points of large documents like blogs or research papers [1].

Simultaneously, with the rise of increased use of social media services which are heavily dependent on text, there has been an increase in the use of a set of unicode characters classified as Emojis. Emojis are pictographic representations, that are typically displayed as one character, of a given phrase or word and can be interpreted differently given the context of usage. Emojis represent a wide array of items, from traditional emoticons to locations. It was found that Emojis infer deeper meanings into conversations [9] in addition to emotions.

Notification bars on mobile phones typically only convey part of a received message. The same can be said about smart watches. If a lengthy message is received, the message is normally split and concatenated with “...” depending on the screen width of the device. This can be viewed as tedious for the end user as it requires them to further expand or open the message fully to view its contents. This problem can be exacerbated as this problem is repeated throughout the day. This paper explores a solution to the above problem by using textual summarization techniques to condense incoming text messages into shorter versions through the use of emojis.

The remainder of this paper is organized as follows. Section 2 serves as a high level definition of the problem, Section 3 looks at existing research, firstly around the Unicode Emoji Characterset in modern day communication and then existing Summarization techniques, Section 4 looks at the proposed method with an Example, Section 5 looks at the results obtained with subjective human testing and Section 6 discusses the results.

2 Problem definition

Consider a scenario where an end user of a mobile instant messaging service, such as WhatsApp or iMessage, is in a high focus situation such as driving an automobile or attending a university lecture. In such scenarios, the time spent on the task at hand by the end user is critical. This end user may also have a smart wearable such as a bracelet or smartwatch that has a digital display to which message notifications are sent.

Now consider that a message is sent to the end-user from one of their contacts. The end-user, upon receiving the notification, will have to read the notification on their smart wearable device and then make a personal judgement call on whether or not they have to open the full message on their mobile phone. This personal judgement call is assumed to be dependent on two main factors: the contact sending the message and the amount of the message that is visible on the screen of the smart wearable device. Typically the length of the message may be too long to be condensed into the display of the wearable and so the end user will default to opening the message fully to read and interpret the message. If the message is not important, this process is inefficient in terms of the high focus task at hand. Had the end-user known the context of the message, they would have opened the entire message at a later time. This then calls for a method that minimizes the read time and interpretation time for a given message.

To express this mathematically, let an incoming message be denoted by a vector \(\bar {X}\) so that each word is a element token separated by a white space character. We assume that the vector \(\bar {X}\) is of size N so that \(\bar {X} = [X_{0}, X_{1}, \ldots , X_{N}]\). It is assumed that read time and interpret time is proportional to the length of the message and the context of the message. Therefore, the problem then becomes to create a summarized representation of the original sentence, \(\bar {Y}\), so that some length and similarity constraints are considered. The vector \(\bar {Y}\) of size M is given by \(\bar {Y} = [Y_{0}, Y_{1}, \ldots , Y_{M}]\) The individual tokens of \(\bar {X}\) and \(\bar {Y}\) can be represented using either a one hot encoding method of the entire vocabulary, or a N-dimensional context vector of that word. Regardless of the representation, we would like MN. We would like to maximize the similarity between both sentences, \(\textsf {Sim}(\bar {X},\bar {Y})\), so that the original context is not lost in the summarized version. In addition to a reduction in the required screen space, the decreased message length will lead to decreased reading and interpretation times.

3 Previous research

3.1 Textual summarization

The research area of textual summarization has been thriving in recent years due to the increased use of social media networks and other internet based services. The area of summarization can be split into two main areas, Extractive and Abstractive summarization. Extractive summarization (ES) is the condensing of an original text sequence so that the summary consists only of extracted content of the original text [22]. It is any means of extracting certain words, phrases or paragraphs from the original text. Early techniques of ES involve assigning certain scores associated with each input sentence or document. Extractive Summarization methods can also be divided into unsupervised and supervised methods. Some common word level features [12] that are used in summarization are:

Keyword Features: :

These are used to identify important areas of a sentence. Traditionally portions of text that use keywords are most likely to be included in the final summary.

Title Word Features: :

These utilize the notion that phrases or sentences in a document that contains the title words have a higher chance to contribute to the summary.

Cue phrase features: :

Cue phrases (e.g., “because”, “summary”) are used to extract as it is assumed that these phrases indicate overall document flow.

While research into summarization with Emojis are in its infancy, there is continuous work into textual based summarization. Previous work by [16] proposed an ES method that uses frequency and position features of the words along with the triangular membership function to generate scores. The sentences with the higher scores are included in the summary. An ES method proposed by [8] uses word vector embedding of the sentences of the abstractly summarized and original text in training a neural network. The neural network assigns a score to each sentence of the original text and then a selection of the highest scoring sentences are used in creating an Extractive summary.

3.2 Abstractive summarization

Abstractive summarization (AS) is a technique in which a summary is generated through novel words, phrases or sentences rather then extracting specific parts of sentences as in Extractive summarization. This generation is done by creating an internal representation of a source document, by analysing features from the source document through deep analysis and by reasoning about the semantic and context.

Current advances of Abstractive summarization are due to Deep Neural Networks. An LSTM-CNN (Long Short Term-Convolutional Neural Network) approach was proposed by [17]. This network uses a CNN to encode words in a sentence and then uses a LSTM network to decode the sequence of encodings into a summary. Promising results was demonstrated through the use of generative adversarial networks (GANs) by [15] which uses an LSTM encoder-decoder model with attention.

Generation of long and short Abstractive summaries were also evaluated by [23]. This study used various model architectures as well as human judgement in evaluating Abstractive summaries. They found that traditional summarization scores do not correlate well with human judgement. While the models achieve higher ROGUE (Recall-Oriented Understudy for Gisting Evaluation) scores, they do not surpass the baseline when evaluated by humans.

4 Proposed method of summarization

This section details the proposed method of summarizing instant messages using a method that is based on Dense Embedding Representation of N-Gram words and its relation to Emojis and their keywords.

4.1 N-gram token replacement method

The overall flow of this method is defined in Fig. 1. This process outlines some key steps to be followed. Given an input sentence, the sentence is split up into tokens where each word is a token. These tokens are then parsed and extracted based on if they are critical in adding context to the sentence. The set of tokens is then used to create a set of N-gram representations of the extracted tokens. Each of these tokens are then compared to another collection of tokens referring to a set of Emojis and the Emoji with the minimal Mean Squared Error (MSE) is selected. This entire process is discussed in more detail later. For the remainder of this paper, this method will be referred to as Method 1.

Fig. 1
figure 1

Flow diagram of the N-gram token replacement Method

Let \(\bar {X}\) be an input sentence to be summarized, with an example shown below. This example will be used as a basis for explaining the overall summarization process.

$$ \bar{X} = ``can\ we\ go\ to\ the\ beach\ take\ my\ car^{\prime\prime} $$
(1)

This input sentence is then converted to a set of tokens defined by the whitespace character. Let this be represented by \(\bar {X}_{Token}\) as follows:

$$ \bar{X}_{\textsf{Token}} = [``can^{\prime\prime},``we^{\prime\prime},``go^{\prime\prime},``to^{\prime\prime}, ``the^{\prime\prime}, ``beach^{\prime\prime}, ``take^{\prime\prime}, ``my^{\prime\prime}, ``car^{\prime\prime}] $$
(2)

Upon completing tokenization, the first core problem can be seen. It involves a method of extracting the important words from a given sentence. To do this, a Part-of-Speech Tagger was used.

A pre-trained greedy average perceptron tagger was used as the part-of-speech tagger at this stage of the model. The selected tagger, from the NLTK package, was trained on a large corpus of data from the Wall Street Journal dataset. While the NLTK package is more computationally expensive, it is more accurate than other publicly available pre-trained perceptron taggers. We can now apply the average perceptron tagger to the sequence of tokens \(\bar {X}_{\textsf {Token}}\). For each token now there should be a corresponding part of speech associated with that token. Let this be represented by \(\bar {X}_{\textsf {POS}}\). For each element of \(\bar {X}_{\textsf {POS}}\), we now have the original word at index 0 and the tagged part of speech at the index 1. The complete tagset of words can be found in the NLTK documentation.

$$ \begin{aligned} \bar{X}_{\textsf{POS}} ={} & [(`can^{\prime}, `MD^{\prime}), (`we', `PRP'), (`go', `VB'), (`to', `TO'),\\ & (`the', `DT'), (`beach', `NN'), (`take', `VB'), \\ & (`my', `PRP'), (`car', `NN')] \end{aligned} $$
(3)

We then assume that specific words with appropriate parts of speech convey the core meaning of the original sentence. The words selected are all forms of Nouns, all forms of Verbs, all forms of Adjectives and all Adverbs. After examining the tags of \(\bar {X}_{\textsf {POS}}\), words that do not meet this criteria are dropped from the sentence. Let the new representation be \(\bar {X}_{\textsf {Ext}}\)

$$ \bar{X}_{\textsf{Ext}} = [`go', `beach', `take', `car'] $$
(4)

The next step is creating an equivalent unigram, bigram and trigram representation. This will be referred to as an N-gram representation from now on. This is done so that words with similar meaning that are next to each will have a chance to be represented together. An example of this would be considering the words “fast”, “car” as a single representation of “fast car”. Doing this then creates a collection of N-gram tokens as defined by Gram.

$$ \begin{aligned} \textsf{Gram} ={} & [ [(`go'), (`beach'), (`take'), (`car')], \text{-- unigram representation} \\ & [(`go', `beach'), (`beach', `take'), (`take', `car')] \text{-- bigram representation} \\ & [(`go', `beach', `take'), (`beach', `take', `car')]] \text{-- trigram representation} \end{aligned} $$
(5)

We now have a collection of N-gram tokens, as defined by Gram. The problem then becomes creating a numerical representation of these tokens. To create this representation, a pre-trained model of Word2Vec [5] was used. Let the Word2Vec representation of a given word w be represented by:

$$ \text{Vector Representaion} = word2Vec[w] $$
(6)

A unique result of using word2Vec is that the vector representation of different words can be algebraically formulated together to generate a unique result. We can then exploit this and create unique vector representations for each token within each N-gram representation. Therefore the function tokenVec is applied to each index of each N-gram representation.

$$ \textsf{tokenVec}(token_{index}) = \sum\limits_{all\ j}word2Vec[w_{j}] $$
(7)

where j is the j th element in the token. Applying this operation to the first element of the unigram representation would be equivalent to: word2V ec[go]. while the first element of the bigram representation would be:

$$ word2Vec[`go']+word2Vec[`beach'] $$
(8)

while the first element of the trigram representation would be

$$ word2Vec[`go']+word2Vec[`beach'] + word2Vec[`take'] $$
(9)

We now have complete vector representations for all N-gram tokens. The next step would be creating a vector representation for the Emoji characterset so that algebraic methods can be applied with the vector representations of the tokens and the Emoji vector representation.

4.2 Vector representation of Emojis

A Kaggle dataset [19], by the name of EmojiNet, consists of keywords and labels for each Emoji other than the ones provided by the official Unicode website. The scraped dataset from the Unicode website and this EmojiNet dataset was joined on the basis of the Emoji Unicode. The keywords from EmojiNet and the description from the Unicode website were joined and a definite set of words for each Emoji representation was created. All duplicates within these words were removed so that each emoji had a concise set of words describing them. Table 1 shows an example of this merged dataset with the each Emoji having a set of keywords.

Table 1 Example Emoji and Keywords before word2Vec Representation

Each of these set of keywords was then converted to word2Vec representation so that each Emoji has an array of 300 dimensions as its rows and the number of keywords as its columns for that given emoji. Table 2 shows the same emojis but with a vectorized keyword representation instead. This method of representation was used for the remainder of this study.

Table 2 Example Emoji and Keywords after word2Vec Representation

We now have a collection of the vectorized N-gram representations and also a collection of vectors describing emojis as detailed in Table 2. For each vectorized token in the N-gram representations, the mean squared error between that token and each keyword from the vectorized keywords is calculated and averaged over the entire array for that emoji. This process is done for all emojis. The resultant array is then sorted in ascending order and only the N/2 remaining errors are averaged to create an error score for that emoji. This process is illustrated in Algorithm 1

figure a

The emoji with the minimum error for that corresponding keyword is then chosen as the prospective replacement for that token. We then have a collection of error values as shown by \(\hat {E}\), a collection of prospective emoji representations (Fig. 2), and a collection of previous N-gram representations, given by Gram. \(\hat {E}\) is then updated again by averaging the error value by the number of words in each token used to create that error and given in (10).

$$ \begin{aligned} \hat{E} ={} & [ [(0.0188), (0.182), (0.0353), (0.0)], \text{-- unigram errors} \\ & [(0.01765), (0.183), (0.0082)] \text{-- bigram errors} \\ & [(0.0232), (0.175)]] \text{-- trigram errors} \end{aligned} $$
(10)
Fig. 2
figure 2

Prospective Emojis based for the N-gram Model

The problem then becomes replacing tokens with the prospective emoji based on the error values. In this scenario we utilize a greedy approach, where \(\hat {E}\) is scanned and the the token with the minimal error value is replaced with the prospective emoji. All tokens are then updated to exclude any tokens that are related to the replaced token. Therefore Gram and \(\hat {E}\) will then become:

$$ \begin{array}{@{}rcl@{}} \textsf{Gram} &=&{} [ [('go'), ('beach'), ('take')], \text{-- unigram tokens} \\ && [('go', 'beach'), ('beach', 'take')] \text{-- bigram tokens} \\ & &[('go', 'beach', 'take')]] \text{-- trigram tokens} \end{array} $$
(11)
$$ \begin{array}{@{}rcl@{}} \hat{E} &= & [ [(0.0188), (0.182), (0.0353)], \text{-- unigram errors} \\ && [(0.01765), (0.183)] \text{-- bigram errors} \\ && [(0.0232)]] \text{-- trigram errors} \end{array} $$
(12)

This process is repeated until some threshold is met so that a certain proportion ρ of the sentence is converted to an Emoji representation. The replaced tokens and non replaced tokens are then reconstructed to form a final sentence. Figure 3 shows this final reconstruction, with ρ = 0.5, meaning that only half of the tokens will be converted into an emoji representation.

Fig. 3
figure 3

Final summarized version of example

5 Testing

The proposed model does not have a set of labelled output to compare against and so a subjective test using Human participants was used. The key values measured are the time taken to read a summarized message and the time taken to interpret the context of the message.

A sample of 10 people were selected to look at a combination of 10 sentences. The 10 sentences consisted of 5 un-summarized messages and 5 summarized messages using the proposed method. The times taken to read and interpret each sentence was then measured. Let read time be denoted by Tr and interpret time be Ti. The detailed methodology was as follows:

  1. 1.

    The collection of 10 sentences is shuffled.

  2. 2.

    The participant simultaneously starts the timer and begins reading the sentence.

  3. 3.

    The participant notes the elapsed time upon completion of reading the sentence.

  4. 4.

    The participant notes the elapsed time upon successfully interpreting the context of the sentence.

  5. 5.

    Repeat the above process for each of the participants.

6 Discussion of results and conclusions

The times across all participants for reading and summarizing the sentences were averaged based on the sentences. This is shown in Table 3. The average read time for the unsummarized sentences was 143.8 ms. It was noted that for the full unsummarized sentences, all participants noted that the read time and interpretation time are the same.

Table 3 Summary of Read Time and Interpretation Time

The read times of the summarized sentences had an average of 64.2ms. The average interpretation times had an average of 108.4ms. A Wilcoxon signed-rank test was used to compare the read times for unsummarized and summarized sentences as well as to compare the interpret times for these sentences. Based on the outcomes of this statistical test, it was found that the difference in read times are statistically significant. However, it was found that the difference in the Interpretation times between unsummarized and summarized sentences were not statistically significant.

This research explored the concept of textual summarizing utilizing the Emoji Unicode Characterset and its application to instant messaging services. The problem was defined as shortening an input vector to a summarized representation so that the length of the summarized representation is less than the input vector, and that the similarity between both input and output are maximized.

A core concept of the proposed method was the vector embedding representation of Emojis. It was found that representing the Emojis as an algebraic sum of their respective keywords was quite efficient for this task. The keywords were extracted through web scraping the official Unicode website and using the EmojiNet database. A part of speech tagger was also used in the form of a pre-trained greedy average perceptron tagger, from the NLTK package. The method of selecting emoji replacement for N-gram tokens was a greedy approach.

Human testing on the proposed model was also conducted, with 10 participants evaluating 10 sentences, 5 of which are summarized and 5 unsummarized. Their read and interpret times for each sentence were noted and aggregated. It was found that the read time for sentences summarized less than the read time of unsummarized sentences. This was further confirmed through the use of a Wilcoxon rank sum test.

The interpretation times for both the unsummarized and summarized sentences were also examined. It was found that the interpretation times, while being less on average for the summarized version, were not statistically different. It was found that the true location shift is equal to zero at an α = 0.05 level, i.e. the medians of the interpretation time for the unsummarized and summarized sentences are similar.

Future work in this area should focus on exploring various optimization methods of selecting Emoji replacements for N-gram tokens for the proposed model. Investigation of a global approach may have better results in terms of minimizing the error but may drive computational time upwards. The difference in interpretation times between the unsummarized and summarized sentences should also be further investigated.