Keywords

1 Introduction

Metadata and tagging are key factors in digital libraries. They are used to describe and organize resources [21] allowing the library users to effectively locate and retrieve digital items. On the other hand, adding manual description to digital items, such as images, is time-consuming and subject to human interpretation. The text that describes images, in many digital library collections, is poor and limits retrieval effectiveness due to dissimilarities between the terms users use to locate images and the (limited) or maybe irrelevant text used to describe those images [16]. Thus, alternative techniques that can reduce human subjectivity and enrich image descriptions in digital libraries are highly desirable and led to dedicated research field known as Automatic Image Annotation.

Social media, and especially the Instagram, contain huge amounts of images which are commented through hashtags by their creators/owners [6]. In a previous work [9] we have found that a portion of 55% of Instagram hashtags are directly related with the visual content of the photos they accompany. Since then, in a series of studies we have proposed several Instagram hashtag filtering techniques to effectively identify those Instagram hashtag subsets [8, 10]. An innovative topic modelling scheme is one of our newest developments towards that aim [2, 20].

Probabilistic topic models (PTM) algorithms can discover the main themes for a vast and unstructured collection of documents. Thus, we can use PTM to organize several documents based on the identified themes. PTM is suitable for any kind of data, and researchers used these algorithms to locate patterns in genetic data, images, and social networks. Topic modelling is an effective way to organize and summarize electronic archives something impossible to achieve with human annotation [4].

Let us now assume that we want to create a set of Instagram photos in order to collect the hashtags and locate the relevant photos. Instagram has a search box and you can locate based on account name and Instagram hashtags. So, we queried with specific hashtag (i.e. #dog), which in the current work we name subject. Hashtags accompany images were collected automatically using the Beautiful SoupFootnote 1 library of Python We can see the hashtags of an Instagram image as a textual representation of it and in this way Instagram hashtag collections of images can be seen as textual documents. So we can analyse via topic modelling techniques once textual preprocessing, such as word splitting is applied first. Since with topic modelling we can measure the most relevant terms of a topic we can assume that by applying topic modelling on the hashtags sets [20] we can derive a set of terms best describing the collection of images based on a specific subject.

Word clouds are used to depict word frequencies derived from a text or a set of text documents. The size of each depicted word in the cloud depends on its frequency: words that occur often are shown larger than words with rare appearance while stopwords are removed. Thus, a Word cloud can be seen as a synopsis of the main themes contained in textual information [3, 13]. Word clouds became popular in practical situations and are commonly used for summarizing a set of reviews presented as free texts (i.e., “open questions”).

In order to construct a classic word cloud it is necessary to calculate the word frequencies in a text or set of texts. However, word frequencies can be replaced by any other measure that reflects the importance of a word in a text document. In that respect word clouds can be used for the visualisation of topics derived from a collection of texts. Topic models infer probability distributions from frequency statistics, which can reflect co-occurrence relationships of words [7]. Through topic modeling we can reveal the subject of a document or a set of documents and present in a summarized fashion what the document(a) is/are about. This is why topic modeling is, nowadays, a state-of-the-art technique to organize, understand and summarize large collections of textual information [1].

In this paper we investigate how the crowd and students understands the topics derived from the hashtag sets of Instagram photos that were grouped together by a common query hashtag which we call subject. The topics are illustrated as word clouds with the queried hashtags (subjects) hidden and the crowd is asked to guess the hidden hashtag providing their best four guesses. The aim of the current work is to examine the performance accuracy interpretation, in topic modeling we created from Instagram hashtags, between crowd and the students and second, to investigate if there is significant correlation on the way the crowd and the students interpret the word clouds of Instagram hashtags. If crowd and students choice coincide with the subject of the word cloud, we have a good indication that the word cloud words, indeed, related with the subject. We believe that through this meta-analysis we gain useful insights on whether we can use words mined form Instagram hashtags [9] as description metadata for digital libraries. To the best of our knowledge this is the first study that examine how to locate the relevant Instagram hashtags for image metadata description in digital libraries.

2 Related Work

Ibba and Pani in their research to formalize knowledge through the creation of a metadata taxonomy they developed a method to integrate and combine Instagram metadata and hashtags [12]. Ibba and Pani also mention our previous work [9] that 55% Instagram hashtags are related to the visual content of the image but the researchers do not analyse how we can locate only the relevant hashtags. Sfakakis et al. [18] propose document subject indexing with the help of Topic Modeling and automated labeling processes. The authors implemented LDA toping modelling to a corpus of papers in order to produce the topic models. To evaluate the topic models they created the researchers asked an expert to label the same corpus of papers and they concluded that human labeling is similar to topic modelling.

Suadaa and Purwarianti [19] in their study they aim to solve the problem of document classification they examined a combination of LDA and Term Frequency-Inverse Cluster Frequency. To conduct their experiment the used Indonesian digital library documents, 113 documents from digital library of STIS and 60 documents from digital library of ITB. The researchers propose LDA in combination with Term Frequency-Inverse Cluster Frequency is the best option for labeling. Hall et al. [11] they focus on automatic clustering techniques and if we can use them to support the exploration of digital libraries. The researchers investigated LDA, K-Means and OPTICS clustering using collection of 28,133 historical images with meta-data provided by the University of St Andrews Library. They created models for the three aforementioned algorithms based on photos title and description and those models were evaluated by the crowd. The authors concluded that we can apply LDA-based models in large digital libraries.

Rohani et al. [17] used topic modeling to extract topic facets from a dataset consisting of 90527 records related with the domain of aviation and airport management. They developed an LDA topic modeling method while the data were pre-processed by removing punctuation and stop words. They identified five main topics and then they examined which one of the topics was the dominant each date. The performance of topic modelling was qualitatively evaluated by domain experts who were asked to investigate the detected topics along with the discovered keywords and compare the results with their own interpretation about the top topics of studied datasets. The topics assigned by domain experts are similar to the LDA topic modeling.

The previous discussion shows that topic modelling is suitable techniques to locate/derive appropriate summary description and/or tags for documents and images. Word clouds have been mainly used for visualisation purposes but the appropriateness of this visualisation format was never assessed. Thus, in addition to the application perspective of our work, which emphasizes on mining terms from Instagram hashtags for image tagging, the crowd-based and student-based meta analysis of word clouds provides also useful insights about their appropriateness for topic visualisation. Some of the reported works applied topic modelling to summarize textual information using the classic LDA approach. Our topic modeling algorithm [20] is quite different and tailored to the specific case of Instagram posts. Photos and associated hashtags are modelled as a bipartite network and the importance of each hashtags is computed via its authority score obtained by applying the HITS algorithm [10].

3 Word Clouds Creation

As already mentioned the main purpose of the current work is to investigate and discuss the crowd-based and student-based interpretation of word clouds created from Instagram hashtags. A dataset of 520 Instagram posts (photos along with their associated hashtags) was created by querying with 26 different hashtags (see Table 1) which in the context of the current work are called subjects. For each subject we collected 10 visually relevant to the subject and 10 visually irrelevant image posts to the subject (images and associated hashtags) leading to a total of 520 (260 relevant and 260 non-relevant) images and 8199 hashtags (2883 for relevant images and 5316 non-relevant images).

All collected hashtags were undergone preprocessing so as to derive meaningful tokens (words in English). Instragram hashtags, are unstructured and ungrammatical, and it is important to use linguistic processing to (a) remove stophashtags [8], that is hashtags that are used to fool the search results of the Instagram platform, (b) split a composite hashtag to its consisting words (e.g. the hashtag ‘#spoilyourselfthisseason’ should be split into four words: ‘spoil’, ‘yourself’, ‘this’, ‘season’), (c) remove stopwords that are produced in the previous stage (e.g. the word ‘this’ in the previous example), (d) perform spelling checks to account for (usually intentionally) misspelled hashtags (e.g. ‘#headaband’, ‘#headabandss’ should be changed to ‘#headband’), and (e) perform lemmatization to merge words that share the same or similar meaning. Preprocessing was conducted with the help of Natural Language ToolKit (NLTK - https://www.nltk.org/), WordnetFootnote 2 and personally developed code in Python.

By finishing all pre-processing steps we ended up with a token set for each one of the 520 Instagram photos. Instagram photos and the associated hashtag sets belonging to a common subject were grouped together and modeled as a bipartite network. Then, topic models were created for each one of the subjects following the approach described in [20]. A total of 52 (26 relevant and 26 irrelevant) different topic models were developed. The importance of each token within a topic model was assessed by applying the HITS algorithm as described in [10]. For each one of the topics a word cloud was created. The token corresponding to the associated subject (query hashtag) was excluded in order to examine whether the crowd and students would guess it correctly (see Sect. 4 for the details). Word clouds visualization was done with the help of WordCloudFootnote 3 Python library.

Fig. 1.
figure 1

Question examples for Appen and Moodle

4 Interpretation of Word Clouds

Crowd-based interpretation of word clouds was conducted with the aid of the AppenFootnote 4 crowdsourcing platform (see Fig. 1a) and student-based interpretation was performed with the aid of the learning platform MoodleFootnote 5 (see Fig. 1b). We choose the interpretation from cloud because we want to take advantage of the collective intelligence. The viability of crowdsourced image annotation was examined and verified by several researchers ([5, 14, 15]). Moreover, the student interpretation was conducted by undergraduate students of the department of Communication & Internet Studies of the Cyprus University of Technology. Students received a treatment of training to perform annotation of word clouds. During a course the students were informed about word cloud creation and topic model.

The word clouds were presented to crowd participants which were asked to select one to four of the subjects that best match the shown word cloud according to their interpretation. The participants were clearly informed that the token corresponding to the correct subject was not shown in the cloud. The same questions were presented to the students and had to choose between one to four subjects that best match the word cloud they see. Also, we informed the students that the correct subject was not included in the word cloud.

Every word cloud for the crowd was judged by at least 30 annotators (contributors in Appen’s terminology) while eight word clouds were also used as ‘gold questions’ for quality assurance, i.e., identification of dishonest annotators and task difficulty assessment. The correct answer(s) for the gold clouds were provided to the crowdsourcing platform and all participants had to judge those clouds. However, gold clouds were presented to the contributors in random order and they could not know which of the clouds were the gold ones. A total of 165 contributors from more than 25 different countries participated in the experiment. The cost per judgement was set to $0.01 and the task was completed in less than six hours. A total of 25 students annotations were collected.

Not all word clouds present the same difficulty in interpretation. Thus, in order to quantitatively estimate that difficulty per subject we used the typical accuracy metric, that is the percentage of correct subject identifications by the crowd and the students. By correct identification we mean that a contributor or a student had selected the right subject within her/his one to four choices. We see for instance in Table 1 that the accuracy for crowd of the guitar word cloud is 93%. This means that 93% of the contributors included the word ‘guitar’ in their interpretation for that word cloud, regardless the number (1 to 4) of contributor choices. The accuracy above was also employed for irrelevant word clouds. For instance, 44% of students chose in their interpretation ‘lion’ for irrelevant word cloud that derived based on the hashtag #lion, and the posts were visually irrelevant with that subject.

Table 1. Topic identification accuracy for word clouds created using visually relevant (Relev.) and irrelevant (Irre.) Instagram photos
Table 2. Summary statistics for the accuracy of identification
Table 3. Independent samples t-test, \(N=26\) subjects in all cases

5 Results and Discussion

The accuracy of interpretation for all word clouds is presented in Table 1 while summary statistics are presented in Table 2. In order to better facilitate the discussion that follows the subjects (query hashtags) were divided into six categories: (a) Music: Guitar, Piano, Microphone (b) Wild animals: Bear, Elephant, Giraffe, Lion, Monkey, Zebra (c) Fashion: Dress, Hat, Headband, Shirt, Sunglasses (d) Office: Chair, Laptop, Table, (e) Pets: Cat, Dog, Fish, Hamster, Parrot, Rabbit, Turtle (f) Miscellaneous: Hedgehog, Horse.

In order to answer the main research questions of our study formulate three null hypotheses as follows:

\(H_{01}\): There is no significant difference of word cloud interpretation of hashtags sets mined from relevant and irrelevant images by the trained students.

\(H_{02}\): There is no significant difference of word cloud interpretation of hashtags sets, mined from relevant and irrelevant images, by the generic crowd.

\(H_{03}\): There is no significant correlation on the way the generic crowd and trained students interpret the word clouds mined from Instagram hashtags.

In Table 3 we see the paired-sampled t-test with the aid of SPSS was conducted to compare the interpretation in relevant and irrelevant word clouds conditions in both the crowd and students. There was a significant difference in the scores for relevant (Mean Crowd = 68%, Mean Student = 58%) and irrelevant (Mean Crowd = 33%, Mean Student = 45%). Thus the null hypotheses \(H_{01}\) and \(H_{02}\) are rejected at a significance level \(a=.003\) for students and \(a=.001\) for the crowd.

Regarding the third null hypothesis, for a significant level \(a = 0.01\) the critical value for the correlation coefficient (two tail test, \(df=50\)) is \(r_c\) = 0.354. By computing the correlation coefficient (Pearson rho) of the mean accuracy values per subject of the crowd and the students we find \(r = 0.861\). Thus, \(r>r_c\) and the null hypothesis (\(H_{03}\)) is rejected at a significance level \(a=0.01\), denoting that the way word clouds are interpreted by the trained students and the crowd is highly correlated.

We see in Table 1 that the interpretation accuracy varies within and across categories. As we explain later through specific examples, there are three main parameters which affect the difficulty of interpretation. The first one is the conceptual context for a specific term. It is very easy, for instance, to define a clear conceptual context for the term fish but very difficult to define clear conceptual contexts for terms such as hat and hedgehog. This difficulty is, obviously, reflected in the use of hashtags that accompany photos presenting those terms. As a result the corresponding word clouds do not provide the textual context and hints that allow the correct interpretation of word clouds. Thus, textual context and key tokens in the word clouds is the second parameter affecting the difficulty of interpretation. Concepts such as dog, cat and horse are far more familiar to everyday people and students than concepts such as hedgehog and hamster.

In the following we present and discuss some representative/interesting examples for each one of the six categories mentioned above.

The word clouds in the Music category have very high scores of interpretation accuracy. Music related terms share a strong conceptual context which results in clear textual contexts in the Instagram hashtags. In Fig. 2a we see the word cloud for the subject ‘microphone’. Tokens like band, singer, music, singer and stage create a strong and clear textual content. Thus, the annotators, 57% for crowd and 92% for students, correctly chose microphone to interpret the word cloud. Moreover, the ‘microphone’ word cloud tokens had as a results the crowd and the students to choose also guitar and piano.

Fig. 2.
figure 2

Word clouds for the ‘microphone’ and ‘monkey’ subject

The monkey word cloud (see Fig. 2b) was in fact a confusing one. The most prominent tokens were art, animal and nature while some other terms such as artist, artwork, and work could also confuse the crowd and the students. As a result the accuracy for that category is 33% for crowd and 36% for the students.

In the case of subject ‘hat’ (see word cloud in Fig. 3) we have a situation where there are many different conceptual contexts. As a result, the hashtags appeared in different Instagram photos differ significantly and the resulting word cloud is confusing. We see that the most prominent tokens in the cloud are blogger, style, sun, and beach (obviously these are concepts shown in some of the Instagram photos grouped under the subject ‘hat’). There is no doubt that the subject ‘hat’ fits well with those terms. However, the same terms fit well or even better to other subjects such as ‘sunglasses’ and dress that had as result the accuracy was not high for students and crowd (7% for crowd and 40% for students).

Fig. 3.
figure 3

Word clouds the subjects ‘hat’ and ‘chair’

Fig. 4.
figure 4

Word clouds for the subjects ‘hedgehog’ and ‘hamster’

The case of hedgehog is a classic example showing that the familiarity with a concept affects the difficulty in interpretation of the word cloud derived from Instagram hashtags. While in the word cloud (see Fig. 4a) the words pygmy, pet and animal are by far the most important ones none of the participants selected the right subject. It appears that the contributors and students were non-familiar with the word pygmy. The African pygmy hedgehog is the species often used as pet.

6 Conclusion

In the current work we have presented a crowd-based and student-based interpretation of word clouds created from Instagram hashtags. The main purpose was to examine if we can locate appropriate tags from Instagram photos that share (and grouped together) a common hashtag (called subject in the current work) for image metadata description. A statistical significant difference between the interpretation accuracy of relevant and irrelevant word clouds was found. This mean that Instagram images of similar visual content share hashtags that are related to the subject. In addition to these we concluded that there is correlation in interpretation of trained students and the generic crowd, denoting that no specific training is mandatory to mine relevant tags from Instagram to describe photos. Moreover, since there is no difference in the interpretation accuracy performance of generic crowd and trained students we have an indication that indeed these hashtags can describe an image. In the results analysis we concluded that there is significant variation in the difficulty of interpretation of word clouds corresponding to different terms and we named three parameters affecting this interpretation: conceptual context, textual context and familiarity with concept. Terms that have a clear conceptual context (‘fish’, ‘guitar’, ‘laptop’), can be easily identified. On the contrary, term without clear conceptual context like ‘hat’ confused students and the crowd. In addition, terms like ‘hedgehog’ that students and crowd were no familiar had a difficult to interpret. The main conclusion is that we can use topic model to mine information from Instagram tags for image description metadata.