Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This chapter will first provide an introduction to information retrieval (IR) in general, before briefly explaining the research field of music information retrieval (MIR). Hereafter, we will discuss why and how social media mining (SMM) techniques can be beneficially employed in the context of MIR. More precisely, motivations for the common MIR tasks of music similarity computation, music popularity estimation, and auto-tagging musicwill be provided, and the current state-of-the-art in employing SMM techniques to these three tasks will be elaborated.

Developing music similarity measures is an important task in MIR as such measures are a key ingredient for music recommendation systems, automated playlist generators, and intelligent browsing interfaces, among others. In this chapter, it will be shown how to infer music similarity information from microblogs, collaborative tags, web pages, playlists, and peer-to-peer networks. Estimating the popularity of a music item is obviously important for the music industry but also to create serendipitous music retrieval and recommendation systems. Therefore, approaches that derive such information from web page counts, geo-located microblogs, a peer-to-peer network, and a social music platform will be reviewed. Eventually, different music auto-tagging methods that assign semantic labels to music pieces will be presented. In particular, computational approaches that rely on machine learning techniques as well as human-centred strategies that infer tags directly from some kind of user input (e.g. “games with a purpose”) will be addressed.

1 Introduction to Information Retrieval

The discipline of information retrieval (IR) is a mature field of research as early work dates back to the 1950s, for instance [59]. Since I can only give a very brief introduction to this exciting field here, the interested reader is referred to one of the many excellent books that offer comprehensive coverage of IR. I personally recommend [22] for an introduction and [3] and [8] for a more comprehensive coverage.

Broadly speaking, IR is concerned with elaborating and testing methods to uncover information from potentially large corpora of text (traditional IR) or (more recently) multimedia, in response to the user’s expression of an information need. This information need is usually given as a text query, the classical example being a user who types in a query string into his or her preferred search engine. Texts are most frequently organised in the form of documents, although other representations exist. Hence, it is usually also documents which are returned as response to a query to a search engine.

In order to be able to promptly provide search results for millions of queries issued every hour to major search engines, enormous amounts of computational power are required. But of no lesser importance are highly efficient representations of the documents. For this purpose, an inverted indexis commonly created from the documents. Such an inverted index stores, for each term t, a list of documents in which toccurs or a list of documents and the precise positions of twithin each document. The former is referred to as document-level inverted index, record-level inverted index, inverted file index, or just inverted file; the latter is typically named full inverted document index, word-level inverted index, full inverted index, or inverted list. The major advantage of a full inverted index is that it allows for phrase search, that is, finding an exact phrase within a document, not only a single term. In a regular expression notation, the two variants of the mapping implemented by the two flavours of indexes can be written as follows:

Table 1

If the user now wants to search for a particular topic, expressed as a query q, the retrieval system computes a matching score between qand the indexed documents D. A common approach is to compute term weights w(q, d) between qand each document d, which estimate the importance of the document for the query. The documents are then ranked with respect to w(q, d) and displayed to the user in descending order of term weight. This classical retrieval approach is often called term vector modelor vector space model. Since its proposal in 1975 by Salton et al. [71], many extensions as well as alternative retrieval approaches have been suggested. More recent methods include probabilistic retrieval[45] and graph-based models[7].

2 Music Information Retrieval at a Glance

Unlike traditional IR, music information retrieval (MIR) is a relatively young field of research, dating back only about a decade. An early and quite general definition of MIR, which highlights the multidisciplinarity of the field, is given by Downie in [17]:

MIR is a multidisciplinaryresearch endeavour that strives to develop innovative content-based searching schemes, novel interfaces, and evolving networked delivery mechanismsin an effort to make the worlds vast store of music accessible to all.

A later definition given by Schedl [72] focuses on extracting and processing musical information on different levels and modalities:

MIR is concerned with the extraction, analysis, and usageof information about any kind of music entity (for example, a song or a music artist) on any representation level (for example, audio signal, symbolic MIDI representation of a piece of music, or name of a music artist).

Due to recent developments, such as audio and music streaming services (e.g. Spotify [41]), personalised web radio (e.g. last.fm [27]), and increasing use of multimedia data in social media, MIR has gained considerably in importance as a research field.

Although MIR is a highly multidisciplinary research field, including areas as diverse as music theory, library science, psychology, law, and artificial intelligence, one of its key goals is to better understand how humans perceive, create, process, and interact with music. Given its strong connection to computer science, MIR approaches to achieve this broad goal typically involve elaborating computational models of music perception. These approaches commonly take as input the audio signal or other modalities of a music item and compute featuresthat strive to describe particular aspects of the music item, for example, rhythm, harmony, or timbre. Figure 1depicts a schematic and simplified illustration of how a signal-based (content-based) audio feature extractor works. First, the audio signal is sampled and digitised, yielding a representation as pulse code modulation(PCM). For instance, when producing a compact disc, the sampling frequency is typically 44,100 Hz, and each sample is described via 16 bits. For a stereo recording, the data volume hence amounts to 176,400 bytes per second. The PCM representation is then split into (often overlapping) frames with a typical length of between 28and 212samples. Low-level features in the time domaincan then be computed directly on these frames. To capture frequency information, alternatively, it is very common to apply a windowing functionto each frame and subsequently compute the fast Fourier transform(FFT) [13], which converts the data from a time–amplitude representation into a frequency–magnitude system. Hereafter, several post-processing steps are commonly performed, for instance, employing some psychoacoustic model of human auditory perception. Eventually, one regularly has to decide how to combine the features computed for each frame of a piece of music to create a global representation. Methods range from computing simple statistical moments to complex time-series modelling via hidden Markov models(HMM) [5]. The computational features extracted via algorithms similar to the one just described can be used for a wide range of MIR tasks, for instance, to estimate similarity between music itemswhich in turn enables the creation of music recommendation systems, of playlists automatically generated, and of clustering-based user interfacesto music collections. If semantic labels describing the music items are available, another popular task is to automatically learn relations between audio features and semantic descriptors. This task is commonly referred to as auto-tagging. The content-based feature extraction framework described above represents the traditional MIR strategy to computationally grasp aspects of a music item that should relate to human music perception. In the past few years, however, MIR has seen a paradigm shift to incorporate additional factors into computational models of music perception and description. In particular, contextual aspects of the music items and of the listener are increasingly taken into account. Integrating these with traditional content-based methods, Fig.  2shows the three broad pillars from which perceptual music information can be extracted, according to [76].

Fig. 1
figure 1

Basic scheme of an acoustic feature extractor

Fig. 2
figure 2

Categorisation of computational aspects that influence music perception

Music contentfeature extractors derive information directly from the audio representation of a piece of music, by applying signal processing techniques. A typical example are features inferred from time-invariant Mel frequency cepstral coefficients(MFCC) representations of the audio signal, which serve to some extent to describe the coarse timbre of an audio signal. Overviews of common content-based extraction techniques are provided, for instance, in [9, 20, 57].

Music contextrefers to aspects that are not encoded in the audio signal (or cannot be extracted with current methods), nevertheless are related to a music item. For instance, collaborative tags about a performer, semantic meaning of song lyrics, or the political background of an artist fall into this category. More details on feature extraction and similarity estimation from the music context can be found, for example, in [75].

User contextrelates to personal properties, preferences, and feelings of the music listener. The user context hence includes the user’s mood, activities, friends, or level of musical training. Although these highly individual factors are obviously influential on music perception, MIR literature centred around the user is relatively sparse. Among the existing work, I would like to highlight the following: Cunningham et al. present an interesting study on why people dislike particular music [15]; Lee conducted a thorough analysis of natural language music queries [55] and personalised and user-aware music retrieval and recommendation are treated, for example, in [4, 10, 81].

Finally, it is noteworthy that some aspects fall into more than one category. For example, song lyrics might be seen to belong to the music contentas they are obviously encoded in the audio signal. However, with current MIR techniques, it is impossible to extract and convert them to a semantically meaningful textual representation. On the other hand, many web pages list huge amounts of song lyrics, which make it easy to extract them from a contextualdata source. I therefore predominantly see them in the music contextcategory. A similar overlap might occur for collaborative tags. One can argue that such tags are the outcome of many users, hence would count them to the user context. However, according to my categorisation, the user contextrefers to individual, personal factors of the user, not to user groups.

3 Social Media Mining in Music Information Retrieval

Usage of social media has seen a tremendous increase during the past couple of years. People create, modify, and most importantly share massive amounts of multimedia data (text, images, music, videos) on platforms such as Twitter [42], Facebook [34], last.fm [27], and YouTube [43].

As music plays a vital role in many human lives and everyone has an opinion about music, user-generated content related to music items such as artists, performers, songs, albums, or music videos is available in abundance. Given the remarkable commercial interest in music distribution and delivery, innovative music retrieval systems are becoming increasingly important. Such systems include personalised, user-aware music recommenders [4], automated playlist generators [64], or intelligent browsing interfaces [48] that transcend the traditional filtering-based browsing scheme according to an artist–album–track hierarchy.

Given the huge amount of user-generated data and the broad interest in music, elaborating sophisticated methods to mine social media content in order to derive semantic information about music and other media items is an ongoing research endeavour, which is currently pursued quite actively. In the following, we will hence discuss the state of the art in three key areas of MIR, where social media mining (SMM) can help improve upon traditional solutions. More precisely, the topics covered are how to compute similarities between music items such as songs or artists, how to estimate the popularity of a music item, and how to tag music items, that is, assign semantic labels to a piece of music, album, or artist.

3.1 Music Similarity Estimation

Computing similarity estimates between two music items (e.g. songs or artists) is an important task in MIR as it enables, among others, automated creation of playlists, recommending items similar to the favourites of a user, or applying clustering techniques and consequently creating user interfaces that foster browsing music collections in an intuitive way.

An example for automated music playlist generation is [68], where content-based data and contextual data (extracted from music-related web pages) are combined to create seamless playlists. Pohle et al. aim at creating playlists in which consecutive tracks sound as similar as possible. Figure 3shows a music browser entitled Traveller’s Sound Player, which allows to interact with the generated playlists.

A user interface to music collections, named nepTune, is presented in [48], where Knees et al. extract audio features from digital audio files to train a self-organising map(SOM) [51]. The SOM uses similarities between feature representations of songs to cluster the music collection under consideration. The clusters are then visualised via first estimating the distribution of the data items over the map and subsequently using the estimated densities as height values to create a virtual landscape of the music collection. The landscape generated in this way can then be navigated through in the manner of a computer game. Figure 4shows a screenshot of the nepTuneinterface.

Fig. 3
figure 3

Screenshot of the Traveller’s Sound Playerinterface for automated playlist generation

Various kinds of social media have been used to derive similarity scores between music items. In the following, we will particularly focus on methods that construct a similarity measure from user-generated shared playlists(e.g. available from Art of the Mix [30]) [2], shared folders in P2P networks[58], microblogs[80], and collaborative tags[19, 56]. Social media sources for collaborative tags include dedicated platforms such as last.fmor the recently quite popular “games with a purpose” [54, 61, 90].

Fig. 4
figure 4

Screenshot of the nepTunebrowsing interface for music collections

According to the exploited data source and similarity computation strategy, the methods under discussion can be categorised into text-based(microblogs and collaborative tags) and co-occurrence-based(web pages, playlists, and shared folders in P2P networks), each of which requires different algorithms to construct a similarity measure. Evaluation, on the other hand, can be performed using the same techniques; most common are genre classificationand comparison against human similarity judgements.

3.1.1 Text: Microblogs and Collaborative Tags

Text-based approachesto music similarity estimation typically approximate the similarity by employing the vector space model, which was introduced in Sect. 1. In the following, we will discuss how to derive similarity information from microblogs and from collaborative tags extracted from last.fmor gathered from “games with a purpose”.

Fig. 5
figure 5

Artist similarity estimation from microblogs

3.1.1.1 Microblogs

A comprehensive study of different aspects in estimating music artist similarity from microblogs is presented in [74]. For the experiments in this chapter, the vector space modelwas applied, that is, each artist is modelled as a term weight vectorin a high-dimensional feature spaceand similarities between these term vector representations are calculated. An overview of the basic approach is depicted in Fig. 5. First, the TwitterAPI is used to retrieve microblogs for each artist in a given list of 3,000 music artists. The returned tweets for each artist are then concatenated, resulting in a virtual artist document, and a term vector representation for each artist is computed. The actual similarity estimate between two artists A i and A j is eventually obtained by calculating a similarity1 function S A i , A j .

In the study presented in [74], several thousand combinations of the following single aspects have been assessed:

  • Query scheme

  • Index term set

  • Term frequency (TF)

  • Inverse document frequency (IDF)

  • Normalisation with respect to document length

  • Similarity function

Evaluating different query schemesis motivated by the fact that earlier work in web-based MIR has shown an improvement in the accuracy of similarity estimates when adding music-related keywords to the search query (e.g. “music” or “music review”) [47, 78, 92]. Different index term sets, that is, lists of terms used to filter the microblogs and create the term weight vectors, have been assessed as well. The number of terms in the index term set corresponds to the dimensionality of the respective feature vectors (TF ⋅IDF vectors). The term frequency r d,tof a term tin a virtual artist document destimates the importance thas for document d, hence for the artist under consideration. The inverse document frequency w testimates the overall importance of term tin the whole corpus and is commonly used to weight the r d, t factor, that is, downweight terms that are important for many documents and hence less discriminative for d. Performing this calculation for all terms in the used index term set and each virtual artist document results in one TF ⋅IDF vector per artist. It is common to subsequently normalisethe TF ⋅IDF vectors with respect to document length. Finally, different similarity functions S d i,d jto estimate the proximity between the term vectors of two virtual artist documents d i and d j are examined.

As for evaluation, mean average precision(MAP) scores are computed on genre labels predicted by various classifiers. More precisely, given a query or seed artist, the retrieval task is to find artists of the same genre via similarity. MAP is simply computed as the arithmetic mean of the precision @ kscores, that is, the average precision of k-nearest neighbour (kNN) classifiers for varying values of k.

Although reporting all results of [74] is way beyond the scope of this chapter, I would like to summarise the most important findings in the following.

When dealing with microblogs, it is preferable to use only the artist name (no additional keywords) to query the TwitterAPI or more general to select the tweets relevant to the artist under consideration.

Even though using all terms in the corpus yields the highest MAP values, results are by far the most unstable ones. This means that slightly modifying a single other aspect can cause a significant decline in accuracy when using all terms in the corpus. Given the high computational complexity due to feature spaces of dimensionality greater than one million, employing no particular index term set is not favourable. Best and most robust results were achieved on average using a dictionary of musical genres, musical instruments, and emotions, which was gathered from Freebase [35].

A simple binary match TF formulation should not be used. The most favourable algorithmic variants are logarithmic formulations and an adapted Okapi BM25formulation [69, 70].

Among the IDF formulations, binary match yields the worst results. Also signal estimates and signal-to-noise ratios do not perform much better. Again, logarithmic formulations and the modified Okapi BM25formulation yield top results.

Performing no normalisation for document length performs best, both in terms of accuracy and robustness. This is presumably due to the special characteristics of tweets, which are limited to 140 characters, a limit commonly exhausted by Twitterusers. Normalisation hence does not improve results but increases computational costs.

Among the similarity functions under estimation, the Jeffrey divergence-based function performs very well, while at the same time maintaining a reasonable stability level. Also the Jaccard coefficientperforms remarkably well. Euclidean similarityperformed inferior in all combinations.

Overall, the best performing variants found in the experiments are given by the three term-weighting functions in Eqs. 13, in combination with the Jaccard coefficientsimilarity function (Eq. 4). In these equations, Nrepresents the total number of documents in the corpus, f d, t is the number of occurrences of term tin document d, f t denominates the total number of documents containing term t, W d is the document length of d, and \({\mathcal{T}}_{{d}_{1},{d}_{2}}\)denotes the set of distinct terms in documents d 1and d 2.

$$ {w}_{d,t} = \log_{e}(1 + {f}_{d,t}) {\cdot \log }_{e}\frac{N - {f}_{t}} {{f}_{t}} $$
(1)
$$ {w}_{d,t} = \log_{e}(1 + {f}_{d,t}) \cdot \log \frac{N - {f}_{t} + 0.5} {{f}_{t} + 0.5} $$
(2)
$$\begin{array}{rcl} {w}_{d,t}& =& \left (1 {+\log }_{e}{f}_{d,t}\right ) {\cdot \log }_{e}\frac{N - {f}_{t}} {{f}_{t}} \end{array}$$
(3)
$$\begin{array}{rcl}\mathrm{sim}({d}_{1},{d}_{2})& =& \frac{{\sum \nolimits }_{t\in {\mathcal{T}}_{{d}_{ 1},{d}_{2}}}\left ({w}_{{d}_{1},t} \cdot {w}_{{d}_{2},t}\right )} {{W}_{{d}_{1}}^{2} + {W}_{{d}_{2}}^{2} -{\sum \nolimits }_{t\in {\mathcal{T}}_{{d}_{ 1},{d}_{2}}}\left ({w}_{{d}_{1},t} \cdot {w}_{{d}_{2},t}\right )}\end{array}$$
(4)
3.1.1.2 Collaborative Tags

User-generated tags that are assigned to music items are a valuable, albeit noisy source for different MIR tasks, not least for similarity estimation and music retrieval purposes.

Geleijnse et al. gather tags from last.fmto generate a “tag ground truth” on the artist level [19]. The authors first filter redundant and noisy tags using the set of tags associated with tracks by the artist under consideration. Similarity between two artists is then estimated as the number of overlapping tags. Evaluation on a set of 1,995 artists, using last.fm’s similar artist function as ground truth, shows that the number of overlapping tags between similar artists is much larger than the overlap between arbitrary artists (about 10 vs. 4 tags after filtering). Another interesting observation is that the tags assigned to the largest number of artists fall into only three semantic categories – genres, personal references, and time periods (Table 1). The least frequent tags are shown in Table 2. Often they are more prosaic, represent specific personal notes, or simply contain typos.

Table 1 Most popular tags and their artist frequencies, among a set of 1,995 artists
Table 2 Tags assigned by last.fmusers only once, among a set of 1,995 artists

Another work on collaborative tags is [56], where Levy and Sandler construct a semantic space for music pieces based on tags retrieved from last.fmand MusicStrands [28], a web service (no longer in operation) that allowed users to share playlists. To this end, all tags found for a specific music piece are tokenised, and a document-term matrix based on TF ⋅IDF weighting is created. Each track is hence represented by a term vector. For the TF part of the weighting, three different approaches are considered: using the number of users that applied the tag, ignoring the number of users (performing no TF weighting at all), and restricting the terms to adjectives by employing a part-of-speech (POS) tagger. Levy and Sandler further analyse the influence of applying latent semantic analysis(LSA) [16] to reduce the dimensionality of the feature space. The authors then compute the similarity between the resulting feature vectors using the cosine measure. For evaluation, the authors employ a retrieval scenario and report average precisionvalues. They judge the relevance of retrieved terms as having assigned the same genre or artist label as the seed. Levy and Sandler find that using all terms (not only adjectives) is preferable so is the incorporation of the number of users that applied the tag into the TF score.

It was in 2007 when the MIR community recognised the value of “games with a purpose” for MIR tasks. In this very year, three papers proposing different music-tagging games were found in the proceedings of the annual “International Society for Music Information Retrieval” (ISMIR) conference, the main scientific venue for MIR research. The principal motivation for such games is to let users solve tasks that are hard or even infeasible to perform for a computer, while at the same time being entertaining enough to attract and keep many users playing. In the music domain, Law et al. present TagATune, a game for semantic annotation of music and audio [54]. Two players are paired and are then listening to the same piece of audio. They can describe the audio by entering words but are rather told to guess what their partners are thinking, because both players will only score points if their tags match. If this is the case for one tag, the game will proceed to the next track. Even though TagATunewas originally designed to harvest semantic descriptions for music and audio, it also implements a “comparison round”, where users are presented three songs – one seed track and two alternatives to choose from. They then have to decide which of the alternatives sound more similar to the seed song. From this kind of information, relative similarity judgements and in turn a similarity measure can be derived, as done by Law and von Ahn [53], Stober [88], and Wolff and Weyde [93], for instance.

A similar game, called Listen Game, is presented by Turnbull et al. in [90]. It aims at uncovering semantic relationships between words and music. Again, players are grouped and listen to the same songs. They subsequently have to choose from a list of words the one that best and the one that worst describes the song. Users get immediate feedback about which tags other players have chosen. From the data collected, Turnbull et al. employ the mixture hierarchies expectation maximisation(MH-EM) [91] algorithm to learn semantic associations between words and songs. These associations are weighted and can therefore be used to construct tag weight vectors for songs and in turn to define a similarity measure for retrieval [89].

Mandel and Ellis present another game for music annotation in [61]. It differs from the other games presented so far in that it uses a more fine-grained scoring scheme. Players receive more points for new tags to stimulate the creation of a larger semantic corpus. More precisely, a player who first uses a tag tto describe a particular song scores two points if tis later confirmed (used again) by another player. The third and subsequent players that use the same tag tdo not receive any points. Thus, players who are the first to use a word tfor tagging a particular song do not receive an immediate reward but will score two points as soon as another player will have used t. The authors report the most popular tags confirmed by at least one user. They are summarised in Table 3. Compared to the top tags extracted from last.fm(Table 1), the tags originating from the tagging game more often describe instruments and gender of the main performer.

Table 3 Most popular tags assigned by players of a “game with a purpose” on music annotation

3.1.2 Co-occurrences: Web Pages, Playlists, and P2P Networks

The family of co-occurrence approachesto music similarity estimation is based upon the assumption that two music items are more likely to be similar if they co-occur in the same document, for instance, a playlist, a web page, or a tweet.

3.1.2.1 Web Pages

In this vein, [78] defines the similarity of two artists as the conditional probability that one artist is to be found on a web page that is known to mention the other artist. This conditional probability can either be calculated on crawled web pagesthat relate to the artists under consideration or heuristically approximated using page count informationfrom major search engines.

The former strategy, performing web crawls to infer similarities, is followed in [12] and [72][Chap. 3]. To this end, a certain amount of top-ranked web pages returned by a search engine is retrieved for each artist A i . Subsequently, all pages fetched for A i are searched for occurrences of all other artist names in the collection. The number of page hits represents a co-occurrence count that equals the document frequency of the artist term “A j ” in the corpus of web pages for artist A i . This count is expressed by the asymmetric function cooc(A i , A j ). A similarity score is then computed by relating this count to the total number of pages successfully fetched for artist A i . Symmetrising these scores for all pairs of artists eventually leads to the similarity function shown in Eq. 5. Please note that cooc(A i , A i ) and cooc(A j , A j ) refer to the total number of web pages successfully crawled for artists A i and A j , respectively.

$$\mathrm{sim}({A}_{i},{A}_{j}) = \frac{1} {2} \cdot \left [\frac{\mathrm{cooc}({A}_{i},{A}_{j})} {\mathrm{cooc}({A}_{i},{A}_{i})} + \frac{\mathrm{cooc}({A}_{j},{A}_{i})} {\mathrm{cooc}({A}_{j},{A}_{j})}\right ]$$
(5)

The heuristic solution referred to in the beginning of this section is proposed in [78]. It relies solely on the page count estimates provided by a search engine. In short, these page count estimates for queries like "artist name i"or "artist name i"+"artist name j"are used to infer the relative frequency of both artists’ co-occurrence and in turn the conditional probability as indicated above. Equation 6gives a formal representation of the symmetrised similarity function.

$$\mathrm{sim}({A}_{i},{A}_{j}) = \frac{1} {2} \cdot \left [\frac{pc({A}_{i},{A}_{j})} {pc({A}_{i})} + \frac{pc({A}_{i},{A}_{j})} {pc({A}_{j})} \right ]$$
(6)

Comparing the two strategies (web crawls and page count estimates) in terms of computational complexity, it is obvious that the former one requires fewer requests to the search engine. The number of queries to the search engine grows indeed linearly with the number of music items in the collection. In contrast, the second approach that entirely relies on page count estimates from search engines grows quadratically in the number of queries. It is hence less suited for mid- and large-size music collections.

3.1.2.2 Playlists

Exploiting co-occurrence information from playlists to derive a similarity estimate between music items was probably first suggested in [66]. Pachet et al. consider radio station playlists from a French radio channel and compilation CDs from CDDB Footnote 1to extract co-occurrences between tracks and between artists. The authors count the number of co-occurrences of two artists (or pieces of music) A i and A j in the radio station playlists and compilation CDs. They define the co-occurrence of an entity A i to itself as the number of A i ’s occurrences in the considered data source. To account for different frequencies, that is, popularities, of songs or artists, the co-occurrence counts are normalised. Assuming that co-occurrence is a symmetric function, the similarity measure used by the authors is the same as given1 by Eq. 5.

Focusing on social media data, Baccigalupo et al. present an approach to derive artist similarity information from playlists shared by members of a web community [2]. The authors look at more than one million playlists made publicly available by MusicStrands [28]. The authors extract the 4,000 most popular artists from the playlist set, measuring popularity as the number of playlists in which each artist occurs. They further take into account that two artists consecutively occurring in a playlist are probably more similar than two artists occurring farther away in a playlist. To this end, the authors define a distance function d h (A i , A j ) that counts how often a song by artist A i co-occurs with a song by A j at a distance of h. Thus, his a parameter that reflects the number of songs in between the occurrence of a song by A i and the occurrence of a song by A j in the same playlist. The distance between two artists A i and A j is defined by Eq. 7, where the playlist counts at distances 0 (two consecutive songs by artists A i and A j ), 1, and 2 are weighted with factors β0, β1, and β2, respectively. The authors empirically set the weights to β0 = 1, β1 = 0. 8, and β2 = 0. 64.

$$\mathrm{dist}({A}_{i},{A}_{j}) ={ \sum \nolimits }_{h=0}^{2}{\beta }_{ h} \cdot \left [{d}_{h}({A}_{i},{A}_{j}) + {d}_{h}({A}_{j},{A}_{i})\right ]$$
(7)
$$\left \vert \mathrm{dist}\right \vert ({A}_{i},{A}_{j}) = \frac{\mathrm{dist}({A}_{i},{A}_{j}) -\widehat{\mathrm{ dist}}({A}_{i})} {\left \vert \mathrm{max}\!\left (\mathrm{dist}({A}_{i},{A}_{j}) -\widehat{\mathrm{ dist}}({A}_{i})\right )\right \vert }$$
(8)
$$\widehat{\mathrm{dist}}({A}_{i}) = \frac{1} {n - 1} \cdot {\sum \nolimits }_{j\in X}\mathrm{dist}({A}_{i},{A}_{j})$$
(9)

To account for the popularity bias, that is, very popular artists co-occur with a lot of other artists in many playlists simply because they are well known and often listened to by the average music listener, the authors perform normalisation according to Eq. 8, where \( \widehat{dist} \)(A i ) denotes the average distance between A i and all other artists (Eq. 9) and Xis the set of the n − 1 artists other than A i .

3.1.2.3 Peer-to-Peer Networks

Peer-to-peer (P2P) networks represent another source of music-related data since users of this kind of network are commonly willing to reveal meta-data about their shared content. For music files, meta-data typically shared is filenames and ID3 tags. By analysing which items co-occur in a user’s shared folder, researchers have created music similarity measures.

Among early work that makes use of data extracted from P2P networks is [18, 58, 92], and [6]. These papers all extract data from the P2P network OpenNapto derive music similarity information.Footnote 2

Logan et al. [58] and Berenzweig et al. [6] report on having determined the 400 most popular artists on OpenNapin mid-2002. The authors harvested meta-data on shared content, which yielded about 175,000 user-to-artist relations from about 3,200 shared music collections. Logan et al. [58] especially highlights the sparsity in the OpenNapdata, in comparison with music content data. Logan et al. compare similarities defined by artist co-occurrences in OpenNapcollections, by expert opinions from allmusic.com [29], by playlist co-occurrences from Art of the Mix, by data gathered from a web survey, and by audio feature extraction (MFCCs) [1]. They calculate a “ranking agreement score” by comparing the top Nmost similar artists according to each data source and calculating the pairwise overlap between the sources. Their main findings are that the co-occurrence data from OpenNapand from Art of the Mixshow a high degree of overlap, the experts from allmusic.comand the participants of the web survey agree moderately, and the signal-based measure has a rather low agreement with all other sources (except for comparison to the http://allmusic.comdata).

Whitman and Lawrence use a software agent to retrieve from OpenNapa total of 1.6 million user–song relations over a period of 3 weeks in August 2001 [92]. To alleviate the popularity bias, the authors use a similarity measure as shown in Eq. 10, where C(A i ) denotes the number of users that share songs by artist A i , C(A i , A j ) is the number of users that have both artists A i and A j in their shared collection, and A k is the most popular artist of the whole data set. The second factor (in the right-hand part of the equation) downweights the similarity between two artists if one of them is very popular and the other is not.

$$\mathrm{sim}({A}_{i},{A}_{j}) = \frac{C({A}_{i},{A}_{j})} {C({A}_{j})} \cdot \left (1 -\frac{\left \vert C({A}_{i}) - C({A}_{j})\right \vert } {C({A}_{k})} \right )$$
(10)

In [18], Ellis et al. aim to build a ground truth for artist similarity estimation. The authors report on having extracted from OpenNapabout 400,000 user-to-song relations, covering about 3,000 unique artists. Again, the co-occurrence data is compared with artist similarity data gathered by a web survey and with allmusic.comdata. In contrast to Whitman and Lawrence, Ellis et al. take indirect links in allmusic.com’s similarity judgements into account. To this end, Ellis et al. propose a transitive similarity function on similar artists from the allmusic.comdata, called “Erdös distance”. More precisely, the distance d(A i , A j ) between two artists A i and A j is measured as the minimum number of intermediate artists needed to form a path from A i to A j . As this definition also allows to derive information on dissimilar artists (with a high minimum path length), it can be employed to obtain a complete distance matrix.

A recent approach by Shavitt and Weinsberg derives similarity information on the artist and on the song level from the Gnutellafile-sharing network [84]. The authors collected meta-data of shared files from more than 1.2 million Gnutellausers in November 2007. They restricted their search to music files (MP3 and WAV). The crawl yielded a data set of 530,000 songs. Information on both users and songs are represented via a 2-mode graph showing users and songs. A link between a song and a user is created if the user shares the song. Analysing the resulting network, it turned out that most users of the P2P network share similar files.

Shavitt and Weinsberg further propose an approach to artist recommendation. To this end, they construct a user-to-artist matrix V, where V(i, j) gives the number of songs by artist A j that user U i shares. They subsequently perform direct clustering on Vusing the k-means algorithm [60] with the Euclidean distance metric. Artist recommendation is then performed using either data from the centroid of the cluster to which the seed user U i belongs or information about the nearest neighbours of U i within the cluster to which U i belongs.

In addition, Shavitt and Weinsberg address the problem of song clustering. Accounting for the popularity bias, the authors define a distance function that is normalised according to song popularity, as shown in Eq. 11, where uc(S i , S j ) denotes the total number of users that share songs S i and S j . C i and C j denote, respectively, the popularity of songs S i and S j , measured as their total occurrence in the data set:

$$\mathrm{dist}({S}_{i},{S}_{j}) = {-\log }_{2}\!\left (\frac{uc({S}_{i},{S}_{j})} {\sqrt{{C}_{i } \cdot {C}_{j}}} \right )$$
(11)

3.2 Music Popularity Estimation

Estimating the popularity of a music artist or song in a certain region of the world is an important task, not only for the music industry. Also the cosmopolitan and culturally aware music aficionado is likely to be interested in which music is currently “hot” in different parts of the world. Not least artists are interested to know where in the world their music is particularly (un)popular. Furthermore, popularity information can serve as an important component for serendipitous music retrievalsystems [10, 81].

An artist’s or song’s popularity can be estimated via a wide variety of predictors, such as traditional charts (e.g. “Billboard Hot 100” released weekly for the United States of America by the Billboard Magazine[26]), microblogging activity, playcounts (e.g. from last.fmor YouTube), occurrences on web pages, and shared folder analysis in P2P networks.

Scientific work on this topic includes [73], where Schedl et al. compare different data sources for artist popularity estimation on a per-country basis. In [50], Koenigstein et al. analyse search queries issued within a P2P network to infer music popularity. Grace et al. compute popularity rankings from user comments in a social network [21].

Given the large interest record companies, producers, and artists have in this kind of information, it is not surprising that there also exist businesses specialised on music popularity measurement. Examples are Band Metrics [31] or BigChampagne Media Measurement [32]. Even though they obviously do not reveal details of their algorithms, it can be reasonably assumed that these companies harvest multiple data sources to create their predictors. The music information platform Echonest [25] even offers a public API function to retrieve a ranking based on the so-called “hottness” of an artist [24]. This ranking is based on editorial, social, and mainstream aspects [23]. However, this web service does not provide country-specific information, and Echonestis known to have a strong focus on the USA.

In the following, approaches that make use of social media to predict the popularity of an artist or a song will be presented and discussed. Also properties of the data sources, such as particular biases, availability, noisiness, and time dependence, will be addressed.

3.2.1 Data Sources for Popularity Estimation

The popularity of an artist or track can be defined on different levels of granularity (e.g. individual user, peer group, country, or cultural region). Incorporating previous approaches presented in [49, 50], Schedl et al. compare different ways to derive popularity information from various social media sources [79] on the level of countries. To this end, a framework is established that uses the following proxies for popularity:

  • Page counts of web pages

  • Artist occurrences in geo-located microblogs

  • Meta-data from folders shared in the GnutellaP2P network

  • Playcount data from the social music platform last.fm

The approaches proposed to compute popularity rankings from each data source are detailed below.

Another work that infers popularity information from social media is [21], where Grace et al. compute popularity rankings from user comments in the social network MySpace [40]. To this end, the authors apply various annotators to crawled MySpaceartist pages in order to spot, for example, names of artists, albums, and tracks, sentiments, and spam. Subsequently, a data hypercube (OLAP cube) is used to represent structured and unstructured data and to project the data to a popularity dimension. A user study showed that the list generated by this procedure was on average preferred to Billboard charts.

3.2.1.1 Web Page Counts

Page counts are gathered by querying the web search engines Google [37] and Exalead [33] for \(\left \langle \mathrm{artist,country}\right \rangle\)tuples. To guide the search towards musically relevant web pages and avoid distortions caused by artist names that equal common speech words (e.g. “Bush”, “Kiss”, “Hole”), the query scheme "artist name" "country name" musicis employed. Furthermore, a factor resembling inverse document frequency(IDF) is used to downweight popularity of artists that are popular everywhere in the world since the aim is to uncover popular artists specific to each country. The final ranking score is calculated according to Eq. 12, where pc c, a is the page count value returned for the country-specific query for artist aand country c, \(\left \vert C\right \vert \)is the total number of countries for which data is available, and df a is the number of countries in which artist ais known according to the data source (i.e. the number of countries with pc c, a  > 0).

$$\mathrm{{popularity}}_{c,a} = p{c}_{c,a} {\cdot \log }_{2}\left (1 + \frac{\left \vert C\right \vert } {d{f}_{a}}\right )$$
(12)
3.2.1.2 Geo-Located Microblogs

Microblogs are retrieved from Twitterusing the search API and are then narrowed in two ways. First, only posts containing the hashtag #nowplayingare considered. This filtering is directly supported by the TwitterAPI. Secondly, the search is narrowed to a specific country. To this end, posts are categorised according to their location within a certain radius around the major cities of the world. Tweets are then aggregated to the country level. Scanning the retrieved microblogs for occurrences of the artists of interests and counting the number of their appearances for a given country ceventually yield a count equal to the term frequency (tf c, a ) of artist ain an aggregated document comprising all tweets gathered for cities in country c. Equation12again gives the ranking score, when pc c, a is replaced with tf c, a .

3.2.1.3 P2P Network

Shared folder data from the P2P network Gnutellais extracted employing a two-stage process, similar to [49]: a crawlercomponent discovers the highly dynamic network topology; a browserqueries the active nodes – corresponding to users – for meta-data of files in their shared folders. The crawler treats the network as a graph and performs breadth-first exploration. Discovered active nodes are enqueued in a list that is processed by the browser. Shared digital content is associated with artists by matching the artist names of interest against ID3 tags of shared music files. Occasionally ID3 tags are missing or misspelled. Artists names are therefore also matched against the filenames. Creating popularity charts for specific countries requires determining the geographical location of the users. The necessary geo-identification process is based on IP addresses. First, a list of all unique IP addresses in the data set – typically over a million – is created. IP addresses are then geo-located using the commercial IP2Location[39] database. Each IP address is hence attached a country code, a city name, and latitude–longitude coordinates. The geographical information obtained in this way pinpoints fans and enables tracking spatial diffusion of artists popularity [50]. Aggregating the amount of digital content associated with each artist for the country under consideration yields the final ranking score.

3.2.1.4 Social Music Platform

As last data source, artist popularity based on the user community of the social music platform last.fmis considered. Despite the issues of hacking and vandalismand a certain community bias[75], which are inherent to collaborative music information systems, the playcounts of last.fmusers can be expected to reflect which music is currently popular in this community. First, the top 400 listeners of each country are gathered via the last.fmAPI. The most frequently played artists for each of these listeners are extracted subsequently.Footnote 3Aggregating these playcounts for each \(\left \langle \mathrm{artist,country}\right \rangle\)pair finally yields a popularity ranking.

3.2.2 A Multifaceted Comparison of Different Data Sources

It was shown in [79] that the popularity charts obtained from the different, inhomogeneous data sources do not correlate highly. Each data source hence covers different aspects of popularity, which indicates that the quest for artist popularity is a multifaceted and challenging task, especially in the era of multichannel music distribution.

Table 4 Comparing different social media sources to infer popularity information

Trying to uncover the different dimensions of the five data sources (the four web and social media sources and traditional music charts), Table 4compares them according to several criteria relevant to the task of popularity estimation. One issue is that certain approaches are prone to a specific bias. The average last.fmuser, for instance, does not represent the average music listener of a country, that is, last.fmdata is distorted by a community bias. The same holds for Twitter, which is biased towards artists with very active fans. On the other hand, some very popular artists may have fans who use Twitterto a much lower degree. Traditional charts are frequently biased towards the record sales figures the music industry commonly uses as proxy.

Another aspect is data availability. While page count estimates are available for all countries of the world, the approaches based on P2P and Twitterdata suffer from a very unbalanced coverage for different countries. Also traditional music charts vary strongly in terms of availability between countries. Table 5shows the number of countries for which data could be extracted for each approach, as presented in [79]. Please note that these results are based on a list of 240 countries retrieved from last.fm.

A big advantage of traditional charts is their robustness against noise. In contrast, page count estimatesare easily distorted by ambiguous artist or country names. Last.fmdata suffers from hacking and vandalism [11], as well as from unintentional input of wrong information and misspellings.

According to the dimension of time dependence, the data sources can be categorised into “current” and “accumulating”, relating to whether they reflect an instantaneous popularity or a general, time-independent popularity. The largest overlap in popularity rankings between the investigated data sources can be explained by the dimension of time dependence. It is present between the output of the page count predictor and the P2P rankings, the data sources behind both of which share an accumulating strategy of data storage. Twitterand last.fmon the other hand are more time dependent in that they reflect better the current “hotness” of an artist than his or her overall popularity.

Table 5 Availability of data for popularity estimation

3.3 Auto-tagging Music

Semantic labels attached to multimedia items, such as images, music pieces, or videos, have become an important means to categorise and describe such items and to communicate particular opinions or feelings about them by users of social media. The process of automatically attaching semantic labels, or tags, to music pieces is referred to as auto-taggingand is a rather recent research endeavour in MIR. Typically, first, a machine learning approach, a supervised learnerto be more precise, is employed on a training data set that associates feature representations (commonly music content or music context features) with semantic tags. After training is finished, the classifier is used to predict labels to previously unseen music items. In order to increase computational efficiency, optionally some feature selectionor dimensionality reductiontechnique might be employed to the input feature vectors before training the classifier. This is of particular importance when dealing with high-dimensional representations of music items, which are typically present when modelling music items via a multimodal approach, for instance, via a feature vector describing aspects of the music content and the music context [76]. It was shown by Sordo in [86] that a dimensionality reduction of 95% by applying principal components analysis(PCA) [44] to the CAL500data set [89] and 600-dimensional audio feature vectors does not significantly decrease accuracy but decreases computational costs considerably. Figure 6depicts the general framework of an auto-tagger [86]. One can broadly categorise music-tagging efforts into approaches that learn relations between feature representations of music files and semantic tags, henceforth referred to as “computational approaches” and strategies to infer tags directly from some kind of user input, referred to as “human-centred approaches”, in the following. Both strategies will be addressed below. As for the former one, methods that learn tags from co-occurrence data(collaborative filtering), audio features, and web pageswill be introduced. For the category of human-centred approaches, another game with a purposewill be presented, and the use of music folksonomiesto infer tags and associate them to semantic categories will be discussed.

Fig. 6
figure 6

Basic scheme of a music auto-tagger

Turnbull et al. in [52] compare various data sources and corresponding algorithms (computational and human-centred) for the task of tagging music: human surveys, social tags, games with a purpose, web pages, and auto-tagging. They elaborate on advantages and disadvantages of each, which are summarised from [52] and extended by the author in Table 6. Weak labellingrefers to the fact that one cannot infer from the absence of a tag tthat tdoes not apply to the item. Users might simply not have thought of the tag in such a case. In contrast, a strongly labelleddata set is complete in the sense that the absence of tag tdoes mean that tis not suited to describe the item under consideration. For an explanation of popularity biasand community bias, please refer to Sects. 3.1.2and 3.2.1, respectively.

Table 6 Comparing different approaches to tag music

3.3.1 Computational Approaches

As described above, the most common approach to auto-tagging music is to train a classifier on music content features and learn relations between them and a set of tags that are known to relate to the corresponding music items. As features typically rhythm and/or timbre descriptors are used [63], sometimes high-level features are included in addition [86].

Sordo proposes a simple and efficient algorithm based on a weighted vote k-nearest neighbour (kNN) classifier to propagate tags from training data to unseen music items [86]. Given a training set of PCA-compressed feature vectors of a song collection together with a set of labels for each piece, the proposed weighted vote kNNalgorithm first determines the kclosest neighbours to the seed song s, which should be tagged. The frequency of all tags assigned to s’s neighbours are then summed up per tag, and a threshold relative to kensures that only frequently used tags are predicted for s.

An approach similar to Sordo’s is suggested in [46]. Kim et al. also employ a kNN classifier to address the problem of auto-tagging artists, but they analyse different artist similarity functions. They compare similarities derived from artist co-occurrences in last.fmplaylists (“scrobbles”), from last.fmtags, from web pages about the artists, and from music content features. Using as ground truth tags manually assigned by music experts, Kim et al. found that the similarity measure based on last.fmco-occurrences performed best, both in terms of precision and recall. When using a kNN classifier, it is crucial to carefully select the similarity measure to determine the nearest neighbours. Depending on the origin of the features, a common choice is cosine similarity(for term-weighting features) or one of the distances/divergences Mahalanobis, Manhattan, or Kullback–Leibler(for music content features).

In their proposed algorithm to extract tags from artist-related web pages, Schedl et al. use a dictionary of musically relevant terms to filter the textual content of the pages under consideration [77]. The authors propose three different term-weighting functions to score the extracted tags per artist and predict the resulting top-ranked tags. A user survey was conducted to evaluate the quality of the suggested tags in terms of descriptiveness. Quite surprisingly, participants in the study found tags suggested by a simple document frequency function superior to those proposed by TF ⋅IDF-based term scoring.

Mandel et al. [63] use conditional restricted Boltzmann machines[85] to learn tag language models over three sets of vocabularies: annotations by users of Amazon’s Mechanical Turk, of the tagging game MajorMiner[62], and of last.fm. The models are learned on the level of song segments. Optionally different “contexts” are included, that is, track level and user level annotations are factored in.

Seyerlehner et al. in [82] use a combination of different audio features described within their block-level framework [83]: spectral pattern(SP), delta spectral pattern(DSP), variance delta spectral pattern(VDSP), logarithmic fluctuation pattern(LFP), correlation pattern(CP), and spectral contrast pattern(SCP). Associations between songs and tags are then learned using a random forestclassifier.

Recently, two-stage algorithms have become popular. In the first stage, they infer higher-level information from music content features, such as term weight vector representations. These new representations are then fed into a machine learning algorithm to learn semantic labels [14, 65]. Sometimes this second stage is said to incorporate contextual aspects since correlations between tags are frequently considered. Alternatively, the term weight vectors inferred in the first stage can also be used as input to a music similarity measure [82]. As an example for a two-stage auto-tagger, Miotto et al. in [65] first model semantic multinomialsover tags based on music content features. In order to account for co-occurrence relations between tags, they subsequently learn a Dirichlet mixture modelof the semantic space, which eventually yields a contextual multinomial.

3.3.2 Human-Centred Approaches

Games with a purposehave already been introduced in Sect. 3.1.1, where it was shown how to use their results for similarity estimation. In [53], Law and von Ahn present another interesting game with a purpose that focuses on input-agreement. Users have to agree whether they are listening to the same piece of music or not, that is, they have to agree on the input. To this end, they can exchange any free-form text that helps to reach the goal. Usually players enter descriptive tags in an effort to quickly choose the correct one of the two classes “same” or “different”. If they agree on the correct class, both players are awarded points and the next round starts. Each game lasts for a total of 3 min.

According to Law and von Ahn, this input-agreement mechanism offers the advantage of being more popular and producing a higher number of tags than other, similar games. Analysing the most frequently used tags (Table 7), most of them describe genre, instrumentation, and properties of the music. Due to the very nature of the game design, the top list also includes communication tagsthat are unsuited to describe the music itself (“yes”, “same”, “no”, “diff”). Furthermore, negation tagsare frequently used to indicate the absence of a particular musical aspect (“no vocal”, for example).

Table 7 Most frequently used tags in the TagATunegame, in descending order of frequency

Music folksonomiespresent another valuable source for musical information. They are created by large numbers of users via tagging particular music items with their own, specific vocabulary. Although this vocabulary is probably not as precise as the one employed by music experts, the wisdom of the crowds is potentially able to cover more diverse aspects of human music perception than experts can think of.

Sordo et al. [87] present a method to automatically categorise tags extracted from Wikipediainto semantically meaningful groups, which they call “facets”. To this end, starting at the most generic page about “music”, the authors extract links from DBpedia, a machine-readable knowledge base created from Wikipediapages. Applying some heuristics, pages not related to music are filtered out. To the remaining nodes, the PageRankalgorithm [67] is applied to determine a relevance score for all nodes/pages in the network. The top facets Sordo et al.’s method found on a data set of about 600,000 artists and 400,000 tags are given in Table 8. The facets and tags extracted in this way are particularly interesting for music retrieval systems, where the user might want to restrict the results to a search query to tracks that are similar according to a specific facet.

Table 8 Top facets of music extracted from Wikipedia

4 Conclusions and Research Directions

In this chapter, we have discussed how various kinds of social media can be used for common music information retrieval tasks. More precisely, approaches to infer music similarityfrom text and co-occurrence information were presented, strategies to estimate the popularityof a music item from social media were elaborated, and recent methods to automatically assign semantically meaningful tags to music items, a process also known as auto-tagging, were discussed.

I am sure that future research in music information retrieval has to strive for a holistic perspective in a sense that information is not only derived from the audio signal and from meta-data but from many different types of multimedia material. Given the spiralling success of social media, analysis and data mining of the respective sources will open unprecedented opportunities to elaborate truly personalised and context-aware music retrieval and recommendation systems. Some concrete challenges in the context of social media mining for music information retrieval are analysing music video clips(official music videos as well as user-generated versions) to infer descriptive information; processing imagesof album covers, of band photographs, and of concert snapshots taken by enthusiastic music aficionados; and making sense of textual dataabout music items (for instance, microblogs). In addition, data fusiontechniques are required to build multifaceted models that describe both music items and listeners in order to eventually enable the next generation of intelligent music retrieval systems.