Abstract
We expand information segmentation to include additional properties of music geometry. We establish a distinct metric for invariant chord structure (harmonic consistency) and models for conjunct melodic motion and acoustic consonance. We combine these with centricity to form a unified measure of music geometry. Using geometric predictors and the LSQOP method, we classify music/non-music with comparable results to AI/ML, between 76% and 92% f-score.
S. Steinmetz—Work of the first author is supported in part by the Northrop Grumman Technical Fellowship.
E. Gethner—Work of the second author is supported in part by a Simons Foundation Collaboration Grant for Mathematicians.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
- Music geometry
- Information geometry
- Harmonic consistency
- Harmonic leading
- Centricity
- Dissonance
- Consonance
- Time-geometry
- Qcurve
- Quant-curve
- Music retrieval
- Music detection
- LSQOP
1 Introduction
Music information retrieval (MIR) dominates audio classification, rhythm, melody, genre and emotion (MER) [19]. MIR began with self-similarity [8], but focuses now on neural networks (NN, CNN, RNN) [10] and support vector machines (SVM) [20]. Artificial intelligence (AI) and machine learning (ML) have proven successful with f-score as high as \(85\%\) [12, 14]; however, AI/ML is burdened by data availability, supervision and labeling. This also means pre-processing (e.g. CUSUM, MLR, GLR, KCD [5, 11]) is a major factor in finding valid segments to process.
We use Tymoczko’s properties of music [23] as the basis for geometric audio segmentation. Therefore, we further develop sufficient models for these properties as segmentation estimators [22]. If we combine geometrically segmented (time/geometry) audio curves of the same class such as music, we arrive at principal curves, which are the foundation for testing geometric variance (i.e. classifying audio). Using a technique called LSQOP we will demonstrate classification for music and non-music, comparable to AI/ML [9, 15].
2 QCurve Transform
Steinmetz and Gethner developed centricity segmentation via three parameter Gamma and geodesic likelihood [1, 2, 22]. The following sections expand this by modeling additional properties where the result is what we call time/geometry, or qcurve.
Definition 1
(QCurve) A qcurve is a time series consisting of positive unitary measures of musical geometry.
3 Harmonic Consistency
Harmonic consistency states “harmonies in a passage of music, whatever they may be, tend to be structurally similar to one another [23].”
Definition 2
(Musical Structure). Let \(K_{12} = (V,E),\,V =\{0,1,\cdots ,11\}\) be a complete graph with vertices labeled \(\{0 = C, 1 = C\#,\cdots ,11=B\}\). Every distinct edge and cycle \(C_r\) \(3 \le r \le 12\) having non-crossing edges in the embedded \(K_{12}\) is a musical structure.
Definition 3
Two non-empty sets of musical frequencies are similar if information gain due to geometric consistency is zero, or very small, such that distance \(D(\boldsymbol{\theta }^{(i)},\boldsymbol{\theta }^{(i+1)}) \le \epsilon \) where \(\epsilon \) is a fixed positive value. \(D(\cdot )\) is dependent on probabilities \(p(x_1\,|\,\boldsymbol{\theta }^{(i)})\), \(p(x_2\,|\,\boldsymbol{\theta }^{(i+1)})\) from [22].
Definition 3 expands on Cont’s idea of similarity [3] except here, we depend on geometric divergence. There are trivial, but useful mapping between vertices of a 12-gon and \(\mathbb {Z}_{12}\) using the complex unit circle \(z_s = f(s) = e^{i\pi (15 - s)/6}\), \(s \in \mathbb {Z}_{12}\). Assume musical frequency \(k \in \mathbb {R}\), where \(T: \mathbb {R}\rightarrow \mathbb {Z}\) such that \(T(k) = \{0,1,2,\ldots ,11\}\) (i.e. pitch class). We define \(\boldsymbol{S_t}(k_i,k_j)\) as shortest distance between \(Z_{12}\) elements, or integer separation between frequencies. Due to chroma, there exists a distinct, linearly independent, invariant Euclidian distance for every 12-tone musical frequency pair, therefore frequencies of identical integer separation are similar, which we will prove.
Because \(s = f^{-1}(z) = \frac{6\,\text {arg}(z)}{\pi }\) and \(f^{-1}(a) = f^{-1}(b) \Rightarrow \frac{6\,\text {arg}(f(a))}{\pi } = \frac{6\,\text {arg}(f(b))}{\pi } \Rightarrow a = b\), f is injective. Let \(\gamma \) be inner angle difference between complex arguments. Due to injectivity, integer separation is modeled \(\gamma = \gamma _i - \gamma _j = \frac{\boldsymbol{S_t}(k_1,k_2)\pi }{6}\).
Lemma 1
Every pair of musical frequencies \(k_i, k_j\) has invariant Euclidian distance
Proof
Let \(z_i, z_{j} \in \mathbb {C},\, z_i \ne z_j\). We define magnitude \(|\cdot | \equiv ||\cdot ||_2\). Due to the law of cosines \(|z_i - z_j|^2 = |z_i|^2 + |z_j|^2 - 2\,|z_i||z_j|\,cos(\theta )\). Since \(|z| = 1\), \(|z_i - z_j|^2 \,=\, 2(1 - cos(\theta ))\). If \(\gamma _i = arg(z_i),\, \gamma _j = arg(z_j)\), then by the dot product
Substituting the angle and replacing identities gives \(|z_i - z_j|^2 = 2(1 - cos(\gamma _i)cos(\gamma _j) + sin(\gamma _i)sin(\gamma _j))\). Multiply by \(\frac{1}{4}\) and substitute haversine
Constrained to the positive domain and substituting \(\gamma \) leaves
Since rotation is unitary, all distinct pairs of musical frequencies have invariant Euclidian distance of this form. \(\square \)
Lemma 1 is chroma-agnostic, linearly dependent, dyad similarity, but removing homogeneity makes all dyad pairs linearly independent under this mapping, therefore
Tone distance is symmetric \(\delta _{t}(k_i,k_j) = \delta _{t}(k_j,k_i)\), invariant under rotation and satisfies triangle inequality. Because \(k_i = k_j\) implies \(\delta _{t}(k_i,k_j) = 2\cdot sin(0) + 0 = 0\), (2) is a metric space. It follows frequencies are similar if and only if, tone distance are equivalent. \({{Path\; distance}}= P_{\delta }(F_{(i)})\) is defined as the sum of tone distance, assuming frequencies \(F_{(i)} = \{k_j\}_{j=1}^{n}\). Using logical reduction and brute force search, we find no duplicate path distances among all 12-tone chords. Temporally, inverse harmonic consistency \(=\text {HC}^{-1} = \sigma (\boldsymbol{H}) \le M_{h}= 18.24\), where \(\sigma \) is standard deviation, \(\boldsymbol{H} = \{P_{\delta }(F_{(i)})\}_{i=1}^{\infty }\) and \(F_{(i)}\) are sequential. Given no discernable notes, noise, or silence \(\text {HC}^{-1} = M_{h}\).
4 Harmonic Leading
Harmonic leading is the combined measure of harmonic consistency, voice leading and conjunct melodic motion (CMM). CMM is the tendency for “melodies to move by short distances from note to note [23].” Due to [4] musical metric distance \(\varDelta (k_1,k_2)\) can be leveraged assuming \(\boldsymbol{P} \in \mathbb {R}^{m\times n}\), whose columns are frequencies sorted in ascending order with \(\boldsymbol{P}_{ij} = 0\), when m differs. From this, we approximate conjunct melodic motion
which also dampens overfitFootnote 1. Assuming \(\sigma (\cdot ) =\) stddev, \(f=\) frame count and \(\varDelta (\cdot )\le M_{v}= 6\) [4]
As a side note, we tested harmonic leading on Harte’s 16 songs [13] with moderate success of \(58\%\) f-score, but we were unable to match HCDF retrieval. We recommend an HCDF quantifier experiment using our information framework, which we leave as an open problem.
5 Consonance
Definition 4
Acoustic consonance is the inverse of dissonant contributions between 20 and 250 Hz of center frequency, within critical band \(\beta \) [7, 17, 18].
Given \(F(k)=|FFT|\), we select \(\tau = |k - \text {argmax}_{x}\left[ F(k)\cdot F(k + x)\right] |,\, x \le 250\) which implies
correlating \(f_{\tau }(x) = \text {max}\{F\}/2\left[ \,cos(2\pi x/\tau ) + 1\right] \). We center on \(k_c : |k_c - k| \le \beta /2\), selecting \(k_j : F(k_j) \ge \frac{\text {max}(F)}{4}\) and observe \(\boldsymbol{C} = \left\{ |C_{\tau _j}(k_j)|\right\} _{j=1}^n\) converge to a block wave as dissonant contribution increases. Therefore,
where max dissonance \(= M_d \approx 10\) and \(D = \text {sgn}\left[ \text {cos}(\frac{2\pi j}{s})\right] + 1\) over bandwidth \(s \le \beta \). \(\beta \) is displacement between the two loudest frequencies in critical band.
6 LSQOP
Qcurves are synthesized using [22] and combinations of AC = acoustic consonance, C = centricity, HL = harmonic leading and MG = music geometry.
is overlap between ideal \(U = (1,1,1)\) and measured vector \(V=\) (AC, C, HL). We observe ordinal separation between qcurves of differing classes, exposing an opportunity to train principal qcurves as a model for variation. Figure 1 illustrates least squares ortho projector (LSQOP) method, assuming input audio qcurve q(x), principal qcurve trend-line segments \(\{p_0,p_1\}\text {(music)},\{p_2,p_3\}\text {(non-music)}\) and arbitrary point \(d = (x,q(x))\). There exist \(x_1 = p_0 + \text {Res}_{p_1-p_0}(d-p_0)\) and \(x_2 = p_1 + \text {Res}_{p_3-p_2}(d-p_2)\) assuming \(\text {Res}(\cdot )\) implies vector resolute. \(\boldsymbol{P}= \frac{(0,1)(0,1)^{T}}{(0,1)^T(0,1)}\), \(a_1 = ||\boldsymbol{P}d||\), \(a_2 = ||\boldsymbol{P}x_1||\) and \(a_3 = ||\boldsymbol{P}x_2||\), yields ratio
Musical probability \(\mathfrak {M} = E[m(x)]\) has binary pass/fail \(\mathfrak {M} \ge \epsilon \) with expectation \(E[\cdot ]\) and non-negative musical threshold \(\epsilon \le 1\). Threshold varies depending on curve training.
7 Evaluation and Analysis
Verification of LSQOP and time/geometry involved classification experiments with GTZAN [24], TUT-17 (parts 1&2) [16], SWS1 (used by [6, 21]), SWS2 (music), SWS3 (non-music) and SWS4 (non-music) data. The database contains 1556 music and 1441 non-music, totaling 2776 files. Custom data were created to fool LSQOP due to abnormal accuracy on GTZAN (\(100\%\)) and TUT (\(92\%\)). Custom data contains random quality, sample rate, content, size and non-thematic clips from samplefocus.com, partnersinrhyme.com, bensound.com, freemp3cloud.com and BBC Sound Effects. Custom non-music sets contain several categories (e.g. people, urban, construction, natural, office, animals, household, video games, military/war, etc.). Custom music sets are spread (mostly) even across several genres with famous, lesser known artists, synthesised and “poor” quality.
We performed three tests using LSQOP for music/non-music classification. The first and second test measures accuracy on all 2997 files. In the first we process the starting 10s of each file and in the second we process [.2, 3]s random samples from each file. Assuming notation (score/segment) (see [22]), (MG/HL) was effective with non-random samples at \(96.4\%\) accuracy for music and \(70.78\%\) classifying non-music. (MG/AC) was effective on random samples with \(80\%\) accuracy for music and \(78\%\) for non-music.
Table 1 is the result of an information retrieval (IR) exercise involving requests for music/non-music from a database (not all tests shown). Small randomly populated datasets are drawn from the entire clip database and each set is guaranteed to carry (roughly) equally distributed music and non-music. The results here are f-scores between \(76\%\) and \(92\%\), averaging around \(81.6\%\).
8 Conclusions
We showed how 12-tone chroma-agnostic frequency pairs are mapped to a distinct, linearly independent, invariant measure of harmonic distance as a metric space. We provided a unique measure of musical chords (path distance) and discussed how to quantify harmonic consistency over successive frames. By combining this idea with work in voice leading we modeled conjunct melodic motion as harmonic leading. We then developed a straightforward approximation of acoustic consonance using psychoacoustic theory. From these measures we devised a unified score for musical geometry and showed how principal qcurves combined with LSQOP is effective in retrieval with consistent f-scores between \(76\%\) and \(92\%\), averaging \(81.6\%\).
Individually, the proposed musical property models have independent value for measuring structural content and acoustic quality analysis. The success of LSQOP is clear, but optimal time/geometry combinations require further experiments. Because audio data were successfully used to train accurate results against other audio, this opens the question of ideal data for use in widespread classification. We must also consider the disparity between music and non-music success, which itself can present a challenge to blind classification using the same geometric properties for music and non-music.
Notes
- 1.
Harmonic leading overfit is defined as acoustically perceivable chord transposition, or inversion.
References
Abdel-All, N.H., Abdel-Galil, E.: Numerical treatment of geodesic differential. In: International Mathematical Forum, vol. 8, pp. 15–29 (2013)
Chen, W.W., Kotz, S.: The riemannian structure of the three-parameter gamma distribution (2013)
Cont, A., Dubnov, S., Assayag, G.: On the information geometry of audio streams with applications to similarity computing. IEEE Trans. Audio Speech Lang. Process. 19, 837–846 (2010)
del Pozo, I., Gómez, F.: Formalization of voice-leadings and the Nabla algorithm. In: Montiel, M., Gomez-Martin, F., Agustín-Aquino, O.A. (eds.) MCM 2019. LNCS (LNAI), vol. 11502, pp. 352–358. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-21392-3_30
Desobry, F., Davy, M., Doncarli, C.: An online kernel change detection algorithm. IEEE Trans. Signal Process. 53, 2961–2974 (2005)
Gethner, S.S.E., Verbeke, J.: A view of music. In: Delp, D.M.K., Kaplan, C.S., Sarhangi, R. (eds.) Proceedings of Bridges 2015: Mathematics, Music, Art, Architecture, Culture, Phoenix, Arizona, Tessellations Publishing, pp. 289–294 (2015). http://archive.bridgesmathart.org/2015/bridges2015-289.html
Fishman, Y.I., et al.: Consonance and dissonance of musical chords: neural correlates in auditory cortex of monkeys and humans. J. Neurophysiol. 86, 2761–2788 (2001)
Foote, J.T., Cooper, M.L.: Media segmentation using self-similarity decomposition. In: Storage and Retrieval for Media Databases, vol. 5021, pp. 167–175. International Society for Optics and Photonics (2003)
Gimeno, P., Mingote, V., Giménez, A.O., Miguel, A., Lleida, E.: Partial auc optimisation using recurrent neural networks for music detection with limited training data. In: Interspeech, pp. 3067–3071 (2020)
Gururani, S., Summers, C., Lerch, A.: Instrument activity detection in polyphonic music using deep neural networks. In: ISMIR, pp. 569–576 (2018)
Gustafsson, F.: The marginalized likelihood ratio test for detecting abrupt changes. IEEE Trans. Autom. Control 41, 66–78 (1996)
Hamel, P., Eck, D.: Learning features from music audio with deep belief networks. In: ISMIR, vol. 10, pp. 339–344. Citeseer (2010)
Harte, C., Sandler, M., Gasser, M.: Detecting harmonic change in musical audio. In: Proceedings of the 1st ACM Workshop on Audio and Music Computing Multimedia, pp. 21–26 (2006)
Kataoka, M., Kinouchi, M., Hagiwara, M.: Music information retrieval system using complex-valued recurrent neural networks. In: SMC’98 Conference Proceedings. 1998 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No. 98CH36218), vol. 5, pp. 4290–4295. IEEE (1998)
Meléndez-Catalán, B., Molina, E., Gomez, E.: Music and/or speech detection mirex 2018 submission, music information retrieval evaluation eX-change (2018)
Mesaros, A., Heittola, T., Virtanen, T.: Tut acoustic scenes 2017, evaluation dataset (2017)
Moore, B.C., Glasberg, B.R.: Suggested formulae for calculating auditory-filter bandwidths and excitation patterns. J. Acoust. Soc. Am. 74, 750–753 (1983)
Nordmark, J., Fahlén, L.E.: Beat theories of musical consonance. Q. Prog. Status Rep. 29, 111–122 (1988)
Panda, R., Malheiro, R.M., Paiva, R.P.: Audio features for music emotion recognition: a survey, IEEE Trans. Affect. Comput. (2020)
Schedl, M., Gutiérrez, E.G., Urbano, J.: Music information retrieval: recent developments and applications. Found. Trends Inf. Retrieval. 8(2–3), 127–261 (2014)
Steinmetz, S., Sonic imagery: a view of music via mathematical computer science and signal processing. University of Colorado at Denver (2016)
Steinmetz, S., Gethner, E.: On musical information geometry with applications to sonified image analysis. In: To Appear in International Conference on Pattern Recognition and Image Processing, Miami (2021)
Tymoczko, D.: A Geometry of Music, Oxford University Press, Oxford, 1 (ed.) (2011)
Zhang, X.: Gtzan, November 2019
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Steinmetz, S., Gethner, E. (2022). Information Synthesis of Time-Geometry QCurve for Music Retrieval. In: Montiel, M., Agustín-Aquino, O.A., Gómez, F., Kastine, J., Lluis-Puebla, E., Milam, B. (eds) Mathematics and Computation in Music. MCM 2022. Lecture Notes in Computer Science(), vol 13267. Springer, Cham. https://doi.org/10.1007/978-3-031-07015-0_36
Download citation
DOI: https://doi.org/10.1007/978-3-031-07015-0_36
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-07014-3
Online ISBN: 978-3-031-07015-0
eBook Packages: Computer ScienceComputer Science (R0)