1 Introduction

Music synchronization or alignment in various forms has great importance for analyzing and processing music. According to music data format (audio recordings or symbolic representations), music synchronization tasks can be divided into three main categories: audio-to-audio, symbolic-to-symbolic, and audio-to-symbolic [28]. Audio-to-audio alignment [42] matches a given position in a music performance to the corresponding position in a separate audio recording. Symbolic-to-symbolic alignment [9] seeks to indicate how each note in a piece of music is matched to its corresponding note in the score according to the correlation between their musical structures. Audio-to-score alignment, as its name implies, seeks to find a mapping between an audio performance and its corresponding symbolic musical score [27, 31]; in other words, the alignment system analyzes the content of the input audio at a given time point (i.e., an onset) and maps it to a corresponding time point on the score with a similar musical structure (i.e., note pitch and note duration).

This paper focuses on audio-to-score alignment in polyphonic music, including audio music with multiple instruments and audio music with unconstrained instruments (in contrast to piano-only music in some previous studies). Generally, audio-to-score alignment can be implemented online or offline. An online system, also known as a score follower, can be used to align a music score to a live performance and can participate in musical interactions with human musicians in real time for tasks such as automatic page turning [1] and computer-aided accompaniment [34]. In addition, because score followers can link note events in a score with time corresponding points in an audio performance, it plays an important role in score-informed source separation tasks [7, 18]. By contrast, offline systems have no real-time constraints, and thus the complete information of a given audio signal can be used during the alignment process. Therefore, offline alignment usually outperforms its online counterpart. An offline alignment system can be used to retrieve the closest matching MIDI document from a database for an audio query [22] and enables the indexing of timestamps of an audio recording according to the desired measure or passage in a music score [14]. Moreover, such an alignment system provides information on performance errors and temporal deviations of note events and thus can be used for music performance evaluation or music education.

Music alignment suffers from some well-known inherent difficulties [13, 27, 28] related to possible mismatches between human performances and their corresponding scores. Live music performance invariably entails certain deviations, intentional or not, from the score; for example, musicians may seek to enrich their performance by adding ornaments or chord variations that are not specified in the score. Another notable aspect is temporal deviation; for example, musicians may seek to express themselves through temporal inconsistencies such as variation in note onset note duration or tempo. Alignment is further complicated when dealing with polyphonic music, where multiple note events occurring simultaneously can lead to interference between harmonic series of notes, thereby increasing the difficulty of identifying the relation between an audio segment and its corresponding score.

The standard method of audio-to-score alignment consists of three steps, namely feature extraction, similarity calculation, and alignment. In step 1, informative features are extracted from an audio signal to characterize the musical content, such as onset and pitch or chord. Step 2 defines a similarity function to measure the difference between features of an audio recording and note events in the score. Step 3 uses an alignment algorithm to determine the closest match between feature sequences and note events. Most polyphonic audio-to-score alignment algorithms adopt a similar procedure to convert a symbolic score into an audio file, perform feature extraction on the input audio recording and the converted audio piece, and then align two sequences of features. In [30], dynamic time warping (DTW) was used to align a polyphonic audio file to another audio file synthesized from a MIDI file. The peak structure distance, which was derived from the spectra of the audio files, was used to measure the difference between an audio performance and a synthesized audio specimen. A similar scheme was proposed in [14, 22] for computation of a simple 12-dimensional pitch chroma as an input feature and alignment of the performance audio with the synthesized audio by using the DTW approach. In [32], an exhaustive search over DTW score normalization techniques was performed to identify the optimal method for reporting a reliable alignment confidence score. Such dynamic-programming-based algorithms can cope with tempo fluctuation and performance errors in musical signals. Therefore, the DTW algorithm and variants thereof are extensively used in audio-to-score alignment systems.

Other audio-to-score alignment systems use probabilistic models to enhance performance, such as hidden Markov models (HMMs) and conditional random field (CRF) models. These models account for matching uncertainty to enhance performance. In such systems, the hidden variable represents the current position in the score. Moreover, flexible transition probabilities permit structural changes. The Viterbi algorithm is usually used to identify the optimal alignment path in these models. For example, in [11], the hierarchical HMM was used in a polyphonic-score-following system, with previously learned pitch templates employed for multiple fundamental frequency matching. The CRF model has also been applied to the audio-to-score alignment problem. In [23], the authors incorporated frame-level and segment-level features, including chroma, onset and tempo information. Three CRF models have been proposed for various degrees of trade-off between accuracy and complexity. These probabilistic models generated efficient and accurate alignment techniques that can compete favorably with DTW-based methods. More recently, a multimodal convolutional neural network for audio-to-sheet matching [17] was proposed for offline applications; the network matches short music audio snippets to their corresponding pixel locations in images of piano score.

Some drawbacks of the methods proposed in the literature are summarized as follows.

  1. a.

    Most audio-to-score studies [8, 11, 14, 22, 23, 30, 32, 33] have used synthesized audio (from MIDI files or music scores) for their experiments. The major disadvantage of using synthesized audio for audio-to-score experiments is that the musical properties of synthesized audio are likely to be different from those of real-world musical audio signals. For example, input audio clips for alignment do not always have the same timbre and harmonic properties as synthesized ones.

  2. b.

    Some studies [11, 23, 24] have adopted probabilistic models to implement the alignment process, and these models usually require time-consuming training to fine tune their parameters.

This paper proposes an audio-to-score alignment framework that is effective for audio music with unconstrained instruments or multiple instruments. Such audio music poses a greater challenge for audio-to-score alignment. Recent studies on music with unconstrained or multiple instruments are represented by [8, 10, 33]. Carabias-Orti et al. [8] used nonnegative matrix factorization (NMF) to decompose the spectral patterns of an input performance and produce a divergence matrix; DTW was then used to find the minimum-cost path within the matrix. The online version [33] of this method was subsequently submitted for Music Information Retrieval Evaluation eXchange (MIREX)Footnote 1 task; the results demonstrated that this method outperformed all the algorithms submitted in previous years. Chen et al. [10] used NMF to decompose harmonics around onsets of the constant Q spectrum and computed the degree of similarity with notes in the corresponding score, thereby generating robust results for aligning audio music with various instruments. This study compares our proposed method with these methods in an experiment.

This paper presents an audio-to-score alignment framework with onset detection and note matching. As shown in Fig. 1, the proposed alignment method involves three steps (onset segmentation, note matching, and dynamic programming). A piece of audio music is input to the onset detection system. In the note-matching block, a feature vector around each estimated onset is compared with the notes in the corresponding music score. Finally, dynamic programming is used to extract the best alignment path in the similarity matrix. By performing onset detection first, we were able to simplify the alignment process considerably. Moreover, consideration of the spectral variation around an onset can reduce the complexity of similarity computation and interference from vibrato. In addition, this paper proposes a method for converting a constant Q spectrogram into another format similar to a piano roll. This conversion does not require synthesis from MIDI to audio and is less likely to be influenced by instrumental timbre. This paper discusses the effects of various onset detectors and types of spectrum difference vectors in the experiments. The proposed algorithm outperforms the aforementioned methods. Furthermore, we propose an online audio-to-score alignment method based on onset detection, the feasibility and performance of which are experimentally validated. Moreover, we propose a method for onset detector fusion. We extract features from various onset detection functions (ODFs) and then select more reliable onsets by using a prelearned classifier. The experimental results verify that fusion is effective and outperforms the constituted ODFs.

Fig. 1
figure 1

Flowchart of the proposed audio-to-score alignment algorithm. A piece of audio music audio is inputted into onset detection system. In the note matching block, a feature vector around each estimated onset is compared with the notes in the music score. Finally, dynamic programming is used to extract the best alignment path from the similarity matrix

The remainder of this paper is organized as follows. Section 2 introduces the onset detection methods used in our fusion method, which is described in Section 3. Section 4 presents a note-matching algorithm used to measure the similarity between a spectrum difference vector derived from the audio onset and a score pitch vector derived from the concurrent note-on event in the score. Section 5 explains our dynamic programming algorithm. In Section 6, we introduce an online alignment method based on onset detection. Section 7 describes our experimental setup and offers a comparison with other methods, alongside detailed analysis and discussion. Finally, Section 8 summarizes our findings and provides directions for future research.

2 Choice of onset detection methods

Onset detection [2, 16] is a crucial preprocessing step toward high-level music processing that involves beat tracking, song segmentation, and music transcription. An onset is usually defined as the starting point of a sound event. Onset detection typically applies an onset detection function (ODF) to the input audio music to generate an ODF curve that indicates onset strength. The local maxima can then be selected from the curve as the onsets because the ODF should have a larger value at the vicinity of an onset event. Different low-level features of audio music can be considered for computing the ODF, including the magnitude spectrum, the phase spectrum, and pitch. For example, spectral flux computes the temporal differences of magnitude spectra, and phase deviation estimates the change in the instantaneous frequency in a short-time Fourier transform frequency bin by computing the second-order temporal differences of phase angles, whereas the complex domain simultaneously considers differences in magnitude and phase spectra.

To construct general detection functions that can detect onsets in a wider range of audio music, machine-learning-based methods have gained popularity in recent years. In [25], onset detection algorithms based on neural networks were first used to classify a spectrogram of a frame as an onset or not. All winning algorithms of the MIREX audio onset detection task in the preceding few years are based on a probabilistic model consisting of a recurrent neural network (RNN) [20] and convolutional neural network (CNN) [36, 37]. Their resultant onset detectors exhibit good performance for almost all types of music.

Another promising direction for onset detection is the fusion of multiple detection methods [15]. Duxbury et al. [19] proposed an approach combining an amplitude-based detection method with a phase-based one. Holzapfel et al. [21] considered pitch, energy, and phase information in parallel for the detection of pitched onsets. Tian et al. [40] investigated multiple fusion strategies to combine existing onset detectors and provided a systematic evaluation of all tested combinations.

This paper proposes a new fusion strategy for ODF curves. For a given onset, the corresponding peak positions of different ODF curves may have some temporal deviation. In addition, some ODF curves may have no peak, and the onset strengths of different ODF curves may not be the same. Consequently, the fusion strategy in [19, 21, 40] is not desirable because the weightings of ODF curves are difficult to estimate. Our approach divides the peaks of ODF curves into groups that are likely to coincide with an onset. We then extract features from each group and train a classifier to determine whether a peak group corresponds to an onset.

In our implementation, we combine the ODFs of three onset detectors. The properties of these detection methods are described as follows.

2.1 SuperFlux

SuperFlux is an onset detection algorithm proposed in [3] that improves onset detection accuracy in the presence of vibrato. SuperFlux is an enhanced version of common spectral flux. It calculates the difference between two near short-time spectra and can suppress false positive detections of music with vibrato by tracking spectral trajectories using a maximum filter.

2.2 ComplexFlux

SuperFlux considers only the magnitude spectrogram without incorporating any phase information. ComplexFlux [4] is derived from SuperFlux and considers local group delay to deal with vibrato and tremolo. By incorporating phase information, ComplexFlux can determine steady tones and accordingly suppress spurious loudness variation. SuperFlux and ComplexFlux are the current state-of-the-art nonprobabilistic onset detection methods according to the results of the MIREX onset detection task from the preceding few years.

2.3 RNN

The recurrent neural network (RNN)-based method [20] incorporates an RNN alongside the spectral magnitude and its first time derivative as input features. Because of its bidirectional architecture, this method can model the context of an onset to detect barely discernible onsets in complex mixes and suppress events that are erroneously considered onsets by other algorithms.

3 Fusion method for onset detection

The fusion of multiple ODFs is based on the concept that different ODFs have different characteristics that, if used properly, can be synergized to capture onsets of audio music in various styles and genres. In particular, spectral flux is known to be sensitive to percussive signals, whereas SuperFlux and ComplexFlux can robustly suppress vibrato while detecting onsets. Moreover, ODFs of a shorter frame size are better at detecting onsets in weak and transient signals, whereas those of a longer frame size better capture soft onsets. The aforementioned observations indicate that a variety of ODF curves should be employed to capture different types of onsets.

In practice, different ODFs have different characteristics, and thus determining a universal criterion for selecting peaks as onset candidates is difficult. To automate this tuning process, we form peak groups and use a support vector machine (SVM) to determine whether a given peak group is a true onset. The use of multiple ODF curves alongside simple training of the SVM is the core concept of the proposed method.

Figure 2 shows a block diagram of a simple example (used in our experiment) of the proposed method. In this specific implementation, three onset detectors are used: the RNN method, SuperFlux, and ComplexFlux. Peak values of ODFs are cascaded with music harmonic/percussive (H/P) features from harmonic/percussive source separation of the input audio. These cascaded feature vectors are then classified by the SVM to predict onsets.

Fig. 2
figure 2

Block diagram of a specific implementation of the proposed method: Peak values of ODFs of RNN, SuperFlux and ComplexFlux are cascaded with music features from harmonic/percussive source separation of the input audio. These feature vectors are then classified by SVM to predict onsets

3.1 Onset detection functions and peak selection

To select the ODF curve peaks, the peak-picking method described in [3] with nearly identical parameter settings is employed. Specifically, an ODF curve at index n, denoted ODF[n], is selected as a peak if it fulfills these three requirements:

  1. 1.
    $$ ODF\left[n\right]=\mathit{\max}\left( ODF\left[n-{w}_1:n+{w}_2\right]\right) $$
  2. 2.
    $$ ODF\left[n\right]\ge mean\left( ODF\left[n-{w}_3:n+{w}_4\right]\right)+\theta $$
  3. 3.
    $$ n-{n}_{previous\ onset}>{w}_5 $$

The first requirement is to identify the local maximum, where w1 and w2 determine the size of the window within which the local maximum resides. The position of the local maximum within the window represents a possible onset. The second requirement is to filter out lower-intensity peaks, which are considered noise. Our approach to fulfilling this requirement is to select the local maximum that exceeds the local mean by a threshold θ. The parameters w2 and w4 determine the window size for determining the local mean. The third requirement is to set the minimum interval between adjacent onsets, which is determined by w5. The parameter values in these three conditions are equal to the default parameter values of the program provided in [3, 4], except for the threshold parameter θ, which can be used to control the number of identified peaks. A lower value of θ results in more peaks being identified and thus more false positives toward final onset detection. By contrast, a higher value of θ may lead to more false negatives because some true onsets are not identified. In our experiment, we selected a median low value of θ to enable selection of more peaks and prevent rejection of any true onsets. An SVM classifier is used to make a final decision regarding whether selected peaks are onsets.

3.2 Peak grouping

After local peaks are identified from each ODF curve, peaks from all ODF curves must be grouped based on their temporal vicinity such that a final classifier can be used to determine whether a peak group is an onset. Let MS and NS be the time vectors of the peak locations given by two ODF curves, and let δ be the tolerance window (50 ms in our experiments). The peak grouping procedure described in Algorithm 1 is used to generate peak groups from the two ODF curves.

figure c

Notably, in our implementation, we used the average time of the grouped peaks as a reference vector of onset time.

3.3 Feature extraction

After the peak groups have been obtained, feature extraction must be performed on the groups; the features are then sent for SVM classification to determine whether a given peak group represents an onset.

3.3.1 ODF features

Peaks in a group are somewhat related and usually describe the same onset event in audio music. For each peak in a SuperFlux or ComplexFlux ODF, two features are extracted based on the ODF’s value and local mean. These features are specified as follows (ODF[n] is a peak at frame n of its corresponding ODF):

  • $$ ODF\left[n\right] $$
  • $$ ODF\left[n\right]- mean\left( ODF\left[n-{w}_3:n+{w}_4\right]\right) $$

Because a peak in an ODF represents a high probability of there being an onset at the instant represented by the peak, the peak value ODF[n] is adopted as feature 1. Feature 2 represents the contrast between the peak and its local average; a higher contrast indicates a higher probability of the peak being an onset. In addition, two ODFs with different frame sizes are generated. The ODF with the shorter frame size is expected to capture more peaks as onset candidates, including some false positives. The ODF with the longer frame size is a smoother curve that can better capture soft onsets. These two types of frames can leverage the strengths of both situations and contribute to final onset detection in a complementary manner. Two frame sizes, 46 and 92 ms, were adopted in our implementation.

For each peak in the RNN ODF, only the peak value is used as the feature because the RNN model in [20] considers multiple frame size resolutions. Therefore, each peak group has nine features. If some ODFs in a peak group have no peaks, the default values of the corresponding features are set to zero. The estimated onset time is the average of all peaks in the group if the peak group is judged to be an onset.

3.3.2 Music H/P features

Musical signals typically consist of harmonic and percussive spectral components. Harmonic sound pertains to pitched instruments, which have a spectral structure that is horizontally smooth in time, whereas percussive sound pertains to drum-type instruments, which have a spectral structure that is vertically smooth in frequency. The goal of harmonic/percussive source separation [29, 39] is to decompose a given audio signal into two harmonic sounds and percussive sounds. Separating these two components is useful for some music information retrieval tasks, such as melody extraction [35] and chord recognition [41].

Onset detection is particularly vulnerable to vibrato interference [3], which tends to produce false positives. Separating musical signals into harmonic and percussive components can reduce interference from vibrato. The spectrogram of an input audio specimen is separated into a harmonic matrix H and percussive matrix P to obtain two curves, namely Hv and Pv, respectively, by summing H and P along the frequency axis. These two curves are normalized to have maxima of 1. Figure 3 presents examples of harmonic and percussive curves. Figure 3a shows the spectrogram of an audio excerpt of a violin, and Fig. 3b shows the harmonic curve (green line) and percussive curve (blue line) of said excerpt. The black circles on the x-axis represent the ground truth of the onsets. The estimated onsets identified using the SuperFlux method are with red asterisks. The figure shows that Pv has a relatively strong rising trend at some onsets, such as 1.6, 2.2, 2.4, and 2.8 s. At the onsets at 3.3 and 4.8 s, Pv has a relatively low degree of rising strength, but the relative amplitude between Pv and Hv is considerably lower than the cases of the insertions. By contrast, the Pv increase at the insertions at 0.5, 0.6, 4.0, and 4.1 s is very minor, and the gaps between Pv and Hv are relatively large. The relative amplitude between these two curves helps to determine whether an onset is present; for example, Pv is usually close to or greater than Hv around an onset. In addition, Hv is larger than Pv in a vibrato segment.

Fig. 3
figure 3

a Spectrogram of an audio excerpt of a violin. bHarmonic curve (green line) and percussive curve (blue line) of the audio excerpt. The black circles on the x-axis represent the groundtruth of onsets. The estimated onsets of SuperFlux method are labeled as red asterisks

Assume na is the frame index of the average time of a peak group. We can then define 9 features extracted from Hv and Pv for each peak group as follows:

  1. 1.

    Hv[na]

  2. 2.

    Pv[na]

  3. 3.

    \( \frac{Hv\left[{n}_h\right]- Hv\left[{n}_a\right]}{n_h-{n}_a},\mathrm{where}\ {n}_h=\underset{n_a\le n\le {n}_a+\Delta {n}_1}{\ \arg\ \max } Hv\left[n\right] \)

  4. 4.

    \( \frac{Pv\left[{n}_p\right]- Pv\left[{n}_a\right]}{n_p-{n}_a},\mathrm{where}\ {n}_p=\underset{n_a\le n\le {n}_a+\Delta {n}_1}{\ \arg\ \max } Pv\left[n\right] \)

  5. 5.

    Hv[na] − mean(Hv[na − ∆n0 : na])

  6. 6.

    Pv[na] − mean(Pv[na − ∆n0 : na])

  7. 7.

    Hv[nh] − mean(Hv[na − ∆n0 : na])

  8. 8.

    Pv[np] − mean(Pv[na − ∆n0 : na])

  9. 9.

    Hv[nh] − Pv[na]

The aforementioned features represent changes in Hv and Pv around na. The first two features are the values of Hv and Pv for onset candidate na. Features 3 and 4 are slopes of Hv and Pv over the intervals [na, np] and [na, nh], respectively, where nh and np are the indices of local maxima in the range [na, na + ∆n1] of Hv and Pv, respectively. Features 5 to 8 are the differences between Hv and Pv at various locations and their averages in the neighborhood. The final feature is the amplitude difference between Hv and Pv. These harmonic and percussive features are cascaded with their ODF features to form 18 features for each peak group; these features are then divided into two categories of “onset” and “not onset” by an SVM classifier, which is described as follows.

3.4 Final classifier

An SVM is used as the final classifier to determine whether a given peak group corresponds to an onset. SVMs have been proven effective classifiers because they attempt to maximize the margin or distance of closest examples from opposite classes from the decision hyperplane [38]. Regarding the dimensionality of the SVM input, we define nine features extracted from ODFs and nine features extracted from Hv and Pv in each peak group. Therefore, the overall dimension of the input to the SVM is 18. In our experiments, we used a Gaussian radial basis function kernel, and the parameters were optimized through a grid search of cross validation within the training set.

4 Similarity measure

After the onsets have been obtained, the note events in the score that trigger an onset according to the spectrum of constant Q transform must be determined. We have developed three methods for evaluating the likelihood of notes in the score inducing spectral variation near an onset.

4.1 Modified constant Q spectrum

We define a modified constant Q spectrum (MCQS) for note matching in audio-to-score alignment. The constant Q transform [6] is used extensively in music analysis; “Q” refers to the “quality factor,” namely the ratio of the center frequency to its bandwidth. Thus, a constant Q indicates that the ratio is constant for all frequency bins in the spectrum. The transform reflects the human auditory system, where the spectral resolution is higher at lower frequencies and the temporal resolution is higher at higher frequencies. The frequency scale of the constant Q transform is logarithmic, and thus this method is particularly useful in music processing because musical transposition corresponds only to the translation of frequency bins. The Hamming window is used in our implementation. We select the Q factor in such a manner that 12 frequency bins exist over an octave, with each bin directly corresponding to a musical note.

The constant Q spectrum is further modified by retaining only the local maximum in the pitch axis of the spectrogram. In addition, we limit the pitch range to that of the score. The advantage of this modification is that it is relatively unaffected by instrumental timbre and can reduce overtone interference.

Figure 4 shows an example of the MCQS of an audio excerpt and its corresponding score. This excerpt is from the Bach10 dataset and contains four parts played by different instruments. Figure 4a shows the MCQS of the excerpt. Each dashed vertical line represents an onset time; these lines are labeled alphabetically at the top. Each horizontal bar is equivalent to a fundamental frequency or an overtone of a tone in the audio excerpt. Figure 4b is the piano-roll plot of the corresponding music score. Each horizontal red bar represents a note, with its height denoting the note pitch and its length denoting the note duration. Each dashed vertical line is the time point of note-on events; these lines are labeled alphabetically at the top. The same letters in Fig. 4a and b represent a time mapping in audio-to-score alignment; that is, timings A–U in Fig. 4a correspond to the same letters in Fig. 4b. In this example, the note duration and note onset of the performance deviates considerably from the score. The durations are 11 and 10 s in Fig. 4a and b, respectively, indicating that the speed of the performance is relatively slow. In addition, because of temporal deviation in the human performance, the time interval of the label is not as stable as that in the score. Despite these minor inconsistences, however, the musical structures are very similar at all corresponding points labeled alphabetically in the figure. Our proposed method for establishing their note correspondences is described in the following section.

Fig. 4
figure 4

a Modified constant Q spectrum of an audio excerpt. b Piano-roll plot of its corresponding score. Each dashed vertical line means an onset time and is labeled alphabetically at the top

4.2 Note-matching method

When computing similarity, the influence of overtones must be considered, as must differences in energy distribution in the MCQS at different pitches. In our method, for each onset, the frames immediately before and after the onset in the MCQS are taken. The number of frames used cannot exceed the onset interval or the given threshold (e.g., 200 ms). Two vectors are obtained by taking the averages of the magnitude spectrum immediately before and after an onset i. Assuming that dAi is the difference between these two vectors, three types of spectrum difference vectors (SDVs) ψi can be defined, as explained subsequently.

  1. a.

    For the first type, we only consider the difference if it is higher than a noise threshold:

    $$ {\psi}_i^1(k)=\left\{\begin{array}{c}\kern0.75em {dA}_i(k),\kern0.5em if\ {dA}_i(k)>{\epsilon}_i\\ {}0,\kern3em otherwise\end{array}\right. $$
    (1)

where ϵi is a noise margin, which is set to 1/20 of the maximum of dAi.

  1. b.

    The second type converts a vector dAi into a binary vector, in which the values greater than noise threshold are set to 1.

    $$ {\psi}_i^2(k)=\left\{\begin{array}{c}\kern1em 1,\kern2.25em if\ {dA}_i(k)>{\epsilon}_i\\ {}\kern0.75em 0,\kern2.25em otherwise\kern1.25em \end{array}\right. $$
    (2)
  2. c.

    The third type takes the logarithm to narrow the numerical range.

    $$ {\psi}_i^3(k)=\left\{\begin{array}{c}\kern1em \log \left({dA}_i(k)\right),\kern0.5em if\ {dA}_i(k)>{\epsilon}_i\\ {}\kern3.25em 0,\kern3.5em otherwise\end{array}\right. $$
    (3)

For comparison with the SDVs from the MCQS, we define score pitch vectors (SPVs) with overtones for each note concurrence (i.e., notes with the same onset time) in the score. Assume that \( {\mathcal{g}}_j^0 \) is a set of note pitches in a concurrence in the score, where j is the index of the concurrence. Here, the harmonic series of a note is accounted for by the introduction of an overtone vector Ω equal to [0, 12, 19, 24], which are the index differences between the note and its overtones on the frequency axis. The elements of vector Ω correspond to the first to fourth harmonic partials of the note because an octave is divided into 12 frequency bins. For example, if a note has its first harmonic partial (i.e., fundamental frequency) at the first bin of the SPV, the 13th bin is its second harmonic partial, the 20th bin is its third harmonic partial, and so on.

We define overtone set \( {\mathcal{g}}_j^n \) as follows:

$$ {\mathcal{g}}_j^{\ell }={\mathcal{g}}_j^{\ell -1}\cup \left({\mathcal{g}}_j^0+\Omega \left(\ell \right)\right) $$
(4)

where ∪ is the union operator. \( {\mathcal{g}}_j^0+\Omega \left(\ell \right) \) means that each element in the group \( {\mathcal{g}}_j^0 \) must be added Ω(n). In other words, \( {\mathcal{g}}_j^{\ell } \) is an overtone variation of \( {\mathcal{g}}_j^0 \) and n stands for the number of overtones. Because we consider 4 harmonic partials, there are 4 sets for each concurrence. We define a score pitch vector with overtone (SPV) \( {\phi}_j^{\ell } \) as a binary vector derived from the set \( {\mathcal{g}}_j^{\ell } \). The definition of a SPV \( {\phi}_j^{\ell } \) is as follows:

$$ {\phi}_j^{\ell }(k)=\left\{\begin{array}{c}1,\kern0.5em if\ k\in {\mathcal{g}}_j^{\ell}\\ {}\kern0.5em 0,\kern.5em otherwise\end{array}\right. $$
(5)

The frequency range of an SPV covers the lowest and highest note pitches in the score. The vector length of the SPV is the same as that of the SDV in the MCQS, and the frequency bins have one-to-one correspondence. If the set \( {\mathcal{g}}_j^{\ell } \) contains a pitch, the corresponding index of the SPV is set at 1, or 0 otherwise. Here, four types of overtone distributions are considered to capture characteristics of different musical timbres from various instruments. Figure 5 shows a simple example of SPVs. Assume that there are two notes with corresponding pitches at indices 1 and 3 at the j-th concurrence, namely \( {\mathcal{g}}_j^0=\left\{1,3\right\} \); \( {\phi}_j^0 \) is a vector with a value 1 at indices 1 and 3. The following operations consider the first one, two, and three overtones to obtain three sets:

$$ {\displaystyle \begin{array}{l}{\mathcal{g}}_j^1={\mathcal{g}}_j^0\cup \left({\mathcal{g}}_j^0+\Omega (1)\right)=\left\{1,3,13,15\right\}\\ {}{\mathcal{g}}_j^2={\mathcal{g}}_j^1\cup \left({\mathcal{g}}_j^0+\Omega (2)\right)=\left\{1,3,13,15,20,22\right\}\\ {}{\mathcal{g}}_j^3={\mathcal{g}}_j^2\cup \left({\mathcal{g}}_j^0+\Omega (3)\right)=\left\{1,3,13,15,20,22,25,27\right\}\end{array}} $$
Fig. 5
figure 5

An example of score pitch vectors (SPVs). \( {\phi}_j^0 \) is a vector containing the pitch information at the concurrence j in the score. \( {\phi}_j^1 \), \( {\phi}_j^2 \) and \( {\phi}_j^3 \) are variants which respectively consider the first one, two, three overtones

\( {\phi}_j^1 \), \( {\phi}_j^2 \) and \( {\phi}_j^3 \) then could be respectively built according to \( {\mathcal{g}}_j^1 \), \( {\mathcal{g}}_j^2 \) and \( {\mathcal{g}}_j^3 \), as shown in the figure. Each vector represents a possible variation of overtones for a concurrence. Our implementation only considers 4 variants for each concurrence.

We use a correlation coefficient to measure the similarity of two vectors. For each SDV ψi and each SPV \( \left\{{\phi}_j^{\ell },\kern0.5em n=0\sim 3\right\} \) of a concurrence, their similarity is as follows:

$$ S\left(i,j\right)=\underset{\ell =0\sim 3}{\mathit{\max}} corr\left({\psi}_i,{\phi}_j^{\ell}\right) $$
(6)

where the function corr(ψ1, ϕ2) returns the Pearson’s linear correlation coefficient of vectors ψ1 and ϕ2.

Assume that M is the number of detected onsets in an audio performance and N is the number of concurrences in a score. The complexity of constructing a similarity matrix is O(M × N). For a long audio file, M and N are very large, and thus they considerably increase computation time. However, because the alignment path of audio-to-score alignment is usually concentrated diagonally, the similarity matrix does not need to be calculated in its entirety. To reduce the amount of computation, the number of SPVs is limited for comparison with each SDV. The pruning procedure is described in Algorithm 2.

figure d

The function round(x) rounds a floating-point number x to the nearest integer. The variable w0 is a constant and is set to 5 in our implementation. After such treatment, the complexity is reduced from O(M × N) to O(M × |M − N|).

5 Dynamic programming

According to the aforementioned note-matching method, a similarity matrix S(i, j) can be constructed, where i is an onset index of the input audio and j is an index of the concurrence in the music score. Each cell in the matrix represents the similarity between an SDV and SPV. Dynamic programming (DP) is employed to determine the path with maximum overall similarity.

$$ D\left(i,j\right)=\mathit{\max}\left\{\begin{array}{c}D\left(i-1,j\right)\\ {}D\left(i,j-1\right)\\ {}D\left(i-1,j-1\right)+S\left(i,j\right)+\eta \left(i,j\right)\end{array}\right\} $$
(7)

where η(i, j) is a local tempo coefficient. Figure 6 shows an example of audio-to-score alignment. The matching curve was obtained for the audio used in Fig. 4. The upward-pointing triangles on the x-axis are the detected onsets of the audio. The right-pointing triangles on the y-axis are the positions of the concurrences in the score. The circles indicate the ground truth. The red asterisk is a matched pair in the DP process. The black asterisks are matched pairs back-traced from the red asterisk. The two dashed red lines are the first two segments and m1 and m2 are their slopes, which are used to evaluate the influence of temporal deviation. The local tempo coefficient is defined as follows:

Fig. 6
figure 6

An example of audio-to-score alignment. This matching curve is the result of the audio in Fig. 4. The circles are the groundtruth. The red asterisk is a matching pair in the dynamic programming process. The 2 dashed red lines are the first 2 segments back-traced from the red asterisk and m1 and m2 are their slopes which are used to evaluate the influence of temporal deviation

$$ \eta \left(i,j\right)=\alpha \left(1-\left|{\log}_2\frac{m_1}{m_2}\right|\right) $$
(8)

The ratio between m1 and m2 denotes the degree of tempo variation in a performance. If the tempo of the audio is the same as that of the score, the ratio is 1 and η(i, j) is equal to α. If the tempo varies considerably, the ratio is considerably higher than 1 and η(i, j) is decreased such that an SDV i is less likely to match the note vector j. The parameter α is a scalar that controls the influence of temporal deviation in the alignment process. In our experiment, we recognized that pitch matching is more crucial than temporal matching and thus set α to 0.1.

The alignment path is a sequence of adjacent cells, where each cell indicates a correspondence between an onset in the audio performance and a concurrence in the music score. After the accumulated similarity has been computed, the best alignment path by can be derived by back-tracking the path with the highest accumulated value in matrix D.

6 Real-time approach

Online and offline algorithms differ in terms of the time that is required to make an alignment decision. In real-time applications, a system must listen to music and immediately match the music to its corresponding notes in the score. The difference between the time at which an alignment decision is made and the estimated note onset time should be minimized. In this paper, we propose an online algorithm with a latency of only one frame (i.e., 10 ms) in most cases.

We created an online version of our offline method. The online method is based on onset detection and our note-matching method to enable construction of a similarity matrix. First, the onset detector is replaced with the online version. A state-of-the-art online onset detector [5] based on the RNN method is used. In addition, to obtain a positive result with low latency, the alignment process is designed to adhere to the following three conditions:

  1. (1)

    Performance stability: A musician usually plays a piece of music according to tempo markings on the corresponding score. The performance tempo does not suddenly change, and thus predicting the next onset from previously matched onsets is possible.

  2. (2)

    Performance continuity: We assume that the music is played according to the order of the notes in the score without any significant deviation. In other words, most matched concurrences are continuous with the previous matched concurrence. The alignment system should not arbitrarily skip sequential concurrences or jump to other positions in the score.

  3. (3)

    System response time: The real-time system must consider the response time. Therefore, the computation time for the alignment process must be reduced. Our algorithm balances precision and computation time; for each onset, only concurrences around the previous matched concurrence are considered for predicting score position.

To meet the above conditions, Eq. (8) should be slightly modified as follows:

$$ \eta \left(i,j\right)=\left\{\begin{array}{c}\alpha \left(1-\left|{\log}_2\frac{m_1}{m_2}\right|-\Delta j\right),\kern1.25em if\ \frac{m_1}{m_2}<4\\ {}\kern7.25em -1,\kern1.2em otherwise\end{array}\right. $$
(9)

We add an extra term, ∆j = j − j0 − 1, where j0 is the index of the previous matched concurrence. This term is a penalty to avoid injurious score jumps. In addition, we limit the ratio between two adjoining tempo slopes to less than 4. Because of these simple modifications, the new version of the accumulation matrix is less affected by score jumps and onset insertions.

Figure 7 displays a flowchart of our proposed method, which is an onset-driven system. When an onset i is detected, the block “local tempo estimation” predicts the corresponding concurrence based on the assumption that the input audio has stable performance. The first step is to identify the most likely concurrence j0 that is matched with the previous onset

$$ {j}_0=\underset{j_a-\varDelta j\le j\le {j}_a+\varDelta j}{\arg\ \max }D\left(i-1,j\right) $$
(10)

where ja is the index of the previous matched concurrence and ∆j is a tolerance window. Under normal circumstances, the current onset i should be matched to the concurrence ja + 1 because of performance continuity. However, because some errors always exist in onset detection and the note-matching process, the real matched concurrence may deviate from ja + 1, and thus the searching range ±∆j must be extended.

Fig. 7
figure 7

Flowchart of the proposed online audio-to-score alignment method. It is an onset-driven system. When an onset i is detected, the block “local tempo estimation” will predict the corresponding note concurrence. The block “looking-ahead note matching” will determine the optimal concurrence based on the note-matching method

In the second step, a short-term back tracing is taken from the position (i − 1, j0) in the accumulation matrix and the corresponding concurrence for the current onset i is estimated. The formula is as follows:

$$ {j}_1=\underset{j_0\le j\le {j}_0+\varDelta j}{\arg\ \min}\left|{f}_{tempo}(j)-{\tau}_i\right| $$
(11)

where the function ftempo(j) predicts the corresponding onset time in the performance for the note concurrence j. The procedure of this function contains computation of the following three variables:

  1. (1)

    Local alignment path: First, we track back N steps from (i − 1, j0) in the accumulation matrix and obtain a local alignment path. The choice of N is relative to the local tempo variation of the audio. We set N as 5 in our experiment.

  2. (2)

    Average tempo slope: We calculate the average intervals of the aligned positions of the audio and score according to the local alignment path. The average tempo slope is the ratio between these two average intervals.

  3. (3)

    Probable onset time: Assuming that the tempo ratio at the current onset is the same as the average tempo ratio of the local alignment path, we can estimate the corresponding onset time of a note concurrence through linear extrapolation from the local alignment path.

We consider the concurrences that range from j0 to j0 + ∆j and select that with the smallest gap to the current onset. If the current detected onset is an insertion, the system discards it and waits for the next onset. A simple strategy is employed to filter out the insertion; specifically, the ratio of the tempo slope of the current onset to the average tempo slope of the local estimation is limited to less than 4.

The second comparison checks performance continuity. If the predicted concurrence j1 is equal to ja + 1, the current onset i is matched to the concurrence j1; otherwise, there may be a score jump, which would require further checking of the note pitch by using our note-matching method. The formula is as follows:

$$ {j}_2=\underset{j_0+1\le j\le {j}_1}{\arg\ \max }S\left(i,j\right) $$
(12)

where S(i, j) is the similarity matrix in Eq. (6). Improving note-matching accuracy requires reading a few frames further. In our experiments, we read an additional 40 ms of audio data after the current onset time. The range of computation is limited in j0 + 1 and j1, and the optimal concurrence j2 is matched to the current onset i.

To clarify our method, we provide two examples here. Figure 8a contains no score jump; the piece of music is the same as that in Fig. 6 and is similarly labeled. The dashed vertical magenta line represents the current detected onset timing. We estimated the score position using Eq. (9) and traced back five steps to obtain a local alignment path, which is denoted by black asterisks. We were then able to compute the average slope between these points and predict the score position that corresponded to the current onset of the audio. The most likely score position is denoted by the red diamond in Fig. 8a. We found that the estimated concurrence index was next to the previous one, denoting that no score jump occurred in this example. By contrast, Fig. 8b shows an example of a score jump; the estimated concurrence index (red diamond) is far from the alignment path (black asterisks), denoting a large jump of notes between them. If this estimated concurrence index is directly matched with the current onset, the subsequent alignment work may be lost, resulting in incorrect output. Thus, this case requires note matching of the nearby notes and selection of the most likely concurrence.

Fig. 8
figure 8

Two examples of real-time audio-to-score alignment. The vertical magenta dashed line represents current detected onset timing. The black asterisks are the local alignment path. The red diamond is the matched score position according to the local tempo estimation. a The estimated score position is continuous with the local alignment path. b There is a score jump between the estimated score position and the local alignment path

The decision time of our method is approximately equal to the onset time plus computation time of the local tempo estimation, except in the case of score jump, which requires additional audio data for note matching. Therefore, our method has a short response time. The reliability of this method was verified in an experiment. This paper details the experiment and compares our method with other methods.

7 Experiments

To demonstrate the performance of our proposed method and the influence of certain parameters, we conducted a series of experiments. The main task of this study was to verify the feasibility of the proposed alignment framework. Because our alignment method is based on onset detection, we intended to observe how the method influences alignment performance. In addition, we have proposed a fusion method of onset detection functions (ODFs). Therefore, experiment 1 was designed to investigate the performance of several onset detectors. Experiment 2 obtained results for audio-to-score alignment based on various onset detectors; in this paper, these results are discussed alongside the effects on final performance of various onset detectors and spectrum difference vectors. In experiment 3, we verified the feasibility of the proposed online approach and compared its performance with that of other online methods.

7.1 Datasets

The evaluation of audio-to-score alignment was performed using two datasets. Each recording had a corresponding MIDI representation of the score. The audio files were recordings of real music performances in 44.1-kHz 16-bit WAV format. The first dataset, Bach10 [18], consists of 10 human-played J.S. Bach four-part chorales, with 30-s audio files sampled from real music performances by an instrumental quartet consisting of violin, clarinet, tenor saxophone, and bassoon. Each instrument was recorded separately. Individual recordings of the same chorale were then mixed to create 60 duets, 40 trios, and 10 performances with four-part polyphony. The total number of notes is approximately 13,700 and the total duration is approximately 60 min. The second dataset, which was collected by Cont [12], has 46 recordings extracted from four distinct pieces of classical music performed in a monophonic or slightly polyphonic manner. The total number of notes is slightly more than 4600 and the total duration is approximately 32 min. Notably, the original human annotations in the second dataset have larger deviation from the real onset positions, and thus the ground truth alignment between the audio and MIDI has been relabeled. The revised labels have been released as a public resource.Footnote 2 These datasets contain a variety of musical types and tempo variation; they are used in MIREX score following contest. Recent studies [11, 23, 24] on multi-instrumental audio-to-score alignment have used these datasets. Therefore, these datasets were adopted to evaluate the proposed method in the experiment of the present study.

7.2 Experiment 1: onset detection

In experiment 1, we compared the performance of the proposed method with that of four other methods from the literature. The first was a convolutional neural network (CNN)-based method that is the start-of-the-art detector proposed in [37], and the other three were SuperFlux, ComplexFlux, and the RNN method, which were the constituted ODFs in our implementation of the fusion method. We have included their results to illustrate the improvement offered by the proposed fusion algorithm.

We trained our onset detector on several previously released datasets [2, 20, 21, 26] consisting of music of a range of genres, styles, and instrumentation. We used 203 audio excerpts with an overall duration of approximately 36 min and 10,053 onsets. For SVM training, the full datasets were initially split into five disjunctive folds to optimize the parameters of the proposed method based on five-fold cross validation within the training set.

For evaluation, the standard measures of precision, recall, and the F-measure were used. An onset was considered correctly detected if a ground truth annotation was observed within a tolerance window around the predicted position. If two or more decisions were made corresponding to a ground truth annotation within a tolerance window, only one decision was counted as a true positive, and all other were rendered false positives. In our experiments, we selected a tolerance window of ±50 ms, which is identical to that used in the MIREX audio onset evaluation task.

Tables 1 and 2 respectively list the experimental results for the Bach10 and Cont datasets. The bold type indicates the best result for each measure. As shown in Table 1, the CNN method had the highest precision, recall, and F-measure among all the methods. As shown in Table 2, the fusion method had the highest precision and F-measure, whereas the CNN method had the highest recall. In both tables, the fusion method had higher precision than did its constituted ODFs—SuperFlux, ComplexFlux, and the RNN method. Although its recall was slightly lower, overall, the F-measure was considerably improved. Notably, the improvement of F-measure was more evident in the second dataset because that dataset contained more monophonic music, for which features extracted from H/P curves are more helpful. Taken together, Tables 1 and 2 indicate the effectiveness of the proposed method. Moreover, our method is more transparent than the CNN-based method because we combine features from three ODFs and the final classifier is an SVM. Moreover, the proposed framework is relatively flexible, and thus the introduction of new features and new fusion methods is fairly straightforward; for example, we added H/P features to reduce interference from vibrato. By contrast, the CNN employs fully connected layers for classification; this approach is difficult to understand and generally viewed as a hard-to-interpret black box.

Table 1 Onset detection results in BACH10 dataset
Table 2 Onset detection results in Cont dataset

The onsets detected by the different methods were used in the audio-to-score alignment experiment described as follows, where we observed the impact of the onset detector on our alignment method.

7.3 Experiment 2: offline audio-to-score alignment

The second experiment compared the proposed algorithm with other audio-to-score alignment methods. The evaluation metrics are based on the precision of the matched notes in the score, which have been frequently adopted in related studies, such as [8, 10, 18, 23, 24, 33]. A note is said to be correctly matched if its estimated onset does not deviate from the real one by more than the specified tolerance window (e.g., 50 ms). The figure of merit used in the experiment was piecewise precision, which is the average of the percentage of detected notes for each recording.

The performance evaluation results for the Bach10 dataset are listed in Table 3, which details 18 systems for evaluation. The first 3 are based on published papers, whereas the remaining 15 are combinations from five onset detectors and three SDVs. RB is derived from Dannenberg’s algorithm [22], which uses chroma features and DTW alignment; this algorithm was chosen as a baseline in our experiment. CC is an approach that we proposed in an earlier study [10]; in CC, an onset detector employs an RNN-based method and the similarity measure is based on NMF to decompose note overtones. JC is one of the most recently proposed methods [8]; it decomposes spectral patterns using learned basis functions and uses DTW to identify the best alignment path. Notably, the results of JC were estimated from Figure 3 in [8], and thus only two data items were included. The remaining 15 systems were based on the proposed methods, with five onset detectors (SuperFlux, ComplexFlux, an RNN, a CNN, and the fusion method) and three SDVs, namely SDV1, SDV2, and SDV3, which are the three representations of SDV \( {\psi}_i^m \) in Eqs. (1), (2), and (3), respectively, in Section 4.

Table 3 Audio-to-score alignment results in BACH10 dataset

As shown in Table 3, four tolerance windows were used to evaluate performance. For each onset detector with a specified tolerance window, the percentage in bold shows the most favorable result in the three SDVs. For the SuperFlux, RNN, and fusion methods, SDV3 consistently outperformed SDV1 and SDV2; for ComplexFlux and CNN, SDV2 consistently outperformed SDV1 and SDV3. In addition, our 15 alignment systems evidently outperformed RB and CC. For comparison with JC, Fig. 9 shows bar charts of the precision for the tolerance windows of 50 and 100 ms. As shown in Fig. 9a, where the tolerance window is 50 ms, except for SDV1, our systems all outperformed JC in all onset detectors and RNN + SDV2. As shown in Fig. 9b, when the tolerance window was 100 ms, our systems consistently outperformed JC. This indicates that the proposed method is fairly robust with respect to different onset detectors.

Fig. 9
figure 9

Audio-to-score alignment results in Bach10 dataset at the two tolerance windows of (a) 50 ms and (b) 100 ms.

Notably, the F-measure of the fusion method was higher than those of its constituted onset detectors (SuperFlux, ComplexFlux, and RNN), as shown in Table 1. However, Table 3 shows that some precision values of the alignment systems with the fusion method were lower than those with SuperFlux and ComplexFlux, possibly because SuperFlux and ComplexFlux have higher recall, leading to more choices of SDVs when DP matching is performed with SPVs to obtain better alignment. Taken as a whole, the proposed note-matching method achieves favorable results when combined with various onset detectors.

Table 4 provides the results of our experiment on the Cont dataset. The proposed method is compared with RB and CC only because we did not obtain results for JC with this dataset. To improve readability, Fig. 10 shows bar charts for the tolerance windows of 50 and 100 ms. The results of all our combinations were significantly favorable to those of RB, thereby demonstrating the robustness of the proposed note-matching method. In addition, our proposed note-matching method outperformed the CC method, except for the alignment systems with SuperFlux and ComplexFlux. This is because SuperFlux and ComplexFlux did not capture onsets sufficiently successfully in this dataset, as shown in Table 2. As for the CNN, although its precision and F-measure are low in Table 2, it still performed well for audio-to-score alignment. This is similar to the results of SuperFlux and ComplexFlux in the Bach10 dataset, where higher recall denoted more SDVs for better alignment. Thus, to some extent, the importance of recall is higher than that of precision in our alignment system; however, precision must be maintained. For instance, the fusion method had lower recall and considerably higher precision than did the CNN, as shown in Table 2. This led to generally greater alignment for the fusion method than for the CNN, as shown in Table 4. Furthermore, the overall performance of SDV3 was consistently favorable to that of SDV1 and SDV2 in the Cont dataset. In summary, fusion+SDV3 had the highest precision in all tolerance windows for this dataset.

Table 4 Audio-to-score alignment results in Cont dataset
Fig. 10
figure 10

Audio-to-score alignment results in Cont dataset at the two tolerance windows of (a) 50 ms and (b) 100 ms.

Bach10 contains polyphonic music played on multiple instruments, whereas the Cont dataset contains more monophonic music. Although these two datasets have different properties, we observed a similar trend in both, namely that the overall performance of SDV3 was the best in all onset detectors. In addition, the alignment results revealed a positive correlation with the results of onset detection. To provide a more intuitive view of all methods, we merged the results obtained using the two datasets and re-evaluated the precision of the detected notes in tolerance window sizes ranging from 25 to 100 ms in 25 ms increments. To improve readability, the results of SDV1 and SDV2 are not presented. The obtained results are plotted in Fig. 11. All our systems achieved results favorable to those of the other methods included for comparison, namely RB and CC, in all tolerance windows. CNN + SDV3 outperformed RB by 20.88% and CC by 10.62% when the tolerance window was 25 ms. Fusion+SDV3 outperformed RB by 11.04% and CC by 6.03% when the tolerance window was 100 ms.

Fig. 11
figure 11

Experimental results based on both datasets. The first 5 are based on the proposed algorithm, which compares favorably with the algorithms CC and RB proposed in the literature

7.4 Experiment 3: online audio-to-score alignment

In experiment 3, Bach10 was used to evaluate the performance of the proposed score follower for online alignment. We adopted SDV3 for computation of the similarity matrix. Figure 12 presents our experimental results. We compared our score follower with four reference methods. Soundprism [18] is a system that uses a particle filter to infer the tempo and score position of each audio frame in online mode. The algorithms ISMIRon [8], BACKspeed, and FWDspeed [33] are derived from the offline algorithm JC, which was introduced in the previous section. ISMIRon matches each performance frame with a score position associated with the minimum value of the accumulated cost, whereas BACKspeed and FWDspeed are two variants that estimate the DTW path in online mode by introducing reliable onset positions.

Fig. 12
figure 12

Results of online audio-to-score alignment. The proposed method produces the best precision when the tolerance window size is less than or equal to 100 ms.

We determined the precision of detected notes in tolerance window sizes ranging from 50 to 200 ms in 50 ms increments, as shown in Fig. 12; the proposed system obtained the best results when the tolerance window was shorter than or equal to 100 ms. As the tolerance window was increased, the precision improvement of our method compared with the other methods became smaller because some onsets were falsely ignored by the employed onset detector, and thus the corresponding notes were not recognized. Although the precision of our method was slightly lower than that of Soundprism at tolerance window sizes of 150 and 200 ms, our method had the highest precision at 50 and 100 ms, and had the lowest average offset of all the compared methods. In addition, our method was able to detect 77% of all notes in 10-ms latency and 23% in 50-ms latency, leading to an average latency of only 19.2 ms. Therefore, our method is relatively suitable for online audio-to-score alignment. The success of this approach can be attributed to the high accuracy of our note-matching method and the reliability of local tempo estimation.

8 Conclusions and future work

We have proposed an algorithm that aligns an audio recording of polyphonic music to its corresponding score. The algorithm involves three steps: onset detection, similarity computation between an SDV derived from an audio onset and an SPV derived from the concurrent note-on event in the score, and a dynamic program to determine the optimal alignment. We have also proposed a fusion method for onset detectors and a note-matching method based on our MCQS. We reduce the complexity of similarity computation and interference from vibrato by considering spectrum variation around an onset. Our note-matching method is a direct comparison of SDVs and SPVs that is more reliable when the timbres in the audio differ from those specified in the music score. We implemented 15 alignment systems with combinations from five onset detectors and three SDVs and compared them with other audio-to-score alignment algorithms proposed in the literature. Our systems obtained higher alignment precision than did all compared methods. Finally, we proposed an online approach based on onset detection where musical notes can be detected within an average window of 19.2 ms. Our online version outperformed all the methods included for comparison when the tolerance windows were 50 and 100 ms.

Although the proposed method performed favorably in the experiments, there is still room for improvement. In future research, we will focus on improving various aspects of the alignment algorithm. Machine learning will be used to analyze the spectral patterns of the constant Q transform to produce a more advanced similarity matrix. Moreover, we intend to develop a musical accompaniment system for polyphonic music performances to demonstrate the feasibility of the proposed algorithm.