Keywords

1 Introduction

Musical Human-Computer Interaction (HCI) techniques have empowered computer music systems to perform with humans via a broad spectrum of applications [14]. It is vital to have a consistent and focused approach when moving from traditional face-to-face (F2F) courses to Remote Learning (RL) courses using an online music teaching environment. Music students and teachers use different websites, online apps, and computer programs to learn, remix, and compose music. Existing music technologies allow digital or MIDI-enabled acoustic pianos to connect synchronously over the Internet, producing reliable instrumental audio, separate from the video-conferencing platform. Research is required to help teachers transition to the online format.

As RL music courses are becoming more common, graduate teaching assistants, or tutors, will become essential as instructor support mechanisms. We should train online teachers to have the necessary skills of online-music-teaching, communication, and assessment.

Based on [5], our analysis revealed four essential elements for online music courses:

  1. 1.

    Online music pedagogy (e.g., teaching philosophies, authentic music learning, openness to online music learning, institutional support, and learning approaches)

  2. 2.

    Course design (e.g., planning, organization, multimedia use, and curriculum)

  3. 3.

    Assessment (e.g., meaningful opportunities to demonstrate music learning)

  4. 4.

    Communication (e.g., methods for exploring subject content and technology tools)

This research identifies critical elements for developing a program for online music tutors. The goal is to train music teachers to master online skills and provide an online platform for people to practice music individually or as a group. Differences between F2F courses and online platforms include multimedia technology, social constructivistFootnote 1 learning activities, and developing practical online communication skills.

This research has two main questions:

  1. 1.

    What are the key components to teach online-music-skills to music tutors?

  2. 2.

    How can these components be adapted to implement an online automated tool that can train itself based on different students’ skill and learning styles?

We are developing a teaching framework that helps music faculty members transition from traditional F2F classroom teaching to the online environment. This teaching framework is divided into three phases: 1) hybrid online courses; 2) a fully online study focused on social constructivist learning, and; 3) fully online classes with limited student interaction.

In this research, we also create a music technology curriculum and share the curriculum with expert music teachers to assess the chapters’ and modules’ possibilities. After analyzing the results provided by music teachers, we change the music modules accordingly. The second phase is to record different F2F classes in which music teachers teach our curriculum to students. In phase three, we implement an automatic online teaching model that can train itself based on the rehearsals and face-to-face recorded courses provided by music instructors. The system incorporates the techniques from different realms, including real-time music tracking (score following), beat estimation, chord detection, and body movement generation. In our system, the virtual music teachers’ and students’ behavior is captured based on the given music audio alone, and such an approach results in a low-cost, efficient and scalable way to produce human and virtual musicians’ co-performance [6].

This paper presents various techniques, especially ML algorithms, to create Artificial Intelligence (AI) tutors and musicians that perform with humans. We focus on four aspects of expression in human-computer collaborative performance: 1) Chord and pitch detection, 2) timing and dynamics, 3) basic improvisation techniques, and 4) facial and body gestures.

Two of the most fundamental aspects of online-music-teaching are timing and dynamics. We create a model of different teachers performing as co-evolving time series. “Based on this representation, we develop a set of algorithms, to discover regularities of expressive musical interaction from rehearsals” [14]. Providing the learned model, an artificial performer generates its musical expression by interacting with a human performer, given a predefined curriculum. With a small number of rehearsals, the results show that we can employ ML to create more expressive and human-like collaborative performance than the baseline automatic accompaniment algorithm.

Body and facial movements are also essential aspects of online-music-teaching. We study body and facial expressions using the feature extraction models to create features based on teacher recordings. We contribute the first algorithm to enable our virtual teaching model to perform an accompaniment for a musician and react to human performance with gestural and facial expression. The current system uses rule-based performance-motion mapping and separates virtual tutor motions into three groups: finger motions, body movements, and eyebrow movements. Our result shows that the virtual tutor embodiment and expression enable more musical, interactive, and engaging human-computer collaborative performance [14].

In particular, we discuss the literature review in the next section. Then, we propose a music learning architecture for “How to play [song] on [instrument]” tutorial lessons when provided a favorite pre-recorded music piece as an input. In Sect. 4, we propose a method for automating the assessment of chord structure and beat detection via ML. In Sect. 5, we conclude the paper with a short discussion about the overall process of online music teaching and improvements to the prototype.

2 Literature Review

2.1 Machine Learning in Music

There are many useful music teaching applications using ML techniques. The main tasks in music that can be solved by ML are: music score following, chord recognition, musical instrument identification, beat tracking, rhythm tracking, source separation, genre classification, and emotion detection [8].

In music, most of the tasks need an initial feature extraction and classification. Some of the feature extractions that can be used for Music Information Retrieval (MIR) tasks include mel-frequency cepstral coefficients (MFCCs), chroma-based features, spectral flux, spectral centroid, spectral dissonance, and percussiveness [13]. Modeling the pattern of the extracted feature plays an important role in training the online automated model. Some of these models include Gaussian Mixture Models (GMMs), Hidden Markov Models (HMMs), and support vector machines (SVMs). With all of the current ML techniques available, we have the ability to automatically retrieve information about a piece of music [8]. This information may include the instrument type(s), key, tempo, musical notation, pitch(es), segments, and chords present in a song. By automatically obtaining this information from a piece of music, whether it be a novice student’s recording or professionally recorded song, we may efficiently use this information for other automation tasks.

A number of metrics have been proposed to evaluate a real-time music tracking system. These metrics are mostly based on measuring the latency/error of every note event [12], or calculating the number of missing/misaligned notes during the process of score following [9]. There are, however two major issues in such evaluation methods. First, the performance of score following cannot fully represent the performance of an automatic accompaniment system operating in real-world environments, as it ignores the latency introduced in sound synthesis, data communication, and even reverberation of the environment. Second, note-level evaluation is suitable only for hard-onset instruments such as piano, while it is limited for soft-onset instruments such as violin, as the uncertainty of violin onset detection could propagate errors in the final evaluation results [6]. To solve these issues, we firstly propose an experimental setup which allows evaluation of the system in a real-world environment. Further, we provide a frame-level evaluation approach for general types of instruments, with intuitive visual diagrams that demonstrate how the system interacts with humans during the performance.

3 Proposed Framework of MOOCs for Music Learning and Performance

3.1 Module 1: Self-learning Tutorials

Online learning environments are divided into three categories: online video tutorials (e.g., YouTube), face-to-face video call with the teacher, and Massive Online Music (MOOCs) Courses. To learn music online, one needs to be self-motivated and self-disciplined. Online video tutorials may lack a real-time interaction with the learner, which is essential for music learning. The one-on-one video calls may present an interactive learning environment; however, they can not be practical in music learning when used for larger audiences. Therefore, to address the interactivity, scalability, and automating the development of music tutorials, machine learning models can be used. Figure 1 presents a framework for supporting MOOCs to increase their scalability to large audiences [1, 11].

Fig. 1.
figure 1

Music education framework

The basics of the framework are: (1) Trained AI Tutor provides a music exercise using the Music Learning System (MLS); (2) learner uses interfaces for practice and learning. Then the learner uploads his/her The exercise’s music performance to the MLS; (3) MLS sends the audio recording to Music Feature Extraction and Critic, where it is analyzed and further presented to the trained AI tutor for assessment.

To create automated lesson plans based on class recordings, some necessary information, including pitch, chord, beat, duration, rhythm, and dynamics, should be retrieved from the music file. This data can be used as valuable features for training a model to assess a students’ performance while learning to play an instrument. One component of the model is the student practice and recording interface that can be easily tailored to specific exercises by the education content designer (music instructor). Our initial model provides better results with a simple instrument such as a flute in which the player can play one note at a time.

Our initial tests show that the face-to-face delivery of teacher performance followed by several student repetitions can be successfully imitated with such interfaces. The first session of this course will be offered during Summer 2021. The demonstrations of interfaces, results, and observations on user experience will be shared with the audience during the conference.

In this paper, we focus on the chord recognition task which is one of the most important tasks in Music Information Retrieval (MIR).

3.2 Module 2: Chord Recognition

In music, a combination of different notes that are played simultaneously is called harmony. The main components of harmony are chords, which are musical constructs that consist of multiple notes (three or more).

The result of a chord recognition task consists of dividing an audio file into smaller segments and assigning a chord label to each segment. “The segmentation represents the timeline of a chord, and the chord label classifies which chord is played during a specific period of time. A typical chord recognition system consists of two essential steps” [7].

  • In the first step, the given audio recording is cut into frames, and each frame is transformed into an appropriate feature vector. Most recognition systems are based on chroma-based audio features, “which correlate to the underlying tonal information contained in the audio signal”.

  • In the second step, pattern recognition techniques are used to map each feature vector to a set of predefined chord labels.

Figure 2 represents a diagram of chord recognition process.

Fig. 2.
figure 2

Chord recognition diagram

Template-Based Pattern Matching. One of the techniques to detect a chord is through matching the chromagram of each segment with a predefined Template-based pattern matrix. For example we assume given a sequence X = \(\{x_1, x_2,...,x_N\}\) and a set \(\varLambda \) of all chord labels. Template-based chord recognition aims to map each chromagram vector \(x_n \in R^{12}\) to a chord label \(\lambda _N\in \varLambda \), \(n\in [1:N]\).

Consider the following set:

$$\begin{aligned} \varLambda = \{C, C^\#,D, ..., B \} \end{aligned}$$
(1)

To simplify the problem, we convert all the possible intervals of chords to the main twelve major and twelve minor triads. Therefore, each frame \(n\in [1:N]\) is mapped to a major chord or a minor chord considered as \(\lambda _n\).

We first pre-compute a set

$$\begin{aligned} \tau \subset F = R^{12} \end{aligned}$$
(2)

of templates denoted by \(t_\lambda \in \tau \), \(\lambda \in \varLambda \). Each template can be considered as a prototypical chromagram vector that represents a musical chord. Moreover, we fix a similarity measure by

$$\begin{aligned} s : F \times F \rightarrow R \end{aligned}$$
(3)

that allows comparing different chromagram vectors. Then, the Template-based procedure consists of classifying the chord label that maximizes the similarity between the corresponding template and the given feature vector \(x_n\):

$$\begin{aligned} \lambda := argmax s(t_\lambda , x_n) \end{aligned}$$
(4)

In this procedure, there are three main concerns.

  1. 1.

    Which chords should be considered in \(\tau \)?

  2. 2.

    How are the chord templates defined?

  3. 3.

    What is the best evaluation method to compare the feature vectors with the chord templates?

Based on [7], for the chord label set \(\varLambda \), we select the twelve major and twelve minor triads. Considering chords up to enharmonic and up to octave shifts, each triad can be coded by a three-element subset of [0:11]. For example, the C major chord C corresponds to the subset 0, 4, 7. Each subset, in turn, can be classified with a binary twelve-dimensional chroma vector \(x=(x(0),x(1),\ldots ,x(11))\), where x(i) = 1 if and only if the chroma value \(i \in [0:11]\) is in the chord.

For example, for the C -major chord, the resulting chroma vector is

$$\begin{aligned} t_C:= x =(1,0,0,0,1,0,0,1,0,0,0,0)^T \end{aligned}$$
(5)

The Template-based pattern mappings based on twelve major and twelve minor chords are shown in Fig. 3.

Fig. 3.
figure 3

Pattern matching [7]

3.3 Implementation

The following steps are performed and visualized in Template-based chord recognition:

  1. 1.

    First, the audio recording is converted into a chromagram representation. We use the STFT-variant.

  2. 2.

    Second, each chromagram vector is compared with each of the 24 binary chord templates, which yields 24 similarity values per segment. These similarity values are visualized in the form of a time–chord representation.

  3. 3.

    Third, for each frame, there is a chord label \(\lambda _n\) of the template that addresses the similarity value over all 24 chord templates. This yields our final chord recognition result, which is shown in the form of a binary time–chord representation.

  4. 4.

    Fourth, the manually generated chord annotations are visualized.

Figure 4 represents Template-based chord recognition results.

Fig. 4.
figure 4

Template-based chord recognition results

3.4 Hidden Markov Model (HMM)

“A Markov chain (MC) is useful when we need to compute a probability for a sequence of observable events. In many cases, however, the events we are interested in are hidden: we don’t observe them directly. For example we don’t normally observe the chord labels in a music audio signal” [2]. Rather, we see the audio sound and must infer the chords out of it. The sequence is called hidden because the elements has not yet been observed.

The HMM will provide an opportunity to add more features to our observation and keep with the same framework of that MC. In this paper, we will cover the main intuitions of HMM in chord detection.

The main answer that a HMM can give us is:

What is the most probable sequence of chords for a given sequence of observations?

In order to answer this question we will need a few things from the MC and some new features:

  • Chord Transition Probability Matrix: These are the notes probabilities explained in the MC section but having the chord transition probabilities instead.

  • Emission probabilities: Probability of an observation to belong to each one of the chords \(P(observation\Vert chord) \).

  • Initial State Probability Matrix: Indicates what is the probability of a sequence to begin with a specific chord.

3.5 Annotation of Music Data

In order to be able to generate the probabilities above, we need:

  1. 1.

    The music files in order to extract the chromagrams

  2. 2.

    An annotated dataset, so we can join the chord labels with the corresponding windowed chromagrams (Fig. 5).

Fig. 5.
figure 5

HMM annotated dataset

3.6 Calculate Framed Chromagram

Music data annotations provide the time period during which each chord was played in a particular piece of music. The idea is to create a definition of “what is a C Major chord in a chromagram” so we can create the emission probabilities matrix.

Before merging our chromagram with the annotation files, we need to know how much time each chromagram windows takes in order to be able to merge. To calculate the framed chromagram we send the windowed chromagram, the signal, and its sampling frequency to the model in order to know how many seconds each window is.

3.7 Calculate State Transition Probability Matrix

To calculate the state transition probability matrix, the model runs through all chords in the dataframe and will count all of the possible chord-to-chord transitions. Finally, in order to turn the count values into probability values, we will normalise all values so the probabilities of going from one chord to all others is always 1. The representation of state transition probability matrix is shown in Fig. 6.

Fig. 6.
figure 6

A representation of state transition probability matrix

3.8 Calculate Emission Probability Matrix

We calculated distribution assumptions of each chord according to the mean energy in the labeled chromagram. For HMM model with M states, mu-array will be shape [M, 12], where 12 are the 12 notes from the chromagram. It will show the average energy in each one of the 12 notes for each chord. For each chord of the HMM model, we will have a matrix of shape [12, 12] for each note. It will tell the HMM how, in each chord, the notes vary among themselves. For example, in a C major chord, we expect when the C note is high, the notes E and G are expected to be high as well. In this condition, state covariance matrix is of shape [M, 12, 12]. Then the model repeatedly runs on every chord and their respective chromagrams to calculates the mean energy and their covariance.

3.9 Calculate Initial State Probability Matrix (ISPM)

The reason initial state probability is needed is because at the beginning of the model training, the model doesn’t know what have been the history of the chords played before. Therefore, an initial state probability matrix will start the estimation process by randomly estimating the beginning chord. For every step in the algorithm, we always calculate \(P(chord_i \Vert chord_{i-1})\), i.e., the probability of observing a chord in a window given the previous chord at the previous window.

In this case, in order to calculate our ISPM, we ran through all our audio annotated files, counted all the initial chords, i.e., first chord of the music except for silence, and then normalized the number so the sum of the ISPM = 1.

3.10 Implementation

To implement HMM model we used the hmmlearn python package. This package abstract a few complicated mathematics, leaving an interface similar to sklearn machine learning packages, where we build and then predict over new observations.

  • Because we are working with continuous emission probabilities, we build a hmm.GaussianHMM first and send the number of states. The covariance-type = “full” defines that the HMM understands that the notes can have a relationship, i.e., the amount of energy in one note is not independent of anyone else. For different HMM with different types of emission probabilities.

  • We set the Initial State Probability Matrix

  • Then Transition Probability Matrix

  • Means and Covariance indicates the two parts of our emission probability as explained in section “Emission Probabilities”.

Figure 7 illustrates the result of HMM model tested on the first 10 s of “Let-it-be” song.

Fig. 7.
figure 7

The results of HMM model

4 Proposed Framework of MOOCs for Music Assessment

It is essential to have an automated performance assessment for large-scale class sizes to provide quantitative metrics for a faster assessment and real-time feedback to each music learner. The real-time assessment report is similar to the feedback provided by teachers in face-to-face classes. Feedback providing task involves extracting features from the student’s audio and video performance to provide standard performance measures and quantitative metrics. Such methods would be more practical in music lessons since young generations learn faster and more efficiently when the provided report is instant and intuitive.

Since spectral modulation features via the modulation spectrum provide a simple visualization of rhythmic structures in music, these features can provide instant feedback for learning the rhythm of various simultaneous parts in a musical piece. In the next subsection, we focus on these features as an example to provide instant feedback for learning piano.

4.1 Modulation Spectral Features for Rhythmic Structures

Spectral Modulation features from the modulation spectrum can be practical in audio data mining tasks. In the music technology era, spectral modulation can address the classification and visualization of long-term and short-term rhythmic structures in music tempo and repeating patterns [10].

To demonstrate the potential of automatic assessment tools for this task, we plan on implementing a benchmark system that uses a well-known approach: assigning performance grades via mapping note level deviations computed from aligned transcriptions of the performance and the reference. Our benchmark system will be tested through a case study in Summer 2021 in a real-life scenario. The results of the study case will be presented at the conference.

5 Discussions and Future Work

Massive Online Courses (MOOCs) highlights a set of practical and thoughtful challenges that may not generally happen in smaller class sizes. This paper addressed two primary aspects of online music learning where machine learning techniques can be applied to enhance online music lessons for diverse learning styles and audiences’ verity.

We demonstrated our proposal for the automated development of lesson plans that can train music lovers to play their favorite instruments and simultaneously have access to automated unique learning styles, varied musical backgrounds, and/or skills.

Furthermore, this paper proposed a quantitative assessment method of a student’s progress in learning how to play a particular musical instrument. Proposed solutions in this paper can be useful for both individual learners or as an instructor who may need to assess the quality of a massive number of students’ performance. “Previous research shows that interactive learning environments can also significantly contribute towards a student’s interest, motivation, and discipline, and thereby enhance the commitment to learning” [1, 3, 4].

Future work would include the implementation of proposed methods and improving chord detection methods presented in the paper. In addition, accessibility methods of the online music teaching platform will be researched and implemented to support persons with disabilities such as visually impaired students.