1 Introduction

The aim of this paper is to build a music mixing system that is capable of automatically mixing separate raw recordings with good quality regardless of the music genre. The realm of modern music and sound production is complex and diverse. To achieve a single end product—music that reaches the world—takes effort, immense commitment, and the combined creative talents of all kinds of experts. The music world comprises artists, engineers, producers, managers, executives, manufacturers, and marketing strategists. All of whom are experts in their fields, such as music, recording, acoustics, production, electronics, law, media, marketing, graphics, and sales. Collaboration within the music production process enables to transform creativity into a product that can be enjoyable for the end-user [1,2,3]. The underlying drive of the teams of people throughout the recorded sound practices concerns the cultural tastes, the art of music, and the ever-present changes and challenges in production technology and industry [4].

The process of a musical piece production can be divided into the following steps: composition, recording, editing (sometimes done just after recording or during the mixing stage), mixing, and mastering. The composition step can take on many forms. It can be creating a song in MIDI (Musical Instrument Digital Interface) in any DAW (Digital Audio Workstation), writing down the composition on a five-line staff, or just having a music piece in the songwriter’s head. The recording step can also vary. Nowadays, it rarely happens to rent a big studio with an engineer and a producer. More commonly, the artists record their songs track by track in a home studio. Regardless of how a song is produced, the result is a recorded song where each instrument is given a separate mono track and, in some cases, multichannel. After an artist decides to record a musical piece or song, the sound engineer uses appropriate microphones, records the multitrack material, and edits it. The mixer’s role is to set proper proportions between the signal elements and adapt their time and frequency-based properties [5]. A more than adequate mix can emphasize the artistic character of a song or even define the music genre [6,7,8]. Mixing was first introduced as physically adjusting the instrument and microphone setup. When a multitrack recording became possible, the mixing process was performed using analog hardware and—later—digital tools.

Regardless of the approach chosen, mixing is used to shape the character, tone, and intention of the production in relation to [4, 9]:

  • Relative level between tracks (how loud are tracks relative to the others [2, 10]).

  • Spatial processing or panning (placement of the sound within the stereo or surround field).

  • Equalization (altering the relative frequency balance of a track).

  • Dynamics processing (adjusting the dynamic range—the ratio of the softest to the loudest peak, expressed in decibels [11])—of a single track, a group, or an output bus to optimize the levels or allow it to not stand out within a mix for the duration of the entire song).

  • Effects processing (adding delay-, pitch- or reverb-related effects to a mix to alter or augment the piece in an attractive, natural, or unnatural way [12,13,14]). It should be noted that audio effects, sound effects, and sound transformation, as these terms are used interchangeably [12], are understood as signal processing functions that change, modify, or augment an audio signal [10, 11].

Nowadays, fully analog studios are very rare. Analog equipment is expensive and requires special care and effort to upkeep. Restoring a session to mix is complex and requires the work of multiple people. However, the so-called “analog sound” is what every mixing engineer is looking for, regardless of their approach to mixing [15]. The second mixing approach is called hybrid mixing, where songs from the DAW are routed to an analog mixing console or single tracks are channeled down to outboard hardware, e.g., compressor, equalizer, or reverb. This approach is precisely in-between in the cost/effect category [16]. The least costly and the easiest method of mixing a song is the fully digital approach, called in-the-box [17]. Many renowned engineers changed their approach to mixing from analog to digital entirely [18, 19]. The in-the-box way of mixing has various advantages, i.e., there is a possibility of going back to a project with one click of the computer mouse, and free software programs that emulate the analog equipment are more and more faithfully to the original hardware.

Amateur and professional mixing may differ in talent, skills, experience, music background, artistry, and knowledge. So, the motivation behind our study is to see where an automatic mix may be positioned relative to them, i.e., whether it is closer to the amateur or between these two approaches. Moreover, we should stress that our intention is not to build an automatic mixing system for music mixing per se; this should stay with a professional sound engineer. In contrast, such an approach may help in gaming or branding areas where the focus is not on music quality but on effective ways of mixing audio [20]. Also, we decided to test “automatic mixing” versus human-made mixes. Moreover, when referring to “automatic mixing,” we differentiate between the use of Wave-U-Net network and mixes prepared with one of the popular plugins. On the “human” side, we decided to test “amateur” and “professional mixes.”

Therefore, in the paper, two hypotheses are posed. The first one considers whether it is possible to mix music consisting of separate raw recordings using a one-dimensional adaptation of the Wave-U-Net autoencoder that can objectively be evaluated similarly to the human-based mix. The second one is related to subjective evaluation and tries to answer the question of whether the prepared mixes may subjectively be assessed as better ones than recordings created by an amateur engineer or mixes produced using state-of-the-art technology and can be comparable to mixes produced by a professional mixer.

The paper is structured as follows. First, literature background is shortly reviewed. This is followed by methodology, focusing on data preparation, deep model training and validation, and preparation of audio mixes. The consecutive section is devoted to the quality evaluation of audio mixes employing objective and subjective approaches, qualitative analysis as well as self-similarity matrices-based analysis. Besides, this section contains statistical analyses and discussion. Finally, a summary is given, along with the proposed direction of further research and development of automated mixing.

2 Literature background

When searching the term “automatic mixing,” Google delivers 249,000,000 documents/links in 0.35 s, so the relevance of this area is easily seen. It should be noted that automatic mixing is a part of intelligent music production (IMP) [21], as the latter encompasses the application of artificial intelligence to mixing and mastering. De Man and his colleagues regard automating music production as introducing intelligence in audio effects, sound engineering, and music production interfaces [10]. Classification of IMP research may differ [17] as it may refer to the audio effect that was automated [22, 23]. Some other aspects researched concern live downmixing stereo [24], selective minimization of masking [25], automatic mixing method for live music, incorporating multiple sources, loudspeakers, and room effects [26], multi-track mixing [27, 28].

Overall, IMP deals with data collection, perceptual evaluation, systems, processes, and interfaces [10]. According to Reiss [28], de Man et al. [10], and a review paper published by Moffat and Sandler [21], IMP is still emerging and under development, even if it is not a new field as it dates back to the 1975 Dugan’s paper on a fully deterministic adaptive gain mixing system [29, 30]. It is also important to note that several on IMP may be observed throughout the years as they depend on machine-learning methods and music resources (e.g., [31, 32]), starting with baseline algorithms, knowledge-based approaches [21, 30, 33, 34], and ending with deep models [35,36,37,38,39].

Undoubtedly, the references included do not exhaust literature sources related to automatic music mixing; thus, the list of pioneers in automatic mixing provided by de Man and collaborators [17] should be referred to. Also, Moffat and Sandler [21] and de Man et al. [10] gave insight into intelligent music production and its history, in general.

Up-to-date, automatic mixing mainly regards more manageable tasks, such as setting the maximum level of the microphone in a live situation in a way that does not allow for the system’s feedback or distortion of the speakers. The more manageable tasks that can be performed are also automatic mixing of audio elements in cases where artistic quality is not the most crucial aspect, e.g., in video games [40] or audio/music branding (for instance, in stores) where the songs are automatically mixed together one after the other [20, 41,42,43,44,45,46]. In the latter case, the mixing happens not in the context of multiple tracks in one piece but in the entire music project, where the previous song is smoothly mixed with the following. Examples of such work are described in several papers [47,48,49,50].

At the same time, the productization of technology and user-friendly interfaces influence the growth of technology and allow for more advanced automatic sound manipulation. Martinez-Ramirez and his co-authors [36] provided a very useful definition of audio effect units that refer to analog or digital signal processing systems that transform specific sound source characteristics. These transformations can be linear or nonlinear, time-invariant or time-varying, and with short-term and long-term memory. Most typical audio effect transformations are based on dynamics, such as compression; tone, such as distortion; frequency, such as equalization; and time, such as artificial reverberation or modulation-based audio effects [36].

Plugins available on the market are digital audio processors that can not only be the digital equivalents (simulations) of analog devices but also can exceed traditional boundaries. One plugin can act as a substitute for a few ones or even a dozen of analog devices. Moreover, modern plugins actively help the user to execute tasks that would be unachievable otherwise, e.g., treating one signal with 28 different filters. Some plugins offer genre or instrument detection and either an entire automatic mixing routine or a part of it (i.e., balance, equalization, or compression-only) [51].

In contrast, knowledge-based audio mixing can be described as a departure from the standard automation methods [21, 30, 33, 34, 52]. Still, many of these methods, except for specific ones, e.g., involving certain data augmentation procedures [37, 38], use large databases to train machine-learning algorithms and models. In the process, multiple parameters, e.g., level, panning, and equalization, are changed at the same time. The most commonly used databases are Open Multitrack Testbed [53] and MUSDB-18 [54]. The methods found in the literature use expert-based knowledge during training or creating a specific model or application. Examples of such work are described in several projects [55,56,57,58,59].

As de Man stated in his work [17], mixing is a multidimensional process. Engineers must decide whether the source is too loud or too quiet, the frequency range is set correctly, the panning of an instrument complements the whole mix, the reverb is fixed correctly, etc. With this said, the various types of processing cannot be done separately; instead, this challenge should be set as an all-in-one task. Isolating one problem will lead to another unresolved issue. As shown by state-of-the-art research in music production, there are a lot of tasks in mixing mastering, and beyond that are approached based on machine learning [49, 60,61,62,63]. Also, deep learning has gained much acceptance in recent years [63, 64].

3 Methodology

3.1 Data preparation

To properly train a neural network, an adequate database is needed. The data should be structured, appropriately differing, and large enough. Databases for tasks in the speech domain, such as speech denoising or speech arrival direction detection, are commonly used. There are, however, very few databases that can be used for mixing purposes. Based on MUSDB18-HQ [54], the most rewarding database available, a new dataset was built by the authors, supplemented with individual tracks from the Cambridge database [65], and expanded by additional songs recorded by one of the authors. The dataset had to be prepared in a particular way to be helpful in model training and validation. The stems contained in MUSDB-18-HQ are wet, and the mixture is the summation of the stems. However, since the song-mixing process is more about altering individual tracks rather than stems, it was decided that this database would be sufficient for this study’s purpose. Moreover, in the Cambridge database, one can find individual tracks and finished mixes. The instrument-to-stem models were trained on a combination of Cambridge and MUSDB18-HQ as well as the stem-to-mix model.

The MUSDB18-HQ database [54] and five songs recorded by the authors were used to train the deep model. This database consists of 150 songs (approximately 10 h in total) belonging to various genres. One hundred songs were used for training and 50 as a validation set. Drum, bass, vocals, and other instrument stems and finished mixes (summation of the stems) can be found in the database. The database consists of songs from the Cambridge database [65], which means that to acquire individual tracks, they had to be taken from the Cambridge database to be appropriately matched. Since the song-mixing process is more about altering individual tracks rather than stems, it was decided that this database would be sufficient for this paper’s purpose.

As already mentioned, five songs recorded and mixed by the authors were added to the training database. All five songs were recorded in the Auditorium of the Electronics, Telecommunication and Informatics Faculty at the Gdansk University of Technology and in a home studio. The songs consist of drums, bass, guitars, and vocals, and their music genre can be classified as rock.

Due to the nature of the system’s architecture, based on Wave-U-Net autoencoders, it was decided to use a fixed number of inputs and outputs for each model. The number of inputs and outputs for the models is presented in Table 1. In cases where the number of signals was larger than the assumed number of inputs, a premix was conducted. The premixing process consisted only of adding the signals together—there was no change applied to their loudness level and loudness in relation to each other, and no effects (such as equalization, compression or reverb) were added. In cases where there were too few original signals (for example, there were only two signals for bass), empty tracks were created to meet the set requirement of the input number. There was not any other processing done to the individual signals (tracks).

Table 1 Models and number of inputs and outputs

3.2 Deep model training and validation

The system consists of five deep models. The models were trained separately and connected to one system. The models differ by the number of inputs and outputs (mono/stereo). The system was created from variants of Wave-U-Net networks, as suggested by other authors [64, 66]. So, the depth of the network models was the same as the one used by Martinez-Ramirez et al. [64]. The change introduced to the original Wave-U-Net enables the network to work on stereo signals. A single model was trained on a network with the following parameters:

  • U-net layers: 10

  • Filter size of convolution in downsampling block (max number of inputs): 15

  • Upsampling: linear

  • Type of output layer: linear without activation

  • Learning rate: 1e−4

  • Augmentation: false

  • Batch size: 16

  • Number of update steps per epoch: 200

  • Optimizer: Adam

Each model utilizes raw (unprocessed) audio input and output with a connection to a series of downsampling and upsampling blocks that contain 1D convolution layers and is used separately. The models also include resampling operations which allow the calculation of features used in the prediction process. A block diagram of the system is presented in Fig. 1.

Fig. 1
figure 1

Block diagram of an automatic audio mixing system

Each model was trained separately and then connected to create the system. The training was performed using the L2 distance as training loss, as previous observations of neural models have shown that using this distance helps achieve better results [36, 67]. The optimizer used was Adam, with a learning rate of 0.0001, decay rates: β1 = 0.9 and β2 = 0.999. Also, early stopping patience of 20 epochs was used, and a finetuning step followed. The batch size was 16. A model with the lowest loss for the validation subset was selected. The validation loss function of the stem-to-mix model training is presented in Fig. 2.

Fig. 2
figure 2

The validation loss function of stem-to-mix model training

The models were trained on a computer supported by a NVidia GeForce 1080 graphics card. Training an individual model took approximately 2 days.

3.3 Preparation of audio mixes

For testing purposes, it was decided to create four different mixes of the same song:

  • A professional mix (called “Pro”)

  • An amateur mix (called “Amateur”)

  • A mix created using state-of-the-art software (called “Izotope” [68])

  • A mix created by the trained models of the Wave-U-Net network (called “Unet”)

Clean tracks for eight songs in four music genres were chosen and acquired from the Cambridge database [65]. The list of the selected songs, including their genres and the number of tracks to be mixed, is presented in Table 2.

Table 2 List of selected songs

Due to the fact that the songs belong to different music genres and the models were trained on data from various genres, the evaluation and testing may show interesting results. For example, all 11 tracks from a selected song (i.e., “Secretariat – Over the top”) are shown in the form of mel spectrograms in Fig. 3. All tracks in each song differ from each other in their spectral content. Moreover, all selected songs differ in the number of individual tracks, and even within the particular genre, they are dissimilar both sonically and emotionally. Also, the songs were specifically chosen to have different tempos and instrumentation.

Fig. 3
figure 3

All 11 tracks from “Secretariat—Over the top” song in the form of mel spectrogram representation

The professional mixes “Pro” were made by well-known experienced audio engineers. Mixes of the following songs: “Angels in Amplifiers—I’m alright,” “Georgia Wonder—Siren,” “Side Effects Project—Sing with me,” “Speak Softly—Broken man,” “The Doppler Shift—Atrophy,” and “Tom McKenzie—Directions” were created by Mike Senior who earned a Music Degree at Cambridge University and worked as an assistant engineer in many noted recording studios, such as RG Jones, West Side, Angell Sound, or By Design. He is also the creator of the open Cambridge database. He collaborated with many famous artists and is the creator of books such as “Recording Secrets For The Small Studio” and “Mixing Secrets For The Small Studio.” The mix for the song Secretariat—Over the top was created by Brian Garten, who is a known recording and mixing engineer. He collaborated with artists like Mariah Carey, Justin Bieber, Britney Spears, and Whitney Houston. He is a four-time nominee for a Grammy award and won one Grammy award for the Best Contemporary R&B Album with Emancipation of Mimi in 2005. The song Ben Carrigan—We’ll talk about it tonight was mixed by Ben Carrigan, who is a songwriter, composer, and music producer from Dublin, Ireland. He graduated from a music school specializing in jazz, classical, and pop music traditions.

The “Amateur” mixes were prepared by a person with experience in music theory through education and practice as a musician. The person, however, did not have any previous experience in audio mixing, neither professional nor recreational. The mixes were created in a home studio using the Cubase 10.5 PRO software. The room in which the mixes were made was treated acoustically. The monitors used during the process were APS Klasik 2020. The digital-to-analog converter used was Apollo Twin. The length of the mixing process varied for each song, depending on the number of tracks in a given piece and its music genre. The quickest preparation of a mix took approximately 2 h, and the most prolonged—6 h. In general, the more familiar the genre was to the amateur mixer, the quicker the mixing process was. The lack of experience in mixing led to a relatively intuitive usage of available tools and relying on subjective assumptions about what a mix should sound like. The amateur was, however, free of any habits and mannerisms that a mixer with more experience would have and performed the process with no external guidance. In the “Amateur” mixes, the mixer did not exclude any raw tracks from the final mix.

To create state-of-the-art mixes, a set of Izotope plugins from the music production packet was used. The plugins included Neutron Pro and Nectar Pro. Their automatic balance and automatic mix features make it possible to mix a song semi-automatic. First, all recordings were imported into the Cubase 10.5 PRO software. Each track was imported into a separate channel. The semi-automatic processing method with the use of Izotope plugins can be divided into two stages:

  • Setting overall balance

  • Creating custom presets for every channel

So, the “Unet” mixes were created using the system presented above. Although in the final version, the system enables to mix a song without any user intervention, the mixes were created manually—each submodel was used separately. This means that, in the first step, the drum tracks were mixed into a drum stem, the bass tracks into the bass stem, the vocal tracks into the vocal stem, and the remaining tracks into the other stem, using appropriate models. Then, the stems were mixed together using an appropriate stem-to-mix model according to the assumed system architecture.

After obtaining all 32 mixes, the postprocessing of the acquired songs was performed. First, from each song, a 15-s clip was selected (duration of an excerpt according to [69]), which best represents the chorus or other loudest part of the song. In other words, a fragment of the song with the most instruments was chosen for the last step of mix preparation.

4 Quality evaluation of audio mixes

It should be noted that QoE is related to both subjective evaluation and objective metrics [70,71,72,73,74]. The users’ experience, based on several factors, such as fulfilling their expectations, emotions, and preferences while interacting with technology, can be evaluated in subjective tests. In contrast, an objective investigation is both content- and context-related, so—in the absence of such a metric in the music mix quality area—several level-oriented parameters were proposed to be tested on the resulting mixes. Still, there is a need to find a dedicated measure that correlates with subjective evaluation results. That is why an approach based on the self-similarity matrix (SSM) analysis was proposed that may achieve such a goal. This is further examined in Section 4.3.

Sections 4.1 and 4.2 present the evaluation process, carried out in two ways, i.e., objectively and subjectively. First, several descriptor values related to perceptual characteristics for each mix are calculated [75]. The selected parameters are level-oriented as they are easy to calculate and understand. However, we do not compare these parameters between songs but rather between different mixing approaches. From an objective point of view, these parameters can be beneficial for determining the dynamic content of the song, even if it is distorted. This is very important when sending a prepared song to the mastering engineer. Samples that were subjected to objective analysis, i.e., waveform statistics based on RMS level, integrated loudness, loudness range, and true peak level, as well as low-level MPEG-7 descriptors (odd-to-even harmonic ratio, RMS-energy envelope, and harmonic energy) were not normalized.

In addition, a qualitative analysis took place. The test participants filled in a questionnaire form, answering several questions about their listening habits and experience. An example of the answers obtained is presented further on.

Moreover, the evaluation methodology and the results of a subjective test are shown as such evaluation has a higher priority over the objective assessment results [76,77,78]. It should be noted that listening tests were conducted on normalized samples, where the listeners rated each sample in multiple evaluation categories (balance, clarity, panning, space, and dynamics).

The statistical analysis is then performed, and the statistical significance of the achieved results is commented. This is followed by similarity matrix-based [79,80,81] analyses and the discussion.

4.1 Objective quality evaluation

Unprocessed samples were used for the objective evaluation. This is because subjecting the recordings to normalization may prevent the correct identification of accurate values for the acquired music signal samples. First, the waveform-based parameters were calculated, such as RMS (root mean square) level (Fig. 4), integrated loudness (Fig. 5), loudness range (Fig. 6), and true peak level (Fig. 7) [82] for all music excerpts. These parameters were judged to be important in the evaluation process.

Fig. 4
figure 4

RMS level calculated for all music pieces evaluated

Fig. 5
figure 5

Integrated loudness calculated for all music samples in LUFS (loudness unit full scale)

Fig. 6
figure 6

Loudness range calculated for all music samples in LU (loudness units)

Fig. 7
figure 7

True peak level calculated for all music samples

Further on, selected low-level descriptors MPEG-7 were calculated [83]. For this purpose, the timbre toolbox [84] in the MATLAB environment was used. Odd-to-even harmonic ratio, RMS-energy envelope, harmonic energy, and noisiness were calculated for each music sample. These descriptors were chosen because of their perceptual interpretation. In Fig. 8, a variation of the harmonic energy of the “Secretariat—Over the top” song—depending on the mix type—is shown.

Fig. 8
figure 8

Variation of the harmonic energy a2 depending on the mix type in the “Secretariat—Over the top” song

For each mentioned descriptor, an analysis was performed to determine the statistical significance of differences between the mixes. For this purpose, the one-way ANOVA series [85] and the post hoc Tukey-Kramer test [86] were executed. The level of significance was assumed to be =.05. For most calculated descriptors, i.e., odd-to-even harmonic ratio, RMS-energy envelope, and harmonic energy, the differences between mixes are statistically significant (values are highlighted in bold font in Table 3) except for Unet-Pro pairs. In Table 3, the results of the statistical significance analysis for the harmonic energy descriptor for the “Secretariat—Over the top” song are presented.

Table 3 Statistical significance analysis results of the harmonic energy descriptor for the “Secretariat—Over the top” song

Considering all the results obtained, it can be concluded that the “Unet” mixes are the closest to the “Pro” mixes, and the developed system is capable of creating a mix that can be objectively rated as professional or close to professional. Moreover, it can be concluded that the system produces mixes better than amateur mixes and better than mixes created by well-known state-of-the-art software.

4.2 Subjective quality evaluation

Before the listening test, the participants were asked to fill in a questionnaire form. There were questions concerning what they listen to, whether they are familiar with a particular music genre, and their music and mixing experience. Music genres that the participants listened to varied, but the most frequent responses were rock, alternative, hip-hop, and jazz. Listeners answered that they were familiar with genres such as rock, pop, alternative, and electronica. Eighty-five percent of the listeners were musicians, and 60% were mixing engineers. In Fig. 9, the listeners’ years of experience in music mixing are presented.

Fig. 9
figure 9

Results of the survey on how many years of experience listeners have in music mixing

After adequate postprocessing of samples, the listeners were asked to fill in a questionnaire and give their subjective rates for each acquired 32 samples (the test samples are available under the link provided the “Availability of data and materials” section). The rating of samples was conducted in line with the methodology of the rank-order procedure proposed by Zacharov and Huopaniemi in the round-robin subjective test devoted to evaluating virtual home theater systems [69]; however, using a five-point scale (1 = lowest–5 = highest). Such a subjective test can be considered MOS-like (mean opinion score). It was suitable for this particular listening test since it was easy to conduct and easy for the listeners to follow [69].

The aim of the tests was presented to potential participants before the tests took place. All persons taking part in subjective listening tests gave informed consent to participate. All participants voluntarily decided whether or not to participate in the subjective tests.

The listeners performed the listening test in the R1 laboratory (mixing room) at the Hamburg University of Applied Sciences. The participants of the subjective tests were experts in the audio mixing area; moreover, they were provided with an explanation of the term “good quality” of the mix, understood as “as free of any distortions/artifacts, with properly controlled dynamics with good frequency balance” [87, 88].

The room at the university was adapted to professional listening and is equipped with multiple pairs of audio monitors. In this case, it was decided to use the “main speakers” pair, i.e., Klein+Hummel 0410. Nuendo 10 software and Audient ASP 8024 mixing console were used for the listening session. All effects on the console were turned off, and all faders were set to the unity position. On the same console, the routing of individual channels to subgroups in the middle of the console was performed. All samples were played simultaneously from the DAW, and the listeners could freely switch between the different mixes—this approach was user-friendly since all participants were familiar with the console. Moreover, when listening to different mixes, the listener would not be introduced to any silence in-between and could easily detect all differences between samples.

The system calibration was set to 85 dB SPL and was performed with the use of the Bruel and Kjaer precision 732A m. For the calibration, pink noise correlated to the listening files (i.e., normalized to the −14 LUFS level) was used. The chosen level may seem relatively high for a regular user, but due to the expert character of the testing process and the identification of the most minute details possible, the selected level was appropriate. The loudness level is also recommended by the Audio Engineering Society [89].

During the listening sessions, the expert listeners were able to switch between the different mixes in any order and marked their ratings in the questionnaire. The listeners were taking part in the sessions individually. The test was constructed in such a way that each person received samples in a different order—the trial was fully randomized, and there was no possibility for the listener to lean into a specific answer due to the testing samples’ order. Every listener was familiar with operating the console and was asked if they understood all questions included in the questionnaire. Due to the fact that the audio jargon used by professional audio engineers may differ in various areas of the world, the authors included definitions next to each expression (e.g., balance).

After the subjective tests were completed, a statistical analysis of the results was performed. There were 20 participants in the tests; all of them were students of the Music Production Class and Digital Sound Masters Program at the Hamburg University of Applied Sciences. All the participants confirmed that they listened to music. Music genres that the participants listened to varied, but the most frequent responses were rock, alternative, hip-hop, and jazz. The majority of listeners answered that they were familiar with genres such as rock, pop, alternative, and electronica. Eighty-five percent of the listeners were musicians, and 60% were mixing engineers.

Statistical analyses of the data resulting from subjective tests were performed using the IBM SPSS Statistics 25 software [90]. The software was used to calculate basic descriptive statistics, the Shapiro-Wilk test of normality, a series of one-way analyses of variance (abbr. one-way ANOVA) for dependent samples, and the linear correlation analysis using the Pearson correlation coefficient (r) [86]. The level of significance was assumed to be =.05. Results whose significance was at the level of .05 < p < .1 were assumed to be statistically significant at the level of the statistical trend.

As part of the research questions, it was decided to check if the types of mixes (“Amateur,” “Izotope,” “Unet,” and “Pro”) differ in how the respondents rated them. For this purpose, a series of one-way analyses of variance for dependent samples was conducted, and individual mixes were compared in the following categories: overall rating, balance, clarity, panning, space, and dynamics. The outcome of the analysis is a probability called the p value. To identify homogeneous subsets of means that are not significantly different from each other, a pairwise comparison with Šidák correction was performed. The significance level was set at p<0.05. The different homogeneous subsets are denoted by different letter indexes (i.e., a, b, c).

First, an analysis of the overall rating of mixes was executed (see Table 4). The result is statistically significant (highlighted in bold font), and the effect size coefficient indicates large differences. The pairwise comparisons with the Šidák correction demonstrated that the “Pro” mixes were rated the highest by the respondents, followed by “Unet.” The “Amateur” and “Izotope” mixes were rated the lowest without a significant difference in ratings.

Table 4 The overall rating of the mix as a function of the mix type (M, mean; SD, standard deviation; p p value; F, F ratio; η2, a measure of the effect size) indicating groups forming separate homogeneous subsets (denoted by a, b, and c)

Next, the mixes were compared within the balance category. The result was statistically significant and the effect size (η2) value signified large differences. The pairwise comparisons with the Šidák correction demonstrated that the highest-rated mixes in the balance category were the “Pro” mixes, followed by the “Unet” mixes. The lowest-rated mixes were “Amateur” and “Izotope,” without any significant differences in results between them. An analogous analysis was conducted with the use of the clarity variable. The analysis results show very big and statistically significant differences, and the pairwise comparisons with the Šidák correction show that the highest-rated mixes in the clarity category were the “Pro” mixes, followed by the “Unet” mixes. The lowest-rated mixes were “Amateur” and “Izotope,” without any significant differences in their results.

The next comparison of mixes was conducted within the panning category. The analysis results show very strong and statistically significant differences, and the pairwise comparisons with the Šidák correction show that the highest-rated mixes in the panning category were the “Pro” mixes, followed by the “Unet” mixes. The lowest-rated mixes were “Amateur” and “Izotope,” without significant differences in their results. Next, the mixes were compared using the space variable. The results, as in the previous analyses, proved very strong and statistically significant differences between the types of mixes. The pairwise comparisons with the Šidák correction proved the “Pro” mixes to be the highest-rated mixes in the space category, followed by “Unet”. The “Amateur” and “Izotope” mixes were rated the lowest, with no significant difference between them.

The last variable used for the comparison of mix types was dynamics. Analogously to the previous analyses, the results showed very strong and statistically significant differences. The pairwise comparisons with the Šidák correction proved the “Pro” mixes to be the highest-rated mixes in terms of dynamics, followed by “Unet”. The “Amateur” and “Izotope” mixes were rated the lowest by respondents, with no significant difference between them. All results are presented in Table 4.

The last step of the analysis encompassed examining the correlation between respondents’ experience in mixing and their overall ratings of each mix type. For this purpose, correlation analysis using the Pearson correlation coefficient (r) was conducted (Table 5). The analysis proved a statistically significant correlation between the number of years of experience in mixing with the rating of “Amateur” and “Pro” mixes and a correlation at a level of statistical significance for the “Unet” mixes. The negative value of the r coefficient for the correlation of experience and ratings of the “Izotope” and “Amateur” mixes means that the more years of experience the listeners have, the lower they rate the mixes. In the case of the “Unet” and “Pro” mixes, the correlation is positive, and it is either moderately strong or strong, which means that when the number of years of experience in mixing grows, the overall rating of those mixes increases.

Table 5 Correlation between the experience in mixing and the overall ratings of mixes

4.3 Self-similarity matric-based analysis

After testing and analyzing the objective and subjective samples from each mix, self-similarity matrices (SSM) based on chromagrams were constructed. In the chromagram calculation process, the entire spectrum is projected onto 12 bins [91]. The method takes into account the fact that pitch consists of two components: tone height and chroma [92, 93]. The features represent the distribution of signal energy over chroma and time. The relationship between components can be defined by the following formula:

$$f={2}^{ch+h}$$
(1)

where ch is chroma (ch ∈ [0, 1]), f is frequency, and h denotes the pitch height that indicates the octave the pitch is in.

The chroma vector sums the spectral energy into 12 bins corresponding to the 12 semitones within an octave.

The following three-step algorithm realizes the process of SSM construction:

  • STEP 1. The feature normalization

  • STEP 2. Self-similarity calculation

  • STEP 3. Visualization of the similarity scores

The feature normalization was performed by normalization of each column of the feature matrix. The normalized values are calculated using the following formula:

$${\hat{x}}_n=\frac{x_n-{\overline{x}}_n}{SD}$$
(2)

where \({\overline{x}}_n\) and SD are the mean and standard deviation of non-normalized features, respectively, and xn = (x1n, …, xNn) is the nth matrix column (n = 1, …, N). Each column of the normalized feature matrix \(\hat{X}\) is compared with each other.

For the purpose of self-similarity calculation, the dot product between the feature matrix and its transpose is calculated as follows:

$$S={\hat{X}}^T\hat{X}$$
(3)

The entries of the matrix imply the similarity scores. Each pixel in the matrix obtains a grayscale value corresponding to the given similarity score. The darkest color refers to the smallest similarity. An example of a comparison between objective and subjective analyses for “Secretariat—Over the top” is depicted in Fig. 10.

Fig. 10
figure 10

Graphical representation of the SSM of “Secretariat—Over the top” objective and subjective samples

Next, all matrices were compared to each other using the root mean square error (RMSE); Structural Similarity Index (SSIM), used for measuring similarity between images [94]; and visual information fidelity (VIF), treated as a full-reference image quality related to image information extracted by the human visual system [95, 96]. The results obtained are presented in Table 6.

Table 6 Comparison of means for all samples

As seen in Table 6, the “Unet” mixes are the closest to the “Pro” mixes (values highlighted in bold). Both the objective and subjective samples achieve similar results.

5 Summary

The main goal of this study was to develop and test an audio file mixing system that allows creating mixes from raw audio signals in a given music genre automatically, without user intervention, which would match professionally made mixes in quality. As part of the system concept, an architecture based on a one-dimensional Wave-U-Net encoder was designed. The implemented system consists of five models that have been trained. A specially prepared MUSDB18-HQ database, which was enriched by individual tracks from the Cambridge database and five original compositions from the authors, was used for training purposes.

To check the validity of the hypotheses posed, multiple experiments were conducted. The first concerned the comparison of objective features of the obtained mixes. The developed system should automatically mix the input tracks so that the mix obtained as the output will be objectively better than the state-of-the-art method and comparable to (or indistinguishable from) a mix created by a professional mixing engineer. It was shown that it is possible to automatically mix input tracks provided by the user, using previously trained models so that the final effect would be objectively very close to mixes prepared by a professional mixing engineer. However, this is only true for the respective audio descriptors.

All mixes created using Wave-U-Net were free of distortions or artifacts throughout the song. The overall quality can be evaluated as good or even very good (especially when compared with the Amateur mixes). The trained models behave similarly between different genres. The Authors did not find any major deviations in the final mixes when testing different songs.

Overall, the methodology proposed shows the possibility of mixing audio signals of good quality automatically. This is especially important in applications designed for the game development industry, where the primary effort is on visual effects or custom music branding, where the focus is on combining songs that match the end and the beginning of tracks. These areas are open to such findings as automatizing the audio mixing process.

With regard to objective test scores, this study proposes to use a method based on self-similarity matrices, commonly used in the analysis of music signals, to assess the quality of audio mixes. The experimental results showed that the proposed method correlates closely with the subjective and objective evaluation results and can be employed as an objective measure for assessing sound quality.

In the extended plans of the proposed method, it is anticipated to include an additional module in the proposed system, i.e., the integration of an automatic instrument classification module at the system’s input. This way, the user would not need to introduce appropriate tracks to respective inputs in the system manually. In the current form, for the system to work correctly, the user needs to assign bass tracks to the bass model, drum tracks to the drums model, etc. Automatic instrument classification is possible [97,98,99,100,101] and would improve the performance of the system in the context of the length of the process. It would also enhance the user’s experience and the ease of use for beginner users who are not trained sound engineers.

Another proposed direction of further research and development is an additional module that could edit individual tracks. Such a module would allow synchronizing tracks with each other automatically (for example, in multitrack drum recordings) and automatically deleting (or scaling down the volume of) unwanted sounds (such as the vocalist’s breathing or accidental microphone hits in between the desired signal). The module should be implemented at the system’s input so that all tracks can be edited before mixing. Currently, the user needs to synchronize all tracks and edit unwanted or accidental sounds manually.