1 Introduction

Robotic dance is an interesting research area and attracts many researchers to work on its development in terms of interaction, imitation, coordination and autonomy by using artificial intelligence and human–robot interaction technology [1,2,3]. As the fundamental part of robotic dance, dance pose is a static body shape and expresses emotion, character, feeling, meaning and theme [4]. In the existing research, dance pose presents several different forms, such as stopping posture [5], key-pose [4, 6], gesture [4, 7] and posture [8]. Despite diverse forms of robotic pose, its essence is unity and plays an important role in robotic dance.

Robotic dance has been classified into four categories in [2], namely cooperative human–robot dance, imitation of human dance motions, synchronization for music, and creation of robotic choreography. However, only robotic choreography creation has aesthetic requirements in accordance with human aesthetics. Based on different research goals, some researchers have explored the aesthetic problem in robotic choreography creation. So far, the explored aesthetic problem involves three kinds of aesthetic objects: robotic dance pose [8,9,10,11], robotic dance motion [8,9,10, 12, 13], and robotic dance [11, 14,15,16,17,18].

As robotic dance is a sequence of dance poses, its aesthetics requirement (accordance with human aesthetics) should be decomposed naturally into these dance poses. Therefore, it is meaningful to explore the aesthetics of robotic dance poses. If a robot could perceive the aesthetics of its own dance poses, it would express more autonomous behavior in robotic choreography creation and promote human–robot interaction. Existing research involves only two aesthetic methods for robotic dance poses: human subjective aesthetics [8,9,10] and the machine learning based method [11]. For the former, although more accurate aesthetic evaluation results are obtained, human–robot interactions impose a heavy burden on people. For the latter, although people do not need to participate too much in human–robot interactions, it is difficult to build a suitable machine aesthetic model to achieve accurate aesthetic evaluation.

However, so far, the aesthetic method of the robotic dance pose, which draws lessons from the mature aesthetic experience of human beings, has been rarely studied.

A human dancer always actualizes his/her dance pose’s aesthetic by integrating multimodal information. For instance, after presenting dance poses before a mirror, human dancers can clearly observe the mirror images of their dance poses and body kinestate. Combining the information, they can make a comprehensive aesthetic judgment on their own dance pose. Inspired by this, a humanoid robot should use the similar mechanism to achieve automatic aesthetics on its own dance pose. However, the following main questions remain: (1) How can a robot integrate multimodal information from two channels of vision and non-visual to make a comprehensive aesthetic judgment of its own dance poses? (2) How can a robot fuse multiple features to understand its own dance poses more completely? (3) Which method can achieve more accurate results on the above aesthetic judgment of robotic dance poses?

Inspired by the corresponding human aesthetics mechanism, to develop the autonomous and humanoid behavior of robots, we propose a new theory of automatic machine aesthetics for robotic dance poses based on multimodal fusion information. More concretely, to analyze robotic dance poses, an automatic image processing method is designed, which extracts useful shape features (including eccentricity, density, rectangularity, aspect ratio, Hu-moment Invariants, and complex coordinate based Fourier descriptors). To portray a robotic dance pose more completely, the shape features are combined with joint features to form mixed features. Then, ten machine learning methods are used to achieve the automatic aesthetic judgment for robotic dance poses.

The main contributions of this paper are as follows:

  • From the perspective of self-aesthetic understanding of dance pose, this paper explores a feasible way to develop the robot’s autonomous intelligence by imitating human dance behavior.

  • By fusing multimodal information (visual and non-visual), this paper proposes a novel automatic approach to estimate the aesthetics of robotic dance poses. The approach improves the autonomy and cognitive ability of the robot to a certain extent.

  • The mixed features (joint + shape), proposed by this paper, can characterize a robotic dance pose well. Moreover, based on the mixed features, ADTree has been verified as an effective machine learning method to achieve more accurate aesthetics evaluations of robotic dance poses.

The rest of the paper is organised as follows. Section 2 outlines the current works that are related to this research. A detailed explanation of the whole mechanism is presented in Sect. 3, including five parts: the whole framework, pre-processing, feature extraction, feature fusion, and machine learning. Section 4 describes the complete experimental process, and shows the experimental results via simulation. Our mechanism is further explained from four aspects in Sect. 5, based on the simulation experimental results. Finally, a brief conclusion and future work are presented in Sect. 6.

2 Related Work

As mentioned in the previous section, aesthetics requirements are mainly related to robotic choreography creation in which a humanoid robot is a carrier. The existing research in this area can be divided into three aspects: robotic dance pose, robotic dance motion and robotic dance, which are listed in the column of “Aesthetics Object” of Table 1. Vircikova and Sincak [8,9,10] constructed a multi-robot system and designed robotic choreography by using interactive evolutionary computation (IEC). They implemented human subjective aesthetic evaluation on robotic dance pose and robotic dance motion.

Table 1 Aesthetics in robotic choreography creation

For seeking good robotic dance poses that are in accordance with human aesthetics, we present a theory of semi-interactive evolutionary computation (SIEC), which is a population-based searching algorithm [11]. Machine learning, an important stage and supervised learning process, was embedded in SIEC and trained a robot to learn how to accomplish autonomously the aesthetic evaluation of dance poses, thereby giving the robot the ability to possess human aesthetics [11]. Moreover, according to the quality evaluation index and the three features of good robotic choreography, several robotic dances based on those good dance poses were evaluated by aesthetics, and the aesthetic results were acceptable.

Eaton proposed an synthesis approach to create robotic dance choreography based on traditional evolutionary computation (TEC), and built an aesthetic fitness function on robotic dance motions [12]. The fitness function involved the sum of all movement values over all of the joints multiplied by the time that the robot remained standing [12]. It assessed the quality of dance movements, which referred to “performance competence evaluation measure” [19]. Furthermore, Shinozaki et al. [13] designed a robot dance system for investigating the role of robots in entertainment. The system used each Hip-Hop robotic motion as a dance unit, and several dance units were concatenated randomly to form dance choreography. Then, human subjective aesthetic evaluations were conducted on robotic dance motion, and the evaluation items included dynamic, exciting, wonder and smooth, etc.

Oliveira et al. [14] constructed a choreography framework, in which a Lego NXT robot could perform its dance motions in response to the inputs of multimodal events. Moreover, an empiric evaluation was made on robotic dance, and each evaluator was required to fulfill a Likert scaled questionnaire to achieve aesthetic evaluation. The evaluation indexes included: the robot’s musical-synchrony, its variety of movements, its human characterization, and the flexibility of the user control over the system, etc. [14].

Manfrè et al. [15] proposed an automatic system for robotic dance creation based on hidden Markov model (HMM). They choose suitable robotic dance motions to be a robotic dance according to the perceived musical rhythm. By calculating the loudness per beat of the inputted music signal, a sequence of music classes was generated and associated and regarded as the HMM’s observed sequence. Moreover, each robotic dance motion was regarded as a hidden state of HMM, and the Viterbi algorithm was introduced to find the optimal sequence of robotic dance motions according to the sequence of music classes. Finally, the created robotic dances were evaluated by three professional dancers, and the aesthetic impact of the whole sequence of robotic dance motions had a mean value of 6.33 in the score range [1,2,3,4,5,6,7,8,9,10] (10 best).

By integrating this automatic system into the cognitive architecture of a humanoid robot dancer, Augello et al. [16] explored the live performances based on human–robot interaction, among which the creative dance motions were generated to form an improvisational robotic dance. After each performance, the spectators were asked to fill a questionnaire to evaluate the performance. The aesthetic evaluation indexes included four aspects: originality of the choreography, naturalness of the robot–dancers interaction, timing and movements of the robot, evaluation of overall performance [16].

In the same way, the automatic system in [15] was integrated into a computational creativity framework, aiming to drive robotic dance creation [17]. More specifically, Manfrè et al. [17] presented a method of demonstration learning that a Nao robot could learn dance motions by human demonstration, and then the set of dance motions was built as the basis of robotic dance creation. Furthermore, the aesthetic evaluation of robotic dance, given by the audiences, involved three aspects: timing and movements, dance naturalness, and overall artistic value. The aesthetic evaluation results demonstrated that the robotic dance performance was depended on the set of dance motions learned from human demonstration.

Qin et al. [18] proposed a humanoid robot dance system driven by musical structures and emotions. In their system, phrases were regarded as the basic structural unit of music and dance, and a piece of music was converted into an emotion sequence by the emotion recognition algorithm they designed. Based on this emotion sequence, a hidden Markov model (HMM) was used for searching a matching dance phrase sequence from a predesigned action library. Additionally, a chance method was adopted as a choreography guide. Ten dance students and ten non-dance students (aesthetic evaluators) were invited to evaluate, by using questionnaires, the creation results of the robot dance system. All twenty concluded the robot did a good job dancing to the music [18].

Furthermore, based on feature perception on visual images, Tutsoy et al. constructed rule-based classifiers to recognize facial characteristics [20] and facial emotion [21]. The perceived visual features (facial distance measurements/facial muscle movements) were evaluated with physiognomy science, and the evaluation results showed that the rule-based classifiers performed well. Thus, a machine or a humanoid robot is given the aesthetic cognitive ability to understand human faces. In addition, to imitate human daily behaviors, Gongor et al. [22, 23] presented a sit-to-stand (STS) motion algorithm for humanoid robots. Based on the calculations on kinematic parameters (joint angle states), the algorithm had driven a Nao humanoid robot to achieve autonomous human-like motions.

3 Automatic Machine Aesthetics of Robotic Dance Pose

This section describes the mechanism of automatic machine aesthetics of robotic dance poses based on multimodal information fusion, which contains five parts: the whole framework, pre-processing, feature extraction, feature fusion, and machine learning.

3.1 The Whole Framework

Aesthetic ability is an advanced cognitive function of human beings. For the human aesthetic mechanism of dance pose, a mirror acts as an important tool to help human dancers to observe the visual effect on their dance poses. Moreover, human dancers can perceive simultaneously the movement status of their body parts. By combining these two kinds of information, human dancers could make a comprehensive aesthetic judgment on their own dance poses.

Similarly, a humanoid robot could use such a mechanism to achieve automatic estimation of aesthetics on its own dance poses. More specifically, a humanoid robot uses its “eyes” (visual cameras) to observe its own dance poses in a mirror, and feels its internal kinestate (motor parameters) using its embedded sensors. By combining these two kinds of information, the humanoid robot could make a comprehensive aesthetic judgment on its own dance poses. Therefore, this paper proposes a mechanism of automatic machine aesthetics of robotic dance poses based on multimodal information fusion. Figure 1 shows the whole framework of the mechanism.

Fig. 1
figure 1

The proposed framework

When its dance pose is presented before a mirror, the humanoid robot can read its own joint motor status from its embedded sensors (encoders, accelerometers and Gyros), and then extract the corresponding joint features. Meanwhile, it can capture the mirror images of its own dance poses by its cameras, and pre-processed these images in three stages (automatic target location, target segmentation, and shape extraction). Then six kinds of shape features (eccentricity, density, rectangularity, aspect ratio, Hu-moment Invariants, and complex coordinate based Fourier descriptors) are extracted.

Thus, each robotic dance pose is described collectively by the joint feature and shape features (mixed features). A human dance expert will give his/her aesthetic evaluation on a robotic dance pose observed. Both the aesthetic evaluation (label) and the fused features (instance) form an example of the robotic dance pose. When enough samples are acquired, the phase of machine learning will start and a machine aesthetics model of robotic dance pose is trained. Finally, the trained machine aesthetics model is used for automatic aesthetic judgment on a new robotic dance pose when the humanoid robot presents it before a mirror.

3.2 Pre-processing

As shown in Fig. 1, the pre-processing stage includes two processes: kinematic parameter pre-processing and image pre-processing, which come from two different information channels. Notably, the proposed mechanism in this paper uses the dance formalization of humanoid robot (HRDF) [11] as a base. The humanoid robot has unique colour blocks on its important parts of body (such as head, shoulder, hand, foot, leg, etc.), and the colour blocks differ from the robot’s embodied environment. Comparing with image pre-processing, kinematic parameter pre-processing is simpler and more convenient. In the kinematic parameter pre-processing process, the humanoid robot reads a joint motor status (Vi) on each joint (Ji) of the whole body, and the presented robotic dance pose is expressed as a vector (V1, V2, , VS), which is the original kinematic parameter data of the robotic dance pose.

Image pre-processing is divided into three phases: (1) automatic target location; (2) target segmentation; (3) shape extraction. The automatic target location phase is to locate the robot position in the original image captured and determinate a suitable rectangle to enclosure the robot. In the phase of target segmentation, the GrabCut algorithm [24] is used for extracting the sub-image of robotic dance pose from the original image. As the GrabCut algorithm is an interactive foreground extraction method, it requires users to be involved, i.e. informing the foreground by drawing a rectangle on the original image interactively. The shape extraction phase is to extract region and contour on the above sub-image of robotic dance pose, which is output from target segmentation. The more details of these three phases are described in the following subsections.

3.2.1 Automatic Target Location

To automatically locate the robot position in the original image, we build a target location method based on the colour block information of a humanoid robot. The prerequisite of our method requires that a humanoid robot has unique colour blocks on the important parts of its body and the colour blocks differ from the robot’s embodied environment. A Nao humanoid robot is used for describing the method. Notably, the humanoid robot always captures its own mirror image of dance poses by its onboard cameras, so the captured original image is the RGB image.

After the original image is acquired, our method firstly finds out the pixels with specific colour, by setting all the pixels without specific colour to be black (background colour). Thus, the specific colour is regarded as foreground colour. Subsequently, the processed image is corroded to eliminate noise, and then dilated to eliminate very small or narrow pixels. Finally, the processed image contains the robot position information, which is described by the position of the foreground colour in the image. To provide the foreground object (the sub image of robotic dance pose) for the stage of target segmentation (the GrabCut algorithm), the robot position must be labelled by a rectangle.

According to the position of foreground colour in the processed image, an approximate minimum enclosing rectangle (AMER) is identified as the input of the GrabCut algorithm in target segmentation. AMER adds a positive bias on the width and height of minimum enclosing rectangle (MER) respectively, aiming to make the robot fall into this range more accurately. In addition, there are double bias increment in the width direction, and a bias increment in the height direction, aiming to eliminating the shadow influence from robot, shown in formulas (1) and (2). The bias can be defined by computing based on the height ratio of MER to original image as shown in formula (3).

$$ {\text{Width}}\,({\text{AMER}}) = {\text{Width}}\, ( {\text{MER)}} + 2\, *\,{\text{Bias}} $$
(1)
$$ {\text{Height}}\, ( {\text{AMER)}} = {\text{Height}}\, ( {\text{MER)}} + {\text{Bias}} $$
(2)
$$ {\text{Bias}} = \left[ {\upomega\, *\,\frac{{{\text{Height}}\, ( {\text{MER)}}}}{{{\text{Height}}\, ( {\text{original}}\,{\text{image)}}}}} \right] $$
(3)

where \( \upomega \) is a constant adjustment parameter. Notably, \( \upomega \), a positive value, is a whole number multiple of ten pixels. Moreover, \( \upomega \) must be adjusted according to the ductility of the presented dance poses, thereby making the robot fall entirely into AMER. On the premise that the other parameters remain unchanged, the larger ω, the larger AMER, and vice versa. Figure 2 shows a sequence of automatic target location process and Fig. 3 shows the corresponding algorithm. Essentially, the algorithm is a colour threshold method that effectively utilizes the unique colour information of foreground objects in an image.

Fig. 2
figure 2

The procedure of automatic target location. a Original image, b filtered image, c corroded image, d dilated image, e final image with target location rectangle (green rectangle). (Color figure online)

Fig. 3
figure 3

The algorithm of automatic target location

3.2.2 Target Segmentation

Target segmentation in our approach aims to separate robot ontology from a RGB original image of robotic dance pose, in which the GrabCut algorithm is adopted. GrabCut is an interactive foreground extraction algorithm using iterated graph cuts [24] and requires user to mark a rectangle around the object on the original image. Thus, the outer part of the rectangle is defined as background, and the inner part is a combination of the object (foreground) and some background. Subsequently, the probability distribution models of foreground and background are built and then optimized for segmentation by minimizing energy function in several iterations until the target is finally separated from background.

In the phase of automatic target location, the humanoid robot is located by an automatically marked rectangle, and the rectangle is the input of the target segmentation phase. In other words, the GrabCut algorithm uses the rectangle as input, and an automatic image segmentation is then processed for separating robot ontology from a RGB original image of robotic dance pose.

Notably, in the GrabCut algorithm, an energy function, E, is defined so that its minimum corresponds to a good segmentation. The method of iterative energy minimization, which is used, guarantees convergence to at least a local minimum of \( {\mathbf{E}} \). When \( {\mathbf{E}} \) converges, a set of parameter values, αn, on opacity are determined, which are then used for the best foreground segmentation.

Specifically, the GrabCut algorithm uses two Gaussian mixture models (GMM)—one for the foreground and another for the background—as well as the Gibbs energy function defined as follows [24]:

$$ {{{\mathbf{E}}\left( {{\underset{\raise0.3em\hbox{$\smash{\scriptscriptstyle-}$}}{\upalpha } },{\mathbf{k}},{\underset{\raise0.3em\hbox{$\smash{\scriptscriptstyle-}$}}{\uptheta } },{\mathbf{z}}} \right) = U\left( {{\underset{\raise0.3em\hbox{$\smash{\scriptscriptstyle-}$}}{\upalpha } },{\mathbf{k}},{\underset{\raise0.3em\hbox{$\smash{\scriptscriptstyle-}$}}{\uptheta } },{\mathbf{z}}} \right) + V\left( {{\underset{\raise0.3em\hbox{$\smash{\scriptscriptstyle-}$}}{\upalpha } },{\mathbf{z}}} \right)}} $$
(4)

where α refers to the unknown opacity variables; k refers to the GMM component variables; z refers to the given image data. The data term, U, evaluates the fit of the opacity distribution, α, to the data, z, given the Gaussian mixture models, θ, and is defined as follows [24]:

$$ U\left( {{\underset{\raise0.3em\hbox{$\smash{\scriptscriptstyle-}$}}{\upalpha } },{\mathbf{k}},{\underset{\raise0.3em\hbox{$\smash{\scriptscriptstyle-}$}}{\uptheta } },{\mathbf{z}}} \right) = \mathop \sum \limits_{\text{n}} D\left( {\alpha_{n} ,k_{n} ,\underset{\raise0.3em\hbox{$\smash{\scriptscriptstyle-}$}}{\theta } ,z_{n} } \right) $$
(5)

The expansion of the term, D, (up to a constant) is defined as follows [24]:

$$ \begin{aligned} D\left( {\alpha_{n} ,k_{n} ,\underset{\raise0.3em\hbox{$\smash{\scriptscriptstyle-}$}}{\theta } ,z_{n} } \right) & = - \log \pi \left( {\alpha_{n} ,k_{n} } \right) + \frac{1}{2}\log det\sum \left( {\alpha_{n} ,k_{n} } \right) \\ & \quad + \frac{1}{2}\left[ {z_{n} - \mu \left( {\alpha_{n} ,k_{n} } \right)} \right]^{\text{T}} \sum \left( {\alpha_{n} ,k_{n} } \right)^{ - 1} \left[ {z_{n} - \mu \left( {\alpha_{n} ,k_{n} } \right)} \right] \\ \end{aligned} $$
(6)

Moreover, the parameters of the model are defined as follows:

$$ \underset{\raise0.3em\hbox{$\smash{\scriptscriptstyle-}$}}{\theta } = \left\{ {\pi \left( {\alpha ,k} \right),\mu \left( {\alpha ,k} \right),\sum \left( {\alpha ,k} \right),\alpha = 0,1,k = 1 \ldots K} \right\} $$
(7)

where π refers to the weights; μ refers to the means; Σ refers to the covariances of the 2K Gaussian components for the background and foreground distributions [24]. Additionally, the smoothness term, V, is defined as follows:

$$ V\left( {\underset{\raise0.3em\hbox{$\smash{\scriptscriptstyle-}$}}{\alpha } ,{\mathbf{z}}} \right) = \gamma \mathop \sum \limits_{{\left( {m,n} \right) \in {\mathbf{C}}}} \left[ {\alpha_{n} \ne \alpha_{m} } \right]{ \exp } - \beta \left\| {z_{m} - z_{n} } \right\|^{2} $$
(8)

where [ψ] denotes the indicator function taking values 0,1 for a predicate ψ; C is the set of pairs of neighboring pixels; β is a constant that ensures the exponential term switches appropriately between high and low contrast; γ is another constant that takes a value of 50 [24].

Based on colour data modeling, the GrabCut algorithm achieves foreground segmentation in still images by iterative energy minimization. Notably, the GrabCut algorithm is applied directly to target segmentation in our approach and has not been optimized for that specific task.

3.2.3 Shape Extraction

Shape is an important visual content of image, and it is one of key information needed by human visual system to recognize objects. Moreover, it is the stable information of objects, and does not change with the surrounding environment’s variation. Therefore, shape provides a feasible way to make machine understand a robotic dance pose.

In the stage of image pre-processing, the shape extraction phase follows the target segmentation phase, and the segmentation result is regarded as the input of the shape extraction phase. Meanwhile, shape extraction is processed from two aspects: region and contour, and they are regarded as the basis of the further shape feature extraction (detailed in Sect. 3.3). Figure 4d, e show the results of shape extraction, and Fig. 5 shows the corresponding algorithm. Essentially, the algorithm is designed based on morphological digital image processing technology.

Fig. 4
figure 4

The procedure of image preprocessing: a original image, b the result of automatic target location, c the result of target segmentation, d the result of shape extraction (region), e the result of shape extraction (contour). (Color figure online)

Fig. 5
figure 5

The algorithm of shape extraction

3.3 Feature Extraction

In general, feature extraction refers to convert the primitive features to be a group of physical or statistical features. In our mechanism, feature extraction is built on the results of pre-processing, aiming to acquire the suitable features to describe a robotic dance pose. For a robotic dance pose, the result of kinematic parameter pre-processing is a kinematic parameter data vector (V1, V2, , VS); and the results of image pre-processing are region shape image and contour shape image. Therefore, our feature extraction focuses on three aspects: joint feature, region shape feature, and contour shape feature.

Notably, each aspect mentioned above selects its own representative features (for details, see the following sub-sections). All the visual features (including region shape, and contour shape) are considered as a whole. These feature extraction methods, which are just directly applied in our approach, have not been modified for the specific task in this paper.

3.3.1 Joint Feature

Joint feature, a good description of kinematic properties, is always used for portraying a dance pose [8,9,10,11]. In general, a humanoid robot has many joint motors in its whole body. Each joint motor can move in a particular direction, and the humanoid robot presents a dance pose by simultaneously actualizing all joint motors.

Thus, in our mechanism, a joint motor of a robot is translated into a joint feature, which describes a specific motor ability. Moreover, joint motor status is regarded as a value of the corresponding joint feature. When a humanoid robot has S joint motors, there are S joint features ({JF1, JF2, …, JFS}). As the result of kinematic parameter pre-processing, a kinematic parameter data vector, (V1, V2, , VS), is regarded as an original instance of the joint feature sequence, (JF1, JF2, …, JFS), and should be further normalized.

3.3.2 Region Shape Feature

The region shape is viewed as a whole in the region shape image, and all the pixels within the region shape are utilized effectively to describe shape information. In this way, the region shape is affected slightly by noise and shape changes. To describe effectively the region shape of a robotic dance pose, five types of region shape features are extracted respectively, namely eccentricity (EC), density (DE), rectangularity (RE), aspect ratio (AR), and Hu-moment Invariants (HuIM). Among them, the first four (EC, DE, RE, and AR) belong to simple geometric features, and the last one (HuIM) belongs to a statistical feature that is described by nonlinear combinations of geometric moments. They are defined as follows:

  1. (1)

    Eccentricity: the eccentricity of ellipse which has the same second-order central moments with the region of the robotic dance pose;

  2. (2)

    Density: the ratio of the square of the regional perimeter to the regional area;

  3. (3)

    Rectangularity: the area ratio of the robot ontology region to its minimum enclosing rectangle (MER);

  4. (4)

    Aspect ratio: the ratio of the MER’s width to the MER’s height;

  5. (5)

    Hu-moment Invariants: there are seven invariant moment combinations, and their definitions is following:

    $$ {\upvarphi }_{1} = {\uptau }_{20} + {\uptau }_{02} $$
    (9)
    $$ {\upvarphi }_{2} = \left( {{\uptau }_{20} - {\uptau }_{02} } \right)^{2} + 4{\uptau }_{11}^{2} $$
    (10)
    $$ {\upvarphi }_{3} = \left( {{\uptau }_{30} - 3{\uptau }_{12} } \right)^{2} + \left( {{\uptau }_{03} - 3{\uptau }_{21} } \right)^{2} $$
    (11)
    $$ {\upvarphi }_{4} = \left( {{\uptau }_{30} + {\uptau }_{12} } \right)^{2} + \left( {{\uptau }_{03} + {\uptau }_{21} } \right)^{2} $$
    (12)
    $$ \begin{aligned} {\upvarphi }_{5} & = \left( {{\uptau }_{30} - 3{\uptau }_{12} } \right)\left( {{\uptau }_{30} + {\uptau }_{12} } \right) *\left[ {\left( {{\uptau }_{30} + {\uptau }_{12} } \right)^{2} - 3\left( {{\uptau }_{21} + {\uptau }_{03} } \right)^{2} } \right] \\ & \quad + \left( {{\uptau }_{03} - 3{\uptau }_{21} } \right)\left( {{\uptau }_{03} + {\uptau }_{21} } \right)\left[ {\left( {{\uptau }_{03} + {\uptau }_{21} } \right)^{2} - 3\left( {{\uptau }_{30} + {\uptau }_{12} } \right)^{2} } \right] \\ \end{aligned} $$
    (13)
    $$ {\upvarphi }_{6} = \left( {{\uptau }_{20} - {\uptau }_{02} } \right) *\left[ {\left( {{\uptau }_{30} + {\uptau }_{12} } \right)^{2} - \left( {{\uptau }_{03} + {\uptau }_{21} } \right)^{2} } \right] + 4{\uptau }_{11} \left( {{\uptau }_{30} + {\uptau }_{12} } \right)\left( {{\uptau }_{03} + {\uptau }_{21} } \right) $$
    (14)
    $$ \begin{aligned} {\upvarphi }_{7} & = \left( {3{\uptau }_{21} - {\uptau }_{03} } \right)\left( {{\uptau }_{30} + {\uptau }_{12} } \right)\left[ {\left( {{\uptau }_{30} + {\uptau }_{12} } \right)^{2} - 3\left( {{\uptau }_{03} + {\uptau }_{21} } \right)^{2} } \right] \\ & \quad + \left( {{\uptau }_{30} - 3{\uptau }_{12} } \right)\left( {{\uptau }_{03} + {\uptau }_{21} } \right)\left[ {\left( {{\uptau }_{03} + {\uptau }_{21} } \right)^{2} - 3\left( {{\uptau }_{30} + {\uptau }_{12} } \right)^{2} } \right] \\ \end{aligned} $$
    (15)

    where \( \tau_{jk} \) is the normalized (j + k)-order central moment:

    $$ {\uptau }_{\text{jk}} = \frac{{{\text{M}}_{\text{jk}} }}{{\left( {{\text{M}}_{00} } \right)^{\text{r}} }},\quad {\text{r}} = \left[ {\frac{{{\text{j}} + {\text{k}}}}{2} + 1} \right] $$
    (16)

    and \( M_{jk} \) is the (j + k)-order central moment based on the region shape f(x, y).

In the above five types of region shape features, eccentricity and aspect ratio reflect the broadness characteristics of the region, and density reflects the compactness characteristics of the region, and rectangularity reflects the fullness characteristics of the object to its minimum enclosing rectangle, and Hu-moment Invariants reflect the distribution characteristics of image grayscale.

Finally, a group of region shape features (EC, DE, RE, AR, HuIM1, HuIM2, …, HuIMp) (p ≤ 7) can be extracted from a region shape image of robotic dance pose. Moreover, the data of region shape features, acquired from the region shape image, should be further normalized.

3.3.3 Contour Shape Feature

The contour shape refers to a set of pixels that constitute the boundary of a region. By characterizing the geometrical distribution of a regional boundary, the contour shape feature can be described with some kind of descriptor. Fourier descriptors are a classical shape description method in transform-domain, and they are the Fourier transform coefficients of object shape boundary curve. That Fourier descriptors based on the coordinate sequence of object contours perform best among the various typical methods for 2-D shape recognition has been verified in the literature [25].

Therefore, complex coordinate based Fourier descriptors extracted from the contour shape image are regarded as the contour shape features of robotic dance pose. The abscissa for the contour shape image is taken as the real axis, and its ordinate is taken as the imaginary axis. Thus, a point on the X–Y plane corresponds to a complex coordinate. Starting from any point of the closed boundary on the X–Y plane, a one-dimensional complex sequence of points is obtained by traversing the boundary in a counter-clockwise direction. The one-dimensional complex sequence of points is shown as follows:

$$ {\text{g}}\left( {\text{t}} \right) = {\text{x}}\left( {\text{t}} \right) + i{\text{y}}\left( {\text{t}} \right),\quad {\text{t}} = 0,1,2, \ldots ,{\text{N}} - 1,\quad i = - \sqrt { - 1} $$
(17)

where N is the total number of sampled boundary pixel points.

The discrete Fourier coefficients of one dimensional sequence are defined as follows:

$$ {\text{f}}\left( {\text{u}} \right) = \frac{1}{\text{N}}\mathop \sum \limits_{{{\text{t}} = 0}}^{{{\text{N}} - 1}} {\text{g}}\left( {\text{t}} \right){ \exp }\left( {\frac{{ - {\text{j}}2\uppi{\text{ut}}}}{\text{N}}} \right),\quad {\text{u}} = 0,\,1,\,2,\, \ldots ,\,{\text{N}} - 1. $$
(18)

These discrete Fourier coefficients are Fourier descriptors, and then need to be further normalized. As f(0) describes the geometric centre position of the region that is surrounded by the contour boundary, f(0) is excluded from normalization and the rest N − 1 Fourier coefficients are normalized. The normalized Fourier descriptors are defined as follows:

$$ {\text{CCFD}}\left( {\text{v}} \right) = \frac{{\left\| {{\text{f}}\left( {\text{v}} \right)} \right\|}}{{\left\| {{\text{f}}\left( 1 \right)} \right\|}},\quad {\text{v}} = 1,2, \ldots ,{\text{N}} - 1. $$
(19)

The normalized Fourier descriptors have the invariance characteristics of rotation, translation, scale, and the starting position. Moreover, the low-frequency components of the normalized Fourier descriptors always describe the contour and outperformed their high-frequency components. Therefore, some low-frequency components of the normalized Fourier descriptors (CCFD1, CCFD2, …, CCFDq) (q ≤ [N/4]) should be selected as the contour shape feature of robotic dance pose.

3.4 Feature Fusion

The purpose of feature fusion is to integrate several features to describe or portray an object completely. In this paper, to portray a robotic dance pose more completely, joint and shape features, which are extracted from two information channels (vision and non-vision), are fused. We believe the joint feature portrays the kinematic properties of a dance pose; shape features portray the overall silhouette and peripheral form; fusion of the above two simultaneously describes a robotic dance pose from the perspectives of movement and appearance. More concretely, in our mechanism, joint and shape features are fused into a mixed feature (joint + shape features). Specifically, the mixed feature is expressed by (JF1, JF2, …, JFS, EC, DE, RE, AR, HuIM1, HuIM2, …, HuIMp, CCFD1, CCFD2, …, CCFDq) (p ≤ 7, q ≤ [N/4]).

3.5 Machine Learning

After feature fusion is processed, the stage of machine learning starts. Its task is to train a machine aesthetics model, aiming at making machine possess human aesthetic ability and implementing autonomous aesthetic judgment on robotic dance pose. By feature extraction and fusion, each robotic dance pose is expressed as an instance of the mixed feature. When a sufficient number of robotic dance poses are processed, the corresponding data set is produced.

To make machine possess human aesthetic ability on robotic dance pose, the supervised learning is necessary. Thus, human dance experts are invited to give their aesthetic evaluation (good/bad) on the all robotic dance poses after their observations. Viewing from machine learning, the example of each robotic dance pose is constituted of two parts: an instance on the mixed feature, and the corresponding aesthetic label (good/bad). Therefore, an example data set of robotic dance poses can be built to form a basis for further training a machine aesthetics model.

Although there are many machine learning methods to choose for training the machine aesthetics model, it is unclear which kind of machine learning method is more suitable and effective in artistic cognition aesthetics [11]. Therefore, it is necessary to implement mainstream machine learning methods to compare their machine aesthetic effects, and find a more suitable and effective machine learning method among them. After a machine aesthetics model is built, a humanoid robot automatically evaluates the aesthetics by perceiving and observing its own new dance poses so that the further autonomous creation of robotic choreography is possible.

4 Experiments

As one of the most stylistic folk dances in China, Chinese Tibetan Tap has abundant variations on upper-body movements and relatively little variation on lower-body movements, as well as the most common form of standing body shape in the dance. Therefore, we have chosen Chinese Tibetan Tap as robotic dance form in our experiment. As one of the most popular humanoid robots nowadays, a Nao robot is selected as dance carrier in our experiment.

The simulated experimental environment includes four kinds of software: Webots7.4.1 simulator, Matlab R2014a, Dev-C++ 5.11, and Weka 3.6. After perceiving its joint motor data (internal kinestate), a simulated Nao robot displays a dance pose in the “Simulation View” area of Webots simulator. The joint motor data is considered as the perceived information source of the Nao robot. The pictures shown in “Simulation View” are treated as the visual information source in which the robot observes its own dance pose in the “mirror”. Notably, the following underlying assumption exists in our simulation experiments: A robot always observes its own dance poses from a mirror, ignoring some limitations in real scenes (e.g. when the mirror is not placed in front of the robot, or the mirror does not appear in the range of vision of the robot, the robot cannot observe its own dance poses from the mirror). All the examples of original images (e.g. Fig. 2a) and the experimental images were acquired based on the above assumption.

Moreover, the shape features are extracted by image processing programs in Matlab. In Dev-C++, robotic dance poses are generated randomly, and data file formats are transformed. Furthermore, Weka is used for machine learning based on the extracted feature data (single feature or mixed features).

Five hundred robotic dance poses of Chinese Tibetan Tap were generated randomly based on the dance formalization of humanoid robot (HRDF) and three dance element sets [11]. For supervised learning, a Chinese folk dance expert was invited to label aesthetic categories (good/bad) on the 500 robotic dance poses. Considering that hands always maintain a naturally relaxed state when human dancers perform Chinese Tibetan Tap, the two hand joints ({LHand, RHand}) of our Nao robot were kept a fixed appearance without change. Thus, the remaining 24 joints of our Nao robot were regarded as the joint features to describe a robotic dance pose (S = 24). Moreover, the joint features of each dance pose were extracted after its joint data were acquired and normalized.

In the visual image pre-processing, the whole procedure was automatic. Figure 6 shows an automatic image processing GUI based on single captured image of robotic dance pose. Some parameters were set as follows: K = 6, Beta = 0.3 (GrabCut Algorithm); 7 Hu-moment Invariants (p = 7) were all taken as one of region shape features of image; the total number of sampled boundary pixel points in each contour shape image is 800 (N = 800); and 30 low-frequency components of complex coordinate based Fourier descriptors (q = 30) were taken as contour shape features of image. Noted worthily, shape features of robotic dance pose consist of the following parts: eccentricity, density, rectangularity, aspect ratio, Hu-moment Invariants, and complex coordinate based Fourier descriptors, expressed by (EC, DE, RE, AR, HuIM1, HuIM2, …, HuIM7, CCFD1, CCFD2, …, CCFD30).

Fig. 6
figure 6

Automatic image processing GUI of robotic dance pose

Generally, the mixed features of robotic dance pose are expressed by (JF1, JF2, …, JF24, EC, DE, RE, AR, HuIM1, HuIM2, …, HuIM7, CCFD1, CCFD2, …, CCFD30). After normalizing the joint features and shape features, ten machine learning methods were used for performing automatic machine aesthetics of robotic dance poses, and ten-fold cross-validation methods were used for evaluation. Notably, all the machine learning methods used in our experiments came from the platform of Weka 3.6 and were not optimized according to our experimental tasks. For comparison, the machine learning methods were applied on three different feature combinations: joint feature, shape feature, and mixed features (joint feature + shape feature). The detailed machine learning results are shown in Table 2. Viewing from the final aesthetic evaluation results, the highest correct evaluation ratio is 81.6%, which comes from ADTree based on the mixed features.

Table 2 The effect comparison on different machine learning methods based on different feature combination

Additionally, the following fact is exhibited in Table 2: The aesthetic evaluation results for the mixed features are close to those for the joint features. Determining the extent of the difference between both of the above is statistically meaningful; thus, a one-way analysis of variance was used for the statistically significant test. In the analysis, the null hypothesis is that there is no difference between them, and the significance level takes the value of 0.05 (δ = 0.05). The result of the analysis shows that the significance probability is 4.84573E−07 (ζ = 4.84573E−07). Thus, the significance probability is less than the significance level (ζ < δ), and then the null hypothesis is refused. Consequently, a significant difference exists between the aesthetic evaluation results for the mixed features and those for the joint feature.

5 Discussion

5.1 Feature Selection

In the above experiments, there are three feature combinations: joint feature, shape feature, and mixed features (joint feature + shape feature). The different feature combinations have different effects on machine aesthetics. As can be seen from the results, the correct ratio of mixed features is highest, joint feature is in the middle, and shape feature is lowest. Joint feature comes from the self-perception of internal kinestate of robot, and reflects the essential characteristics of robotic dance pose well. Therefore, even if only joint feature is used, the machine aesthetic effect of robotic dance pose is acceptable. In the above experiment, the average correct ratio of joint feature is 77.7%, and the highest correct ratio of joint feature is 80.8%.

Shape feature comes from the appearance impression of robotic dance pose via the robotic visual channel, which is the same as human beings. Although having the lowest correct ratio in the machine aesthetic effect among three features, shape feature has 72.7% on the average correct ratio and 76% on the highest correct ratio. It is useful for automatic machine aesthetics of robotic dance pose so that the robot could understand the beauty of its dance pose. Moreover, the poor machine aesthetic effect based on shape feature is caused by the following reasons:

  1. (1)

    Humanoid robot shows its dance pose in 3-dimensional space. However, the image of robotic dance pose captured by robotic cameras is in a 2D space and 1D dimensional spatial data (depth data) is lost, which may result in the loss of shape feature.

  2. (2)

    The adopted combination of shape features (eccentricity, density, rectangularity, aspect ratio, Hu-moment Invariants, and complex coordinate based Fourier descriptors) is insufficient to describe a robotic dance pose. The more powerful shape feature descriptors are required.

  3. (3)

    There is the shadow in the captured image of robotic dance pose. GrabCut algorithm may not be able to segment the robot ontology and shadow. Therefore, the extracted shape of robotic dance pose may have certain distortion.

To improve the machine aesthetic effect of robotic dance pose based on shape feature, the following three measures can be considered:

  • By using the depth image sensor for capturing the image of a real robot in a mirror, the RGB image and depth image of robotic dance pose can be acquired simultaneously. Therefore, the missing 1D spatial data (depth data) could be obtained.

  • By combining other shape features (such as wavelet descriptor, scale space, Zernike moments, autoregressive, etc.), the more powerful or suitable shape feature descriptors could be found.

  • By improving GrabCut algorithm, the robot ontology and shadow in the image of robotic dance pose could be correctly segmented.

Mixed features come from two information channels (vision and non-visual sensors) of the humanoid robot via multimodal information fusion. They describe the robotic dance pose and outperformed over single source feature (joint feature/shape feature). The machine aesthetic effect of robotic dance pose based on mixed features is the best among the three feature combinations. It has 78% on the average correct ratio and 81.6% on the highest correct ratio. Compare to joint feature, mixed features bring the average correct ratio increment of 0.3%, and the highest correct ratio increment of 0.8%. It is foreseeable that the machine aesthetic effect of robotic dance pose based on mixed features (joint feature + shape feature) will improve further, if a more powerful shape descriptor can be extracted.

5.2 ADTree

Alternating Decision Tree (ADTree) is a boosting-based decision tree algorithm for classification and has a wide range of applications. ADTree consists of an alternation of decision nodes, which contain a single number, to specify a prediction condition and prediction nodes. An instance classified by an ADTree follows all paths for which all decision nodes are true, summing any prediction nodes that are traversed [26].

As can be seen from the experiments conducted above, ADTree demonstrates a better aesthetic effect among all machine learning methods listed in Table 2. It has gained the highest correct ratio (80.8%) on joint feature and the highest correct ratio (81.6%) on mixed features. Although it has not gained the highest correct ratio on shape feature, its correct ratio (73.6%) has exceeded the average correct ratio on shape feature (72.7%). Predictably, if the bottleneck of aesthetic performance based on shape feature is overcome, ADTree may achieve the highest correct ratio on shape feature.

Furthermore, as a concrete method of machine learning stage in semi-interactive evolutionary computation (SIEC), ADTree is used for machine aesthetics of robotic dance pose, and has gained the highest correct ratio among three machine learning methods (SVM, RBF network, and ADTree) [11]. Therefore, ADTree is an effective machine learning method for estimating robotic dance pose aesthetics. To further improve the correct ratio of robotic dance pose aesthetics obtained from ADTree, several aspects (such as information gain, Gini index, pruning, etc.) should be considered.

5.3 Multimodal Information Fusion

From the perspective of human ethology, by fusing multimodal information, human beings always exhibit a variety of daily behaviors (e.g. speech, walking, eating, sports, etc.) in their embodied environments. For humans, as the result of natural evolution, these actions happen in a conscious or unconscious way [27]. For example, when a person wipes a desk, he watches the desk with his eyes, and his hands simultaneously execute the wiping motion. Thus, visual and kinesthetic information work hand-in-hand for the task of wiping. As another kind of human daily behavior, the evaluation of aesthetics by humans of their own dance poses remains a procedure of multimodal information fusion. It should be noted that the imitation of human behavior is an effective way to develop artificial intelligence. Therefore, with this as inspiration, we propose a corresponding approach to make a robot imitate human behavior.

From the perspective of cognitive neuroscience, humans always perform perception tasks more precisely and effectively when multiple sense information (e.g. vision, audition, etc.) is provided. Although the information provided by each sense is distinct, the resulting representation of the surrounding world is not one of disjointed sensations, but of a unified multisensory experience [28]. Moreover, viewed from the cellular level, some cells in specific regions of the human brain respond to stimuli that emanate from multiple sensory information. For example, many cells in the superior colliculus fuse the information emanating from different sensory channels. In a phenomenon called multisensory integration [29], the cells then integrate this information and make an appropriate response [28]. Also inspired by this, we propose a corresponding approach to make a robot imitate the cognitive style that occurs in the human brain.

As seen from our experimental results (Table 2), multimodal information fusion brings about the improvement of the correct ratio of the aesthetics evaluation of robotic dance poses. Compared with the correct ratio of aesthetics evaluation brought about by a single information channel (visual or non-visual), multimodal information fusion results in the highest average correct ratio (78%) and the highest correct ratio (81.6%).

Moreover, compared with the correct ratio of aesthetics evaluation brought about by non-visual information channels (joint features), multimodal information fusion (mixed features) brings about only the average correct ratio increment of 0.3%, and the highest correct ratio increment of 0.8%. This phenomenon shows that, from the aspect of kinematic properties, a joint feature is a good feature for portraying a robotic dance pose. Although shape features bring about a limited promotion for the correct ratio of aesthetics evaluation on mixed features, shape features remain effective for multimodal information fusion. As to the reason why the performance improvement caused by multimodal information fusion is not obvious, we believe feature conflicts exist in the mixed features. However, at present, feature conflict is still an open problem in the aesthetics evaluation of robotic choreography, which we will explore in the future. Meanwhile, we believe that some more suitable mixed features, having fewer feature conflicts, exist for better portraying robotic dance poses. We will search for those features in the future.

5.4 Comparison with the State-of-the-Art Approaches

There is a paucity of literature regarding the aesthetics of robotic dance poses, and what does exist focuses on the following two methods: human subjective aesthetics [8,9,10] and the machine learning based method [11]. Although more accurate aesthetic evaluation results are obtained for the former, extensive human–robot interaction creates a heavy burden for people. For the latter, although people need not extensively participate in human–robot interaction, it is difficult to build a suitable machine aesthetic model that will yield accurate aesthetic evaluation results.

To reduce the human burden and develop artificial intelligence, the machine learning based method for aesthetics evaluation is advocated. In general, our proposed approach belongs to the machine learning based method of aesthetics evaluation. Meanwhile, of note is the following fact: It is important to determine how to make a machine aesthetic model possess human aesthetic ability, which is still an open problem. We believe that good feature combinations and good machine learning methods, collectively, will help solve this problem.

A comparison between the state-of-the-art approaches and our approach is shown in Table 3. Different from what is presented in the existing literature, we explore the automatic aesthetic evaluation of robotic dance poses from the perspective of multimodal information fusion, which involves two channels, non-visual and visual. By fusing joint and shape features, we used mixed features to more completely portray a robotic dance pose. A good result (81.6%) on the mixed features is achieved with automatic aesthetic evaluation. Moreover, as an effective machine learning method for estimating robotic dance pose aesthetics, ADTree has been verified. Thus, the three main unsolved questions, mentioned in Sect. 1, have been answered well.

Table 3 The comparison between the state-of-the-art approaches and our approach

6 Conclusion

By using image processing and machine learning technologies, this paper presented an automatic machine aesthetics mechanism based on mixed features of robotic dance pose. The simulated experimental results show that the humanoid robot can integrate sensing data from two channels, implement multimodal information fusion, and evaluate the aesthetics of its own dance pose. Thus, the robot could conduct the autonomous dance activity as a human does. Moreover, it is proved that the shape features is useful to evaluate aesthetic feeling of robotic dance pose, and the mixed features (joint feature and shape features) can bring higher accuracy than single source feature (joint features or shape features) in the automatic machine aesthetics of robotic dance pose. Meanwhile, ADTree is also verified as a suitable and effective machine learning method of robotic dance pose aesthetics.

In the future, our work will be focused on three aspects: (1) to implement the proposed mechanism on a real Nao robot that is placed before a mirror, so that it could complete aesthetic evaluation of its own dance pose autonomously; (2) to find more useful mixed features to describe a robotic dance pose; (3) to build an automatic aesthetic evaluation of robotic dance motion, based on the proposed mechanism.