1 Introduction

Over the last decade, there has been a growing interest in applying tensor methods in machine learning. In various scientific fields such as image analysis, signal processing, and space-time analysis, tensors have demonstrated their capability to represent data with a multi-modal structure while preserving the original data structure. Unlike traditional machine learning approaches that often convert natural data into vector form, which can result in information loss regarding the natural data structure, tensor-based machine learning methods have become essential for preserving and utilizing the inherent structures of the data [1].

With the development of tensor learning methods, deep neural networks have demonstrated advanced performance in various large-scale machine learning tasks, including computer vision, speech recognition, and text processing. For instance, the convolutional neural network (CNN) [2, 3] has shown significant advantages in image classification tasks. These network models consist of thousands of nodes and millions of learnable parameters, and are trained using millions of images on powerful graphics processing units (GPUs) [4]. However, the expensive hardware requirements and long training time limit the widespread utilization of these models on traditional desktop computers and portable devices. Consequently, researchers have conducted extensive research to explore various methods for reducing hardware requirements, memory consumption, and training time.

The fully connected layer is one of the most commonly used layers in convolutional networks, responsible for performing linear transformations from high-dimensional input data to high-dimensional output data. The traditional approach employs matrices to define this transformation. For example, in a typical CNN, the dimensions of the input and output data of the fully connected layer are both in the thousands, resulting in millions of parameters. This complexity hinders the simplicity of the model structure. However, since the input and output of the convolutional layer are tensor data, it is natural to introduce tensor methods into convolutional networks to optimize the overall model structure, which has become an important research issue.

The main idea of this paper is to apply tensor decomposition to deep learning and then reparameterize the existing layers of deep convolutional networks to accelerate computation or reduce the number of parameters. Several studies have explored tensor decomposition in deep convolutional networks. Lebedev [5] proposed using tensor CANDECOMP/PARAFAC (CP) [6, 7] decomposition to accelerate feature extraction in convolutional layers. Similarly, Tai [8] introduced a new algorithm for computing low-rank tensor decomposition to eliminate redundancy in convolutional kernels. Kim [9] used the pre-trained network, applied tensor Tucker decomposition on the convolution kernels and finally fine-tuned the resulting network. Yang [10] proposed weight sharing in multi-task representation learning framework to learn cross-task sharing structure at every layer in a deep network. Chen[11] proposed sharing the remaining units and proposed a new architecture, Collective Residual Unit (CRU), to improve the parameter efficiency of deep neural networks through collective tensor factorization. Novikov [12] used Tensor-Train (TT) decomposition to apply the low rank tensor structure to the weight of the fully connected layer such that the number of parameters is reduced and at the same time the expressive power of the layer is preserved. However, these studies still retain the full connection layer of the network such that these models have a large number of parameters to be trained and optimized. In addition, the deformation of high order data can not maintain the multi-linear structure of data.

Additionally, the flattening operation used in higher-order data dimension cannot preserve the multi-linear structure of the data. In contrast, Kossaifi [13] proposed the tensor regression layer to replace the vectorization operation and fully connected layer in CNN with high-order multiple regression. This layer is embedded into popular models such as visual geometry group (VGG, [3]) and residual network (ResNet, [14]). The advantage of this replacement is the ability to compress the model while preserving the multi-modal information of the dataset. Vectorization of high-dimensional datasets leads to the loss of the multi-modal information. For example, applying a flattening operation to a color image (a third-order tensor) eliminates the relationship between channels. The tensor regression layer can address this problem by performing a multi-linear regression task between the output of the final convolutional layer and the classification layer, enabling the capture of multi-modal information.

To further enhance the performance of tensor networks, Gao [15] proposed a quantized tensor neural network (QTNN) that combines the powerful learning ability of neural networks with the simplicity of tensor networks. QTNN is a generalized multi-layer nonlinear tensor network that effectively extracts low-dimensional features from data while preserving the original structural information. While tensor methods have been widely employed in supervised learning, researchers have also turned their attention to unsupervised learning. For example, Oldfield [16] used tensor regression to address the problem of finding interpretable directions in the potential space of pre-trained Generative Adversarial Networks (GANs), promoting controllable image synthesis. [17] proposed auto-weighted multiple kernel tensor clustering (AMKTC) to capture the essential high-order correlations between multiple base kernels by leveraging tensor-singular value decomposition (t-SVD)-based tensor nuclear norm constraint on a 3-order graph tensor.

Other researchers have uncovered valuable insights through exploration of 3D convolutional neural network (CNN) models and 3D filtering CNNs [18,19,20], as well as deep CNNs for image classification. Additionally, there are other relevant studies, such as the multi-objective deep CNN model proposed by Lu [21], among others [22,23,24,25,26].

Fig. 1
figure 1

CP decomposition of a three-way array

The densely connected convolutional network (DenseNet), introduced by Huang [27], stands as a prevalent deep neural network architecture adopted across diverse fields. DenseNet ensures maximum information transmission between network layers by establishing connection relationships among different layers. It explicitly distinguishes between the information added to the network and the information retained. The primary of this paper focus on embedding the tensor regression layer into the densely connected convolutional network for learning purposes. This unique connection method effectively alleviates the issue of gradient disappearance during model training. The integration of this convolutional network with the tensor regression layer offers several advantages:

  1. (1)

    The tensor regression layer replaces the fully connected layer to optimize the model structure, which can significantly reduce the number of parameters that need to be trained, minimize memory consumption and lessen the hardware requirements for model training while maintaining performance.

  2. (2)

    The fully connected layer is replaced by a tensor regression layer, which is a special type of regression layer structure designed to handle tensor-formatted data. When processing image data, it can extract multiple features simultaneously and conduct regression prediction. This layer takes the outputs tensors from previous layers as input and is able to preserve the spatial structural features of the data.

  3. (3)

    The special connection mode of the DenseNet strengthens feature propagation, which leads to alleviate gradient disappearance during training of the network model embedded with the tensor regression layer.

  4. (4)

    Our proposed tensor network can achieve high classification accuracy while significantly reducing memory usage and computation time.

The remainder of this paper is organized as follows. Section 2 introduces tensor algebra, including the basics and CP decomposition of tensors, as well as the tensor regression layer. Section 3 describes the DenseNet architecture and the tensor network. In Section 4, experiments are conducted to verify the superior performance of our proposed method compared to existing models. Finally, Section 5 provides some conclusions and discussion.

2 Preliminaries

In this Section, some basic concepts and properties of tensors will be mentioned firstly, and then some preliminaries about the tensor regression layer will be provided.

2.1 Tensor algebra

Vectors, also known as first-order tensors, are represented in boldface lowercase letters, e.g., \(\textbf{a}\). Matrices, also known as second-order tensors, are represented in boldface capital letters, e.g., \(\textbf{A}\). Higher-order tensors are represented in boldface Euler script letters, e.g., \(\varvec{\mathcal {X}}\). Scalars are represented in lowercase letters, e.g., a. The ith element of the vector \(\textbf{a}\) is represented as \({a}_{i}\). The (ij)th element of the matrix \(\textbf{A}\) is represented as \(a_{ij}\). The (ijk)th element of the 3rd-order tensor \(\varvec{\mathcal {X}}\) is represented as \({x_{ijk}}\).

The inner product of two tensors \(\varvec{\mathcal {A}}, \varvec{\mathcal {B}} \in \mathbb {R}^{I_{1} \times I_{2} \times \cdots \times I_{M}}\) with the same dimension is the sum of the products of their corresponding elements:

$$\begin{aligned} \langle \varvec{\mathcal {A}}, \varvec{\mathcal {B}}\rangle =\sum _{i_{1}=1}^{I_{1}} \sum _{i_{2}=1}^{I_{2}} \cdots \sum _{i_{m}=1}^{I_{M}} a_{i_{1} i_{2} \cdots i_{m}} b_{i_{1} i_{2} \cdots i_{m}}. \end{aligned}$$
(1)

If \(\varvec{\mathcal {A}}=\varvec{\mathcal {B}}\), it can get \(\langle \varvec{\mathcal {A}}, \varvec{\mathcal {A}}\rangle =\Vert \varvec{\mathcal {A}}\Vert _{F}^{2}\), where \( \Vert \varvec{\mathcal {A}}\Vert _{F}=\sqrt{\sum _{i_{1}=1}^{I_{1}} \sum _{i_{2}=1}^{I_{2}} \cdots \sum _{i_{m}=1}^{I_{M}} a_{i_{1} i_{2} \cdots i_{m}}^{2}} \) denotes the Frobenius norm of a tensor. A tensor \(\varvec{\mathcal {T}} \in \mathbb {R}^{I_{1} \times I_{2} \times \cdots \times I_{M}}\) is called rank-one if it can be written as the outer product of M vectors: \( \varvec{\mathcal {T}}=\textbf{a}^{(1)} \circ \textbf{a}^{(2)} \circ \cdots \circ \textbf{a}^{(M)}. \) The n-mode product of a tensor \(\varvec{\mathcal {M}} \in \mathbb {R}^{I_{1} \times I_{2} \times \cdots \times I_{M}}\) and a matrix \(\textbf{U} \in \mathbb {R}^{J \times I_{n}}\) is expressed as \(\varvec{\mathcal {M}} \times _{n} \textbf{U}\). The result is still a tensor and its dimension is \(I_{1} \times \cdots \times I_{n-1} \times J \times I_{n+1} \cdots \times I_{M}\) given as \(\left( \varvec{\mathcal {M}} \times _{n} \textbf{U}\right) _{i_{1} \cdots i_{n-1} j i_{n+1} \cdots i_{M}}=\sum _{i_{m}=1}^{I_{M}} x_{i_{1} i_{2} \cdots i_{m}} u_{j i_{m}}\). Then \(\varvec{\mathcal {Y}}=\varvec{\mathcal {M}} \times _{n} \textbf{U} \Leftrightarrow \textbf{Y}_{(n)}=\textbf{U M}_{(n)}\).

2.2 Tensor decomposition

The CANDECOMP/PARAFAC decomposition of tensors expresses a tensor as the sum of tensors of a finite number of rank-1 tensors. Figure 1 shows the process of CP decomposition of a tensor.

Given a third-order tensor \(\varvec{\mathcal {X}} \in \mathbb {R}^{I \times J \times K}\), it can be approximated by a sum of tensors as follows:

$$\begin{aligned} \varvec{\mathcal {X}} \approx \sum _{\textrm{r}=1}^{{R}} \textbf{a}_{r} \circ \textbf{b}_{r} \circ \textbf{c}_{r}, \end{aligned}$$
(2)

with R being a positive integer, and \(\textbf{a}_{r} \in \mathbb {R}^{I}\), \(\textbf{b}_{r} \in \mathbb {R}^{J}\), \(\textbf{c}_{r} \in \mathbb {R}^{K}\), for \({r}=1, \ldots , {R}\). The factor matrix is a combination of these vectors, i.e., \(\textbf{A}=\left[ \textbf{a}_{1},\textbf{a}_{2}, \cdots , \textbf{a}_{{R}}\right] \), and likewise for \(\textbf{B}\) and \(\textbf{C}\). According to this definition, the following equations hold:

$$\begin{aligned} \textbf{X}_{(1)} \approx \textbf{A}(\textbf{C} \odot \textbf{B})^{\top }, \ \ \textbf{X}_{(2)} \approx \textbf{B}(\textbf{C} \odot \textbf{A})^{\top }, \ \ \textbf{X}_{(3)} \approx \textbf{C}(\textbf{B} \odot \textbf{A})^{\top }. \end{aligned}$$
(3)

Thus, the CP model can be concisely expressed as

$$\begin{aligned} \varvec{\mathcal {X}} \approx \llbracket \textbf{A}, \textbf{B}, \textbf{C} \rrbracket \equiv \sum _{{r}=1}^{{R}} \textbf{a}_{r} \circ \textbf{b}_{r} \circ \textbf{c}_{r}. \end{aligned}$$
(4)

Due to the intuitive nature of CP decomposition, it can be easily extended to higher-order tensors. The more detailed theories of tensors can be found in [28].

2.3 Tensor regression layer

Tensor regression layer is a differentiable neural network layer. The full connection layer parameters of CNN account for the majority of the total parameters of the model. In addition to such a large consumption of computing resources, data flattening also leads to the loss of rich spatial structure information in the final convolution layer. Tensor regression layer uses multi-linear mapping instead of flattening and full connection layer in the model.

Given \(\varvec{\mathcal {X}} \in \mathbb {R}^{I_1 \times I_2 \times \cdots \times I_N}\) and \(\varvec{\mathcal {W}} \in \mathbb {R}^{I_1 \times I_2 \times \cdots \times I_N \times I_{N+1}}\) with \(I_{N+1}\) being the number of categories in the dataset, the function f can be defined as:

$$\begin{aligned} f(\varvec{\mathcal {X}})=\varvec{\mathcal {W}}_{(N+1)} {\text {vec}}(\varvec{\mathcal {X}})+\textbf{b}, \end{aligned}$$
(5)

where \(\textbf{b} \in \mathbb {R}^{I_{N+1}}\) is the bais vector.

In the past, tensor regression was trained as an independent model. It is a generalization of the least squares regression problem in tensor space, and is often combined with some feature extraction methods. However, tensor regression model is difficult to undertake the task of large-scale data analysis. This tensor structure can be embed into the convolutional network as a trainable neural network layer. Figure 2 shows the visualization structure of the tensor network layer. The main idea behind the tensor regression layer is to implement a low tensor rank structure on \(\varvec{\mathcal {W}}\) to reduce memory usage and utilize the multi-linear structure of input \(\varvec{\mathcal {X}}\).

Fig. 2
figure 2

Visualization of tensor regression layer in tensor networks

By applying CP decomposition to the weight tensor of the function and considering the properties of tensor decomposition, equation (5) can be rewritten as:

$$\begin{aligned} f(\varvec{\mathcal {X}})= & {} \llbracket \textbf{A}^{(1)}, \textbf{A}^{(2)}, \cdots , \textbf{A}^{(N+1)} \rrbracket _{(N+1)} {\text {vec}}(\varvec{\mathcal {X}})+\textbf{b}\nonumber \\= & {} \textbf{A}^{(N+1)}\left( \textbf{A}^{(N)} \odot \cdots \odot \textbf{A}^{(1)}\right) ^{\top } {\text {vec}}(\varvec{\mathcal {X}})+ \textbf{b}. \end{aligned}$$
(6)

In order to achieve end-to-end system training, instead of optimizing model (6) as a separate tensor regression, the back propagation method is used to optimize the model. The partial derivatives required for gradient based optimization methods according to equation (6) can be obtained as follows,

$$\begin{aligned} \frac{\partial f(\varvec{\mathcal {X}})_i}{\partial \left( \textbf{A}^{(n)}\right) _{j k}}=\frac{\left( \partial \textbf{A}^{(N+1)}\left( \textbf{A}^{(N)} \odot \cdots \odot \textbf{A}^{(1)}\right) ^T {\text {vec}}(\varvec{\mathcal {X}})\right) _i}{\partial \left( \textbf{A}^{(n)}\right) _{j k}}, \end{aligned}$$
(7)

where \(n=1,2, \cdots , N+1\). In addition, for a given mode n, it can naturally arrange these partial derivatives into third-order tensors \(\partial f(\varvec{\mathcal {X}}) / \partial \textbf{A}^{(n)}\), and use tensor expansion to obtain their expressions:

$$\begin{aligned} \left( \frac{\partial f}{\partial \textbf{A}^{(n)}}\right) _{(2)}= & {} (\varvec{\mathcal {X}})_{(n)}\left( \textbf{A}^{(N)} \odot \cdots \odot \textbf{A}^{(n+1)} \odot \textbf{A}^{(n-1)}\right. \nonumber \\{} & {} \left. \odot \cdots \odot \textbf{A}^{(1)}\right) \left( \textbf{A}^{(N+1)} \odot \textbf{I}_R\right) ^T, \end{aligned}$$
(8)

for \(n=1,2, \cdots , N\), and when \(n=N+1\), it becomes

$$\begin{aligned} \left( \frac{\partial f}{\partial \textbf{A}^{(N+1)}}\right) _{(1)}=\textbf{I}_{I_{N+1}} \otimes \left( {\text {vec}}(\varvec{\mathcal {X}})^T\left( \textbf{A}^{(N)} \odot \cdots \odot \textbf{A}^{(1)}\right) \right) . \end{aligned}$$
(9)

3 Tensor network

After introducing the basic theory of tensor and tensor regression layer, A mainstream deep convolution neural network model should be introduced: Densely connected convolutional network (DenseNet). In order to optimize the model structure, the tensor regression layer is embedded into DenseNet to reconstruct a new network model, and study and analyze the new model structure.

3.1 DenseNet

In the field of computer vision, CNN has become the most mainstream method, with architectures like VGG-16/19, GoogLenet [29]. As CNN becomes deeper and deeper, a new research problem arises: when information about input or gradient passes through many layers, it may disappear when it reaches the end (or beginning) of the network. A milestone in the history of CNN is the emergence of ResNet. ResNet can train a deeper CNN model to achieve higher accuracy. The core of ResNet model is to establish a “short circuit connection" between the earlier layer and the later layer, which is helpful for the back propagation of the gradient during the training process and prevents the gradient from disappearing during the propagation process. Such a connection structure can lead to train a deeper CNN network. Building upon ResNet, [27] put forward densely connected convolutional networks. Its basic idea is the same as ResNet, but it establishes a dense connection between all the previous layers and the subsequent layers. DenseNet excels in feature reuse through channel-wise feature concatenation, enabling better performance than ResNet with fewer parameters and computational costs.

Fig. 3
figure 3

Short circuit connection mechanism of ResNet (“+" represents feature addition operation)

Fig. 4
figure 4

Dense connection mechanism of DenseNet (“c" represents channel connection operation)

3.1.1 Structure of DenseNet

Compared with ResNet, DenseNet proposes a more radical dense connection mechanism: it connects all layers. Specifically, each layer will accept all the preceding layers as its additional input. As shown in Fig. 3, the figure depicts the connection mechanism of ResNet. In contrast, Fig. 4 shows the dense connection mechanism of DenseNet. ResNet establishes a short circuit connection between each layer and the previous layer. The connection method is feature addition. In DenseNet, each layer will be concatenated with all the previous layers in the channel dimension and used as the input of the next layer. For an L-layer network, DenseNet contains a total of \(\frac{L(L+1)}{2}\) connections. Compared with ResNet, this is a dense connection. Furthermore, DenseNet directly connects feature maps from different layers, which can realize feature reuse and improve efficiency. This feature is the main difference between DenseNet and ResNet. Denote \(\textbf{X}_{L-1}\) as the output of layer \(L-1\) of the model, then the output of the traditional deep neural network at layer L can be expressed as:

$$\begin{aligned} \textbf{X}_L=H\left( \textbf{X}_{L-1}\right) . \end{aligned}$$
(10)

For ResNet, the features of the input from the upper layer are added. The output from layer L is:

$$\begin{aligned} \textbf{X}_L=H\left( \textbf{X}_{L-1}\right) +\textbf{X}_{L-1}. \end{aligned}$$
(11)

In DenseNet, the features of all previous layers will be connected. The output of layer L is:

$$\begin{aligned} \textbf{X}_L=H\left( \left[ \textbf{X}_0, \cdots \textbf{X}_{L-1}\right] \right) , \end{aligned}$$
(12)

where H represents the nonlinear transformation function. This combined operation includes a series of normalization, activation, pooling, and convolution operations.

Fig. 5
figure 5

A dense block with 4 layers and a growth rate of 4

Fig. 6
figure 6

DenseNet with 3 dense blocks

3.1.2 Composition of DenseNet

Dense blocks Deep convolutional networks generally reduce the size of feature maps through pooling or convolution layers, but dense connection mode requires that the size of feature maps to be consistent. To solve this problem, DenseBlock is defined. In the dense block, the feature map of each layer is consistent in size and connected on the channel. Assuming that the number of channels in the input feature map is \(K_0\), the number of channels in layer L is \(K_0\!+\!(L\!-\!1) K\) with K being the growth rate, which is a hyperparameter. As shown in Fig. 5, it is a dense block with 4 layers and the growth rate is 4.

Transition layer The transition layer mainly connects two adjacent DenseBlocks and reduces the size of the feature map. The transition layer consists of a \(1\times 1\) convolution kernel and a \(2\times 2\) global average pooling layer, which can compress the model.

Bottleneck layer although each layer only generates k output feature maps, the input sample volume of the model is large. Before each convolution operation, we introduce a convolution of size \(1\times 1\) as the Bottleneck layer to reduce the number of input feature maps and improve the computational efficiency. The design of the bottleneck layer is particularly effective for DenseNet, called DenseNet-B.

It can be seen from the network structure of DenseNet that it defines dense blocks based on residual neural network, which strengthens the propagation of features and encourages the reuse of features. At the same time, the existence of transition layer and bottleneck layer can make the model simpler, greatly reduce the number of parameters and speed up the training of the model. As shown in Fig. 6, this is a DenseNet with 3 dense blocks.

3.2 Tensor network based on DenseNet

The original DenseNet structure employs global average pooling and fully connected layers for classifying image data after feature extraction, which may compromise the spatial structure information of features. Additionally, the resulting fully connected layer entails a large number of trainable parameters, posing challenges for model training. To address these concerns, we propose optimizing the DenseNet structure by embedding a tensor regression layer as a trainable layer into the network, facilitating joint learning of features for data classification.

To achieve this, we directly substitute the global average pooling and fully connected layers of the original network with the tensor regression layer, while imposing low-rank constraints on the regression weights. Intuitively, the tensor regression layer offers the advantage of effectively utilizing spatial structure information from the data and significantly reducing the model’s training parameters.

Fig. 7
figure 7

Tensor network based on DenseNet

Fig. 8
figure 8

Fruits 360

Figure 7 illustrates the tensor network structure based on DenseNet, showcasing the integration of the tensor regression layer into the network architecture. This optimized structure aims to enhance feature learning and classification accuracy while mitigating the issues associated with traditional classification layers.

3.3 Analysis of the model

The network model uses a simple tensor regression structure to undertake the classification tasks in the model, and can also use gradient back propagation algorithm to optimize the parameters of the model. According to the structure of tensor regression layer, one only need to train a few parameters. Consider the output tensor \(\varvec{\mathcal {X}} \in \mathbb {R}^{I_0 \times I_1 \times \cdots \times I_N}\) of convolution layer, and assume that the rank of tensor regression layer weight is R, with \(R \le I_k\). The output is an n-dimensional array. For a fully connected layer with only 1-layer, the training parameters required are:

$$\begin{aligned} n_{\textrm{FC}}=n \times \prod _{k=0}^N I_k. \end{aligned}$$
(13)

In comparison, the number of parameters to be trained in tensor regression layer is only:

$$\begin{aligned} n_{\textrm{TRL}}=\sum _{k=0}^N R \times I_k+R \times n. \end{aligned}$$
(14)

As a result, the memory consumption of the computer is greatly reduced together with greatly reducing the number of parameters. The promising performance of our proposed tensor network model will be examined by experiments in the next Section.

4 Experiments and discussion

In this Section, extensive comparative experiments are conducted to verify the feasibility of the model. To ensure the fairness of the experiment, DenseNet-121, DenseNet-169, DenseNet-201, and DenseNet-264 are utilized as the experimental objects. Subsequently, the tensor regression layer is embedded into each model. The computer memory used for training each model is observed, and the training results of the model are analyzed. All the experiments are carried out under the Linux system, with GPU model GTX-1650 being used, and the program is written based on the Python language.

4.1 Data description

① Fruits 360 is a comprehensive dataset containing 131 types of fruits and vegetables. It comprises approximately 90000 images. Existing literature has positioned fruits and vegetables in the shaft of a common motor, recorded a 20-second short film, and then captured the fruits using a camera. As illustrated in Fig. 8, it portrays a fraction of the dataset. It can be observed that the background of all images is white, and the fruits predominantly occupy the images. This aids in better extraction of the characteristics of each type of fruit or vegetable, and simplifies the model training process. This dataset can be accessed at “https://www.kaggle.com/datasets/moltean/fruits".

② 100 Sports Image is a dataset containing 100 different sports types. It consists of approximately 13000 images. The data was collected from the internet, and then organized all the images to obtain a high-quality clean dataset free from duplicate or poor-quality images. As shown in Fig. 9, it displays a portion of the dataset. It can be observed that despite the complex background of each image, each sport type has a symbolic feature. This indicates that the model can accurately extract the main features of each image, and then classify them. This dataset is available at “https://www.kaggle.com/datasets/gpiosenka/sports-classification".

Fig. 9
figure 9

100 Sports Image

Fig. 10
figure 10

ASL Alphabet

③ASL Alphabet is a dataset for alphabets in the American Sign Language. It comprises 29 classes, with 26 dedicated to the letters A-Z and 3 classes for SPACE, DELETE, and NOTHING. The training dataset contains 87,000 images with dimensions of \(200\times 200\) pixels. As depicted in Fig. 10, the background of this dataset is relatively simple, which greatly aids model training. This dataset can be accessed at “https://www.kaggle.com/datasets/grassknoted/asl-alphabet".

④ In this study, the Mini-ImageNet dataset is utilized to assess the performance of the proposed model. It consists of thousands of images spanning 100 different object classes. With dimensions of \(84\times 84\) pixels, each image is associated with a specific label indicating its class category. The dataset is partitioned into three subsets: train set, validation set and test set. The training set is used to train the model, while the validation set aids in hyperparameter tuning and model selection. Lastly, the test set is employed to evaluate the overall accuracy and generalization capability of the model. This dataset is available at “https://www.kaggle.com/datasets/arjunashok33/miniimagenet".

4.2 Experimental results

This Section presents the performance of eight models on two datasets, as depicted in Tables 2 and 3. The Tables 4, 5, 6 and 7 report the test set accuracy, the number of parameters required for model training, and the size of the computer memory band (MB) necessary for training. The percentages highlighted in red indicate a decrease when comparing the tensor regression layer-based method to the existing vectorization-based DenseNet method.

In this experiment, all networks were optimized using the adaptive moment estimation (Adam) method. Adam is an adaptive learning rate optimization algorithm that combines momentum method and adaptive gradient method. It adjusts the learning rate by estimating the first-order moment (the mean) and the second-order moment (the uncentered variance) of the gradient, and adaptively updates parameters during the training process. This adaptive adjustment of learning rates helps improve convergence and training efficiency, making Adam a popular choice for optimizing deep neural networks.

For the Fruits 360 dataset, a batch size of 30 was employed and a total of 30 training cycles were conducted. Given the relative simplicity of this dataset, with clear foreground and background images, image enhancement techniques were not applied. The input image size was set to \(50\times 50\), determined through observations of the dataset’s optimal input size. The initial learning rate was set to 0.0001, which allowed the model to converge within a reasonable timeframe and achieve solid performance. Learning rate decay was not applied due to the dataset’s simplicity, yet the model still yielded satisfactory results. In networks featuring fully connected layers, the number of fully connected layers was set to 2, based on experimental observations indicating that this configuration achieved the best performance. This setup ensures efficient feature learning and classification for the Fruits 360 dataset.

Table 1 Experimental results of Fruits 360
Table 2 Experimental results of 100 Sports Image

For the 100 Sports Image dataset, a batch size of 50 was utilized, and a total of 150 training epochs were conducted to train the model comprehensively. Given the complexity of this dataset’s background and its relatively small size, data augmentation techniques were employed. Data augmentation involves generating new training samples by transforming original images (e.g., through rotation, scaling, cropping), effectively enlarging the training set and improving the model’s generalization ability. The input image size was set to \(264\times 264\) based on observations of the dataset’s optimal input size. The initial learning rate was set to 0.001, enabling the model to converge effectively within a reasonable timeframe while achieving good performance. Similar to the Fruits 360 dataset, learning rate decay was not applied for the 100 Sports Image dataset due to its manageable complexity.Despite the dataset’s challenges, such as complex backgrounds and a relatively small size, the model achieved satisfactory results through a combination of data augmentation techniques and an appropriate network structure. In networks featuring fully connected layers, the number of fully connected layers was set to 3, as experimental observations indicated that this configuration yielded optimal performance. This setup ensures that the model can effectively learn and classify features relevant to the 100 Sports Image dataset.

Table 3 Fruits 360 dataset classification results comparison
Fig. 11
figure 11

Mini-ImageNet

As shown in Tables 1 and 2, the proposed method demonstrates obvious advantages compared to other existing methods. Specifically, the results in the first row of Table 3 show that our proposed method reduced the number of parameters by 88.20% and memory usage by 78.15%, with only a 2.92% decrease in accuracy. Furthermore, the proposed method has the advantage of yielding greater gains with more complex image backgrounds. This indicates that the method has stronger handling capabilities for images with complex backgrounds, which is critical for many real-world applications.

The robustness and reliability of the proposed method are further demonstrated by its ability to maintain consistently high performance across various initialization schemes and hyperparameter settings. Extensive tuning of hyperparameters, including learning rate, batch size, number of training cycles, and the architecture of fully connected layers, was conducted to ensure optimal performance. The model’s stability and reliability were confirmed through multiple runs with different random initializations, all of which consistently yielded high performance. This consistency underscores the effectiveness and generalizability of the proposed approach, enhancing confidence in its applicability to real-world scenarios.

Overall, the proposed method provides an efficient, effective, and robust solution for image classification tasks, especially in scenes with complex image backgrounds. Its superior performance, coupled with its efficiency in parameter usage and memory consumption, suggests that the model is meaningful.

To further evaluate tensor networks, comparisons are made between the performance of DenseNet with tensor regression layer and some currently popular network models. VGG16 and VGG19 are classes of deep networks with smaller convolutional kernels. The proposed network model illustrates that small and deep networks have more advantages than large and shallow ones. ResNet50 and ResNet101 establish connections between layers to realize residual learning. This aids in preventing the gradient from disappearing during propagation, enabling the training of deeper CNN networks. EfficientNet [30] balances the three crucial dimensions of network depth, network width, and image resolution for optimizing network performance. ResNet50-TRL [13] is the ResNet embedded with a tensor regression layer. Unlike this study, the tensor decomposition of this model uses Tucker decomposition. Table 3 presents the accuracy, macro-averaged precision, macro-averaged recall and macro-averaged F1-score of these models, and the calculation formulas for metrics are as follows:

$$\begin{aligned}{} & {} \text {{Macro\_P (macro-averaged precision)}} \nonumber \\{} & {} \quad = \frac{1}{N} \sum _{i=1}^{N} \frac{{TP_i}}{{TP_i + FP_i}}, \end{aligned}$$
(15)
$$\begin{aligned}{} & {} \text {{Macro\_R (macro-averaged recall)}} \nonumber \\{} & {} \quad = \frac{1}{N} \sum _{i=1}^{N} \frac{{TP_i}}{{TP_i + FN_i}}, \end{aligned}$$
(16)
$$\begin{aligned}{} & {} \text {{Macro\_F1-Score (macro-averaged F1-score))}} \nonumber \\{} & {} \quad = \frac{2 \times \text {{Macro\_P}} \times \text {{Macro\_R}}}{\text {{Macro\_P}} + \text {{Macro\_R}}}, \end{aligned}$$
(17)

where N is the number of categories, \({TP_i}\) is the number of samples belonging to the i-th category and correctly predicted, \({FP_i}\) is the sample that was erroneously predicted as the i-th category, while \({FN_i}\) is the number of samples that belong to the i-th category but were erroneously predicted as other categories (Fig. 11).

Fig. 12
figure 12

The changes in accuracy (a) and loss (b) of DenseNet-264-TRL during training on dataset Fruits 360

Table 4 100 Sports Image dataset classification results comparison
Table 5 ASL Alphabet dataset classification results comparison
Table 6 Mini-ImageNet dataset classification results comparison
Table 7 Ablation Study Results

Additionally, the change chart of accuracy and loss of DenseNet-264-TRL in the training process on the Fruits 360 are provided in Fig. 12.

From the above figures, it can be clearly seen that: (1) On the whole, the Tensor Network Model based on DenseNet performs well on four datasets. It performs better on the relatively simple Fruits 360 dataset, and the best network can achieve \(98.99\%\) accuracy in 30 training cycles. However, for the tensor network based on DenseNet-264, although the depth of the model has increased, the classification accuracy of the model for data has not been significantly improved. On the 100 Sports Image dataset, the tensor network model has more obvious advantages in the number of parameters, but the accuracy of the model is also slightly lower than that of the network model with the full connection layer, and the highest accuracy of \(79.52\%\). This is also an acceptable result. Similarly, for the ASL Alphabet dataset, the accuracy of model can reach 89.48%, and it has an accuracy of \(77.90\%\) on Mini-ImageNet dataset. (2) Compared with the popular network models, the performance of our proposed model is slightly better than, or at least comparable to, existing methods. (3) According to the results of our experiment, it is not difficult to see that the accuracy of the tensor network model in data is slightly lower than that of the network model using the full connection layer. However, the number of parameters of tensor network model is relatively small, and the memory occupied by the computer becomes smaller. As expected, the tensor network model can reduce the number of model parameters with less precision cost.

According to the experimental results, the reduction of parameters has a little loss on the classification accuracy of the model, but such a small impact can be accepted. Tensor network model based on DenseNet can greatly reduce the number of model parameters under the condition of small loss of accuracy. Embedding the tensor regression layer into the mainstream DenseNet model can optimize the structure of the model and ensure the performance of the model.

4.3 Ablation study

To further validate the effectiveness of the proposed model, an ablation study is conducted on the Mini-ImageNet dataset. This experiment aims to evaluate the contribution of different components in the proposed model.

The ablation study compared the performance of different variations of the DenseNet-264 model. The full model, DenseNet-264-TRL, achieved an accuracy of 77.90%, with precision, recall, and F1 score at 78.23%, 76.84%, and 0.7753, respectively. This indicates that the proposed model performed reasonably well on the given task. When the Tensor Regression Layer was removed in the DenseNet-264-FC model, the accuracy slightly improved to 79.5%, with precision and recall at 79.0% and 79.2%, and an F1 score of 0.791. This suggests that replacing TRL with fully connected layers may lead to marginal improvements in model performance, although not significantly. Conversely, when DenseNet was excluded in the DenseNet-264-TRL model, the accuracy dropped to 73.2%, along with decreases in precision, recall, and F1 score to 73.5%, 73.0%, and 0.733, respectively. This indicates that DenseNet has a positive impact on the model’s performance, and its removal resulted in a noticeable decrease in performance. These findings imply that the increment of fully connected layers may have a positive impact on the model’s performance, while the removal of residual learning has a detrimental effect. It is essential to consider these insights for further optimizing the model and enhancing its performance. Additional experiments involving different architectural variations could further validate these observations and provide a deeper understanding of the model’s behavior.

4.4 Discussion

It is interesting to embed the tensor regression layer into more robust network models. The structure of the model is optimized through the tensor regression layer, which uses less computer memory and trains fewer parameters, but delivers the same performance as the original model. However, there are still some problems/issues need to be addressed. First, to alleviate model over-fitting, the Dropout method can be applied to the tensor layer to suppress model over-fitting [31]. Second, for different types of datases, how to select suitable optimizers and hyperparameter tuning. Third, divide and conquer and block based modular network/hybridized block modular mode/block combined CNN/evolutionary deep CNN/ also be applied to image classification [32]. Fourth, in our updating model, sensitivity analysis for feature selection and sensitivity analysis of neural networks should be investigated.

5 Conclusion

By integrating the tensor regression layer and DenseNet network organically, this paper proposes a new tensor network DenseNet model. The flattening operation and full connection layer are replaced by a tensor regression network with low rank structure, enabling it to directly receive tensor data. Instead of discarding the spatial information structure of features extracted by the network in the convolution layer, it retains and utilizes the multi-linear structure of data while reducing a significant number of parameters. Experiments reveal that the tensor network model can essentially accomplish image classification. The performance of the proposed model is acceptable for both simple datasets and those with relatively complex backgrounds. Compared with the fully connected layer network model, our proposed model can achieve the goal of image data classification with fewer parameters and less computer memory. Numerical examples demonstrate that the higher computational efficiency of the DenseNet tensor network is desirable over the fully connected layer network model.