1 Introduction

Low investment and drop in consuming goods characterize the current scenario of the Brazilian economy. This potentiates the productive sector expenses with logistics because outsourcing the activities related to logistics lead to cost savings, such a goal that stands out in times of recession. Logistics outsourcing continues to be crucial for companies to reduce logistics costs, and it is an opportunity to create business value (Mello et al. 2008; Zacharia et al. 2011). A research conducted by the Brazilian Research Foundation with 142 companies of 22 industry segments (whose total sales amount to 15% of Brazilian GDP) supports this assertion. This study found that cost reduction is achieved by the outsourcing of logistics activities (Resende et al. 2016). There are similar results in other developing countries, such as Turkey (Aktas and Ulengin 2005; Aktas et al. 2011) and China (Yang et al. 2016) and India (Mitra 2006). Outsourcing of logistics services is adopted by companies that do not have logistics as a core business, transforming fixed costs into variable costs, i.e., causing logistics costs to be counted only when demand is verified. Based on Research Institute, in 2014 a growth of 11.7% in logistics costs was observed in the total net revenue of companies, while in 2010 this percentage was 10.6% (Ilos 2014). From this logistical nature spending of Brazilian companies, two-thirds are designated to pay logistics service providers, which highlights the importance of this market in the economy (CEL 2009). In general, logistics service providers offer service packages that include not only transport itself but also other services of the supply chain.

The logistic service providers supply many service options to their customers. Connectivity and communication requirements involved have pushed third-party logistic providers 3PLs into a more advanced level (Zacharia et al. 2011). The widespread dissemination of information and communication technology has opened new opportunities to create new roles in the supply chain (Evangelista and Sweeney 2006). However, creating and offering these new services might lead to considerable costs to the 3PLs. The identification of a proper service portfolio that drives the logistic operator is a crucial task to the agents involved (Lai 2004), with a notable impact on performance (Grönroos 2000). Such an issue is even more relevant for the logistic service providers, due to the growing complexity of dealing with specific outsourced services offered by other service providers (4PL and 5PL; Hosie et al. 2012). Despite that, literature is plenty of studies on the effect of outsourcing logistics services and on the selection of logistics providers (Aguezzoul 2014; Senthil et al. 2014; Guarnieri et al. 2015; Govindan et al. 2016). On the other hand, few studies were found on the performance of the logistic service providers, mainly focusing on the identification of their service portfolios.

The aim of this study was to identify the service packages offered by logistics operators leading to the greater efficiency observed in LSP’s. For this analysis, the Data Envelopment Analysis (DEA) methodology was applied in two stages. First stage consists in using DEA models for efficiency scores while the second stage consists in using a regression model to assess the relationship between technical efficiency from DEA and the services offered by the respective companies. The study was based on secondary data available in a specialized journal of the logistics sector for the period of 2007–2015. Results show that the relationship between the offer of certain logistics services packages and technical efficiency of logistics operators could be verified and quantified. The definition of proper services portfolios allows the Logistic Service Providers to better drive investments towards performance improvement in all relevant business areas.

2 Theoretical foundation

2.1 Productivity and efficiency

Productivity is a fundamental indicator of the economic evaluation of transformation processes. It is defined by the ratio between the volume of goods and/or services and the volume of inputs used in a transformation process. This relationship is trivial in the case of a process of transforming a single input into a single output. However, in the case of a general process, when multiple inputs are transformed into multiple outputs, the numerator and denominator of this relation must represent the characteristics of the production technology used. This general measure of productivity is referred to as Total Factor Productivity (TFP), which can be simplified as the weighted proportion of total output in relation to the weighted proportion of total inputs used in the production process (Coelli et al. 2005).

While productivity relates outputs and inputs, efficiency compares the productive performance with the best performance observed in the sector. To Norman and Stoker (1991), the assessment of how well an organizational unit transforms their inputs into outputs is made by the efficiency analysis of the transformation process used.

2.2 Data envelopment analysis

Data Envelopment Analysis (DEA) is a non-parametric mathematical programming approach to estimate efficiency frontiers of transformation processes of multiple inputs to multiple outputs. DEA is used to empirically assess the productive efficiency of a set of Decision Making Units (DMU’s). The DMU’s represent agents operating in an economic environment by producing goods and/or services. Such tool, based on linear programming, was originally proposed by Charnes et al. (1978) and able to deal with constant returns to scale only. It was then improved to accommodate variable returns to scale by Banker et al. (1984). Since then, DEA became a flexible way for the efficiency evaluation of processes, being often updated and extended to many different situations and scenarios.

DEA’s basic model assess the efficiency of the DMUs by the weighted average of all output, divided by the also weighted average of all input, with no prior determination of factor scoring (Eq. 1). Each DMU is responsible for its production plan, which is the set of weights for input and output that it considers appropriated in order to maximize productivity (Cooper et al. 2007). For a given DMU o from a set of n production units using similar technologies (r inputs X and soutputs Y), the goal is to determine the set of weights \( v_{i} ,(i = 1, \ldots ,r),{\text{e}}\,u_{j} ,(j = 1, \ldots ,s) \), which maximizes the weighted relation between input and output, \( \sum\nolimits_{j = 1}^{s} {u_{j} Y_{jo} } /\sum\nolimits_{i = 1}^{r} {v_{i} X_{io} } \), subject to the restriction that for each DMU from the set to be analyzed, the weighted sum of their output is limited by the weighted sum of their input. The denominator of the objective function of this optimization problem above might be truncated to 1 in a way that it can be transformed into a linear programming problem:

$$ \begin{array}{*{20}l} {} \hfill & {\mathop {\hbox{max} }\limits_{u,v} \,\,e_{o} = \sum\limits_{j = 1}^{s} {u_{j} Y_{jo} } } \hfill & {} \hfill \\ {{\text{subject}}\,{\text{to}}} \hfill & {\sum\limits_{i = 1}^{r} {v_{i} X_{io} } = 1} \hfill & {} \hfill \\ {} \hfill & {\sum\limits_{j = 1}^{s} {u_{j} Y_{jm} } \le \sum\limits_{i = 1}^{r} {v_{i} X_{im} } } \hfill & {m = 1, \ldots ,n} \hfill \\ {} \hfill & {v_{i} \ge 0} \hfill & {i = 1, \ldots ,r} \hfill \\ {} \hfill & {u_{j} \ge 0} \hfill & {j = 1, \ldots ,s} \hfill \\ \end{array} $$
(1)

This linear programming problem is referred to in the literature as the constant returns to scale model (CRS). (Cooper et al. 2007). It can be solved for each unit m from the set of DMUs, where em ≤ 1. The DMUs where em = 1 are operating with production plans in the efficiency frontier, while those where em < 1 are outside the efficiency frontier and so evaluated as inefficient. This model is called input-oriented because it calculates the distance between the production plan of the DMU o and the efficiency frontier by considering the lowest level of input that can be used to achieve the same level of output.

The addition of the convexity restrictions to the mathematical programming above makes the area of feasible solutions of the CRS model restricted to the convex combinations generated by the production plans of the observed DMUs. This model was proposed by Banker et al. (1984), and it has introduced the concept of variable returns of scale (VRS) to the DEA.

The use of DEA models is particularly advantageous in the assessment of units that production factors are not directly subject to market values or other cardinal measures of relative importance, because the analysis is done disregarding the functional form of the relationship between inputs and outputs (Fries 2013).

3 Proposed methodology

The data modeling for logistic operators consists in adopting structured stages during the execution activity, mainly in the most important stages of the data analysis: the DEA model and the Regression model. These ideas must focus on precision, in the sense of being a better solution to the problem and not just the simple identification operators’ activities. The proposed scheme defines four stages for data modeling to logistic operators systems, as shown in Fig. 1.

Fig. 1
figure 1

Data analysis procedure

The initial step of the first stage consists in gathering and processing of the data used in work. The Research Institute publishes annually an overview of logistics operators in Brazil with information about their characteristics and operating results. In the period under consideration from 2007 to 2015, it was counted 1207 records of Logistic Services Provider (LSP) companies. This group of companies consolidated the database for this study (Table 1).

Table 1 Descriptive of the sample.

3.1 Market classification

The first data analysis stage seeks to identify homogeneous groups of LSP regarding size and types of services provided by companies. This classification seeks to organize LSP’s that have functional similarities into relatively homogeneous groups, which share both production factors and similar results. For this purpose cluster analysis was used, which aims to identify groups of elements with homogeneous characteristics and heterogeneous to one another (Hair et al. 2010; Tabachnick and Fidell 2013). This step is detailed to the condition of homogeneity of DMU’s sample, which is recommended for application of DEA models. Among the available techniques of classification, was opted the method k-means cluster analysis algorithm with the generalized expectation maximization (Hair et al. 2010). To perform the cluster analysis, were selected 14 dummy variables (presence/absence): (1) own transport fleet, (2) own fleet routing systems, (3) outsourced fleet routing systems, (4) tracking satellite technology for own fleet, (5) tracking satellite technology for outsourced fleet, (6) transport coordination services, (7) distribution services, (8) door to door transport, (9) transfer service (10) milk run transportation, (11) packaging services, (12) Assembly kits and SETS, (13) WMS, (14) use of customer warehouses, and (15) use of own warehouses.

To better assess the arrangement of the number of the clusters it was used the Silhouette coefficient (Si; Rousseeuw 1987), that indicates the best arrangement is achieved with three clusters. Thus, the sample of 1207 companies analyzed was divided into three clusters according to their composition of assets and offered services. The data was obtained in the survey carried out between the years of 2007 and 2015 (Castro 2015). Because of missing data on some variables, it was necessary to disregard 219 cases. This procedure is considered the first data disposal of the analysis. These clusters were named as: (1) Node, formed by 120 companies that mainly carry out logistics services on the premises, that is, in general warehouses and manufacturing facilities; (2) Network With Assets, formed by 626 companies that predominantly have their own fleet and provide logistics services throughout the network, i.e. the facilities and services associated with transport that enable connection between the logistics facilities, and (3) Network Without Assets, formed by 242 companies that predominantly do not have own fleet and offer logistics services throughout the network. Companies in this category usually outsource transportation activities to other logistics companies.

3.2 Selection of variables for the DEA

In the following it is described the procedure of variable selection that was applied to the original set of potential input and output variables. The objective is to improve DEA’s discriminatory power. For each of the three clusters detailed above, a Principal Component Analysis (PCA) was performed to help selecting the variables to represent the processing function of inputs into outputs. The need for variable selection happens when there are many variables as inputs and outputs, because many different units can be identified as efficient for different parameters, and therefore their scores will not have discriminatory desired power (Adler and Yazhemsky 2010). This technique enables reduction of variables with the least possible loss of information content.

It must be highlighted that the PCA was applied here only to help selecting the subset of the original variables that accounts for the largest part of the variance, and not to replace original DEA inputs by the PCA factor scores, what is known as PCA–DEA (Adler and Golany 2007). In the current case where a regression analysis is performed on top of DEA efficiency scores, it was more suitable to use raw rather than PCA-transformed variables. Dyson et al. (2001) point out that the interpretations of the efficiency scores and other benchmarks between peers can change under weights restrictions such as the PCA factor scores transformation.

From the factor loadings of each variable in each component it is verified the possibility of discarding some variables, afterall, variables with similar loadings across the components convey basically the same information, and thus can be removed from the DEA problem for the sake of discriminatory power. The selected components have eigenvalues higher than 1.0 and represent more than 60% of the data variance, as suggested by the literature (Tabachnick and Fidell 2013). In this sense, from the factor loadings of the analysis of each variable in each component, variables have selected the ones to be used as input and output in DEA. At the end of this procedure, for each of the three clusters analyzed the following variables were selected (Table 2).

Table 2 Variables input and output selected

With the definition of the variables that are used to determine the efficiency of LSP’s with the application of DEA models, the sample of the three clusters undergoes a new discard phase due to values not informed by the respondents of the selected variables. This second stage of data disposal refers to cases that do not present records referring to the input and output variables of their actuation cluster. Figure 2 shows the reduction related to the disposal suffered by the sample in the clustering step (1st disposal) and disposal suffered after the selection of input and output variables (2nd disposal).

Fig. 2
figure 2

Number of LSP´s in each cluster and disposal after clustering and selection of input and output variables

3.3 Determination of DEA efficiency scores

The third step was the determination of the actual efficiency scores. Applies the mathematical programming model with the data for each year, for each cluster, and determines the scores of relative efficiencies of LSP’s. In this step was applied the BCC model (model VRS) input oriented, or seek to minimize inputs used to generate the same amount of each of the outputs. For this application of DEA model, the GAMS software (General Algebraic Modeling System) was used. In the result of each year, scores of companies equal to one (1000) are on the efficient frontier for an efficiency given for each model, and thus qualified as efficient. Companies with scores below one (1000) qualify as inefficient, and their scores are the level of efficiency of the companies considered efficient in the reporting year. Figure 3 shows the histograms of the DMU’s scores of the clusters; the nodes; the network with assets and; the network without assets, respectively. The histograms present a high concentration of values close to 1, what is a very frequent situation in Data Envelopment Analysis. This distribution of data makes the use of regression models based on the normality assumption not recommended.

Fig. 3
figure 3

DMU’s values of each cluster

3.4 Regression

The fourth and final stage (2nd stage) consisted in the application of regression models to investigate the relationship between the efficiency scores obtained in the previous step (dependent variable) and the services offered by logistics operators (independent variables) sample. The independent variables are a dummy (presence/absence) of the logistics service. In total were selected 20 different services offered by these logistics operators: (1) Storage (STO), (2) Inventory Control (INC), (3) Packaging (PAC), (4) Assembly Kits and Sets (AKS) (5) Third-Party Management (TPM), (6) Palletizing (PAL), (7) Cross-Docking (CRD), (8) Just-In-Time (JIT), (9) Import/Export and Customs Clearance (IEC), (10) Reverse Logistics (RLG), (11) Fiscal Support (FIS), (12) Projects Development (PRD), (13) Performance Monitoring (PEM), (14) Supply (SUP) (15) Coordination (COO), (16) Distribution (DIS), (17) Door-To-Door (DTD), (18) Transfer (TRA) (19) Milk Run (MKR) and (20) Intermodal Management (IMA).

The used regression models seek to identify the effect of the provision of logistics services packages (independent variables) in the technical efficiency measure (VRS model; dependent variable) on each identified cluster (network with assets, network without assets and node). The technical efficiency scores obtained by the DEA have values between 0 and 1. The Anderson–Darling test showed values that reject the hypothesis of adhesion to normal distribution of this data (dependent variable) for scores of clusters network without assets (An = 57.50; p value < 0.000), network with assets (An = 107.93; p value < 0.000) and node (An = 34.04; p value < 0.000). Thus, it is not recommended to use regression techniques based on normal distribution, such as OLS regression (Hair et al. 2010, Tabachnick and Fidell 2013). Noncompliance of efficiency scores obtained in the DEA to a normal distribution was also reported on the work of Hoff (2007) and Wu and Zhou (2015).

Alternatively, the Beta regression model was selected, since this model is useful for situations where the variable of interest is continuous, asymptotic and restricted to the interval (0, 1) and is related to other variables through a regression structure (Ferrari and Cribari-Neto 2004). This type of model uses the estimation of maximum likelihood and is based on the assumption that the response is beta distributed. The beta distribution is very flexible, and its density can have quite different shapes depending on the values of the parameters of the distribution. Often, proportions data include a non-negligible number of zeros and/or ones. When this is the case, the beta distribution does not provide a satisfactory description of the data, since it does not allow a positive probability for any particular point in the interval [0, 1] (Ospina and Ferrari 2010).

The Beta distribution with two parameters θ1θ2 > 0 offers a robust approximation for a large variety of variables with several shapes and for modeling of data closed to interval [0, 1] (Johnson et al. 1995). The probability density function (pdf) of the Beta distribution is given by,

$$ f\left( {y;\theta_{1} ,\theta_{2} } \right) = \frac{{\varGamma \left( {\theta_{1} + \theta_{2} } \right)}}{{\varGamma \left( {\theta_{1} } \right)\left( {\theta_{2} } \right)}}y^{\alpha - 1} \left( {1 - y} \right)^{\beta - 1} $$
(2)

where \( \varGamma \left( \theta \right) \) is the Gamma function evaluated as \( \varGamma \left( \theta \right) = \mathop \smallint \limits_{0}^{ \propto } y^{{\theta_{{}} - 1}} e^{ - y} \), θ > 0. In the remaining of this section, we briefly describe the fundamentals of the Beta regression model proposed by Ferrari and Cribari-Neto (2004), with the reparameterization to mean and variability defined as \( \mu = \frac{{\theta_{1} }}{{\theta_{1} + \theta_{2} }} \) and ϕ = θ1 + θ2, redefining with θ1 = μϕ and θ2 = (1 − μ)ϕ], thus,

$$ E\left( Y \right) = \mu \quad {\text{and}}\quad Var\left( Y \right) = \frac{Var\left( \mu \right)}{{\left( {1 + \phi } \right)}} $$
(3)

Finally, the Beta regression model can be represented in a traditional regression format by,

$$ {\text{Y}}\sim\,Beta\left[ {\beta_{0} + \beta_{t} X_{t} } \right] = \frac{{e^{{\beta_{0} + \beta x_{1} + \cdots + \beta_{k} x_{k} }} }}{{1 + e^{{\beta_{0} + \beta_{1} x_{1} + \cdots + \beta_{k} x_{k} }} }}, t = 1, \ldots , k $$
(4)

where Y is the output variable, β’s are unknown coefficients of Xt (x1,…, xk) input variables.

The data for technical efficiency scores (VRS model) obtained for the clusters analyzed in this study, as well as in other DEA applications have a higher density of values 1. This distribution is usually referred to one-inflated beta distribution and indicates the use of one inflated beta regression models.

4 Results and discussion

The beta regression models performed have the effect of providing 20 logistics services (independent variables) in the VRS efficiency score of DEA (dependent variable). The result obtained aims to identify which service packages provided by logistics operators that have a significant effect on the efficiency score. It conducted a regression model for each cluster of logistics operators analyzed, called node, network with assets and network without assets. For each cluster, was used the beta-inflated one regression model.

The results of the regression model for the logistics operators cluster (1) Node is presented in Table 3. Four service packages provided by logistics operators were identified that have significant relationship (p value < 0.10) with its efficiency score. Among the services that have a positive and significant effect on the efficiency score of logistics operators in cluster Node, can be mentioned the services: Palletizing (PAL) (β = 1.442, p value < 0.0437), Distribution (DIS) (β = 1.543, p value < 0.020) and Intermodal Management (IMA) (β = 1.004, p value < 0.053). When assessing the coefficients of the variables we found that the provision of the distribution service (DIS) is important to determine the effect on the efficiency score in the cluster node. It was also found that the Milk Run service (MKR) has a significant negative effect (β = − 1076; p value < 0.073) in the efficiency score of logistics operators of this cluster. This result can be interpreted that the availability of this service in logistics operators type Node affects the transformation of inputs into outputs, resulting in reduced efficiency of logistics operators active in this segment.

Table 3 One inflated Beta regression for cluster of Node

The results of the regression model for the cluster logistics cluster operators (2) Network with assets is shown in Table 4. In the case of operators who are net with assets were identified seven service packages that present significant relationship (p value < 0.10) with its efficiency score.

Table 4 One inflated Beta regression for cluster of Network with assets

Among the services that present a significant effect on the efficiency score of logistics operators in cluster net with assets, can be mentioned the services: Storage (STO) (β = 1.604, p value < 0.027), Palletizing (PAL) (β = 1.053, p value < 0.006), Cross-docking (CRD) (β = − 0719; p value < 0.016), import/export and customs clearance (IEC) (β = − 0288; p value < 0.014), Supply (SUP) (β = − 0552; p value < 0.022), Distribution, Door-to-door (DTD) (β = − 0306; p value < 0.043) and Transfer service (TRA) (β = − 0720; p value < 0.039). Among the services provided by logistics operators of cluster Network with Assets, is found that the Storage Services (STO) and Palletizing (PAL) are those with positive effect on the efficiency score, or significantly contribute to the better efficiency in transformation of inputs into outputs in the operation. The others present negative contribution in the efficiency score and impair the level of efficiency of logistics operators operating in the Network with assets segment.

The results of the regression model for the logistics operators of cluster (3) Network without Assets is shown in Table 5. Five service packages provided by logistics operators were identified that have significant relationship (p value < 0.10) with their efficiency score. Only the assembly kits and sets service (AKS) (β = 0.148, p value < 0.030) showed a significant and positive contribution to the performance efficiency. Other services, such as Packaging (PAC) (β = − 0223; p value < 0.025), Third-party management (TPM) (β = − 0110; p value < 0.040 de Performance monitoring (PEM) (β = − 0.106; p value < 0.028) and Intermodal management (IMA) (β = − 0734; p value < 0.007) showed a significant negative contribution to the performance of logistics operators that operate in Network without Assets.

Table 5 One inflated Beta regression for cluster of Network without assets

4.1 Contributions to theory

In this paper, we use the one-inflated Beta model in the DEA 2nd stage as alternative to traditional regression models. The problem of non-adherence of DEA efficiency scores to Normal distribution harm parameter estimation in the using of ordinary least squares method. The possibility of data transformation, although reducing the previous problem, affects the interpretation of the coefficients in the equation (Simar and Wilson 2007). Among the models reported in the literature, despite the reported problems, the use of the OLS method regression is still often (Liu et al. 2013; Dokas et al. 2014; Hou et al. 2017). The use of Tobit regression with maximum likelihood method is also used in literature (Hoff 2007; McDonald 2009). However, Tobit partially solves the problem, since it is limited to the nonnegative interval (McDonald and Moffitt 1980), not being restricted to the interval of the DEA scores [0,1]. Even so, the use of OLS or Tobit regression is not recommended for the second stage of DEA analysis (Simar and Wilson 2007; Liu et al. 2013; Banker et al. 2015). This study seeks to solve the problem of the second stage regression through this model based on the Beta distribution. The general family of Beta distributions sounds more suitable, because it is also restricted to the [0, 1] interval, and, also, it can be parametrized to assume virtually all distribution formats within the [0, 1] range (Johnson et al. 1995). Secondly, we might consider that, also by construction, DEA scores are more concentrated at the proximity of 1 than of 0. Probabilistically, this effect gets stronger (i.e., more concentration right below or at 1) as the number of variables in a DEA problem grow. This specific characteristic of some Beta distributions is better captured by the One Inflated Beta distribution, which precisely models the truncation effect right before 1 (Ospina and Ferrari 2012).

4.2 Contributions to practice

This study permits the identification of service packages determinants to improve the operation efficiency of logistics operators. It was found that several available service packages significantly impair the level of companies’ efficiency. This result presents important information relating to the decision to make available a suitable portfolio to the activity of logistics operators in Brazil. The results can also be used in other countries, especially those of continental dimensions where logistics costs tend to be an important part of the final consumer prices.

LSPs play a crucial role in making the operations of many industries more effective and efficient. The services provided by LSPs should be offered as efficiently as possible, allowing their managers to evaluate and monitor their operations continually. The search for efficiency must incorporate the selection of service packages that direct the operation to the efficiency frontier to ensure profitability of LSPs and therefore provide lower costs for shippers and end customers. Since logistics have become one of the determining factors in the competitiveness of the economy, the results of this study can contribute positively in this sense.

5 Conclusions

This study aimed to identify the service packages offered by logistics operators leading to the greater efficiency observed in LSP’s sector. The classification of LSP’s market identified three well defined clusters: (1) “Node”, comprising the LSP’s which mainly provide warehousing services; (2) “Network with assets”, comprising the LSP’s with own fleet operating in both storage services as transport; and (3) “Network without assets”, which consists of LSP’s with the same performance of the previous group, but without its own fleet.

With the determination of efficiencies by applying DEA models and the use of regression models was possible to identify service packages significant to the efficiency of the three logistics service clusters. Different for each group, the significant service packages vary in magnitude of effect in the efficiency measure. This effect tends to contaminate negatively efficiency measures of logistics service providers analyzed, while others affect positively. This observation suggests the existence of negative effect in certain packages on the service portfolio.

We also provided empirical evidence that DEA model as a classifier of relevant variables using the Principal Components Analysis and as modeling of important DEA scores using Beta Regression model, which are robust mathematical techniques.