Introduction

Energy is considered a key factor for development (Sepúlveda 2016) and an important element contributing to the achievement of a world economy (Zhenxing and Jing 2007). According to the International Energy Agency, the development of effective actions becomes important to solve problems related to climate change in the energy sector. The CO2 emissions in this sector correspond to two-thirds of the total emissions; these levels have increased over the last years (IEA 2015). Therefore, increase of energy efficiency can be considered one of the primary strategies to reduce energy consumption and CO2 emissions (Nagy et al. 2015). The UNEP (United Nation Environment Programme) (2016) shows that commercial and residential buildings consume approximately 60% of the global energy. According to the International Energy Agency, these buildings represent 10% of total CO2 emissions (Soares et al. 2017). To reduce this index, it is important to modernize buildings by applying energy efficiency measures. This approach is a logical way to increase the life cycle of buildings, resulting in improved living conditions, reduced energy bill of the occupants (Dascalaki et al. 2016) and reduction of environmental impacts caused by building construction, which are the primary objectives for world energy policy (Soares et al. 2017).

Since the fuel crisis, which occurred in the 1970s, energy efficiency has been a significant factor to reduce the cost of energy and ensure sustainability in the world (Kazanasmas et al. 2014). Attention has been given to the development of technologies capable of improving the energy performance of buildings; however, only a few of the technologies have been applied (Mardookhy et al. 2014).

According to Zhang et al. (2015) the improvement of energy efficiency in buildings is still one of the easiest, immediate and economic ways to reduce a country’s energy consumption. Those advantages are why energy efficiency has received considerable attention (Zorita et al. 2016).

Several countries have created legislation that imposes the construction of buildings with a positive energy balance. For example, in Europe the buildings consume more than 40% of the total electricity production. Because of that the EPDB (The Directive on Energy Performance in Buildings) was created. It is a legislative instrument that affects energy use and efficiency in this sector of the EU and it is expected that in 2020 new buildings should consume “nearly zero” from the power grid; a large part of the demand will be generated locally by renewable sources. The new buildings are known as “very low energy building” or “nearly zero energy building” by its characteristics of very high energy performance (ECEEE 2010).

Many articles have been written relating to several factors that affect the use of energy and energy efficiency in buildings. Therefore, the most diverse tools for data analysis have been employed (Engvall et al. 2014).

This research aims to identify the most relevant publications and authors and the importance of the subject over the years, as well as the most frequent data analysis procedures employed in this area.

This research is justified to the extent that it contributes to the formulation and adequacy of curricula of undergraduate and postgraduate courses in the area of energy and smart buildings regarding the data analysis disciplines to be offered.

A bibliometrics analysis of articles extracted from the SCOPUS database was performed, which involved evaluating several types of information, such as evolution of publications, research areas, number of publications divided by region and country, and most cited key terms. Based on this, it was possible to identify the most relevant publications in the area as well as the most important authors and research centers. Thus, it was possible to evaluate the relevance of the subject over the last years. It was also possible to see the timing and the sequence of the data analysis procedures introduced in these research areas. The main journal that published and the most relevant papers were identified as well.

The article is organized in three sections: the methods, the results and findings, and the conclusions. In the method is presented, in detail, the methodological approach adopted. In the results and findings is presented the data extracted from the database, as well as the discussion of the main findings. In addition, the conclusions will present the most relevant comments on the most important points of the research.

Method for bibliometrics analysis

The method employed for this research is illustrated in Fig. 1.

Fig. 1
figure 1

(Source Li et al. 2015a, b)

Research method.

Research subject

This research started off by defining the key terms to be used. It is worth recalling that the main interest of this research relied on identifying as many papers as possible dealing with data analysis techniques among papers concerned with energy efficiency in buildings which, in turn, have been extracted from a pull of all papers related to energy efficiency. Figure 2 represents the research strategy.

Fig. 2
figure 2

Defining the research key terms

The SCOPUS was the chosen database. Only papers from journals from 1980 to 2016 were considered. To get the desired effect, suggested by Fig. 2, the research initially retrieved papers containing research terms related to “Energy Efficiency”. The query of these terms in the title resulted in 96,400 documents, is available in “Appendix 1”.

The next research also included the term “Building” in the title, and resulted in 3678 publications, according to the query in the “Appendix 2”.

The final query applied, containing all terms of the previous researches and the terms related to “Data Analysis Techniques” in the title, abstract, and keywords, resulting in 513 articles, is available in “Appendix 3”. All the bibliometric analysis was based on these last results.

Data retrieved

This step retrieved information from the 513 articles as a title, keywords, year of publication, authors, affiliations and number of citations.

Data analysis

The previous step retrieved a large amount of information that was summarized, for further analysis, into tables and several types of graphs by means of such software packages as Microsoft Excel, VoSviewer and Minitab.

Results

Several of the results obtained were the evolution of the publications; the most common research areas; number of publications per area; most cited key terms; institution distribution; author distribution and most cited articles.

Findings

Trend of publications and the relevancy of the theme; type of articles published in the several knowledge areas; the relevant areas for the theme; the most published journal for the knowledge area; the evolution of publications per region; number of publication by country; number of citations per publication; percentage of non-cited papers; number of citations of the most cited paper; score plot of the countries, analysis of the most used key terms; main institutions and authors.

Many of the 513 articles, firstly retrieved, only shallowly mentioned some data analysis technique. The total number of techniques really applied in this amount of articles accounts for 296. In some cases more than one technique have been applied simultaneously.

Results and findings

Analysis of the evolution of publications

The number of publications related with the subject over the period under analysis has been considered a measure of scientific productivity and interest in the subject. Figure 3 shows a clear upward trend from 1980 (2 articles) to 2016 (80 articles). About fifty (53%) percent of the articles over this period were published in the last four years (2013–2016).

Fig. 3
figure 3

Evolution of the publications over the period 1980–2016 according to the Scopus database

According to the Price’s law (1963) the growth of scientific production of a field follows an exponential function and Fig. 4 shows that the growth pattern of the subject under research in this paper fits an exponential function. This statement is supported by the statistic R 2, which is approximately 0.96, what means that such a model can explain 96% of the total variability of the data.

Fig. 4
figure 4

Growth pattern of the publications over the period 1980–2016 according to the Scopus database

According to Dabi et al. (2016) while “the main hypothesis of Price’s law is that the development of science follows an exponential growth. The growth of a scientific domain goes through four phases”. Taking a closer look at Fig. 4 it is possible to identify the first three of these phases as can be seen in Fig. 5.

Fig. 5
figure 5

Phases of growth of a scientific domain according to Price’s law (Dabi et al. 2016)

The first phase, which extends itself from 1980 to roughly 1992, is the precursors’ phase, according to Dabi et al. (2016) “during this phase only a small number of researches begins publishing”. In this study, this phase accounts for about 4.4% of the whole publication body. The second phase (1992–2013) is the proper exponential growth. “During this phase the expansion of the field attracts many researchers as many aspects of the subject still have to be explored” (Dabi et al. 2016). It can be seen that the data fit well an exponential function, since the statistic R 2 is very close to 1.00. It is worth mentioning that the initial and final years of this phase were set in order to maximize the statistic R 2 for this and the subsequent phase. During this phase, the number of publications doubled approximately every each 7 years. The third phase (2013–2016), the body of knowledge is consolidated and the growth of scientific production becomes linear (Dabi et al. 2016). Observing Fig. 5 it can be seen that the data retrieved fit perfectly a linear function. It is important to note that the next phase, according to Dabi et al. (2016), “corresponds to the collapse of the domain and is marked by a decrease in the number of the publications. The aspect of the curve transforms from exponential to logistic”, reaching a ceiling value after passing through an inflection point.

Based on the analysis exposed above, it can be concluded that number of publications in this field, from 1980 to 2016, follows the Price’s law and an inflection point has not been reached yet.

The evolution of the number of articles related to energy efficiency in Buildings over the analyzed period of time can be explained by the fact that residential, commercial and public buildings account for approximately 60% of global energy (UNEP—United Nations Environment Programme 2016) and for 10% of global greenhouse gas emissions, according to the International Energy Agency (Soares et al. 2017). Therefore, it seems that the development of energy efficient buildings has been a great deal of research on sustainable development what has been attracted researches’ attention worldwide. In Europe, for example, where the energy consumption in this sector is greater than 40%, several incentive programs has been created to promote the rational use of energy in buildings after 2010. For instance, the Energy Performance of Buildings Directive, which aims at increasing the energetic performance of European enterprises, estimating that in 2020 most of the new buildings will supply their electric energy demand without being connected to the power grid. The electric power will be locally produced by renewable sources. This new buildings will be known as very low energy building or nearly zero energy buildings (ECEEE 2010). This set of actions has motivated the increasing number of publications in this area following an exponential function as can be seen in Fig. 6.

Fig. 6
figure 6

Growth pattern of the publications over the period 1990–2016 in Europe, according to the Scopus database

Compared to Europe, publication is this field in North America had begun 10 years before due to the great attention payed by the United States to the theme. The expectations of an energetic crisis, which actually arose in California (2000–2001), had incentivized many programs aiming to promote the energetic efficiency in buildings that were followed by several publications whose growth followed the same pattern of world growth, as can be seen in Fig. 7.

Fig. 7
figure 7

Growth pattern of the publications over the period 1990–2016 in North America, according to the Scopus database

The number of publications in Asia have been followed the same pattern in Europe as can be seen in Fig. 8.

Fig. 8
figure 8

Growth pattern of the publications over the period 1990–2016 in North America, according to the Scopus database

Research areas

This section discusses the relevance of the theme for several research areas. Figure 9 presents the number of publication per area for the eighteen most relevant research areas.

Fig. 9
figure 9

Research areas where articles were published according to the Scopus database

The four most relevant areas for the theme are Engineering, Energy, Environmental Science and Computer Science. It can also be seen that the subject has gotten the attention of some peripheral areas, such as Social Science; Business, Management and Accounting; Economics, Econometrics and Finance; Arts and Humanities; and Medicine. One of the reasons for a such intersection is the fact that many publications were classified into more than one area. But scientific interests can also explain the overlap between the areas.

Social Sciences area contains publications dealing with thermal comfort and occupants’ behavior. According to Nghana and Tariky (2016) 62% of the building energy use are towards maintaining thermal comfort and a frequent worry is reducing the energy consumption keeping the occupants’ comfort (Cetin et al. 2016; Koumoutsos et al. 2015). For achieving this goal, it is important to understand the occupants’ behavior. This explain the interest of some social sciences’ studies in this area.

Business, Management and Accounting area groups articles that address the subject from the stand point of relevant planning and management tools for energy-efficient buildings. These tools simulate energy consumption in buildings, facilitating comparative analyses of the energy consumed in relation to the efficient operation of the buildings and the behavioral changes of the occupants (Kim and Yu 2016; Azizi et al. 2014; Kontokosta 2015).

Economics, Econometrics and Finance area presents papers that are correlated with Market Values of buildings. The increase of energy efficiency requirements for new construction, makes the less efficient buildings subject to greater economic depreciation, with declining Market Values (Surmann et al. 2015; Cajias and Piazolo 2013), This area are interested in studying the relationship between the energetic aspects and demand for energy efficiency buildings.

Arts and Humanities area contains papers that approach the theme related from the historical heritage stand point, the assessment of historical buildings may help to recognize the actions to reduce energy needs, that can reflect in a set of general interventions in the management process of historical architecture. This approach is new in the field of conservation, restoration, and reuse of Cultural Heritage (Magrini and Franco 2016).

Medicine area presents publications dealing with health problems recurrent of CO2 emissions. There is a growing concern regarding the energy consumption in buildings in many countries, as a part of ‘‘Low Carbon Cities’’ programmes striving for reducing greenhouse-gas emissions, these programmes can reduce energy consumption and also improve health problems of occupants (Norbäck et al. 2014).

Scientific journals

The total amount of papers (513) are produced by 155 journals as shown in Appendix 4. It is worth analyzing the distribution of these articles between the different journals. Bradford’s law of scattering (1948 apud Palomo et al. 2017) has stated that: “if scientific journals are arranged in order to decrease the productivity of articles, they may be divided into a nucleus of periodicals, more particularly devoted to the subject, and several groups, or zones, (normally 3 as defined by Bradford), containing the same number of articles as the nucleus. Where the number of periodicals in the nucleus and succeeding zones will be [1:n:n 2].” This means that a small set of journals (the nucleus zone) accounts for the majority of the articles, whereas other sets of journals have to be larger in order to account for the same amount of published articles. This suggests that there is an inverse relationship between the number of articles and the amount of journals where they were published (Palomo et al. 2017).

The journals within each zone are identified in the “Appendix 4”. The theoretical number of articles per zone, according to Bradford’s law, should be 513/3 = 171, or 33% of the total. Table 1 shows that the three zones contain roughly the same number of articles. The number of journals in the three zones are in this proportion [1:32:122], which is roughly equal to [1:32:128] that is close to [20:25:27]. Therefore, this distribution does not fit into the original Bradford’s distribution [1:n:n 2], since it is not possible to find out the Bradford multiplier n.

Table 1 Distribution of the journals

Table 1 presents the distribution of the journals in three zones, as defined originally by the Bradford’s law.

Even though the distribution presented in Table 1 can be changed by moving one or more articles from one zone to another, keeping roughly the same proportion of articles, among zones, it still will not fit into the original Bradford’s distribution.

Although the observed data does not follow the Bradford’s law, it is kept that, “few journals publish a relatively high percent of the articles in the field, and there are many journals that publish only few articles each” (Diodato 2012) as can be seen in Fig. 10, that presents one of the several versions of the Bradford curve, that stands out that 10% of the journals published 60% of the papers and about 60% of the journals published only one paper.

Fig. 10
figure 10

Bradford curve

The most productive journal in this area is Energy and Buildings, which published 153 articles, and is the nucleus. The second zone grouped the 32 journals, which published at least three articles each. Within this zone, the most productive is Applied Energy, with 36 papers and the less productive, with 4 papers each, are Architectural Science Review, Building; Services Engineering Research and Technology; Building Simulation; HVAC and R Research; Journal of Building Physics; Renewable and Sustainable Energy Reviews; Tumu Jianzhu Yu Huanjing Gongcheng Journal of Civil Architectural and Environmental Engineering; and Applied Thermal Engineering. The third zone clusters 122 journals, 12 of them with 3 articles each, 24 with 2 articles and 86 with only one paper.

Table 2 shows the journals within the first and second zones.

Table 2 Journal within the first and second zones

The Journal with the greatest number of publications is Energy and Buildings (153) followed by Applied Energy (36), Energy (22), Building and Environment (20), and Energy Conservation and Management (16).

According to the number of publications, the Energy and Buildings is the most important journal. Most researchers have chosen it to publish their articles related to data analysis applied to energy efficiency in buildings because this it is concerned with energy use in buildings and it covers a broad range of topics, twenty-six (Appendix 5) ranging from energy conservation to smart buildings passing by thermal comfort, energy management, modeling, energy sustainability, etc. Most of the papers published are linked to laboratory or field measurements, comparisons of results, and replication studies (Elsevier 2017a, b, c, d, e).

The second most important journal is the Applied Energy, which deals with problems of modeling and forecasting, energy conservation strategies, and the environmental, social and economic impacts of energy policies and usage. It covers about 8 topics (Appendix 5) (Elsevier 2017a, b, c, d, e).

The third most important journal is the Energy. It covers twelve topics (Appendix 4) on research in mechanical engineering and thermal sciences, passing by energy analysis, energy modelling and prediction, integrated energy systems, energy planning and energy management (Elsevier 2017a, b, c, d, e).

The fourth most important journal is the Building and Environment, which covers 4 topics (Appendix 5). It publishes papers related to building science and human interaction with built environment (Elsevier 2017a, b, c, d, e).

The fifth most important journal is the Energy Conservation and Management, which publishes papers dealing with modeling, experimental, analysis and optimization issues covering 8 topics (Appendix 5) concerned with interdisciplinary energy subjects related to advanced technologies (Elsevier 2017a, b, c, d, e).

The main metrics for these journals are presented in Table 3.

Table 3 Journal metrics

Energy and Buildings and Applied Energy journals are related to the research area Engineering; Energy and Energy Conservation and Management are linked to the research area Energy; and Building and Environment is concerned with subjects related to the research are Social Science.

Table 4 presents the journals of the first and second zones distributed among five research areas.

Table 4 Distribution of journals within the first and second zones according to the research area

It can be seen that Engineering (16) and Energy (10) areas stands out with greater number of journals, followed by Environmental Science (4), Social Sciences (2) and Material Sicence (1).

Number of publications by country

The descriptive bibliometrics results of the countries that have published more than five articles are seen in Table 5.

Table 5 Countries with number of publications ≥ 5 (1980–2016)

Figure 11 shows the first five countries in number of publications.

Fig. 11
figure 11

Number of publications per country

It can be seen that United States (134), China (82) and the United Kingdom (41) stand out. Germany (20) and Italy (20) share the fifth place of the most productive countries.

The profile of citations is shown in Fig. 12.

Fig. 12
figure 12

Number of citations per country

In this figure, as in the previous chart, the first three positions are occupied by the United States (3103), China (1102) and the United Kingdom (974). The only difference between this chart and that shown in Fig. 11 is the fifth position. Indeed this suggests a correlation between the number of publication and the number of citations, which can be explained by the fact that the authors’ collaboration net is mainly composed of authors from the same country.

The number of citations per publication (CPP) is presented in Fig. 13.

Fig. 13
figure 13

Number of citations per publication

The chart presents several countries that were not presented in the former charts. The first two positions are filled by Greece (49.84) and Switzerland (49.34), which are the fourteenth and sixteenth, respectively, in number of publications. Because the number of citations per paper is greater than the number of publications, it is possible to conclude that these publications have been cited by authors abroad. In this case we can conclude that these countries have produced good publications in the area.

The percentage of publications non-cited (PNC) are illustrated in Fig. 14.

Fig. 14
figure 14

Percentage of publications non-cited

The fourth position is occupied by Germany (22.40%), Brazil (20.00%), Belgium (20.00%) and Egypt (20.00%). Approximately 20% of the total number of publications of these countries has not been cited. Another point worth mentioning is the fact that although China has a great number of publications, more than 25% of them have not been cited. Malaysia (33.34%) and India (28.58%) also have a great percentage of publications not cited.

Figure 15 shows the number of citations of the most-cited paper for several countries.

Fig. 15
figure 15

Number of citations of the most-cited article

The graph highlights the United States (642) position. The United Kingdom (205) is ranked in the second position. Brazil (185) appears in the third place. The fourth position is occupied by Singapore (161), Japan (160) and Canada (160). The most cited article in The United States is distant from the most cited article in any other country.

Table 6 shows the most-cited articles used to produce Fig. 15.

Table 6 The most-cited articles

It is remarkable that the most cited articles were published in the same journal, Energy and Building. All of them were published from 2001 to 2016. It is worth mentioning that five of them are concerned with simulation. The one produced by Japan made use of a clustering algorithm to carry out the research.

The data from Table 5 have been submitted to a Principal Components Analysis (PCA), and Fig. 16 presents the first and the second components plotted against each other. Each data point represents a single country.

Fig. 16
figure 16

Principal component analysis

The positioning of each country on the graph depicted in Fig. 16 depends on a linear combination of all variables presented in Table 5. On the right upper of the graph can be seen the load plot for this analysis, which is useful in explaining the positioning of each country on the graph. The greater publications, citations and max cites, further to the right side of the graph will be the country. The greater the CPP is, the higher the country on the graph will be. The greater the PNC is, the lower the country on the graph will be.

The United States stands out because of the number of publications, citations and the number of citations of the most cited article. Switzerland stands out because of the number of citations per publication. Although China has a great number of publications and citations it also has a high percentage of non-cited publications. India and Malaysia are on the left bottom of the graph because the percentage of non-cited papers is significant.

Analysis of the key terms

The pattern of key terms that appear in publications of a given field of knowledge defines the underlying themes concerned with such a field. Therefore, when a set of publications is analyzed, “the ideas, concepts, and methods that constitute this field of knowledge are defined by clusters of key terms that reflect commonality within a field of scientific research” (Phillips et al. 2015)

According to Phillips et al. (2015) there are several strategies for visualizing textual data. This paper uses bibliometric maps for visualizing keywords associations, generated by the VOS viewer software. The interpretation of such maps are based on:

  • The size of the visual representation of terms is related to the frequency they appear in the retrieved articles.

  • The relative position of terms in the map reflects their relative association, it means that all possible pairs of terms that are commonly associated are positioned close to one another, and terms with a low degree of association with each other are positioned far from one another.

  • The clustering of terms establishes related conceptual domains.

In order to visualize changes in the thematic focus of this research field, three bibliometric maps covering the period of analysis (1980-1990, 1980-2000, 1980-2016), will be employed. Thus, it will be possible to study the formation and evolution of the clusters, as a way to understand the development of this field.

Figure 17 shows the bibliometric map ranging from 1980 to 1990.

Fig. 17
figure 17

Bibliometric map ranging from 1980 to 1990

During this period, 22 articles were produced, totalizing 81 keywords. In the interest of clarity, Fig. 17 presents only the 12 keywords with frequency greater than or equal to 2, forming 4 clusters.

The clustering process intend to form heterogeneous groups of homogeneous individuals. In this case, the clustering algorithms uses the frequency and strength of connection of the keywords to create the clusters. Since many keywords were cited only once, many clusters containing only one element would arise obscuring the analysis. Because of that, only the keywords that appeared 2 or more times were used.

The oil crisis in the middle of the 1970s turned the world’s attention to the rational use of energy what reflected in terms like “energy conservation”, “energy savings”, and “energy utilization” in the publications at the beginning of 1980s, which were grouped in Cluster 1.

Thermal comfort systems and lighting were, and still are, important energy loads then it was expected that such a theme would be explored. Cluster 3 groups terms related to thermal comfort like “air conditioning”, “heating”, and “meteorology-climatology”. Cluster 4 puts together terms concerned with lighting, like “electric lighting” and “solar radiation” this last term is in accordance with a trend at that time of improving the use of natural lighting in order to save energy.

With the advent of the personal computers in the 1980s, it was also expected that softwares of data analysis were introduced in this research area. In cluster 4, one can see the keyword “computer simulation”, which is perfectly correlated to the other elements of this cluster as it could mean the use of a dedicated software to lighting design.

Since this was the start-up of this research area, the number of clusters and the number of elements per clusters increased as the time passed, as can be seen in Fig. 18 that presents the bibliometric map for 1980 to 2000.

Fig. 18
figure 18

Bibliometric map ranging from 1980 to 2000

The bibliometric map presented in Fig. 18 shows 43 keywords that appeared two or more times from 58 articles, grouped into 7 clusters with, at least, 2 keywords each. This map shows the introduction of new terms like “data structures”, “mathematical models”, “statistical methods”, “regression analysis”, “correlation methods”(univariate data analysis techniques), and “principal component analysis”(multivariate data analysis technique), then it can be concluded that this decade was marked by the introduction of the univariate data analysis techniques in the publications in this area.

It also can be noted the presence of the first country cited in the articles, the “United States” of America (EUA), which is in accordance with the fact that they were the precursors in this research field as can be seen in Fig. 3. It is also remarkable the very first appearance, in this kind of study, of the term “environmental impact”.

This map shows that the cluster 4, in Fig. 17, aggregated more 6 words and the correlation of the term “solar radiation” to the others became weak, forcing it to group with 6 new terms to form a new cluster, the cluster 7 in Fig. 18. Clusters 1 and 3 from Fig. 17 joined to form the cluster 6 in Fig. 18. The cluster 2 in Fig. 17 increased with the inclusion of three new terms. The clusters 1, 3, and 5 are formed by keywords that did not appear in the preceding map.

From Fig. 18 it is possible to infer the emergence of a pattern in the cluster formation. It seems that some of them are specialized in a topic of the research area. For example, cluster 1 groups mainly terms related to data analysis techniques like “data structures”, “mathematical models”, “statistical methods”, and “computer software”. Cluster 4, the electric loads in a building like “energy use”, “electric lighting”, “ventilation”, and “air conditioning”. In some of the other clusters, the word association pattern is not clear yet, but the more words are included in the clusters, the clearer the pattern becomes.

The final map covers the period ranging from 1980 to 2016 (Fig. 19). This map shows 300 keywords that appeared four or more times from 513 articles, grouped into 7 clusters.

Fig. 19
figure 19

Bibliometric map ranging from 1980 to 2016

This map shows the inclusion of 257 new key terms, roughly seven times more words than the presented in the preceding map. Thus, the cluster structure is different, many of the clusters presented in the previous map merged with each other, some disappeared, and others arisen.

The environmental questions has gained more importance after the year 2000, then many authors tried to call the readers` attention by including among the keywords, terms like “gas emissions”, “global warming”, “climatic index”, “climate models”, and “climate changes”. For the same reason, terms related to renewable energy appeared many times “renewable energy resources”, “wind power”, “wind effect”, “solar power”, and “alternative energy”.

It also can be seen terms related to well-being like “comfort level”, “thermal comfort”, “indoor air quality”, “indoor air pollution”, and “indoor environment”, what means that many articles were produced focusing on the building occupants’ wellness.

Over this period, more attention was paid to the building material and building design to improve the building energy efficiency, thus terms like “building materials”, “recycling”, “phase change materials”, “concrete aggregates”, “structural design”, “architectural design”, and “computer aided design” arose among the keywords.

Although the univariate statistics has been widely applied in the articles, this period was marked by the introduction of data analysis techniques based on computational intelligence, thus many terms like “neural networks”, “artificial neural networks”, “genetic algorithms”, “adaptive algorithms”, “artificial intelligence”, and “back propagation” appeared in the articles.

Many articles were produced mentioning geographical region were the research took place “Far East”, “Asia”, “Middle East”, “Eurasia”, “Europe”, “Southern Europe”, “Northern Europe”, and “Scandinavia”; “Singapore”, “China”, “Hong Kong”, and “Italy”, showing that over this period this subject spread out all over the world.

Terms like “intelligent buildings”, “building management systems”, and “intelligent structures” suggest the rise of the smart building concept.

It is remarkable that terms concerned with cost and technological innovation appeared so few times in the literature “energy cost”, “cost effectiveness”, and “innovation”.

Terms concerned with multivariate analysis like “cluster analysis”, “multiple regression analysis”, “factor analysis”, and “multivariant analysis” are also present among the keywords.

In order to analyze each cluster, it is worth classifying their elements into categories in order to make the comparison between clusters possible. Before proceeding with this classification it will be necessary to create such categories, what is not an easy task.

As the keywords were extracted from technical papers, which have a quite similar structure, i.e., they are built around the same elements, which are the main objective, additional objectives, research’s focus, method, data analysis techniques, and delimitation; these elements were chosen to identify the categories.

The classification process is not straightforward. While some words can be cleared classified into a category, anothers, only roughly, can be grouped into one. The connection between a term and categories sometimes is made by an underlying sense. For example, the term “meteorological data” is connected to simulation, since they are one of the inputs for simulation software’s.

In the name of the clarity, the clusters identified in Fig. 19 will be presented and analyzed individually. Figure 20 shows cluster 1.

Fig. 20
figure 20

Cluster 1 from Fig. 19

The elements of Cluster 1 can be roughly classified according to Table 7.

Table 7 Classification of individuals of cluster 1

It is possible to infer that this cluster groups terms from articles whose main objective is to “reduce energy consumption”, to improve the energetic “building performance”, to increase the rational “energy utilization”, the “building energy use”, and the “building operations”. The main objective is to be reached by taking into account the occupants’ wellness by dealing with the “indoor air”, “indoor air quality”, “indoor pollution”, “air quality”, and “carbon dioxide”.

Terms like “heat flux”, “waste heat”, “ventilation”, “air conditioning”, “cooling energy”, and “cooling” allow one to conclude that the articles within this cluster focused on Heating, Ventilation, and Air Conditioning (HVAC) systems. The words “atmosphere temperature” and “outdoor temperature” are indirectly related to HVAC systems.

The terms “office buildings”, and “school buildings” suggest that the researches described were delimited to commercial buildings.

The methods employed by the authors were “computer simulation” or “building simulation”. “Mathematics models” are necessary to carry out simulations, and “computational fluid dynamics” is also related to simulation. “EnergyPlus” is a building energy simulation software used to evaluate the energy consumption for heating, cooling, ventilation, lighting and other plug loads. The most cited data techniques were “linear regression” and “correlation methods”.

Figure 21 presents the second key terms cluster.

Fig. 21
figure 21

Cluster 2 from Fig. 19

As done before, the keywords from Fig. 21 are classified according Table 8.

Table 8 Classification of individuals of cluster 2

The main objective of the articles within this cluster is the “energy efficiency”, “energy conservation”, “energy performance”, or “energy saving potential” applied to buildings, as can be concluded from the terms “energy efficiency in buildings”, “energy-efficient buildings”, and “energy performance of buildings”. These papers consider environmental questions like “global warming”, “gas emission”, “greenhouse gases”, “sustainable development”, and signal the idea of “sustainable building”. It seems that the researchers were delimited to “commercial buildings” with a certain highlight to researches done in Singapore.

The research focus relies on three areas building energy management as suggested by “energy management”, “building management system”, and “information management”. Automation as can be deduced from terms like “climate control”, “control systems”, and “intelligent buildings”. And architectural design.

The preferred methodological approach used by these researches is simulation (“building energy simulation tools”, “computer simulation software”, and “software”) and the design of experiments”.

The third cluster of keywords is presented in Fig. 22.

Fig. 22
figure 22

Cluster 3 from Fig. 19

The classification of the keywords presented in Fig. 22 is shown in Table 9.

Table 9 Classification of individuals of cluster 3

The main objective of the papers in this cluster is to evaluate the buildings energy use (“energy use” and “building energy simulation”), taking into account the “climate change”, and “climate effect”. Great part of the researches grouped in this cluster were carried out mainly in China and were not delimited to only a specific type of building.

These researches focused their attention on the thermal comfort systems (“cooling load”, “heating load”, and “heat gains”). The approach used by the researches was mainly the simulation as inferred from the terms “climate models”, “meteorological data”, “weather conditions”, “typical meteorological year”, and “solar radition”.

Some of the articles, within this cluster, innovate using the principal component analysis, as data analysis techniques.

Figure 23 presents the fourth cluster.

Fig. 23
figure 23

Cluster 4 from Fig. 19

Table 10 presents a classification of the keywords from Fig. 23.

Table 10 Classification of individuals of cluster 4

The main objective of these articles in the analysis of “building energy consumption” (“building energy analysis”, “energy use intensities”). Therefore, such papers deal with “forecasting”, “prediction”, and “electric load forecasting”. This focus relied on “forecasting algorithms”, “model validation”, “parameter estimation”, and “optimization”. There researches seem to be conducted mainly in the USA.

In these papers, the authors employed a wide variety of data analysis techniques ranging from the basic univariate statistics (“statistical tests”, “mean square error”, “regression analysis”) to computational intelligence (“neural networks”, “fuzzy systems”, “genetic algorithms”, “data mining”), possibly multivariate analysis (“clustering algorithms”).

The fifth cluster is presented in Fig. 24.

Fig. 24
figure 24

Cluster 5 from Fig. 19

The classification of the terms of Fig. 24 is shown in Table 11.

Table 11 Classification of individuals of cluster 5

The keywords of this cluster suggest articles interested in energy efficiency in buildings, considering environmental and sustainability question, as well as, historical aspects and urban planning. The focus of these articles relied on four fronts: the energetic performance; thermal performance; constructive techniques; and economic analysis. The method preferred by these paper’s authors was the benchmarking. Such publications did not employ a diversity of data analysis techniques. These researches were delimited to buildings in general and great part of them took place in Europe, Asia, Eurasia, Asia and Australia.

Figure 25 shows the sixth cluster.

Fig. 25
figure 25

Cluster 6 from Fig. 19

The classification of the keywords from Fig. 25 is shown in Table 12.

Table 12 Classification of individuals of cluster 6

The articles within this cluster aim to improve “building energy efficiency”, “building energy saving”, to decrease the “electric consumption” and the “electric power utilization”. They reach their goals by focusing on HVAC system (“heating”, “heating energy”).

The data used by such papers were gathered by means of surveys and were analyzed by univariate statistics (“statistics”, “statistical analysis”, “correlation analysis”, and “decision trees”) and by multivariate statistics (“cluster analysis”, “multiple regression analysis”, “factor analysis”, and multivariate analysis”).

Figure 26 presents the last cluster.

Fig. 26
figure 26

Cluster 7 from Fig. 19

The keywords from Fig. 26 can be classified according to Table 13.

Table 13 Classification of individuals of cluster 6

Reducing the “energy demand”, the “energy usage”, and the “annual energy consumption” are the main objectives of the articles grouped into this cluster. The researches’ focus is a building automation (“building controls”, “identification control system”) and “behavior research”. The energy benchmarking” and “energy simulation” are the methods used such researches.

Several data analysis techniques were employed by researches: descriptive statistics (“population statistics”, “probability distributions”); stochastic processes (“stochastic models”, “stochastic systems”, “Markov processes”); computational intelligence (“artificial intelligence”, “learning systems”); and operations research techniques (“data envelopment analysis”). The researches seem to be carried out in buildings in general.

The seven clusters can be characterized, respectively, by the following terms: Buildings and Energy Uses; Building Energy Conservation; Energy Consumption; Energy Consumption Forecasting and Computational Intelligence; Energy Efficiency and Climate Effects; Building Energy Efficiency and Multivariate Statistics; and Building Energy Analysis and Stochastics Processes.

Observing the seven clusters it can be seen that the most frequent method applied in the researches is the simulation and the most used data analysis technique is regression analysis. The cluster 2, 3, and 5 grouped articles worried by the environmental questions.

The clusters 1, 3, and 6 focused on the occupants’ thermal comfort. Only the cluster 5 deals with building materials.

The articles within clusters 4, 6, and 7 employ a large variety of data analysis techniques ranging from basic statistics to computational intelligence.

In cluster 4 most of the data analysis techniques mentioned is related to computational intelligence and most of the articles deals with forecasting and prediction.

In cluster 6, the multivariate statistics were largely cited along with univariate descriptive statistics, which is in accordance with the method used by the articles.

The cluster 7 was dominated by the sthocastic processes as data analysis techniques, which seems to be in tune with the researches’ focus.

Institutions analysis

Figure 27 presents the percent of institutions that published the articles, subdivided into regions.

Fig. 27
figure 27

Percentage of institutions that published articles dealing with the theme, distributed by region

Table 14 presents the percent of institutions that published the articles, subdivided into countries.

Table 14 Number of institutions that published articles dealing with the theme, distributed by country

It can be seen that Asia stands out with a number of sixty-five institutions, followed by Europe (49) and North America (40), which highlights that these regions contributed the most to publications related to the theme. These same institutions were distributed according the country, as shown the Table 14.

It can be seen that the most countries that have the largest number of institutions are among the ten countries that presented the largest number of publications, as has been seen, previously, in the analysis of a number of publications by country.

Table 15 presents the number of publications for the twelve most relevant institutions.

Table 15 Analysis of institutions with the highest number of publications

The first rank is not taken by a university, but by a Lawrence Berkeley National Labor, that is a Department of Energy (DOE) Office of managed by the University of California, located in the United States. The second, the third, the fourth, the fifth and the seventh go to the Hong Kong Polytechnic University, City University of Hong Kong, Tongji University, Xi’an University of Architecture and Technology and Tianjin University which are all located in the China. The sixth most productive institution is the National University of Singapore which represents the Singapore. The eighth position is occupied by the Texas A and M University, United State; followed by Universiti Teknologi Malaysia located in Malaysia, Tohoku University which represents the Japan, UCL in the United Kingdom and UC Berkeley which is located in the United States.

The expectations of an energetic crisis, which actually arose in California (2000–2001), incentivized many programs aiming to promote the energetic efficiency in buildings that have been followed by several publications whose growth followed the same pattern of world energetic crisis, this suggests the prominence of the University of California, so can be seen from Table 15.

It can be seen that the five most prolific institutions are in the United States and China that occupies the two first positions among the most productive countries (Table 5). Curiously among the twelve most productive institutions presented in Table 15, there is solely one European Institution, the University College London (UCL).

By reviewing the departments involved with the published articles by the top twelve institutions, it was noticed that the large part of them are purely related with energy, engineering and buildings like “Department of Building Services Engineering”, “Department of Mechanical Engineering”, “ Department of Building” or “Department of Energy”. This is aligned with the authors’ expectation.

Authors’ distribution

According Palomo et al. (2017) to analyze the pattern of productivity of the researchers in a subject area the Lotka’s law can be applied.

In order to apply the Lotka’s law it is necessary to know the frequency distribution of authorship. Table 16 shows the authorship distribution considering multiple authors per article, i.e., including all co-authors.

Table 16 Frequency distribution, including all co-authors

From Table 16 it can be seen that 1334 authors and co-authors are responsible for the total of publications. The Lotka’s law, “describes the productivity distribution among researchers and shows that a small number of authors are responsible for most of literature whereas the contribution of the large majority of researchers is very low in terms of number of publications” (Barrios et al. 2008).

The Lotka’s inverse power law (Pao 1985) states that the number of authors (y x ) that produce x articles is inversely proportional to x that is the output of each individual author. Such a relation can be modeled as Eq. (1).

$$x^{n} y_{x} = C$$
(1)

where n and C are constants to be estimated from the observed data set.

Figure 28 presents the procedure described by Pao (1985) to determine the n value, which is the angular coefficient of the straight line depicted in Fig. 28 the n value was calculated using the least-squares method (n ≈ − 3.4).

Fig. 28
figure 28

Determining the n value (the most prolific author was discarded, x = 6, and x = 10)

According to Pao (1985) the C value can be calculated by Eq. (2)

$$C = \frac{1}{{\sum {x^{n} } }}\,\,\,\,\,(n < 0)\,\, = \,\,\frac{1}{{\sum {\frac{1}{{x^{n} }}} }}\,\,(n > 0)$$
(2)

Unfortunately, has is no easy formula for computing the sum of the infinite series described by Eq. 2. Although Pao (1985) derived a function approximating such a summation, in this paper the C value was calculated by means of using the Wolfram Alpha by means of the following command string:

$$\left( {{\text{sum}}\,{\text{x}}^{ - 3.4} \,{\text{from}}\,1\,{\text{to}}\,{\text{infinity}}} \right)^{ - 1}$$

leading to a C value of approximately 0.88.

Finally, in order to verify that the observed data fitted the estimated distribution, the non-parametric Kolmogorov–Smirnov goodness-of-fit test has been applied according to the procedure described by Pao (1985), Table 17.

Table 17 Kolmogorov–Smirnov goodness-of-fit test (Pao 1985)

From Table 17, it can be seen that the observed data fitted the Lotka’s law, since the maximum deviation between the cumulative proportions of observed data and estimated values (0.0157) is lower than the critical value at a significance level of 0.05 (0.0446), which means that about 88% of the authors made a single contribution.

According to Pao (1986) among the sources of errors that lead to questionable parameters estimates for Lotka’s equation, is the number of authors per publication considered. Although Coile (1977, apud Pao 1986) showed that data counts with senior authors (or first-authors) and those with all authors could produce significantly conclusions, since they would produce different values for the exponent in Eq. (1), this is not the case for this study, since the n value keeps around 3.4 when only one author is considered as well, as illustrated in Fig. 29.

Fig. 29
figure 29

Determining the n value (the most prolific author was discarded, x = 7)

In his original article, Lotka described experiments with two different data sets, from which he derived two different values for n that he considered approximately 2. “That enabled him to draw his often quoted conclusion that the number of persons making n contribution is about 1/n 2 of those making one and the proportion of contributors that make a single contribution is about 60%” (Lotka 1926, apud Pao 1985). This statement is not true for this case, since, according to Kolmogorov–Smirnov goodness-of-fit test, the theoretical distribution considering n value equals 2 does not have adherence to the observed data (considering single or multiple authors).

Another interesting consideration is the Price’s square root law which states that “half of the literature on a subject will be contributed by the square root of the total number of authors publishing in that area”. In other words “Price’s contention was that half the published output in a subject field will be contributed by a highly productive subset of authors” (Nicholls 1988).

Considering a single author per paper, according to the Price’s contention, the 21 most prolific authors are responsible for 256 articles. This result is not supported by the empirical evidences since the 21 most prolific authors are responsible for 63 papers. If multiple authors per article are considered, the 36 most prolific authors should be responsible for 256 articles. Empirical evidences do not support this result either, since the 36 most prolific authors and respective co-authors are responsible for 158 articles. Although it is true, for this area, that a small number of authors are responsible for most of articles and a large number of authors produces only one paper each, the results do not support the Price’s square root law, because the Price’s original claim was based on Lotka’s inverse square law, rather than on the generalized Lotka’s inverse power law (Nicholls 1988).

The top eleven most productive authors in the data analysis techniques applied to energy efficiency in buildings, and theirs respective citations are shown in the Fig. 30.

Fig. 30
figure 30

Authors’ production and citations (considering multiple authors per article)

The most productive authors are Lam, J.C. (City University of Hong Kong, Department of Civil and Architectural Engineering, Hong Kong, China) which published ten articles on the subject, was also the most cited (586). Among his publications, three exceeded 100 citations, namely: “Sensitivity analysis of energy performance of office buildings” which was published in 1996 and cited 126 times, “Future trends of building heating and cooling loads and energy consumption in different climates” published in 2011 and cited 107 times; “Energy analysis of commercial buildings in subtropical climates” published in 2000 and cited 101, these articles were published in Building and Environment journal; and Yoshino, H. (Tohoku University, Department of Architecture and Building Science, Sendai, Japan) which published ten articles and was cited 340 times. Two of his publications exceeded 100 citations, namely: “A systematic procedure to study the influence of occupant behavior on building energy consumption” (168) published in 2011 and “A decision tree method for building energy demand modeling” (114) published in 2010. These articles were published in Energy and Buildings journal.

Identifying the data analysis techniques employed

The 513 papers retrieved at the beginning of this research have been read, and only 296 of them really applied any data analysis technique. Some of the data analysis techniques pointed out by the key terms analysis was only mentioned, but not really employed, thus they were not took in consideration. The data analysis techniques used in these 296 articles were roughly grouped into seven categories. Table 18 presents these categories and the number of times they appear. It is worth mentioning that sometimes more than one technique was used in an article.

Table 18 Data analysis techniques

The most cited category was the Regression Analysis. This category grouped several techniques such as linear regression (77 occurrences of the technique), multivariate linear regression (39), non-linear regression (4), ordinary least squares and other techniques (19). These techniques were applied in articles focusing on almost all types of investigation within this field, as can be seen in the analysis of the key terms topic. For example, to estimate the energy performance of buildings (Melo et al. 2016); to understand the influential characteristics of energy consumption in buildings (Ma and Cheng 2016); and for prediction of the energy consumption of buildings (Zhou et al. 2016).

The second most cited category was the Descriptive Statistics, which brings together techniques such as central tendency and dispersion measures (48), and correlation measures (26). This category is employed in papers describing exploratory researches, and it was also employed concomitantly with other techniques., for example, analysis of the correlation between energy consumption in buildings and greenhouse gas emissions (Wang et al. 2016), and between building design and energy consumption (Liu et al. 2015).

The third most cited group was the Multivariate Data Analysis Techniques presenting methods such as Principal Component Analysis (16), Clustering (16), Structural Equation Modeling (4), Factor Analysis (2), Discriminant Analysis (1) and Multivariate Analysis of Variance (1). This category is present in papers which deal with surveys, for example, to describe the perceptions of individuals in relation to the challenges of building energy efficiency (Addy et al. 2014); and analyze the impacts of household and building characteristics on energy consumption (Estiri 2014).

The fourth most cited group was the Computational Intelligence, which brings together techniques such as Artificial Neural Networks (25), Genetic Algorithms (8) and Machine Learning (1). These techniques were applied in papers focusing on forecasting, for example, for energy savings predictions for a retrofit project in buildings (Yalcintas 2008); and in the prediction of building energy consumption (Li et al. 2015a, b; Buratti et al. 2014).

The fifth most cited category was the Stochastic Processes that consists of a variety of techniques such Monte Carlo (8) and Sensitivity Analysis techniques (10). These analysis techniques are used in articles interested in simulation and modeling, for example, to find a model for lower cost and energy consumption (Orosa 2012; Wang et al. 2012); and to examine the sensitivity of energy performance of buildings (Lam and Hui 1996; Hemsath and Bandhossani 2015).

The sixth most cited category was the Inferential Statistics that brings together parametric tests (25) and nonparametric tests (5), which brought together techniques that are jointly employed with most of the other techniques. For example, comparative analysis of building energy simulation of the different variables (Pernigotto and Gasparella 2013); identification of variables to estimate the energy performance of buildings (Tsanas and Xifara 2012); and to evaluate the frequency distribution of annual energy consumption in buildings (Capozzoli et al. 2016).

The last category was the Design of Experiment, which groups several types of experimental designs (17) and Analysis of Variance (6). The Design of Experiments is used to optimize processes and to assess the influence of input factors in the response variable of a process. For example, to optimize the energy consumption based on environmental factors (Assadi et al. 2016); and to evaluate the effectiveness of climate factors on energy saving in green buildings (Zahraee et al. 2014).

It is worth mentioning that the use of some of these techniques presented an overlap with others. Figure 31 shows the dispersion of the data analysis techniques over the years.

Fig. 31
figure 31

Evolution of the application of the data analysis techniques

It can be seen from Fig. 31 that all of those data analysis techniques presented in Table 18 have been remarkably applied after 2010. It is also possible to see that Regression Analysis and Descriptive Statistics have been used since the beginning and the middle of the 1990s, respectively.

Although the utilization of all of the techniques has been increasing over the observed period, none of them showed a growth more consistent than the one showed by the Regression Analysis. Its utilization was exceeded by the others techniques only in 6 years: 1984, 1993, 1997, 1998, 2000 and 2007.

It is worth mentioning that from 2012 to 2016 were the period when most of the techniques were applied.

Conclusions

This paper describes a bibliometrics analysis concerned with the data analysis procedures applied to studies of energy performance in buildings. This article investigated the evolution of publications on the theme from 1980 to 2016.

The interest in this subject has been increasing over the period under analysis, this can be interpreted by the growth in the publications related to the theme, it was identified three of the four phases in the growth of publications according to Price’s law. The first phase (1980–1998) was the precursors’ phase, because a small number of researchers began to publish, this phase accounts for about 4.4% of the whole publication body. The second phase (1992–2013) presented a proper by exponential growth, demonstrating that the number of publications doubled approximately every each 7 years. The third phase (2013–2016), the body of knowledge was consolidated and the growth of scientific production becomes linear. Based on this analysis, it could be concluded that the number of publications in this field, from 1980 to 2016, followed the Price’s Law and an inflection point has not been reached yet.

The growth of publications was primarily driven by the Europe, North America and Asia. In Europe, the governmental incentive programs for promoting the rational use of energy and legal determination for the creation of zero energy buildings were responsible for the growing in the number of publications. In comparison, North America had begun 20 years before due to the great attention paid by the United States to the theme, the expectations of an energetic crisis, which actually arose in California, had incentivized many programs aiming to promote the energy efficiency in buildings that were followed by several publications. The number of publications in Asia followed the same pattern in Europe, because the reasons were the same.

The United States is the country with the greatest number of publication and citations. The most cited paper is also published in the United States. Greece and Switzerland have the greatest number of citations per publication (CPP), although they are out of the top ten publishers. Since the CPP is greater than the number of publications, one may conclude that these publications have been cited by authors abroad, so the publications are relevant to the area.

Malaysia, India and China are the countries with the greatest percentage of non-cited publications. Although China is one of the greatest publishers and their papers are much cited, more than 28% of its production has not been cited yet.

A remarkable fact is that the number of citations of the Brazilian most-cited paper occupies the third position at the rank of the most cited publications. Brazil occupies the thirteenth position out of twenty-six in number of publications. In addition, 20% of its whole production is not cited.

According to the Principal Components Analysis (PCA), the United States stood out because of the number of publications, citations and the number of citations of the most cited article. Switzerland stood out because of the number of citations per publication. Although China has a great number of publications and citations it also has a high percentage of non-cited. India and Malaysia appear highlighted because the percentage of non-cited papers is significant.

The most relevant areas related to the theme were Engineering, Energy, Environmental Science and Computer Science. It could also be seen that the subject has gotten the attention of some peripheral areas such as Social Scences; Business, Management and Accounting; Economics, Econometrics and Finance; Arts and Humanities; and Medicine. One of the reasons for a such intersection is the fact that many publications are classified into more than one area. But scientific interests can also explain the overlap between the areas.

Social Sciences area contained publications dealing with thermal comfort and occupant behavior. Business, Management and Accounting area grouped articles that address the subject from the stand point of relevant planning and management tools for energy-efficient buildings. Economics, Econometrics and Finance area presented papers that was correlated with Market Value of buildings. Arts and Humanities area contained papers that approach the theme related to the buildings of historical heritage. Medicine are presented publications dealing with health problems recurrent of CO2 emissions.

The 513 articles were distributed among 155 journals. It was observed that this distribution does not follow the Bradford’s law, but it is worth mentioning that 10% of the journals published 60% of the papers, and about 60% of the journals published only one paper.

Energy and Buildings was by far the scientific journal that had published most of the paper related to the theme. It was responsible for about 32% of the publications in the period covered by this research, being the nucleus of journals’ distribution. The next zone, containing 32 journals, is responsible for about 35% of the publications in the same period. The third and last zone of the journals’ distribution (122 journals) accounts for about 36% of the publications.

The first and second zones account for 22% of the journals with 64% of the articles. Considering only these two zones. Engineering is the area with the greatest number of journals (15), followed by the area of Energy (10), Environmental Science (4), Social Sciences (12), and Material Science (1).

The most productive research centers in this area are located in Asia, with a number of sixty-five institutions, followed by Europe and North America.

China, United Kingdom and the United States stood out with a greater number of institutions, according to their respective region. Taking into account all the figures from bibliometrics analysis, it was possible to conclude that these countries are on the vanguard of the application of data analysis techniques for building energy efficiency studies.

The three most important research centers found in these countries were Lawrence Berkeley National Labor, University of California, United States; Hong Kong Polytechnic University, China; and University College London, United Kingdom. And by reviewing the departments involved with the published articles, it was noticed that the large part of them were purely related with energy, engineering and buildings, what is aligned with authors’ expectations.

The total of authors (first author only) involved in the production of these 513 articles is 441. Considering both author and co-authors this number increases to 1334. In both cases a small number of them is responsible for most of the literature, and in both cases, as well, the frequency distribution of articles per authors follow the Lotka’s inverse power law with n = 3.4 and C = 0.88. Since n = 3.4 and not 2 as predictive by the Lotka’s inverse square law, the observed data does not support the Price’s square root law since 88% of the authors are responsible for only one article and the 21 most prolific authors are responsible for 63 articles instead of 256.

The most productive and cited authors was Lam, J.C. and Yoshini, H. They published ten articles each and some of their papers reached more than 100 citations.

In order to visualize changes in the thematic focus of this research field, according to the key terms, three bibliometric maps covering the period of analysis (1980–1990, 1980–2000, 1980–2016), have been employed. Thus, it was possible to study the formation and evolution of the clusters, to understand the development of this field of knowledge.

During the period ranging from 1980 to 1990, 22 articles were produced, totalizing 81 keywords, but only 12 words occurred two or more times, forming 4 clusters. Because of the oil crisis in the middle of the 1970s, terms like “energy conservation”, “energy savings”, and “energy utilization” appeared in the publications at the beginning of 1980s. It was noted that the advent of the personal computers in the 1980s brought terms related to computer simulation.

From 1980 to 2000, 58 articles were produced, totalizing 43 keywords, grouped into 7 clusters. Over this period new terms concerned with descriptive statistics were introduced in the publications, suggesting that this period was marked by the inclusion of the univariate data analysis techniques in the publications. It was also remarkable the very first appearance, in this kind of study, of the term “environmental impact”.

From 1980 to 2016, 257 new terms have been included, modifying the former 7 clusters. These clusters are characterized, by the following terms: Buildings and Energy Uses; Building Energy Conservation; Energy Consumption; Energy Consumption Forecasting and Computational Intelligence; Energy Efficiency and Climate Effects; Building Energy Efficiency and Multivariate Statistics; and Building Energy Analysis and Stochastic Processes. It was also remarkable that terms concerned with cost and technological innovation appeared so few times in the literature.

After the year 2000, the environmental questions and terms related to renewable energy have gained more importance. Many articles, in this period, were produced focusing on the building occupants’ wellness, building material and building design to improve energy efficiency. Although the univariate descriptive analysis has been applied in the articles, terms concerned with multivariate statistics and computational intelligence arose.

The data analysis techniques were roughly grouped into seven categories: Regression Analysis, Descriptive Statistics, Multivariate Analysis, Computational Intelligence, Stochastic Processes, Inferential Statistics and Design of Experiments. Some of the data analysis techniques pointed out by the key terms analysis were only mentioned, but not really employed, thus they were not took in consideration in the formation of the categories.

The most cited category was the Regression Analysis, which grouped techniques that were applied in articles focusing on almost all types of investigation within this field, in accordance to the key terms analysis. The second category was the Descriptive Statistics, which was employed in papers describing exploratory researches, and it was also employed concomitantly with other techniques. The Multivariate Data Analysis Techniques was the third most cited group, which was presented in articles that dealt with surveys. The fourth most cited group was the Computational Intelligence, which brought together techniques that were applied in papers focusing on forecasting. The fifth category was the Stochastic Processes that consisted of a variety of techniques, which were used in articles interested in simulation and modeling. The sixth most cited category was the Inferential Statistics, which was jointly employed with most of the other techniques. The last category was the Design of Experiment Cluster, which grouped techniques that was used to optimize processes and to assess the influence of input factors in the response variable of a process or phenomenon.

These data analysis techniques have been remarkably applied in the building energy efficiency researches after 2010, maybe because of the popularization of statistical packages, reaching the top between 2012 and 2016. Although the utilization of all of the techniques has been increasing over the observed period, none of them showed a growth more consistent than the one showed by the Regression Analysis. Its utilization was exceeded by the other techniques only in 1984, 1993, 1997, 1998, 2000 and 2007.

The data analysis techniques identified in this article may influence the possibility of reformulation and the adequacy of the curricula of the undergraduate and graduate courses in the area of energy and smart buildings. The results of this research showed a general perspective about the data analysis tools employed in energy efficiency research in buildings, which can be useful in showing relevant themes for further research.