1 Introduction

A tremendous amount of data on medicinal plant uses has been documented worldwide over the years. Such data may include plants that are used to treat particular diseases, plant organs used, how the plant parts or organs are collected and how such medicines are prepared (York et al. 2011; Elansary et al. 2015; Leso et al. 2017). In the light of this wealthy dataset available in ethnobotany, some authors indicated that we now have enough information with which we should be formulating and testing theories and hypotheses that can advance the scope of ethnobotany as a scientific discipline (e.g. Albuquerque et al. 2006; Ford and Gaoue 2017; Gaoue et al. 2017; Hart et al. 2017). Such paradigm shifts towards a more hypothesis- or theory-driven ethnobotany is necessary to make ethnobotany a stronger scientific discipline with theories and hypotheses that can be used to predict new medicinal plant uses as well as explain plant–human interactions (Gaoue et al. 2017). Interesting questions that can be investigated for a better understanding of plant–human interactions are as follows: is traditional medicine a placebo? Why some plants in a particular family are predominantly used or over-utilised in some pharmacopoeias while other plants are less used (under-utilised)?

To answer this question, a hypothesis has been proposed, termed a “non-random hypothesis” (Moerman 1979, 1991, 1996), which predicts that large families are more likely to be richer in medicinal plants than small-sized families. This hypothesis implies that medicinal plants are not randomly selected by local communities such that a linear positive relationship is expected between the number of medicinal plants in families and the size of those families (Moerman 1979). Initially, Moerman (1979) formulated and tested this hypothesis to demonstrate that the traditional medicine of Native Americans was not a placebo. Because of this non-random selection, some plant families tend to be over- or under-represented in a given pharmacopoeia (Moerman 1979, 1991; Moerman and Estabrook 2003; Ford and Gaoue 2017). This implies that plant family can become a strong determinant of plant use value (Phillips and Gentry 1993), and in one of his early studies, Moerman (1991) already explained this by the fact that species in the same family, due to their evolutionary relatedness, share some characteristics of plant defence inherited from common ancestors, which influence their physiology and effectiveness as medicines. Using phylogenetic approach, recent studies confirmed that plant families that are closely related are more likely to have similar medicinal uses than those that are phylogenetically distant (Saslis-Lagoudakis et al. 2013; Yessoufou et al. 2015).

Several studies tested the non-random hypothesis in several geographic contexts, e.g. in Amazonian Ecuador (Bennett and Husby 2008), in Belize (Amiguet et al. 2006), in Kashmir (Kapur et al. 1992), and recently in Hawai’i, USA (Ford and Gaoue 2017) and Ecuadorian Amazon (Robles Arias et al. 2020). These studies reported strong support for the hypothesis. In particular, Robles Arias et al. (2020) demonstrated that the prediction of the hypothesis could be gender-specific. Nonetheless, such hypothesis-driven ethnobotanical studies are scant particularly in plant-rich countries with wealthy medicinal knowledge. South Africa is one of these species-rich families, with a remarkable plant diversity estimated at approximately 24,000 vascular plants but where ethnobotanical studies are less theory-driven. The different uses of medicinal plants are very well documented, e.g. ~ 3000 medicinal plants are recorded in the country, including 350 species known to be commonly used and traded (e.g. van Wyk and Gericke 2000; Fennell et al. 2004; van Wyk 2008; York et al. 2011; Elansary et al. 2015; Leso et al. 2017).

Furthermore, in studies that tested the non-random hypothesis, the methodological approaches used could be improved. For example, by fitting the simple linear model to the untransformed data he collected, Moerman (1979) did not account for normal residuals and homogeneity of variance. Recently, Ford and Gaoue (2017) have fitted the same model but on log-transformed data to account for that bias. Even so, the log-transformation performs poorly on “count data” (here, number of medicinal plants) in comparison with generalised linear model with negative binomial (see O’Hara and Kotze 2010). The application of these various statistics whist ignoring their limitation is potential source of bias, not necessarily in the overall outcome of hypothesis testing, but more critically for the identification of over-utilised versus under-utilised families.

In the present study, the non-random hypothesis of medicinal plant selection in the Mpumalanga Province of South Africa was tested. Specifically, the different statistical approaches to explore the relationships between the number of known medicinal plants in families and the size of the family in the province were applied.

2 Materials and methods

2.1 Study area

Mpumalanga is one of the nine South African provinces within the Greater Maputaland-Pondoland Albany Biodiversity Hotspot, harbouring the southern half of the Kruger National Park and other centres of endemism. The Mpumalanga Province is divided into three districts, namely Gert Sibande, Nkangala and Ehlanzeni. Local communities are diverse in culture, and together with language discrepancies, there is a rich base of traditional knowledge. These communities include Siswati (30%), while 26% of the inhabitants speak isiZulu (26%), isiNdebele (10.3%), Sepedi (21.2%) and Xitsonga (11.6%) (Tshikalang et al. 2016).

Four major vegetation types are dominant in the study area, namely the highveld grasslands, escarpment grassland-forest mosaic, eastern Lowveld savannah and the north-western bushveld savannah (Schmidt et al. 2007). These vegetation types are represented in three distinct biomes: forest, savannah and grassland (Schmidt et al. 2007). The rainfall varies from a minimum of 440 mm in the north to a maximum of 740 mm in the south of the Kruger National Park (KNP) (Venter 1990). Mean annual temperature is around 21–23 °C, but in summer temperatures often exceed 38 °C, and frost can occur sporadically during winter.

2.2 Data collection

Data on the floristic composition of the Mpumalanga Province were collected through an intensive four-year fieldwork conducted from 2008 to 2012 by the last author of this paper (Yessoufou 2012). These data were supplemented by an existing database, i.e. the book entitled Trees and shrubs of Mpumalanga and Kruger National Park by Schmidt et al. (2007). This book contains both floristic and ethnobotanical knowledge of the region collected for more than 10 years of fieldwork. This book provided a unique botanical knowledge (including some medicinal uses) for a comprehensive checklist of 811 plant species representing 97 botanical families, of which 321 were reported to have some medicinal uses (Schmidt et al. 2007). In addition, data were also collected from PRECIS (SANBI, 2005), a comprehensive inventory of ethnomedicinal flora of Southern Africa containing 800,000 records of taxa grouped by order and regions (Magill et al. 1983; Germishuizen and Meyer 2003). More importantly, ethnobotanical data were further collected from Prelude Medicinal Plants Database (https://www.africamuseum.be/en/research/collections_libraries/biology/prelude), an electronic database of articles and various publications related to medicinal plants of Africa, hosted in the https://www.africamuseum.be website. From these data, two variables were derived, namely (1) the total number of plants species per family in the province and (2) the total number of medicinal plants recorded for each family in the province.

2.3 Data analysis

All analyses were done in R (R Development core Team 2017) using number of medicinal species recorded per families as response variable (count data) and the total number of species documented in the province for each family as predictive variable. Firstly, we fitted the simple linear model (model 1) to the untransformed data as commonly done in previous studies (Amiguet et al. 2006; Moerman 1996, 1979). Then, we tested for normality of the residuals. As this analysis indicated non-normality (Figure S1), we log(x + 1)-transformed the response and predictor variables to address the normality issues (Figure S2). Then, we fitted the general linear model (model 2) to the transformed variables as done in a few recent studies (e.g. Ford and Gaoue 2017). Finally, because of the poor performance of simple linear regression with log-transformation of “count data” compared to the generalised linear model with negative binomial (see O’Hara and Kotze 2010), we also fitted a negative binomial model (model 3) to our dataset (Zeileis et al. 2008; O’Hara and Kotze 2010). For each of these models, we identified over-utilised families as those with positive residuals; this means these families contain a higher number of recorded medicinal species than would be expected from the model fitted.

3 Results

From the woody flora of the Mpumalanga Province, we recorded ~ 40% of medicinal plants, in ~ 76% of woody plant families in this study area (Table S1). Our analysis revealed that some plant families are over-utilised, while others are under-utilised (Fig. 1; Table 1). The proportion of over-utilised families ranges from 50% in line with Moerman’s linear regression approach through 55% (linear regression after log–log transformation) to 34% (negative binomial model). Following Moerman’s approach, the top over-utilised families are Fabaceae (residual =  + 34.44), Apocynaceae (+ 5.82) and Phyllanthaceae (+ 5.53). The log-transformed model confirms these three families as the top over-utilised families but in a slightly different sequence: Fabaceae (+ 1.55), Phyllanthaceae (+ 0.83) and Apocynaceae (+ 0.79). However, using the negative binomial model, Fabaceae is no longer even part of the top 10 over-utilised families, which are now: Phyllanthaceae (+ 2.09), Apocynaceae (+ 1.51), Loganiaceae (+ 1.48), Rhamnaceae (+ 1.48), Sapotaceae (+ 1.48), Oleaceae (+ 1.39), Salicaceae (+ 1.39), Clusiaceae (+ 1.30), Boraginaceae (+ 1.28) and Lamiaceae (+ 1.18) (Table 1). The top 10 under-utilised families comprise Celastraceae (− 0.05), Monimiaceae (− 0.06), Aquifoliaceae, Arecaceae, Canellaceae, Cornaceae, Gentianaceae, Hernandiaceae, Picrodendraceae and Piperaceae (− 0.06, each).

Fig. 1
figure 1

Relationships between number of medicinally used woody plants and the total number of woody plants per family in the Mpumalanga Province, South Africa. The names of some families could not be read clearly, because they are superposed; Table 1 presents the full list of plant families with their residual values indicating their position in relation to the fit lines. Fit lines of different models tested are colour-coded. Families that are above of the fit line of a model are considered over-utilised (has a positive residual), and families below the fit line are considered under-utilised (has a negative residual)

Table 1 Residual values from various models fitting to medicinal data from Mpumalanga Province, South Africa

4 Discussion

Almost 40% of the total woody species have local medicinal applications as remedies to certain illnesses, and this proportion is approximately three times higher than the proportion (12.5%) of the known medicinal plants in South Africa (van Wyk and Gericke 2000; Arnold et al. 2002; Williams et al. 2013). In addition, the medicinal plants of the Mpumalanga Province are, however, well represented at family level since they represent nearly 76% of woody plants families in this province. This is perhaps indicative of the richness of the province in medicinal flora, although we only focussed on woody flora, suggesting that the proportion of medicinal plants is likely greater than what we report here if non-woody plants were included in the analysis. Indeed, as suggested by the optimal defense theory, non-apparent species, that is, species with short lifespans (herbaceous, early successional plants), are subjected to lower herbivore pressure than apparent species (e.g. perennial, dominant plants, woody plants). As a result, non-apparent plants produce “cheap” defenses but in high quality (e.g. alkaloids), while apparent species invest quantitatively more in “expensive” defenses, e.g. lignins (Feeny 1976). Consequently, more herbs are likely to be medicinal than woody plants (Albuquerque and Lucena 2005; da Silva et al. 2018). There is therefore a need for future studies to incorporate herbaceous plants into their analysis to further test the non-random hypothesis (or other theories).

Are medicinal plants a random selection of total flora in our study area? We investigated this question and found evidence that some medicinal families are over-utilised, i.e. they contain more medicinal plants than expected, whereas others are under-utilised, i.e. they have significantly lower number of medicinal plants. This finding supports the theory of non-random plant selection, which predicts a positive relationship between the total number of medicinal plants and the total number of species in a given family (Moerman 1979, 1991, 1996; Gaoue et al. 2017). Such a relationship has increasingly been reported in several studies in the Amazonian Ecuador (Bennett and Husby 2008), in Belize (Amiguet et al. 2006), in Kashmir (Kapur et al. 1992) and in Hawai’i, USA (Ford and Gaoue 2017), pointing potentially to the generalisation of the non-random hypothesis.

Despite this apparently general trend of non-random selection of medicinal plants in different contexts and regions of the globe, various methodological approaches have been used to test the theory. The most widely used approach is the general linear model proposed by Moerman et al. (1979). The Moerman approach has recently been modified using the log–log transformation of variables because it did not account for normal residuals and homogeneity of variance (Ford and Gaoue 2017), and some earlier studies employed the least squares regression analysis (e.g. Douwes et al. 2008). However, in the presence of just one zero observation, that is, when a plant species is not used for any medicinal treatment, log-transformation of the data becomes problematic, and we have to artificially create bias in the data by adding, for example, the number 1 to the observations to allow log-transformation. In any case, O’Hara and Kotze (2010), while comparing the different models (including those with variously transformed data), demonstrated that models using Poisson or negative binomial models outperform any other models fitted to count data. In our case, as our response variable is count data (i.e. number of medicinal species), we fitted a negative binomial model to the medicinal plant data collected while also fitting the simple linear model with both untransformed and log-transformed data for comparison purpose.

All models fitted support not only the non-random plant selection hypothesis but also indicate that some families are over-utilised, whereas others are under-utilised. However, while Moerman and the log–log models yield similar proportions of over-utilised plant families (50 and 55%), the negative binomial model is very stringent as only 34% of plant families are categorised as over-utilised by this model. This is an indication that previous studies that employed Moerman approach may have included some (statistically) under-utilised families in their list of over-utilised families. For example, we identified the Fabaceae family as the top most over-utilised family when we applied Moerman and log-transformed models. Indeed, Fabaceae has been identified as one of the most over-utilised families in several studies that employed Moerman or similar approaches (Moerman 1999; Douwes et al. 2008; Gaoue et al. 2017; Kew 2017). In their recent report on the state of the world plants, RBKew (2017) indicated that Fabaceae, with its 11.2% of medicinal plants, is the 12th richest family in medicinal plants. They further indicated that the family contains important secondary compounds such as alkaloids.

However, while employing the negative binomial model, Fabaceae becomes under-utilised, thus indicating potential over-estimation of medicinal values of some taxa in previous studies. This does not imply that Fabaceae is not an important medicinal plant; rather, this implies that other families may outcompete Fabaceae in terms of people’s preferences for medicinal uses. Indeed, in the negative binomial model, Phyllanthaceae is identified as the number one of all most over-utilised families followed by Apocynaceae, Loganiaceae, Rhamnaceae, Sapotaceae, Oleaceae, Salicaceae, Clusiaceae, Boraginaceae and Lamiaceae. Working in a similar floristic environment, Douwes et al. (2008) have already identified five of these families (Phyllanthaceae, Salicaceae, Apocynaceae, Loganiaceae and Boraginaceae) as over-utilised, although both studies employed different methodological approaches. It is possible that plants in these families are over-utilized medicinally in the study area for cultural reasons; that is, people may have developed cultural preferences for some plants. It could also be that the over-utilisation of these plant families is simply dictated by the environment (geography), i.e. people may be over-utilising what the environment made available to them in abundance (Saslis-Lagoudakis et al. 2014). Finally, the over-utilisation of these families may simply be indicative of the effectiveness of these plant families for medicinal purpose. Douwes et al. (2008) reported that most of these families are rich in terpenoids and their derivatives, flavonoids and alkaloids. It is therefore not surprising that these families are also identified in the present study among the most over-utilized.

Specifically, the family Phyllanthaceae belongs to the order Malpighiales, which contains a high level of secondary compounds such as aliphatic, alkaloids, amino acids and peptides, benzo-pyranoides, flavonoids, oxygen heterocycles, polycyclic aromatics, simple aromatics, terpenoids and derivatives (Douwes et al. 2008). In our dataset in the Mpumalanga Province (South Africa), species in the Phyllanthaceae family are reported to treat a variety of ailments. These ailments range from high blood pressure, oedema, bronchitis, intestinal disorders, diabetes, poison, skin infection, infertility, impotency, toothache, gingivitis, insecticide, heartburn, laxative, rheumatism, viral infections, HIV/AIDS epidemic, paralysis, bones diseases, kidney and bladder complaints (Bessong et al. 2005; Schmidt et al. 2007).

The recent work of Robles-Arias et al. (2020) employed the negative binomial model that we used in our study. Our study is similar to theirs in that both studies show that the relationships between medicinal plants and the total flora are not linear as suggested in Moerman (1979) (Fig. 1 in our study and Figure 2 in their study). Our study is different from theirs in two ways. First, we did not identify the same families as top over-utilized. This is because, as demonstrated in Saslis-Lagoudakis et al. (2014), the environment shapes the composition of medicinal floras. Second, Robles-Arias et al. (2020) showed that the outcome of the model prediction is gender-specific. This is a new knowledge that they brought into the non-random hypothesis. Although we did not test the influence of gender in this study, we suggest that future studies assess this influence in different geographic contexts for its generalisation.

In addition, Robles-Arias et al. (2020) also suggested that the presence of protected areas in an environment might hamper the development of medicinal knowledge in that environment. We tested and confirmed this negative effect of protected areas on medicinal plants (unpublished work). In the context of the present study, this means that more medicinal plants may have been reported in our study area if not for the presence of protected areas (e.g. Kruger National Park). Given this potential effect of protected areas on medicinal plants, we suggest that specimens of plant species that are found exclusively in protected areas could be grown ex situ in contact with local communities to increase the probability of the development of medicinal knowledge (availability hypothesis). From a similar perspective, a recent study linked land-use change with the change in human selection of medicinal plants (Kunwar et al. 2016); this is in support of our above claim that establishing protected areas may impact on the use and development of medicinal knowledge. The non-random hypothesis that we tested and supported in the present study may actually be driven by multiple other factors such as plant apparency (see da Silva et al. 2018).

Overall, the present study tested the non-random hypothesis of medicinal plant selection using the woody flora of the Mpumalanga Province of South Africa. This test was done using the most commonly statistical approaches (general linear model and log-linear model). Because of the limitations of these two models, a better model, the negative binomial model was also tested. The latter model seems to perform better than the former two models. In any case, our analysis showed that large families tend to have more plants being considered for local medicinal applications, a salient confirmation of the non-random plant selection for medicinal purpose. However, the negative binomial model identified the family Phyllanthaceae as the most over-utilized family in the province, while the other two families identified Fabaceae as the most over-utilized family. This is an illustration of the need to apply the most appropriate model while testing ethnobotanical hypotheses. This is paramount because the identification of over-utilized families is the first step towards the prioritisation of research efforts for drug discovery and wild plant conservation (Saslis-Lagoudakis et al. 2013).