Introduction

Despite the great progress in molecular methods over the last decades, morphology remains a highly relevant criterion in plant systematics. Moreover, for other botanical and applied disciplines that need to distinguish plants in the field, differences other than the morphological (cytological, molecular, etc.) are rather useless. Regarding morphology, multivariate morphometric analysis—i.e., the statistical analysis of several to many morphological characters together, is one of the principal tools. When characters are properly defined, this approach allows for the rigorous and repeatable testing of “traditionally” used characters, as well as the discovery of new ones. In recent years, multivariate morphometrics (usually in combination with other data, such as ploidy levels and various molecular markers) have been used to disentangle complexes of closely related taxa (Foggi et al. 2012; Kaplan and Marhold 2012; Kuzmanović et al. 2013); re-examine various poorly described or otherwise dubious taxa (Rooks et al. 2011) or putative endemics (Lepší et al. 2013; Kučera et al. 2013); detect hybrids (Košnar et al. 2010; van Hengstum et al. 2012; Koutecký 2012); and describe various aspects of intraspecific variability, such as cytotypic differentiation (Cires et al. 2010; Mráz et al. 2011; Španiel et al. 2011; Koutecký et al. 2012), phenotypic plasticity (Slovák et al. 2012) and differentiation between populations from different habitats (Cires et al. 2010; Španiel et al. 2011) or between native and invasive populations (Mráz et al. 2011). A recent brief overview of use of morphometric analysis in plant systematics, including also underlying statistical concepts, is given by Marhold (2011). For details on statistical methods, many text books can be consulted; beside those cited in Marhold (2011), for example Quinn and Keough (2009).

Many software packages allow the computation of multivariate morphometric analyses, the most popular of which includes general statistical packages such as SAS, Statistica and SPSS and more specialised software such as Canoco (ter Braak and Šmilauer 2012) and SYN-TAX (Podani 2001). In recent years, R (R Core Team 2013) has experienced increasing popularity due to its distribution as freeware and ability to handle a wide and continually growing array of analyses through supplementation with additional packages. However, no comprehensive set of functions exists for the convenient handling of morphometric analysis in the R environment.

The aim of this paper is to provide a set of functions for performing distance-based morphometric analysis in R. No previous knowledge of R is necessary. The use of the functions is demonstrated on a sample data set, and the results are compared with those of several other statistical software packages. Experienced R users are welcome to modify the functions according to their specific needs. I would also be grateful for any comments, improvements of the functions or suggestions for additions in future versions.

Data structure

Data can be imported from a single tab-delimited text file. For each individual, two grouping levels are expected: population and taxon (or cytotype, genotype, etc.). There can be any number of morphological characters, either quantitative (including counts and ratios) or binary. Missing data can be present.

Sample data

The sample data (Online Resource 1) include portions of data sets from previously published studies by Koutecký (2007) and Koutecký et al. (2012): twenty-five morphological characters (see the cited studies for details) of the vegetative (stems and leaves) and reproductive structures (capitula and achenes) of three diploid species of the Centaurea phrygia complex: C. phrygia L. s.str. (abbreviated “ph”), C. pseudophrygia C.A.Mey. (“ps”) and C. stenolepis A.Kern. (“st”). Moreover, a fourth group includes the putative hybrid C. pseudophrygia × C. stenolepis (“hybr”). The data represent 8, 12, 7 and 6 populations for each group, respectively, and 20 individuals per population, with one exception in which only 12 individuals were available. All morphological characters are either quantitative (direct measurements, counts or ratios) or binary (two characters states or presence/absence). In four characters of achenes, there are missing data because fruits were not available in all individuals. In two populations of C. stenolepis fruits were completely missing. In total, the data set includes 652 individuals (453 complete) from 33 populations (31 complete).

Typical workflow and functions

Data can be imported from tab-delimited text files using the read.morfodata or read.morfodata2 functions (which differ only in the decimal separator used). Import from Excel spreadsheets through the clipboard in Windows also usually works well. After importing the data, the functions rename the first three columns and convert them into factors if necessary. The imported data structure should be checked using built-in R functions such as summary and str.

Descriptive statistics for each character can be calculated using the functions descr.all, descr.tax, and descr.pop, which calculate values for the whole dataset, each taxon and each population, respectively. The following statistics are included: number of observations, mean, standard deviation, and the percentiles 0 % (minimum), 5 %, 25 % (lower quartile), 50 % (median), 75 % (upper quartile), 95 % and 100 % (maximum). The 5 % and 95 % percentiles are included because the trimmed range (without the most extreme 10 % of values) is sometimes used in taxa descriptions, determination keys, etc. The data are formatted as data frames to allow easy export from R to text files or directly to table editors such as Excel.

For the exportation of the results of most MorphoTools functions, the export.res function is provided. This function exports spreadsheet-like results (of the data frame or matrix classes) as a tab-delimited text file or copies the data to the clipboard.

The data may contain missing values. Generally, most of the multivariate methods require a full data matrix. The preferred approach is to reduce the data set to complete rows only (i.e. perform the casewise deletion of missing data, as is incorporated within most of the functions) or to remove characters for which there are missing values (if they are concentrated in a few characters). In exceptional cases, missing data can be substituted with mean values for the respective population using the na.meansubst function. The use of mean substitution, which introduces values that are not present in the original data, is justified only if (1) there are relatively few missing values, (2) these missing values are scattered throughout many characters (each character includes only a few missing values) and (3) removing all individuals or all characters with missing data would unacceptably reduce the data set.

Most of the multivariate analyses can be computed at two levels, using either individuals or populations as the operational taxonomic units (OTUs). While the first level describes the real variation in nature, the latter represents average trends that are somewhat purified from random variation. Population data may be obtained using the functions popul.means and popul.otu. The former computes population means for each character and returns a data frame that can be exported as a formatted table, and the latter converts this data frame to the form required by the following functions (which are generally prepared for individual data). Note that when using populations as OTUs, they are handled with the same weight in all analyses (disregarding population size, within-population variation, etc.)

Some multivariate methods, such as principal component analysis and (especially) discriminant analyses, require the multivariate normal distribution of the characters. However, this condition is rarely achieved by biological data. Although these methods are, to some extent, robust to the violation of the normality assumption (Lepš and Šmilauer 2003), transformations (e.g. logarithmic, square-root or arcsin) of the original characters can be helpful. To examine deviation from the normal distribution, the functions charhist and charhist.t plot histograms and the expected normal distribution of quantitative characters, the former for the whole dataset and the latter for a specified taxon. If significance tests are necessary, Shapiro–Wilk’s test is available in the R base installation (the shapiro.test function); several other normality tests are included in the nortest package (Gross and Ligges 2012).

Some methods (again, mainly principal component and discriminant analyses) are susceptible to errors resulting from highly correlated characters (r > |0.95|). Therefore, correlations should be examined. The functions cormat.p and cormat.s calculate matrices of the correlation coefficients of characters (Pearson’s and Spearman’s, respectively). The results are formatted as data frames to allow export with the export.res function. Significance tests are not performed, as they are usually unnecessary in morphometric analysis. However, these tests can be computed using the R base function cor.test (for a pair of characters) or the corr.test function from the psych package (Revelle 2013) (matrix of significance tests).

Cluster analysis can be employed to obtain an overall view of the structure of the data. Typically, populations are used as OTUs. The functions clust.upgma and clust.ward provide the two most often used clustering algorithms, UPGMA and Ward’s method. Both functions include (1) standardisation of the characters to a zero mean and a unit standard deviation, (2) calculation of the distance matrix and (3) clustering using the R base function hclust. The results are plotted as a dendrogram using the generic function plot, and rectangles highlighting certain clusters can be added using the R base function rect.hclust (Fig. 1a). Compared to Statistica and SAS software packages, the R functions can handle missing data, and thus all OTUs are retained in the analysis, which leads to slightly different dendrograms. However, when the missing data are excluded prior to the analysis, the results (both distance matrices and dendrograms) are identical.

Fig. 1
figure 1

Examples of graphical outputs based on the sample data set. a Cluster analysis of populations (UPGMA based on Euclidean distances); the R base function plot is used and selected clusters are highlighted by dashed lines using the R base function rect.hclust. b PCA of individuals (the plot function). c Positions of characters in the PCA of individuals (the R base functions plot, arrows, and text). d Canonical discriminant analysis of individuals displayed using the s.class function from the ade4 package; only three species are included. e Canonical discriminant analysis of individuals displayed as a 3D diagram using the scatterplot3d function from the scatterplot3d package; to obtain three canonical axes, hybrids were included as the fourth group. f Canonical discriminant analysis of individuals of two species only (C. phrygia: black, and C. pseudophrygia: white); canonical scores are presented as a histogram using the R base function hist, and the overlap between the groups is displayed using semitransparent colours

As the next step, principal component analysis (PCA) is conducted. Several PCA functions are available in R; the MorphoTools functions are based on the R base function prcomp. The function pca.calc performs PCA after omitting rows with missing data and standardising columns (characters) to have a zero mean and a unit standard deviation. The results are accessed using the pca.scores function for the ordination scores of objects (OTUs), the pca.cor function for correlations (loadings) of characters with the ordination axes, and the pca.eigen function for the eigenvalues of ordination the axes. The pca.scores function uses the generic function predict and can also compute scores also for additional (passive) samples that were not present in the pca.calc computation. The ordination scores of the objects can be plotted using the R base function plot (two dimensional graphs; Fig. 1b) or the scatterplot3d function from the scatterplot3d package (Ligges and Mächler 2003) (three-dimensional graphs; the style is similar to that of Fig. 1e). A graph of the characters represented as arrows can be drawn using the generic functions plot and arrows (Fig. 1c). More plots, including a “spider” diagram of the ordination scores connected to the group centroids (the style is similar to that of Fig. 1d), are available in the ade4 package (Dray and Dufour 2007), which also has its own PCA function, dudi.pca. The results of the R functions and those of other software packages (SAS, Statistica, Canoco 5) are identical.

To test the differences between groups, canonical and classificatory discriminant analysis are used. Discriminant analysis is a powerful tool, especially if independent information (other than morphology) is available. For example, groupings based on genetic markers (Kučera et al. 2013), genome size/ploidy levels (Koutecký et al. 2012; Kúr et al. 2012), habitat characteristics (Španiel et al. 2011) or geographic origin (Mráz et al. 2011) can be tested. In addition, the contributions of individual characters can be examined and rigorously tested. The separation of the groups is often expressed as a number/percentage of correctly classified individuals in classificatory discriminant analysis.

Canonical discriminant analysis is computed by the discr.calc function that is based on the cca function from the vegan package (Oksanen et al. 2013). Standardisation of data is unnecessary in discriminant analyses and thus the original values of characters are used. Similarly to Canoco, vegan uses permutation significance tests that overcome the need for the normal distribution of characters. The function discr.calc creates an object, discr.data, that contains two data frames (values of the characters and the classification of objects to a taxon), then passes it to the cca (the vegan package) function, which performs the discriminant analysis. The function discr.sum summarises the results and returns the eigenvalues of the canonical axes, percentage of variation explained, canonical correlation coefficients for the individual axes, and permutation tests of the whole model and of the individual axes. Note that the eigenvalues in vegan (as in Canoco) are equal to the squares of the canonical correlation coefficients and are different from the eigenvalues printed by other software (such as Statistica and SAS); the recalculation is returned as the first item by the discr.sum function (see ter Braak and Šmilauer 2012 for details). The coefficients of the discriminant function (regression coefficients) for individual characters are returned by the discr.coef function; the mean of each character is also computed. These coefficients are for centred but unstandardised characters [i.e. for characters A, B,.., the score = (A – mean (A)) * coef A + (B – mean (B)) * coef B + …]. The function discr.taxa returns the canonical scores of taxa, while the scores of individuals (OTUs) are returned by the discr.scores function. This function uses the generic function predict and can thus also calculate scores for passive samples that were not present in the original ordination. This approach is advantageous for testing the positions of “atypical” populations or for assessing the taxonomic positions of selected individuals; for an example, see Kúr et al. (2012), in which the cytotypes of type herbarium specimens were estimated from morphology. The function discr.bip returns the contribution of characters to the individual axes (“biplot scores”; in contrast with usual biplot scores, these scores are standardised by within-group variance instead of total variance in discriminant analysis to better reflect the relative importance of characters; see Lepš and Šmilauer 2003 for details). The function discr.test performs two significance tests of individual characters based on a permutation procedure. First, the function tests the marginal effects (i.e. when a character is alone in the model). Second, the function tests the unique contributions of the characters (i.e. the addition of each character into the model with all other characters; note that these contributions are called marginal effects in vegan terminology). The latter is analogous to the standardised coefficients of the discriminant function computed by other software packages. Finally, the function discr.step performs the stepwise (forward) selection of characters. The results of discr.coef, discr.taxa, discr.scores, discr.bip and discr.test can be exported using the export.res function and plotted using the generic function plot in a way similar to PCA as 2D (a style similar to that of Fig. 1b) or 2D “spider” (Fig. 1d) or 3D figures (Fig. 1e). If only two groups are present, the scores along the only canonical axis can be drawn as a histogram common for both groups (Fig. 1f). The comparable results (such as the eigenvalues and the group/sample scores) of the R functions and those of other software packages (SAS, Statistica, Canoco 5) are identical.

Classificatory discriminant analysis with cross-validation is computed based on the function lda from the package MASS (Venables and Ripley 2002). Two functions classif.da and classif.da.1 are available. Both functions return the original three columns from the data (ID, Population, Taxon), a column with the classification from discriminant analysis, and the posterior probabilities of the classification for each group (taxon). These two functions differ in the mode of cross-validation. The function classif.da.1 uses the standard one-leave-out method. However, as some hierarchical structure is usually present in the data (individuals from a population are not completely independent observations, as they are closer to each other than to individuals from other populations), the function classif.da uses whole populations as leave-out units. This method does not allow classification if there is only one population for a taxon and is more sensitive to “atypical” populations, which usually leads to a somewhat lower classification success rate. Finally, the function classif.samp conducts the classificatory discriminant analysis of a sample set based on an independent training set and may be used to classify hybrid populations, type herbarium specimens, etc. The results of all three functions are accessed by the classif.matrix and classif.pmatrix functions. The former creates a classification matrix of the taxa (i.e. for each taxon in the original data, it shows the number of classifications into all taxa present and the percentage of correct classifications), while the latter shows the classification of populations in a similar way. This detailed classification of populations is very useful, as it can reveal some atypical or incorrectly assigned populations; for example, most of the populations from the sample data are successfully classified using the classif.da function (generally over 70 % correct classifications and often 100 %), but two populations have only 15 and 31 % correct classifications. The results of all of the functions can be exported using the export.res function.

k-nearest neighbour classification is used as a non-parametric alternative to classificatory discriminant analysis. This method finds k neighbours of an individual (those with the lowest Euclidean distance) and classifies the individual according to the a priori classification of the neighbours using a majority vote. The functions are based on the knn and knn.cv functions from the package class (Venables and Ripley 2002). The functions knn.select and knn.select.1 search for the optimal k for the given data set; they differ only in cross-validation method (similarly to the classificatory discriminant analysis functions). The functions compute the number of correctly classified individuals for k values from 1 to 30 and highlight the k with the highest success rate. Ties (i.e. two or more groups receive the same number of votes) are broken at random, and different iterations may thus give different results. Therefore, the functions compute 10 iterations and use the average success rates for each k; they also graphically display the minimum and maximum success rates for each k. The functions knn.classif and knn.classif.1 then perform knn classification using the specified k; they differ by cross-validation method in the same manner as the above functions. The function knn.samp computes the knn classification from a sample set using an independent training set. The results of the knn functions have a similar format as those of the classificatory discriminant analyses, and classification matrices can be produced using the classif.matrix and classif.pmatrix functions.

Three additional filles can be downloaded from the journal’s webpage or from the MorphoTools webpage: the full manual to the functions (Online Resource 2), the function definitions (Online Resource 3), and the working protocol containing the analysis of the sample data (Online Resource 4).