MorphoTools: a set of R functions for morphometric analysis

Koutecký, Petr

doi:10.1007/s00606-014-1153-2

MorphoTools: a set of R functions for morphometric analysis

Original Article
Published: 16 October 2014

Volume 301, pages 1115–1121, (2015)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Plant Systematics and Evolution Aims and scope Submit manuscript

MorphoTools: a set of R functions for morphometric analysis

Download PDF

Petr Koutecký¹

2463 Accesses
100 Citations
1 Altmetric
Explore all metrics

Abstract

A set of R functions for the convenient handling of morphometric analysis is provided. No previous knowledge of R is required. The functions include data import from Excel or tab-delimited text files, descriptive statistics for populations and taxa, histograms of characters, correlation matrices of characters, cluster analysis, principal component analysis, linear discriminant analysis with permutation tests, classificatory discriminant analysis and k-nearest neighbour classification. The use of the functions is demonstrated on a sample data set. Detailed descriptions of the functions and examples of the scripts for producing graphics are included as an electronic appendix. Documentation and function definitions can be downloaded from http://www.prf.jcu.cz/systematics/morphotools.html.

A Newly Noticed Formula Enforces Fundamental Limits on Geometric Morphometric Analyses

Article 02 August 2017

Materials

The Method of Random Skewers

Article 07 July 2017

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Despite the great progress in molecular methods over the last decades, morphology remains a highly relevant criterion in plant systematics. Moreover, for other botanical and applied disciplines that need to distinguish plants in the field, differences other than the morphological (cytological, molecular, etc.) are rather useless. Regarding morphology, multivariate morphometric analysis—i.e., the statistical analysis of several to many morphological characters together, is one of the principal tools. When characters are properly defined, this approach allows for the rigorous and repeatable testing of “traditionally” used characters, as well as the discovery of new ones. In recent years, multivariate morphometrics (usually in combination with other data, such as ploidy levels and various molecular markers) have been used to disentangle complexes of closely related taxa (Foggi et al. 2012; Kaplan and Marhold 2012; Kuzmanović et al. 2013); re-examine various poorly described or otherwise dubious taxa (Rooks et al. 2011) or putative endemics (Lepší et al. 2013; Kučera et al. 2013); detect hybrids (Košnar et al. 2010; van Hengstum et al. 2012; Koutecký 2012); and describe various aspects of intraspecific variability, such as cytotypic differentiation (Cires et al. 2010; Mráz et al. 2011; Španiel et al. 2011; Koutecký et al. 2012), phenotypic plasticity (Slovák et al. 2012) and differentiation between populations from different habitats (Cires et al. 2010; Španiel et al. 2011) or between native and invasive populations (Mráz et al. 2011). A recent brief overview of use of morphometric analysis in plant systematics, including also underlying statistical concepts, is given by Marhold (2011). For details on statistical methods, many text books can be consulted; beside those cited in Marhold (2011), for example Quinn and Keough (2009).

Many software packages allow the computation of multivariate morphometric analyses, the most popular of which includes general statistical packages such as SAS, Statistica and SPSS and more specialised software such as Canoco (ter Braak and Šmilauer 2012) and SYN-TAX (Podani 2001). In recent years, R (R Core Team 2013) has experienced increasing popularity due to its distribution as freeware and ability to handle a wide and continually growing array of analyses through supplementation with additional packages. However, no comprehensive set of functions exists for the convenient handling of morphometric analysis in the R environment.

The aim of this paper is to provide a set of functions for performing distance-based morphometric analysis in R. No previous knowledge of R is necessary. The use of the functions is demonstrated on a sample data set, and the results are compared with those of several other statistical software packages. Experienced R users are welcome to modify the functions according to their specific needs. I would also be grateful for any comments, improvements of the functions or suggestions for additions in future versions.

Data structure

Data can be imported from a single tab-delimited text file. For each individual, two grouping levels are expected: population and taxon (or cytotype, genotype, etc.). There can be any number of morphological characters, either quantitative (including counts and ratios) or binary. Missing data can be present.

Sample data

The sample data (Online Resource 1) include portions of data sets from previously published studies by Koutecký (2007) and Koutecký et al. (2012): twenty-five morphological characters (see the cited studies for details) of the vegetative (stems and leaves) and reproductive structures (capitula and achenes) of three diploid species of the Centaurea phrygia complex: C. phrygia L. s.str. (abbreviated “ph”), C. pseudophrygia C.A.Mey. (“ps”) and C. stenolepis A.Kern. (“st”). Moreover, a fourth group includes the putative hybrid C. pseudophrygia × C. stenolepis (“hybr”). The data represent 8, 12, 7 and 6 populations for each group, respectively, and 20 individuals per population, with one exception in which only 12 individuals were available. All morphological characters are either quantitative (direct measurements, counts or ratios) or binary (two characters states or presence/absence). In four characters of achenes, there are missing data because fruits were not available in all individuals. In two populations of C. stenolepis fruits were completely missing. In total, the data set includes 652 individuals (453 complete) from 33 populations (31 complete).

Typical workflow and functions

Data can be imported from tab-delimited text files using the read.morfodata or read.morfodata2 functions (which differ only in the decimal separator used). Import from Excel spreadsheets through the clipboard in Windows also usually works well. After importing the data, the functions rename the first three columns and convert them into factors if necessary. The imported data structure should be checked using built-in R functions such as summary and str.

Descriptive statistics for each character can be calculated using the functions descr.all, descr.tax, and descr.pop, which calculate values for the whole dataset, each taxon and each population, respectively. The following statistics are included: number of observations, mean, standard deviation, and the percentiles 0 % (minimum), 5 %, 25 % (lower quartile), 50 % (median), 75 % (upper quartile), 95 % and 100 % (maximum). The 5 % and 95 % percentiles are included because the trimmed range (without the most extreme 10 % of values) is sometimes used in taxa descriptions, determination keys, etc. The data are formatted as data frames to allow easy export from R to text files or directly to table editors such as Excel.

For the exportation of the results of most MorphoTools functions, the export.res function is provided. This function exports spreadsheet-like results (of the data frame or matrix classes) as a tab-delimited text file or copies the data to the clipboard.

The data may contain missing values. Generally, most of the multivariate methods require a full data matrix. The preferred approach is to reduce the data set to complete rows only (i.e. perform the casewise deletion of missing data, as is incorporated within most of the functions) or to remove characters for which there are missing values (if they are concentrated in a few characters). In exceptional cases, missing data can be substituted with mean values for the respective population using the na.meansubst function. The use of mean substitution, which introduces values that are not present in the original data, is justified only if (1) there are relatively few missing values, (2) these missing values are scattered throughout many characters (each character includes only a few missing values) and (3) removing all individuals or all characters with missing data would unacceptably reduce the data set.

Most of the multivariate analyses can be computed at two levels, using either individuals or populations as the operational taxonomic units (OTUs). While the first level describes the real variation in nature, the latter represents average trends that are somewhat purified from random variation. Population data may be obtained using the functions popul.means and popul.otu. The former computes population means for each character and returns a data frame that can be exported as a formatted table, and the latter converts this data frame to the form required by the following functions (which are generally prepared for individual data). Note that when using populations as OTUs, they are handled with the same weight in all analyses (disregarding population size, within-population variation, etc.)

Some multivariate methods, such as principal component analysis and (especially) discriminant analyses, require the multivariate normal distribution of the characters. However, this condition is rarely achieved by biological data. Although these methods are, to some extent, robust to the violation of the normality assumption (Lepš and Šmilauer 2003), transformations (e.g. logarithmic, square-root or arcsin) of the original characters can be helpful. To examine deviation from the normal distribution, the functions charhist and charhist.t plot histograms and the expected normal distribution of quantitative characters, the former for the whole dataset and the latter for a specified taxon. If significance tests are necessary, Shapiro–Wilk’s test is available in the R base installation (the shapiro.test function); several other normality tests are included in the nortest package (Gross and Ligges 2012).

Some methods (again, mainly principal component and discriminant analyses) are susceptible to errors resulting from highly correlated characters (r > |0.95|). Therefore, correlations should be examined. The functions cormat.p and cormat.s calculate matrices of the correlation coefficients of characters (Pearson’s and Spearman’s, respectively). The results are formatted as data frames to allow export with the export.res function. Significance tests are not performed, as they are usually unnecessary in morphometric analysis. However, these tests can be computed using the R base function cor.test (for a pair of characters) or the corr.test function from the psych package (Revelle 2013) (matrix of significance tests).

Cluster analysis can be employed to obtain an overall view of the structure of the data. Typically, populations are used as OTUs. The functions clust.upgma and clust.ward provide the two most often used clustering algorithms, UPGMA and Ward’s method. Both functions include (1) standardisation of the characters to a zero mean and a unit standard deviation, (2) calculation of the distance matrix and (3) clustering using the R base function hclust. The results are plotted as a dendrogram using the generic function plot, and rectangles highlighting certain clusters can be added using the R base function rect.hclust (Fig. 1a). Compared to Statistica and SAS software packages, the R functions can handle missing data, and thus all OTUs are retained in the analysis, which leads to slightly different dendrograms. However, when the missing data are excluded prior to the analysis, the results (both distance matrices and dendrograms) are identical.

As the next step, principal component analysis (PCA) is conducted. Several PCA functions are available in R; the MorphoTools functions are based on the R base function prcomp. The function pca.calc performs PCA after omitting rows with missing data and standardising columns (characters) to have a zero mean and a unit standard deviation. The results are accessed using the pca.scores function for the ordination scores of objects (OTUs), the pca.cor function for correlations (loadings) of characters with the ordination axes, and the pca.eigen function for the eigenvalues of ordination the axes. The pca.scores function uses the generic function predict and can also compute scores also for additional (passive) samples that were not present in the pca.calc computation. The ordination scores of the objects can be plotted using the R base function plot (two dimensional graphs; Fig. 1b) or the scatterplot3d function from the scatterplot3d package (Ligges and Mächler 2003) (three-dimensional graphs; the style is similar to that of Fig. 1e). A graph of the characters represented as arrows can be drawn using the generic functions plot and arrows (Fig. 1c). More plots, including a “spider” diagram of the ordination scores connected to the group centroids (the style is similar to that of Fig. 1d), are available in the ade4 package (Dray and Dufour 2007), which also has its own PCA function, dudi.pca. The results of the R functions and those of other software packages (SAS, Statistica, Canoco 5) are identical.

To test the differences between groups, canonical and classificatory discriminant analysis are used. Discriminant analysis is a powerful tool, especially if independent information (other than morphology) is available. For example, groupings based on genetic markers (Kučera et al. 2013), genome size/ploidy levels (Koutecký et al. 2012; Kúr et al. 2012), habitat characteristics (Španiel et al. 2011) or geographic origin (Mráz et al. 2011) can be tested. In addition, the contributions of individual characters can be examined and rigorously tested. The separation of the groups is often expressed as a number/percentage of correctly classified individuals in classificatory discriminant analysis.

Canonical discriminant analysis is computed by the discr.calc function that is based on the cca function from the vegan package (Oksanen et al. 2013). Standardisation of data is unnecessary in discriminant analyses and thus the original values of characters are used. Similarly to Canoco, vegan uses permutation significance tests that overcome the need for the normal distribution of characters. The function discr.calc creates an object, discr.data, that contains two data frames (values of the characters and the classification of objects to a taxon), then passes it to the cca (the vegan package) function, which performs the discriminant analysis. The function discr.sum summarises the results and returns the eigenvalues of the canonical axes, percentage of variation explained, canonical correlation coefficients for the individual axes, and permutation tests of the whole model and of the individual axes. Note that the eigenvalues in vegan (as in Canoco) are equal to the squares of the canonical correlation coefficients and are different from the eigenvalues printed by other software (such as Statistica and SAS); the recalculation is returned as the first item by the discr.sum function (see ter Braak and Šmilauer 2012 for details). The coefficients of the discriminant function (regression coefficients) for individual characters are returned by the discr.coef function; the mean of each character is also computed. These coefficients are for centred but unstandardised characters [i.e. for characters A, B,.., the score = (A – mean (A)) * coef A + (B – mean (B)) * coef B + …]. The function discr.taxa returns the canonical scores of taxa, while the scores of individuals (OTUs) are returned by the discr.scores function. This function uses the generic function predict and can thus also calculate scores for passive samples that were not present in the original ordination. This approach is advantageous for testing the positions of “atypical” populations or for assessing the taxonomic positions of selected individuals; for an example, see Kúr et al. (2012), in which the cytotypes of type herbarium specimens were estimated from morphology. The function discr.bip returns the contribution of characters to the individual axes (“biplot scores”; in contrast with usual biplot scores, these scores are standardised by within-group variance instead of total variance in discriminant analysis to better reflect the relative importance of characters; see Lepš and Šmilauer 2003 for details). The function discr.test performs two significance tests of individual characters based on a permutation procedure. First, the function tests the marginal effects (i.e. when a character is alone in the model). Second, the function tests the unique contributions of the characters (i.e. the addition of each character into the model with all other characters; note that these contributions are called marginal effects in vegan terminology). The latter is analogous to the standardised coefficients of the discriminant function computed by other software packages. Finally, the function discr.step performs the stepwise (forward) selection of characters. The results of discr.coef, discr.taxa, discr.scores, discr.bip and discr.test can be exported using the export.res function and plotted using the generic function plot in a way similar to PCA as 2D (a style similar to that of Fig. 1b) or 2D “spider” (Fig. 1d) or 3D figures (Fig. 1e). If only two groups are present, the scores along the only canonical axis can be drawn as a histogram common for both groups (Fig. 1f). The comparable results (such as the eigenvalues and the group/sample scores) of the R functions and those of other software packages (SAS, Statistica, Canoco 5) are identical.

Classificatory discriminant analysis with cross-validation is computed based on the function lda from the package MASS (Venables and Ripley 2002). Two functions classif.da and classif.da.1 are available. Both functions return the original three columns from the data (ID, Population, Taxon), a column with the classification from discriminant analysis, and the posterior probabilities of the classification for each group (taxon). These two functions differ in the mode of cross-validation. The function classif.da.1 uses the standard one-leave-out method. However, as some hierarchical structure is usually present in the data (individuals from a population are not completely independent observations, as they are closer to each other than to individuals from other populations), the function classif.da uses whole populations as leave-out units. This method does not allow classification if there is only one population for a taxon and is more sensitive to “atypical” populations, which usually leads to a somewhat lower classification success rate. Finally, the function classif.samp conducts the classificatory discriminant analysis of a sample set based on an independent training set and may be used to classify hybrid populations, type herbarium specimens, etc. The results of all three functions are accessed by the classif.matrix and classif.pmatrix functions. The former creates a classification matrix of the taxa (i.e. for each taxon in the original data, it shows the number of classifications into all taxa present and the percentage of correct classifications), while the latter shows the classification of populations in a similar way. This detailed classification of populations is very useful, as it can reveal some atypical or incorrectly assigned populations; for example, most of the populations from the sample data are successfully classified using the classif.da function (generally over 70 % correct classifications and often 100 %), but two populations have only 15 and 31 % correct classifications. The results of all of the functions can be exported using the export.res function.

k-nearest neighbour classification is used as a non-parametric alternative to classificatory discriminant analysis. This method finds k neighbours of an individual (those with the lowest Euclidean distance) and classifies the individual according to the a priori classification of the neighbours using a majority vote. The functions are based on the knn and knn.cv functions from the package class (Venables and Ripley 2002). The functions knn.select and knn.select.1 search for the optimal k for the given data set; they differ only in cross-validation method (similarly to the classificatory discriminant analysis functions). The functions compute the number of correctly classified individuals for k values from 1 to 30 and highlight the k with the highest success rate. Ties (i.e. two or more groups receive the same number of votes) are broken at random, and different iterations may thus give different results. Therefore, the functions compute 10 iterations and use the average success rates for each k; they also graphically display the minimum and maximum success rates for each k. The functions knn.classif and knn.classif.1 then perform knn classification using the specified k; they differ by cross-validation method in the same manner as the above functions. The function knn.samp computes the knn classification from a sample set using an independent training set. The results of the knn functions have a similar format as those of the classificatory discriminant analyses, and classification matrices can be produced using the classif.matrix and classif.pmatrix functions.

Three additional filles can be downloaded from the journal’s webpage or from the MorphoTools webpage: the full manual to the functions (Online Resource 2), the function definitions (Online Resource 3), and the working protocol containing the analysis of the sample data (Online Resource 4).

References

Cires E, Cuesta C, Revilla MA, Fernández Prieto JA (2010) Intraspecific genome size variation and morphological differentiation of Ranunculus parnassifolius (Ranunculaceae), an Alpine-Pyrenean-Cantabrian polyploid group. Biol J Linn Soc 101:251–271
Article Google Scholar
Dray S, Dufour AB (2007) The ade4 package: implementing the duality diagram for ecologists. J Stat Softw 22:1–20
Google Scholar
Foggi B, Parolo G, Šmarda P, Coppi A, Lastrucci L, Lakusić D, Eastwood D, Rossi G (2012) Revision of the Festuca alpina group (Festuca section Festuca, Poaceae) in Europe. Bot J Linn Soc 170:618–639
Article Google Scholar
Gross J, Ligges U (2012) nortest: Tests for Normality. R package, version 1.0-2. http://CRAN.R-project.org/package=nortest
Kaplan Z, Marhold K (2012) Multivariate morphometric analysis of the Potamogeton compressus group (Potamogetonaceae). Bot J Linn Soc 170:112–130
Article Google Scholar
Košnar J, Košnar J, Herbstová M, Macek P, Rejmánková E, Štech M (2010) Natural hybridization in tropical spikerushes of Eleocharis subgenus Limnochloa (Cyperaceae): evidence from morphology and DNA markers. Amer J Bot 97:1229–1240
Article Google Scholar
Koutecký P (2007) Morphological and ploidy level variation of Centaurea phrygia agg. (Asteraceae) in the Czech Republic Slovakia and Ukraine. Folia Geobot 42:77–102
Article Google Scholar
Koutecký P (2012) A diploid drop in the tetraploid ocean: hybridization and long-term survival of a singular population of Centaurea weldeniana Rchb. (Asteraceae), a taxon new to Austria. Pl Syst Evol 298:1349–1360
Article Google Scholar
Koutecký P, Štěpánek J, Baďurová T (2012) Differentiation between diploid and tetraploid Centaurea phrygia: mating barriers, morphology and geographic distribution. Preslia 84:1–32
Google Scholar
Kučera J, Turis P, Zozomová-Lihová J, Slovák M (2013) Cyclamen fatrense, myth or true Western Carpathian endemic? Genetic and morphological evidence. Preslia 85:133–158
Google Scholar
Kúr P, Štech M, Koutecký P, Trávníček P (2012) Morphological and cytological variation in Spergularia echinosperma and S. rubra, and notes on potential hybridization of these two species. Preslia 84:905–924
Google Scholar
Kuzmanović N, Comanescu P, Frajman B, Lazarević M, Paun O, Schönswetter P, Lakusić D (2013) Genetic, cytological and morphological differentiation within the Balkan-Carpathian Sesleria rigida sensu Fl. Eur. (Poaceae): a taxonomically intricate tetraploid-octoploid complex. Taxon 62:458–472
Article Google Scholar
Lepš J, Šmilauer P (2003) Multivariate analysis of ecological data using CANOCO. Cambridge Univ. Press, Cambridge
Google Scholar
Lepší M, Lepší P, Vít P (2013) Sorbus quernea: taxonomic confusion caused by the naturalization of an alien species, Sorbus mougeotii. Preslia 85:159–178
Google Scholar
Ligges U, Mächler M (2003) scatterplot3d—an R package for visualizing multivariate data. J Stat Softw 8:1–20
Google Scholar
Marhold K (2011) Multivariate morphometrics and its application to monography at specific and infraspecific levels. In: Stuessy TF, Lack HW (eds) Monographic plant systematics: fundamental assessment of plant biodiversity. A. R. Gantner Verlag, Ruggell, pp 73–99
Google Scholar
Mráz P, Bourchier RS, Treier UA, Schaffner U, Müller-Schärer H (2011) Polyploidy in phenotypic space and invasion context: a morphometric study of Centaurea stoebe s.l. Int J Pl Sci 172:386–402
Article Google Scholar
Oksanen J, Blanchet FG, Kindt R, Legendre P, Minchin PR, O’Hara RB, Simpson GL, Solymos P, Stevens MHH, Wagner H (2013) vegan: community ecology package, version 2.0-10. http://CRAN.R-project.org/package=vegan
Podani J (2001) Syn-tax 2000. Computer programs for data analysis in ecology and systematics. User’s manual. Scientia Publishing, Budapest
Quinn GP, Keough MJ (2009) Experimental design and data analysis for biologist. Cambridge University Press, Cambridge
Google Scholar
R Core Team (2013) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/
Revelle W (2013) psych: procedures for personality and psychological research, version 1.4.1. Northwestern University, Evanston, Illinois, USA. http://CRAN.R-project.org/package=psych
Rooks F, Jarolímová V, Záveská Drábková E, Kirschner J (2011) The elusive Juncus minutulus: a failure to separate tetra- and hexaploid individuals of the Juncus bufonius complex in a morphometric comparison of cytometrically defined groups. Preslia 83:565–589
Google Scholar
Slovák M, Kučera J, Marhold K, Zozomová-Lihová J (2012) The morphological and genetic variation in the polymorphic species Picris hieracioides (Compositae, Lactuceae) in Europe strongly contrasts with traditional taxonomical concepts. Syst Bot 37:258–278
Article Google Scholar
Španiel S, Marhold K, Fialová B, Zozomová-Lihová J (2011) Genetic and morphological variation in the diploid–polyploid Alyssum montanum in Central Europe: taxonomic and evolutionary considerations. Pl Syst Evol 294:1–15
Article Google Scholar
ter Braak CJF, Šmilauer P (2012) Canoco reference manual and user’s guide: software for ordination (version 5.0). Microcomputer Power, Ithaca
Google Scholar
van Hengstum T, Lachmuth S, Oostermeijer JGB, den Nijs HCM, Meirmans PG, van Tienderen PH (2012) Human-induced hybridization among congeneric endemic plants on Tenerife, Canary Islands. Pl Syst Evol 298:1119–1131
Article Google Scholar
Venables WN, Ripley BD (2002) Modern applied statistics with S, 4th edn. Springer, New York
Book Google Scholar

Download references

Acknowledgments

I am obliged to Petr Šmilauer, Jan Š. Lepš, Filip Kolář, Pavel Kúr and Tomáš Urfus who gave me valuable advice on the statistical methods used or performed some analyses in other software packages for comparison. Filip Kolář also came up with the name MorphoTools. Three reviewers provided valuable comments to the first version of the manuscript.

Author information

Authors and Affiliations

Faculty of Science, Department of Botany, University of South Bohemia, Branišovská 1760, 37005, České Budějovice, Czech Republic
Petr Koutecký

Authors

Petr Koutecký
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Petr Koutecký.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Koutecký, P. MorphoTools: a set of R functions for morphometric analysis. Plant Syst Evol 301, 1115–1121 (2015). https://doi.org/10.1007/s00606-014-1153-2

Download citation

Received: 09 May 2014
Accepted: 28 August 2014
Published: 16 October 2014
Issue Date: April 2015
DOI: https://doi.org/10.1007/s00606-014-1153-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

MorphoTools: a set of R functions for morphometric analysis

Abstract

Similar content being viewed by others