11. Groups II
Written by Norm MacLeod - The Natural History Museum, London, UK (email: n.macleod@nhm.ac.uk). This article first appeared in the Nº 65 edition of Palaeontology Newsletter.
Introduction
Last time out we began to confront the problems presented by datasets that include group-level structure. We also developed some statistical tools we could use to determine whether that structure was reflected in statistically significant differences in group means and to assign unknown specimens to the closest group mean. So far, so good. But what we really want is some way of defining a space—like a PCA space or a PCoord space—in which the groups are maximally separated.
You’ll recall this plot of the Iris data from the previous essay (Fig. 1, see Palaeontological Association Newsletter, 64:35–45, also see that essay, or the PalaeoMath101 Excel Spreadsheet for a listing of these data). This captures the problem nicely. Given just four variables there are effectively six different ways of looking at the problem if we ignore plots of the four variables against themselves and the plots in which the same variables are plotted on different axes. Each plot yields some information about both within-group variation and between-group separation. Some plots seem more informative than others. But no single plot tells the whole story.
Ideally we’d like to see our data transformed into a low-dimensional space, such that the majority of the between-group separation is summarized in just a few axes. Also, if the equations of the axes could give us some indication of which single or combination of the original variables was most important for achieving group discrimination (which is another way of saying ‘most important for group characterization’), that would be nice too. It’s another tall order, but our colleagues over in the maths department have some ideas along these lines. Let’s have a look.
Before we can begin our discussion make sure we have some basic concepts straight, specifically the difference between discrimination and classification. Both involve groups but there is a world of difference between them—mathematically speaking—that we need to understand before the mathematical operations will make much sense. Understanding these concepts will also help us understand the difference between the material I presented in the Groups I essay and what I’ll be presenting below.
Discrimination is the act of determining a mathematical expression that distinguishes between groups of measurements or observations. In order to perform a discrimination or ‘discriminant’ analysis the groups need to be specified at the outset of an investigation. Classification is the act of determining how many groups are present in a collection of measurements or observations. This procedure does not require knowledge of the number of groups beforehand. Rather, that information is the purpose or result of the classification analysis. One group of techniques tells you how best to separate groups (discriminant analysis) the other tells you how many groups a sample contains (classification analysis). Of course, in the real world palaeontologists want to know both. The problem is you can’t get at both questions in any single analysis. The mathematics that optimize the discrimination of groups of data require specification of the number of groups to be discriminated and the mathematics of classification analysis require that the characteristic differences between groups be known. What to do?
Inevitably, we fall back on using combinations of analyses. Principal components analysis, factor analysis, principal coordinates analysis, correspondence analysis and the rest of the ‘single group’ methods can be used to obtain a sense of how many groups there might be in a dataset. They can do other things too, but practically speaking this is one of their primary uses. Once some hypotheses about possible classification schemes have been developed based on results of a single-group analysis, those can be checked for statistical significance using the methods of mean-difference analysis (e.g., likelihood-ratio test, Hotelling’s T2-test). These results will allow decisions to be made regarding a viable classification scheme, after which consideration of the discriminant problem can begin. Mahalanobis distances can be used to affect identification by assigning individuals to groups based on their proximity to the group centroid (after scaling the variables by the inverse of the pooled covariance matrix). However, the space in which the Mahalanobis distance operates has not been optimized for maximal group separation. Nonetheless, it is possible to create a space that optimizes the difference between classification groups—at least the distances between their centroids. It is to this missing piece of the puzzle we now turn.
Most discussions of discriminant analysis begin with a discussion of the two groups case—where the point is to find a linear discriminant function that separates two groups. This is obviously the simplest case of discrimination and, because of this the mathematics involved can be simplified. Nevertheless, the two-sample case hardly ever comes up in real situations. For the most part we need to distinguish between three or more groups and so need an approach to determining discriminant functions that is powerful enough to handle any number of groups. Since the simplified mathematics of two-group linear discriminant analysis cannot be extended to the multiple-groups case, we’ll proceed directly to the multiple-groups problem, the most popular solution to which is called canonical variates analysis (CVA). Should we ever need to discriminate between just two groups, CVA works fine for those data too.
In our example analysis we’ll stick with the Fisher Iris data from the previous essay, but bump up the number of individuals in each group in order to get a better estimate of group variation and to illustrate some features of the technique. The following table lists these example data.
Canonical variates analysis was invented by R. A. Fisher (1936) with important contributions by Bartlett (1951, regarding how to calculate the inverse of a matrix), Mahalanobis (1936, regarding use of Mahalanobis distances in discriminant analysis), and Rao (1952, in synthesizing Fisher’s and Mahalanobis’ concepts into the modern procedure). The basic idea behind modern approaches to CVA is reasonably simple. It is in essence a two-stage rotation of a data matrix that has been subdivided into groups, hence the name canonical variates.
Campbell and Atchley (1981) provide an excellent discussion of the geometric transformations implicit in CVA. Their presentation has served as a model for the geometric explanation presented below. In the actual algorithm (which we’ll discuss after the geometric presentation) several of these steps are performed simultaneously. Most textbook descriptions of CVA only focus on presenting a recipe of equations and plots such that comparatively few practitioners gain much understanding of the geometry inherent in the methods. In my presentation we’ll review of few basic equations (which readers of this column have seen before) and then let the pictures do most of the talking.
First, recall that in our previous discussion of the likelihood-ratio test we developed the idea that total similarity relations (T) within grouped data matrices could be subdivided into ‘within-groups’ (W) and between-groups (B) partitions.
(11.1) |
There are different ways to operationalize this concept, but in the case of CVA the T matrix usually represents the total sums of squares and cross products (SSQCP) for all variables and has the following form.
(11.2) |
In this expression r and c refer to the rows and columns of the T matrix (any cell of which is occupied by a value t) with and representing the grand means for the entire, combined dataset. The grand mean is the centre of the pooled sample of all measurements. Matrix T then summarizes the dispersion of the total dataset about this group-independent, fixed reference.
The W matrix represents the within-groups SSQCP matrix and has a corresponding form.
(11.3) |
Once again, r and c refer to the rows and columns of the W matrix (any cell of which is occupied by a value, w). Now the variables and refer to the analogous group-specific means. Here, the group mean is the centre of the cloud of points representing each group in Figure 1. Matrix W, then, summarizes the dispersion of each dataset relative to its own group-specific reference.
Once T and W have been found the most intuitively way of determining the B matrix is to simply subtract each element of the W matrix from the corresponding element of the T matrix (B = T - W). Conceptually though, the between-groups matrix summarizes the dispersion of the group means from the grand mean.
In our geometric example analysis we’ll reduce the Table 1 data to just two variables: petal width and petal length. Canonical variates analysis begins (conceptually) with a standard PCA analysis of the within-groups dispersion matrix (Fig. 2).
Next, CVA performs a somewhat counter-intuitive scaling operation. As you can see from Figure 2B, the scatters of the original groups are elongated with much more variance along PC-1 than PC-2. This reflects the greater variation of the petal length relative to petal width data, which in turn reflects the fact that Iris petals are much longer than they are wide. In order to achieve maximum separation between the group centroids the principal components are scaled by the square root of the associated eigenvalue. This operation involves multiplying each individual’s PC score by the reciprocal of that square root. The result of this intermediate scaling operation is illustrated in Figure 3.
What we’re doing by scaling the PC space in this way is reminding ourselves of what we mean by ‘distance’ in a multivariate space. As we discussed last time, correlations between variables matter when it comes to assessing the separation between any two points in a space defined by multiple variables. We apply a similar scaling operation to the Mahalanobis distance calculation specifically to correct for distortions caused by inter-variable correlations. The scaling operation we’ve just performed in the intermediate stage of our CVA analysis distorts the PC space such that the geometric reality of the distribution of points in that space matches our ‘common sense’ notion of distance (recall we performed the original PC rotation on W, not T). This scaling operation shows us that the notion of distance between points in the standard PC multivariate space can be just as distorted as it is in ordinary scatterplots. By using the eigenvalues to scale the eigenvectors we can construct a ‘true’ picture of the separations between points in this group-defined space, one that conforms to the world of spatial relations in which we live. Thus, our three Iris species are actually more distinct than figures 2A, 2B, or 3A would have us believe. That’s the profound bit. The trivial bit is that all this complexity is taken into consideration by the Mahalanobis distance. Thus we’ve had a way of taking the distortions inherent in the spaces represented by 2A, 2B, and 3A into account all along.
The second and final stage of a CVA focuses on the group centroids. While the first rotation summarized within-groups dispersion patterns, a second rotation is required to summarize between-groups dispersion patterns. This is accomplished by conducting a second PCA, this time using only data from the positions of the group means in the orthogonal and variance standardized—or orthonormal—space (Fig. 3B). Figure 4 illustrates the result of this operation.
The scatterplot shown in Figure 4B is typically presented as the CVA ordination. Generally speaking there are one fewer CVA axes with positive between-groups eigenvalues than the number of groups present in the analysis. Once these results have been obtained most routines will also report statistical tests for group distinctiveness (e.g., Hotelling’s T2) and a Mahalanobis distance-based cross-tabulation analysis of the data used to define the CVA space. The former are used to confirm group distinctiveness (see previous column for examples and details of these calculations) while the purpose of the latter is to determine the degree to which these particular CVA results can provide a reliable basis for achieving discrimination between the groups.
Note these are very different questions. It is quite possible for group means to be distinct relative to their within-groups dispersion yet contain so much overlap between their respective point clouds that effective discrimination is more-or-less impossible. Results of this cross-tabulation analysis are usually presented in the form of a ‘confusion matrix’ that summarizes the extent to which specimens assigned a priori to a given group are placed in the appropriate group by a Mahalanobis distance analysis (see previous column for details of this calculation). The confusion matrix for the two-variable Iris dataset is provided below.
As can be seen from both this matrix and Figure 4B, I. setosa is perfectly discriminated from I. versicolor and I. virginica by the first CVA axis. However, approximately one-third of the specimens assigned to the latter two species are mis-assigned to these other groups. Is this a good result? The answer depends on the question you’re asking along with your ability to collect other information and/or access additional specimens of each group. If it is of the utmost importance to identify all specimens perfectly using only these variables, the fact that this analysis produced something like 35 percent incorrect identifications for two of the three groups for the sample used to define the discriminant space is a matter of concern. Still, for many applications—including most replication-based studies of systematic identifications—a consistent identification accuracy of 65 percent is competitive with most human experts (see MacLeod 1998; Culverhouse in press). Of course, this question is moot for the Iris dataset as we have ready access to measurements from additional specimens (which would improve our estimates of W and B) and additional variables (see below).
There is one additional issue we need to discuss before we leave this simple example. As with all the single-group data analysis methods we’ve discussed to date, we would like to use the equations of the CVA axes to tell us something about the geometric meaning of the space portrayed in Figure 4B, especially the identities of the variables most useful for group characterization/discrimination. For CVA this is more complex than for the previous ordination methods we’ve discussed.
The first interpretational complication arises because of the nature of the mathematical operations implicit in CVA. In Figure 4B the CVA axes are portrayed (correctly) as being orthogonal to one another. But recall the PCA that produced those axes was undertaken on a series of group centroid locations that had already been transformed from their original positions through rotation (Fig. 2) and differential scaling (Fig. 3). In order to determine how the CVA axes relate to the original variables it’s not enough to simply inspect the CVA axis loadings because those refer to the rotated and scaled variables. Rather, we must undo these prior transformations in order to understand how the CVA loadings relate to the original data. Figure 5 illustrates the results of these back-transformation operations.
There is yet another problem with the assignment of importance to the canonical variables, though. Campbell and Atchley (1981) note that many authors assess importance of the variables to between-groups discrimination by scaling the canonical variate loadings by the standard deviations of the pooled within-groups variables. This operation produces a crude and ad hoc measure of the correspondence between high levels of variation in aspects of the sample and alignment of the between-groups discriminant axes. The fly in the ointment here is covariation. If two variables covary to a substantial degree both could be identified as having either a large or small importance with respect to group discrimination, whereas one may be the real driver of this relation and the other a more passive passenger. Campbell and Reyment (1978, see also Campbell 1980) advocate analysis of the stability of the CV loadings as a method of approaching this problem and have developed the method of ‘shrunken [CVA] estimators’ to be used in this context.
Now that we understand exactly what CVA is doing to our data we can briefly review the mathematics used in contemporary approaches to implementing this method (in which several of the steps outlined above are combined) and undertake an expanded example analysis using the full 3 group, 25 specimen, and 4 variable Iris dataset.
The modern algorithm is based on the parallel between CVA and the statistical procedure known as analysis of variance (ANOVA). We begin with the T, W, and B matrices calculated in precisely the manner given by equations 11.1, 11.2, and 11.3 (above). Rather than undertaking the separate rotation and scaling operations outlined in our previous geometric dissection of the method, these steps are combined by noticing that the quantity we are after is a set of axes that are aligned with the maximum differences between the B and W matrices. In effect we need to maximize the B/W ratio. Without going into the precise matrix equation derivation, suffice it to say that this ratio will be maximized by calculating the first eigenvector (principal component) of the W-1B matrix . Subsequent eigenvectors of this matrix represent subdominant modes variation that contribute most (in a major-axis sense) to maximizing the distinction between B and W. Together, this set of eigenvectors will represent the best single set of discriminant axes that can be calculated for the sample. Of course, since discrimination between groups is the focus of this analysis there will be one fewer eigenvectors than the number of groups in the dataset that are assigned positive eigenvalues.
A minor complication arises owing to the fact that the W-1B matrix will not be symmetric. This is a direct reflection of the implicit differential scaling of B by the within-groups eigenvalues. Fortunately, this situation is easily handled by employing singular value decomposition (SVD) as the basis for decomposition of the W-1B matrix. Recall that the eigenanalysis of a non-symmetric matrix produces non-orthogonal eigenvectors in the context of the original variables, which we have also seen is the case for CVA axes (see Fig. 5).
Applying these relations to the full Iris dataset (Table 1), the total, within-groups, and between-groups matrices are given below.
The basis matrix for the CVA analysis, then, is as follows.
Note the non-symmetrical form of this matrix. Decomposition using SVD yields the following eigenvectors and eigenvalues.
Observe there are only two eigenvectors with non-zero eigenvalues. These are the canonical variate (= discriminant) axes. Plotting the original data in the space of these two axes produces the following ordination.
Once again, I. setosa is well separated from I. versicolor and I. virginica. Unlike the previous two-variable result, however (see Fig. 4B), the presence of the additional septal variables allow a much better discrimination between these latter two species, albeit along the same CV axis. The fact that CV-2 plays such a small role in between-group discrimination is reflected in its small eigenvalue. This far superior discriminant result is reflected in the confusion matrix for the analysis which measure the ability of the variable set to characterize-discriminate between the different groups.
There are many variants to this generalized procedure, as there are with all the methods we’ve covered. The important thing, as always, is to understand the basic concepts so you can make appropriate interpretations of the results reported by any programme.
As I hope you can appreciate now, CVA is very different from PCA, principal coordinates (PCoord), factor analysis (FA), correspondence analysis (CA), and the other data analysis procedures we’ve discussed to date. Whereas it wouldn’t make much sense to (say) perform a PCA analysis and then submit the result to a correspondence analysis, there is an inherent logic to submitting the results of a PCA to a CVA. For example, PCA could be used to gather the principle sources of variation in the raw data into a small number of composite variables. Then these could be used as the basis for optimal discriminant functions.
A final word on the ‘supernatural’ aspects of CVA (and other multivariate methods). As should seem obvious to you by now, multivariate procedures are absolutely dependent on using sets of specimens to estimate the within-groups and between-groups geometry of their variables or measurements. Single-group methods (e.g., PCA, PCoord) focus only on within-groups variation while multiple-group methods (e.g., PLS, CVA) focus on the within-groups and between-groups distinctions. In the Iris dataset we saw dramatic improvement in the between-groups discrimination resulting from addition of two variables: septal length and septal width. Generally speaking the more sources of information you have about a system of measurements the better. But this improvement comes at a cost.
Consider a square space containing 100 evenly spaced points. If the square is 10 units on a side the inter-point distance is 0.010. That’s pretty good characterization of the space. However, if we turn the square into a cube by adding another variable the same number of points only achieves an inter-point spacing of 0.1. That’s an order of magnitude reduction in our information about the space in which our data reside. In order to achieve the same resolution in the cube space I’d need to increase sample size to 1000. If we added additional variables we’d need to increase sample size to … you get the picture.
Adding variables to a multivariate problem almost always results in a substantial drop in resolution. This is called the ‘curse of dimensionality’ (Belman 1957). The effects of the curse are especially noticeable in discriminant analyses because we’re trying to estimate the character of within-groups variation and between-groups variation. For the Iris dataset, because the number of variables was small and the number of specimens relatively large our CVA analysis was able to pick up major differences in W and B despite the fact that addition of the septal variables resulted in an overall loss of resolution. In other words, we beat the curse of dimensionality, this time. If you undertake multivariate procedures be mindful of the curse and don’t expect to beat it all the time.
REFERENCES
BARTLETT, M.S., 1951. An inverse matrix adjustment arising in discriminant analysis. Annals of Mathematical Statistics, 22, 107–111.
BELLMAN, R.E., 1957. Dynamic programming. Princeton University Press, Princeton, 340p.
CAMPBELL, N., 1980. Shrunken estimators in discriminant and canonical variate analysis. Applied Statistics, 29, 5–14.
CAMPBELL, N.A. and ATCHLEY, W.R., 1981. The geometry of canonical variate analysis. Systematic Zoology, 30, 268–280.
CAMPBELL, N. and REYMENT, R.A., 1978. Discriminant analysis of a Cretaceous foraminifer using shrunken estimators. Mathematical Geology, 10, 347–359.
CULVERHOUSE, P., in press. Natural object categorization: man vs. machine. In: N. MacLeod (Editor), Automated taxon recognition in systematics: theory, approaches and applications. The Systematics Association, Boca Raton, Florida.
FISHER, R. A. 1936. The utilization of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179–188.
MacLEOD, N., 1998. Impacts and marine invertebrate extinctions. In: M.M. Grady, R. Hutchinson, G.J.H. McCall and D.A. Rotherby (Editors), Meteorites: flux with time and impact effects. Geological Society of London, London, 217–246.
MAHALANOBIS, P., C. 1936. On the generalized distance in statistics. Proceedings of the National Academy of Science, India, 12, 49–55.
RAO, C.R., 1952. Advanced statistical methods in biometric research. John Wiley and Sons, New York.
Footnotes
1Confusingly (in my view) a number of programmes currently available for implementing CVA operate on matrices that violate the basic T = W + B relation. In such cases the authors of those algorithms are usually trying to take account of differences between the number of specimens representing each group. Unfortunately, they rarely specify exactly how their programmes undertake this correction, often resulting in slight non-comparabilities between the results reported by different programmes.
2A listing of all calculations is provided in the PalaeoMath101 Excel spreadsheet.
3Recall the -1 superscript refers to the inverse of a matrix. The W-1B matrix then is the matrix formed by the between-groups SSQCP matrix being pre-multiplied by the inverse of the within-groups SSQCP matrix (see example calculations in the PalaeoMath101 Excel spreadsheet.