## Principal componentFactors

The e term represents the error involved in ignoring column C, and is equal to unity minus the communality of the columns which were considered.

Cluster analysis is used to classify results into groups or clusters. Two general procedures will be described, as follows:

(a) Dendrograms

(b) Hierarchic methods.

Dendrograms are used to investigate collections of factors to determine which are related and which are independent. A typical example is shown in Table 5.13, which lists ten questions about a collection of shampoos, put to a test panel. The panel was asked to use the shampoos, and give them numerical gradings. The resulting correlation matrix is shown in Table 5.14.

A dendrogram of the correlation matrix is plotted in Figure 5.7, in which the ordinate represents correlation coefficients, and the question numbers are equally spaced along the abscissa, not necessarily in numerical order. The highest correlation coefficient in

Table 5.13 Shampoo questionnaire.

Assess the following qualities on the given 0 to 7 scale

Table 5.13 Shampoo questionnaire.

Assess the following qualities on the given 0 to 7 scale

 1 Overall impression 2 Suitability for your hair type 3 Lathering ability 4 Rinsability 5 Cleansing power 6 The condition of your hair 7 The manageability of your hair 8 The feel of your hair 9 The texture of your hair 10 How tangle free your hair was left

Table 5.14 Correlation matrix for shampoo evaluation factors.

Question number 2345678910

Table 5.14 is 0.90, between questions 1 and 2, and these numbers are placed side by side at a convenient position on the horizontal axis, to form the base of a rectangle of height 0.90. The next highest correlation coefficient (0.86) occurs between questions 9 and 10, but neither 9 nor 10 correlates well with the other questions, so this result is set aside while other, higher correlation coefficients are considered. The next correlation coefficient is 0.81, for questions 1 and 6, and since it has question 1 in common with the first rectangle, this rectangle is plotted alongside with question 1 shared by the two rectangles, which are then joined by a horizontal and two vertical lines, corresponding to the next highest correlation coefficient (0.72 between 2 and 6), as shown in the figure. Before considering the next highest correlation coefficient (0.63 between 3 and 5), it is necessary to note that the next two correlation coefficients are equal (0.60), and are between question 5 and either 1 or 6, both of which belong to the cluster of two rectangles described above. The rectangle between 3 and 5 is therefore drawn alongside the "6,1,2" cluster, to which it is linked by a bar, 0.60 units high. The next correlation coefficient (0.59) is between question 5, which is in the "3,5,6,1,2" cluster, and question 4. A vertical of 0.59 units is therefore drawn, and linked by a horizontal bar to the cluster. For similar reasons, a vertical of 0.54 from question 8, which has the next highest correlation coefficient, is placed for convenience on the other side of the "4,3,5,6,1,2" cluster, and joined to it with a horizontal bar. The next three correlation coefficients involve questions which already form part of the dendrogram, and are therefore ignored, and question 7 is joined Figure 5.7 Shampoo dendrogram. Note that the correlation coefficient scale is negative, so that a short rectangle represents a high coefficient and a tall rectangle represents a low coefficient.

Figure 5.7 Shampoo dendrogram. Note that the correlation coefficient scale is negative, so that a short rectangle represents a high coefficient and a tall rectangle represents a low coefficient.

to the cluster in the same way as question 8. Finally, the "9,10" rectangle, which correlates poorly with all the other questions (r=0.43), is joined up to the cluster which forms the remainder of the questions.

The dendrogram suggests that there are three clusters,

(a) Between questions 3, 4 and 5, indicating that the subjects associate lather with cleansing, which is not surprising.

(b) Linking questions 1, 2 and 6, suggesting that subjects discriminate between shampoos principally for the condition in which it leaves their hair.

(c) Between questions 9 and 10, and these correlated very little with the other attributes. The inference is that the descriptions, "texture" and "tangle-free" were associated, and were not related with terms like "condition", "manageability" and "feel".

### 5.8.1 Pattern recognition methods

Some pharmacological and toxicological data are binary (e.g. a compound may be assessed as being carcinogenic or non-carcinogenic) rather than continuous. Such data are not amenable to multiple regression analysis, and other methods have to be used in their analysis. A number of such methods are available, the most widely used of which is discriminant analysis. A number of physico-chemical and/or structural parameters are generated for each compound, and an appropriate computer program selects the best combination of these that will discriminate between the different classes of biological activity. If n parameters are selected, then the discrimination is by means of a hyperplane in n-dimensional hyperspace. This is difficult to comprehend, and so principal components analysis is often used to reduce the data, and frequently a plot of the first principal component (PCA1) against PCA2 gives good discrimination. An example of this is shown in Figure 5.8.

A slightly different approach is exemplified in Figure 5.9, in which active compounds are seen generally to have lower n values than inactive compounds. The situation is rarely as simple as this. Experimental results usually take the form of a continuous spectrum of biological activities, and an arbitrary dividing line, above which the compound is deemed to be active, has to be drawn across the graph of "active" or "inactive" against a dependent variable, such as Hansch's n value. Overlapping can be rationalised by determining the mean squared distance (MSD); for example, for compounds 1, 2 and 3 in Figure 5.9, if n1 represents the n value of compound 1 and so on.

0 0