Cluster analysis is a multivariate method which aims to classify a sample of subjects (or objects) on the basis of a set of measured variables into a number of different groups such that similar subjects are placed in the same group. An example where this might be used is in the field of psychiatry, where the characterization of patients on the basis of clusters of symptoms can be useful in the identification of an appropriate form of therapy. In marketing, it may be useful to identify distinct groups of potential customers so that, for example, advertising can be appropriately targeted.
WARNING ABOUT CLUSTER ANALYSIS
Cluster analysis has no mechanism for differentiating between relevant and irrelevant variables. Therefore the choice of variables included in a cluster analysis must be underpinned by conceptual considerations. This is very important because the clusters formed can be very dependent on the variables included.
Approaches to cluster analysis
There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:
- Hierarchical methods
- Agglomerative methods, in which subjects start in their own separate cluster. The two ’closest’ (most similar) clusters are then combined and this is done repeatedly until all subjects are in one At the end, the optimum number of clusters is then chosen out of all cluster solutions.
- Divisive methods, in which all subjects start in the same cluster and the above strategy is applied in reverse until every subject is in a separate Agglomerative methods are used more often than divisive methods, so this handout will concentrate on the former rather than the latter.
- Non-hierarchical methods (often known as k-means clustering methods)
Types of data and measures of distance
The data used in cluster analysis can be interval, ordinal or categorical. However, having a mixture of different types of variable will make the analysis more complicated. This is because in cluster analysis you need to have some way of measuring the distance between observations and the type of measure used will depend on what type of data you have.
A number of different measures have been proposed to measure ’distance’ for binary and categorical data. For details see the book by Everitt, Landau and Leese. Readers are also referred to this text for details of what to do if you have a mixture of different data types. For interval data the most common distance measure used is the Euclidean distance.