The objective of cluster analysis is to assign observations to groups (“clusters”) so that observations within each group are similar to one another with respect to variables or attributes of interest, and the groups themselves stand apart from one another. In other words, the objective is to divide the observations into homogeneous and distinct groups.
There are different approaches to do cluster analysis. These approaches are: hierarchical methods, partitioning methods (more precisely, k-means), and two-step clustering, which is largely a combination of the first two methods.
Steps in cluster analysis:
Hierarchical clustering procedures are characterized by the tree-like structure established in the course of the analysis. Most hierarchical techniques fall into a category called agglomerative clustering. In this category, clusters are consecutively formed from objects. Initially, this type of procedure starts with each object representing an individual cluster. These clusters are then sequentially merged according to their similarity. First, the two most similar clusters (i.e., those with the smallest distance between them) are merged to form a new cluster at the bottom of the hierarchy. In the next step, another pair of clusters is merged and linked to a higher level of the hierarchy, and so on. This allows a hierarchy of clusters to be established from the bottom up. In figure (left-hand side), agglomerative clustering assigns additional objects to clusters as the cluster size increases.
A cluster hierarchy can also be generated top-down. In this divisive clustering, all objects are initially merged into a single cluster, which is then gradually split up. Figure illustrates this concept (right-hand side). As we can see, in both agglomerative and divisive clustering, a cluster on a higher level of the hierarchy always encompasses all clusters from a lower level. This means that if an object is assigned to a certain cluster, there is no possibility of reassigning this object to another cluster.
Select a Measure of Similarity or Dissimilarity
There are various measures to express (dis)similarity between pairs of objects. A straightforward way to assess two objects’ proximity is by drawing a straight line between them. This type of distance is also referred to as Euclidean distance (or straight-line distance) and is the most commonly used type when it comes to analysing ratio or interval-scaled data.
There are also alternative distance measures: The city-block distance uses the sum of the variables’ absolute differences. This is often called the Manhattan metric as it is akin to the walking distance between two points in a city like New York’s Manhattan district, where the distance equals the number of blocks in the directions North-South and East-West.
There are other distance measures such as the Angular, Canberra or Mahalanobis distance. In many situations, the latter is desirable as it compensates for collinearity between the clustering variables.
The distance measures presented can be used for metrically and – in general – ordinally scaled data, applying them to nominal or binary data is meaningless. In this type of analysis, one should rather select a similarity measure expressing the degree to which variables’ values share the same category. These so called matching coefficients can take different forms but rely on the same allocation scheme shown in Table.
Allocating scheme for matching coefficients.
Two types of matching coefficients, which do not equate the joint absence of a characteristic with similarity and more of value in segmentation studies, are the Jaccard (JC) and the Russel and Rao (RR) coefficients. They are defined as follows:
These matching coefficients are – just like the distance measures – used to determine a cluster solution. There are many other matching coefficients such as Yule’s Q, Kulczynski or Ochiai.
For nominal variables with more than two categories, one should always convert the categorical variable into a set of binary variables in order to use matching coefficients. With ordinal data, one should always use distance measures such as Euclidean distance. Even though using matching coefficients would be feasible and from a strictly statistical standpoint – even more appropriate, one would disregard variable information in the sequence of the categories. In the end, a respondent who indicates that he or she is very loyal to a brand is going to be closer to someone who is somewhat loyal than a respondent who is not loyal at all. Furthermore, distance measures best represent the concept of proximity, which is fundamental to cluster analysis.
Select a Clustering Algorithm
After having chosen the distance or similarity measure, we need to decide which clustering algorithm to apply. There are several agglomerative procedures and they can be distinguished by the way they define the distance from a newly formed cluster to a certain object, or to other clusters in the solution. The most popular agglomerative clustering procedures include the following:
1. Single linkage (nearest neighbour): The distance between two clusters corresponds to the shortest distance between any two members in the two clusters.
2. Complete linkage (furthest neighbour): The oppositional approach to single linkage assumes that the distance between two clusters is based on the longest distance between any two members in the two clusters.
3. Average linkage: The distance between two clusters is defined as the average distance between all pairs of the two clusters’ members.
4. Centroid: In this approach, the geometric centre (centroid) of each cluster is computed first. The distance between the two clusters equals the distance between the two centroids.
A common way to visualize the cluster analysis’s progress is by drawing a dendrogram, which displays the distance level at which there was a combination of objects and clusters.
We read the dendrogram from left to right to see at which distance objects have been combined
Decide on the Number of Clusters
An important is how to decide on the number of clusters to retain from the data. Unfortunately, hierarchical methods provide only very limited guidance for making this decision. The only meaningful indicator relates to the distances at which the objects are combined. We can seek a solution in which an additional combination of clusters or objects would occur at a greatly increased distance. This raises the issue of what a great distance is, of course.
One potential way to solve this problem is to plot the number of clusters on the x-axis (starting with the one-cluster solution at the very left) against the distance at which objects or clusters are combined on the y-axis. Using this plot, we then search for the distinctive break (elbow).
Alternatively, we can make use of the dendrogram which essentially carries the same information. SPSS provides a dendrogram; however, this differs slightly from the one presented in Fig. Specifically, SPSS rescales the distances to a range of 0–25; that is, the last merging step to a one-cluster solution takes place at a (rescaled) distance of 25. The rescaling often lengthens the merging steps, thus making breaks occurring at a greatly increased distance level more obvious.
Despite this, this distance-based decision rule does not work very well in all cases. It is often difficult to identify where the break actually occurs.
Overall, the data can often only provide rough guidance regarding the number of clusters one should select; consequently, one should rather revert to practical considerations. Occasionally, we might have a priori knowledge, or a theory on which we can base our choice. However, first and foremost, one should ensure that results are interpretable and meaningful. Not only must the number of clusters be small enough to ensure manageability, but each segment should also be large enough to warrant strategic attention.
3. Business Research Methods by Cooper, Schindler, Sharma. 11th Edition.
Section B Group 6_Vaneet Bhatia (13FPM008)
- Apurva Ramteke(13PGP068)
- Chandan Parsad(13FPM002)
- Komal Suchak (13PGP086)
- Rohan Kr. Jha (13FPM004)
- Silpa Bahera (13PGP107)
- Sushil Kumar (13FPM010)
- Vivek Roy (12FPM005)