Cluster means grouping of element or characteristics, which belongs to same category. If we talk Cluster analysis in research methodology it also says same thing, we divide our data/behavior into groups which are useful and identical to same category. In other words our objective is to divide observation into homogeneous or distinct groups. The greater the similarity within the group and greater the difference between the groups the better or more distinct the clustering.
In application we found that, the notion of cluster is not well defined. To understand this difficulty let’s have a look an example which depicts the general problem of cluster analysis.
Above figure shows points and they are dividing the data into two cluster to six clusters. Last figure describe how two big cluster are divided into six parts. This create problem to researcher to make any conclusion these six different cluster shows same kind of nature because this cluster is a subset of two cluster.
Measure of distance for variable:
It require more precise definition of similarity of observations and clusters. The grouping is based on variable, it employ the familiar concept of distance. Consider two points i and j with coordinates (X1i, X2i) and (X1j, X2j), respectively.
The Euclidean distance between the two points is the hypotenuse of the triangle ABC:
D(i, j) = Sqrt(A2+B2) = Sqrt[(X1i – X1j)2 + (X2i – X2j)2]
Types of clustering:
The majorly cluster analysis can be conduct in three as per our requirement.
1) Hierarchical cluster
2) Partitioning methods
3) Two steps clustering
It is characterized by tree like structure establishing in the course of analysis. Most of the hierarchical technique fall into agglomerative clustering. In this category, cluster are consecutively from into objects. At beginning this type of cluster procedure start with each objects representing individual cluster. These cluster are sequentially merged to form a new at the bottom of the hierarchy. Next step is to merge this cluster to another cluster with higher level of hierarchy, and so on. This allow hierarchy of cluster to be established from the bottom up.
This theory is visible when we use SPSS output to analyze the data, in which dendrogram explain it in better way. With dendrogram we output we come to conclusion of how many cluster are present in the data. The only meaningful indicator relates to the distances at which the objects are combined.
One potential way to solve this problem is to plot the number of clusters on the x-axis (starting with the one-cluster solution at the very left) against the distance at which objects or clusters are combined on the y-axis. Using this plot, we then search for the distinctive break (elbow). SPSS does not produce this plot automatically you have to use the distances provided by SPSS to draw a line chart by using a common spreadsheet program such as Microsoft Excel.
Partitioning methods: K means
Hierarchical cluster is use when we have variable size is less than 50. But, we have no. of factor to be consider is more than 50 we use K means cluster analysis. The K means cluster analysis is use for segmentation, targeting and positioning (STP) of market. This method can be used for identifying the potential market for any organization, where they launch their products and achieve maximum profit.
The k-means algorithm follows an entirely different concept than the hierarchical methods discussed before. This algorithm is not based on distance measures such as Euclidean distance or city-block distance, but uses the within-cluster variation as a measure to form homogenous clusters. This procedure aimed at segmenting the data in such a way that within the cluster variation is minimized.
With hierarchical methods, our object remains in a cluster once it is assigned, but with k-means, cluster afﬁliations can change in the course of the clustering process. K means cluster does not build hierarchy of event, which is why this approach is labeled as non-hierarchical in nature.
Prior to analysis we need to decide no. of clusters. Based on this algorithm randomly selects a center for each cluster. After this Euclidean distances are computed from the clusters centers to every single object. Each cluster is assigned to the cluster center with the shortest distance to it.
Based on this initial partition, each cluster centroid is computed. This was done by computing the mean value of objects contained in the cluster.
In the fourth step, the distances from each object to the newly located cluster centers are computed and objects are again assigned to a certain cluster on the basis of their minimum distance to other cluster centers. Since the cluster centers’ position changed with respect to the initial situation in the ﬁrst step, this could lead to a different cluster solution.
http://www.yorku.ca/ptryfos/f1500.pdf (Last accessed on March 20, 2014)
http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf (Last accessed on March 20, 2014)
http://en.wikipedia.org/wiki/Cluster_analysis (Last accessed on March 20, 2014)
E. Mooi and M. Sarstedt, A Concise Guide to Market Research (2011)
Samkit Jain_Group2_SectionB (12PGP112)