# Session 9 Cluster analysis Section B Group 6 Vaneet Bhatia (13FPM008)

The objective of cluster analysis is to assign observations to groups (“clusters”) so that observations within each group are similar to one another with respect to variables or attributes of interest, and the groups themselves stand apart from one another. In other words, the objective is to divide the observations into homogeneous and distinct groups.

There are different approaches to do cluster analysis. These approaches are: hierarchical methods, partitioning methods (more precisely, k-means), and two-step clustering, which is largely a combination of the first two methods.

Steps in cluster analysis:

Hierarchical Method:

Hierarchical clustering procedures are characterized by the tree-like structure established in the course of the analysis. Most hierarchical techniques fall into a category called agglomerative clustering. In this category, clusters are consecutively formed from objects. Initially, this type of procedure starts with each object representing an individual cluster. These clusters are then sequentially merged according to their similarity. First, the two most similar clusters (i.e., those with the smallest distance between them) are merged to form a new cluster at the bottom of the hierarchy. In the next step, another pair of clusters is merged and linked to a higher level of the hierarchy, and so on. This allows a hierarchy of clusters to be established from the bottom up. In figure  (left-hand side), agglomerative clustering assigns additional objects to clusters as the cluster size increases.

A cluster hierarchy can also be generated top-down. In this divisive clustering, all objects are initially merged into a single cluster, which is then gradually split up. Figure illustrates this concept (right-hand side). As we can see, in both agglomerative and divisive clustering, a cluster on a higher level of the hierarchy always encompasses all clusters from a lower level. This means that if an object is assigned to a certain cluster, there is no possibility of reassigning this object to another cluster.

Select a Measure of Similarity or Dissimilarity

There are various measures to express (dis)similarity between pairs of objects. A straightforward way to assess two objects’ proximity is by drawing a straight line between them. This type of distance is also referred to as Euclidean distance (or straight-line distance) and is the most commonly used type when it comes to analysing ratio or interval-scaled data.

There are also alternative distance measures: The city-block distance uses the sum of the variables’ absolute differences. This is often called the Manhattan metric as it is akin to the walking distance between two points in a city like New York’s Manhattan district, where the distance equals the number of blocks in the directions North-South and East-West.

There are other distance measures such as the Angular, Canberra or Mahalanobis distance. In many situations, the latter is desirable as it compensates for collinearity between the clustering variables.

The distance measures presented can be used for metrically and – in general – ordinally scaled data, applying them to nominal or binary data is meaningless. In this type of analysis, one should rather select a similarity measure expressing the degree to which variables’ values share the same category. These so called matching coefficients can take different forms but rely on the same allocation scheme shown in Table.

Allocating scheme for matching coefficients.

Two types of matching coefficients, which do not equate the joint absence of a characteristic with similarity and more of value in segmentation studies, are the Jaccard (JC) and the Russel and Rao (RR) coefficients. They are defined as follows:

These matching coefficients are – just like the distance measures – used to determine a cluster solution. There are many other matching coefficients such as Yule’s Q, Kulczynski or Ochiai.

For nominal variables with more than two categories, one should always convert the categorical variable into a set of binary variables in order to use matching coefficients. With ordinal data, one should always use distance measures such as Euclidean distance. Even though using matching coefficients would be feasible and from a strictly statistical standpoint – even more appropriate, one would disregard variable information in the sequence of the categories. In the end, a respondent who indicates that he or she is very loyal to a brand is going to be closer to someone who is somewhat loyal than a respondent who is not loyal at all. Furthermore, distance measures best represent the concept of proximity, which is fundamental to cluster analysis.

Select a Clustering Algorithm

After having chosen the distance or similarity measure, we need to decide which clustering algorithm to apply. There are several agglomerative procedures and they can be distinguished by the way they define the distance from a newly formed cluster to a certain object, or to other clusters in the solution. The most popular agglomerative clustering procedures include the following:

1. Single linkage (nearest neighbour): The distance between two clusters corresponds to the shortest distance between any two members in the two clusters.

2. Complete linkage (furthest neighbour): The oppositional approach to single linkage assumes that the distance between two clusters is based on the longest distance between any two members in the two clusters.

3. Average linkage: The distance between two clusters is defined as the average distance between all pairs of the two clusters’ members.

4. Centroid: In this approach, the geometric centre (centroid) of each cluster is computed first. The distance between the two clusters equals the distance between the two centroids.

A common way to visualize the cluster analysis’s progress is by drawing a dendrogram, which displays the distance level at which there was a combination of objects and clusters.

We read the dendrogram from left to right to see at which distance objects have been combined

Dendrogram

Decide on the Number of Clusters

An important is how to decide on the number of clusters to retain from the data. Unfortunately, hierarchical methods provide only very limited guidance for making this decision. The only meaningful indicator relates to the distances at which the objects are combined. We can seek a solution in which an additional combination of clusters or objects would occur at a greatly increased distance. This raises the issue of what a great distance is, of course.

One potential way to solve this problem is to plot the number of clusters on the x-axis (starting with the one-cluster solution at the very left) against the distance at which objects or clusters are combined on the y-axis. Using this plot, we then search for the distinctive break (elbow).

Alternatively, we can make use of the dendrogram which essentially carries the same information. SPSS provides a dendrogram; however, this differs slightly from the one presented in Fig. Specifically, SPSS rescales the distances to a range of 0–25; that is, the last merging step to a one-cluster solution takes place at a (rescaled) distance of 25. The rescaling often lengthens the merging steps, thus making breaks occurring at a greatly increased distance level more obvious.

Despite this, this distance-based decision rule does not work very well in all cases. It is often difficult to identify where the break actually occurs.

Overall, the data can often only provide rough guidance regarding the number of clusters one should select; consequently, one should rather revert to practical considerations. Occasionally, we might have a priori knowledge, or a theory on which we can base our choice. However, first and foremost, one should ensure that results are interpretable and meaningful. Not only must the number of clusters be small enough to ensure manageability, but each segment should also be large enough to warrant strategic attention.

References:

3. Business Research Methods by Cooper, Schindler, Sharma. 11th Edition.

Section B Group 6_Vaneet Bhatia (13FPM008)

Other Member:

• Apurva Ramteke(13PGP068)
• Komal Suchak (13PGP086)
• Rohan Kr. Jha (13FPM004)
• Silpa Bahera (13PGP107)
• Sushil Kumar (13FPM010)
• Vivek Roy (12FPM005)
Standard

# Section A_Group 6_Devasheesh Nautiyal_13PGP014 Session 6

The Question Stock & 2 Smoking Barrels (Devasheesh Nautiyal Group_A2)

Of all the things that was discussed in class regarding creation of a questionnaire, the questionnaire cannot be seen in isolation to its purpose. There are a lot of factors that affect the reliability and validity of the inference, out of many references that I came across, the following two caught my eye. Though seemingly obvious, the outcome can be startling.

Barrel#1: Time Lapse:

Some surveys are conducted over a long period of time, mostly in multiple iterations with considerable time in between the first and the last survey. As per the ongoing craze around incremental improvements in all the facets of industry, the surveys are also brought up to speed with time. The learning that came out of it is incredible.

Above is the visual representation for religious tolerance of countries all over the world with red marking the most intolerant countries and vice versa. So what could be wrong about this survey if you did not see it yet! Though I am still trying to find out what went wrong for India, a Bangladeshi National apparently was able to decipher the cause for the ‘Red Bangladesh’.

What had happened was, to gauge religious tolerance, the survey was using 0 for tolerance and 1 as intolerance in 1990s. During the process of ‘improving’ the survey, the value references were swapped for the Bangladesh survey (1 for tolerance and 0 as intolerance) leading to the erroneous interpretation.

Barrel #2: Mr. Who!!

Who conducts the survey holds direct connect to how valid the response is or/and how fierce the backlash to the erroneous survey will be, for the conductor and the organizer.

Case and Point, a survey the New York Times wanted to conduct in the prestigious Yale University campus. The responsibility was given to the Yale College Council (YCC), who in their own right decided to conduct the survey among themselves (28 in all). The conduct was widely criticized on part of both the parties, YCC not considering a random sample and NYT publishing the discovery made from the data provided by YCC. Here the question is not about sampling, but of the reluctance on the part of the conductors and the organisers.

The next and the most frequently experienced surveyor based issue is of when she tries to influence our ratings most probably for improving her overall ratings. For an average performance, demands are made for ‘Rockstar Ratings’ which speaks volumes about itself. As it happened to a witness

Some years ago when I would take my car to the dealer for service.  The same conversation ensued each time I returned to pick up the car.  The clerk whom I paid for the service would shove a paper survey at me and say “Please complete our service satisfaction survey” suggesting with her body language that my keys would be held hostage until I finished the thing.  “But,” I would protest, “How do I know whether I am satisfied or not until I have driven the car for a while?”  Impervious to my logic, she would shrug and tell me that she needed it done now because that was their process, and if I didn’t know how satisfied I was, I should guess.

As we saw above a good questionnaire, apart from being good should also be consistent and conducted by responsible and accountable. Something very important but seemingly obvious enough not to be noticed before the trigger has been pulled!!!

References:

Standard

# Section A_Group 6_Suresh Neela_13PGP035 Session 6

Application of Chi Square Tests in Marketing

We have learnt what Chi Square test is and the step by step process how to implement it in class room discussion. But as a marketer, knowing how exactly this test can be implemented in real time business scenario is much more helpful. Hence, I would like to discuss the same in this article. Hope, it would be interesting for all the visitors of this blog.

Especially, companies are interested in knowing consumer behavior about the products. For example, Are all colors of refrigerators equally preferred among consumers? Is there any association between income of the family and brand preference to buy refrigerators? Like this, the day to day life of marketing job involves many such scenarios to make effective decision.

As we all know, Chi Square test is implemented to find out whether there are differences with Categorical variables ( Color category: Red,Blue,Green,Orange &  Income category: Lower, Middle, Upper etc..), it is broadly used in two different scenarios in marketing discipline

1. 1.     Goodness of Fit Test

This is mainly used, to find out how closely expected and observed frequencies are matched. Only, single variable can be considered here as mentioned in the above examples “color of refrigerator”.

Let us consider a marketer wants to check the preference of 200 consumers among four colors (Red, White, Blue and Black) of refrigerators stating following null hypothesis

Null Hypothesis                :               All colors of refrigerator are equally preferred.

Alternative Hypothesis   :               All colors of refrigerator are not equally preferred

After comparing Observed frequencies which might be from survey, questionnaire & Expected frequencies which are equal (50 for each in this case, response from 200 respondents) from the above mentioned null hypothesis.

Upon all calculations using the formula related to Chi square test, if the computed value is greater than critical value at 5% level of significance and 3 degrees of freedom i.e (n-1)  where n= 4 colors, then null hypothesis is rejected. Hence in this case we can infer all colors of refrigerators are not equally preferred

1. 2.     Independence Test

This is mainly used, to find out whether, there is a relation or association between two categorical variables or not, that means are they independent or dependent? Two variables mentioned in the above example are “Income group of the family “(Lower, Middle, & Upper) and “Brand preference” (Samsung, LG, Whirlpool, Godrej)

Null Hypothesis               :  The two variables are independent

Alternative Hypothesis: The two variables are dependent

Again from a sample of 200 consumers, the contingency table can be prepared with total cells 12 (3 from Income group * 4 from brands)

Upon all calculations using the formula related to Chi square test, if the computed value is greater than critical value at 5% level of significance and  6 degrees of freedom i.e (n1-1)*(n2-1)  where n1= 3 and n2=4, then null hypothesis is rejected. Hence, in this case we can infer Income group of the family and Brand preference are dependent

In this way depending on the research objective, marketing manager can implement Chi-square test in decision making process related to the products.

References:

http://www.slideshare.net/parth241989/chi-square-test-16093013

http://davidmlane.com/hyperstat/viswanathan/chi_square_marketing.html

http://www.polarismr.com/Portals/58820/research-lifeline/chi-square-test.htm

Suresh Neela

13PGP035

Standard

# SectionA_Group6_Devasheesh Nautiyal (Session 5) “The Question Stock & 2 Smoking Barrels”

The Question Stock & 2 Smoking Barrels (Devasheesh Nautiyal Group_A2)

Of all the things that was discussed in class regarding creation of a questionnaire, I believe it cannot be seen in isolation with respect to its purpose (provide insight). There are a lot of factors that affect the reliability and validity of the inference, out of many references that I came across, the following two caught my eye. Though seemingly obvious, the outcome can be startling.

Barrel#1: Time Lapse:

Some surveys are conducted over a long period of time, mostly in multiple iterations with considerable time in between the first and the last survey. As per the ongoing craze around incremental improvements in all the facets of industry, the surveys are also brought up to speed with time. The learning that came out of it is incredible.

Above is the visual representation for religious tolerance of countries all over the world with red marking the most intolerant countries and vice versa. So what could be wrong about this survey if you did not see it yet! Though I am still trying to find out what went wrong for India, a Bangladeshi National apparently was able to decipher the cause for the ‘Red Bangladesh’.

What had happened was, to gauge religious tolerance, the survey was using 0 for tolerance and 1 as intolerance in 1990s. During the process of ‘improving’ the survey, the value references were swapped for the Bangladesh survey (1 for tolerance and 0 as intolerance) leading to the erroneous interpretation.

Barrel #2: Mr. Who!!

Who conducts the survey holds direct connect to how valid the response is or/and how fierce the backlash to the erroneous survey will be, for the conductor and the organizer.

Case and Point, a survey the New York Times wanted to conduct in the prestigious Yale University campus. The responsibility was given to the Yale College Council (YCC), who in their own right decided to conduct the survey among themselves (28 in all). The conduct was widely criticized on part of both the parties, YCC not considering a random sample and NYT publishing the discovery made from the data provided by YCC. Here the question is not about sampling, but of the reluctance on the part of the conductors and the organisers.

The next and the most frequently experienced surveyor based issue is of when she tries to influence our ratings most probably for improving her overall ratings. For an average performance, demands are made for ‘Rockstar Ratings’ which speaks volumes about itself. As it happened to a witness

Some years ago when I would take my car to the dealer for service.  The same conversation ensued each time I returned to pick up the car.  The clerk whom I paid for the service would shove a paper survey at me and say “Please complete our service satisfaction survey” suggesting with her body language that my keys would be held hostage until I finished the thing.  “But,” I would protest, “How do I know whether I am satisfied or not until I have driven the car for a while?”  Impervious to my logic, she would shrug and tell me that she needed it done now because that was their process, and if I didn’t know how satisfied I was, I should guess.

As we saw above a good questionnaire, apart from being good should also be consistent and conducted by responsible and accountable. Something very important but seemingly obvious enough not to be noticed before the trigger has been pulled!!!

References:

Standard

# The Philosophy of Research

In business, we define ‘Research’ as a systematic inquiry which is done to obtain information for problem solving and making decisions. This includes reporting, descriptive, explanatory, and predictor studies. But, if we come out of the business perspective and try to view research in a broader way, it has been a lot more relevant and applicable to the existence and evolution of knowledge since time immemorial.

Research Philosophy can be defined as an overarching term relating to the development of knowledge and the nature of that knowledge in relation to research. It refers to the systematic search for existence, knowledge, values, reason, mind, and language. This research requires an open mind in order to establish facts to both new and existing mysteries. Prior to the emergence of the modern era of research, research was more or less termed as logical reasoning So, one should not be surprised that a number of the fundamental distinctions in logic have been carried over into contemporary research. Research can be considered as a very nonfigurative and sophisticated thing but if we try to understand its various elements or phases and the way these match along, it isn’t nearly as difficult as it seems. The assumptions about our perception and understanding of the world form the basis of all research. Though the academia and the philosophers have been arguing for millennia about how to best understand the world, let’s see how most of the contemporary social scientists try to answer this question of understanding the world in the best way. The most appropriate way to approach this question is to refer to the philosophical schools of thought. Positivism and Post-Positivism are considered as the two major philosophical schools of thought. For the time being, we can ignore the other arguable alternatives like subjectivism, relativism, constructivism, deconstructivism etc. Positivism, in its broader sense, can be defined as the rejection of metaphysics. In simple words, it means that the objective of knowledge is to explain what we observe and measure i.e. our experiences. Post-Positivism, on the other hand, is the complete rejection of the central tenets of positivism. It considers the scientific reasoning and the common sense reasoning as the same process.

Research philosophy classifications such as ontology, epistemology, and anxiology and their conflicting applications to the ‘quantitative-qualitative’ debates, are a major source of dilemma to research students in establishing their relevance to subjects areas and discipline.  A number of studies have used totally different descriptions, categorizations and classifications of research paradigms and philosophies in reference to research strategies with overlapping emphasis and meanings. This has not solely resulted in tautological confusion of what’s rooted where, and according to whom; but raises a vital question of whether or not these opposing views are enriching knowledge or subtly turning into harmful within the field.

References:

1. Business research methods by Cooper, Schindler and Sharma.
2. http://wps.pearsoned.co.uk/ema_uk_he_saunders_resmethbus_5/111/28552/7309457.cw/content/index.html.

By:Section A _Group 4_Rajjan Singh_Roll No. 13PGP046

Standard

# Confidence Interval

What is Confidence Interval?

For estimate the unknown population parameters such as a population mean or a population parameter, there are two estimates available. First is the point estimates and second is the interval estimates. To find out the value of a single sample statistic, point estimate is used. But around the point estimate, range of numbers that is interval, confidence interval estimate is constructed. Construction of confidence interval is done such that the probability that the population parameter is located somewhere within the interval is known.

Need of Confidence Interval:

To find out, how accurate is the sample statistic, you need to construct a confidence interval estimate. You can have unknown population parameter, now you calculate a sample statistic of that population by taking an appropriate sample from that population. To check the reliability of that point estimate whether it lies within the population region or not, it is used.

Suppose, you want to estimate the mean CGPA of the PGP students of IIM Raipur of current year. An unknown population mean us denoted by µ as in this case the mean of the CGPA of the IIM Raipur’s students. Now you select a sample of students and calculate the sample mean say 6.2. The sample mean, X = 6.2, is a point estimate of the population mean, µ. But how accurate is 6.2? Confidence interval estimate can give answer to such questions.

Understanding confidence interval:

The sample mean varies from sample to sample because it depends on the items selected in the sample. Developing the interval estimate of the population mean, you need to take the known variability from sample to sample into consideration. Estimation of value of the population parameter should have specified confidence to verify the correctness of estimation. In other words, there is specified confidence that µ is somewhere in the range of numbers defined by the interval.

Suppose you estimate interval of the mean CGPA of students of the IIM Raipur of current year is (6.0 ≤ µ ≥ 6.4) and you can say it with 95% confidence interval. So you can interpret that I am 95% confident that the mean CGPA of IIM Raipur’s PGP students is between 6.0 and 6.4. There is only 5 % chance that the CGPA is below 6.0 or above 6.4.

Suman Gupta_ Section A_13FPM005_Group9

Standard

# Survey Methods

Survey Methods

Survey is a non-experimental, descriptive research method. It is a field of applied statistics which studies the sampling of individual units from a population. Surveys can be useful when a researcher wants to collect data on phenomena that cannot be directly observed. What makes a survey a survey? Scientific methodology, Data collection from an individual, samples from a large population and the fact that it is conducted for the purpose of description, exploration and explanation

Types of Surveys

Data are usually collected through the use of questionnaires, although sometimes researchers directly interview subjects. Surveys can use qualitative or quantitative measures. There are two  basic types of surveys

• Cross-sectional surveys
• Longitudinal surveys
1. Trend
2. Time cohort
3. Panel

# Cross-sectional surveys

Cross-sectional surveys are studies aimed at determining the frequency of a particular attribute in a defined population at a particular point in time. A cross-sectional study is an observational one. It implies that the study environment is not being manipulated by researchers. It can compare different population groups at a single point in time and it stands as a defining feature of cross-sectional surveys. The benefit of a cross-sectional study design is that it allows researchers to compare many different variables at the same time.

For example: A questionnaire that collects data on how parents feel about Internet filtering

# Longitudinal surveys

A longitudinal study is also observational. Even here there is no interference with the subjects by researchers. In a longitudinal study several observations are conducted on the same subjects over a period of time. Sometimes it lasts for many years. The benefit of a longitudinal study is that researchers are able to detect developments or changes in the characteristics of the target population at both the group and the individual level. The key here is that longitudinal studies extend beyond a single moment in time. As a result, they can establish sequences of events.

## Trend surveys

Trend studies focus on a particular population, which is sampled and scrutinized repeatedly. While samples are of the same population, they are typically not composed of the same people. Trend studies, since they may be conducted over a long period of time, do not have to be conducted by just one researcher or research project. A researcher may combine data from several studies of the same population in order to show a trend. An example of a trend study would be a yearly survey of librarians asking about the percentage of reference questions answered using the Internet.

For example: A yearly survey of librarians asking about the percentage of reference questions answered using the Internet.

## Time cohort surveys

Cohort studies also focus on a particular population, sampled and studied more than once. But cohort studies have a different focus. Cohort studies are largely about the life histories of segments of populations, and the individual people who constitute these segments. A cohort study would sample the same class, every time.

For example: In 2013 a sample of 2013-2014 post-graduates of IIM Raipur could be questioned regarding their attitudes toward professionals in libraries. One year later, the researcher could question another sample of 1999 graduates, and study any changes in attitude.

## Panel surveys

Panel studies allow the researcher to find out why changes in the population are occurring, since they use the same sample of people every time. That sample is called a panel.  Panel studies, while they can yield extremely specific and useful explanations, can be difficult to conduct. They tend to be expensive, they take a lot of time, and they suffer from high attrition rates. Attrition is what occurs when people drop out of the study.

For example: A researcher could, for example, select a sample of post graduate students, and ask them questions on their library usage. Every year thereafter, the researcher would contact the same people, and ask them similar questions, and ask them the reasons for any changes in their habits

Priti Arvind Zod_ SectionA_pgp13044_Group9

Standard