Cluster analysis

Cluster means grouping of element or characteristics, which belongs to same category. If we talk Cluster analysis in research methodology it also says same thing, we divide our data/behavior into groups which are useful and identical to same category. In other words our objective is to divide observation into homogeneous or distinct groups. The greater the similarity within the group and greater the difference between the groups the better or more distinct the clustering.

In application we found that, the notion of cluster is not well defined. To understand this difficulty let’s have a look an example which depicts the general problem of cluster analysis.


Above figure shows points and they are dividing the data into two cluster to six clusters. Last figure describe how two big cluster are divided into six parts. This create problem to researcher to make any conclusion these six different cluster shows same kind of nature because this cluster is a subset of two cluster.

Measure of distance for variable:

It require more precise definition of similarity of observations and clusters. The grouping is based on variable, it employ the familiar concept of distance. Consider two points i and j with coordinates (X1i, X2i) and (X1j, X2j), respectively.

The Euclidean distance between the two points is the hypotenuse of the triangle ABC:

D(i, j) = Sqrt(A­2+B2) = Sqrt[(X1i – X1j)2 + (X2i – X2j)2]

Types of clustering:

The majorly cluster analysis can be conduct in three as per our requirement.

1)      Hierarchical cluster

2)      Partitioning methods

3)      Two steps clustering


Hierarchical cluster:

It is characterized by tree like structure establishing in the course of analysis. Most of the hierarchical technique fall into agglomerative clustering. In this category, cluster are consecutively from into objects. At beginning this type of cluster procedure start with each objects representing individual cluster. These cluster are sequentially merged to form a new at the bottom of the hierarchy. Next step is to merge this cluster to another cluster with higher level of hierarchy, and so on. This allow hierarchy of cluster to be established from the bottom up.

This theory is visible when we use SPSS output to analyze the data, in which dendrogram explain it in better way. With dendrogram we output we come to conclusion of how many cluster are present in the data. The only meaningful indicator relates to the distances at which the objects are combined.

One potential way to solve this problem is to plot the number of clusters on the x-axis (starting with the one-cluster solution at the very left) against the distance at which objects or clusters are combined on the y-axis. Using this plot, we then search for the distinctive break (elbow). SPSS does not produce this plot automatically you have to use the distances provided by SPSS to draw a line chart by using a common spreadsheet program such as Microsoft Excel.

Partitioning methods: K means

Hierarchical cluster is use when we have variable size is less than 50. But, we have no. of factor to be consider is more than 50 we use K means cluster analysis. The K means cluster analysis is use for segmentation, targeting and positioning (STP) of market. This method can be used for identifying the potential market for any organization, where they launch their products and achieve maximum profit.

The k-means algorithm follows an entirely different concept than the hierarchical methods discussed before. This algorithm is not based on distance measures such as Euclidean distance or city-block distance, but uses the within-cluster variation as a measure to form homogenous clusters. This procedure aimed at segmenting the data in such a way that within the cluster variation is minimized.


With hierarchical methods, our object remains in a cluster once it is assigned, but with k-means, cluster affiliations can change in the course of the clustering process. K means cluster does not build hierarchy of event, which is why this approach is labeled as non-hierarchical in nature.

Prior to analysis we need to decide no. of clusters. Based on this algorithm randomly selects a center for each cluster. After this Euclidean distances are computed from the clusters centers to every single object. Each cluster is assigned to the cluster center with the shortest distance to it.

Based on this initial partition, each cluster centroid is computed. This was done by computing the mean value of objects contained in the cluster.

In the fourth step, the distances from each object to the newly located cluster centers are computed and objects are again assigned to a certain cluster on the basis of their minimum distance to other cluster centers. Since the cluster centers’ position changed with respect to the initial situation in the first step, this could lead to a different cluster solution.

Reference: (Last accessed on March 20, 2014) (Last accessed on March 20, 2014) (Last accessed on March 20, 2014)

E. Mooi and M. Sarstedt, A Concise Guide to Market Research (2011)

Submitted by

Samkit Jain_Group2_SectionB (12PGP112)





Section B_Group 3_Nishita Khemka_13PGP093


When I first downloaded the SPSS software and did the first few operations as instructed by the professor, the first thing that struck me was that, “This thing is too technical for me”. Most of my friends here would agree with me that SPSS is a bit complicated. It is more suitable for statisticians rather than students and business professionals who do not have any expertise in the field of statistics. Moreover, SPSS contains many tools which have low applicability and relevance for businesses which want to do a basic analysis of data.

Identifying these pain points, two young analysts – Greg Laughlin and John Le developed a software called Statwing to make life easier for Non-Stats Majors like us. Statwing, popular for its simplicity and ease of use, makes data analysis intuitive and beautiful. Statistical best practices are encoded into the software so that non-experts can get the same insight into their data as a statistician. Here are a few reasons why Statwing is more user-friendly than statistical analysis tools like SPSS and R:

Simplified process: Statwing relies on a rules engine that automatically considers the type of data uploaded and the types of variables (a maximum of two right now) a user wants to relate to each other. It is designed to make it easier to ask questions about data. The user does not need to know the type of variables being used and the kind of analysis that will be apt for the said variables.

Faster: Statwing automates statistical analysis so you can understand your data deeply in just a few clicks—regardless of whether it’s kilobytes or gigabytes. Unlike SPSS where data loading takes time, data in Statwing can be copied or uploaded in seconds. Analysts and market researchers say they analyze data more deeply and five times more quickly in Statwing than in Excel or SPSS.

Instant Visualization: Statwing automatically visualizes every analysis. It understands th user’s data’s structure, so it automatically creates histograms, scatterplots, heatmaps, and bar charts that the user can easily export to Excel or PowerPoint.

Accounting for outliers: Unlike traditional software, Statwing accounts for data issues like outliers, so the user can always be confident in your analyses. When Statwing notices outliers or other statistical issues, it runs statistical tests that take them into account (for example, running Spearman’s Rank Correlation instead of Pearson’s Correlation wherever applicable).

Easy interpretation of results: In SPSS the results are shown in the form of tables and the user has to interpret the results according to his/her understanding. On the other hand, Statwing interprets the result of the analysis in plain English words, thus making it easy to understand the result.

Example: Suppose the user wants to analyze the relationship between a customer’s gender and their satisfaction with the product.


  1. Load the data… [wait 2 seconds]
  2. Analyze the variables and think, “Gender is binary–male versus female–and satisfaction is continuous, so the correct statistical test is an independent samples t-test”
  3. Run the test and get the result in the form of a table as shown below: 1
  4. Interpret the result: The p-value is below point oh five, which indicates a statistically significant difference between men’s and women’s satisfaction

In Statwing:-

  1. Paste or upload the data to Statwing
  2. Select Gender, then Satisfaction, and then choose to relate the two
  3. Statwing understands the structure of these variables, so it runs a t-test automatically
  4. Read the headline for a quick summary of the statistical testings results2
  5. Look at the visualization to see how women’s satisfaction scores tend to be lower than those of men3
  6. Go to the Advanced Tab for the full test results, as well as standard deviations, confidence intervals, and more

Despite these advantages, Statwing is suitable only for basic analysis. Many tools used for higher analysis such as Two-Way ANOVA, Regression Analysis, Time Series Analysis, etc do not work in Statwing. However, developments in these tools in Statwing can give a tough competition to SPSS and may even lead to its phasing out.


Reducing Mail Survey Errors

Mail surveys are the preferred survey methods due to their low cost and ease of implementation. However, like other survey methods, they are also prone to various errors such as sampling error, non-coverage error, non-response error, and measurement error. This blog intends to put forward ways to minimize the impact of these errors on the survey results.

Sampling Error

The most common error is the sampling error which exists because the sample selected to conduct the survey might not be the right representative of the population.

Using statistics, it can be showed that for a given confidence level, increasing the sample size can reduce the sampling error. This is true for random or probabilistic sampling methods. How large should be the sample size depends on the trade-off between precision of estimation and costs.

Non probabilistic sampling involves subjective selection of sample. Therefore, precise estimate of sampling error in such cases is not statistically feasible.

Coverage Error

Under-coverage and over-coverage, both can affect the results of the mail survey conducted. Coverage error affects the survey estimates if the characteristics of the respondents covered or not covered in the survey, differ from the characteristics of those covered. Having complete, up-to-date sample frame can reduce the chance of incorporating coverage error.

Non-Response Error

Non-response error arises when some of the sample members do not respond to the survey questions. A lot of research has been conducted to improve the response rates. Some of the variables which have been found to have positive effect on response rate are the number of contacts (more the better), relevance/ salience of the questionnaire topic, government sponsorship (compared to private sponsorship), specificity of the target sample (compared to general population), incentives, pre-notifocation, stamped return postage, etc.

Research on incentives has shown that response rates increase only when incentives are provided with initial mail and not for those where the incentives were made contingent on return response. Further, no statistically significant difference was found between monetary and non-monetary incentives.

Research also has shown that that through mixed-mode surveys which use mailing questionnaires, electronic mail, telephone, and face-to-face interviews, all in some proportion, can increase the response rate compared to a typical mail survey.

Measurement Error

Measurement errors arise from the respondents’ side when they either do not respond to certain questions, or leave open-ended questions unanswered/incomplete, or fail to follow instructions. Measurement errors signify the difference between the recorded answers and the true answers.

Mail surveys have some advantage when it comes to measurement errors due to absence of an interviewer which not only lessens the likelihood of driving the respondents’ to provide socially desirable responses and but also removes interviewers’ bias.

Some ways to reduce measurement errors are:

  • Pre-testing the questionnaire
  • Making the questionnaire more respondents’ friendly
  • Making questionnaire instructions clear and simple
  • Streamlining the questionnaire design in case of mixed-mode surveys to reduce deviance among the different modes


Section A _Group 3_Sameer Pandey_13PGP047


Section B _Group 7_Amrit Jain_13PGP062

SPSS- A Statistical Tool


While thinking for a topic for our blog, I was confused which topic should I choose. Also I was solving a problem on SPSS. Suddenly it struck to my mind why not to write on IBM SPSS software itself! I thought it can be an interesting topic to write. SPSS- Statistical Package for the Social Sciences as originally called. It is a Windows based program that can be used to perform data entry and analysis and also, to create tables, graphs and pictograms. SPSS is capable of handling huge amounts of data and impressively performing statistical analysis of data. SPSS is updated often with new versions. The one I am using is SPSS 15.0 Evaluation version.  It was developed by Norman H. Nie and C. Hadlai Hull of IBM Corporation in the year 1968. It is compatible with Windows, Linux, UNIX & Mac operating systems. SPSS is among the most widely used programs for statistical analysis in social sciences.

Before learning about SPSS I was confused whether spreadsheet applications like Microsoft Excel or Openoffice Calc. is better than SPSS, because spreadsheets are also widely used for statistical analysis. But after doing secondary research on the same I got impressed by the marvels of this tool. This learning came as a value addition for me. SPSS looks a lot like a typical spreadsheet application. When we open it, we see the familiar tabular grid and we enter values in cells. Spreadsheets, on the other hand, are capable of a lot of things that SPSS is good at, like generating graphs and statistics on a data set. The difference can be summed up in the following points:

 Flexibility: Spreadsheets are designed to be very flexible and broadly applicable to many different tasks, while SPSS is specifically designed for statistical processing of large amounts of data at an enterprise level.  For example, unlike a spreadsheet, SPSS has the concepts of “case” and “variable” built-in. The rows in SPSS always represent cases, for example survey responses( typically, of a questionnaire) or experimental subjects, and the columns always represent variables observed from those cases, like the specific values given by the survey respondent or measurements from the experimental subject. Owing to this case/variable arrangement, when some calculation is performed over a set of data, the result does not get inserted into another cell on the table, like it would in a typical spreadsheet, but appears in a separate window. This is particularly advantageous when dealing with large sets of data, since it keeps calculated statistics and graphs separate from the raw data but still easily accessible. Spreadsheet like MS Excel has a lot more functions than SPSS and gives more flexibility in how you use them.

Ease of use: It is also much more convenient to perform statistical tests in SPSS, even though many are possible using typical spreadsheets. For example, to perform a one-sample T-test with Excel, we’ll have to calculate the T value independently for the sample and use the “T.DIST” function to return the significance, while also selecting a cell for the results and labelling it in another cell. To perform the same test in SPSS we select a variable and supply the value to compare with our sample and, when we click “Ok,” SPSS generates a table with t, the degrees of freedom, the significance, and a confidence interval neatly calculated. SPSS makes it easy to understand statistical results. It has added a lot of extra help files and tutorials that explain how we can or should interpret a lot of the statistical jargon that the software spits out. Spreadsheets don’t provide so.

Modernity: Probably the most significant advantage of using SPSS is that it was designed with modern data collection methods in mind. A lot of data that’s collected, especially survey data, is numerically coded before it’s electronically stored. So for example a response of “strongly agree” might become a 6; a level of education such as “completed high school” or “some college” might become a 10 or 11. SPSS makes it possible to automatically define the variable so that the coded values are connected to their original meanings. For this reason only, various surveys and polls, (including many that U of I students and faculty can access through Roper iPoll, ICPSR, and other sets provided through the U of I library), make their raw data available in SPSS’s native.” say” format.

The differences mentioned above are the major ones. Now there are some disappointments from SPSS too. For e.g. SPSS doesn’t update the values of cells automatically when changes are made elsewhere in our data despite having setup a compute command. Also, if we delete one variable we cannot restore it. IBM SPSS is expensive, sometimes ridiculously so, and even when we do buy we are really only leasing, and its license is definitely not user friendly. There are often compatibility issues with prior.

But despite these minor issues I really like working on SPSS relative to spreadsheet application. Summing in one line, ease of use and in-depth data analysis are the features which really impressed me.

Other Members: Amrit Jain, Ankit Saxena, Gugan N, Jyoti Kanwatia, Nitin Sonkar, Sonam Supriya, Sumit Ranjan,Yogesh Sham Gupta


Section B_Group 3_Rishabh Raj_13PGP104


None of us would have ever dreamt that he /she would be the Truman Burbank (The Truman Show, 1998) of someone else’s show. The truth of the matter is that there are millions of people who are being constantly monitored for their behaviour patterns and preferences. The observations can be either be direct or indirect, disguised or undisguised, structured or unstructured, human or mechanical etc.  The role of observation in modern day marketing cannot be underestimated, in fact you will be surprised to know the kind of efforts and innovations companies are putting in in order to gain competitive advantage. Following are some of the innovative methods under observation


The Mystery Shopper/ Diner – This observation can be categorised as –Indirect ,Unstructured , Human, Archived . The term “Mystery Shopping” was coined in the 1940s by Wilmark , the first research firm to apply the concept beyond integrity applications. A mystery diner sits near to the actual customers and observes their dining experience as well as the service provided to them. Mystery shoppers /diners not only give valuable feedback regarding the service provided but also play a key role in inspecting and evaluating variety of activities including company operations, employee integrity , store merchandising and product quality.Today , over 100 companies belong to the Mystery Shopping Providers Association and the industry is estimated to be over 1.5 $ billion annually.

Trend SpottingThis observation can be categorised as –Indirect, Unstructured, Human, Archived. Extending the practise of observation beyong what is clearly or scientifically seen, some researchers have tried to catalog behaviours that might signat the beginning of important trends . This method is called trendspotting and has been under the scanner always because of the subjectivity and randomness of observations present in it. Denmark giant ad agency DDB Worldwide invites observers all over the world plus other targeted groups such as members of youth organizations , to submit thir observations  to managers appointed as Signbankers. The signbankers update the corporate database on a regularly basis and it is properly segregated. To get an idea of the kind of observations that might be included , think of anything as random as time people spend at the metro station watcing an ad close to ticket counter , the importance of colours of billboards and their impact on sportschannel viewer.

Mobile Trackers for vehicles –This observation can be categorised as –Direct, Structured, Mechanical and as is. Electronic devices are designed so that they can pinpoint the location of any equipped vehicle through  its Global Postioning System. These devices help companies keep track of important questions such as  are drivers speeding , are device readings accurate, are drivers using vehicles during off hours etc. This data can also help fleet managers with their duty.

Neuromarketing – This observation can be categorised as –Indirect, Structured, Mechanical and archived .A high tech research methodology in which a technology called quantified electroencephalography (QEEG). Subjects wear light and portable EEG equipment that records brain activity, software displays activity levels in different arts of the brain. Hewlett Packard used the mentioned technique while developing advertisements for its digital photography products.

Scanner based Consumer Panel – This observation can be categorised as –Direct, Structured, Mechanical and As is .Each household is assigned a bar coded card, like frequent shopper card, which members present to clerk at the register. The household’s code number is coupled with the purchase information recorded by the scanner. In addition, background information about the household is obtained through answers to demographic and psychographic questions.

A number of ethical issues have been raised against the observation methods which target the privacy of the consumers. Disguised observations do not seek approval of the consumers tracked and go about collecting data whatever field it might belong to. A set of questions might help in deciding the dilemma that marketers face while collecting information – is the behaviour being observed commonly performed in public , is the behaviour being observed performed in anonymity  and has the person agreed to be observed.


Section B_Group 3_Ravi Kumar Singh_13PGP103

Survey Design Studies: A Basic Introduction


As a project matures, one realizes how important it is to have a research design. A research design guides the team and the company’s decision makers. It lays out the methods and procedures needed to employ as the information gets collected.

To develop a research design, you will rely on three types of studies: exploratory studies, descriptive studies, and causal studies.

Each depends on different information that will help you. No matter how large or small your project, conducting surveys and establishing a research design is vital to your success. If you don’t know where your project is going, you won’t know if it’s succeeding.


First, you need to do an exploratory study. This is the problem finding phase. An exploratory study forces you to focus the scope of your project. It helps you anticipate the problems and variables that might arise in your project.

Perhaps the most common problem is size. The project must be kept focused. If the scope of a project is too big, it will not get off the ground. Too much information is overwhelming. An important objective of an exploratory study is keeping your project manageable. The larger your project’s scope, the more difficult it is to control. This process will help you weed out problems.

In the case of developing an app, for example, an exploratory study would help your research team take an abstract idea and develop it into a focused plan. The specific app would be market-driven. This process takes legwork, but the results are worth the effort.

Exploratory studies generally encompass three distinct methods:

1. Literature search

A literary search means you go to secondary sources of information: the internet, the public library, company or government records. These sources are usually easy and inexpensive to access.

2. Expert interviews

After a literature search, your team would have a useful background for the project. They know what questions to ask and how to set up their project. After the literary search, the next step is to interview experts. These experts might include company executives or consumers. They would also talk to people who used similar products. Your team would seek out professionals who have careers relating to the research project.

3. Case studies

Every research project will have pitfalls. Therefore, case studies become an important tool because they allow us to examine another business’s managerial problems and solutions. If another study deals with similar issues, we can avoid these pitfalls by learning from its mistakes.


Who are you selling to? An exploratory study helped you establish what you are selling, but the descriptive study will help you find your market and understand your customer. Since you will not be able to sell to everyone, a descriptive study is necessary to focus your project and resources.

There are different kinds of studies you can implement to better understand your market. Consider the following descriptive studies:

  • Market potential: description of the number of potential customers of a product.
  • Market-share: identification of the share of the market received by your product, company and your competitors.
  • Sales analysis: description of sales by territory, type of account, size or model of product.
  • Product research: identification and comparison of functional features and specifications of competitive products.
  • Promotion research: description of the demographic characteristics of the audience being reached by the current advertising program.
  • Distribution research: determining the number and location of retailers handling the company’s products. These are supplied by wholesalers and distributed by the company.
  • Pricing research: identifying competitors’ prices by geographic area.


Even though descriptive studies describe and predict relationships, results, events, you may want to know the reason. If you can discover the reasons behind your solutions, then you can assemble your own predictive models.

Cause and effect have to be related. Before a cause and effect can be established, a logical implication (or theoretical justification) has to be found.

There are three types of evidence that can be used to establish causal relationships:

Associative variation
Associative variation involves taking two variables and seeing how often they are associated. The more they show up in studies, the more likely they are related. Associative variation can be broken down into two distinctions: association by presence and association by change.

Association by presence measures how closely presence of one variable is associated with presence of another.

Sequence of events

In order to establish a cause/effect relationship, you must first establish that the causal factor occurred first. For example, in order for salesperson training to result in increased sales, the training must have taken place prior to the sales increase. If the cause does not precede the effect, then there is no causal relationship.

Absence of other possible causal factors

You must also demonstrate that other factors did not cause the effect. Once you have proved this, you can logically conclude that the remaining factor is the cause. For example, if we can control all other factors affecting the sales item, then we have to conclude that the increase in sales comes from training.


Section A 13PGP020 Isaac Solomon Session 7

Guidelines in defining Variables in SPSS


Variable Name

Also each variable name should be unique without duplication. They can be upto 64 bytes long though. This name must be 8 characters long or fewer, and the first character must be a letter or one of the characters @, #, or $. Subsequent characters can be any combination of letters, numbers, non-punctuation characters, and a period (.). Though we can use “Chelsea_Football”, we cannot use “Chelsea-Football” nor can we use “Chelsea Football” as variable names. The SPSS tool misinterprets “–“ as  subtraction sign. The space confuses the software as to how many variable are being named. Since variable names often tend to be cryptic and they must be 8 characters or less. Label allows to specify a longer variable name to give in more clarity about the variable. This longer label will appear on any charts or graphs produced.

Variable Type

Two types of variable that can be used are numbers and strings. Numeric variables may only have numbers assigned and string variables may contain both numbers and letters. But the catch here is that a string variable though can hold a number cannot be used for numeric operations on it. These operations include mean, variance, standard deviation, etc. By default all variables are assumed to be numeric. Other types of variables which can be used are comma, dot, scientific notation, date, dollar, custom currency and restricted numeric. In all cases, one will need to specify the variable width. It can be done in the dialogue box, or in the subsequent width and decimal columns.


Values allow connecting the values (numbered codes) of the coding scheme to the original category. For example, it I here that for a variable Sex, males can be coded with a 0 and females with a 1. Codes are to be added on after the other in a sequence.


Missing refers to a missing data code. This is a “special” number that SPSS will treat as a unique code to identify places where there is no data. The SPSS will avoid including it as a “real” number when statistics are computed.



Columns refers to how many columns wide you would like the variable to be presented in the “ data view.” Normally this would be at least 8 so that the variable name could appear easily.

Measurement of the Variable

The level of measurement of ratio and intervals cannot be differentiated by the SPSS software. These two measurements are grouped together as scale. Nominal and ordinal however are differentiated.

Nominal variables are also known as categorical variables. The different codes here refer to different categories that [e.g., SEX is a categorical variables because the two different groups are simply different categories that bear no mathematical relation to each other).

Ordinal variables refer to those variables where the numerical codes reflect an ordering of some sort, but where the distance between the categories can vary. For example, “Job Grade” is an ordinal variable – the professorial ranks are ordered (1) Grade 6 (2) Grade 7; and (3) Grade 8 – but the distances between the codes are not necessarily equal.

Scale variables include interval and ratio levels of measurement, where any numeric codes have meaning in terms of number relations that go beyond category and order. If it makes sense to compute a mean for the variable, then it probably is a scale variable.