Cluster means grouping of element or characteristics, which belongs to same category. If we talk Cluster analysis in research methodology it also says same thing, we divide our data/behavior into groups which are useful and identical to same category. In other words our objective is to divide observation into homogeneous or distinct groups. The greater the similarity within the group and greater the difference between the groups the better or more distinct the clustering.

In application we found that, the notion of cluster is not well defined. To understand this difficulty let’s have a look an example which depicts the general problem of cluster analysis.

Above figure shows points and they are dividing the data into two cluster to six clusters. Last figure describe how two big cluster are divided into six parts. This create problem to researcher to make any conclusion these six different cluster shows same kind of nature because this cluster is a subset of two cluster.

It require more precise definition of similarity of observations and clusters. The grouping is based on variable, it employ the familiar concept of distance. Consider two points i and j with coordinates (X_{1i}, X_{2i}) and (X_{1j}, X_{2j}), respectively.

The Euclidean distance between the two points is the hypotenuse of the triangle ABC:

D(i, j) = Sqrt(A^{2}+B^{2}) = Sqrt[(X_{1i} – X_{1j})^{2} + (X_{2i} – X_{2j})^{2}]

The majorly cluster analysis can be conduct in three as per our requirement.

1) Hierarchical cluster

2) Partitioning methods

3) Two steps clustering

Hierarchical cluster:

It is characterized by tree like structure establishing in the course of analysis. Most of the hierarchical technique fall into agglomerative clustering. In this category, cluster are consecutively from into objects. At beginning this type of cluster procedure start with each objects representing individual cluster. These cluster are sequentially merged to form a new at the bottom of the hierarchy. Next step is to merge this cluster to another cluster with higher level of hierarchy, and so on. This allow hierarchy of cluster to be established from the bottom up.

This theory is visible when we use SPSS output to analyze the data, in which dendrogram explain it in better way. With dendrogram we output we come to conclusion of how many cluster are present in the data. The only meaningful indicator relates to the distances at which the objects are combined.

One potential way to solve this problem is to plot the number of clusters on the x-axis (starting with the one-cluster solution at the very left) against the distance at which objects or clusters are combined on the y-axis. Using this plot, we then search for the distinctive break (elbow). SPSS does not produce this plot automatically you have to use the distances provided by SPSS to draw a line chart by using a common spreadsheet program such as Microsoft Excel.

Partitioning methods: K means

Hierarchical cluster is use when we have variable size is less than 50. But, we have no. of factor to be consider is more than 50 we use K means cluster analysis. The K means cluster analysis is use for segmentation, targeting and positioning (STP) of market. This method can be used for identifying the potential market for any organization, where they launch their products and achieve maximum profit.

The k-means algorithm follows an entirely different concept than the hierarchical methods discussed before. This algorithm is not based on distance measures such as Euclidean distance or city-block distance, but uses the within-cluster variation as a measure to form homogenous clusters. This procedure aimed at segmenting the data in such a way that within the cluster variation is minimized.

With hierarchical methods, our object remains in a cluster once it is assigned, but with k-means, cluster afﬁliations can change in the course of the clustering process. K means cluster does not build hierarchy of event, which is why this approach is labeled as non-hierarchical in nature.

Prior to analysis we need to decide no. of clusters. Based on this algorithm randomly selects a center for each cluster. After this Euclidean distances are computed from the clusters centers to every single object. Each cluster is assigned to the cluster center with the shortest distance to it.

Based on this initial partition, each cluster centroid is computed. This was done by computing the mean value of objects contained in the cluster.

In the fourth step, the distances from each object to the newly located cluster centers are computed and objects are again assigned to a certain cluster on the basis of their minimum distance to other cluster centers. Since the cluster centers’ position changed with respect to the initial situation in the ﬁrst step, this could lead to a different cluster solution.

Reference:

http://www.yorku.ca/ptryfos/f1500.pdf (Last accessed on March 20, 2014)

http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf (Last accessed on March 20, 2014)

http://en.wikipedia.org/wiki/Cluster_analysis (Last accessed on March 20, 2014)

E. Mooi and M. Sarstedt, A Concise Guide to Market Research (2011)

Submitted by

Samkit Jain_Group2_SectionB (12PGP112)

]]>

When I first downloaded the SPSS software and did the first few operations as instructed by the professor, the first thing that struck me was that, “This thing is too technical for me”. Most of my friends here would agree with me that SPSS is a bit complicated. It is more suitable for statisticians rather than students and business professionals who do not have any expertise in the field of statistics. Moreover, SPSS contains many tools which have low applicability and relevance for businesses which want to do a basic analysis of data.

Identifying these pain points, two young analysts – Greg Laughlin and John Le developed a software called Statwing to make life easier for Non-Stats Majors like us. Statwing, popular for its simplicity and ease of use, makes data analysis intuitive and beautiful. Statistical best practices are encoded into the software so that non-experts can get the same insight into their data as a statistician. Here are a few reasons why Statwing is more user-friendly than statistical analysis tools like SPSS and R:

**Simplified process:** Statwing relies on a rules engine that automatically considers the type of data uploaded and the types of variables (a maximum of two right now) a user wants to relate to each other. It is designed to make it easier to ask questions about data. The user does not need to know the type of variables being used and the kind of analysis that will be apt for the said variables.

**Faster:** Statwing automates statistical analysis so you can understand your data deeply in just a few clicks—regardless of whether it’s kilobytes or gigabytes. Unlike SPSS where data loading takes time, data in Statwing can be copied or uploaded in seconds. Analysts and market researchers say they analyze data more deeply and five times more quickly in Statwing than in Excel or SPSS.

**Instant Visualization:** Statwing automatically visualizes every analysis. It understands th user’s data’s structure, so it automatically creates histograms, scatterplots, heatmaps, and bar charts that the user can easily export to Excel or PowerPoint.

**Accounting for outliers:** Unlike traditional software, Statwing accounts for data issues like outliers, so the user can always be confident in your analyses. When Statwing notices outliers or other statistical issues, it runs statistical tests that take them into account (for example, running Spearman’s Rank Correlation instead of Pearson’s Correlation wherever applicable).

**Easy interpretation of results:** In SPSS the results are shown in the form of tables and the user has to interpret the results according to his/her understanding. On the other hand, Statwing interprets the result of the analysis in plain English words, thus making it easy to understand the result.

**Example: **Suppose the user wants to analyze the relationship between a customer’s gender and their satisfaction with the product.

In SPSS:-

- Load the data… [wait 2 seconds]
- Analyze the variables and think, “Gender is binary–male versus female–and satisfaction is continuous, so the correct statistical test is an independent samples t-test”
- Run the test and get the result in the form of a table as shown below:
- Interpret the result: The p-value is below point oh five, which indicates a statistically significant difference between men’s and women’s satisfaction

In Statwing:-

- Paste or upload the data to Statwing
- Select Gender, then Satisfaction, and then choose to relate the two
- Statwing understands the structure of these variables, so it runs a t-test automatically
- Read the headline for a quick summary of the statistical testings results
- Look at the visualization to see how women’s satisfaction scores tend to be lower than those of men
- Go to the Advanced Tab for the full test results, as well as standard deviations, confidence intervals, and more

Despite these advantages, Statwing is suitable only for basic analysis. Many tools used for higher analysis such as Two-Way ANOVA, Regression Analysis, Time Series Analysis, etc do not work in Statwing. However, developments in these tools in Statwing can give a tough competition to SPSS and may even lead to its phasing out.

]]>**Sampling Error**

The most common error is the sampling error which exists because the sample selected to conduct the survey might not be the right representative of the population.

Using statistics, it can be showed that for a given confidence level, increasing the sample size can reduce the sampling error. This is true for random or probabilistic sampling methods. How large should be the sample size depends on the trade-off between precision of estimation and costs.

Non probabilistic sampling involves subjective selection of sample. Therefore, precise estimate of sampling error in such cases is not statistically feasible.

**Coverage Error**

Under-coverage and over-coverage, both can affect the results of the mail survey conducted. Coverage error affects the survey estimates if the characteristics of the respondents covered or not covered in the survey, differ from the characteristics of those covered. Having complete, up-to-date sample frame can reduce the chance of incorporating coverage error.

**Non-Response Error**

Non-response error arises when some of the sample members do not respond to the survey questions. A lot of research has been conducted to improve the response rates. Some of the variables which have been found to have positive effect on response rate are the number of contacts (more the better), relevance/ salience of the questionnaire topic, government sponsorship (compared to private sponsorship), specificity of the target sample (compared to general population), incentives, pre-notifocation, stamped return postage, etc.

Research on incentives has shown that response rates increase only when incentives are provided with initial mail and not for those where the incentives were made contingent on return response. Further, no statistically significant difference was found between monetary and non-monetary incentives.

Research also has shown that that through mixed-mode surveys which use mailing questionnaires, electronic mail, telephone, and face-to-face interviews, all in some proportion, can increase the response rate compared to a typical mail survey.

**Measurement Error**

Measurement errors arise from the respondents’ side when they either do not respond to certain questions, or leave open-ended questions unanswered/incomplete, or fail to follow instructions. Measurement errors signify the difference between the recorded answers and the true answers.

Mail surveys have some advantage when it comes to measurement errors due to absence of an interviewer which not only lessens the likelihood of driving the respondents’ to provide socially desirable responses and but also removes interviewers’ bias.

Some ways to reduce measurement errors are:

- Pre-testing the questionnaire
- Making the questionnaire more respondents’ friendly
- Making questionnaire instructions clear and simple
- Streamlining the questionnaire design in case of mixed-mode surveys to reduce deviance among the different modes

Section A _Group 3_Sameer Pandey_13PGP047

]]>While thinking for a topic for our blog, I was confused which topic should I choose. Also I was solving a problem on SPSS. Suddenly it struck to my mind why not to write on IBM SPSS software itself! I thought it can be an interesting topic to write. SPSS- Statistical Package for the Social Sciences as originally called. It is a Windows based program that can be used to perform data entry and analysis and also, to create tables, graphs and pictograms. SPSS is capable of handling huge amounts of data and impressively performing statistical analysis of data. SPSS is updated often with new versions. The one I am using is SPSS 15.0 Evaluation version. It was developed by Norman H. Nie and C. Hadlai Hull of IBM Corporation in the year 1968. It is compatible with Windows, Linux, UNIX & Mac operating systems. SPSS is among the most widely used programs for statistical analysis in social sciences.

Before learning about SPSS I was confused whether spreadsheet applications like Microsoft Excel or Openoffice Calc. is better than SPSS, because spreadsheets are also widely used for statistical analysis. But after doing secondary research on the same I got impressed by the marvels of this tool. This learning came as a value addition for me. SPSS looks a lot like a typical spreadsheet application. When we open it, we see the familiar tabular grid and we enter values in cells. Spreadsheets, on the other hand, are capable of a lot of things that SPSS is good at, like generating graphs and statistics on a data set. The difference can be summed up in the following points:

** Flexibility**: Spreadsheets are designed to be very flexible and broadly applicable to many different tasks, while SPSS is specifically designed for statistical processing of large amounts of data at an enterprise level. For example, unlike a spreadsheet, SPSS has the concepts of “case” and “variable” built-in. The rows in SPSS always represent cases, for example survey responses( typically, of a questionnaire) or experimental subjects, and the columns always represent variables observed from those cases, like the specific values given by the survey respondent or measurements from the experimental subject. Owing to this case/variable arrangement, when some calculation is performed over a set of data, the result does not get inserted into another cell on the table, like it would in a typical spreadsheet, but appears in a separate window. This is particularly advantageous when dealing with large sets of data, since it keeps calculated statistics and graphs separate from the raw data but still easily accessible. Spreadsheet like MS Excel has a lot more functions than SPSS and gives more flexibility in how you use them.

**Ease of use**: It is also much more convenient to perform statistical tests in SPSS, even though many are possible using typical spreadsheets. For example, to perform a one-sample T-test with Excel, we’ll have to calculate the T value independently for the sample and use the “T.DIST” function to return the significance, while also selecting a cell for the results and labelling it in another cell. To perform the same test in SPSS we select a variable and supply the value to compare with our sample and, when we click “Ok,” SPSS generates a table with t, the degrees of freedom, the significance, and a confidence interval neatly calculated. **SPSS makes it easy to understand statistical results.** It has added a lot of extra help files and tutorials that explain how we can or should interpret a lot of the statistical jargon that the software spits out. Spreadsheets don’t provide so.

**Modernity**: Probably the most significant advantage of using SPSS is that it was designed with modern data collection methods in mind. A lot of data that’s collected, especially survey data, is numerically coded before it’s electronically stored. So for example a response of “strongly agree” might become a 6; a level of education such as “completed high school” or “some college” might become a 10 or 11. SPSS makes it possible to automatically define the variable so that the coded values are connected to their original meanings. For this reason only, various surveys and polls, (including many that U of I students and faculty can access through Roper iPoll, ICPSR, and other sets provided through the U of I library), make their raw data available in SPSS’s native.” say” format.

The differences mentioned above are the major ones. Now there are some disappointments from SPSS too. For e.g. SPSS doesn’t update the values of cells automatically when changes are made elsewhere in our data despite having setup a compute command. Also, if we delete one variable we cannot restore it. IBM SPSS is expensive, sometimes ridiculously so, and even when we do buy we are really only leasing, and its license is definitely not user friendly. There are often compatibility issues with prior.

But despite these minor issues I really like working on SPSS relative to spreadsheet application. Summing in one line, ease of use and in-depth data analysis are the features which really impressed me.

Other Members: Amrit Jain, Ankit Saxena, Gugan N, Jyoti Kanwatia, Nitin Sonkar, Sonam Supriya, Sumit Ranjan,Yogesh Sham Gupta

]]>None of us would have ever dreamt that he /she would be the Truman Burbank (The Truman Show, 1998) of someone else’s show. The truth of the matter is that there are millions of people who are being constantly monitored for their behaviour patterns and preferences. The observations can be either be direct or indirect, disguised or undisguised, structured or unstructured, human or mechanical etc. The role of observation in modern day marketing cannot be underestimated, in fact you will be surprised to know the kind of efforts and innovations companies are putting in in order to gain competitive advantage. Following are some of the innovative methods under observation

**The Mystery Shopper/ Diner** – This observation can be categorised as –Indirect ,Unstructured , Human, Archived . The term “Mystery Shopping” was coined in the 1940s by Wilmark , the first research firm to apply the concept beyond integrity applications. A mystery diner sits near to the actual customers and observes their dining experience as well as the service provided to them. Mystery shoppers /diners not only give valuable feedback regarding the service provided but also play a key role in inspecting and evaluating variety of activities including company operations, employee integrity , store merchandising and product quality.Today , over 100 companies belong to the Mystery Shopping Providers Association and the industry is estimated to be over 1.5 $ billion annually.

**Trend Spotting**** – **This observation can be categorised as –Indirect, Unstructured, Human, Archived. Extending the practise of observation beyong what is clearly or scientifically seen, some researchers have tried to catalog behaviours that might signat the beginning of important trends . This method is called trendspotting and has been under the scanner always because of the subjectivity and randomness of observations present in it. Denmark giant ad agency DDB Worldwide invites observers all over the world plus other targeted groups such as members of youth organizations , to submit thir observations to managers appointed as Signbankers. The signbankers update the corporate database on a regularly basis and it is properly segregated. To get an idea of the kind of observations that might be included , think of anything as random as time people spend at the metro station watcing an ad close to ticket counter , the importance of colours of billboards and their impact on sportschannel viewer.

**Mobile Trackers for vehicles** –This observation can be categorised as –Direct, Structured, Mechanical and as is. Electronic devices are designed so that they can pinpoint the location of any equipped vehicle through its Global Postioning System. These devices help companies keep track of important questions such as are drivers speeding , are device readings accurate, are drivers using vehicles during off hours etc. This data can also help fleet managers with their duty.

**Neuromarketing **– This observation can be categorised as –Indirect, Structured, Mechanical and archived .A high tech research methodology in which a technology called quantified electroencephalography (QEEG). Subjects wear light and portable EEG equipment that records brain activity, software displays activity levels in different arts of the brain. Hewlett Packard used the mentioned technique while developing advertisements for its digital photography products.

**Scanner based Consumer Panel** – This observation can be categorised as –Direct, Structured, Mechanical and As is .Each household is assigned a bar coded card, like frequent shopper card, which members present to clerk at the register. The household’s code number is coupled with the purchase information recorded by the scanner. In addition, background information about the household is obtained through answers to demographic and psychographic questions.

A number of ethical issues have been raised against the observation methods which target the privacy of the consumers. Disguised observations do not seek approval of the consumers tracked and go about collecting data whatever field it might belong to. A set of questions might help in deciding the dilemma that marketers face while collecting information – is the behaviour being observed commonly performed in public , is the behaviour being observed performed in anonymity and has the person agreed to be observed.

]]>

As a project matures, one realizes how important it is to have a research design. A research design guides the team and the company’s decision makers. It lays out the methods and procedures needed to employ as the information gets collected.

To develop a research design, you will rely on three types of studies: exploratory studies, descriptive studies, and causal studies.

Each depends on different information that will help you. No matter how large or small your project, conducting surveys and establishing a research design is vital to your success. If you don’t know where your project is going, you won’t know if it’s succeeding.

**EXPLORATORY STUDIES**

First, you need to do an exploratory study. This is the problem finding phase. An exploratory study forces you to focus the scope of your project. It helps you anticipate the problems and variables that might arise in your project.

Perhaps the most common problem is size. The project must be kept focused. If the scope of a project is too big, it will not get off the ground. Too much information is overwhelming. An important objective of an exploratory study is keeping your project manageable. The larger your project’s scope, the more difficult it is to control. This process will help you weed out problems.

In the case of developing an app, for example, an exploratory study would help your research team take an abstract idea and develop it into a focused plan. The specific app would be market-driven. This process takes legwork, but the results are worth the effort.

Exploratory studies generally encompass three distinct methods:

1. Literature search

A literary search means you go to secondary sources of information: the internet, the public library, company or government records. These sources are usually easy and inexpensive to access.

2. Expert interviews

After a literature search, your team would have a useful background for the project. They know what questions to ask and how to set up their project. After the literary search, the next step is to interview experts. These experts might include company executives or consumers. They would also talk to people who used similar products. Your team would seek out professionals who have careers relating to the research project.

3. Case studies

Every research project will have pitfalls. Therefore, case studies become an important tool because they allow us to examine another business’s managerial problems and solutions. If another study deals with similar issues, we can avoid these pitfalls by learning from its mistakes.

**DESCRIPTIVE STUDIES**

Who are you selling to? An exploratory study helped you establish what you are selling, but the descriptive study will help you find your market and understand your customer. Since you will not be able to sell to everyone, a descriptive study is necessary to focus your project and resources.

There are different kinds of studies you can implement to better understand your market. Consider the following descriptive studies:

- Market potential: description of the number of potential customers of a product.
- Market-share: identification of the share of the market received by your product, company and your competitors.
- Sales analysis: description of sales by territory, type of account, size or model of product.
- Product research: identification and comparison of functional features and specifications of competitive products.
- Promotion research: description of the demographic characteristics of the audience being reached by the current advertising program.
- Distribution research: determining the number and location of retailers handling the company’s products. These are supplied by wholesalers and distributed by the company.
- Pricing research: identifying competitors’ prices by geographic area.

**CAUSAL STUDIES**

Even though descriptive studies describe and predict relationships, results, events, you may want to know the reason. If you can discover the reasons behind your solutions, then you can assemble your own predictive models.

Cause and effect have to be related. Before a cause and effect can be established, a logical implication (or theoretical justification) has to be found.

There are three types of evidence that can be used to establish causal relationships:

**Associative variation**

Associative variation involves taking two variables and seeing how often they are associated. The more they show up in studies, the more likely they are related. Associative variation can be broken down into two distinctions: association by presence and association by change.

Association by presence measures how closely presence of one variable is associated with presence of another.

**Sequence of events**

In order to establish a cause/effect relationship, you must first establish that the causal factor occurred first. For example, in order for salesperson training to result in increased sales, the training must have taken place prior to the sales increase. If the cause does not precede the effect, then there is no causal relationship.

**Absence of other possible causal factors**

You must also demonstrate that other factors did not cause the effect. Once you have proved this, you can logically conclude that the remaining factor is the cause. For example, if we can control all other factors affecting the sales item, then we have to conclude that the increase in sales comes from training.

]]>** **

**Variable Name**

Also each variable name should be unique without duplication. They can be upto 64 bytes long though. This name must be 8 characters long or fewer, and the first character must be a letter or one of the characters @, #, or $. Subsequent characters can be any combination of letters, numbers, non-punctuation characters, and a period (.). Though we can use “Chelsea_Football”, we cannot use “Chelsea-Football” nor can we use “Chelsea Football” as variable names. The SPSS tool misinterprets “–“ as subtraction sign. The space confuses the software as to how many variable are being named. Since variable names often tend to be cryptic and they must be 8 characters or less. Label allows to specify a longer variable name to give in more clarity about the variable. This longer label will appear on any charts or graphs produced.

**Variable Type**

Two types of variable that can be used are numbers and strings. Numeric variables may only have numbers assigned and string variables may contain both numbers and letters. But the catch here is that a string variable though can hold a number cannot be used for numeric operations on it. These operations include mean, variance, standard deviation, etc. By default all variables are assumed to be numeric. Other types of variables which can be used are comma, dot, scientific notation, date, dollar, custom currency and restricted numeric. In all cases, one will need to specify the variable width. It can be done in the dialogue box, or in the subsequent width and decimal columns.

**Values**

Values allow connecting the values (numbered codes) of the coding scheme to the original category. For example, it I here that for a variable Sex, males can be coded with a 0 and females with a 1. Codes are to be added on after the other in a sequence.

**Missing**

Missing refers to a missing data code. This is a “special” number that SPSS will treat as a unique code to identify places where there is no data. The SPSS will avoid including it as a “real” number when statistics are computed.

** **

**Columns**

Columns refers to how many columns wide you would like the variable to be presented in the “ data view.” Normally this would be at least 8 so that the variable name could appear easily.

**Measurement of the Variable **

The level of measurement of ratio and intervals cannot be differentiated by the SPSS software. These two measurements are grouped together as scale. Nominal and ordinal however are differentiated.

**Nominal variables** are also known as categorical variables. The different codes here refer to different categories that [e.g., SEX is a categorical variables because the two different groups are simply different categories that bear no mathematical relation to each other).

**Ordinal variables** refer to those variables where the numerical codes reflect an ordering of some sort, but where the distance between the categories can vary. For example, “Job Grade” is an ordinal variable – the professorial ranks are ordered (1) Grade 6 (2) Grade 7; and (3) Grade 8 – but the distances between the codes are not necessarily equal.

**Scale variables** include interval and ratio levels of measurement, where any numeric codes have meaning in terms of number relations that go beyond category and order. If it makes sense to compute a mean for the variable, then it probably is a scale variable.

There are different approaches to do cluster analysis. These approaches are: ** hierarchical methods, partitioning methods (more precisely, k-means), and two-step clustering**, which is largely a combination of the first two methods.

Steps in cluster analysis:

**Hierarchical Method:**

Hierarchical clustering procedures are characterized by the tree-like structure established in the course of the analysis. Most hierarchical techniques fall into a category called agglomerative clustering. In this category, clusters are consecutively formed from objects. Initially, this type of procedure starts with each object representing an individual cluster. These clusters are then sequentially merged according to their similarity. First, the two most similar clusters (i.e., those with the smallest distance between them) are merged to form a new cluster at the bottom of the hierarchy. In the next step, another pair of clusters is merged and linked to a higher level of the hierarchy, and so on. This allows a hierarchy of clusters to be established from the bottom up. In figure (left-hand side), agglomerative clustering assigns additional objects to clusters as the cluster size increases.

A cluster hierarchy can also be generated top-down. In this divisive clustering, all objects are initially merged into a single cluster, which is then gradually split up. Figure illustrates this concept (right-hand side). As we can see, in both agglomerative and divisive clustering, a cluster on a higher level of the hierarchy always encompasses all clusters from a lower level. This means that if an object is assigned to a certain cluster, there is no possibility of reassigning this object to another cluster.

**Select a Measure of Similarity or Dissimilarity**

There are various measures to express (dis)similarity between pairs of objects. A straightforward way to assess two objects’ proximity is by drawing a straight line between them. This type of distance is also referred to as **Euclidean distance** (or straight-line distance) and is the most commonly used type when it comes to analysing ratio or interval-scaled data.

There are also alternative distance measures: The **city-block distance** uses the sum of the variables’ absolute differences. This is often called the Manhattan metric as it is akin to the walking distance between two points in a city like New York’s Manhattan district, where the distance equals the number of blocks in the directions North-South and East-West.

There are other distance measures such as the Angular, Canberra or Mahalanobis distance. In many situations, the latter is desirable as it compensates for collinearity between the clustering variables.

The distance measures presented can be used for metrically and – in general – ordinally scaled data, applying them to nominal or binary data is meaningless. In this type of analysis, one should rather select a similarity measure expressing the degree to which variables’ values share the same category. These so called matching coefficients can take different forms but rely on the same allocation scheme shown in Table.

Allocating scheme for matching coefficients.

Two types of matching coefficients, which do not equate the joint absence of a characteristic with similarity and more of value in segmentation studies, are the Jaccard (JC) and the Russel and Rao (RR) coefficients. They are defined as follows:

These matching coefficients are – just like the distance measures – used to determine a cluster solution. There are many other matching coefficients such as Yule’s Q, Kulczynski or Ochiai.

For nominal variables with more than two categories, one should always convert the categorical variable into a set of binary variables in order to use matching coefficients. With ordinal data, one should always use distance measures such as Euclidean distance. Even though using matching coefficients would be feasible and from a strictly statistical standpoint – even more appropriate, one would disregard variable information in the sequence of the categories. In the end, a respondent who indicates that he or she is very loyal to a brand is going to be closer to someone who is somewhat loyal than a respondent who is not loyal at all. Furthermore, distance measures best represent the concept of proximity, which is fundamental to cluster analysis.

**Select a Clustering Algorithm**

After having chosen the distance or similarity measure, we need to decide which clustering algorithm to apply. There are several agglomerative procedures and they can be distinguished by the way they define the distance from a newly formed cluster to a certain object, or to other clusters in the solution. The most popular agglomerative clustering procedures include the following:

1. *Single linkage (nearest neighbour):* The distance between two clusters corresponds to the shortest distance between any two members in the two clusters.

2. *Complete linkage (furthest neighbour):* The oppositional approach to single linkage assumes that the distance between two clusters is based on the longest distance between any two members in the two clusters.

3. *Average linkage:* The distance between two clusters is defined as the average distance between all pairs of the two clusters’ members.

4. *Centroid:* In this approach, the geometric centre (centroid) of each cluster is computed first. The distance between the two clusters equals the distance between the two centroids.

A common way to visualize the cluster analysis’s progress is by drawing a dendrogram, which displays the distance level at which there was a combination of objects and clusters.

We read the dendrogram from left to right to see at which distance objects have been combined

Dendrogram

**Decide on the Number of Clusters**

An important is how to decide on the number of clusters to retain from the data. Unfortunately, hierarchical methods provide only very limited guidance for making this decision. The only meaningful indicator relates to the distances at which the objects are combined. We can seek a solution in which an additional combination of clusters or objects would occur at a greatly increased distance. This raises the issue of what a great distance is, of course.

One potential way to solve this problem is to plot the number of clusters on the x-axis (starting with the one-cluster solution at the very left) against the distance at which objects or clusters are combined on the y-axis. Using this plot, we then search for the distinctive break (elbow).

Alternatively, we can make use of the dendrogram which essentially carries the same information. SPSS provides a dendrogram; however, this differs slightly from the one presented in Fig. Specifically, SPSS rescales the distances to a range of 0–25; that is, the last merging step to a one-cluster solution takes place at a (rescaled) distance of 25. The rescaling often lengthens the merging steps, thus making breaks occurring at a greatly increased distance level more obvious.

Despite this, this distance-based decision rule does not work very well in all cases. It is often difficult to identify where the break actually occurs.

Overall, the data can often only provide rough guidance regarding the number of clusters one should select; consequently, one should rather revert to practical considerations. Occasionally, we might have a priori knowledge, or a theory on which we can base our choice. However, first and foremost, one should ensure that results are interpretable and meaningful. Not only must the number of clusters be small enough to ensure manageability, but each segment should also be large enough to warrant strategic attention.

References:

1.http://www.yorku.ca/ptryfos/f1500.pdf

3. Business Research Methods by Cooper, Schindler, Sharma. 11^{th} Edition.

Section B Group 6_Vaneet Bhatia (13FPM008)

Other Member:

- Apurva Ramteke(13PGP068)
- Chandan Parsad(13FPM002)
- Komal Suchak (13PGP086)
- Rohan Kr. Jha (13FPM004)
- Silpa Bahera (13PGP107)
- Sushil Kumar (13FPM010)
- Vivek Roy (12FPM005)

Of all the things that was discussed in class regarding creation of a questionnaire, the questionnaire cannot be seen in isolation to its purpose. There are a lot of factors that affect the reliability and validity of the inference, out of many references that I came across, the following two caught my eye. Though seemingly obvious, the outcome can be startling.

** **

**Barrel#1: Time Lapse**:

Some surveys are conducted over a long period of time, mostly in multiple iterations with considerable time in between the first and the last survey. As per the ongoing craze around incremental improvements in all the facets of industry, the surveys are also brought up to speed with time. The learning that came out of it is incredible.

Courtesy: http://sites.tufts.edu/inclusivecommerceblog/files/2013/05/wpost_tolerance_map_20130516.jpg

Above is the visual representation for religious tolerance of countries all over the world with red marking the most intolerant countries and vice versa. So what could be wrong about this survey if you did not see it yet! Though I am still trying to find out what went wrong for India, a Bangladeshi National apparently was able to decipher the cause for the ‘Red Bangladesh’.

What had happened was, to gauge religious tolerance, the survey was using 0 for tolerance and 1 as intolerance in 1990s. During the process of ‘improving’ the survey, the value references were swapped for the Bangladesh survey (1 for tolerance and 0 as intolerance) leading to the erroneous interpretation.

**Barrel #2****: Mr. Who!!**

Who conducts the survey holds direct connect to how valid the response is or/and how fierce the backlash to the erroneous survey will be, for the conductor and the organizer.

Case and Point, a survey the New York Times wanted to conduct in the prestigious Yale University campus. The responsibility was given to the Yale College Council (YCC), who in their own right decided to conduct the survey among themselves (28 in all). The conduct was widely criticized on part of both the parties, YCC not considering a random sample and NYT publishing the discovery made from the data provided by YCC. Here the question is not about sampling, but of the reluctance on the part of the conductors and the organisers.

The next and the most frequently experienced surveyor based issue is of when she tries to influence our ratings most probably for improving her overall ratings. For an average performance, demands are made for ‘Rockstar Ratings’ which speaks volumes about itself. As it happened to a witness

“*Some years ago when I would take my car to the dealer for service. The same conversation ensued each time I returned to pick up the car. The clerk whom I paid for the service would shove a paper survey at me and say “Please complete our service satisfaction survey” suggesting with her body language that my keys would be held hostage until I finished the thing. “But,” I would protest, “How do I know whether I am satisfied or not until I have driven the car for a while?” Impervious to my logic, she would shrug and tell me that she needed it done now because that was their process, and if I didn’t know how satisfied I was, I should guess.*”

As we saw above a good questionnaire, apart from being good should also be consistent and conducted by responsible and accountable. Something very important but seemingly obvious enough not to be noticed before the trigger has been pulled!!!

** **

**References****:**

We have learnt what Chi Square test is and the step by step process how to implement it in class room discussion. But as a marketer, knowing how exactly this test can be implemented in real time business scenario is much more helpful. Hence, I would like to discuss the same in this article. Hope, it would be interesting for all the visitors of this blog.

Especially, companies are interested in knowing consumer behavior about the products. For example, Are all colors of refrigerators equally preferred among consumers? Is there any association between income of the family and brand preference to buy refrigerators? Like this, the day to day life of marketing job involves many such scenarios to make effective decision.

As we all know, Chi Square test is implemented to find out whether there are differences with Categorical variables ( Color category: Red,Blue,Green,Orange & Income category: Lower, Middle, Upper etc..), it is broadly used in two different scenarios in marketing discipline

**1.****Goodness of Fit Test**

This is mainly used, to find out how closely expected and observed frequencies are matched. Only, single variable can be considered here as mentioned in the above examples “color of refrigerator”.

Let us consider a marketer wants to check the preference of 200 consumers among four colors (Red, White, Blue and Black) of refrigerators stating following null hypothesis

**Null Hypothesis **** **: All colors of refrigerator are equally preferred.

**Alternative Hypothesis** : All colors of refrigerator are not equally preferred

After comparing **Observed frequencies** which might be from survey, questionnaire & **Expected frequencies** which are equal (50 for each in this case, response from 200 respondents) from the above mentioned null hypothesis.

Upon all calculations using the formula related to Chi square test, if the computed value is **greater than** critical value at 5% level of significance and 3 degrees of freedom i.e (n-1) where n= 4 colors, then null hypothesis is rejected. Hence in this case we can infer all colors of refrigerators are not equally preferred

**2.****Independence Test**

This is mainly used, to find out whether, there is a relation or association between two categorical variables or not, that means are they independent or dependent? Two variables mentioned in the above example are “Income group of the family “(Lower, Middle, & Upper) and “Brand preference” (Samsung, LG, Whirlpool, Godrej)

**Null Hypothesis** : The two variables are independent

** Alternative Hypothesis****:** The two variables are dependent

Again from a sample of 200 consumers, the contingency table can be prepared with total cells 12 (3 from Income group * 4 from brands)

Upon all calculations using the formula related to Chi square test, if the computed value is **greater than** critical value at 5% level of significance and 6 degrees of freedom i.e (n1-1)*(n2-1) where n1= 3 and n2=4, then null hypothesis is rejected. Hence, in this case we can infer Income group of the family and Brand preference are dependent

In this way depending on the research objective, marketing manager can implement Chi-square test in decision making process related to the products.

**References:**

http://www.slideshare.net/parth241989/chi-square-test-16093013

http://davidmlane.com/hyperstat/viswanathan/chi_square_marketing.html

http://www.polarismr.com/Portals/58820/research-lifeline/chi-square-test.htm

Suresh Neela

13PGP035

]]>