Wednesday, August 24, 2011

Correspondence Analysis

A correspondence table is any two-way table whose cells contain some measurement of correspondence between the rows and the columns. The measure of correspondence can be any indication of the similarity, affinity, confusion, association, or interaction between the row and column variables. A very common type of correspondence table is a crosstabulation, where the cells contain frequency counts.

Such tables can be obtained easily with the Crosstabs procedure. However, a crosstabulation does not always provide a clear picture of the nature of the relationship between the two variables. This is particularly true if the variables of interest are nominal (with no inherent order or rank) and contain numerous categories. Crosstabulation may tell you that the observed cell frequencies differ significantly from the expected values in a 10x9 crosstabulation of occupation and breakfast cereal, but it may be difficult to discern which occupational groups have similar tastes or what those tastes are.

Correspondence Analysis allows you to examine the relationship between two nominal variables graphically in a multidimensional space. It computes row and column scores and produces plots based on the scores. Categories that are similar to each other appear close to each other in the plots. In this way, it is easy to see which categories of a variable are similar to each other or which categories of the two variables are related. The Correspondence Analysis procedure also allows you to fit supplementary points into the space defined by the active points.

If the ordering of the categories according to their scores is undesirable or counterintuitive, order restrictions can be imposed by constraining the scores for some categories to be equal. For example, suppose that you expect the variable smoking behavior, with categories none, light, medium, and heavy, to have scores that correspond to this ordering. However, if the analysis orders the categories none, light, heavy, and medium, constraining the scores for heavy and medium to be equal preserves the ordering of the categories in their scores.

The interpretation of correspondence analysis in terms of distances depends on the normalization method used. The Correspondence Analysis procedure can be used to analyze either the differences between categories of a variable or the differences between variables. With the default normalization, it analyzes the differences between the row and column variables. 

The correspondence analysis algorithm is capable of many kinds of analyses. Centering the rows and columns and using chi-square distances corresponds to standard correspondence analysis. However, using alternative centering options combined with Euclidean distances allows for an alternative representation of a matrix in a low-dimensional space.

Wednesday, August 17, 2011

Linear Mixed Models

The Linear Mixed Models procedure expands the general linear model so that the error terms and random effects are permitted to exhibit correlated and non-constant variability. The linear mixed model, therefore, provides the flexibility to model not only the mean of a response variable, but its covariance structure as well. In other words, it is an extension of the general linear model, in which factors and covariates are assumed to have a linear relationship to the dependent variable.
Factors: Categorical predictors should be selected as factors in the model. Each level of a factor can have a different linear effect on the value of the dependent variable.
Fixed-effects factors are generally thought of as variables whose values of interest are all represented in the data file.
Random-effects factors are variables whose values in the data file can be considered a random sample from a larger population of values. They are useful for explaining excess variability in the dependent variable.
For example, a grocery store chain is interested in the effects of five different types of coupons on customer spending. At several store locations, these coupons are handed out to customers who frequent that location; one coupon selected at random is distributed to each customer.
The type of coupon is a fixed effect because the company is interested in those particular coupons. The store location is a random effect because the locations used are a sample from the larger population of interest, and while there is likely to be store-to-store variation in customer spending, the company is not directly interested in that variation in the context of this problem.
Covariates: Scale predictors should be selected as covariates in the model. Within combinations of factor levels (or cells), values of covariates are assumed to be linearly correlated with values of the dependent variables.
Interactions: The Linear Mixed Models procedure allows you to specify factorial interactions, which means that each combination of factor levels can have a different linear effect on the dependent variable. Additionally, you may specify factor-covariate interactions, if you believe that the linear relationship between a covariate and the dependent variable changes for different levels of a factor.
Random effects covariance structure: The Linear Mixed Models procedure allows you to specify the relationship between the levels of random effects. By default, levels of random effects are uncorrelated and have the same variance.
Repeated effects: Factors and covariates are features of the general linear model. In the Linear Mixed Models procedure, repeated effects variables are added, allowing you to relax the assumption of independence of the error terms. In order to model the covariance structure of the error terms, you need to specify the following:
  • Repeated effects variables are variables whose values in the data file can be considered as markers of multiple observations of a single subject.
  • Subject variables define the individual subjects of the repeated measurements. The error terms for each individual are independent of those of other individuals.
  • The covariance structure specifies the relationship between the levels of the repeated effects. The types of covariance structures available allow for residual terms with a wide variety of variances and covariances.
For example, if the grocery store recorded the purchasing habits of their customers for four consecutive weeks, then the variable week would be a repeated effects variable. Specifying a subject variable denoting the Customer ID differentiates the repeated observations of separate customers. Specifying a first-order autoregressive covariance structure reflects your belief that a higher-than-average volume of purchases in one week will correspond to a higher (or lower)-than-average volume in the following week.

Wednesday, August 10, 2011

Variables

In statistics, a variable has two defining characteristics:
  • A variable is an attribute that describes a place, idea or thing
  • The value of the variable can "vary" from one entity to another
For example, a person's height is a potential variable, which could have the value of "tall" for one person and "short" for another.

Qualitative vs. Quantitative Variables
Variables can be classified as Qualitative (categorical) or Quantitative (numerical):
  • Categorical: Categorical variables take on values that are names or labels. The colour of eyes (e.g. black, blue) would be an example of categorical variable.
  • Numerical: Quantitative variables are numerical and represent a measurable quantity. For example, when we speak of the height of students, we are talking about the height (in inches or feet) of the students - a measurable attribute. Therefore, height would be a quantitative variable.
Discrete vs. Continuous Variables
Quantitative variables can be further classified as discrete or continuous. If a variable can take on any value between its minimum value and its maximum value, it is called a continuous variable; otherwise, it is called a discrete variable. The following examples will clarify the difference between discrete and continuous variables:
  • Suppose the fire department mandates that all fire fighters must weigh between 150 and 250 pounds. The weight of a fire fighter would be an example of a continuous variable; since a fire fighter's weight could take on any value between 150 and 250 pounds.
  • Suppose we flip a coin and count the number of heads. The number of heads could be any integer value between 0 and plus infinity. However, it could not be any number between 0 and plus infinity. We could not, for example, get 2.3 heads. Therefore, the number of heads must be a discrete variable.
Univariate vs. Bivariate Data
Statistical data is often classified according to the number of variables being studied.
  • Univariate data: When we conduct a study that looks at only one variable, we say that we are working with univariate data. Suppose, for example, that we conducted a survey to estimate the average weight of high school students. Since we are only working with one variable (weight), we would be working with univariate data.
  • Bivariate data: When we conduct a study that examines the relationship between two variables, we are working with bivariate data. Suppose we conducted a study to see if there were a relationship between the height and weight of high school students. Since we are working with two variables (height and weight), we would be working with bivariate data.

Monday, August 8, 2011

Inferential statistics

Inferential statistics deal with inferences. Inferential statistics are concerned with determining how likely it is that results based on a sample or samples are the same results that would have been obtained for the entire population. Inferences about populations based on the behaviour of samples.

Concept of Standard Error
If we randomly select a number of samples from the same population and compute the mean for each it is likely that each mean will be somewhat different from each other mean, and that none of the means will be identical to the population mean. The variation among the means is referred to as sample error. If a difference is found between sample means, the question of interest is whether the difference is a result of sampling error or a reflection of a true difference.

Sample Size and Standard Error
Sample errors are interesting. They are normally distributed and most of the sample-means will be very close to the population mean; the number of means which are considerably different from the population mean will decrease as the size of the difference increases.
Standard deviation of the sample means (the standard deviation of sampling errors) is usually referred to as the standard error of the mean. The standard error of the mean tells us by how much we would expect our sample means to differ if we used other samples from the same population. According to normal curve percentages, we can say that approximately 68% of the sample means will fall between plus and minus one standard error of the mean, 95% will fall between plus and minus two standard errors, and 99+% will fall between plus and minus three standard errors.
If we know the standard deviation, then the standard error of the mean is equal to the standard deviation divided by the square root of the sample size. SE(mean) = SD/square root of (N-1). If a sample mean is 80, and the SE mean is 1.00, if we say that the population mean falls between 79 and 81, we have approximately 68% chance of being correct, if we say that the population mean falls between 78 and 82, we will have approximately a 95% chance of being correct, if we say that the population mean falls between 77 and 83, we will have approximately 99+% chance of being correct. In another word, the probability of the population mean being less than 77 and larger than 83 is less than 1%.
It is obvious now that a smaller standard error indicates less sampling error. The major factor affects standard error of the mean is sample size. The size of the sample increases the standard error of the mean decreases. Another factor affecting the standard error of the mean is the size of the population standard deviation. If the population standard deviation is large, members of the population are very spread out on the variable of interest, and the sample means will also be very spread out.
In order to determine whether or not the difference between those means probably represents a true population difference, we need an estimate of the standard error of the difference between two means.

Test Null Hypothesis
When we talk about the difference between two sample means being a true difference we mean that the difference was caused by the treatment and not by chance. The chance explanation for the difference is called the null hypothesis. The null hypothesis says in essence that there is no difference or relationship between parameters in the populations and that any difference or relationship found for the samples is the result of sampling error. The research hypothesis usually states that one method is expected to be more effective than another. Utilizing null hypothesis is more conclusive support for a positive research hypothesis. Suppose one hypothesizes that all research textbooks contain a chapter on sampling. If he or she examines and finds that a book does contain the chapter, it does not approve the hypothesis, because it is only one book. In other word, if he or she finds a book does not contain the chapter, it is enough to disapprove the hypothesis.  The result of a study can reject the null hypothesis or not reject the null hypothesis.  If it is rejected, the hypothesis PROBABLY false, if it is not rejected, the hypothesis PROBABLY true.

Test of Significance
In order to test a null hypothesis we need a test of significance and we need to select a probability level that indicates how much risk we are willing to take that the decision we make is wrong. At the end of an experimental research study, if there is a little difference between the group means, then researcher needs to decide whether the difference is significant or different enough to conclude that they represent a true difference. The test of significance is made at a pre-selected probability level and allows the researcher to state that he has rejected the null hypothesis. The level will be usually set at 0.05, or 0.01. That means the researcher will have 5% or 1% times to find the difference by chance. There are a number of different tests of significance that can be applied in research studies, t-test, analysis of variance and chi square etc.