next up previous index
Next: Generating Random Variables Up: Chi-Square Goodness of Fit Previous: Chi-Square Goodness of Fit   Index

Click for printer friendely version of this HowTo

Why It Works

Consider a binomial random variable $ Y\sim\textrm{Bin}(n,p)$ with mean $ \mu_Y = np$ and variance $ \sigma^2_Y = np(1-p)$. From the Central Limit Theorem, we know that $ Z = (Y - \mu_Y)/ \sigma_Y$ has an approximately a standard Normal(0,1) distribution for large values of $ n$. Since the square of a standard normal random variable has a chi-square distribution with one degree of freedom, $ Z^2$ is approximately $ \chi ^2_1$.

Now consider the random variable $ Y_1$ which has a binomial$ (n, p_1)$ distribution and let $ Y_2 = n - Y_1$ and $ p_2 = 1 - p_1$. Then

$\displaystyle Z^2$ $\displaystyle = \frac{(Y_1 - np_1)^2}{np_1(1-p_1)}$    
  $\displaystyle = \frac{(Y_1 - np_1)^2(1-p_1) + (Y_1 - np_1)^2(p_1)}{np_1(1-p_1)}$    
  $\displaystyle = \frac{(Y_1 - np_1)^2}{np_1} + \frac{(Y_1 - np_1)^2}{n(1-p_1)},$    

and since

$\displaystyle (Y_1 - np_1)^2 = (n - Y_2 - n + np_2)^2 = (Y_2 - np_2)^2,$    

we have

$\displaystyle Z^2 = \frac{(Y_1 - np_1)^2}{np_1} + \frac{(Y_2 - np_2)^2}{np_2},$    

where $ Z^2$ has a chi-square distribution with 1 degree of freedom.

In general, for $ k$ random variables $ Y_i$, where $ i = 1, 2, \ldots, k$, with corresponding expected values $ np_i$, a statistic measuring the ``closeness'' of the observations to their expectations is the sum:

$\displaystyle \frac{(Y_1 - np_1)^2}{np_1} + \frac{(Y_2 - np_2)^2}{np_2} + \cdots + \frac{(Y_k - np_k)^2}{np_k},$    

which has a chi-square distribution with $ k-1$ degrees of freedom. This is because we know that the sum of all of the probabilities, $ p_1,
\ldots, p_k$, must equal 1, and thus we can derive $ p_k$ by subtracting the first $ k-1$ probabilities from 1.

Allele Frequenciesno_title

Example 3.7.1.2 (Allele Frequencies)  

The population is said be in Hardy-Weinberg equilibrium for a given gene if it is:

  1. Stable with respect respect to the allele and genotype frequencies of interest. That is, allele frequencies do not change from generation to generation.
  2. The genotype frequencies in the progeny produced by random mating among parents is determined solely by the allele frequencies of the parents.
In other words, if, for a particular gene A with alleles A$ _1$ and A$ _2$, and the allele frequencies in the parents are $ f($A$ _1) =
p$ and $ f($A$ _2) = q$ (and thus $ p + q = 1$ or $ q = 1 - p$), than the percentage of offspring with the genotype A$ _1$A$ _1 = p^2$, A$ _1$A$ _2 = 2pq$ and A$ _2$A$ _2 = q^2$.


Table: Observed genotypes at the MN blood group gene locus for individuals in a human population. Source: Plagiarized from Michael D. Purugganan, class notes.
Genotype Observed
A$ _1$A$ _1$ 22
A$ _1$A$ _2$ 216
A$ _2$A$ _2$ 492


Given the data in Table 3.7.2, we can calculate the observed allele frequencies. That is,

$\displaystyle p = \frac{(22 + 216/2)}{730} = 0.178,$    

and

$\displaystyle q = 1 - p = 0.822.$    

With values for $ p$ and $ q$, we can now calculate how many individuals with each class of genotype we would expect if the population was in Hardy-Weinberg Equilibrium. The results of this calculation are in Table 3.7.3.

Table 3.7.3: Both observed and expected genotypes at the MN blood group gene locus for individuals in a human population.
Genotype Observed Expected
A$ _1$A$ _1$ 22 23.14
A$ _1$A$ _2$ 216 213.60
A$ _2$A$ _2$ 492 493.26


Now that we have both observed and expected values for each class of genotype, we can calculate a chi-square test statistic. That is,

$\displaystyle \chi^2$ $\displaystyle = \frac{(22 - 23.14)^2}{23.14} + \frac{(216 - 213.60)^2}{213.60}$    
  $\displaystyle \quad + \frac{(492 - 493.26)^2}{493.26}$    
  $\displaystyle = 0.086$    

Now all we need to do is compare this value to that from a chi-square distribution. The trick, however, is determining how many degrees of freedom there are. Here we have three different categories, or genotypes, and each one has an associated probability of membership. However, two of these probabilities are dependent on one of them. That is, since $ q = 1 - p$ the probability of having the genotype A$ _1$A $ _2 = 2pq = 2p(1 - p)$ and the probability of having the genotype A$ _2$A $ _2 = q^2 = (1 - p)^2$. Thus, since there is only one linearly independent probability, the degree of freedom is 1.

We can now use Octave to determine the probability our hypothesis is correct:

octave:2> 1 - chisquare_cdf(0.086, 1)
ans = 0.76933
So, since we usually fail to reject the hypothesis that the data comes from our model if the probability is more than 5 percent (and in this case it is 77 percent, see Figure 3.7.2), we will not reject the hypothesis that that alleles for the MN blood type gene are in Hardy-Weinberg Equilibrium.
Figure: The area under the $ \chi ^2_1$ graph that represents the p-value, the probability our hypothesis that the Locus for the MN blood group is in Hardy-Weinberg Equilibrium is correct. Since the p-value/area is so large (77 percent) we will accept our hypothesis (or Fail to Reject our hypothesis).
\includegraphics[width=3in]{chi_square2}
$ \vert\boldsymbol{\vert}$


next up previous index
Next: Generating Random Variables Up: Chi-Square Goodness of Fit Previous: Chi-Square Goodness of Fit   Index

Click for printer friendely version of this HowTo

Frank Starmer 2004-05-19
>