Why It Works

Next: Generating Random Variables Up: Chi-Square Goodness of Fit Previous: Chi-Square Goodness of Fit Index

Click for printer friendely version of this HowTo

Why It Works

Consider a binomial random variable $Y\sim\textrm{Bin}(n,p)$ with mean $\mu_Y = np$ and variance $\sigma^2_Y = np(1-p)$ . From the Central Limit Theorem, we know that $Z = (Y - \mu_Y)/ \sigma_Y$ has an approximately a standard Normal(0,1) distribution for large values of

. Since the square of a standard normal random variable has a chi-square distribution with one degree of freedom,

is approximately $\chi ^2_1$ .

Now consider the random variable which has a binomial distribution and let and . Then

$\displaystyle Z^2$	$\displaystyle = \frac{(Y_1 - np_1)^2}{np_1(1-p_1)}$
	$\displaystyle = \frac{(Y_1 - np_1)^2(1-p_1) + (Y_1 - np_1)^2(p_1)}{np_1(1-p_1)}$
	$\displaystyle = \frac{(Y_1 - np_1)^2}{np_1} + \frac{(Y_1 - np_1)^2}{n(1-p_1)},$

and since

$\displaystyle (Y_1 - np_1)^2 = (n - Y_2 - n + np_2)^2 = (Y_2 - np_2)^2,$

we have

$\displaystyle Z^2 = \frac{(Y_1 - np_1)^2}{np_1} + \frac{(Y_2 - np_2)^2}{np_2},$

where

has a chi-square distribution with 1 degree of freedom.

In general, for random variables , where $i = 1, 2, \ldots, k$ , with corresponding expected values , a statistic measuring the ``closeness'' of the observations to their expectations is the sum:

$\displaystyle \frac{(Y_1 - np_1)^2}{np_1} + \frac{(Y_2 - np_2)^2}{np_2} + \cdots + \frac{(Y_k - np_k)^2}{np_k},$

which has a chi-square distribution with

degrees of freedom. This is because we know that the sum of all of the probabilities, $p_1, \ldots, p_k$ , must equal 1, and thus we can derive

by subtracting the first

probabilities from 1.

Allele Frequenciesno_title

Example 3.7.1.1

Example 3.7.1.2 (Allele Frequencies)

The population is said be in Hardy-Weinberg equilibrium for a given gene if it is:

Stable with respect respect to the allele and genotype frequencies of interest. That is, allele frequencies do not change from generation to generation.
The genotype frequencies in the progeny produced by random mating among parents is determined solely by the allele frequencies of the parents.

In other words, if, for a particular gene A with alleles A

and A

, and the allele frequencies in the parents are

and

(and thus

), than the percentage of offspring with the genotype A

, A

and A

Table: Observed genotypes at the MN blood group gene locus for individuals in a human population. Source: Plagiarized from Michael D. Purugganan, class notes.

Genotype	Observed
AA	22
AA	216
AA	492

Given the data in Table 3.7.2, we can calculate the observed allele frequencies. That is,

$\displaystyle p = \frac{(22 + 216/2)}{730} = 0.178,$

and

$\displaystyle q = 1 - p = 0.822.$

With values for

and

, we can now calculate how many individuals with each class of genotype we would expect if the population was in Hardy-Weinberg Equilibrium. The results of this calculation are in Table 3.7.3.

Table 3.7.3: Both observed and expected genotypes at the MN blood group gene locus for individuals in a human population.

Genotype	Observed	Expected
AA	22	23.14
AA	216	213.60
AA	492	493.26

Now that we have both observed and expected values for each class of genotype, we can calculate a chi-square test statistic. That is,

$\displaystyle \chi^2$	$\displaystyle = \frac{(22 - 23.14)^2}{23.14} + \frac{(216 - 213.60)^2}{213.60}$
	$\displaystyle \quad + \frac{(492 - 493.26)^2}{493.26}$
	$\displaystyle = 0.086$

Now all we need to do is compare this value to that from a chi-square distribution. The trick, however, is determining how many degrees of freedom there are. Here we have three different categories, or genotypes, and each one has an associated probability of membership. However, two of these probabilities are dependent on one of them. That is, since

the probability of having the genotype A

and the probability of having the genotype A

. Thus, since there is only one linearly independent probability, the degree of freedom is 1.

We can now use Octave to determine the probability our hypothesis is correct:

octave:2> 1 - chisquare_cdf(0.086, 1)
ans = 0.76933

So, since we usually fail to reject the hypothesis that the data comes from our model if the probability is more than 5 percent (and in this case it is 77 percent, see Figure 3.7.2), we will not reject the hypothesis that that alleles for the MN blood type gene are in Hardy-Weinberg Equilibrium.

**Figure:** The area under the $\chi ^2_1$ graph that represents the p-value, the probability our hypothesis that the Locus for the MN blood group is in Hardy-Weinberg Equilibrium is correct. Since the p-value/area is so large (77 percent) we will accept our hypothesis (or Fail to Reject our hypothesis).
$\includegraphics[width=3in]{chi_square2}$

$\vert\boldsymbol{\vert}$

Next: Generating Random Variables Up: Chi-Square Goodness of Fit Previous: Chi-Square Goodness of Fit Index

Click for printer friendely version of this HowTo

Frank Starmer 2004-05-19