DNA Fingerprinting and the Calculation of Profile Frequency
Linh T. Do
MBA 604, Data Analysis, Professor Steve Huxley
University of San Francisco
Fall 1998, Term Paper




Executive Summary

Two prominent cases from this decade employed "DNA fingerprinting" in their body of evidence. The most current case is this year's investigation of President William Jefferson Clinton by the Office of the Independent Counsel led by Ken Starr. The less current case is the 1994-1995 trial of The People of the State of California vs. Orenthal James Simpson. Although the two cases vastly differ in context, the use of DNA typing surfaces in both, and this technology has been brought under the scrutiny of everyday news. Probability and conclusion such as "one in 7.87 trillion" chance that the DNA profile found from an incriminating sample does or does not match the accused are reported without explanation of how these odds are calculated. In this discussion, an attempt is made to explain these probability values. A hypothetical example is used to illustrate the process of frequency calculations.
In 1984 British researcher Alec Jeffreys at the University of Leicester coined the term "DNA fingerprinting." His research had led him to the discovery of a new molecular biology method, the "multilocus probe" for human DNA analysis. In simplified terms, this method enabled Dr. Jeffreys to generate a pattern of a person's DNA profile that looks much like a bar code for products in a supermarket (1). And like a bar code, which is created to uniquely identify a product, a person "DNA fingerprint" can be a truly unique identification mark.
The pattern of ridged skin at the end of each person's fingers, the traditional fingerprint, was the most unique means of human identification for many years and is still the most widely used identification mark. A person's fingerprint, however, can be altered. Whereas a DNA fingerprint, generated from deoxyribonucleic acid (DNA), which is the basis of the code of life in every cells of a person (2), cannot be disguised or changed.
Dr. Jeffreys began introducing DNA testing for cases of kinship determination and murder investigations. Following in his footsteps, commercial testing laboratories adapted various types of DNA-based identification typing methods which employed scientific protocols and principles similar to those used by Dr. Jeffreys (the details of these scientific principles and protocols are beyond the scope of this discussion and will not be included). Laboratories such as Cellmark Diagnostics (now owned by Lifecodes Corporation, Conneticut) provided DNA data for evidence in the court. In 1988, Cellmark was involved in the first criminal case in which the strength of DNA fingerprint evidence helped convicted two men in the rape and murder of a woman (3). Cellmark's laboratory protocol differed slightly from Dr. Jeffrey's since Cellmark used a cocktail of "single-locus probes" for assessing the suspects' DNA profiles.
In 1988, the Federal Bureau of Investigation (FBI) of the United States (U.S.) Department of Justice implemented its DNA Analysis Unit. The FBI became the first public crime laboratory in the U. S. that performs forensic DNA analysis.
The use of DNA fingerprinting blossommed as commercial, university, and government laboratories applied DNA testing for various objectives. From the diagnosis of inherited diseases or determination of paternity to the identification of armed forces war casualties or the linking of suspects to biological evidence found at the scene of a crime, the news about the various applications of DNA profiling have proliferated. "DNA fingerprinting" has slowly crept into the household vocabulary as a common terminology. Nowaday, in forensic investigation of rape cases, DNA profiling is a prominent priority.
Two famous cases employing DNA fingerprinting dominated the public news for lengthy periods in this decade: This year's investigation of President Clinton by the Office of the Independent Counsel and the 1994-1995 O.J. Simpson trial. The physical evidence reported in The Starr Report (4) "conclusively establishes that the President and Ms. Lewinsky had a sexual relationship." This conclusion was based on a match between the DNA profile of semen stains collected from a navy blue dress turned over by Ms. Lewinsky and the DNA profile of President Clinton's voluntary blood sample. The FBI conducted the DNA testing, and according to its laboratory reports, the chance that the semen is not the President's is one in 7.87 trillion. This data along with other association evidence led the FBI to conclude that "the President was the source of the DNA obtained from the dress "(5). In The People of the State of California vs. Orenthal James Simpson, the prosecutors presented DNA evidence stating odds of billions to one that the blood found at the scene was not Simpson's. One in a trillion or one in a billion is a very minute probability for an event. How is it that DNA typing can claim such exceedingly low probability to rule out that an event happened by chance?
First, before an example is used to illustrate how these probabilities are obtained, it is important to note that DNA fingerprinting as a method is less than two decades old. In the time between the O.J. Simpson's trial and the release of The Starr Report, technological advances in the use of "restriction fragment length polymorphisms," or RFLPs, to generate DNA profiles and larger databases with more complete data on the frequency distributions of DNA patterns in ethnic populations have both made DNA typing "much more precise" now (6). In fact, "the new policy [of the FBI] states that if the likelihood of a random match if less than one in 260 billion, the examiner can testify that the samples are an exact match" (7). Experience and training in this field are accumulating quickly, favoring improved methods and standardization of laboratory and statistical techniques.
Second, the power of resolution to differentiate among individuals can be used as a tool not only by the prosecutor but also by the defense. In up to a third of forensic cases involving DNA profiling of suspects, or "individuals for whom there is enough other evidence to go to trial," the accused are exonerated and at times not brought to trial because of DNA evidence (8).
And third, the complexity of DNA typing cannot be underestimated. From the proper collection and storage of samples to the execution of scientific laboratory methods, of which there are numerous protocols, problems of degraded or contaminated samples may arise. Examiner bias (9) can introduce errors into the the raw data collection step when an analyst is determining the visibility or the proper sizes of DNA bands before matches are assessed. In validating DNA band sizes, confidence intervals must be established. As stated in one reference (10), "bands from the same source generally fall within ± 1.8% of their combined average [size] value." In other cases in which DNA test results are not clear cut, for example, some bands are clear and distinct and some are faint, examiners have had to use his or her technical judgment and experience at times to ascertain the data set.
At this point, to answer the question about the establishment of exceedingly low probabilities which rule out an the occurrence of an event due to chance, assumptions will be made that all of the steps just discussed are sound and valid as they lead up to the determination of a match (or no match) and the calculation of probability for the strength of association of a match. Probability of specimen match can be made only if estimation of population pattern of allele (see below for definition) frequency distributions are available. These two areas, frequency distributions of a selected allele and probability, require sound statistical analyses and by no means, are immune from discussions and critical assessments.
To have a feel for the probability calculation, a brief review of science terminology is necessary. The human genome, which includes all 46 chromosomes, has coding and non-coding sequences. Expression of coding sequences will show up as phenotypic features such as eye and hair color. The role of non-coding sequences are not clearly understood. However, there is much variability in these non-coding areas of DNA. At some locus, or a location in the DNA, tandem repeats of particular sequences are present, which is much like the word "cat" repeating over and over in a sentence: This cat cat cat cat cat cat is my pet. A locus that is polymorphic, meaning it can have different forms, is useful for forensic DNA typing. Each possible form is an allele. For example, the "cat" sentence above has six "cat" repeats. Another allele of this could be "This cat cat is my pet," which has only two "cat" repeats. Note that the physical size or length of these sentences differ, just as polymorphic DNA alleles would differ in sizes. The ABO blood groups, for example, is an expression of genetic polymorphism. Blood typing, however, does not target DNA but rather identify the gene products from DNA expression. These polymorphic sites in non-coding regions differ very frequently among individuals. For DNA typing, many variable alleles have been found, and their complimentary probes have been developed. Empirical frequency data from random sampling of populations have been collected over the last ten to fifteen years by different laboratories for each commonly used probe. This sampling allows one to infer information about population parameters such as the frequencies of specific allele at a give locus even though a complete population profile will not be assembled (11).
Allele population frequency data is guided by several population genetic theories. Two principles lay the foundation for analyses, the Hardy-Weinberg (H-W) equilibrium and the linkage equilibrium (L-E). The H-W model "states that there is a predictable relationship between allele frequencies and genotype frequencies at a single locus. This is a mathematical relationship that allows for the estimation of genotype frequencies in a population even if the genotype has not been seen in an actual population survey" (12). The L-E model is "the steady state condition of a population where the frequency of any multi-locus genotypic frequency is the product of each separate locus. This allows for the estimation of a DNA profile over several loci [plural of "locus"], even if the profile has not been seen in an actual population survey" (13). Population stratification must also be taken into consideration (14, 15). There are several primary allele population profiles such as those for Caucasians, Mexican-American, Asians, and Blacks (16). Geographic sub-population profiles are also available depending upon the laboratory. In addition, population databases must be large enough to provide confidence that the frequencies obtained are representative of those in actual populations. An example of how a population frequency profile, or histogram, is constructed follows. A "bin table" is first compiled for two hypothetical alleles (17); then a frequency profile is charted.
RFLP Bin Table

Locus A Locus B
Bin # Size Range Allele Counts Frequency Allele Counts Frequency
1 0-500 0            0.0003 0.004
2 501-1000 0 0.000 10 0.014
3 1001-1500 5 0.005 15 0.022
4 1501-2000 8 0.008 9 0.013
5 2001-2500 13 0.013 12 0.017
6 2501-3000 29 0.030 35 0.051
7 3001-3500 38 0.039 44 0.064
8 3501-4000 58 0.059 53 0.077
9 4001-4500 126 0.130 36 0.052
10 4501-5000 77 0.079 75 0.108
11 5001-5500 86 0.088 56 0.081
12 5501-6000 92 0.094 74 0.107
13 6001-6500 110 0.110 110 0.159
14 6501-7000 100 0.100 103 0.149
15 7001-7500 95 0.097 45 0.065
16 7501-8000 66 0.067 12 0.017
17 8001-8500 42 0.043 0 0.000
18 8501-9000 28 0.029 0 0.000
19 9001-9500 5 0.005 0 0.000
20 9501-10000 0 0.000 0 0.000
ToTal: 978 1.00 692 1.00
If all of aspects of the H-W and L-E population guidelines have been properly considered, and in certain cases, the profiles corrected for variation from the models, then the probability for a match can be determined. The question to answer here is whether a match is a proof of guilt or whether it is due to chance alone. The proper null hypothesis (H0) is to assume that a match is due to random chance and a probability value is calculated to quantify the uncertainty of this assumption. (18, 19) It has been noted that one reference text did set the null hypothesis to determine whether a person "is the source of an item of biological evidence," which would be the equivalent of looking for proof of guilt. This null hypothesis could commit a Type I error in which a person is presumed to be guilty instead presumed to be innocent until proven guilty.

To continue with the example, suppose that an individual X was sampled and DNA testing conducted to determine what type of allele is present at particular loci in his DNA. Individual X turned out to be heterozygous, meaning that on one chromosome, the allele is "A," but on the other paired chromosome, the allele is "a." The genotype for this individual is "Aa." The size for "A" measured to be 2231 basepairs of DNA and would fall into category of "bin#5" in the above table with an expected frequency of 0.013. The size for "a" measured to be 3510 basepairs and would be binned in "bin#8," corresponding to an expected frequency of 0.059. Taking into account that there are two copies of a gene in the genome, the chance that a person may have this specific "Aa" pattern is calculated as "2pq," where p (for "A") and q (for "a") are the frequencies of this genotype in the reference population. If p is 0.013 and q is 0.059, then 2pq is equal to 2 x 0.013 x 0.0059, or 0.001534, which is 0.15% of the hypothetical relevant population.
Suppose also that three additional polymorphic alleles were added to the panel to establish his profile. These 3 alleles has the following calculated hypothetical frequencies: 0.06, 0.01, 0.009. How common or how rare would this 4-loci profile be in the reference population? Assume that the sequences are inherited in an independent fashion in which the presence of any one of the alleles does not influence the occurrence of any of the other alleles. The product rule for probability calculation can then be applied as follows: 0.001534 x 0.06 x 0 .01 x 0.009. The product is 8.28x10e-9. To express this number in another way, take the reciprocal, or 1/8.28x10e-9, to obtain 120.7 million. The chance of finding this exact profile in the relevant population is one in 120.7 million people. And thus, the probability of finding another person (excluding twins) who would have the same profile by chance is extremely low.
The higher the number of alleles used for the analysis, the smaller the probability of finding a specific allele combination for an individual in a population. It is no surprise then that the FBI concluded in its report that the semen stains on Ms. Lewinsky blue dress came from President Clinton. The probability of the semen profile matching the President's profile by chance was calculated to be very very small. This evidence in addition to the testimony given by Ms. Lewinsky about how her dress got stained was sufficient to incriminate the President.
Footnotes and References:
1) David Micklos, Greg Freyer, DNA Science, A First Course in Recombinant DNA Technology. Cold Spring Harbor Laboratory Press and Carolina Biological Suppy Company, 1990, p.165.
2) "There are a few exceptions. For example, our red blood cells lack DNA. Blood itself can be typed because of the DNA contained in our white blood cells."
Donald Riley, "DNA Testing: Introduction for non-Scientists," Scientific Testimony: An Online Journal. University of Washington, 1998.
3) David Micklos, Greg Freyer, DNA Science, A First Course in Recombinant DNA Technology. Cold Spring Harbor Laboratory Press and Carolina Biological Suppy Company, 1990, p.167.
4) Office of the Independent Counsel, sub-Section "Evidence Establishing Nature of Relationship," Section "Narrative," The Starr Report, 1998.
5) FBI Lab Report, Lab no 980730002SBO and 980803100SBO, 8/17/98, as footnoted by The Starr Report (see reference 4).
6) Editorial, "DNA Fingerprinting Comes of Age," Science, Vol. 278, 21 Nov 1997, p. 1407.
7) Id.
8) Daniel Koshland, Jr., "DNA Fingerprinting and Eyewitness Testimony," Science, Vol. 256, 01 May 1992, p. 593.
9) William Thompson, "Examiner Bias in Forensic RFLP Analysis," published at http://www.scientific.org/case_in_point/Biased%20Interp.html
(discussion was first described in W. Thompson, A Sociological Perspective on the Science of Forensic DNA Testing, U.C. Davis Law Review, 1997, 30(4), p. 1113-1136.
10) Keith Inman, Norah Rudin, An Introduction to forensic DNA Analysis. CRC Press, 1997, p. 183.
11) Lorne T. Kirby, DNA Fingerprinting, An Introduction. Stockton Press, 1990, p. 150.
12) Keith Inman, Norah Rudin, An Introduction to forensic DNA Analysis. CRC Press, 1997, p. 91.
13) Keith Inman, Norah Rudin, An Introduction to forensic DNA Analysis. CRC Press, 1997, p. 91-92.
14) M. Krawczak, J. Schmidtke, DNA Fingerprinting. BIOS Scientific Publishers Ltd., 1994, p. 64-66.
15) Keith Inman, Norah Rudin, An Introduction to forensic DNA Analysis. CRC Press, 1997, p. 91-92.
16) C. T. Caskey, et al, Triple Repeat Mutations in Human Disease, Science, Vol 256, 8 May 1992, p. 785.
17) Keith Inman, Norah Rudin, An Introduction to forensic DNA Analysis. CRC Press, 1997, p. 184.
18) Denise Casey, "Creating and Comparing DNA Profiles, Human Genome News, Vol. 8:1, Jul-Sept. 1996.
19) Lorne T. Kirby, DNA Fingerprinting, An Introduction. Stockton Press, 1990, p. 160-161.
20) Keith Inman, Norah Rudin, An Introduction to forensic DNA Analysis. CRC Press, 1997, p. 87.