DNA Fingerprinting and the Calculation of Profile Frequency
Linh T. Do
MBA 604, Data Analysis, Professor Steve Huxley
University of San Francisco
Fall 1998, Term Paper
Executive Summary
Two prominent cases from this decade employed "DNA fingerprinting"
in their body of evidence. The most current case is this year's
investigation of President William Jefferson Clinton by the Office
of the Independent Counsel led by Ken Starr. The less current
case is the 1994-1995 trial of The People of the State of California
vs. Orenthal James Simpson. Although the two cases vastly differ
in context, the use of DNA typing surfaces in both, and this technology
has been brought under the scrutiny of everyday news. Probability
and conclusion such as "one in 7.87 trillion" chance
that the DNA profile found from an incriminating sample does or
does not match the accused are reported without explanation of
how these odds are calculated. In this discussion, an attempt
is made to explain these probability values. A hypothetical example
is used to illustrate the process of frequency calculations.
In 1984 British researcher Alec Jeffreys at the University of
Leicester coined the term "DNA fingerprinting." His
research had led him to the discovery of a new molecular biology
method, the "multilocus probe" for human DNA analysis.
In simplified terms, this method enabled Dr. Jeffreys to generate
a pattern of a person's DNA profile that looks much like a bar
code for products in a supermarket (1). And like a bar code, which
is created to uniquely identify a product, a person "DNA
fingerprint" can be a truly unique identification mark.
The pattern of ridged skin at the end of each person's fingers,
the traditional fingerprint, was the most unique means of human
identification for many years and is still the most widely used
identification mark. A person's fingerprint, however, can be altered.
Whereas a DNA fingerprint, generated from deoxyribonucleic acid
(DNA), which is the basis of the code of life in every cells of
a person (2), cannot be disguised or changed.
Dr. Jeffreys began introducing DNA testing for cases of kinship
determination and murder investigations. Following in his footsteps,
commercial testing laboratories adapted various types of DNA-based
identification typing methods which employed scientific protocols
and principles similar to those used by Dr. Jeffreys (the details
of these scientific principles and protocols are beyond the scope
of this discussion and will not be included). Laboratories such
as Cellmark Diagnostics (now owned by Lifecodes Corporation, Conneticut)
provided DNA data for evidence in the court. In 1988, Cellmark
was involved in the first criminal case in which the strength
of DNA fingerprint evidence helped convicted two men in the rape
and murder of a woman (3). Cellmark's laboratory protocol differed
slightly from Dr. Jeffrey's since Cellmark used a cocktail of
"single-locus probes" for assessing the suspects' DNA
profiles.
In 1988, the Federal Bureau of Investigation (FBI) of the United
States (U.S.) Department of Justice implemented its DNA Analysis
Unit. The FBI became the first public crime laboratory in the
U. S. that performs forensic DNA analysis.
The use of DNA fingerprinting blossommed as commercial, university,
and government laboratories applied DNA testing for various objectives.
From the diagnosis of inherited diseases or determination of paternity
to the identification of armed forces war casualties or the linking
of suspects to biological evidence found at the scene of a crime,
the news about the various applications of DNA profiling have
proliferated. "DNA fingerprinting" has slowly crept
into the household vocabulary as a common terminology. Nowaday,
in forensic investigation of rape cases, DNA profiling is a prominent
priority.
Two famous cases employing DNA fingerprinting dominated the public
news for lengthy periods in this decade: This year's investigation
of President Clinton by the Office of the Independent Counsel
and the 1994-1995 O.J. Simpson trial. The physical evidence reported
in The Starr Report (4) "conclusively establishes that the
President and Ms. Lewinsky had a sexual relationship." This
conclusion was based on a match between the DNA profile of semen
stains collected from a navy blue dress turned over by Ms. Lewinsky
and the DNA profile of President Clinton's voluntary blood sample.
The FBI conducted the DNA testing, and according to its laboratory
reports, the chance that the semen is not the President's is one
in 7.87 trillion. This data along with other association evidence
led the FBI to conclude that "the President was the source
of the DNA obtained from the dress "(5). In The People of
the State of California vs. Orenthal James Simpson, the prosecutors
presented DNA evidence stating odds of billions to one that the
blood found at the scene was not Simpson's. One in a trillion
or one in a billion is a very minute probability for an event.
How is it that DNA typing can claim such exceedingly low probability
to rule out that an event happened by chance?
First, before an example is used to illustrate how these probabilities
are obtained, it is important to note that DNA fingerprinting
as a method is less than two decades old. In the time between
the O.J. Simpson's trial and the release of The Starr Report,
technological advances in the use of "restriction fragment
length polymorphisms," or RFLPs, to generate DNA profiles
and larger databases with more complete data on the frequency
distributions of DNA patterns in ethnic populations have both
made DNA typing "much more precise" now (6). In fact,
"the new policy [of the FBI] states that if the likelihood
of a random match if less than one in 260 billion, the examiner
can testify that the samples are an exact match" (7). Experience
and training in this field are accumulating quickly, favoring
improved methods and standardization of laboratory and statistical
techniques.
Second, the power of resolution to differentiate among individuals
can be used as a tool not only by the prosecutor but also by the
defense. In up to a third of forensic cases involving DNA profiling
of suspects, or "individuals for whom there is enough other
evidence to go to trial," the accused are exonerated and
at times not brought to trial because of DNA evidence (8).
And third, the complexity of DNA typing cannot be underestimated.
From the proper collection and storage of samples to the execution
of scientific laboratory methods, of which there are numerous
protocols, problems of degraded or contaminated samples may arise.
Examiner bias (9) can introduce errors into the the raw data collection
step when an analyst is determining the visibility or the proper
sizes of DNA bands before matches are assessed. In validating
DNA band sizes, confidence intervals must be established. As stated
in one reference (10), "bands from the same source generally
fall within ± 1.8% of their combined average [size] value."
In other cases in which DNA test results are not clear cut, for
example, some bands are clear and distinct and some are faint,
examiners have had to use his or her technical judgment and experience
at times to ascertain the data set.
At this point, to answer the question about the establishment
of exceedingly low probabilities which rule out an the occurrence
of an event due to chance, assumptions will be made that all of
the steps just discussed are sound and valid as they lead up to
the determination of a match (or no match) and the calculation
of probability for the strength of association of a match. Probability
of specimen match can be made only if estimation of population
pattern of allele (see below for definition) frequency distributions
are available. These two areas, frequency distributions of a selected
allele and probability, require sound statistical analyses and
by no means, are immune from discussions and critical assessments.
To have a feel for the probability calculation, a brief review
of science terminology is necessary. The human genome, which includes
all 46 chromosomes, has coding and non-coding sequences. Expression
of coding sequences will show up as phenotypic features such as
eye and hair color. The role of non-coding sequences are not clearly
understood. However, there is much variability in these non-coding
areas of DNA. At some locus, or a location in the DNA, tandem
repeats of particular sequences are present, which is much like
the word "cat" repeating over and over in a sentence:
This cat cat cat cat cat cat is my pet. A locus that is polymorphic,
meaning it can have different forms, is useful for forensic DNA
typing. Each possible form is an allele. For example, the "cat"
sentence above has six "cat" repeats. Another allele
of this could be "This cat cat is my pet," which has
only two "cat" repeats. Note that the physical size
or length of these sentences differ, just as polymorphic DNA alleles
would differ in sizes. The ABO blood groups, for example, is an
expression of genetic polymorphism. Blood typing, however, does
not target DNA but rather identify the gene products from DNA
expression. These polymorphic sites in non-coding regions differ
very frequently among individuals. For DNA typing, many variable
alleles have been found, and their complimentary probes have been
developed. Empirical frequency data from random sampling of populations
have been collected over the last ten to fifteen years by different
laboratories for each commonly used probe. This sampling allows
one to infer information about population parameters such as the
frequencies of specific allele at a give locus even though a complete
population profile will not be assembled (11).
Allele population frequency data is guided by several population
genetic theories. Two principles lay the foundation for analyses,
the Hardy-Weinberg (H-W) equilibrium and the linkage equilibrium
(L-E). The H-W model "states that there is a predictable
relationship between allele frequencies and genotype frequencies
at a single locus. This is a mathematical relationship that allows
for the estimation of genotype frequencies in a population even
if the genotype has not been seen in an actual population survey"
(12). The L-E model is "the steady state condition of a population
where the frequency of any multi-locus genotypic frequency is
the product of each separate locus. This allows for the estimation
of a DNA profile over several loci [plural of "locus"],
even if the profile has not been seen in an actual population
survey" (13). Population stratification must also be taken
into consideration (14, 15). There are several primary allele
population profiles such as those for Caucasians, Mexican-American,
Asians, and Blacks (16). Geographic sub-population profiles are
also available depending upon the laboratory. In addition, population
databases must be large enough to provide confidence that the
frequencies obtained are representative of those in actual populations.
An example of how a population frequency profile, or histogram,
is constructed follows. A "bin table" is first compiled
for two hypothetical alleles (17); then a frequency profile is
charted.
RFLP Bin Table
Locus A Locus B
Bin # Size Range Allele Counts Frequency Allele Counts Frequency
1 0-500 0 0.0003
0.004
2 501-1000 0 0.000 10 0.014
3 1001-1500 5 0.005 15 0.022
4 1501-2000 8 0.008 9 0.013
5 2001-2500 13 0.013 12 0.017
6 2501-3000 29 0.030 35 0.051
7 3001-3500 38 0.039 44 0.064
8 3501-4000 58 0.059 53 0.077
9 4001-4500 126 0.130 36 0.052
10 4501-5000 77 0.079 75 0.108
11 5001-5500 86 0.088 56 0.081
12 5501-6000 92 0.094 74 0.107
13 6001-6500 110 0.110 110 0.159
14 6501-7000 100 0.100 103 0.149
15 7001-7500 95 0.097 45 0.065
16 7501-8000 66 0.067 12 0.017
17 8001-8500 42 0.043 0 0.000
18 8501-9000 28 0.029 0 0.000
19 9001-9500 5 0.005 0 0.000
20 9501-10000 0 0.000 0 0.000
ToTal: 978 1.00 692 1.00
If all of aspects of the H-W and L-E population guidelines have
been properly considered, and in certain cases, the profiles corrected
for variation from the models, then the probability for a match
can be determined. The question to answer here is whether a match
is a proof of guilt or whether it is due to chance alone. The
proper null hypothesis (H0) is to assume that a match is due to
random chance and a probability value is calculated to quantify
the uncertainty of this assumption. (18, 19) It has been noted
that one reference text did set the null hypothesis to determine
whether a person "is the source of an item of biological
evidence," which would be the equivalent of looking for proof
of guilt. This null hypothesis could commit a Type I error in
which a person is presumed to be guilty instead presumed to be
innocent until proven guilty.
To continue with the example, suppose that an individual X was
sampled and DNA testing conducted to determine what type of allele
is present at particular loci in his DNA. Individual X turned
out to be heterozygous, meaning that on one chromosome, the allele
is "A," but on the other paired chromosome, the allele
is "a." The genotype for this individual is "Aa."
The size for "A" measured to be 2231 basepairs of DNA
and would fall into category of "bin#5" in the above
table with an expected frequency of 0.013. The size for "a"
measured to be 3510 basepairs and would be binned in "bin#8,"
corresponding to an expected frequency of 0.059. Taking into account
that there are two copies of a gene in the genome, the chance
that a person may have this specific "Aa" pattern is
calculated as "2pq," where p (for "A") and
q (for "a") are the frequencies of this genotype in
the reference population. If p is 0.013 and q is 0.059, then 2pq
is equal to 2 x 0.013 x 0.0059, or 0.001534, which is 0.15% of
the hypothetical relevant population.
Suppose also that three additional polymorphic alleles were added
to the panel to establish his profile. These 3 alleles has the
following calculated hypothetical frequencies: 0.06, 0.01, 0.009.
How common or how rare would this 4-loci profile be in the reference
population? Assume that the sequences are inherited in an independent
fashion in which the presence of any one of the alleles does not
influence the occurrence of any of the other alleles. The product
rule for probability calculation can then be applied as follows:
0.001534 x 0.06 x 0 .01 x 0.009. The product is 8.28x10e-9. To
express this number in another way, take the reciprocal, or 1/8.28x10e-9,
to obtain 120.7 million. The chance of finding this exact profile
in the relevant population is one in 120.7 million people. And
thus, the probability of finding another person (excluding twins)
who would have the same profile by chance is extremely low.
The higher the number of alleles used for the analysis, the smaller
the probability of finding a specific allele combination for an
individual in a population. It is no surprise then that the FBI
concluded in its report that the semen stains on Ms. Lewinsky
blue dress came from President Clinton. The probability of the
semen profile matching the President's profile by chance was calculated
to be very very small. This evidence in addition to the testimony
given by Ms. Lewinsky about how her dress got stained was sufficient
to incriminate the President.
Footnotes and References:
1) David Micklos, Greg Freyer, DNA Science, A First Course in
Recombinant DNA Technology. Cold Spring Harbor Laboratory Press
and Carolina Biological Suppy Company, 1990, p.165.
2) "There are a few exceptions. For example, our red blood
cells lack DNA. Blood itself can be typed because of the DNA contained
in our white blood cells."
Donald Riley, "DNA Testing: Introduction for non-Scientists,"
Scientific Testimony: An Online Journal. University of Washington,
1998.
3) David Micklos, Greg Freyer, DNA Science, A First Course in
Recombinant DNA Technology. Cold Spring Harbor Laboratory Press
and Carolina Biological Suppy Company, 1990, p.167.
4) Office of the Independent Counsel, sub-Section "Evidence
Establishing Nature of Relationship," Section "Narrative,"
The Starr Report, 1998.
5) FBI Lab Report, Lab no 980730002SBO and 980803100SBO, 8/17/98,
as footnoted by The Starr Report (see reference 4).
6) Editorial, "DNA Fingerprinting Comes of Age," Science,
Vol. 278, 21 Nov 1997, p. 1407.
7) Id.
8) Daniel Koshland, Jr., "DNA Fingerprinting and Eyewitness
Testimony," Science, Vol. 256, 01 May 1992, p. 593.
9) William Thompson, "Examiner Bias in Forensic RFLP Analysis,"
published at http://www.scientific.org/case_in_point/Biased%20Interp.html
(discussion was first described in W. Thompson, A Sociological
Perspective on the Science of Forensic DNA Testing, U.C. Davis
Law Review, 1997, 30(4), p. 1113-1136.
10) Keith Inman, Norah Rudin, An Introduction to forensic DNA
Analysis. CRC Press, 1997, p. 183.
11) Lorne T. Kirby, DNA Fingerprinting, An Introduction. Stockton
Press, 1990, p. 150.
12) Keith Inman, Norah Rudin, An Introduction to forensic DNA
Analysis. CRC Press, 1997, p. 91.
13) Keith Inman, Norah Rudin, An Introduction to forensic DNA
Analysis. CRC Press, 1997, p. 91-92.
14) M. Krawczak, J. Schmidtke, DNA Fingerprinting. BIOS Scientific
Publishers Ltd., 1994, p. 64-66.
15) Keith Inman, Norah Rudin, An Introduction to forensic DNA
Analysis. CRC Press, 1997, p. 91-92.
16) C. T. Caskey, et al, Triple Repeat Mutations in Human Disease,
Science, Vol 256, 8 May 1992, p. 785.
17) Keith Inman, Norah Rudin, An Introduction to forensic DNA
Analysis. CRC Press, 1997, p. 184.
18) Denise Casey, "Creating and Comparing DNA Profiles, Human
Genome News, Vol. 8:1, Jul-Sept. 1996.
19) Lorne T. Kirby, DNA Fingerprinting, An Introduction. Stockton
Press, 1990, p. 160-161.
20) Keith Inman, Norah Rudin, An Introduction to forensic DNA
Analysis. CRC Press, 1997, p. 87.