Jody Hey Evolutionary GeneticsProfessor - Department of Genetics - Rutgers University |
|
**SITES**
-- DNA POLYMORPHISM ANALYSIS PROGRAM -- DOCUMENTATION
Jody Hey Department
of Genetics Rutgers
University Nelson
Biological Labs 604
Allison Rd. Piscataway,
NJ 08854-8082 732-445-5272 fax
732-445-5870 http://lifesci.rutgers.edu/~heylab * This computer program and documentation may be freely copied and used by anyone, provided no fee is charged for it. _______________________ Contents
_______________________
______________________ Overview
_______________________ SITES is a computer program for the analysis of comparative DNA sequence data. Basic analyses include: data summaries by polymorphism class; polymorphism estimates within and between groups (species); estimates of migration, neutral model, and recombination parameters; and linkage disequilibrium analyses. SITES is primarily intended for data sets with multiple closely related sequences. It is especially useful when multiple sequences have been obtained from each of one or several closely related populations or species.
SITES can handle large data sets, and is flexible with regard to the input data. With a few commands on the command line, or in response to menus, one can tailor a particular data set in different ways to suit different questions, including working on only a subset of the data and regrouping the data.
If you find the need to mention the program in
a publication, you may cite the
following reference which mentions the
program: Hey, J and
J. Wakeley. 1997. A coalescent estimator of the population recombination rate. GENETICS 145: 833-846. ______________________ Downloadable Files Return to Contents ______________________
_______________________ New Features Return to Contents _______________________ The major recent addition, as of 2001, is a suite of linkage disequilibrium analyses. SITES will also generate lines suitable for input to the HKA and WH computer programs. SITES will also now handle alternative genetic codes. Note to previous users - some command line flags have changed - check MENUS.. _______________________ _______________________
_______________________ Input File
Format Return to Contents _______________________ There are
three kinds of input format. SITES format, PHYLIP sequential format and PHYLIP
interleaved format. PHYLIP is Joe Felsenstein's package of phylogenetic
programs. The first line of a PHYLIP format file begins with two integers.
SITES will check to see if the first character of a file is a number (i.e.,
0..9) and if so it will assume that the file is in PHYLIP format. If the first
character is not a number it will assume a SITES format. SITES format is very
similar to PHYLIP sequential format. If multiple analyses may be run, it is
most efficient to set up a SITES format file. Data
Restrictions. The only
characters that are allowed in the sequences are 'A','a','G','g','C','c','T','t','N','n','-','.', and '*'. 'N' and 'n' represent base positions where the sequence is not known.
'-','*', and '.' represent base
positions where one sequence has a gap relative to another sequence. Other
characters cause the program to insert an 'N' in their place, and to display a
warning at runtime. Each DNA sequence must have exactly the same number of
characters. The only exception to this is that the data can have spaces (i.e. '
') which are ignored.
Sample SITES FormatBelow is a
sample SITES input file for 10 sequences each of 50 characters. There is one noncoding region extending
between bases 20 to 44, inclusive, and the first base of the coding region
(i.e. base 1) is in frame 3 of the
codon. There are four groups of sequences. **** SITES sample
input 10 50 3 1 20 44 4 simulans 3 mauritiana 2 sechellia 2 melanogaster 3 SI-CA1
CAGGGTGTCCGACTCGGCCTACTCGAGCA......GCAACAGCCAGTCAC SI-CA2
CAGGGTGTCCGACTCGGCCTACTCGAACAGCTGCAGCAACAGCCAGTCAC SI-K1
CAGGGTGTCCGACTCGGCCTACTCGAACA......GTAACAGCCAGTCAC MA-1
CAGGGTGTCCGACTCGGCCTACTCGAACAGCTGCAGCAATAGCCAGTCGC MA-2
CAGGGTGTCCGACTCGGCCTACTCGAACAGCTGCAGCAATCGCCAGTCGC SE-C1
CAAGGTGTCCGACTCGGCCTACCCGAACAGCTGCAGCAACAGCCAGTCGC SE-P1
CAAGGTGTCCGACTCGGCCTACTCGAACAGCTGCAGCAACAGCCAGTCGC ME-NJ1 CAAGGTGTCCGACTCGGCCTACCCGAACGGCTGCAGCAACAGCCAGCCGC
ME-K1
CAAGGTGTCCGACTCGGCCTACCCGAACAGCTGCAGCAACAGCCAGCCGC ME-LI1
CAAGGTGTCCGACTCGGCCTACCCGAACAGCTGCAGCAACAGCCAGCCGC **** For many
questions, intron and exon boundaries are not relevant. To run SITES without providing information on
introns and exons, simply indicate that the entire sequence is one large
noncoding sequence. In this case,
the value of the first coding base is
irrelevant. For example to do this
with the sample data set above, lines 3
and 4 of the data file would be: 1 1 1 50 PHYLIP sequential and interleaved formats:
_______________________ Running the
Program Return to Contents _______________________ The program file should reside either in the same folder as the data file or in a folder automatically searched by the operating system. The user starts the program simply by going to the folder where the data file and the program exist and typing the name of the program (e.g. 'sites'). The program asks several questions about the data file and the desired analysis. Nearly all commands and options can also be entered using command line parameters.
On a PowerPC, clicking on the program icon opens a small window in which command line parameters can be entered. The user can also just hit return at this point and the program will request runtime parameters.
_______________________ _______________________
----------------------- If the 'S'
flag is not used when the program is started, then the following menu will
appear. SITE OPTIONS ARE
: a for All site types i for all noncoding base
changes e for all coding base
changes s for synonymous coding
sites r for replacement coding
sites o for ambiguous coding sites d for insertion/deletion
(indel) sites f for only informative
sites n for only transitions v for only transversions
z
to skip all positions with more than 2 base values x to skip all positions
within indels TYPE THE LETTERS
OF THE DESIRED SITE TYPES (no spaces): These options determine the information that is included in the polymorphism table, and they may affect which kinds of polymorphic sites are subject to other analyses. The user should enter a string of characters corresponding to those types of polymorphic sites that are to be included in the analyses. For example, an analysis on only informative synonymous sites would be called for by entering 'fs' or 'sf'. The most common analysis is done on all types of polymorphic sites (i.e. 'a'); 'ambiguous coding sites' are those for which the program could not determine whether a coding base change was a replacement or a synonymous change. 'z' will cause all those positions with more than two types of base values to be excluded from the analysis. 'x' will cause all those positions that are within regions for which some of the lines differ by indels to be excluded from the analyses. By invoking
both 'z' and 'x', the data set can be reduced to just those sites that fit a
simple infinite sites model of mutation. ----------------------------- If the 'A'
command is not used when the program is started, then the following menu will
appear. ANALYSIS TYPES
ARE: a for all basic types (except r,m,g,l) s for basic site table e for coding change details c for codon usage tables p for polymorphism analysis i for indel site table & indels in
recombination analysis r for recombination analysis m for population model fitting l (L) for linkage disequilibrium
analyses (invokes r) g for GC analyses TYPE THE LETTERS
OF THE DESIRED ANALYSES (no spaces): ·
'a'
calls for the most basic analyses, including s,e,c,p & i ·
's' calls
for a table of polymorphic base positions to appear in the output file. ·
'e'
calls for a table that contains the codons that are the sites of synonymous,
replacement, and ambiguous polymorphisms. This is especially useful for
determining the likely sequence of mutations when multiple polymorphisms occur
within a codon. ·
'c'
calls for a table of codon usage. Only the 'universal' code is used. Two tables
are printed, one for the first sequence in the data file, and one for all of
the sequences. Also included are counts of the numbers of synonymous and
replacement bases (i.e. relative proportions of random mutations expected to
cause synonymous or replacement changes). These counts of synonymous and
replacement bases can be used to calculate the number of synonymous
polymorphisms per synonymous base. Similarly, the counts of replacement bases
can be used to calculate the number of replacement polymorphisms per
replacement base. ·
'p'
calls for a set of analyses on polymorphism within and among groups. ·
'i'
calls for a polymorphism table for indel variation. This option also causes
indels to be included in some recombination analyses, if the 'r' option is also
used. ·
'r'
calls for a set of recombination analyses. ·
'm'
calls for fitting the ISOLATION SPECIATION MODEL to pairs of groups of
sequences, and the POPULATION SIZE CHANGE MODEL to each group of sequences.
These analyses are not applicable for many data sets, and they can be slow for
data sets with large groups. ·
'l'
(‘L’) calls for the Linkage Disequilibrium Options Menu to appear. ·
'g'
calls for a table of GC content by codon position and for noncoding regions,
for each of the sequences. Data And Output Options Menu ----------------------------------- If the 'O'
command is not used when the program is started, then the following menu will
appear. Several of these commands require additional information that must be
entered in response to queries, or provided with other command line flags. DATA
OPTIONS: g for comparisons within
groups n for new group designations d for dropping some sites m for data limited to some
sites x for dropping some sequences c for non-canonical genetic code OUTPUT OPTIONS:
s for no screen output during analysis p suppress large pairwise
difference table h text line of input values
for HKA program w text line of input values
for WH program f for first sequence reference
in site table t for table style: (. and -)
replace (- and *) TYPE THE LETTERS
OF THE DESIRED OPTIONS (no spaces):
DROPPING SOME SITES, enter option : r for range f for file
i for individual entry The user should enter 'r','f', or 'i', and follow the directions.
In the case of 'f', the user should previously have created a file that
contains base positions that are to be dropped. The format is simply a list of
numbers, one number per line. In the
case of 'r' or 'i', the program asks the user for information on the range
limits, or individual base positions, respectively.
Linkage
Disequilibrium Options Menu
------------------------------------------- If the 'L'
command is not used on the command line, and the 'L' option is listed in
response to the Analysis Choice Menu then the following menu will appear. Some of these commands require
additional information that must be entered in response to queries, or provided
with other command line flags. Linkage
Disequilibrium (LD) Analyses: m print matrix of pairwise values and
significance tests y measure and compare average LD by
regions t randomization tests of average LD
by regions s analyze LD among polymorphism
shared between groups x exclude singletons from all LD
analyses Apply Analyses
to the Following LD Measures d
D - standard linkage disequilibrium (default) p
D' - D prime = D/Dmax b
|D| - absolute value of D a
|D'| - absolute value of D prime r
correlation coefficient q
r^2 - squared correlation coefficient TYPE THE LETTERS
OF DESIRED ANALYSES AND MEASURES (no spaces):
_______________________ Command line usage Return to Contents _______________________ Nearly all commands and options can be given at the command line. Usage of the command line permits many analyses to be automated, so that they can be very easily repeated and, if desired, given in batch files. Each command
line parameter flag may be upper or lower case and may be preceded by a '-',
'\' or '/'. Following a command line parameter is a string of characters appropriate to that command. Flag parameter string example ---------------------------------- ---------------- I the name of the data file -Imydata.seq R the name of the output file -Rmyresult M message up to 78
characters (no blanks) -Mmy_data S kinds of polymorphic sites to analyze -Sa (see Site Options Menu
above) A kinds of analyses to perform -Aspr (see Analysis
Menu above) O data and output options
-Ogc (see Data and Ouput Menu above) L linkage disequilibrium options -Lyrp (see LD Analysis Menu) C
alternate genetic
code
-Cf (see Alternative Genetic Code Menu above) Any of
these parameters can be given at the command line when the program is started.
Either uppercase or lowercase letters may be used for flags and for their
parameter strings. For example, the following command line, entered in a
command prompt window will generate a file called myresult.sit. sites -isitestestdata -rmyresult -mmy_data
-sa
-aspr -ogc -cf For each
of the parameter flags that are not included at the command line when the program is started, the program
will ask for the information. It is not
required that any of the command line parameter flags be used. The program will also ask for more information if phylip
style data formats are used, and if
some of the analysis options are used. The data file can have any name, though it should not include the folder information. This usually means that
the program and the data need to reside in the same folder. The name of the
output file should not have an extension. The characters '.SIT' will be added to the output file name that is given
to the program. Thus, for the example
above, the program would produce a file called 'myresult.sit'. Three of
the flags ('S', 'A', and 'O') correspond to menus (see below). If the flag is not used at the command line,
a menu will appear, requesting user
input. Additional
command line options can be used to limit the size of the data set if the 'O'
flag or the ‘L’ flag is used. These
secondary flags should be placed after the respective primary flag. Note that
command line flags for dropping sequences can be used either in conjunction with
flags for excluding base positions, or with flags for keeping certain base positions.
_______________________ _______________________ When running,
the program writes some messages to the screen (unless the 's' ANALYSIS OPTION
is used). While reading in the data, the program writes the base position of each polymorphic site
as it is found. Following this, the program writes a brief message for each
category of analysis being conducted. The output
of the program is all contained in one file that has a '.SIT' extension. This
file can be quite long for large data sets and complete analyses. If there are
many sequences, the 'p' option in the OUPUT OPTIONS menu can cut down on the
size of the data set. For the
most part, the formatting of the output is not hard to follow. At the top of
the file, the run parameters are listed, as are the sequence and group names.
For a full analysis of a data set with multiple groups, the following headings are generated: ------------------------------- This table
provides the base position and type of polymorphism for all variable base
positions in the data set (excluding those sequences and regions of sequences
that were not included in the analysis). ----------------------- An
approximate list of counts of the number of base positions associated with each
kind of polymorphisms in the table. These counts are rough because some base
positions may have multiple kinds of polymorphisms. -------------- A set of
two tables: The first is very similar
to that for polymorphic sites, except that each distinct indel (regardless of
length) gets just one position. The second table lists the sequence and
position of each distinct indel. Table of Synonymous, Replacement and
Uncertain Exon Changes
-------------------------------------------------------------------------- A list of
all (or most) coding region base changes. With each polymorphism the different
codon states are shown. This can be used to resolve cases where multiple
changes have occurred within the same codon, and where it was not clear to the
program whether a change was synonymous or replacement. This analysis may not be complete if there
are three or four bases segregating at a position, or in cases where a single
aligned codon has many amino acids segregating. --------------------- A table of
codon usage for the first sequence. Also included are counts of the numbers of
synonymous and replacement sites (i.e. relative proportions of random mutations
expected to cause synonymous or replacement changes). The calculation of these values is best explained with an
example. Consider a codon (e.g. AAA for
lysine) and consider each base in the
codon. For each base, the
fraction of all mutations that will
change the codon in such a way that the amino acid does not change is counted. For AAA, 0 of the
three possible mutations at the first
base will lead to a synonymous change, similarly 0 for the second base, and 1/3
of the mutations for the third base
(because an A-> G change leads to a AAG which is also lysine). So the number
of synonymous sites in an AAA codon is
1/3. The number of replacement sites is 3-1/3 = 2 2/3. Every codon gets a score
this way, and the final tally is just the sum of scores for all codons. Codon
Usage Table For All Lines ------------------------------------- A table of
codon usage for all sequences and counts of synonymous and replacement sites
summed across all sequences. -------------------------- Most analyses
are applied only to those types of sites that
are specified in the SITES OPTIONS menu, but there are exceptions - see
below. ·
BASE
PAIR COMPARISONS - not including N's or indels - counts of the number of bases
compared and the number of differences
for all pairs of sequences. Essentially a
distance matrix. The counts of the numbers of bases compared are based
on the entire sequence, or a shorter region, depending on the 'x' SITE OPTION,
and the 'l' and 'm' ANALYSIS
OPTIONS. The counts of the number of
bases compared are not reduced by other
SITE OPTIONS choices. However, the
counts of site differences are affected by SITE OPTION choices. This table is
not generated if the 'p' option is used in the OUTPUT OPTIONS. ·
GROUP
DIFFERENCES - not including N's or indels -a matrix with group by group
comparisons. Above and on the diagonal
are the average pairwise differences for those
sites specified under SITE OPTIONS. Below the diagonal is the net
average pairwise divergence (e.g. Nei, 1987, p.276). ·
GROUP
DIFFERENCES PER BASE PAIR - same as above but numbers are per base pair. Note
that the divisor is the average # of base pairs compared, calculated from the
same numbers above the diagonal in the BASE PAIRS COMPARISON table. Thus these
per base pair measures may be misleading for some SITE OPTIONS choices. For example, if only synonymous
sites are analyzed (option 's' under SITE OPTIONS), a per base pair measure of
divergence can be obtained by dividing numbers in the GROUP DIFFERENCES matrix,
by the estimated number of synonymous sites, that are given beneath the CODON
USAGE TABLE. ·
FIXED DIFFERENCES - A fixed difference is a polymorphic site at which all of the sequences of one group are different
from all of the sequences of a second group. ·
SHARED
POLYMORPHISMS - A shared polymorphism
is a polymorphic site at which each of two groups of sequences are found to
have at least two of the same bases. ·
Fst AND POPULATION MIGRATION RATES - Fst
values, between pairs of populations, and estimates of Nm, assuming diploidy (i.e. N is the
effective number of diploid individuals) calculated according to equation 4 (except using a factor of 1/4) of
Hudson et al., 1992. This estimate
should be multiplied by 4/3 to get the
corresponding number for an X linked locus. It
should be multiplied by 2 to get the corresponding number for a haploid
locus. Also if the locus is haploid and
sex-limited (e.g. mitochondria), the estimate should be multiplied by 2, and then it applies only to
the number of individuals of the sex
that carry the locus. ·
POLYMORPHIC
SITE FREQUENCIES PER GROUP – FOLDED
& ROOTED -These tables give the counts of the number of lines that
carry a polymorphism of a certain frequency. For example, a site in which two
sequences are different from the remaining n-2 sequences is counted in either
category 2 or category n-2 (depending
on whether the distribution is folded, or if
not depending on the value of the root sequence. There
are two tables, one for the folded distribution which shows only the frequencies for the rarest
bases, and one for the rooted distribution.
A rooted distribution is not shown if the analysis includes only one
group. The table for the rooted
distribution is based on an outgroup sequence chosen from one of the
other sequence groups. The method for
picking outgroups is very simple. The
outgroup sequence for one group is the
first sequence listed in the most divergent other group. This may not be an ideal outgroup for
various reasons. Some properties of
these distributions are known for some
models and the distributions are useful for considering questions about
changing population size and natural selection. See papers by Tajima and also
by Fu. ·
SITES
WITH MORE THAN TWO BASES SEGREGATING - A number of analyses assume an infinite
sites model, under which a polymorphic site is caused by exactly one mutation
and there can be no more than two bases segregating. This table lists those
sites that are clearly not consistent with this assumption. If desired, the
program can be rerun using analysis option 'd' to exclude these sites. ·
D STATISTICS
- These indices are measures of departure from a neutral Fisher-Wright
model. See Tajima (1989) and Fu and Li (1993).
These statistics rely on a count of the number of polymorphic sites. The
counts that are used come directly from the site frequency distribution.
However these counts will underestimate
the actual number of mutations,
particularly if some sites are segregating more than two bases within a
sequence sample group. For Fu and Li D,
an outgroup sequence is picked from among
the other groups in the data set (if any occur - see above for
POLYMORPHIC SITE FREQUENCIES). Note,
the outgroup sequence that is picked may not be ideal, depending on the divergence
among groups. Also given are counts of
the different classes of mutations, as defined by Fu and Li. ·
THETA
(4Nu) ESTIMATES - Two different estimates of the neutral mutation parameter
4Nu: Watterson’s and nucleotide diversity or pi. See, for example, Hudson (1990) or Tajima (1993). Also listed are
the number of sequences for each group and the number of bases, which is
calculated by taking the average of the number of bases compared in the pairwise comparison matrix.
|