Jody Hey                  Evolutionary Genetics

  Professor    -     Department of Genetics     -   Rutgers University

Hey Lab Research Publications Software, Data Contacts, People


 

**SITES** -- DNA POLYMORPHISM ANALYSIS PROGRAM --

 

DOCUMENTATION  

Jody Hey 

Department of Genetics

Rutgers University

Nelson Biological Labs

604 Allison Rd.

Piscataway, NJ  08854-8082

732-445-5272

fax 732-445-5870

hey@biology.rutgers.edu

http://lifesci.rutgers.edu/~heylab

 

* This computer program and documentation may be freely copied and used by anyone, provided no fee is charged for it. 

_______________________

 Contents

_______________________

______________________

 Overview   

_______________________

 

SITES is a computer program for the analysis of comparative DNA sequence data.  Basic analyses include: data summaries by polymorphism class;  polymorphism estimates within and between groups (species); estimates of migration, neutral model, and recombination parameters; and linkage disequilibrium analyses.  SITES is primarily intended for data sets with multiple closely related sequences. It is especially useful when multiple sequences have been obtained from each of one or several closely related populations or species.  

 

SITES can handle large data sets, and is flexible with regard to the input data.  With a few  commands on the command line, or in response to menus, one can tailor a particular data set in different ways to suit different questions, including working on only a subset of the data and regrouping the data. 

 

If  you find the need to mention the program in a publication,  you may cite the following  reference which mentions the program:  

Hey, J and J. Wakeley. 1997. A coalescent estimator of the population  recombination rate.  GENETICS 145: 833-846.

______________________

 

Downloadable Files              Return to Contents

______________________

 

_______________________

 

 New Features                    Return to Contents

_______________________

 

The major recent addition, as of 2001, is a suite of  linkage disequilibrium analyses.

SITES will also generate lines suitable for input to the HKA  and WH computer programs. 

SITES will also now handle alternative genetic codes.

Note to previous users - some command line flags have changed - check MENUS.. 

_______________________

 

 Analyses 

_______________________

 

  • Polymorphism Table SITES identifies polymorphic sites and generates a table summarizing polymorphism information. Polymorphisms are characterized with regard to several categories: Noncoding or Coding; Synonymous, Replacement or Ambiguous; Transversion or Transition; Insertion/Deletion (Indels)

 

  • Indel Table SITES tries to identify the boundaries of insertions/deletions (indels)  and generates a polymorphism table with its best guess of indel variation.

 

  • Codon Usage Tables (assuming the standard genetic code)

 

  • Numbers of synonymous and replacement base positions. These can be interpreted as the relative proportions of random mutations that would be expected to cause synonymous or replacement changes.

 

  • Numbers of pairwise differences among sequences.

 

  • GC content for complete sequences and each codon position.

 

  • Polymorphism Analyses  SITES conducts several kinds of polymorphism analysis that can be  applied to the entire data set or to multiple subsets of sequences.
    • The program generates the two most commonly used estimates of the neutral mutation parameter Theta (or 4Nu, where N is the effective population size and u is the neutral mutation rate):
      • pairwise nucleotide diversity or pi (Nei, 1987 p256).
      • Wattersons etimator (Watterson, 1975).
    • The program determines several measures of non-neutrality in the site frequency distribution, including Tajima's D (Tajima, 1989), Fu and Li's D and Fu and Li's D* (Fu and Li, 1993).
    • The program determines the site frequency distribution for each group, both rooted and unrooted.

 

  • Group Comparisons.  SITES works with groups of sequences, so that groups can be compared  with one another. A number of analyses are intended explicitly for group comparisons.

 

    •  The program determines the average number of pairwise  differences among all pairs of groups, as well as the net divergence  among all pairs of groups.
    • determines the numbers of shared  and fixed differences among groups.
    • estimates the population migration parameter (Nm) and Fst for all pairs of groups, following the method outlined in of Hudson et al., 1992.

 

  • Historical Population Model Fitting SITES carries out the analyses described in Wakeley &  Hey (1997, Estimating ancestral population parameters.Genetics 145, 847-855). The data are fit to two different models.
    • an isolation speciation model, with four parameters
    • a model of recent population size change, with three parameters

 

  • Recombination analyses. SITES conducts several kinds of recombination analysis on each group, or the entire data set.
    • a table of site by site congruency
    • a table of the minimum set of recombination intervals (Hudson & Kaplan, 1985)
    • Hey & Wakeley's (1997) gamma, an estimate of 4Nc, the population recombination rate.
    • Hudson's (1987) estimate of 4Nc

 

  • Linkage Disequilibrium analyses. SITES will conduct several analyses of Linkage Disequilibrium (LD)
    • matrices of LD values, for several measure of LD
    • matrices of statistical tests of LD values
    • calculates mean LD within and between regions of the sequence
    • does simulations to test mean LD within and between regions
    • determines various measures of LD among polymorphisms shared by groups

 

  • Data Subsets and ReGrouping The data set can be partitioned in numerous ways, without changing the data file.
    • the groupings of sequences can be changed at runtime
    • sequences can be dropped from the analysis at runtime
    • analyses can be limited to specific base pairs or specific intervals
    • specific base positions can be dropped from the analysis.
    • analyses can be limited to certain categories of polymorphic sites

 

_______________________

 

 Input File Format                Return to Contents

_______________________

 

There are three kinds of input format. SITES format, PHYLIP sequential format and PHYLIP interleaved format. PHYLIP is Joe Felsenstein's package of phylogenetic programs. The first line of a PHYLIP format file begins with two integers. SITES will check to see if the first character of a file is a number (i.e., 0..9) and if so it will assume that the file is in PHYLIP format. If the first character is not a number it will assume a SITES format. SITES format is very similar to PHYLIP sequential format. If multiple analyses may be run, it is most efficient to set up a SITES format file.

 

Data Restrictions.

 

The only characters that are allowed in the sequences are  'A','a','G','g','C','c','T','t','N','n','-','.', and '*'.   'N' and 'n'    represent base positions where the sequence is not known. '-','*', and  '.' represent base positions where one sequence has a gap relative to another sequence. Other characters cause the program to insert an 'N' in their place, and to display a warning at runtime. Each DNA sequence must have exactly the same number of characters. The only exception to this is that the data can have spaces (i.e. ' ') which are ignored.

 

SITES format:

  • Line 1. A line of text that generally provides a brief description of the data set. This line must not begin with a number. If desired, up to 10 extra lines of commentary can be added following line 1.  Each additional line of commentary must have a '#' at the very beginning of the line. All comment lines will be reproduced in the output file.
  • Line 2. Contains 2 integers. The first is the number of sequences. The second is the number of characters in each sequence. 
  • Line 3. Contains 2 integers. The first integer can be 1,2 or 3, and is the position in the coding frame of the first coding base. If there are no coding bases, then this position can be any number The second integer is the number of noncoding stretches of sequence. For instance if the DNA sequence contains a 5' non- coding sequence followed by three amino acid coding regions - with intervening introns- followed by a 3' non-coding region. Then this integer would be 4 (5' noncoding plus 2 introns plus 3' noncoding).  If the sequence does not include a protein coding region or if the coding frame of the first coding base is not known, then the entire sequence should be specified as non-coding, and the second integer should be 1.
  • After line 3 there is one line for each noncoding stretch indicated on line 3. Each line contains two integers: the base position of the first base in the noncoding region, and the base position of the last base in the region. The lines go in order from 5' to 3', meaning that the first line after line 3 contains the bases for the most 5' noncoding region.  If there are no coding regions in the sequence, then the second number on line 3 should be 1 and the  two integers on line 4 should simply be a 1, followed by the length of the sequences (i.e. indicating a noncoding region as long as the  sequence).
  • The line after those containing information on noncoding regions contains just a single integer, the number of groups of sequences. If sequences are not in groups then this number should be 1. If this number is greater than 1, then for each group there is a line containing the group name and the number of sequences in that group. A group name can be up to 12 characters. The group names should be ordered so that they are in the same order as the sequences.
  • After the lines containing the number of groups and the group names and sizes, come the sequences. The sequences must be in sequential format, meaning that all of the text for one sequence occurs before the text for the next sequence. There can be one sequence per line, however this is not necessary, and sequences can be spread out over multiple lines.  Only the following characters can be included in the sequence: A,a,C,c,G,c,T,t,N,n,'*','-','.'.  The latter three all indicated the absence of sequence relative to other sequences in the file. 
  • For each sequence, the first 10 characters contain the name of the sequence for that line. There may be spaces between the name and the beginning of the sequence, but the actual sequence must begin in column 11. The sequence name should begin with a letter or a number (no asterisks, or other symbols can be the first character in a sequence name).

 

Sample SITES Format

Below is a sample SITES input file for 10 sequences each of 50 characters.  There is one noncoding region extending between bases 20  to 44, inclusive,  and the first base of the coding region (i.e. base 1) is in frame 3 of the  codon. There are four groups of sequences.

 

****

SITES sample input

10 50

3 1

20 44

4

simulans 3

mauritiana 2

sechellia 2

melanogaster 3

SI-CA1    CAGGGTGTCCGACTCGGCCTACTCGAGCA......GCAACAGCCAGTCAC

SI-CA2    CAGGGTGTCCGACTCGGCCTACTCGAACAGCTGCAGCAACAGCCAGTCAC

SI-K1     CAGGGTGTCCGACTCGGCCTACTCGAACA......GTAACAGCCAGTCAC

MA-1      CAGGGTGTCCGACTCGGCCTACTCGAACAGCTGCAGCAATAGCCAGTCGC

MA-2      CAGGGTGTCCGACTCGGCCTACTCGAACAGCTGCAGCAATCGCCAGTCGC

SE-C1     CAAGGTGTCCGACTCGGCCTACCCGAACAGCTGCAGCAACAGCCAGTCGC

SE-P1     CAAGGTGTCCGACTCGGCCTACTCGAACAGCTGCAGCAACAGCCAGTCGC

ME-NJ1    CAAGGTGTCCGACTCGGCCTACCCGAACGGCTGCAGCAACAGCCAGCCGC

ME-K1     CAAGGTGTCCGACTCGGCCTACCCGAACAGCTGCAGCAACAGCCAGCCGC

ME-LI1    CAAGGTGTCCGACTCGGCCTACCCGAACAGCTGCAGCAACAGCCAGCCGC

****

For many questions, intron and exon boundaries are not relevant. To run  SITES without providing information on introns and exons, simply indicate that the entire sequence is one large noncoding sequence.  In this case, the  value of the first coding base is irrelevant.  For example to do this with  the sample data set above, lines 3 and 4 of the data file would be:

1   1

1   50

 

PHYLIP sequential and interleaved formats:

  • line 1: contains 2 integers. The first is the number of sequences. The second is the number of bases per sequence.
  • After line 1, are lines for the data. The data can be in sequential format with 1 line per sequence. If so then the first 10 positions are devoted to the name of the sequence. The data can also be in interleaved format. If so then for N sequences, there are N lines each beginning with a name in the first 10 positions followed by some sequence. After N lines there is an empty line, followed by N more lines containing aligned data (but no names). This pattern is repeated until the final block of N lines.

_______________________

 

 Running the Program              Return to Contents

_______________________

 

The program file should reside either in the same folder as the data file or in a folder automatically searched by the operating system. The user starts the program simply by going to the folder where the data file and the program exist and typing the name of the program (e.g. 'sites'). The program asks several questions about the data file and the desired analysis. Nearly all commands and options can also be entered using command line parameters. 

 

On a PowerPC, clicking on the program icon opens a small window in which command line parameters can be entered.  The user can also just hit return at this point and the program will request runtime parameters.

 

_______________________

 

 Menus                           Return to Contents

_______________________

 

     Site choice menu.

     -----------------------

 

If the 'S' flag is not used when the program is started, then the following menu will appear.

 

SITE OPTIONS ARE : a for All site types

                   i for all noncoding base changes

                   e for all coding base changes

                   s for synonymous coding sites

                   r for replacement coding sites

                   o for  ambiguous coding sites

                   d for insertion/deletion (indel) sites

                   f for only informative sites

                   n for only transitions

                   v for only transversions

 

                   z to skip all positions with more than 2 base values

                   x to skip all positions within indels

TYPE THE LETTERS OF THE DESIRED SITE TYPES (no spaces):

 

These options determine the information that is included in the polymorphism table, and they may affect which kinds of polymorphic sites are subject to other analyses. The user should enter a string of characters corresponding to those types of polymorphic sites that are to be included in the analyses. For example, an analysis on only informative synonymous sites would be called for by entering 'fs' or 'sf'. The most common analysis is done on all types of polymorphic sites (i.e. 'a'); 'ambiguous coding sites' are those for which the program could not determine whether a coding base change was a replacement or a synonymous change. 

'z' will cause all those positions with more than two types of base values to be excluded from the analysis.

'x' will cause all those positions that are within regions for which some of the lines differ by indels to be excluded from the analyses.

By invoking both 'z' and 'x', the data set can be reduced to just those sites that fit a simple infinite sites model of mutation.

 

     Analysis choice menu.

     -----------------------------

 

If the 'A' command is not used when the program is started, then the following menu will appear.

 

ANALYSIS TYPES ARE: 

        a for all basic types (except r,m,g,l)

        s for basic site table

        e for coding change details

        c for codon usage tables

        p for polymorphism analysis

        i for indel site table & indels in recombination analysis

 

        r for recombination analysis

        m for population model fitting

        l (L) for linkage disequilibrium analyses (invokes r)

        g for GC analyses

TYPE THE LETTERS OF THE DESIRED ANALYSES (no spaces):

 

·        'a' calls for the most basic analyses, including s,e,c,p & i

·        's' calls for a table of polymorphic base positions to appear in the output file.

·        'e' calls for a table that contains the codons that are the sites of synonymous, replacement, and ambiguous polymorphisms. This is especially useful for determining the likely sequence of mutations when multiple polymorphisms occur within a codon.

·        'c' calls for a table of codon usage. Only the 'universal' code is used. Two tables are printed, one for the first sequence in the data file, and one for all of the sequences. Also included are counts of the numbers of synonymous and replacement bases (i.e. relative proportions of random mutations expected to cause synonymous or replacement changes). These counts of synonymous and replacement bases can be used to calculate the number of synonymous polymorphisms per synonymous base. Similarly, the counts of replacement bases can be used to calculate the number of replacement polymorphisms per replacement base.

·        'p' calls for a set of analyses on polymorphism within and among groups.

·        'i' calls for a polymorphism table for indel variation. This option also causes indels to be included in some recombination analyses, if the 'r' option is also used.

·        'r' calls for a set of recombination analyses.

·        'm' calls for fitting the ISOLATION SPECIATION MODEL to pairs of groups of sequences, and the POPULATION SIZE CHANGE MODEL to each group of sequences. These analyses are not applicable for many data sets, and they can be slow for data sets with large groups.

·        'l' (‘L’) calls for the Linkage Disequilibrium Options Menu to appear.

·        'g' calls for a table of GC content by codon position and for noncoding regions, for each of the sequences.

 

     Data And Output Options Menu

     -----------------------------------

 

If the 'O' command is not used when the program is started, then the following menu will appear. Several of these commands require additional information that must be entered in response to queries, or provided with other command line flags.

 

DATA OPTIONS:   g for comparisons within groups

                n for new group designations

                d for dropping some sites

                m for data limited to some sites

                x for dropping some sequences

                c for non-canonical genetic code

OUTPUT OPTIONS: s for no screen output during analysis

                p suppress large pairwise difference table

                h text line of input values for HKA program

                w text line of input values for WH program

                f for first sequence reference in site table

                t for table style: (. and -) replace (- and *)

TYPE THE LETTERS OF THE DESIRED OPTIONS (no spaces):

 

  • 'g' means that polymorphism and recombination analyses will be performed on each group of sequences individually. If 'g' is not entered then the entire data set is treated as a single group. This option is ignored if there are no groups in the data set.
  • 'n' provides the opportunity to redesignate group assignments. These new groupings must still be consistent with the order of the DNA sequences in the data file. A series of questions will prompt the user to provide the necessary information.
  • 'd' provides the opportunity to drop some of the base positions from inclusion in the analysis. A series of questions will prompt the user to provide the necessary information. There are three ways to drop base positions, chosen in response to the following query that will appear on the screen.

 

DROPPING SOME SITES, enter option : r for range

                                    f for file

                                    i for individual entry

The user should enter 'r','f', or 'i', and follow the directions. In the case of 'f', the user should previously have created a file that contains base positions that are to be dropped. The format is simply a list of numbers, one number per line.  In the case of 'r' or 'i', the program asks the user for information on the range limits, or individual base positions, respectively.

 

  • 'm' provides the opportunity to limit the analyses to just a subset of the base positions. This option works almost exactly like 'd', but can be more convenient under some circumstances. By using the 'm' option it is possible to examine just short sequences. For example, if a sliding window analysis is desired, the program can be run repeatedly using a different interval in each case. If just coding, or non-coding sequence is desired, then the 'd' or 'm' options are not needed. Instead simply specify 'e' or 'i' in the Site Choice menu.
  • 'x' provides the opportunity to exclude some of the sequences from the analyses. This options works exactly as the 'd' option -  either a range or a file or individual entry can be used to indicate which sequences to exclude. This option allows for the use of very large data sets. One large data set can be maintained for a set of aligned sequences, and then specific analyses that use just a subset of the sequences can still be run. The program will maintain the appropriate group designations, even if one or more groups are entirely excluded. 
  • 'c' calls for an alternative genetic code to be used.  If this option is used then the following  menu appears that lists the most frequently used alternative codes. Selection of one of the alternatives causes that code to be used in all codon related analyses. The codes are given at the NCBI site http://www.ncbi.nlm.nih.gov/htbin-post/Taxonomy/wprintgc?mode=c
    • Alternate (Non-Canonical) Genetic Code:
      • V    Vertebrate Mitochondrial Code
      • Y    Yeast Mitochondrial Code
      • M    Mold, Protozoan, and Coelenterate Mitochondrial Code
      • I      The Invertebrate Mitochondrial Code
      • C     The Ciliate, Dasycladacean and Hexamita Nuclear Code
      • E     The Echinoderm Mitochondrial Code
      • P     The Euplotid Nuclear Code
      • A     The Alternative Yeast Nuclear Code
      • D     Ascidian Mitochondrial Code
      • F     Flatworm Mitochondrial Code
    • TYPE THE LETTER OF THE DESIRED GENETIC CODE:
  • 's' no information is output to the screen during the run.
  • 'p' suppresses the output of the matrix of pairwise differences between individual sequences. This matrix can be useful for a close look at things, but with many sequences it is quite large.
  • 'h' causes the generation of text that is suitable for input into a program called HKA.  This program does HKA tests (Hudson, Kreitman and Aguadé, 1987) and related analyses on multiple loci.
  • 'w' causes the generation of text that is suitable for input into a program called WH.  This program does analyses described in Wakeley and Hey(1997) and Wang, Wakeley & Hey (1997).
  • 'f' calls for the polymorphism table to use the first DNA sequence in the data file as the reference. The default is to use a consensus sequence (i.e. the most common base among all of the sequences) as the reference sequence in the polymorphism table output.
  • 't' changes the characters that are used in the polymorphism table. The default is for '-' to refer to a base value that is identical to the reference sequence for a given polymorphism and to use '*' to refer to a gap in the DNA sequence relative to other DNA sequences, or relative to the reference. By using the 't' option, the table will use '.' to refer to a base that is identical to the reference and '-' to refer to gaps.

 

     Linkage Disequilibrium Options Menu

     -------------------------------------------

 

If the 'L' command is not used on the command line, and the 'L' option is listed in response to the Analysis Choice Menu then the following menu  will appear. Some of these commands require additional information that must be entered in response to queries, or provided with other command line flags.

 

Linkage Disequilibrium (LD) Analyses:

         m print matrix of pairwise values and significance tests

         y measure and compare average LD by regions

         t randomization tests of average LD by regions

         s analyze LD among polymorphism shared between groups

         x exclude singletons from all LD analyses

 

Apply Analyses to the Following LD Measures

         d  D - standard linkage disequilibrium (default)

         p  D' -  D prime = D/Dmax

         b  |D| - absolute value of D

         a  |D'| - absolute value of D prime

         r  correlation coefficient

         q  r^2 - squared correlation coefficient

 

TYPE THE LETTERS OF DESIRED ANALYSES AND MEASURES (no spaces):

 

  • 'm' causes matrices of LD values to be printed in an additional output file.  That filename is either given at the command line ('-E' flag) or is requested following the appearance of the Linkage Disequilibrium menu.  For every group there is a matrix of statistical tests, including Fisher's Exact Test and a chi-square test. Also for every group, and every LD measure, there will be a matrix showing LD values for all pairs of sites.
  • 'y' causes an analysis of mean LD values within and between regions of the sequence.  These regions must be given in a file, the name of which is either given on the command line ('-G' flag) or is requested after this menu.  For an analysis with R regions the file should have R numbers, in order from low to hi, with each number being the last base of a region.  The very last number should be identical to the full length of the sequence.
  • 't' causes statistical tests, via simulation, of mean LD values within and between regions. The simulations can be slow (see below).
  • 's' causes analysis of LD among polymorphisms shared between species.
  • 'x' causes all LD analyses to be limited to only those pairs of sites for which neither polymorphism is of the singleton type (i.e. the rarer base appears only once). LD measures vary in the sensitivity to the presence of singletons.
  • 'd','p','b','a','r','q'  These measures are conventional (see e.g. Hedrick 2000).  For each one that is used, all of the LD analyses are done.  Quite a lot of output can be generated by calling for multiple LD measures.

 

_______________________

 

Command line usage             Return to Contents

_______________________

 

Nearly all commands and options can be given at the command line.  Usage of the command line permits many analyses to be automated, so that they can be very easily repeated and, if desired, given in batch files. 

 

Primary Command Line Flags:

 

Each command line parameter flag may be upper or lower case and may be preceded by a '-', '\' or '/'. Following a command line parameter is a string  of characters appropriate to that command.

 

Flag  parameter string                            example

----------------------------------            ----------------

I    the name of the data file                -Imydata.seq

R    the name of the output file              -Rmyresult

M    message up to 78 characters (no blanks)  -Mmy_data

S    kinds of polymorphic sites to analyze    -Sa   (see Site Options Menu above)

A    kinds of analyses to perform             -Aspr (see Analysis Menu above)

O    data and output options                  -Ogc  (see Data and Ouput Menu above)

L    linkage disequilibrium options           -Lyrp (see LD Analysis Menu)

C    alternate genetic code                   -Cf   (see Alternative Genetic Code Menu above)

 

Any of these parameters can be given at the command line when the program is started. Either uppercase or lowercase letters may be used for flags and for their parameter strings.  For example, the following command line, entered in a command prompt window will generate a file called myresult.sit.

 

sites -isitestestdata -rmyresult -mmy_data -sa -aspr -ogc -cf

 

For each of the parameter flags that are not included at the command line  when the program is started, the program will ask for the information. It is  not required that any of the command line parameter flags be used. The program  will also ask for more information if phylip style data formats are used, and  if some of the analysis options are used. The data file can have any name,  though it should not include the folder information. This usually means  that the program and the data need to reside in the same folder. The name of the output file should not have an extension. The characters '.SIT' will be  added to the output file name that is given to the program. Thus, for the  example above, the program would produce a file called 'myresult.sit'.

 

Three of the flags ('S', 'A', and 'O') correspond to menus (see below).  If the flag is not used at the command line, a menu will appear, requesting  user input.

 

Secondary Command Line Flags:

 

Additional command line options can be used to limit the size of the data set if the 'O' flag or the ‘L’ flag is used.  These secondary flags should be placed after the respective primary flag. 

 

Note that command line flags for dropping sequences can be used either in conjunction with flags for excluding base positions, or with flags for keeping  certain base positions.

 

  • If the 'O' flag (see Primary Command Line Flags) is on the command line  with the 'd' option, in order to exclude some base positions from the  analyses (see Analysis Options 'd' below):
    • 'D' flag can be included on the command line to specify the filename with base positions to be dropped. Immediately following the '-D' should be the name of the existing file. For example: '-Dbasedrop.txt'.
    • or an 'X' flag can be included on the command line to specify the range of base positions to be excluded. Immediately following the '-X' should be a range with a '-' separating the two extremes, and without any spaces.  For example: '-X100-200'.
  • If the 'O' flag is on the command line with the 'm' option, in order to  limit the analyses to just a subset of base positions  (see Analysis Options 'm' below):
    •  a 'K' flag can be included on the command line to specify the filename with the base positions to be used in the analysis. Immediately following the '-K' should be the name of the  existing file.  For example: '-Kbasekeep.txt'   
    • or a 'Y' flag can be included on the command line to specify the range of base positions to be kept. Immediately following the '-Y' should be a range with a '-' separating the two extremes, and without any spaces.  For example: '-Y201-300'.
  • If the 'O' flag is on the command line with the 'x' option, in order to  drop some sequences from the analyses  (see Analysis Options 'x' below):
    •  a 'Q' flag can be included on the command line to specify the filename with the sequence numbers to be dropped. Immediately following the '-Q' should be the name of the  existing file.  For example: '-Qdropseq.txt'.
    • or a 'Z' flag can be included on the command line to specify the range of sequences to be dropped. Immediately following the '-Z' should be a range with a '-' separating the two extremes, and without any spaces.  For example: '-Z4-12'.

 

  • If the 'L' flag is in the command line with the 'm' option, the program  the program will print out LD matrices to another file.  The name of that  file can  be identified at runtime, or else the program will ask for the filename.
    • an'E' flag can be included on the command line to specify the filename to which the LD matrices will be printed. Any existing file of that name will be overwritten. Immediately following the '-E' should be the name of the file. For example –Emyldmatrices
  • If the 'L' flag is in the command line with  the 'y' option, the program will generate an analyses of LD by regions  of the sequence.  To do this it needs input on the boundaries of regions.  These are contained in another file, which can be identified  at runtime, or else the program will ask for the filename.
    •  a 'G' flag can be included on the command line to specify the filename that contains the boundaries of regions for measuring and comparing mean LD. Immediately following the '-G' should be the name of the existing file. For an analysis with r regions the file should have r numbers, in order from low to hi, with each number being the last base of a region.  The very last number should be identical to the full length of the sequence.

 

_______________________

 

Output                          Return to Contents

_______________________

 

When running, the program writes some messages to the screen (unless the 's' ANALYSIS OPTION is used). While reading in the data, the program writes  the base position of each polymorphic site as it is found. Following this, the program writes a brief message for each category of analysis being conducted.

 

The output of the program is all contained in one file that has a '.SIT' extension. This file can be quite long for large data sets and complete analyses. If there are many sequences, the 'p' option in the OUPUT OPTIONS menu can cut down on the size of the data set.

 

For the most part, the formatting of the output is not hard to follow. At the top of the file, the run parameters are listed, as are the sequence and group names. For a full analysis of a data set with multiple groups,  the following headings are generated:

 

     Table of Polymorphic Sites

     -------------------------------

This table provides the base position and type of polymorphism for all variable base positions in the data set (excluding those sequences and regions of sequences that were not included in the analysis).

 

     Counts of Site Types

     -----------------------

An approximate list of counts of the number of base positions associated with each kind of polymorphisms in the table. These counts are rough because some base positions may have multiple kinds of polymorphisms.

 

     Indel Tables

     --------------

A set of two tables:  The first is very similar to that for polymorphic sites, except that each distinct indel (regardless of length) gets just one position. The second table lists the sequence and position of each distinct indel.

 

     Table of Synonymous, Replacement and Uncertain Exon Changes

     --------------------------------------------------------------------------

A list of all (or most) coding region base changes. With each polymorphism the different codon states are shown. This can be used to resolve cases where multiple changes have occurred within the same codon, and where it was not clear to the program whether a change was synonymous or replacement.  This analysis may not be complete if there are three or four bases segregating at a position, or in cases where a single aligned codon has many amino acids segregating.

 

     Codon Usage Table

     ---------------------

A table of codon usage for the first sequence. Also included are counts of the numbers of synonymous and replacement sites (i.e. relative proportions of random mutations expected to cause synonymous or replacement changes).  The calculation of these values is best explained with an example. Consider a  codon (e.g. AAA for lysine) and consider each base in the  codon.  For each base, the fraction of all mutations that  will change the codon in such a way that the amino acid does  not change is counted. For AAA, 0 of the three possible  mutations at the first base will lead to a synonymous change, similarly 0 for the second base, and 1/3 of the mutations for  the third base (because an A-> G change leads to a AAG which  is also lysine).  So the number of synonymous sites in an AAA  codon is 1/3. The number of replacement sites is 3-1/3 = 2 2/3. Every codon gets a score this way, and the final tally is just the sum of scores for all codons.

 

   Codon Usage Table For All Lines

   -------------------------------------

A table of codon usage for all sequences and counts of synonymous and replacement sites summed across all sequences.

 

     Polymorphism Analyses

     --------------------------

Most analyses are applied only to those types of sites that  are specified in the SITES OPTIONS menu, but there are exceptions - see below.     

·        BASE PAIR COMPARISONS - not including N's or indels - counts of the number of bases compared and the number of  differences for all pairs of sequences. Essentially a  distance matrix. The counts of the numbers of bases compared are based on the entire sequence, or a shorter region, depending on the 'x' SITE OPTION, and the 'l' and  'm' ANALYSIS OPTIONS.  The counts of the number of bases  compared are not reduced by other SITE OPTIONS choices.  However, the counts of site differences are affected by SITE OPTION choices. This table is not generated if the 'p' option is used in the OUTPUT OPTIONS.

·        GROUP DIFFERENCES - not including N's or indels -a matrix with group by group comparisons. Above and on the  diagonal are the average pairwise differences for those  sites specified under SITE OPTIONS. Below the diagonal is the net average pairwise divergence (e.g. Nei, 1987, p.276).

·        GROUP DIFFERENCES PER BASE PAIR - same as above but numbers are per base pair. Note that the divisor is the average # of base pairs compared, calculated from the same numbers above the diagonal in the BASE PAIRS COMPARISON table. Thus these per base pair measures may be misleading for some SITE OPTIONS  choices. For example, if only synonymous sites are analyzed (option 's' under SITE OPTIONS), a per base pair measure of divergence can be obtained by dividing numbers in the GROUP DIFFERENCES matrix, by the estimated number of synonymous sites, that are given beneath the CODON USAGE TABLE.

·         FIXED DIFFERENCES  - A fixed difference is a polymorphic site at which all of  the sequences of one group are different from all of the sequences of a second group.

·        SHARED POLYMORPHISMS  - A shared polymorphism is a polymorphic site at which each of two groups of sequences are found to have at least two of the same bases.

·         Fst AND POPULATION MIGRATION RATES - Fst values, between pairs of populations, and estimates of  Nm, assuming diploidy (i.e. N is the effective number of diploid individuals) calculated according to  equation 4 (except using a factor of 1/4) of Hudson et  al., 1992. This estimate should be multiplied by 4/3  to get the corresponding number for an X linked locus. It  should be multiplied by 2 to get the corresponding number for a haploid locus. Also if the locus is haploid and  sex-limited (e.g. mitochondria), the estimate should be  multiplied by 2, and then it applies only to the number  of individuals of the sex that carry the locus.

·        POLYMORPHIC SITE FREQUENCIES PER GROUP – FOLDED  & ROOTED -These tables give the counts of the number of lines that carry a polymorphism of a certain frequency. For example, a site in which two sequences are different from the remaining n-2 sequences is counted in either category 2 or category n-2  (depending on whether the distribution is folded, or if  not depending on the value of the root sequence.  There  are two tables, one for the folded distribution which  shows only the frequencies for the rarest bases, and one for the rooted distribution.  A rooted distribution is not shown if the analysis includes only one group. The table for the rooted  distribution is based on an outgroup sequence chosen from one of the other sequence groups.  The method for picking  outgroups is very simple. The outgroup sequence for one  group is the first sequence listed in the most divergent other group.  This may not be an ideal outgroup for various  reasons. Some properties of these distributions are known for  some models and the distributions are useful for considering questions about changing population size and natural selection. See papers by Tajima and also by Fu.

·        SITES WITH MORE THAN TWO BASES SEGREGATING - A number of analyses assume an infinite sites model, under which a polymorphic site is caused by exactly one mutation and there can be no more than two bases segregating. This table lists those sites that are clearly not consistent with this assumption. If desired, the program can be rerun using analysis option 'd' to exclude these sites.

·         D STATISTICS  - These indices are measures of departure from a neutral Fisher-Wright model. See Tajima (1989) and Fu and Li (1993).  These statistics rely on a count of the number  of polymorphic sites.  The counts that are used come directly from the site frequency distribution. However  these counts will underestimate the actual number of  mutations, particularly if some sites are segregating more than two bases within a sequence sample group. For  Fu and Li D, an outgroup sequence is picked from among  the other groups in the data set (if any occur - see above for POLYMORPHIC SITE FREQUENCIES).  Note, the outgroup sequence that is picked may not be ideal, depending on the divergence among  groups. Also given are counts of the different classes of mutations, as defined by Fu and Li.

·        THETA (4Nu) ESTIMATES - Two different estimates of the neutral mutation parameter 4Nu: Watterson’s and nucleotide diversity or pi.  See, for example, Hudson (1990) or Tajima (1993). Also listed are the number of sequences for each group and the number of bases, which is calculated by taking the average of the number of bases compared in the  pairwise comparison matrix.