We developed a pipeline that generates pairwise alignments between D.melanogaster and closely related species such as D. simulans, D.yakuba, and D. pseudoobscura.
First, the coding sequences are cut into the sequences of the component exons.
Then, each exon sequence is blasted against a given genome.
If a coding sequence of one gene has been cut into more than one piece, which represents the common case, the positions of the resulting BLAST alignments are subjected to quality control.
That is, they are required to be close to each other in the genome.
This condition is realized by a 'decision alignment'.
The alignment of the best (longest) aligned exon is chosen as the reference alignment, and the other aligned sequence parts must lie within 10 kb of this reference.
If this is not the case, the distant alignment fails quality control and is handled as if there was no hit.
A further problem that comes up is the fact that BLAST obtains only local alignments.
Though the parameters are set in such a way that the resulting alignments are of maximum length, these alignments are not global in most cases.
It is often the case that the local alignments lack only two or three nucleotides at the ends of the alignment due to the fact that the BLAST algorithm does not extend the alignment if there are mismatches at the ends.
Another main reason for incomplete alignments is that, for some genomes, not all scaffold sequences are completely resolved, but instead have "N" at the positions of uncertain nucleotides.
If one end of an alignment contains a long stretch of "N", BLAST will ignore these ends.
In the above cases where few nucleotides are missing at the beginning/end of an exon, the given positions of the maximum local alignments are used to extract the missing nucleotides from the genome sequence.
The global alignments of each exon are then assembled in the correct order to build the coding region of their respective gene. The alignments are then reformatted for the codeml program of the PAML software package (Yang 1997) for calculation of dN and dS. Regions with "N" are changed to gaps ("-"), which are ignored in the PAML calculations (the number of ignored codons for each gene alignment is given in the Sebida database). In addition, the syntax of the codon triplets is checked to ensure that the proper reading frame is maintained.
|Yang Z., PAML: a program package for phylogenetic analysis by maximum likelihood, (CABIOS 13, 555-556; 1997).|
Values of dN/dS for the melanogaster lineage and the melanogaster
subgroup were taken from Larracuente et al. (2008).
A.M.Larracuente, T.B.Sackton, A.J.Greenberg, A.Wong, N.D.Singh, D.Sturgill, Y.Zhang, B.Oliver, A.G.Clark, Evolution of protein-coding genes in Drosophila, (Trends Genet. 24: 114-123, 2008).