Marking DNA files for MIDI Conversion
1. Simple file Conversion
Marking a coding sequence (cds) DNA file
Coding sequence (cds) DNA files contain the sequences corresponding to a translatable messenger RNA (mRNA). Cds files are usually identified as such in the DNA database records.
To ensure an accurate translation, you must mark the beginning and end of the protein-coding sequence, which is not necessarily the beginning and end of your DNA file. Usually a cds DNA file contains the information that identifies the beginning and end of the coding sequence. For example in the Huntington's Disease DNA file included with Bio2MIDI (hunt_dna.txt), the line
contains this information. The tilde (~) is placed before Base# 316 and after Base# 9750. Note that each line of the DNA sequence contains reference numbers that allow you to identify these specific bases. A protein coding sequence will always begin with the bases ATG and end with one of the following triplets: TGA, TAG or TAA.
Decide whether you want to play the DNA sequence itself or a translation of the sequence.
If you want to play only the DNA, then be sure that the DNA option is selected. This will result in a MIDI file composition that uses only 4 different notes for the A, C, T, G bases of the DNA sequence.
If you want to play the protein encoded by a DNA sequence, then select the Protein option. This will result in a MIDI file composition that uses 20 different notes, for the 20 amino acids that make up the protein "alphabet" of the translated (bases into amino acids) sequence.
2. Advanced file conversion
Marking a DNA file that includes introns and exons
1. Exons and Introns.
The genes of all organisms except bacteria and some viruses do not consist of continuous coding sequences. The genes are very long sequences composed of exons, which contain the coding information, and introns, which are noncoding sequences interspersed among the exons. For example the structure of the beta-globin gene is:
Exon 1----Intron 1----Exon 2----Intron 2----Exon 3
In a cell, the genetic information is processed so that protein is
synthesized using a message that contains only the information of the exons. Bio2Midi
includes a feature that plays back a continuous translation of DNA files in which the
boundaries of the exons have been marked.
2. Beta-globin: a sample gene with exons and introns.
For demonstration purposes, the file beta_dna.txt has been included. This file contains only some general information about the full gene sequence and about the last 13308 bases of the 73308-base sequence of the human beta-globin region. The information necessary to create a translatable sequence is in the line:
3. Marking the exon boundaries.
You would mark the sequence for translation as follows:
: before base # 62187
This marks the beginning of the protein coding sequence (beginning of the first exon).
...and marks the end of the first exon (beginning of the first intron):
; after base # 62278
Then mark the beginning of the second exon (end of the first intron):
: before base # 62409
...and the end of the second exon (beginning of the second intron):
; after base # 62631
Then mark the third exon:
: before base # 63482
; after base # 63610
If you have selected the DNA option, Bio2Midi will produce a MIDI file corresponding to a DNA coding sequence.
If you have selected the Protein option, it will produce a MIDI file corresponding to the translated protein.
Full DNA sequences should be played only in DNA mode, i.e. with the DNA option selected. Bio2MIDI will dutifully translate the bases, but the translation of such a sequence is not biologically meaningful.
Bio2MIDI is copyright © 1998-2004 John Dunn & Alogrithmic Arts. All Rights Reserved.