This file contains a detailed description regarding the model genomes listed on our website to be used as input to ModuleFinder. All genomes and alignments listed were obtained from the UCSC genomebrowser, and edited as stated below. For each organism, the genome being searched is referred to as the "base genome", and the aligment files are referred to as "alignment genomes". For each alignment, it is stated whether the alignments used were obtained from a "pairwise alignment" or a "multiple alignment". _________________________________________________________________________________________ Genomes used, and lexicon for abbreviations 1) S. cerevisiae: Base genome: cer1 (S. Cerevisiae; UCSC listing = sacCer1, October 2003). Alignment genomes: bay (S. bayanus, multiple alignment) cas (S. castellii, multiple alignment) klu (S. kluyveri, multiple alignment) kud (S. kudriavzevii, multiple alignment) mik (S. mikatae, multiple alignment) par (S. paradoxus, multiple alignment) 2) C. elegans Base genome: ce (C. elegans, UCSC listing = ce2, March 2004) Alignment genome: cb (C. briggsae, cb1, pairwise alignment) 3) D. melanogaster Base genome: dm (D. melanogaster, UCSC listing = dm1, January 2003) Alignment genome: po (D. pseudoobscura, dp1, pairwise alignment) 4) Mouse Base genome: mm4 (M. musculus, UCSC listing = mm4, October 2003) Alignment genomes: hg16 (human, Hg 16, pairwise alignment) rn3 (rat, rn3, pairwise alignment) 5) Human Base genome: hg16 (H. Sapiens, UCSC listing = hg16, July 2003) Alignment genomes: pt1 (chimp, panTro1, multiple alignment) mm3 (mouse, mm3, multiple alignment) rn3 (rat, rn3, multiple alignment) gg2 (chicken, galGal2, multiple alignment) Each base genome is represented by two files, one containing intergenic sequence and one containing intragenic sequences (i.e. intronic). These are described in more detail below. On our website, these base genomes are listed by, &&_inter_bg.txt and &&_intra_bg.txt, where && is the appropriate base genome abbreviation given above. Because the the Human intergenic sequence file and the its alignment genome files are too large to open, they are split into two parts. Similarly, each alignment genome is represented by two matching files, one for intergenic regions and one for intragenic regions (again, described in more detail below). These alignment files are all named as $$_inter_ag.txt and $$_intra_ag.txt where $$ is the alignment genome abbreviation given above. _________________________________________________________________________________________ Data preprocessing of base genomes: 1) For each genome listed, we obtained the corresponding refGene and refLink files from UCSC. These give the chromosomal coordinates of each gene relative to the UCSC assembly, as well as the UCSC identifier, and common gene name and common gene abbreviation. 2) For each gene in the refGene database, we deleted any gene not located within a fully assembeled region, as well as mitochondrial genes (i.e. we deleted genes with chrM/chrU/random listed in the appropriate refGene field). Also, for UCSC identifiers (i.e., NM_xxxxx numbers) listed multiple times, only the first occurrence was used. 3) The refGene files contain many redundancies, where the same gene will essentially be listed multiple times, with only slight variations in txn'l/transl Starts/Stops and exons. Additionally, there are many cases where one gene is located within another gene, or where two distinct genes are overlapping one another. To compensate for these situations, we first put the refGene listings through a fusing process, where all overlapping genes and genes within genes are fused together. For example, if refGene lists the following translational starts and stops for two genes (note that here, we are using the UCSC convention of ignoring strandedness, thus the "start" of the gene is always less numerically than the "stop"): NM_0001 1334532 1338521 NM_3012 1226682 1339997 Then these are fused into the following listing NM_0001___NM_3012 1334532 1339997 4) After, this fusing process, all regions lying between the translational start and stop for a fused gene are referred to as "intragenic regions", and all regions between the trnsl stop of one fused gene and trnsl start of the next fused gene are referred to as "intragenic regions". These intergenic and intragenic regions are exracted from the appropriate genome and listed as separate FASTA files with the names inter_&&&.txt and intra_&&&.txt (for example, inter_dm.txt corresponds to the intergenic regions of D. melanogaster). Following the usual conventions for FASTA files, the inter/intra genic regions are separated by lines beginning with ">", and containing annotation information (the format of this descriptor is explained in the next section). 5) Information on the fused genes is given in the following format: gene_accession_number protein_name gene_name chromosome_number chromosome_strand translation_start translation_stop We shall henceforth refer to this tab-separated identifying scheme as the "fused_gene_id". For example, the following is a fused gene in fly: >NM_164353___NM_164354 CG31973___CG31973 CG31973-PA___CG31973-PB chr2L -___- 26520 58150 In the intergenic txt files, each intergenic region is flanked by two fused genes. The descriptive line in the FASTA intergenic txt file is of the form >(fused_gene_id1)(tab)(tab)(fused_gene_id2) where "fused_gene_id1) corresponds to the fused gene on the left flanking region (i.e., lower chromosomal number), and fused_gene_id2 corresponds to the fused gene flanking the intergenic region on the right. For intragenic txt files, the regions are located within fused genes. Thus, only one fused_gene_id is necessary. For consistency in formatting, however, we list it twice: >(fused_gene_id1)(tab)(tab)(fused_gene_id1). 6) For both intergenic and intragenic sequence files, note that repetitive regions have been changed so that they are in lower case. Additionally, for the intragenic regions, all coding sequences have additionally been set to lower case. Thus, they will be avoided in performing the searches, assuming one replies "no" when asked whether or not one would like to "change lower case to upper case" by ModuleFinder (see ModuleFinder README file). 7) In Human and Mouse, because several intergenic and intragenic regions are very large, we have split these regions into fasta entries of size 50,000 bp (assuming the region is bigger than 50,000 bp). In doing this splitting, in order to avoid cutting up potential ModuleFinder hits, there is a 2000 bp overlap between the end of the previous region and the start of the next region. The different parts of the region are identified by adding _R# to the accession number. _________________________________________________________________________________________ Data preprocessing of alignment genomes: 1) Let G be a single sequence from either an intragenic or intergenic region. We then generated a matching alignment region that is exactly the same length as S, but composed entirely of the letter "r". Call this sequence A. 2) Using the alignments generated by UCSC, if a subregion of S is orthologous to some region in the alignment genome, then we extracted that region and pasted it into the corresponding position of R. For example, if G = ACAGTTTACAGATTACAGTACATTGACATTAG and, the sub region GATTACAGTACATTGAC was orthologous to the region GATTACTGTGCAATGAA from the alignment genome, then A = rrrrrrrrrrGATTACTGTGCAATGAArrrrr 3) This situation is somewhat complicated by deletions. If the deletion occurs in the alignment sequence, then this is notated by a "-" and inserted as usual. Using the preceding example, if the aligned region were GATTACT---CAATGAA, then G = ACAGTTTACAGATTACAGTACATTGACATTAG A = rrrrrrrrrrGATTACT---CAATGAArrrrr Deletions in the base genome have the effect of altering the coordinate system given by UCSC. To avoid this, they are deleted as follows. Assume that the sub-region of G that the following is the alignment between G and A G = ACAGTTTACAGATTACAGTACAT-----TGACATTAG A = rrrrrrrrrrGATTACTGTGCAAATACCTGAArrrrr After removing the dashed region from G, this becomes G = ACAGTTTACAGATTACAGTACATTGACATTAG A = rrrrrrrrrrGATTACTGTGCAATGAArrrrr as expected. This, however, introduces a problem: if the motif {CATTGAC,CAATGAA,...} were input to ModuleFinder, then it would see observe the sequence at the appropriate position in G, and then observe that this match was conserved in A (this match is highlighted below with "{" and "}"): G = ACAGTTTACAGATTACAGTA{CATTGAC}ATTAG A = rrrrrrrrrrGATTACTGTG{CAATGAA}rrrrr This is clearly undesirable, as it is not possible that this motif is truly conserved once the inserted region is added back in. To compensate for this effect, we use notation of changing the positions one to left and one to the right of the insertion in the alignment genome to lower case: G = ACAGTTTACAGATTACAGTACATTGACATTAG A = rrrrrrrrrrGATTACTGTGCAatGAArrrrr ModuleFinder is then set to not consider motif matches as conserved if they contain lower case letters (other than at the first and last positions) within the alignment genome region. 4) All alignment files for intergenic and intragenic are given in FASTA format, with the ordering of the regions being the same between the intergenic and intragenic regions. The descriptor lines are exactly the same in the alignment genome files as in the base genome inter- and intra- genic files.