This file contains a detailed description regarding the model genomes listed on our website
to be used as input to ModuleFinder.  All genomes and alignments listed were obtained from 
the UCSC genomebrowser, and edited as stated below.  For each organism, the genome being 
searched is referred to as the "base genome", and the aligment files are referred to 
as "alignment genomes".  For each alignment, it is stated whether the alignments used were 
obtained from a "pairwise alignment" or a "multiple alignment".


_________________________________________________________________________________________
Genomes used, and lexicon for abbreviations

1) S. cerevisiae:  
	Base genome: cer1 (S. Cerevisiae; UCSC listing = sacCer1, October 2003).  
   	Alignment genomes: bay (S. bayanus, multiple alignment)
			   cas (S. castellii, multiple alignment)
			   klu (S. kluyveri, multiple alignment)
			   kud (S. kudriavzevii, multiple alignment)
			   mik (S. mikatae, multiple alignment)
			   par (S. paradoxus, multiple alignment)

2) C. elegans   
	Base genome: ce (C. elegans, UCSC listing = ce2, March 2004)
	Alignment genome: cb (C. briggsae, cb1, pairwise alignment)

3) D. melanogaster
	Base genome: dm (D. melanogaster, UCSC listing = dm1, January 2003)
	Alignment genome: po (D. pseudoobscura, dp1, pairwise alignment)

4) Mouse
	Base genome: mm4 (M. musculus, UCSC listing = mm4, October 2003)
	Alignment genomes:  hg16 (human, Hg 16, pairwise alignment)
			    rn3 (rat, rn3, pairwise alignment)

5) Human
	Base genome: hg16 (H. Sapiens, UCSC listing = hg16, July 2003)
	Alignment genomes: pt1 (chimp, panTro1, multiple alignment)
			   mm3 (mouse, mm3, multiple alignment)
			   rn3 (rat, rn3, multiple alignment)
			   gg2 (chicken, galGal2, multiple alignment)


Each base genome is represented by two files, one containing intergenic sequence and one
containing intragenic sequences (i.e. intronic). These are described in more detail 
below.  On our website, these base genomes are listed by, &&_inter_bg.txt and 
&&_intra_bg.txt, where && is the appropriate base genome abbreviation given above. Because the 
the Human intergenic sequence file and the its alignment genome files are too large to
open, they are split into two parts.

Similarly, each alignment genome is represented by two matching files, one for intergenic
regions and one for intragenic regions (again, described in more detail below).  These 
alignment files are all named as $$_inter_ag.txt and $$_intra_ag.txt where $$ is the alignment
genome abbreviation given above.


_________________________________________________________________________________________
Data preprocessing of base genomes:

1) For each genome listed, we obtained the corresponding refGene and refLink files from UCSC.
These give the chromosomal coordinates of each gene relative to the UCSC assembly, as well 
as the UCSC identifier, and common gene name and common gene abbreviation.

2) For each gene in the refGene database, we deleted any gene not located within a fully
assembeled region, as well as mitochondrial genes (i.e. we deleted genes with 
chrM/chrU/random listed in the appropriate refGene field).  Also, for UCSC identifiers
(i.e., NM_xxxxx numbers) listed multiple times, only the first occurrence was used.

3)  The refGene files contain many redundancies, where the same gene will essentially be 
listed multiple times, with only slight variations in txn'l/transl Starts/Stops and exons.
Additionally, there are many cases where one gene is located within another gene, or where 
two distinct genes are overlapping one another.  To compensate for these situations, we 
first put the refGene listings through a fusing process, where all overlapping genes and genes 
within genes are fused together.  For example, if refGene lists the following  translational 
starts and stops for two genes (note that here, we are using the UCSC convention of ignoring 
strandedness, thus the "start" of the gene is always less numerically than the "stop"):

	NM_0001	1334532	1338521
	NM_3012	1226682	1339997

Then these are fused into the following listing
	NM_0001___NM_3012	1334532	1339997

4) After, this fusing process, all regions lying between the translational start and stop
for a fused gene are referred to as "intragenic regions", and all regions between the trnsl
stop of one fused gene and trnsl start of the next fused gene are referred to as "intragenic 
regions".  These intergenic and intragenic regions are exracted from the appropriate genome
and listed as separate FASTA files with the names inter_&&&.txt and intra_&&&.txt (for example,
inter_dm.txt corresponds to the intergenic regions of D. melanogaster).  Following the usual
conventions for FASTA files, the inter/intra genic regions are separated by lines beginning with
">", and containing annotation information (the format of this descriptor is explained in the
next section).  

5) Information on the fused genes is given in the following format:

	gene_accession_number	protein_name	gene_name	chromosome_number	chromosome_strand	translation_start	translation_stop

We shall henceforth refer to this tab-separated identifying scheme as the "fused_gene_id".
For example, the following is a fused gene in fly:
	>NM_164353___NM_164354	CG31973___CG31973	CG31973-PA___CG31973-PB	chr2L	-___-	26520	58150


In the intergenic txt files, each intergenic region is flanked by two fused genes.  The
descriptive line in the FASTA intergenic txt file is of the form

	>(fused_gene_id1)(tab)(tab)(fused_gene_id2)
where "fused_gene_id1) corresponds to the fused gene on the left flanking region (i.e., lower
chromosomal number), and fused_gene_id2 corresponds to the fused gene flanking the intergenic
region on the right.  

For intragenic txt files, the regions are located within fused genes.  Thus, only one 
fused_gene_id is necessary.  For consistency in formatting, however, we list it twice:

	>(fused_gene_id1)(tab)(tab)(fused_gene_id1).

6) For both intergenic and intragenic sequence files, note that repetitive regions have been
changed so that they are in lower case.  Additionally, for the intragenic regions, all
coding sequences have additionally been set to lower case.  Thus, they will be avoided
in performing the searches, assuming one replies "no" when asked whether or not one would like
to "change lower case to upper case" by ModuleFinder (see ModuleFinder README file).

7) In Human and Mouse, because several intergenic and intragenic regions are very large, we have split
these regions into fasta entries of size 50,000 bp (assuming the region is bigger than 50,000 bp). In 
doing this splitting, in order to avoid cutting up potential ModuleFinder hits, there is a 2000 bp overlap
between the end of the previous region and the start of the next region. The different parts of the region are
identified by adding _R# to the accession number.

_________________________________________________________________________________________
Data preprocessing of alignment genomes:
1) Let G be a single sequence from either an intragenic or intergenic region.  We then generated
a matching alignment region that is exactly the same length as S, but composed entirely of the 
letter "r".  Call this sequence A.

2) Using the alignments generated by UCSC, if a subregion of S is orthologous to some region in
the alignment genome, then we extracted that region and pasted it into the corresponding position 
of R.  For example, if

	G = ACAGTTTACAGATTACAGTACATTGACATTAG

and, the sub region GATTACAGTACATTGAC was orthologous to the region GATTACTGTGCAATGAA from the 
alignment genome, then
	A = rrrrrrrrrrGATTACTGTGCAATGAArrrrr


3) This situation is somewhat complicated by deletions.  If the deletion occurs in the alignment 
sequence, then this is notated by a "-" and inserted as usual.  Using the preceding example, if the
aligned region were GATTACT---CAATGAA, then
	G = ACAGTTTACAGATTACAGTACATTGACATTAG
	A = rrrrrrrrrrGATTACT---CAATGAArrrrr

Deletions in the base genome have the effect of altering the coordinate system given by UCSC.  To 
avoid this, they are deleted as follows.  Assume that the sub-region of G that the following is the 
alignment between G and A
	G = ACAGTTTACAGATTACAGTACAT-----TGACATTAG
	A = rrrrrrrrrrGATTACTGTGCAAATACCTGAArrrrr

After removing the dashed region from G, this becomes
	G = ACAGTTTACAGATTACAGTACATTGACATTAG
	A = rrrrrrrrrrGATTACTGTGCAATGAArrrrr
as expected.  This, however, introduces a problem: if the motif {CATTGAC,CAATGAA,...} were input
to ModuleFinder, then it would see observe the sequence at the appropriate position in G, and then
observe that this match was conserved in A (this match is highlighted below with "{" and "}"):

	G = ACAGTTTACAGATTACAGTA{CATTGAC}ATTAG
	A = rrrrrrrrrrGATTACTGTG{CAATGAA}rrrrr

This is clearly undesirable, as it is not possible that this motif is truly conserved once the inserted
region is added back in.  To compensate for this effect, we use notation of changing the positions one to 
left and one to the right of the insertion in the alignment genome to lower case:
	G = ACAGTTTACAGATTACAGTACATTGACATTAG
	A = rrrrrrrrrrGATTACTGTGCAatGAArrrrr

ModuleFinder is then set to not consider motif matches as conserved if they contain lower case letters (other
than at the first and last positions) within the alignment genome region.

4) All alignment files for intergenic and intragenic are given in FASTA format, with the
ordering of the regions being the same between the intergenic and intragenic regions.  The
descriptor lines are exactly the same in the alignment genome files as in the base genome inter- and intra-
genic files.