This file contains a detailed description of how to run ModuleFinder and decipher its output.  
Note that ModuleFinder can be run on any sequences, so long as they adhere to the format 
described in the Genome_README.txt file.  For simplicity, however, the examples given in this 
section will be taken from the pre-edited human genome given on our web page.


_________________________________________________________________________________________
HOW TO RUN MODULEFINDER


To run ModuleFinder, download the executable code on our website.  After clicking on the 
"ModuleFinder Executable" button and giving your email, it will be sent directly to you.  
This has been compiled with linux gcc, version 8.1 (note that the source code is available by 
request, in case you would like to compile the program for mac or pc use, but these platforms
are not currently supported).

At the command prompt in the directory where you have saved ModuleFinder, type "ModuleFinder"
The program will then proceed to ask you several questions, which are described below

1)Enter the name of the output directory, including the whole path:
	--> here, type the directory of where you would like ModuleFinder to output to.  For
	example, "/home/me/ModuleFinder_test/"


2) Enter the name of the output file, path should NOT be included:
	-->for the moment, let's say you typed "output_file.txt".  ModuleFinder will then
	actually output 3 files, one entitled b_output_file.txt, one entitled output_file.txt
	and one entitled l_output_file.txt.  This will be described at the end of this README.  
	Note that whatever you type here, the path should not be included (i.e., don't type
	"/home/me/ModuleFinder_test/output_file.txt").


3) Enter the name of the sequence file:
	-->This is a FASTA file to be searched for ModuleFinder hits.  If you are using the
	pre-edited genomes given on our website, then any "base genome" file will suffice.
	For example, if you have downloaded the human genome for input to ModuleFinder, then
	you can type hg16_inter_bg.txt to scan all intergenic sequences.


4) Enter the name of the first alignment file:
	-->A FASTA file that is matched to the sequence file of 2), so that all sequences
	are of the same length.  If you are using the pre-edited genomes, then any "alignment
	genome" file will suffice: for example if you are scanning human intergenic regions
	and you want to use the mouse genome to check conservation, then you can type 
	mm3_inter_ag.txt

5) Enter the name of the second alignment file:
	--> This is the second genome file to use to check alignment.  The formatting is 
	exactly the same as in 3).  Note that if you want to only use one alignment genome
	for the scan, then it is sufficient to type the same file as in 3), and the score
	will automatically collapse to the 2-genome case.  

6) Do you want to change lower case to upper case, y or n?
	--> As described in the Genome_README file, our pre-edited genomes follow the UCSC
	 convention of changing repetitive regions to lower case.  We also use the convention
	of changing coding regions to lowercase in the intragenic regions.  ModuleFinder
	is set to ignore any lower case sequence in the base genome.  Thus, if you want to
	include repetitive sequence (or coding sequence) in the scan, you should type "y"
	here.  Otherwise, just reply "n".


7) Do you want to automatically get word frequencies: y or n
	--> As described in the accompanying paper, ModuleFinder takes motif frequencies 
	into account for the score.  To do this, it needs estimates of word frequencies
	for the input files, and can get them automatically.  Note that, if you type "y"
	then files of the form "countsk.txt" (where k is a number between 1 and 8) will
	be generated in the directory designated in 1.  Note that if you are running 
	ModuleFinder many times on the SAME INPUT SEQUENCE, then you can simply type "n" 
	at this prompt if you already have appropriate counts files.


8) Enter the maximum window size:
	--> As stated in the accompanying paper, ModuleFinder scans the base genome with 
	a series of nested windows, where the maximum window size to inspect, the minimum
	window size, and the increment size between windows are input by the user.  Note 
	that the more windows you use to scan with, the slower the program will run.  


9) Enter the minimum window size:
	--> see 7)


10) Enter the increment size:
	--> see 7) Make sure that both the minum and maximum window sizes are multiples of 
	the increment.


11) Enter the shift size:
	--> As stated in the accompanying paper, we have added a feature to ModuleFinder that
	compensates for local imperfections in the alignments.  The shift size corresponds
	to the amount of "wiggle room" to allow for when scanning the alignment genomes.  A value
	of "0" corresponds to no wiggle room.  Shift size should be a positive number.  Note that 
	it is recommended to keep the shift size less than 5 or so, in order to avoid spurious 
	conservation.


12) Enter the number of dimers:
	--> ModuleFinder can take as input TFs that occur as homo- or hetero-dimers.  If you want to
	input many dimers, then input the number here.  For each input dimer, it will ask for a pair of
	motif files, as well as a minimum and maximum spacing between the pairs of motifs to search for.
	Note that, ModuleFinder DOES NOT automatically search for reverse complements, in order to leave 
	the user free to inspect for orientation effects.  Therefore, if motif1.txt is an input file, and
	motif2.txt is an input file, and you have reason for believing that they dimerize and bind as 
	(motif1)(space)(motif2), then you might want to also input a second dimer to the program that is 
	of the form (motif2_rc)(space)(motif1_rc) where motif1_rc and motif2_rc are the reverse complements
	of motif1.txt and motif2.txt.  By doing this, you will find both the dimer and the reverse complement
	of its binding site.


13) Enter the number of single motifs:
	--> here, type in the names of the text files containing single motifs.  Again, if you do not want to
	check for orientation effects, then each motif file should contain both the binding sites of interest,
	as well as their reverse complements.


14) Enter the name of single motif X:
	--> enter the name (along with directory) of motif file.  NOTE THAT IN THE MOTIF FILE, IT IS 
	IMPORTANT THAT BOTH THE WORD AND ITS REVERSE COMPLEMENT NOT BOTH BE PRESENT--JUST CHOOSE ONE 
	ORIENTATION, AND STICK WITH IT FOR ALL WORDS IN THE MOTIF.  

15) Do you wnat to include the reverse complement words in the calculation, Y or N?
	--> With, ModuleFinder, you are free to search on either one strand of the DNA (i.e. to observe
	potential orientation bias of the motif) or both strands.  If you would like to search both stands, 
	enter "Y" here, otherwise enter "N".


16) Enter the threshold score:
	--> For each input FASTA sequence, ModuleFinder will return all windows of sequence above the
	cutoff value input here, after fusing overlapping windows (this is described in greater detail
	below).  The number input here should be a negative number, and more negative numbers correspond
	to a more stringent cutoff (i.e., fewer output windows).  A reasonable input value here is
	-1 * (number of input motifs), so that -5 is reasonable for 5 input motifs.


After you have answered all of these questions, then ModuleFinder will begin running.  Note that it
outputs the number of sequences scanned so far (mulitples of 100) to the standard output, to give
you an idea as to how far along the scan is.  

Note that, if you would like to run ModuleFinder directly from the command line, so that you don't 
have to go through these questions every time, then you can simply type in what you would have normally
typed in at each of these questions IN EXACTLY THE SAME ORDER at the command line.  Doing so will
cause the program to immediately start running, and it will not ask you the above questions.


_________________________________________________________________________________________
ModuleFinder OUTPUT

ModuleFinder outputs three distinct files: output_file.txt, b_output_file.txt, and l_output_file.txt
We describe each of these below.

1)output_file.txt
	-->Here is an example of a typical output section from this file.  It is taken from
	a run on intergenic regions in fly, using 5 input TFs (no dimers), window sizes ranging
	from 300 to 700 bps, with an increment size of 50  


	>NM_140744	CG6064	CG6064-PA [Drosophila melanogaster]	chr3L	-	17611104	17613498		NM_168744___NM_140745___NM_140747___NM_140748	CG32183___CG6052___CG7460___CG6034	CG32183-PB [Drosophila melanogaster]___CG6052-PA [Drosophila melanogaster]___CG7460-PB [Drosophila melanogaster]___CG6034-PA [Drosophila melanogaster]	chr3L	+	17664848	17696903
	47425	300	-8.437716
	motif 1 (0, 0, 0)
	motif 2 (1, 0, 0)
	motif 3 (4, 1, 1)
	motif 4 (0, 0, 0)
	motif 5 (3, 0, 0)


	The first line is exactly the same as the descriptive file of the FASTA sequence (see
	genome README file for an explanation of the descriptive lines that we use for our genome
	editing).  The line directly below this:
	
	47425	300	-8.437716

	corresponds to the highest-scoring window within the FASTA sequence entry, where 47425
	indicates that this window begins at position 47425 of the FASTA sequence entry,
	300 indicates that the window was of size 300 bp, and -8.437716 is the score of the 
	region (this is described in the occompanying paper).

	The next lines give the number of times that the motif occurred in the window.  For 
	example, 

	motif 3 (4, 1, 1) indicates that motif 3 occurred 4 times in this window, and that 1 of
	these 4 occurrences was conserved in the first alignment sequence, and one of these 4
	entries was conserved in the second alignment sequnce (note that in fly, there is only one
	alignment genome available, which was input twice; thus, the second and third entries
	will always be identical.


2) b_output_file.txt
	--> This is a "brief" output file, that contains only the information from the FASTA descriptive
	line, followed by position of the maximal scoring window within the sequence, as well as the size
	of the maximal scoring window, and its score.  Using the same example FASTA sequence as in 1), we
	have:

	>NM_140744	CG6064	CG6064-PA [Drosophila melanogaster]	chr3L	-	17611104	17613498		NM_168744___NM_140745___NM_140747___NM_140748	CG32183___CG6052___CG7460___CG6034	CG32183-PB [Drosophila melanogaster]___CG6052-PA [Drosophila melanogaster]___CG7460-PB [Drosophila melanogaster]___CG6034-PA [Drosophila melanogaster]	chr3L	+	17664848	17696903	47425	300	-8.437716


3) l_output_file.txt	
	--> This is the "long" output file, and contains all regions scoring above the threshold value
	that was input during question 13) above.  Before, outputting this regions, the program fuses
	overlapping windows in order to avoid excessive output.  For example, if the two following
	windows have exactly the same score (using the example above)

		47425	300	-8.437716
		47426	300	-8.437716

	it is clearly not desirable to output both of them, as they are essentially the same window.
	Thus, we fuse them into the following region

	4725 301	-8.437716


	Henceforth, we shall refer to fused windows as "REGIONS".  The output of l_output_file.txt looks like the following:


	>NM_004924	ACTN4	actinin, alpha 4	19	+	43814434	43896121	60595	81	96343_at	19		115517 (0,10825) (11825,64517) (65517,115517)	1.08426823	1.428259286	1.057356154
		region location: (2769,3048)
		max window in region: position = 2769, size = 200, score = -3.053519
		motif 1: (2,0,0)	(2850, 2912, )	()	()
	
		motif 2: (0,0,0)	()	()	()
	
		motif 3: (0,0,0)	()	()	()
	
		motif 4: (1,0,0)	(2969, )	()	()

		region location: (16077,16585)
		max window in region: position = 16122, size = 410, score = -3.197767
		motif 1: (5,1,0)	(16132, 16224, 16351, 16393, 16532, )	(16224, )	()
	
		motif 2: (0,0,0)	()	()	()
	
		motif 3: (0,0,0)	()	()	()
	
		motif 4: (0,0,0)	()	()	()


	it is very similar to the output_file.txt, but multiple REGIONS are reported.  Note
	also that the score of the highest scoring sub-window in the REGION is reproted.
	Note also, that the motif counts are for the REGION, not just the window.