This file contains a detailed description of how to run ModuleFinder and decipher its output. Note that ModuleFinder can be run on any sequences, so long as they adhere to the format described in the Genome_README.txt file. For simplicity, however, the examples given in this section will be taken from the pre-edited human genome given on our web page. _________________________________________________________________________________________ HOW TO RUN MODULEFINDER To run ModuleFinder, download the executable code on our website. After clicking on the "ModuleFinder Executable" button and giving your email, it will be sent directly to you. This has been compiled with linux gcc, version 8.1 (note that the source code is available by request, in case you would like to compile the program for mac or pc use, but these platforms are not currently supported). At the command prompt in the directory where you have saved ModuleFinder, type "ModuleFinder" The program will then proceed to ask you several questions, which are described below 1)Enter the name of the output directory, including the whole path: --> here, type the directory of where you would like ModuleFinder to output to. For example, "/home/me/ModuleFinder_test/" 2) Enter the name of the output file, path should NOT be included: -->for the moment, let's say you typed "output_file.txt". ModuleFinder will then actually output 3 files, one entitled b_output_file.txt, one entitled output_file.txt and one entitled l_output_file.txt. This will be described at the end of this README. Note that whatever you type here, the path should not be included (i.e., don't type "/home/me/ModuleFinder_test/output_file.txt"). 3) Enter the name of the sequence file: -->This is a FASTA file to be searched for ModuleFinder hits. If you are using the pre-edited genomes given on our website, then any "base genome" file will suffice. For example, if you have downloaded the human genome for input to ModuleFinder, then you can type hg16_inter_bg.txt to scan all intergenic sequences. 4) Enter the name of the first alignment file: -->A FASTA file that is matched to the sequence file of 2), so that all sequences are of the same length. If you are using the pre-edited genomes, then any "alignment genome" file will suffice: for example if you are scanning human intergenic regions and you want to use the mouse genome to check conservation, then you can type mm3_inter_ag.txt 5) Enter the name of the second alignment file: --> This is the second genome file to use to check alignment. The formatting is exactly the same as in 3). Note that if you want to only use one alignment genome for the scan, then it is sufficient to type the same file as in 3), and the score will automatically collapse to the 2-genome case. 6) Do you want to change lower case to upper case, y or n? --> As described in the Genome_README file, our pre-edited genomes follow the UCSC convention of changing repetitive regions to lower case. We also use the convention of changing coding regions to lowercase in the intragenic regions. ModuleFinder is set to ignore any lower case sequence in the base genome. Thus, if you want to include repetitive sequence (or coding sequence) in the scan, you should type "y" here. Otherwise, just reply "n". 7) Do you want to automatically get word frequencies: y or n --> As described in the accompanying paper, ModuleFinder takes motif frequencies into account for the score. To do this, it needs estimates of word frequencies for the input files, and can get them automatically. Note that, if you type "y" then files of the form "countsk.txt" (where k is a number between 1 and 8) will be generated in the directory designated in 1. Note that if you are running ModuleFinder many times on the SAME INPUT SEQUENCE, then you can simply type "n" at this prompt if you already have appropriate counts files. 8) Enter the maximum window size: --> As stated in the accompanying paper, ModuleFinder scans the base genome with a series of nested windows, where the maximum window size to inspect, the minimum window size, and the increment size between windows are input by the user. Note that the more windows you use to scan with, the slower the program will run. 9) Enter the minimum window size: --> see 7) 10) Enter the increment size: --> see 7) Make sure that both the minum and maximum window sizes are multiples of the increment. 11) Enter the shift size: --> As stated in the accompanying paper, we have added a feature to ModuleFinder that compensates for local imperfections in the alignments. The shift size corresponds to the amount of "wiggle room" to allow for when scanning the alignment genomes. A value of "0" corresponds to no wiggle room. Shift size should be a positive number. Note that it is recommended to keep the shift size less than 5 or so, in order to avoid spurious conservation. 12) Enter the number of dimers: --> ModuleFinder can take as input TFs that occur as homo- or hetero-dimers. If you want to input many dimers, then input the number here. For each input dimer, it will ask for a pair of motif files, as well as a minimum and maximum spacing between the pairs of motifs to search for. Note that, ModuleFinder DOES NOT automatically search for reverse complements, in order to leave the user free to inspect for orientation effects. Therefore, if motif1.txt is an input file, and motif2.txt is an input file, and you have reason for believing that they dimerize and bind as (motif1)(space)(motif2), then you might want to also input a second dimer to the program that is of the form (motif2_rc)(space)(motif1_rc) where motif1_rc and motif2_rc are the reverse complements of motif1.txt and motif2.txt. By doing this, you will find both the dimer and the reverse complement of its binding site. 13) Enter the number of single motifs: --> here, type in the names of the text files containing single motifs. Again, if you do not want to check for orientation effects, then each motif file should contain both the binding sites of interest, as well as their reverse complements. 14) Enter the name of single motif X: --> enter the name (along with directory) of motif file. NOTE THAT IN THE MOTIF FILE, IT IS IMPORTANT THAT BOTH THE WORD AND ITS REVERSE COMPLEMENT NOT BOTH BE PRESENT--JUST CHOOSE ONE ORIENTATION, AND STICK WITH IT FOR ALL WORDS IN THE MOTIF. 15) Do you wnat to include the reverse complement words in the calculation, Y or N? --> With, ModuleFinder, you are free to search on either one strand of the DNA (i.e. to observe potential orientation bias of the motif) or both strands. If you would like to search both stands, enter "Y" here, otherwise enter "N". 16) Enter the threshold score: --> For each input FASTA sequence, ModuleFinder will return all windows of sequence above the cutoff value input here, after fusing overlapping windows (this is described in greater detail below). The number input here should be a negative number, and more negative numbers correspond to a more stringent cutoff (i.e., fewer output windows). A reasonable input value here is -1 * (number of input motifs), so that -5 is reasonable for 5 input motifs. After you have answered all of these questions, then ModuleFinder will begin running. Note that it outputs the number of sequences scanned so far (mulitples of 100) to the standard output, to give you an idea as to how far along the scan is. Note that, if you would like to run ModuleFinder directly from the command line, so that you don't have to go through these questions every time, then you can simply type in what you would have normally typed in at each of these questions IN EXACTLY THE SAME ORDER at the command line. Doing so will cause the program to immediately start running, and it will not ask you the above questions. _________________________________________________________________________________________ ModuleFinder OUTPUT ModuleFinder outputs three distinct files: output_file.txt, b_output_file.txt, and l_output_file.txt We describe each of these below. 1)output_file.txt -->Here is an example of a typical output section from this file. It is taken from a run on intergenic regions in fly, using 5 input TFs (no dimers), window sizes ranging from 300 to 700 bps, with an increment size of 50 >NM_140744 CG6064 CG6064-PA [Drosophila melanogaster] chr3L - 17611104 17613498 NM_168744___NM_140745___NM_140747___NM_140748 CG32183___CG6052___CG7460___CG6034 CG32183-PB [Drosophila melanogaster]___CG6052-PA [Drosophila melanogaster]___CG7460-PB [Drosophila melanogaster]___CG6034-PA [Drosophila melanogaster] chr3L + 17664848 17696903 47425 300 -8.437716 motif 1 (0, 0, 0) motif 2 (1, 0, 0) motif 3 (4, 1, 1) motif 4 (0, 0, 0) motif 5 (3, 0, 0) The first line is exactly the same as the descriptive file of the FASTA sequence (see genome README file for an explanation of the descriptive lines that we use for our genome editing). The line directly below this: 47425 300 -8.437716 corresponds to the highest-scoring window within the FASTA sequence entry, where 47425 indicates that this window begins at position 47425 of the FASTA sequence entry, 300 indicates that the window was of size 300 bp, and -8.437716 is the score of the region (this is described in the occompanying paper). The next lines give the number of times that the motif occurred in the window. For example, motif 3 (4, 1, 1) indicates that motif 3 occurred 4 times in this window, and that 1 of these 4 occurrences was conserved in the first alignment sequence, and one of these 4 entries was conserved in the second alignment sequnce (note that in fly, there is only one alignment genome available, which was input twice; thus, the second and third entries will always be identical. 2) b_output_file.txt --> This is a "brief" output file, that contains only the information from the FASTA descriptive line, followed by position of the maximal scoring window within the sequence, as well as the size of the maximal scoring window, and its score. Using the same example FASTA sequence as in 1), we have: >NM_140744 CG6064 CG6064-PA [Drosophila melanogaster] chr3L - 17611104 17613498 NM_168744___NM_140745___NM_140747___NM_140748 CG32183___CG6052___CG7460___CG6034 CG32183-PB [Drosophila melanogaster]___CG6052-PA [Drosophila melanogaster]___CG7460-PB [Drosophila melanogaster]___CG6034-PA [Drosophila melanogaster] chr3L + 17664848 17696903 47425 300 -8.437716 3) l_output_file.txt --> This is the "long" output file, and contains all regions scoring above the threshold value that was input during question 13) above. Before, outputting this regions, the program fuses overlapping windows in order to avoid excessive output. For example, if the two following windows have exactly the same score (using the example above) 47425 300 -8.437716 47426 300 -8.437716 it is clearly not desirable to output both of them, as they are essentially the same window. Thus, we fuse them into the following region 4725 301 -8.437716 Henceforth, we shall refer to fused windows as "REGIONS". The output of l_output_file.txt looks like the following: >NM_004924 ACTN4 actinin, alpha 4 19 + 43814434 43896121 60595 81 96343_at 19 115517 (0,10825) (11825,64517) (65517,115517) 1.08426823 1.428259286 1.057356154 region location: (2769,3048) max window in region: position = 2769, size = 200, score = -3.053519 motif 1: (2,0,0) (2850, 2912, ) () () motif 2: (0,0,0) () () () motif 3: (0,0,0) () () () motif 4: (1,0,0) (2969, ) () () region location: (16077,16585) max window in region: position = 16122, size = 410, score = -3.197767 motif 1: (5,1,0) (16132, 16224, 16351, 16393, 16532, ) (16224, ) () motif 2: (0,0,0) () () () motif 3: (0,0,0) () () () motif 4: (0,0,0) () () () it is very similar to the output_file.txt, but multiple REGIONS are reported. Note also that the score of the highest scoring sub-window in the REGION is reproted. Note also, that the motif counts are for the REGION, not just the window.