glossary and GENRE software download/tutorial Download glossaryGENRE_download.zip from http://thebrain.bwh.harvard.edu/glossary-GENRE/download.html Unzip the scripts in a directory you can use as the GENRE-glossary home directory. Everything in the subsequent downloads will move to a subdirectory of this home directory. $ unzip glossaryGENRE_download.zip Verify all dependencies installed. $ sh testEnv.sh Dependencies are: Python 2.X (tested in 2.6.6 and 2.7.6) argparse os sqlite3 re time random csv math hashlib subprocess decimal R (tested in 3.0.2 and 3.2.3) zoo Biostrings methods BiocGenerics parallel IRanges XVector BEDTools (tested in 2.23.0 and 2.25.0) sqlite3 (tested in 3.6.20 and 3.11.0) Download all needed dependencies and try again. View all available downloads. $ python getGENRE.py -avail Obtain a foreground set. Must be in BED file format. Only the first three columns will be utilized. Download an appropriate database for your foreground. At the moment, only the database from Mariani et al., 2017 is available. If you are looking for more genomes/lengths, please contact the Bulyk lab. $ python getGENRE.py -download "db_ID" db_ID: The ID is the second column of the -avail output; choose an ID under the DATABASES heading. First, narrow options by your foreground set's genome. Second, narrow options by your foreground set's length distribution. dflt: "default" length of 200 bp For example, if you've chosen a foreground of GATA peaks from the hg19 genome with lengths of 200 bp, you'd download the hg19_dflt database. $ python getGENRE.py -download hg19_dflt The program will automatically download the GENRE scripts and hg19 genome as dependencies. Run GENRE using the pre-defined database you've just downloaded and your chosen foreground set. $ bin/GENRE/GENRE "db_ID" "FG set" db_ID: see above FG set: your foreground set in BED format Optional arguments: -seed seed: seed for randomization, preferably a number (anything given is converted to a string); default 123456789 -BG BG: Background output directory name. If not given, prefix will be the same as the FG file; suffix will be the db_ID -mult mult: multiplicity factor (positive integer); default 1 (mult 1 is needed to run with the glossary) For the above example, to get 1 background sequence for 1 foreground sequence in an output BED file named hg19_GATA_dfltBG/hg19_GATA_dfltBG.bed: $ bin/GENRE/GENRE hg19_dflt hg19_GATA.bed -seed 49472047 -BG hg19_GATA_dfltBG Output: Message with GENRE version number, seed, and BG filename. Output directory containing: Copy of Foreground BED file (to account for version control) Foreground FASTA file Background BED and FASTA file Download an appropriate motif set $ python getGENRE.py -download "motifs_ID" motifs_ID: The ID is the second column of the -avail output; choose an ID under the MOTIFS heading. For example, to download the glossary in kmer format, $ python getGENRE.py -download glossary-kmer The program will automatically download the glossary scripts as a dependency. Run glossary script with foreground and background FASTA files and the motif set you've just downloaded. $ bin/glossary/glossary "FG FASTA file" "BG FASTA file" "motifs_ID" FG FASTA file: your foreground set in FASTA file format BG FASTA file: the background set in FASTA file format motifs_ID: see above Both the FASTA files are outputted by GENRE if you don't have them from other sources. For the above example, $ bin/glossary/glossary hg19_GATA_dfltBG/hg19_GATA.fa hg19_GATA_dfltBG/hg19_GATA_dfltBG.fa glossary-kmer Output: Best matches per motif file PWM - PWM score kmer - E score location in sequence sequence matched AUC results Motif: Motif file AUROC: enrichment of motif in foreground over background p-val: p-value of AUROC - significance of enrichment numFG: number of genes with motif match in foreground numBG: number of genes with motif match in background medFG: median distance of best motif match in foreground medBG: median distance of best motif match in background posTest: position test p-value All-in-One option: Combine GENRE and the glossary. $ bin/glossary/glossary_GENRE "db_ID" "FG set" "motifs_ID" Optional arguments: -seed seed: seed for randomization, preferably a number (anything given is converted to a string); default 123456789 -BG BG: Background output directory name. If not given, prefix will be the same as the FG file; suffix will be the db_ID Same output is given as for GENRE and glossary separately. Add your own motif sets $ bin/glossary/addMotifSet "motifs" "type" motifs: file of motifs separated by >motif_name or directory of single motif files with the motif_name being the filename (minus extension) type: type of motif; either pwm or kmer pwm specifics rectangular matrix all positions add to 1 bases are assumed to be ACGT bases must be labeled if length is 4 or 5bp, though it's always a good idea to label them positions may or may not be labeled kmer specifics two columns separated by non-newline whitespace: kmer and E-score Optional arguments -description description: description to use in the -avail output