Admin Pages

[Admin Home]

Here are instructions about how to prepare your data files before uploading them. Included is some info about helpful scripts you can use to aid you in the process.

Instructions

Prepare these files in some directory on your machine with the folder name you supplied when you added the publication to the database.
Your directory should have one subdirectory which exactly matches each protein in your publication for which there is PBM data, or any complex you added in the "Add Complexes" step.
There is a hierarchy to the directory structure within each protein directory that should be followed, which is: protein - clone - PBM version - replicate. Any of these levels (except protein) can be omitted. The same is true for protein complexes, except it is assumed that a complex will only have one "clone." Each directory name should start with the name of its parent directory followed by an underscore. Here are some more details for each level of the hierarchy:
- Proteins: Straightforwardly, each protein should have its own directory within the overall publication folder. If you have more than one protein with exactly the same name (case-sensitive) from different species, every directory from each such protein should have the protein name, followed by an underscore, followed by the species name with all spaces replaced by underscores: e.g., Msn4_Candida_albicans and Msn4_Saccharomyces_cerevisiae.
- Clones: If a protein has distinct PBM data for different clones, each clone should have its own subdirectory within the protein directory; even a clone corresponding to the full-length protein must have its own subdirectory if there are multiple clones. Clone names should correspond to the names you specified in the "add clone" step, which presumably reflect the nomenclature used in the publication.
- PBM versions: You may have PBM data using different PBM array designs. In this case, have one subdirectory for each version, named as: [parent directory]_[AMADID #]. If you have data resulting from an averaging of the data from the different designs, have these in the parent directory; do not create a subdirectory for these data.
- Replicates: If a protein, clone, complex, or set of data from a specific PBM version has separate PBM data for different replicate experiments, each set of data files should have its own separate subdirectory named in some consistent way. Unless you have some other naming scheme in mind, we suggest using [parent directory]_rep1, _rep2, etc. If you have data resulting from the averaging of data from multiple replicates, include these in the parent directory - do not create a subdirectory for these data.
In every folder which itself contains data files (i.e., not in a deeper subdirectory), the data files must start with the name of the directory which contains them, followed by an underscore, followed by the appropriate filename and extension (see below).

Format the files according to the instructions below, running any necessary scripts. When you are done, cd into your directory and execute the following command to create the zip you'll upload to the server:

zip -r ../[folder_name].zip *

Data Files

Each file should start with the name of its parent directory followed by an underscore. Not all of these files are required, and precautions have been taken to account for the absence of certain file types, but the more of these you have, the better!

PWM files:
- Format:
  - These files should end in .pwm, and are expected to be in either "seed-and-wobble" or "probability" format at this time.
  - Seed-and-wobble format represents the output from the Universal PBM Analysis Suite, which contains multiple matrices for each motif, while "probability" format represents only the probability/frequency matrix, i.e., the PWM proper. It is encouraged to use seed-and-wobble format if you have the output file, as it makes for easier and more accurate sequence logo generation.
  - If you are using a secondary motif, you should split primary and secondary motifs into different files, which should end in _primary.pwm and _secondary.pwm. The same goes for tertiary motifs. If you are not using any non-primary motifs, just have one .pwm file (do not include _primary in the filename).
  - Each file can have an arbitrary number of top-scoring motifs, but generally only the top one will be used.
  - No matter how many motifs are in the file, each motif is expected to have a commented header line that looks something like this:
    # Motif: C.C.GCACGA Enrichment Score: 0.478827302868675
    (Note that some files from older publications may not have these for every motif, but we now require it for new publications. See the add_headers scripts below for help.)
  - Below that, for seedandwobble format files, you have a similar but uncommented line with the motif number, like so for the top-scoring motif:
    1 C.C.GCACGA 0.478827302868675
    For probability format files, simply have the motif sequence and enrichment score on its own, (e.g., C.C.GCACGA 0.478827302868675) no matter how many top-scoring motifs are in the file.
  - Then come the one or more matrices for that motif. For seed-and-wobble format, each matrix should have its own type supplied above it (e.g., "Enrichment score matrix", "Probability matrix", etc.). For each motif, every matrix should have the same number of rows (always 4) and columns (variable).
  - For each PWM, the rows are specified in the order A, C, G, T, and each column corresponds to the score for that row's base at that position in the motif. Each row can optionally be prefaced by its base followed by a colon and a tab, e.g., "A: "; it is recommended you do this for clarity if possible. Matrix values in different columns are then separated by tabs.
- UniPROBE can also accept PWM data derived using BEEML-PBM. These files should end in .bml.pwm, and should start with the following commented header line:
```
# Algorithm: BEEML      Format: PROBABILITY
```
  You can use the add_algorithm_and_format_line.sh or .pl scripts (see below) to easily add this line to your files. Currently we only accept probability format files for BEEML-PBM data, which means the header line will always be exactly as shown. We are working on providing a tool that will allow easy conversion of the raw BEEML-PBM output into this format, and should introduce it soon.
- Examples:
  - You can use this file as a reference for seed-and-wobble format matrix files.
  - You can use this file as a reference for probability format matrix files.
  - You can use this file as a reference for a BEEML-PBM data file in probability format.
- See the database administrator or email uniprobe@genetics.med.harvard.edu with any further questions about PWM file input, as input processing for PWM files is still expected to be a work in progress.
8-mer files:
- Format:
  - These contain score data for the protein's binding to specific motifs. The motifs can be gapped or ungapped.
  - If you only have ungapped 8-mer data, you'll have one 8-mer file called [folder_name]_8mers.txt, where you fill in [folder_name] with the immediate directory, which may be a protein, clone, complex, or replicate name (see top instructions).
  - If you have gapped 8-mer data, you'll name each file according to the position of the gaps allowed for the 8-mers in that file. A 1 is a base and a . is a gap. For example, a file for 8-mers with single gaps after the 2nd and 3rd positions would end in _8mers_11.1.11111.txt. For ungapped k-mer data, use _8mers_11111111.txt. (You can also use this filename even if you only have ungapped data, if you want.) Note that there should be 8 1s in total regardless of how many gaps there are.
  - This file will always have the two corresponding k-mers (each is the reverse complement of the other) in the first 2 columns, followed by the enrichment score in the 3rd column.
  - More recent files have the median signal intensity and Z-score in the 4th and 5th columns, respectively. If this information is available, please include it.
  - A header line can be included in the file denoting what value is in each column, but please stick to the order just described.
  - You can use the scripts rearrange_8mer_columns.sh and .pl to change your files appropriately with ease (see below).
- Examples:
  - You can use this file as a reference for ungapped 8-mer files.
  - You can use this file as a reference for gapped 8-mer files.
Top enrichment file:
- This should be called [folder]_8mers_top_enrichment.txt, and contains the top-scoring k-mers with any number of gaps, along with their respective enrichment scores.
- The format is otherwise identical to the other 8-mer files.
- You can use this file as a reference.
De Bruijn sequence file:
- This should be called [folder]_combinatorial.txt.
- De Bruijn files output from an older version of the Universal PBM Analysis Suite may be called _deBruijn.txt; the adjust_filenames script will adjust this (see below).
- It contains the 60-mer probe sequences along with raw fluorescence signal intensities.
- You can use this file as a reference.
All-data file:
- This should be called [folder]_alldata.txt, another file that is output from the Universal PBM Analysis Suite.
- No further processing of this file is necessary.
- You can see this file as a reference.
Regression file:
- This should be called [folder]_regression.txt, another file that is output from the Universal PBM Analysis Suite.
- No further processing of this file is necessary.
- You can see this file as a reference.
Raw data file:
- This should be called [folder]_rawdata.txt, another file that is output from the Universal PBM Analysis Suite.
- No further processing of this file is necessary.
- You can see this file as a reference.

Helpful scripts

Download these scripts here.
NOTE: It is advisable to back up your entire publication folder in bulk (e.g., using cp -r, or by making a zip) before running any of these scripts. Please be sure you understand what each script will do before running it!
See toward the bottom (Calling order) for a template sequence in which to run them.

add_algorithm_and_format_line.sh: We recommend each pwm file having a commented (#) line at the top formatted as follows:
```
# Algorithm: [SEEDANDWOBBLE or BEEML]	Format: [SEEDANDWOBBLE or PROBABILITY]
```
This is only necessary for BEEML-PBM data files (see instructions above), and is not required for SEEDANDWOBBLE files. It will probably be necessary for all files derived using algorithms other than Seed-And-Wobble, as we incorporate more in the future.
You should include the following command line arguments for this script (in this order):
```
<FOLDER> <EXTENSION> <ALGORITHM> <FORMAT>
```
Here, FOLDER is the full path to your file, EXTENSION is the file extension for all files you want to alter (e.g., .bml.pwm).
(NOTE: This may cause problems if you want to alter only your SEEDANDWOBBLE data files, because using .pwm as your extension will alter the BEEML files as well. It is recommended in this case to use the add_algorithm_and_format_line.pl script; see below. However, as explained above, there is no requirement to alter your SEEDANDWOBBLE files.)
Example usage: add_algorithm_and_format_line.sh ~/pub_dirs/EMBO10 .bml.pwm BEEML PROBABILITY
add_algorithm_and_format_line.pl: add_algorithm_and_format_line.sh is a wrapper script for this script. It may be useful to call this script directly if you only want to modify one or more files individually. It takes the filepath of one file, the algorithm, and the format as arguments.
Example usage: add_algorithm_and_format_line.pl ~/pub_dirs/EMBO10/Ehf/Ehf.bml.pwm BEEML PROBABILITY
add_headers.sh: If you need to quickly add header lines to your .pwm files, you can try calling add_headers.sh. It takes one argument, which is the full path of your folder. The script will modify all the pwm files in this directory. The script should not add headers if they already exist.
add_headers.pl: add_headers.sh is just a wrapper script for add_headers.pl. If you only have certain files you want to modify or they are not all in the same publication directory, you can call add_headers.pl directly and feed it the paths of all the files you want to modify as command-line arguments (separated by spaces).
adjust_filenames.sh: This is a script to make older filenames more compatible with newer schema. It is bound to be a work in progress, but for now it will copy or move any commonly found files with incompatible names to a file with the right name. See the script itself for more details on exactly what it will change. To run, you just need to supply the full path to your directory as command-line argument, like so:
adjust_filenames.sh ~/pub_dirs/GB11
concatenate_data_files.sh: You may have certain large files split up for convenience. In this case we expect that you will have named them [protein_name]-[number]_[filetype.txt], e.g., MAL8P1.153-1_rawdata.txt, and presumably similarly with -2 and beyond. This script will concatenate all such files into one file, which is necessary for the proper functioning of the pipeline. Takes 2 arguments: publication_folder_name file_extension. Here file_extension is not the actual extension, but rather we assume that the file will be named _extension.txt.
Example usage: concatenate_data_files.sh ~/pub_dirs/Path10 rawdata
generate_motif_line.sh: Some files may not only not have header lines, but may not even have proper lines (uncommented) above each matrix containing the motif sequence in it, as described above for PWM files. Usually in these cases the corresponding information will be found in a separate file ending in _top_kmer.txt. (If not, you should make such a file.) See, for example, this file. To add the line to the .pwm file, run generate_motif_line.sh (before running the add_headers script) with two command-line parameters: publication_folder file_format . Here the file_format parameter is a number representing the output format: 0 for seedandwobble, 1 for probability.
Example usage: /disk/var/www/html/pbms/UniPROBE_staging/admin/scripts/generate_motif_line.sh ~/pub_dirs/GB11 0 (for seedandwobble)
NOTE: This script is currently only intended for files with only one motif. Also, this script will replace the first line of any pwm file, whatever it may be, so it will work if you run it more than once, but will not have the intended results if you've already added a header line to your files or you want to keep the first line for some reason. If either of these is an issue and you need to do this for a file with multiple motifs or want to keep the first line, please see the database administrator and we'll try to upgrade the script.
generate_motif_line.pl: generate_motif_line.sh is a wrapper script for generate_motif_line.pl. This Perl script is useful to call directly if you need to modify individual pwm files; it will modify only 1 file at a time. To call the script, supply 3 parameters: pwm_file top_kmer_file file_format, where these parameters are as described above for generate_motif_line.sh.
Example usage: generate_motif_line.sh ~/pub_dirs/Path10/MAL8P1.153/MAL8P1.153.pwm ~/pub_dirs/Path10/MAL8P1.153/MAL8P1.153_top_kmer.txt 0 (for seedandwobble)
move_misnamed_files.sh: If you have a protein directory with a name that does not match the name of the protein stored in the database, you should change it and all the files in it so that they do match exactly. To do this, use the script move_misnamed_files.sh. You call it with 3 command-line arguments: folder_name new_string old_string. As an example, if you wanted to change the Yap2 directory in the GB11 folder so that it was called Cad1, and change all the files in it beginning in "Yap2" to also begin with "Cad1", you would call:
move_misnamed_files.sh ~/pub_dirs/GB11 Cad1 Yap2
rearrange_8mer_columns.sh: See above for the description of how your 8mer data files should be formatted. You may have your columns out of order for some reason. Then, you'll want to call this script to fix that. As command-line arguments, supply your files folder, then the current colum number (1-based) of the enrichment scores in your file, followed optionally by the current column number for the median signal intensity and then the . See the script for details. Be careful not to run this script more than once in a row without restoring your original files. If you only need to modify the files in some subdirectory, call that folder instead of your whole data directory.
Example usage: rearrange_8mer_columns.sh ~/pub_dirs/CB11/LUX 4 3
This tells the script that the enrichment scores are in the 4th column for all 8mer files in this directory and its subdirectories, while the median signal intensities are in the 3rd. The script would then switch them accordingly to produce the proper format.
See script for more details.
rearrange_8mer_columns.pl: The above shell script is a wrapper for this script. This will actually do the rearranging, and acts on only one file at a time. It takes command-line arguments in the form -kmer_file <8mer_file> -escore_col <enrichment_score_column_number> -median_col <median_signal_intensity_column_number> -zscore_col <z-score_column_number>. Only -kmer_file and -escore_col are required arguments (but if you only input these, there isn't much point... in fact it will most likely just delete whatever column you failed to put in). The script will back up your 8mer file first to a .bak file.
Example usage: rearrange_8mer_columns.pl -kmer_file ~/pub_dirs/CB11/LUX/LUX_015681/LUX_015681_rep1/LUX_015681_rep1_8mers.txt -escore_col 4 -median_col 3
remove_backups.sh: Note that the add_headers script will back up your original .pwm files to .pwm.bak files before modifying them. You should check that your pwm files were modified correctly, then remove the backups. (We don't want them to be in the publicly available download zips.) To do this you can use remove_backups.sh. You call it with 2 arguments: the full filepath from which you want to (RECURSIVELY) remove backup files, and the file extension whose backups you want to remove. So you might call:
remove_backups.sh ~/pub_dirs/GB11 pwm
(See the script itself for more details.)*
restore_from_backup.sh: If the files were NOT modified correctly by add_headers or rearrange_8mer_columns (which we hope doesn't happen), then you can use restore_from_backup.sh instead to restore the original files from the backups. You call it with 2 arguments: the full filepath in which you want to restore backup files, and the file extension whose backups you want to restore. This script is not recursive but it does also go two layers deeper, making it useful to supply a filepath corresponding to the root folder you're using for your publication data files. So you might call:
restore_from_backup.sh ~/pub_dirs/GB11 pwm
(See the script itself for more details.)*

*Please open these scripts and read the comments to make sure you understand how to run them and what they will do before running them. Be careful as they will delete files!

Calling order

Let's say you need to call every single one of the above shell scripts for your publication directory. Here is the order in which you would do it. This isn't set in stone but is a good template to follow. You can follow this order for any dataset and simply omit whichever scripts you don't need to run. Here DIR_PATH is the full path to the directory for your publication in which you're keeping the data files.
NOTE: It is advisable to back up your entire publication folder in bulk (using cp -r) before running any of these scripts. You should also check that each script worked correctly after EVERY step. If step 4, 5, or 6 did not work properly, call restore_from_backup.sh DIR_PATH with the appropriate file extension (pwm or txt, most likely). In the event of any script not performing correctly or the way you need it to, contact the DBA (uniprobe@genetics.med.harvard.edu) so the script can be upgraded.

move_misnamed_files.sh DIR_PATH NEW_STR OLD_STR (filling in new and old names; call as many times as you need)
concatenate_data_files.sh DIR_PATH FILE_EXTENSION (where FILE_EXTENSION is as described above for this script; call as many times as you need)
adjust_filenames.sh DIR_PATH
generate_motif_line.sh DIR_PATH FORMAT (where FORMAT is 0 or 1 as described above for this script)
add_headers.sh DIR_PATH
add_algorithm_and_format_line.sh DIR_PATH EXTENSION ALGORITHM PROBABILITY (as needed; see script description)
rearrange_8mer_columns.sh DIR_PATH ESCORE_COL MEDIAN_COL ZSCORE_COL (see script description)
remove_backups.sh DIR_PATH pwm
remove_backups.sh DIR_PATH txt

Then if all is well you can remove your backup folder that you made before you started.