[Admin Home]
Here are instructions about how to prepare your data files before uploading them. Included is some info about helpful scripts you can use to aid you
in the process.
Instructions
- Prepare these files in some directory on your machine with the folder name you supplied when you added the publication to the database.
- Your directory should have one subdirectory which exactly matches each protein in your publication for which there is PBM data, or any complex you added in the "Add Complexes" step.
- There is a hierarchy to the directory structure within each protein directory that should be followed, which is: protein - clone - PBM version - replicate. Any of these levels (except protein) can be omitted. The same is true for protein complexes, except it is assumed that a complex will only have one "clone." Each directory name should start with the name of its parent directory followed by an underscore. Here are some more details for each level of the hierarchy:
- Proteins: Straightforwardly, each protein should have its own directory within the overall publication folder. If you have more than one protein with exactly the same name (case-sensitive) from different species, every directory from each such protein should have the protein name, followed by an underscore, followed by the species name with all spaces replaced by underscores: e.g., Msn4_Candida_albicans and Msn4_Saccharomyces_cerevisiae.
- Clones: If a protein has distinct PBM data for different clones, each clone should have its own subdirectory within the protein directory; even a clone corresponding to the full-length protein must have its own subdirectory if there are multiple clones. Clone names should correspond to the names you specified in the "add clone" step, which presumably reflect the nomenclature used in the publication.
- PBM versions: You may have PBM data using different PBM array designs. In this case, have one subdirectory for each version, named as: [parent directory]_[AMADID #]. If you have data resulting from an averaging of the data from the different designs, have these in the parent directory; do not create a subdirectory for these data.
- Replicates: If a protein, clone, complex, or set of data from a specific PBM version has separate PBM data for different replicate experiments, each set of data files should have its own separate subdirectory named in some consistent way.
Unless you have some other naming scheme in mind, we suggest using [parent directory]_rep1, _rep2, etc. If you have data resulting from the averaging of data from multiple replicates, include these in the parent directory - do not create a subdirectory for these data.
- In every folder which itself contains data files (i.e., not in a deeper subdirectory), the data files must start with the name of the directory which contains them, followed by an underscore, followed by the appropriate filename and extension (see below).
Format the files according to the instructions below, running any necessary scripts.
When you are done, cd into your directory and execute the following command to create the zip you'll upload to the server:
zip -r ../[folder_name].zip *
Data Files
Each file should start with the name of its parent directory followed by an underscore. Not all of these files are required, and precautions have been taken to account for the absence of certain file types, but the more of these you have, the better!
- PWM files:
- 8-mer files:
- Format:
- These contain score data for the protein's binding to specific motifs. The motifs can be gapped or ungapped.
- If you only have ungapped 8-mer data, you'll have one 8-mer file called [folder_name]_8mers.txt, where you fill in [folder_name] with the immediate directory,
which may be a protein, clone, complex, or replicate name (see top instructions).
- If you have gapped 8-mer data, you'll name each file according to the position of the gaps allowed for the 8-mers in that file. A 1 is a base and a . is a gap.
For example, a file for 8-mers with single gaps after the 2nd and 3rd positions would end in _8mers_11.1.11111.txt. For ungapped k-mer data, use _8mers_11111111.txt.
(You can also use this filename even if you only have ungapped data, if you want.) Note that there should be 8 1s in total regardless of how many gaps there are.
- This file will always have the two corresponding k-mers (each is the reverse complement of the other) in the first 2 columns, followed by the enrichment score in the 3rd column.
- More recent files have the median signal intensity and Z-score in the 4th and 5th columns, respectively. If this information is available, please include it.
- A header line can be included in the file denoting what value is in each column, but please stick to the order just described.
- You can use the scripts rearrange_8mer_columns.sh and .pl to change your files appropriately with ease (see below).
- Examples:
- You can use this file as a reference for ungapped 8-mer files.
- You can use this file as a reference for gapped 8-mer files.
- Top enrichment file:
- This should be called [folder]_8mers_top_enrichment.txt, and contains the top-scoring k-mers with any number of gaps, along with their respective enrichment scores.
- The format is otherwise identical to the other 8-mer files.
- You can use this file as a reference.
- De Bruijn sequence file:
- This should be called [folder]_combinatorial.txt.
- De Bruijn files output from an older version of the Universal PBM Analysis Suite may be called _deBruijn.txt; the adjust_filenames script will adjust this (see below).
- It contains the 60-mer probe sequences along with raw fluorescence signal intensities.
- You can use this file as a reference.
- All-data file:
- This should be called [folder]_alldata.txt, another file that is output from the Universal PBM Analysis Suite.
- No further processing of this file is necessary.
- You can see this file as a reference.
- Regression file:
- This should be called [folder]_regression.txt, another file that is output from the Universal PBM Analysis Suite.
- No further processing of this file is necessary.
- You can see this file as a reference.
- Raw data file:
- This should be called [folder]_rawdata.txt, another file that is output from the Universal PBM Analysis Suite.
- No further processing of this file is necessary.
- You can see this file as a reference.
Helpful scripts
Download these scripts here.
NOTE: It is advisable to back up your entire publication folder in bulk (e.g., using cp -r, or by making a zip) before running any of these scripts. Please be sure you understand what each script will do before running it!
See toward the bottom (Calling order) for a template sequence in which to run them.
*Please open these scripts and read the comments to make sure you understand how to run them and what they will do before running them. Be careful as they will delete files!
Calling order
Let's say you need to call every single one of the above shell scripts for your publication directory. Here is the order in which you would do it.
This isn't set in stone but is a good template to follow.
You can follow this order for any dataset and simply omit whichever scripts you don't need to run.
Here DIR_PATH is the full path to the directory for your publication in which you're keeping the data files.
NOTE: It is advisable to back up your entire publication folder in bulk (using cp -r) before running any of these scripts.
You should also check that each script worked correctly after EVERY step. If step 4, 5, or 6 did not work properly, call restore_from_backup.sh DIR_PATH with the appropriate file extension (pwm or txt, most likely).
In the event of any script not performing correctly or the way you need it to, contact the DBA (uniprobe@genetics.med.harvard.edu) so the script can be upgraded.
- move_misnamed_files.sh DIR_PATH NEW_STR OLD_STR (filling in new and old names; call as many times as you need)
- concatenate_data_files.sh DIR_PATH FILE_EXTENSION (where FILE_EXTENSION is as described above for this script; call as many times as you need)
- adjust_filenames.sh DIR_PATH
- generate_motif_line.sh DIR_PATH FORMAT (where FORMAT is 0 or 1 as described above for this script)
- add_headers.sh DIR_PATH
- add_algorithm_and_format_line.sh DIR_PATH EXTENSION ALGORITHM PROBABILITY (as needed; see script description)
- rearrange_8mer_columns.sh DIR_PATH ESCORE_COL MEDIAN_COL ZSCORE_COL (see script description)
- remove_backups.sh DIR_PATH pwm
- remove_backups.sh DIR_PATH txt
Then if all is well you can remove your backup folder that you made before you started.