Integrative 'Omics Tools

Tissue-specific gene expression patterns are encoded in metazoan genomes primarily via DNA sequence motifs that are recognized by sequence-specific transcription factors (TFs). Chromatin immunoprecipitation (ChIP)-seq data on in vivo TF genomic occupancy have been used to infer cis regulatory elements and TF-DNA binding sites. However, accurate identification of the bound TFs can be complicated by cofactors modulating TF-DNA recognition in vivo. Depending on the TF, indirect DNA association through tethering by a sequence-specific TF with a different binding motif (“tethered binding”) can explain a significant fraction of a TF’s in vivo binding events. Tethering interactions expand the combinatorial complexity of the underlying regulatory networks, allowing genes to be co-regulated by multiple TFs.

To discriminate between direct and indirect binding and to infer recruiting factors, we analyze ChIP-seq data together with data on TFs’ intrinsic DNA-binding specificities. We have developed novel tools – a non-redundant TF-8mer “modules” of shared specificity for 671 metazoan TFs, and GENRE (Genomically Equivalent Negative REgions), a tunable tool for construction of matched genomic background sequences for analysis of regulatory regions – that enable such analysis (Mariani et al., Cell Systems, 2017). By integrating gene expression (e.g., RNA-Seq) data into our analysis, we can precisely identify which TF, among all the TFs associated with an enriched motif, is likely responsible for tethering the ChIPed factor in cells. Analysis of ChIP-Seq data using our TF-8mer glossary and GENRE suggested novel TF-TF interactions. We anticipate that these tools, together with additional integrative genomics tools that we are developing, will aid in elucidating tissue-specific gene regulatory programs.

Recent genome-wide chromatin accessibility profiling studies have provided catalogues of putative open regions, where transcription factors can recognize their motifs and regulate gene expression programs. In a recent study (Mariani et al., Genome Research, 2020), we developed MEDEA (Motif Enrichment in Differential Elements of Accessibility), a computational tool that analyzes high-throughput chromatin accessibility genomic data to identify cell-type-specific accessible regions and lineage-specific motifs associated with transcription factor (TF) binding therein. To benchmark MEDEA, we used a panel of reference cell lines profiled by ENCODE and curated by the ENCODE-DREAM consortium. Comparing results with RNA-seq data, ChIP-seq peaks, and DNase-seq footprints, we show that MEDEA improves the detection of motifs associated with known lineage specifiers. We then applied MEDEA to 610 ENCODE DNase-seq datasets, where it revealed significant motifs even when absolute enrichment was low and identified novel regulators, such as NRF1 in kidney development. MEDEA performs well on both bulk and single-cell ATAC-seq data. MEDEA is publicly available as part of our Glossary-GENRE suite for motif enrichment analysis.