
End-to-end functional annotation pipeline integrating JASPAR transcription factor motif analysis, gene ontology enrichment, and protein-protein interaction network analysis
Source:R/analysis.R
profile_target_genes.RdThis function serves as a fully automated, end-to-end multi-omics analysis pipeline. It elegantly bridges 3D genomic interactions (e.g., Hi-C, HiChIP) with transcriptomic data (RNA-seq) to systematically characterize the functional landscape and regulatory mechanisms underlying the identified target genes.
The suite integrates multiple analytical modules:
Differential Expression (LFC) Profiling: Compares the Log2 Fold Change of target genes against the genomic background using tailored violin/boxplots.
GSEA Analysis: Evaluates the rank-distribution of custom target gene sets within the global differential expression landscape.
Expression & 3D Connectivity Dynamics: Generates Z-score expression heatmaps and correlates 3D structural connectivity (e.g., Hub degree) with gene expression/LFC using scatter and sophisticated Raincloud plots.
TF Motif Analysis: Scans JASPAR core motifs across proximal and distal loop anchors, generating motif enrichment rank scatter plots and sequence logos (SeqLogos).
Functional Enrichment (GO): Performs Gene Ontology (BP) enrichment, visualizing results through divergent concept networks (cnetplot) and aggregated faceted lollipop charts.
Interaction Networks (PPI): Constructs and visualizes high-confidence Protein-Protein Interaction networks via the STRING database.
Usage
profile_target_genes(
annotation_res,
diff_file,
lfc_col,
expr_matrix_file,
metadata_file,
target_source = c("loops", "targets"),
target_mapping_mode = c("all", "promoter"),
loop_types = c("E-P", "P-P"),
include_Filled = TRUE,
use_nearest_gene = FALSE,
group_order = NULL,
project_name = "Analysis",
out_dir = "./",
org_db = "org.Hs.eg.db",
run_motif = FALSE,
genome_id = "hg19",
motif_p_thresh = 1e-04,
motif_ntop = 5,
run_go = TRUE,
run_ppi = FALSE,
ppi_score = 400,
ppi_nSample = 400,
heatmap_nSample = 1000,
gsea_nSample = 99999,
cnet_nSample = 50,
stat_test = "wilcox.test",
cor_method = "pearson"
)Arguments
- annotation_res
List. The result object returned by
annotate_peaks_and_loopsorrefine_loop_anchors_by_expression.- diff_file
Character. Path to the differential expression file (CSV/TSV). Must contain gene IDs (rownames) and a Log Fold Change column.
- lfc_col
Character. The column name in
diff_filerepresenting Log2 Fold Change (e.g., "log2FoldChange").- expr_matrix_file
Character. Path to the normalized expression matrix (Genes x Samples, e.g., TPM or FPKM).
- metadata_file
Character. Path to the sample metadata file. Must contain "SampleID" and "Group" columns.
- target_source
Character vector. Source of target genes to analyze. Options include
"loops"(genes connected by specific loops) and"targets"(genes overlapping with external target BED).- target_mapping_mode
Character. Specifies the mapping strategy for target genes. Must be one of:
"all"(Default): Accepts all physically connected 3D targets (including Gene Bodies)."promoter": Strictly enforces that the 3D loop must explicitly anchor at a Promoter region.
- loop_types
Character vector. The specific loop types to analyze (e.g.,
c("E-P", "P-P")). Active only when"loops"is included intarget_source.- include_Filled
Logical. If
TRUE, utilizes the comprehensively merged gene assignment (retaining both loop-assigned targets and local host genes).- use_nearest_gene
Logical. If
TRUE, bypasses 3D loop-based gene assignment and strictly uses the nearest 1D gene (SYMBOL), serving as a spatial baseline reference.- group_order
Character vector. Optional factor levels to sort sample groups in visualizations.
- project_name
Character. Prefix for all output files and plot titles.
- out_dir
Character. Directory path for saving output PDF plots and CSV tables.
- org_db
Character. Organism annotation database (e.g., "org.Hs.eg.db" for human, "org.Mm.eg.db" for mouse).
- run_motif
Logical. Whether to perform Transcription Factor Binding Site (TFBS) motif analysis on proximal and distal anchors.
- genome_id
Character. Reference genome assembly for motif sequence extraction (e.g., "hg38", "hg19", "mm10").
- motif_p_thresh
Numeric. P-value threshold for
motifmatchrscanning.- motif_ntop
Numeric. Number of top enriched motifs to output as SeqLogos.
- run_go
Logical. Whether to perform Gene Ontology (GO) ORA enrichment.
- run_ppi
Logical. Whether to construct Protein-Protein Interaction networks via the STRING database (requires internet access).
- ppi_score
Numeric. Minimum combined confidence score for STRING interaction edges (0-1000).
- ppi_nSample
Numeric. Maximum number of top variable/LFC genes to include in the PPI network to prevent visual overcrowding.
- heatmap_nSample
Numeric. Maximum number of highly variable genes to plot in the expression heatmap.
- gsea_nSample
Numeric. Maximum number of target genes to sample for GSEA to maintain computational efficiency.
- cnet_nSample
Numeric. Number of top GO terms to display in the divergent concept network plot.
- stat_test
Character. Statistical test for LFC comparisons (
"wilcox.test"or"t.test").- cor_method
Character. Method for sample correlation matrices (
"pearson"or"spearman").
Value
An invisible nested list containing:
go_results: A list of top GO enrichment tables for each analyzed data source.target_gene_sets: The specific lists of gene IDs utilized for each analysis task.
Examples
# =========================================================================
# 1. Get paths to real example files included in the package
# =========================================================================
rdata_path <- system.file("extdata", "analysis_results.RData", package = "looplook")
expr_path <- system.file("extdata", "example_tpm.txt", package = "looplook")
diff_path <- system.file("extdata", "example_deg.txt", package = "looplook")
meta_path <- system.file("extdata", "example_coldata.txt", package = "looplook")
# Check if all necessary files exist before running
if (rdata_path != "" && expr_path != "" && diff_path != "" && meta_path != "") {
# Safely load the pre-computed annotation result from RData
temp_env <- new.env()
load(rdata_path, envir = temp_env)
raw_annotation <- temp_env[[ls(temp_env)[1]]]
out_base <- tempdir()
# =========================================================================
# Comprehensive Integrative Profiling (Loops + Targets)
#
# =========================================================================
res <- profile_target_genes(
annotation_res = raw_annotation,
diff_file = diff_path,
lfc_col = "log2FoldChange",
expr_matrix_file = expr_path,
metadata_file = meta_path,
target_source = c("targets"),
target_mapping_mode = "all",
include_Filled = FALSE,
use_nearest_gene = FALSE,
project_name = "Scenario_A_Integrative",
out_dir = out_base,
run_go = FALSE, # Set to TRUE in real analysis
run_ppi = FALSE
)
}
#> >>> Analysis Init | Root Project: Scenario_A_Integrative_LoopOnly
#> --- Reading files...
#> Loaded Promoter Centric Stats.
#>
#> ================================================================
#> >>> Processing Source: [targets]
#> Targets: Successfully located 'Assigned_Target_Genes'
#>
#> --- Task: Target_Genes (Valid Genes: 150) ---
#> Saved LFC Violin Plot: /tmp/Rtmpp3oEeP/Scenario_A_Integrative_LoopOnly_targets_Target_Genes_LFC_Violin.pdf
#> Warning: qvalue::qvalue() failed, returning NA for qvalue. Error: missing values and NaN's not allowed if 'na.rm' is FALSE
#> Saved GSEA Plot (Custom): /tmp/Rtmpp3oEeP/Scenario_A_Integrative_LoopOnly_targets_Target_Genes_GSEA.pdf
#> [Extra Analysis] Found 'n_Linked_Distal'. Running distal connectivity analysis...
#>
#> All analysis complete.