
A transcriptome-informed filtering framework to refine the regulatory element (anchor) classification and the loop-target linkage using gene expression profiles
Source:R/annotation.R
refine_loop_anchors_by_expression.RdA critical filtration engine that integrates quantitative RNA-seq data (e.g., TPM/FPKM) with 3D structural data to transition from a "Physical Contact Network" to a "Functionally Active Regulatory Network".
It systematically evaluates every loop anchor, purging transcriptional noise, splitting conjoined multi-gene loci, and biologically reclassifying silent promoters into enhancer-like regulatory elements.
Usage
refine_loop_anchors_by_expression(
annotation_res,
expr_matrix_file,
sample_columns,
threshold = 1,
unit_type = "TPM",
species = "hg38",
out_dir = "./results/filtered",
project_name = "HiChIP",
color_palette = "Set2",
karyo_bin_size = 1e+05,
reclassify_by_expression = TRUE,
hub_percentile = 0.95
)Arguments
- annotation_res
List. The raw foundational output object returned by
annotate_peaks_and_loops.- expr_matrix_file
Character. Path to the normalized expression matrix (rows = genes, columns = samples). Supports CSV or TSV.
- sample_columns
Character vector or Integer indices. The specific sample columns in the matrix to average for calculating the baseline expression.
- threshold
Numeric. The minimum expression value (e.g., TPM > 1) required to consider a gene "biologically active" (default:
1).- unit_type
Character. Expression unit used for plot labeling (e.g.,
"TPM"; default:"TPM").- species
Character. Reference genome build. Supported:
"hg38","hg19","mm10","mm9".- out_dir
Character. Directory path to save the generated PDF Karyotypes, Rose plots, and Excel results.
- project_name
Character. Prefix for all output files (automatically appends
"_Filtered"if not present).- color_palette
Character. Color brewer palette name for global visualization (default:
"Set2").- karyo_bin_size
Numeric. Bin size for Karyotype genomic density heatmaps (default:
1e5).- reclassify_by_expression
Logical. If
TRUE(default), mathematically silent Promoters (P) and Gene Bodies (G) are downgraded toePandeG, preserving the 3D loop but correcting its topological class.- hub_percentile
Numeric (0-1). Top percentile threshold used as a fallback for calculating Hub metrics if upstream inheritance is missing (default:
0.95).
Value
An invisible list containing the refined multi-omics datasets (also exported as _Refined_Results.xlsx):
loop_annotation: The fully filtered 3D network with updatedloop_type(e.g., eP-P) and activePutative_Target_Genes.anchor_annotation: Cluster annotations updated with expressed localized targets.promoter_centric_stats: Inherited and updated topological statistics for regulatory targets.distal_element_stats: Enhancer-level statistics preserving high-connectivity enhancer status.target_annotation: External features strictly linked to verified active loop components.
Details
Core Algorithmic Innovations:
Multi-Gene Parsing & Precision Filtration: Flawlessly handles merged gene assignments (e.g.,
"GeneA;GeneB") generated by upstream biotype-aware annotations. It splits, individually evaluates against the expression threshold, and securely recombines the surviving active targets.Biologically-Driven Downgrading (eP / eG): A purely physical Promoter (
P) loop anchor that exhibits no active transcription is biologically reclassified as an enhancer-like element (eP). Gene bodies without transcription becomeeG. This corrects the regulatory syntax (e.g., turning a silentP-Ploop into a functionally accurateeP-Penhancer-promoter loop).Topological Hub Inheritance (Crucial): Unlike naive filters that recalculate node degrees from scratch, this engine inherits the foundational 3D Hub classifications (e.g.,
Is_High_Connectivity_Gene) from the raw physical data. A structural hub remains structurally important; this function simply annotates its functional activation state, preventing the network from collapsing due to tissue-specific silencing.Automated Target Purification: Intelligently detects external mapping columns (e.g.,
Assigned_Target_Genes_Filled) and filters them based on expression, ensuring that auxiliary BED targets are strictly linked only to functionally verified active genes.
Examples
# 1. Get paths to the required example files in the package
rdata_path <- system.file("extdata", "analysis_results.RData", package = "looplook")
expr_path <- system.file("extdata", "example_tpm.txt", package = "looplook")
# 2. Check if files exist before running
if (rdata_path != "" && expr_path != "") {
# Safely load the pre-computed annotation result from RData
temp_env <- new.env()
load(rdata_path, envir = temp_env)
# Extract the first object found in the RData file
raw_annotation <- temp_env[[ls(temp_env)[1]]]
# =========================================================================
# Example : Advanced filtering WITH Transcriptome-Guided Reclassification
# =========================================================================
res_reclassified <- refine_loop_anchors_by_expression(
annotation_res = raw_annotation,
expr_matrix_file = expr_path,
sample_columns = c("con1", "con2"),
threshold = 1.0,
unit_type = "TPM",
species = "hg38",
out_dir = tempdir(),
project_name = "Example_Reclassified",
reclassify_by_expression = TRUE
)
# View the biologically corrected loop types (e.g., transition from P-P to eP-P)
print(table(res_reclassified$loop_annotation$loop_type))
}
#> >>> [Refinement] Project Name: Example_Reclassified_Filtered
#> >>> [Step 1] Loading Data & Expression Matrix...
#> [Info] 'a1_id'/'a2_id' columns missing. Reconstructing from coordinates...
#> >>> Active Genes (> 1 TPM): 0
#> >>> [Step 2] Updating Anchors & Loops...
#> >>> [Step 3] Updating Stats...
#> >>> [Step 4] Refining Target Annotations...
#> >>> [Step 5] Generating Visualizations...
#> Drawing Target Sankey Diagram...
#> Links is a tbl_df. Converting to a plain data frame.
#> file:////tmp/Rtmpp3oEeP/Example_Reclassified_Filtered_Target_Sankey.html screenshot completed
#> 2169 genes were dropped because they have exons located on both strands of
#> the same reference sequence or on more than one reference sequence, so cannot
#> be represented by a single genomic range.
#> Use 'single.strand.genes.only=FALSE' to get all the genes in a GRangesList
#> object, or use suppressMessages() to suppress this message.
#> 'select()' returned 1:1 mapping between keys and columns
#> >>> [Step 6] Exporting Refined Results...
#> Excel saved.
#> Refinement Complete.
#>
#> E-E E-eG E-eP eG-eG eG-eP eP-eP
#> 8 60 90 127 339 376