A transcriptome-informed filtering framework to refine the regulatory element (anchor) classification and the loop-target linkage using gene expression profiles — refine_loop_anchors_by

A critical filtration engine that integrates quantitative RNA-seq data (e.g., TPM/FPKM) with 3D structural data to transition from a "Physical Contact Network" to a "Functionally Active Regulatory Network".

It systematically evaluates every loop anchor, purging transcriptional noise, splitting conjoined multi-gene loci, and biologically reclassifying silent promoters into enhancer-like regulatory elements.

Usage

refine_loop_anchors_by_expression(
  annotation_res,
  expr_matrix_file,
  sample_columns,
  threshold = 1,
  unit_type = "TPM",
  species = "hg38",
  out_dir = "./results/filtered",
  project_name = "HiChIP",
  color_palette = "Set2",
  karyo_bin_size = 1e+05,
  reclassify_by_expression = TRUE,
  hub_percentile = 0.95
)

Arguments

annotation_res: List. The raw foundational output object returned by annotate_peaks_and_loops.
expr_matrix_file: Character. Path to the normalized expression matrix (rows = genes, columns = samples). Supports CSV or TSV.
sample_columns: Character vector or Integer indices. The specific sample columns in the matrix to average for calculating the baseline expression.
threshold: Numeric. The minimum expression value (e.g., TPM > 1) required to consider a gene "biologically active" (default: 1).
unit_type: Character. Expression unit used for plot labeling (e.g., "TPM"; default: "TPM").
species: Character. Reference genome build. Supported: "hg38", "hg19", "mm10", "mm9".
out_dir: Character. Directory path to save the generated PDF Karyotypes, Rose plots, and Excel results.
project_name: Character. Prefix for all output files (automatically appends "_Filtered" if not present).
color_palette: Character. Color brewer palette name for global visualization (default: "Set2").
karyo_bin_size: Numeric. Bin size for Karyotype genomic density heatmaps (default: 1e5).
reclassify_by_expression: Logical. If TRUE (default), mathematically silent Promoters (P) and Gene Bodies (G) are downgraded to eP and eG, preserving the 3D loop but correcting its topological class.
hub_percentile: Numeric (0-1). Top percentile threshold used as a fallback for calculating Hub metrics if upstream inheritance is missing (default: 0.95).

Value

An invisible list containing the refined multi-omics datasets (also exported as _Refined_Results.xlsx):

loop_annotation: The fully filtered 3D network with updated loop_type (e.g., eP-P) and active Putative_Target_Genes.
anchor_annotation: Cluster annotations updated with expressed localized targets.
promoter_centric_stats: Inherited and updated topological statistics for regulatory targets.
distal_element_stats: Enhancer-level statistics preserving high-connectivity enhancer status.
target_annotation: External features strictly linked to verified active loop components.

Details

Core Algorithmic Innovations:

Multi-Gene Parsing & Precision Filtration: Flawlessly handles merged gene assignments (e.g., "GeneA;GeneB") generated by upstream biotype-aware annotations. It splits, individually evaluates against the expression threshold, and securely recombines the surviving active targets.
Biologically-Driven Downgrading (eP / eG): A purely physical Promoter (P) loop anchor that exhibits no active transcription is biologically reclassified as an enhancer-like element (eP). Gene bodies without transcription become eG. This corrects the regulatory syntax (e.g., turning a silent P-P loop into a functionally accurate eP-P enhancer-promoter loop).
Topological Hub Inheritance (Crucial): Unlike naive filters that recalculate node degrees from scratch, this engine inherits the foundational 3D Hub classifications (e.g., Is_High_Connectivity_Gene) from the raw physical data. A structural hub remains structurally important; this function simply annotates its functional activation state, preventing the network from collapsing due to tissue-specific silencing.
Automated Target Purification: Intelligently detects external mapping columns (e.g., Assigned_Target_Genes_Filled) and filters them based on expression, ensuring that auxiliary BED targets are strictly linked only to functionally verified active genes.

Examples

# 1. Get paths to the required example files in the package
rdata_path <- system.file("extdata", "analysis_results.RData", package = "looplook")
expr_path <- system.file("extdata", "example_tpm.txt", package = "looplook")

# 2. Check if files exist before running
if (rdata_path != "" && expr_path != "") {
  # Safely load the pre-computed annotation result from RData
  temp_env <- new.env()
  load(rdata_path, envir = temp_env)
  # Extract the first object found in the RData file
  raw_annotation <- temp_env[[ls(temp_env)[1]]]

  # =========================================================================
  # Example : Advanced filtering WITH Transcriptome-Guided Reclassification
  # =========================================================================
  res_reclassified <- refine_loop_anchors_by_expression(
    annotation_res = raw_annotation,
    expr_matrix_file = expr_path,
    sample_columns = c("con1", "con2"),
    threshold = 1.0,
    unit_type = "TPM",
    species = "hg38",
    out_dir = tempdir(),
    project_name = "Example_Reclassified",
    reclassify_by_expression = TRUE
  )

  # View the biologically corrected loop types (e.g., transition from P-P to eP-P)
  print(table(res_reclassified$loop_annotation$loop_type))
}
#> >>> [Refinement] Project Name: Example_Reclassified_Filtered
#> >>> [Step 1] Loading Data & Expression Matrix...
#>     [Info] 'a1_id'/'a2_id' columns missing. Reconstructing from coordinates...
#>     >>> Active Genes (> 1 TPM): 0
#> >>> [Step 2] Updating Anchors & Loops...
#> >>> [Step 3] Updating Stats...
#> >>> [Step 4] Refining Target Annotations...
#> >>> [Step 5] Generating Visualizations...
#>     Drawing Target Sankey Diagram...
#> Links is a tbl_df. Converting to a plain data frame.
#> file:////tmp/Rtmpp3oEeP/Example_Reclassified_Filtered_Target_Sankey.html screenshot completed
#>   2169 genes were dropped because they have exons located on both strands of
#>   the same reference sequence or on more than one reference sequence, so cannot
#>   be represented by a single genomic range.
#>   Use 'single.strand.genes.only=FALSE' to get all the genes in a GRangesList
#>   object, or use suppressMessages() to suppress this message.
#> 'select()' returned 1:1 mapping between keys and columns
#> >>> [Step 6] Exporting Refined Results...
#>     Excel saved.
#> Refinement Complete.
#> 
#>   E-E  E-eG  E-eP eG-eG eG-eP eP-eP 
#>     8    60    90   127   339   376