๐งฌ Executive Summary
Current high-throughput technologies produce vast multi-omics data. However, finding shared biomarkers across these layers is challenging when relationships are nonlinear. This paper presents MRF-IMD, an unsupervised Multivariate Random Forest framework. It uses a novel metric, Inverse Minimal Depth (IMD), to prioritize shared features.
โ Problem
Linear methods (SPLS, CCA) fail to capture complex, nonlinear biological interactions across omics layers.
๐ Solution
MRF-IMD captures nonlinear hubs and uses 3 selection strategies (Filter, Mixture, Transform) for robust discovery.
Impact Highlights
- 1 Outperforms SPLS/CCA in nonlinear simulations.
- 2 Identified 8 Tumor Clusters in Pan-Cancer analysis.
- 3 Improved Dementia Prediction (P=0.033) over existing scores.
The MRF-IMD Workflow
1. Multi-Omics Input
Samples with matched data (e.g., Gene Exp + Methylation). One layer acts as predictors (X), the other as multivariate response (Y).
2. Multivariate Forest
Fit Random Forest. Trees split nodes to maximize heterogeneity in the multivariate response Y.
3. Calculate IMD
Compute Inverse Minimal Depth. Strong variables appear closer to the root (depth 0), resulting in high IMD scores.
4. Feature Selection
Apply selection strategy (Filter, Mixture, or Transform) to identify robust shared biomarkers.
Strategy A: Filter
Selects variables above a threshold (ฯ ยท ฯ). Best for parsimonious, stable signatures (e.g., ~73 genes in BRCA).
Strategy B: Mixture
Fits a 2-component mixture model to separation signal vs. noise. Offers a balanced trade-off.
Strategy C: Transform
Standardizes IMD using a t-score. Best for detecting subtle signals or interaction effects.
Simulation Benchmark
Comparing MRF-IMD against established methods (SPLS, CCA) and Ensemble learners. Notice the performance gap in Nonlinear settings.
Area Under Precision-Recall Curve (PR-AUC)
In linear settings, MRF-IMD matches SPLS/CCA.
Key Takeaways
- โ Linear Scenarios: MRF-IMD achieves ~0.90 PR-AUC, comparable to SPLS (0.94) and PMDCCA (0.90). It is competitive even when linearity assumptions favor classical methods.
- โ Nonlinear Scenarios: This is the critical advantage. MRF-IMD maintains high performance (~0.71-0.81 PR-AUC), while SPLS crashes to random guessing levels (~0.04-0.07).
- โ Ensemble Comparison: Adapted univariate ensemble learners (GBM, XGBoost) underperform in the multivariate context compared to the specialized MRF framework.
Cancer Biomarker Discovery
Analysis of TCGA Breast (BRCA) and Colon (COAD) cohorts.
Top Identified Genes (IMD Weight)
Prognostic Stratification (Kaplan-Meier)
Pathway Enrichment
Pan-Cancer Clustering (Figure 4)
Visualizing how MRF-IMD features improve tumor stratification. Switch views to compare Raw Data (Figure 4a/b) vs MRF-IMD Selected (Figure 4c).
Cluster Definitions (Table 5)
Cluster Confusion Matrix (Figure 4d)
Mapping of the 8 identified IntNMF clusters (Columns) to the true 22 TCGA Cancer Types (Rows). Darker cells indicate higher sample overlap.
Alzheimer's Disease Prediction (ADNI)
Integrating DNA methylation and gene expression for 538 participants (CN vs MCI). MRF-IMD identified markers like ARL11 and S1PR1 linked to neuroinflammation.
Top Genes (Importance)
Dementia Conversion Prediction
MRF-IMD Performance
Significant separation of high/low risk groups.
P = 0.033
Competitor Performance
SPLS failed to stratify patients significantly.
P = 0.60