Preprocessing Affymetrix Gene Expression Arrays
Background
- GEO (Gene Expression Omnibus) is a public functional genomics data repository supporting MIAME-compliant data submissions. Array- and sequence-based data are accepted.
- A CEL file is a data file created by Affymetrix DNA microarray image analysis software. It contains the data extracted from “probes” on an Affymetrix GeneChip and can store thousands of data points, which may make it large in file size. CEL files can be processed by software algorithms and visualized on a 2D grid as part of an overall genome experiment.
R Example on Breast Cancer Data
Data: GES25055
Platform of data: hgu133a.db
library(hgu133a.db)
Install R package from Bioconductor:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager") ## Install BiocManager
BiocManager::install("affy")
library(affy)
Full description of affy
on Bioconductor: affy
Download data
Download CEL file:
file = getGEOSuppFiles("GSE25055") ## Here we need to unzip the download zip file
affyob<- ReadAffy(celfile.path="GSE25055/GSE25055_RAW")
Download information data:
pinfo_25055 <- pData(phenoData(getGEO("GSE25055")[[1]]))
fRMA process
Full desciption of fRMA
: frma
Normal RMA
Robust multiarray analysis (RMA) is the most widely used preprocessing algorithm for Affymetrix and Nimblegen gene expression microarrays. Three steps are performed by RMA: bacckground correction, normalization, and summarization in a modular way.
- Backgound correction: BOLSTAD , B. (2004). Low-level analysis of high-density oligonucleotide array data: background, normalization and summarization, [PhD. Thesis]. Berkeley, CA: University of California.
- Normalization: Quantile normalization forcesthe probe intensity distribution to be thesame on all the arrays.
-
Summarization: Summarizing probe intensities into gene expression to be used in downstream analysis.
\[Y_{ijn} = \theta_{in}+\phi_{jn}+\epsilon_{ijn}\quad\quad (1)\]- $Y_{ijn}$ representing the $\log_2$ background corrected and normalized intensity of probe $j \in 1,…, J_n$ in probe set $n \in 1,…, N$ on array $i \in 1,…, I$.
- $\theta_{in}$ is the expression of probe set n on array $i$
- $\phi_{jn}$ is the probe effect for $j_{th}$ probe set $n$.
Current R package affy
uses median polish to estimate the $\theta$s.
Problem with RMA:
- Cannot computing the reference distribution used in quantile normalization
- Cannot estimating the $\phi$s and $Var(\epsilon_{ijn})$ in equation (1) by running RMA on a reference database of biologially diverse samples
frozen RMA
\[Y_{ijkn} = \theta_{in}+\phi_{jn}+\gamma_{jkn}+\epsilon_{ijkn}\quad\quad (2)\]- $k \in 1,…,K$ represent batch and random-effect term and $\gamma$ explains the variability in probe effects across batches.
Frozen RMA removed the batch effect from each intensity and create a probe effect-corrected intensity:
\[Y_{i j \ln }^{*} \equiv Y_{i j l n}-\hat{\phi}_{j n}=\theta_{i n}+\gamma_{j l n}+\varepsilon_{i j ln} \quad\quad (3)\]library(fMRA)
object <- frma(affyob, verbose=TRUE)
expp<-exprs(object)
colnames(expp)<-substr(colnames(expp),1,9)
Filter probes
Use probe with largest IQR to represent the gene
arrayIQR<-apply(expp,1,IQR)
probe<-rownames(expp)
uniqueGenes<-findLargest(as.vector(probe),arrayIQR,'hgu133a')
exp2<-expp[uniqueGenes,]
geneSymbol<-getSYMBOL(rownames(exp2),"hgu133a.db")
rownames(exp2)<-geneSymbol
Save expression data
write.csv(exp2,'Expr/exp_25055c.csv')