What is SERRF?

06/10/2020

SERRF is a QC-based sample normalization method designed for large-scale untargeted metabolomics data.

How it works?

SERRF uses a machine learning algorithm, Random Forest (Breiman, 2001), to normalize the data. For each compound, SERRF uses the QC samples to build a Random Forest model to estimate the systematical error (e.g. batch effect, day-to-day variation, etc). Then apply the model on the study samples to normalize the systematical error. In this website, we use the cross-validated relative standard deviation (a.k.a coefficient of variation) of the QCs to evalute the performance.

How SERRF differs from others?

The sample normalizations can be classified into three categories based on the types of reference samples/compounds being used, data-driven normalizations, internal standard- (IS) based normalizations, and QC-based normalizations. SERRF is a QC-based normalization method. It uses pooled aliquots of biological subject samples to normalize the metabolite intensity.

Unlike most of the QC-based methods (e.g. batch-ratio, LOESS, SVM and eigenMS normalization), SERRF assumes that the systematical error are associated with not only the batch effects and the injection order, but also the behavior of other compounds. Using RF algorithm, SERRF automatically selects correlated compounds to normalize the systematical error summarized by the QC samples for each compound.

Why use Random Forest?

The Random Forest (RF) algorithm, developed by Breiman (Breiman, 2001), are nonparametric, nonlinear, less prone to overfitting, relatively robust to outliers and noise and fast to train (Touw, et al., 2013). These attributes are desired for normalizing the high-throughput untargeted metabolomics data. In addition, the RF models can take the metabolites correlation into consideration by automatically assigning higher weights to the important compounds.

How good is SERRF?

There is no single normalization method that always outperforms others. But here are some benchmarks for SERRF.

P20 data (negative mode). Median average of 5-fold Monte-Carlo Cross-Validated QC RSD reduced from 26.5% to 6.3% (2nd best: LOESS 8.2%). Median average of External-validated QC RSD reduced from 27.1% to 9.5% (2nd best: LOESS 13.2%).
P20 data (positive mode). Median average of 5-fold Monte-Carlo Cross-Validated QC RSD reduced from 19.7% to 3.9% (2nd best: SVM 7.4%). Median average of External-validated QC RSD reduced from 17.1% to 8.2% (2nd best: cubic 11.3%).
ADNI data (positive mode). Median average of 5-fold Monte-Carlo Cross-Validated QC RSD reduced from 17.5% to 4.4% (2nd best: LOESS 11.3%).
ADNI data (negative mode). Median average of 5-fold Monte-Carlo Cross-Validated QC RSD reduced from 23.2% to 7.3% (2nd best: LOESS 12.3%).
GOLDN data (positive mode). Median average of 5-fold Monte-Carlo Cross-Validated QC RSD reduced from 21.6% to 3.4% (2nd best: LOESS 11.3%).
GOLDN data (negative mode). Median average of 5-fold Monte-Carlo Cross-Validated QC RSD reduced from 34.1% to 4.7% (2nd best: LOESS 8.4%).

Note: the cross validation is critical procedures to deal with the overfitting issue.

Citation

Systematic Error Removal using Random Forest (SERRF) for Normalizing Large-Scale Untargeted Lipidomics Data Sili Fan, Tobias Kind, Tomas Cajka, Stanley L. Hazen, W. H. Wilson Tang, Rima Kaddurah-Daouk, Marguerite R. Irvin, Donna K. Arnett, Dinesh Kumar Barupal, and Oliver Fiehn Analytical Chemistry Just Accepted Manuscript DOI: 10.1021/acs.analchem.8b05592

Sili Fan

principal statistician in the West Coast Metabolomics Center
Github at https://github.com/slfan2013
email at slfan@ucdavis.edu
KEGGLE at https://www.kaggle.com/bigdatafan

Step1: Prepare the data

The data must be in the first sheet of a .xlsx file.
Samples in column and compounds in rows. Number of compounds is limited at < 2000. Contact me if more help is needed.
batch: the index of batches. e.g. A, B, C.
sampleType: the index of sample type. Must include qc, sample (case sensitive).

time: the index of injection order or sample acquisition time. e.g. 1, 2, 3.
label: The label of both samples and compounds. e.g. sample1, sample2, sample3, xylose, glucose, unknown1.
No: The numeric index of compounds. e.g. 1, 2, 3

download exmample

Step2: What if I have missing value

missing values must be empty cells in the .xlsx file
The missing values of each compound will be automatically replaced by the half-minimum of that compound.

Step3: Start normalization

Go to the Use SERRF. (top of this page)
Click the Choose File button. (The file format will be automatically checked after uploading)
Click the Apply SERRF normalization button. (The normalzation procedure will take several minutes dependending on the sample size and number of compounds.)

Step4: Results and Download

PCA: use principal component analysis to visualize the result. A good normalization will put the qcs into a dense cluster and will put baches (distinguished by colors) to be overlapped.
RSD: relative standard deviation is calculated on the validate of each compound. The median of all the metabolite's RSDs is used as the final evaluation. A low value of these RSDs indicates a good performance.
Downloads the result by clicking the Download Results button.

SERRF

Normalize your data with SERRF.

Using SERRF is easy. Just prepare your dataset the same format with the example dataset, and upload it by clicking the "Choose File" button. When the dataset is successfully uploaded, simply click the "Start SERRF" button. A few minutes later, you'll have your normalized dataset ready to download. Also, feel free to try our one-for-all R code. Still have questions?

Please click this link to go to our temperal website for SERRF. The underlining normalization code is same.

The main server is under maintenance. Not sure when this update finish. Feel free to try our R code (see second line of the only paragraph on this page).

.

Figure1. Principal Component Analysis score plot using the raw data.

The median of QC RSD across all compounds is: .

# (%) of compounds with QC RSD < 20%: ().

The median of validate RSD across all compounds is: .

# (%) of compounds with validate RSD < 20%: ().

Figure2. Principal Component Analysis score plot using the SERRF normalized data.

The median of QC RSD across all compounds is: .

# (%) of compounds with QC RSD < 20%: ().

The median of validate RSD across all compounds is: .

# (%) of compounds with validate RSD < 20%: ().

Thank you! Was SERRF helpful? Yes, No.

SERRF

What is SERRF?

How it works?

How SERRF differs from others?

Why use Random Forest?

How good is SERRF?

Citation

Contact

Sili Fan

Tutorial

Step1: Prepare the data

Step2: What if I have missing value

Step3: Start normalization

Step4: Results and Download

SERRF

Statistic