6
$\begingroup$

I am preprocessing scRNA-seq data. What is the best practice in use to run both ComBat for batch effects removal, data imputation (to mitigate dropout) and library size normalization?

I thought that library size should be run first, since it is per-cell normalization, then ComBat batch effects removal. On the original paper - Johnson et al. (2007) - it is stated that:

We assume that the data have been normalized and expression values have been estimated for all genes and samples.

However, I want to apply it to scRNA-seq data. Does this statement still hold? Additionally, I plan to apply imputation (e.g. with MAGIC) in the end. Is there any problem you can spot?

Update

I attach the PCA regarding an example Mus Musculus dataset in which different colors represent different mice. It seems clear to me that the first two principal components are affected by batches (mouse id).

pca

Update 2

I rerun the PCA on raw counts data (the first PCA was on log-transformed data) and I obtain a different description of the dataset, in which batch effects seem not to be prevalent.

pca_raw

$\endgroup$
11
  • $\begingroup$ From what I can tell, MAGIC should be run on raw data, so that would be the first step. $\endgroup$
    – burger
    Commented Jan 4, 2018 at 0:00
  • $\begingroup$ @burger MAGIC normalizes data before imputation, so it should be run at least after library size normalization. My concern is that using MAGIC before ComBat will amplify batch effects. Reading the paper I could not find any reference to batch effect removal. $\endgroup$
    – gc5
    Commented Jan 4, 2018 at 16:32
  • $\begingroup$ The advice I got was that the best would be to adjust for batch effect instead of removing them. Did you tried adjusting for your batch effects? How big is your batch effect ? (Do the PCA or MDS or dendograms show a clear distinction by your batch effect (or several batchs) ?) $\endgroup$
    – llrs
    Commented Jan 10, 2018 at 9:39
  • $\begingroup$ @Llopis yes, actually for batch effects removal I meant adjusting for batch effect with ComBat, is it what you meant? $\endgroup$
    – gc5
    Commented Jan 10, 2018 at 14:34
  • 1
    $\begingroup$ in my experience, the absolute first thing you need to do is normalize for library size. I suspect that if you color your cells according to size you will notice a clear correlation with PC1. $\endgroup$
    – galicae
    Commented Jan 25, 2018 at 15:01

1 Answer 1

3
$\begingroup$

MAGIC assumes input data has been both library-size normalized, and either log or sqrt transformed prior to imputation (see also: MAGIC tutorial). Additionally, any graph-based methods (MAGIC, PHATE, t-SNE, UMAP, spectral clustering, Louvain, etc etc) will give flawed results if your data contains a batch effect, since the neighbourhood graph would reflect that structure of your batch effect, and worse, imputation would further reinforce this batch effect.

Thus I would recommend the following pipeline:

  • Library-size normalization
  • Square root (or log) transform
  • Batch effect removal
  • Imputation

Regarding your update, the reason you don't see the batch effect in the raw counts data is simply that the batch effect is not visible in the most highly expressed genes. Prior to transformation, the principal source of variation in your data is simply the expression of the most highly expressed genes - this is essentially masking the source of the batch effect, not removing it. I recommend never working with raw molecule counts in scRNAseq, as the raw counts data hides much of the heterogenity in your dataset, which is precisely what you are looking in doing single-cell RNA-seq.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.