We have an experiment where samples were collected and sequenced in several batch (matched ATACseq and RNAseq). We would like to include the batch in our analysis formula if possible. The problem is that after quality control and removing substandard samples, some of the batches have only one sample in them (we have several batches with many samples in and a few with only one).

We have used DESeq2 to do the differential testing (for both ATAC and RNA) and would also like to do more downstream analysis (such as clustering etc). We have included the batch in our design formula for DESeq2. Is this a terrible thing to do? Because the model can now set the batch effect to be anything for those samples, are they even adding anything to the analysis?

Secondly for the downstream analysis we are using rlog from DESeq2 and then passing those results to removeBatchEffects from limma. When we do this, the basically get a 0s in every row for those samples that in a batch on their own. (makes sense, because with only one sample in a batch the linear model can always find a beta value that will account for all variance)

Some solutions we have considered:

* Pool all the samples that are the only sample in their batch into a pooled batch.

* Do the differential analysis using all samples, but only use samples from multisample batches in the downstream analysis

* Remove the single batch samples from the analysis. We'd rather not do this, because it would severly reduce the number of samples in the analysis.

Thanks.

I wouldn’t assume that because some of your “batches” include only one sample that they’re useless. I’d be interested in exploratory analysis to perform analysis of variance on the larger batches to see the relative contribution of batches and experimental factors to variance, to help understand how “big” the batch effects are. If they’re smaller or maybe even comparable to experimental effects, I might take an approach like you mentioned of combining batches by combining the singletons, expanding the time periods, or otherwise expanding what constitutes a batch.

Since it can be hard to know what really matters or contributes to the strongest batch effects, another solution I would explore is the *sva* approach of inferring surrogate batch variables.