Reproducibility

In data sciences, reproducibility should be as much as important as in wet-lab experiments. However, as I will show in the below paragraphs, different analysis pipelines can create different results from RNA-sequencing raw-data. (learn more) This depends not only on the the bioinformatic programs, but also on the program-versions within the pipeline (learn more). Therefore, the following recommendations should be taken into consideration when planning and performing a bioinformatic analysis.

  • If you want to compare your results with published results, consider to mimic exactly the pipeline which was done in the publication. If possible also try to use the same program versions.
  • If this is not possible, try to re-analyze the published raw-data with your pipeline of choice, prior to analysis.
  • If you know, that your study might get expanded in the future, try to save the pipeline setup and control for the program versions. So you do not need to re-analyze your initial dataset, which might also change the outcome.
  • In your publications, try to describe the programs with its versions and the set parameters as exactly as possible. If possible also publish the raw-data files.

Let me help you with those issues by:

1. Setting up a custom-made pipeline with all your required specifications.

2. Analyzing your original data and/or published datasets with exactly the same pipeline and parameters.

3. Saving and archiving the pipeline in its current stage, so that it can be used in future settings.

4. Helping you to prepare the methods parts for your publications.

Contact me to discuss your needs.



Different bioinformatic analysis pipelines create different outcomes:

10 published RNA-sequencing raw-data file sets (5 controls VS 5 non-small cell lung cancer samples, paired-end data) were analyzed with 12 different analysis pipelines to identify differentially expressed genes. (Click here for detailed methods)

Those pipelines contain some of the most widely used programs for RNA-sequencing analysis (Aligners: STAR, Hisat2, Tophat2, Bowtie2, Bowtie, Downstream: Rsem+EBSeq, DESeq2, DESeq and EdgeR)

Results

  • In total 2815 down-regulated and 4087 up-regulated genes were identified with at least one pipeline (corrected p-value <= 0.05). Only 73 up-regulated as well as 132 down-regulated genes were identified with all pipelines.
  • Among the different pipelines, deviations between the identified log2-FoldChange values were identified.
  • Different aligners produce more consistent results than the downstream programs for Differential gene expression analysis.

You can perform a pairwise comparison of the different pipelines’ outcome with the form below.




Different program versions can lead to different analysis results.

Next, it was tested, how the program version of one of the downstream programs affects the analysis results. For that DESeq2 was applied with different version numbers on the data.

Results

  • In total 2236 down-regulated and 3537 up-regulated genes were identified with at least one pipeline (corrected p-value <= 0.05). 2189 up-regulated as well as 3313 down-regulated genes were identified with all pipelines.
    There is a total difference of 271 genes = 4.7%
  • Among the different DESeq2-versions, deviations between the identified log2-FoldChange values were also identified.

You can perform a pairwise comparison of the different DESeq2 outcome with the form below.




Methods

Samples

10 sets of Raw-data fastq-gz files from a paired-end RNA-Sequencing experiment were downloaded. Those contain 5 random lung cancer samples as well as five random healthy controls from ArrayExpress Dataset E-GEOD-81089. The datasets contain between 16 and 65 mio. paired reads with each having a length of 101 bases.

Pipelines

The pipelines contain the latest versions of programs which are highly used by the scientific community to analyze RNA-Sequencing data. To reduce complexity of the analysis, the raw-data as well as the binary-alignment files were not pre-processed prior to the analysis. Also the programs were used mostly with their default parameters. Corrected p-values of the outcome were defined in the EBseq results with the PPEE-value, in the EdgeR-results with the FDR-corrected p-value, and in the DESeq/DESeq2-results with the adusted p-value. An overview of all used programs and their program versions is shown in the table below.

PipelineAlignerExtraction
of genes
Differential gene
expression analysis
Remarks
Rsem_STAR_EBSeqSTAR (1)
version 2.7.0a
Rsem (2)
version 1.3.1
EBSeq (3)
version 1.24.0
“PPEE” were used as corrected p-values
Rsem_Bowtie2_EBSeqBowtie2 (4)
version 2.3.5.1
Rsem (2)
version 1.3.1
EBSeq (3)
version 1.24.0
“PPEE” were used as corrected p-values
Rsem_Bowtie_EBSeqBowtie (5)
version 1.2.2
Rsem (2)
version 1.3.1
EBSeq (3)
version 1.24.0
“PPEE” were used as corrected p-values
STAR_DESeq2STAR (1)
version 2.7.0a
featureCounts (6)
version 1.6.3
DESeq2 (7)
version 1.24.0
“padj” were used as corrected p-values
Hisat2_DESeq2Hisat2 (8)
version 2.1.0
featureCounts (6)
version 1.6.3
DESeq2 (7)
version 1.24.0
“padj” were used as corrected p-values
Tophat2_DESeq2Tophat2 (9)
version 2.1.1
featureCounts (6)
version 1.6.3
DESeq2 (7)
version 1.24.0
“padj” were used as corrected p-values
STAR_edgeRSTAR (1)
version 2.7.0a
featureCounts (6)
version 1.6.3
edgeR (10)
version 3.26.5
“FDR” were used as corrected p-values
Hisat2_edgeRHisat2 (8)
version 2.1.0
featureCounts (6)
version 1.6.3
edgeR (10)
version 3.26.5
“FDR” were used as corrected p-values
Tophat2_edgeRTophat2 (9)
version 2.1.1
featureCounts (6)
version 1.6.3
edgeR (10)
version 3.26.5
“FDR” were used as corrected p-values
STAR_DESeqSTAR (1)
version 2.7.0a
featureCounts (6)
version 1.6.3
DESeq (11)
version 1.36.0
“padj” were used as corrected p-values
Hisat2_DESeqHisat2 (8)
version 2.1.0
featureCounts (6)
version 1.6.3
DESeq (11)
version 1.36.0
“padj” were used as corrected p-values
Tophat2_DESeqTophat2 (9)
version 2.1.1
featureCounts (6)
version 1.6.3
DESeq (11)
version 1.36.0
“padj” were used as corrected p-values

DESeq2 analysis

The *.bam-files, which were created by the STAR aligner, were subjected to DESeq2- analysis after applying featureCounts. DESeq2 was set up in various versions. Those were installed using the “bioconda”-repository.

1. Dobin A et al., STAR: ultrafast universal RNA-seq aligner., Bioinformatics. 2013 Jan 1;29(1):15-21.
2. Li B and Dewey CN, RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome, BMC Bioinformatics. 2011 Aug 4;12:323.
3. Leng N et al., EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments, Bioinformatics. 2013 Apr 15;29(8):1035-43.
4. Langmead B, Salzberg S. Fast gapped-read alignment with Bowtie 2, Nature Methods. 2012, 9:357-359.
5. Langmead B et al., Ultrafast and memory-efficient alignment of short DNA sequences to the human genome, Genome Biol. 2009;10(3):R25
6. Liao Y et al., featureCounts: an efficient general purpose program for assigning sequence reads to genomic features, Bioinformatics. 2014 Apr 1;30(7):923-30.
7. Love, MI et al., Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2, Genome Biol. 2014;15(12):550.
8. Kim D. et al., HISAT: a fast spliced aligner with low memory requirements, Nat Methods. 2015 Apr;12(4):357-60.
9. Trapnell C et al., TopHat: discovering splice junctions with RNA-Seq, Bioinformatics 2009 25(9):1105-1111.
10. Robinson MD et al., edgeR: a Bioconductor package for differential expression analysis of digital gene expression data, Bioinformatics. 2010 Jan 1;26(1):139-40.
11. Anders S and Huber W, Differential expression analysis for sequence count data, Genome Biol. 2010;11(10):R106.