Performing example analyses¶
Below are two example analyses than can be performed by the user. The first one is a single-species analysis with one species (Salmonella Typhimurium SL1344), while the second one is a multi-species analysis with two species (Staphylococcus aureus strain SH100 and Human Mast cells). The read files of the multi-species analysis consist of three libraries: Infected has both human and staphylococcus reads, while uninfected has only humand and steady_state only stapyhlococcus reads. Both analysis follow a similar workflow. The main differences occur when creating a new project and running deseq, since some information about the libraries and the species they contain has to be provided (See details bwlow, when running the subcommands).
Single-species analysis¶
Here you will be guided trough a small example analysis using a publicly available RNA-Seq from NCBI GEO that was part of a publication by Kröger et al.. This is a transcriptome analysis of Salmonella Typhimurium SL1344 in different conditions. We will generate several output files in different formats. The CSV (tabular separated plain text files) files can be opened with any spreadsheet program like LibreOffice or Excel. For inspecting the mappings (in BAM format) and coverage files (wiggle format) you can use a genome browser for example IGB or IGV.
Generating a project
At first we have to create the analysis folder and its input subfolder.
All output folders, except for the align output folder, will be created when the corresponding subcommand is called.
For this we use the create
subcommand.
Please not that all species of a project - even if it is only one species - need to be named via the –species flag
$ reademption create --project_path READemption_analysis --species salmonella="Salmonella Typhimurium"
Created folder "READemption_analysis" and required subfolders.
Please copy read files into folder "READemption_analysis/input/reads" and reference sequences files into folder/s "READemption_analysis/input/salmonella_reference_sequences".
This will result in a folder structure as shown here:
READemption_analysis
├── config.json
├── input
│ ├── reads
│ ├── salmonella_annotations
│ └── salmonella_reference_sequences
└── output
└── align
├── alignments
├── index
├── processed_reads
├── reports_and_stats
│ ├── stats_data_json
│ └── version_log.txt
└── unaligned_reads
Retrieving the input data
We have to download the reference sequence (FASTA format) as well as
the annotation file (GFF3 format) for Salmonella from NCBI. As we
will use the URL of Salmonella Typhimurium SL1344’s source FTP
folder it several times we store it in an environment variable called
FTP_SOURCE
.
$ FTP_SOURCE=ftp://ftp.ncbi.nih.gov/genomes/archive/old_refseq/Bacteria/Salmonella_enterica_serovar_Typhimurium_SL1344_uid86645/
We download the reference sequence (the chromosome and three plasmids)
in FASTA format and store them in the reference_sequences
folder. The files are saved with a different suffix (.fa
instead
of .fna
) as some genome browser (e.g. IGB) will not accept them as
FASTA files otherwise.
$ wget -O READemption_analysis/input/salmonella_reference_sequences/NC_016810.fa $FTP_SOURCE/NC_016810.fna
$ wget -O READemption_analysis/input/salmonella_reference_sequences/NC_017718.fa $FTP_SOURCE/NC_017718.fna
$ wget -O READemption_analysis/input/salmonella_reference_sequences/NC_017719.fa $FTP_SOURCE/NC_017719.fna
$ wget -O READemption_analysis/input/salmonella_reference_sequences/NC_017720.fa $FTP_SOURCE/NC_017720.fna
We have to modify the header of the FASTA files as the sequence IDs have to be the same as the ones in the first column of the GGF3 files (see below) to be used in the gene quantification. This will be also necessary if both, FASTA and GFF3 files, will be loaded in the IGB.
$ sed -i "s/>/>NC_016810.1 /" READemption_analysis/input/salmonella_reference_sequences/NC_016810.fa
$ sed -i "s/>/>NC_017718.1 /" READemption_analysis/input/salmonella_reference_sequences/NC_017718.fa
$ sed -i "s/>/>NC_017719.1 /" READemption_analysis/input/salmonella_reference_sequences/NC_017719.fa
$ sed -i "s/>/>NC_017720.1 /" READemption_analysis/input/salmonella_reference_sequences/NC_017720.fa
Then we download the GFF3 files that contain the annotations
$ wget -P READemption_analysis/input/salmonella_annotations https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/210/855/GCF_000210855.2_ASM21085v2/GCF_000210855.2_ASM21085v2_genomic.gff.gz
$ gunzip READemption_analysis/input/salmonella_annotations/GCF_000210855.2_ASM21085v2_genomic.gff.gz
Finally, we need the reads of the RNA-Seq libraries. To save some time for running this examples we will work with subsampled libraries of 1M reads each. This will the limit informative value of the results which is acceptable as we just want to understand the workflow of the READemption. Please be aware that READemption can perform only basic quality trimming and adapter clipping. If this is not sufficient you can use the FASTX toolkit, cutadapt or other tools for the preprocessing.
$ wget -P READemption_analysis/input/reads http://reademptiondata.imib-zinf.net/InSPI2_R1.fa.bz2
$ wget -P READemption_analysis/input/reads http://reademptiondata.imib-zinf.net/InSPI2_R2.fa.bz2
$ wget -P READemption_analysis/input/reads http://reademptiondata.imib-zinf.net/LSP_R1.fa.bz2
$ wget -P READemption_analysis/input/reads http://reademptiondata.imib-zinf.net/LSP_R2.fa.bz2
We have now all the necessary data available. The input folder should look like this now:
$ ls READemption_analysis/input/*
READemption_analysis/input/salmonella_annotations:
NC_016810.gff NC_017718.gff NC_017719.gff NC_017720.gff
READemption_analysis/input/reads:
InSPI2_R1.fa.bz2 InSPI2_R2.fa.bz2 LSP_R1.fa.bz2 LSP_R2.fa.bz2
READemption_analysis/input/salmonella_reference_sequences:
NC_016810.fa NC_017718.fa NC_017719.fa NC_017720.fa
Processing and aligning the reads
The first step it the read processing and mapping. Via parameters we
tell READemption to use 4 CPU (-p 4
) and perform a poly-A-clipping
(--poly_a_clipping
) before the mapping.
$ reademption align -p 4 --poly_a_clipping --project_path READemption_analysis
Once this the mapping is done the file read_alignment_stats.csv
is
created which can be found in
READemption_analysis/output/align/reports_and_stats/
. It contains
several mapping statistics for example how many reads are successfully
aligned in total and how many were aligned to each replicon. We see
that more than 98 % of the reads are mapped for each library. Sorted
and indexed alignements in BAM format are stored in
READemption_analysis/output/align/alignments
. We could load them
into a genome browser but instead we continue with the next step.
Generating coverage files
In order to generate strand specific coverage files with different
normalizations we use the subcommand coverage
.
$ reademption coverage -p 4 --project_path READemption_analysis
The sets are stored in subfolder of
READemption_analysis/output/salmonella_coverage-raw/
, READemption_analysis/output/salmonella_coverage-tnoar_mil_normalized/
and READemption_analysis/output/salmonella_coverage-tnoar_min_normalized/
. The most oftenly used set
is stored in coverage-tnoar_min_normalized
. Here the coverage
values are normalized by the total number of aligned reads (TNOAR) of
the individual library and then multiplied by the lowest TNOAR value
of all libraries.
These files could be inspected for differential
RNA-Seq (dRNA-Seq - comparing libraries with and without Terminator
Exonuclease treatment) data in order to determine transcriptional
start sites. They can be loaded in a common genome browsers like IGB or IGV. Keep in mind that the
coverages of the reverse strand have negative values so you have to
adapt the scaling in some genome browsers.
Performing gene wise quantification
In this step we want to quantify the number of reads overlapping with
the locations of the annotation entries. With the --features
parameter we configure reademption
to just quantify CDS, tRNA and
rRNA entries.
$ reademption gene_quanti -p 4 --features CDS,tRNA,rRNA --project_path READemption_analysis
After the quantification we find tables that contain the combined
counting for all entries in
READemption_analysis/output/salmonella_gene_quanti_combined
. The
countings for mappings in sense and anti-sense are separately
listed. Besides the raw countings there are also tables for
countings normalized by the total number of reads, RPKM values and TPM (transcripts per million).
Performing differential gene expression analysis
To compare the gene expression of different conditions we apply the
subcommand deseq
which makes use of the R library DESeq2.
$ reademption deseq -l InSPI2_R1.fa.bz2,InSPI2_R2.fa.bz2,LSP_R1.fa.bz2,LSP_R2.fa.bz2 -c InSPI2,InSPI2,LSP,LSP -r 1,2,1,2 --libs_by_species salmonella=InSPI2_R1,InSPI2_R2,LSP_R1,LSP_R2 project_path READemption_analysis
We have to tell READemption which libraries are replicates of which
condition. This is done by the parameter -l
, -c
and -r
. -l
should hold a comma separated list of the libraries, -c
the
corresponding conditions and -r
the corresponding replicate number. In our case we have 4 libraries
(InSPI2_R1.fa.bz2
, InSPI2_R2.fa.bz2
, LSP_R1.fa.bz2
,
LSP_R2.fa.bz2
) and two conditions (which we call InSPI2
and
LSP
) and two times two replicates (R1 and R2 for each condition). Just to make this association easier to understand:
libs InSPI2_R1.fa.bz2 InSPI2_R2.fa.bz2 LSP_R1.fa.bz2 LSP_R2.fa.bz2
| | | |
conds InSPI2 InSPI2 LSP LSP
| | | |
reps 1 2 1 2
When you call deseq
it will compare all conditions with each other
and you can pick the comparison that you need. The raw DESeq2
results are enriched with the original annotation information and are
stored in
READemption_analysis/output/salmonella_deseq/deseq_with_annotations
Create plots
Finally we generate plots that visualize the results of the different
steps. viz_align
creates histograms of the read length
distribution for the untreated and treated reads (saved in
READemption_analysis/output/read_lengths_viz_align/
).
It also creates an overview of how many reads map to each species
and how many reads are species cross-mapped per library (saved in
READemption_analysis/output/all_species_viz_align/
. However, this folder can be neglected in a single species analysis).
$ reademption viz_align --project_path READemption_analysis
viz_gene_quanti
visualizes the gene wise countings. In our example
you will see that - as expected - the replicates are more similar to
each other than to the libs of the other condition. It also generates
bar plots that show the distribution of reads inside the different RNA
classes.
$ reademption viz_gene_quanti --project_path READemption_analysis
viz_deseq
generates MA-plots as well as volcano plots.
$ reademption viz_deseq --project_path READemption_analysis
Multi-species analysis¶
Here you will be guided trough a small example Dual RNA-seq analysis using a publicly available RNA-Seq from the European Nucleotide Archive (ENA) that was part of a publication by Goldmann et al.. This is a transcriptome analysis of Staphylococcus aureus strain SH100 and Human Mast cells in different conditions. The complete analysis is publicly available at Publisso. Note that we use only three of the five conditions (9 instead of all 15 libraries) to make the analysis less complicated. We will generate several output files in different formats. The CSV (tabular separated plain text files) files can be opened with any spreadsheet program like LibreOffice or Excel. For inspecting the mappings (in BAM format) and coverage files (wiggle format) you can use a genome browser for example IGB or IGV.
Generating a project
At first we have to create the analysis folder and its input subfolder.
All output folders, except for the align output folder, will be created when the corresponding subcommand is called.
For this we use the create
subcommand.
Please not that all species of a project - in this case two species - need to be named via the –species flag
$ reademption create --project_path READemption_analysis --species human="Homo sapiens" staphylococcus="Staphylococcus aureus"
Created folder "READemption_analysis" and required subfolders.
Please copy read files into folder "READemption_analysis/input/reads" and reference sequences files into folder/s "READemption_analysis/input/human_reference_sequences", "READemption_analysis/input/staphylococcus_reference_sequences".
This will result in a folder structure as shown here:
READemption_analysis
├── config.json
├── input
│ ├── human_annotations
│ ├── human_reference_sequences
│ ├── reads
│ ├── staphylococcus_annotations
│ └── staphylococcus_reference_sequences
└── output
└── align
├── alignments
├── index
├── processed_reads
├── reports_and_stats
│ ├── stats_data_json
│ └── version_log.txt
└── unaligned_reads
Retrieving the input data
We have to download the reference sequences (FASTA format) as well as the annotation files (GFF3 format) for both species.
We download the Staphylococcus genome to the corresponding folder and unpack it.
$ wget -O READemption_analysis/input/staphylococcus_reference_sequences/staphylococcus_genome.fa.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/013/425/GCF_000013425.1_ASM1342v1/GCF_000013425.1_ASM1342v1_genomic.fna.gz
$ gunzip READemption_analysis/input/staphylococcus_reference_sequences/staphylococcus_genome.fa.gz
We download the Staphylococcus annotation to the corresponding folder and unpack it.
$ wget -O READemption_analysis/input/staphylococcus_annotations/staphylococcus_annotation.gff.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/013/425/GCF_000013425.1_ASM1342v1/GCF_000013425.1_ASM1342v1_genomic.gff.gz
$ gunzip READemption_analysis/input/staphylococcus_annotations/staphylococcus_annotation.gff.gz
We download the Human genome to the corresponding folder and unpack it.
$ wget -O READemption_analysis/input/human_reference_sequences/human_genome.fa.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_27/GRCh38.p10.genome.fa.gz
$ gunzip READemption_analysis/input/human_reference_sequences/human_genome.fa.gz
We download the Human annotation to the corresponding folder and unpack it.
$ wget -O READemption_analysis/input/human_annotations/human_annotation.gff.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_27/gencode.v27.annotation.gff3.gz
$ gunzip READemption_analysis/input/human_annotations/human_annotation.gff.gz
The reference *Staphylococcus sequence was saved with a different suffix (.fa
instead
of .fna
) as some genome browser (e.g. IGB) will not accept them as
FASTA files otherwise.
Finally, we need the reads of the RNA-Seq libraries. To save some time for running this examples we will work with subsampled libraries of 10000 reads each. This will the limit informative value of the results which is acceptable as we just want to understand the workflow of the READemption. Please be aware that READemption can perform only basic quality trimming and adapter clipping. If this is not sufficient you can use the FASTX toolkit, cutadapt or other tools for the preprocessing.
$ wget https://raw.githubusercontent.com/Tillsa/Tillsa-2022-06-15-READemption_tutorial_data/main/Infected_replicate_1.fq https://raw.githubusercontent.com/Tillsa/Tillsa-2022-06-15-READemption_tutorial_data/main/Infected_replicate_2.fq \ https://raw.githubusercontent.com/Tillsa/Tillsa-2022-06-15-READemption_tutorial_data/main/Infected_replicate_3.fq https://raw.githubusercontent.com/Tillsa/Tillsa-2022-06-15-READemption_tutorial_data/main/Steady_state_replicate_1.fq https://raw.githubusercontent.com/Tillsa/Tillsa-2022-06-15-READemption_tutorial_data/main/Steady_state_replicate_2.fq https://raw.githubusercontent.com/Tillsa/Tillsa-2022-06-15-READemption_tutorial_data/main/Steady_state_replicate_3.fq https://raw.githubusercontent.com/Tillsa/Tillsa-2022-06-15-READemption_tutorial_data/main/Uninfected_replicate_1.fq https://raw.githubusercontent.com/Tillsa/Tillsa-2022-06-15-READemption_tutorial_data/main/Uninfected_replicate_2.fq https://raw.githubusercontent.com/Tillsa/Tillsa-2022-06-15-READemption_tutorial_data/main/Uninfected_replicate_3.fq -P READemption_analysis/input/reads
We have now all the necessary data available. The input folder should look like this now:
$ ls READemption_analysis/input/*
input/human_annotations:
human_annotation.gff
input/human_reference_sequences:
human_genome.fa
input/reads:
Infected_replicate_1.fq Infected_replicate_3.fq Steady_state_replicate_2.fq Uninfected_replicate_1.fq Uninfected_replicate_3.fq
Infected_replicate_2.fq Steady_state_replicate_1.fq Steady_state_replicate_3.fq Uninfected_replicate_2.fq
input/staphylococcus_annotations:
staphylococcus_annotation.gff
input/staphylococcus_reference_sequences:
staphylococcus_genome.fa
Processing and aligning the reads
The first step it the read processing and mapping. Via parameters we
tell READemption to use 4 CPU (-p 4
) and perform a poly-A-clipping
(--poly_a_clipping
) before the mapping.
$ reademption align -p 4 --poly_a_clipping --project_path READemption_analysis
Once this the mapping is done the file read_alignment_stats.csv
is
created which can be found in
READemption_analysis/output/align/reports_and_stats/
. It contains
several mapping statistics for example how many reads are successfully
aligned in total and how many were aligned to each species as well as the species cross aligned reads. Sorted
and indexed alignements in BAM format are stored in
READemption_analysis/output/align/alignments
. We could load them
into a genome browser but instead we continue with the next step.
Generating coverage files
In order to generate strand specific coverage files with different
normalizations we use the subcommand coverage
.
$ reademption coverage -p 4 --project_path READemption_analysis
The sets are stored in subfolder of
READemption_analysis/output/staphylococcus_coverage-raw/
, READemption_analysis/output/staphylococcus_coverage-tnoar_mil_normalized/
, READemption_analysis/output/staphylococcus_coverage-tnoar_min_normalized/
,
READemption_analysis/output/human_coverage-raw/
, READemption_analysis/output/human_coverage-tnoar_mil_normalized/
and READemption_analysis/output/human_coverage-tnoar_min_normalized/
.
The most oftenly used set is stored in coverage-tnoar_min_normalized
.
Here the coverage values are normalized by the total number of aligned reads (TNOAR) of
the individual library and then multiplied by the lowest TNOAR value
of all libraries.
These files could be inspected for differential
RNA-Seq (dRNA-Seq - comparing libraries with and without Terminator
Exonuclease treatment) data in order to determine transcriptional
start sites. They can be loaded in a common genome browsers like IGB or IGV. Keep in mind that the
coverages of the reverse strand have negative values so you have to
adapt the scaling in some genome browsers.
Performing gene wise quantification
In this step we want to quantify the number of reads overlapping with
the locations of the annotation entries. With the --features
parameter we configure reademption
to just quantify gene entries to save some time.
$ reademption gene_quanti -p 4 --features gene --project_path READemption_analysis
After the quantification we find tables that contain the combined
counting for all entries in
READemption_analysis/output/staphylococcus_gene_quanti_combined
and READemption_analysis/output/human_gene_quanti_combined
. The
countings for mappings in sense and anti-sense are separately
listed. Besides the raw countings there are also tables for
countings normalized by the total number of reads, RPKM values and TPM (transcripts per million).
Performing differential gene expression analysis
To compare the gene expression of different conditions we apply the
subcommand deseq
which makes use of the R library DESeq2.
$ reademption deseq -l Infected_replicate_1,Infected_replicate_2,Infected_replicate_3,Steady_state_replicate_1,Steady_state_replicate_2,Steady_state_replicate_3,Uninfected_replicate_1,Uninfected_replicate_2,Uninfected_replicate_3 -c infected,infected,infected,steady_state,steady_state,steady_state,uninfected,uninfected,uninfected -r 1,2,3,1,2,3,1,2,3 --libs_by_species human="Infected_replicate_1,Infected_replicate_2,Infected_replicate_3,Uninfected_replicate_1,Uninfected_replicate_2,Uninfected_replicate_3" staphylococcus="Infected_replicate_1,Infected_replicate_2,Infected_replicate_3,Steady_state_replicate_1,Steady_state_replicate_2,Steady_state_replicate_3" --size_factor=species --project_path READemption_analysis
We have to tell READemption which libraries are replicates of which
condition. This is done by the parameter -l
, -c
and -r
. -l
should hold a comma separated list of the libraries, -c
the
corresponding conditions and -r
the corresponding replicate number.
In our case we have 9 libraries (Infected_replicate_1
, Infected_replicate_2
, Infected_replicate_3
, Steady_state_replicate_1
, Steady_state_replicate_2
, Steady_state_replicate_3
, Uninfected_replicate_1
, Uninfected_replicate_2
, Uninfected_replicate_3
)
and three conditions (which we call infected
, steady_state
and uninfected
) and three times three replicates (R1, R2 and R3 for each condition). Just to make this association easier to understand:
libs Infected_replicate_1 Infected_replicate_2 Infected_replicate_3 Steady_state_replicate_1 Steady_state_replicate_2 Steady_state_replicate_3 Uninfected_replicate_1 Uninfected_replicate_2 Uninfected_replicate_3
| | | | | | | | |
conds infected infected infected steady_state steady_state steady_state uninfected uninfected uninfected
| | | | | | | | |
reps 1 2 3 1 2 3 1 2 3
Because we set the –size factor to species and set the species for each lib via –libs_by_species,
when you call deseq
it will calculate the size factors for normalization based on the reads of the current species.
Deseq will compare all the possible combinations of the libraries of a species
and you can pick the comparison that you need. The raw DESeq2
results are enriched with the original annotation information and are
stored in
READemption_analysis/output/staphyloccus_deseq/deseq_with_annotations
and READemption_analysis/output/human_deseq/deseq_with_annotations
Create plots
Finally we generate plots that visualize the results of the different
steps. viz_align
creates histograms of the read length
distribution for the untreated and treated reads (saved in
READemption_analysis/output/read_lengths_viz_align/
).
It also creates an overview of how many reads map to each species and how many reads are species cross-mapped per library.
$ reademption viz_align --project_path READemption_analysis
viz_gene_quanti
visualizes the gene wise countings. In our example
you will see that - as expected - the replicates are more similar to
each other than to the libs of the other condition. It also generates
bar plots that show the distribution of reads inside the different RNA
classes.
$ reademption viz_gene_quanti --project_path READemption_analysis
viz_deseq
generates MA-plots as well as volcano plots.
$ reademption viz_deseq --project_path READemption_analysis