READemption’s subcommands¶
Results separately for each species
In general the subcommands need at least one argument - the analysis
folder. The only exception is the create
subcommand, which needs the species
of a project.
After the initial project creation and the subsequent alignment,
which is (except for the statistics) species agnostic all other subcommands
(coverage
, gene_quanti
, deseq
, viz_align
, viz_gene_quanti
, viz_deseq
) present
their results by species.
Exclude or include species cross-mapped reads
Usually a multi-species project will contain cross-mapped reads that map to different species.
These reads are excluded from countings and normalization factors of the subcommands coverage
and gene_quanti
per default.
Since the other subcommands (deseq
, viz_gene_quanti
, viz_deseq
) are based on gene_quanti
,
species cross-mapped reads are also excluded from these subcommands.
This default behaviour can be turned off separately for countings [–count_cross_aligned_reads] and normalization [–normalize_cross_aligned_reads_included],
when executing coverage
or gene_quanti
.
Fragment building for paired-end reads
When using paired-end reads, READemption automatically builds fragments from aligned reads
as part of the align
subcommand.
This behaviour, if not needed, can be turned off [–no_fragment_building] to save time.
To continue skipping fragment building, the flag [–no_fragment_building] also needs to be set during
coverage
or gene_quanti
. Otherwise the fragments will be build and used by coverage
and gene_quanti
,
even if fragments were not built during align
.
create¶
create
generates the required folder structure for input and
output files. A folder prefix
and a display name
must be assigned to each species
,
even if the project contains only a single species.
Once these folders are created the input files have to
be placed into the correct locations. A reference_sequences
folder and an
annotation
folder will be created for each species, while the reads
folder
is common for all species.
The following example creates the input folders for a project with three different species:
$ reademption create --project_path reademption_analysis_triple \
--species \
human="Homo sapiens" \
staphylococcus="Staphylococcus aureus" \
influenza="Influenza A"
Created input folder structure:
As a minimal requirement,
RNA-Seqs reads in FASTA format (can be compressed with bzip2
or
gzip
) must be placed in input/reads
and the reference sequences of the different species
in FASTA format must be copied or linked to their corresponding species input folder
input/[species]_reference_sequences
. For the command gene_quanti
annotation files in GFF3 format have to be put into
input/[species]_annotations
.
Please note that a single species project will have only one annotations and one reference_sequences folder.
usage: reademption create [-h] --species SUB_FOLDER_NAME=DISPLAY_NAME
[SUB_FOLDER_NAME=DISPLAY_NAME ...] --project_path
PROJECT_PATH
Named Arguments¶
--species, -s | One key-value pair for each species. Key-value pairs consist of a ‘folder prefix’ (key) and a ‘display name’ (value) and are separated by an ‘=’-sign. The ‘folder prefix’ will be used to create input -and output folders for the given species. The ‘display name’ will be used for output files and figures. Syntax example: $ reademption create –species human=”Homo sapiens” staphylococcus=”Staphylococcus aureus” –project_path READemption_analysis |
--project_path, -f | |
Path of the project folder. |
align¶
align
performs the clipping and size filtering of the reads, as
well as the actual aligning to the reference sequences. It also
generates statistics about the steps (e.g. number of aligned reads,
number of mappings). As the result of this steps are needed by the
other subcommands it has to be run before the others. It requires
reads in FASTA or FASTQ format (or counterparts compressed with
gzip
or bzip2
) and reference sequences in FASTA
format. align
generates the read alignments in BAM format
(*.bam
) and also index files for those (*.bam.bai
). Is also
stores unmapped reads so that they can be inspected e.g. to search for
contaminations. The file
output/align/reports_and_stats/read_alignment_stats.csv
lists
several mapping statistics for the whole project, separated by species and per chromosome.
The folder output/align/reports_and_stats/stats_data_json/
contains files
with the original countings in JSON format. Please be aware that
READemption can perform only basic quality trimming and adapter
clipping. If this is not sufficient you can use the FASTX toolkit, cutadapt or other tools for the
preprocessing.
usage: reademption align [-h] --project_path PROJECT_PATH
[--min_read_length MIN_READ_LENGTH]
[--processes PROCESSES]
[--segemehl_accuracy SEGEMEHL_ACCURACY]
[--segemehl_evalue SEGEMEHL_EVALUE]
[--segemehl_bin SEGEMEHL_BIN] [--paired_end]
[--no_fragment_building]
[--max_fragment_length MAX_FRAGMENT_LENGTH] [--split]
[--poly_a_clipping] [--fastq]
[--min_phred_score MIN_PHRED_SCORE]
[--adapter ADAPTER] [--check_for_existing_files]
[--reverse_complement] [--progress]
[--crossalign_cleaning]
Named Arguments¶
--project_path, -f | |
Path of the project folder. | |
--min_read_length, -l | |
Minimal read length after clipping (default 12). Should be higher for eukaryotic species. Default: 12 | |
--processes, -p | |
Number of processes that should be used (default 1). Default: 1 | |
--segemehl_accuracy, -a | |
Segemehl’s minimal accuracy (in %) (default 95). Default: 95.0 | |
--segemehl_evalue, -e | |
Segemehl’s maximal e-value (default 5.0). Default: 5.0 | |
--segemehl_bin, -s | |
Segemehl’s binary path (default ‘segemehl.x’). Default: “segemehl.x” | |
--paired_end, -P | |
Use this if reads are originating from a paired-end sequencing. The members of a pair must be marked with ‘_p1’ and ‘_p2’ in front of the file type suffixes (e.g. ‘my_sample_p1.fa’ and ‘my_sample_p2.fa’ or ‘my_sample_p1.fa.bz2’ and ‘my_sample_p2.fa.bz2’). This option cannot be used with polyA tail clipping. Default: False | |
--no_fragment_building, -nb | |
Do not build fragments from paired end alignments. Not building the fragments saves time. If later subcommands like ‘gene_quanti’ or ‘coverage’ need the fragments, they will be build before. If you don’t want to work with fragments, also turn on ‘no_fragments’ when using the sucommands ‘gene_quanti’ and ‘coverage’ to avoid building the fragments when running these downstream subcommands. Default: False | |
--max_fragment_length, -mf | |
The maximum fragment length for all fragments. Fragments are either built from two mates or a single read. Default: False | |
--split, -S | Run segemehl with read splitting. Default: False |
--poly_a_clipping, -c | |
Perform polyA tail clipping. This option cannot be used for paired-end reads. Default: False | |
--fastq, -q | Input reads are in FASTQ not FASTA format. Default: False |
--min_phred_score, -Q | |
Minimal Phred score. Works only if read are given in FASTQ format. As soon as a based drop below this value it and all the nucleotides downstream of it will be trimmed off. | |
--adapter, -A | Adapter sequence. If it is found in a read it and all the nucleotides downstream will be trimmed off. |
--check_for_existing_files, -F | |
Check for existing files (e.g. from a interrupted previous run) and do not overwrite them if they exits. Attention! You have to take care that there are no partially generated files left! Default: False | |
--reverse_complement, -R | |
Map reverse complement of the input reads. Default: False | |
--progress, -g | Show progress of the segemehl mapping. Default: False |
--crossalign_cleaning, -x | |
Remove reads from BAM files that are cross-mapped to replicons of different species. Default: False |
coverage¶
coverage
generates strand specific coverage files in wiggle format based on the
read alignments. These wiggle files can be viewed in common genome
browser like the Integrated genome browser (IGB) or the Integrative genome viewer (IGV). Three sets of wiggle
files will be generated: raw counting values without normalization
(located in the folder coverage-raw), normalized by the total number
of aligned reads (abbreviated as tnoar) and the multiplied by the
lowest number of aligned reads of all considered libraries (in folder
coverage-tnoar_min_normalized) as well as normalized by the total
number of aligned reads and multiplied by one million
(coverage-tnoar_mil_normalized). The different normalizations make a
visual semi-quantitative comparative possible and enable to perform
transcription start site analysis (e.g. using tools like TSSPredator). For
each library and set there will be coverage files for the forward and
the reverse strand. The coverages for the forward strand have positive
values while the one for the reverse stand have negative values in
order to make a visual discrimination easy. Per default all reads and
each position of them will be considered. To calculate the coverages
only based on uniquely aligned read use the --unique_only
parameter. If only the first base should be considered add
--first_base_only
. Reads are aligned to multiple location will
account only in fraction to the values of the different positions. For
example a read that is mapped to three different location will
contribute a value of 1/3 to each of the nucleotides of these
positions. To turn off this behavior use
--skip_read_count_splitting
.
usage: reademption coverage [-h] --project_path PROJECT_PATH [--unique_only]
[--normalize_by_uniquely]
[--count_cross_aligned_reads]
[--normalize_cross_aligned_reads_included]
[--processes PROCESSES]
[--skip_read_count_splitting]
[--non_strand_specific]
[--coverage_style {global,first_base_only,last_base_only,centered}]
[--clip_length CLIP_LENGTH]
[--check_for_existing_files] [--paired_end]
[--no_fragments]
[--max_fragment_length MAX_FRAGMENT_LENGTH]
[--no_norm_by_fragments]
Named Arguments¶
--project_path, -f | |
Path of the project folder. | |
--unique_only, -u | |
Use uniquely aligned reads only. Default: False | |
--normalize_by_uniquely, -U | |
Normalize by the number of uniquely aligned reads. By default the normalization is done based on the total number of aligned reads even if only uniquely aligned reads are used for the coverage calculation. Default: False | |
--count_cross_aligned_reads, -x | |
Count species-cross-aligned reads for the coverage file creation, by default these reads are not counted. Default: False | |
--normalize_cross_aligned_reads_included, -X | |
Add species-cross-aligned reads to normalization, by default only the species specific reads are used. Default: False | |
--processes, -p | |
Number of processes that should be used (default 1). Default: 1 | |
--skip_read_count_splitting, -s | |
Do not split the read counting between different alignings. Default is to do the splitting. Default: False | |
--non_strand_specific, -d | |
Do not distict between the coverage of the forward and reverse strand but sum them to a single value for each base. Default: False | |
--coverage_style, -b | |
Possible choices: global, first_base_only, last_base_only, centered Select for coverage generation if only the first aligned base at the 5’ end of each read (‘first_base_only’) or the last aligned base at the 3’ end of each read (‘last_base_only’) is taken into account. The centered approach (‘centered’) clips a predefined number of nts from each alignment end and adds to the remaining genomic region a value divided by its length. By default the coverage is generated using the whole range of each alignment (‘global’). Default: “global” | |
--clip_length, -cl | |
Number of nucleotides that are clipped from each alignment end for centered approach. Default: 11 | |
--check_for_existing_files, -F | |
Check for existing files (e.g. from a interrupted previous run) and do not overwrite them if they exits. Attention! You have to take care that there are no partially generated files left! Default: False | |
--paired_end, -P | |
Use this if reads are originating from a paired-end sequencing. Default: False | |
--no_fragments, -nf | |
Count single alignments instead of fragments built after alignment. This option saves time if the fragments have not been build during alignment. Only use this if reads are originating from a paired-end sequencing. Default: False | |
--max_fragment_length, -mf | |
The maximum fragment length for all fragments. Fragments are either built from two mates or a single read. Default: False | |
--no_norm_by_fragments, -nff | |
Use total no. of single alignments instead of fragments built after alignment for the normalisation. This option saves time if the fragments have not been build during alignment. Only use this if reads are originating from a paired-end sequencing. Default: False |
gene_quanti¶
With gene_quanti
the number of reads overlapping with each of the
annotation entries is counted and the results are combined in
tables. At least one GGF3 file with annotations has to be placed in
input/annotations
. The sequence ID of the sequenced must be
precisely the same as the IDs used in the reference sequence FASTA
files. To specify the feature classes (the third column in the GFF3
file e.g. CDS, gene, rRNA, tRNA) that should be quantified the
parameter --features
can be used. Otherwise countings for all
annotation entries are generated. Per default sense and anti-sense
overlaps are counted and separately listed. gene_quanti
provides, besides the
raw read countings, Transcripts per Million (TPM) normalized read counts,
normalized by the total number of aligned reads of the given library (TNOAR),
as well as Reads per Kilobase Million (RPKM) normalized read counts
The results of the different counting methods are provided in separate CSV-tables.
usage: reademption gene_quanti [-h] --project_path PROJECT_PATH
[--min_overlap MIN_OVERLAP]
[--read_region {global,first_base_only,last_base_only,centered}]
[--clip_length CLIP_LENGTH]
[--no_count_split_by_alignment_no]
[--no_count_splitting_by_gene_no]
[--add_antisense] [--antisense_only]
[--non_strand_specific] [--processes PROCESSES]
[--features ALLOWED_FEATURES] [--unique_only]
[--pseudocounts] [--check_for_existing_files]
[--count_cross_aligned_reads]
[--normalize_cross_aligned_reads_included]
[--paired_end] [--no_fragments]
[--max_fragment_length MAX_FRAGMENT_LENGTH]
[--no_norm_by_fragments]
Named Arguments¶
--project_path, -f | |
Path of the project folder. | |
--min_overlap, -o | |
Minimal read-annotation-overlap (in nt) (default 1). Default: 1 | |
--read_region, -b | |
Possible choices: global, first_base_only, last_base_only, centered Select for gene-wise quantification if only the first aligned base at the 5’ end of each read (‘first_base_only’) or the last aligned base at the 3’ end of each read (‘last_base_only’) is taken into account. The centered approach (‘centered’) clips a predefined number of nts from each alignment end and calculates the overlap based on the remaining region. By default the overlap is calculated based on the whole range of each alignment (‘global’). Default: “global” | |
--clip_length, -cl | |
Number of nucleotides that are clipped from each alignment end for centered approach. Default: 11 | |
--no_count_split_by_alignment_no, -n | |
Do not split read countings by the number of alignments a read has. By default this count splitting is performed. Default: False | |
--no_count_splitting_by_gene_no, -l | |
Do not split read countings by the number of genes it overlaps with. By default this count splitting is performed. Default: False | |
--add_antisense, -a | |
Count anti-sense and sense read-gene-overlaps and report them separately. By default only sense overlaps are counted. Default: False | |
--antisense_only, -ao | |
Count only anti-sense read-gene-overlaps and no sense overlaps. By default only sense overlaps are counted. Default: False | |
--non_strand_specific | |
Use countings of reads overlapping with a gene on both strands and sum them up. Default: False | |
--processes, -p | |
Number of processes that should be used (default 1). Default: 1 | |
--features, -t | Comma separated list of features that should be considered (e.g. gene, cds, region, exon). Other feature will be skipped. If not specified all features will be considered. |
--unique_only, -u | |
Use uniquely aligned reads only. Default: False | |
--pseudocounts, -c | |
Add a pseudocount of 1 to each gene. Default: False | |
--check_for_existing_files, -F | |
Check for existing files (e.g. from a interrupted previous run) and do not overwrite them if they exits. Attention! You have to take care that there are no partially generated files left! Default: False | |
--count_cross_aligned_reads, -x | |
Count species-cross-aligned reads for the gene quantification. Bydefault these reads are not counted. Default: False | |
--normalize_cross_aligned_reads_included, -X | |
Add species-cross-aligned reads to normalization, by default only the species specific reads are used. Default: False | |
--paired_end, -P | |
Use this if reads are originating from a paired-end sequencing. Default: False | |
--no_fragments, -nf | |
Count single alignments instead of fragments built after alignment. This option saves time if the fragments have not been build during alignment. Only use this if reads are originating from a paired-end sequencing. Default: False | |
--max_fragment_length, -mf | |
The maximum fragment length for all fragments. Fragments are either built from two mates or a single read. Default: False | |
--no_norm_by_fragments, -nff | |
Use total no. of single alignments instead of fragments built after alignment for the normalisation. This option saves time if the fragments have not been build during alignment. Only use this if reads are originating from a paired-end sequencing. Default: False |
deseq¶
Differential gene expression can be performed using deseq
which
will run a DESeq2
analyses for all possible combinations of conditions. To allocated the
conditions to the libraries use the --libs
and --conditions
parameters (e.g. --libs
SamA_R1.fa,SamA_R2.fa,SamB_R1.fa,SamB_R2.fa --conditions
SamA,SamA,SamB,SamB
).
usage: reademption deseq [-h] --project_path PROJECT_PATH --libs LIBS
--conditions CONDITIONS --replicates REPLICATES
--libs_by_species SUB_FOLDER_NAME=LIST_OF_LIBRARIES
[SUB_FOLDER_NAME=LIST_OF_LIBRARIES ...]
[--size_factor {project,species,comparison}]
[--cooks_cutoff_off] [--fc_shrinkage_off]
Named Arguments¶
--project_path, -f | |
Path of the project folder. | |
--libs, -l | Comma separated list of libraries. |
--conditions, -c | |
Comma separated list of condition in the same order as their corresponding libraries. | |
--replicates, -r | |
Comma separated list of replicates in the same order as their corresponding libraries. | |
--libs_by_species, -s | |
One key-value pair for each species. Key-value pairs consist of the ‘folder prefix’ (key) and a ‘list of libraries’ and are separated by an ‘=’-sign. The ‘folder prefix’ has or have to be the same that was used when creating a new project Syntax example: $ reademption deseq –libs infected_1.fq,infected_2.fq,control_human_1.fq,control_human_2.fq,control_staph_1.fq,control_staph_2.fq –conditions infected,infected,control_human,control_human,control_staph,control_staph –replicates 1,2,1,2,1,2 –libs_by_species human=”infected_1.fq,infected_2.fq,control_human_1.fq,control_human2.fq” staphylococcus=”infected_1.fq,infected_2.fq,control_staphylococcus_1.fq,control_staphylococcus_2.fq” –project_path READemption_analysis | |
--size_factor, -b | |
Possible choices: project, species, comparison Select the libraries for the size factor estimation of DESeq2. Selecting the option ‘project’ will result in calculating the size factors for all comparisons based on all libraries of thewhole project (It doesn’t matter if the libraries have no reads of the species being inspected or if the libraries are part ofthe current comparison). The option ‘species’ will result in calculating the size factors for all comparisons of a species only based on the libraries of the given species. The option ‘comparison’ calculates the size factors for each comparison andonly based on the libraries that are being compared for thecurrent comparison. Default: “species” | |
--cooks_cutoff_off, -k | |
Default: False | |
--fc_shrinkage_off, -so | |
turn off log2 fold change shrinkage. Default: False |
viz_align¶
viz_align
plots histograms of the read length distributions of the
reads before and after the read clipping.
usage: reademption viz_align [-h] [--paired_end] --project_path PROJECT_PATH
Named Arguments¶
--paired_end, -P | |
Use this if reads are originating from a paired-end sequencing. Default: False | |
--project_path, -f | |
Path of the project folder. |
viz_gene_quanti¶
viz_gene_quanti
creates scatterplots in which the raw gene wise
quantification values are compared for each library pair
(all-against-all). For each comparison the pearson correllation
(r) coefficiant is. Additionally, bar charts that visualize the
distribution of the read counting of the different annotation classes
are plotted.
usage: reademption viz_gene_quanti [-h] [--paired_end] --project_path
PROJECT_PATH
Named Arguments¶
--paired_end, -P | |
Use this if reads are originating from a paired-end sequencing. Default: False | |
--project_path, -f | |
Path of the project folder. |
viz_deseq¶
viz_deseq
generates MA-plots of the comparison (log2 fold changes
vs. the base mean) as well as volcano plots (log2 fold changes
vs. p-values / adjusted p-values).
usage: reademption viz_deseq [-h] --project_path PROJECT_PATH
[--max_pvalue MAX_PVALUE]
Named Arguments¶
--project_path, -f | |
Path of the project folder. | |
--max_pvalue | Maximum adjusted p-value for genes considered to be regulated. Genes with adjusted p-values below will be marked red. Default: 0.05 |