READemption’s subcommands

Results separately for each species

In general the subcommands need at least one argument - the analysis folder. The only exception is the create subcommand, which needs the species of a project. After the initial project creation and the subsequent alignment, which is (except for the statistics) species agnostic all other subcommands (coverage, gene_quanti, deseq, viz_align, viz_gene_quanti, viz_deseq) present their results by species.

Exclude or include species cross-mapped reads

Usually a multi-species project will contain cross-mapped reads that map to different species. These reads are excluded from countings and normalization factors of the subcommands coverage and gene_quanti per default. Since the other subcommands (deseq, viz_gene_quanti, viz_deseq) are based on gene_quanti, species cross-mapped reads are also excluded from these subcommands. This default behaviour can be turned off separately for countings [–count_cross_aligned_reads] and normalization [–normalize_cross_aligned_reads_included], when executing coverage or gene_quanti.

Fragment building for paired-end reads

When using paired-end reads, READemption automatically builds fragments from aligned reads as part of the align subcommand. This behaviour, if not needed, can be turned off [–no_fragment_building] to save time. To continue skipping fragment building, the flag [–no_fragment_building] also needs to be set during coverage or gene_quanti. Otherwise the fragments will be build and used by coverage and gene_quanti, even if fragments were not built during align.

create

create generates the required folder structure for input and output files. A folder prefix and a display name must be assigned to each species, even if the project contains only a single species. Once these folders are created the input files have to be placed into the correct locations. A reference_sequences folder and an annotation folder will be created for each species, while the reads folder is common for all species.

The following example creates the input folders for a project with three different species:

$ reademption create --project_path reademption_analysis_triple \
      --species \
      human="Homo sapiens" \
      staphylococcus="Staphylococcus aureus" \
      influenza="Influenza A"

Created input folder structure:

reademption_analysis_triple
├── config.json
├── input
│      ├── human_annotations
│      ├── human_reference_sequences
│      ├── influenza_annotations
│      ├── influenza_reference_sequences
│      ├── reads
│      ├── staphylococcus_annotations
│      └── staphylococcus_reference_sequences
└── output
         └── align
                  ├── alignments
                  ├── index
                  ├── processed_reads
                  ├── reports_and_stats
                  │      ├── stats_data_json
                  │      └── version_log.txt
                  └── unaligned_reads

As a minimal requirement, RNA-Seqs reads in FASTA format (can be compressed with bzip2 or gzip) must be placed in input/reads and the reference sequences of the different species in FASTA format must be copied or linked to their corresponding species input folder input/[species]_reference_sequences. For the command gene_quanti annotation files in GFF3 format have to be put into input/[species]_annotations. Please note that a single species project will have only one annotations and one reference_sequences folder.

usage: reademption create [-h] --species SUB_FOLDER_NAME=DISPLAY_NAME
                          [SUB_FOLDER_NAME=DISPLAY_NAME ...] --project_path
                          PROJECT_PATH

Named Arguments

--species, -s One key-value pair for each species. Key-value pairs consist of a ‘folder prefix’ (key) and a ‘display name’ (value) and are separated by an ‘=’-sign. The ‘folder prefix’ will be used to create input -and output folders for the given species. The ‘display name’ will be used for output files and figures. Syntax example: $ reademption create –species human=”Homo sapiens” staphylococcus=”Staphylococcus aureus” –project_path READemption_analysis
--project_path, -f
 Path of the project folder.

align

align performs the clipping and size filtering of the reads, as well as the actual aligning to the reference sequences. It also generates statistics about the steps (e.g. number of aligned reads, number of mappings). As the result of this steps are needed by the other subcommands it has to be run before the others. It requires reads in FASTA or FASTQ format (or counterparts compressed with gzip or bzip2) and reference sequences in FASTA format. align generates the read alignments in BAM format (*.bam) and also index files for those (*.bam.bai). Is also stores unmapped reads so that they can be inspected e.g. to search for contaminations. The file output/align/reports_and_stats/read_alignment_stats.csv lists several mapping statistics for the whole project, separated by species and per chromosome. The folder output/align/reports_and_stats/stats_data_json/ contains files with the original countings in JSON format. Please be aware that READemption can perform only basic quality trimming and adapter clipping. If this is not sufficient you can use the FASTX toolkit, cutadapt or other tools for the preprocessing.

usage: reademption align [-h] --project_path PROJECT_PATH
                         [--min_read_length MIN_READ_LENGTH]
                         [--processes PROCESSES]
                         [--segemehl_accuracy SEGEMEHL_ACCURACY]
                         [--segemehl_evalue SEGEMEHL_EVALUE]
                         [--segemehl_bin SEGEMEHL_BIN] [--paired_end]
                         [--no_fragment_building]
                         [--max_fragment_length MAX_FRAGMENT_LENGTH] [--split]
                         [--poly_a_clipping] [--fastq]
                         [--min_phred_score MIN_PHRED_SCORE]
                         [--adapter ADAPTER] [--check_for_existing_files]
                         [--reverse_complement] [--progress]
                         [--crossalign_cleaning]

Named Arguments

--project_path, -f
 Path of the project folder.
--min_read_length, -l
 

Minimal read length after clipping (default 12). Should be higher for eukaryotic species.

Default: 12

--processes, -p
 

Number of processes that should be used (default 1).

Default: 1

--segemehl_accuracy, -a
 

Segemehl’s minimal accuracy (in %) (default 95).

Default: 95.0

--segemehl_evalue, -e
 

Segemehl’s maximal e-value (default 5.0).

Default: 5.0

--segemehl_bin, -s
 

Segemehl’s binary path (default ‘segemehl.x’).

Default: “segemehl.x”

--paired_end, -P
 

Use this if reads are originating from a paired-end sequencing. The members of a pair must be marked with ‘_p1’ and ‘_p2’ in front of the file type suffixes (e.g. ‘my_sample_p1.fa’ and ‘my_sample_p2.fa’ or ‘my_sample_p1.fa.bz2’ and ‘my_sample_p2.fa.bz2’). This option cannot be used with polyA tail clipping.

Default: False

--no_fragment_building, -nb
 

Do not build fragments from paired end alignments. Not building the fragments saves time. If later subcommands like ‘gene_quanti’ or ‘coverage’ need the fragments, they will be build before. If you don’t want to work with fragments, also turn on ‘no_fragments’ when using the sucommands ‘gene_quanti’ and ‘coverage’ to avoid building the fragments when running these downstream subcommands.

Default: False

--max_fragment_length, -mf
 

The maximum fragment length for all fragments. Fragments are either built from two mates or a single read.

Default: False

--split, -S

Run segemehl with read splitting.

Default: False

--poly_a_clipping, -c
 

Perform polyA tail clipping. This option cannot be used for paired-end reads.

Default: False

--fastq, -q

Input reads are in FASTQ not FASTA format.

Default: False

--min_phred_score, -Q
 Minimal Phred score. Works only if read are given in FASTQ format. As soon as a based drop below this value it and all the nucleotides downstream of it will be trimmed off.
--adapter, -A Adapter sequence. If it is found in a read it and all the nucleotides downstream will be trimmed off.
--check_for_existing_files, -F
 

Check for existing files (e.g. from a interrupted previous run) and do not overwrite them if they exits. Attention! You have to take care that there are no partially generated files left!

Default: False

--reverse_complement, -R
 

Map reverse complement of the input reads.

Default: False

--progress, -g

Show progress of the segemehl mapping.

Default: False

--crossalign_cleaning, -x
 

Remove reads from BAM files that are cross-mapped to replicons of different species.

Default: False

coverage

coverage generates strand specific coverage files in wiggle format based on the read alignments. These wiggle files can be viewed in common genome browser like the Integrated genome browser (IGB) or the Integrative genome viewer (IGV). Three sets of wiggle files will be generated: raw counting values without normalization (located in the folder coverage-raw), normalized by the total number of aligned reads (abbreviated as tnoar) and the multiplied by the lowest number of aligned reads of all considered libraries (in folder coverage-tnoar_min_normalized) as well as normalized by the total number of aligned reads and multiplied by one million (coverage-tnoar_mil_normalized). The different normalizations make a visual semi-quantitative comparative possible and enable to perform transcription start site analysis (e.g. using tools like TSSPredator). For each library and set there will be coverage files for the forward and the reverse strand. The coverages for the forward strand have positive values while the one for the reverse stand have negative values in order to make a visual discrimination easy. Per default all reads and each position of them will be considered. To calculate the coverages only based on uniquely aligned read use the --unique_only parameter. If only the first base should be considered add --first_base_only. Reads are aligned to multiple location will account only in fraction to the values of the different positions. For example a read that is mapped to three different location will contribute a value of 1/3 to each of the nucleotides of these positions. To turn off this behavior use --skip_read_count_splitting.

usage: reademption coverage [-h] --project_path PROJECT_PATH [--unique_only]
                            [--normalize_by_uniquely]
                            [--count_cross_aligned_reads]
                            [--normalize_cross_aligned_reads_included]
                            [--processes PROCESSES]
                            [--skip_read_count_splitting]
                            [--non_strand_specific]
                            [--coverage_style {global,first_base_only,last_base_only,centered}]
                            [--clip_length CLIP_LENGTH]
                            [--check_for_existing_files] [--paired_end]
                            [--no_fragments]
                            [--max_fragment_length MAX_FRAGMENT_LENGTH]
                            [--no_norm_by_fragments]

Named Arguments

--project_path, -f
 Path of the project folder.
--unique_only, -u
 

Use uniquely aligned reads only.

Default: False

--normalize_by_uniquely, -U
 

Normalize by the number of uniquely aligned reads. By default the normalization is done based on the total number of aligned reads even if only uniquely aligned reads are used for the coverage calculation.

Default: False

--count_cross_aligned_reads, -x
 

Count species-cross-aligned reads for the coverage file creation, by default these reads are not counted.

Default: False

--normalize_cross_aligned_reads_included, -X
 

Add species-cross-aligned reads to normalization, by default only the species specific reads are used.

Default: False

--processes, -p
 

Number of processes that should be used (default 1).

Default: 1

--skip_read_count_splitting, -s
 

Do not split the read counting between different alignings. Default is to do the splitting.

Default: False

--non_strand_specific, -d
 

Do not distict between the coverage of the forward and reverse strand but sum them to a single value for each base.

Default: False

--coverage_style, -b
 

Possible choices: global, first_base_only, last_base_only, centered

Select for coverage generation if only the first aligned base at the 5’ end of each read (‘first_base_only’) or the last aligned base at the 3’ end of each read (‘last_base_only’) is taken into account. The centered approach (‘centered’) clips a predefined number of nts from each alignment end and adds to the remaining genomic region a value divided by its length. By default the coverage is generated using the whole range of each alignment (‘global’).

Default: “global”

--clip_length, -cl
 

Number of nucleotides that are clipped from each alignment end for centered approach.

Default: 11

--check_for_existing_files, -F
 

Check for existing files (e.g. from a interrupted previous run) and do not overwrite them if they exits. Attention! You have to take care that there are no partially generated files left!

Default: False

--paired_end, -P
 

Use this if reads are originating from a paired-end sequencing.

Default: False

--no_fragments, -nf
 

Count single alignments instead of fragments built after alignment. This option saves time if the fragments have not been build during alignment. Only use this if reads are originating from a paired-end sequencing.

Default: False

--max_fragment_length, -mf
 

The maximum fragment length for all fragments. Fragments are either built from two mates or a single read.

Default: False

--no_norm_by_fragments, -nff
 

Use total no. of single alignments instead of fragments built after alignment for the normalisation. This option saves time if the fragments have not been build during alignment. Only use this if reads are originating from a paired-end sequencing.

Default: False

gene_quanti

With gene_quanti the number of reads overlapping with each of the annotation entries is counted and the results are combined in tables. At least one GGF3 file with annotations has to be placed in input/annotations. The sequence ID of the sequenced must be precisely the same as the IDs used in the reference sequence FASTA files. To specify the feature classes (the third column in the GFF3 file e.g. CDS, gene, rRNA, tRNA) that should be quantified the parameter --features can be used. Otherwise countings for all annotation entries are generated. Per default sense and anti-sense overlaps are counted and separately listed. gene_quanti provides, besides the raw read countings, Transcripts per Million (TPM) normalized read counts, normalized by the total number of aligned reads of the given library (TNOAR), as well as Reads per Kilobase Million (RPKM) normalized read counts The results of the different counting methods are provided in separate CSV-tables.

usage: reademption gene_quanti [-h] --project_path PROJECT_PATH
                               [--min_overlap MIN_OVERLAP]
                               [--read_region {global,first_base_only,last_base_only,centered}]
                               [--clip_length CLIP_LENGTH]
                               [--no_count_split_by_alignment_no]
                               [--no_count_splitting_by_gene_no]
                               [--add_antisense] [--antisense_only]
                               [--non_strand_specific] [--processes PROCESSES]
                               [--features ALLOWED_FEATURES] [--unique_only]
                               [--pseudocounts] [--check_for_existing_files]
                               [--count_cross_aligned_reads]
                               [--normalize_cross_aligned_reads_included]
                               [--paired_end] [--no_fragments]
                               [--max_fragment_length MAX_FRAGMENT_LENGTH]
                               [--no_norm_by_fragments]

Named Arguments

--project_path, -f
 Path of the project folder.
--min_overlap, -o
 

Minimal read-annotation-overlap (in nt) (default 1).

Default: 1

--read_region, -b
 

Possible choices: global, first_base_only, last_base_only, centered

Select for gene-wise quantification if only the first aligned base at the 5’ end of each read (‘first_base_only’) or the last aligned base at the 3’ end of each read (‘last_base_only’) is taken into account. The centered approach (‘centered’) clips a predefined number of nts from each alignment end and calculates the overlap based on the remaining region. By default the overlap is calculated based on the whole range of each alignment (‘global’).

Default: “global”

--clip_length, -cl
 

Number of nucleotides that are clipped from each alignment end for centered approach.

Default: 11

--no_count_split_by_alignment_no, -n
 

Do not split read countings by the number of alignments a read has. By default this count splitting is performed.

Default: False

--no_count_splitting_by_gene_no, -l
 

Do not split read countings by the number of genes it overlaps with. By default this count splitting is performed.

Default: False

--add_antisense, -a
 

Count anti-sense and sense read-gene-overlaps and report them separately. By default only sense overlaps are counted.

Default: False

--antisense_only, -ao
 

Count only anti-sense read-gene-overlaps and no sense overlaps. By default only sense overlaps are counted.

Default: False

--non_strand_specific
 

Use countings of reads overlapping with a gene on both strands and sum them up.

Default: False

--processes, -p
 

Number of processes that should be used (default 1).

Default: 1

--features, -t Comma separated list of features that should be considered (e.g. gene, cds, region, exon). Other feature will be skipped. If not specified all features will be considered.
--unique_only, -u
 

Use uniquely aligned reads only.

Default: False

--pseudocounts, -c
 

Add a pseudocount of 1 to each gene.

Default: False

--check_for_existing_files, -F
 

Check for existing files (e.g. from a interrupted previous run) and do not overwrite them if they exits. Attention! You have to take care that there are no partially generated files left!

Default: False

--count_cross_aligned_reads, -x
 

Count species-cross-aligned reads for the gene quantification. Bydefault these reads are not counted.

Default: False

--normalize_cross_aligned_reads_included, -X
 

Add species-cross-aligned reads to normalization, by default only the species specific reads are used.

Default: False

--paired_end, -P
 

Use this if reads are originating from a paired-end sequencing.

Default: False

--no_fragments, -nf
 

Count single alignments instead of fragments built after alignment. This option saves time if the fragments have not been build during alignment. Only use this if reads are originating from a paired-end sequencing.

Default: False

--max_fragment_length, -mf
 

The maximum fragment length for all fragments. Fragments are either built from two mates or a single read.

Default: False

--no_norm_by_fragments, -nff
 

Use total no. of single alignments instead of fragments built after alignment for the normalisation. This option saves time if the fragments have not been build during alignment. Only use this if reads are originating from a paired-end sequencing.

Default: False

deseq

Differential gene expression can be performed using deseq which will run a DESeq2 analyses for all possible combinations of conditions. To allocated the conditions to the libraries use the --libs and --conditions parameters (e.g. --libs SamA_R1.fa,SamA_R2.fa,SamB_R1.fa,SamB_R2.fa --conditions SamA,SamA,SamB,SamB).

usage: reademption deseq [-h] --project_path PROJECT_PATH --libs LIBS
                         --conditions CONDITIONS --replicates REPLICATES
                         --libs_by_species SUB_FOLDER_NAME=LIST_OF_LIBRARIES
                         [SUB_FOLDER_NAME=LIST_OF_LIBRARIES ...]
                         [--size_factor {project,species,comparison}]
                         [--cooks_cutoff_off] [--fc_shrinkage_off]

Named Arguments

--project_path, -f
 Path of the project folder.
--libs, -l Comma separated list of libraries.
--conditions, -c
 Comma separated list of condition in the same order as their corresponding libraries.
--replicates, -r
 Comma separated list of replicates in the same order as their corresponding libraries.
--libs_by_species, -s
 One key-value pair for each species. Key-value pairs consist of the ‘folder prefix’ (key) and a ‘list of libraries’ and are separated by an ‘=’-sign. The ‘folder prefix’ has or have to be the same that was used when creating a new project Syntax example: $ reademption deseq –libs infected_1.fq,infected_2.fq,control_human_1.fq,control_human_2.fq,control_staph_1.fq,control_staph_2.fq –conditions infected,infected,control_human,control_human,control_staph,control_staph –replicates 1,2,1,2,1,2 –libs_by_species human=”infected_1.fq,infected_2.fq,control_human_1.fq,control_human2.fq” staphylococcus=”infected_1.fq,infected_2.fq,control_staphylococcus_1.fq,control_staphylococcus_2.fq” –project_path READemption_analysis
--size_factor, -b
 

Possible choices: project, species, comparison

Select the libraries for the size factor estimation of DESeq2. Selecting the option ‘project’ will result in calculating the size factors for all comparisons based on all libraries of thewhole project (It doesn’t matter if the libraries have no reads of the species being inspected or if the libraries are part ofthe current comparison). The option ‘species’ will result in calculating the size factors for all comparisons of a species only based on the libraries of the given species. The option ‘comparison’ calculates the size factors for each comparison andonly based on the libraries that are being compared for thecurrent comparison.

Default: “species”

--cooks_cutoff_off, -k
 Default: False
--fc_shrinkage_off, -so
 

turn off log2 fold change shrinkage.

Default: False

viz_align

viz_align plots histograms of the read length distributions of the reads before and after the read clipping.

usage: reademption viz_align [-h] [--paired_end] --project_path PROJECT_PATH

Named Arguments

--paired_end, -P
 

Use this if reads are originating from a paired-end sequencing.

Default: False

--project_path, -f
 Path of the project folder.

viz_gene_quanti

viz_gene_quanti creates scatterplots in which the raw gene wise quantification values are compared for each library pair (all-against-all). For each comparison the pearson correllation (r) coefficiant is. Additionally, bar charts that visualize the distribution of the read counting of the different annotation classes are plotted.

usage: reademption viz_gene_quanti [-h] [--paired_end] --project_path
                                   PROJECT_PATH

Named Arguments

--paired_end, -P
 

Use this if reads are originating from a paired-end sequencing.

Default: False

--project_path, -f
 Path of the project folder.

viz_deseq

viz_deseq generates MA-plots of the comparison (log2 fold changes vs. the base mean) as well as volcano plots (log2 fold changes vs. p-values / adjusted p-values).

usage: reademption viz_deseq [-h] --project_path PROJECT_PATH
                             [--max_pvalue MAX_PVALUE]

Named Arguments

--project_path, -f
 Path of the project folder.
--max_pvalue

Maximum adjusted p-value for genes considered to be regulated. Genes with adjusted p-values below will be marked red.

Default: 0.05