READemption’s subcommands

In general the subcommands need at least one argument - the analysis folder.

create

create generates the required folder structure for input and output files. Once these folders are created the input files have to be placed into the correct locations. As a minimal requirement, RNA-Seqs reads in FASTA format (can be compressed with bzip2 or gzip) must be placed in input/reads and the reference sequence in FASTA format must be copied or linked in input/reference_sequences. For the command gene_quanti annotation files in GFF3 format have to be put into input/annotations.

usage: reademption create [-h] [--project_path PROJECT_PATH]

Named Arguments

--project_path, -f
 Path of the project folder.

align

align performs the clipping and size filtering of the reads, as well as the actual aligning to the reference sequences. It also generates statistics about the steps (e.g. number of aligned reads, number of mappings). As the result of this steps are needed by the other subcommands it has to be run before the others. It requires reads in FASTA or FASTQ format (or counterparts compressed with gzip or bzip2) and reference sequences in FASTA format. align generates the read alignments in BAM format (*.bam) and also index files for those (*.bam.bai). Is also stores unmapped reads so that they can be inspected e.g. to search for contaminations. The file output/align/reports_and_stats/read_alignment_stats.csv lists several mapping statistics. The folder output/align/reports_and_stats/stats_data_json/ contains files with the original countings in JSON format. Please be aware that READemption can perform only basic quality trimming and adapter clipping. If this is not sufficient you can use the FASTX toolkit, cutadapt or other tools for the preprocessing.

usage: reademption align [-h] [--project_path PROJECT_PATH]
                         [--min_read_length MIN_READ_LENGTH]
                         [--processes PROCESSES]
                         [--segemehl_accuracy SEGEMEHL_ACCURACY]
                         [--segemehl_evalue SEGEMEHL_EVALUE]
                         [--segemehl_bin SEGEMEHL_BIN] [--paired_end]
                         [--split] [--poly_a_clipping] [--fastq]
                         [--min_phred_score MIN_PHRED_SCORE]
                         [--adapter ADAPTER] [--check_for_existing_files]
                         [--reverse_complement] [--progress]
                         [--crossalign_cleaning CROSSALIGN_CLEANING_STRING]

Named Arguments

--project_path, -f
 Path of the project folder.
--min_read_length, -l
 

Minimal read length after clipping (default 12). Should be higher for eukaryotic species.

Default: 12

--processes, -p
 

Number of processes that should be used (default 1).

Default: 1

--segemehl_accuracy, -a
 

Segemehl’s minimal accuracy (in %) (default 95).

Default: 95.0

--segemehl_evalue, -e
 

Segemehl’s maximal e-value (default 5.0).

Default: 5.0

--segemehl_bin, -s
 

Segemehl’s binary path (default ‘segemehl.x’).

Default: “segemehl.x”

--paired_end, -P
 

Use this if reads are originating from a paired-end sequencing. The members of a pair must be marked with ‘_p1’ and ‘_p2’ in front of the file type suffixes (e.g. ‘my_sample_p1.fa’ and ‘my_sample_p2.fa’ or ‘my_sample_p1.fa.bz2’ and ‘my_sample_p2.fa.bz2’). This option cannot be used with polyA tail clipping.

Default: False

--split, -S

Run segemehl with read splitting.

Default: False

--poly_a_clipping, -c
 

Perform polyA tail clipping. This option cannot be used for paired-end reads.

Default: False

--fastq, -q

Input reads are in FASTQ not FASTA format.

Default: False

--min_phred_score, -Q
 Minimal Phred score. Works only if read are given in FASTQ format. As soon as a based drop below this value it and all the nucleotides downstream of it will be trimmed off.
--adapter, -A Adapter sequence. If it is found in a read it and all the nucleotides downstream will be trimmed off.
--check_for_existing_files, -F
 

Check for existing files (e.g. from a interrupted previous run) and do not overwrite them if they exits. Attention! You have to take care that there are no partially generated files left!

Default: False

--reverse_complement, -R
 

Map reverse complement of the input reads.

Default: False

--progress, -g

Show progress of the segemehl mapping.

Default: False

--crossalign_cleaning, -x
 Remove reads that are cross-mapped to replicons of different species. To associated species and replicons give a string in the following format: ‘<ORG_NAME_1>:<org_1_repl1>,<org_1_repl2>,..,<org_1_repl_n>;<ORG_NAME_2>:<org_2_repl1>,<org_2_repl2>,..,<org_2_repl_n>’

coverage

coverage generates strand specific coverage files in wiggle format based on the read alignments. These wiggle files can be viewed in common genome browser like the Integrated genome browser (IGB) or the Integrative genome viewer (IGV). Three sets of wiggle files will be generated: raw counting values without normalization (located in the folder coverage-raw), normalized by the total number of aligned reads (abbreviated as tnoar) and the multiplied by the lowest number of aligned reads of all considered libraries (in folder coverage-tnoar_min_normalized) as well as normalized by the total number of aligned reads and multiplied by one million (coverage-tnoar_mil_normalized). The different normalizations make a visual semi-quantitative comparative possible and enable to perform transcription start site analysis (e.g. using tools like TSSPredator). For each library and set there will be coverage files for the forward and the reverse strand. The coverages for the forward strand have positive values while the one for the reverse stand have negative values in order to make a visual discrimination easy. Per default all reads and each position of them will be considered. To calculate the coverages only based on uniquely aligned read use the --unique_only parameter. If only the first base should be considered add --first_base_only. Reads are aligned to multiple location will account only in fraction to the values of the different positions. For example a read that is mapped to three different location will contribute a value of 1/3 to each of the nucleotides of these positions. To turn off this behavior use --skip_read_count_splitting.

usage: reademption coverage [-h] [--project_path PROJECT_PATH] [--unique_only]
                            [--normalize_by_uniquely] [--processes PROCESSES]
                            [--skip_read_count_splitting]
                            [--non_strand_specific]
                            [--coverage_style {global,first_base_only,last_base_only,centered}]
                            [--clip_length CLIP_LENGTH]
                            [--check_for_existing_files]

Named Arguments

--project_path, -f
 Path of the project folder.
--unique_only, -u
 

Use uniquely aligned reads only.

Default: False

--normalize_by_uniquely, -U
 

Normalize by the number of uniquely aligned reads. By default the normalization is done based on the total number of aligned reads even if only uniquely aligned reads are used for the coverage calculation.

Default: False

--processes, -p
 

Number of processes that should be used (default 1).

Default: 1

--skip_read_count_splitting, -s
 

Do not split the read counting between different alignings. Default is to do the splitting.

Default: False

--non_strand_specific, -d
 

Do not distict between the coverage of the forward and reverse strand but sum them to a single value for each base.

Default: False

--coverage_style, -b
 

Possible choices: global, first_base_only, last_base_only, centered

Select for coverage generation if only the first aligned base at the 5’ end of each read (‘first_base_only’) or the last aligned base at the 3’ end of each read (‘last_base_only’) is taken into account. The centered approach (‘centered’) clips a predefined number of nts from each alignment end and adds to the remaining genomic region a value divided by its length. By default the coverage is generated using the whole range of each alignment (‘global’).

Default: “global”

--clip_length, -cl
 

Number of nucleotides that are clipped from each alignment end for centered approach.

Default: 11

--check_for_existing_files, -F
 

Check for existing files (e.g. from a interrupted previous run) and do not overwrite them if they exits. Attention! You have to take care that there are no partially generated files left!

Default: False

gene_quanti

With gene_quanti the number of reads overlapping with each of the annotation entries is counted and the results are combined in tables. At least one GGF3 file with annotations has to be placed in input/annotations. The sequence ID of the sequenced must be precisely the same as the IDs used in the reference sequence FASTA files. To specify the feature classes (the third column in the GFF3 file e.g. CDS, gene, rRNA, tRNA) that should be quantified the parameter --features can be used. Otherwise countings for all annotation entries are generated. Per default sense and anti-sense overlaps are counted and separately listed. gene_quanti provides, besides the raw read countings, Transcripts per Million (TPM) normalized read counts, normalized by the total number of aligned reads of the given library (TNOAR), as well as Reads per Kilobase Million (RPKM) normalized read counts The results of the different counting methods are provided in separate CSV-tables.

usage: reademption gene_quanti [-h] [--project_path PROJECT_PATH]
                               [--min_overlap MIN_OVERLAP]
                               [--read_region {global,first_base_only,last_base_only,centered}]
                               [--clip_length CLIP_LENGTH]
                               [--no_count_split_by_alignment_no]
                               [--no_count_splitting_by_gene_no]
                               [--add_antisense] [--antisense_only]
                               [--non_strand_specific] [--processes PROCESSES]
                               [--features ALLOWED_FEATURES] [--unique_only]
                               [--pseudocounts] [--check_for_existing_files]

Named Arguments

--project_path, -f
 Path of the project folder.
--min_overlap, -o
 

Minimal read-annotation-overlap (in nt) (default 1).

Default: 1

--read_region, -b
 

Possible choices: global, first_base_only, last_base_only, centered

Select for gene-wise quantification if only the first aligned base at the 5’ end of each read (‘first_base_only’) or the last aligned base at the 3’ end of each read (‘last_base_only’) is taken into account. The centered approach (‘centered’) clips a predefined number of nts from each alignment end and calculates the overlap based on the remaining region. By default the overlap is calculated based on the whole range of each alignment (‘global’).

Default: “global”

--clip_length, -cl
 

Number of nucleotides that are clipped from each alignment end for centered approach.

Default: 11

--no_count_split_by_alignment_no, -n
 

Do not split read countings by the number of alignments a read has. By default this count splitting is performed.

Default: False

--no_count_splitting_by_gene_no, -l
 

Do not split read countings by the number of genes it overlaps with. By default this count splitting is performed.

Default: False

--add_antisense, -a
 

Count anti-sense and sense read-gene-overlaps and report them separately. By default only sense overlaps are counted.

Default: False

--antisense_only, -x
 

Count only anti-sense read-gene-overlaps and no sense overlaps. By default only sense overlaps are counted.

Default: False

--non_strand_specific
 

Use countings of reads overlapping with a gene on both strands and sum them up.

Default: False

--processes, -p
 

Number of processes that should be used (default 1).

Default: 1

--features, -t Comma separated list of features that should be considered (e.g. gene, cds, region, exon). Other feature will be skipped. If not specified all features will be considered.
--unique_only, -u
 

Use uniquely aligned reads only.

Default: False

--pseudocounts, -c
 

Add a pseudocount of 1 to each gene.

Default: False

--check_for_existing_files, -F
 

Check for existing files (e.g. from a interrupted previous run) and do not overwrite them if they exits. Attention! You have to take care that there are no partially generated files left!

Default: False

deseq

Differential gene expression can be performed using deseq which will run a DESeq2 analyses for all possible combinations of conditions. To allocated the conditions to the libraries use the --libs and --conditions parameters (e.g. --libs SamA_R1.fa,SamA_R2.fa,SamB_R1.fa,SamB_R2.fa --conditions SamA,SamA,SamB,SamB).

usage: reademption deseq [-h] [--project_path PROJECT_PATH] --libs LIBS
                         --conditions CONDITIONS [--cooks_cutoff_off]
                         [--fc_shrinkage_off]

Named Arguments

--project_path, -f
 Path of the project folder.
--libs, -l Comma separated list of libraries.
--conditions, -c
 Comma separated list of condition in the same order as their corresponding libraries.
--cooks_cutoff_off, -k
 Default: False
--fc_shrinkage_off, -so
 

turn off log2 fold change shrinkage

Default: False

viz_align

viz_align plots histograms of the read length distributions of the reads before and after the read clipping.

usage: reademption viz_align [-h] [--paired_end] [--project_path PROJECT_PATH]

Named Arguments

--paired_end, -P
 

Use this if reads are originating from a paired-end sequencing.

Default: False

--project_path, -f
 Path of the project folder.

viz_gene_quanti

viz_gene_quanti creates scatterplots in which the raw gene wise quantification values are compared for each library pair (all-against-all). For each comparison the pearson correllation (r) coefficiant is. Additionally, bar charts that visualize the distribution of the read counting of the different annotation classes are plotted.

usage: reademption viz_gene_quanti [-h] [--paired_end]
                                   [--project_path PROJECT_PATH]

Named Arguments

--paired_end, -P
 

Use this if reads are originating from a paired-end sequencing.

Default: False

--project_path, -f
 Path of the project folder.

viz_deseq

viz_deseq generates MA-plots of the comparison (log2 fold changes vs. the base mean) as well as volcano plots (log2 fold changes vs. p-values / adjusted p-values).

usage: reademption viz_deseq [-h] [--project_path PROJECT_PATH]
                             [--max_pvalue MAX_PVALUE]

Named Arguments

--project_path, -f
 Path of the project folder.
--max_pvalue

Maximum adjusted p-value for genes considered to be regulated. Genes with adjusted p-values below will be marked red. (default 0.05)

Default: 0.05