Generating UMI-exclusive and UMI+Sequence structured reads

tags: Resimpy

Introduction

resimpy_general is a module that can simulate reads consisting of only UMIs per each, or UMI+Genomic sequence per each. The general-purpose design gives the module this name. To achieve this purpose, a case-study CLI should look like below:

resimpy_general \
-r seq_errs \
-rs umi \
-perm_num 3 \
-umiup 1 \
-umiul 10 \
-umi_num 50 \
-seq_len 20 \
-pcr_num 8 \
-pcr_err 0.0001 \
-seq_err 0.0001 \
-ampl_rate 0.85 \
-sim_thres 3 \
-spl_rate 1 \
-seq_errs 1e-3;1e-2;0.1 \
-out_dir ./

Parameters are illustrated below.

Par ameter a cronym

Full name

Function

r

recipe

to specify a module to work on your requirement

rs

read structure

e.g., umi+seq or umi

pe rm_num

permutation number

in silico test numbers

umiup

UMI unit pattern

1 for monomer blocks, 2 for dimer blocks, 3 for trimer blocks

umiul

UMI unit len fixed

the fixed length of a monomer UMI

u mi_num

UMI number fixed

the fixed number of molecules/UMIs to be initiated in the initial read pool

sim _thres

similarity threshold fixed

how many nucleotites are different at least between each pair of two randomly generated UMIs

s eq_len

sequence length

the length of a genemic sequence

p cr_num

PCR n umber/cycle

a fixed PCR number

p cr_err

PCR error

a fixed DNA polymerase error rate during PCR

s eq_err

sequencing error

a fixed sequencing error rate

amp l_rate

am plification rate

PCR amplification rate

sp l_rate

subsampling rate

subsampling rate used for sequencing

se q_errs

sequencing errors

sequencing error rate partitioned by semicolon, e.g., 1e-3;1e-2;0.1

pc r_errs

PCR errors

DNA polymerase error rate partitioned by semicolon, e.g., 1e-3;1e-2;0.1

pc r_nums

PCR numbers

PCR numbers partitioned by semicolon, e.g., 8;9;10;11;12

um i_lens

UMI lengths

UMI lengths partitioned by semicolon, e.g., 8;9;10;11;12

ampl _rates

am plification rates

amplification rates partitioned by semicolon, e.g., 0.1;0.2;0.3;0.4;0.5;0.6;0.7;0.8;0.9;1.0

o ut_dir

output directory

a directory where you want to output results

Due to -rs is specified as only umi, each read conly contain one UMI. If -rs is specified as umi+seq, each read will contain one UMI and one genomic sequence. In each permutation test, reads will be generated based on one varying parameter such as seq_errs and all of the fixed parameters such as pcr_num except for the varying one. In this context, seq_err will not be applied because seq_errs is claimed, such that reads can be examined under this varying one. This is actually a one-factor experiment control. Similarly, for pcr_errs, pcr_nums, umi_lens, and ampl_rates, the CLIs should look like below:

Reads changing with PCR errors

resimpy_general -r pcr_errs -rs umi+seq -perm_num 3 -umiup 1 -umiul 10 -umi_num 50 -seq_len 20 -pcr_num 8 -pcr_err 0.0001 -seq_err 0.0001 -ampl_rate 0.85 -sim_thres 3 -spl_rate 1 -pcr_errs 1e-3;1e-2;0.1 -out_dir ./

Reads changing with amplification rates

resimpy_general -r ampl_rates -rs umi+seq -perm_num 3 -umiup 1 -umiul 10 -umi_num 50 -seq_len 20 -pcr_num 8 -pcr_err 0.0001 -seq_err 0.0001 -ampl_rate 0.85 -sim_thres 3 -spl_rate 1 -ampl_rates 0.1;0.2;0.3;0.4;0.5;0.6;0.7;0.8;0.9;1.0 -out_dir ./

Reads changing with PCR numbers

resimpy_general -r pcr_nums -rs umi+seq -perm_num 3 -umiup 1 -umiul 10 -umi_num 50 -seq_len 20 -pcr_num 8 -pcr_err 0.0001 -seq_err 0.0001 -ampl_rate 0.85 -sim_thres 3 -spl_rate 1 -pcr_nums 6;7;8;9;10;11;12;13;14 -out_dir ./

Reads changing with UMI lengths

resimpy_general -r umi_lens -rs umi+seq -perm_num 3 -umiup 1 -umiul 10 -umi_num 50 -seq_len 20 -pcr_num 8 -pcr_err 0.0001 -seq_err 0.0001 -ampl_rate 0.85 -sim_thres 3 -spl_rate 1 -umi_lens 6;7;8;9;10;11;12 -out_dir ./