Generating UMI-exclusive and UMI+Sequence structured reads

tags: Resimpy

Introduction

resimpy_general is a module that can simulate reads consisting of only UMIs per each, or UMI+Genomic sequence per each. The general-purpose design gives the module this name. To achieve this purpose, a case-study CLI should look like below:

resimpy_general \
-r seq_errs \
-rs umi \
-perm_num 3 \
-umiup 1 \
-umiul 10 \
-umi_num 50 \
-seq_len 20 \
-pcr_num 8 \
-pcr_err 0.0001 \
-seq_err 0.0001 \
-ampl_rate 0.85 \
-sim_thres 3 \
-spl_rate 1 \
-seq_errs 1e-3;1e-2;0.1 \
-out_dir ./

Parameters are illustrated below.

Par ameter a cronym	Full name	Function
r	recipe	to specify a module to work on your requirement
rs	read structure	e.g., umi+seq or umi
pe rm_num	permutation number	in silico test numbers
umiup	UMI unit pattern	1 for monomer blocks, 2 for dimer blocks, 3 for trimer blocks
umiul	UMI unit len fixed	the fixed length of a monomer UMI
u mi_num	UMI number fixed	the fixed number of molecules/UMIs to be initiated in the initial read pool
sim _thres	similarity threshold fixed	how many nucleotites are different at least between each pair of two randomly generated UMIs
s eq_len	sequence length	the length of a genemic sequence
p cr_num	PCR n umber/cycle	a fixed PCR number
p cr_err	PCR error	a fixed DNA polymerase error rate during PCR
s eq_err	sequencing error	a fixed sequencing error rate
amp l_rate	am plification rate	PCR amplification rate
sp l_rate	subsampling rate	subsampling rate used for sequencing
se q_errs	sequencing errors	sequencing error rate partitioned by semicolon, e.g., 1e-3;1e-2;0.1
pc r_errs	PCR errors	DNA polymerase error rate partitioned by semicolon, e.g., 1e-3;1e-2;0.1
pc r_nums	PCR numbers	PCR numbers partitioned by semicolon, e.g., 8;9;10;11;12
um i_lens	UMI lengths	UMI lengths partitioned by semicolon, e.g., 8;9;10;11;12
ampl _rates	am plification rates	amplification rates partitioned by semicolon, e.g., 0.1;0.2;0.3;0.4;0.5;0.6;0.7;0.8;0.9;1.0
o ut_dir	output directory	a directory where you want to output results

Due to -rs is specified as only umi, each read conly contain one UMI. If -rs is specified as umi+seq, each read will contain one UMI and one genomic sequence. In each permutation test, reads will be generated based on one varying parameter such as seq_errs and all of the fixed parameters such as pcr_num except for the varying one. In this context, seq_err will not be applied because seq_errs is claimed, such that reads can be examined under this varying one. This is actually a one-factor experiment control. Similarly, for pcr_errs, pcr_nums, umi_lens, and ampl_rates, the CLIs should look like below:

Reads changing with PCR errors

resimpy_general -r pcr_errs -rs umi+seq -perm_num 3 -umiup 1 -umiul 10 -umi_num 50 -seq_len 20 -pcr_num 8 -pcr_err 0.0001 -seq_err 0.0001 -ampl_rate 0.85 -sim_thres 3 -spl_rate 1 -pcr_errs 1e-3;1e-2;0.1 -out_dir ./

Reads changing with amplification rates

resimpy_general -r ampl_rates -rs umi+seq -perm_num 3 -umiup 1 -umiul 10 -umi_num 50 -seq_len 20 -pcr_num 8 -pcr_err 0.0001 -seq_err 0.0001 -ampl_rate 0.85 -sim_thres 3 -spl_rate 1 -ampl_rates 0.1;0.2;0.3;0.4;0.5;0.6;0.7;0.8;0.9;1.0 -out_dir ./

Reads changing with PCR numbers

resimpy_general -r pcr_nums -rs umi+seq -perm_num 3 -umiup 1 -umiul 10 -umi_num 50 -seq_len 20 -pcr_num 8 -pcr_err 0.0001 -seq_err 0.0001 -ampl_rate 0.85 -sim_thres 3 -spl_rate 1 -pcr_nums 6;7;8;9;10;11;12;13;14 -out_dir ./

Reads changing with UMI lengths

resimpy_general -r umi_lens -rs umi+seq -perm_num 3 -umiup 1 -umiul 10 -umi_num 50 -seq_len 20 -pcr_num 8 -pcr_err 0.0001 -seq_err 0.0001 -ampl_rate 0.85 -sim_thres 3 -spl_rate 1 -umi_lens 6;7;8;9;10;11;12 -out_dir ./