Python package for identifying genomic regions with siRNA enrichment. Takes a SCRAM2 alignment file and genome reference (FASTA) as inputs and writes enriched regions to file (FASTA).
Clone repository
pip install -e . from root directory
help(sir.extract_enriched_seqs)
extract_enriched_seqs(scram_alignment_file, reference_fa, output_fa, window=200, cutoff=30, abund_count=5, strand_ratio=0.2, padding=30)
scram_alignment_file: Output file of a single alignment length from the SCRAM2 aligner. Generally pick 21 or 22 nt files for siRNAs.
reference_fa: Genome reference file in FASTA format (the same reference as used by the SCRAM2 aligner)
output_fa: Output file to results identified regions in FASTA format
window: window size in nt to scan in. This will be automatically expanded if enriched windows are adjacent (within len window). Default = 200.
cutoff: Minimum alignment count for an siRNA to be included in identification. Default = 30.
abund_count: Minimum number of the siRNAs with an alignment count above cutoff for a window to be identified as an enriched region. Default = 5.
strand_ratio: Minimum ratio of abund_count of siRNAs on the lower abundance strand. Prevents identification of degraded single-stranded RNA regions. Default = 0.2.
padding: number of nucleotides to added to each end of the window for writing to file. Default = 30.