A CASE STUDY INTO PHAGE GENOME ASSEMBLY ERWINIA AMYLOVORA PHAGE KEY
Zlatohurska M.
D.K. Zabolotny Institute of Microbiology and Virology of the NAS of Ukraine,
Department of bacteriophage molecular genetics
e-mail: zlatohurska@gmail.com
Phage genomes contain complex genomic structures (long direct or inverted repeats, terminal redundancies) that are problematic for assembly of the whole-genome sequence from the reads (Klumpp et al., 2012). In the present paper, the genome assembly steps are described on the example of Erwinia amylovora phage KEY. The purpose was to obtain the most reliable genome assembly that will serve as the basis for downstream genome analyses.
Phage KEY was isolated from quince with symptoms of fire blight. Virion DNA was obtained by the phenol-chloroform extraction method. Sequencing was performed using the Illumina HiSeq 2500 platform at The Centre for Applied Genomics in the Hospital for Sick Children, Toronto, Canada.
Quality control of the raw Illumina data was performed using FastQC v0.11.9 (Andrews, 2010). The sequencing run yielded a total of 14,790,000 raw reads with the length of 126 bp. The FastQC results indicate that the minimum quality value of reads is above Q30 (error probability 0.001). Additionally, a significant sequence duplication level (between 10 and 100 repeats) was detected for almost 80% of the sequences. There are also several overrepresented sequences. Next, the set of pair-end reads was processed by Trimmomatic (Bolger et al., 2014) to remove unpaired reads.
De novo assembly was performed using Velvet v7.0.4 (Zerbino et al., 2008), Unicycler v0.4.9b (Wick et al., 2017) and SPAdes v3.13.1 (Bankevich et al., 2012). Assembly statistics were assessed using QUAST v5.0.2 (Gurevich et al., 2013) at the contig level. Assembler outputs were compared with regards to total assembly length, the number of contigs, N50, NG50, and others. None of the assemblers generated a single contig. SPAdes output were considered as the best choice for downstream manipulation.
At graph assembly level, the topological structure of assembly was analyzed by Bandage v.8.1 (Wick et al., 2015). Many pairs of contigs with high identity level and different coverage values were found at the same time. Together with overrepresented and duplicated sequences this fact indicates that the mix of highly related genomes was sequenced.
Using BLASTn, Pantoea phage vB_PagS_AAS21 (MK770119.1) was identified as the closest genome to KEY. SPAdes contigs were mapped unto the AAS21 genome as a reference sequence to determine the low-coverage regions and possible sites of misassemble. As a result, an 118,944 bp long final contig suitable for further genome analyses was obtained.