In order to analyze the LAUPs sequence characteristics for 2019-nCoV genome, we use JBLA (Jellyfish-based LAUPs analysis application) to compute the un-derrepresented sequence and common LAUPs.
As a single-strand positive-sense RNA virus, 2019-nCoV follows all the molecular rules of the RNA world. Two of the primary rules are U as a nuclear base, instead of T in DNA, and secondary structure formed by single strand RNA molecule, mostly intramolecular. To apply k-mer and LAUP concepts in 2019-nCoV data analysis, we first convert the full sequence of the virus into k-mers in various lengths, and subsequently look for LAUPs by comparing with two k-mer pools: one con-tains a random k-mers generated in limited G+C and A+G content windows, and the other is generated from all unique and high-quality RNA viruses (in Baltimore four classes if necessary). In this protocol, LAUPs contain sequence-derive permutations that are excluded by RNA viruses, and therefore in doing so we are not only able avoid the impact of genome sequencing errors, but also discover sequences that negatively selected by viral populations. Virus-infected human, animal individuals, and populations often serve as hosts for viruses to select their best fitness by forming quasi-species where deleterious mutations are excluded so viral fitness is evolved. In this process, LAUPs as a set of sequences are subjected to selection in terms of secondary structures and targets of cellular RNA surveillance and interactive systems (such as RNA degradation and miRNA target-ing), which is complementary to selection of protein sequences. These analyses of LAUPs can help us to improve the accuracy of genome sequence and phylogenetic analyses as well as viral biology and host pathophysiology.