Adriano de Bernardi Schneider (1) & Denis Jacob Machado (2)
(1) Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC, USA;
(2) Laboratório de Anfíbios, Universidade de São Paulo, São Paulo, SP, Brazil.

We created a broad dataset containing 133 genomes of Flaviviridae (Flavivirus and outgroup sequences of Pegivirus, Pestivirus, and Hepacivirus) available on NCBI, including 16 outgroup sequences, and developed an annotation pipeline for these viruses, including:
-
prediction of putative protein coding sequences with GeneWise using reference protein sequences from UniProt and NCBI;
-
validation of best matches using TransDecoder, BlastP, and hmmscan;
-
pooling orthologous loci for translation alignment using PAM250;
-
removal of outliers through the Tukey method. We joined genes and aligned them with different algorithms (ClustalW, Mafft, Muscle, Geneious translation-based).
Stopping to compare apples and oranges:
a homology-based phylogeny of Flaviviridae
Research on specific flaviviruses such as West Nile and Dengue (DENV) has been intense over the past couple decades. Other flaviviruses have not garnered much attention until lately. For example, scientists neglected Zika virus (ZIKV) until the 2015 outbreaks in the Pacific, Americas, and South-East Asia that were later related to severe neuropathology. Therefore, we advocate that an updated and comprehensive phylogeny of flaviviruses based on annotated genomes with emphasis on strong homology statements can be a useful tool to epidemiologists and virologists

We joined genes and aligned them with different algorithms (ClustalW, Mafft, Muscle, Geneious translation-based). We also aligned non-annotated genomes to assess the effect of data partitioning, observing that the alignment of non-annotated genomes results in different genes aligning together, generating several false hypotheses of homology. We assessed the effect of increasing sampling, adding outgroup, partitioning the dataset, and analyzing the data employing different optimality criteria for phylogenetic analysis including maximum likelihood, Bayesian inference, and parsimony. Some clades, such as the clade encompassing all tick-borne viruses and the clade of Mammalian tick-borne viruses, are dependent on the annotation and presence of outgroup sequences. Increasing sampling allowed us to visualize missing links within the phylogeny of Flavivirus and recognize that previously non-monophyletic groups such as ZIKV + DENV form a clade. There is a higher degree of convergence among different optimality criteria than between annotated and non-annotated datasets or between datasets with and without outgroup sequences, indicating that annotation is critical in the phylogenomics of Flaviviridae. Furthermore, our results show that the annotation of Flaviviridae genomes is attainable with reasonable time and reduced resources, minimizing spurious homology statements, and leading to significant changes in the tree topology.