The human genome is made up of protein-coding sequences and non-protein coding sequences. Only 2 % of the human genome encodes protein and the rest of the genome is non-coding. For a long time, the rest of the 98 % of the genome was considered "junk DNA" but we now know that the rest of this genomic DNA encodes for non-coding RNAs and repetitive elements like endogenous retroviruses (ERVs), SINE, LINE, and DNA transposons. ERVs are viral sequences originating from exogenous retroviruses that integrated into the human genome approximately 40 million years ago and comprise 8% of the human genome (1). ERVs are transcribed and translated during HIV infection, multiple sclerosis, and breast cancer (2-4) implying that ERVs are active elements that interact with the host and have the potential to significantly impact the host biology. Yet, the impact of the vast majority of ERVs on the host biology is poorly understood.
The biggest challenge in ERV research is the lack of available tools to study them, as they are highly repetitive, present in multiple copies throughout the genome, and annotation of ERVs in existing databases is incomplete and lacks consensus. Repeatmasker annotation has the most number of ERV elements, but it lacks annotations of many autonomous ERVs that have been identified experimentally or in silico. Therefore, the Repeatmasker track on UCSC genome browser does not include many of these autonomous ERVs. We believed that the field would benefit from a curated database that consists mostly of autonomous ERVs, as this is lacking in the field.
Another challenge with ERV research is that despite the wealth of deep sequencing data available, we need to apply filtering criteria to account for the repetitive nature of these elements and to increase our confidence in mapping sequencing reads to the correct ERV loci. Therefore, we believed that a unique pipeline to analyze RNA sequencing data was necessary to determine the transcriptome of ERVs (ERVome) on a genome-wide scale.
Using ERVmap, we have uncovered unique cell-type specific ERVome signatures, elevated ERVome in patients with systemic lupus erythematosus (SLE), and many differentially expressed ERVs in breast cancer tissues. These data imply the involvement of ERVs in cell differentiation and/or cell fate determination and revealed a potential for ERVs to underlie disease (Tokuyama M. et al, PNAS 2018). We hope that ERVmap will be used by the community to discover previously unidentified role for ERVs in a range of phenotypes.