

a fusion of two distinct TEs) that may appear only once or twice in the genome. In particular, with the aim of trying to identify as many TE candidate sequences as possible, automated tools are often greedy and report a number of chimeras (i.e. prior to gene annotation), detailed analysis of TE diversity and evolution within a genome generally requires greater accuracy. For example, while automated repeat identification may be sufficient for general repeat masking (e.g. Although these algorithms have dramatically improved our capacity to identify TEs and other genomic repetitive sequences, in most cases they lack the exactitude required for certain downstream applications. TE identification has become an intrinsic part of genome projects and, in line with this, many de novo and homology-based algorithms have been developed (refs.

However, with the wide-ranging importance of TE biology attracting greater recognition and many more genomes now being assembled to high-standards following the advent of long-read sequencing technologies, researchers are increasingly paying more attention to the repetitive fraction of genomes. Despite their ubiquity, TEs have historically been understudied in genomic analyses, partly stemming from their incomplete representation in assemblies produced from short-read sequencing. They are ubiquitous across life, highly diverse, and can occupy large proportions of many eukaryotic genomes for example, ~ 50% of the human genome is derived from TEs. Transposable elements (TEs) are mobile genetic entities generally found in multiple copies in the genome.

The proposed set of programs and tools presented here will make the process of manual curation achievable and amenable to all researchers and in special to those new to the field of TEs.

Detailed step-by-step protocols, aimed at the complete beginner, are presented in the Supplementary Methods. Our manuscript attempts to fill this gap by providing a set of detailed computer protocols, software recommendations and video tutorials for those aiming to manually curate TEs. This know-how is often passed on from mentor-to-mentee within research groups, making it difficult for those outside the field to access this highly specialised skill. Despite the availability of these valuable resources, producing a library of high-quality full-length TE consensus sequences largely remains a process of manual curation. Many algorithms and pipelines are available to automatically identify putative TE families present in a genome. In the study of transposable elements (TEs), the generation of a high confidence set of consensus sequences that represent the diversity of TEs found in a given genome is a key step in the path to investigate these fascinating genomic elements.
