The Key Terms section of this guide provides definitions and resources for helpful terminology relating to the subjects discussed.
Term In Bolded Blue - The terms are linked to the original source the definition was pulled from, while also providing access to additional resources. By clicking on this link a new window will open, highlighting the location where it can be found within the text.
Definition - The definitions have been copied directly from the sources they were selected from. The initial sources and related attributions can be found by clicking on the linked term.
Continuous (or 'contiguous') sequences produced in a de novo assembly, free of any gaps.
Regions with an excess or deficiency in the number of sequence reads originating as a result of platform differences in sequence chemistry, amplification or cloning.
The action of constructing the sequence of a genome from overlapping DNA sequences without guidance from a reference genome.
The average number of reads covering a particular base in the sequence being assembled.
A set of DNA fragments of approximately the same length that are paired-end sequenced.
Portions of chromosomes that stain densely, are typically gene poor and are rich in satellite sequences.
Variants that are insertions and deletions of sequence, typically 1 to 49 bp in size.
Strings of k consecutive letters extracted from a longer sequence, such as a read or a reference assembly.
A collection of paired-end or mate-pair reads derived from DNA fragments with a tightly controlled size range.
Data from a pair of reads sequenced from the same circularized DNA fragment. The circularization step allows for larger fragments sizes to be used. They provide the same information as paired-end reads to the assembler.
A general term for a form of DNA sequencing that measures trace signals from millions to hundreds of millions of amplified sequences at once, most frequently referring to sequencing produced by Illumina, Life Technologies and Complete Genomics platforms. Often referred to as next-generation or second-generation sequencing to distinguish it from long-read sequencing approaches (for example, single-molecule sequencing), which are sometimes referred to as third-generation sequencing.
Regions that have been incorrectly closed in a genome assembly despite additional sequences being present at these sites in the source genome.
A summary measure of read length distribution: 50% of the bases in the reads are in reads longer than the N50 value. Similarly, for de novo assemblies, 50% of the bases in the assembled contigs are in contigs longer than the N50 value.
A statistic in genomics defined as the shortest contig at which half the total length of the assembly is made of contigs of that length or greater. It is commonly used as a metric to summarize the contiguity of an assembly.
The relationship between two reads, the ends of which have highly similar sequences. The minimum length allowed for the corresponding sequence is an important parameter in assembly.
Two reads sequenced from opposite ends of the same fragment.
Data from a pair of reads sequenced from ends of the same DNA fragment. The genomic distance between the reads is approximately known and is used to constrain assembly solutions.
The assignment of genetic variants or alleles to one of two homologous chromosomes.
The fraction of query variants in the benchmark regions that match the benchmark variants, or true positives/(true positives + false positives).
Aligning a given read to a reference.
Small sequence fragments from larger molecules generated by a given sequencing technology; the length can range from 100 bp to >1 million bp, depending on the sequencing method.
The fraction of benchmark variants that are matched by query variants, or true positives/(true positives + false negatives).
A haploid genome assembly to which sequencing reads are mapped and variants are called.
A material that is sufficiently stable (over time) and homogeneous (between vials) for its applications. For example, genomic reference materials from the US National Institute of Standards and Technology (NIST) are extensively characterized to develop benchmark variants and regions to reliably identify false positives and false negatives.
Characterizing a sample genome and its associated variation by mapping and aligning sequence reads to a reference genome sequence.
Highly repetitive DNA composed of thousands to tens of thousands of tandem repeats, usually between 100–300 bp in length, and frequently associated with heterochromatin.
The process of connecting assembled contigs even when the intervening sequence is unknown.
Sets of ordered and oriented contigs, with the approximate distances between contigs estimated by traversing paired-end sequences that anchor to different contigs. Scaffolds consist of both sequence contigs and gaps.
When a sequence is represented two or more times in a genome with high sequence identity and did not arise by retrotransposition. Often defined as paralogous sequences that share ≥90% sequence identity and are ≥1 kb in length.
Tandem repeats in which the individual unit of repetition is less than 10 bp long and varies in length between different individuals in a population.
A form of DNA sequencing in which signals are derived from single molecules, frequently referring to sequencing produced by Pacific Biosciences and Oxford Nanopore Technologies platforms.
Variants that are single-base substitutions. They are also commonly called single-nucleotide polymorphisms (SNPs) when they occur at an appreciable frequency (typically >1%) in the germ lines of the wider population.
Variable number of tandem repeats
Any tandem array of repeated sequence motifs that are highly variable in different individuals of a population. Historically, these were originally used in reference to tandem repeats that varied on the scale of thousands of base pairs over the length of the array.