Guides @ UF: Genomic Data Processing: Key Terms

Key Terms and Definitions

Continuous (or 'contiguous') sequences produced in a de novo assembly, free of any gaps.

Regions with an excess or deficiency in the number of sequence reads originating as a result of platform differences in sequence chemistry, amplification or cloning.

De novo assembly

The action of constructing the sequence of a genome from overlapping DNA sequences without guidance from a reference genome.

Depth of coverage

The average number of reads covering a particular base in the sequence being assembled.

Fragment library

A set of DNA fragments of approximately the same length that are paired-end sequenced.

Heterochromatic DNA

Portions of chromosomes that stain densely, are typically gene poor and are rich in satellite sequences.

Indels

Variants that are insertions and deletions of sequence, typically 1 to 49 bp in size.

k-mers

Strings of k consecutive letters extracted from a longer sequence, such as a read or a reference assembly.

Library

A collection of paired-end or mate-pair reads derived from DNA fragments with a tightly controlled size range.

Mate-pair data

Data from a pair of reads sequenced from the same circularized DNA fragment. The circularization step allows for larger fragments sizes to be used. They provide the same information as paired-end reads to the assembler.

Massively parallel sequencing

A general term for a form of DNA sequencing that measures trace signals from millions to hundreds of millions of amplified sequences at once, most frequently referring to sequencing produced by Illumina, Life Technologies and Complete Genomics platforms. Often referred to as next-generation or second-generation sequencing to distinguish it from long-read sequencing approaches (for example, single-molecule sequencing), which are sometimes referred to as third-generation sequencing.

Muted gaps

Regions that have been incorrectly closed in a genome assembly despite additional sequences being present at these sites in the source genome.

N50

A summary measure of read length distribution: 50% of the bases in the reads are in reads longer than the N50 value. Similarly, for de novo assemblies, 50% of the bases in the assembled contigs are in contigs longer than the N50 value.

N50 length

A statistic in genomics defined as the shortest contig at which half the total length of the assembly is made of contigs of that length or greater. It is commonly used as a metric to summarize the contiguity of an assembly.

Overlap

The relationship between two reads, the ends of which have highly similar sequences. The minimum length allowed for the corresponding sequence is an important parameter in assembly.

Paired-end

Two reads sequenced from opposite ends of the same fragment.

Paired-end data

Data from a pair of reads sequenced from ends of the same DNA fragment. The genomic distance between the reads is approximately known and is used to constrain assembly solutions.

Phase

The assignment of genetic variants or alleles to one of two homologous chromosomes.

Precision

The fraction of query variants in the benchmark regions that match the benchmark variants, or true positives/(true positives + false positives).

Read mapping

Aligning a given read to a reference.

Reads

Small sequence fragments from larger molecules generated by a given sequencing technology; the length can range from 100 bp to >1 million bp, depending on the sequencing method.

Recall

The fraction of benchmark variants that are matched by query variants, or true positives/(true positives + false negatives).

Reference genome assembly

A haploid genome assembly to which sequencing reads are mapped and variants are called.

Reference material

A material that is sufficiently stable (over time) and homogeneous (between vials) for its applications. For example, genomic reference materials from the US National Institute of Standards and Technology (NIST) are extensively characterized to develop benchmark variants and regions to reliably identify false positives and false negatives.

Resequencing

Characterizing a sample genome and its associated variation by mapping and aligning sequence reads to a reference genome sequence.

Satellite DNA

Highly repetitive DNA composed of thousands to tens of thousands of tandem repeats, usually between 100–300 bp in length, and frequently associated with heterochromatin.

Scaffolding

The process of connecting assembled contigs even when the intervening sequence is unknown.

Scaffolds

Sets of ordered and oriented contigs, with the approximate distances between contigs estimated by traversing paired-end sequences that anchor to different contigs. Scaffolds consist of both sequence contigs and gaps.

Segmental duplication

When a sequence is represented two or more times in a genome with high sequence identity and did not arise by retrotransposition. Often defined as paralogous sequences that share ≥90% sequence identity and are ≥1 kb in length.

Short tandem repeats

Tandem repeats in which the individual unit of repetition is less than 10 bp long and varies in length between different individuals in a population.

Single-molecule sequencing

A form of DNA sequencing in which signals are derived from single molecules, frequently referring to sequencing produced by Pacific Biosciences and Oxford Nanopore Technologies platforms.

Single-nucleotide variants

Variants that are single-base substitutions. They are also commonly called single-nucleotide polymorphisms (SNPs) when they occur at an appreciable frequency (typically >1%) in the germ lines of the wider population.

Variable number of tandem repeats

Any tandem array of repeated sequence motifs that are highly variable in different individuals of a population. Historically, these were originally used in reference to tandem repeats that varied on the scale of thousands of base pairs over the length of the array.

Genomic Data Processing: Key Terms

How to Use This Page

Key Terms and Definitions