The genomics section of this guide provides definitions and resources for genomics related terms.
Term In Bolded Blue - The terms are linked to the original source the definition was pulled from, while also providing access to additional resources. By clicking on this link a new window will open, highlighting the location where it can be found within the text.
Definition - The definitions have been copied directly from the sources they were selected from. The initial sources and related attributions can be found by clicking on the linked term.
(Alternative phrasing) - These are alternative phrases for the terms, which can be synonyms, acronyms, or other ways of referring to the concept.
[See also] - These are terms that are related to the defined term.
The enzymatic process of adding an acetyl group to a lysine residue on histone tails or on other proteins.
A chromosome with a centromere near to one end so that one arm is very short.
Evolution as a result of selection.
An evolutionary process that is directed by natural selection, which makes a population better adapted to live in an environment.
A metaphor used to describe the sequence of fixation of beneficial mutations that transform a low-fitness genotype into a genotype that is well-adapted to its environment.
An admixed population contains hybrids or offspring of individuals originating from genetically divergent parental populations.
A population formed recently from the mixing of two or more groups whose ancestors had long been separated.
The mixture of two or more genetically distinct populations.
(Hybridized)
The examination of gene-expression profiles by the high-density array of single-stranded DNA nucleotides.
When multiple variants in the same gene affect the same disease.
Co-dominant nuclear DNA markers that consist of enzymes that differ in their mobility on a charged gel.
An interspersed DNA sequence of 300 bp that belongs to the short interspersed element (SINE) family and is found in the genome of primates. Alu elements are composed of a head-to-tail dimer in which the first monomer is 140 bp long and the second is 170 bp long. In humans, there are ∼1.1 million copies of Alu elements, of which ∼500,000 copies are located in introns.
An interspersed DNA sequence of ∼300 base pairs (bp) that is found in the genomes of primates, which can be cleaved by the restriction enzyme AluI. They are composed of a head-to-tail dimer, with the first monomer ∼140-bp long and the second ∼170-bp long. In humans, there are 300,000–600,000 copies of Alu elements.
Gene amplification refers to an increase in the number of copies of a gene in a genome.
(Gene amplification)
Amplified fragment length polymorphism
A DNA fragment-length polymorphism that is revealed by a PCR-based DNA fingerprinting technique that generates dozens of polymorphic marker bands (presence or absence of a restriction enzyme site) in a single gel lane. The marker bands are usually dominant in that we generally cannot see the difference between a heterozygote and homozygote.
(AFLP)
Genetic markers ascertained for large differences in allele frequency between subpopulations that are genotyped to infer genetic ancestry in new samples.
The presence of an abnormal number of chromosomes, either more or less than the diploid number. It is associated with cell and organismal inviability, birth defects and cancer.
In molecular biology, the process by which two single strands of DNA hydrogen bond at complementary nucleotides to form a double-stranded molecule.
Oligonucleic acids that bind to a specific target molecule, such as a small molecule, protein or nucleic acid. Nucleic acid aptamers are typically developed through in vitro selection schemes but are also found naturally (for example, RNA aptamers in riboswitches).
A method for enriching whole genomic DNA for many regions of interest by hybridization to an array containing RNA or DNA sequences complementary to the regions of interest.
The bias in patterns of variation that results from using pre-ascertained SNPs.
Assay for transposase accessible chromatin sequencing
A method that uses the activity of a hyperactive transposase to cleave exposed DNA and add sequencing adapters. Regions that cannot be sequenced are inferred to be chromatin interacting.
(ATAC-seq, ATAC-sequencing, ATAC sequencing)
A stretch of sequence surrounding a polymorphism that has been associated with a phenotype, in which linkage disequilibrium levels between polymorphisms and the associated marker might be sufficiently high to drive the originally observed association.
A set of methods that are used to correlate polymorphisms in genotype to polymorphisms in phenotype in populations.
“Autosomal” means that the gene in question is located on one of the numbered, or non-sex, chromosomes.
Linear structures that assemble along the length of meiotic chromosomes. Axial elements become the lateral elements of the mature synaptonemal complex.
A sequencing method where a physical map is generated from overlapping bacterial artificial chromosome (BAC) clones tiled across a chromosome. Each BAC is then fragmented and sequenced. The sequenced fragments are aligned with the knowledge of the originating BAC.
Originally, backcross referred to the mating of an offspring with one of its parents, in which the offspring is heterozygous, with the parent being homozygous for one of the alleles in the offspring's genotype. Nowadays, backcross simply refers to a mating between individuals with those two genotypes.
Bacterial artificial chromosome
Bacterial artificial chromosomes (BACs) are DNA molecules assembled in vitro from defined constituents and are stably maintained as one large DNA fragment in Escherichia coli. Artificial chromosomes are useful for genome sequencing programs, for transduction of DNA segments into eukaryotic cells, and for functional characterization of genomic regions and entire viral genomes such as cytomegalovirus (CMV) genomes.
(BAC)
A form of selection in which multiple phenotypes (or alleles) are maintained in a population.
A series of known bases added to a template molecule either through ligation or amplification. After sequencing, these barcodes can be used to identify which sample a particular read is derived from.
(DNA barcode)
DNA barcoding is a tool for rapid species identification based on DNA sequences. DNA barcodes consist of a standardized short sequence of DNA (400–800 bp) that in principle should be easily generated and characterized for all species on the planet. DNA barcoding aims to use the information of one or a few gene regions to identify all species of life.
(DNA barcoding)
A conserved mRNA splicing mechanism. It is composed of the splicing signals and the core of the machinery is formed by five spliceosomal small nuclear ribonucleoproteins and an unknown number of proteins.
A cellular mechanism that repairs damaged DNA and is initiated by the activity of DNA glycosylases.
A system used by most next-generation sequencing platforms. When a one-base-encoded probe or a sequencing-by-synthesis approach is used, each signal is correctly correlated to a base.
A framework of statistical inference in which previous beliefs (or data) and likelihoods are combined to estimate a parameter of interest given the observed data.
A statistical perspective that focuses on the probability distribution of parameters before and after observing the data.
Most imputation methods provide a probabilistic prediction of the missing genotypes. The best guess genotype is that genotype which has the largest probability.
Refers to two (possibly different) variants located on both alleles of the same gene.
An individual protein that is uniquely produced in a diseased state.
The use of artificial methods to modify the genetic material of living organisms or cells to produce novel compounds or to perform new functions.
A technique in which the treatment of DNA with bisulphite, which converts cytosines into uracils but does not modify methylated cytosines, is used to determine the DNA methylation pattern.
Two paired or synapsed homologous chromosomes, each formed of two sister chromatids.
When n statistical tests are carried out, each has the potential (probability, p, the significance level) to return a false-positive result. If tests are independent of each other, the so-called experiment-wise probability that one or more tests show a false-positive result is approximately np. So, to achieve an experiment-wise false-positive rate of p, each individual test must only be allowed a false-positive error rate of p/n, which is referred to as the Bonferroni correction.
A statistical approach that is often used to generate confidence intervals (measures of variation) around parameter estimates in which the data are re-sampled repeatedly (with replacement) using computer Monte Carlo simulations.
A marked reduction in population size that often results in the loss of genetic variation and more frequent matings among closely related individuals.
A mechanism of chromosomal instability caused by a cycle of telomere breaks and dicentric chromosome formation.
(BFB cycles, BFB's)
A conserved structural domain of ~40–50 amino acids that is commonly found in proteins associated with chromatin remodeling and with proteins that bind to acetylated lysine residues in histones.
Distinct sub-nuclear structures present in eukaryotic cells associated with RNA metabolism and ribonucleoprotein biogenesis.
Cap analysis of gene expression
The high-throughput sequencing of concatamers of DNA tags that are derived from the initial nucleotides of 5′ mRNA.
(CAGE)
The proteinaceous shell that packages the genetic material of the virus. Its structure is important in determining viral stability, delivery and host interactions.
(Viral envelope)
Any gene (or genes) harboured on the sequence of an extrachromosomal DNA (ecDNA) element.
Topological linkages between duplex DNA. Catenations between sister chromatids arise during replication.
A genetic marker that is functionally responsible for altering the severity of the phenotype.
A mode of gene effect that is restricted to the cell in which the gene is expressed.
Actual population size (total number of individuals) as compared to the theoretical effective population size.
1 centimorgan (relative distance between genes on a chromosome having a crossover value of 1%) OR A centimorgan (abbreviated cM) is a unit of measure for the frequency of genetic recombination. One centimorgan is equal to a 1% chance that two markers on a chromosome will become separated from one another due to a recombination event during meiosis (which occurs during the formation of egg and sperm cells).
(cM)
Repetitive region of the chromosome that attaches to the mitotic spindle and is responsible for ensuring accurate transmission of the genome during cell division.
A mechanism that monitors the fidelity of cellular events and triggers cell cycle arrest and possibly apoptosis when errors are not corrected. In meiosis, unrepaired DNA damage and synapsis failure trigger checkpoints that can halt meiotic progression.
A technique that assesses the mode of action of gene products by generating animals from a mixture of cells that are derived from two or more genetically distinct animals.
The product of chromosome replication in meiosis I. Chromatids are distinguished from chromosomes by the fact that the two daughter chromatids of one chromosome remain attached at their centromeres through meiosis I cell division.
A complex of DNA and histone proteins. The basic unit of chromatin is the nucleosome.
The extent to which proteins are able to interact with chromatinized DNA, which is regulated through nucleosome occupancy and other factors occluding access to DNA.
A technique that is used to identify the location of DNA-binding proteins and epigenetic marks in the genome. Genomic sequences containing the mark of interest are enriched by binding soluble DNA chromatin extracts (complexes of DNA and protein) to an antibody that recognizes the mark. Related techniques — such as methylated DNA immunoprecipitation — use antibodies to recognize DNA modifications directly.
(ChIP)
Chromatin immunoprecipitation followed by sequencing
A method used to analyse protein interactions with DNA by combining ChIP with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins.
(ChIP-seq, ChIP-sequencing)
A motor-driven process in which a loop-extruding factor translocates along the chromatin fibre in opposite directions, thereby growing a chromatin loop.
An ATP-dependent enzymatic process that alters histone–DNA interactions or regulates the position of nucleosomes. Chromatin remodelling can also be ATP-independent in the case of the facilitates chromatin transcription (FACT) complex.
The propensity of transcription factors (TFs) to be inhibited from binding motifs due to reversible chromatin features such as modification of DNA (CpG methylation) or presence and actions of chromatin proteins (such as nucleosomes).
A conserved structural domain of ~40–50 amino acids that is commonly found in proteins associated with chromatin remodelling and with proteins that bind to methylated lysine residues in histones.
A condition in which the rate of chromosome mis-segregation is elevated.
(CIN)
Chromosome conformation capture
A technique used to study the long-distance interactions between genomic regions. These interactions can be used to study the three-dimensional architecture of chromosomes in a cell nucleus.
Specific, largely non-overlapping areas in the nucleus that each chromosome occupies.
Chromosome-associated regulatory RNAs
Regulatory RNAs associated with the chromatin.
(carRNAs)
A massive chromosomal rearrangement resulting from a chromosome shattering event, characterized by more than 20 DNA fragments stitched together in an abnormal order.
Absorption spectroscopy method to detect the differential absorption of left- and right-handed light spectra for rapid evaluation of the secondary structures of macromolecules such as protein and DNA.
Non-coding DNA sequences that regulate transcription of genes located on the same chromosome. They include enhancers, promoters, insulators, silencing elements and tethering elements. Different classes of CREs can be identified using a combination of molecular markers, including chromatin accessibility and epigenetic modifications.
(CRE, CREs)
A gradient of variation across space. It usually refers to increased differences among populations in the frequency of an allele or trait with increased geographic distance.
Core clock genes are directly involved in the primary transcriptional–translational feedback loops. By contrast, clock-controlled genes are those genes whose expression is driven by the transcriptional–translational feedback loops within cells and tissues, resulting in circadian oscillations in their function.
Cells that originate from a common cell ancestor (progenitor) with identical genetic identity.
The production of an exact copy—specifically, an exact genetic copy—of a gene, cell, or organism.
A mathematical algorithm that organizes a set of items according to their similarity. For example, genes can be clustered according to their similarity in pattern of expression.
Groups of DNA templates in close spatial proximity, generated either though bead-based amplification or by solid-phase amplification. Bead-based approaches rely on emulsions to maintain template isolation during amplification. Solid-phase approaches rely on the template-to-bound-adapter ratio to probabilistically bind template molecules at a sufficient distance from each other.
The joining of genetic lineages to common ancestors when they are traced backwards in time.
Relating to the mathematical and statistical properties of genealogies. A modelling framework in which two DNA sequence lineages converge in a common ancestral sequence, going backwards in time.
The portion of a gene or an mRNA which actually codes for a protein.
(Coding sequence, CDS)
Genetic markers that allow the determination of both alleles at a diploid locus (for example, microsatellites, allozymes and single nucleotide polymorphisms); these differ from dominant markers in which the determination of heterozygotes is not always possible (or example, RAPDs and AFLPs).
A codon is a sequence of three DNA or RNA nucleotides that corresponds with a specific amino acid or stop signal during protein synthesis.
A multisubunit protein complex that mediates sister-chromatid cohesion in mitosis and is essential for topologically associating domain (TAD) formation.
A non-synonymous SNP present in at least 1% of the human population that is either overtly neutral or not known to influence disease in appreciable ways.
Comparative anchor-tag sequences
Exon sequences that are conserved across taxa allowing the design of primers that amplify in divergent species (for example, across mammal orders). CATS-like primers speed the discovery of SNPs (in exons or introns) and comparative genome mapping across taxa.
(CATS)
Complementation occurs when two mutations together result in a wild-type phenotype.
When an individual inherits two different recessive mutations, one from each parent, in the same gene that cause the same phenotype. An example would be a single-nucleotide variant causing a codon for an amino acid to be changed into a stop codon in one allele and a 4-bp deletion in the other allele: each of these variants knock out their respective allele, resulting in neither copy functioning.
The existence of distinct mutations on opposite alleles of a single gene located on an autosomal chromosome.
A spurious association between a risk factor (a gene, exposure or interaction) and disease induced by the joint associations of some other variable with the risk factor and the disease that are independent of the risk factor. Confounding can also distort the magnitude of the association of a true risk factor with disease or mask it.
The transfer of genetic information from a donor to a recipient cell by a conjugative or mobile genetic element, often a conjugative plasmid.
In next-generation sequencing (NGS) routines that allow multiple overlapping reads from a single molecule of DNA, all related reads are aligned to each other and the most likely base at each position is determined. This process helps to overcome high, single-pass error rates. A high-quality consensus sequence derived from the circular template from Pacific Biosciences (PacBio) is called a circular consensus sequence (CCS).
A base sequence in a DNA molecule (or an amino acid sequence in a protein) that has remained relatively unchanged throughout evolution.
A permanently condensed chromatin conformation that is repressive for transcription and is commonly found at repetitive regions of the genome, such as centromeres and telomeres.
Independent evolution from different ancestors that leads to similar characteristics.
Algorithms designed to learn from the data to uncover connections. CNNs are frequently used in image recognition and have been increasingly used to uncover relationships in biological data.
(CNN, CNNs)
Copy number variants (CNVs) are regions of the genome that vary in integer copy number.
(CNV, CNVs)
In the pedigree of a family with a condition, the segregation pattern shows how often the putative causal variant is found to coincide with the condition. When a variant coincides with the condition in a family, the condition and the variant are said to co-segregate.
A bacterial recombination vector that contains long inserted DNA sequences.
The number of sequence reads that have alignments that overlap a certain position. Because current sequencing strategies produce random reads, resulting in an uneven distribution of reads across the genome, a high average coverage is required to assure that most bases in the genome are covered by multiple reads.
A sequence of at least 200 bp with a greater number of CpG sites than expected for its GC content. These regions are often GC rich, typically undermethylated, and are found upstream of many mammalian genes.
Integration, excision, and inversion of defined DNA segments commonly occur through site-specific recombination, a process of DNA breakage and reunion that requires no DNA synthesis or high-energy cofactor.
(Recombination)
A reciprocal exchange of DNA along chromatids such that the proximal end of one homologue becomes attached to the distal end of the other.
The CCCTC binding factor (CTCF) is a zinc-finger transcription factor that is enriched at the boundaries of TADs.
Created by introducing a donor nucleus introduced into a cytoplast. Because cybrids contain the nuclear genes from one cell and the mitochondrial genes from another, they can be used to assess the contributions of mitochondrial genes and nuclear genes independently.
A mutation that does not affect fitness but is damaging to gene function.
The conversion of a continuous signal to a discrete signal.
Classical statistical pattern-recognition methods that are used to categorize samples into two classes of data.
The separation of chromosomes or chromatids during anaphase of mitosis or meiosis.
A form of selection in which extreme phenotypes are more fit than intermediate forms.
The addition of a unique molecular tag to each fragment of an individual's DNA so that after pooling with other DNA samples, the genotype of each individual in the pool can be reconstructed.
(DNA bar-coding, DNA-barcoding, barcoding)
A type II DNA topoisomerase that catalyses the ATP-dependent supercoiling of closed-circular dsDNA by strand breakage and rejoining reactions. Control of chromosomal topological transitions is essential for DNA replication and transcription in bacteria, making gyrase an effective target for antimicrobial agents.
A class of motor proteins that move along DNA and transiently separate duplexes into two single strands using energy from ATP hydrolysis.
Physical DNA–DNA interaction in the genome within 3D nuclear space.
The destruction of foreign dsDNA by a restriction endonuclease. The protection of self DNA from restriction is achieved by DNA methylation.
Irreversible and unintended DNA changes caused mainly due to off-targeting by DNA targeting modules with functional nucleases.
A chromatin region with a high rate of cleavage by DNase I due to its preference for open chromatin. DNase I hypersensitivity generally reflects transcription factor (TF) binding and a local reduction in nucleosome occupancy.
DNase I hypersensitivity site footprinting
An assay that identifies regions of the genome that lack nucleosome structure and are therefore readily degraded by the enzyme DNase I. Such regions tend to be associated with transcriptional activity. When coupled with sequencing, the ends of DNA fragments generated by treatment of chromatin with DNase I are sequenced.
The process of genetically adapting an animal or plant to better suit the needs of human beings (for example, breeding cattle for milk production).
The phenomenon whereby the expression levels of sex-linked genes are made equal in males and females of heterogametic species.
A visualization technique that allows the easy identification of matching nucleotides or amino acids (letters) between two sequences. For example, for two sequences X and Y, each letter has a unique coordinate on the x axis and the y axis respectively. When two letters are the same at a specified coordinate, a dot is plotted in the matrix at that position.
A serious form of DNA damage that is created enzymatically during meiosis and that stimulates repair by crossover or non-crossover recombination.
A twofold rotational symmetry relationship (in this case, a DNA arrangement in which a 5′→3′ sequence on one strand is juxtaposed with the same 5′→3′ sequence on the opposite strand). Transcripts from such regions have the capacity to form stem–loop structures.
A genetically distinct population within a widely spread species.
Recombination between nonhomologous sequences.
The increase in risk (or proportion of population variation) that is conferred by a given causal variant.
The size of the ideal constant-size population, in which the effects of random drift would be the same as those seen in the actual population.
(Ne)
Embryonic-stem-cell-mediated transgenesis
A method in which DNA is introduced into embryonic stem (ES) cells and integrates randomly, or through gene targeting, into the genome. Transgenic ES cells are delivered to the germline through the generation of (ES cell↔embryo) chimaeras.
The prevalent endogenous viral elements that are derived from retroviruses that have become integrated into the genome.
(ERV RNAs, ERVs, ERV RNA)
An intermediate phenotype that is heritable and associated with a disease but is not itself a symptom of the disease. Although there is little evidence to support the theory, it has been argued that endophenotypes would be a more tractable target for genetic analysis than the relevant disease state itself.
A process in which a somatic structural genomic rearrangement brings an enhancer into physical proximity of a gene it does not normally interact with, and activates it ectopically.
Interactions driven by the binding energy between molecules, such as homotypic interactions among chromatin states.
Changes that increase the number of accessible microstates in the system and do not require input energy.
Epidermal differentiation complex
A gene complex of >50 genes that encode proteins involved in terminal differentiation and cornification of skin epidermal keratinocytes.
(EDC)
Literally means 'outside conventional genetics'; this term describes any heritable change in gene expression that is not caused by a change in DNA sequence.
Heritable phenotypic changes that are independent of changes to the DNA sequence.
Chemical additions to DNA and histones that are associated with changes in gene expression and are heritable but do not alter the primary DNA sequence.
The combined features that enable stable propagation of different gene expression patterns from the same genome sequence. These include methylation of DNA at cytosine bases (mC), chemical modification of the histone proteins, chromatin accessibility and higher-order chromatin structures.
The state of those mechanisms that regulate gene expression and are transmitted to daughter cells.
In the context of transient transfection, this term refers to a plasmid target that is extra chromosomal.
Circular DNA that is not integrated in the genome.
These are created by screening the fitness of double mutants in a high-throughput manner. The results, when analysed as a whole, can reveal both positive and negative genetic interactions between genes and provide insights into biological pathways and protein–protein complexes in the cell.
Non-condensed chromatin state that is enriched in genes and permissive for transcription.
Features (such as feathers) that evolved by selection for one purpose (such as warmth) and were later adapted to a new purpose (such as flight).
The exome is the collection of known exons in our genome: this is the portion of the genome that is translated into proteins. As exons comprise only 1% of the genome and contain the most easily understood, functionally relevant information, sequencing of only the exome is a cheaper method of identifying most of the variants that are most likely to affect a trait.
Exon-primed intron-crossing PCR
EPIC primers are designed in conserved exons and amplify intron sequences that are generally more polymorphic than exons, which are therefore useful for the development of SNP or RFLP markers.
(EPIC-PCR, EPIC PCR)
The probability for a locus that two alleles drawn from its allele-frequency distribution are distinct.
The rate of phenotypic change that results from the continuing accumulation of new mutations (expressed mutation rate = total mutation rate − neutral mutation rate).
Short DNA sequences (several hundred base pairs) that are produced by reverse transcription of mRNA into DNA. ESTs are cDNAs that consist of exons and the sequences that flank exons. The sequencing of ESTs allows rapid identification ('tagging') of genes and can expedite DNA marker (SNP) development in coding genes.
(EST, ESTs)
Extrachromosomal DNA concatenation
A structure in which two or more closed circular DNAs are interlinked.
(ecDNA concatenation)
Reversibly condensed chromatin conformation that is transcriptionally silent.
This technique isolates nucleosome-free regions of DNA from chromatin during phenol:chloroform extraction.
(Formaldehyde-assisted isolation of regulatory elements followed by sequencing)
The proportion of false-positive test results out of all positive (significant) tests (note that the FDR is conceptually different to the significance level).
(FDR)
A study design in which many members of a family across several generations are sequenced. These studies are used to understand how phenotypes manifest within a particular genotype background.
Markers used to correct for drift that may occur during an experiment. These can be fluorescent beads or labels on the DNA that remain constant throughout the imaging experiment.
Disposable parts of a next-generation sequencing routine. Template DNA is immobilized within the flow cell where fluid reagents can be streamed into the cell and flushed away.
Fluorescence resonance energy transfer
A system in which energy can be transferred from one light-sensitive molecule to another. When the two molecules are in close proximity (≤30 nm), energy transferred between the two molecules modulates the intensity of a fluorescence signal.
(FRET, Förster resonance energy transfer, Forster resonance energy transfer)
A DNA region that only spans a sub-chromosomal arm proportion of the chromosome and is amplified at a high level; that is, more than eight copies.
If all four possible gametes are observed for two bi-allelic loci then this test infers that a recombination event must have occurred between them (under an infinite sites mutation model).
(FGT)
Fourier transform infrared spectroscopy
A spectroscopy method that simultaneously collects the absorption, emission and photoconductivity of a wide spectral range at high resolution to measure the intensity and wavelength of light required to vibrate molecules in a sample.
The process of breaking large DNA fragments into smaller fragments. This can be achieved mechanically (by passing the DNA through a narrow passage), by sonication or enzymatically.
A genomic sequence that provides a function that is under selection and tends to be conserved between species. For example, a protein-coding region or transcription-factor binding site.
Alignment programs deal with insertions and deletions (indels) by introducing a 'gap' in the sequence that contains the deletion. The introduction of gaps and their extension decreases the overall alignment score by a certain value. This value is defined by a gap-opening penalty and a gap-extension penalty, both of which are used as parameters in alignment programs.
A technique used to separate molecules on the basis of their ability to migrate through a semisolid gel in response to an electric current.
The emergence of a non-heritable extra copy of a gene in a somatic tissue. In microorganisms this term can be used interchangeably with gene duplication.
Originally coined to describe non-Mendelian segregation of alleles obtained from a single meiosis, this typically (but not always) refers to a non-reciprocal form of non-crossover recombination that results in the alteration of the sequence of a gene (or DNA sequence) to that of its homologue. In ectopic gene conversion, the donor and recipient DNA strands are not allelic copies of the same locus.
The amount of product produced from a gene; broadly equivalent to gene expression.
The emergence of a heritable copy of a gene.
The movement of genes among populations. Often expressed as the proportion of gene copies (or breeding individuals) that are immigrants from a different population.
An early term describing situations in which a gene has more than one function. Modern studies describe such genes as multifunctional.
The technique used to cure heritable diseases by replacing mutant genes with good genes.
The independent distribution of genotype and environment in the source population.
Gene-environment-wide interaction study
A scan of the entire genome for interactions with various environmental exposures.
A phenomenon observed in autosomal dominant diseases in which some clinical manifestations develop earlier and are more severe with successive generations.
The random fluctuations in allele frequencies over time that are due to chance alone.
Alteration of the genetic makeup of an organism using the molecular methods of biotechnology.
The presence of a recombinational event in one region that affects the occurrence of recombinational events in adjacent regions. Positive interference, which is seen in eukaryotes, reduces the probability of using nearby hot spots in the same meiosis and causes a more even spacing of crossovers than would occur by chance.
An outline of genes and their location on a chromosome that is based on recombination frequencies between markers.
Animals in which homozygous mutations are carried by only a small clone of cells.
The effect of a single gene on multiple phenotypic traits. The underlying mechanism is related to the effects of the gene product on various targets.
Identifying gene variants in an individual that may lead to a genetic disease in that individual.
An organism whose genome has been artificially changed.
(GMO)
A method to identify which chromosome a DNA sequence is derived from. By examining polymorphisms, the chromosome of origin can be inferred by matching the reads that share the same variation.
The simultaneous genotyping of hundreds of loci from across the genome, which ideally includes mapped loci and different classes of loci such as allozymes, microsatellites and AFLPs, or synonymous (non-coding) and non-synonymous nucleotide polymorphisms.
An examination of common genetic variation across the genome that is designed to identify associations with traits, such as common diseases.
(GWAS)
The use of an ectopically supplied enzyme that adds chemical groups to DNA but is itself sensitive to the factors binding DNA, such as transcription factors (TFs) or nuclesomes. Its activity can subsequently be read out by sequencing.
Epigenetic marks that are differentially established during male and female gametogenesis and lead to allele-specific gene expression after fertilization.
The study of entire genomes, including the complete set of genes, their nucleotide sequence and organization, and their interactions within a species and with other species.
Four DNA stranded secondary structures formed in G-rich sequences in which four guanines form a planar array via Hoogsteen base-pairing. These structures can cause replication stress.
Selection on traits that increase the relative fitness of populations or lineages of organisms at some fitness cost to individuals. All of the feasible mechanisms require selection on lineages or small interbreeding groups of related individuals in subdivided populations.
This occurs when a diploid organism only has one copy of a gene and both copies are required for correct function. This is one way that a protein-truncating mutation can influence predisposition to a disease.
A set of genetic markers that are present on a single chromosome and that show complete or nearly complete linkage disequilibrium — that is, they are inherited through generations without being changed by crossing over or other recombination mechanisms.
Long stretches (tens of megabases) along a chromosome that have low recombination rates (and relatively few haplotypes). Adjacent blocks are separated by recombination hot spots (short regions with high recombination rates).
An approach to association studies in which the co-inheritance of phenotypes and haplotypes — as opposed to single markers — is statistically analysed.
Helicos Genetic Analysis System
A sequencing technology based on single nucleotide addition. Each nucleotide contains a ‘virtual terminator’ that prevents the incorporation of multiple nucleotides per cycle.
Methylation of a residue on one strand within a palindromic target sequence but not of the corresponding residue within the palindromic target sequence on the complementary DNA strand. Not be confused with monoallelic methylation, in which one allele of a locus is methylated in a diploid organism.
An animal with a transgene insertion on one chromosome of a homologous pair, rather than on each of the two homologous chromosomes (homozygote).
The type of zygosity in which only one allele contains a gene or mutation.
The proportion of total phenotypic variation that can be attributed to genetic effects (broad sense) or purely additive genetic effects (narrow sense). Narrow-sense heritability predicts the initial response of a population to selection and decreases over the course of selection.
A phenotype that is at least partially transmitted genetically from parents to offspring.
A densely packaged form of chromatin that is associated with repressive histone modifications, DNA methylation and gene silencing.
Double-stranded DNA in which the sequences of the strands are not perfectly complementary.
The co-existence of mutant and wild-type mitochondrial DNA molecules within the same mitochondrion or within a cell.
A probabilistic model that is applied to protein- and DNA-sequence pattern recognition. HMMs represent a system as a set of discrete states and as transitions between those states. Each transition has an associated probability. HMMs are valuable because they enable a search or alignment algorithm to be built on firm probabilistic bases, and the parameters (transition probabilities) can be easily trained on a known data set.
(HMM)
A family of small, highly conserved basic proteins that are found in the chromatin of all eukaryotic cells and that associate with DNA to form a nucleosome. Two each of the core histones H2A, H2B, H3 and H4 make up an octameric nucleosome, around which DNA winds.
Covalent modifications to histone proteins, such as methylation, acetylation, phosphorylation, ubiquitylation and sumoylation, that take place at lysine, serine, threonine, arginine and other residues. Histone modifications are catalysed by a diverse panel of enzymes referred to as writers, removed by a different set of proteins known as erasers, and recognized by chromatin-binding proteins known as readers. Activity of CREs is directly linked to distinct histone modifications due to the activities of writers, erasers and readers.
Histone octamer lateral surface
The positively charged outer surface of the histone octamer around which DNA is wrapped.
Structurally distinct, non-typical versions of histone proteins. They are encoded by independent genes and are often subject to regulation that is distinct from that of the canonical histones.
A technique similar to ChIP–seq in which proteins bound to RNA — such as splicing factors — are immunoprecipitated and the RNA fragments are sequenced.
The point at which the strands of two dsDNA molecules exchange partners as an intermediate step in crossing over. Typically, two Holliday junctions are formed in the recombination pathway that gives rise to crossovers.
Chromosomal regions with DNA amplification presenting a uniformed staining pattern with Giemsa nucleic acid stain.
(HSR, HSRs)
A template-based mechanism for accurate repair of double-stranded breaks in DNA.
A sequence run of identical bases.
An alternative base pairing in which the purine is flipped and form different hydrogen bonds with partner bases. For adenines, the second hydrogen bond with the pyrimidine base is formed with N6 rather than N1. These alternative base pairs allow for additional structures beyond double helix including triplexes and quadruplexes.
A group of linked regulatory homeobox genes that are involved in patterning the animal body axis during development. Homeobox genes are defined as those that contain an 180-base-pair sequence that encodes a DNA-binding helix–lturn–helix motif (a homeodomain).
Offspring that are produced by crossing two different populations within a single species.
Low activity of forms of a gene.
Refers to a variant that results in reduced but not eliminated function of the gene product.
Two or more alleles are identical by descent if they are identical copies of the same ancestral allele.
Two or more alleles are identical by state if they are identical. Alleles which are identical by state may or may not be identical by descent owing to the possibility of multiple mutation events.
Nonhomologous sequence recombination at the genomic DNA level.
A locus with monoallelic expression determined by the parental origin of the allele.
Genes in which one allele is expressed in a parent-of-origin-specific manner.
The epigenetic marking of a gene on the basis of parental origin, which in somatic tissues results in monoallelic expression.
Refers to the phenomenon of some individuals who carry a pathogenic variant who do not exhibit clinical signs.
A small insertion or deletion of nucleotides. If it occurs in an exon and is not a multiple of three in length, it results in a frameshift and usually the loss of gene function.
A model that assumes that there are an infinite number of nucleotide sites and consequently that each new mutation occurs at a different locus.
A genomic element that acts as a barrier, preventing interactions between contiguous regions of the genome.
The tendency of different traits to vary jointly in a coordinated manner throughout a morphological structure or even a whole organism.
The ratio of odds ratios for the relationship of one factor (for example, a gene) with disease across the levels of another factor (for example, an environmental exposure); as such, it is a measure of departure from a multiplicative joint effect.
A set of molecular components of the cell, such as proteins, and the interactions between them. The interactions can be physical (protein A binds protein B) or correlative (perturbing protein A alters protein B's activity).
A phenomenon in which the occurrence of a crossover recombination at one position on a chromosome suppresses the frequency of additional, nearby crossovers; inhibition decreases with physical distance.
The transfer or genetic material from one species to another by hybridization and repeated backcrossing.
The relative position of an intron within or between codons. Phase zero, one and two are defined by the position of an intron between two codons or after the first or second nucleotide of a codon, respectively.
A cytogenetically anomalous chromosome characterized by the presence of two centromeres, with additional, identical copies of DNA segments joined end to end.
Proteins produced from the same genetic locus but which differ in exon order or combination.
Pairs of structurally distinct restriction enzymes with the same recognition sequence and the same cleavage positions.
Individuals that share some of their genes by recent common descent.
A member of the long interspersed transposable element (LINE) family, which is a type of large repetitive DNA sequence that inserts itself throughout the genome through retroposition. L1 retro-elements are ∼6,400 base pairs long and are abundant in the human genome.
Megabase-scale regions of the genome that interact with the nuclear lamina, are gene-poor, late-replicating and that correspond to heterochromatin and the B compartment.
The transfer of DNA, frequently cassettes of genes, between organisms.
The non-random association of alleles. For example, alleles of SNPs that reside near one another on a chromosome often occur in non-random combinations owing to infrequent recombination. Linkage disequilibrium is useful in genome-wide association studies as it reduces the number of SNPs that must be interrogated to determine genotypes across the genome. Conversely, strong linkage disequilibrium can complicate the identification of functional variants.
(LD)
Reads derived from the 10X Genomics synthetic long-read platform. These are discontinuous reads each sharing the same barcode, thus they are derived from the same original long molecule.
This occurs when a phenotype is caused by mutations at more than one gene locus, which suggests that the products of the genes belong to the same metabolic pathway.
(Genetic heterogeneity)
The logarithm of the likelihood ratio (odds) for genetic linkage versus no linkage at a given value of the recombination fraction.
(Logarithm of odds score)
A statistical model for the dependency of a binomial (two-class) phenotype on a number of risk factors. The probability, p, for one of the two phenotype states is expressed in the form of its logit, log(p/(1 – p)), which is assumed to be predicted by the linear combination (weighted sum) of the risk factors.
Non-coding RNAs longer than 200 nucleotides.
(lncRNA, lncRNAs)
A model of how CTCF and cohesin are thought to form topologically associating domains (TADs), whereby cohesin is loaded onto the DNA and extrudes a loop until it is blocked by CTCF bound at the base of the loop.
An approach to genetic association studies that is focused on putatively functional SNPs, for example, identified by re-sequencing exons and other functional regions in relatively large samples, or directly in patients. This approach is also sometimes called direct.
The effects of a specific risk factor (gene or exposure) in the population as a whole, averaging over all other variables.
The part of a genealogical graph that corresponds to a single locus or stretch of DNA that is inherited without recombination.
In epistatic interactions between two loci asscoiated with disease, each with three genotypes, the nine genotype pairs might each be associated with a certain penetrance — that is, the probability that the genotype pair leads to disease. From these penetrances and the genotype frequencies, (marginal) penetrances might be computed — that is, penetrances that are associated with the genotypes at one of the two loci.
The process by which new genetic markers are obtained — for example, by re-sequencing a subset of chromosomes in a population sample. If those markers are population-specific then inferences that are based on them in other populations might be biased through so-called ascertainment bias.
A computational technique for the efficient numerical calculation of likelihoods.
(MCMC)
A method that selects the phylogenetic tree that has the highest probability of explaining the sequence data, under a specific model of substitution (changes in the nucleotide or amino-acid sequence).
A statistical test that is commonly used for the comparison of between-species divergence and within-species polymorphism at replacement and synonymous sites to infer adaptive protein evolution.
A multisubunit protein complex that bridges transcription factors and the basal RNA polymerase II transcriptional machinery.
Methylated DNA is immunoprecipitated with an antibody against methylated cytosine and then sequenced by next-generation sequencing.
A disease that is carried in families in either a dominant or recessive manner and that is typically controlled by variants of large effect in a single gene.
A technique for studying the relationship between a biomarker and disease indirectly by studying the relationship of each to a gene that influences the biomarker.
Ordinary genomics studies the genome of a single organism. Metagenomics is the simultaneous study of a collection of many different species’ genomes in a single sample, typically that of microbial communities.
The enzymatic process of adding a methyl group to a lysine or an arginine residue on histone tails or other proteins. Alternatively, methyl groups can be added to DNA itself on cytosine bases.
Refers to transcription factor (TF) motifs that are bound with higher affinity when CpG dinucleotides within the motif are methylated.
Refers to transcription factor (TF) motifs that are bound with lower affinity when CpG dinucleotides within the motif are methylated.
Methylated DNA is identified by shotgun sequencing of bisulphite-converted DNA.
(Bisulphite conversion followed by sequencing, BS–seq, BS sequencing)
An enzyme that generates cuts preferentially within linker DNA between nucleosomes and in nucleosome-depleted regions. Coupling MNase digestion of chromatin with next-generation sequencing generates maps of nucleosome position and density.
(MNase)
Evolutionary processes or changes over relatively short time periods — such as change in allele frequencies, genotypic composition or gene expression — within or between populations.
The small nuclear structures that reside in the cytoplasm and contain damaged DNA fragments which were not incorporated into the main nucleus after mitosis.
A type of genetic marker in which individuals vary in their number of tandemly repeated copies of a short DNA unit.
Migration-drift genetic equilibrium
The balance between the loss of alleles through genetic drift and the gain of alleles through migration.
Minimum-description length approaches
A concept from information theory, in which all of the information contained in a system (for example, a sample of DNA sequences) is described in the most compact form possible.
A region of DNA in which repeat units of 10–50 bp are tandemly arranged in arrays 0.5–30 kb in length.
Ranging from 0 to 50%, this is the proportion of alleles at a locus that consists of the less frequent allele. This number does not take genotype into account.
(MAF)
A protein complex involved in the early stages of processing microRNA (miRNA) and RNA interference in animal cells.
A DNA-repair pathway that removes mismatched bases and corrects the insertion or deletion of short stretches of (repeated) DNA.
A heterogeneous complex composed of mitochondrial RNA and proteins involved in RNA regulation.
(MRG, MRGs)
Molecular mechanism driving circadian rhythms, consisting of transcriptional–translational feedback loops of core clock genes.
The use of molecular genetic techniques — for example, multiplex PCR, pulse-field gel electrophoresis, Southern blotting and multilocus sequence typing — to genetically compare and characterize bacterial genomes.
A form of DNA lesion induced by DNA damaging agents, such as ultraviolet radiation, which on longer exposure can be converted into covalent crosslinks in the DNA. Mono-adducts can, to an extent, induce recombination in yeast, mammalian and bacterial cells.
Refers to one genetic variant located on one allele of a gene.
Distinctive phenotypes. Organisms that are classified together on the basis of similar physical features without knowledge of their genetic relationships.
An organism that consists of cells of more than one genotype. The strict definition requires that the genotypically different cells all derive from a single zygote. The term mosaic is also used more broadly to describe any organism comprised of cells of different genotypes.
A condition in which an animal contains multiple cell lineages with different genotypes.
Multi-locus genetic approaches
Genetic methods that make use of information from many loci; such approaches use nuclear loci because mitochondrial genes are typically inherited as one locus.
The accumulated deleterious alleles that are carried by a population at any given time.
A region in which the frequency of mutation is greater than expected, owing to specific structural and/or functional features of the protein or gene.
A segment of underwound DNA in which the two strands wind around the helical axis less than 360° every 10.5 bp and retain twist strain (free energy).
The random acquisition of a new function in the course of the accumulation of neutral mutations in duplicated genes.
Pairs of structurally distinct restriction enzymes with the same recognition sequence but with different cleavage positions.
The process by which a DNA sequence acquires many mutations over time that have no phenotypic effect, and are not acted on by Darwinian selection.
Loci that are not evolving directly in response to selection, the dynamics of which are controlled mainly by genetic drift and migration. These loci can, however, be influenced by selection on nearby (linked) loci.
A gene that has originated recently in the relevant evolutionary timescale.
Here, we define this as the use of established sequencing platforms, including the Illumina/Solexa Genome Analyzer, Roche/454 Genome Sequencer and Applied Biosystems SOLiD platforms, as well as newer platforms, such as the Helicos and Pacific Biosciences platforms.
(Next generation sequencing, NGS)
Cas9 that has either its HNH or RuvC nuclease domain catalytically inactivated, resulting in a Cas9 enzyme that can only cut one strand of targeted double-stranded DNA.
(nCas9)
The process of the accumulation of neutral mutations in a duplicated gene that renders the gene copy non-functional.
(Pseudogenization)
An error-prone mechanism for repairing double-stranded breaks in DNA involving the ligation of two free DNA ends.
(NHEJ)
A genetic variant that changes a codon for one amino acid to another amino acid. Many non-synonymous variants are well-tolerated, but others can cause a disease.
The area at the edge of the nucleus. It is normally associated with gene silencing.
An assay that directly measures the transcriptional activity of a gene by incorporation of labelled UTP into its mRNA.
A complex composed of mitochondrial DNA and its associated proteins that regulate the organization and expression of the mitochondrial genome.
The basic unit of chromatin, containing ∼147 bp of DNA wrapped around a histone octamer (which is composed of two copies each of histone 3 (H3), H4, H2A and H2B).
The distribution (or range) of values across which we expect to observe the value of the test statistic if the null hypothesis is true (for example, neutrality). When conducting a standard t-test, t is the test statistic and the null distribution is the normal (Gaussian) distribution with t degrees of freedom.
(Neutral distribution)
The odds of carrying a genetic variant (or other hazard exposure) in cases compared with controls. It can be used as a measure of effect size in case–control association studies. An odds ratio significantly different from one suggests that the genetic variant is associated with the disease or trait.
The effects arising due to non-specific and unintended targeting of DNA targeting modules such as zinc fingers, transcription activator-like effector (TALE) and CRISPR in the genome.
Short fragments of DNA produced by discontinuous replication on the lagging strand during DNA replication. Because the template for lagging strand synthesis is exposed in the 5′–3′ direction at the progressing replication fork, the nascent strand is composed of sequential Okazaki fragments created by DNA polymerase working backwards from the replication fork.
Oligonucleotides that contain a single interrogation base in a known position. The base corresponds to a fluorescent label on each probe. The remaining bases are either degenerate (any of the four bases) or universal (unnatural bases with nonspecific hybridization), allowing the probe to interact with many different possible template sequences.
A formal system for organizing knowledge, here used in the context of biological pathways as a means of synthesizing information about the function of genes and exposures and their joint roles in disease causation.
Genes that do not share any homology with genes from other species.
A cellular environment or host into which genetic material is transplanted to avoid undesired native host interference or regulation. Orthogonal hosts are often organisms with sufficient evolutionary distance from the native host.
Sequences, or genes, that have originated from a common ancestral sequence, or gene, by a speciation event.
Genome locations (or markers or base pairs) that show behaviour or patterns of variation that are extremely divergent from the rest of the genome (locus-specific effects), as revealed by simulations or statistical tests.
A method for simultaneously capturing and amplifying large numbers of regions of interest from whole genomic DNA. Each padlock probe has two complementary oligonucleotide sequences that flank a region of interest. The sequences are joined by a loop of DNA that ensures efficient joint hybridization and contains sequences for PCR with universal primers.
In paired-end sequencing, a DNA template is sequenced from both sides; the forward and reverse reads may or may not overlap. A deviation in the expected genome alignment between two ends of a paired-end read can indicate astructural variation.
A class of segmentation gene that determines segments along the anterior–posterior axis. The expression of pair-rule genes in a pattern of seven stripes that are perpendicular to the axis is regulated by another class of segmentation genes: the gap genes.
Pairwise linkage disequilibrium
The strength of association between alleles at two different markers.
(Pairwise LD)
Sequences, or genes, that have originated from a common ancestral sequence, or gene, by a duplication event.
As applied to phylogenetic reconstruction, a criterion for estimating historical changes by minimizing the number of substitution events that are required to explain how one DNA sequence evolves into another.
Partial epigenetic reprogramming
Delivery of factors that can de-differentiate cells into induced pluripotent stem cells, typically short term, to de-age the epigenetic state of cells.
Genomic islands that contain genes that are required for virulence. These islands are usually absent from non-pathogenic organisms and are acquired by horizontal gene transfer.
The study of drug interactions with the genome or proteome; also called toxicogenomics.
A process through which polymer chains (or segments of a polymer chain) spontaneously de-mix and segregate through the formation of immiscible phases.
Determining the haplotype phase (the arrangement of alleles at two loci on homologous chromosomes) from genotype data using statistical methods.
The production of a phenotype as a result of environmental factors, such as stress, which closely resembles a phenotype that normally results from specific gene expression or from gene mutation.
The study of the geographic distribution of phylogenetic lineages, usually within species and to reconstruct the origins and diffusion of lineages.
A representation of the physical distance between genes or genetic markers.
A small circular molecule of DNA found in bacteria that replicates independently of the main bacterial chromosome; plasmids code for some important traits for bacteria and can be used as vectors to transport DNA into bacteria in genetic engineering applications.
A phenomenon in which a gene can influence two or more independent characteristics.
Self-associating compartment-like structures marked by histone 3 lysine 27 trimethylation (H3K27me3).
(PADs)
Diseases that are mediated by numerous genetic variants that each individually contribute small effects.
Also known as polygenic risk score. A score that summarizes genetic liability to a trait or disease and is typically calculated by aggregating the weighted effect of many trait-associated genetic variants.
A technique used to make multiple copies of DNA.
(PCR)
Contiguous sequences of nucleotides incorporated by the DNA polymerase while reading a template. These reads include sequences from adapters and can represent sequences from multiple passes around a circular template.
DNA structures containing many paired sister chromatids, which are produced by multiple rounds of DNA replication without cell division.
A marked reduction in population size followed by the survival and expansion of a small random sample of the original population. It often results in the loss of genetic variation and more frequent matings among closely related individuals.
The process of making inferences about the evolutionary and demographic history of a gene (or organism) on the basis of data on genetic variation in a species.
Parameters that characterize populations such as gene flow, migration rates, effective size, change in size, relatedness and phylogeny.
The phenomenon of an apparently homogeneous population that is actually composed of subgroups of individuals with distinct ancestral origins and differing allele frequencies at many loci. This leads to bias in the assessment of the significance of associations of a trait with particular loci.
Genetic differences between individuals as a consequence of the distribution of individuals in partially isolated populations.
Variegated expression patterns that arise owing to intercellular differences in epigenetic gene silencing, typically observed when reporter genes are brought into proximity with heterochromatin.
A process by which natural selection favours a single beneficial genotype over other genotypes and may drive this genotype to a high frequency in a population.
Pre-ascertained single nucleotide polymorphisms
SNPs that have already been detected in previous studies, usually from an extremely small sample of chromosomes.
(Pre-ascertained SNPs)
SNPs that are confined to a single population.
A proteoform is a defined form of a protein derived from a given gene with a specific amino acid sequence and localized post-translational modifications.
Study of the function of proteomes.
A gene that promotes the specialization and division of cells; however, when it is mutated or expressed at high levels, it causes abnormal cellular growth.
Phage or plasmid sequences that match one or more clustered, regularly interspaced short palindromic repeat (CRISPR) spacer sequences and are targeted during CRISPR interference.
A region on a sex chromosome that is homologous between the X chromosome and the Y chromosome. Successful meiosis in males requires a crossover in this pseudoautosomal region.
A photosensitizing chemical that is used for determining RNA–DNA structures in cells, and intercalates between two strands in duplex DNA. When attached to an oligonucleotide, psoralen forms interstrand crosslinks. When exposed to ultraviolet light, it forms photoadducts, crosslinked chemical bonds within adjacent bases.
A non-nested structural RNA motif formed upon base-pairing between the loop of a secondary structure element (such as a stem-loop (SL)) and any complementary region along the RNA.
Selection against deleterious alleles that arise in a population, preventing their increase in frequency and assuring their eventual disappearance from the gene pool.
A locus that controls a quantitative phenotypic trait, identified by showing a statistical association between genetic markers surrounding the locus and phenotypic measurements.
(QTL)
Qualitative traits consist of a discrete number of classes, such as 'affected' and 'unaffected'.
(Simple trait)
Quantitative traits occur with a continuous distribution.
(Complex trait)
A three-stranded nucleic acid structure that contains a DNA:RNA hybrid and a displaced strand of DNA.
Radiation-induced interspecific cell hybrids.
(RH)
Radiation hybrid (RH) mapping, a somatic cell genetic technique, was developed as a general approach for constructing long-range maps of mammalian chromosomes. This statistical method depends on x-ray breakage of chromosomes to determine the distances between DNA markers, as well as their order on the chromosome.
(RH mapping)
Random fluctuations in allele frequencies between generations owing to sampling effects. It increases as the effective population size decreases.
Denotes the probability of mutation from one amino acid to another (or from one nucleotide to another) for a given period of evolution. The most well known rate matrices are BLOSUM and PAM.
The sequence of bases from a single molecule of DNA (or RNA).
The means by which the 10X Genomics platform determines a synthetic long read. Discontinuous linked reads from the same genomic region are aligned to each other. No single linked read contains the entire long sequence; however, when they are stacked, full coverage is achieved.
The highest-quality single sequence for an insert, regardless of the number of passes.
A sequencing strategy used in the Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) platforms. In these approaches there is no pause after the detection of a base or series of bases, thus the sequence is derived in real-time.
A combination of DNA fragments generated by molecular cloning that does not exist in nature.
A protein that is expressed from recombinant DNA molecules.
The proportion of offspring that receives a recombinant haplotype from a parent, or the probability that recombination occurs between two loci.
A region of the genome in which the per-generation recombination rate is substantially elevated above the genome-wide average.
The early, visible manifestations of sites of chiasmata and crossing over. They are recognized by immunochemical staining, typically for the protein MutL homologue 1, which is a component of late recombination nodules.
A process for identifying complex relationships in large sets by dividing them into a hierarchy of smaller and more homogeneous subgroups on the basis of the most statistically significant indicators.
Reduced representation bisulphite sequencing
This technique cuts genomic DNA with restriction enzymes to enrich for CG-rich regions, which are then converted through bisulphite treatment and sequenced with next-generation sequencing. Bisulphite treatment converts unmethylated C to uracil — which appears as T in sequencing reads — while leaving methylated C intact.
Region in genomic DNA that can contribute to gene regulation.
A group of transcriptional units or operons that are coordinately controlled by a regulator.
The ability of telomerase to synthesize multiple telomeric repeats without dissociating from the telomere.
(RAP)
Cloning of entire organisms.
The relative ability of a genotype to pass on its genetic material to the next generation. Often measured as the proportion of offspring generated relative to other genotypes in the population.
In the context of recombination, strand-biased enzymatic removal of nucleotides at the site of a double-strand break. In most recombination models, resection occurs in the 5′ to 3′ direction.
An enzyme that recognizes a specific nucleotide sequence in DNA and cuts the DNA double strand at that recognition site, often with a staggered cut leaving short single strands or “sticky” ends.
(RE)
Restriction fragment length polymorphism
A fragment length variant in DNA sequences that is generated through the gain or loss of a restriction site owing to a DNA substitution.
(RFLP)
Restriction-modification system
A set of enzymes found in many bacteria and archaea that protects the host genome from genomic parasites. Restriction–modification systems consist of sequence-specific restriction endonucleases, which target invading DNA, and associated DNA methyltransferases with similar recognition sequences, which protect the host genome from the action of the endonucleases.
A mobile genetic element. Its DNA is transcribed into RNA, which is reverse-transcribed into DNA and then inserted into a new location in the genome.
A form of genetic analysis that manipulates DNA to disrupt or affect the product of a gene to analyze the gene’s function.
The full range of RNA structures formed by the transcriptome of an organism.
A method of sequencing cDNA derived from RNA. This approach can be used to sequence both coding and non-coding RNA.
(RNA-seq, RNA sequencing)
A method of DNA amplification using a circular template. Briefly, DNA polymerase binds to a primed section of a circular DNA template. As the polymerase traverses the template, a new strand is synthesized. When the polymerase completes a full circle and encounters the double-stranded DNA (dsDNA) template, it displaces the template without degradation, thus creating a long ssDNA fragment composed of many copies of the template sequence.
(RCA)
An approach in which dye-labeled normal deoxynucleotides (dNTPs) and dideoxy-modified dNTPs are mixed. A standard PCR reaction is carried out and, as elongation occurs, some strands incorporate a dideoxy-dNTP, thus terminating elongation. The strands are then separated on a gel and the terminal base label of each strand is identified by laser excitation and spectral emission analysis.
A subfraction of genomic DNA consisting of short repetitive nucleotide sequences that are repeated a large number of times. These non-coding repeats are important for centromere and heterochromatin construction and separate from the rest of the genomic DNA on a density gradient because of their higher content of AT base pairs.
Sets of ordered and oriented contigs, with the approximate distances between contigs estimated by traversing paired-end sequences that anchor to different contigs. Scaffolds consist of both sequence contigs and gaps.
A short exact, or nearly exact, matching string of characters aligning between two sequences.
Segmentation genes that are required for patterning the body along the anterior–posterior axis. They are expressed in a pattern of 14 stripes at the onset of gastrulation and following the expression of pair-rule genes.
The average proportional reduction in fitness of one genotype relative to another owing to selection.
The molecular footprint of a selection event from the recent past (for example, an excess of rare alleles at a locus relative to the abundance of rare alleles at loci across the rest of the genome).
The increase in frequency of an allele (and closely linked chromosomal segments) that is caused by selection for the allele. Sweeps initially reduce variation and subsequently lead to a local excess of rare alleles (homozygosity excess) as new unique mutations accumulate.
A cysteine protease that cleaves the α-kleisin subunit of cohesin at the onset of anaphase to allow sister chromatid disjunction.
An approach to genetic association studies that is focused on a set of genetic markers, often now called tagging SNPs, which are statistically associated with whichever variants influence the phenotype.
This uses oligonucleotide microarrays or oligonucleotide-coupled beads to select for regions of the genome, such as all exons (exome sequencing) for targeted sequencing.
Genes that are transcribed at different levels in males and females. Often thought to be a major underlying mechanism for sexually dimorphic phenotypes.
Single nucleotide polymorphism
SNVs that occur in > 1% of individuals in a sampled population are usually referred to as single-nucleotide polymorphisms (SNPs).
(SNP)
Sequence variations that include insertions and deletions in addition to base substitutions (which are known as SNPs).
(SNV)
In single-end sequencing, a DNA template is sequenced only in one direction.
A single-guide RNA molecule, composed of a CRISPR RNA (crRNA) fused to its corresponding trans-activating CRISPR RNA (tracrRNA) scaffold sequence, that directs the binding and nuclease activity of Cas9 enzymes.
(sgRNA)
The single-molecule real-time (SMRT) sequencing approach from Pacific Biosciences (PacBio) enables a single molecule of DNA to be sequenced multiple times. A single pass is one single iteration through a molecule.
A method for enriching whole genomic DNA for many regions of interest by hybridization to a complex library of RNA or DNA sequences in solution, followed by retrieval of the annealed hybrids.
The process by which the nucleus from an adult cell is transferred into a previously enucleated cell; the reconstructed oocyte is activated, which initiates subsequent development.
(SCNT)
In Mendelian disorders, an in vivo somatic genetic event that partially or totally counteracts the deleterious effect of the pathogenic germline mutation and provides a selective advantage over non-somatically modified cells.
A complex global response to DNA damage identified in bacteria that includes activation of multiple factors, leading to the stalling of cell division and alteration of DNA replication, recombination and repair to promote genome integrity and cell survival, at the cost of increased mutagenesis.
Features that arise as an unselected byproduct of selectively adaptive features, which are therefore easily co-opted to a new function.
A process of improvement of different aspects of gene function in each gene copy, which is driven by positive selection.
A cytogenetic technique used to simultaneously visualize all chromosomes in a cell by using different fluorescently labelled probes for each chromosome.
(SKY)
A large RNA–protein complex that catalyses the removal of introns from nuclear pre-mRNA.
A variant, usually found at the intron–exon boundary, that alters the splicing of an exon to its surrounding exons.
Selection that favours intermediate phenotypes over extreme phenotypes.
The step-by-step build-up of a regression model, which represents a dependent variable as a weighted sum (linear combination) of independent (risk) variables.
When both ends of a segment of DNA are anchored (for example, by proteins) and the DNA is pulled mechanically, it carries stretching tension coupled with twisting torsion along the helix and can be elongated by up to 70% without disrupting base pairs.
A variation larger than single-nucleotide polymorphisms (SNPs). This can include the insertion or deletion of blocks of DNA, inversions or translocations of DNA segments, and copy-number differences.
The process of the accumulation of degenerate mutations in gene copies that subdivides gene function among the duplicated genes. This term has been introduced to describe the mechanism of the duplication–degeneration–complementation model, but it is often used indiscriminately to describe any subdivision of function.
The sequences derived from a single pass as a polymerase traverses a DNA molecule multiple times. A subread is trimmed to exclude any adapter sequence.
Changes in the nucleotide sequences of coding genes that result in changes in the peptide sequence (that is, the replacement of an amino acid). These contrast with silent (or synonymous) changes in coding sequences, which do not result in changes in the peptide.
(Replacement changes)
Twists applied to DNA that can occur in the same (positive) or opposite (negative) orientation to the double helix.
Multi-kilobase stretches of regulatory DNA that exhibit unusually strong occupancy of transcription factors and co-factors.
Chromosomal regions that encompass multiple genes that are inherited together because of close genetic linkage. Often supergenes are associated with chromosomal inversions, which prevent recombination with the alternative allele.
A process that bars conjugative transfer of a plasmid into recipient cells that already harbour a related plasmid.
Genetic divergence that leads to species formation in the same habitat.
A proteinaceous structure that forms between pairs of homologous chromosomes during synapsis and facilitates crossover recombination.
Short aligned segments between genome sequences from two species, which are believed to define an orthologous relationship.
A genomic region that is collinear in the order of genes (or of other DNA sequences) in a chromosomal region of two species.
Increased transcriptional activity of regions of the genome where extrachromosomal DNAs (ecDNAs) make a physical connection, similar to the effect of aneuploidy.
A genetic interaction in which the deletion of two genes at the same time results in lethality. An organism in which one gene is deleted and the other gene is present will still be viable.
An approach that investigates a biological phenomenon by assaying a wide range of levels of biological organization, from individual proteins to entire cellular networks.
A SNP chosen from a larger set of available SNPs for use in an association study. Tag SNPs are generally selected on the basis of favourable linkage disequilibrium properties.
Identifying sub-sets of markers ('tags') that describe patterns of association or haplotypes among larger marker sets.
A genetic marker that is correlated to a number of neighbouring variants such that the genetic information it contains is representative of these variants.
The process by which double-stranded DNA is cleaved by the transposase Tn5, creating short DNA fragments that are simultaneously tagged with PCR adapters. Tagmentation using Tn5 preferentially occurs at accessible or open chromatin and this property is used in ATAC-seq and other related assays.
The utility of SNPs chosen as tags in one population for use as tags in another population.
Extrachromosomal circular DNA molecules that contain telomeric repeat sequences.
A short repeat sequence of DNA at the end of chromosomes, which both protects and ensures the complete replication of chromosome ends.
A DNA fragment to be sequenced. The DNA is typically ligated to one or more adapter sequences where DNA sequencing will be initiated.
The process by which RNA templates are switched between viral genomes during reverse transcription.
The summary value (often a summary statistic) of a data set that is compared with a statistical distribution to determine whether the data set differs from that expected under a null hypothesis.
These are cis-regulatory elements that function to bring together distal DNA elements.
Quantitative traits that are discretely expressed in a limited number of phenotypes (usually two), but which are based on an assumed continuous distribution of factors that contribute to the trait (underlying liability).
A class of enzymes that are able to cleave one or both strands of DNA to release topological stress on DNA duplex, and to link or unlink, knot or unknot associated DNA molecules.
Topologically associating domains
These are defined on population-level contact-frequency maps as domains of higher interaction frequency within a region than between regions.
(TAD, TADs)
Molecular complexes consisting of extrachromosomal DNAs (ecDNAs) and transcription machinery components, with high transcriptional activity of ecDNA sequences.
The uniformity of gene expression in a cell population, defined as a low variance in expression when scaled to the average level of expression.
Describes a cellular state in which very low to no active gene expression is observed, for example, in fully differentiated gametes.
The transfer of genetic information from one bacterial or archaeal cell to another by a phage particle containing chromosomal DNA.
Transfection is a procedure that introduces foreign nucleic acids into cells to produce genetically modified cells.
Genetic alteration of a cell resulting from the acquisition of genes from free DNA molecules in the surrounding environment.
Translesion synthesis polymerases
Polymerases that can catalyze DNA polymerization at damaged templates during replication and/or repair, although often with lower fidelty than replicative polymerases.
DNA sequences in the genome that replicate and insert themselves into various positions in the genome.
(TE, transposons, mobile elements)
The ability of a gene on one chromosome to influence the activity of an allele on the opposite chromosome when the chromosomes are paired.
A terminal deoxyuridine 5′-triphosphate nick-end-labelling assay. It involves the enzymatic labelling of the 3′ ends of partially degraded DNA in a cell undergoing apoptosis (and some other forms of cell death).
An assay system in which one protein is fused to an activation domain and the other to a DNA-binding domain, and both fusion proteins are expressed in cells. Expression of a reporter gene indicates that the two proteins physically interact.
Oligonucleotides that contain two adjacent interrogation bases in a known position. The bases correspond to a fluorescent label on each probe. The remaining bases are either degenerate (any of the four bases) or universal (unnatural bases with nonspecific hybridization) allowing the probe to interact with many different possible template sequences.
A system in which bases are discriminated by labelling Cs and Ts with a red or green fluorophore, respectively. Each A base is labelled with either a red or green fluorophore, but the two populations are mixed. During base discrimination, clusters that are either red or green are called either C or T, whereas clusters with a red and green mixed signal are called A. The G base is unlabelled, thus any cluster without a fluorophore signal is called G.
Unequal sister chromatid exchange
A mitotic crossover event that leads to the exchange of genetic material between homologous chromosomes and is also a major repair pathway for double-strand breaks.
Both copies of a chromosome derived from one parent.
Refers to both copies of a chromosome originating from one parent (maternal or paternal) and the chromosome from the other parent being absent. Segmental uniparental isodisomy occurs when only part of a chromosome is affected.
An unpaired chromosome at metaphase I: usually one that has failed to synapse or recombine with its homologue.
Sequence data in which the phase of double heterozygotes was not determined.
Upstream open reading frames (uORFs) are cis-acting elements located within the 5'-leader sequence of transcripts and are defined by an initiation codon in-frame with a termination codon located upstream or downstream of its main ORF (mORF) initiation codon.
(uORF, uORFs)
The nucleus of a terminally differentiated vegetative cell. It does not contribute genetic information to subsequent generations.
A gene responsible for the production of a molecule that contributes to the establishment of disease by bacterial pathogens.
A plasmid that carries virulence factor genes or pathogenicity islands.
Whole-exome and targeted sequencing
Sequencing of only exons or other selected regions. A system of capture or amplification is used to isolate or enrich for only exons or target regions. This is done by designing probes or primers for the regions of interest.
Sequencing of the entire genome without using methods for sequence selection.
(WGS)
The stage of development, which can vary widely between species, at which expression of the embryonic genome is strongly activated and thus control of development transfers from the maternal to the embryonic contribution.