Guides @ UF: Genomics &amp; Bioinformatics Dictionary: Bioinformatics

How to Use This Page

The bioinformatics section of this guide provides definitions and resources for bioinformatics related terms.

Term In Bolded Blue - The terms are linked to the original source the definition was pulled from, while also providing access to additional resources. By clicking on this link a new window will open, highlighting the location where it can be found within the text.

Definition - The definitions have been copied directly from the sources they were selected from. The initial sources and related attributions can be found by clicking on the linked term.

(Alternative phrasing) - These are alternative phrases for the terms, which can be synonyms, acronyms, or other ways of referring to the concept.

[See also] - These are terms that are related to the defined term.

Accelerators

A hardware device or software program that enhances the overall performance of the computer. A software accelerator implements as many system functions as possible in software and moves performance-critical functions into special-purpose external hardware to reduce compute time.

Adaptor content

A graph indicating where adapter sequences occur in the reads.

Admixture mapping

Genetic mapping strategy that uses individuals whose genomes are mosaics of fragments that are descended from genetically distinct populations. This method exploits differences in allele frequencies in the founders to determine ancestry at a locus to map traits in a way that is broadly similar to an advanced intercross.

Armitage trend test

A standard χ2 (1 degree of freedom) association test computed as the number of samples times the squared correlation between genotype and phenotype.

Ascertainment bias

A consequence of collecting a non-random subsample with a systematic bias, so that results based on the subsample are not representative of the entire sample.

BAM file format

The compressed binary version of SAM is called a BAM file.

(Binary Alignment Map file format, BAM file, BAM)

[See also]: SAM file format

Bash

Bash is the shell, or command language interpreter, for the GNU operating system.

Bayes model averaging

A technique for accounting for uncertainty about the correct model form (for example, the selection of variables to include in a multiple regression model) by averaging the effects of each possible variable over the set of all plausible models.

Bayesian

A statistical school of thought in which the posterior probability distribution for any unknown parameter or hypothesis given the observed data is used to carry out inference. Bayes theorem is used to construct the posterior distribution using the observed data and a prior distribution, often allowing the incorporation of useful knowledge into the analysis.

Bayesian network analysis

A technique for developing a minimal graphical representation of the connections among a large set of variables by examining the conditional independence relationships among pairs of variables given the other variables connected to them within the graph. This technique has been widely used for the analysis of gene co-expression data.

Bloom filters

An indexing approach for storing the presence or absence of k-mers in a dataset; they have been leveraged to considerably reduce the amount of space and still run in constant time. However, they can have high false positive rates (that is, query hits when there are none).

Bonferroni correction

A multiple comparisons adjustment for testing at a conventional significance level. It is based on multiplying the p value for a specific test by the total number of tests performed, and approximately controls the overall type I error rate (the probability of at least one false positive association) at the chosen significance level if the predictors are independent.

Burrows wheeler aligner

A software package for mapping low-divergent sequences against a large reference genome. (BWA)

Change directory

The command to change locations in your file system. (cd)

CIGAR strings

The sequence alignment map (SAM) file format’s compressed representation of a read alignment to a reference.

(Concise idiosyncratic gapped alignment report strings)

Cloud computing

The use of computing resources distributed in the ‘cloud-shaped’ Internet to store, manage and analyze data, rather than doing so on a local server or personal computer.

Coalescent-based statistical methods

Methods of reconstructing population history by simulating the genealogy of genes back to the most recent common ancestor of all alleles currently in the population.

Complexity

Algorithm complexity is generally measured as an upper bound on its long-term growth rate: how its runtime or space requirements grows as the input size grows, rather than its absolute magnitude, and thus constants are omitted. In practice, a set of algorithms can share the asymptotic complexity despite some of them being a constant 2, 3 or even 1,000 times slower than their counterparts in the set.

Compute resources

The amount of compute power (for example, central processing units (CPUs) and memory) that can be requested, allocated and used for computing.

Contingency table

A table of observations of two or more variables that might have a statistical relationship of interest. For each variable, a contingency table places each observation into one of a series of categories.

Current working directory

Current working directory is our current default directory, i.e., the directory that the computer assumes we want to run commands in. (cwd)

Differential bias

Spurious differences in allele frequencies between cases and controls due to differences in sample collection, sample preparation and/or genotyping assay procedures.

Digital logic gates

A digital logic gate implements Boolean logic (such as AND, OR, or NOT) on one or more logic inputs to produce a single logic output. Electronic logic gates are implemented using diodes and transistors and operate on input voltages or currents, whereas biological logic gates operate on cellular molecules (chemical or biological).

Domain-specific languages

Computer languages tailored to a specific domain such as genomics.

Empirical Bayes

A technique for estimating the effects of each component of a large ensemble of related variables by assuming the ensemble has some common distribution and estimating the parameters of that distribution. Empirical Bayes estimators typically have better prediction error than estimating each one separately.

Ensembl

Ensembl is a genome browser for vertebrate genomes that supports research in comparative genomics, evolution, sequence variation and transcriptional regulation. Ensembl annotate genes, computes multiple alignments, predicts regulatory function and collects disease data.

Exome resequencing

A study design in which exon capture technologies are used to obtain resequencing data covering all exonic regions for each individual in the study.

Expectation-maximization algorithm

A method for finding maximum-likelihood estimates of parameters in statistical models, in which the model depends on unobserved latent variables. It is an iterative method which alternates between performing an expectation (E) step and a maximization (M) step.

Extreme value theory

A branch of statistical theory that is concerned with the asymptotic properties of the largest samples from a probability distribution, that is those from the tail of the distribution.

False discovery rate

This controls the proportion of all reported positive associations that are expected to be false positives, and can be used to judge which of many associations are noteworthy.

Family structure

Sample structure due to familial relatedness among samples.

Family trio

A set of three people, comprising an individual plus both of the parents. In genetic association studies, the term 'affected family trio' denotes an individual with the phenotype of interest plus both of the parents, who effectively serve as controls.

Family-based association test

A class of association tests that uses families with one or more affected children as the subjects rather than unrelated cases or controls. The analysis treats the allele that is transmitted to (one or more) affected children from each parent as a 'case' and the untransmitted alleles as 'controls' to avoid the effects of population structure.

FASTQ

FASTQ is a format for storing information about sequencing reads and their quality.

FASTQC

A program to visualize the quality of sequencing reads. Rather than looking at quality scores for each individual read, FastQC looks at quality collectively across all reads within a sample.

Field-programmable gate arrays

Hardware accelerators that can be configured/reprogrammed by a customer after manufacturing. They enable custom hardware acceleration without needing entirely new chips to be manufactured.

File system

The part of the operating system that manages files and directories is called the file system.

Filters

Algorithms or devices for removing or enhancing parts or frequency components from a signal.

First-level coefficients

In a hierarchical model, the regression coefficients (for example, log relative risks for each variable) for the subject-level data on the association between risk factors and disease. Unlike a non-hierarchical model, these coefficients are treated as random variables with distributions described in the higher level(s) of the model rather than as model parameters to be estimated directly.

Frequentist

A name for the school of statistical thought in which support for a hypothesis or parameter value is assessed using the probability of the observed data (or more extreme data sets) given the hypothesis or value. These theories are usually contrasted with Bayesian models.

Fst

A measure of the genetic distance between two populations that describes the proportion of overall genetic variation that is due to differences between populations.

Full path

The directories on the computer are arranged into a hierarchy. The full path tells you where a directory is in that hierarchy.

Genetic interaction

An interaction in which one gene product alters the phenotypic effect of a second gene product.

Genetic matching

A method of association testing in which cases and controls are matched for genetic ancestry, as inferred by principal components analysis or other methods.

Genome-wide association study

Examination of DNA variation (typically SNPs) across the whole genome in a large number of individuals who have been matched for population ancestry and assessed for a disease or trait of interest. Correlations between variants and the trait are used to locate genetic risk factors.

Genomic control

A method for detecting (or detecting and correcting for) stratification based on the genome-wide inflation of association statistics.

Genomic imputation

Probabilistic prediction of genotypes that have not been measured experimentally.

Graphics processing units

Hardware accelerators that can process many pieces of data simultaneously. They were historically used primarily for rendering computer graphics, but the massive parallelism makes them useful for applications such as machine learning.

Head

Command that lets you look at the beginning of a file.

Hidden Markov model

A class of statistical model that can be used to relate an observed process across the genome to an underlying, unobserved process of interest. Such models have been used to estimate population structure and admixture, for genotype imputation and for mutiple testing.

Identifier

A unique name for a sample or set of sequencing data. They are names for that data that only exist for that data.

Imputation

Imputation methods aim to fill in missing genotype data using a sparse set of genotypes (for example, from a genome-wide association scan) and a scaffold of linkage disequilibrium relationships (as provided by the HapMap data).

Indexing the reference genome

Generally one of the first steps in alignment. Indexing allows the aligner to quickly find potential alignment sites for query sequences in a genome, which saves time during alignment.

Instrumental variable

In statistics, a variable that can be used to predict the value of an explanatory variable that is measured with error. The instrumental variable thereby indirectly yields an unbiased estimate of the relationship of the explanatory variable with an outcome variable.

Integrative Genomics Viewer

Web-based genome browser. A high-performance, easy-to-use, interactive tool for the visual exploration of genomic data. It supports flexible integration of all the common types of genomic data and metadata, investigator-generated or publicly available, loaded from local or cloud sources.

Iteration

Each time the loop runs (called an iteration).

Jaccard index

A measure of the similarity between two sets, defined as the size of the intersection divided by the size of the union.

Joint segregation and linkage analysis

The use of family studies to estimate the parameters of a penetrance model. The parameters could include interactions between the unobserved major gene, which is linked to a marker, and environmental factors.

k-mer

Genomic data normally come in long strings of nucleotides (A, C, G and T). Many genomic algorithms process these strings by looking at exact matches of length-k substrings, which are known as k-mers.

k-mer content

A graph showing any sequences which may show a positional bias within the reads. One of the FASTQC data outputs.

Latent variable models

A model involving one or more unobservable intermediate variables that represent the pathway connecting a cause (for example, exposures and genotypes) to an effect (for example, disease). Identifying the pathways typically requires the use of surrogates for the latent variables (for example, biomarkers) in addition to the observable cause and effect variables.

Loading

In a principal components analysis, a quantity that represents the contribution of one of the original variables (columns of the data matrix) to one of the principal components.

LOESS normalization

A computationally intensive method in which a polynomial regression is fitted to each point in the data and more weight is given to data nearer the point of interest. It is often applied to hybridization array data to remove differences in global signal intensity among data sets or colour channels.

Longitudinal study

A study in which repeated measurements are taken from the same subjects at different time points.

Loop

Loops are key to productivity improvements through automation as they allow us to execute commands repeatedly. Loops let you perform the same set of operations on multiple files with a single command.

Lossless compression

A procedure that takes advantage of redundancy/repetition to reversibly transform a large file into a smaller one — for example, storing the string ‘ACGTACGTACGTACGTACGT’ as ‘5*(ACGT)’. Note that although shorter, the transformed string contains all the same information as the original.

Lossy compression

Sometimes, we are willing to discard some information when compressing a file. For example, if we start with data points ‘12.362, 15.212, 92.786’ we could round the points and discard some precision to get ‘12, 15, 93’, which can be stored in less space. However, after lossy compression, although we can still reproduce data that look similar to the same kind of format as the original, they are no longer an exact replica.

MA plot

A representation of microarray data in which M (vertical axis) is the intensity ratio between the red (R) and green (G) colour channels (M=log(R/G)) and A (horizontal axis) is the mean intensity (A=(logR+logG)/2). This representation is often used as a basis for normalizing microarray data, with the underlying assumptions that dye bias is dependent on signal intensity, that the majority of probes do not have very different signal intensities among channels and that approximately the same number of probes in each channel have signal intensities that are stronger than the equivalent probes in the other channel.

Machine learning

Any of many data analysis techniques for mining large data sets derived from the computer science field. The techniques are not specifically based on mathematical statistics theory.

MAQ

MAQ is a software that builds mapping assemblies from short reads generated by the next-generation sequencing machines.

Median absolute deviation

A measure of statistical dispersion that is less influenced by outliers and extreme values than standard deviation. It is defined as the median of the collection of absolute deviations from the data set's median.

Metadata

In regards to sequencing, metadata is data generated about the sequences before it ever goes to the sequencing center. This is the data about the data, often called the metadata.

Minor allele frequency

Ranging from 0 to 50%, this is the proportion of alleles at a locus that consists of the less frequent allele. This number does not take genotype into account.

Mixed models

A class of models in which phenotypes are modelled using both fixed effects (candidate SNPs and fixed covariates) and random effects (the phenotypic covariance matrix).

Modified segregation analysis

This analysis applies likelihood-based methods to data from a pedigree in which one or more members have genotypes available at a major gene. It derives the genotypes of untyped individuals by summing their conditional genotype probabilities using the genotypes available.

Modules

A set of components that work in an integrated fashion. Personal computers are modular systems: they have keyboards, displays, motherboards and hard drives, each of which represents a module.

Multicore

A single computing processor with two or more independent computing units (called cores). Running multiple instructions on multiple cores at the same time can increase the overall speed of programs.

Multidimensional scaling

A dimensionality reduction technique, similar to principal components analysis, in which points in a high-dimensional space are projected into a lower-dimensional space while approximately preserving the distance between points.

Multiple comparisons penalty

The higher degree of statistical significance that is required for a particular association to be considered noteworthy when many possible associations are analysed simultaneously. Several adjustment methods can take account of this penalty, the best known of which is the Bonferroni correction.

Multiple regression

A standard statistical technique for relating a single outcome variable to multiple explanatory variables, either all at once or using some variable selection method, such as stepwise forward selection or backward elimination.

Multiple testing

An analysis in which multiple independent hypotheses are tested. Multiple testing must be taken into account during statistical analysis, as the combined probability of type I error increases in an unadjusted analysis.

Multivariate regression analysis

Regression analysis includes any technique for modelling and analysing several variables when the focus is on the relationship between a dependent variable (y) and one or more independent variables (x). In multivariate regression analysis, two or more dependent variables are included in one analysis.

NOR gate

A digital logic gate that implements logical NOR, or the negation of the OR operator. It produces a HIGH output (1) only if both inputs to the gate are LOW (0).

Overrepresented sequences

A list of sequences that occur more frequently than would be expected by chance. This is one of the FASTQC data outputs.

Overwriting

Replace the output that was already present in the file.

Paired-end

Paired-end sequencing allows users to sequence both ends of a fragment and generate high-quality, alignable sequence data.

Parallelization

Parallel computing allows numerous calculations to be performed simultaneously, thereby accelerating computation. Based on this principle, many large-scale computational tasks can then be divided into smaller ones and solved on multiple machines concurrently.

Parsing

The input data to a computer program can come in various formats. Before performing any type of complicated analysis, programs must first translate those data into an internal representation, in a process known as parsing.

Pathway indicator variables

Various types of information that can be used as predictor variables in the higher levels of a hierarchical model, specifically binary variables that indicate whether a particular gene or interaction has a role in a particular pathway.

Pattern recognition

Any technique from exploratory data analysis or machine learning for discovering non-random patterns in large data sets.

Per base N content

The percent of times that ‘N’ occurs at a position in all reads. If there is an increase at a particular position, this might indicate that something went wrong during sequencing. This is one of the FASTQC data outputs.

Per base sequence content

Plots the proportion of each base position over all of the reads. Typically, we expect to see each base roughly 25% of the time at each position, but this often fails at the beginning or end of the read due to quality or adapter content. This is one of the FASTQC data outputs.

Per base sequence quality

A graph showing quality scores across all bases within a read. The x-axis displays the base position in the read, and the y-axis shows quality scores. This is one of the FASTQC data outputs.

Per sequence GC content

A density plot of average GC content in each of the reads. This is one of the FASTQC data outputs.

Per sequence quality scores

A density plot of quality for all reads at all positions. This plot shows what quality scores are most common. This is one of the FASTQC data outputs.

Per tile sequence quality

The machines that perform sequencing are divided into tiles. This plot displays patterns in base quality along these tiles. Consistently low scores are often found around the edges, but hot spots can also occur in the middle if an air bubble was introduced at some point during the run. This is one of the FASTQC data outputs.

Pipe

Take the output that is scrolling by on the terminal and uses that output as input to another command.

Population structure

Sample structure due to differences in genetic ancestry among samples.

Power

The probability of correctly rejecting the null hypothesis when it is truly false. For association studies, the power can be considered as the probability of correctly detecting a genuine association.

Principal component

A composite variable that summarizes the variation across a larger number of variables, each represented by a column of a matrix.

Principal component analysis

A statistical method that is used to simplify data sets by transforming a series of correlated variables into a smaller number of uncorrelated factors.

Print

One way to examine a file is to print out all of the contents. Certain commands and programs will print (display) all of the contents on the screen.

Print working directory

Command to print working directory on the screen.

Prompt

The characters to the left of the cursor, which shows us that the shell is waiting for input.

Proportional hazards model

A class of survival models in statistics. Survival models relate the time that passes before some event occurs to one or more covariates that may influence that quantity. In a proportional hazards model, the unique effect of a unit increase in a covariate is multiplicative with respect to the hazard rate.

Quality score

With sequencing, each quality score represents the probability that the corresponding nucleotide call is incorrect. This quality score is logarithmically based, so a quality score of 10 reflects a base call accuracy of 90%, but a quality score of 20 reflects a base call accuracy of 99%. These probability values are the results from the base calling algorithm and dependent on how much signal was captured for the base incorporation.

Quantile normalization

A method for equalizing the total signal intensities and distributions of probe signal strengths among arrays or among colour channels on an array. It sorts all probes by signal strength and then matches probes at each rank position among arrays and forces the values at each rank position to be equal. An identical distribution of probe signal strengths among the arrays or colour channels is obtained.

Quantile–quantile plot

This compares the observed data against data sampled from a theoretical distribution, in which deviation from the line of y = x indicates that the observed data are not behaving as expected. In the context of genome-wide association studies, it is often used to test for systematic false-positive associations.

Random access

Access to any element of stored data as easily and efficiently as any other.

Random access memory

Short-term storage for data the computer is actively using to speed access.

Rank statistic

A statistic describing the rank, across markers, of association of each marker. Rank statistics can be transformed into quantiles of a standard normal distribution that can be combined with other statistics.

Raw data

The original data. This data has not been changed, edited, or manipulated in any way. In sequencing, this is data you get back from the sequencing center.

Read alignment

Process to determine where in the genome sequencing reads originated from.

README

Describes the data files in the directory or documents how the files in that directory were generated. As the name suggests, it’s a file that we or others should read to understand the information in that directory.

Redirection

Taking what would ordinarily be printed to the terminal screen and redirecting (diverting) it to another location.

Relative path

Relative paths specify a location starting from the current location (regarding directories).

Replication association

Testing the same variant of interest for association in diverse data sets.

Reverse engineering

The process of discovering the technological principles of a device, object or system through analysis of its structure, function and operation. It often involves taking a system apart and analysing its workings with the aim of making a new device or program that does the same thing without using any physical part of the original.

Root directory

The root directory is the highest level directory in your file system and contains files that are important for your computer to perform its daily work.

SAM file format

A tab-delimited text file that contains information for each individual read and its alignment to the genome.

Samtools

Samtools is a suite of programs for interacting with high-throughput sequencing data.

Scale

Scalability typically refers to how an algorithm handles larger amounts of data; for example, an algorithm scales with the amount of data if its runtime and space requirements grow slowly enough in required time and size to solve the problem.

Sequence duplication levels

A distribution of duplicated sequences. In sequencing, we expect most reads to only occur once. If some sequences are occurring more than once, it might indicate enrichment bias (e.g. from PCR). If the samples are high coverage (or RNA-seq or amplicon), this might not be true. This is one of the FASTQC outputs.

Sequence length distribution

The distribution of sequence lengths of all reads in the file. If the data is raw, there is often on sharp peak, however if the reads have been trimmed, there may be a distribution of shorter lengths. This is one of the FASTQC outputs.

Sequence Read Archive

Sequence Read Archive (SRA) data, available through multiple cloud providers and NCBI servers, is the largest publicly available repository of high throughput sequencing data. SRA stores raw sequencing data and alignment information to enhance reproducibility and facilitate new discoveries through data analysis.

Shell

A shell is a computer program that presents a command line interface which allows you to control your computer using commands entered with a keyboard instead of controlling graphical user interfaces (GUIs) with a mouse/keyboard/touchscreen combination.

Single-threaded

Computation that operates as a single sequential series of operations without any parallelization. It is often used as a benchmark for the speed of a method without using any types of hardware tricks or multi-threaded acceleration.

Sketching

These methods reduce the number of data points considered, while still capturing salient features of the underlying data, to minimize the computational resources required for large-scale analyses. Unlike lossy data compression, it is generally not possible to reproduce even an approximate copy of the original data, because the sketch only summarizes a few important features.

SNP

A nucleotide site at which two or more variants exist in a population. Most SNPs in genome-wide association studies are biallelic.

SNP loadings

The correlations of each SNP to a given principal component in principal components analysis. The principal component coordinates of each sample are proportional to the sum of normalized genotypes weighted by SNP loadings.

Space complexity

Computer scientists traditionally measure the amount of computer memory (random access memory (RAM)) an algorithm needs to run by asking how the amount of memory needed scales with the size of the data. Often, the same types of terms are used as for time-complexity, and we speak of linear, log-linear or quadratic space algorithms.

String

Lines matching a specific set of characters.

Structured association

A method for correcting for stratification in which samples are assigned to subpopulation clusters and evidence of association is stratified by cluster.

Tail

Command that lets you look at the end of a file.

Targeted indexing

Indexing refers to the incorporation of short sequences as tagged codes during the construction of a sequencing library, followed by the simultaneous parallel sequencing of libraries from many sources. The source of the DNA sequence for each read can be deduced from the index. This technique can be combined with targeted sequencing of regions of interest enriched by hybrid selection.

Tensor processing units

Systems developed by Google for application-specific integrated circuits to accelerate machine learning workflows.

Terminal

On a Mac or Linux machine, you can access a shell through a program called “Terminal”, which is already available on your computer. The Terminal is a window into which we will type commands.

Text editor

Can be used to create, edit, or add text to files.

Time-complexity

Computer scientists traditionally measure how fast an algorithm is by asking how the number of central processing unit (CPU) operations scales with the size of the data. An algorithm is linear time if doubling the amount of data to be processed doubles the number of CPU operations needed. An algorithm is quadratic time if doubling the amount of data quadruples (×4) the number of CPU operations. A log-linear time algorithm is only marginally slower than a linear time algorithm, although the exact scaling requires a bit more mathematical formalism to describe. Most practical algorithms are either linear or log-linear.

Transmission disequilibrium test

A family-based association test involving case–parent trios in which alleles transmitted from parents to children are compared with untransmitted alleles.

Trimming

Removing some of the low quality sequences, reads, and bases to reduce false positive rates due to sequencing error.

Trimmomatic

Program to filter poor quality reads and trim poor quality bases from samples.

UCSC genome browser

Web-based genome browser. Interactively visualize genomic data.

Unknown nucleotide

The four nucleotides that appear in DNA are abbreviated A, C, T, and G. Unknown nucleotides are represented with the letter N. An N appearing in a sequencing file represents a position where the sequencing machine was not able to confidently determine the nucleotide in that position.

Variant call

A variant call is a conclusion that there is a nucleotide difference vs. some reference at a given position in an individual genome or transcriptome, often referred to as a Single Nucleotide Variant (SNV). The call is usually accompanied by an estimate of variant frequency and some measure of confidence.

Wildcard

The * character is a special type of character called a wildcard, which can be used to represent any number of any type of character.

Workflow

When working with high-throughput sequencing data, the raw reads you get off of the sequencer will need to pass through a number of different tools in order to generate your final desired output. The execution of this set of tools in a specified order is commonly referred to as a workflow or a pipeline.

Zip file

Compressed files. They each contain multiple different types of output files for a single input file.