The bioinformatics section of this guide provides definitions and resources for bioinformatics related terms.
Term In Bolded Blue - The terms are linked to the original source the definition was pulled from, while also providing access to additional resources. By clicking on this link a new window will open, highlighting the location where it can be found within the text.
Definition - The definitions have been copied directly from the sources they were selected from. The initial sources and related attributions can be found by clicking on the linked term.
(Alternative phrasing) - These are alternative phrases for the terms, which can be synonyms, acronyms, or other ways of referring to the concept.
[See also] - These are terms that are related to the defined term.
A hardware device or software program that enhances the overall performance of the computer. A software accelerator implements as many system functions as possible in software and moves performance-critical functions into special-purpose external hardware to reduce compute time.
A graph indicating where adapter sequences occur in the reads.
Genetic mapping strategy that uses individuals whose genomes are mosaics of fragments that are descended from genetically distinct populations. This method exploits differences in allele frequencies in the founders to determine ancestry at a locus to map traits in a way that is broadly similar to an advanced intercross.
A standard χ2 (1 degree of freedom) association test computed as the number of samples times the squared correlation between genotype and phenotype.
A consequence of collecting a non-random subsample with a systematic bias, so that results based on the subsample are not representative of the entire sample.
The compressed binary version of SAM is called a BAM file.
(Binary Alignment Map file format, BAM file, BAM)
[See also]: SAM file format
Bash is the shell, or command language interpreter, for the GNU operating system.
A technique for accounting for uncertainty about the correct model form (for example, the selection of variables to include in a multiple regression model) by averaging the effects of each possible variable over the set of all plausible models.
A statistical school of thought in which the posterior probability distribution for any unknown parameter or hypothesis given the observed data is used to carry out inference. Bayes theorem is used to construct the posterior distribution using the observed data and a prior distribution, often allowing the incorporation of useful knowledge into the analysis.
A technique for developing a minimal graphical representation of the connections among a large set of variables by examining the conditional independence relationships among pairs of variables given the other variables connected to them within the graph. This technique has been widely used for the analysis of gene co-expression data.
An indexing approach for storing the presence or absence of k-mers in a dataset; they have been leveraged to considerably reduce the amount of space and still run in constant time. However, they can have high false positive rates (that is, query hits when there are none).
A multiple comparisons adjustment for testing at a conventional significance level. It is based on multiplying the p value for a specific test by the total number of tests performed, and approximately controls the overall type I error rate (the probability of at least one false positive association) at the chosen significance level if the predictors are independent.
A software package for mapping low-divergent sequences against a large reference genome. (BWA)
The command to change locations in your file system. (cd)
The sequence alignment map (SAM) file format’s compressed representation of a read alignment to a reference.
(Concise idiosyncratic gapped alignment report strings)
The use of computing resources distributed in the ‘cloud-shaped’ Internet to store, manage and analyze data, rather than doing so on a local server or personal computer.
Coalescent-based statistical methods
Methods of reconstructing population history by simulating the genealogy of genes back to the most recent common ancestor of all alleles currently in the population.
Algorithm complexity is generally measured as an upper bound on its long-term growth rate: how its runtime or space requirements grows as the input size grows, rather than its absolute magnitude, and thus constants are omitted. In practice, a set of algorithms can share the asymptotic complexity despite some of them being a constant 2, 3 or even 1,000 times slower than their counterparts in the set.
The amount of compute power (for example, central processing units (CPUs) and memory) that can be requested, allocated and used for computing.
A table of observations of two or more variables that might have a statistical relationship of interest. For each variable, a contingency table places each observation into one of a series of categories.
Current working directory is our current default directory, i.e., the directory that the computer assumes we want to run commands in. (cwd)
Spurious differences in allele frequencies between cases and controls due to differences in sample collection, sample preparation and/or genotyping assay procedures.
A digital logic gate implements Boolean logic (such as AND, OR, or NOT) on one or more logic inputs to produce a single logic output. Electronic logic gates are implemented using diodes and transistors and operate on input voltages or currents, whereas biological logic gates operate on cellular molecules (chemical or biological).
Computer languages tailored to a specific domain such as genomics.
A technique for estimating the effects of each component of a large ensemble of related variables by assuming the ensemble has some common distribution and estimating the parameters of that distribution. Empirical Bayes estimators typically have better prediction error than estimating each one separately.
Ensembl is a genome browser for vertebrate genomes that supports research in comparative genomics, evolution, sequence variation and transcriptional regulation. Ensembl annotate genes, computes multiple alignments, predicts regulatory function and collects disease data.
A study design in which exon capture technologies are used to obtain resequencing data covering all exonic regions for each individual in the study.
Expectation-maximization algorithm
A method for finding maximum-likelihood estimates of parameters in statistical models, in which the model depends on unobserved latent variables. It is an iterative method which alternates between performing an expectation (E) step and a maximization (M) step.
A branch of statistical theory that is concerned with the asymptotic properties of the largest samples from a probability distribution, that is those from the tail of the distribution.
This controls the proportion of all reported positive associations that are expected to be false positives, and can be used to judge which of many associations are noteworthy.
Sample structure due to familial relatedness among samples.
A set of three people, comprising an individual plus both of the parents. In genetic association studies, the term 'affected family trio' denotes an individual with the phenotype of interest plus both of the parents, who effectively serve as controls.
A class of association tests that uses families with one or more affected children as the subjects rather than unrelated cases or controls. The analysis treats the allele that is transmitted to (one or more) affected children from each parent as a 'case' and the untransmitted alleles as 'controls' to avoid the effects of population structure.
FASTQ is a format for storing information about sequencing reads and their quality.
A program to visualize the quality of sequencing reads. Rather than looking at quality scores for each individual read, FastQC looks at quality collectively across all reads within a sample.
Field-programmable gate arrays
Hardware accelerators that can be configured/reprogrammed by a customer after manufacturing. They enable custom hardware acceleration without needing entirely new chips to be manufactured.
The part of the operating system that manages files and directories is called the file system.
Algorithms or devices for removing or enhancing parts or frequency components from a signal.
In a hierarchical model, the regression coefficients (for example, log relative risks for each variable) for the subject-level data on the association between risk factors and disease. Unlike a non-hierarchical model, these coefficients are treated as random variables with distributions described in the higher level(s) of the model rather than as model parameters to be estimated directly.
A name for the school of statistical thought in which support for a hypothesis or parameter value is assessed using the probability of the observed data (or more extreme data sets) given the hypothesis or value. These theories are usually contrasted with Bayesian models.
A measure of the genetic distance between two populations that describes the proportion of overall genetic variation that is due to differences between populations.
The directories on the computer are arranged into a hierarchy. The full path tells you where a directory is in that hierarchy.
An interaction in which one gene product alters the phenotypic effect of a second gene product.
A method of association testing in which cases and controls are matched for genetic ancestry, as inferred by principal components analysis or other methods.
Examination of DNA variation (typically SNPs) across the whole genome in a large number of individuals who have been matched for population ancestry and assessed for a disease or trait of interest. Correlations between variants and the trait are used to locate genetic risk factors.
A method for detecting (or detecting and correcting for) stratification based on the genome-wide inflation of association statistics.
Probabilistic prediction of genotypes that have not been measured experimentally.
Hardware accelerators that can process many pieces of data simultaneously. They were historically used primarily for rendering computer graphics, but the massive parallelism makes them useful for applications such as machine learning.
Command that lets you look at the beginning of a file.
A class of statistical model that can be used to relate an observed process across the genome to an underlying, unobserved process of interest. Such models have been used to estimate population structure and admixture, for genotype imputation and for mutiple testing.
A unique name for a sample or set of sequencing data. They are names for that data that only exist for that data.
Imputation methods aim to fill in missing genotype data using a sparse set of genotypes (for example, from a genome-wide association scan) and a scaffold of linkage disequilibrium relationships (as provided by the HapMap data).
Generally one of the first steps in alignment. Indexing allows the aligner to quickly find potential alignment sites for query sequences in a genome, which saves time during alignment.
In statistics, a variable that can be used to predict the value of an explanatory variable that is measured with error. The instrumental variable thereby indirectly yields an unbiased estimate of the relationship of the explanatory variable with an outcome variable.
Web-based genome browser. A high-performance, easy-to-use, interactive tool for the visual exploration of genomic data. It supports flexible integration of all the common types of genomic data and metadata, investigator-generated or publicly available, loaded from local or cloud sources.
Each time the loop runs (called an iteration).
A measure of the similarity between two sets, defined as the size of the intersection divided by the size of the union.
Joint segregation and linkage analysis
The use of family studies to estimate the parameters of a penetrance model. The parameters could include interactions between the unobserved major gene, which is linked to a marker, and environmental factors.
Genomic data normally come in long strings of nucleotides (A, C, G and T). Many genomic algorithms process these strings by looking at exact matches of length-k substrings, which are known as k-mers.
A graph showing any sequences which may show a positional bias within the reads. One of the FASTQC data outputs.
A model involving one or more unobservable intermediate variables that represent the pathway connecting a cause (for example, exposures and genotypes) to an effect (for example, disease). Identifying the pathways typically requires the use of surrogates for the latent variables (for example, biomarkers) in addition to the observable cause and effect variables.
In a principal components analysis, a quantity that represents the contribution of one of the original variables (columns of the data matrix) to one of the principal components.
A computationally intensive method in which a polynomial regression is fitted to each point in the data and more weight is given to data nearer the point of interest. It is often applied to hybridization array data to remove differences in global signal intensity among data sets or colour channels.
A study in which repeated measurements are taken from the same subjects at different time points.
Loops are key to productivity improvements through automation as they allow us to execute commands repeatedly. Loops let you perform the same set of operations on multiple files with a single command.
A procedure that takes advantage of redundancy/repetition to reversibly transform a large file into a smaller one — for example, storing the string ‘ACGTACGTACGTACGTACGT’ as ‘5*(ACGT)’. Note that although shorter, the transformed string contains all the same information as the original.
Sometimes, we are willing to discard some information when compressing a file. For example, if we start with data points ‘12.362, 15.212, 92.786’ we could round the points and discard some precision to get ‘12, 15, 93’, which can be stored in less space. However, after lossy compression, although we can still reproduce data that look similar to the same kind of format as the original, they are no longer an exact replica.
A representation of microarray data in which M (vertical axis) is the intensity ratio between the red (R) and green (G) colour channels (M=log(R/G)) and A (horizontal axis) is the mean intensity (A=(logR+logG)/2). This representation is often used as a basis for normalizing microarray data, with the underlying assumptions that dye bias is dependent on signal intensity, that the majority of probes do not have very different signal intensities among channels and that approximately the same number of probes in each channel have signal intensities that are stronger than the equivalent probes in the other channel.
Any of many data analysis techniques for mining large data sets derived from the computer science field. The techniques are not specifically based on mathematical statistics theory.
MAQ is a software that builds mapping assemblies from short reads generated by the next-generation sequencing machines.
A measure of statistical dispersion that is less influenced by outliers and extreme values than standard deviation. It is defined as the median of the collection of absolute deviations from the data set's median.
In regards to sequencing, metadata is data generated about the sequences before it ever goes to the sequencing center. This is the data about the data, often called the metadata.
Ranging from 0 to 50%, this is the proportion of alleles at a locus that consists of the less frequent allele. This number does not take genotype into account.
A class of models in which phenotypes are modelled using both fixed effects (candidate SNPs and fixed covariates) and random effects (the phenotypic covariance matrix).
This analysis applies likelihood-based methods to data from a pedigree in which one or more members have genotypes available at a major gene. It derives the genotypes of untyped individuals by summing their conditional genotype probabilities using the genotypes available.
A set of components that work in an integrated fashion. Personal computers are modular systems: they have keyboards, displays, motherboards and hard drives, each of which represents a module.
A single computing processor with two or more independent computing units (called cores). Running multiple instructions on multiple cores at the same time can increase the overall speed of programs.
A dimensionality reduction technique, similar to principal components analysis, in which points in a high-dimensional space are projected into a lower-dimensional space while approximately preserving the distance between points.
The higher degree of statistical significance that is required for a particular association to be considered noteworthy when many possible associations are analysed simultaneously. Several adjustment methods can take account of this penalty, the best known of which is the Bonferroni correction.
A standard statistical technique for relating a single outcome variable to multiple explanatory variables, either all at once or using some variable selection method, such as stepwise forward selection or backward elimination.
An analysis in which multiple independent hypotheses are tested. Multiple testing must be taken into account during statistical analysis, as the combined probability of type I error increases in an unadjusted analysis.
Multivariate regression analysis
Regression analysis includes any technique for modelling and analysing several variables when the focus is on the relationship between a dependent variable (y) and one or more independent variables (x). In multivariate regression analysis, two or more dependent variables are included in one analysis.
A digital logic gate that implements logical NOR, or the negation of the OR operator. It produces a HIGH output (1) only if both inputs to the gate are LOW (0).
A list of sequences that occur more frequently than would be expected by chance. This is one of the FASTQC data outputs.
Replace the output that was already present in the file.
Paired-end sequencing allows users to sequence both ends of a fragment and generate high-quality, alignable sequence data.
Parallel computing allows numerous calculations to be performed simultaneously, thereby accelerating computation. Based on this principle, many large-scale computational tasks can then be divided into smaller ones and solved on multiple machines concurrently.
The input data to a computer program can come in various formats. Before performing any type of complicated analysis, programs must first translate those data into an internal representation, in a process known as parsing.
Various types of information that can be used as predictor variables in the higher levels of a hierarchical model, specifically binary variables that indicate whether a particular gene or interaction has a role in a particular pathway.
Any technique from exploratory data analysis or machine learning for discovering non-random patterns in large data sets.
The percent of times that ‘N’ occurs at a position in all reads. If there is an increase at a particular position, this might indicate that something went wrong during sequencing. This is one of the FASTQC data outputs.
Plots the proportion of each base position over all of the reads. Typically, we expect to see each base roughly 25% of the time at each position, but this often fails at the beginning or end of the read due to quality or adapter content. This is one of the FASTQC data outputs.
A graph showing quality scores across all bases within a read. The x-axis displays the base position in the read, and the y-axis shows quality scores. This is one of the FASTQC data outputs.
A density plot of average GC content in each of the reads. This is one of the FASTQC data outputs.
A density plot of quality for all reads at all positions. This plot shows what quality scores are most common. This is one of the FASTQC data outputs.
The machines that perform sequencing are divided into tiles. This plot displays patterns in base quality along these tiles. Consistently low scores are often found around the edges, but hot spots can also occur in the middle if an air bubble was introduced at some point during the run. This is one of the FASTQC data outputs.
Take the output that is scrolling by on the terminal and uses that output as input to another command.
Sample structure due to differences in genetic ancestry among samples.
The probability of correctly rejecting the null hypothesis when it is truly false. For association studies, the power can be considered as the probability of correctly detecting a genuine association.
A composite variable that summarizes the variation across a larger number of variables, each represented by a column of a matrix.
A statistical method that is used to simplify data sets by transforming a series of correlated variables into a smaller number of uncorrelated factors.
One way to examine a file is to print out all of the contents. Certain commands and programs will print (display) all of the contents on the screen.
Command to print working directory on the screen.
The characters to the left of the cursor, which shows us that the shell is waiting for input.
A class of survival models in statistics. Survival models relate the time that passes before some event occurs to one or more covariates that may influence that quantity. In a proportional hazards model, the unique effect of a unit increase in a covariate is multiplicative with respect to the hazard rate.
With sequencing, each quality score represents the probability that the corresponding nucleotide call is incorrect. This quality score is logarithmically based, so a quality score of 10 reflects a base call accuracy of 90%, but a quality score of 20 reflects a base call accuracy of 99%. These probability values are the results from the base calling algorithm and dependent on how much signal was captured for the base incorporation.
A method for equalizing the total signal intensities and distributions of probe signal strengths among arrays or among colour channels on an array. It sorts all probes by signal strength and then matches probes at each rank position among arrays and forces the values at each rank position to be equal. An identical distribution of probe signal strengths among the arrays or colour channels is obtained.
This compares the observed data against data sampled from a theoretical distribution, in which deviation from the line of y = x indicates that the observed data are not behaving as expected. In the context of genome-wide association studies, it is often used to test for systematic false-positive associations.
Access to any element of stored data as easily and efficiently as any other.
Short-term storage for data the computer is actively using to speed access.
A statistic describing the rank, across markers, of association of each marker. Rank statistics can be transformed into quantiles of a standard normal distribution that can be combined with other statistics.
The original data. This data has not been changed, edited, or manipulated in any way. In sequencing, this is data you get back from the sequencing center.
Process to determine where in the genome sequencing reads originated from.
Describes the data files in the directory or documents how the files in that directory were generated. As the name suggests, it’s a file that we or others should read to understand the information in that directory.
Taking what would ordinarily be printed to the terminal screen and redirecting (diverting) it to another location.
Relative paths specify a location starting from the current location (regarding directories).
Testing the same variant of interest for association in diverse data sets.
The process of discovering the technological principles of a device, object or system through analysis of its structure, function and operation. It often involves taking a system apart and analysing its workings with the aim of making a new device or program that does the same thing without using any physical part of the original.
The root directory is the highest level directory in your file system and contains files that are important for your computer to perform its daily work.
A tab-delimited text file that contains information for each individual read and its alignment to the genome.
Samtools is a suite of programs for interacting with high-throughput sequencing data.
Scalability typically refers to how an algorithm handles larger amounts of data; for example, an algorithm scales with the amount of data if its runtime and space requirements grow slowly enough in required time and size to solve the problem.
A distribution of duplicated sequences. In sequencing, we expect most reads to only occur once. If some sequences are occurring more than once, it might indicate enrichment bias (e.g. from PCR). If the samples are high coverage (or RNA-seq or amplicon), this might not be true. This is one of the FASTQC outputs.
The distribution of sequence lengths of all reads in the file. If the data is raw, there is often on sharp peak, however if the reads have been trimmed, there may be a distribution of shorter lengths. This is one of the FASTQC outputs.
Sequence Read Archive (SRA) data, available through multiple cloud providers and NCBI servers, is the largest publicly available repository of high throughput sequencing data. SRA stores raw sequencing data and alignment information to enhance reproducibility and facilitate new discoveries through data analysis.
A shell is a computer program that presents a command line interface which allows you to control your computer using commands entered with a keyboard instead of controlling graphical user interfaces (GUIs) with a mouse/keyboard/touchscreen combination.
Computation that operates as a single sequential series of operations without any parallelization. It is often used as a benchmark for the speed of a method without using any types of hardware tricks or multi-threaded acceleration.
These methods reduce the number of data points considered, while still capturing salient features of the underlying data, to minimize the computational resources required for large-scale analyses. Unlike lossy data compression, it is generally not possible to reproduce even an approximate copy of the original data, because the sketch only summarizes a few important features.
A nucleotide site at which two or more variants exist in a population. Most SNPs in genome-wide association studies are biallelic.
The correlations of each SNP to a given principal component in principal components analysis. The principal component coordinates of each sample are proportional to the sum of normalized genotypes weighted by SNP loadings.
Computer scientists traditionally measure the amount of computer memory (random access memory (RAM)) an algorithm needs to run by asking how the amount of memory needed scales with the size of the data. Often, the same types of terms are used as for time-complexity, and we speak of linear, log-linear or quadratic space algorithms.
Lines matching a specific set of characters.
A method for correcting for stratification in which samples are assigned to subpopulation clusters and evidence of association is stratified by cluster.
Command that lets you look at the end of a file.
Indexing refers to the incorporation of short sequences as tagged codes during the construction of a sequencing library, followed by the simultaneous parallel sequencing of libraries from many sources. The source of the DNA sequence for each read can be deduced from the index. This technique can be combined with targeted sequencing of regions of interest enriched by hybrid selection.
Systems developed by Google for application-specific integrated circuits to accelerate machine learning workflows.
On a Mac or Linux machine, you can access a shell through a program called “Terminal”, which is already available on your computer. The Terminal is a window into which we will type commands.
Can be used to create, edit, or add text to files.
Computer scientists traditionally measure how fast an algorithm is by asking how the number of central processing unit (CPU) operations scales with the size of the data. An algorithm is linear time if doubling the amount of data to be processed doubles the number of CPU operations needed. An algorithm is quadratic time if doubling the amount of data quadruples (×4) the number of CPU operations. A log-linear time algorithm is only marginally slower than a linear time algorithm, although the exact scaling requires a bit more mathematical formalism to describe. Most practical algorithms are either linear or log-linear.
Transmission disequilibrium test
A family-based association test involving case–parent trios in which alleles transmitted from parents to children are compared with untransmitted alleles.
Removing some of the low quality sequences, reads, and bases to reduce false positive rates due to sequencing error.
Program to filter poor quality reads and trim poor quality bases from samples.
Web-based genome browser. Interactively visualize genomic data.
The four nucleotides that appear in DNA are abbreviated A, C, T, and G. Unknown nucleotides are represented with the letter N. An N appearing in a sequencing file represents a position where the sequencing machine was not able to confidently determine the nucleotide in that position.
A variant call is a conclusion that there is a nucleotide difference vs. some reference at a given position in an individual genome or transcriptome, often referred to as a Single Nucleotide Variant (SNV). The call is usually accompanied by an estimate of variant frequency and some measure of confidence.
The * character is a special type of character called a wildcard, which can be used to represent any number of any type of character.
When working with high-throughput sequencing data, the raw reads you get off of the sequencer will need to pass through a number of different tools in order to generate your final desired output. The execution of this set of tools in a specified order is commonly referred to as a workflow or a pipeline.
Compressed files. They each contain multiple different types of output files for a single input file.