Guides @ UF: Genomic Data Processing: FAQ

Frequently Asked Questions (FAQ)

What is de-novo genome assembly?

De-novo genome assembly is the analysis of DNA reads to produce the genome sequence of an individual without mapping individual reads to a reference genome (Olsen et al., 2023). Generally, this entails assembling a species' genome for the first time.

What are the main types of de-novo assembly?

De-novo genome assembly can be completed with short reads, long reads, or a hybrid approach including both short and long reads.

What to consider when working with one read type:

In cases where only short or long reads are available, coverage depth should be considered when choosing a pipeline. Data that has a higher coverage can undergo additional filtering to collect high-confidence data and in turn produce a more accurate assembly. Filtering of lower coverage data is not usually advised, and all information should be utilized. If both short and long reads are available, a hybrid assembly can be considered.

Hybrid approach with lower coverage long reads:

When long reads have lower coverage, these should be corrected with short reads prior to the long read assembly (Haghshenas et al., 2016).

Hybrid approach with high coverage long reads:

If the long reads have higher coverage, a long read only assembly is possible (Koren et al., 2017), followed by polishing the genome with short reads (Walker et al., 2014).

What are some things to consider when deciding on a pipeline other than read type?

Additional metrics that should be considered are genome contiguity, genome coverage, and base calling error rate.

Does post-processing application matter?

Pipelines and programs should be chosen based on the application the assembled genome will be utilized for. High genome coverage is prioritized if the assembled genome will be used as a reference genome for reference-guided assembly, while genomes assembled for functional genomic studies must emphasize high base calling accuracy in gene areas (Bayat et al., 2020).

What is an assembly graph?

The main goal of assembly graphs is to connect DNA fragments to one another in order to build the genome. Outputs of assembly graphs are generally a set of sequences referred to as contigs. There is no information about their order in the genome, and each contig would ideally cover one entire chromosome. However, de novo assembly is rarely perfect, and one should not be discouraged if single contigs do not provide complete chromosomal coverage.

What is a contig?

Contigs are continuous (or 'contiguous') sequences produced in a de novo assembly, and are free of any gaps (Chaisson et al., 2015).

What are some of the common types of assembly graph?

Although new approaches are emerging frequently, common types of assembly graphs are greedy, Eulerian or de-Bruijn, and Overlap-Layout-Consensus (OLC) (Pevzner et al., 2001). Additionally, string graph and the related A-Bruijn graph are similar in concept to de Bruijn graphs (Chaisson et al., 2015).

Greedy assembly graphs:

With greedy graph assemblers, programs applying this algorithm use iterative extension as their one basic operation, meaning any given read or contig will be merged with the one that has the largest overlap (Bao et al., 2011).

De Bruijn assembly graphs:

Assemblers using the de Bruijn graph approach aim to reduce fragment assembly to the classic graph-theoretical Eulerian path problem (Compeau et al., 2011). In short, the goal of the Eulerian path problem is to find a genome sequence or contig that meets each edge (short read or sequence fragment) in the de Bruijn graph only once, and after a linear time algorithm is used to find an Eulerian path in a de Bruijn graph to assemble contigs, these are then merged into a full-length genome sequence (Bao et al., 2011).

Overlap-Layout-Consensus (OLC) assembly graphs:

The OLC assembly graph approach first detects overlaps between all reads, then overlapping reads are repeatedly merged until a read heuristically determined to be at the boundary of a repeat is reached in order to form the contigs (Huang et al., 2003).

String and A-Bruijn assembly graphs:

Along with the those mentioned above, string graph and the related A-Bruijn graph are similar in concept to de Bruijn graphs, but take the full length of a sequence read instead of breaking sequences into k-mers (Chaisson et al., 2015). Simply, k-mers are strings of k consecutive letters or bases extracted from a longer sequence (Nagarahan et al., 2013).

What are some of the options when attempting a hybrid approach?

There are many ways long and short reads can be utilized in a de novo assembly pipeline if both read types are available. A few examples are listed below.

Correcting long read errors using short reads.
Polishing assembled genomes from long reads using short reads.
Scaffolding contigs from a long read assembly pipeline using short reads.
Scaffolding contigs from a short read assembly pipeline using long reads.
Correct errors in long reads using contigs created with short reads.
Complete genome assembly with only short reads, only long reads, and both short and long reads. Merge all three assemblies to produce a more accurate assembly.

What is preprocessing in de novo genome assembly?

Preprocessing steps in de novo assembly can include quality control by filtering reads to remove sequences that could increase the error rate, trimming of low quality bases, and fixing read errors to improve quality. Additionally, redundant reads can be removed to reduce the volume of data and improve processing, though this is generally only applied to short reads.

How can I correct errors in long reads with short reads?

Correction of long reads using short reads is a good approach that has been used frequently over the years. This process is usually done using read mappers, an example being BWA. After indexing the entire long read dataset, for each short read several subsequences, also known as seeds, are identified. These seeds are used to search the long reads for exact matches.

Regions that have enough similarities are referred to as candidate regions. Alignment of each short read can then be aligned to their corresponding candidate regions. When enough short reads are mapped to a long read to provide evidence of an error through a consensus process, this can then be corrected using the short read data. During this process, short reads may not all map after the first run, though this correction method can be applied iteratively as many times as necessary.

How do I correct errors in long reads without short reads?

Overlap detection is a common approach to error correction of long reads if no short reads are available. This approach includes finding overlaps between long reads and correcting errors through consensus. Since it is not usually expected that overlaps for long reads will match exactly, regions that are sufficient in length and similar are considered to be an overlap.

Going through each read within the dataset is not feasible, so indexing is often utilized to assist with the search. Quick identification of identical k-mers between one read and the remaining reads in the dataset is possible once an index is created. Once enough k-mers have been identified to suggest an overlap between two reads, additional processing can confirm if an overlap is actually present. Reads can then be aligned with a pairwise alignment algorithm in the overlap region.

What can I do if my dataset is too large to index?

Although indexing the entire dataset is optimal, this is not always practical. For very large genomes where computing resources are a concern, there are two approaches that can be used for indexing the dataset. The first would be partial indexing, where not all k-mers are stored in the index. This approach can reduce identification of common k-mers, though this can be less of a concern if long overlapping regions are expected between reads.

The second approach simply involves allotting a specific amount of resources for each part of the dataset. The dataset can be divided into any number of parts, which would result in an equal reduction in memory and increase in processing.

Genomic Data Processing: FAQ

How To Use This Page

Frequently Asked Questions (FAQ)