Key Concepts

Massive Bioinformatics



13/12/2023



Key Concepts

K
M

N
O
P
Q
R
S
T
U
V
Y

Adapter

Adapter is an oligonucleotide sequence that is attached to the ends of the DNA sequence by ligation or rapid chemistry. It carries a motor protein that plays a role in separating the strands of the DNA molecule, aiding the strand to passing through the nanopore and determining the DNA’s translocation speed through the nanopore.

Alpha diversity

Refers to the diversity found within a certain area or ecosystem, and is commonly measured in terms of species richness and diversity. The number of various species in a sample is called species richness, and the distribution of microorganisms in a sample is called species diversity.

Annotation

Genome annotation can be entitled as giving meaning to the sequence data. In the genome annotation, locations of the genes are identified, and then coding regions and their functions are determined. Firstly, portions of the genome that do not code for proteins are recognized and due to the prior information gene prediction is performed by identifying the elements on the genome, lastly exact functions are assigned to these elements (Banerjee et al., 2021).

Antibiotic Resistance Genes

Several mechanisms can trigger antibiotic resistance in bacteria such as, producing enzymes which digest antibiotics, removing the drug with efflux pumps, modifying the cellular target of the drug, activating a pathway which will bypass the activity of the drug, and eliminating the transmembrane pores which the drug will enter. The regulator genes of these mechanisms are called antibiotic resistance genes.

BAM

BAM (Binary Alignment Map) files contain all reads aligned to a reference genome. BAM files contain a header section and an alignment section. Whereas, the header section of a BAM file includes information concerning the entire file, alignments section includes the name, the sequence and the quality of the reads and alignment information.

Barcoding

Barcoding is the technique that makes it possible to pool multiple samples and read them simultaneously in the sequencing device, by attaching pre-determined oligonucleuotide sequences called barcodes to the ends of DNA. Multiple barcoded samples are then pooled together and loaded on a single flow cell.

Basecalling

In the Oxford Nanopore sequencing platform, DNA and RNA strands pass through the nanopores, creating an electrical signal. The process of converting this electrical signal to ATGC bases using neural networking is called “Basecalling”. Tools such as Guppy are used for the basecalling process.

Benign

Not pathogenic. Indicates that a genetic variation, or cell and tissue, does not cause disease.

Beta Diversity

There are different types of definitions and calculations for the beta diversity, although it usually refers to the differences between the samples which are obtained from different environments. Additionally, its mathematical explanation is usually defined by the ratio of gamma diversity to the alpha diversity.

BLAST

Basic Local Alignment Search Tool is an algorithm used for comparing an unknown sequence of nucleotides of DNA/RNA or of amino acids called query with reference sequences existing in databases such as NCBI. This process is the determination of which species the sequence is close to, or the direct species identification.

BRIG (Comparative Genomic Analysis)

A free-access platform which generates images demonstrating the similarity between a central reference sequence and other template sequences as a concentric ring, where BLAST comparisons are coloured on a scale defined the percentage identify (Blast ring image generator (brig), n.d.). A concentric ring refers to a shape in which two or more objects share the same axis or center. The performed images may include some customized and/or annotated information, which are read coverage, assembly breakpoints, and collapsed repeats.

Contig

A contig is the “contiguous” state of small overlapping pieces of DNA. It is formed by joining overlapping pieces of DNA into a longer sequence. Sequences are assembled first into contigs and finally into a complete genomic sequence.

Consensus Genome

A consensus sequence is a DNA, RNA, or protein sequence that represents related, aligned sequences. The consensus sequence of similar sequences can be defined in a variety of ways, but the most common nucleotide(s) or amino acid residue(s) at each location are usually used.

Coverage

In next-generation sequencing, “reference genome coverage” indicates the percentage of sequenced reference genome for each read (for a specific sequence). “Depth/breadth of coverage” refers to the average number of reads of a specific base in the selected nucleotide sequence. “Mean coverage” is obtained by dividing the number of reads in the selected region by the length of the region. This value represents the average number of reads for each nucleotide in that specific region.

De Novo Assembly (Flye)

Flye is a de novo assembly tool, used in processing single molecule sequencing reads generated by ONT and PacBio. Flye de novo assembler can be used in a great variety of datasets, from 16S projects to large genomes, such as mammalian and plant genomes. The Flye pipeline is a complete package that enables users to upload raw ONT or PacBio data and take them as polished contigs. Flye generates repeat graphs and merges them as high quality and contiguous assemblies (Schmid et al., 2018).

Dendrogram

The dendrogram is a tree-like plot which presents a graphical representation of hierarchical clusterings. Each branch shows a cluster obtained through a step of hierarchical clustering.

Depth

In next generation sequencing, the term “depth” indicates the total number of reads of a single nucleotide in a specific region. “Allele depth” implies the amount of reads where a specific variant of the selected nucleotide was detected. The term “allele frequency” is attained by dividing the allele depth value by the total depth value.

DNA Methylation

Addition of a methyl (-CH3) group to a molecule. Methylation of DNA is the covalent attachment of a methyl group to the cytosine nucleotide, forming 5-methylcytosine (5mC). DNA methylation is an epigenetic regulation which plays a crucial part in normal development and can have different molecular functions depending on the location of the modification, primarily in terms of gene expression regulation.

Exon

Exons are segments of a DNA or RNA molecule containing information for coding amino acids. Pre-mRNA includes both exon and intron but after the mRNA maturation, the non-coding introns are cut out and only the coding exon sequences remain.

Fast5

This file format contains the raw signal data produced as a result of ONT sequencing platform. The FAST5 file format is based on the hierarchical Data Format 5 (HDF5). The main data in the Fast5 file are “squiggles” that represent measurements taken from nanopores thousands of times per second.

Fastq

Fastq is another file format created as a result of sequencing. In nanopore sequencing, Fast5 files are converted to Fastq with Guppy. The fastq file contains 4 rows of information for each read. The first row contains information about header and read. The second row contains the sequence itself. The third row contains additional information or is blank with only the “+” sign. Finally, the fourth row contains symbols which indicate the quality of each of the corresponding bases in the sequence.

Germline Mutations

Mutations in the sperm or egg cell which are passed directly from parents to children. During embryogenesis, germline mutations are carried in all cells and are passed on to subsequent offspring. About 5-10% of all cancers develop as a result of germline mutations, and these cancers are also called hereditary cancer syndromes.

Hypothesis Testing

Refers to a type of statistical deduction that utilizes data from a sample to examine a population parameter or a population probability distribution. First, a temporary assumption is generated about the parameter or distribution feature. This first tentative assumption is named as the null hypothesis and is showed with H0. An alternative hypothesis (denoted Ha), which is the opposite meaning of the null hypothesis, is generated. The hypothesis-testing procedure involves using sample data to reject the H0. If H0 is can be repelled, the alternative hypothesis Ha can be stated as true due to the statistical conclusion (Shreffler & Huecker., 2021).

Histone

Histones are small, basic and highly conserved proteins that act as scaffolds for packaging DNA. A DNA fragment of approximately 150 bp in length is wrapped around the histone octomer, forming nucleosomes, the subunit of chromatin structure. Histones regulate gene activity by undergoing post-translational modifications such as methylation, acetylation, and phosphorylation.

Histone Acetylation

Addition of an acetyl group to histone octomers. An acetyl group is transferred from acetyl coenzyme A to a histone, which generally leads to the transition of chromatin to the active euchromatin structure.

Histone Methylation

Addition of a methyl group to histone octomers. Histones can be mono-di-tri methylated. Depending on the number and the location of the methyl marker, these rearrangements can lead to increased gene expression or gene silencing.

Intron

Introns are segments of the gene that cannot code amino acids. Only eukaryotes carry introns whereas, introns are very rare in prokaryotes. In the splicing step of pre-mRNA processing, introns are removed to prepare the mature mRNA. Although introns are not translated, they have several functions in eukaryotes such as alternative splicing, enhancing gene expression and controlling mRNA transport.

Multiple Alignment (Lastz, Mauve)

Multiple Alignment is a method that is used for comparing more than 1 sequence. In this technique, multiple sequences coming from different samples are aligned and compared according to their differences.

Principal Coordinates Analysis (PCoA)

Also known as metric multidimensional scaling, PCoA indicates the (dis)similarities of objects in two-three-dimensional geometrical space (which is called as Euclidean space). The working principle is the fact that the assigned item has a location on the Euclidean space and the (dis)similarities are calculated in respect to species/items relative abundance. Interpretation of PCoA plot is mainly based on the system that if the assigned object closer to one another, those items shares more similarities than other items.

Quality Metrics

Quality metrics here refer to the metrics used to measure accuracy
in alignment. These metrics weigh the compatibility between the
query that is the subject of research and the corresponding segments
in the database.

Secondary Metabolites

They are compounds which do not affect the growth or reproduction
directly but contribute advantage to the organism for the selective
mechanisms. For example, a compound produced by the organism might
have an inhibitory effect on another organism with which they
compete. Secondary metabolites are the primary source of many
antibiotics and other medically important compounds.

Sequence Alignment

The arrangement of two representations of DNA or protein sequences
in such a way that the most similar elements are next to each other
is called alignment. Many bioinformatics-related jobs depend on
successful alignments.

Trimming (Trimmomatic)

The first step in a sequencing data analysis pipeline is read
trimming, which alters the read sequences generated by a sequencer.
The alterations it makes to the raw read sequences may have an
influence on all of the analysis pipeline’s following phases.