Dictionary of Key Concepts

Massive Bioinformatics

16/08/2023

 

 

TERM

DEFINITION

Adapter Adapter is an oligonucleotide sequence that is attached to the ends of the DNA sequence by ligation or rapid chemistry. It carries a motor protein that plays a role in separating the strands of the DNA molecule, aiding the strand to passing through the nanopore and determining the DNA’s translocation speed through the nanopore.
Alpha diversity Refers to the diversity found within a certain area or ecosystem, and is commonly measured in terms of species richness and diversity. The number of various species in a sample is called species richness, and the distribution of microorganisms in a sample is called species diversity.
Principal Coordinates Analysis (PCoA) Also known as metric multidimensional scaling, PCoA indicates the (dis)similarities of objects in two-three-dimensional geometrical space (which is called as Euclidean space). The working principle is the fact that the assigned item has a location on the Euclidean space and the (dis)similarities are calculated in respect to species/items relative abundance. Interpretation of PCoA plot is mainly based on the system that if the assigned object closer to one another, those items shares more similarities than other items
Annotation Genome annotation can be entitled as giving meaning to the sequence data. In the genome annotation, locations of the genes are identified, and then coding regions and their functions are determined. Firstly, portions of the genome that do not code for proteins are recognized and due to the prior information gene prediction is performed by identifying the elements on the genome, lastly exact functions are assigned to these elements (Banerjee et al., 2021).
Antibiotic Resistance Genes Several mechanisms can trigger antibiotic resistance in bacteria such as, producing enzymes which digest antibiotics, removing the drug with efflux pumps, modifying the cellular target of the drug, activating a pathway which will bypass the activity of the drug, and eliminating the transmembrane pores which the drug will enter. The regulator genes of these mechanisms are called antibiotic resistance genes.
BAM BAM (Binary Alignment Map) files contain all reads aligned to a reference genome. BAM files contain a header section and an alignment section. Whereas, the header section of a BAM file includes information concerning the entire file, alignments section includes the name, the sequence and the quality of the reads and alignment information.
Barcoding Barcoding is the technique that makes it possible to pool multiple samples and read them simultaneously in the sequencing device, by attaching pre-determined oligonucleuotide sequences called barcodes to the ends of DNA. Multiple barcoded samples are then pooled together and loaded on a single flow cell.
Basecalling In the Oxford Nanopore sequencing platform, DNA and RNA strands pass through the nanopores, creating an electrical signal. The process of converting this electrical signal to ATGC bases using neural networking is called “Basecalling”. Tools such as Guppy are used for the basecalling process.
Benign Not pathogenic. Indicates that a genetic variation, or cell and tissue, does not cause disease.
Beta diversity There are different types of definitions and calculations for the beta diversity, although it usually refers to the differences between the samples which are obtained from different environments. Additionally, its mathematical explanation is usually defined by the ratio of gamma diversity to the alpha diversity.
BLAST Basic Local Alignment Search Tool is an algorithm used for comparing an unknown sequence of nucleotides of DNA/RNA or of amino acids called query with reference sequences existing in databases such as NCBI. This process is the determination of which species the sequence is close to, or the direct species identification.
BRIG (comparative genomic analysis) A free-access platform which generates images demonstrating the similarity between a central reference sequence and other template sequences as a concentric ring, where BLAST comparisons are coloured on a scale defined the percentage identify (Blast ring image generator (brig), n.d.). A concentric ring refers to a shape in which two or more objects share the same axis or center. The performed images may include some customized and/or annotated information, which are read coverage, assembly breakpoints, and collapsed repeats.
Contig: A contig is the “contiguous” state of small overlapping pieces of DNA. It is formed by joining overlapping pieces of DNA into a longer sequence. Sequences are assembled first into contigs and finally into a complete genomic sequence.
Coverage In next-generation sequencing, “reference genome coverage” indicates the percentage of sequenced reference genome for each read (for a specific sequence). “Depth/breadth of coverage” refers to the average number of reads of a specific base in the selected nucleotide sequence. “Mean coverage” is obtained by dividing the number of reads in the selected region by the length of the region. This value represents the average number of reads for each nucleotide in that specific region.
Multiple alignment(Lastz, Mauve):  Multiple Alignment is a method that is used for comparing more than 1 sequence. In this technique, multiple sequences coming from different samples are aligned and compared according to their differences.
De novo Assembly (Flye) Flye is a de novo assembly tool, used in processing single molecule sequencing reads generated by ONT and PacBio. Flye de novo assembler can be used in a great variety of datasets, from 16S projects to large genomes, such as mammalian and plant genomes. The Flye pipeline is a complete package that enables users to upload raw ONT or PacBio data and take them as polished contigs. Flye generates repeat graphs and merges them as high quality and contiguous assemblies (Schmid et al., 2018).
Dendrogram The dendrogram is a tree-like plot which presents a graphical representation of hierarchical clusterings. Each branch shows a cluster obtained through a step of hierarchical clustering.
Depth: In next generation sequencing, the term “depth” indicates the total number of reads of a single nucleotide in a specific region. “Allele depth” implies the amount of reads where a specific variant of the selected nucleotide was detected. The term “allele frequency” is attained by dividing the allele depth value by the total depth value.
Sequence Alignment The arrangement of two representations of DNA or protein sequences in such a way that the most similar elements are next to each other is called alignment. Many bioinformatics related jobs depend on successful alignments.
DNA methylation Addition of a methyl (-CH3) group to a molecule. Methylation of DNA is the covalent attachment of a methyl group to the cytosine nucleotide, forming 5-methylcytosine (5mC). DNA methylation is an epigenetic regulation which plays a crucial part in normal development and can have different molecular functions depending on the location of the modification, primarily in terms of gene expression regulation.
Exon Exons are segments of a DNA or RNA molecule containing information for coding amino acids. Pre-mRNA includes both exon and intron but after the mRNA maturation, the non-coding introns are cut out and only the coding exon sequences remain.
Fast5 This file format contains the raw signal data produced as a result of ONT sequencing platform. The FAST5 file format is based on the hierarchical Data Format 5 (HDF5). The main data in the Fast5 file are “squiggles” that represent measurements taken from nanopores thousands of times per second.
Fastq Fastq is another file format created as a result of sequencing. In nanopore sequencing, Fast5 files are converted to Fastq with Guppy. The fastq file contains 4 rows of information for each read. The first row contains information about header and read. The second row contains the sequence itself. The third row contains additional information or is blank with only the “+” sign. Finally, the fourth row contains symbols which indicate the quality of each of the corresponding bases in the sequence.
Germline mutations Mutations in the sperm or egg cell which are passed directly from parents to children. During embryogenesis, germline mutations are carried in all cells and are passed on to subsequent offspring. About 5-10% of all cancers develop as a result of germline mutations, and these cancers are also called hereditary cancer syndromes.
Hypothesis testing Refers to a type of statistical deduction that utilizes data from a sample to examine a population parameter or a population probability distribution. First, a temporary assumption is generated about the parameter or distribution feature. This first tentative assumption is named as the null hypothesis and is showed with H0. An alternative hypothesis (denoted Ha), which is the opposite meaning of the null hypothesis, is generated. The hypothesis-testing procedure involves using sample data to reject the H0. If H0 is can be repelled, the alternative hypothesis Ha can be stated as true due to the statistical conclusion (Shreffler & Huecker., 2021).
Histone Histones are small, basic and highly conserved proteins that act as scaffolds for packaging DNA. A DNA fragment of approximately 150 bp in length is wrapped around the histone octomer, forming nucleosomes, the subunit of chromatin structure. Histones regulate gene activity by undergoing post-translational modifications such as methylation, acetylation, and phosphorylation.
Histone acetylation Addition of an acetyl group to histone octomers. An acetyl group is transferred from acetyl coenzyme A to a histone, which generally leads to the transition of chromatin to the active euchromatin structure.
Histone methylation Addition of a methyl group to histone octomers. Histones can be mono-di-tri methylated. Depending on the number and the location of the methyl marker, these rearrangements can lead to increased gene expression or gene silencing.
Secondary Metabolites They are compounds which do not affect the growth or reproduction directly but contribute in advantage to the organism for the selective mechanisms. For example, a compound produced by the organism might have an inhibitory effect on another organism which they compete with. Secondary metabolites are the primary source of many antibiotics and other medically important compounds.
Intron Introns are segments of the gene that cannot code amino acids. Only eukaryotes carry introns whereas, introns are very rare in prokaryotes. In the splicing step of pre-mRNA processing, introns are removed to prepare the mature mRNA. Although introns are not translated, they have several functions in eukaryotes such as alternative splicing, enhancing gene expression and controlling mRNA transport.
Quality Metrics Quality metrics here refer to the metrics used to measure accuracy in alignment. These metrics weigh the compatibility between the query that is the subject of research and the corresponding segments in the database.
Trimming (trimmomatic) The first step in a sequencing data analysis pipeline is read trimming, which alters the read sequences generated by a sequencer. The alterations it makes to the raw read sequences may have an influence on all of the analysis pipeline’s following phases.
Consensus genome A consensus sequence is a DNA, RNA, or protein sequence that represents related, aligned sequences. The consensus sequence of similar sequences can be defined in a variety of ways, but the most common nucleotide(s) or amino acid residue(s) at each location are usually used
Malignant Malicious, pathogenic. Indicates that a genetic variation or cell and tissue can cause disease.
Read Mapping The process of comparing each of the reads obtained by sequencing with the corresponding reference genome is called read mapping. One or more alignments can be obtained between each read and the reference genome.
ORF (Open Reading Frame) Open reading frame refers to a particular piece of DNA molecule where translation occurs, yet it contains no stop codon. Open reading frame sequences may serve different possible reading frames. Finding a coding gene, especially, in prokaryotes starts from looking for open reading frames. It has a critical role in gene prediction for bioinformaticians.
QPCR Also referred to as real-time PCR, is a method that is used to amplify the targeted genomic region with the help of fluorescent dyes. The main difference of this technique from the classical PCR is the quantifying of the amplified amplicons in real-time. The main principle is based on the detection of the fluorescent dyes which bind to the amplicons by sensitive detectors
Map to reference (Minimap2) Map to reference refers to the mapping of the reads to the interested organism’s reference genome. In this process, the reads coming from the sequencing platforms are aligned to corresponding locations of the target reference genome. By this way, the map of the interested organism’s genome can be constructed.
Shotgun metagenomics Shotgun metagenomic sequencing is used to sequence the genomes of untargeted cells in a community in order to determine the composition and function of that community. Due to the widespread presence of vast microbial communities, research employing this approach taps into a variety of domains. Investigating the soil microbiota, for example, has aided in the understanding and treatment of plant diseases.
Somatic mutations Mutations that occur as a result of DNA damage caused by various external factors in a single cell during a person’s lifetime. Somatic mutations are not mutations that are passed on from parents to children – they develop later stages of life. Sporadic cancers develop as a result of these mutations which cause changes in the cell and cell environment they ocur in.
PCA (Principal Component Analysis) Samples in a set can be compared with each other through multiple variables. Data in the dataset can find its expression in a high-dimensional space where the coordinates represent the variables. Since the dimensionality, that is, the multiplicity of axes, will complicate the comparisons between samples, it is aimed to simplify the analysis by performing reduction of dimensionality through PCA (Principal Component Analysis). PCA begins with the creation of the first axis at which variance is maximum in a sample set. The line forming this axis is realized by choosing the maximum of the distance between the two furthest points of the line on which the vertical projections of the data will fall, or by choosing the line on which the average of the squares of the distances of the points will be calculated as the minumum. Then, another axis is drawn perpendicular to this axis and the variance is maximum. This mechanism continues by drawing the axes that will be orthogonal to the previous plotted axes and where the variance will be maximum. These plotted axes form the principal components and the complexity of the analysis is reduced by performing a very serious dimension reduction with low percentage data variance loss.
Telomere DNA-protein structures located at the ends of chromosomes that protect DNA from mechanisms such as degradation and fusion. They are widely known for their importance in cellular aging. Since DNA ends cannot be completely copied during replication, telomeres gradually shorten at the end of a certain number of cell divisions.
Tumor mutation burden (TMB) The number of mutations per coding region in the genome of a tumor cell. It is known that tumors with high TMB respond better to immune checkpoint inhibitor therapies.
Whole Genome Sequencing Whole genome sequencing (WGS) is the technique of determining the complete, or almost the complete, of an organism’s DNA sequence all at once. It produces a massive quantity of raw data from which complicated bioinformatic studies are required to extract usable information. The study may focus on the complete genome (whole-genome analysis, WGA), the exome (whole-exome analysis), a subset of genes depending on the test’s goal.
Universal primers Primers designed to complement the vector DNA from a position which is just upstream of the DNA segment inserted into the vector for sequencing. Because they are vector-specific, these primers can be utilized to sequence all different DNA fragments ligated into the same type of vector.
UTR region Untranslated Region. Regions which are located in the 3′ and 5′ regions of messenger RNAs (mRNAs) and are not translated into proteins. These regions regulate the translation of mRNA by controlling the stability, function and localization of mRNA.
Domain: The highest rank in taxonomy, domain is a biological class used to first describe an organism. There are three domains: Bacteria, Archaea and Eukarya.
Variance-polymorphism-mutation Any change in the hereditary material (DNA/RNA) is called variance. Polymorphisms are genetic variations that occur widely in the population (the incidence of the least common variant should be 1% of the population) and they are usually not associated with disease. Mutation, on the other hand, is a term used for rarer and abnormal alterations, and a specific mutation can be considered as a polymorphism when the frequency of a mutation increases in the population.