Biological Dark Matter Exploration using Data Mining for the Discovery of Antimicrobial Natural Products

Abstract The discovery of novel antimicrobials has significantly slowed down over the last three decades. At the same time, humans rely increasingly on antimicrobials because of the progressive antimicrobial resistance in medical practices, human communities, and the environment. Data mining is currently considered a promising option in the discovery of new antibiotics. Some of the advantages of data mining are the ability to predict chemical structures from sequence data, anticipation of the presence of novel metabolites, the understanding of gene evolution, and the corroboration of data from multiple omics technologies. This review analyzes the state-of-the-art for data mining in the fields of bacteria, fungi, and plant genomic data, as well as metabologenomics. It also summarizes some of the most recent research accomplishments in the field, all pinpointing to innovation through uncovering and implementing the next generation of antimicrobials.

ening strains is imperative. There is an urgency to develop new strategies to control the emergence and spread of AMR to avoid further health emergencies from infectious agents [3]. In this scenario, developing new antibiotics is a crucial task that must occur as soon as possible, as recommended by health agencies such as the WHO, the European Centre for Disease Prevention and Control (ECDC), the US Centers for Disease Control and Prevention (CDC), the Infectious Diseases Society of America (IDSA), the Drugs for Neglected Diseases initiative (DNDi), and global non-profit organizations such as the Global Antibiotic Research and Development Partnership (GARDP) [6].
For this purpose, several strategies, including the exploration of natural resources using conventional and state-of-the-art methodologies, as well as synthetic procedures, are currently being applied. Among these cutting-edge approaches, data mining of biological dark matter stands among the most powerful alternatives to contain the health crisis generated by the spread of AMR. Its advantages include the ability to predict the functionalities of the chemical structures starting from their biosynthetic genes, as well as the possibility to improve and diversify these molecules, all prior to production in the laboratory, thus accelerating the discovery process and providing leads for the identification of completely new classes of molecules and functionalities.
Genome mining refers to the prediction of previously uncharacterized natural product biosynthetic gene clusters (BGCs) within sequence data, analysis of the enzymes encoded by these clusters, and the experimental identification of the resulting products [7]. This research strategy was developed after the exploration by the Hopwood [8] and Ō mura [9] groups of the first sequenced Streptomyces genomes, which led to the idea that a great number of secondary metabolites (SMs) gene clusters appear not to be expressed under normal or laboratory conditions. The expanse of bacteria, fungi, and plantsʼ bioactive metabolites that remain to be explored has only widened with time and with the accumulation of sequencing data [10].
The continued interest from investigators led to the development of many tools, ever increasing in complexity, to explore this new-found data [11]. The structural, chemical, and biological characterization of the newly discovered molecules and enzymes depends on in silico prediction algorithms. From this perspective, two scenarios are possible; first the top-down discovery processes, where the sequence data is initially analyzed, leading to predictions regarding possible functionalities, which afterwards need to be experimentally verified; and second, the bottom-up approaches which involve linking experimental data to in silico predictions, and a confirmation step commonly including molecular biology techniques. Currently, the main data mining strategies can be generally divided in phylogeny-based, target or resistance-based, and cultivation independent mining (comprising metagenomic data mining and single cell genomics) (▶ Fig. 1). This review summarizes the state-of-the-art for data mining in the fields of bacteria, fungi, and plants genomic data, as well as metabologenomics, for the discovery of new antimicrobial NPs. Without a doubt, these research accomplishments are the foundations for the development of the next generation of drugs to overcome the antimicrobial resistance. The bibliographic investigation was performed on Scopus, Scifinder, Web of Science, and Pubmed, and the main search words and terms applied were "bacterial genomic data mining" (> 350 hits), "fungal genomic data mining" (> 100 hits), and "plants genomic data mining" (> 50 hits), then filtering with the word "antimicrobials" (total 105, 18, and 14 hits, respectively). Also, the word "metabologenimics" (17 hits) was used in the bibliography search. Most of the hits resulted from the time span 2010-2020.

Discovery of Bioactive Bacterial Products Driven by Genome Mining
Bacteria are the most relevant source of genetic information for the identification of gene clusters, which could produce molecules with antimicrobial and anti-infective effects, as indicated from past discoveries (especially the phylum Actinobacteria and genus Streptomyces). The diversity of possible compounds encoded by bacterial DNA is for now difficult to estimate but has been predicted to contain many chemical classes which have not yet been characterized in a laboratory setting. Molecules belonging to these classes could bring some well-needed relief to medical practice and help contain, at least temporarily, the AMR crisis, which is why their exploration is essential [3].
The origin of this biochemical diversity resides in basic natural biological processes. The main driving forces behind the evolution of organisms are those of either collaboration or competition. At genetic level, the evolution processes are defined by mutations, horizontal transfer of genes, gene loss, gene duplication, reshuffling and immunity development (CRISPR-cas systems) [12]. These changes often lead to the biochemical development of clusters of genes codifying the assembly of compounds with antimicrobial effects, improved over millions of years of evolution. Such groups of genes are denominated BGCs. Estimates of such clusters, proteins and genomes evolution, as well as finding the common ancestors, added to sequence similarity studies, have provided extremely efficient tools in uncovering new metabolites [13]. The huge diversity of chemical structures observed in natural products (NPs) research can be thus explained on the basis of the co-evolutionary processes, leading the diversification of biosynthetic mechanisms and enzymes, as needed adaptations to new ecosystems and communities [14,15]. In this regard, phylogenetic analyses can be highly predictive of protein function and lead to engineering semi-synthetic variants and pathways to achieve improved molecules and greater product yields [16].
The number of studies involving phylogeny-based genome mining have increased exponentially over the last decade, due to the reduction in sequencing costs and worldwide distribution of the technology [17]. Extraordinary accomplishments have been made possible. Some examples are described in the following paragraphs.

Top-down discoveries
Most NPs of microbial origin with a medical application originate from the genera Streptomyces, phylum Actinobacteria [18]. For this reason, top-down discoveries are centered on this bacterial group, mainly for two reasons: estimations of gene diversity indicate a yet-unaccountable potential in these genomes; and be-cause most of the databases and tools have started with and are more efficient at investigating Actinobacterial genes and metabolites, compared to recently discovered microorganisms [19]. This allows for a certain degree of superior security in making predictions, and the use of established experimental tools for their validation.
In Actinobacteria, BGCs by both phylogeny-based and chemical structure-based estimations indicate that genomes hold unknown clades of metabologenomic patterns of genes and molecules which remain to be characterized in the laboratory, in some taxonomic groups up to 50 % of their BGC content [11,20,21]. Among the members of Actinobacteria, the genera Streptomyces is the richest in bioactive metabolites such as antibiotics, volatile compounds, siderophores, and other molecules [22]. Recent studies focusing on Streptomyces diversity and distribution patterns of BGCs across all public genomes as of 2019 concluded that even strains with identical phylogenetic patterns present signifi-cantly different numbers of secondary clusters [23,24]. Thus, strain-level genetic surveys are required to categorize and characterize novel metabolites from this genus. Additionally, it was concluded that 16S-based metagenomics of environmental samples has highly underestimated the number and value of BGCs in Streptomyces bacteria, as the data only offer a superficial view of the BGCs encoded in the whole genome, over-representing non-ribosomal peptides (NRPs), polyketides (PKs), terpenes, and lantipeptides.
These findings were further confirmed by Sivakala et al., who concluded that a collection of Streptomyces strains (from desert samples), averaging 36 BGCs per genome, displayed a unique biosynthetic potential. The majority of BGCs were predicted as novel and were characterized by an overall decrease in non-ribosomal peptide synthetases (NRPSs) clusters and an increase in the polyketides synthases (PKSs) hybrids and terpenes clusters [25]. Many predicted BGCs are also present in marine Streptomyces, ▶ Fig. 1 Mining genome workflow: a1 Harvesting samples from different environment and natural sources; a2 Isolation of colonies for genomic analysis; a3 Determination of experimental conditions for maintaining viable cultures in the laboratory; a4 Whole genome for non-cultivable bacteria; a5 Molecular and microbiological characterization of strain submitted to genomic analysis; a6 DNA isolation for massive sequencing; a7 High-throughput sequencing for genomes, metagenomes, pangenomes, single-cell genome, transcriptome. b1 Preliminary assessment of genomic data; b2 Identification of BGC in genomes and metagenomes; b3 Prioritization process for BGCs using networking analysis, phylogenomic tools, and EvoMining strategies; b4 Correlation of genomics and metabolomics data to pair novel BGCs and metabolites; b5 Experimental confirmation of NPs production by different methods such as heterologous expression and mutant lines generation. Molecular characterization supporting the flow gene-enzyme-metabolite. from 16 to 84, according to Xu et. al. [26]. These data show the dynamism of Streptomyces as a hyper diverse taxonomic group which has evolved several pathways, importantly increasing the NPs chemical space [27].
The group of Metsä-Ketelä has focused on the research of marine symbiotic Actinobacteria, including but not limited to Streptomyces [27]. Species from four rare genera (Actinomadura, Amycolatopsis, Saccharopolyspora, and Streptomonospora) were isolated for the first time from marine sponges harvested in the Persian Gulf. Of these strains, the new bioactive aromatics polyketides denoted persiamycin A and 1-hydroxy-4-methoxy-2-naphthoic acid were discovered and their existence confirmed through experimental approaches (▶ Fig. 2).
Rare genera and uncultivable organisms are starting to be studied by a top-down approach as well. The evolution of sequencing technologies also allows, for instance, the assembly of complete or almost-complete genomes from metagenomic data, without the need for cultivating the microorganisms [28][29][30][31][32]. This will be discussed in the cultivation-independent data mining section.

Achievements and latest research examples
Recent advances in the discovery of new antimicrobials by genome mining include the exploration of new species from the Proteobacteria, Firmicutes and Actinobacteria taxa [33,34], enhancing the ability to predict and prioritize novel BGCs.
For instance, recent genome mining studies have identified the cyanobacterium Tolypothrix sp. PCC 7601 as a producer of tolypamide, a cyanobactin with antimicrobial activity structurally differentiated by bearing a prenyl moiety bound to a threonine residue (▶ Fig. 2) [35]. Another example for data mining in a new taxon is the genus Micromonospora, whose members can produce analogues of ramoplanin and enduracin. In addition, some strains can simultaneously produce the antimicrobial chersinamycin (▶ Fig. 2) [36,37]. Finally, re-visiting bacteria of the genus Clostridia, an unconventional biosynthetic gene cluster for closthioamide (CTA) (▶ Fig. 2), a symmetric NRP composed of two diaminopropane-linked to polythioamidated monomers, with a mechanism targeting the DNA gyrase, were discovered and experimentally confirmed.
These are some of the most recent discoveries indicating that genome mining is a great alternative to find new SMs with anti-infective action [38]. In the last decade, sequencing capacity and cutting-edge computational methods to analyze these data evolved in parallel, and now there are publicly available platforms for genome mining. Some examples include AntiSMASH (which relies on the highly curated MIBiG database), PRISM, BAGEL, and RiPPMiner [39][40][41][42][43]. Recent discoveries using these tools have been reported for instance, a Ni-Fe hydrogenase gene cluster (hyp cluster) and putative new BGCs in Streptomyces seoulensis [44]. In Staphylococcus aureus 4181, a strain involved in bovine mastitis, a bacteriocin-denominated aureocin 4181, is involved in resistance against multidrug-resistant (MDR) staphylococci. Interestingly, both BAGEL4 and antiSMASH v.5.0 were implemented in the identification of the gene cluster coding for aureocin 4181 [45,46]. The findings were confirmed in the wet lab.
Machine learning is also an emerging tool for developing nonsupervised computational models for predicting biological outcomes using BGCs as targets [47]. One example is RODEO (Rapid ORF Description and Evaluation Online), an online tool to identify BGCs mainly ribosomally synthesized and post-translationally modified peptides (RiPPs) [48]. In this approach, the mapping of lasso peptide space generated a larger number of putative BGCs, compared to other methods. Moreover, their characterization in the laboratory confirmed the existence of new lasso peptides including citrulassin A, lagmycin and LP2006 (▶ Fig. 2) [48].
To prevent re-discovery of the same compounds, other deep learning-based methods have been designed to detect key genes from large-scale expression profiles and reduce redundant information. A highly sought-over edge would be to retain only the relevant biological information from the width of the data and to provide robust predictions of the compounds using only sequence data [49]. This reduces substantially the resources and time needed to confirm the data mining predictions and increases the probability for discovery of bioactive molecules.
Instead of using expression data, other projects of this type used the PRISM discovery platform in tandem with automated LC-MS/MS searches, leading to the identification of aurantizolicin and cyanobactin produced by Cyanobacteria (a pathogen associated with gastroenteritis during the intake of contaminated water) (▶ Fig. 2) [50] This bacterium appears to biosynthesize a broad group of metabolites with structural variability, generated by post-translational modifications, including prenylation.
For the chemical and biological characterization of newly identified metabolites, cloning and heterologous expression are sometimes necessary when the compound is not normally secreted in the growth medium, or the natural signals for its production are not identified. There is currently a limitation in cloning of large DNA fragments, such as those responsible for producing SMs. Therefore, advances in cloning comprise the development of new heterologous hosts for BGCs expression, e.g., anaerobic bacteria. S. mutans UA159 was used as a host for anaerobic BGCs by functionally expressing a known pyrazinone BGC from Staphylococcus epidermidis (Sa. epidermidis) ATCC 35 984 and two unidentified BGCs from human oral bacteria S. mutans 35 and S. mutans NMT4863, respectively. Cloning and expression of these BGCs led to the discovery of a (2E)-decanoyl dipeptide SNC1-465 (▶ Fig. 2) [51].
Target-based genome mining can rely on the interesting perspective that BGCs code also for resistance genes meant to protect organisms from self-intoxication. In the case of the glycopeptide antibiotic complestatin and corbomycin, this strategy led to the elucidation of a novel mode of action by inhibition of autolysins (▶ Fig. 2) [52].
Adding to that, CRISPR/Cas9 genome editing has been recently made available for gene manipulation in the widest range of organisms, including Actinobacteria. For example, surfactin production requires key precursors including glutamate (Glu), valine (Val), and β-hydroxy fatty acid used in a biosynthetic pathway that initiates with their assembly by SrfAA, building and extension by srfAB, extension and cyclization by srfAC and which is regulated by the operon srf. When this operon is disrupted by CRISPR/ Cas9-editing, interestingly, the yield of piplastin increases. In this ▶ Fig. 2 Chemical structure of compounds identified by genome mining in bacteria. example, the interruption of surfactin biosynthesis redirected their structural components toward the activation of an alternative pathway, which uses common components from the surfactin biosynthetic pathway. This experiment allowed to identify plipastatin, a cyclic lipopeptide that is synthesized by nonribosomal peptide synthetases (NRPSs) and contains amino acids with a C14-C21(R)-hydroxy fatty acid side chain linked by an amide bond [53].
Anticipated limitations in the expression of BGCs due to incompatibility between host and receptor bacterial DNA have been recently overcome with promising results using the CRAGE system [54]. In this technological advancement, a recombinase (RAGE system) can integrate in a single step a DNA fragment between 10 to 48 Kb. The main advantage is that, for large BGCs expression, changes in the regulation or expression control of DNA sequences in the original microorganism is now possible, avoiding the need for heterologous expression. In this study, 10 polyketide synthase and non-ribosomal peptide synthetase (PKS/NRPS) BGCs in the entomopathogenic bacterium Photorhabdus luminescens subsp. laumondii TT01 were modelled, confirming the enhanced production of 22 metabolites from six of the ten BGCs, the expression of which was modulated successfully by CRAGE-CRISPR in the wet lab [54,55].
Orphan BGCs have attracted attention since they have yet to reveal their potential as a source of NPs. Due to the increasing number of orphans, the first challenge is centered in their selection and prioritization and connecting them to biosynthesized small molecules. An approach to reduce the number of potential BGCs is to correlate putative antibiotic resistance genes that encode target-modified protein with the orphan BGCs in a method denominated by Tangʼs Group as "target-directed genome mining" [56]. Co-localization between an orphan PKS-NRPS hybrid BGC (tlm) with a putative fatty acid synthase resistance gene in a Salinispora bacterial genome led to laboratory production of several unusual thiotetronic acid NPs, including the well-known fatty acid synthase inhibitor thiolactomycin that was first described over 30 years ago, yet never at the genetic level in regard to biosynthesis and auto-resistance (▶ Fig. 2) [56].
Taken together, the impulse placed by new technologies for DNA sequencing, the improvement of computational tools to detect and select the BGCs, the continuous reporting and sharing of genomic data, pan-genome analyses, and a deep understanding about the connection gene-enzyme-metabolite represents an indisputable ally in the new generation of natural product research.

Implementation of artificial intelligence strategies for data mining
The use of artificial intelligence (which in this type of approaches is based on machine learning techniques) has allowed for repurposing medicines. In one brilliant example of repurposing, scientists used machine learning (ML) to teach a neural network what the best antibiotics "look like" from a chemical standpoint and applied it to a library of more than 6000 compounds already used in human medicine [57]. The best hit, halicin, turned out to be active against a wide variety of antibiotic-resistant bacteria, including E. coli, M. tuberculosis, and C. difficile, with a set of 21 other molecules still under review. Interestingly, the halicin mechanism of action is also a novelty, as it works by interfering with the electrochemical potential across bacterial membranes [57].
This course of action is followed by the iteration of novel principles and concepts regarding the drug targets, their mode of action and structural characteristics, which can afterwards be applied in the rational design of bioactive compounds. A whole range of ML strategies now focus on predictions related to smallmolecule antibiotics and antimicrobial peptides (AMPs) [58]. While significant progress has been made, by, for instance, accelerating the drug progress towards animal trials, these strategies need to be supported by an improvement in data quality and availability, expanding the chemical space, drug repurposing and multidisciplinary collaborations to produce lead molecules adequate for human trials and thoroughly verify the findings through experimentation [58].

Synergy
Synergism is seen as the combination of two or more drugs for medical treatment, aiming to enhance drug efficacy; nevertheless, synergy in bacteria encompasses more elaborate examples. Synergy for the discovery of new antimicrobials can be two-faced: on one side, the existence of several BGCs in the same genome can act together to enhance a biological activity and on the other side, the synergy of different strategies linked to mining genome can improve the ability to discover new antimicrobials. Regarding synergy in nature, recent studies have identified bacteria with the ability to produce more than one antimicrobial at the same time, which are biosynthesized by different clusters of genes located in adjacent genomic regions within their genomes, denominated super-clusters [59]. The simultaneous production of antimicrobials could play an ecological role to enhance the interspecies competition for acquiring nutrients. There is little information available about how the super-clusters are structured and even less information concerning their regulation and simultaneous production [60,61]. A first approximation revealed that superclusters act cooperatively with synergistic potential, but cases in the current literature are scarce [62][63][64][65]. In nature, simultaneous production of antimicrobials has been identified in Streptomyces rapamycinicus, which can produce rapamycin, a compound with a weak antifungal activity that in combination with actinoplanic acid (APL), acts as a strong antifungal agent against C. albicans, probably as inhibitor of the enzyme farnesyl protein transferase, and thereby increasing the effect of rapamycin. In this example, BGCs are analyzed by synteny (co-localization of genes) and results about the genetic organization seem to pinpoint different clusters [66]. The phylogenetic approach has been used to explain antimicrobial co-production, allowing for the introduction of novel computational methods based on the probabilities of modules exchange over time. A new property based on the spatial position and order of BGCs could mean a new biological parameter not yet evaluated in the bacterial genome.
Alternatively, synergy can be induced, for example, in response to scarcity of nutrients. Ferroverdins and bagremycins are produced by the same BGCs, but the production of one or the other depends on the availability of iron in the environment. In this situation, using activators or inducers can be a useful strategy in eliciting BGCs and facilitating the identification of simultaneous NPs (▶ Fig. 2) [67]. In the laboratory, production by P. galatheae of novel metabolites can be controlled by the addition of inducers, such as andrimid, to increase holomycin production, or trimethropin, to attenuate its production, indicating that activators or repressors can be a useful strategy in eliciting BGCs and facilitating natural product discovery (▶ Fig. 2) [61].
Another pathway to induce synergy of antimicrobials is through genetic engineering of modules of various clusters, leading to combinatorial chemistry and semsynthetic compounds with completely novel functions. Such explorations have already been reported; for example, the groups of genes involved in the biosynthesis of aminocoumarin, which is divided into modules, is able to be integrated in new functional BGCs [68]. Finally, Prof. Medemaʼs group discusses some criteria for identifying when BGCs work together and promote synergism, based on genomic analyses: a) conservation across larger evolutionary time scales based on phylogenetic approach, b) proximity between BGCs within genome, c) co-regulated gene expression and d) assessment of the likeness between BGCs for related targets [66]. Without doubt, synergism of BGCs has begun to show new abilities provided by microorganisms that can be used for the discovery of not yet evaluated potent combinations.

Discovery of Bioactive Fungal Products Driven by Genome Mining
The fungal kingdom is the second most phylogenetically diverse and extensive of the tree of life. It is divided into twelve phyla which comprise about 200 orders [69] with an estimated 2.2-3.8 million species [70] and only a short slice (130 000) completely identified [71]. Fungi are crucial for preserving the ecological balance of any ecosystem, as they are, along with bacteria, na-tureʼs major decomposers [72]. Furthermore, these microorganisms have a massive economic impact as plant, crop, and animal pathogens, as well as in the food and pharmaceutical industries [73]. Among them, strains from the Dikarya subkingdom (Phylum Ascomycota and Basidiomycota) are commonly investigated in drug discovery programs, owing to their capacity to produce a myriad of bioactive SMs (~20 000) [74] which often represent an amazing starting point for finding new drugs [75,76]. Some fruitful examples include penicillin, considered the grandfather of antibiotics and the "wonder drug of World War II" [77], cephalosporins [78], the cholesterol lowering lovastatin (compactin) which gave origin to the statins family [79], myriocin, a precursor of fingolimod, the first oral treatment for multiple sclerosis [80], and lately, caspofungin (antifungal), a semisynthetic derivative of pneumocandin (▶ Fig. 3) [81]. However, up to the last two decades, the methodologies to identify and characterize NPs from any source relied on bioassay-guided programs, where the bioactivity of an extract and fractions is evaluated during the isolation of the bioactive metabolite [82]. This has led to the discovery of many outstanding drugs, principally antibacterial (penicillins, cephalosporins, quinolones, tetracylins, macrolides, aminoglycosides, among others) [83] and anticancer (taxol, camptothecin, vincristine, vinblastine, podophilotoxin, to mention a few) [84,85]; however, the re-isolation of known compounds via this process is a bottleneck. This is demonstrated by the lack of new kinds of antibiotics in the market over the last 35 years. To overcome this, dereplication procedures [86] using LC-UV-MS/MS [87] and NMR [88,89] have been developed. Importantly, the emergence of the omics era, the fast-paced evolution of bio and chemoinformatics tools to mine huge amounts of data, and the expansion of affordable whole-genome sequencing technologies have shifted the way of NPs research, by exploring and comparing DNA sequences with annotated repositories of BGCs [90].
In fungi, the mechanisms that activate their extraordinary metabolic capacity are associated with environmental changes and complex ecological networks, turning on several metabolic pathways, and consequently the biosynthesis of exciting and unusual backbones [91]. In fungal genomes, the genes involved in the biosynthesis of NPs are concatenated in BGCs [92]. These clus-▶ Fig. 3 Chemical structure of approved drugs inspired on fungal SMs. ters are mainly constituted by: i) a scaffold-defining synthase or synthetase (PK, terpene, NRP, etc.), that use building blocks to generate the main structure; ii) modifying enzymes such as oxidases, transferases, reductases, and acyltransferases, among others, that incorporate or transform functional groups; iii) specific or global transcription factors that positively or negatively regulate the expression of the BGC; iv) DNA segments encoding proteins to ameliorate the SM toxicity, for example, efflux pumps, transporters, detoxifying enzymes or double copies of the target enzyme; and v) other hypothetical genes with unknown activity (▶ Fig. 4) [82,92,93].
As described elsewhere, fungal and bacterial BGCs are quite similar; however, there are some differences that make the characterization of the former ever-challenging [94]. For example, bacterial BGCs are normally congregated in operons and their transcription occurs in a single mRNA, whereas in fungi, genes are separated by non-codifying DNA sequences, and are transcribed individually. According to MIBiG (Minimum Information about a Biosynthetic Gene cluster) database, up to May 2021 more than 1900 BGCs of several chemical classes have been annotated (▶ Fig. 5 and Table S1, Supporting information) [95]. Among them, only 14.5 % are of fungal origin, with a major representation of fungi from the Pezizomycotina subphylum (Aspergillus, Penicillium, Fusarium, Alternaria, Phoma, Talaromyces, Beauveria, and Claviceps; 180 BGCs, ▶ Table 1).
Robey et al. [96] recently demonstrated that fungi represent a huge reservoir of SM; however, less than 3 % of the BGCs have been linked to a chemical product. This conclusion was based on the analysis of 1037 genomes from Dikarya and non-Dikarya taxa, using antiSMASH, and exploration of the fungal chemical space. Results from this study predicted 36 399 BGCs ranging from 5 to 220 kb, organized in 12 067 gene cluster families (GCF), which are mainly species specific [96]. This finding could be supported by the fact that fungal chemodiversity is associated with three different evolutionary BGCs processes: i) heterologous and paralogous BGCs functional divergence (accumulation of molecular differences that lead to changes in function or phenotype), ii) BGC horizontal transfer and iii) de novo BGC assembly; contributing to the immense repertory of fungal SM [92].

Some of the best achievements and latest research examples
Throughout history, fungi have only contributed with five antibiotics against bacteria (including mycophenolic acid, the previously cited β-lactams, fusidic acid, and retapamulin, a semisynthetic derivative of pleuromutilin), and two antifungals (griseofulvin and caspofungin) (▶ Fig. 6). Among them, fusidic acid (isolated in 1962 from Acremonium fusidioides, previously known as Fusidium coccineum [97], by Leo Pharma, Denmark) has been used over the past 45 years in Europe to treat skin and chronic bone and joint infections [98]. It inhibits protein synthesis by a new mechanism that involves binding to EF-G-GPD. This molecule is unique in its class; due to its novel structure and mode of action, microorganisms have not yet developed cross-resistance against it [98]. The most recently approved antibiotic from fungi is retapamulin [99]. This broad-spectrum antibiotic is a semisynthetic derivative of pleuromutilin, a diterpenoid isolated from Pleurotus mutilus in 1951 [100,101] that displayed moderate activity towards Gram-positive bacteria. Retapamulin was approved by the FDA in 2007 to topically treat impetigo, a skin infection caused by Staphylococcus aureus or Streptococcus pyogenes. Mechanistically, it selectively inhibits protein synthesis by binding to the 50S subunit of the bacterial ribosome in a different manner than macrolides [102]. Like fusidic acid, microorganisms have not developed cross resistance against retapamulin.
Despite the potential of fungi as a source of chemical entities for drug development, and the crucial need for new antimicrobials for human use against multidrug resistant infections, most pharmaceutical companies have dropped their antimicrobial drug discovery programs, due to the low income and the severe regulatory frameworks imposed by regulatory agencies [103]. Nevertheless, since 2012, when IDSA requested the FDA and EMA (European Medicines Agency) to set more realistic and economic regulatory pathways for novel antibiotics registration and authorization, research attention has shifted back to microorganisms-de-▶ Fig. 4 Outline of a typical fungal BGC constructed by a transcription factor, a scaffold defining synthase or synthetase, decorative enzymes, detoxifying enzymes and other genes. Adapted from [93]. rived NPs [74]. In this regard, academia has taken the lead and many fungal NPs with antibacterial activity, or even targeting several other bacterial survival mechanisms, have been recently discovered. The methodologies involved on such programs are diverse, e.g., bioassay-guided isolation, taxonomy-based isolation, targeted and untargeted metabolomics approaches applied to fungi grown under standard and modified conditions (One Strain Many Compounds, OSMAC), and processes that combine all those schemes have been successfully applied (Supporting Information).

Fungal macrolides through genome mining coupled with heterologous expression
Recently, the Asaiʼs group described a tri-partite methodology that includes synthetic biology, genome mining and heterologous expression to uncover the fungal machinery involved in the biosynthesis of macrolides. This approach led to the identification and validation of BGCs comprising a highly reducing polyketide synthase (HR-PKS) and a thioesterase in the genome of Arthrinium phaeospermum (apmlA and apmlb, respectively) [104]. Correspondingly, the heterologous expression of the built BGCs in Aspergillus oryzae led to the isolation of phaeospelides A and B, two macrolides of 34 and 32 members (▶ Fig. 7).
In a subsequent work, phylogenetic and network analysis of the HR-PKS gene exposed a family of macrolide BGCs, including a glycosylphosphatidylinositol−ethanolamine phosphate transferase (GPI-EPT) homolog. Genome mining of the GPI-EPT gene led to the identification of two BGCs, one from Aspergillus kawachii IFO 4308 (akml cluster) and one from Colletotrichum incanum MAFF 238 704 (ciml cluster). Heterologous expression of several combinations of genes that make up the akml and ciml clusters resulted in the production and isolation of several 24 (AKML A-D) and 22 (CIML A-D) member macrolides containing phosphoethanolamine or phosphocholine moieties (▶ Fig. 7). These macrolides were evaluated as antibacterial and antifungal agents in bioassay-guided programs, showing moderate activities [105].

Discovery of Bioactive NPs Ducts from Microbes by Metabologenomics
Microbial NPs have demonstrated their importance as drugs, in particular to treat cancer and infectious diseases. As mentioned above, high-throughput screening (HTS) processes and the creation of synthetic compounds libraries were not as successful as expected by the pharmaceutical industry to reach a great number of ▶ developmental drug candidates. Contrastingly, methods that rely on genomics, bioinformatics, and chemical analysis (metabolomics), or the combination of them (metabologenomics) are now recognized as better alternatives for the discovery of new drugs from nature (▶ Fig. 8). The term metabologenomics was officially introduced in 2016 and has one main advantage over the traditional search of potential drugs: it does not require prior knowledge of the molecular structure or bioactivity of the compounds, their detection by high-resolution mass spectrometry (HRMS) constitutes the basis for their experimental confirmation, thus avoiding silent BGCs [106]. Genome sequencing has become one of the most valuable tools to understand the potential of microbes to produce SMs.
Since the increase in knowledge of microbial BGCs and other important attributes within their genomes, efforts for large-scale sequencing have been done to accurately predict novel NPs BGCs, to improve the large-scale assembly, cloning, and expression of BGCs, and to increase the diversity of new compounds [107]. On the other hand, metabolomics data obtained by using liquid chromatography coupled with high-resolution mass spectrometry (LC-HRMS) has allowed the identification of hundreds of metabolites for each strain, with high sensitivity and selectivity. Furthermore, MS data analysis tools, such as principal component and molecular networking (MN) analysis, have been successfully applied to demonstrate the chemical space and diversity of metabolites in samples, which could be correlated to the functional phenotype ▶ Fig. 6 Chemical structure of antibiotics derived from fungi.
▶ Fig. 7 Fungal macrolides identified by genome mining coupled with heterologous expression. of the natural source [108]. When the genomic and metabolomic data are combined, simple correlation scores reveal the probability that observed NPs and a specific gene cluster family (GCF) are associated if they co-occur across multiple strains.
Certainly, the discovery of new specialized metabolites through the expression of their BGCs in heterologous hosts offers some advantages that complement traditional chemical studies in cultivable organisms [109,110]. However, there are many limitations for the expression of BGCs in heterologous hosts, such as the use of unstable and difficult to manipulate artificial chromosomes, the incorporation of incomplete BGCs in cosmids and fosmids, and the lack of specific analytical methods to differentiate the heterologous expressed products form hostʼs endogenous NPs. To solve the latter, mass-spectrometry (MS) metabolomics plays an important role to detect such metabolites and correlate them with natural or induced BGCs.
In the next paragraphs, we review a series of examples on metabologenomics studies for the discovery of bioactive NPs. Without a doubt, these comprehensive studies between microbiologists, structural and molecular biologists, chemists, and bioinformatic analysts are now part of the new era in drug discovery. In fact, a community initiative to systematically document links between metabolome and (meta)genome data was recently developed in the form of the Paired Omics Data Platform (PoDP; https://pairedomicsdata.bioinformatics.nl/).

Rimosamides
McClure and collaborators [111] developed a metabolite/gene cluster correlation platform to elucidate the biosynthetic origins of a new family of NPs produced by Streptomyces rimosus NRRL B-2659, the rimosamides (▶ Fig. 9). The core of these compounds contains a depsipeptide bond at the point of bifurcation in their unusual, branched structures and are closely related to the detoxins. These groups of molecules are biosynthesized by a NRPS and a PKS. The study involved the identification of the co-occurrence of an unknown compound (negative dereplication in NPs databases) by LC-MS analysis of 178 extracts of actinobacteria, and specific gene cluster families on their annotated genomes. By doing so, 14 strains containing the compound matched six families of genes. Remarkably, almost all strains that encode for these genes were found within the S. rimosus clade. The proposed BGC for these compounds was established by detailed bioinformatic analysis. Then, heterologous expression in S. lividans 66 and LC-MS analyses of its organic extract were performed to demonstrate the production of the rimosamides. Finally, authors found that these compounds antagonize the antibiotic activity of blasticidin S in Bacillus cereus due to their structural similarity with the detoxines. Since blasticidin S has not been found in rimosamide-producing strains, these compounds could help them in the defense against other Streptomycetes.

Tambromycins
Following the same correlation approach described for the rimosamides, Goering and colleagues [112] uncovered a series of NPs and its BGCs in different actinomycetes, the tambromycins (▶ Fig. 9). These compounds are closely related to JBIR-34/35 and possess substructures of a hydroxylated and chlorinated indole, a methyl-oxazoline, and 2-methyl-serine. Successful correlation in six different strains of Streptomyces was demonstrated by extensive BGC analysis and LC-MS identification of the matching ions. The strain F-4474 was selected for fermentation and purification of the metabolite because it showed the highest intensity of the tambromycin ion. Functional annotation of genes in the tambromycins BGC revealed the presence of a tryptophan halogenase, a tryptophan aldolase, an aldehyde dehydrogenase, a methylation domain, and an alanine-hydroxymethyl transferase, consistent with the proposed biosynthesis for these compounds. In addition, feeding experiments with stable isotope labeled amino acids were carried out to confirm the biosynthesis of tambromycins. Unfortunately, the main compound (tambromycin) was inactive against a set of clinically important bacteria but showed antiproliferative activity against different cancerous B-and T-cell lines.

Tyrobetaines
Parkinson and colleagues [106] discovered a novel class of nonribosomal peptides (NRPs) with an unusual trimethylammonium tyrosine residue, the tyrobetaines, produced by Streptomyces sp. NRRL WC-3773 (▶ Fig. 9). Then, MN analysis (clustering of molecular features based on MS and MS 2 fragmentation patterns) was used to improve the metabologenomics correlation results. The BGC of the tyrobetaines was predicted using BLAST, and the boundaries for the BGC were estimated by comparing the BGCs from multiple tyrobetaine producing strains. For this, domains of an N-methyltransferase, a phosphopantetheinyl transferase, a P450 monooxygenase, among others, were consistent with the proposed biosynthesis of these NPs. As for the rimosamides, S. lividans 66 was used for heterologous expression of these NRPs. Feeding experiments with stable isotope labeled L-tyrosine, L-leucine, L-alanine, methionine, 3-chlorotyrosine coupled with MS analysis confirmed the biosynthesis of the tyrobetaines. Finally, tyrobetaine and its chlorinated derivative were tested against ESKAPE pathogens (Enterococcus faecalis ATCC 19 433, S. aureus ATCC 29 213, Klebsiella pneumoniae ATCC 27 736, Acinetobacter baumannii ATCC 19 606, Pseudomonas aeruginosa PAO1, and Escherichia coli ATCC 25 922) and human lung cancer cell line A549, with no detectable activity at the assayed concentrations.

Stravidins
Stravidins were discovered from Streptomyces avidinii as components of an interesting antibiotic complex [113]. These peptide antibiotics synergistically achieve biotin deficiency in target microbes. The unique amino acid monomer amiclenomycin (Acm) inhibits the 7,8-diaminopelargonic acid (DAPA) aminotransferase (BioA),which is essential for survival and persistence of mycobacteria. Interestingly, other Acm-containing peptides have shown activity against Gram-negative bacteria.
Using the metabologenomics data set from a collection of Streptomyces strains, Montaser and Kelleher identified the NP/ BGC correlation in nine strains. In particular, the strain NRRL S-98 showed a specific ion and fragments characteristic of the stravidins and correlated to an orphan BGC. This compound was isolated and characterized as stravidin S5 (▶ Fig. 9). BLAST-P analysis revealed additional members of this GCF and allowed to propose the biosynthetic pathway for these compounds. Finally, heterologous expression of the stravidins BGC in S. lividans 66 using the CRISPR-Cas9 technology, was successfully accomplished and confirmed by UPLC-HRMS/MS analysis.

4-Hydroxy pyridones
Fungal 4-hydroxy pyridones are polyketide-NRP compounds with diverse structures and biological activities [114]. The biosynthesis of these compounds is initiated by the PKS-NRPS and a trans-acting enoyl reductase (ER) to yield a pentacyclic intermediate, acryltetramic acid. This compound undergoes an oxidative ring expansion by a cytochrome P450 enzyme to give the six-membered skeleton of 4-hydroxy-2-pyridone, followed by a variety of modifications by tailoring enzymes.
Genome mining of in-house strains led to the identification of the 4-hydroxy-pyridones BGC in the endophyte Tolypocladium sp. 49Y isolated from the leaf of Acorus tatarinowii. Then, the genes were introduced into a heterologous expression host, Aspergillus oryzae NSAR1 (AO1 strain). As predicted, a new compound, tolypoalbin (▶ Fig. 9), was detected by HPLC analysis in the host culture extract. A homologous gene, also present in the BGC, was then introduced into the AO1 strain to establish a three-gene strain (AO2). The latter produced additional new 4-hydroxy-pyridones. Some of these compounds showed significant antifungal effects against Cryptococcus neoformans in bioassay-guided experiments.
Terezine D As mentioned, there are technological challenges in the insertion of large DNA fragments in homologous or heterologous systems for the expression of microbial NPs. Specifically, for the expression of fungal BGC, fungal artificial chromosome (FAC) expression vectors have been used with some success [115] In this context, the unbiased Random Shear BAC technology coupled with an autonomous fungal replicating element, AMA1, was applied to increase the transformation efficiency for the expression of FACs in A. nidulans [115]. The construction of the unbiased shuttle BAC library of A. terreus was possible since the genome of this fungus was fully sequenced and extensively annotated (56 BGC). The confirmation of the successful transformation of A. nidulans was established by LC-HRMS-MS/MS analysis of the transformant extracts, followed by the production of terezine D, a stable intermediate in the astechrome biosynthesis (▶ Fig. 9).

Acu-dioxomorpholines
Robey and collaborators used FAC-MS technology to elucidate the biosynthetic pathway for a rare group of metabolites, the acudioxomorpholines (▶ Fig. 9) produced by Aspergillus aculeatus [116]. The fungal genome was randomly sheared and~100 kb fragments were inserted into FACs, A. nidulans/E. coli shuttle vectors. The FACs with the BGCs were then transformed into the heterologous host A. nidulans. MS-based metabolomics allowed the detection of SMs and associated them with each FAC, assessing the effect of BGC mutants. The biosynthesis of the acu-dioxomorpholines was proved by stable isotope feeding experiments. As expected, [D 5 -indole]-Trp and [D 5 -phenyl]-Phe were incorporated in the biosynthetic products.

Discovery of Bioactive Plant Products Driven by Genome Mining
Plants have played a crucial role in the establishment and development of human societies, as some have been used empirically for thousands of years to treat a plethora of diseases [117]. Historically, plants were used as crude medicine or in preparations, [117][118][119] and later on (19th century) they were implemented as starting material to isolate bioactive compounds, commencing with the separation of morphine from Papaver sofmniferum in 1817 [117,120]. Since then, more than 133 000 bioactive NPs from higher plants have been discovered [74]. Several ap-proaches, including conventional chromatographic techniques, [121] bioassay-guided isolation, MS and NMR metabolomicsguided isolation, [122] and recently, multi-informative methodologies, [123] have been followed. Unfortunately, as of now, there is a major gap in the application of genome mining methodologies to understand the biosynthesis of plant derived NPs, [124] and an even greater one to integrate such techniques in the discovery of new plant derived bioactive NPs. The details contributing to this lag are linked to the biosynthetic pathway through which a plant SM is biologically synthesized (sequence of enzymatic reactions), and the fact that genes participating in the assembly of a molecule are commonly scattered along the genome. Conversely, those found in microorganisms are organized in BGCs and much easier to detect. Plant BGCs range from~35 to hundreds of kb, and congregate from three to 10 genes, [125,126] including those responsible for the first step in the biosynthesis of a SM (generally, but not always), and at least two coding peripheral enzymes [127]. According to Nützmann et al., BGCs can be grouped in four types. a) Archetypal cluster: it is compact and contains all genes required for the biosynthesis of SMs; b) core cluster + peripheral genes: most of the genes are contiguous but some are located in diffident locus (cis or trans, linked or unlinked with the core cluster, respectively); c) core clusters + satellite subgrups: the majority of genes for the first committed step are organized in a core cluster, but some genes (two or three) are present as satellites elsewhere in the genome; and d) core cluster + peripheral genes: the main cluster encodes for some tailoring enzymes and the gene encoding the scaffold of the NP is elsewhere in the genome (cis or trans) [127].
The boom of plant BGCs started about 24 years ago with the discovery of the cyclic hydroxamic acid 2,4-dihydroxy-1,4-benzoxazin-3-one (DIBOA) BGC, in Graminaea, which is assembled by five genes, Bx1 through Bx5, [128] and the antibacterial triterpenoids avenacins in oat (Avena spp.) [127,129]. Since then, more than fifty BGCs have been revealed in several species across the phylogenetic three of the Plantae kingdom [130].
Among the plant BGCs reported so far, triterpenoids are the most exploited, with about 43.4 % (23 SM) of the universe with structures covering a limited but informative portion of the chemical space. Next, diterpenoids contribute with 28.3 % (15 SM), hydroxamic acids (7.5 %, four SM), polyketides, steroidal alkaloids and cyanogenic glycosides contribute with 5.7 % each (three SM ▶ Fig. 10 Percentage of annotated plant BGCs by chemical class. individually), and benzylisokilolines alkaloids and monoterpenoids with 1.9 % (one CE representing each chemical class) (▶ Fig. 10) [125,127,131]. ▶ Fig. 11 shows the most relevant structure representative of every chemical class; for a deeper view of the structures reported, readers are referred to Nützmann et al. [127] and Polturak et al. [131].
Biologically, plant SM encoded in a BGC play essential roles in chemical ecology, especially in interactions with competing plants, herbivores, insects, bacteria, and fungi, demonstrating that they are only biosynthesized under biotic stress conditions, as defense mechanisms [113]. Thus, the study of BGCs in plants opens the door to better understand and elucidate complex biosynthetic pathways, as well as the discovery and exploitation of novel SMs that could be of great interest for the agrochemical and pharmaceutical industries, as they could be potentially developed in new biocides or candidate drugs to treat infectious diseases caused by human pathogens.

Perspectives
The limitations of genome mining are due to information and database availability on gene functions and biosynthetic pathways. Recent approaches suggest pathways to circumvent or alleviate these hurdles. Omics studies to obtain data on the genomes, transcriptomes, and metabolomes of the same samples can improve the coverage and bridge the gaps in knowledge by connecting molecules with their BGC of origin. Such explorative studies of biologically hyper diverse geographical regions or of those with extreme conditions can complete our knowledge of the pre-existing chemical diversity of the tree of life. In addition, genomic data analyses provide possibilities to design suitable growth conditions in the laboratory for formerly not cultivable organisms, and thus allow for their metabolites to be studied. Finally, genome editing capabilities have been greatly developed and have simplified the exploration of cryptic BSCs, especially with the advent of CRISPR-Cas technologies [55]. While the number of molecules uncovered by genome mining has increased, with some even indicating the presence of completely novel chemical classes or genetic entities, there is very little funding and public interest to take these discoveries to the clinical stage and confirm their suitability in the clinical/medical/industrial practices. Screening these by cheaper/ faster means, such as artificial intelligence/machine learning strategies, high throughput screenings by in vitro testing in organoids, has been proposed as ways forward to decrease some of the risks of failure further in their validation.

Supporting Information
Traditional approaches for antimicrobials discovery from fungi and fungal natural products with minimum information about biosynthetic gene cluster (Table S1) are available in the Supporting Information.

Contributorsʼ Statement
Drafting the manuscript, design of the review, and critical revision of the manuscript: J. Rivera-Chávez, C. D. Ceapa, and M. Figueroa. All authors read and approved the final version of the manuscript.