<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Genomic Contamination on Luc Cornet</title>
    <link>https://lcornet.github.io/tags/genomic-contamination/</link>
    <description>Recent content in Genomic Contamination on Luc Cornet</description>
    <generator>Hugo</generator>
    <language>en-us</language>
    <atom:link href="https://lcornet.github.io/tags/genomic-contamination/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Evaluation of Genomic Contamination Detection Tools and Influence of Horizontal Gene Transfer on Their Efficiency through Contamination Simulations at Various Taxonomic Ranks</title>
      <link>https://lcornet.github.io/publications/conta4/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://lcornet.github.io/publications/conta4/</guid>
      <description>&lt;p&gt;&lt;strong&gt;Luc Cornet, Valérian Lupo, Stéphane Declerck and Denis Baurain&lt;/strong&gt; &lt;br&gt;&#xA;Genomic contamination remains a pervasive challenge in (meta)genomics, prompting the development of numerous detection tools. Despite the attention that this issue has attracted, a comprehensive comparison of the available tools is absent from the literature. Furthermore, the potential effect of horizontal gene transfer on the detection of genomic contamination has been little studied. In this study, we evaluated the efficiency of detection of six widely used contamination detection tools. To this end, we developed a simulation framework using orthologous group inference as a robust basis for the simulation of contamination. Additionally, we implemented a variable mutation rate to simulate horizontal transfer. Our simulations covered six distinct taxonomic ranks, ranging from phylum to species. The evaluation of contamination levels revealed the suboptimal precision of the tools, attributed to significant cases of both over-detection and under-detection, particularly at the genus and species levels. Notably, only so-called “redundant” contamination was reliably estimated. Our findings underscore the necessity of employing a combination of tools, including Kraken2, for accurate contamination level assessment. We also demonstrate that none of the assayed tools confused contamination and horizontal gene transfer. Finally, we release CRACOT, a freely accessible contamination simulation framework, which holds promise in evaluating the efficacy of future algorithms.&lt;br&gt;&#xA;&lt;a href=&#34;https://doi.org/10.3390/applmicrobiol4010009&#34;&gt;https://doi.org/10.3390/applmicrobiol4010009&lt;/a&gt;&lt;/p&gt;</description>
    </item>
    <item>
      <title>Contamination detection in genomic data: more is not enough</title>
      <link>https://lcornet.github.io/publications/conta3/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://lcornet.github.io/publications/conta3/</guid>
      <description>&lt;p&gt;&lt;strong&gt;Luc Cornet and Denis Baurain&lt;/strong&gt;&lt;br&gt;&#xA;The decreasing cost of sequencing and concomitant augmentation of publicly available genomes have created an acute need for automated software to assess genomic contamination. During the last 6 years, 18 programs have been published, each with its own strengths and weaknesses. Deciding which tools to use becomes more and more difficult without an understanding of the underlying algorithms. We review these programs, benchmarking six of them, and present their main operating principles. This article is intended to guide researchers in the selection of appropriate tools for specific applications. Finally, we present future challenges in the developing field of contamination detection.&lt;br&gt;&#xA;&lt;a href=&#34;https://link.springer.com/article/10.1186/s13059-022-02619-9&#34;&gt;https://link.springer.com/article/10.1186/s13059-022-02619-9&lt;/a&gt;&lt;/p&gt;</description>
    </item>
    <item>
      <title>Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics</title>
      <link>https://lcornet.github.io/publications/conta2/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://lcornet.github.io/publications/conta2/</guid>
      <description>&lt;p&gt;&lt;strong&gt;Valérian Lupo, Mick Van Vlierberghe, Hervé Vanderschuren, Frédéric Kerff, Denis Baurain and Luc Cornet&lt;/strong&gt;&lt;br&gt;&#xA;Contaminating sequences in public genome databases is a pervasive issue with potentially far-reaching consequences. This problem has attracted much attention in the recent literature and many different tools are now available to detect contaminants. Although these methods are based on diverse algorithms that can sometimes produce widely different estimates of the contamination level, the majority of genomic studies rely on a single method of detection, which represents a risk of systematic error. In this work, we used two orthogonal methods to assess the level of contamination among National Center for Biotechnological Information Reference Sequence Database (RefSeq) bacterial genomes. First, we applied the most popular solution, CheckM, which is based on gene markers. We then complemented this approach by a genome-wide method, termed Physeter, which now implements a k-folds algorithm to avoid inaccurate detection due to potential contamination of the reference database. We demonstrate that CheckM cannot currently be applied to all available genomes and bacterial groups. While it performed well on the majority of RefSeq genomes, it produced dubious results for 12,326 organisms. Among those, Physeter identified 239 contaminated genomes that had been missed by CheckM. In conclusion, we emphasize the importance of using multiple methods of detection while providing an upgrade of our own detection tool, Physeter, which minimizes incorrect contamination estimates in the context of unavoidably contaminated reference databases.&lt;br&gt;&#xA;&lt;a href=&#34;https://doi.org/10.3389/fmicb.2021.755101&#34;&gt;https://doi.org/10.3389/fmicb.2021.755101&lt;/a&gt;&lt;/p&gt;</description>
    </item>
    <item>
      <title>Consensus assessment of the contamination level of publicly available cyanobacterial genomes</title>
      <link>https://lcornet.github.io/publications/conta1/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://lcornet.github.io/publications/conta1/</guid>
      <description>&lt;p&gt;&lt;strong&gt;Luc Cornet, Loïc Meunier, Mick Van Vlierberghe, Raphaël R. Léonard, Benoit Durieu, Yannick Lara, Agnieszka Misztak, Damien Sirjacobs, Emmanuelle J. Javaux, Hervé Philippe, Annick Wilmotte, Denis Baurain&lt;/strong&gt;&lt;br&gt;&#xA;Publicly available genomes are crucial for phylogenetic and metagenomic studies, in which contaminating sequences can be the cause of major problems. This issue is expected to be especially important for Cyanobacteria because axenic strains are notoriously difficult to obtain and keep in culture. Yet, despite their great scientific interest, no data are currently available concerning the quality of publicly available cyanobacterial genomes. As reliably detecting contaminants is a complex task, we designed a pipeline combining six methods in a consensus strategy to assess the contamination level of 440 genome assemblies of Cyanobacteria. Two methods are based on published reference databases of ribosomal genes (SSU rRNA 16S and ribosomal proteins), one is indirectly based on a reference database of marker genes (CheckM), and three are based on complete genome analysis. Among those genome-wide methods, Kraken and DIAMOND blastx share the same reference database that we derived from Ensembl Bacteria, whereas CONCOCT does not require any reference database, instead relying on differences in DNA tetramer frequencies. Given that all the six methods appear to have their own strengths and limitations, we used the consensus of their rankings to infer that &amp;gt;5% of cyanobacterial genome assemblies are highly contaminated by foreign DNA (i.e., contaminants were detected by 5 or 6 methods). Our results will help researchers to check the quality of publicly available genomic data before use in their own analyses. Moreover, we argue that journals should make mandatory the submission of raw read data along with genome assemblies in order to facilitate the detection of contaminants in sequence databases.&lt;br&gt;&#xA;&lt;a href=&#34;https://doi.org/10.1371/journal.pone.0200323&#34;&gt;https://doi.org/10.1371/journal.pone.0200323&lt;/a&gt;&lt;/p&gt;</description>
    </item>
  </channel>
</rss>
