BlackOPs: increasing confidence in variant detection through mappability filtering.

in Nucleic acids research by Christopher R Cabanski, Matthew D Wilkerson, Matthew Soloway, Joel S Parker, Jinze Liu, Jan F Prins, J S Marron, Charles M Perou, D Neil Hayes

TLDR

  • A new tool, BlackOPs, is introduced to simulate experimental RNA-seq and DNA whole exome sequences, detect variants, and produce a blacklist of positions and alleles caused by mismapping.

Abstract

Identifying variants using high-throughput sequencing data is currently a challenge because true biological variants can be indistinguishable from technical artifacts. One source of technical artifact results from incorrectly aligning experimentally observed sequences to their true genomic origin ('mismapping') and inferring differences in mismapped sequences to be true variants. We developed BlackOPs, an open-source tool that simulates experimental RNA-seq and DNA whole exome sequences derived from the reference genome, aligns these sequences by custom parameters, detects variants and outputs a blacklist of positions and alleles caused by mismapping. Blacklists contain thousands of artifact variants that are indistinguishable from true variants and, for a given sample, are expected to be almost completely false positives. We show that these blacklist positions are specific to the alignment algorithm and read length used, and BlackOPs allows users to generate a blacklist specific to their experimental setup. We queried the dbSNP and COSMIC variant databases and found numerous variants indistinguishable from mapping errors. We demonstrate how filtering against blacklist positions reduces the number of potential false variants using an RNA-seq glioblastoma cell line data set. In summary, accounting for mapping-caused variants tuned to experimental setups reduces false positives and, therefore, improves genome characterization by high-throughput sequencing.

Overview

  • The study focuses on identifying variants using high-throughput sequencing data, which can be challenging due to technical artifacts, specifically incorrect alignment of sequences to their genomic origin.
  • The study introduces BlackOPs, an open-source tool that simulates experimental RNA-seq and DNA whole exome sequences, aligns them, detects variants, and outputs a blacklist of positions and alleles caused by mismapping.
  • The primary objective is to develop a tool that generates a blacklist specific to an experimental setup to reduce false positives in genome characterization.

Comparative Analysis & Findings

  • The study shows that the blacklist positions are specific to the alignment algorithm and read length used, and BlackOPs allows users to generate a blacklist tailored to their experimental setup.
  • By querying the dbSNP and COSMIC variant databases, the study found numerous variants indistinguishable from mapping errors.
  • The study demonstrates that filtering against blacklist positions reduces the number of potential false variants using an RNA-seq glioblastoma cell line data set.

Implications and Future Directions

  • The study highlights the importance of accounting for mapping-caused variants tuned to experimental setups to reduce false positives and improve genome characterization by high-throughput sequencing.
  • Future research directions may include integrating BlackOPs with existing variant detection algorithms and developing more advanced methods for simulating experimental sequences.
  • BlackOPs can be useful for researchers working with high-throughput sequencing data, particularly in studying rare diseases where accurate variant detection is crucial.