Evo 2: Open‑Source AI Trained on Trillions of DNA Bases Across All Life Domains

Evo 2: Open‑Source AI Trained on Trillions of DNA Bases Across All Life Domains
Ars Technica2

Key Points

  • Evo 2 is an open‑source AI trained on trillions of DNA base pairs.
  • It incorporates genomes from bacteria, archaea and eukaryotes.
  • The model learns internal representations of regulatory DNA and splice sites.
  • Evo 2 builds on the earlier Evo system that excelled with bacterial genomes.
  • It addresses the complexity of eukaryotic genomes, including introns and scattered regulatory elements.
  • The system opens new avenues for bioinformatics research and collaboration.

Evo 2 is an open‑source artificial‑intelligence system that has been trained on trillions of base pairs of DNA from bacteria, archaea and eukaryotes. Building on the earlier Evo model, which excelled at predicting gene sequences in bacterial genomes, Evo 2 now learns internal representations of complex genomic features such as regulatory DNA, splice sites and the scattered elements that characterize eukaryotic genomes. The system demonstrates that large‑scale AI can capture patterns even in the most intricate parts of the genome, opening new possibilities for bioinformatics research.

Background and Motivation

Earlier coverage highlighted an AI system called Evo that was trained on an enormous number of bacterial genomes. The system could, when given sequences from a cluster of related genes, correctly identify the next gene or suggest a completely novel protein. This success relied on the relatively simple organization of bacterial genomes, where related genes are often clustered together and regulatory elements are compact.

Challenges with Complex Genomes

The original reporting noted uncertainty about whether the same approach would work with more complex genomes, such as those of eukaryotes. Eukaryotic DNA contains introns—non‑coding segments that interrupt coding regions—and regulatory sequences that can be scattered across vast stretches of DNA. These features are weakly defined, with only a few bases being strictly required and many showing probabilistic tendencies. Additionally, eukaryotic genomes include large amounts of DNA that has been labeled as “junk,” comprising inactive viruses and damaged genes.

Evo 2: Extending the Model

Undeterred by these challenges, the team behind Evo set out to create Evo 2, an open‑source AI trained on genomes from all three domains of life: bacteria, archaea and eukaryotes. By ingesting trillions of base pairs of DNA, Evo 2 developed internal representations of key genomic features that are difficult for humans to spot, including regulatory DNA motifs and splice‑site boundaries.

Key Capabilities

Evo 2’s training enables it to recognize patterns across the full spectrum of genomic complexity. In bacterial genomes, it continues to leverage the straightforward organization of contiguous genes and compact regulatory systems. In eukaryotic genomes, it can parse intron‑containing genes, locate weakly defined regulatory sites, and differentiate functional sequences from the extensive non‑functional DNA that surrounds them.

Implications for Research

The emergence of Evo 2 suggests that large‑scale AI models can bridge the gap between simple and complex genomic architectures. By learning from vast, diverse datasets, such models may assist scientists in identifying regulatory elements, predicting gene structures and uncovering novel proteins across a wide range of organisms. The open‑source nature of Evo 2 also invites collaboration and further development within the bioinformatics community.

#artificial intelligence#genomics#open source#bioinformatics#DNA sequencing#machine learning#genome analysis#regulatory DNA#eukaryotic genomes#evolutionary biology
Generated with  News Factory -  Source: Ars Technica2

Also available in: