How I used Google Colab’s A100 to basecall 100 Gb of data within 3hrs (and only cost me $2).

How I used Google Colab’s A100 to basecall 100 Gb of data within 3hrs (and only cost me $2).

[4 min read] By Igor Bogorad 

Our new sequencing company processes a lot of data using Oxford Nanopore Technology’s MinION flow cell. It’s an amazing instrument that sequences long strands of DNA through a protein pore, generating electrical signals as DNA passes through. These signals are then decoded, or “basecalled,” using sophisticated AI-driven algorithms such as Dorado. With a GPU, the sequencer can almost instantaneously generate raw signals and convert them to DNA sequences (live basecalling). While live basecalling usually works smoothly, hiccups can happen.

Recently, I hit one of those hiccups after updating Oxford's software called MinKNOW. The live basecaller produced corrupted FASTQ files.

This problem was reported [here], and ONT’s support team quickly provided a patch [here]. While MinKNOW can perform post-run basecalling, this is painfully slow on modest GPUs (e.g., the RTX models ONT recommends for MinIONs [link]). Basecalling 100 GB this way could take 10+ hours.

But Dorado’s repository clearly states that it is “heavily optimized for Nvidia A100 and H100 GPUs” and performs best on these systems. Of course, buying an A100 outright (~$10,000) is unrealistic for most labs.

The workaround: use Google Colab’s A100 instances in the cloud.


Practical Setup

  • Colab Plan: If you’re not a heavy user, Colab’s Pay-As-You-Go option is great. $10 gives you 100 credits, which is more than enough for several sequencing runs. A100 GPUs cost 5 credits/hour ($0.50/hr).

  • Pro vs Pro+: A Pro subscription increases your chances of landing an A100. With Pro+, your priority is even higher. If the A100 isn’t immediately available, wait a bit and try again.
  • Select A100 under “Runtime” -> “Change runtime type”

Caveat: Upload bandwidth matters. At <50 Mbps, the upload time can outweigh the GPU speed benefit. At our lab in Bonneville Labs, fast fiber internet makes this practical.


Workflow (Simplified)

For those who aren’t full-time coders (like me), here’s the rough sequence:

  1. Mount Google Drive so Colab can access your data.
  2. Download and install Dorado (must repeat each session since Colab resets).
  3. Download the model, such as: dna_r10.4.1_e8.2_400bps_sup@v5.0.0
    .
  4. List POD5/FAST5 files and confirm paths.
  5. Run basecalling → output combined FASTQ.
  6. Demultiplex (if needed) using dorado demux.

Here is link to the Google Colab notebook. Feel free to make a copy and edit it as you like.

Result: 100 GB basecalled in ~3 hours at a total compute cost of about $2.


Why It Matters

Cheap access to powerful GPUs can be a lifesaver when you hit unexpected software issues. Beyond basecalling, A100s in Colab could also support other GPU-intensive genomics tasks.

Have you tried other fast or low-cost cloud solutions for nanopore data? Drop a note below—I’d love to compare strategies.

 

Interested in high quality and inexpensive nanopore sequencing? Try Angstrom Innovation at sequencing.angstrominno.com

Back to blog

Leave a comment

Please note, comments need to be approved before they are published.