

bio_utils’ offers many classes that house biological data from many different file formats. Each instance contains an attribute per field of a given file format as well as a write() method that returns the original entry properly formatted and followed by a newline character. Each section below includes a simple description of what each format contains and a link to a detailed description of said format.

These classes are currently contained in the iterators subpackage but will constitute their own package later, see Roadmap for details.


B6 files contain various data detailing the length and quality of an alignment between nucleotide or protein sequences. This file format is used by NCBI BLAST+ as output format 6, hence B6 (Blast+ 6). B6 was referred to as M8 in NCBI BLAST. drive5 contains a good, succinct description of this format. This class only supports the default B6 format and does not accept arbitrary fields.


FASTA files contain nucleotide and protein sequences differentiated by unique identifiers. Wikipedia provides both the history of the FASTA format and format specifications.


FASTQ files contain nucleotide and protein sequences differentiated by unique identifiers, like FASTA files. FASTQ files also contain quality scores indicating the confidence of each base or residue declaration. Wikipedia provides a description of the FASTQ format and the meaning of the various quality scores.


General Feature Format 3 (GFF3) contain various data on the location, type, and quality of a nucleotide or protein sequence annotation. As opposed to previous versions, GFF3 support an arbitrary number of hierarchical annotation levels. GMOD gives a very detailed walkthrough of this format.


Sequence Alignment/Map (SAM) files contains details on the location and quality of an alignment. Many alignment programs produced SAM files as their default output. GitHub host a painfully detailed description of the SAM file format.