Biopython is a library containing freely available Python tools for computational Biology. It makes it easy to write python programs for bioinformatics use.

The basic functionalities provided by Biopython include :

  • Parsers for various Bioinformatics file formats (BLAST output, FASTA, Genbank etc)
  • Access to online services (NCBI, Expasy etc)
  • A standard sequence class that deals with sequences, ids on sequences, and sequence features.
  • Interface to some common Bioinformatics programs like BLAST from NCBI (tool for sequence alignment) and EMBOSS tools (tools for sequence analysis)
  • Tools for performing common operations on sequences, such as translation, transcription and weight calculations
  • Integration with BioSQL, a sequence database schema


Before installing BioPython, you need to install the prerequisites, i.e python and NumPy.

  • Python

Biopython is available for python 2.6, 2.7, 3.3, 3.4, 3.5 (

  • NumPy (Numerical Python )

To install NumPy , you can use pip :

    pip install numpy
  • Biopython

After installing numpy, you can install biopython using pip:

    pip install biopython

To check if biopython is install properly, use this:

    import Bio

If this gives an error, Biopython is not installed.

For detailed instructions on installation, refer Python Package Index.


Let’s look at some of the functionalities that make Biopython awesome !

  • Parsing various Biological file formats

Most of the biological data is stored as special file formats such as FASTA formatted text files, GENBANK formatted text files etc. Parsing these file formats into a format that can be manipulated using a programming language is a challenging task that can be simplified using the parsers provided in biopython.

Suppose you have a file named “ABC” in FASTA format. You can parse the file format and obtain a list of sequnces stored in the file as follows:

    from Bio import SeqIO
    records = list(SeqIO.parse("ABC.fasta", "fasta"))

A file in GENBANK format can be parsed similarly

    from Bio import SeqIO
    records = list(SeqIO.parse("ABC.gbk", "genbank"))
  • Accessing Online Databases

Biopython can be used to access and download biological data from several databases such as NCBI, Expasy etc. The Bio.Entrez module can be used for this purpose.

    >>> from Bio import Entrez
    >>> from Bio import SeqIO

    #provide your email-id
    >>> = "[email protected]"

    #IDs to be searched
    >>> records = ["P68871","Q96I25"]

    #search the database and obtain data in FASTA format
    >>> for rec_id in records:
    ...     handle = Entrez.efetch(db="protein", id=rec_id, rettype="fasta")
    ...     seqRec =,"fasta")
    ...     print(seqRec)
    ...     handle.close()

    ID: P68871.2
    Name: P68871.2
    Description: P68871.2 RecName: Full=Hemoglobin subunit beta; AltName: Full=Beta-globin; AltName: Full=Hemoglobin beta chain; Contains: RecName: Full=LVV-hemorphin-7; Contains: RecName: Full=Spinorphin
    Number of features: 0

    ID: Q96I25.1
    Name: Q96I25.1
    Description: Q96I25.1 RecName: Full=Splicing factor 45; AltName: Full=45 kDa-splicing factor; AltName: Full=RNA-binding motif protein 17
    Number of features: 0
  • Sequence Objects and common operations

Biopython has sequence objects that are basically strings of letters. We can perform operations such as indexing, calculating string length, iterating through the characters , slicing the string etc, just like we do with python strings.

    >>> from Bio.Seq import Seq
    >>> from Bio.Alphabet import IUPAC

    >>> my_seq = Seq("ATGCGTACGATACATACAGCGT" , IUPAC.unambiguous_dna)

    >>> len(my_seq)		#length of sequence

    >>> my_seq.count("A")	#count occurrences of a character

    >>> my_seq[3:7]		#slicing the sequence
    Seq('CGTA', IUPACUnambiguousDNA())

    >>> for letter in my_seq[:5]:
    ...     print(letter)	#iterating through characters 

Though the Seq objects in Biopython and standard Python strings have some similarities, there are two major differences.

  • The Seq object supports many biologically relevant methods like translate(), transcribe(), reverse_complement() etc.
  • The Seq object has an important attribute, alphabet, which is an object describing what the individual characters making up the sequence string mean, and how they should be interpreted. For example, a Seq object could denote a protein sequence, or a DNA sequence.

Some of the biologically relevant operations that can be performed using Biopython methods are demonstrated below.

    >>> from Bio.Seq import Seq
    >>> from Bio.Alphabet import IUPAC
    >>> from Bio.SeqUtils import GC

    >>> my_seq = Seq("ATGCGTACGATACATACAGCGT" , IUPAC.unambiguous_dna)

    #calculating GC content of the DNA sequence
    >>> GC(my_seq)

    #complement of DNA sequence
    >>> my_seq.complement()

    #reverse complement of DNA
    >>> my_seq.reverse_complement()

    #simulating biological DNA strands
    >>> coding_dna = my_seq
    >>> template_dna = coding_dna.reverse_complement()
    >>> template_dna

    #transcription process (DNA -> mRNA)
    >>> messenger_rna = template_dna.reverse_complement().transcribe()
    >>> messenger_rna

    #translation process (mRNA -> Protein)
    >>> protein = messenger_rna.translate()
    >>> protein
    Seq('MRTIHTA', IUPACProtein())

Apart from these, Biopython offers lots of other features. So, if you are interested in bioinformatics, and love to program in python, then Biopython is the perfect choice for you !

To know more, check out the official documentation.