Biopython is a library containing freely available Python tools for computational Biology. It makes it easy to write python programs for bioinformatics use.
The basic functionalities provided by Biopython include :
- Parsers for various Bioinformatics file formats (BLAST output, FASTA, Genbank etc)
- Access to online services (NCBI, Expasy etc)
- A standard sequence class that deals with sequences, ids on sequences, and sequence features.
- Interface to some common Bioinformatics programs like BLAST from NCBI (tool for sequence alignment) and EMBOSS tools (tools for sequence analysis)
- Tools for performing common operations on sequences, such as translation, transcription and weight calculations
- Integration with BioSQL, a sequence database schema
Before installing BioPython, you need to install the prerequisites, i.e python and NumPy.
Biopython is available for python 2.6, 2.7, 3.3, 3.4, 3.5 (http://www.python.org).
- NumPy (Numerical Python )
To install NumPy , you can use pip :
pip install numpy
After installing numpy, you can install biopython using pip:
pip install biopython
To check if biopython is install properly, use this:
If this gives an error, Biopython is not installed.
For detailed instructions on installation, refer Python Package Index.
Let’s look at some of the functionalities that make Biopython awesome !
- Parsing various Biological file formats
Most of the biological data is stored as special file formats such as FASTA formatted text files, GENBANK formatted text files etc. Parsing these file formats into a format that can be manipulated using a programming language is a challenging task that can be simplified using the parsers provided in biopython.
Suppose you have a file named “ABC” in FASTA format. You can parse the file format and obtain a list of sequnces stored in the file as follows:
from Bio import SeqIO records = list(SeqIO.parse("ABC.fasta", "fasta"))
A file in GENBANK format can be parsed similarly
from Bio import SeqIO records = list(SeqIO.parse("ABC.gbk", "genbank"))
- Accessing Online Databases
Biopython can be used to access and download biological data from several databases such as NCBI, Expasy etc. The Bio.Entrez module can be used for this purpose.
>>> from Bio import Entrez >>> from Bio import SeqIO #provide your email-id >>> Entrez.email = "[email protected]" #IDs to be searched >>> records = ["P68871","Q96I25"] #search the database and obtain data in FASTA format >>> for rec_id in records: ... handle = Entrez.efetch(db="protein", id=rec_id, rettype="fasta") ... seqRec = SeqIO.read(handle,"fasta") ... print(seqRec) ... handle.close() ... ID: P68871.2 Name: P68871.2 Description: P68871.2 RecName: Full=Hemoglobin subunit beta; AltName: Full=Beta-globin; AltName: Full=Hemoglobin beta chain; Contains: RecName: Full=LVV-hemorphin-7; Contains: RecName: Full=Spinorphin Number of features: 0 Seq('MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDA...KYH', SingleLetterAlphabet()) ID: Q96I25.1 Name: Q96I25.1 Description: Q96I25.1 RecName: Full=Splicing factor 45; AltName: Full=45 kDa-splicing factor; AltName: Full=RNA-binding motif protein 17 Number of features: 0 Seq('MSLYDDLGVETSDSKTEGWSKNFKLLQSQLQVKKAALTQAKSQRTKQSTVLAPV...EQV', SingleLetterAlphabet())
- Sequence Objects and common operations
Biopython has sequence objects that are basically strings of letters. We can perform operations such as indexing, calculating string length, iterating through the characters , slicing the string etc, just like we do with python strings.
>>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> my_seq = Seq("ATGCGTACGATACATACAGCGT" , IUPAC.unambiguous_dna) >>> len(my_seq) #length of sequence 22 >>> my_seq.count("A") #count occurrences of a character 7 >>> my_seq[3:7] #slicing the sequence Seq('CGTA', IUPACUnambiguousDNA()) >>> for letter in my_seq[:5]: ... print(letter) #iterating through characters ... A T G C G
Seq objects in Biopython and standard Python strings have some similarities, there are two major differences.
Seqobject supports many biologically relevant methods like
Seqobject has an important attribute,
alphabet, which is an object describing what the individual characters making up the sequence string mean, and how they should be interpreted. For example, a
Seqobject could denote a protein sequence, or a DNA sequence.
Some of the biologically relevant operations that can be performed using Biopython methods are demonstrated below.
>>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> from Bio.SeqUtils import GC >>> my_seq = Seq("ATGCGTACGATACATACAGCGT" , IUPAC.unambiguous_dna) #calculating GC content of the DNA sequence >>> GC(my_seq) 45.45454545454545 #complement of DNA sequence >>> my_seq.complement() Seq('TACGCATGCTATGTATGTCGCA', IUPACUnambiguousDNA()) #reverse complement of DNA >>> my_seq.reverse_complement() Seq('ACGCTGTATGTATCGTACGCAT', IUPACUnambiguousDNA()) #simulating biological DNA strands >>> coding_dna = my_seq >>> template_dna = coding_dna.reverse_complement() >>> template_dna Seq('ACGCTGTATGTATCGTACGCAT', IUPACUnambiguousDNA()) #transcription process (DNA -> mRNA) >>> messenger_rna = template_dna.reverse_complement().transcribe() >>> messenger_rna Seq('AUGCGUACGAUACAUACAGCGU', IUPACUnambiguousRNA()) #translation process (mRNA -> Protein) >>> protein = messenger_rna.translate() >>> protein Seq('MRTIHTA', IUPACProtein())
Apart from these, Biopython offers lots of other features. So, if you are interested in bioinformatics, and love to program in python, then Biopython is the perfect choice for you !
To know more, check out the official documentation.