Using PDBModel

The PDBModel class is the hearth of Biskit. PDBModel objects encapsulate coordinates and atom information. They allow you to modify and re-arrange atoms, chains or residues and are the hub for integrating results from external calculations with data from the original PDB file.

Note:

Between Biskit release 2.0.1 and 2.1, we performed some major refacturing of Biskit and, especially, PDBModel. This document gives some detailed examples for working with the new version of PDBModel.

1. Data structure
2. Creating a PDBModel
3. Saving a PDBModel
4. Working with coordinates
5. Working with profiles
6. Working with atoms
7. Manipulating PDBModels
8. Comparing PDBModels
9. Surfaces, Energies, Conservations...
10. Relation to other Biskit classes

1. Data structure

Coordinates are stored in the Numpy array 'xyz'; the additional atom infos from the PDB (name, residue_name, and many more) are stored in a PDBProfiles instance 'atoms' which can be used to also associate arbitrary other data to the atoms. Moreover, a PDBProfiles instance 'residues' can hold data associated to residues. A normal dictionary 'info' accepts any information about the whole model.

Note:

Previous versions of Biskit (>=2.0.1) had a somewhat different layout. The new PDBModel version is backward compatible to previous versions. Pickles of older PDBModel instances are converted on the fly. However, you need to have the old Numeric library installed (in parallel to numpy), if you want to unpickle older Biskit objects.

See also: PDBModel API reference (may be outdated)

Note: In the example code below 'm' stands for a PDBModel object. The Biskit.tools module is imported as 'T' and the numpy module is imported as 'N'.

2. Creating a PDBModel

A PDBModel can be created from a PDB file or from a previously saved binary (pickled) PDBModel object.

from a PDB file:

>>> m = PDBModel('~/biskit/docs/tutorial1/1R7M.pdb')
>>> m
[PDBModel 1R7M  5883 atoms,  749 residues,  8 chains]

from a pickled PDBModel:

>>> m = PDBModel('~/biskit/docs/tutorial1/rec.model')

## or alternatively:
>>> import Biskit.tools as T
>>> m = T.Load('~/biskit/docs/tutorial1/rec.model')

from another PDBModel in memory:
```
>>> m_child = PDBModel( m )
```

The file or object passed to the constructor becomes the source of the PDBModel. See the section on saving PDBModels below!

Quick inspection

There are some useful methods for a quick inspection of your structure:

PDBModel.report() gives you an overview over the atoms, residues and profiles of your model:

>>> m.report()

[PDBModel 1R7M  5883 atoms,  749 residues,  8 chains]
    * chain 0: NIKKNQVMNLGPNSKLLKEYKSQLIELNIEQFEAGIGLILGDAYIRSRDEGKTYCMQF..
    * chain 1: NIKKNQVMNLGPNSKLLKEYKSQLIELNIEQFEAGIGLILGDAYIRSRDEGKTYCMQF..
    * chain 2: acgctagggataacagggtaatac
    * chain 3: gtattaccctgttatccctagcgt
    * chain 4: acgctagggataacagggtaatac
    * chain 5: gtattaccctgttatccctagcgt
    * chain 6: ++++++
    * chain 7: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~..
 source: LocalPath[ {/data/raik/py/biskit|$projectRoot}/docs/tutorial1/1R7M.pdb ]
 15 atom profiles:    ['name', 'residue_number', 'type', 'alternate', 'name_o..
  0 residue profiles: []
  1 info records:     ['date']

PDBModel.sequence() returns the complete residue sequence of a structure as a string.
PDBModel.lenAtoms(), lenResidues, lenChains return the number of atoms, residues and chains, respectively
PDBModel.plot() will generate a quick & dirty 2-D sketch of your structure's backbone(s):

>>> m.plot()

3. Saving a PDBModel

A normal PDB file can be created from the model at any time:

>>> m.writePdb('~/my/modified.pdb')

This method has several options for dealing with TER records and atom names. Type 'print m.writePdb.__doc__' for the full list.

However, writing and reading PDB files is rather slow and additional information in .residues, .atoms, or .info is lost. We therefore recommend to pickle PDBModels as python objects to disc:

>>> m.saveAs( '~/my/repickled.model' )

PDBModel.saveAs produces an independent stand-alone PDBModel on disc, which can also be copied around between different machines (python pickles are platform independent).

There is a second, and slightly different way of saving PDBModels:

>>> Biskit.tools.Dump( m, '~/my/repickled.model')

However, if you use the Dump method in Bisit.tools, the memory-saving mechanism of Biskit is kicking in: The model remembers its source, which can be a PDB file or a pickled PDBModel on disc, and keeps track of whether xyz, or any profiles have been changed with respect to this source. Simple pickling with Dump will automatically trigger the PDBModel.slim() method and will check and reset the xyz array or any profiles to 'None' if they have not been modified since being read from disc.

4. Working with coordinates

The coordinates of your structure are stored in the variable PDBModel. xyz. xyz is a two-dimensional N x 3 numpy array where the list of atoms forms the first dimension and the x, y, and z coordinates the second. For example:

>>> m = T.Load( 'biskit/docs/tutorial1/rec.model' )
>>> m.xyz
array([[ 65.3809967 ,  37.31000137,  44.41699982],
       [ 64.07700348,  36.67399979,  44.08000183],
       [ 63.91699982,  36.47600174,  42.58100128],
       ...,
       [ 17.96199989,  26.05699921,  37.00999832],
       [ 18.97500038,  25.22699928,  37.73899841],
       [ 20.17900085,  26.12199974,  37.70299911]], dtype=float32)

That means you can access the three (x, y, z) coordinates of the first atom with:

>>> m.xyz[0]
array([ 65.3809967 ,  37.31000137,  44.41699982], dtype=float32)

... and the single z coordinate of the last atom with:

>>> m.xyz[-1][2]
37.702999115

The next example returns the z coordinate of all atoms as a one-dimensional array:

>>> m.xyz[:, 2]
array([ 44.41699982,  44.08000183,  42.58100128, ...,  37.00999832,
        37.73899841,  37.70299911], dtype=float32)

To the outside Python world, nympy arrays behave like lists of lists (of lists...). That means they have a len and can be iterated:

>>> len( m.xyz )
1854
>>> for x,y,z in m.xyz:
        if x > 99 or y > 99 or z > 99:
            print 'Warning: coordinate overflow!'

However, iterating over a numpy array is a bit like mounting a horse in front of a Porsche. Numpy arrays are great for high-speed number-crunching and can be very efficiently manipulated without ever typing a single 'for'. So the above would become:

>>> N.any( N.greater( m.xyz, 99 ) )
False

In this case, the difference (apart from looking better) is a speedup by a factor of 740, which is a pretty typical result. You can directly add, substract, and multiply arrays with '+', '-', and '*' operators. There are also numpy functions for summing, concatenating, clipping, sorting of arrays and hundreds of methods for more advanced mathematical applications. See: http://numpy.scipy.org/numpydoc/numpy-9.html

One last note: Rather than assigning a modified array back to PDBModel.xyz, it is saver to use the PDBModel.setXyz() method:

>> m.setXyz( m.xyz + (1,0,0) )

Otherwise, your changes may be lost when the model is pickled to disc (see the description of PDBModel.*source* above). Ideally, PDBModel.xyz should be implemented as a property and do this automatically but that would involve a conversion to new-style Python classes and break our existing pickles. So we have yet shied away from it. Also, if you ever happen to stumble across a PDBModel.xyz that is 'None', this probably means your model shares its coordinates with another source. Try PDBModel.getXyz() instead, which will fetch the coordinates for you.

5. Working with profiles

See API documentation of ProfileCollection and CrossView!

All atom- or residue-related data that are not coordinates are stored in two collections of so-called profiles. A ProfileCollection is basically a dictionary of arrays. There are two ProfileCollections in each PDBModel: PDBModel .atoms and PDBModel .residues.

5.1 Atom profiles

Let's focus on .atoms first:

>>> m.atoms
  ProfileCollection: 16 profiles of length 1854
name
{'isarray': False, 'default': None, 'changed': 0, 'version': '14/10/2007 ProfileCollection
 $Revision: 531 $; PDBModel $Revision: 556 $'}
      ['N', 'CA', 'C', 'O', 'CB', 'CG', 'OD1', 'ND2'...
residue_number
{'isarray': True, 'default': 0, 'changed': 0, 'version': '14/10/2007 ProfileCollection
 $Revision: 531 $; PDBModel $Revision: 556 $'}
      [  3   3   3 ..., 225 225 225]
...

The real output is much longer. By default PDBModel.atoms comes with 15 entries that are taken straight from the PDB file:

>>> m.atoms.keys()
['name', 'residue_number', 'insertion_code', 'alternate', 'name_original', 'chain_id',
 'occupancy', 'element', 'segment_id', 'charge', 'residue_name', 'after_ter',
 'serial_number', 'temperature_factor', 'type' ]

That means:

>>> m.atoms['element']

...will give you a list of length m.lenAtoms() with the Element character of each atom. For historic reasons, profiles with non-numeric values are stored as Python lists rather than as numpy arrays. This will change in some future Biskit version.

ProfileCollection (the mother class of 'atoms' and 'residues') offers many convenient methods to access, modify, iterate and even plot profiles.

For convenience, we have defined some shortcuts so that many profile methods can be directly accessed from the model:

>>> m['name']            ## fetch the atom profile 'name'
['H1', 'H2', 'N', 'H3', 'CA', 'CB', 'C', 'O', 'N', 'H', '...

>>> m['residue_name']   ## fetch the **atom** profile 'residue_name'
['ALA', 'ALA', 'ALA', 'ALA', 'ALA', 'ALA', 'ALA', 'ALA', 'GLN', ...

Note that 'residue_name' still has one value per atom rather than per residue because it comes from the parsing of the PDB file. You can assign arbitrary values to the atoms of your structure by creating a new profile:

>>> m.atoms['running#'] = range( len( m ) )
>>> m['running#']
array([   0,    1,    2, ..., 1851, 1852, 1853], dtype=int32)

The long version of this command allows you to also assign meta-data to your new profile:

>>> m.atoms.set( 'running#', range( len(m) ), comment='running number for each atom' )

These meta-data can then be accessed like a dictionary within the infos dictionary:

>>> m.atoms.info['running#']
{'isarray': True, 'default': 0, 'version': '27/07/2008 ProfileCollection $Revision: 573 $;
 PDBModel $Revision: 613 $', 'comment': 'running number for each atom', 'changed': 1}

... or, more directly:

>>> m.atoms['running#', 'comment']
'running number for each atom'

5.2 Residue profiles

Now let's turn to the residue profiles. The PDBModel.*residues* ProfileCollection is initially empty. That may be a bit counter-intuitive when it comes to profiles like 'residue_name' which are initially read atom-by-atom from the PDB and thus end up in PDBModel.atoms rather than in PDBModel.residues. We haven't yet found a consistent solution for this that doesn't raise eyebrows in other situations -- suggestions are welcome.

If you want to have a real residue profile with the residue names, you can create a copy from the atom profile:

>>> m['resname'] = m.atom2resProfile('residue_name')

PDBModel.atom2resProfile will, by default, simply put the atom profile value of each residue's first atom into the residue profile. However, you can provide an alternative 'averaging' function to atom2resProfile. The following example creates two residue profiles that contain each residue's average or maximum temperature factor, respectively:

>>> m['temp_avg'] = m.atom2resProfile( 'temperature_factor', N.average )
>>> m['temp_max'] = m.atom2resProfile( 'temperature_factor', max )

The given function needs to accept a list of values as argument -- one for each atom of a given residue -- and return a single value for the whole residue. That means you will usually not need to write this function yourself but can use built-ins like min() or max() or some numpy functions like sum(), average(), etc.

Now let's do some simple calculation. We want to create a profile that lists the distance of each atom from the molecule's center of mass. This is how it works:

>>> import numpy as N
>>> diff = m.xyz - m.centerOfMass()
>>> dist = N.sqrt( N.sum( diff**2, axis=1 ) )
>>> m['from_center'] = dist

5.3 incomplete profiles

Sometimes you don't actually have a value for each atom or residue. So suppose, you extracted only C-alpha distances from your dist profile above. You can still create an atom profile from this if you also provide a mask designating the positions for which you actually have values:

>>> dist = N.compress( dist, m.maskCA() )  ## extract CA distances
>>> m.atoms.set('ca_from_center', dist, mask=m.maskCA(),
                    default=-1, comment='CA distances' )

The new profile will have distances associated to CA atoms and -1 in all other positions.

Note that we now use some optional parameters of ProfileCollection.set() (type 'print ProfileCollection.set.__doc__' for the full documentation). The parameter mask accepts a numpy array (or list) of m.lenAtoms() length in which all positions for which we have a value are marked with 'True' and all missing positions with 'False'. The number of 'True' in this mask must correspond to the length of the given profile! PDBModel has several methods to generate useful atom masks (think of them as atom selections). PDBModel.maskCA() generates a mask where only C-alpha positions are labelled 'True'.

6. Working with atoms

In the above sections, we have worked with lists of atom or residue values. PDBModel, moreover, also offers a more atom-centric view on the same data. You can fetch a virtual dictionary with the values of all atom profiles for one atom:

>>> m[0]
CrossView{'name': 'H1', 'residue_number': 1, 'insertion_code': '',
...

Which is the same as:

>>> m.atoms[0]

Changes to this dictionary change the underlying profiles at the position of this atom. Vice versa, changes to the profiles are mirrored in the 'dictionary'.

Iteration over a model m will give you one CrossView after the other:

>>> for atom in m:
...    if atom['element']=='C' and atom['temperature_factor'] > 70:
...        atom['charge'] = 1.

However, working with CrossViews is usually much slower than directly working with the underlying profiles. So the above example would be much more efficiently done like this:

>>> m['charge'] = N.greater( m['temperature_factor'], 70 ) * m.maskFrom('element', 'C')

See the description of CrossView for a detailed discussion of performance issues.

Iterating over PDBModel.residues will give you the CrossViews for each residue. Although you will need to create some residue profiles, first.

7. Manipulating PDBModels

Methods are provided to extract, remove, re-arrange, or concatenate atoms, chains or residues in ways similar to the handling of Numeric arrays. The major methods for this are take, compress, concat, takeResidues* and takeChains. All of these methods operate on coordinates, atom- and residue profiles simultaneously so that everything always stays synchronized.

Note: the following section has a bit of a brutal approach to the unprepared reader... stay with us until we have prepared a more gentle introduction...

Atoms, residues and chains can be selected by their indices (positions) or with masks (of 1 and 0 or True and False). And there are many methods to produce such masks. The following example extracts all CA atoms from a PDBModel and puts them into a new model 'm_ca':

 >>> mask_ca = m.maskFrom('name', 'CA')
 >>> mask_ca
 array([False,  True, False, ..., False, False, False], dtype=bool)

 >>> m_ca = m.compress( mask_ca )
 >>> m_ca
[PDBModel 1R7M   452 atoms,  452 residues,  3 chains]

PDBModel has a shortcut for this:

>>> m_ca = m.compress( m.maskCA() )

Masks are normal Numpy arrays and can be combined with logical_and, logical_or (simple * and + does the same) and other methods. Masks calculated from atoms can be converted to residue or chain masks and back with atom2resMask, atom2chainMask, res2atomMask, etc. methods.

Many PDBModel methods (called take-something) operate with indices rather than masks. Indices are lists of (atom, residue, or chain) positions. A mask can easily be converted into a list of indices with the numpy.nonzero() function. So:

>>> m_ca = m.compress( mask_ca )

Is equivalent to:

 >>> i_ca = N.nonzero( mask_ca )[0]
 >>> i_ca
 array([   1,    9,   17,   26,   35,   43,   52,   59,   67,   75,   83,
          87,   94,  102,  108,  117,  125,  133,  142,  151,  163,  172,
        ...
        3657, 3665, 3673, 3685, 3694, 3702, 5676, 5677, 5678, 5679, 5680,
        5681])

 >>> m_ca = m.take( i_ca )
 >>> m_ca
[PDBModel 1R7M   452 atoms,  452 residues,  3 chains]

Like masks, indices can be converted between atom, residue, and chain level with atom2resIndices, res2atomIndices, and so on. The following example builds a new model m2 from only the first and last chain of an existing model m:

>>> chain1 = m.takeChains( [0] )
>>> chain2 = m.takeChains( [-1] )
>>> m12 = chain1.concat( chain2 )

However more efficient and easier would be:

>>> m2 = m.takeChains( [0,-1] )

Which is the same as:

>>> m2 = m.take( m.chain2atomIndices( [0,-1] ) )

numpy.put() is used to convert a list of indices into a mask:

>>> cacb_pos = m.indicesFrom('name', ['CA','CB'] )
>>> mask = N.zeros( len(m) )
>>> N.put( mask, cacb_pos, 1 )
>>>   ## Test:
>>> N.all( mask == N.nonzero( cacb_pos ) )
True

8. Comparing PDBModels

You can compare and normalize the atom content of similar models with PDBModel.*compareAtoms*, calculate an (iterative) RMSD with PDBModel.*rms*, and superposition two equivalent (PDBModel.*fit*) or almost equivalent models (PDBModel.*magicFit*) -- again with or without iterative removal of variable regions.

Please consult the doc-string (__doc__) of these methods!

9. Surfaces, Energies, Conservations...

The class PDBDope adds surfaces, conservation profiles, energies and other information from external programs to PDBModel profiles:

>>> d = PDBDope( m )
>>> d.addConservation()  ## needs Hmmer and Pfam installed
>>> d.addSurfaceRacer()  ## access. and mol. surface from surfrace
>>> d.addDensity()       ## simple atomic density
>>> d.addFoldX()         ## puts fold-X scores into m.info
>>> d.addSecondaryStructure()  ## from DSSP

10. Relation to other Biskit classes

A Dock.Complex can be created from two PDBModels and offers methods for contact evaluation, scoring and other things:

>>> c = Complex( rec_model, lig_model )

The receptor and ligand PDBModels of a complex are accessible as:

>>> m1 = c.rec()
>>> m2 = c.lig()

A Trajectory adds a time dimension to a PDBModel and can be created from a list of PDBModels, PDBs or an Amber crd file. The underlying PDBModel of a trajectory t is accessed as:

>>> m = t.ref

And a PDBModel for a certain time point can be created by indexing:

>>> model_10 = t[10]

Personal tools