Homology modelling

How to use the Biskit.Mod homology modelling pipeline

What is Biskit/Mod?
Quickstart -- short instructions
Results
Detailed Step-by-step instructions
Contributions (The Making of)

What is Biskit/Mod?

Biskit/Mod is a python module for fully automatic or semi/automatic homology modelling.

There are two ways to use this module:

biskit/scripts/Mod/: interactive scripts for all steps of homology modelling
biskit/Biskit/Mod/: python library

Biskit/Mod implements homology modeling in roughly 4 steps:

search of homologous sequences and 3D structures (Blast)
multiple alignments (T-Coffee)
building of the models (Modeller)
cross validation by re-building the available templates

The pipeline works fully automatic but the different steps are as much as possible decoupled. This allows you to intervene manually in between the steps. For example, you can first manually change the template selection, repeat alignment and modeling, then optimize the alignment, and repeat only the modeling step.

The typical usage pattern would thus be:

run through all steps fully automatic once
inspect final model, trace problems to intermediate steps (selected sequences, selected templates, alignments, modeller parameters, ...)
modify intermediate results if necessary
re-run the subsequent steps

Quickstart -- short instructions

1) Preparing the project directory

Create one directory for each project of homology modeling. An example project is included in biskit: biskit/test/Mod/project/

Starting point (and only input) is a fasta file with the primary sequence of the target protein (60 characters per line). The default name for this file is target.fasta. If you use default file names and default parameters, you can run the following scripts without any options -- as long as you are in the project folder. Refer to the full documentation for any non-standard options, file names, etc. Advanced options of each script are available with: script.py -help

Ultra-quick instructions -- for a standard run w/o cross-validation, call in this order:

search_sequences.py
search_templates.py
clean_templates.py
align.py
model.py

2) Search homologous sequences

>>> search_sequences.py

3) Search homologous 3D structures (Templates)

>>> search_templates.py

4) Clean template PDB files

>>> clean_templates.py

5) Optional: Set up cross-validation

>>> setup_validation.py

6a) Built multiple alignment (non-parallel)

For main project:

>>> align.py

Repeat in each cross-validation folder (project_root/validation/????/)

>>> align.py

6b) Or built multiple alignment both for the main project and for the cross-validation (parallelised)

>>> pvm  ## start pvm if it is not running
>>> align_parallel.py

7a) Build models (non-parallel)

For main project:

>>> model.py

Repeat in each cross-validation folder (project_root/validation/????/)

>>> model.py

7b) Or build models both for the main project and for the cross-validation (parallelised)

>>> pvm  ## start pvm if it is not running
>>> model_parallel.py

8) Optional: Evaluation of cross validation

Example: benchmark.py -d list-of-folders

>>>  benchmark.py -d  ~/biskit/test/Mod/project/validation/*

Example: analyse.py -d list-of-folders

>>>  analyse.py -d  ~/biskit/test/Mod/project/

Results

project/analyse/global_results.out

contains for each template, their PDB code, the all atom rmsd (rmsd_aa) without iterative fitting, the c-alpha only rmsd (rmsd_ca) without iterative fitting, the rmsd_aa with the percentage of outliers during the iterative superposition, the rmsd_ca with the percentage of outliers also. Following the percentage of identities (mean) of the target with its templates and finally the number of templates which served to remodel each template.

project/analyse/local_results.out

contains for each residue of the target the rmsd profile given by each template using the multiple alignment. Then, a score, which is the mean rmsd. This score serves as final evaluation for the models of our target.

project/modeller/final_???????.pdb

PDBs files with the rmsd score from cross-validation put into the B-factor column. The user could visualize easily the result with a classic software as VMD or Pymol using a gradient of color for this score.

Detailed Step-by-step instructions

A) Prepare a project folder for an homology modeling

Put a file with the target sequence into an empty directory.

B) Search for homolgous sequences: search_sequences.py

Syntax:  search_sequences.py [-q |target.fasta| -o |outFolder| -log |logFile|
               -db |database| -e |e-value-cutoff| -aln |n_alignments|
               -psi
               -... additional options for blastall (see SequenceSearcher.py)]

Options: -q      fasta file with query sequence (default: ./target.fasta)
         -o      output folder for results (default: .)
         -log    log file (default: STDOUT)
         -db     sequence data base
         -e      E-value cutoff for sequence search
         -aln    number of alignments to be returned
         -simcut similarity threshold for blastclust (score < 3 or % identity)
         -simlen length threshold for clustering
         -ncpu   number of CPUs for clustering
         -psi    use PSI Blast instead, experimental!!

Result: A folder "sequences" in the project folder containing two files:

sequences/all.fasta: the list of all the homologous sequences found
sequences/nr.fasta: a list of non redundant homologous sequences (1/cluster)

You can edit the search result by removing (or adding) sequences from nr.fasta.

C) Search of homolgous 3D structures (Templates): search_templates.py

Syntax: search_templates.py [-q |target.fasta| -o |outFolder| -log |logFile|
                            -db |database| -e |e-value-cutoff|
                            -aln |n_alignments| -psi
                             -... additional options for blastall
                             (see SequenceSearcher.py)]

Options: -q    fasta file with query sequence (default: ./target.fasta)
         -o      output folder for results (default: .)
         -log    log file (default: STDOUT)
         -db     sequence data base
         -e      E-value cutoff for sequence search
         -aln    number of alignments to be returned
         -simcut similarity threshold for blastclust (score < 3 or % identity)
         -simlen length threshold for clustering
         -ncpu   number of CPUs for clustering
         -psi    use PSI Blast instead, experimental (currently not working)!!

Result: A folder "templates" in the project folder containing:

templates/all/: directory with all homologous structures (PDB format).
templates/all.fasta: the sequences of all template structures
templates/blast.out: the blast alignment
templates/cluster_result.out: clustering of all template sequences
templates/nr.fasta: non-redundant template sequences (1/cluster)
templates/templates.fasta: non-redundant template sequences taken directly from the PDB-files
templates/nr/: directory with the non-redundant template PDBs.
templates/nr/chain_index.txt: the PDBs (plus chain IDs) which will be cleaned for modeller and T-Coffee in the next step.

Edit chain_index.txt and templates.fasta to remove or add templates or force using different chains. If adding templates, make sure the new entry in chain_index.txt points to an existing file (best put into templates/nr/).

D) Cleaning of the PDB files: clean_templates.py

Needs two input files (see Search of homologous structures):

templates/nr/*.pdb
templates/nr/chain_index.txt

This step prepares the PDB files in the directory templates/nr for the multiple alignement step (T-Coffee) and the building of the models (Modeller).

The script changes non standard residues to standard ones, removes hydrogen atoms, removes atoms with multiple configurations etc...

Syntax:  clean_templates.py [-o |output_folder| -i |chainIndex| -log |logFile|]

Options: -o       output folder for results (default: .)
         -i       chain index file for templates
         -log   log file (default: STDOUT)

Result: A folder "t_coffee" in the folder /templates containing:

templates/t_coffee/*.alpha: CA input files for the structure alignment
templates/modeller/*.pdb: input PDBs for Modeller

E) Setting up the cross-validation: setup_validation.py

This step creates a new homology modelling project for each template structure. The aim is to re-model all known template structures (pretending we don't know their structure). setup_validation.py links in the cleaned structures from the previous steps.

Syntax:  setup_validation.py [ -o |project folder(s)| ]

Options:  -o  one or several project folders (default: current)

Result: A folder "validation" containing:

validation/|code|_pdb: a sub-folder for each template structure

Each sub-folder will contain (taking validation/1EX7 as example):
- reference.pdb: known structure of 1EX7
- target.fasta: the primary sequence of 1EX7
- t_coffee_template_files: the PDB files used as templates for re-building 1EX7
- sequences: link to the sequences directory
- templates/modeller: see D
- templates/t_coffee: see D
- templates/templates.fasta: see C

F) Multiple alignment with or without parallelisation: align.py and align_parallel.py

Build multiple alignments with T-Coffee. (A) one-by-one with by running align.py in the project folder and in each single cross-validation sub-folder. OR (B) by running the parallised align_parallel.py in the project folder.

A - align.py

None parallellized alignment:

Syntax: align.py [ -o |outFolder| -log |logFile| -h |host_computer| ]

Options: -o      output folder for results (default: .)
         -log    log file (default: STDOUT)
         -h      different computer for calculation  (default: local computer)
                -> must be accessible w/o password via ssh, check!

Troubleshooting:

If there are more than approximately 50 sequences t_coffe will eat up all the memory and the job will not finish. (This is taken care of by Aligner.py). This should be fixed in more recent versions of T-Coffee (v > 3.2) where T-Coffee, according to the manual "switches to a heuristic mode, named DPA, where DPA stands for Double Progressive Alignment."
If there is only one template structure, step 2 of T-coffee will not work. Skipp the structural alignment if only one template structure is provided!
In quite some cases the sequence retrieved from the nr PDB sequence database is different from the sequence extracted from the coordinates in the pdb-file. This will sometimes cause t-coffee to terminate with an error (2 sequences with the same name but with different sequences). Temporary solution: Choose another structure from the same cluster as the troublemaker.

Results: A folder "t_coffee" containing:

t_coffee/final.phylip: Phylip format.
t_coffee/final.pir_aln: Alignment in pir format.
t_coffee/final.score_html: alignment score per residue (HTML format)
t_coffee/t_coffee.inp: the used T-Coffee commands
t_coffee/t_coffee_log1-4: the T-Coffee log files of the 4 alignment steps

For more details, see the T-Coffee documentation.

You can edit final.pir_aln before running the model-building step.

B - align_parallel.py

Works in the same way but executes simultaneously several mutliple alignments from different project folders. This script needs PVM running.:

Syntax: align_parallel.py [ -d |list of folders| -h |host| -pdb |pdbFolder|
                       -ft |fastaTemplates| -fs |fastaSequences|
                       -fta |fastaTarget| -fe |ferror|]

Options: -d            [str], list of project directories [/validation/*]
         -h            int, number of hosts to be used [10]
         -a            first add hosts to pvm [yes]
         -pdb  str, pdbFolder for the pdb *.alpha
         -ft           str, path to 'templates.fasta'
         -fs           str, path to 'nr.fasta'
         -fta          str, path to 'target.fasta'
         -fe           str, path to the error file

Results: see Results of align.py

G) Building the models: model.py and model_parallel.py

As before, the Modeller step can be run in two different ways: (A) Either one-by-one non-parallel (model.py) or (B) parallelised (model_parallel.py).

A - model.py

None parallellized modelling:

Syntax:  model.py [ -o |outFolder| -log |logFile| -h |host_computer| ]

Options:  -o    project folder (default: .)
          -log  log file (default: STDOUT)
          -h    host computer for calculation  (default: local computer)
                -> must be accessible w/o password via ssh, check!

Result: A directory "modeller" containing: (For more details, see the Modeller 8v1 documentation.)

modeller/modeller.top or model_mult.py: input file for Modeller
modeller/target.B999900??.pdb: raw output PDB files, 10 models by default.
modeller/modeller.log: modeller log file
modeller/Modeller_Score.out: all the modeller scores (Objective Function) for each model (in increasing order).
modeller/model_??.pdb: output PDB files (10 by default) ordered by their modeller score.
modeller/PDBModels.list: pickled python object (ModelList) with same PDBs as PDBModel objects (in same order).
identities_cov.out: percent identity of sequences between target and the different templates (directly in the project folder).

B - model_parallel.py

Works in the same way but executes simultaneously several building step from different project folders. This script needs PVM running.:

Syntax:  model_parallel.py -d |list of folders| -h |host|
                         [-fta |fastaTarget| -pir |f_pir|
                          -tf |template_folder| -sm |starting_model|
                          -em |ending_model| -fe |ferror|]

Options:   -d   [str], list of project directory (full path)
           -h   int, number of hosts to be used
           -fta  str, path to find 'target.fasta'
           -pir  str, alignment filename
           -tf  str, directories for input atom files
           -sm   int, index of the first model
           -em   int, index of the last model
           -fe   str, filename to output errors from the Slave

Results: see model.py

H) Evaluation step by cross validation: benchmark.py

The first evaluation step is to determine how well we can reproduce a template structure by modeling them from the remaining templates. As the structure of each template is available we can then compare them with the dummy structures that we produce using Biskit/Mod.

First execute benchmark.py in each sub-directory of the validation folder.:

Syntax:   benchmark.py -d |list of folders|

Options:  -d   [str], list of project directory

To run all at once:

>>> benchmark.py -d validation/

Result: A folder validation/benchmark containing:

validation/????/benchmark/Fitted_??.pdb: Each dummy model superimposed on its known structure. The fitted structure is called fitted_00.pdb refering to the dummy model number 00 etc. Currently, the iterative superimposition method implemented in biskit/Biskit/rmsFit.py is used (max. 10 iterations).
validation/????/benchmark/rmsd_aa.out: gives the all-atom rmsd of the different dummy models. (1) without iterative fitting, (2) with iterative fitting and (3) the percentage of atoms that has been removed during the iterative fitting.
validation/????/benchmark/rmsd_ca.out: same as above, but only for C-alpha atoms
validation/????/benchmark/rmsd_res_??: gives the C-alpha rmsd for each residue in the current dummy model.
validation/????/benchmark/PDBModels.list: pickled PYTHON list of PDBModels. Same as modeller/PDBModels.list but now each model contains the benchmark information in the atom and residue profiles: 'rmsd_aa', 'rmsd_ca', 'rmsd_res'. See PDBModel.profile()!

I) Evaluate the dummy models: analyse.py

Here, the aim is to give an accurate way to evaluate the models created by the module. Three sorts of results are provided. The first, global, the second local, and the last one is a simple visualisation of the result directly on the 3D structure of the model (score RMSD).:

Syntax:  analyse.py -d |list of folders| [ -s ]

Options:  -d   [str], list of project directories
          -s   show 'final.pdb' in PyMol

Result: A folder "analyse" containing:

analyse/global_results.out: for each template, PDB code, the heavy atom rmsd(rmsd_aa) with and without iterative fitting, the C-alpha rmsd (rmsd_ca) with and without iterative fitting, the rmsd_aa with the percentage of outliers during the iterative fit, Moreover, the percentage of identities (mean) of the target with its templates, the best Modeller score and finally the number of templates which served to remodel each template.
analyse/local_results.out: contains for each residue of the target the rmsd profile given by each template using the multiple alignment. Then, a score, which is the mean rmsd. This score will served as final evaluation for the models of our target.
modeller/final.pdb: the "best" model where the B-factor (Temperature factor) are replaced by the score rmsd for visualisation, e.g., in VMD or Pymol.

Contributions (The Making of)

Michael Nilges worked out the alignment and modelling strategy (including T-Coffee parameters, Modeller input scripts etc).

R.G. & J.L. translated this strategy into the Biskit/Mod package and tested/bugfixed it all to full automation (well, more or less).

David Giganti bravely used Biskit/Mod for the first real-world application and had the idea of re-modelling the template structures for cross-validation.

Olivier Perin spent his very productive internship on implementing the cross-validation in Biskit/Mod and parallelising some of the steps (supervised by R.G.).

The parallelisation method used throughout Biskit (TrackingJobMaster/Slave) is based on code from Wolfgang Rieping. The iterative RMSD-fitting (rmsFit.py) was implemented by Michael Habeck

Personal tools