Computational Structural Biology group focusing on dissecting, understanding and predicting biomolecular interactions at the molecular level.

Supported by:

# Molecular Dynamics Simulation of the p53 N-terminal peptide

## General Overview

This tutorial introduces Molecular Dynamics (MD) simulations of proteins. The simulation protocol can be used as a starting point for the investigation of protein dynamics, provided your system does not contain non-standard groups. By the end of this tutorial, you should know the steps involved in setting up, running, and analyzing a simulation, including critically assessing the choices made at the different steps.

## Use of virtual machines (VMs)

For this module of the course we will be using NMRbox. NMRbox offers cloud-based virtual machines for executing various biomolecular software that can complement NMR (Nuclear Magnetic Resonance). NMRbox users can choose from 235 software packages that focus on research topics as metabolomics, molecular dynamics, structure, intrinsically disordered proteins or binding. One can search through all available packages on https://nmrbox.org/software.

### Register

To use virtual machines through NMRbox, one needs to register, preferably with their institutional account here. Since the registration has to be manually validated and it can take up to two business days, we strongly encourage students to do so before the course starts. After a successful validation you will receive an e-mail with your NMRbox username and password that you will be using while accessing your virtual machine.

### Accessing NMRbox

To run the virtual machine on a local computer, one needs to install VNCviewer. With the RealVNC client connects your computer to the NMRbox servers with a virtual desktop - graphical interface. More information about the VNC viewer is in the FAQ of NMRbox.

To choose a virtual machine, first log into the user dashboard https://nmrbox.org/user-dashboard. Download the zip file with bookmarks for the production NMRbox virtual machines here and extract the zip file. Back in VNCviewer click File -> Import connections and select the folder in which you extracted the content of the downloaded zip file. After importing, you will see the current release virtual machines. You can use any available virtual machine. The user-dashboard provides information on machine capabilities and recent compute load, thus it is clever to choose a less occupied one. Double click on one of the VMs. An Authentication panel appears. Enter your NMRbox username and password. Click on the Remember password box to have VNCviewer save your information. By default, your desktop remains running when you disconnect from it. If you login to your VM repeatedly you will see a screen symbol next to the VM you connected to recently. For more details follow the quick start guide for using NMRbox with VNCviewer here.

If everything runs correctly you should have a window with your virtual desktop open. In the virtual desktop you have an access to the internet with Chromium as browser or use various programs, including Pymol. Thus, you could run all three stages of this course here or transfer data between your local machine and the virtual machine. File transfer to and from the VM is quite straightforward and it is described here: https://nmrbox.org/faqs/file-transfer.

## Preparing the System

The preparation of the system is the heart of the simulation. Neglecting this stage can lead to artifacts or instability during the simulation. Each simulation must be prepared carefully, taking into consideration its purpose and the biological and chemical characteristics of the system under study.

### Selecting an initial structure

The first step is obviously the selection of a starting structure. The aim of this tutorial is to simulate a peptide of the N-terminal sequence of the transactivation domain of p53. The sequence of this peptide is given below, in FASTA format:

>P53_MOUSE
SQETFSGLWKLLPPE


Peptides are often very flexible molecules with short-lived secondary structure elements. Some can even adopt different structures depending on which protein partner they are interacting with, remaining in a disordered state if free in solution. As such, the effort of using an advanced method such as homology modeling for this peptide is very likely unwarranted. Instead, it is possible, and plausible, to generate structures of the peptide in three ideal conformations – helical, sheet, and polyproline-2 – which have been shown to represent the majority of the peptides deposited in the PDB. Generating these structures is a simple matter of manipulating backbone dihedral angles. Pymol has a utility script to do so, written by Robert Campbell.

The instructions shown in this tutorial refer only to the helical peptide, for simplicity. The successful completion of the tutorial requires, however, all three conformations to be simulated.

Generate an ideal structure for the peptide sequence using the build_seq script in PyMol, choose between helix/polypro/beta.

Pay attention when typing the sequence! A missing or swapped amino acid will render your simulation useless! Also carefully inspect the generated object whether it matches your expectations - you might want to adjust the residue numbers.

To change residue numbers within PyMol take a look at the help message of the alter command:

help alter

### Preparing the initial structure

In case of structures downloaded from the RCSB PDB, it is important to ensure that there are no missing atoms, as well as check for the presence of non-standard amino acids and other small ligands. Force fields usually contain parameters for natural amino acids and nucleotides, a few post-translational modifications, water, and ions. Exotic molecules such as pharmaceutical drugs and co-factors often have to be parameterized manually, which is a science on its own. Always judge if the presence of these exotic species is a necessity. In some cases, the ligands can be safely ignored and removed from the structure. As for missing residues and atoms, except hydrogens, it is absolutely necessary to rebuild them before starting a simulation. MODELLER is an excellent program for this purpose. In addition, some crystals diffract at a good enough resolution to distinguish water molecules in the density mesh. Save for very particular cases where these waters are the subject of the study, the best policy is to remove them altogether from the structure. Fortunately, most of these “problematic” molecules appear as hetero-atoms (HETATM) in the PDB file, and can therefore be removed rather easily with a simple sed command:

sed -e ‘/^HETATM/d’ 1XYZ.pdb > 1XYZ_clean.pdb

It is also good practice to run additional quality checks on the structure before starting the simulation. The refinement process in structure determination does not always yield a proper orientation of some side-chains, such as glutamine and asparagine, given the difficulty in distinguishing nitrogen and oxygen atoms in the density mesh. Also, the protonation state of several residues depends on the pH and can influence the protein’s hydrogen bonding network. For crystal structures, the PDB_REDO database contains refined versions of structures deposited in the RCSB PDB, which address some of these problems. Alternatively, there are web servers that allow these and other problems to be detected and corrected, such as WHATIF.

Since the initial structure of the p53 peptide was generated using Pymol and ideal geometries, there is no need to proceed with such checks.

### Structure Conversion and Topology Generation

A molecule is defined not only by the three-dimensional coordinates of its atoms, but also by the description of how these atoms are connected and how they interact with each other. The PDB file, which was generated or downloaded in the previous step, contains only the former. The description of the system in terms of atom types, charges, bonds, etc, is contained in the topology, which is specific to the force field used in the simulation. The choice of the force field must then not be taken lightly. For biomolecular systems, there are few major force fields – e.g. CHARMM, AMBER, GROMOS, OPLS – that have been parameterized to reproduce the properties of biological molecules, namely proteins. This has been, and continues to be, an area of active research since the very first day of molecular dynamics simulations. There are several literature reviews available in Pubmed that assess the quality and appropriateness of each force field and their several versions. Some are well-known for their artifacts, such as a biased propensity for alpha-helical conformations. Here, in this tutorial, we use the AMBER99SB-ILDN force field, which is widely used in sampling and folding simulations and has been shown to reproduce fairly well experimental data (source). Another, more practical, reason behind this choice is the availability of this force field in GROMACS.

Since the simulation takes place in a solvated environment, i.e. a box of water molecules, we have also to choose an appropriate solvent model. The model is simply an addition to the force field containing the properties of the water molecule, parameterized to reproduce specific properties such as density and freezing and vaporization temperatures. As such, particular water models tend to be tied to specific force fields. Due to the difficulties of reproducing the properties of water computationally - yes, even for such a simple molecule! - some models represent water with more than 3 atoms, using additional pseudo-particles to improve characteristics such as its electrostatic distribution. The water model suggested to use with the AMBER force field, and which we will use in this simulation, is the TIP3P model (for Transferable Interaction Potential with 3 Points), which was actually developed by the author of the OPLS force field. Really transferable!

This choice is usually limited by the force field, unless there is a specific need for a particular solvent model.

The GROMACS program pdb2gmx takes an initial structure and returns both a topology file (peptide.top) and a new structure (peptide.gro) that adheres to the force field atom naming conventions. To convert the structure and build the topology, pdb2gmx divides the molecule in several blocks, such as amino acids, and uses a force field-specific library of such building blocks to make the necessary conversions. Usually, the matching to the library is done through residue/atom names on each ATOM/HETATM line in the PDB file. If a residue (or atom) is not recognized, the program stops and returns an error.

In case of an error, make sure to read the error message. It often points very clearly to the problem and its solution.

Different force fields define different atom types and/or give different names to the same atom type. While the majority of the heavy atoms, i.e. non-hydrogen, have identical naming across most force fields, hydrogens do not. As such, the flag -ignh indicates that GROMACS should ignore these atoms when reading the structure and (re)generate their coordinates using ideal geometric parameters defined in the force field. Also, the program allows the user to define the status of the termini of the molecule through the -ter flag. Termini can be either charged (e.g. NH3+ and COO-), uncharged (e.g. NH2 and COOH), or capped by an additional chemical group (e.g. N-terminal acetyl and C-terminal amide). This is very important since leaving the termini charged (default) can lead to artificial charge-charge interactions, particular in small molecules. If a peptide is part of a larger structure, then it makes sense to cap the termini in order to neutralize their charge, as it would happen in reality. Read through the output of pdb2gmx and check the choices the program made for histidine protonation states and the resulting charge of the peptide.

The newly generated topology file is also worth some attention. It contains a listing of all the residues and their corresponding atoms, detailing the atom types, masses, and charges. Further, it contains a listing of all the bonds in the molecule, the angles, and the dihedral angles. Note that the topology file does not contain any information on their chemistry. This information is stored in internal parameter libraries that are defined at the very top of the topology file.

Open the peptide.top file in a text editor and browse through it.

; Include forcefield parameters
#include "amber99sb-ildn.ff/forcefield.itp"

[ moleculetype ]
; Name            nrexcl
Protein             3

[ atoms ]
;   nr       type  resnr residue  atom   cgnr     charge       mass  typeB    chargeB      massB
; residue   1 SER rtp NSER q +1.0
1         N3      1    SER      N      1     0.1849      14.01
2          H      1    SER     H1      2     0.1898      1.008
3          H      1    SER     H2      3     0.1898      1.008
4          H      1    SER     H3      4     0.1898      1.008
5         CT      1    SER     CA      5     0.0567      12.01
6         HP      1    SER     HA      6     0.0782      1.008
7         CT      1    SER     CB      7     0.2596      12.01
8         H1      1    SER    HB1      8     0.0273      1.008
9         H1      1    SER    HB2      9     0.0273      1.008
10         OH      1    SER     OG     10    -0.6714         16
11         HO      1    SER     HG     11     0.4239      1.008
12          C      1    SER      C     12     0.6163      12.01
13          O      1    SER      O     13    -0.5722         16   ; qtot 1


SER is in principle a neutral amino acid within a protein sequence. Can you rationalize why in this case the sum of the charges add up to +1?

### Periodic Boundary Conditions

This converted structure includes several atoms, namely hydrogen, that have been added according only to ideal geometric parameters. If generated with Pymol, it also has ideal backbone geometry. If it was otherwise downloaded from the RCSB PDB, the structure is also likely to contain certain chemical aspects (bond lengths, angles, interatomic distances) that are not considered ideal by the force field. In fact, merely changing force fields will cause the definition of ideal to change as well. The first step towards preparing the system is then to remove these “imperfections” as best as possible, which is normally achieved through an energy minimization of the system. This optimization method essentially forces a set of atoms to adhere, as best as possible, to the definitions of the force field. The larger the number of atoms in the system, the harder it is to have all of them to comply ideally with respect to all the definitions. For example, moving two atoms closer to reduce the strain from violating the definitions imposed by the van der Waals forces may cause the strain from the electrostatic term to increase.

Before minimizing the system, a general layout of the simulation setup has to be chosen. In other words, the peptide must be placed somewhere for this minimization to happen. Most modern simulations of proteins and peptides define periodic boundary conditions (PBC), which set a single unit cell that can be stacked infinitely. As a result, an infinite, periodic system is defined that avoids the problem of having hard boundaries (walls) that the molecules can literally bump into. When the protein crosses the wall on the left side, the periodic image to its right enters the current unit cell, maintaining a constant number of atoms in every unit cell. A simpler way to rationalize PBCs is to compare them to the snake game available in old Nokia cell phones. When the head of the snake crosses a boundary of the screen, it re-appears on the diametrically opposed edge.

Schematic representation of the idea of periodic boundary conditions (source).

The choice of the shape of the unit cell is also important, since this will define the volume in which the molecule is simulated. Molecular dynamics simulations are computationally demanding. The more molecules in the system, the more forces need to be calculated at each step. As such, while a cube can be perfectly stacked ad eternum it is not the most efficient shape from a volume standpoint (remember that simulations take place, usually, in a solvated environment!). Approximating the shape to a sphere is ideal, but spheres cannot be stacked. As such, only a few general shapes support the setup of periodic boundary conditions. One of those is the rhombic dodecahedron, which corresponds to the optimal packing of a sphere and is therefore the best choice for a freely rotating molecule such as a peptide or a protein.

Another thing to have in mind when setting up the PBCs is the size of the unit cell. Continuing with the snake analogy, it is not proper to have the snake’s head see its own tail. In other words, the cell must be sufficiently large to allow the molecule to cross the boundaries and still be at a sufficient distance from the next image that no force calculations are made between them. In GROMACS, this setting is defined as a distance from the molecule to the wall of the unit cell. This distance should not be arbitrarily large either, otherwise the box is to large and the simulation becomes computationally inefficient as your purpose is not to simulate water. Take the cutoff used to calculate non-bonded interactions (long range) in the force field as a rule of thumb. The distance to the wall must be larger than this value.

Also important is to consider possible conformational changes. The size of the box should allow for those to occur without introducing period image problems as explained above.

As with pdb2gmx, the GROMACS program editconf generates a sizable output that contains, for example, the volume and dimensions of the unit cell it just created. The dimensions use the triclinic matrix representation, in which the first three numbers specify the diagonal elements ($$xx, yy, zz$$) and the last six the off-diagonal elements ($$xy, xz, yx, yz, zx, zy$$).

What is the volume of the unit cell?

### Energy minimization of the structure in vacuum

Having defined the physical space where simulations can take place, the molecule can now be energy minimized. GROMACS uses a two-step process for any calculation involving the molecules and a force field. First, the user must combine the structure and the topology data, together with the simulation parameters, in a single control file. This file contains everything about the system and ensures the reproducibility of the simulation, provided the same force field is available on the machine. Another advantage of having such a self-contained file is that the preparation can take place in one machine while the calculations run on another. Again, simulations are computationally demanding. While the system can be easily prepared on a laptop, with the help of Pymol, GUI-enabled text editors, and all the other advantages of having a screen, calculations usually run on specialized clusters with hundreds of processing cores that provide only a command-line interface access. This will be relevant when running the production simulation. The intermediate calculations to prepare the system are comfortably small to run on a laptop.

The simulation parameters are contained in a separate file, usually with the .mdp extension. For simplicity, we provide these files in our GitHub repository) and also already in our virtual image in NMRBox, if you are using it (see \$MOLMOD_DATA/mdp/). These parameters specify, for example, the cutoffs used to calculate non-bonded interactions, the algorithm used to calculate the neighbors of each atom, the type of periodic boundary conditions (e.g. three-dimensional, bi-dimensional), and the algorithms to calculate non-bonded interactions. They also specify the type of simulation, for example energy minimization or molecular dynamics, and its length and time step if appropriate. Finally, they describe also the frequency with which GROMACS should write to disk the coordinates and energy values. Depending on the aim of the simulation, this writing frequency can be increased to have a higher temporal resolution at a cost of some computational efficiency (writing takes time). MDP files support hundreds of parameter settings, all of which are detailed in the GROMACS manual.

Although GROMACS is made of several utilities, its heart is the mdrun program. It is this code that runs all the simulations. The -deffnm flag is a very convenient option that sets the default file name for all file options, both input and output, avoiding multiple individual definitions. The -v flag tells mdrun to be verbose and in this case, print the potential energy of the system and the maximum force at each step of the minimization.

Steepest Descents:
Tolerance (Fmax)   =  1.00000e+01
Number of steps    =         5000
Step=    0, Dmax= 1.0e-02 nm, Epot=  4.80138e+03 Fmax= 1.83867e+04, atom= 57
Step=    1, Dmax= 1.0e-02 nm, Epot=  2.69755e+03 Fmax= 6.83824e+03, atom= 56
Step=    2, Dmax= 1.2e-02 nm, Epot=  1.46780e+03 Fmax= 4.07714e+03, atom= 57
Step=    3, Dmax= 1.4e-02 nm, Epot=  1.69036e+03 Fmax= 1.47891e+04, atom= 56
Step=    4, Dmax= 7.2e-03 nm, Epot=  1.19448e+03 Fmax= 6.00160e+03, atom= 56
Step=    5, Dmax= 8.6e-03 nm, Epot=  1.03838e+03 Fmax= 4.85784e+03, atom= 56
Step=    6, Dmax= 1.0e-02 nm, Epot=  1.11613e+03 Fmax= 1.01399e+04, atom= 56
Step=    7, Dmax= 5.2e-03 nm, Epot=  8.98891e+02 Fmax= 2.73856e+03, atom= 56
Step=    8, Dmax= 6.2e-03 nm, Epot=  8.39895e+02 Fmax= 4.76931e+03, atom= 56
Step=    9, Dmax= 7.5e-03 nm, Epot=  8.05094e+02 Fmax= 5.81049e+03, atom= 56
Step=   10, Dmax= 9.0e-03 nm, Epot=  7.77891e+02 Fmax= 5.97918e+03, atom= 56
...


The steepest descent algorithm used in this minimization calculates the gradient of the energy of the system at each step and extracts forces that push the system towards an energy minimum. As such, the potential energy must decrease. This is not the case for molecular dynamics and other minimization algorithms. The minimization ends when one of two conditions is met: either the maximum force is small than the provided threshold ($$10 kJ.mol^-1$$), and the minimization converged, or the algorithm reached the maximum number of steps defined in the parameter file (5000). Ideally, a minimization should run until convergence, but except for very specific scenarios such as normal mode analysis, this is not a strict requirement.

### Solvating the simulation box

The next step is to add solvent to the simulation box. The first molecular dynamics simulations of proteins were done in vacuum, but researchers quickly realized this was a major limitation. Water molecules interact with the protein, mediating interactions between residues. In addition, water as a solvent has a screening effect for long-range interactions, such as electrostatics. In vacuum, there is nothing to prevent two opposite charge atoms to feel each other even at a very long distance, as long as they are within the cutoff used for the simulation of course. With the addition of water, this interaction is dampened significantly. The effect of water-mediated interactions are also important when choosing the size of the box. The presence of a solute, the peptide, induces a particular ordering of the water molecules in its vicinity. This might have a ripple effect that propagates the effect of the solute and causes artifacts well beyond the theoretical non-bonded cutoff.

You should have already chosen the appropriate water model – TIP3P – when running pdb2gmx. The topology file is not required for the solvation. Essentially, this operation is just a matter of placing pre-calculated chunks of water molecules inside the box and remove those overlapping with protein atoms. No chemistry involved. However, the topology must be updated to reflect the addition of the solvent.

GROMACS backs up the previous topology file before updating it. Generally, GROMACS never overwrites files, instead copying the previous one and renaming it with # symbols. At the end of the new topology file, there is an additional entry listing the number of water molecules that are now in the structure. It also added a definition that loads the water model parameters.

Open the solvated structure in Pymol.

show cell

Why is it not a problem to have water and/or protein atoms sticking out of the box?

### Addition of ions: counter charge and physiological concentration

Besides water, the cellular environment contains a number of ions that maintain a certain chemical neutrality of the system. Adding some of these to the simulation box also increases the realism of the simulation. The GROMACS program genion performs this task, but requires as input a .tpr file. The addition of ions is done by replacing certain atoms that are already in the simulation box. Since removing atoms of the peptide is not quite desired, pay attention to the group you select when running genion. The -neutral flag indicates that an excess of one ionic species is allowed to neutralize the charge of the system, if there is any.

### Energy minimization of the solvated system

The addition of ions was the final step in setting up the system (chemically) for the simulation. From here on, all that is necessary is to relax the system in a controlled manner. Adding the solvent and the ions might have caused some unfavorable interactions, such as overlapping atoms and equal charges placed too close together.

### Restrained MD – relaxation of solvent and hydrogen atoms

Despite dissipating most of the strain in the system, energy minimization does not consider temperature, and therefore velocities and kinetic energy. When first running molecular dynamics, the algorithm assigns velocities to the atoms, which again stresses the system and might cause the simulation to become unstable. To avoid possible instabilities, the preparation setup here described includes several stages of molecular dynamics that progressively remove constraints on the system and as such, let it slowly adapt to the conditions in which the production simulation will run.

The .mdp file for this simulation is substantially different from those used for the minimization runs. First, the integrator is now md, which instructs mdrun to actually run molecular dynamics. Then, there are several new options that relate specifically to this algorithm: dt, t_coupl, ref_t, and gen_vel. At the top of the file, there is a preprocessing option that defines a particular flag -DPOSRES. In the topology file, there is a specific statement that is activated only when this flag is set, which relates to a file created by pdb2gmxposre.itp. This file contains position restraints for certain atoms of the system, which prevent them from moving freely during the simulations.

Which atoms are restrained during the simulation? What could be the purpose of using such restraints?

What is the length of the simulation in picoseconds?

The inclusion of velocity in this system caused the particles and the system to gain kinetic energy. This information is stored in an binary file format with extension .edr, which can be read using the GROMACS utility energy. This utility extracts the information from the energy file into tabular files that can then be turned into plots. Select the terms of interest by typing their numbers sequentially followed by Enter. To quit, type 0 and Enter. Use the xvg_plot.py utility to plot the resulting .xvg file, passing the -i flag to have an interactive session open. If you want to change the colors of the plot, run the script with the -h flag and refer to this page for the available color maps.

Extract and plot the temperature, potential, kinetic, and total energy of the system.

### Coupling the barostat – simulating in NPT conditions

Equilibration is often conducted in two stages: first, the system is simulated under a canonical ensemble (NVT) in which the number of molecules, volume, and temperature are kept constant. The goal is to let the system reach and stabilize at the desired temperature. The second step is to couple a barostat to the simulation and maintain a constant pressure, which resembles more closely the experimental conditions. While the temperature is controlled by adjusting the velocity of the particles, the pressure is kept constant by varying the volume of the simulation box ($$PV = NRT$$).

Equilibrate the system under NPT conditions and re-analyze the several thermodynamical variables, there is no need to edit this file.

Inside 04_npt_pr_PME.mdp we define the Berendsen barostat to be used, although this weak-coupling algorithm is not rigorously compatible with a full isothermal-isobaric (NPT) ensemble. Gromacs correctly complains about this by means of a warning message. In our case, we are just equilibrating the system, and using the Berendsen barostat is perfectly fine. Therefore the warning can be safely ignored by adding --maxwarn 1 at the end of the previous command.

What is better controlled: temperature or pressure? Why?

### Releasing the position restraints

By now, the system had time to adjust to the injection of velocities and the introduction of both temperature and pressure. The heavy atoms of the peptide are, however, still restrained to their initial positions. The next and final steps of the simulation setup release these restraints, progressively, until the system is completely unrestrained and fully equilibrated at the desired temperature and pressure, thus ready for the production simulation.

The strength of the restraints is defined in the posre.itp file, created by pdb2gmx. The value of the force constant defines how strictly the atom is restrained. As such, releasing the restraints is as simple as modifying the numbers on the file.

[ position_restraints ]
; atom  type	  fx	  fy	  fz
1     1    1000  1000  1000
4     1    1000  1000  1000
6     1    1000  1000  1000
9     1    1000  1000  1000
12     1    1000  1000  1000
13     1    1000  1000  1000
14     1    1000  1000  1000
17     1    1000  1000  1000


Decrease the strength of the force constant of the position restraints and re-run the system under NPT.

The final equilibration step is to completely remove the position restraints. This is done by removing the -DPOSRES definition at the beginning of the .mdp file, while maintaining all other parameters. For simplicity, we provide a further .mdp file without this definition (you don’t need to edit this file either).

## Production Simulation

Despite all these efforts, the system is unlikely to be in equilibrium already. The first few nanoseconds of a simulation, depending on the system, are in fact an equilibration period that should be discarded when performing any analysis on the properties of interest. To setup the simulation for production, all it takes it to generate a new .tpr file that contains the desired parameters, namely the number of steps that defines the simulation length. At this stage, there are plenty of questions to address that have varying degrees of influence in the performance of the calculations:

• At what time scale do the processes under study occur? How long should the simulation run for?
• What is the temporal resolution necessary to answer the research questions?
• Is there a need to store velocities and energies frequently?
• How often should the simulation information be written to the log file?

The simulation will run for 50 nanoseconds, which is sufficient to derive some insights on the conformational dynamics of such a small peptide. Bear in mind that a proper simulation to fully and exhaustively sample the entire landscape should last much longer, and probably make use of more advance molecular dynamics protocols such as replica exchange. In this case, since several students are expected to work on the same peptide, using different random seeds and starting from different initial conformations, we assume that individual simulations of 50 nanoseconds are informative enough.

The production run can be done in NMRBox, and we will run on the cluster over the next couple of days.

To run it in NMRBox, the only step missing is to generate a .tpr file containing the information for this simulation. Give this input file a clear name, combining the protein identifier (p53_helix, p53_extended, or p53_polypro) with your name or initials.

Run the production MD! This will take a few hours to complete.

gmx mdrun -v -deffnm p53_helix_CAH

If you wish to inspect the contents of the .tpr file, use the dump utility of GROMACS, which, as the name indicates, outputs the entire contents of the file to the screen. Pipe the output of the command to a text processor such as less or more (Unix joke) to paginate the output. Press q to quit the program.

gmx dump -s p53_helix_CAH.tpr | more

## Analysis of the Molecular Dynamics Simulation

The production run is only the beginning of the real work behind molecular dynamics simulations. The analysis of a simulation can be divided in several parts and varies substantially depending on the goal of the simulation and the research questions being asked. Generally, the first part of the analysis is to assess the quality and stability of the simulation in its entirety. If these indicate the simulation suffered from any problem, namely periodic image interactions, unstable temperature or pressure, or uncontrolled dynamics of the solute (i.e. unexpected unfolding of a protein), the simulation might have to be repeated. If otherwise the simulation is stable, then the analysis progresses to extract data that might help answer the research questions.

The production simulation produces a number of files, each containing different information. Depending on the options provided to mdrun, the names may vary. The extensions, however, remain the same. For most of the analysis, the only requirements are the compressed trajectory (.xtc) and energy (.edr) files.

• topol.tpr: Run input file, contains a complete description of the system at the start of the simulation.
• confout.gro: Structure file, contains the coordinates and velocities of the last step of the simulation.
• traj.trr: Full precision trajectory, contains the positions, velocities and forces over time.
• traj.xtc: Compressed trajectory, contains only coordinates (low precision: 0.001 nm)
• ener.edr: Energy file, contains energy, temperature, pressure and other related parameters over time
• md.log: Log file containing information about the simulation, namely performance, warnings, and errors.

### Quality Assurance

Before all else, it must be assured that the simulation finished properly. Many variables can cause a simulation to crash, especially problems related to the force field (if you use custom parameters) or insufficient or deficient equilibration of the system.

Another important source of information about the simulation and its successful conclusion is the log file. Most of this file contains information on the energies at each step of the simulation. At the end, there are several tables with detailed information about the performance of the simulation.

Writing checkpoint, step 25000000 at Thu Jul 16 21:58:00 2015

Energies (kJ/mol)
U-B    Proper Dih.  Improper Dih.      CMAP Dih.          LJ-14
5.44222e+02    4.34283e+02    1.81035e+01   -2.63369e+01    1.25435e+02
Coulomb-14        LJ (SR)   Coulomb (SR)   Coul. recip.      Potential
2.90324e+03    7.71580e+03   -9.23198e+04    2.82229e+02   -8.03229e+04
Kinetic En.   Total Energy    Temperature Pressure (bar)   Constr. rmsd
1.50841e+04   -6.52387e+04    3.09273e+02   -3.82873e+02    2.53229e-05

<======  ###############  ==>
<====  A V E R A G E S  ====>
<==  ###############  ======>

Statistics over 25000001 steps using 250002 frames

Energies (kJ/mol)
U-B    Proper Dih.  Improper Dih.      CMAP Dih.          LJ-14
5.20505e+02    4.57178e+02    3.03600e+01   -1.14010e+01    1.26916e+02
Coulomb-14        LJ (SR)   Coulomb (SR)   Coul. recip.      Potential
2.89500e+03    8.09479e+03   -9.32391e+04    2.85641e+02   -8.08401e+04
Kinetic En.   Total Energy    Temperature Pressure (bar)   Constr. rmsd
1.51323e+04   -6.57078e+04    3.10260e+02    1.51651e+00    0.00000e+00

Box-X          Box-Y          Box-Z
4.35387e+00    4.35387e+00    3.07865e+00

Total Virial (kJ/mol)
5.04350e+03   -1.47436e-02    1.02119e+00
-1.12579e-02    5.04338e+03   -2.17245e+00
1.02255e+00   -2.17242e+00    5.04266e+03

Pressure (bar)
1.89170e+00    2.46813e-01   -2.53503e-01
2.44845e-01    9.49682e-01    8.72892e-01
-2.54285e-01    8.72874e-01    1.70815e+00

Epot (kJ/mol)        Coul-SR          LJ-SR        Coul-14          LJ-14
Protein-Protein   -3.84871e+03   -2.18949e+02    2.89500e+03    1.26916e+02
Protein-non-Protein   -1.39745e+03   -2.24747e+02    0.00000e+00    0.00000e+00
non-Protein-non-Protein   -8.79929e+04    8.53849e+03    0.00000e+00    0.00000e+00

T-Protein  T-non-Protein
3.13254e+02    3.10150e+02

M E G A - F L O P S   A C C O U N T I N G

NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
W3=SPC/TIP3p  W4=TIP4p (single or pairs)
V&F=Potential and force  V=Potential only  F=Force only

Computing:                               M-Number         M-Flops  % Flops
-----------------------------------------------------------------------------
NB VdW [V&F]                          1606.118220        1606.118     0.0
Pair Search distance check          725959.726480     6533637.538     0.4
NxN Ewald Elec. + LJ [F]          18070249.146840  1409479433.454    84.7
NxN Ewald Elec. + LJ [V&F]          182525.653360    23545809.283     1.4
NxN Ewald Elec. [F]                2138562.944360   130452339.606     7.8
NxN Ewald Elec. [V&F]                21600.501696     1814442.142     0.1
1,4 nonbonded interactions            2328.446520      209560.187     0.0
Calc Weights                         74722.738140     2690018.573     0.2
Spread Q Bspline                   1594085.080320     3188170.161     0.2
Gather F Bspline                   1594085.080320     9564510.482     0.6
3D-FFT                             8683320.943800    69466567.550     4.2
Solve PME                            40790.304000     2610579.456     0.2
Reset In Box                          1245.381900        3736.146     0.0
CG-CoM                                1245.387762        3736.163     0.0
Propers                               2230.719750      510834.823     0.0
Impropers                              169.959600       35351.597     0.0
Virial                                2834.932800       51028.790     0.0
Stop-CM                                249.082242        2490.822     0.0
Calc-Ekin                             4981.521738      134501.087     0.0
Lincs                                 2283.236373      136994.182     0.0
Lincs-Mat                            49011.663888      196046.656     0.0
Constraint-V                         33260.265960      266082.128     0.0
Constraint-Vir                        3097.710213       74345.045     0.0
Settle                                9564.597738     3089365.069     0.2
(null)                                  42.489900           0.000     0.0
-----------------------------------------------------------------------------
Total                                              1664061187.059   100.0
-----------------------------------------------------------------------------

D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S

av. #atoms communicated per step for force:  2 x 32013.5
av. #atoms communicated per step for LINCS:  2 x 1537.9

Part of the total run time spent waiting due to load imbalance: 1.7 %
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Y 0 % Z 0 %

R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 18 MPI ranks

Computing:          Num   Num      Call    Wall time         Giga-Cycles
Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------
Domain decomp.        18    1     212450     667.934      31260.638   1.9
DD comm. load         18    1     212450      39.920       1868.341   0.1
DD comm. bounds       18    1     212450     106.924       5004.253   0.3
Neighbor search       18    1     212451     647.826      30319.561   1.9
Comm. coord.          18    1    4036539    3017.213     141211.619   8.7
Force                 18    1    4248990   11911.658     557489.413  34.4
Wait + Comm. F        18    1    4248990    1200.017      56163.217   3.5
PME mesh              18    1    4248990   14556.938     681293.782  42.1
NB X/F buffer ops.    18    1   12322068     153.829       7199.495   0.4
Write traj.           18    1        888       0.358         16.767   0.0
Update                18    1    4248990     139.904       6547.776   0.4
Constraints           18    1    4248990    1876.305      87814.821   5.4
Comm. energies        18    1     424900     217.910      10198.637   0.6
Rest                                          57.370       2685.017   0.2
-----------------------------------------------------------------------------
Total                                      34594.107    1619073.337 100.0
-----------------------------------------------------------------------------
Breakdown of PME mesh computation
-----------------------------------------------------------------------------
PME redist. X/F       18    1    8497980   10470.597     490044.880  30.3
PME spread/gather     18    1    8497980    1694.125      79288.445   4.9
PME 3D-FFT            18    1    8497980     740.960      34678.396   2.1
PME 3D-FFT Comm.      18    1   16995960    1516.815      70989.972   4.4
PME solve Elec        18    1    4248990     125.464       5871.977   0.4
-----------------------------------------------------------------------------

Core t (s)   Wall t (s)        (%)
Time:   318506.837    34594.107      920.7
9h36:34
(ns/day)    (hour/ns)
Performance:       21.224        1.131
Finished mdrun on rank 0 Thu Jul 16 21:58:00 2015


## Visually inspecting the simulation

Although most of the analysis comes down to extracting data and plotting them, molecular dynamics is first and foremost about dynamical. As such, it is possible to extract the frames from the trajectory and combine them into a movie. This alone can inform substantially about the integrity of the peptide throughout the simulation. The following Pymol commands show the peptide in a sausage-like representation colored sequentially from N- to C-terminal. To manipulate the trajectory file, use trjconv, the GROMACS swiss-knife utility. When asked to select a group to output, choose Protein only, otherwise you will end up with a box of slushy water molecules obscuring the real action!

The peptide moves all around the box, wiggling as it diffuses through the water molecules and beyond the boundaries of the box. When the movie is over, use the intra_fit command to align all the frames, so that you can better observe peptide motions. Then replay the trajectory.

How does the peptide behave during the simulation? Did it unfold completely? Are there parts that remained stable while others didn’t?

Feel free to play around with Pymol. Zoom in on specific regions, such as where the peptide is most rigid or most flexible, and check the side chain conformations (show sticks). Feel free to waste some (CPU) time on making an nice image, using ray and png. Do mind that scenes that are too complex may cause the built-in ray-tracer of Pymol to crash, so in that case you can only get the image as you have it on screen using png directly. Check out the Pymol Gallery for inspiration, or ask your instructors for tips. If you have really a lot of time to waste, you can also make a movie of the trajectory, although this is probably best done outside the virtual machine of the course, for performance reasons. You might need to extract more frames from the simulation to make a sizable movie, depending on the frame rate you choose.

Then, in the command-line interface, assuming you are in the directory where Pymol stored all the .png files:

convert -delay 1 -loop 0 -dispose Background frame_*.png dynamics.gif

## Quantitative Quality Assurance

After a first visual inspection of the trajectory, assuming the simulation went smoothly, it is time to perform additional and more thorough checks regarding the quality of the simulation. This analysis involves testing for the convergence of the thermodynamic parameters, such as temperature, pressure, and the potential and kinetic energies. Sometimes, the convergence of a simulation is also checked in terms of the root mean square deviation (RMSD) of the atomic coordinates of each frame against the initial structure and/or the average structure. Since this simulation is of a very small and flexible peptide, it is expected that it does not converge, although there might be surprises! Finally, the occurrence of interactions between periodic images must be checked as well since, if these did occur, they might lead to artifacts in the simulation.

### Convergence of the thermodynamical parameters

Start off by extracting the thermodynamic parameters from the energy file, as done previously. Of interest are the temperature, pressure, potential energy, kinetic energy, unit cell volume, density, and the box dimensions. The energy file of the simulation contains several dozen terms. Some of the energetic terms are split in groups. These groups were defined in the .mdp file and can be used to isolate specific parts of the system for future analysis, for example, looking at the interaction between specific residues.

Have a look at the plot and see how the temperature fluctuates around the specified value ($$310 K$$). The Heat Capacity of the system can also be calculated from these fluctuations. The system temperature must be extracted from the .edr energy file together with the enthalpy (for NPT) or Etot (for NVT) values. Furthermore, we have to explicitly state how many molecules we have in the system with the -nmol option (you can refer to the end of the topology file to get the total number of molecules in your system). This will allow gmx energy to automatically calculate the heat capacity and show at the end of its output. Check the GROMACS manual for more details.

The equilibration of some terms takes longer than that of others. In particular, the temperature quickly converges to its equilibrium value, whereas for example the interaction between different parts of the system might take much longer.

### Calculation of the minimum distance between periodic images

A key point of any molecular dynamics simulation analysis where periodic boundary conditions were used is to check if there have been any direct interactions between neighboring images. Since the periodic images are just a trick to avoid having hard boundaries, such interactions are unphysical self-interactions and invalidate the results of the simulation.

The occurrence of a periodic image sighting can be overlooked if it is very transient and infrequent. If it does occur frequently or consistently over a stretch of the simulation, time to go back and re-do the whole setup. Also, not only direct interactions are of concern. As mentioned before, the water around the solute has a different structure than the bulk water. To be on the safe side, add an extra nanometer when calculating the allowed minimal distance.

### Conformational dynamics and stability I – Radius of Gyration

Before analyzing any structural parameter, the trajectory has to be massaged to avoid artifacts because of the periodic boundary conditions. In addition, all the analysis tools work faster if the trajectory contains only the necessary (protein) atoms and their information.

Perhaps not entirely relevant for this particular simulation, since the goal is to sample many conformations, but another part of the quality assurance of a simulation is checking the convergence of the structure itself. This can be done either by calculating the root mean square deviation (RMSD) of the atomic coordinates of each frame with respect to the initial structure or the average structure, but also by calculating the radius of gyration of the structure over the trajectory. The radius of gyration gives an indication of the shape of the molecule and compares to the experimentally obtainable hydrodynamic radius.

### Conformational dynamics and stability II – Root Mean Square Fluctuation (RMSF)

The structure of the peptide changes throughout the simulation, but not equally. Some regions are more flexible than others, usually due to differences in the amino acid sequence. The root mean square fluctuations capture, for each atom, the fluctuation about its average position and often correspond to the crystallographic temperature (or b) factors. Comparing this experimental measure with the RMSF profile can serve as an additional quality check for a simulation. The higher the temperature factor, the more mobile the atom. An interesting collateral of this analysis is the calculation of an average structure, which can be used for future analyses.

### Conformational dynamics and stability III – Root Mean Square Deviation (RMSD)

As the calculation of the RMSF also produced an average structure, it is now possible to calculate the root mean square deviation of the entire trajectory. This metric is commonly used as an indicator of convergence of the structure towards an equilibrium state. The RMSD is a distance measure, and as such is mostly meaningful for low values. Two frames that differ by $$10Å$$ from the average structure may well be entirely different conformations. The GROMACS tools rms allows such calculations, and in particular selecting only specific groups of atoms of the molecule, such as the backbone.

While the RMSD with respect to the initial structure is relevant, if it plateaus at a relatively high value, it does not inform on the stability of the conformation. As mentioned above, two structures at $$10Å$$ can be very different. For this reason, the RMSD with respect to the average structure is likely to offer a better perspective of the evolution of structural changes throughout the simulation.

## Structural Analysis

When asked for a selection choose “Protein” if no selection is specifically stated or does not follow logically from the text.

Having assured that the simulation has converged to an equilibrium state, and that its results are likely to be valid, it is time for some real analysis that provides answers to a research question. Analysis of simulation data can be divided in several categories. One comprises the interpretation of single conformations according to some functions to obtain a value, or a number of values, for each time point. Example of these are the previously calculated RMSD and radius of gyration metrics. Next to that, the analysis can be done in the time domain, e.g. through averaging, such as (auto)correlations or fluctuations. In the next section, several different types of analyses will be performed, each providing a different but complementary view into the trajectory. Some might not be strictly necessary for this simulation, but are included as an example of what can be done elsewhere.

When running the GROMACS programs to perform the analyses, pay attention to their output as well as the plots they generate.

### Hydrogen Bonds

Secondary structures (of proteins) are maintained by specific hydrogen bonding networks. Thus, the number of hydrogen bonds, both internal and between the peptide and the solvent. The presence or absence of a hydrogen bond is inferred from the distance between a donor/acceptor pair and the H-donor-acceptor angle. OH and NH groups are regarded as donors, while O is always classified as an acceptor. N is an acceptor by default as well, unless specifically disabled. GROMACS can calculate hydrogen bonds over full trajectories with the hbond program. The program output informs on the number of hydrogen bonds, distance and angle distributions, and an existence matrix of all internal hydrogen bonds over all frames. The number of hydrogen bonds alone is a proxy for the existence of secondary structures.

In addition to global analyses, many GROMACS programs support index files, which are created with the make_ndx program. These index files allow the creation of user-specified groups, such as single residues or stretches of residues. For example, it is possible to evaluate the creation of β-hairpins by checking the existence of hydrogen bonds between the two halves of the peptide. Assume you are working on a 14-residue long peptide. The syntax within make_ndx to create an index file to check for hydrogen bonds between the two halves is as follows:

r 1-7
name 17 half_1
r 8-14
name 18 half_2
q


On the basis of this analysis, is your peptide adopting a β-hairpin structure during the simulation?

### Secondary Structure

Among the most common parameters to analyse protein structure is the assignment of secondary structure elements, such as α-helices and β-sheets. One of the most popular tools for this purpose is the dssp software. Although not part of the GROMACS distribution, dssp can be freely obtained online at the PDB REDO github page, and integrated in many of its analysis tools. Specifically, the do_dssp tool produces a plot of the different secondary structure elements of each residue in the peptide as a function of time. This matrix, in .xpm format, can be converted into a Postscript file using the gmx xpm2ps tool, and then into a PDF file using ps2pdf. The xpm2ps utility allows a scaling flag, -by, that is useful for very short sequences, as well as a -rainbow flag that controls the coloring of the output.

## Analysis of time-averaged properties

This simulation considers only one conformation. To obtain proper sampling of the peptide conformational landscape, 50 nanoseconds do not suffice. However, trajectories starting from different initial structures or starting from the same structure with a different initial random seed explore different regions of the conformational landscape. It is then desirable to combine different trajectories together and therefore obtain a much larger body of data.

Obtain different (full) trajectories from 2 of your colleagues. If possible, try to be as diverse as possible regarding initial structures.

### Preparation of a concatenated trajectory

The first step is to trim the trajectories in order to remove the first 10 nanoseconds, which can be conservatively considered as equilibration. This operation is possible through trjconv and its -b flag, which allows the user to specify an offset previous to which the trajectory data is ignored. To be able to extract only the peptide atoms, trjconv requires an dummy index file.

Why doesn’t it matter which topology file is used to process the different trajectory files?

After all three trajectories are trimmed, they can be concatenated using the GROMACS program trjcat. Make sure to note down the order in which the trajectories are provided to trjcat. The concatenation requires two particular flags to be provided as input to the program: -cat, which avoids discarding double time frames, and -settime, which changes the starting time of the different trajectories interactively. Effectively, the second trajectory will start at 40 ns and the third at 80 ns. The program will prompt for an action during the concatenation: press c, which tells trjcat to append the next trajectory right after the last frame of the previous one.

### Root Mean Square Deviations – Part II

Although the root mean square deviation (RMSD) was already calculated to check for the convergence of the simulation, it can be used for a more advanced and in-depth analysis of conformational diversity. After all, the RMSD is a metric that compares structures. By performing an all-vs-all comparison with all frames in the concatenated trajectory, it is possible to identify groups of frames that share similar structures. This also quantifies the conformational diversity of a particular trajectory (or trajectories). The matrix allows also to detect and quantify the number of transitions between different conformations during the simulations. It is as relevant to have 10 different conformations or 2 that interconvert quickly. Since an all-vs-all RMSD matrix entails a very large number of pairwise comparisons, and the peptide conformations are different enough, use only backbone atoms to fit and calculate the RMSD.

### Cluster Analysis

Using the all-vs-all RMSD matrix calculated in the previous step, it is possible to quantitatively establish the number of groups of similar structures that a trajectory (or concatenated trajectories) samples. Using an unsupervised classification algorithm, clustering, structures that are similar to each other within a certain RMSD threshold are grouped together. The size of a cluster, the number of structures that belong to it, is also an indication of how favourable that particular region of the conformational landscape is in terms of free energy. GROMACS implements several clustering algorithms in the cluster program. Here, we will use the gromos clustering algorithm with a cutoff of $$2Å$$. Briefly, the algorithm first calculates how many frames are within $$2Å$$ of each particular frame, based on the RMSD matrix, and then selects the frame with the largest number of neighbors to form the first cluster. These structures are removed from the pool of available frames, and the calculation proceeds iteratively, until the next largest group is smaller than a pre-defined number. The cluster program produces a very large number of output files that inform on several different properties of the clusters. Importantly, it also produces a PDB file with the centroids, or representatives, of each cluster.

Cluster the RMSD matrix using the GROMOS method to quantitatively extract representative structures of the simulation. Choose peptide backbone for fitting and all-atoms of peptide as output. This is important, since we have will use the output structures for docking.

gmx cluster -f p53_concatenated.xtc -s p53_helix_CAH.tpr -dm p53_concatenated_RMSD-matrix.xpm -dist p53_concatenated_rmsd-distribution.xvg -o p53_concatenated_clusters.xpm -sz p53_concatenated_cluster-sizes.xvg -tr p53_concatenated_transitions.xpm -ntr p53_concatenated_transitions.xvg -clid p53_concatenated_cluster-id-over-time.xvg -cl p53_concatenated_clusters.pdb -cutoff 0.2 -method gromos

How many clusters did the algorithm find? Tune the cutoff to obtain a reasonable number of clusters (e.g. 10-15).

What is the clustering cutoff that allows the definition of that number of clusters? Do you think these clusters are meaningful, i.e. contain only similar structures?

Open the resulting PDB file in Pymol and compare the centroids of each cluster with the others.

Are there any meaningful differences between the largest clusters?

## Picking representatives of the simulation

The aim of this simulation exercise was the sample the conformational landscape of the p53 N-terminal transactivation peptide, in order to extract representatives that could be used to generate models of its interaction with the MDM2 protein. The last step of clustering provides an unbiased method to select structures that were sampled throughout most of the trajectory (large clusters) and are likely good candidates for seeding the docking calculations.

Select 5 representatives of the clusters you obtained in the previous step and create individual PDB files using Pymol.

## Congratulations!

By the end of this tutorial, you have (we hope!) learned how to setup a molecular dynamics simulation of a small peptide and how to critically interpret and validate your results. This is no small feat. The analyses we show here are just the tip of the iceberg of what you can extract from your trajectory. If you are serious about MD simulations, be sure to read the GROMACS documentation and get acquainted with the tools it offers.

You might want to use the representatives you just selected in the tutorial for data-driven docking calculations!