Differences

This shows you the differences between two versions of the page.

--- howto:biochem_qmmm [2017/07/18 12:16] – dvanrompaey
+++ howto:biochem_qmmm [2024/01/03 13:13] (current) – oschuett
@@ Line 1: / Line 1: @@
-=====QMMM for biochemical systems=====
+This page has been moved to: https://manual.cp2k.org/trunk/methods/qm_mm/builtin.html
-This tutorial will illustrate the setup process of a QM/MM protein - substrate system. The model system we will be using is chorismate mutase, an enzyme that catalyzes the conversion of chorismate to prephenate, which is a critical reaction in the shikimate pathway. One attractive feature of this enzyme is that it catalyzes the reaction in an electrostatic manner, without forming covalent bonds to the substrate. This allows us to treat the complete enzyme at the computationally cheap MM level, saving us the headache of setting up bonds across the QM/MM border. In the production simulations we will simulate the substrate at the QM level (using the semi-empirical AM1 method), but we will first equilibrate our system with all components at the MM level.
-{{:howto:substrate_in_protein.png?800|}}
-The enzyme Chorismate Mutase with its native substrate
-Before we can get started on the simulations, however, we need to provide CP2K with the proper input files. The coordinates for the chorismate mutase system can be obtained from the protein data bank under PDB ID 2CHT. The active complex consists of three protein subunits, forming three active sites, each of which catalyzes the reaction separately from the others. The PDB structure, however, contains twelve units with twelve inhibitors bound. In order to generate one functional complex, we need to remove the excess protein subunits and their inhibitors. In addition, the inhibitors bound to the crystal structure have to be replaced with copies of the substrate, using your favourite molecular editor. For this tutorial, you can find all input files **[here]({{ site.baseurl }}/inputfiles/MM.zip)**.
-===Generating a topology===
-Now that we have our coordinates, we need to use them to generate a topology that CP2K can use. CP2K supports CHARMM-type PSF files and Amber-type prmtop files. I will be using [[http://ambermd.org/#AmberTools|Ambertools]] to generate the input. Small molecule parametrization for the GAFF2 forcefield is performed with Antechamber and parmchk2. Charges are derived with the bcc method. Antechamber generates MM parameters for small molecules while Parmchk2 verifies this topology, reports missing parameters and outputs a best guess for these in an .frcmod file. For the purposes of this tutorial, we will take the parameters straight from antechamber/parmchk2, but for production simulations these should be examined cautiously. As the active complex has three ligands, we will perform this parameterization for each of them in a bash loop.
- <code>
-for i in A B C; do antechamber -i Lig${i}.mol2 -o Ligand${i}.mol2 -fi mol2 -fo mol2 -c bcc -pf yes -nc -2 -at gaff2 -j 5 -rn CH${i}; done
-for i in A B C; do parmchk2 -i Ligand${i}.mol2 -f mol2 -o Ligand${i}.frcmod -s 2; done
- </code>
-We will then use tleap to load in all the parameters from antechamber and parmchk2, generate parameters for the protein according to the Amber14 forcefield and combine everything into a complex. The commands for tleap can be placed in an input file and ran using `tleap -f script.in`.
- <code>
-source leaprc.protein.ff14SB
-source leaprc.gaff2
-source leaprc.water.tip3p
-ligA = loadmol2 LigandA.mol2
-ligB = loadmol2 LigandB.mol2
-ligC = loadmol2 LigandC.mol2
-loadamberparams LigandA.frcmod
-loadamberparams LigandB.frcmod
-loadamberparams LigandC.frcmod
-protein = loadPDB Protein.pdb
-complex = combine {protein ligA ligB ligC}
- </code>
-Now that we have assembled our complex, we have to solvate it. For simplicity's sake, we will use a cubic box with water extending out 14 A from the protein and neutralize with Na+ ions. The distance is determined by the cutoffs for nonbonded interactions, plus some buffer to account for cell shrinking. For production simulations you may want to switch to a dodecahedron or octahedron for increased efficiency.
- <code>
-solvateBox complex TIP3PBOX 14.0 iso
-addIonsRand complex  Na+ 0
-saveamberparm complex complex.prmtop complex.inpcrd
-quit
- </code>
-These two files contain everything we need to get started. Complex.prmtop contains the topology information and complex.inpcrd contains the coordinates. Using your favourite visualisation tool, check that the Na+ ions did not get placed right next to your ligand - their placement is random. The box size can be found at the bottom of the inpcrd file, followed by the vectors. These dimensions should be specified in the `&CELL` subsection of your input file.
-===Energy minimization===
-We can now move on to actually running CP2K. We will first run 1000 steps of energy minimization (or fewer, depending on the convergence) with the input file `em.inp` to eliminate bad contacts. For quantities like cell size, for instance, CP2K allows you to specify your units by placing them in brackets, for instance `TIMESTEP [fs] 0.5`.
-<code>
-cp2k can be run using a single process:
-cp2k.sopt em.inp > em.out
-or using several processes, where N is the number of processes:
-mpirun -np N cp2k.popt em.inp > em.out
-</code>
-Visualize your simulation to check for any abnormalities. Take a look at the output file, paying extra attention to any warnings you might receive. Note: CP2K will warn you about missing forcefield terms. You can have CP2K output these using `&FORCE_EVAL > &MM > &PRINT > &FF_INFO`.In this case, the missing terms are Urey-Bradley interactions. This is normal, as the Amber force field we are employing here does not use Urey-Bradley terms. A second warning states that our CRD file lacks velocities and box information will not be read. This is not a problem either, as we do not have velocities at this moment and box parameters are provided in the input file.
-In the next step we will perform 5 ps of dynamics in the NVT ensemble, to equilibrate the temperature of our simulation using this `nvt.inp`. We are using the &EXT RESTART section to restart with the last coordinates of the energy minimization procedure. Once the simulation has finished, take a look at the temperature file `NVT-1.PARTICLES.temp` to check that the temperature is oscillating around the correct value (in this case, 298K). Note that we are using the CSVR thermostat with a low time constant of 10 fs to efficiently and quickly control the system temperature. For production simulations a higher time constant such as 100 fs should be chosen to avoid interfering with the dynamics of the system.
-The next step is simulating our system in the constant pressure NPT ensemble to get the correct dimensions for our cell. `npt.inp` runs 50 ps of NPT, outputting the cell dimensions into a separate file every 100 timesteps. A C-C distance in of the chorismate molecules is constrained, in order to keep it near the reactive conformation to facilitate our later simulations in the QM/MM phase. Pressure control is achieved by adding a barostat with a reference pressure of 1 bar and a coupling time of 100fs, and changing the ensemble from NVT to NPT_I, indicating that the box will be scaled isotropically. We expect our cell to shrink at first, as the solvation procedure cannot fill up the cell completely due to overlap with the protein and the edges of the box. After a while the cell size should stabilize. Take a look at the cell dimensions in `NPT.cell`; we will be using these equilibrated cell dimensions for the next tutorial.
-Congratulations, you should now have a fully equilibrated MM system. We will now move on to QM/MM to simulate the bond-breaking process.We will also bias our simulation to ensure that we can see the conversion of chorismate into prephenate occur in a reasonable timeframe.
-====Moving on to QMMM====
-<img src="{{ site.baseurl }}/img/Reaction2.png" alt="Chorismate">
-<span class="caption text-muted"> Chorismate(left) is converted into prephenate(right) by Chorismate Mutase </span>
-As chorismate mutase performs its catalytic function through noncovalent interactions, we can afford to treat the entire enzyme at the MM level. For most biochemistry problems, the boundary will not be so clear and key residues will need to be treated at the QM level as well. This means the QM - MM border will cut through covalent bonds, necessitating the use of link atoms. For our simple system however, this is not necessary. We first have to specify which atoms should be treated at the QM level using the QM_KIND section:
-<code>
-&QM_KIND O
-  MM_INDEX 5668 5669 5670 5682 5685 5686
-&END QM_KIND
-&QM_KIND C
-  MM_INDEX 5663 5666 5667 5671 5673 5675 5676 5678 5680 5684
-&END QM_KIND
-&QM_KIND H
-  MM_INDEX  5664 5665 5672 5674 5677 5679 5681 5683
-&END QM_KIND
-</code>
-Now that the QM atoms are defined, CP2K needs to know how to treat these. We will use the semi-empirical AM1 method as our QM method, electrostatically embedded in the MM system: `E_COUPL COULOMB`. This allows for the QM region to be polarized by the MM environment. Mechanical embedding, where the QM region only interacts with the MM region as point charges and no polarization occurs, can be selected through `E_COUPL NONE`. In DFT calculations, the highly efficient [GEEP](http://pubs.acs.org/doi/full/10.1021/ct050123f) method can be used as well. For stability reasons, we need to make some slight modifications to our prmtop file. In the TIP3P water model, the hydrogens do not have a separate Lennard-Jones site. The same is true for the amber serine, threonine and tyrosine HO atoms. This can cause unphysical interactions with the QM system: the hydrogen atoms strongly interact with the QM electron clouds, crashing the system. In order to prevent this we will add LJ sites to water hydrogens and serine/threonine/tyrosine hydrogens with the ambertools utility parmed. We will change their LJ parameters to those of a GAFF2 type alcohol using the `changeLJSingleType` command. Take care when using this, as it changes the LJ type for all of the atoms that share the LJ type, not just the one you selected. For instance, changing the serine HO also alters all the threonine and tyrosine HO atoms in the system. In this case, this behaviour is useful, but it is strongly advised to use `printLJTypes` to get a good idea of what exactly you are modifying. For a more selective manner of doing this, the `addLJType` command is useful. This adds an LJ type for your selected atom or group of atoms, rather than modifying all occurrences of a specific LJ type. Confirm that `@5724` corresponds to a water hydrogen and that `@4843` is a serine HO using `printDetails @(atomnumber)` in parmed. If not, use `printDetails` to locate the correct atom numbers and modify the changeLJSingleType commands accordingly.
-<code>
-parmed complex.prmtop
-changeLJSingleType @5724 0.3019 0.047
-changeLJSingleType @4843 0.3019 0.047
-parmout complex_LJ_mod.prmtop
-quit
-</code>
-The input file `monitor.inp` will perform 5 ps of simulation in the NVT ensemble. Note that the parameter file should be changed to our modified file with LJ sites on the OH hydrogens. We have also defined a collective variable (CV), i.e. a variable that that describes our process of interest well and allows us to monitor this variable as a surrogate for the process. The CV for this system is the difference in distance between the C-O bond that we will be breaking and the C-C bond that we will be forming. Start the simulation and observe the collective variable output in the MONITOR-COLVAR.metadynLog file.
-===Metadynamics===
-We have now simulated a full QM/MM system for 5 ps. Unfortunately, as we can see in the metadynamics logfile and in the .xyz trajectory file, no reaction occurred. This makes sense, as there is a significant energy barrier to overcome. One option would be to run this simulation for months on end, hoping for the reaction to occur spontaneously. A second, more practical option would be to bias the system. We will use metadynamics to bias our collective variables (CVs). The metadynamics procedure adds new a gaussian potential at the current location after a number of MD steps, controlled by `NT_HILLS`, encouraging the system to explore new regions of CV space. The height of the gaussian can be controlled with the keyword `WW`, while the width can be controlled with `SCALE`. These three parameters determine how the energy surface will be explored. Exploring with very wide and tall gaussians will explore various regions of your system quickly, but with poor accuracy. On the other hand, if you use tiny gaussians, it will take a very long time to fully explore your CV space. An issue to be aware of is hill surfing: when gaussians are stacked on too quickly, the simulation can be pushed along by a growing wave of hills. Choose the `NT_HILLS` parameter large enough to allow the simulation to equilibrate after each hill addition.
-You can now run the metadynamics simulation using `metadynamics.inp`. The evolution of the collective variables over time can be observed in the `METADYN-COLVAR.metadynLog` file. The substrate can be found near the 3 to 8 bohr CV region, whereas the product can be found near the -3 to -8 bohr region. For this tutorial, we will stop the simulation once we have sampled the forward and reverse reaction once. In a production setting you should observe several barrier crossings and observe the change in free energy profile as a function of simulation time. The free energy landscape for the reaction can be obtained by integrating all the gaussians used to bias the system. Fortunately, CP2K comes with the graph.sopt tool that does this all for you. You will need to compile this program separately if you don't have it already. Run the tool passing the restart file as input.
-<code>
-graph.sopt -cp2k -file METADYN-1.restart -out fes.dat
-</code>
-This outputs the `fes.dat` file, containing a one-dimensional energy landscape. Plot it to obtain a graph charting the energy as a function of the distance difference.
- Observe the two energy minima for substrate and product, and the maximum associated with the transition state. From this graph we can estimate the activation energy in the enzyme, amounting to roughly 44 kcal/mol. Results from literature generally estimate the energy to be around 20 kcal/mol. While we are in the right order of magnitude, our estimate is not very accurate. This is due to a number of limitations of this setup: enzyme residues were not treated with QM, sampling was very limited and the inexpensive AM1 method was used for the QM part. This tutorial is intended as a starting point for further experimentation, and the interested reader is encouraged to explore this system further using more accurate techniques.