Subsections
- Calculation of the Order Parameter, \( \Psi \)
- Updating the Bias Potential, \( U(\Psi) \)
- Calculating the Biasing Force
- Modified Langevin Step
Calculation of the Order Parameter, \( \Psi \)
Calculating the order parameter
$$\begin{equation}
\Psi = \frac{1}{R_{0}^{3}} \left( \frac{V}{(2 \pi)^{3}} \int f(k) \lvert W_{-}({\bf k}) \rvert^{l} d{\bf k} \right)^{1/l},
\end{equation}$$
requires a Fourier transform of the real-space field, \(W_{-}({\bf r}) \), into reciprocal space to obtain \( W_{-}({\bf k}) \).
A Fourier transform can be a costly operation to perform on the host (CPU), the execution time scaling with the number of mesh
points used to represent the field. It may also be the case that the user wishes to output \( \Psi \) at every
time step, so this relatively costly calculation could end up being performed frequently. As a result, I have used CUDA's CuFFT library
to perform the Fourier transform on the GPU with the function:
cufftResult
cufftExecD2Z(
cufftHandle plan,
cufftDoubleReal *idata,
cufftDoubleComplex *odata);
The next area that can be parallelised is identified by noting that the equation for the order parameter contains an integral
that becomes a summation on our discrete mesh:
$$\begin{equation}
\int f(k) \lvert W_{-}({\bf k}) \rvert^{l} d{\bf k} \rightarrow C \sum_{i} f(k_{i}) \lvert W_{-}({\bf k}_{i}) \rvert^{l}.
\end{equation}$$
If we look at the contribution to the sum from a particular wavevector \( {\bf k}_{i} \),
it's clear that neither \( f(k_{i}) \) or \( W_{-}({\bf k}_{i}) \) depend on any other wavevector, \( k_{j} \).
As a result, we can divide up the hard work by calculating each contribution in parallel on the GPU.
The next step would be to add the results from each \( k_{i} \) together again. We could achieve this by copying the
array of results from the GPU back to the host and summing them with the CPU. However, copying a large array between
host and GPU memory can become costly, especially if repeated often.
As a result, I have utilised the CUDA Thrust library function:
OutputType thrust::
transform_reduce(
InputIterator first,
InputIterator last,
UnaryFunction unary_op,
OutputType init,
BinaryFunction binary_op).
Whilst I will not go through the function in detail, it takes an iterator specifying the first and last positions of
the input arrays, a function that will be applied the the input data (in our case, it calculates the product
\( f(k_{i}) \lvert W_{-}({\bf k}_{i}) \rvert^{l} \)), the initial summation value (zero), and a reduction operator
(i.e., the built in operation
thrust::plus<double>()
for summation). These parameters are used to perform a sum of
the transformed input data on the GPU and finally return a single double precision variable (not a large array) to
host memory. In the code, the function looks like:
Psi = thrust::transform_reduce(
_z,
_z + _Mk,
Psi_calc(_ell, _M),
0.0,
thrust::plus<double>());
where _z is a
thrust::zip_iterator<IteratorTuple>
(not discussed here), _M and _Mk are constants corresponding to array lengths in real and reciprocal space and the function:
Psi_calc(double ell, int M)
calculates the product \( f(k_{i}) \lvert W_{-}({\bf k}_{i}) \rvert^{l} \), where ell is the constant \( l \).
The steps outlined above are contained within the public function:
double get_Psi(double *w_gpu)
in the metadynamics class. It requires only a pointer to the field array stored on the GPU as a parameter.
Updating the Bias Potential, \( U(\Psi) \)
It is necessary to periodically update the bias potential by an amount \( \delta U(\Psi) \), and
its derivative by an amount \( \delta U'(\Psi) \). Furthermore, to perform a linear extrapolation of the free energy
once the system has become well-tempered, we also need to calculate the ensemble average of \( W_{-}^{2}({\bf r}) \)
as a function of \( \Psi \) (mathematics not shown). This is done by adding Gaussians to two functions, \( I_{0}(\Psi) \)
and \( I_{1}(\Psi) \), centred about the current value of the order parameter, \( \hat{\Psi} \).
The Gaussians added to \( I_{1}(\Psi) \) are scaled by an amount:
$$\begin{equation}
\hat{\langle W_{-}^{2} \rangle} = \frac{1}{M} \sum_{i=1}^{M} W_{-}^{2}({\bf r}_{i})
\end{equation}$$
compared to those added to \( I_{0}(\Psi) \).
Since the number of mesh points, \( M \), is large this is another sum that can be calculated on the GPU via
thrust::
transform_reduce(...)
(click
here for more details).
This time, we use the built-in operation,
thrust::square<double>()
to perform the square.
The next step is to update \( \delta U(\Psi) \), \( \delta U'(\Psi) \), \( \delta I_{0}(\Psi) \) and \( \delta I_{1}(\Psi) \)
as the Gaussians are added. Depending on the range of \( \Psi \) being investigated and the resolution (i.e., the distance between
adjacent points, \( \Delta \Psi \) ), this is another procedure that can be significantly sped up through parallelisation.
It is possible as long as each GPU thread knows the value about which to centre the Gaussians, \( \hat{\Psi} \), and can work
out which value of \( \Psi \) it corresponds to itself - a relatively simple couple of requirements to satisfy.
This is achived in the CUDA kernel:
__global__ void update_bias_fields_gpu(...),
into which \( \hat{\langle W_{-}^{2} \rangle} \) and \( \hat{\Psi} \) are passed as parameters (among others).
All four of the aforementioned arrays are updated in the above kernel, since \( \delta U'(\Psi) \) necessiates
calculating \( \delta U(\Psi) \), and both \( I_{0}(\Psi) \) and \( I_{1}(\Psi) \) can reuse the
Gaussian contribution already being used in the thread.
Finally, the steps discussed above are all performed by calling either of the overloaded functions:
void update_bias_field(double *w_gpu),
void update_bias_field(double Psi_hat,
double *w_gpu),
depending on whether \( \hat{\Psi} \) has already been calculated prior to the call (thereby saving an additional call to get_Psi(...)).
Calculating the Biasing Force
The first step in calculating the biasing force for the Langevin step is to obtain the current value of the order parameter \( \hat{\Psi} \)
with a call to the parallelised function: double get_Psi(w_gpu) (described above).
We then use \( \hat{\Psi} \) to look up the derivative of the bias potential that we have been keeping track of and obtain \( U'({\hat{\Psi}}) \).
The next step is to calculate the functional derivative, \( \mathcal{D} \Psi / \mathcal{D} W_{-}({\bf r}) \)
(see
Beardsley and Matsen for mathematical details).
We begin by calculating \( \mathcal{D} \Psi / \mathcal{D} W_{-}({\bf k}) \) on the GPU with a call to the kernel: __global__ void get_dPsi_dwk(...),
the result of which is Fourier transformed back into real space via the cuFFT function: cufftExecZ2D(...)
to give \( \mathcal{D} \Psi / \mathcal{D} W_{-}({\bf r}) \).
A final call to the GPU kernel: __global__ aEqBc(...), leaves is with the biasing force,
$$\begin{equation}
F_{B}(\Psi) = U(\hat{\Psi}) \left( \frac{\mathcal{D} \Psi}{\mathcal{D} W_{-}({\bf k})} \right)
\end{equation}$$
Modified Langevin Step
The WTMD simulation requires a modified Langevin step in which the biasing force is added to the update of \( W_{-}({\bf r}) \).
Since a simulation will often start with zero bias potential (and therefore no bias force), a flag has been
introduced as an argument to the Langevin function that allows the user to turn off the call to get_fBias(...) if it's not
required:
void langevin(..., bool useBias = true).
This will stop wasteful kernel calls and Fourier transforms being used to determine that zero bias potential results in zero bias force.
In the case where the biasing force does not need to be calculated, \( W_{-}({\bf r}) \) is updated using a standard
L-FTS Langevin step on the GPU via the kernel: __global__ langevin_sym(...).
Otherwise, after getting the biasing force, \( F_{B} (\Psi) \), on the GPU by calling get_fBias(...), it is passed to a CUDA kernel to
update \( W_{-}({\bf r}) \) via a modified Langevin step:
__global__ langevin_sym_bias(..., double *bias).