Complex Langevin Field-Theoretic Simulations (CL-FTS) of Diblock Copolymers on GPUs


Related Publications

Accurate universal predictions for block copolymer melts using field-theoretic simulations.
M. W. Matsen, J. D. Willis, and T. M. Beardsley
Langevin field-theoretic simulations (L-FTSs) have recently evaluated complete phase diagrams for diblock copolymer melts [Matsen et al., Phys. Rev. Lett. 2023, 130, 248101], but they involve a partial saddle-point approximation (PSPA). Although previous complex-Langevin field-theoretic simulations (CL-FTSs) calculated complete phase diagrams without the PSPA [Delaney and Fredrickson, J. Phys. Chem. B 2016, 120, 7615], the diagrams disagree with experiments. We find evidence attributing this to nonuniversal behavior resulting from long-range interactions used to remove an ultraviolet divergence. Therefore, we perform CL-FTSs on symmetric diblocks with contact interactions and use the renormalization procedure employed in the L-FTSs. This successfully removes the divergence and restores the universality, but, without the long-range interactions, the fields have a tendency to form intense delta-like "hot spots", causing the CL-FTSs to fail. Nevertheless, we find good agreement between the CL-FTSs and L-FTSs over the parameter space where the CL-FTSs are sufficiently well behaved. This implies that the renormalization compensates for the inaccuracy of the PSPA, thereby imparting L-FTSs with the ability to provide accurate universal predictions for block copolymer melts of high molecular weight.

Project Motivations


Scientific Motivations


  1. Compare CL-FTS to the (previously implemented) L-FTS method in terms of speed and accuracy.
  2. Investigate the instability reported in CL-FTS simulations.
  3. Attempt to study diblock copolymer phase behaviour.
  4. Determine the lower bound of molecular weights that can be investigated (limited by instability).
  5. Generate equilibrium phases.
  6. Try to detect order-disorder transitions.
  7. Attempt to produce diblock copolymer phase diagrams.
  8. Compare results to experiments, theory and other simulations.

Technical Motivations


  1. There is no saddle-point approximation applied to \(W_{+}\) in CL-FTS (it is present in L-FTS).
  2. Anderson mixing (the costly part of L-FTS) is not neccessary in CL-FTS.
  3. Field-based models can be significantly faster than particle simulations for intermediate/long chains.
  4. Can take advantage of GPU hardware to maximise computational speed/efficiency.

Background Theory


Diblock Copolymers (DBCs)


A diblock copolymer consists of an A-block of \( N_{A} \) segments and a B-block of \( N_{B} \) segments (figure 1), joined together at one end to give a linear chain with \( N = N_{A} + N_{B} \) total segments and composition fraction \( f_{A} = N_{A}/N \).


Figure 1 - A diblock copolymer with \( N_{A} \) segments in the A-block and \( N_{B} \) segments in the B-block.


The strength of interactions between A and B segments is controlled by the Flory-Huggins parameter, \( \chi \), which is inversely related to temperature (i.e., low \( \chi \) corresponds to high temperature). When the A-B interactions in a melt of DBCs become sufficiently unfavourable (i.e., \( \chi \) becomes large enough) they microphase-separate, forming periodically ordered structures with domains on the nanometer scale. The equilibrium structures include the classical lamellar (L), cylindrical (C) and spherical (S) phases, as well as the complex gyroid (G) and Fddd phases (see figure 2).


Lamellar (L)
Cylinder(C)
Spherical (S)
Gyroid (G)
Fddd (Fddd)



Self-consistent Field Theory (SCFT)


Imagine looking at a single DBC in a melt. It has complicated interactions with all the other polymers in the system and capturing them all is a difficult mathematical and computational challenge.

SCFT addresses this problem by replacing the interactions that a polymer has with other chains by mean fields. The theory has been enormously successful in the study of DBCs, producing the famous equilibrium phase diagram shown in figure 3. It displays the various ordered phases, which are separated from the disordered phase by the order-disorder transition (ODT).

Figure 3 - The self-consistent field theory phase diagram for diblock copolymer melts.
Figure 4 - A schematic representation of the effect of fluctuations on the SCFT phase diagram in figure 3.


Limitations of SCFT


Unfortunately, the mean-field nature of SCFT means that it doesn't include the fluctuation effects that are inherent in real statistical systems. As a result, we are interested in fluctuation corrections to figure 3, which are controlled by the invariant polymerisation index, \( \bar{N} = a^{6}\rho_{0}^{2}N \), where \( \rho_{0} \) is the segment density and \( a \) is the statistical segment length.

SCFT corresponds to the case where \( \bar{N} = \infty \), with fluctuations increasing as \( \bar{N} \) is reduced. The main effects of fluctuations on the phase diagram in figure 3 are that the ODT gets shifted upward, and we get direct transitions between the disordered and ordered phases. For example, there will be a direct transition beween gyroid and disorder as indicated schematically in figure 4. We can see this behaviour in experiments...

Experimental Phase Diagrams


In 1994, Bates et. al. performed experiments to produce phase diagrams for three different DBC chemistries: PE-PEP (\( \bar{N}=27 \times 10^{3} \)), PEP-PEE (\( \bar{N}=3.4 \times 10^{3} \)) and PI-PS (\( \bar{N}=1.1 \times 10^{3} \)). Each of the phase diagrams exhibited the spherical, cylindrical and lamellar phases, but no gyroid or Fddd. There was, however, a region of another phase known as perforated lamellar in the region where SCFT predicts gyroid. Indeed, later experiments by Hajduk et. al. and Vigild et. al. showed that perforated lamellar is actually a long-lived metastable state that eventually converts to gyroid. Even more recent experiments by Wang et. al. and Takenaka et. al. have also confirmed the presence of Fddd on both sides of the phase diagram.

To reiterate, the significant differences in the phase diagrams of the experiments compared to the SCFT phase diagram are due to the effects of fluctuations near the ODT: the shifting up of the ODT and direct transitions between disorder and the ordered phases. It is these effects that we wish to capture in a field theory, and the next section shows how this problem is approached.

Polymer field theory


We start with a particle-base model (in our case a bead-spring model), which is fully specified by the polymer coordinates, \( {\bf r}_{\alpha,i} \), where \( \alpha = 1,2\ldots,n \) indexes the \( n \) polymers in the system and \( i = 1,2\ldots,N \) indexes the \( N \) monomers of a polymer. The particle-based hamiltonian takes the form $$\begin{eqnarray} \frac{H(\{ {\bf r}_{\alpha,i} \})}{k_{B}T} &=& \frac{3}{2a^{2}} \sum\limits_{\alpha=1}^{n} \sum\limits_{i=1}^{N-1} ( {\bf r}_{\alpha,i} - {\bf r}_{\alpha,i+1})^{2} \nonumber \\ &+& \rho_{0}\chi_{b} \int \hat{\phi}_{A} \hat{\phi}_{B} d{\bf r}, \nonumber \end{eqnarray}$$ where \(\hat{\phi}_{A}\) and \(\hat{\phi}_{B}\) are the concentrations of A and B monomers. The first term is the stretching energy arising from bonds between monomers, while the second is the energy from A-B interactions.

We then perform a mathematical transformation to give us a field-based model with hamiltonian $$\begin{eqnarray} \frac{H[W_{-},W_{+}]}{k_{B}T} &=& -n \ln Q[W_{-},W_{+}] \nonumber \\ &+& \rho_{0} \int \left( \frac{ W_{-}^{2}({\bf r})}{\chi_{b}} - W_{+}({\bf r}) \right) d{\bf r} \nonumber \end{eqnarray}$$ This field-based hamiltonian depends on two fields, \( W_{-}({\bf r}) \) and \( W_{+}({\bf r}) \), as well as the partition function of a single chain in these fields, \( Q \). The composition field, \( W_{-}({\bf r}) \), couples to the difference in A and B concentrations, while the pressure field, \( W_{+}({\bf r}) \), couples to the total concentration. It is important to note that \( W_{+}({\bf r}) \) is imaginary-valued, as this will become important as the theory progresses. In order to do calculations, we represent the fields on a mesh with grid spacing \( \Delta \), as shown in figure 6.


Figure 5 - A particle-based model of a diblock copolymer melt in a lamellar phase. The system is fully specified by the monomer coordinates, \( {\bf r}_{\alpha, i} \), where \( \alpha \) is the molecule index and \( i \) is the monomer index.
Figure 6 - A field-based model of a diblock copolymer melt in a lamellar phase. The system is fully specified by the real-valued composition field, \( W_{-}({\bf r}) \), and the imaginary-valued pressure field, \( W_{+}({\bf r}) \).


SCFT solution to polymer field theory


Self-consistent field theory obtains the fields by solving the following self-consistent equations: \[ \left( \frac{\mathcal{D}H}{\mathcal{D}W_{-}} \right) = 0, \quad \left( \frac{\mathcal{D}H}{\mathcal{D}W_{+}} \right) = 0, \] to obtain the saddle-point solutions of the fields \( w_{-} \) and \( w_{+} \). The free energy can then be approximated by \( F = H [w_{-}, w_{+}] \). However, since SCFT does not include fluctuation effects, we have to take a different approach...

Field-Theoretic Simulations (FTS)


The free energy of a system is given by \( F = - k_{B}T \ln (Z) \), where the partition function, \( Z \), is a sum over all possible microstates. For the field-based system, the partition function takes the form \[ Z \propto \int \exp \left( - \frac{H[W_{-}, W_{+}]}{k_{B}T} \right) \mathcal{D}W_{-} \mathcal{D}W_{+}. \] In field theoretic simulations, one attempts to simulate the full partition function directly, but there are still some issues to contend with.

Firstly, the \( W_{+} \) field is imaginary-valued, which gives us a complex-valued Boltmann weight (the exponential term). This means that it is not possible to use standard simulations techniques.

Secondly, since we represent the fields on a mesh, one might expect that as we reduce the grid spacing the results become more accurate. However, this is not the case due to the appearance of a UV divergence that must also be dealt with.

Complex Langevin FTS (CL-FTS)


In this project, the complex-valued Boltzmann weight is dealt with by performing complex Langevin field-theoretic simulations (CL-FTS), where the fields evolve according to the equations: $$\begin{eqnarray} W_{-}({\bf r}; \tau + \delta \tau) &=& W_{-}({\bf r}; \tau) \nonumber \\ &-& \left( \frac{\mathcal{D}H}{\mathcal{D}W_{-}} \right) \delta \tau + \mathcal{N} (0, \sigma), \end{eqnarray}$$ and $$\begin{eqnarray} W_{+}({\bf r}; \tau + \delta \tau) &=& W_{+}({\bf r}; \tau) \nonumber \\ &-& \left( \frac{\mathcal{D}H}{\mathcal{D}W_{+}} \right) \delta \tau + i\mathcal{N} (0, \sigma), \end{eqnarray}$$ where \( \mathcal{N} (0,\sigma) \) is real-valued Gaussian noise with zero mean and standard deviation given by: \[ \sigma = \sqrt{\frac{2 \delta \tau}{\Delta^{3}\rho_{0}}}. \] However, there have been some issues with this approach discussed in the literature, including an instability in the complex-Langevin dynamics that makes it difficult to reduce \( \bar{N} \) to values typical of experiments. In previous work, we dealt with this issue by approximating the \( W_{+} \) field by its saddle-point value and allowing \( W_{-} \) to remain fully fluctuating, resulting in a real-valued Boltmann-weight and the ability to apply standard Langevin dynamics to evolve the simulation. Details of this L-FTS method can be seen here.

In this project, CL-FTS is implemented and compared to L-FTS in order to assess the relative stability, speed and precision of the two methods.

Dealing with the UV divergence


Now that we have dealt with the issue of the complex-valued Boltzmann weight arising in polymer field theory, we turn out attention back to the second problem: the UV divergence. Fortunately, polymer field theory is renormalisable. This means that the UV divergence can be removed by expressing all results in terms of an effective interaction paramter introduced by the Morse group. For the sake of simplicity we can use the linear expression, \[ \chi = z_{\infty} \chi_{b}. \] Here, \( z_{\infty} \) is the relative number of intermolecular contacts (the number of interactions a chain segment has with segments of other molecules), a parameter that can be calculated analytically. The bare interaction parameter, \( \chi_{b} \), comes directly from the L-FTS simulations. It is possible to go beyond the linear approximation and calculate higher-order contributions to \( \chi \) by performing a Morse calibration. However, the details are outside the scope of this brief introduction to the background theory.

CL-FTS Simulation Code Requirements


Broad Simulation Requirements


  1. Read an input file defining simulation parameters and initial fields.
  2. Define a 3D mesh and assign initial fields to it.
  3. Use a Langevin update to propagate both \( W_{-} \) and \( W_{+} \) forward in simulation time.
  4. Sample quantities such as the structure function, \( S(k) \), and order parameter, \( \Psi \), to facilitate detecting the ODT.
  5. Output the fields and averages so they can be visualised.
  6. Repeat steps 3 - 5.

Technical Simulation Requirements


  1. Use C/C++ (a low-level language) for fast execution.
  2. Identify code that would benefit from GPU parallelisation.
  3. Use the CUDA Thrust library, which provides native support for complex numbers.
  4. Write custom functions for execution on the GPU.

GPU Parallelised Code Areas


Subsections


  1. Propagator Calculation
  2. Calculating \( \partial \ln (Q) / \partial L \) to determine equilibrium box size
  3. Random Number Generation
  4. Langevin update of \( W_{-}({\bf r}) \)
  5. Partition Function, \( Q \)
  6. Composition and Pressure fields, \( \phi_{-}({\bf r}) \) and \( \phi_{+}({\bf r}) \)
  7. Calculation of the Order Parameter, \( \Psi \)


Propagator Calculation


Without going into the full mathematical details, the simulation needs to calculate the composition field, \( \phi_{-}({\bf r}) \), and total concentration field, \( \phi_{+}({\bf r}) \) multiple times in order to obtain \( w_{+} \) and update \( W_{-}({\bf r}) \) at each time step of the simulation.

However, \( \phi_{-}({\bf r}) \) and \( \phi_{+}({\bf r}) \) themselves necessitate calculation of forward and backward propagators given by the recursion relations: \[ q_{i+1}({\bf r}) = h_{i+1}({\bf r}) \int g(R) q_{i}({\bf r} - {\bf R}) d{\bf R}, \] \[ q_{i-1}^{\dagger}({\bf r}) = h_{i-1}({\bf r}) \int g(R) q_{i}^{\dagger}({\bf r} - {\bf R}) d{\bf R}, \] where \( i \) indexes the polymer segment, \( h_{i}({\bf r}) \) is related to the \( i^{th} \) monomer's field energy and \( g({\bf r}) \) is related to the stretching energy of polymer bonds (mathematical details).

Calculating the propagators from these recursion relations is the most costly part of the FTS simulations, which involves transforming to and from reciprocal space for \( N-1 \) of the monomers. Since there are both forward and backward propagators, this results in \( 2(N-1) \) fast Fourier transforms (FFTs) having to be performed every time \( \phi_{-}({\bf r}) \) and \( \phi_{+}({\bf r}) \) are calculated. Then, considering the fact that \( \phi_{-}({\bf r}) \) and \( \phi_{+}({\bf r}) \) have to be recalculated \( n_{AM} \) times at each time step of the simulation in order to solve the self-consistent equation for \( w_{+} \), the total number of FFTs per time step, \( 2(N-1)n_{AM} \), soon adds up.

In serial versions of the code, the transforms are done with the FFTW fast Fourier transform library. However, performing Fourier transforms is highly parallelisable so this portion of the code has been rewritten to take advantage of the CUDA cuFFT library on the GPU.

In order to maximise the benefits of performing FFTs on the GPU, there are a number of other considerations that must be taken into account.

Minimising Data Transfers

Firstly, it is important to minimise the number of data transfers between CPU (host) memory and GPU (device) memory due to the costly overhead associated with copying data between them. As a result, the code has been written such that all of the propagators are calculated on the GPU without any need for any memory to be copied to the host. This required that \( h_{i}({\bf r}) \) and \( g({\bf r}) \) were both calculated and accessible in GPU memory.

Batched FFTs

Secondly, performing an FFT using either the FFTW or cuFFT libraries requires setup of a "plan". A plan provides the FFT library with the information it needs to perform the FFT, such as whether to do a forward/backward transform and the size/dimensionality of the mesh. Two plans are required in this code: one for transforming from real to reciprocal space, the other for transforming in the reverse direction.

Since we must calculate recursion relations for forward and backward propagators, a sensible first approach would be the following:

  1. Set up plans for forward (\( \mathcal{F_{r \rightarrow k}} \)) and backward (\( \mathcal{F_{k \rightarrow r}} \)) Fourier transforms using the cufftPlan3d function of the cuFFT library.
  2. Set intial conditions, \( q_{i=1}({\bf r}) \) and \( q_{j=N}^{\dagger}({\bf r}) \), of the propagators.
  3. Calculate \( q_{i+1}({\bf r}) \), the forward propagator of monomer (\( i+1 \)), which involves transforms to and from reciprocal space via the \( \mathcal{F_{r \rightarrow k}} \) and \( \mathcal{F_{k \rightarrow r}} \) cuFFT plans.
  4. Calculate \( q_{i-1}^{\dagger}({\bf r}) \), the backward propagator of monomer (\( j-1 \)), which involves transforms to and from reciprocal space via the \( \mathcal{F_{r \rightarrow k}} \) and \( \mathcal{F_{k \rightarrow r}} \) cuFFT plans.
  5. Repeat steps 3-4 until \( q_{N}({\bf r}) \) and \( q_{1}^{\dagger}({\bf r}) \) have been calculated.

The above algorithm certainly works, but can be improved upon by considering that it may not be fully utilising the GPU.

To maximise throughput on the GPU (i.e., amount of calculations completed per unit time), we would ideally pass it enough work that all its threads are utilised in parallel. However, the FFTs being performed on the GPU in the algorithm above are limited by the simulation's 3D mesh size. For example, suppose there are \( 60000 \) threads available on the GPU, but we run a simulation that uses only \( 30^{3} = 27000 \) threads during a FFT. That means there are another \( 33000 \) threads not being used. The problem is that in spite of having plenty of work for the GPU to do, we have to calculate the propagators sequentially since \( q_{i+1}({\bf r}) \) depends on \( q_{i}({\bf r}) \) and \( q_{i-1}^{\dagger}({\bf r}) \) depends on \( q_{i}^{\dagger}({\bf r}) \) etc.

However, it should be noted that \( q({\bf r}) \) and \( q^{\dagger}({\bf r}) \) can be calculated independently of one another. I.e., the calculations for \( q({\bf r}) \) do not have any dependence on \( q^{\dagger}({\bf r}) \) and vice versa. Therefore, there would be no issue with calculating both of them at the same time if it were possible. Fortunately, the cuFFT library provides us with the cufftPlanMany() function, which allows us to set up a plan where multiple Fourier transforms can be batch processed. Put another way, cufftPlanMany() enables us to fast Fourier transform \( q({\bf r}) \) and \( q^{\dagger}({\bf r}) \) on the GPU at the same time, thereby double the amount of required threads and taking better advantage of the hardware for smaller systems.

The only complication in batch processing \( q({\bf r}) \) and \( q^{\dagger}({\bf r}) \) is that they must be in contigious memory, i.e., the arrays for \( q({\bf r}) \) and \( q^{\dagger}({\bf r}) \) must sit next to each other in the computer's memory. As a result, the arrays in this simulation have been set up in such a way that \( q_{1}({\bf r}) \) and \( q^{\dagger}_{N}({\bf r}) \) reside next to each other in memory, as do \( q_{2}({\bf r}) \) and \( q^{\dagger}_{N-1}({\bf r}) \) etc...

This reduces our algorithm to:

  1. Set up plans for forward (\( \mathcal{\hat{F}_{r \rightarrow k}} \)) and backward (\( \mathcal{\hat{F}_{k \rightarrow r}} \)) Fourier transforms using the cufftPlanMany() function of the cuFFT library.
  2. Set intial conditions, \( q_{i=1}({\bf r}) \) and \( q_{j=N}^{\dagger}({\bf r}) \), of the propagators.
  3. Calculate \( q_{i+1}({\bf r}) \), the forward propagator of monomer (\( i+1 \)), and \( q_{i-1}^{\dagger}({\bf r}) \), the backward propagator of monomer (\( j-1 \)), at the same time via transforms to and from reciprocal space with the (\( \mathcal{\hat{F}_{r \rightarrow k}} \)) and (\( \mathcal{\hat{F}_{k \rightarrow r}} \)) cufftPlanMany() plans
  4. Repeat step 3 until \( q_{N}({\bf r}) \) and \( q_{1}^{\dagger}({\bf r}) \) have been calculated.


Calculating \( \partial \ln (Q) / \partial L \) to determine equilibrium box size


For a cubic mesh, the equilibrium simulation box size can be determined by adjusting \( L_{\nu} \) until the ensemble averages of \[ L_{\nu} \frac{\partial \ln (Q)}{\partial L} = \frac{L_{\nu}}{Q} \frac{\partial Q}{\partial L_{\nu}}, \] are equal in each of the \( \nu = x,y,z \) dimensions. Here, \( L_{\nu} \) refers to the length of the box dimension and \( Q \) is the partition function.

In order to efficiently determine this quantity, propagators are calculated in reciprocal space. While the mathematical details are not shown here (see Beardsley and Matsen, and Riggleman and Fredrickson), the calculation follows a very similar scheme to that detailed in the Propagator Calculation section above. As a result, all of the same GPU parallelisations are used in calculating \( \partial \ln (Q) / \partial L \) and the already allocated GPU memory can be reused. There is also the benefit that this procedure also returns the partition function, \( Q \).

Random Number Generation


At each "time" step in the simulation, Gaussian random noise is added to the \( W_{-}({\bf r}) \) field at every point on the underlying mesh. Since the total number of mesh point scales with the cube of the linear dimension (i.e., a cubic mesh with linear dimension \( m_{x} = 64 \) has a total of \( M = m_{x}^{3} = 262,144 \) points), there can be a significant number of random numbers and it makes sense to generate them in parallel on the GPU for speed. Furthermore, having the random numbers already accessible on the GPU means that we can subsequently use them to update \( W_{-}({\bf r}) \) on the GPU (see next section), rather than incurring the cost of copying the random numbers back to the CPU (host) and updating \( W_{-}({\bf r}) \) there.

To achieve GPU random number generation, I have taken advantage of the CUDA cuRand library. It has the advantage of the curandGenerateNormalDouble(...) function that generates normally distributed random numbers, rather than having to generate a uniform random distribution and manually transform it:

curandGenerateNormalDouble(
  curandGenerator_t generator,
  double *outputPtr, size_t n,
  double mean, double stddev
).


Langevin update of \( W_{-}({\bf r}) \) and \( W_{+}({\bf r}) \)


During testing, it was found that the Leimkuhler-Matthews algorithm previously used in L-FTS did not provide adequate stability in CL-FTS (the numerics would blow up within a relative short amount of simulation time). Instead, a predictor-corrector algorithm with adaptive time step was used (mathematics not detailed here). The size of time step taken during each Langevin update is scaled with inverse proportion to the largest force being exerted on the fields. This acts to make the simulation more stable by taking small time steps when forces are large, and larger time steps when the forces are small.

Fourtunately, this algorithm can be applied to each mesh point independently so the code has been written to update \( W_{-}({\bf r}) \) and \( W_{+}({\bf r}) \) in both the predictor and corrector steps on the GPU. Furthermore, the function thrust::max_element() uses the GPU to rapidly identify the largest forcing term for implementing the adaptive time step.


Partition Function, \( Q \)


The partition function, \( Q \), is required for the Langevin update of \( W_{-}({\bf r}) \), as well in calculating the composition field \( \phi_{-}({\bf r}) \), and the pressure field \( \phi_{+}({\bf r}) \). The simplest way to calculate \( Q \) is to integrate the forward propagator of the \( N^{th} \) monomer over all space: \[ Q = \int q_{N}({\bf r}) d{\bf r}, \] with the integral becoming a sum in the simulation code.

Since the propagator has already been computed on the GPU, it would be advantageous to calculate the sum there too, rather than copy \( q_{N}({\bf r}) \) back to the CPU (host) for summation. Doing so requires a reduction algorithm, so the reduce(...) function of the CUDA Thrust library is used.

As a side note, \( Q \) is also calculated as a byproduct determining \( \partial \ln (Q) / \partial L \). However, since the function to get \( \partial \ln (Q) / \partial L \) is not neccessarily called at each Langevin step, the propagator reduction method (must be called at each step) remains the primary method of determining \( Q \).


Composition and Pressure fields, \( \phi_{-}({\bf r}) \) and \( \phi_{+}({\bf r}) \)


The composition and pressure fields form part of the update to \( W_{-}({\bf r}) \) and \( W_{+}({\bf r}) \) at the next time step. These two fields are given by the equations \[ \phi_{-}({\bf r}) = \frac{1}{NQ} \sum\limits_{i=1}^{N} \gamma_{i} \frac{q_{i}({\bf r}) q_{i}^{\dagger}({\bf r})}{ h_{i}({\bf r}) }, \] and \[ \phi_{+}({\bf r}) = \frac{1}{NQ} \sum\limits_{i=1}^{N} \frac{q_{i}({\bf r}) q_{i}^{\dagger}({\bf r})}{ h_{i}({\bf r}) }, \] where \( \gamma_{i} = +1 \) if monomer \( i \) is A-type, or -1 if it's B-type.

These are costly fields to evaluate serially, because the product \( q_{i}({\bf r}) q_{i}^{\dagger}({\bf r}) \) has to be evaluated \( N \) times for every mesh point. However, as discussed in the previous sections, all of the quantities required in the above equations are already available on the GPU. It is therefore straightforward to consider a single \( i \) and calculate \( q_{i}({\bf r}) q_{i}^{\dagger}({\bf r}) \) for all mesh points in parallel. By repeating this process for the \( i = 1 \ldots N \) monomers and summing the results, we vastly reduce the time taken to evaluate the fields.


Calculation of the Order Parameter, \( \Psi \)


As shown in detail in the project applying metadynamics to L-FTS, it was possible to identify the ordered phase via an order-parameter: \[ \Psi = \frac{1}{R_{0}^{3}} \left( \frac{V}{(2 \pi)^{3}} \int f(k) \lvert W_{-}({\bf k}) \rvert^{l} d{\bf k} \right)^{1/l}, \] where \( R_{0} \) is the end-to-end length of a polymer, \( V \) is the volume of the system and \( f(k) \) is a Heavide-like screening function that limits the contribution of large wavevectors to the integral. The parameter, \( l \), is chosen so that the \( \Psi \) exhibits a large value when the system is in the ordered phase and a small value in the disordered phase. In L-FTS the fields are real-valued, so any choice of \( l \) will result in \( Psi \) being real-valued - this is not the case for the complex fields in CL-FTS. Fortunately, the optimal choice of \( l=4 \) determined in L-FTS metadynamics reduces \( \Psi \) to: \[ \Psi = \frac{1}{R_{0}^{3}} \left( \frac{V}{(2 \pi)^{3}} \int f(k) \lvert W_{-}({\bf k})W_{-}({\bf -k}) \rvert^{2} d{\bf k} \right)^{1/4}, \] which continues to be real-valued in CL-FTS.

Calculating the order parameter requires a Fourier transform of the real-space field, \(W_{-}({\bf r}) \), into reciprocal space to obtain \( W_{-}({\bf k}) \). A Fourier transform can be a costly operation to perform on the host (CPU), the execution time scaling with the number of mesh points used to represent the field. It may also be the case that the user wishes to output \( \Psi \) at every time step, so this relatively costly calculation could end up being performed frequently. As a result, I have used CUDA's CuFFT library to perform the Fourier transform on the GPU with the function:

cufftResult cufftExecD2Z(
   cufftHandle plan,
   cufftDoubleReal *idata,
   cufftDoubleComplex *odata);

The next area that can be parallelised is identified by noting that the equation for the order parameter contains an integral that becomes a summation on our discrete mesh: $$\begin{equation} \int f(k) \lvert W_{-}({\bf k}) \rvert^{l} d{\bf k} \rightarrow C \sum_{i} f(k_{i}) \lvert W_{-}({\bf k}_{i}) \rvert^{l}. \end{equation}$$ If we look at the contribution to the sum from a particular wavevector \( {\bf k}_{i} \), it's clear that neither \( f(k_{i}) \) or \( W_{-}({\bf k}_{i}) \) depend on any other wavevector, \( k_{j} \). As a result, we can divide up the hard work by calculating each contribution in parallel on the GPU. The next step would be to add the results from each \( k_{i} \) together again. We could achieve this by copying the array of results from the GPU back to the host and summing them with the CPU. However, copying a large array between host and GPU memory can become costly, especially if repeated often. As a result, I have utilised the CUDA Thrust library function:

OutputType thrust::transform_reduce(
   InputIterator first,
   InputIterator last,
   UnaryFunction unary_op,
   OutputType init,
   BinaryFunction binary_op).

Whilst I will not go through the function in detail, it takes an iterator specifying the first and last positions of the input arrays, a function that will be applied the the input data (in our case, it calculates the product \( f(k_{i}) \lvert W_{-}({\bf k}_{i}) \rvert^{l} \)), the initial summation value (zero), and a reduction operator (i.e., the built in operation thrust::plus<double>() for summation). These parameters are used to perform a sum of the transformed input data on the GPU and finally return a single double precision variable (not a large array) to host memory. In the code, the function looks like:

Psi = thrust::transform_reduce(
   _z,
   _z + _M,
   Psi4_calc(),
   0.0,
   thrust::plus<double>());

where _z is a thrust::zip_iterator<IteratorTuple> (not discussed here), _M and _Mk are constants corresponding to array lengths in real and reciprocal space and the function:

Psi4_calc()

calculates the product \( f(k_{i}) \lvert W_{-}({\bf k}_{i}) W_{-}(-{\bf k}_{i}) \rvert^{2} \). The steps outlined above are contained within the public function:

double get_Psi4(thrust::device_vector> &w_gpu, bool isWkInput = false)

in the order_parameter() class. It requires either a reference to \( W_{-}({\bf r}) \) (isWkInput = false), or \( W_{-}({\bf r}) \) (isWkInput = true) stored on the GPU as a parameter.


GPU L-FTS Code


The GPU code for this project can be found on my github page at: https://github.com/tmbeardsley/clfts_gpu.




Figure 7 - Real and imaginary parts of \( W_{-}({\bf r}) \) at \( \bar{N} = 10^{5} \), \( N = 90 \), \( \chi N = 12 \), \( m = 32 \) and \( L = 4.209 \).