10.2 CUDA Final Report

Individual Contributions

Justus Dreßler: all members contributed equally

Thorsten Kröhl: all members contributed equally

Julius Halank: all members contributed equally

MS-1 Preparation

We read through the CUDA C++ Documentation (well part of it) and applied it to a small test example. The test example (in /cudeTests) consisted of a simple vector addition kernel and a main function to test it. We did initialy pick up a bad habit from it though which is cudaMallocManaged, since this turned out to severely slow down the code.

__global__ void vectorAdd(int *a, int *b, int *c, int N) {
    //calculate global thread id
    int tid = blockIdx.x * blockDim.x + threadIdx.x;

    //range check
    if (tid < N) {
        c[tid] = a[tid] + b[tid];
    }
}

Vector Addition Kernel

It helped us understand the way the GPU calls our code though. Each thread is calling the function with the same parameters except an implicit index given too it, which is its position in all threads. We used this later to tell each thread which cell to calculate in the simulation. CUDA makes 2D programming very easy in that case because we can just get a 2D thread id and use it to index our 2D arrays.

MS-2 Initial Kernel Development

We identified (to no ones surprise) that the key area we could speed up was the WavePropagate class. Since its dependent on the solvers, we decided that we would always use FWave for the GPU version. We then implemented the tsunami_lab::patches::WavePropagationCUDA class, first by copying the WavePropagation2D class and then modifying it to use CUDA. We first only implemented the setGhostCells functions on the GPU, since it was the easiest to implement and most self contained.

__global__ void setGhostCellsX(tsunami_lab::t_real *io_h, tsunami_lab::t_real *io_hu, tsunami_lab::t_idx i_nx, tsunami_lab::t_idx i_ny, tsunami_lab::t_boundary i_boundaryLeft, tsunami_lab::t_boundary i_boundaryRight)
{
    tsunami_lab::t_idx l_x = blockIdx.x * blockDim.x + threadIdx.x;
    tsunami_lab::t_idx l_y = blockIdx.y * blockDim.y + threadIdx.y;
    if (l_x > i_nx + 1 || l_y > i_ny + 1)
    {
        return;
    }

    if (l_x == 0 && i_boundaryLeft == tsunami_lab::t_boundary::OPEN)
    {
        io_h[(i_nx+2) * l_y] = io_h[1 + (i_nx+2) * l_y];
        io_hu[(i_nx+2) * l_y] = io_hu[1 + (i_nx+2) * l_y];
    }
    else if (l_x == i_nx + 1 && i_boundaryRight == tsunami_lab::t_boundary::OPEN)
    {
        io_h[l_x + (i_nx+2) * l_y] = io_h[l_x - 1 + (i_nx+2) * l_y];
        io_hu[l_x + (i_nx+2) * l_y] = io_hu[l_x - 1 + (i_nx+2) * l_y];
    }
}

Ghost Cell Setter Kernel

They worked quite well after figuring out some indexing issues. Admittedly they are far from optimal in regards to performance since we spawn a thread for each cell despite only needing to calculate the ghost cells. This means we are spawning a lot of threads that just instantly finish. We ignored this though, since the code was working first and later turned out to run fast enough to not be a bottleneck anyways.

While developing the setGhostCells kernels we figured out a nice performance improvement to our code. In our initial code we modified the bathymetry cells in each setGhostCells call. This is not necessary since the bathymetry is constant and we can just copy it once at the start of the simulation. So we added a initGhostCells function to the WavePropagationCUDA and WavePropagation2D classes which sets the bathymetry for the ghost cells. This reduces the amount of writes in each setGhostCells call by 33%.

We also went a bit ahead of the schedule and already implemented all the other kernels for the timeStep(i_scaling) function at the heart of the simulation.

void tsunami_lab::patches::WavePropagationCUDA::timeStep(t_real i_scaling)
{
    dim3 l_blockSize(16, 16);
    dim3 l_numBlock((m_nCellsx+2-1)/l_blockSize.x+1, (m_nCellsy+2-1)/l_blockSize.y+1);

    setGhostCellsX<<<l_numBlock,l_blockSize>>>(m_h, m_hu, m_nCellsx, m_nCellsy, m_boundaryLeft, m_boundaryRight);

    cudaMemcpy(m_hTemp, m_h, (m_nCellsx+2) * (m_nCellsy+2) * sizeof(float), cudaMemcpyDeviceToDevice);
    cudaMemcpy(m_huvTemp, m_hu, (m_nCellsx+2) * (m_nCellsy+2) * sizeof(float), cudaMemcpyDeviceToDevice);
    netUpdatesX<<<l_numBlock,l_blockSize>>>(m_h, m_hu, m_hTemp, m_huvTemp, m_b, m_nCellsx, m_nCellsy, i_scaling);

    setGhostCellsY<<<l_numBlock,l_blockSize>>>(m_h, m_hv, m_nCellsx, m_nCellsy, m_boundaryBottom, m_boundaryTop);

    cudaMemcpy(m_hTemp, m_h, (m_nCellsx+2) * (m_nCellsy+2) * sizeof(float), cudaMemcpyDeviceToDevice);
    cudaMemcpy(m_huvTemp, m_hv, (m_nCellsx+2) * (m_nCellsy+2) * sizeof(float), cudaMemcpyDeviceToDevice);
    netUpdatesY<<<l_numBlock,l_blockSize>>>(m_h, m_hv, m_hTemp, m_huvTemp, m_b, m_nCellsx, m_nCellsy, i_scaling);
}

Time Step Function

This timestep function still follows the old structure of the WavePropagation2D class, but encapsulates all actions into kernels. The (m_nCellsx+2-1)/l_blockSize.x+1 in the l_numBlock calculation might seem a bit odd, but it is just a way to calculate the integer ceil of (m_nCellsx+2)/l_blockSize.x. Taking the ceil is necessary since we might not have a perfect multiple of the block size in the number of cells and with our structure we can’t spawn less threads than cells. cudaMemcpyDeviceToDevice replaced the std::copy calls from the CPU version since we now have to move our data on the GPU and assume that cudaMemcpy is faster than our own implementation would be. We also use the property of CUDA that we can call multiple kernels in a row and they automatically get executed in order (so we have no race conditions that f.E. cudaMemcpy starts before the setGhostCells function finishes).

While developing the netUpdates kernels we noticed that we had to integrate the FWave solver into the same file since kernels are limited to calling code defined in the same file. So we just copied the FWave solver into the WavePropagationCUDA class and made it a __device__ function, meaning it can only be called from the GPU.

__global__ void netUpdatesY(tsunami_lab::t_real *o_h, tsunami_lab::t_real *o_hv, tsunami_lab::t_real *i_hTemp,tsunami_lab::t_real *i_huvTemp, tsunami_lab::t_real *i_b, tsunami_lab::t_idx i_nx, tsunami_lab::t_idx i_ny, tsunami_lab::t_real i_scaling)
{
    tsunami_lab::t_idx l_x = blockIdx.x * blockDim.x + threadIdx.x;
    tsunami_lab::t_idx l_y = blockIdx.y * blockDim.y + threadIdx.y;

    if (l_x > i_nx + 1 || l_y > i_ny + 1)
    {
        return;
    }
    // determine top and bottom cell-id
    tsunami_lab::t_idx l_ceB = l_x + l_y * (i_nx + 2);
    tsunami_lab::t_idx l_ceT = l_x + (l_y + 1) * (i_nx + 2);

    // compute net-updates
    tsunami_lab::t_real l_netUpdates[2][2];

    netUpdatesCUDA(i_hTemp[l_ceB],
                   i_hTemp[l_ceT],
                   i_huvTemp[l_ceB],
                   i_huvTemp[l_ceT],
                   i_b[l_ceB],
                   i_b[l_ceT],
                   l_netUpdates[0],
                   l_netUpdates[1]);

    // update the cells' quantities
    atomicAdd(&o_h[l_ceB], -i_scaling * l_netUpdates[0][0]);
    atomicAdd(&o_hv[l_ceB], -i_scaling * l_netUpdates[0][1]);

    atomicAdd(&o_h[l_ceT], -i_scaling * l_netUpdates[1][0]);
    atomicAdd(&o_hv[l_ceT], -i_scaling * l_netUpdates[1][1]);
}

Net Updates Kernel

Another important update was adding atomicAdd() around our writes to the o_h and o_hv arrays since we can no longer guarantee that no 2 kernels write to the same cell at the same time.

We also updated the Constructor of the class to reserve both memory on the GPU and the CPU (with only the GPU needing the temporary arrays). We used cudaMalloc instead of cudaMallocManaged since we want those arrays to always be on the GPU instead of a first touch policy approach. The extra arrays on the CPU only get used for initialization of all the cells and later getting the data back from the class to the writer. For the latter we now mandate a prior call to a prepareDataAccess function which waits for the GPU to finish its calculation and then copies the data from the GPU to the CPU.

void tsunami_lab::patches::WavePropagationCUDA::prepareDataAccess()
{
    cudaDeviceSynchronize();
    cudaMemcpy(m_h_host, m_h, (m_nCellsx+2) * (m_nCellsy+2) * sizeof(float), cudaMemcpyDeviceToHost);
    cudaMemcpy(m_hu_host, m_hu, (m_nCellsx+2) * (m_nCellsy+2) * sizeof(float), cudaMemcpyDeviceToHost);
    cudaMemcpy(m_hv_host, m_hv, (m_nCellsx+2) * (m_nCellsy+2) * sizeof(float), cudaMemcpyDeviceToHost);
}

Data Access Preparation Function

MS-3 Implementation

With much work already done in week 2, we didn’t really touch the WavePropagationCUDA class in week 3 outside of a few tweaks, none with major impact. We did however start testing our performance of the CUDA version against the CPU version. It turned out to be much much faster in calculation, so much so that our calls to NetCdf became the major bottleneck instead.

Example Data of a run with the Tohoku Simulation, with 1000m Cellsize, 100 Frames and 18000 seconds simulated time:

total time: 2min 48s 378ms 535us 507ns
setup time: 1s 73ms 223us 470ns
calc time : 12s 369ms 302us 895ns
write time: 2min 34s 936ms 9us 142ns

CUDA Performance

total time: 3min 47s 819ms 319us 488ns
setup time: 1s 806ms 745us 853ns
calc time : 3min 14s 908ms 392us 693ns
write time: 31s 104ms 180us 942ns

CPU Performance

Both tests were run on an Intel i5-13600k, an NVIDIA RTX 3070 and 32GB of DDR5 RAM @ 6000MHz

The calculation time reduced drastically, by a factor of about 15, but the write time increased by a factor of about 5. This leads to a total time improvement of only 35%, which is pretty disappointing seeing the large improvement in the calculation time. We couldn’t figure out why the write time increased so much, since from the perspective of the NetCdf library our data should be the same. We also noticed an issue that our simulation was sometimes “stuttering”, meaning that sometimes Frames took significantly longer to be written than others. This also sometimes happened in the setup, which increased setup times from ~2s to over 40s. We couldn’t identify whether the culprit was something in our use of the NetCdf library or the CUDA library.

MS-4 Testing

We did not further optimize our CUDA code in this step despite setting out to seeing as the write time was dominating our total time now. This made changes to our CUDA code not seem very impactful. We did run the code through NVIDIAs NSight Compute profiles (the newer version of NVProf):

_images/10_Nsight_setGhostCells_1.png
_images/10_Nsight_setGhostCells_2.png

Set Ghost Cells Kernel Profile

We already alluded to this in MS-2 Initial Kernel Development section, but here we have the proof that our throughput and occupancy are quite low. This is due to the fact that we spawn a thread for each cell, despite only needing to calculate the ghost cells (so the vast majority of threads are doing nothing). We did not change this as despite its low throughput it was overall only a very minor part of the total time. As a reference the netUpdates kernels take 14 times as long as the setGhostCells kernels. The maximum calculation time speedup then would be around 7%, which is not bad but we deemed it not worth the time investment since the calculation time overall is (depending on number of frames) only about 8% of the total time as well. Which would make the maximum total time speedup of the setGhostCells kernels about 0.6%.

_images/10_Nsight_netUpdates_1.png
_images/10_Nsight_netUpdates_2.png

Net Updates Kernel Profile

This looks pretty good, we seem to have a good occupancy and a good memory throughput. We see that we are bottlenecked by the memory throughput, but we can’t really do anything about that. Overall we are happy with these results.

We also ran the coda with differing resolutions on Ara:

All tests were run with 100 frames and 18000 seconds simulated time with the CUDA kernels

Resolution

Total Time

Setup Time

Calculation Time

Write Time

4000m

23s 566ms 770us 810ns

20s 879ms 547us 439ns

4ms 169us 871ns

2s 683ms 53us 500ns

2000m

9s 22ms 518us 27ns

2s 992ms 497us 211ns

5ms 200us 927ns

6s 24ms 819us 889ns

1000m

43min 47s 134ms 788us 719ns

3s 253ms 829us 489ns

8ms 578us 417ns

43min 43s 872ms 380us 813ns

500m

2h 59min 52s 494ms 302us 997ns

5s 473ms 510us 557ns

13ms 87us 343ns

2h 59min 43s 623ms 213us 679ns

Performance on Ara with 2 NVIDIA Tesla P100 GPU’s, 2 Intel Xeon E5-2660v4 CPUs and 128GB DDR4 RAM

On Ara we see that Ara is significantly slower than our own home computers for our simulation (43 minutes on Ara vs 3 minutes on our PC’s). While Ara seems to boost great calculation times (though we are a bit sceptical of those results), the write time is a lot slower than on our home computers. Since the write time dominated the overall time already on our own computers, it should be no surprise that Ara is overall significantly slower in total then. We assume that the write time is so slow because of the shared file system BeeGFS on Ara and the relatively old CPU’s and slower RAM.

We also observe one of the “stutters” we mentioned in MS-3 Implementation section at the setup, where Ara takes over 20 seconds instead of the usual ~3 seconds at 4000m Cellsize.

And since we saw that our limiting factor was the write time we tested our code with differing amounts of frames:

All tests were run with 1000m resolution and 18000 seconds simulated time with the CUDA kernels on an Intel i5-13600k, an NVIDIA RTX 3070 and 32GB of DDR5 RAM @ 6000MHz

Frames

Total Time

Setup Time

Calculation Time

Write Time

100

2min 39s 116ms 662us 561ns

1s 905ms 48us 542ns

12s 266ms 751us 226ns

2min 24s 944ms 862us 793ns

75

1min 26s 759ms 169us 619ns

1s 226ms 376us 879ns

12s 277ms 471us 7ns

1min 13s 255ms 321us 733ns

50

19s 630ms 698us 731ns

1s 149ms 982us 374ns

12s 680ms 54us 844ns

5s 800ms 661us 513ns

25

14s 688ms 777us 997ns

1s 120ms 59us 687ns

12s 518ms 198us 788ns

1s 50ms 519us 522ns

Performance with differing amounts of frames using CUDA kernel

Frames

Total Time

Setup Time

Calculation Time

Write Time

100

4min 10s 389ms 335us 964ns

2s 139ms 793us 127ns

3min 22s 811ms 868us 815ns

45s 437ms 674us 22ns

75

4min 16s 535ms 842us 722ns

898ms 301us 662ns

3min 58s 287ms 648us 191ns

17s 349ms 892us 869ns

50

3min 21s 543ms 143us 507ns

839ms 248us 187ns

3min 17s 390ms 728us 987ns

3s 313ms 166us 333ns

25

3min 29s 606ms 801us 293ns

796ms 630us 221ns

3min 27s 603ms 875us 928ns

1s 206ms 295us 144ns

Performance with differing amounts of frames using our parallelized CPU code

We see that the write time scales non linearly with the amount of frames. The time difference did not matter much in the CPU code as the simulation is slower despite the write taking quite a bit of time. For the CUDA code it might be advisable to reduce the amount of frames to reduce the write time significantly. If the application would support it it would probably be beneficial to seperate the simulation into multiple smaller files to keep the writes performant. Ultimately a better parallizable and chunkable format would probably be advisable if one were to develop the simulation further. We don’t know why the writes take so much longer on the CUDA version than on the CPU version. NetCdf might have a maximum throughput problem when processing large files and wait for function calls asynchronously to finish. In that case the faster simulation would hinder the libraries ability to hide the write time behind the calculation time.

We validated the calculation accuracy by using the same test cases as in the CPU version.

Tohoku Simulation with 1000m Cellsize, 100 Frames and 18000 seconds simulated time on CUDA

Goals

We had 2 goals for this individual phase:

  1. Increase performance by at least 2 times compared to the parallelized CPU code

  2. Don’t compromise the accuracy of the simulation while doing so

Regarding the calculation time we definitely achieved our first goal. We went from over 3 minutes to about 12 seconds, which is a speedup of about 15 times.

Regarding the total time though it depends a lot on the amount of frames. With low frames (<= 50) we still reached our goal with a speedup of about 9 times. If we increase the amount of frames to 100 though we reduce our speedup to a meager 1.25 times.

Regarding the calculation accuracy we reused the same test cases that we used for the CPU version and they all passed without any issues. So we assume the accuracy is good enough.