10.2 CUDA Final Report

Links

Individual Contributions

Justus Dreßler: all members contributed equally

Thorsten Kröhl: all members contributed equally

Julius Halank: all members contributed equally

MS-1 Preparation

We read through the CUDA C++ Documentation (well part of it) and applied it to a small test example. The test example (in /cudeTests) consisted of a simple vector addition kernel and a main function to test it. We did initialy pick up a bad habit from it though which is cudaMallocManaged, since this turned out to severely slow down the code.

__global__ void vectorAdd(int *a, int *b, int *c, int N) {
    //calculate global thread id
    int tid = blockIdx.x * blockDim.x + threadIdx.x;

    //range check
    if (tid < N) {
        c[tid] = a[tid] + b[tid];
    }
}

Vector Addition Kernel

It helped us understand the way the GPU calls our code though. Each thread is calling the function with the same parameters except an implicit index given too it, which is its position in all threads. We used this later to tell each thread which cell to calculate in the simulation. CUDA makes 2D programming very easy in that case because we can just get a 2D thread id and use it to index our 2D arrays.

MS-2 Initial Kernel Development

We identified (to no ones surprise) that the key area we could speed up was the WavePropagate class. Since its dependent on the solvers, we decided that we would always use FWave for the GPU version. We then implemented the tsunami_lab::patches::WavePropagationCUDA class, first by copying the WavePropagation2D class and then modifying it to use CUDA. We first only implemented the setGhostCells functions on the GPU, since it was the easiest to implement and most self contained.

__global__ void setGhostCellsX(tsunami_lab::t_real *io_h, tsunami_lab::t_real *io_hu, tsunami_lab::t_idx i_nx, tsunami_lab::t_idx i_ny, tsunami_lab::t_boundary i_boundaryLeft, tsunami_lab::t_boundary i_boundaryRight)
{
    tsunami_lab::t_idx l_x = blockIdx.x * blockDim.x + threadIdx.x;
    tsunami_lab::t_idx l_y = blockIdx.y * blockDim.y + threadIdx.y;
    if (l_x > i_nx + 1 || l_y > i_ny + 1)
    {
        return;
    }

    if (l_x == 0 && i_boundaryLeft == tsunami_lab::t_boundary::OPEN)
    {
        io_h[(i_nx+2) * l_y] = io_h[1 + (i_nx+2) * l_y];
        io_hu[(i_nx+2) * l_y] = io_hu[1 + (i_nx+2) * l_y];
    }
    else if (l_x == i_nx + 1 && i_boundaryRight == tsunami_lab::t_boundary::OPEN)
    {
        io_h[l_x + (i_nx+2) * l_y] = io_h[l_x - 1 + (i_nx+2) * l_y];
        io_hu[l_x + (i_nx+2) * l_y] = io_hu[l_x - 1 + (i_nx+2) * l_y];
    }
}

Ghost Cell Setter Kernel

They worked quite well after figuring out some indexing issues. Admittedly they are far from optimal in regards to performance since we spawn a thread for each cell despite only needing to calculate the ghost cells. This means we are spawning a lot of threads that just instantly finish. We ignored this though, since the code was working first and later turned out to run fast enough to not be a bottleneck anyways.

While developing the setGhostCells kernels we figured out a nice performance improvement to our code. In our initial code we modified the bathymetry cells in each setGhostCells call. This is not necessary since the bathymetry is constant and we can just copy it once at the start of the simulation. So we added a initGhostCells function to the WavePropagationCUDA and WavePropagation2D classes which sets the bathymetry for the ghost cells. This reduces the amount of writes in each setGhostCells call by 33%.

We also went a bit ahead of the schedule and already implemented all the other kernels for the timeStep(i_scaling) function at the heart of the simulation.

void tsunami_lab::patches::WavePropagationCUDA::timeStep(t_real i_scaling)
{
    dim3 l_blockSize(16, 16);
    dim3 l_numBlock((m_nCellsx+2-1)/l_blockSize.x+1, (m_nCellsy+2-1)/l_blockSize.y+1);

    setGhostCellsX<<<l_numBlock,l_blockSize>>>(m_h, m_hu, m_nCellsx, m_nCellsy, m_boundaryLeft, m_boundaryRight);

    cudaMemcpy(m_hTemp, m_h, (m_nCellsx+2) * (m_nCellsy+2) * sizeof(float), cudaMemcpyDeviceToDevice);
    cudaMemcpy(m_huvTemp, m_hu, (m_nCellsx+2) * (m_nCellsy+2) * sizeof(float), cudaMemcpyDeviceToDevice);
    netUpdatesX<<<l_numBlock,l_blockSize>>>(m_h, m_hu, m_hTemp, m_huvTemp, m_b, m_nCellsx, m_nCellsy, i_scaling);

    setGhostCellsY<<<l_numBlock,l_blockSize>>>(m_h, m_hv, m_nCellsx, m_nCellsy, m_boundaryBottom, m_boundaryTop);

    cudaMemcpy(m_hTemp, m_h, (m_nCellsx+2) * (m_nCellsy+2) * sizeof(float), cudaMemcpyDeviceToDevice);
    cudaMemcpy(m_huvTemp, m_hv, (m_nCellsx+2) * (m_nCellsy+2) * sizeof(float), cudaMemcpyDeviceToDevice);
    netUpdatesY<<<l_numBlock,l_blockSize>>>(m_h, m_hv, m_hTemp, m_huvTemp, m_b, m_nCellsx, m_nCellsy, i_scaling);
}

Time Step Function

This timestep function still follows the old structure of the WavePropagation2D class, but encapsulates all actions into kernels. The (m_nCellsx+2-1)/l_blockSize.x+1 in the l_numBlock calculation might seem a bit odd, but it is just a way to calculate the integer ceil of (m_nCellsx+2)/l_blockSize.x. Taking the ceil is necessary since we might not have a perfect multiple of the block size in the number of cells and with our structure we can’t spawn less threads than cells. cudaMemcpyDeviceToDevice replaced the std::copy calls from the CPU version since we now have to move our data on the GPU and assume that cudaMemcpy is faster than our own implementation would be. We also use the property of CUDA that we can call multiple kernels in a row and they automatically get executed in order (so we have no race conditions that f.E. cudaMemcpy starts before the setGhostCells function finishes).

While developing the netUpdates kernels we noticed that we had to integrate the FWave solver into the same file since kernels are limited to calling code defined in the same file. So we just copied the FWave solver into the WavePropagationCUDA class and made it a __device__ function, meaning it can only be called from the GPU.

__global__ void netUpdatesY(tsunami_lab::t_real *o_h, tsunami_lab::t_real *o_hv, tsunami_lab::t_real *i_hTemp,tsunami_lab::t_real *i_huvTemp, tsunami_lab::t_real *i_b, tsunami_lab::t_idx i_nx, tsunami_lab::t_idx i_ny, tsunami_lab::t_real i_scaling)
{
    tsunami_lab::t_idx l_x = blockIdx.x * blockDim.x + threadIdx.x;
    tsunami_lab::t_idx l_y = blockIdx.y * blockDim.y + threadIdx.y;

    if (l_x > i_nx + 1 || l_y > i_ny + 1)
    {
        return;
    }
    // determine top and bottom cell-id
    tsunami_lab::t_idx l_ceB = l_x + l_y * (i_nx + 2);
    tsunami_lab::t_idx l_ceT = l_x + (l_y + 1) * (i_nx + 2);

    // compute net-updates
    tsunami_lab::t_real l_netUpdates[2][2];

    netUpdatesCUDA(i_hTemp[l_ceB],
                   i_hTemp[l_ceT],
                   i_huvTemp[l_ceB],
                   i_huvTemp[l_ceT],
                   i_b[l_ceB],
                   i_b[l_ceT],
                   l_netUpdates[0],
                   l_netUpdates[1]);

    // update the cells' quantities
    atomicAdd(&o_h[l_ceB], -i_scaling * l_netUpdates[0][0]);
    atomicAdd(&o_hv[l_ceB], -i_scaling * l_netUpdates[0][1]);

    atomicAdd(&o_h[l_ceT], -i_scaling * l_netUpdates[1][0]);
    atomicAdd(&o_hv[l_ceT], -i_scaling * l_netUpdates[1][1]);
}

Net Updates Kernel

Another important update was adding atomicAdd() around our writes to the o_h and o_hv arrays since we can no longer guarantee that no 2 kernels write to the same cell at the same time.

We also updated the Constructor of the class to reserve both memory on the GPU and the CPU (with only the GPU needing the temporary arrays). We used cudaMalloc instead of cudaMallocManaged since we want those arrays to always be on the GPU instead of a first touch policy approach. The extra arrays on the CPU only get used for initialization of all the cells and later getting the data back from the class to the writer. For the latter we now mandate a prior call to a prepareDataAccess function which waits for the GPU to finish its calculation and then copies the data from the GPU to the CPU.

void tsunami_lab::patches::WavePropagationCUDA::prepareDataAccess()
{
    cudaDeviceSynchronize();
    cudaMemcpy(m_h_host, m_h, (m_nCellsx+2) * (m_nCellsy+2) * sizeof(float), cudaMemcpyDeviceToHost);
    cudaMemcpy(m_hu_host, m_hu, (m_nCellsx+2) * (m_nCellsy+2) * sizeof(float), cudaMemcpyDeviceToHost);
    cudaMemcpy(m_hv_host, m_hv, (m_nCellsx+2) * (m_nCellsy+2) * sizeof(float), cudaMemcpyDeviceToHost);
}

Data Access Preparation Function

MS-3 Implementation

With much work already done in week 2, we didn’t really touch the WavePropagationCUDA class in week 3 outside of a few tweaks, none with major impact. We did however start testing our performance of the CUDA version against the CPU version. It turned out to be much much faster in calculation, so much so that our calls to NetCdf became the major bottleneck instead.

Example Data of a run with the Tohoku Simulation, with 1000m Cellsize, 100 Frames and 18000 seconds simulated time:

total time: 2min 48s 378ms 535us 507ns
setup time: 1s 73ms 223us 470ns
calc time : 12s 369ms 302us 895ns
write time: 2min 34s 936ms 9us 142ns

CUDA Performance

total time: 3min 47s 819ms 319us 488ns
setup time: 1s 806ms 745us 853ns
calc time : 3min 14s 908ms 392us 693ns
write time: 31s 104ms 180us 942ns

CPU Performance

Both tests were run on an Intel i5-13600k, an NVIDIA RTX 3070 and 32GB of DDR5 RAM @ 6000MHz

The calculation time reduced drastically, by a factor of about 15, but the write time increased by a factor of about 5. This leads to a total time improvement of only 35%, which is pretty disappointing seeing the large improvement in the calculation time. We couldn’t figure out why the write time increased so much, since from the perspective of the NetCdf library our data should be the same. We also noticed an issue that our simulation was sometimes “stuttering”, meaning that sometimes Frames took significantly longer to be written than others. This also sometimes happened in the setup, which increased setup times from ~2s to over 40s. We couldn’t identify whether the culprit was something in our use of the NetCdf library or the CUDA library.

MS-4 Testing

We did not further optimize our CUDA code in this step despite setting out to seeing as the write time was dominating our total time now. This made changes to our CUDA code not seem very impactful. We did run the code through NVIDIAs NSight Compute profiles (the newer version of NVProf):

_images/10_Nsight_setGhostCells_2.png — Set Ghost Cells Kernel Profile

We already alluded to this in MS-2 Initial Kernel Development section, but here we have the proof that our throughput and occupancy are quite low. This is due to the fact that we spawn a thread for each cell, despite only needing to calculate the ghost cells (so the vast majority of threads are doing nothing). We did not change this as despite its low throughput it was overall only a very minor part of the total time. As a reference the netUpdates kernels take 14 times as long as the setGhostCells kernels. The maximum calculation time speedup then would be around 7%, which is not bad but we deemed it not worth the time investment since the calculation time overall is (depending on number of frames) only about 8% of the total time as well. Which would make the maximum total time speedup of the setGhostCells kernels about 0.6%.

_images/10_Nsight_netUpdates_2.png — Net Updates Kernel Profile

This looks pretty good, we seem to have a good occupancy and a good memory throughput. We see that we are bottlenecked by the memory throughput, but we can’t really do anything about that. Overall we are happy with these results.

We also ran the coda with differing resolutions on Ara:

All tests were run with 100 frames and 18000 seconds simulated time with the CUDA kernels

Resolution	Total Time	Setup Time	Calculation Time	Write Time
4000m	23s 566ms 770us 810ns	20s 879ms 547us 439ns	4ms 169us 871ns	2s 683ms 53us 500ns
2000m	9s 22ms 518us 27ns	2s 992ms 497us 211ns	5ms 200us 927ns	6s 24ms 819us 889ns
1000m	43min 47s 134ms 788us 719ns	3s 253ms 829us 489ns	8ms 578us 417ns	43min 43s 872ms 380us 813ns
500m	2h 59min 52s 494ms 302us 997ns	5s 473ms 510us 557ns	13ms 87us 343ns	2h 59min 43s 623ms 213us 679ns

Performance on Ara with 2 NVIDIA Tesla P100 GPU’s, 2 Intel Xeon E5-2660v4 CPUs and 128GB DDR4 RAM

On Ara we see that Ara is significantly slower than our own home computers for our simulation (43 minutes on Ara vs 3 minutes on our PC’s). While Ara seems to boost great calculation times (though we are a bit sceptical of those results), the write time is a lot slower than on our home computers. Since the write time dominated the overall time already on our own computers, it should be no surprise that Ara is overall significantly slower in total then. We assume that the write time is so slow because of the shared file system BeeGFS on Ara and the relatively old CPU’s and slower RAM.

We also observe one of the “stutters” we mentioned in MS-3 Implementation section at the setup, where Ara takes over 20 seconds instead of the usual ~3 seconds at 4000m Cellsize.

And since we saw that our limiting factor was the write time we tested our code with differing amounts of frames:

All tests were run with 1000m resolution and 18000 seconds simulated time with the CUDA kernels on an Intel i5-13600k, an NVIDIA RTX 3070 and 32GB of DDR5 RAM @ 6000MHz

Frames	Total Time	Setup Time	Calculation Time	Write Time
100	2min 39s 116ms 662us 561ns	1s 905ms 48us 542ns	12s 266ms 751us 226ns	2min 24s 944ms 862us 793ns
75	1min 26s 759ms 169us 619ns	1s 226ms 376us 879ns	12s 277ms 471us 7ns	1min 13s 255ms 321us 733ns
50	19s 630ms 698us 731ns	1s 149ms 982us 374ns	12s 680ms 54us 844ns	5s 800ms 661us 513ns
25	14s 688ms 777us 997ns	1s 120ms 59us 687ns	12s 518ms 198us 788ns	1s 50ms 519us 522ns

Performance with differing amounts of frames using CUDA kernel

Frames	Total Time	Setup Time	Calculation Time	Write Time
100	4min 10s 389ms 335us 964ns	2s 139ms 793us 127ns	3min 22s 811ms 868us 815ns	45s 437ms 674us 22ns
75	4min 16s 535ms 842us 722ns	898ms 301us 662ns	3min 58s 287ms 648us 191ns	17s 349ms 892us 869ns
50	3min 21s 543ms 143us 507ns	839ms 248us 187ns	3min 17s 390ms 728us 987ns	3s 313ms 166us 333ns
25	3min 29s 606ms 801us 293ns	796ms 630us 221ns	3min 27s 603ms 875us 928ns	1s 206ms 295us 144ns

Performance with differing amounts of frames using our parallelized CPU code

We see that the write time scales non linearly with the amount of frames. The time difference did not matter much in the CPU code as the simulation is slower despite the write taking quite a bit of time. For the CUDA code it might be advisable to reduce the amount of frames to reduce the write time significantly. If the application would support it it would probably be beneficial to seperate the simulation into multiple smaller files to keep the writes performant. Ultimately a better parallizable and chunkable format would probably be advisable if one were to develop the simulation further. We don’t know why the writes take so much longer on the CUDA version than on the CPU version. NetCdf might have a maximum throughput problem when processing large files and wait for function calls asynchronously to finish. In that case the faster simulation would hinder the libraries ability to hide the write time behind the calculation time.

We validated the calculation accuracy by using the same test cases as in the CPU version.

Tohoku Simulation with 1000m Cellsize, 100 Frames and 18000 seconds simulated time on CUDA

Goals

We had 2 goals for this individual phase:

Increase performance by at least 2 times compared to the parallelized CPU code
Don’t compromise the accuracy of the simulation while doing so

Regarding the calculation time we definitely achieved our first goal. We went from over 3 minutes to about 12 seconds, which is a speedup of about 15 times.

Regarding the total time though it depends a lot on the amount of frames. With low frames (<= 50) we still reached our goal with a speedup of about 9 times. If we increase the amount of frames to 100 though we reduce our speedup to a meager 1.25 times.

Regarding the calculation accuracy we reused the same test cases that we used for the CPU version and they all passed without any issues. So we assume the accuracy is good enough.