Tutorial 9 - Runtime CPU/GPU Decision

The CPU/GPU data node creates various attributes for use in a CUDA-based GPU compute or a CPU-based compute, where the decision of which to use is made at runtime rather than compile time. A few representative types are used, though the list of potential attribute types is not exhaustive. See Attribute Data Types for the full list.

OgnTutorialCpuGpuData.ogn

The ogn file shows the implementation of a node named “omni.graph.tutorials.CpuGpuData”, which has attributes whose memory type is determined at runtime by the input named isGPU. The algorithm of the node is implemented in CUDA, but in such a way that it can run on either the CPU or the GPU, depending on where the attribute data lives.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
{
    "CpuGpuData" : {
        "version": 1,
        "categories": "tutorials",
        "memoryType": "any",
        "description": [
            "This is a tutorial node. It illustrates how to access data whose memory location, CPU or GPU, is ",
            "determined at runtime in the compute method. The data types are the same as for the purely CPU and ",
            "purely GPU tutorials, it is only the access method that changes. The input 'is_gpu' determines where ",
            "the data of the other attributes can be accessed."
        ],
        "metadata":
        {
            "uiName": "Tutorial Node: Attributes With CPU/GPU Data"
        },
        "inputs": {
            "is_gpu": {
                "type": "bool",
                "memoryType": "cpu",
                "description": ["Runtime switch determining where the data for the other attributes lives."],
                "default": false
            },
            "a": {
                "type": "float",
                "description": ["First value to be added in algorithm 1"],
                "default": 0.0
            },
            "b": {
                "type": "float",
                "description": ["Second value to be added in algorithm 1"],
                "default": 0.0
            },
            "points": {
                "type": "float[3][]",
                "description": ["Points to be moved by algorithm 2"],
                "default": []
            },
            "multiplier": {
                "type": "float[3]",
                "description": ["Amplitude of the expansion for the input points in algorithm 2"],
                "default": [1.0, 1.0, 1.0]
            }
        },
        "outputs": {
            "sum": {
                "type": "float",
                "description": ["Sum of the two inputs from algorithm 1"]
            },
            "points": {
                "type": "float[3][]",
                "description": ["Final positions of points from algorithm 2"]
            }
        },
        "tests": [
            { "inputs:is_gpu": false, "inputs:a": 1.0, "inputs:b": 2.0, "outputs:sum": 3.0 },
            { "inputs:is_gpu": false, "inputs:a": 5.0, "inputs:b": 3.0, "outputs:sum": 8.0 },
            {
                "inputs:is_gpu": false,
                "inputs:points": [[1.0, 2.0, 3.0], [2.0, 3.0, 4.0]],
                "inputs:multiplier": [2.0, 3.0, 4.0],
                "outputs:points": [[2.0, 6.0, 12.0], [4.0, 9.0, 16.0]]
            },
            {
                "inputs:is_gpu": true,
                "inputs:a": 1.0, "inputs:b": 2.0,
                "inputs:points": [[1.0, 2.0, 3.0], [2.0, 3.0, 4.0]],
                "inputs:multiplier": [2.0, 3.0, 4.0]
            }
        ],
        "$why_disabled": "GPU extraction is not yet implemented so for those tests only inputs are provided",
        "$tests_disabled_until_gpu_data_extraction_works": [
            { "inputs:is_gpu": true, "inputs:a": 1.0, "inputs:b": 2.0, "outputs:sum": 3.0, "gpu": ["outputs:sum"] },
            { "inputs:is_gpu": true, "inputs:a": 5.0, "inputs:b": 3.0, "outputs:sum": 8.0, "gpu": ["outputs:sum"] },
            {
                "inputs:is_gpu": true,
                "inputs:points": [[1.0, 2.0, 3.0], [2.0, 3.0, 4.0]],
                "inputs:multiplier": [2.0, 3.0, 4.0],
                "outputs:points": [[2.0, 6.0, 12.0], [4.0, 9.0, 16.0]],
                "gpu": ["outputs:points"]
            }
        ]
    }
}

OgnTutorialCpuGpuData.cpp

The cpp file contains the implementation of the compute method, which checks the value of the isGPU attribute and then extracts the data of the specified type to pass to the algorithm in the .cu file.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
// Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
//
// NVIDIA CORPORATION and its licensors retain all intellectual property
// and proprietary rights in and to this software, related documentation
// and any modifications thereto.  Any use, reproduction, disclosure or
// distribution of this software and related documentation without an express
// license agreement from NVIDIA CORPORATION is strictly prohibited.
//
#include <OgnTutorialCpuGpuDataDatabase.h>

// This helper file keeps the GPU and CPU functions organized for easy access.
// It's only for illustration purposes; you can choose to organize your functions any way you wish.

// The first algorithm is a simple add, to illustrate passing non-array types.
extern "C" void cpuGpuAddCPU(outputs::sum_t_cpu, inputs::a_t_cpu, inputs::b_t_cpu, size_t);
extern "C" void cpuGpuAddGPU(outputs::sum_t_gpu, inputs::a_t_gpu, inputs::b_t_gpu);

// The second algorithm shows the more common case of accessing array data.
extern "C" void cpuGpuMultiplierCPU(outputs::points_t_gpu, inputs::multiplier_t_gpu, inputs::points_t_gpu, size_t);
extern "C" void cpuGpuMultiplierGPU(outputs::points_t_gpu, inputs::multiplier_t_gpu, inputs::points_t_gpu, size_t);

class OgnTutorialCpuGpuData
{
public:
    static bool compute(OgnTutorialCpuGpuDataDatabase& db)
    {
        // Computing the size of the output is independent of CPU/GPU as the size update is lazy and will happen
        // when the data is requested from the flatcache.
        size_t numberOfPoints = db.inputs.points.size();
        db.outputs.points.resize(numberOfPoints);

        // Use the runtime input to determine where the calculation should take place.
        //
        // Another helpful use of this technique is to first measure the size of data being evaluated and
        // if it meets a certain minimum threshold move the calculation to the GPU.
        if (db.inputs.is_gpu())
        {
            cpuGpuAddGPU(db.outputs.sum.gpu(), db.inputs.a.gpu(), db.inputs.b.gpu());

            // GPU data is just raw pointers so the size must be passed down to the algorithm
            cpuGpuMultiplierGPU(
                db.outputs.points.gpu(), db.inputs.multiplier.gpu(), db.inputs.points.gpu(), numberOfPoints);
        }
        else
        {
            // CPU version calls the shared CPU/GPU algorithm directly. This could also be implemented in the
            // CUDA file as an extern "C" function but it is so simple that this is easier.
            cpuGpuAddCPU(db.outputs.sum.cpu(), db.inputs.a.cpu(), db.inputs.b.cpu(), 0);

            // Extract the CPU data and iterate over all of the points, calling the shared deformation algorithm.
            // The data is passed through as raw CUDA-compatible bytes so that the deformation algorithm can live
            // all in one file. If the code were all on the CPU side then the data types could be used as regular
            // arrays, as showing in the tutorial on array data types.
            auto rawOutputPoints = db.outputs.points.cpu().data();
            auto rawInputPoints = db.inputs.points.cpu().data();
            cpuGpuMultiplierCPU(&rawOutputPoints, &db.inputs.multiplier.cpu(), &rawInputPoints, numberOfPoints);
        }

        return true;
    }
};

REGISTER_OGN_NODE()

OgnTutorialCpuGpuData_CUDA.cu

The cu file contains the implementation of the deformation on the CPU and GPU using CUDA.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
// Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
//
// NVIDIA CORPORATION and its licensors retain all intellectual property
// and proprietary rights in and to this software, related documentation
// and any modifications thereto.  Any use, reproduction, disclosure or
// distribution of this software and related documentation without an express
// license agreement from NVIDIA CORPORATION is strictly prohibited.
//
#include <OgnTutorialCpuGpuDataDatabase.h>

// ======================================================================
// This set of methods is one way of efficiently implementing an algorithm to run on either CPU or GPU:
//     xxxCPU() - Looping over all inputs to call the node algorithm
//     xxxGPU() - Host function to launch the kernel to run the node algorithm
//     xxxCUDA() - Run the algorithm on the block of data in the kernel
//     xxx() - Implementation of the actual node algorithm
//
// This minimizes the code duplication by keeping the node algorithm in a shared location. The tradeoff is that
// it is restricted to data types that CUDA understands, though for practical purposes since the algorithm must
// run on the GPU this isn't restrictive at all.
//
// One thing to be careful of is when you are passing arrays of CPU data to a GPU function. Data can only be passed
// from the CPU to the GPU by value, which for simple types is the default since the data type will be something
// like "float", not "float&" or "float*". However when the CPU data is an array it has to first be copied to the
// GPU using something like cudaMalloc/cudaMemcpy/cudaFree. While possible, this is very inefficient. It is better
// to specify that the attribute be always on the gpu, or that the decision is made at runtime, so that the
// GPU conversion can be handled automatically.

// ======================================================================
// Algorithm 1 support - add two values together (something so trivial you would never use the GPU for it; here for
//                       illustrative purposes only)
//
namespace
{
__device__ __host__
void cpuGpuAdd(float* sum, float const* a, float const* b, size_t i)
{
    sum[i] = a[i] + b[i];
}

// CUDA kernel that runs the addition algorithm on the block value
__global__
void cpuGpuAddCUDA(outputs::sum_t sum, inputs::a_t a, inputs::b_t b)
{
    // Make sure the current evaluation block is in range of the available points
    size_t i = blockIdx.x * blockDim.x + threadIdx.x;
    if (1 < i)
        return;

    cpuGpuAdd(sum, a, b, i);
}
}

// GPU kernel launcher
extern "C"
void cpuGpuAddGPU(outputs::sum_t sum, inputs::a_t a, inputs::b_t b)
{
    // Launch the GPU deformation using the minimum number of threads and blocks
    cpuGpuAddCUDA<<<1, 1>>>(sum, a, b);
}

// CPU version of the algorithm
extern "C"
void cpuGpuAddCPU(float& sum, float const& a, float const& b, size_t i)
{
    cpuGpuAdd(&sum, &a, &b, i);
}

// ======================================================================
// Algorithm 2 support - multiply every point in an array by a constant vector
//
namespace
{
// The shared algorithm applies the multiplier to the current point
extern "C"
__device__ __host__
void cpuGpuMultiplier(
    outputs::points_t outputPoints,
    inputs::multiplier_t multiplier,
    inputs::points_t inputPoints,
    size_t currentPointIndex)
{
    (*outputPoints)[currentPointIndex].x = (*inputPoints)[currentPointIndex].x * multiplier->x;
    (*outputPoints)[currentPointIndex].y = (*inputPoints)[currentPointIndex].y * multiplier->y;
    (*outputPoints)[currentPointIndex].z = (*inputPoints)[currentPointIndex].z * multiplier->z;
}

// The CUDA kernel applies the deformation algorithm to the point under its control
__global__
void cpuGpuMultiplierCUDA(
    outputs::points_t outputPoints,
    inputs::multiplier_t multiplier,
    inputs::points_t inputPoints,
    size_t numberOfPoints)
{
    // Make sure the current evaluation block is in range of the available points
    size_t i = blockIdx.x * blockDim.x + threadIdx.x;
    if (numberOfPoints <= i)
        return;

    cpuGpuMultiplier(outputPoints, multiplier, inputPoints, i);
}
}

// The GPU algorithm launches a CUDA kernel on the set of points for deformation
extern "C"
void cpuGpuMultiplierGPU(
    outputs::points_t outputPoints,
    inputs::multiplier_t multiplier,
    inputs::points_t inputPoints,
    size_t numberOfPoints)
{
    // Split the work into 256 threads, an arbitrary number that could be more precisely tuned when necessary
    const int numberOfThreads = 256;
    // Block size is the number of points that fit into each of the threads
    const int numberOfBlocks = (numberOfPoints + numberOfThreads - 1) / numberOfThreads;

    // Launch the GPU deformation using the calculated number of threads and blocks
    cpuGpuMultiplierCUDA<<<numberOfBlocks, numberOfThreads>>>(outputPoints, multiplier, inputPoints, numberOfPoints);
}

// The CPU algorithm just loops over the list of points, applying the shared deformation to each point
extern "C"
void cpuGpuMultiplierCPU(
    outputs::points_t outputPoints,
    inputs::multiplier_t multiplier,
    inputs::points_t inputPoints,
    size_t numberOfPoints)
{
    for (size_t i = 0; i < numberOfPoints; i++)
    {
        cpuGpuMultiplier(outputPoints, multiplier, inputPoints, i);
    }
}

CPU/GPU Attribute Access

Here is how the attribute values are returned from the database. Up until now the attribute name has sufficed as the database member that accesses the value through its operator(). The addition of the runtime switch of memory locations is facilitated by the addition of the gpu() and cpu() members.

CPU Function

CPU Type

GPU Function

GPU Type

CUDA Type

inputs.a.cpu()

const float&

inputs.a_t_gpu

const float*

const float*

inputs.b.cpu()

const float&

inputs.b.gpu()

const float*

const float*

outputs.sum.cpu()

float&

outputs.sum.gpu()

float*

float*

inputs.multiplier.cpu()

const GfVec3f&

inputs.multiplier.gpu()

const GfVec3f*

const float3

inputs.points.cpu()

const GfVec3f*

inputs.points.gpu()

const GfVec3f**

const float3**

outputs.points.cpu()

GfVec3f*

outputs.points.gpu()

const GfVec3f**

float3**

Type Information

As there are three different potential types for each attribute when it varies location at runtime (CPU, CPU being passed to GPU, and GPU) there are extra types introduced in order to handle each of them. The CUDA types are handled as before, but on the CPU side there are extra types for the data being passed from the CPU to the GPU.

CPU Type Method

CPU Data Type

GPU Type Method

GPU Data Type

inputs::a_t

const float&

inputs::a_t_gpu

const float*

inputs::b_t

const float&

inputs::b_t_gpu

const float*

outputs::sum_t

float&

outputs::sum_t_gpu

float*

inputs::multiplier_t

const GfVec3f&

inputs::multiplier_t_gpu

const GfVec3f*

inputs::points_t

const GfVec3f*

inputs::points_t_gpu

const GfVec3f**

outputs::points_t

GfVec3f*

outputs::points_t_gpu

const GfVec3f**

On the C++ side the functions defined in the CUDA file are declared as:

extern "C" void cpuGpuMultiplierCPU(outputs::points_t, inputs::multiplier_t, inputs::points_t, size_t);
extern "C" void cpuGpuMultiplierGPU(outputs::points_t_gpu, inputs::multiplier_t_gpu, inputs::points_t_gpu, size_t);

The addition of the _gpu suffix mostly adds an extra layer of indirection to the values, since they exist in the GPU memory namespace. Care must be taken to call the correct version with the correctly extracted data:

if (db.inputs.is_gpu())
{
    cpuGpuMultiplierGPU(
        db.outputs.points.gpu(),
        db.inputs.multiplier.gpu(),
        db.inputs.points.gpu(),
        numberOfPoints
    );
}
else
{
    // Note how array data is extracted in its raw form for passing to the function on the CUDA side.
    // This would be unnecessary if the implementation were entirely on the CPU side.
    cpuGpuMultiplierCPU(
        db.outputs.points.cpu().data(),
        db.inputs.multiplier.cpu(),
        db.inputs.points.cpu().data(),
        numberOfPoints
    );
}

On the CUDA side the function definitions use the existing CUDA types, so their signatures are:

extern "C" void cpuGpuMultiplierCPU(outputs::points_t, inputs::multiplier_t, inputs::points_t, size_t);
extern "C" void cpuGpuMultiplierGPU(outputs::points_t, inputs::multiplier_t, inputs::points_t, size_t);