Revision as of 12:47, 9 August 2012

Introduction

OpenCL is a standard which defines a framework, an API and a programming language for parallel computation on heterogeneous systems like desktop computers, high-performance computing servers as well as hand-held devices.

The standard is maintained by the Khronos Group and supported by a large consortium of industry leaders including Apple, Intel, AMD, NVIDIA and ARM. Influenced by NVIDIA’s CUDA from the GPU side and by OpenMP which originates from the classical CPU side, the open OpenCL standard is characterized by a formulation which is abstract enough to support both CPU and GPU computing resources.

This is an ambitious goal, since providing an abstract interface together with a peak performance is a challenging task. OpenCL employs a strict isolation of the computation work into fundamental units, the kernels. These kernels can be developed in the OpenCL C programming language, a subset of the C99 language, with some additional OpenCL specific keywords.

In general, these kernels are hardware independent and compiled by the OpenCL runtime when they are loaded. To be able to fully exploit the parallel execution of the kernel code, several kernel instances, the work items, are started to process a set of input values. The actual number of concurrently running work items is determined by the OpenCL system. How a concrete algorithm can be partitioned into work items has to be decided by the programmer.

Reference Material

Khronos Group OpenCL

http://www.khronos.org/opencl/
OpenCL 1.2 Quick Reference Card

http://www.khronos.org/files/opencl-1-2-quick-reference-card.pdf
OpenCL 1.2 Full Documentation

http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/
Intel SDK for OpenCL

http://software.intel.com/en-us/articles/vcsource-tools-opencl-sdk/
AMD OpenCL Zone

http://developer.amd.com/zones/OpenCLZone/
NVIDIA OpenCL

http://www.nvidia.com/object/cuda_opencl_1.html

Programming Environment and Libraries

c++ interface / blah

Project: Bootstrapping OpenCL and Vector Addition

The first project is intended to give an overview over the essential ingredients of an OpenCL application. The various components are kept as simple as possible but will still allow to gain an understanding of the OpenCL concepts. To save time, some of the initialization code is provided.

The main goal is to define a kernel which adds the floating point numbers of two input arrays to an output array in the following style: <math>C_i = A_i + B_i</math> The index i is here the location of the numbers in the arrays. Each OpenCL kernel instance will process one fixed value of i. This will provide sufficient parallelism to execute on the CPU and the GPU.

Although not a complex implementation, this code will already allow to test some of the performance aspects of OpenCL. In the second part of this tasks some performance measurements will be performed to characterize the CPU and GPU hardware architectures.

Compiling and running the test program

Open the folder project_vectoradd, create the build files using CMake and compile the application.

$ cd project_vectoradd/
[hauth@vdt-corei7avx project_vectoradd]$ cmake . 
[hauth@vdt-corei7avx project_vectoradd]$ make 
[100%] Built target vectoradd
[hauth@vdt-corei7avx project_vectoradd]$

Once the application was compiled successully, run it. The output should be along the following lines:

$ ./vectoradd
Testing Platform : Intel(R) OpenCL
> Selected Compute Device :        Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz
Transferring data to device memory took 2e-06 s
Running vectorAdd kernel took 2e-06 s
Transferring data to host memory took 0 s
All done

Task: Understand the Existing Code

The first task is to read and understand the existing source code. To do so, you have to edit the file vectoradd.cpp in your favorite text editor. Use nano if you are not sure which tool to use.

$ nano vectoradd.cpp

Take your time to familiarize yourself with the source code which is already in the file. Some of the intial steps of setting up the OpenCL system are already provided:

OpenCL compute context

An OpenCL platform is automaticaly selected, depending on the required device type (CPU/GPU) and a compute context is created. You can change the required device type by modifying the constant devType:
```
// Desired Device type.
// can be CL_DEVICE_TYPE_GPU or CL_DEVICE_TYPE_CPU in this example
const cl_device_type devType = CL_DEVICE_TYPE_CPU;
```
OpenCL command queue

A command queue is created on the context.
Kernel compile

The variable contains kernelSourceAdd the source code of the OpenCL kernel. This string ist used to compile and register the kernel with the OpenCL runtime system. If you want to extend the kernel code, you have to add to the string contained in the variable kernelSourceAdd.
Host data buffers

The variables hostVectorA, hostVectorB and hostVectorC hold buffers which will be used later to transfer data to and from the OpenCL kernel. The underlying C++ type is std::vector which can be conviniently used to acquire a continuous space of memory in the host’s RAM.
```
FloatingPoint vp = 23.0f;
FloatingPointVector hostVectorA(itemCount, vp);
FloatingPointVector hostVectorB(itemCount, vp);
FloatingPointVector hostVectorC(itemCount, vp);
```
Various timing measurements

To quantify the runtime of the OpenCL operations, various timing objects are present. For the time measurements to be correct, only the operations indicated by the TODO comments must be between the creation of the Timing object and the call to EndWithReport() of the respective timing object.
```
Timing t_transfer_input("Transferring data to device memory");
// TODO: transfer input buffer here
t_transfer_input.EndWithReport();
```

Task: Running the OpenCL Kernel

The OpenCL kernel is already created and associated with the compiled program:

cl::Kernel kernelAdd(pProgram, "vectorAdd");

Use the enqueueNDRangeKernel(…) method on the command queue object to add a kernel instance to the queue. Call the finish() method on the command queue object to ensure the kernel is done processing before the host program continues execution. The global_work_size parameter must reflect the size of the input data stored in the variable itemCount.

Extend the kernel code and use the OpenCL functions get_global_id, get_global_size, get_local_id, get_local_size and get_work_dim to retrieve information about the position of the kernel instance in the overall kernel run. Use the printf function to output this information to the console. Be aware, that the printf command is only avaiable in kernels running on the CPU.

Run the kernel and see if the output of the printf command is reasonable and you understand the meaning.

OpenCL Commands in this Task

clEnqueueNDRangeKernel

http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueNDRangeKernel.html
clFinish

http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clFinish.html
printf

http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/printfFunction.html
get_global_id

http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/get_global_id.html
get_global_size

http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/get_global_size.html
get_local_id

http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/get_local_id.html
get_local_size

http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/get_local_size.html
get_work_dim

http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/get_work_dim.html

Task: Creating and Transfering Memory Buffers

In order to perform actual work with the OpenCL kernel, input data must be transferred to the computing device and copied back to the host to present it to the user or store it in the file system.

Every memory buffer, including its size and other properties, must be explictly declared to OpenCL before running the kernel. The kernels themselfs cannot dynamically allocate any memory. So all memory the kernels want to write to during their execution must also be allocated before the kernels run.

Use the class cl::Buffer to allocate three memory buffers which are able to hold all the contents of the hostVectorA, hostVectorB and hostVectorC vectors. The two input buffers (hostVectorA, hostVectorB) can be created read-only and the output-buffer (hostVectorC) can be created write-only. This access qualification does only concern in which fashion the OpenCL kernels can access this buffers. The access from the host-side is not restricted. If you want to look for the parameters of the cl::Buffer constructor, they are equivalent to the ones of the function clCreateBuffer().

Use the method enqueueWriteBuffer on the command queue to transfer the content of host vectors hostVectorA and hostVectorB to the OpenCL device. Be sure to use CL_TRUE for the blocking_write parameter. This will ensure the buffers are completely copied before the method call returns. For the pointer to the data buffer you can use the following code snippet to get a pointer to the beginning of the host buffer.

&hostVectorA.front()

Use the method enqueueReadBuffer in the same fashion to retrieve the content of the output buffer to hostVectorC once the kernel run is complete. Run the application to ensure that everything works fine. The buffer are unchanged, as they are not used by the kernel. We will change this in the next task.

OpenCL Commands in this Task

clCreateBuffer ( indirect via the cl::Buffer constructor )

http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateBuffer.html
clEnqueueWriteBuffer

http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueWriteBuffer.html
clEnqueueReadBuffer

http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueReadBuffer.html

Task: Passing parameters to the kernel and performing a calculation

[sec:TaskCalc] Use the setArg method of the kernel object to pass all three OpenCL memory buffers to the kernel. Extend the source code of the OpenCL kernel to receive three buffers as parameters.

Within the kernel, compute the item of the output buffer which corresponds to the global id of the kernel instance using the following formula: <math>C_i = A_i + B_i</math>

Once the kernel execution is complete, check the correctness of the results by outputting the hostVectorC content to the console.

OpenCL Commands in this Task

cl::Kernel::setArg(…)

http://www.khronos.org/registry/cl/specs/opencl-cplusplus-1.1.pdf

Task: Running the Kernel on the GPU

Up to now, the OpenCL kernels were compiled and run on the CPU using Intel’s OpenCL SDK. Switch the device type to GPU and confirm that your program still performs as expected. Make sure to remove all printf calls which might be left in your kernel as this call is not supported by the GPU.

Task: Timing and Performance Tests

Until now, only a small number of floating point values have been added. This way, the code is easier to test and debug. In the this step, the number of items in our work list will be hugely increased so some properties of the performance of the implementation and the different computing hardware becomes visible.

First, make sure that no printf or other outputs to the consoles of work items are in your kernel or host program. Now, increase the number of work items by modifying the itemCount line:

const size_t itemCount = 1 * pow(10, 7);

Run the program both for GPU and CPU and compare the runtime of the kernel and the runtime including the buffer transfers. Evaluate the following points:

How performs the CPU compared to the GPU ?
In your opinion, is this computing task complex enough to justify the transfer overhead of the data to the GPU ?
Compute the natural logarithm ( log() ) in the kernel after adding the two input values. How is the ratio of kernel runtime to buffer transfer now ?
Does it become more profitable to run the kernel computing the log on the GPU now ?

OpenCL Commands in this Task

log

http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/log.html

Optional Task: Using `float4` datatype

Switch from the scalar data type to the vector datatype float4. This data type can store a four-dimensional vector. Can you still perform the same mathematical operation in the kernel as done in section [sec:TaskCalc] ?

Note: the data type in OpenCL C is named float4 but in the C++ host application use the name cl_foat4 to refer to it. Please investigate the following points:

Does the memory transfer timing increase as expected (factor 4) in relation to the scalar float datatype case ?
Does the kernel runtime increase as expected (factor 4) in relation to the scalar float datatype case ? If not, what might be the reason ?

Optional Task: Using HostPtr computations to Mitigate the Buffer Transfer Overhead of the CPU

Project: N-Body Simulation

One the primary goals of early physics research was to understand and predict the movements of the celestial bodies, especially the planets within our solar system. An important achievement are the laws of planetary motion by Johannes Kepler which analytically describe the motion of two clestial bodies. But for problems of more than two objects ( <math>n > 2</math>) no universally valid analytical solution exists today. The term n-body problem is generally used refer to this kind of situations.

Fortunately, the motion of object in space follow the well-know Newton’s law of motion and therefore a numerical simulation of the n-body problem can be performed to study the properties of various starting conditions.

Mathematical Formulation

The state of particle <math>i</math> in the simulation is fully described by its location <math>\vec{r}_i</math>, velocity <math>\vec{v}_i</math> and mass <math>m_i</math>. To transfer the simulation from timestep <math>t</math> to timestep <math>t' = t + \Delta t</math>, the following computation has to be performed for each particle.

Compute the acceleration on each particle

Newton’s law of universal gravitation is used to calcluate the accelaration which is applied to one particle by all other particles in the simulation.

<math>\vec{ a}_i = \sum_{j=1, i \neq j}^{n} G \frac{ m_j }{ \left| \vec{r_j} - \vec{r_i} \right|^2 } \hat{r_{ij}}</math>

Here, <math>G</math> is the gravitational constant which can be set to 1 for this example. <math>m_j</math> is the mass of the particles which interacts with the particle <math>i</math>. <math>r_i</math> and <math>r_j</math> are the position vectors of the two interaction particles. <math>\hat{r_{ij}}</math> is the normal vector between the two interacting particles. It can be computed using:

<math>\hat{r_{ij}} = \frac{ r_j - r_i }{ \left| r_j - r_i \right| }</math>
Compute the new velocity of the particle

The new velocity vector the of particle <math>i</math> can be computed using: <math>v_i' = a_i \Delta t + v_i</math>
Compute the new location of the particle

The new absolute position of particle <math>i</math> can be computed using: <math>r_i' = \frac{1}{2} \Delta t ^ 2 a_i + v_i \Delta t + r_i</math>

Difference between revisions of "OpenCL on CPU and GPU"

Revision as of 12:47, 9 August 2012

Contents

Introduction

Reference Material

Programming Environment and Libraries

Project: Bootstrapping OpenCL and Vector Addition

Compiling and running the test program

Task: Understand the Existing Code

Task: Running the OpenCL Kernel

OpenCL Commands in this Task

Task: Creating and Transfering Memory Buffers

OpenCL Commands in this Task

Task: Passing parameters to the kernel and performing a calculation

OpenCL Commands in this Task

Task: Running the Kernel on the GPU

Task: Timing and Performance Tests

OpenCL Commands in this Task

Optional Task: Using `float4` datatype

Optional Task: Using HostPtr computations to Mitigate the Buffer Transfer Overhead of the CPU

Project: N-Body Simulation

Mathematical Formulation

Technical specification/requirements

Navigation menu

Views

Personal tools

Navigation

Search

Tools

@@ Line 1: / Line 1: @@
 = Introduction =
+OpenCL is a standard which defines a framework, an API and a programming language for parallel computation on heterogeneous systems like desktop computers, high-performance computing servers as well as hand-held devices.
-OpenCL is a standard which defines a framework, an API and a programming language for parallel computation on heterogeneous systems like client computer systems, high- performance computing servers as well as hand-held devices. The standard is maintained by the Khronos Group and supported by a large consortium of industry leaders including Apple, Intel, AMD, NVIDIA and ARM. Influenced by NVIDIA’s CUDA from the GPU side and by OpenMP which originates from the classical CPU side, the open OpenCL standard is characterized by a formulation which is abstract enough to support both CPU and GPU computing resources. This is an ambitious goal, since providing an abstract interface together with a peak performance is a challenging task. OpenCL employs a strict isolation of the computation work into fundamental units, the kernels. These kernels can be developed in the OpenCL C programming language, a subset of the C99 language, with some additional OpenCL specific keywords. In general, these kernels are hardware independent and compiled by the OpenCL runtime when they are loaded. To be able to fully exploit the parallel execution of the kernel code, several kernel instances, the work items, are started to process a set of input values. The actual number of concurrently running work items is determined by the OpenCL system. How a concrete algorithm can be partitioned into work items has to be decided by the programmer.
+The standard is maintained by the Khronos Group and supported by a large consortium of industry leaders including Apple, Intel, AMD, NVIDIA and ARM. Influenced by NVIDIA’s CUDA from the GPU side and by OpenMP which originates from the classical CPU side, the open OpenCL standard is characterized by a formulation which is abstract enough to support both CPU and GPU computing resources.
+This is an ambitious goal, since providing an abstract interface together with a peak performance is a challenging task. OpenCL employs a strict isolation of the computation work into fundamental units, the kernels. These kernels can be developed in the OpenCL C programming language, a subset of the C99 language, with some additional OpenCL specific keywords.
+In general, these kernels are hardware independent and compiled by the OpenCL runtime when they are loaded. To be able to fully exploit the parallel execution of the kernel code, several kernel instances, the work items, are started to process a set of input values. The actual number of concurrently running work items is determined by the OpenCL system. How a concrete algorithm can be partitioned into work items has to be decided by the programmer.
 = Reference Material =
@@ Line 19: / Line 25: @@
 <p>http://www.nvidia.com/object/cuda_opencl_1.html</p></li></ul>
+= Programming Environment and Libraries =
-= Project: Boostraping OpenCL and Vector Addition =
+c++ interface / blah
+= Project: Bootstrapping OpenCL and Vector Addition =
+The first project is intended to give an overview over the essential ingredients of an OpenCL application. The various components are kept as simple as possible but will still allow to gain an understanding of the OpenCL concepts. To save time, some of the initialization code is provided.
+The main goal is to define a kernel which adds the floating point numbers of two input arrays to an output array in the following style: <math>C_i = A_i + B_i</math> The index ''i'' is here the location of the numbers in the arrays. Each OpenCL kernel instance will process one fixed value of ''i''. This will provide sufficient parallelism to execute on the CPU and the GPU.
+Although not a complex implementation, this code will already allow to test some of the performance aspects of OpenCL. In the second part of this tasks some performance measurements will be performed to characterize the CPU and GPU hardware architectures.
+== Compiling and running the test program ==
+Open the folder <tt>project_vectoradd</tt>, create the build files using CMake and compile the application.
+<pre>$ cd project_vectoradd/
+[hauth@vdt-corei7avx project_vectoradd]$ cmake .
+[hauth@vdt-corei7avx project_vectoradd]$ make
+[100%] Built target vectoradd
+[hauth@vdt-corei7avx project_vectoradd]$</pre>
+Once the application was compiled successully, run it. The output should be along the following lines:
+<pre>$ ./vectoradd
+Testing Platform : Intel(R) OpenCL
+&gt; Selected Compute Device :        Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz
+Transferring data to device memory took 2e-06 s
+Running vectorAdd kernel took 2e-06 s
+Transferring data to host memory took 0 s
+All done</pre>
+== Task: Understand the Existing Code ==
+The first task is to read and understand the existing source code. To do so, you have to edit the file <tt>vectoradd.cpp</tt> in your favorite text editor. Use <tt>nano</tt> if you are not sure which tool to use.
+<pre>$ nano vectoradd.cpp</pre>
+Take your time to familiarize yourself with the source code which is already in the file. Some of the intial steps of setting up the OpenCL system are already provided:
+<ul>
+<li><p>'''OpenCL compute context'''</p>
+<p>An OpenCL platform is automaticaly selected, depending on the required device type (CPU/GPU) and a compute context is created. You can change the required device type by modifying the constant <tt>devType</tt>:</p>
+<pre>// Desired Device type.
+// can be CL_DEVICE_TYPE_GPU or CL_DEVICE_TYPE_CPU in this example
+const cl_device_type devType = CL_DEVICE_TYPE_CPU;</pre></li>
+<li><p>'''OpenCL command queue'''</p>
+<p>A command queue is created on the context.</p></li>
+<li><p>'''Kernel compile'''</p>
+<p>The variable contains <tt>kernelSourceAdd</tt> the source code of the OpenCL kernel. This string ist used to compile and register the kernel with the OpenCL runtime system. If you want to extend the kernel code, you have to add to the string contained in the variable <tt>kernelSourceAdd</tt>.</p></li>
+<li><p>'''Host data buffers'''</p>
+<p>The variables <tt>hostVectorA</tt>, <tt>hostVectorB</tt> and <tt>hostVectorC</tt> hold buffers which will be used later to transfer data to and from the OpenCL kernel. The underlying C++ type is <tt>std::vector</tt> which can be conviniently used to acquire a continuous space of memory in the host’s RAM.</p>
+<pre>FloatingPoint vp = 23.0f;
+FloatingPointVector hostVectorA(itemCount, vp);
+FloatingPointVector hostVectorB(itemCount, vp);
+FloatingPointVector hostVectorC(itemCount, vp);</pre></li>
+<li><p>'''Various timing measurements'''</p>
+<p>To quantify the runtime of the OpenCL operations, various timing objects are present. For the time measurements to be correct, only the operations indicated by the TODO comments must be between the creation of the <tt>Timing</tt> object and the call to <tt>EndWithReport()</tt> of the respective timing object.</p>
+<pre>Timing t_transfer_input(&quot;Transferring data to device memory&quot;);
+// TODO: transfer input buffer here
+t_transfer_input.EndWithReport();</pre></li></ul>
+== Task: Running the OpenCL Kernel ==
+The OpenCL kernel is already created and associated with the compiled program:
+<pre>cl::Kernel kernelAdd(pProgram, &quot;vectorAdd&quot;);</pre>
+Use the <tt>enqueueNDRangeKernel(…)</tt> method on the command queue object to add a kernel instance to the queue. Call the <tt>finish()</tt> method on the command queue object to ensure the kernel is done processing before the host program continues execution. The <tt>global_work_size</tt> parameter must reflect the size of the input data stored in the variable <tt>itemCount</tt>.
+Extend the kernel code and use the OpenCL functions <tt>get_global_id</tt>, <tt>get_global_size</tt>, <tt>get_local_id</tt>, <tt>get_local_size</tt> and <tt>get_work_dim</tt> to retrieve information about the position of the kernel instance in the overall kernel run. Use the <tt>printf</tt> function to output this information to the console. Be aware, that the <tt>printf</tt> command is only avaiable in kernels running on the CPU.
+Run the kernel and see if the output of the <tt>printf</tt> command is reasonable and you understand the meaning.
+=== OpenCL Commands in this Task ===
+<ul>
+<li><p><tt>clEnqueueNDRangeKernel</tt></p>
+<p>http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueNDRangeKernel.html</p></li>
+<li><p><tt>clFinish</tt></p>
+<p>http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clFinish.html</p></li>
+<li><p><tt>printf</tt></p>
+<p>http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/printfFunction.html</p></li>
+<li><p><tt>get_global_id</tt></p>
+<p>http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/get_global_id.html</p></li>
+<li><p><tt>get_global_size</tt></p>
+<p>http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/get_global_size.html</p></li>
+<li><p><tt>get_local_id</tt></p>
+<p>http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/get_local_id.html</p></li>
+<li><p><tt>get_local_size</tt></p>
+<p>http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/get_local_size.html</p></li>
+<li><p><tt>get_work_dim</tt></p>
+<p>http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/get_work_dim.html</p></li></ul>
+== Task: Creating and Transfering Memory Buffers ==
+In order to perform actual work with the OpenCL kernel, input data must be transferred to the computing device and copied back to the host to present it to the user or store it in the file system.
+Every memory buffer, including its size and other properties, must be explictly declared to OpenCL before running the kernel. The kernels themselfs cannot dynamically allocate any memory. So all memory the kernels want to write to during their execution must also be allocated before the kernels run.
+Use the class <tt>cl::Buffer</tt> to allocate three memory buffers which are able to hold all the contents of the <tt>hostVectorA</tt>, <tt>hostVectorB</tt> and <tt>hostVectorC</tt> vectors. The two input buffers (<tt>hostVectorA</tt>, <tt>hostVectorB</tt>) can be created read-only and the output-buffer (<tt>hostVectorC</tt>) can be created write-only. This access qualification does only concern in which fashion the OpenCL kernels can access this buffers. The access from the host-side is not restricted. If you want to look for the parameters of the cl::Buffer constructor, they are equivalent to the ones of the function <tt>clCreateBuffer()</tt>.
+Use the method <tt>enqueueWriteBuffer</tt> on the command queue to transfer the content of host vectors <tt>hostVectorA</tt> and <tt>hostVectorB</tt> to the OpenCL device. Be sure to use <tt>CL_TRUE</tt> for the <tt>blocking_write</tt> parameter. This will ensure the buffers are completely copied before the method call returns. For the pointer to the data buffer you can use the following code snippet to get a pointer to the beginning of the host buffer.
+<pre>&amp;hostVectorA.front()</pre>
+Use the method <tt>enqueueReadBuffer</tt> in the same fashion to retrieve the content of the output buffer to <tt>hostVectorC</tt> once the kernel run is complete. Run the application to ensure that everything works fine. The buffer are unchanged, as they are not used by the kernel. We will change this in the next task.
+=== OpenCL Commands in this Task ===
+<ul>
+<li><p><tt>clCreateBuffer ( indirect via the cl::Buffer constructor )</tt></p>
+<p>http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateBuffer.html</p></li>
+<li><p><tt>clEnqueueWriteBuffer</tt></p>
+<p>http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueWriteBuffer.html</p></li>
+<li><p><tt>clEnqueueReadBuffer</tt></p>
+<p>http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueReadBuffer.html</p></li></ul>
+== Task: Passing parameters to the kernel and performing a calculation ==
+[sec:TaskCalc] Use the <tt>setArg</tt> method of the kernel object to pass all three OpenCL memory buffers to the kernel. Extend the source code of the OpenCL kernel to receive three buffers as parameters.
+Within the kernel, compute the item of the output buffer which corresponds to the global id of the kernel instance using the following formula: <math>C_i = A_i + B_i</math>
+Once the kernel execution is complete, check the correctness of the results by outputting the <tt>hostVectorC</tt> content to the console.
+=== OpenCL Commands in this Task ===
+<ul>
+<li><p><tt>cl::Kernel::setArg(…)</tt></p>
+<p>http://www.khronos.org/registry/cl/specs/opencl-cplusplus-1.1.pdf</p></li></ul>
+== Task: Running the Kernel on the GPU ==
+Up to now, the OpenCL kernels were compiled and run on the CPU using Intel’s OpenCL SDK. Switch the device type to <tt>GPU</tt> and confirm that your program still performs as expected. Make sure to remove all <tt>printf</tt> calls which might be left in your kernel as this call is not supported by the GPU.
+== Task: Timing and Performance Tests ==
+Until now, only a small number of floating point values have been added. This way, the code is easier to test and debug. In the this step, the number of items in our work list will be hugely increased so some properties of the performance of the implementation and the different computing hardware becomes visible.
+First, make sure that no <tt>printf</tt> or other outputs to the consoles of work items are in your kernel or host program. Now, increase the number of work items by modifying the <tt>itemCount</tt> line:
+<pre>const size_t itemCount = 1 * pow(10, 7);</pre>
+Run the program both for GPU and CPU and compare the runtime of the kernel and the runtime including the buffer transfers. Evaluate the following points:
+* How performs the CPU compared to the GPU ?
+* In your opinion, is this computing task complex enough to justify the transfer overhead of the data to the GPU ?
+* Compute the natural logarithm ( log() ) in the kernel after adding the two input values. How is the ratio of kernel runtime to buffer transfer now ?
+* Does it become more profitable to run the kernel computing the log on the GPU now ?
+=== OpenCL Commands in this Task ===
+<ul>
+<li><p><tt>log</tt></p>
+<p>http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/log.html</p></li></ul>
+== Optional Task: Using <tt>float4</tt> datatype ==
+Switch from the scalar data type to the vector datatype <tt>float4</tt>. This data type can store a four-dimensional vector. Can you still perform the same mathematical operation in the kernel as done in section [sec:TaskCalc] ?
+Note: the data type in OpenCL C is named <tt>float4</tt> but in the C++ host application use the name <tt>cl_foat4</tt> to refer to it. Please investigate the following points:
+* Does the memory transfer timing increase as expected (factor 4) in relation to the scalar <tt>float</tt> datatype case ?
+* Does the kernel runtime increase as expected (factor 4) in relation to the scalar <tt>float</tt> datatype case ? If not, what might be the reason ?
+== Optional Task: Using HostPtr computations to Mitigate the Buffer Transfer Overhead of the CPU ==
 = Project: N-Body Simulation =
+One the primary goals of early physics research was to understand and predict the movements of the celestial bodies, especially the planets within our solar system. An important achievement are the laws of planetary motion by Johannes Kepler which analytically describe the motion of two clestial bodies. But for problems of more than two objects ( <math>n > 2</math>) no universally valid analytical solution exists today. The term n-body problem is generally used refer to this kind of situations.
+Fortunately, the motion of object in space follow the well-know Newton’s law of motion and therefore a numerical simulation of the n-body problem can be performed to study the properties of various starting conditions.
+== Mathematical Formulation ==
+The state of particle <math>i</math> in the simulation is fully described by its location <math>\vec{r}_i</math>, velocity <math>\vec{v}_i</math> and mass <math>m_i</math>. To transfer the simulation from timestep <math>t</math> to timestep <math>t' = t + \Delta t</math>, the following computation has to be performed for each particle.
+<ol>
+<li><p>'''Compute the acceleration on each particle'''</p>
+<p>Newton’s law of universal gravitation is used to calcluate the accelaration which is applied to one particle by all other particles in the simulation.</p>
+<p><math>\vec{ a}_i = \sum_{j=1, i \neq j}^{n} G \frac{ m_j }{ \left| \vec{r_j} -
+\vec{r_i} \right|^2 }
+\hat{r_{ij}}</math></p>
+<p>Here, <math>G</math> is the gravitational constant which can be set to 1 for this example. <math>m_j</math> is the mass of the particles which interacts with the particle <math>i</math>. <math>r_i</math> and <math>r_j</math> are the position vectors of the two interaction particles. <math>\hat{r_{ij}}</math> is the normal vector between the two interacting particles. It can be computed using:</p>
+<p><math>\hat{r_{ij}} = \frac{ r_j - r_i }{ \left| r_j - r_i  \right| }</math></p></li>
+<li><p>'''Compute the new velocity of the particle'''</p>
+<p>The new velocity vector the of particle <math>i</math> can be computed using: <math>v_i' = a_i  \Delta t + v_i</math></p></li>
+<li><p>'''Compute the new location of the particle'''</p>
+<p>The new absolute position of particle <math>i</math> can be computed using: <math>r_i' = \frac{1}{2}  \Delta t ^ 2  a_i + v_i  \Delta t + r_i</math></p></li></ol>

Difference between revisions of "OpenCL on CPU and GPU"

Revision as of 12:47, 9 August 2012

Contents

Introduction

Reference Material

Programming Environment and Libraries

Project: Bootstrapping OpenCL and Vector Addition

Compiling and running the test program

Task: Understand the Existing Code

Task: Running the OpenCL Kernel

OpenCL Commands in this Task

Task: Creating and Transfering Memory Buffers

OpenCL Commands in this Task

Task: Passing parameters to the kernel and performing a calculation

OpenCL Commands in this Task

Task: Running the Kernel on the GPU

Task: Timing and Performance Tests

OpenCL Commands in this Task

Optional Task: Using float4 datatype

Optional Task: Using HostPtr computations to Mitigate the Buffer Transfer Overhead of the CPU

Project: N-Body Simulation

Mathematical Formulation

Technical specification/requirements

Navigation menu

Search

Optional Task: Using `float4` datatype