Difference between revisions of "OpenCL on CPU and GPU"

From Gridkaschool
(Tutorial Material)
 
(11 intermediate revisions by 2 users not shown)
Line 8: Line 8:
   
 
In general, these kernels are hardware independent and compiled by the OpenCL runtime when they are loaded. To be able to fully exploit the parallel execution of the kernel code, several kernel instances, the work items, are started to process a set of input values. The actual number of concurrently running work items is determined by the OpenCL system. How a concrete algorithm can be partitioned into work items has to be decided by the programmer.
 
In general, these kernels are hardware independent and compiled by the OpenCL runtime when they are loaded. To be able to fully exploit the parallel execution of the kernel code, several kernel instances, the work items, are started to process a set of input values. The actual number of concurrently running work items is determined by the OpenCL system. How a concrete algorithm can be partitioned into work items has to be decided by the programmer.
  +
  +
====Assigned exercise machines====
  +
UID Host name IP address
  +
068 rz-tesla0.rz.uni-karlsruhe.de 172.21.94.100
  +
074 rz-tesla0.rz.uni-karlsruhe.de 172.21.94.100
  +
075 rz-tesla0.rz.uni-karlsruhe.de 172.21.94.100
  +
077 rz-tesla0.rz.uni-karlsruhe.de 172.21.94.100
  +
082 rz-tesla0.rz.uni-karlsruhe.de 172.21.94.100
  +
084 rz-tesla0.rz.uni-karlsruhe.de 172.21.94.100
  +
085 rz-tesla0.rz.uni-karlsruhe.de 172.21.94.100
  +
088 rz-tesla0.rz.uni-karlsruhe.de 172.21.94.100
  +
089 rz-tesla0.rz.uni-karlsruhe.de 172.21.94.100
  +
090 rz-tesla0.rz.uni-karlsruhe.de 172.21.94.100
  +
102 rz-tesla0.rz.uni-karlsruhe.de 172.21.94.100
  +
103 rz-tesla0.rz.uni-karlsruhe.de 172.21.94.100
  +
104 rz-tesla0.rz.uni-karlsruhe.de 172.21.94.100
  +
110 rz-tesla2.rz.uni-karlsruhe.de 172.21.94.102
  +
113 rz-tesla2.rz.uni-karlsruhe.de 172.21.94.102
  +
117 rz-tesla2.rz.uni-karlsruhe.de 172.21.94.102
  +
121 rz-tesla2.rz.uni-karlsruhe.de 172.21.94.102
  +
123 rz-tesla2.rz.uni-karlsruhe.de 172.21.94.102
  +
125 rz-tesla2.rz.uni-karlsruhe.de 172.21.94.102
  +
128 rz-tesla2.rz.uni-karlsruhe.de 172.21.94.102
  +
012 rz-tesla2.rz.uni-karlsruhe.de 172.21.94.102
  +
132 rz-tesla2.rz.uni-karlsruhe.de 172.21.94.102
  +
134 rz-tesla2.rz.uni-karlsruhe.de 172.21.94.102
  +
137 rz-tesla2.rz.uni-karlsruhe.de 172.21.94.102
  +
139 rz-tesla2.rz.uni-karlsruhe.de 172.21.94.102
  +
  +
= Tutorial Material =
  +
  +
<ul>
  +
<li><p>Slides</p>
  +
<p>http://hauth.web.cern.ch/hauth/opencl_introduction.pdf</p></li>
  +
<li><p>Tasks</p>
  +
<p>http://hauth.web.cern.ch/hauth/tutorial_opencl.pdf</p></li>
  +
<li><p>Source code</p>
  +
<p>http://hauth.web.cern.ch/hauth/tut_mcore.tar.gz</p>
  +
Use:
  +
wget http://hauth.web.cern.ch/hauth/tut_opencl.tar.gz
  +
tar xzf tut_opencl.tar.gz
  +
to download and extract the source code on your GridKa maschine.
  +
</li>
  +
</ul>
   
 
= Reference Material =
 
= Reference Material =
   
 
<ul>
 
<ul>
  +
<li><p>OpenCL C++ Wrapper</p>
  +
<p>http://hauth.web.cern.ch/hauth/cl_hpp/</p></li>
 
<li><p>Khronos Group OpenCL</p>
 
<li><p>Khronos Group OpenCL</p>
 
<p>http://www.khronos.org/opencl/</p></li>
 
<p>http://www.khronos.org/opencl/</p></li>
Line 24: Line 70:
 
<li><p>NVIDIA OpenCL</p>
 
<li><p>NVIDIA OpenCL</p>
 
<p>http://www.nvidia.com/object/cuda_opencl_1.html</p></li></ul>
 
<p>http://www.nvidia.com/object/cuda_opencl_1.html</p></li></ul>
 
= Programming Environment and Libraries =
 
 
c++ interface / blah
 
 
= Project: Bootstrapping OpenCL and Vector Addition =
 
 
The first project is intended to give an overview over the essential ingredients of an OpenCL application. The various components are kept as simple as possible but will still allow to gain an understanding of the OpenCL concepts. To save time, some of the initialization code is provided.
 
 
The main goal is to define a kernel which adds the floating point numbers of two input arrays to an output array in the following style: <math>C_i = A_i + B_i</math> The index ''i'' is here the location of the numbers in the arrays. Each OpenCL kernel instance will process one fixed value of ''i''. This will provide sufficient parallelism to execute on the CPU and the GPU.
 
 
Although not a complex implementation, this code will already allow to test some of the performance aspects of OpenCL. In the second part of this tasks some performance measurements will be performed to characterize the CPU and GPU hardware architectures.
 
 
== Compiling and running the test program ==
 
 
Open the folder <tt>project_vectoradd</tt>, create the build files using CMake and compile the application.
 
 
 
 
<pre>$ cd project_vectoradd/
 
[hauth@vdt-corei7avx project_vectoradd]$ cmake .
 
[hauth@vdt-corei7avx project_vectoradd]$ make
 
[100%] Built target vectoradd
 
[hauth@vdt-corei7avx project_vectoradd]$</pre>
 
Once the application was compiled successully, run it. The output should be along the following lines:
 
 
<pre>$ ./vectoradd
 
Testing Platform : Intel(R) OpenCL
 
&gt; Selected Compute Device : Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz
 
Transferring data to device memory took 2e-06 s
 
Running vectorAdd kernel took 2e-06 s
 
Transferring data to host memory took 0 s
 
All done</pre>
 
== Task: Understand the Existing Code ==
 
 
The first task is to read and understand the existing source code. To do so, you have to edit the file <tt>vectoradd.cpp</tt> in your favorite text editor. Use <tt>nano</tt> if you are not sure which tool to use.
 
 
<pre>$ nano vectoradd.cpp</pre>
 
Take your time to familiarize yourself with the source code which is already in the file. Some of the intial steps of setting up the OpenCL system are already provided:
 
 
<ul>
 
<li><p>'''OpenCL compute context'''</p>
 
<p>An OpenCL platform is automaticaly selected, depending on the required device type (CPU/GPU) and a compute context is created. You can change the required device type by modifying the constant <tt>devType</tt>:</p>
 
<pre>// Desired Device type.
 
// can be CL_DEVICE_TYPE_GPU or CL_DEVICE_TYPE_CPU in this example
 
const cl_device_type devType = CL_DEVICE_TYPE_CPU;</pre></li>
 
<li><p>'''OpenCL command queue'''</p>
 
<p>A command queue is created on the context.</p></li>
 
<li><p>'''Kernel compile'''</p>
 
<p>The variable contains <tt>kernelSourceAdd</tt> the source code of the OpenCL kernel. This string ist used to compile and register the kernel with the OpenCL runtime system. If you want to extend the kernel code, you have to add to the string contained in the variable <tt>kernelSourceAdd</tt>.</p></li>
 
<li><p>'''Host data buffers'''</p>
 
<p>The variables <tt>hostVectorA</tt>, <tt>hostVectorB</tt> and <tt>hostVectorC</tt> hold buffers which will be used later to transfer data to and from the OpenCL kernel. The underlying C++ type is <tt>std::vector</tt> which can be conviniently used to acquire a continuous space of memory in the host’s RAM.</p>
 
<pre>FloatingPoint vp = 23.0f;
 
FloatingPointVector hostVectorA(itemCount, vp);
 
FloatingPointVector hostVectorB(itemCount, vp);
 
FloatingPointVector hostVectorC(itemCount, vp);</pre></li>
 
<li><p>'''Various timing measurements'''</p>
 
<p>To quantify the runtime of the OpenCL operations, various timing objects are present. For the time measurements to be correct, only the operations indicated by the TODO comments must be between the creation of the <tt>Timing</tt> object and the call to <tt>EndWithReport()</tt> of the respective timing object.</p>
 
<pre>Timing t_transfer_input(&quot;Transferring data to device memory&quot;);
 
// TODO: transfer input buffer here
 
t_transfer_input.EndWithReport();</pre></li></ul>
 
 
== Task: Running the OpenCL Kernel ==
 
 
The OpenCL kernel is already created and associated with the compiled program:
 
 
<pre>cl::Kernel kernelAdd(pProgram, &quot;vectorAdd&quot;);</pre>
 
Use the <tt>enqueueNDRangeKernel(…)</tt> method on the command queue object to add a kernel instance to the queue. Call the <tt>finish()</tt> method on the command queue object to ensure the kernel is done processing before the host program continues execution. The <tt>global_work_size</tt> parameter must reflect the size of the input data stored in the variable <tt>itemCount</tt>.
 
 
Extend the kernel code and use the OpenCL functions <tt>get_global_id</tt>, <tt>get_global_size</tt>, <tt>get_local_id</tt>, <tt>get_local_size</tt> and <tt>get_work_dim</tt> to retrieve information about the position of the kernel instance in the overall kernel run. Use the <tt>printf</tt> function to output this information to the console. Be aware, that the <tt>printf</tt> command is only avaiable in kernels running on the CPU.
 
 
Run the kernel and see if the output of the <tt>printf</tt> command is reasonable and you understand the meaning.
 
 
=== OpenCL Commands in this Task ===
 
 
<ul>
 
<li><p><tt>clEnqueueNDRangeKernel</tt></p>
 
<p>http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueNDRangeKernel.html</p></li>
 
<li><p><tt>clFinish</tt></p>
 
<p>http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clFinish.html</p></li>
 
<li><p><tt>printf</tt></p>
 
<p>http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/printfFunction.html</p></li>
 
<li><p><tt>get_global_id</tt></p>
 
<p>http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/get_global_id.html</p></li>
 
<li><p><tt>get_global_size</tt></p>
 
<p>http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/get_global_size.html</p></li>
 
<li><p><tt>get_local_id</tt></p>
 
<p>http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/get_local_id.html</p></li>
 
<li><p><tt>get_local_size</tt></p>
 
<p>http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/get_local_size.html</p></li>
 
<li><p><tt>get_work_dim</tt></p>
 
<p>http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/get_work_dim.html</p></li></ul>
 
 
== Task: Creating and Transfering Memory Buffers ==
 
 
In order to perform actual work with the OpenCL kernel, input data must be transferred to the computing device and copied back to the host to present it to the user or store it in the file system.
 
 
Every memory buffer, including its size and other properties, must be explictly declared to OpenCL before running the kernel. The kernels themselfs cannot dynamically allocate any memory. So all memory the kernels want to write to during their execution must also be allocated before the kernels run.
 
 
Use the class <tt>cl::Buffer</tt> to allocate three memory buffers which are able to hold all the contents of the <tt>hostVectorA</tt>, <tt>hostVectorB</tt> and <tt>hostVectorC</tt> vectors. The two input buffers (<tt>hostVectorA</tt>, <tt>hostVectorB</tt>) can be created read-only and the output-buffer (<tt>hostVectorC</tt>) can be created write-only. This access qualification does only concern in which fashion the OpenCL kernels can access this buffers. The access from the host-side is not restricted. If you want to look for the parameters of the cl::Buffer constructor, they are equivalent to the ones of the function <tt>clCreateBuffer()</tt>.
 
 
Use the method <tt>enqueueWriteBuffer</tt> on the command queue to transfer the content of host vectors <tt>hostVectorA</tt> and <tt>hostVectorB</tt> to the OpenCL device. Be sure to use <tt>CL_TRUE</tt> for the <tt>blocking_write</tt> parameter. This will ensure the buffers are completely copied before the method call returns. For the pointer to the data buffer you can use the following code snippet to get a pointer to the beginning of the host buffer.
 
 
<pre>&amp;hostVectorA.front()</pre>
 
Use the method <tt>enqueueReadBuffer</tt> in the same fashion to retrieve the content of the output buffer to <tt>hostVectorC</tt> once the kernel run is complete. Run the application to ensure that everything works fine. The buffer are unchanged, as they are not used by the kernel. We will change this in the next task.
 
 
=== OpenCL Commands in this Task ===
 
 
<ul>
 
<li><p><tt>clCreateBuffer ( indirect via the cl::Buffer constructor )</tt></p>
 
<p>http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateBuffer.html</p></li>
 
<li><p><tt>clEnqueueWriteBuffer</tt></p>
 
<p>http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueWriteBuffer.html</p></li>
 
<li><p><tt>clEnqueueReadBuffer</tt></p>
 
<p>http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueReadBuffer.html</p></li></ul>
 
 
== Task: Passing parameters to the kernel and performing a calculation ==
 
 
[sec:TaskCalc] Use the <tt>setArg</tt> method of the kernel object to pass all three OpenCL memory buffers to the kernel. Extend the source code of the OpenCL kernel to receive three buffers as parameters.
 
 
Within the kernel, compute the item of the output buffer which corresponds to the global id of the kernel instance using the following formula: <math>C_i = A_i + B_i</math>
 
 
Once the kernel execution is complete, check the correctness of the results by outputting the <tt>hostVectorC</tt> content to the console.
 
 
=== OpenCL Commands in this Task ===
 
 
<ul>
 
<li><p><tt>cl::Kernel::setArg(…)</tt></p>
 
<p>http://www.khronos.org/registry/cl/specs/opencl-cplusplus-1.1.pdf</p></li></ul>
 
 
== Task: Running the Kernel on the GPU ==
 
 
Up to now, the OpenCL kernels were compiled and run on the CPU using Intel’s OpenCL SDK. Switch the device type to <tt>GPU</tt> and confirm that your program still performs as expected. Make sure to remove all <tt>printf</tt> calls which might be left in your kernel as this call is not supported by the GPU.
 
 
== Task: Timing and Performance Tests ==
 
 
Until now, only a small number of floating point values have been added. This way, the code is easier to test and debug. In the this step, the number of items in our work list will be hugely increased so some properties of the performance of the implementation and the different computing hardware becomes visible.
 
 
First, make sure that no <tt>printf</tt> or other outputs to the consoles of work items are in your kernel or host program. Now, increase the number of work items by modifying the <tt>itemCount</tt> line:
 
 
<pre>const size_t itemCount = 1 * pow(10, 7);</pre>
 
Run the program both for GPU and CPU and compare the runtime of the kernel and the runtime including the buffer transfers. Evaluate the following points:
 
 
* How performs the CPU compared to the GPU ?
 
* In your opinion, is this computing task complex enough to justify the transfer overhead of the data to the GPU ?
 
* Compute the natural logarithm ( log() ) in the kernel after adding the two input values. How is the ratio of kernel runtime to buffer transfer now ?
 
* Does it become more profitable to run the kernel computing the log on the GPU now ?
 
 
=== OpenCL Commands in this Task ===
 
 
<ul>
 
<li><p><tt>log</tt></p>
 
<p>http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/log.html</p></li></ul>
 
 
== Optional Task: Using <tt>float4</tt> datatype ==
 
 
Switch from the scalar data type to the vector datatype <tt>float4</tt>. This data type can store a four-dimensional vector. Can you still perform the same mathematical operation in the kernel as done in section [sec:TaskCalc] ?
 
 
Note: the data type in OpenCL C is named <tt>float4</tt> but in the C++ host application use the name <tt>cl_foat4</tt> to refer to it. Please investigate the following points:
 
 
* Does the memory transfer timing increase as expected (factor 4) in relation to the scalar <tt>float</tt> datatype case ?
 
* Does the kernel runtime increase as expected (factor 4) in relation to the scalar <tt>float</tt> datatype case ? If not, what might be the reason ?
 
 
== Optional Task: Using HostPtr computations to Mitigate the Buffer Transfer Overhead of the CPU ==
 
 
= Project: N-Body Simulation =
 
 
One the primary goals of early physics research was to understand and predict the movements of the celestial bodies, especially the planets within our solar system. An important achievement are the laws of planetary motion by Johannes Kepler which analytically describe the motion of two clestial bodies. But for problems of more than two objects ( <math>n > 2</math>) no universally valid analytical solution exists today. The term n-body problem is generally used refer to this kind of situations.
 
 
Fortunately, the motion of object in space follow the well-know Newton’s law of motion and therefore a numerical simulation of the n-body problem can be performed to study the properties of various starting conditions.
 
 
== Mathematical Formulation ==
 
 
The state of particle <math>i</math> in the simulation is fully described by its location <math>\vec{r}_i</math>, velocity <math>\vec{v}_i</math> and mass <math>m_i</math>. To transfer the simulation from timestep <math>t</math> to timestep <math>t' = t + \Delta t</math>, the following computation has to be performed for each particle.
 
 
<ol>
 
<li><p>'''Compute the acceleration on each particle'''</p>
 
<p>Newton’s law of universal gravitation is used to calcluate the accelaration which is applied to one particle by all other particles in the simulation.</p>
 
<p><math>\vec{ a}_i = \sum_{j=1, i \neq j}^{n} G \frac{ m_j }{ \left| \vec{r_j} -
 
\vec{r_i} \right|^2 }
 
\hat{r_{ij}}</math></p>
 
<p>Here, <math>G</math> is the gravitational constant which can be set to 1 for this example. <math>m_j</math> is the mass of the particles which interacts with the particle <math>i</math>. <math>r_i</math> and <math>r_j</math> are the position vectors of the two interaction particles. <math>\hat{r_{ij}}</math> is the normal vector between the two interacting particles. It can be computed using:</p>
 
<p><math>\hat{r_{ij}} = \frac{ r_j - r_i }{ \left| r_j - r_i \right| }</math></p></li>
 
<li><p>'''Compute the new velocity of the particle'''</p>
 
<p>The new velocity vector the of particle <math>i</math> can be computed using: <math>v_i' = a_i \Delta t + v_i</math></p></li>
 
<li><p>'''Compute the new location of the particle'''</p>
 
<p>The new absolute position of particle <math>i</math> can be computed using: <math>r_i' = \frac{1}{2} \Delta t ^ 2 a_i + v_i \Delta t + r_i</math></p></li></ol>
 
 
 
 
 
===[[Internals:OpenCL|Technical specification/requirements]]===
 

Latest revision as of 13:32, 30 August 2012

Introduction

OpenCL is a standard which defines a framework, an API and a programming language for parallel computation on heterogeneous systems like desktop computers, high-performance computing servers as well as hand-held devices.

The standard is maintained by the Khronos Group and supported by a large consortium of industry leaders including Apple, Intel, AMD, NVIDIA and ARM. Influenced by NVIDIA’s CUDA from the GPU side and by OpenMP which originates from the classical CPU side, the open OpenCL standard is characterized by a formulation which is abstract enough to support both CPU and GPU computing resources.

This is an ambitious goal, since providing an abstract interface together with a peak performance is a challenging task. OpenCL employs a strict isolation of the computation work into fundamental units, the kernels. These kernels can be developed in the OpenCL C programming language, a subset of the C99 language, with some additional OpenCL specific keywords.

In general, these kernels are hardware independent and compiled by the OpenCL runtime when they are loaded. To be able to fully exploit the parallel execution of the kernel code, several kernel instances, the work items, are started to process a set of input values. The actual number of concurrently running work items is determined by the OpenCL system. How a concrete algorithm can be partitioned into work items has to be decided by the programmer.

Assigned exercise machines

UID      Host name                       IP address
068	 rz-tesla0.rz.uni-karlsruhe.de	172.21.94.100
074	 rz-tesla0.rz.uni-karlsruhe.de	172.21.94.100
075	 rz-tesla0.rz.uni-karlsruhe.de	172.21.94.100
077	 rz-tesla0.rz.uni-karlsruhe.de	172.21.94.100
082	 rz-tesla0.rz.uni-karlsruhe.de	172.21.94.100
084	 rz-tesla0.rz.uni-karlsruhe.de	172.21.94.100
085	 rz-tesla0.rz.uni-karlsruhe.de	172.21.94.100
088	 rz-tesla0.rz.uni-karlsruhe.de	172.21.94.100
089	 rz-tesla0.rz.uni-karlsruhe.de	172.21.94.100
090	 rz-tesla0.rz.uni-karlsruhe.de	172.21.94.100
102	 rz-tesla0.rz.uni-karlsruhe.de	172.21.94.100
103	 rz-tesla0.rz.uni-karlsruhe.de	172.21.94.100
104	 rz-tesla0.rz.uni-karlsruhe.de	172.21.94.100
110	 rz-tesla2.rz.uni-karlsruhe.de	172.21.94.102
113	 rz-tesla2.rz.uni-karlsruhe.de	172.21.94.102
117	 rz-tesla2.rz.uni-karlsruhe.de	172.21.94.102
121	 rz-tesla2.rz.uni-karlsruhe.de	172.21.94.102
123	 rz-tesla2.rz.uni-karlsruhe.de	172.21.94.102
125	 rz-tesla2.rz.uni-karlsruhe.de	172.21.94.102
128	 rz-tesla2.rz.uni-karlsruhe.de	172.21.94.102
012	 rz-tesla2.rz.uni-karlsruhe.de	172.21.94.102
132	 rz-tesla2.rz.uni-karlsruhe.de	172.21.94.102
134	 rz-tesla2.rz.uni-karlsruhe.de	172.21.94.102
137	 rz-tesla2.rz.uni-karlsruhe.de	172.21.94.102
139	 rz-tesla2.rz.uni-karlsruhe.de	172.21.94.102

Tutorial Material

Reference Material