Difference between revisions of "Multi-threaded Programming"

From Gridkaschool
Line 1: Line 1:
 
= Introduction =
 
= Introduction =
   
  +
OpenCL is a standard which defines a framework, an API and a programming language for parallel computation on heterogeneous systems like client computer systems, high- performance computing servers as well as hand-held devices.
OpenCL is a standard which defines a framework, an API and a programming language for parallel computation on heterogeneous systems like client computer systems, high- performance computing servers as well as hand-held devices. The standard is maintained by the Khronos Group and supported by a large consortium of industry leaders including Apple, Intel, AMD, NVIDIA and ARM. Influenced by NVIDIA’s CUDA from the GPU side and by OpenMP which originates from the classical CPU side, the open OpenCL standard is characterized by a formulation which is abstract enough to support both CPU and GPU computing resources. This is an ambitious goal, since providing an abstract interface together with a peak performance is a challenging task. OpenCL employs a strict isolation of the computation work into fundamental units, the kernels. These kernels can be developed in the OpenCL C programming language, a subset of the C99 language, with some additional OpenCL specific keywords. In general, these kernels are hardware independent and compiled by the OpenCL runtime when they are loaded. To be able to fully exploit the parallel execution of the kernel code, several kernel instances, the work items, are started to process a set of input values. The actual number of concurrently running work items is determined by the OpenCL system. How a concrete algorithm can be partitioned into work items has to be decided by the programmer.
 
  +
  +
The standard is maintained by the Khronos Group and supported by a large consortium of industry leaders including Apple, Intel, AMD, NVIDIA and ARM. Influenced by NVIDIA’s CUDA from the GPU side and by OpenMP which originates from the classical CPU side, the open OpenCL standard is characterized by a formulation which is abstract enough to support both CPU and GPU computing resources.
  +
  +
This is an ambitious goal, since providing an abstract interface together with a peak performance is a challenging task. OpenCL employs a strict isolation of the computation work into fundamental units, the kernels. These kernels can be developed in the OpenCL C programming language, a subset of the C99 language, with some additional OpenCL specific keywords.
  +
  +
In general, these kernels are hardware independent and compiled by the OpenCL runtime when they are loaded. To be able to fully exploit the parallel execution of the kernel code, several kernel instances, the work items, are started to process a set of input values. The actual number of concurrently running work items is determined by the OpenCL system. How a concrete algorithm can be partitioned into work items has to be decided by the programmer.
   
 
= Reference Material =
 
= Reference Material =
Line 18: Line 24:
 
<li><p>NVIDIA OpenCL</p>
 
<li><p>NVIDIA OpenCL</p>
 
<p>http://www.nvidia.com/object/cuda_opencl_1.html</p></li></ul>
 
<p>http://www.nvidia.com/object/cuda_opencl_1.html</p></li></ul>
  +
  +
= Programming Environment and Libraries =
  +
  +
c++ interface / blah
   
 
= Project: Boostraping OpenCL and Vector Addition =
 
= Project: Boostraping OpenCL and Vector Addition =
Line 43: Line 53:
 
Transferring data to host memory took 0 s
 
Transferring data to host memory took 0 s
 
All done</pre>
 
All done</pre>
  +
== Task: Understand the existing code ==
== Running an OpenCL Kernel ==
 
   
The first task is to run a simple OpenCL kernel. To do so, you have to edit the file <tt>vectoradd.cpp</tt> in your favorite text editor, use <tt>nano</tt> if you are not sure which tool to use.
+
The first task is to read and understand the existing source code. To do so, you have to edit the file <tt>vectoradd.cpp</tt> in your favorite text editor, use <tt>nano</tt> if you are not sure which tool to use.
   
 
<pre>$ nano vectoradd.cpp</pre>
 
<pre>$ nano vectoradd.cpp</pre>
Line 51: Line 61:
   
 
<ul>
 
<ul>
<li><p>Creating the OpenCL compute context</p>
+
<li><p>OpenCL compute context</p>
<p>A OpenCL platform is automaticaly selected, depending the required device type. You can change the required device type by modifying the constant <tt>devType</tt>:</p>
+
<p>An OpenCL platform is automaticaly selected, depending on the required device type (CPU/GPU) and a compute context is created. You can change the required device type by modifying the constant <tt>devType</tt>:</p>
 
<pre>// Desired Device type.
 
<pre>// Desired Device type.
 
// can be CL_DEVICE_TYPE_GPU or CL_DEVICE_TYPE_CPU in this example
 
// can be CL_DEVICE_TYPE_GPU or CL_DEVICE_TYPE_CPU in this example
 
const cl_device_type devType = CL_DEVICE_TYPE_CPU;</pre></li>
 
const cl_device_type devType = CL_DEVICE_TYPE_CPU;</pre></li>
<li></li></ul>
+
<li><p>OpenCL command queue</p>
  +
<p>A command queue is created on the context</p></li>
  +
<li><p>Kernel compile</p>
  +
<p>The variable contains <tt>kernelSourceAdd</tt> the source code of the OpenCL kernel. This string ist used to compile and register the kernel with the OpenCL runtime system. If you want to extend the kernel code, you have to add to the string contained in the variable <tt>kernelSourceAdd</tt>.</p></li>
  +
<li><p>Host data buffers</p>
  +
<p>The variables <tt>hostVectorA</tt>, <tt>hostVectorB</tt> and <tt>hostVectorC</tt> hold buffers which will be used later to transfer data to and from the OpenCL kernel. The underlying C++ type is <tt>std::vector</tt> which can be conviniently used to acquire a contingious space of memory in host’s RAM.</p>
  +
<pre>FloatingPoint vp = 23.0f;
  +
FloatingPointVector hostVectorA(itemCount, vp);
  +
FloatingPointVector hostVectorB(itemCount, vp);
  +
FloatingPointVector hostVectorC(itemCount, vp);</pre></li>
  +
<li><p>Various timing measurements</p>
  +
<p>To quantify the runtime of the OpenCL operations, various timing objects are present. For the time measurements to be correct, only the operations indicated by the TODO comments must be between the creation of the <tt>Timing</tt> object and the call to <tt>EndWithReport()</tt> of the respective timing object.</p>
  +
<pre>Timing t_transfer_input(&quot;Transferring data to device memory&quot;);
  +
// TODO: transfer input buffer here
  +
t_transfer_input.EndWithReport();</pre></li></ul>
  +
  +
== Task: Running the OpenCL Kernel ==
  +
  +
The OpenCL kernel is already created and associated with the compiled program:
  +
  +
<pre>cl::Kernel kernelAdd(pProgram, &quot;vectorAdd&quot;);</pre>
  +
Use the <tt>enqueueNDRangeKernel(…)</tt> method on the command queue object to add a kernel instance to the queue <ref>http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueNDRangeKernel.html
  +
</ref>
  +
  +
Call the <tt>finish()</tt> method on the command queue object to ensure the kernel is done processing before the host program continues execution. <ref>http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clFinish.html
  +
</ref>
   
 
== Modifications to play around ==
 
== Modifications to play around ==
Line 64: Line 99:
   
 
= Project: N-Body Simulation =
 
= Project: N-Body Simulation =
  +
  +
<references />
  +
   
 
===[[Internals:Multi-threaded|Technical specification/requirements]]===
 
===[[Internals:Multi-threaded|Technical specification/requirements]]===

Revision as of 18:25, 3 August 2012

Introduction

OpenCL is a standard which defines a framework, an API and a programming language for parallel computation on heterogeneous systems like client computer systems, high- performance computing servers as well as hand-held devices.

The standard is maintained by the Khronos Group and supported by a large consortium of industry leaders including Apple, Intel, AMD, NVIDIA and ARM. Influenced by NVIDIA’s CUDA from the GPU side and by OpenMP which originates from the classical CPU side, the open OpenCL standard is characterized by a formulation which is abstract enough to support both CPU and GPU computing resources.

This is an ambitious goal, since providing an abstract interface together with a peak performance is a challenging task. OpenCL employs a strict isolation of the computation work into fundamental units, the kernels. These kernels can be developed in the OpenCL C programming language, a subset of the C99 language, with some additional OpenCL specific keywords.

In general, these kernels are hardware independent and compiled by the OpenCL runtime when they are loaded. To be able to fully exploit the parallel execution of the kernel code, several kernel instances, the work items, are started to process a set of input values. The actual number of concurrently running work items is determined by the OpenCL system. How a concrete algorithm can be partitioned into work items has to be decided by the programmer.

Reference Material

Programming Environment and Libraries

c++ interface / blah

Project: Boostraping OpenCL and Vector Addition

TODO: give overview of what to do

Compiling and running the test program

Open the folder project_vectoradd, create the build files using CMake and compile the application.


$ cd project_vectoradd/
[hauth@vdt-corei7avx project_vectoradd]$ cmake . 
[hauth@vdt-corei7avx project_vectoradd]$ make 
[100%] Built target vectoradd
[hauth@vdt-corei7avx project_vectoradd]$ ./vectoradd 

Once the application was compiled successully run it. The output should be along the following lines:

$ ./vectoradd
Testing Platform : Intel(R) OpenCL
> Selected Compute Device :        Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz
Transferring data to device memory took 2e-06 s
Running vectorAdd kernel took 2e-06 s
Transferring data to host memory took 0 s
All done

Task: Understand the existing code

The first task is to read and understand the existing source code. To do so, you have to edit the file vectoradd.cpp in your favorite text editor, use nano if you are not sure which tool to use.

$ nano vectoradd.cpp

Take your time to familiarize yourself with the sourcecode which is already in the file. Some of the intial steps of setting up the OpenCL system are already provided:

  • OpenCL compute context

    An OpenCL platform is automaticaly selected, depending on the required device type (CPU/GPU) and a compute context is created. You can change the required device type by modifying the constant devType:

    // Desired Device type.
    // can be CL_DEVICE_TYPE_GPU or CL_DEVICE_TYPE_CPU in this example
    const cl_device_type devType = CL_DEVICE_TYPE_CPU;
  • OpenCL command queue

    A command queue is created on the context

  • Kernel compile

    The variable contains kernelSourceAdd the source code of the OpenCL kernel. This string ist used to compile and register the kernel with the OpenCL runtime system. If you want to extend the kernel code, you have to add to the string contained in the variable kernelSourceAdd.

  • Host data buffers

    The variables hostVectorA, hostVectorB and hostVectorC hold buffers which will be used later to transfer data to and from the OpenCL kernel. The underlying C++ type is std::vector which can be conviniently used to acquire a contingious space of memory in host’s RAM.

    FloatingPoint vp = 23.0f;
    FloatingPointVector hostVectorA(itemCount, vp);
    FloatingPointVector hostVectorB(itemCount, vp);
    FloatingPointVector hostVectorC(itemCount, vp);
  • Various timing measurements

    To quantify the runtime of the OpenCL operations, various timing objects are present. For the time measurements to be correct, only the operations indicated by the TODO comments must be between the creation of the Timing object and the call to EndWithReport() of the respective timing object.

    Timing t_transfer_input("Transferring data to device memory");
    // TODO: transfer input buffer here
    t_transfer_input.EndWithReport();

Task: Running the OpenCL Kernel

The OpenCL kernel is already created and associated with the compiled program:

cl::Kernel kernelAdd(pProgram, "vectorAdd");

Use the enqueueNDRangeKernel(…) method on the command queue object to add a kernel instance to the queue <ref>http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueNDRangeKernel.html </ref>

Call the finish() method on the command queue object to ensure the kernel is done processing before the host program continues execution. <ref>http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clFinish.html </ref>

Modifications to play around

  • switch to double , how does the runtime change for CPU/GPU ?
  • switch to float4 vector type. can you perform the same addition operations ?

Project: N-Body Simulation

<references />


Technical specification/requirements