OpenCL™ Driver for Intel® HD, Iris™, and Iris™ Pro Graphics for Linux -- Release Notes1 Version Information This document covers the Intel® OpenCL Linux graphics device driver version r5.0-BUILD_ID, hereafter referred to as the intel-opencl-r5.0 driver, where BUILD_ID refers to the build ID of the distributed files.
Embed this Program Add this Program to your website by copying the code below. Preview Preview. Modern hardware has incredible performance improvements over older equipment, but if your programming habits haven't evolved to match, then all of these gains are lost. But what does parallel programming mean? Modern hardware is able to do two or more things at the same time, achieved by having additional physical cores and specialized hardware. Basically multiple brains. However, software needs to be designed with this multitasking capacity in mind, or else only one task will be worked on at a time.
Parallel programming makes use of multiple cores and additional hardware to complete tasks faster. The Intel SDK for OpenCL Applications gives programmers the tools they need to easily make their programs run faster and utilize modern hardware features, such as integrated graphics engines.
A SDK Full of Features The Intel SDK for OpenCL Applications is a fully featured development environment that allows developers to design, build and debug their code with parallel programming in mind. Develop in OpenCL for Windows, Linux and Android devices. The debugger and analyzer allow you to troubleshoot and fine tune settings for both your parallel workloads. For developers who want to get started, there is a full community of coders that is developing and sharing tips and code samples for a wide variety of projects.
Support for the latest generation of Intel hardware and OpenCL 2.0 makes this SDK the most up-to-date solution for this type of development on Intel processors. Both Eclipse and Microsoft Visual Studio are fully supported so you don't have to change your development software to make use of this SDK. Catapult your development and programs into the current generation with all the functionality of this free SDK. The Latest and Greatest SDK for Parallel Programming The limit for the speed of processors hasn't been improved upon in recent years. Increasingly, better performance is being achieved by doing more calculations at the same time. Although there is still overhead in managing several different processors working on the same job, the gains are still significant.
For programs to make use of these advancements, they need to make use of parallel programming. With the Intel SDK for OpenCL Applications, developers can quickly and effectively learn and make use of the latest programming techniques to push the limits of current hardware. From design to build, to troubleshooting, this is everything you need, with the support to make you a better programmer. Download free today and start programming in parallel! Author's review. Intel® SDK for OpenCL™ Applications lets you build high-performance heterogeneous applications for Windows, Linux and Android.
Optimize your application performance with Intel® Graphics compute technology. The Intel® OpenCL™ Code Builder makes it easy to build, debug and analyze your OpenCL™ application. Take advantage of the latest OpenCL™ standard version 2.0 on the latest Intel® processors. Some of the exciting features include support for shared virtual memory to reduce data transfer overhead, SPIR-V intermediate representation to maintain portability while protecting your IP, and integration into popular IDEs such as Microsoft Visual Studio and Eclipse.
Intel® is the largest supporter of OpenCL™ technology. The Intel® SDK for OpenCL™ Applications is for any developer that wants to improve the performance of their applications with OpenCL™ on the latest Intel® platforms.
You must have administrator privileges on the development system to install the necessary packages and drivers required for the host software development. The host system must be running one of the following supported Windows. and Linux. operating systems listed on the page. Develop your host application for the Intel® FPGA SDK for OpenCL™ using one of the following development environments: Windows OS systems. Intel FPGA SDK for OpenCL. Board support package (BSP).
Microsoft. Visual Studio Professional version 2010 or later. Linux OS systems. Intel FPGA SDK for OpenCL. BSP.
RPM (RPM Package Manager; originally Red Hat Package Manager). C compiler included with GCC. Perl command version 5 or later. Intel® FPGA SDK for OpenCL™ provides two modes of development experience for users. For code builders, all the tools are integrated into the GUI, which allows them to design, compile, and debug the kernel. On the other hand, the command-line options are for conventional users.
Khronos Compatibility Intel® FPGA SDK for OpenCL™ is based on a published Khronos Specification and is supported by many vendors who are part of the Khronos group. Intel FPGA SDK for OpenCL has passed the Khronos Conformance Testing Process. It conforms to the OpenCL 1.0 standard and provides both the OpenCL 1.0 and OpenCL 2.0 headers by the Khronos Group. Attention: The SDK currently does not support all OpenCL 2.0 application programming interfaces (APIs). If you use the OpenCL 2.0 headers and make a call to an unsupported API, the call will return an error code to indicate that the API is not fully supported.
The Intel FPGA SDK for OpenCL host runtime conforms with the OpenCL platform layer and API with some clarifications and exceptions, which can be found at the section of the Intel FPGA SDK for OpenCL Programming Guide. Other Related Links:. For more information on OpenCL, visit the page. Current conformance status can be found at the page. For more information on the OpenCL 1.0 standard, refer to by Khronos.
OpenCL Extensions Channels (I/Os or Kernel) The Intel® FPGA SDK for OpenCL™ channel extension provides a mechanism for passing data to kernels and synchronizing kernels with high efficiency and low latency. Use the following links for more information on how to implement, use, and emulate channels:. Note: If you want to leverage the capabilities of channels but have the ability to run your kernel program using other SDKs, implement OpenCL pipes.
For more information on pipes, see the following section on pipes. Pipes Intel FPGA SDK for OpenCL provides preliminary support for OpenCL pipe functions, which are part of the OpenCL Specification version 2.0. They provide a mechanism for passing data to kernels and synchronizing kernels with high efficiency and low latency. The Intel FPGA SDK for OpenCL implementation of pipes is not fully conformant to the OpenCL Specification version 2.0. The goal of the SDK's pipe implementation is to provide a solution that works seamlessly on a different OpenCL 2.0-conformant device.
To enable pipes for Intel FPGA products, your design must meet certain requirements. See the following links for more information on how to implement OpenCL pipes:. In a, you can assess the functionality of your OpenCL™ kernel by executing it on one or multiple emulation devices on an x86-64 Windows. or Linux.
host. The compilation of the design for emulation takes seconds to generate an.aocx file and allows you to iterate on your design more effectively without having to go through the lengthy hours required for the full compilation. For Linux systems, the emulator offers symbolic debug support. Symbolic debug allows you to locate the origins of functional errors in your kernel code. The link below has an overview of the design flow for OpenCL kernels and illustrates the different stages for which you can emulate your kernel. Section from the Programming Guide contains more details on the differences between kernel operation on hardware and emulation. Other Related Links:.
With the Intel® FPGA SDK for OpenCL™ Offline Compiler technology, you do not need to change your kernel to fit it optimally into a fixed hardware architecture. Instead, the offline compiler customizes the hardware architecture automatically to accommodate your kernel requirements. In general, you should optimize a kernel that targets a single compute unit first. After you optimize this compute unit, increase the performance by scaling the hardware to fill the remainder of the FPGA. The hardware footprint of the kernel correlates with the time it takes for hardware compilation. Therefore, the more optimizations you can perform with a smaller footprint (that is, a single computing unit), the more hardware compilations you can perform in a given amount of time.
OpenCL Optimization for Intel FPGAs To optimize the implementation of your design and get the maximum performance, understand your theoretical maximum performance and understand what your limitations are. Follow these steps:. Start with a simple known good functional implementation. Use an emulator to validate the functionality. Remove or minimize the pipeline stalls that are reported with the optimization report. Plan memory access for optimal memory bandwidth. Use a profiler to debug performance issues.
The Profiler gives more insight into the system performance, which gives you direction to further optimize the algorithm in usage of the memory. Remember that for FPGAs, the more resources that can be allocated, the more unrolling, parallelization, and higher performance can be attained. Helpful Reports and Resources for Optimization There are a number of system generated reports available to users. These reports give insight into the code, resource usage, and hints on where to focus to further improve the performance:.
Memory Optimization Understanding memory systems is crucial to efficiently implement an application using OpenCL. Global Memory Interconnect Unlike a GPU, an FPGA can build any custom load-store unit (LSU) that is most optimal for your application.
As a result, your ability to write OpenCL code that selects the ideal LSU types for your application might help improve the performance of your design significantly. For more information, refer to the section of the Intel FPGA SDK for the OpenCL Best Practices Guide. Local Memory Local memory is a complex system.
Unlike typical GPU architecture where there are different levels of caches, an FPGA implements local memory in dedicated memory blocks inside the FPGA. For more information, refer to the section of the Intel FPGA SDK for OpenCL Best Practices Guide.
There are a number of ways memory used can be optimized for improving the overall performance. For more information on some of the key techniques, refer to the section of the Intel FPGA SDK for OpenCL Best Practices Guide. For more information on the strategies to improve memory access efficiency, refer to the section of the Intel FPGA SDK for OpenCL Best Practices Guide. Pipelines Understanding pipelines is crucial for leveraging the best performance of your implementation.
Efficient use of pipelines directly improves the performance throughput. For more details, refer to the section of the Intel FPGA SDK for OpenCL Best Practices Guide. For more information on data transfer, refer to the section of the Intel FPGA SDK for OpenCL Best Practices Guide. Stall, Occupancy, Bandwidth Profile your kernel to identify performance bottlenecks. For more information on how profiling information helps you identify poor memory or channel behaviors that lead to unsatisfactory kernel performance, refer to the section of the Intel FPGA SDK for OpenCL Best Practices Guide. Loop Optimization Some techniques for optimizing the loops are:. For some tips on removing loop-carried dependencies in various scenarios for a single work item kernel, refer to the section of the Intel FPGA SDK for OpenCL Best Practices Guide.
For more information on optimizing floating-point operations, refer to the section of the Intel FPGA SDK for OpenCL Best Practices Guide. Area Optimization Area usage is an important design consideration if your OpenCL kernels are executable on FPGAs of different sizes.
When you design your OpenCL application, Intel recommends that you follow certain design strategies for optimizing hardware area usage. Optimizing kernel performance generally requires additional FPGA resources. In contrast, area optimization often results in decreased performance. During kernel optimization, Intel recommends that you run multiple versions of the kernel on the FPGA board to determine the kernel programming strategy that generates the best size versus performance trade-off. For more information on strategies for optimizing FPGA area usage, refer to the section of the Intel FPGA SDK for OpenCL Best Practices Guide.
Reference Design Examples Some design examples that illustrate the optimization techniques are as follow:. This example shows the optimization of the fundamental matrix multiplication operation using loop tiling to take advantage of the data reuse inherent in the computation. This example illustrates: - Single-precision floating-point optimizations - Local memory buffering - Compile optimizations (loop unrolling, numsimdworkitems attribute) - Floating-point optimizations - Multiple device execution.
This design example implements the time-domain finite impulse response (FIR) filter benchmark from the HPEC Challenge Benchmark Suite. For more information, refer to the page. This design is a great example of how FPGAs can provide far better performance than a GPU architecture for floating-point FIR filters. This example illustrates: - Single-precision floating-point optimizations - Efficient 1D sliding window buffer implementation - Single work-item kernel optimization methods. This design example implements a video downscaler that takes 1080p input video and outputs 720p video at 110 frames per second. This example uses multiple kernels to efficiently read from and write to global memory.
This example illustrates - Kernel channels - Multiple simultaneous kernels - Kernel-to-kernel channels - Sliding window design pattern - Memory access pattern optimizations. This design example is an OpenCL implementation of the Lucas Kanade optical flow algorithm. A dense, non-iterative, and non-pyramidal version with a window size of 52x52 is shown to run at over 80 frames per second on the Cyclone® V SoC Development Kit. This example illustrates: - Single work-item kernel - Sliding window design pattern - Resource usage reduction techniques - Visual output Training Online training specific to OpenCL optimization with design examples are available at:. References. In a, if the estimated kernel performance from emulation is acceptable, you can chose to collect information about how your design performs while executing on the FPGA. You can instruct the Intel® FPGA SDK for OpenCL™ Offline Compiler to instrument performance counters in the Verilog code in the.aocx file with the -profile option.
During execution, the Intel FPGA SDK for OpenCL Profiler measures and reports performance data that are collected during the OpenCL kernel execution on the FPGA. You can then review the performance data in the Profiler GUI. The section of the Intel FPGA SDK for OpenCL Programming Guide contains more information on how to profile your kernel.
How to Analyze Profiling Data Profiling information helps you identify poor memory or channel behaviors that lead to unsatisfactory kernel performance. The section of the Intel FPGA SDK for OpenCL Best Practices Guide contains more in-depth information on the Dynamic Profiler GUI and how to interpret profiling data such as stall, bandwidth, cache hits, and so on.
It also contains Profiler analysis of several OpenCL design example scenarios. Intel ® FPGA SDK for OpenCL™ provides a compiler and tools for you to build and run OpenCL applications that target Intel FPGA products. If you only require the Intel FPGA SDK for OpenCL's kernel deployment functionality, download and install the Intel FPGA Runtime Environment (RTE) for OpenCL.
The RTE is a subset of the Intel FPGA SDK for OpenCL. Unlike the SDK, which provides an environment that enables the development and deployment of OpenCL kernel programs, the RTE provides tools and runtime components that enable you to build and execute a host program, and execute precompiled OpenCL kernel programs on target accelerator boards. Do not install the SDK and the RTE on the same host system. The SDK already contains the RTE. Utilities and Host Runtime Libraries The RTE for OpenCL provides utilities, host runtime libraries, drivers, and RTE-specific libraries and files.
The RTE Utility includes commands you can invoke to perform high-level tasks. The RTE utilities are a subset of of the Intel FPGA SDK for OpenCL utilities. The host runtime provides the OpenCL host platform API and runtime API for your OpenCL host application The host runtime consists of the following libraries:. Statically-linked libraries provide OpenCL host APIs, hardware abstractions, and helper libraries. Dynamic link libraries (DLLs) provide hardware abstractions and helper libraries For more information on utilities and host runtime libraries, refer to the section of the Intel FPGA RTE for OpenCL Getting Started Guide. You can now significantly reduce the system latency of your systems using host channels that allows streaming data from the host to stream directly into the FPGA kernel through the PCIe.
interface while bypassing the memory controller. The FPGA kernel can begin processing the data immediately and does not have to wait for the data transfer to complete.
Host channels are supported in the OpenCL runtime application programming interfaces (APIs) and include emulation support. For more details on host channels and emulation support, refer to the section of the Intel® FPGA SDK for OpenCL™ Programming Guide.
Profiling allows you to learn where your program spent its time and what are the different functions that are called. This information shows you which part of your program is running slower than you expected that might need a rewrite for faster program execution.
It can also tell you which functions are being called more or less often than you expected. Gprof The gprof is an open-source tool available in Linux. operating systems for profiling the source code. It works on time-based sampling. During intervals the program counter is interrogated to decide at which point in the code the execution has arrived. To use the gprof, recompile the source code using the compiler profiling flag -pg Run the executables to generate the files containing profiling information: A specific file named “gmon.out” containing all the information that the gprof tool requires to produce a human-readable profiling data is generated.
So, now use the gprof tool in the following way: $ gprof source code gmon.out profiledata.txt profiledata.txt is the file that contains the information that the gprof tool uses to produce human-readable profiling data. This contains two parts: flat profile and call graph.
The flat profile shows how much time your program spent in each function, and how many times that function was called. The call graph shows, for each function, which functions called it, which other functions it called, and how many times.
There is also an estimate of how much time was spent in the subroutines of each function. More information on the usage of gprof for profiling is available on the.
Intel® VTune™ Amplifier The Intel® VTune™ Amplifier used for profiling helps you speed up and optimize execution of your code on Linux embedded platforms, Android., or Windows. systems providing the following types of analysis:. Performance analysis: Find serial and parallel code bottlenecks, analyze algorithm choices, and GPU engine usage, and understand where and how your application can benefit from available hardware resources. Intel Energy Profiler analysis: Analyze power events and identify those that waste energy For more information on the Intel V-tune Amplifier, visit the website. OpenCL™ host pipelined multithread provides a framework to achieve high throughput for algorithms where a large number of input data needs to be processed and the process for each data needs to be done in sequential order. One of the best applications of this framework is in heterogeneous platforms where high-throughput hardware or platform is used to accelerate the most time-consuming part of the application.
Remaining parts of the algorithm must run in a sequential order on other platforms such as CPUs, to either prepare the input data for the accelerated task or to use the output of that task to prepare the final output. In this scenario, although the performance of the algorithm is partially accelerated, the total system throughput is much lower because of the sequential nature of the original algorithm. In this, a new pipelined framework for high-throughput design is proposed. This framework is optimal for processing large input data through algorithms where data dependency forces sequential execution of all stages or tasks of the algorithm. FPGAs are highly used in the acceleration space.
![1.1 1.1](/uploads/1/2/4/3/124365227/496460868.jpg)
OpenCL has a specific way to be used by the CPU to offload task to FPGA. The file attached below contains the common initialization steps needed for the host code to launch the FPGA kernel. Download the file containing initialization steps. The init function can be called from the main function to initialize the FPGA. The code first finds the device upon which the kernel will run, and then programs it with the aocx supplied in the same directory as the host execuatable. After the initialization steps in the code, the user must set the kernel arguments according to their designs needs. There is also a cleanup function which frees the resources after executing the kernel.
Environment Variable Description ACLHALDEBUG Set this variable to a value of 1 to 5 to increase debug output from the hardware abstraction layer (HAL), which interfaces directly with the MMD layer. ACLPCIEDEBUG Set this variable to a value of 1 to 10,000 to increase debug output from the MMD. This variable setting is useful for confirming that the version ID register was read correctly and the UniPHY IP cores are calibrated. ACLPCIEJTAGCABLE Set this variable to override the default quartuspgm argument that specifies the cable number. The default is cable 1. If there are multiple Intel® FPGA Download Cables, you can specify a particular cable by setting this variable. ACLPCIEJTAGDEVICEINDEX Set this variable to override the default quartuspgm argument that specifies the FPGA device index.
By default, this variable has a value of 1. If the FPGA is not the first device in the JTAG chain, you can customize the value. ACLPCIEUSEJTAGPROGRAMMING Set this variable to force the MMD to reprogram the FPGA using the JTAG cable instead of partial reconfiguration. ACLPCIEDMAUSEMSI Set this variable if you want to use MSI for direct memory access (DMA) transfers on Windows. OS. CLCONTEXTCOMPILERMODEINTELFPGA Unset this variable or set it to a value of 3. The OpenCL™ host runtime reprograms the FPGA as needed, which it does at least once during initialization.
To prevent the host application from programming the FPGA, set this variable to a value of 3. Due to a loop in the host program, users may experience the OpenCL™ system slowing down while running it. To know more details about such a scenario, refer to the section of the Intel® FPGA SDK for OpenCL Programming Guide The Intel Code Builder for OpenCL is a software development tool available as part of the Intel FPGA SDK for OpenCL. It provides a set of Microsoft. Visual Studio and Eclipse plug-ins that enable capabilities for creating, building, debugging, and analyzing Windows. and Linux. applications accelerated with OpenCL.
For more information, refer to the section of the Intel FPGA SDK for OpenCL Programming Guide. Title Description This video describes the out-of-box procedure for running two applications, OpenCL™ HelloWorld and OpenCL fast Fourier transform (FFT) on the Cyclone® V SoC using a Windows. machine. This video describes the out-of-box procedure for running two applications, OpenCL HelloWorld and OpenCL FFT on the Cyclone V SoC using a Windows machine. This video describes the out-of-box procedure for running two applications, OpenCL HelloWorld and OpenCL FFT on the Cyclone V SoC using a Windows machine. This video describes the out-of-box procedure for running two applications, OpenCL HelloWorld and OpenCL FFT on the Cyclone V SoC using a Windows machine. This video describes the out-of-box procedure for running two applications, OpenCL HelloWorld and OpenCL FFT on the Cyclone V SoC using a Windows machine.
The video discusses why customers could potentially use this feature to have their custom processing blocks (RTL) in OpenCL kernel code. The video explains the design example, such as the makefiles and config files, and explains the compilation flow. The video also shows a demo of the design example. This video shows you how to download, install, and configure the tools required to develop OpenCL kernels and host code targeting Altera® SoC FPGAs.
This video shows you how to download and compile an example OpenCL application targeting the emulator that is built into the OpenCL. This video shows you how to compile the OpenCL kernel and host code targeting the FPGA and processor of the Cyclone V SoC FPGA. This video shows you how to set up the Cyclone V SoC board to run the OpenCL example and execute the host code and kernel on the board.