February 28, 2011, 8:48 PM — Nvidia is announcing on Monday an upgrade to its Cuda Toolkit for developing parallel applications using Nvidia GPUs, with the latest version offering enhancements for application performance and programming.
Nvidia 4.0 Toolkit is intended to make parallel programming easier and enable more developers to port applications to GPUs. A release candidate -- a precursor to a general release -- will be available for free beginning March 4.
[Keep up with the latest developer news with InfoWorld's Developer World newsletter. ]
Among the key features in version 4.0 is the company's GPUDirect 2.0 technology, which supports peer-to-peer communications among GPUs within a single processor or workstation, enabling faster multi-GPU programming and application performance, the company said. "You can now have direct [data] transfers between the GPUs," said Sanford Russell, director of Cuda marketing at Nvidia. Also, a Unified Virtual Addressing capability offers a merged-memory address space for the main system memory and GPU memories, for easier parallel programming, Nvidia said.
Version 4.0 enables multi-thread sharing of GPUs in which multiple CPU host threads can share contexts on a single GPU. This makes it easier for multi-threaded applications to share a single GPU. In addition, multi-GPU sharing by a single CPU thread is enabled, letting a single CPU host thread access all GPUs in a system. Developers can coordinate work across multiple GPUs.
Also, Thrust C++ Template Performance Primitives Libraries in version 4.0 provide open source C++ parallel algorithms and data structures to make it easier to program in C++, Nvidia said. Routines like parallel sorting are to five to 100 times faster with Standard Template Library and Threading Building Blocks, which is Intel's C++ template library.
"The new features of Cuda 4.0 are designed for the programmer who doesn't want to dive into some of the details that were required for the previous releases," Brown said.
MPI (Message Passing Interface) integration with Cuda applications enable moving of data from and to the GPU memory over Infiniband when an application does an MPI send or receive call. Version 4.0 also offers a set of image transformation operations for rapid development of imaging and computer vision applications. Also featured is a new GPU binary disassembler, enabling developers to see the output of Nvidia's compiler to better understand application behaviors.