Review CUDA debugger "NVIDIA Parallel Nsight 2.0"

Debugging parallel code is a tedious and costly process. Parallelization errors are problematic to catch due to the non-determinism of the behavior of parallel applications. Moreover, if an error is detected, it is often difficult to reproduce it again. It happens that after changing the code, it is difficult to make sure that the error is fixed, and not masked. Most often, errors in a parallel program are heisenbags . Sometimes you feel an urgent need for the most convenient and functional tools for debugging parallel programs.
So, just over a year ago, NVIDIA released a package of tools that are integrated into Microsoft Visual Studio 2008 sp1 and 2010 for debugging parallel programs written in CUDA called NVIDIA Parallel Nsight. About it, in due time, wrote XaocCPS on habrosobschestve. Since then, this product has become more sophisticated and completely free. To date, the latest version is 2.0. Consider the possible configuration, installation, configuration, as well as the main features of NVIDIA Parallel Nsight.

Possible configurations

NVIDIA offers 4 hardware configuration options for installing Parallel Nsight, which differ in the ability to use certain tools:

Configuration	System with 1 GPU	System with 2 GPU	Two systems, each with a GPU	System with 2 GPUs per machine (NVIDIA Multi-OS)
CUDA C / C ++ Parallel Debugger	0	one	one	one
Direct3D Shader Debugger	0	0	one	one
Direct3D graphic inspector	one	one	one	one
Analyzer	one	one	one	one

NVIDIA calls the “ULTIMATE” configuration 4 option. NVIDIA Multi-OS is a virtual machine, with the support of a video driver developer. I was thinking of raising a similar system using VMWare, but I was faced with the impossibility of installing the developer driver on the video adapter of the virtual system.
NVIDIA offers the following system requirements, depending on the configuration chosen:

Hardware Requirements:

	Minimum	Featured
operating system	Windows® Vista SP3, Windows 7 or Windows HPC Server 2008 (32- or 64-bit)	The same
CPU	Intel Pentium Dual-core CPU equivalent @ 1.6 GHz	Intel Pentium Dual-core CPU equivalent @ 2.2 GHz or higher
Ram	For the host: 2 GB For the machine that performs the calculation: 2 GB	For a host: 2 GB or more For a machine that performs the calculation: 4 GB or more
Free space on the hard drive	32-bit machine: 240 MB for Parallel Nsight 64-bit machine: 330 MB for Parallel Nsight	32-bit machine with Parallel Nsight host: 240 MB + space for your project. 64-bit machine with Parallel Nsight host: 330 MB + space for your project. (If you are using a remote machine, to run / debug the application, then the remote machine should have 240 MB of free space + space for the debug version of your application)
Output devices	Separate monitor for computing GPU	DVI connection recommended
Local debugging (host and computer on the same machine)	Two GPUs that support CUDA. (see the list of supported devices)	The same
Remote debugging (host and computer on different machines)	On a computer: 1 GPU with CUDA support. On the host (where the studio is installed): 1 GPU on the host machine: can be any GPU.	The same
Supported GPUs	developer.nvidia.com/parallel-nsight-supported-gpus	developer.nvidia.com/parallel-nsight-supported-gpus

Software Requirements:

Display driver	You must install any NVIDIA display driver that supports Parallel Nsight. If you have an NVIDIA video card installed on a computer, then this driver is probably already installed on it. However, NVIDIA Parallel Nsight requires an updated version of the driver in order to function properly.	The same
Local debugging (the host and computer work on the same machine)	.NET Framework 3.5 with SP1 Visual Studio: Microsoft Visual Studio 2008 with SP1 Standard Edition or higher or Microsoft Visual Studio 2010	The same
Remote debugging (the host and computer work on different machines)	Host Machine: .NET Framework 3.5 with SP1 Visual Studio: Microsoft Visual Studio 2008 with SP1 Standard Edition or higher or Microsoft Visual Studio 2010 Computing Machine:. NET Framework 3.5 with SP1	The same
Network	Internet connection for downloading the installer. For remote debugging: TCP / IP connection of the host and calc. cars.	The same

Install Parallel Nsight

To debug parallel code, configuration with two CUDA compatible GPUs on one machine is enough (of course, it would be much more interesting to talk about configuration with two machines, but I, unfortunately, do not currently have the opportunity to build such a configuration).
So, I had to purchase one of the most budget CUDA supporting cards: GeForce 210, in addition to my working card: GeForce GTX460. Thus, the following hardware configuration was prepared for installing Parallel Nsight:
')

Host:

CPU Type QuadCore AMD Phenom II X4 965, 3918 MHz
Gigabyte GA-790FXTA-UD5 Mainboard (3 PCI, 1 PCI-E x1, 3 PCI-E x16, 4 DDR3 DIMM, Audio, Dual Gigabit LAN, IEEE-1394)
Chipset AMD 790FX, AMD K10
4096 MB system memory

Conclusion:

NVIDIA GeForce 210 Video Adapter (512 MB)
NVIDIA GeForce GTX 460 Video Adapter (1024 MB)
Monitor ENV LED2770h [NoDB] (AUBB1JA005271) (DVI)

As an operating system, I used Windows 7 enterprise edition x64. Next we need MVS no lower than 2008 sp1.

The NVIDIA website contains the necessary distributions. We will need :

Developer Drivers for WinVista and Win7
CUDA Toolkit
CUDA Computing SDK
Parallel Nsight 2.0.

Install distributions in the same manner. Now in the studio when calling the wizard of new projects a new section “NVIDIA” should be added (the template comes in the package “CUDA Toolkit”), and in it the type of project is “NVIDIA CUDA 4.0”. Select it and create a project. If the installation of all distributions is correct, then the resulting helovorld can be compiled and run.
All OK? Then we will deal directly with the Parallel Nsight debugger. Since our machine is immediately both a server and a computer, you must first start the host component: “Nsight Monitor”. Open the code and set a breakpoint somewhere in the computational kernel procedure, launch the project with a special button in the nsight panel. Pay attention to a few points:

The project must be built in advance (the nsight application launch button does not compile).
All breakpoints set outside the computational kernel will be ignored if the program is running in the nsight debug mode. This is done in the reverse order: if the program is debugged in the normal mode, then only the stopping points opposite the usual code are taken into account.
When you first run the nsight debugger on the seven, you are likely to encounter at least two problems: incompatibility with the WPF accelerator, and Windows Aero. They must be turned off (the first is turned off by adding to the registry:
Windows Registry Editor Version 5.00 [HKEY_CURRENT_USER\Software\Microsoft\Avalon.Graphics]"DisableHWAcceleration"=dword:00000001
the second one is turned off from the control panel) or you can turn off the warning check in the nsight itself: in the studio, set: Nsight-> Options-> Override local debugging checks to “True”, but this is fraught with problems. For example, if you specify in the code, as a device for calculations, a video card on which the desktop is drawn and run nsight debugging, we get an eternal frieze. It is not clear what is meant by the incompatibility of Parallel Nsight and WPF / Aero, since during debugging with the “Override local debugging check” option enabled, problems with these mechanisms were not observed either by the debugger or by the mechanisms themselves.

So, the debugger at the point of stopping:

Now, just like when debugging a regular application, you can see the available control values. Full watcher allows you to view arrays. On the screenshot above, the variable "A" of the Matrix3 type:

typedef struct {
int x_size;
int y_size;
int z_size;
float4* elements;
} Matrix3;

The number of elements of the “elements” array that can be viewed is determined by the “Max array elements” parameter in the settings of the Parallel Nsight debugger.
As can be seen from the values of the indices: blockIdx and threadIdx: the debugger is located in the first thread of the first grid block. The question arises: how to move to the desired stream? A window is available in the nsight tools: “Nsight Cuda Device Summary”, an interface that allows you to move between warps in the neighborhood of the stream where the stop was made. The size of the neighborhood is determined by the hardware capabilities of the video chip. So, when calculating on the GeForce 200 at the time of stopping, two blocks of 4 warp were available:

Similarly for the GeForce GTX 460:

Available 31 block. In order to move to a specific thread inside the warp, you need to use the “Cuda Debug Focus” window (the interface of which also allows you to move between the blocks).
Again, the question: "how to get into a stream that does not fall into the vicinity of the first stream?". For this, conditional stopping points are used. The syntax is as follows:
@blockIdx(x,y,z) && @threadIdx(x,y,z)
The debugger will stop at the specified stream, the relative of which the warp neighborhood will be available.
The Nvidia parallel Nsight package includes powerful tools for analyzing CUDA para-calculations for various parameters with graphing, etc. called “Analysis Activity”, but this is a topic for a separate article.
My impressions of Parralel Nsight are only the most pleasant. As it seems to me, a big plus is the integration into the most popular development environment for windows. I repeat that recently this product has become absolutely free, which is very nice. And finally: it is the only tool for debugging programs on CUDA under Windows, not counting the profiler “NVIDIA Compute Visual Profiler”.
Related article .
And a couple of thematic clips from YouTube:

Source: https://habr.com/ru/post/131882/

All Articles