The eternal question of measuring time

It seemed that long discussions in the forums had ended, how to measure the time of the algorithm, what functions to use, what accuracy to expect. It is a pity, but again we will have to return to this issue. On the agenda, how best to measure the speed of the parallel algorithm.

I must say that I will not give the exact recipe. I myself recently faced the problem of measuring the speed of parallel algorithms and am not an expert in this matter. This article is more of an exploratory nature. I will be glad to hear your opinion and recommendations. I think together we will deal with the problem and work out the optimal solution.

The task is to measure the time of the user code section. Earlier for this task I used the following class:

  class Timing {
 public:
   void StartTiming ();
   void StopTiming ();
   double GetUserSeconds () const {
     return double (m_userTime) / 10000000.0;
   }
 private:
   __int64 GetUserTime () const;
   __int64 m_userTime;
 };

 __int64 Timing :: GetUserTime () const {
   FILETIME creationTime;
   FILETIME exitTime;
   FILETIME kernelTime;
   FILETIME userTime;
   GetThreadTimes (GetCurrentThread (),
                  & creationTime, & exitTime,
                  & kernelTime, & userTime);
   __int64 curTime;
   curTime = userTime.dwHighDateTime;
   curTime << = 32;
   curTime + = userTime.dwLowDateTime;
   return curTime;
 }

 void Timing :: StartTiming () {
   m_userTime = GetUserTime ();
 }

 void Timing :: StopTiming () {
   __int64 curUserTime = GetUserTime ();
   m_userTime = curUserTime - m_userTime;
 }

')
This class is based on the use of the GetThreadTimes function, which allows you to separate the time spent on the work of user code, and the time of the system functions. The class is designed to estimate the time the thread is running in user mode, and therefore only the return parameter lpUserTime is used.

You can measure work time using tools such as the Intel Parallel Amplifier , but it is often convenient to measure time from within a program. For example, it allows you to write values in the log. Therefore, we still try to create a class for measuring speed.

Consider a sample code that calculates a number. We use to measure the time class Timing.

  int _tmain (int, _TCHAR *)
 {
   Timing t;
   t.StartTiming ();
   unsigned sum = 0;

   for (int i = 0; i <1000000; i ++)
   {
     char str [1000];
     for (int j = 0; j <999; j ++)
       str [j] = char (((i + j)% 254) + 1);
     str [999] = 0;
     for (char c = 'a'; c <= 'z'; c ++)
       if (strchr (str, c)! = NULL)
         sum + = 1;
   }

   t.StopTiming ();
   printf ("sum =% u \ n", sum);
   printf ("%. 3G seconds. \ n", t.GetUserSeconds ());
   return 0;
 }

In this form, the time measurement mechanism behaves as expected and, on my working machine, gives, say, 7 seconds. The result is correct even for a multi-core machine, since it does not matter which cores will be used in the process of the algorithm (see Figure 1).
Single thread work on a multi-core machine

Single thread work on a multi-core machine

Figure 1 - Single-thread operation on a multi-core machine.

Now imagine that we want to use the capabilities of multi-core processors in our program and want to evaluate what parallelization of the algorithm based on the OpenMP technology gives us. Let's parallelize our code by adding one line:

  #pragma omp parallel for reduction (+: sum)
 for (int i = 0; i <1000000; i ++)
 {
   char str [1000];
   for (int j = 0; j <999; j ++)
     str [j] = char (((i + j)% 254) + 1);
   str [999] = 0;
   for (char c = 'a'; c <= 'z'; c ++)
     if (strchr (str, c)! = NULL)
       sum + = 1;
 }

Now the program prints a running time of 1.6 seconds. Since a machine with 4 cores is used, I would like to say “Hooray, we received an acceleration 4 times and measuring the operating time confirms this”

Grieve. We measure not the running time of the algorithm, but the running time of the main thread. In this case, the measurement seems to be reliable, since the main thread worked as much as the additional one. Let's put a simple experiment. Explicitly indicate the use of 10 threads, not 4:

  #pragma omp parallel for reduction (+: sum) num_threads (10)

The logic dictates that the given code should work approximately the same time as the code being parallelized into 4 threads. We have a four-core processor, and therefore, an increase in the number of threads can only be expected to slow down. However, on the screen we will see a value of about 0.7 seconds.

This is an expected result, although we wanted to get a completely different one. 10 threads were created. Each of which worked about 0.7 seconds. It was this time that the main thread worked, the time of which we measure using the class Timing. As you can see, this method is not applicable to measuring the speed of programs with parallel sections of code. For clarity, let's display this in Figure 2.
So work 10 flows, by the machine with four kernels can look

So work 10 flows, by the machine with four kernels can look

Figure 2 - This is how 10 threads might look like, on a machine with four cores.

Of course, you can always use the time () function, but its resolution is low, it does not allow to divide the time of the user and system code. In addition, other processes can affect time, which can also distort time measurements.

A favorite feature of many developers for measuring speed is the QueryPerformanceCounter . Let's build a measurement of speed with its use. In its simplest form, a class for measurement will look like this:

  class Timing2 {
 public:
   void StartTiming ();
   void StopTiming ();
   double GetUserSeconds () const {
     return value;
   }
 private:
   double value;
   LARGE_INTEGER time1;
 };

 void Timing2 :: StartTiming ()
 {         
   QueryPerformanceCounter (& time1);  
 }  

 void Timing2 :: StopTiming ()
 {  
   LARGE_INTEGER performance_frequency, time2;
   QueryPerformanceFrequency (& performance_frequency);
   QueryPerformanceCounter (& time2);  
   value = (double) (time2.QuadPart - time1.QuadPart);
   value / = performance_frequency.QuadPart;
 }

Unfortunately, on a multi-core machine, this can no longer be done. :) Here is what is said about this feature in MSDN:

It shouldn't matter which processor is called. However, you can get the hardware or the hardware abstraction layer (HAL). To specify the processor, use the SetThreadAffinityMask function.

Make improvements and tie the main thread to one core:

  class Timing2 {
 public:
   void StartTiming ();
   void StopTiming ();
   double GetUserSeconds () const {
     return value;
   }
 private:
   DWORD_PTR oldmask;
   double value;
   LARGE_INTEGER time1;
 };

 void Timing2 :: StartTiming ()
 {         
   oldmask = SetThreadAffinityMask (:: GetCurrentThread (), 1);
   QueryPerformanceCounter (& time1);
 }  

 void Timing2 :: StopTiming ()
 {  
   LARGE_INTEGER performance_frequency, time2;
   QueryPerformanceFrequency (& performance_frequency);
   QueryPerformanceCounter (& time2);  
   SetThreadAffinityMask (:: GetCurrentThread (), oldmask);
   value = (double) (time2.QuadPart - time1.QuadPart);
   value / = performance_frequency.QuadPart;
 }

This method of measurement is still subject to the same disadvantage that it is impossible to separate the time of the system and user code. If other tasks are performed on the kernel, the measurement result will also be very inaccurate. However, it seems to me that this method is still applicable to the parallel algorithm, in contrast to GetThreadTimes. If this is not the case, then please correct me!

Let us measure what the classes Timing and Timing2 will show on a different number of threads:

  int _tmain (int, _TCHAR *)
 {
   Timing t;
   Timing2 t2;
   t.StartTiming ();
   t2.StartTiming ();
   unsigned sum = 0;

 // # pragma omp parallel for reduction (+: sum) num_threads (2)
 // # pragma omp parallel for reduction (+: sum) num_threads (4)
 // # pragma omp parallel for reduction (+: sum) num_threads (6)
 // # pragma omp parallel for reduction (+: sum) num_threads (10)
   for (int i = 0; i <1000000; i ++)
   {
     char str [1000];
     for (int j = 0; j <999; j ++)
       str [j] = char (((i + j)% 254) + 1);
     str [999] = 0;
     for (char c = 'a'; c <= 'z'; c ++)
       if (strchr (str, c)! = NULL)
         sum + = 1;
   }

   t.StopTiming ();
   t2.StopTiming ();
   printf ("sum =% u \ n", sum);
   printf ("GetThreadTimes:% .3G seconds. \ n",
          t.GetUserSeconds ());
   printf ("QueryPerformanceCounter:% .3G seconds. \ n",
          t2.GetUserSeconds ());
   return 0;
 }

We summarize the data in the table shown in the third figure.

Figure 3 - The running time of the algorithm in seconds measured using the GetThreadTimes and QueryPerformanceCounter functions on a quad-core machine.

As can be seen from the table, as long as the number of threads does not exceed the number of cores, the GetThreadTimes function gives a result similar to QueryPerformanceCounter, which can lead to the feeling that the correct measurements are being made. However, with a larger number of flows, it is no longer possible to rely on its result.

Waiting for your comments and descriptions of the ways how best to measure the running time of parallel algorithms.

Source: https://habr.com/ru/post/82832/

All Articles

The eternal question of measuring time

More articles: