Implementing a singleton in a multithreaded application

Introduction

At the moment, it is difficult to imagine software running in one thread. Of course, there are a number of simple tasks for which one thread is more than sufficient. However, this is not always the case, and most of the tasks of medium or high complexity somehow use multithreading. In this article I will talk about using singletones in a multithreaded environment. Despite the seeming simplicity, this topic contains many nuances and interesting questions, so I think that it deserves a separate article. The discussion of why to use singletons and also how to use them correctly will not be touched on here. To clarify these issues, I recommend referring to my previous articles on various issues related to singletons [1] , [2] , [3] . In this article we will talk about the impact of multithreading on the implementation of singletons and discussion of issues that emerge during development.

Formulation of the problem

In previous articles, the following implementation of a singleton was considered:

template<typename T> T& single() { static T t; return t; }

The idea of this function is quite simple and straightforward: for any type T we can create an instance of this type on demand, i.e. “Lazy”, and the number of instances created by this function does not exceed 1. If we do not need an instance, then there are no problems from the point of view of multithreading (and from the point of view of lifetime and other problems). However, what happens if in our multi-threaded application at the same time 2 or more threads want to call this function with the same type T?
')

C ++ Standard

Before answering this question from a practical point of view, I suggest that you familiarize yourself with the theoretical aspect, i.e. let's get acquainted with the standard C ++. Currently compilers support 2 standards: 2003 and 2011.

$ 6.7.4, C ++ 03

It means that the initial-initialization (8.5) has taken place. A local object of the POD type (3.9) with static storage duration In the case of the storage space in the area of space, it should not be lost. Otherwise, this is the first time control passes through its declaration; this an object is considered initialized upon completion of its initialization. It will be your intention. If the object is the initialized, the behavior is undefined.

$ 6.7.4, C ++ 11

There is no need for any storage-blocking-rate (3.7.1) or thread-storage duration (3.7.2). Constant initialization (3.6.2) of a block-scope entity with static storage duration, if applicable. It has been noted that this has been the case in the past year. Otherwise, such a variable is initialized; such a variable is considered initialized upon completion of its initialization. It will be your intention. If the variable is being initialized, it should be the initialization (*) . If the declaration is not defined.

(*) The implementation of the initializer

( highlighted by me )

In short, the new standard says that if during the initialization of a variable (i.e. instantiation) the second thread tries to access the same variable, then it (the flow) must wait for the initialization to complete, and the implementation should not allow deadlock situations. In the earlier standard about multithreading, as you can be sure, not a word is said.

It now remains to find out which compilers really support the new standard, and which only try to pretend that they support. To do this, we conduct the following experiment.

Experiment

When using multithreaded primitives, I will use the Ultimate ++ framework. It is quite lightweight and easy to use. In the framework of this article, this does not play a fundamental role (for example, you can use boost ).

For our experiment, we will write a class, the creation of which takes quite a long time:

 struct A { A() { Cout() << '{'; //     Thread::Sleep(10); //  10  if (++ x != 1) Cout() << '2'; //  :    Cout() << '}'; //    } ~A() { Cout() << '~'; //   } int x; };

At the initial moment of class creation, the value of x is 0, since we plan to use it only from singleton, i.e. with the word static, using which all POD types will be initialized to the value 0. Then we wait some time, emulating the duration of the operation. At the end there is a check for the expected value, if it is different from one, then we give an error message. Here I used the output of characters in order to more clearly show the sequence of operations. I specifically did not use the message, because this would require additional synchronization for multithreaded use, which we wanted to avoid.

Next, we will write a function called when creating new threads:

 void threadFunction(int i) { Cout() << char('a'+i); //   -    ,    A& a = single<A>(); //    if (ax == 0) Cout() << '0'; //  :    Cout() << char('A'+i); //   -    }

And we will call the threadFunction function at the same time from 5 threads, thereby emulating the situation when competitive access to the singleton occurs:

 for (int i = 0; i < 5; ++ i) Thread::Start(callback1(threadFunction, i)); Thread::ShutdownThreads();

For the experiment, I chose 2 compilers that are quite popular today: MSVC 2010 and GCC 4.5. Testing was also conducted using the MSVC 2012 compiler, the result was fully consistent with the 2010 version, so I’ll omit the mention of it later.

Startup result for GCC :

ab {cde} ABCDE ~

Startup result for MSVC :

ab {0cB0dCe00DE} A ~

Discussion of the results of the experiment

Discuss the results. For GCC , the following happens:

start threadFunction function for thread 1
start threadFunction function for thread 2
start of singleton initialization
start threadFunction function for thread 3
start threadFunction function for thread 4
start threadFunction function for thread 5
completion of singleton initialization
exit from the threadFunction function consistently for all threads 1-5
completion of the program and destruction of singleton

There are no surprises here: the singleton is initialized only once and the threadFunction function completes its work only after the completion of the initialization of the singleton => GCC correctly initializes the object in a multithreaded environment.

The situation with MSVC is somewhat different:

start threadFunction function for thread 1
start threadFunction function for thread 2
start of singleton initialization
error: singleton is not initialized
start threadFunction function for thread 3
exit from threadFunction function for thread 2
error: singleton is not initialized
...
exit from threadFunction function for thread 5
completion of singleton initialization
exit from threadFunction function for thread 1
completion of the program and destruction of singleton

In this case, the compiler for the first thread starts to initialize the singleton, and for the rest, it immediately returns an object that did not even have time to initialize. Thus, MSVC does not ensure proper operation in a multi-threaded environment.

Analysis of the results of the experiment

Let's try to figure out the difference between the result obtained by the considered compilers. To do this, compile and disassemble the code:

GCC :

 5 T& single() 0x00418ad8 <+0>: push %ebp 0x00418ad9 <+1>: mov %esp,%ebp 0x00418adb <+3>: sub $0x28,%esp 6 { 7 static T t; 0x00418ade <+6>: cmpb $0x0,0x48e070 0x00418ae5 <+13>: je 0x418af0 <single<A>()+24> 0x00418ae7 <+15>: mov $0x49b780,%eax 0x00418aec <+20>: leave 0x00418aed <+21>: ret 0x00418af0 <+24>: movl $0x48e070,(%esp) 0x00418af7 <+31>: call 0x485470 <__cxa_guard_acquire> 0x00418afc <+36>: test %eax,%eax 0x00418afe <+38>: je 0x418ae7 <single<A>()+15> 0x00418b00 <+40>: movl $0x49b780,(%esp) 0x00418b07 <+47>: call 0x4195d8 <A::A()> 0x00418b0c <+52>: movl $0x48e070,(%esp) 0x00418b13 <+59>: call 0x4855cc <__cxa_guard_release> 0x00418b18 <+64>: movl $0x485f04,(%esp) 0x00418b1f <+71>: call 0x401000 <atexit> 8 return t; 9 } 0x00418b24 <+76>: mov $0x49b780,%eax 0x00418b29 <+81>: leave 0x00418b2a <+82>: ret

It can be seen that before calling the constructor of object A, the compiler inserts a call to the synchronization functions: __cxa_guard_acquire / __ cxa_guard_release, which allows you to safely start the single function at the same time when initializing a singleton.

MSVC :

 T& single() { 00E51420 mov eax,dword ptr fs:[00000000h] 00E51426 push 0FFFFFFFFh 00E51428 push offset __ehhandler$??$single@UA@@@@YAAAUA@@XZ (0EE128Eh) 00E5142D push eax static T t; 00E5142E mov eax,1 00E51433 mov dword ptr fs:[0],esp ;   00E5143A test byte ptr [`single<A>'::`2'::`local static guard' (0F23944h)],al 00E51440 jne single<A>+47h (0E51467h) ;     "" 00E51442 or dword ptr [`single<A>'::`2'::`local static guard' (0F23944h)],eax 00E51448 mov ecx,offset t (0F23940h) 00E5144D mov dword ptr [esp+8],0 ;  :   00E51455 call A::A (0E51055h) 00E5145A push offset `single<A>'::`2'::`dynamic atexit destructor for 't'' (0EED390h) 00E5145F call atexit (0EA0AD1h) 00E51464 add esp,4 return t; } 00E51467 mov ecx,dword ptr [esp] 00E5146A mov eax,offset t (0F23940h) 00E5146F mov dword ptr fs:[0],ecx 00E51476 add esp,0Ch 00E51479 ret

Here the compiler uses the variable at address 0x0F23944 to check initialization. If the object has not yet been initialized, then this value is set to one, and then the initialization of the singleton is simply called. It can be seen that no synchronization is provided, which explains the result obtained as a result of our experiment.

A simple solution

You can use a fairly simple solution that solves our problem. To do this, before creating an object, we will use a mutex to synchronize access to the object:

 //        struct StaticLock : Mutex::Lock { StaticLock() : Mutex::Lock(mutex) { Cout() << '+'; } ~StaticLock() { Cout() << '-'; } private: static Mutex mutex; }; Mutex StaticLock::mutex; template<typename T> T& single() { StaticLock lock; //   mutex.lock() static T t; //   return t; //  mutex.unlock()    }

Startup Result:

ab + {cde} -A + -B + -C + -D + -E ~

Sequence of operations:

start threadFunction function for thread 1
start threadFunction function for thread 2
global lock take: mutex.lock ()
start of singleton initialization
start threadFunction function for thread 3
start threadFunction function for thread 4
start threadFunction function for thread 5
completion of singleton initialization
global lock removal: mutex.unlock ()
exit from threadFunction function for thread 1
global lock take: mutex.lock ()
global lock removal: mutex.unlock ()
exit from threadFunction function for thread 2
...
exit from threadFunction function for thread 5
completion of the program and destruction of singleton

Such an implementation completely eliminates the problem of returning an uninitialized object: before starting the initialization, mutex.lock () is called, and after the initialization is completed, mutex.unlock () is called. The remaining threads wait for the initialization to complete before starting to use it. However, this approach has a significant disadvantage: the lock is always used, regardless of whether the object is already initialized or not. To improve performance, I would like the synchronization to be used only at the time when we want to gain access to an object that has not yet been initialized (as implemented for GCC ).

Double-checked locking pattern

To implement the above idea, an approach called the Double-checked locking pattern ( DCLP ) or the “double-check lock” design pattern is often used. Its essence is described by the following set of actions:

condition check: initialized or not? If yes, then we immediately return the object link
take the lock
check the condition a second time, if initialized - then remove the lock and return the link
we initialize singlton
change the condition to "initialized"
remove the lock and return the link

From this sequence of actions it becomes clear where this name comes from: we check the condition 2 times, first before blocking, and then immediately after. The idea is that the first check may not mean that the object is not initialized, for example, in the case when 2 threads entered this function at the same time. In this case, both threads get the status: “not initialized”, and then one of them takes the lock, and the other waits. So, the waiting thread on the lock, if you do not make an additional check, will reinitialize the singleton, which can lead to dire consequences.

DCLP can be illustrated by the following example:

 template<typename T> T& single() { static T* pt; if (pt == 0) //  ,   { StaticLock lock; if (pt == 0) //  ,   pt = new T; } return *pt; }

Here, the pointer to our created type acts as a condition: if it is zero, then the object must be initialized. It would seem that everything is fine: there are no performance problems, everything works fine. However, it turned out that not everything is so rosy. At one time it was even thought that this was not a pattern, but an anti- pattern, i.e. it should not be used, because leads to subtle mistakes. Let's try to figure out what's the matter.

Well, firstly, such a singleton will not be deleted, although this is not a very big problem: the lifetime of a singleton coincides with the time of the application, so the operating system will clean it up itself (unless, of course, some nontrivial processing is required, such as writing in message log or sending a specific request to the database to change the application status record).

The second more serious problem is the following line:

 pt = new T;

Consider this in more detail. This line can be rewritten as follows (I will omit exception handling for short):

 pt = operator new(sizeof(T)); //     new (pt) T; // placement new:

Those. memory is allocated first, and then the object is initialized by calling its constructor. So, it may turn out that the memory has already been allocated, the pt value has been updated, and the object has not yet been created. Thus, if any thread performs the first check outside the lock, the single function will return a reference to the memory that was allocated but not initialized.

Let us now try to correct both problems described above.

Proposed approach

Let's introduce 2 functions for creating a singleton: we will use one as if we have a single-threaded application, and the other for multi-threaded use:

 //      template<typename T> T& singleUnsafe() { static T t; return t; } //       template<typename T> T& single() { static T* volatile pt; if (pt == 0) { T* tmp; { StaticLock lock; tmp = &singleUnsafe<T>(); } pt = tmp; } return *pt; }

The idea is as follows. We know that our initial implementation (now the singleUnsafe function) works fine in a single-threaded application. Therefore, all we need is call serialization, which is achieved by properly using locks. In a sense, there are also 2 checks here, only the first check outside the lock uses the pointer, and the second the internal variable that is generated by the compiler. It also uses the volatile keyword to prevent reordering of operations in case of over-optimization by the compiler. It is also worth noting that the assignment of the pt pointer occurs outside of locks. This is done so that the processor does not reorder operations during the execution of the code.

The result of compiling such an implementation is shown below:

 template<typename T> T& single() { ;   00083B30 push 0FFFFFFFFh 00083B32 push offset __ehhandler$??$single@UA@@@@YAAAUA@@XZ (0A13B6h) 00083B37 mov eax,dword ptr fs:[00000000h] 00083B3D push eax 00083B3E mov dword ptr fs:[0],esp 00083B45 push ecx static T* volatile pt; if (pt == 0) ;     00083B46 mov eax,dword ptr [pt (0E3950h)] 00083B4B test eax,eax 00083B4D jne single<A>+7Dh (83BADh) { T* tmp; { StaticLock lock; ;   EnterCriticalSection    00083B4F push offset staticMutex (0E3954h) 00083B54 mov dword ptr [esp+4],offset staticMutex (0E3954h) 00083B5C call dword ptr [__imp__EnterCriticalSection@4 (0ED6A4h)] tmp = &singleUnsafe<T>(); 00083B62 mov eax,1 00083B67 mov dword ptr [esp+0Ch],0 ;   00083B6F test byte ptr [`singleUnsafe<A>'::`2'::`local static guard' (0E394Ch)],al 00083B75 jne single<A>+68h (83B98h) 00083B77 or dword ptr [`singleUnsafe<A>'::`2'::`local static guard' (0E394Ch)],eax 00083B7D mov ecx,offset t (0E3948h) 00083B82 mov byte ptr [esp+0Ch],al ;  :   00083B86 call A::A (1105Fh) 00083B8B push offset `singleUnsafe<A>'::`2'::`dynamic atexit destructor for 't'' (0AD4D0h) 00083B90 call atexit (60BB1h) 00083B95 add esp,4 } ;   LeaveCriticalSection    00083B98 push offset staticMutex (0E3954h) 00083B9D call dword ptr [__imp__LeaveCriticalSection@4 (0ED6ACh)] pt = tmp; ;    pt    00083BA3 mov dword ptr [pt (0E3950h)],offset t (0E3948h) } return *pt; } 00083BAD mov ecx,dword ptr [esp+4] ;     eax 00083BB1 mov eax,dword ptr [pt (0E3950h)] 00083BB6 mov dword ptr fs:[0],ecx 00083BBD add esp,10h 00083BC0 ret

I added comments to the assembly code so that it was clear what was going on. It is interesting to note the exception handling code: a rather impressive piece. It can be compared with the GCC code, where tables are used when promoting a stack with zero overhead in the absence of a generated exception. If you look at the code for the x64 platform of the MSVC compiler, then you can see a slightly different approach to exception handling:

 template<typename T> T& single() { 000000013F401600 push rdi 000000013F401602 sub rsp,30h 000000013F401606 mov qword ptr [rsp+20h],0FFFFFFFFFFFFFFFEh 000000013F40160F mov qword ptr [rsp+48h],rbx static T* volatile pt; if (pt == 0) 000000013F401614 mov rax,qword ptr [pt (13F4F5890h)] 000000013F40161B test rax,rax 000000013F40161E jne single<A>+75h (13F401675h) { T* tmp; { StaticLock lock; 000000013F401620 lea rbx,[staticMutex (13F4F58A0h)] 000000013F401627 mov qword ptr [lock],rbx 000000013F40162C mov rcx,rbx 000000013F40162F call qword ptr [__imp_EnterCriticalSection (13F50CCC0h)] ; nop !!! 000000013F401635 nop tmp = &singleUnsafe<T>(); 000000013F401636 mov eax,dword ptr [`singleUnsafe<A>'::`2'::`local static guard' (13F4F588Ch)] 000000013F40163C lea rdi,[t (13F4F5888h)] 000000013F401643 test al,1 000000013F401645 jne single<A>+65h (13F401665h) 000000013F401647 or eax,1 000000013F40164A mov dword ptr [`singleUnsafe<A>'::`2'::`local static guard' (13F4F588Ch)],eax 000000013F401650 mov rcx,rdi 000000013F401653 call A::A (13F401087h) 000000013F401658 lea rcx,[`singleUnsafe<A>'::`2'::`dynamic atexit destructor for 't'' (13F4A6FF0h)] 000000013F40165F call atexit (13F456664h) ; nop !!! 000000013F401664 nop } 000000013F401665 mov rcx,rbx 000000013F401668 call qword ptr [__imp_LeaveCriticalSection (13F50CCD0h)] pt = tmp; 000000013F40166E mov qword ptr [pt (13F4F5890h)],rdi } return *pt; 000000013F401675 mov rax,qword ptr [pt (13F4F5890h)] } 000000013F40167C mov rbx,qword ptr [rsp+48h] 000000013F401681 add rsp,30h 000000013F401685 pop rdi 000000013F401686 ret

I specifically noted the nop instructions. They are used as markers in the promotion of a stack in the event of an exception being thrown. This approach also has no overhead for executing code in the absence of a generated exception.

findings

So, it is time to formulate conclusions. The article shows that different compilers relate differently to the new standard: GCC makes every effort to adapt to modern realities and correctly handles the initialization of singletons in a multi-threaded environment; MSVC lags slightly behind, so a careful implementation of the singleton described in the article is required. This approach is a universal and efficient implementation without serious synchronization overhead.

PS

This article is an introduction to multi-threading questions. It solves the problem of access in case of creating a singleton object. Further use of his data raises other serious questions, which will be discussed in detail in the next article.

Update

The implementation of the singleton was corrected taking into account comments to the article.

Literature

[1] Habrahabr: Using the Singleton pattern
[2] Habrahabr: Singleton and object lifetime
[3] :
[4] Final Committee Draft (FCD) of the C++0x standard
[5] C++ Standard — ANSI ISO IEC 14882 2003
[6] Ultimate++: C++ cross-platform rapid application development framework
[7] Boost C++ libraries
[8] Wikipedia: Double-checked locking
[9] , double-checked locking

Source: https://habr.com/ru/post/150276/

All Articles