📜 ⬆️ ⬇️

Fighting the mysterious falls of MSBuild on XamlTaskFactory

Our team is developing a cross-platform application core that should be built on Windows under Visual Studio 2015, Linux with gcc 4.9+, MacOS, iOS, Android and Windows Phone 8.1+. To automatically check the code for Jenkins configured assembly under all the required configuration. The task of assemblies is to catch code that is not built on one or more of the platforms or does not pass unit tests and does not allow it to get to the commands of the final applications before making the appropriate corrections. Such a CI process allows a developer to locally use his convenient operating system and development environment, be it Visual Studio, Xcode, QtCreator, or vim + ninja in general, without being afraid that his changes will not be collected or will be tested in another environment.

In an ideal world, the red assembly on Jenkins (it is he who we use as a buildserver) speaks of a problem in the code. Having seen the red light on the monitor hanging in the corner of the room, the “person on duty for assembly” should go and fix the problem found. In reality, the reasons for the fall of the build may be very different, for example, breakage of the connection with the node on which the compilation took place, ended disk space or the arrival of aliens. Such false positives take the team extra time, dull attention and generally reduce the credibility of CI in the team. The story of the struggle with one of these problems I want to tell.

The problem was specific to MSBuild and appeared like this in a log message:

20:03:56 "D:\jenkins\workspace\task\ws\...\SomeTarget.vcxproj" (default target) (429) -> 20:03:56 (_QtMetaObjectCompilerH target) -> 20:03:56 D:\jenkins\workspace\task\ws\...\SomeQtBasedTarget.targets(52,5): error MSB4175: The task factory "XamlTaskFactory" could not be loaded from the assembly "Microsoft.Build.Tasks.Core". Could not find file 'D:\jenkins\workspace\task\ws\TEMP\fv5nnzin.dll'. [D:\jenkins\workspace\jenkins\workspace\task\ws\...\SomeTarget.vcxpro] 

For some time, the problem did not manifest itself often, once every few days, and only made it curse and once again restart the fallen build. But after moving from virtualok to new brilliant hardwood nodes, the situation worsened, random drops could happen several times a day. The situation was completely unacceptable for a long time and the project was assembled (tens of minutes, with which, by the way, we were fighting in parallel). Sometimes it was necessary to drive an urgent fix through CI, but after waiting a lot of time it was possible to catch the fall, and then you had to wait again.
')
So, what did lead to an error?

To generate projects we use gyp , in which there are 2 ways to call an external command during the build - these are actions and rules . Actions are implemented via CustomBuild inside vcxproj files.

Example from documentation:

 <ItemGroup> <CustomBuild Include="faq.txt"> <Message>Copying readme...</Message> <Command>copy %(Identity) $(OutDir)%(Identity)</Command> <Outputs>$(OutDir)%(Identity)</Outputs> </CustomBuild> </ItemGroup> 

And everything is fine with them, they do not explode. Rules use another mechanism. The comment in the code reads as follows:
MSBuild rules are implemented using the XML file, a .targets file and a .props file. See blogs.msdn.com/b/vcblog/archive/2010/04/21/quick-help-on-vs2010-custom-build-rule.aspx .

How it works? For each such rule, MSbuild in% TEMP% generates a source code in C # (.cs file), from which it tries to compile the dll-ku and immediately use it, and if it does not work out, throws an exception .
The comment says:
This occurs if there is a failure to compile the assembly. We will take care of the failure below.

And indeed, in the system log for a couple of seconds before the error time (according to the build server log), you can find approximately the following C # compiler error record:

 Faulting application name: csc.exe, version: 4.6.1055.0, time stamp: 0x563c1a09 Faulting module name: KERNELBASE.dll, version: 6.3.9600.18233, time stamp: 0x56bb4ebb Exception code: 0xc0000142 Fault offset: 0x00000000000ecdd0 Faulting process id: 0x1af4 Faulting application start time: 0x01d1d13dbec0f5bd Faulting application path: C:\Windows\Microsoft.NET\Framework64\v4.0.30319\csc.exe Faulting module path: KERNELBASE.dll Report Id: fc6cf36d-3d30-11e6-8260-0cc47ab21249 Faulting package full name: Faulting package-relative application ID: 

A search for similar errors in Google suggested that it was the size of the Desktop Heap for non-interactive sessions. Indeed, it looked like this: the error code matched, and the Jenkins slave agent worked as a Windows service.

Taking this hypothesis into development, I began to play with the value of the SharedSection section in the HKEY_LOCAL_MACHINE \ System \ CurrentControlSet \ Control \ Session Manager \ SubSystems \ Windows registry entry. Along the way, I accidentally managed to make the assembly prone to a fall with almost a 100% probability, which made the debugging iteration somewhat easier. After reading a little more, I got to the tick "Allow desktop interaction" in the properties of the Jenkins service, and then to the NoInteractiveServices option in HKEY_LOCAL_MACHINE \ SYSTEM \ CurrentControlSet \ Control \ Windows. But all these attempts have not borne fruit. Sometimes the builds took place, but they failed to catch patterns.

Continuing with the peculiarities of running processes from Jenkins, I came across the following text on StackOverflow. The author talks about the features of the default behavior of MSBuild when specifying the / M option for parallel building of several projects. The bottom line is that MSBuild creates the necessary number of its copies - the node waiting for tasks. During the assembly process tasks are scattered on these nodes and executed in parallel. After the assembly is completed, the nodes are not quenched and continue to expect new tasks. So it was with us at Jenkins, after the end of the assembly, MSBuild processes continued to hang in memory.

I started experimenting. Having reproduced the fall of the build several times in a row, I killed all the MSBuild processes in memory, and, oh, a miracle, the next build was successful! Then I armed myself with the instruction with StackOverflow and added to our build script setting the MSBUILDDISABLENODEREUSE variable and forwarding the / nr: false options to the MSBuild call. After that, all MSBuild processes began to die at the end of the assembly, and did not remain in memory.

The solution turned out to be working. Almost 2 weeks have passed, the problem has never been reproduced. And although I did not fully understand the root causes of the error, I could find a solution that worked and, I hope, will help someone else.

Source: https://habr.com/ru/post/307104/


All Articles