Battle for ADFS (Active Directory Federation Services)

Prehistory

The project began as a portal based on SP 2007, and later on the basis of 2010 SP. Initially, all users were in Active Directory. There was only one type of user. The links between them were fairly simple. New types of users appeared that in a complex way became connected with each other. Gradually, the project also acquired various related subsystems, some of which worked inside the portal, some outside. And all this complicated the authorization scheme.

What problems?

Up to a certain point, using NTLM has satisfied all needs. The only problem was that when faced with the transition to the related service, if the url-address was different from the portal address, it was necessary to re-enter the login and password. In principle, such a problem could be solved with the help of the web application proxy product. Later, however, Java modules appeared, and so the environment became largely heterogeneous. Also on the horizon loomed the need to provide access to users from third-party domains. To solve these problems, the need for Single Sign On (SSO) has become clear. It was decided to implement ADFS and transfer the portal and all services to this technology.

We discussed our vision with the customer and scheduled Day X, when everything should work on ADFS.
')

Course of events:

Day X - 1 year

First of all, we decided to switch one of the modules to ADFS.
The idea was to include ADFS on the portal, stuff all the bumps that you can stuff, burn yourself, where you can burn yourself, etc. Having identified a certain number of problems, we successfully conducted this switch, as already mentioned, habrahabr.ru/company/eastbanctech/blog/209834 .

Day X - 3 months

The first thing we started with is refactoring the code so that it works correctly under Claims Authentication. A couple of examples from what we changed:

1. Distribution of permissions to SharePoint list items for users and Active Directory groups. Since permissions were now to be distributed to claims, we had to rewrite all these places. It is good that almost all such places used one of our libraries for working with AD (and those that we did not use were changed in such a way that they also work with our library).

Instead of distributing permissions to users of the “domain \ user” type, permissions were issued to the “i: 0e.t | ADFS | user @ domain” screens. For “domain \ group” - to the form “c: 0-.t | ADFS | group” respectively. We also had to separate the cases when permissions are issued to the user from the cases when they are issued to the group, because without using claims, domain groups look the same in MS SharePoint. Thus, the GetPrincipalName method, which defines the full name of a principal, I turned into 2:

public static string GetGroupPrincipalName(string group) { if (string.IsNullOrEmpty(TrustedIdentityProviderName)) { return string.Format(CultureInfo.InvariantCulture, "{0}\\{1}", CurrentDomain, GetPrincipalNameWithoutDomain(group)); } return string.Concat("c:0-.t|", TrustedIdentityProviderName, "|", GetPrincipalNameWithoutDomain(@group)); } public static string GetUserPrincipalName(string user) { if (string.IsNullOrEmpty(TrustedIdentityProviderName)) { return string.Format(CultureInfo.InvariantCulture, "{0}\\{1}", CurrentDomain, GetPrincipalNameWithoutDomain(user.ToLower(CultureInfo.InvariantCulture))); } return string.Concat("i:0e.t|", TrustedIdentityProviderName, "|", GetPrincipalNameWithoutDomain(user), TrustedIdentityProviderDomain); }

2. Checks for group membership
Checks for group membership also went through this library, so they didn’t have to be changed much. Changed only the functionality defined by the user. Now, it was possible to file as domain users as well as key users:

 public static string GetPrincipalNameWithoutDomain(string principal) { if (string.IsNullOrEmpty(principal)) return string.Empty; return Regex.Match(principal.Trim(), @"(i:)?0e.t.*\|(?<userName>[\d\w_&\.\s-]+)@[\w\d\._]|(c:)?0-.t.*\|(?<userName>[\d\w_&\.\s-]+)$|^(?<userName>[\d\w_&\.\s-]+)@[\w\d\._]|^(?<userName>[\d\w_&\.\s-]+)$|[\w]+\\(?<userName>[\d\w_&\.\s-]+)$") .Groups["userName"].Value; }

Why was this done? At some point, it turned out that part of the team is already working on servers transferred to ADFS, and the rest on servers with NTLM authentication. Since we managed to ensure that our library worked in the same way for both NTLM and ADFS, this helped not to stop the process of developing new modules.

Day X - 2 months

After a month or two, the portal came to life, earned basic things, a news module, Sharepoint Workflows-based business processes and much more.

After that, the portal migrated on the test environment, made a checklist for our endless modules, assigned those responsible for each module and flew according to the scheme:

• module made
• rolled out for testing
• found a bunch of bugs, corrected
• module is ready

Along the way, we met a couple of interesting problems.

1. The first problem we encountered is a call to SOAP services.

Most of the wcf-services that “live” in the _vti_bin folder are used by scripts from the browser and via webHttpBinding. Of the remaining, some services are used for inter-module communication, and the rest are needed by other client systems with which integration is configured. Naturally, most of these interactions are based on SOAP (more convenient) and tied to NTLM authentication. For a start, we tried to understand how problematic it would be to transfer all clients (at least ours, working on WCF) to ADFS. We tried, were amazed at the number of necessary movements and the complexity of the client's configuration and forgot this thought. We spent some time trying to get SharePoint to work simultaneously with two authentication schemes for services (yes, so that users don't notice). It did not work, NTLM stubbornly refused to work (about the reasons below).

Thus, we needed to urgently return the "real" NTLM for services. Therefore, we turned to the possibility of SharePoint, as the extension (extend) of web applications. We reasoned that the main address will remain REST-services for the browser, which the user's browser will authenticate, and all the service services will be transferred to the address of the extended site that will work with NTLM. It would seem that everything is just ... Began.

We expanded the portal already translated to ADFS using standard SharePoint tools (Extend Web application), choosing Authentication Provider for the extended NTLM site as the Authentication Provider. Expected result: all users of WCF services from the ISAPI service directory in SharePoint work with them, as before, maximum - the address of the called service in the client section in the Web.config configuration file rules. The extension of the main site, already translated to ADFS, did not immediately solve the problems of WCF-service calls for users through NTLM authentication - we inexorably received an answer to every WCF-service call: “The HTTP request is unauthorized with client authentication scheme 'Ntlm'. The authentication header received from the server was 'Negotiate, NTLM'. ” The real cause of the problem was that when expanding the main site into the main configuration file “Web.config”, Sharepoint copied the wrong authentication modules in the “modules” section:

 <add name="FederatedAuthentication" type="Microsoft.SharePoint.IdentityModel.SPFederationAuthenticationModule …" /> <add name="SessionAuthentication" type="Microsoft.SharePoint.IdentityModel.SPSessionAuthenticationModule …" /> <add name="SPWindowsClaimsAuthentication" type="Microsoft.SharePoint.IdentityModel.SPWindowsClaimsAuthenticationHttpModule …" />

Whereas the correct module for an NTLM site is:

 <add name="Session" type="System.Web.SessionState.SessionStateModule" />

After a long discussion, we decided to act in the following sequence:
- create an extension of the main site before its transfer to ADFS, then the previous correct NTLM authentication module will remain;
- transfer the main portal site to Sharepoint on ADFS
If you have the same WCF-services in ISAPI, which have endpoints simultaneously for both access via NTLM and ADFS, they will require simultaneous support of the IIS site for "Forms Authentication" and " Windows authentication. In our case, we have the main Sharepoint site and its extension, which intentionally do not simultaneously support both authentication methods. To solve this problem used:

- Create two subfolders in the ISAPI "ModuleServiceAdfs", "ModuleServiceNtlm"

- copy the WCF service svc file into both folders

- create in each folder its own configuration file “Web.config” - in the first for ADFS, in the second - for NTLM.

2. The second problem is the obsolescence of the saml token.

Most of the queries that were performed to the server worked through ajax via jquery. In this case, periodically there are situations when the saml-token becomes invalid (when the token is outdated, when the sharepoint pool was restarted, when the ADFS pool was restarted). The standard mechanism for redirecting to an authentication page with autoupdate of a token or entering a login / password and further returning to the original page does not work in the case of jquery.ajax. And the situation itself, when the user has to be sent to the authentication page only to automatically return him to the original page, but with lost results of work, did not inspire enthusiasm. A quick search on the Internet led us to this decision . For those who are too lazy to follow the link, the essence of the decision in the institution of the preauth page and wrapping all ajax requests with a handler, which, in case of a server 401 response, loads this page via iFrame, and then repeats the original request. Creating a single point of fulfillment of ajax-requests for our case was not a problem, because the application was written in this manner (We already talked about our approach in this and this article).

We developed the solution found by making some additions:

1. In case several requests were made in a row, instead of creating for each iFrame when receiving 401, we return the same deferred for the second and subsequent requests that was created for the first request.

2. This approach works when the token is outdated or when sharepoint has “forgotten” it. But it did not work in the case when ADFS “forgot” us, in this case it is required to re-enter the login / password. The approach described in the article on such cases led to an endless cycle of loading a preauth page through the iFrame without any result. A direct redirect to the authentication page was also not very pleasant, since it meant a loss of user experience. The solution was to display the modal to enter the login / password in case the download through iFrame did not help, and we again received 401. The modnale, in turn, makes a call to the custom service, which already performs authentication in ADFS. After performing the authentication, we duplicate the original ajax request / requests.
The updated code looks like this:

 refreshToken: function () { if (wcfDispatcherDef.frameLoadPromise === undefined) { return jquery.Deferred(function (d) { wcfDispatcherDef.frameLoadPromise = d; var iFrame = jquery('<iframe></iframe>'); iFrame.hide(); iFrame.appendTo('body'); iFrame.attr('src', wcfDispatcherDef.PreauthUrl); iFrame.load(function () { setTimeout(function () { wcfDispatcherDef.frameLoadPromise = undefined; d.resolve(); iFrame.remove(); }, 100); }); }); } else { return wcfDispatcherDef.frameLoadPromise; } }, makeServiceCall: function (settings, initialPromise) { var self = this; var d = initialPromise || jquery.Deferred(); var promise = jquery.ajax(settings) .done(function () { d.resolveWith(self.requestContext || self, jquery.makeArray(arguments)); }).fail(function (error) { if (error.status * 1 === ETR.HttpStatusCode.Unauthorized && wcfDispatcherDef.HandleUnauthorizedError === true) { if (initialPromise) { wcfDispatcherDef.AuthDialog.show().done(function (result) { if (result === true) { self.makeServiceCall.call(self, settings, d).done(function () { d.resolveWith(self.requestContext || self, jquery.makeArray(arguments)); }); } else { router.navigate('#forbidden'); } }); } else { self.refreshToken().then(function () { self.makeServiceCall.call(self, settings, d).done(function () { d.resolveWith(self.requestContext || self, jquery.makeArray(arguments)); }); }); } } else { d.rejectWith(self.requestContext || self, jquery.makeArray(arguments)); } }); return d; },

In addition, we have solved many interesting problems that we do not include here, since their description draws on a whole book.

Day X - 1 month

When we finished on the test environment, finalized the instructions, collected a huge package for updating and conducted a test migration on the prerelease server with real data. For three days they fought with trivial matters (as without them, the test environment is not perfect), and they started testing from the very beginning, after which they appointed Day X.

Day X

Day X was scheduled for Saturday, we are going to the office at 9-00. At the client’s office, they are their administrator, who actually does the deployment. When the deployment is not yourself, but someone else according to your instructions, it is always scary, so the whole day we followed his every step through the shared screen in Lync. At 15-00 migration is completed. We check everything, we find that it did not take off, we finish the file with a file. At 18-00, everything else worked, we are dissatisfied.

The first working day after Day X

It is Monday, the first working day. HELL begins for us. From the main, it turns out that:
a. Token swells more often than we thought;
b. The portal slows down more than usual;
c. Users are constantly thrown onto the login page, which is very, very difficult to work;
d. There is a problem not only with ajax requests. If the user fills in the standard form of a new list item for half an hour, then with a probability of 90% he loses his changes while saving.

We get access to the working server, analyze the logs, try to understand what is happening:

The first "surprise". We have a custom uploader of files with thumbnails, which saves temporary files on the disk (for cases when the object and files are created in one form and there is no place to attach files at the moment, but you need to show it in preview). So, the uploader saves temporary files in a subdirectory of the application, such as C: \ inetpub \ Sharepoint \ Files while creating its own subdirectories there, and then deleting them. These deletions led to the periodic recycle of the application pool. And since Sharepoint Logon Token Cache lives just in memory, you could say goodbye to all the sessions. I must admit that this Uploader has been living with us for a year now, and most likely it has overloaded the pool before, but nobody really noticed before that, with Windows authentication, this did not lead to a repeated request for credentials :). As a result, the target uploader folder was promptly changed and the orgy became smaller.

The second “surprise”. Sometimes the process started to eat up all the memory, which caused the pool to overload. To catch what exactly the portal loads, when hundreds of people work on it, the task is not simple and not fast. We analyze database queries, logs, find those very critical places that periodically flood everyone up, understand how to do them differently, rewrite, roll in ... Breathing becomes easier. Again, the problem of this non-optimality was always, but the periodic brakes used to turn a blind eye. You never know what slows down, and with the lack of direct access to the prod server, on-line diagnostics is complicated by much, therefore, they ignored this problem.

The portal works faster, no longer throws out every 5 minutes, but once a half hour or an hour a relogin occurs. And here we discover for ourselves the basic things that should have been discovered at the very beginning - the Logon Token lifetime management for those who are interested in the details msdn.microsoft.com/ru-ru/library/office/hh147183 (v = office .14) .aspx (this topic was missed, since it didn’t arise much during testing. Testing of a separate test case easily fits into 5 minutes, and the tester changes users from case to case, and the session is updated. Long work from 9 to 6 is not emulated. Set up the session for 1 day and everything seemed to be good, but ...

Another surprise. Users continues to throw out. Not so regularly, but there are cases, and mass ones. Judging by the logs, the pool is not overloaded at this time. What is the matter? We read, understand how it all works, we find a useful article about caches blogs.msdn.com/b/besidethepoint/archive/2013/03/27/appfabric-caching-and-sharepoint-1.aspx We understand that the default cache size - only for 250 tokens, and when the cache is full, - hello everyone :) We increase the size of the cache - euphoria occurs ... The flow of negativity and angry letters subsides.

What's next

In principle, this could end, but curiosity takes up. There is one more moment when sessions "go out." The specificity of the business and the pace of development is such that a rare day goes without a hotfix. At the time of the hotfix, you have to overload the application pool, and it was then that everyone ran to log in.

Began to look, read, learn what you can do in this situation. Literature, which reveals the mechanisms of SharePoint operation in detail, was not found as usual, so we went with heavy artillery - analysis of modules connected by SharePoint, disassembling assemblies, viewing how it all works from the inside. Research day and find the villain:

 <add type="Microsoft.SharePoint.IdentityModel.SPTokenCache, Microsoft.SharePoint.IdentityModel, Version=14.0.0.0, Culture=neutral, PublicKeyToken=71e9bce111e9429c" />

And in it is SPSecurityTokenCache private KeyValuePair <string, SecurityToken> [] m_StrongCache; - that's how that StrongCache is implemented inside.

For the sake of interest, we decided to try to write your TokenCache, but an attempt to do it in 5 minutes was not successful. Rummaging deeper throughout the SP assembly line, we found that the connectivity of the components there is quite high. As a result, they still wrote their version of TokenCache for test purposes, although in some places I still had to use Reflection.

Conclusion

What conclusions can be made, looking back. Such changes are similar to the replacement of bricks at the base of the pyramid, which keeps the entire structure on itself. Of course, one can say that it is not necessary to do this, that the system needs to be built on, and not rebuilt. However, in real life, the moment will come when fundamental changes will be necessary. And then, in our understanding, you need:
a. Thoroughly examine the internal mechanisms of what will change, no matter how many suppliers say about the “well-documented black box”;
b. How to best model the operation of the system, including scenarios of continuous use, high load, etc .;
c. Try to predict the consequences. Work out rollback scripts if changes have to be removed anyway;
d. Agree with the client that failures are possible;
e. And get ready ...

Source: https://habr.com/ru/post/260689/

All Articles