Let's talk about usernames

A couple of weeks ago I released django-registration 2.4.1. Build 2.4.x will be the latest in the version of django-registration 2.x, then only bug fixes will be released. The main branch is now preparing for version 3.0, from where it is planned to remove a bunch of obsolete trash that has accumulated over the past decade of support, and I will try to take into account the best practices of modern Django applications.

In the near future I will write more about the new version, but right now I want to talk a little about the deceptively simple problem that I have to deal with. These are usernames. Yes, I could write one of the most popular articles of the type “Programmers misconceptions about X” , but I still prefer to really explain why this is more difficult than it seems, and offer some tips on how to solve the problem. And not just jiving without a useful context.

Remark: the correct way to identify

Usernames - in the form in which they are implemented on many sites and services and in many popular frameworks (including Django) - are almost certainly the wrong way to solve the problem that they are trying to solve with their help. What we really need in terms of user identification is some combination of the following:

System-level identifier for foreign keys in the database.
Login ID to perform credentials verification.
Public ID to display to other users.

Many systems request a username - and use the same username for all three specified tasks. This is probably wrong. A more competent approach is a three-way identification template , in which each identifier is different, and several login identifiers and / or public identifiers can be associated with one system identifier.
')
Many problems and sufferings when trying to build and scale an account system are caused by ignoring this model. An annoyingly large number of hacks are used in systems that do not support such a pattern so that they look and work as if they support it.

So if you are developing a system from scratch now in 2018, I would suggest taking this model and using it as a basis. First you have to work a little, but in the future it will provide good flexibility and save time, and one day someone might even create an acceptable universal implementation for reusable use (of course, I thought of doing this for Django, and maybe I will do it once).

In the rest of this article, we will assume that you use a more common implementation, in which the unique user name serves as at least a system identifier and login to the system, and, most likely, a public identifier, which is shown to all users. And by “username” I mean essentially any string identifier. For example, you may have usernames like forums like Reddit or Hacker News, or you can use email addresses or some other unique string. It doesn't matter, you're probably still using some kind of unique string. So you need to know about some problems.

Uniqueness is harder than it looks.

Perhaps you ask the question: how difficult is it? You can simply create a unique column in the database - and you're done! Create a table with users in Postgres:

CREATE TABLE accounts ( id SERIAL PRIMARY KEY, username TEXT UNIQUE, password TEXT, email_address TEXT );

Here is our table with users and a column with unique names. Easy!

Well, it's easy, until we start thinking about real use. If you are registered as john_doe , what happens if I register as JOHN_DOE ? This is another username, but can I make people think that I am you? Will people accept my friend requests and share confidential information with me because they do not realize that for a computer, a different register is a different character?

This is a simple thing that is incorrectly implemented in many systems. During the research for this article, I discovered that the auth system in Django does not ensure the uniqueness of usernames case-insensitive, despite the correct approach in implementing many other things. In the bug tracker there is a ticket to make the usernames case-insensitive, but now it is marked as WONTFIX, because creating usernames case-insensitively breaks backward compatibility - and no one is sure how to do it or whether it should be done . I’ll probably think about enforcing this in django-registration 3.0, but I’m not sure that this can be implemented even there - problems will start on any website where register-based accounts already exist.

So if you are going to build a system from scratch today, then you need to make checks from the very beginning on the uniqueness of the user name without register: john_doe , John_Doe and JOHN_DOE should be considered identical names. As soon as one of them is registered, the others become inaccessible.

But this is only the beginning. We live in the Unicode world, but here it’s more difficult to compare two names for a match than to simply perform the operation username1 == username2 . First, there is a composition and decomposition of characters. They differ when comparing them as sequences of Unicode code points, but they look the same on the screen. Therefore, here you need to think about normalization , choose the form of normalization (NFC or NFD), and then normalize each username to the selected form before performing any unique checks.

Illustration from the article "Normalization Unicode" - approx. per.

Also, when developing a system for checking the uniqueness of names without regard for the register, you will have to consider non-ASCII characters. Is the StraßburgJoe and StrassburgJoe user identities identical? The answer often depends on whether you do a normalization check in lower or upper case. And while there are still different options for decomposition in Unicode; You can get (and get) different results for many strings depending on whether you use canonical equivalence or compatibility mode.

If all this is confusing - and this is true, even if you are an Unicode expert! - I recommend following the advice of the Unicode Technical Report 36 technical report and normalizing the names according to the NFKC form. If you use the Django UserCreationForm or its subclass (django-registration uses the UserCreationForm subclasses), then this is already done for you. If you use Python, but without Django (or do not use UserCreationForm ), then this can be done on a single line using the helper from the standard library:

 import unicodedata username_normalized = unicodedata.normalize('NFKC', username)

For other languages, look for a good Unicode library.

No, really, making uniqueness is harder than it seems.

Unfortunately, that's not all. Checking uniqueness case in normalized strings is the beginning, but it does not cover all cases that need to be caught. For example, consider the following username: jane_doe . Now consider another username: jane_doe . Is this the same username?

In the font that I use for this article, and in any font available for my blog, they seem to be the same. But for software, they are completely different , and still remain different after Unicode normalization and case-insensitive comparison (regardless of whether you chose a normal check in lower case or upper case).

To understand the reason, pay attention to the second code point. In one of the usernames, this is U+0061 LATIN SMALL LETTER A In the other, it's U+0430 CYRILLIC SMALL LETTER A And no Unicode normalization or removal of case sensitivity will make these code points the same, although they are often visually completely indistinguishable.

This is the basis of homographic attacks, which first became widely known in the context of internationalized domain names . And to solve the problem will need a little more work.

For network hosts, one of the solutions will be to show the names in the Punycode representation created to solve this particular problem by displaying the names in any encoding using only ASCII characters. Returning to the usernames above, the differences between them become obvious. If you want to try it yourself, here is a one-liner in Python and the result is on a username with a Cyrillic symbol:

 >>> 'jne_doe'.encode('punycode') b'jne_doe-2fg'

(If you have problems with copy-paste non-ASCII characters, this name can be expressed as a string literal j\u0430ne_doe ).

But to display user names in this form is not suitable in practice. Of course, you can show Punycode every time, but it will break the display of many perfectly normal usernames with characters not in ASCII. What we really want is to reject the above username during registration. How to do it?

Well, this time we are heading to the Unicode Technical Report 39 technical report and begin to read sections 4 and 5. The sets of code points that differ from each other (even after normalization), but are visually identical or, before mixing, are similar in visualization, are called “leading to confusables, and Unicode provides mechanisms for detecting such code points.

The username in our example is what Unicode refers to as “mix-script confusable” leading to confusion, and this is what we want to detect. In other words, the username is completely in Latin with “leading to confusion” characters, can probably be considered normal. And the fully Cyrillic username with “leading to confusion” characters can probably also be considered normal. But if the name is made up predominantly of Latin characters plus a single Cyrillic code point, which, when rendered, turned out to be similar to the Latin character before mixing ... then this will not work.

Unfortunately, in the standard library, Python does not provide the necessary access to the full set of Unicode properties and tables to make such a comparison. But an amiable developer named Victor Felder wrote the appropriate library and released it under a free open source license. Using the confusable_homoglyphs library confusable_homoglyphs we can identify the problem:

 >>> from confusable_homoglyphs import confusables >>> s1 = 'jane_doe' >>> s2 = 'j\u0430ne_doe' >>> bool(confusables.is_dangerous(s1)) False >>> bool(confusables.is_dangerous(s2)) True

The real result of the is_dangerous() function for the second user name is a data structure with detailed information about potential problems, but the main thing is that you can identify a string with mixed alphabets and code points that lead to confusion. This is what we need.

Django allows non-ASCII characters to be used in usernames, but does not check for identical characters from different encodings. However, since version 2.3, django-registration has confusable_homoglyphs dependent on the confusable_homoglyphs library confusable_homoglyphs and its is_dangerous() function is used in the process of validating user names and email addresses. If you need to implement user registration in Django (or in general in Python) and you can’t or don’t want to use django-registration, then I recommend using the confusable_homoglyphs library in the same way.

I have already mentioned that to achieve uniqueness is difficult?

If we are dealing with unicode code points leading to confusion, it makes sense to think about what to do with similar characters from the same alphabet . For example, paypal and paypa1 . In some fonts it is difficult to distinguish them from each other. Until now, all my proposals were suitable for everyone in general, but here we are entering a territory specific to specific languages, alphabets and geographic regions. Decisions should be made here with caution and taking into account possible consequences (for example, a ban on misleading Latin characters may cause more false-positive results than you would like). This is worth thinking about. The same applies to the names of users who are different, but still very similar to each other. At the database level, you can check in various forms - for example, Postgres comes with support for Soundex and Metaphone , as well as support for Levenshtein distance and fuzzy matching trigrams - but then again, this will have to be done occasionally and not all the time.

I want to mention another problem with the uniqueness of names. True, it refers mainly to e-mail addresses, which in our time are often used as usernames (especially in services that rely on a third-party identity provider and use OAuth and similar protocols). Suppose we need to ensure the uniqueness of email addresses. How many different addresses are listed below?

johndoe@example.com
johndoe+yoursite@example.com
john.doe@example.com

There is no definite answer. Most mail servers have long ignored all characters after the + sign in the local part of the address when determining the user name. In turn, many people use this technical feature to specify an arbitrary text after the "plus" as a special system of labels and filtering. And Gmail also famously ignores the dots ( . ) In the local part, including in the distributed domains on their services, so without a DNS query, it is generally impossible to understand whether someone else’s mail server johndoe and john.doe .

So if you need unique email addresses or you use email addresses as a user ID, you probably need to delete all points from the local part, as well as + and any text after it, before performing a uniqueness check. Currently, django-registration does not do this, but I have plans to add this feature in version 3.x.

In addition, when processing leading to confusion of Unicode code points in email addresses, apply this check separately to the local part and to the domain. People can not always change the alphabet that is used in the domain, so they cannot be punished for using different alphabets in the local part and the domain part. If neither the local part nor the part of the domain separately contain a mixture of alphabets leading to confusion, then probably everything is in order (and the django-registration validator does such a check).

You may encounter many other problems regarding user names that are too similar to each other, so as not to be considered “different”, but as soon as you start turning off case sensitivity, start normalization and checking for a mixture of alphabets, quickly go to the territory with diminishing returns [when the benefits decrease with every innovation - approx. per. ], especially since many rules are beginning to apply that depend on a language, alphabet or region. This does not mean that you do not need to think about them. It's just hard to give universal advice that suits everyone.

Let's turn the situation around a bit and consider a different type of problem.

Some names should be reserved

Many sites use the username not only as a field in the login form. Some create a profile page for each user and put the username in the URL. Some create email addresses for each user. Some create subdomains. So a few questions arise:

If your site puts the username in the URL on the profile page, what happens if I create a user named login ? If I post the text “Our login page is moved, please click here to log in” with a link to my credential site. How many people do you think I can fool?
If your site creates email addresses from usernames, what happens if I register as a user named webmaster or postmaster ? Will I receive letters directed to these addresses for your domain? Can I get an SSL certificate for your domain with the correct username and an automatically created email address?
If your site creates subdomains from user names, what happens if I register as a user with the name www ? Or smtp , or mail ?

If you think that these are just silly hypothetical questions, well, well, some of this actually happened . And not once, but several times . No, in fact, such things happened several times .

You can — and should — take some precautionary measures to ensure that, say, an automatically created subdomain for a user account does not conflict with an already existing subdomain that you actually use for some purpose. Or that automatically created email addresses do not conflict with important and / or already existing addresses.

But for maximum security, you probably just need to prevent certain usernames from being registered. I first saw such advice — and the list of reserved names, as well as the first two articles mentioned above — in this article by Jeffrey Thomas . Starting with version 2.1, django-registration comes with a list of reserved names, and this list grows with each version; now there are about a hundred entries.

In the django-registration list, the names are divided into several categories, which allows you to create subsets of them depending on your needs (the validator defaults to all of them, but you can reconfigure it with only the necessary sets of reserved names):

The addresses of the hosts used for auto-detection / auto-tuning of some well-known services.
Host addresses associated with commonly used protocols.
Email addresses used by certificate authorities to verify domain ownership.
Email addresses listed in RFC 2142 that are not listed in any other set of reserved names.
Common addresses no-reply @.
Strings matching confidential file names (for example, cross-domain access policies).
A long list of other potentially sensitive names like contact and login .

The django-registration validator will also reject any username that starts with .well-known to protect everything that the RFC 5785 standard uses to indicate “well-known URIs”.

As in the case of characters in user names leading to confusion, I recommend that you copy the necessary elements of the django-registration list and add it if necessary. In turn, this list is an enhanced version of Jeffrey Thomas’s list.

This is just the beginning.

Not everything that can be done to verify usernames is listed here. If I tried to write a complete list, I would be stuck here forever. However, this is a good starting platform, and I recommend following most or all of these tips. I hope the article has shown approximately what difficulties may be hidden behind such a seemingly “simple” problem as user accounts with usernames.

As I mentioned, Django and / or django-registration already performs most of these checks. And what does not, probably will be added at least in the version of django-registration 3.0. By itself, Django may not be able to implement such checks in the near future (or ever) because of strong backward compatibility issues. All source code is open (under the BSD license), so copy, adapt and improve it without any problems.

If I missed something important, please let me know about it: you can report a bug or send a pull-request to django-registration on GitHub or simply contact me directly .

Source: https://habr.com/ru/post/349232/

All Articles