
As the project expands, in which I now take an active part, I began to increasingly encounter similar typos in the function argument names, as in the picture on the right. Debugging was especially costly for such errors in the class constructor, when, with a long chain of inheritance, the incorrect parameter of the base class was passed, or not transmitted at all. Redesigning interfaces to special user structures like namedtuple instead of ** kwargs had several problems:
- Worsened user interaction. You need to pass a specially constructed object to the function. It is not clear what to do with optional arguments.
- Complicated the development. When inheriting classes, the corresponding argument structures must be inherited. With namedtuple, it won't work, you need to write your own tricky class. A bunch of implementation work.
- And most importantly, it still did not completely save from typos in the names.
The solution, which I finally came to, cannot protect 100% of all possible cases, however, in those required 80% (in my project, 100%) it does an excellent job. In short, it consists in analyzing the source (byte) code of the function, building a matrix of distances between the found "real" names and those transmitted from the outside and printing warnings according to specified criteria.
SourcesTdd
So, first set the task accurately. The following example should print 5 “suspicious” warnings:
def foo(arg1, arg2=1, **kwargs): kwa1 = kwargs["foo"] kwa2 = kwargs.get("bar", 200) kwa3 = kwargs.get("baz") or 3000 return arg1 + arg2 + kwa1 + kwa2 + kwa3 res = foo(0, arg3=100, foo=10, fo=2, bard=3, bas=4, last=5)
- Instead of arg2 passed arg3
- Instead of bar or baz passed bas
- Instead of the bar passed bard
- Besides foo passed fo
- last generally superfluous
Similarly, in the example with classes and inheritance, there should be the same warnings plus one more (instead of boo, bog was passed):
')
class Foo(object): def __init__(self, arg1, arg2=1, **kwargs): self.kwa0 = arg2 self.kwa1 = kwargs["foo"] self.kwa2 = kwargs.get("bar", 200) self.kwa3 = kwargs.get("baz") or 3000 class Bar(Foo): def __init__(self, arg1, arg2=1, **kwargs): super(Bar, self).__init__(arg1, arg2, **kwargs) self.kwa4 = kwargs.get("boo") bar = Bar(0, arg3=100, foo=10, fo=2, bard=3, bas=4, last=5, bog=6)
Task plan
- For the first example with a function, we will make a smart decorator, for the second with classes, we will make a metaclass. They must share all the internal complex logic and in fact do not differ in any way. Therefore, we first make an internal micro API and on it we are already doing the
userspace user API. The decorator called detect_misprints, and the metaclass KeywordArgsMisprintsDetector (heavy Java / C # legacy, aha). - The idea of ​​the solution was in analyzing the bytecode and finding the distance matrix. These are independent steps, so the micro API will consist of two corresponding functions. I called them get_kwarg_names and check_misprints.
- To analyze the code, use the standard inspect and dis , to calculate the distance between the lines - pyxDamerauLevenshtein . The requirements of the project were compatibility with two and three, as well as PyPy. As you can see, dependencies
raspberries do not spoil compatible with these requirements.
get_kwarg_names (extracting names from code)
There should be a footcloth code, but I'd rather give you a
link to it. A function takes as input a function
that takes as input a function that ... and must return the set of named arguments found. I am not particularly commentary, so briefly go through the main points.
The first thing to do is to find out if the function has any ** kwargs. If not, return the void. Next, we specify the name of the "double star", because ** kwargs is a generally accepted agreement and nothing more. Then the logic, as often happens in the portable version of the code, splits, but not as usual on the branches for two and for three, and <3.4 and> =. The fact is that the imputed support for disassembling (along with the total refactoring of dis) appeared precisely in 3.4. Before that, as strange as it is, without third-party modules one could only print pythonium bytecode in stdout (sic!). The
dis.get_instructions () function returns a generator of instances of all bytecode instructions of the object being analyzed. In general, as far as I understand, the only reliable description of the bytecode is the
header of its opcodes , which, of course, is sad, because the deployment of specific instructions to opcodes had to be determined experimentally.
We will match two patterns: var = kwargs ["key"] and kwargs.get ("key" [, default]).
>>> from dis import dis >>> def foo(**kwargs): return kwargs["key"] >>> dis(foo) 2 0 LOAD_FAST 0 (kwargs) 3 LOAD_CONST 1 ('key') 6 BINARY_SUBSCR 7 RETURN_VALUE >>> def foo(**kwargs): return kwargs.get("key", 0) >>> dis(foo) 2 0 LOAD_FAST 0 (kwargs) 3 LOAD_ATTR 0 (get) 6 LOAD_CONST 1 ('key') 9 LOAD_CONST 2 (0) 12 CALL_FUNCTION 2 (2 positional, 0 keyword pair) 15 RETURN_VALUE
As you can see, in the first case it is a combination of LOAD_FAST + LOAD_CONST, in the second LOAD_FAST + LOAD_ATTR + LOAD_CONST. Instead of “kwargs” in the instruction argument, one should look for the name of the “double star” found at the beginning. I refer to well-versed people for a detailed description of bytecode, well, we will be getting things done, that is, move on.
And then we have an ugly workaround for old versions of Python on regular expressions. With the help of inspect.getsourcelines () we get a list of the source lines of the function, and we draw on each precompiled regular. This method is even worse than bytecode analysis, for example, expressions consisting of several lines or several expressions with a semicolon are not defined in the current form. Well, that's what he does and workaround so as not to strain too much ... However, this part can be objectively improved, I want a pull request :)
check_misprints (distance matrix)
Code At the input we get the result of the previous stage, the named arguments passed, the mysterious tolerance and the function to make warnings. For each argument passed, you need to find the editing distance to each "real", i.e. which found in the analysis of baytkod. In fact, there is no need to consider stupidly the entire matrix as a whole, if you have already found the perfect match, you can not continue. And, of course, the matrix is ​​symmetric, and, therefore, you can only calculate its half. I think you can still somehow optimize, but with a typical number of kwarg-s less than 30, n
2 will come down. We will calculate the
distance of Damerau-Levenshteyn as widely known, popular and understandable to the author :) In Habré, for example, they wrote about him. Several packages are written for it under Python, I chose PyxDamerauLevenshtein for portability of Cython, on which it is written and for optimal linear memory consumption.
Then the matter of technology: if for the argument there was not a single even remotely similar standard, we declare its categorical uselessness. If there are several matches with a distance of less than tolerance - declare our vague suspicions.
detect_misprints
A classic decorator , we pre-calculate the “real” names of the named arguments (sorry for tautology), and at each call, we pull check_misprints.
KeywordArgsMisprintsDetector
Our metaclass will intercept the moment the class type is created (__init__, in which once in the lifetime, the real names of the daes will be computed) and the instantiation of the class instance (__call__ that pulls check_misprints). The only point is that the class has
mro and base classes, in the constructors of which, perhaps, ** kwargs are also used. So in __init __- e we have to run through all base classes and add the names of each argument to the common set.
How to use
Simply add the decorator described above to the function or the metaclass to the class.
@detect_misprints def foo(**kwargs): ... @six.add_metaclass(KeywordArgsMisprintsDetector) class Foo(object): def __init__(self, **kwargs): ...
Summary
I considered one of the ways to deal with typos in names ** kwargs, and in my case he solved all the problems and met all the requirements. First, we analyzed the bytecode of a function or just the source code on older versions of Python, and then we built a matrix of distances between the names that are used in the function and those transferred by the user. The distance was calculated according to Damerau-Levenshtein, and at the end, the warning was written in two cases of errors - when the argument is “completely left” and when it looks like one of the “real” ones.
The source code from the article
is posted on GitHub . I will be glad to fixes and improvements. I also want to know my opinion, whether this creation should be spread on PyPi.