Why you should use pathlib

From the translator: Hi, habr! I present to you the translation of the article Why you should be using pathlib and its continuation, No really, pathlib is great . Much attention is now paid to such Python features as asyncio, the operator: =, and optional typing. At the same time, the risk of passing through the radar is not so significant (although,: = a language is not turned serious), but very useful innovations in the language. In particular, I did not find articles on sabzh in Habré (except for one paragraph here ), so I decided to correct the situation.

When I discovered the then new pathlib module a few years ago, I decided, by the simplicity of my soul, that this is just a slightly awkward object-oriented version of the os.path module. I was wrong. pathlib is really wonderful !

In this article I will try to make you fall in love with the pathlib . I hope that this article will inspire you to use pathlib in any situation related to working with files in Python .

Part 1.

`os.path` clumsy

The os.path module os.path always been what we used when talking about paths in Python. In principle, there is everything that you need, but often it does not look very elegant.

Is it worth it to import it?

 import os.path BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__))) TEMPLATES_DIR = os.path.join(BASE_DIR, 'templates')

Or so?

 from os.path import abspath, dirname, join BASE_DIR = dirname(dirname(abspath(__file__))) TEMPLATES_DIR = join(BASE_DIR, 'templates')

Maybe the join function has a very general name, and we should do something like this:

 from os.path import abspath, dirname, join as joinpath BASE_DIR = dirname(dirname(abspath(__file__))) TEMPLATES_DIR = joinpath(BASE_DIR, 'templates')

To me, all the options above do not seem too comfortable. We pass strings to functions that return strings, which we pass to the following functions that work with strings. It just so happens that they all contain paths, but they are still just strings.

Using strings for input and output in os.path functions os.path very inconvenient because the code has to be read from the inside to the outside. I would like to convert these calls from nested to sequential. This is exactly what pathlib can do!

 from pathlib import Path BASE_DIR = Path(__file__).resolve().parent.parent TEMPLATES_DIR = BASE_DIR.joinpath('templates')

The os.path module requires nested function calls, but pathlib allows us to create chains of successive calls to methods and attributes of the Path class with an equivalent result.

I know what you think: stop, these Path objects are not the same as they were before, we no longer operate with strings of paths! We will return to this question later (hint: in almost any situation, these two approaches are interchangeable).

`os` overloaded

The classic os.path module os.path designed to work with paths. But after you want to do something with the path (for example, create a directory), you will need to access another module, often os .

os contains a bunch of utilities for working with files and directories: mkdir , getcwd , chmod , stat , remove , rename , rmdir . Also chdir , link , walk , listdir , makedirs , renames , removedirs , unlink , symlink . And a bunch of other things that are not related to file systems at all: fork , getenv , putenv , environ , getlogin , system , ... A few more dozen things that I will not mention here.

The os module is designed for a wide range of tasks; This is such a box with everything related to the operating system. There are many utilities in os , but it is not always easy to navigate in it: it is often necessary to dig a little in the module before you find what you need.

pathlib transfers most of the file system functions to Path objects.

Here is the code that creates the src/__pypackages__ and renames our .editorconfig file to src/.editorconfig :

 import os import os.path os.makedirs(os.path.join('src', '__pypackages__'), exist_ok=True) os.rename('.editorconfig', os.path.join('src', '.editorconfig'))

Here is a similar code using Path

 from pathlib import Path Path('src/__pypackages__').mkdir(parents=True, exist_ok=True) Path('.editorconfig').rename('src/.editorconfig')

Notice that the second sample code is much easier to read, because it is organized from left to right - this is all due to the chains of methods.

Don't forget `glob`

Not only os and os.path contain file system related methods. It is also worth mentioning the glob , which can not be called useless.

We can use the glob.glob function to search for files with a specific pattern:

 from glob import glob top_level_csv_files = glob('*.csv') all_csv_files = glob('**/*.csv', recursive=True)

The pathlib module also provides similar methods:

 from pathlib import Path top_level_csv_files = Path.cwd().glob('*.csv') all_csv_files = Path.cwd().rglob('*.csv')

After switching to the pathlib module, the need for the glob disappears completely : everything necessary is already a part of the Path objects.

`pathlib` makes simple things even easier

pathlib simplifies many complex situations, but beyond that makes some simple code snippets even easier .

Want to read all the text in one or more files?

You can open the file, read the contents, and close the file using the with block:

 from glob import glob file_contents = [] for filename in glob('**/*.py', recursive=True): with open(filename) as python_file: file_contents.append(python_file.read())

Or you can use the read_text method on Path objects and generate lists to get the same result for a single expression:

 from pathlib import Path file_contents = [ path.read_text() for path in Path.cwd().rglob('*.py') ]

And what if you need to write to a file?

Here is what it looks like using open :

 with open('.editorconfig') as config: config.write('# config goes here')

Or you can use the write_text method:

 Path('.editorconfig').write_text('# config goes here')

If for some reason you need to use open , either as a context manager or by personal preference, Path provides an open method, as an alternative:

 from pathlib import Path path = Path('.editorconfig') with path.open(mode='wt') as config: config.write('# config goes here')

Or, starting with Python 3.6, you can pass your Path directly to open :

 from pathlib import Path path = Path('.editorconfig') with open(path, mode='wt') as config: config.write('# config goes here')

Path objects make your code more obvious.

What do the following variables indicate? What is the meaning of their meanings?

 person = '{"name": "Trey Hunner", "location": "San Diego"}' pycon_2019 = "2019-05-01" home_directory = '/home/trey'

Each variable points to a string. But each of them has different meanings: the first is JSON, the second is the date, and the third is the file path.

Such a representation of objects is slightly more useful:

 from datetime import date from pathlib import Path person = {"name": "Trey Hunner", "location": "San Diego"} pycon_2019 = date(2019, 5, 1) home_directory = Path('/home/trey')

JSON objects can be deserialized into a dictionary, dates can be natively represented using datetime.date , and file path objects can be represented as Path

Using Path objects makes your code more explicit. If you want to work with dates, you use date . If you want to work with file paths, use Path .

I am not a very big supporter of the PLO. Classes add an extra layer of abstraction, and abstractions sometimes tend to complicate the system rather than simplify it. At the same time, I believe that pathlib.Path is a useful abstraction . Pretty quickly, it becomes a commonly accepted solution.

Thanks to PEP 519 , Path are becoming standard for working with paths. At the time of Python 3.6, most of the os , shutil , os.path work correctly with these objects. You can switch to pathlib , transparent to your codebase!

What is missing in the `pathlib` ?

Although pathlib and cool, but not comprehensive. There are definitely several features that I would like to have included in the module .

The first thing that comes to mind is the lack of methods in the Path equivalent to shutil . And although you can pass Path as the shutil parameters for copying / deleting / moving files and directories, it’s impossible to call them as methods on Path objects.

So, to copy files, you need to do something like this:

 from pathlib import Path from shutil import copyfile source = Path('old_file.txt') destination = Path('new_file.txt') copyfile(source, destination)

There is also no analogue of the os.chdir method. This means that you need to import it if you need to change the current directory:

 from pathlib import Path from os import chdir parent = Path('..') chdir(parent)

There is also no equivalent of the os.walk function. Although you can write your own function in the spirit of a walk without too much difficulty.

I hope that one day the pathlib.Path objects will contain methods for some of the operations mentioned. But even in this situation, I consider it much more simple to use the pathlib with something else than to use os.path and everything else .

Do you always need to use `pathlib` ?

Starting in Python 3.6, Paths work almost wherever you use strings . So I see no reason not to use pathlib if you are using Python 3.6 and higher.

If you use an earlier version of Python 3, you can at any time wrap the Path object in a str call to get a string if you need to return to the country of lines. It is not very elegant, but it works:

 from os import chdir from pathlib import Path chdir(Path('/home/trey')) #   Python 3.6+ chdir(str(Path('/home/trey'))) #

Part 2. Answers to questions.

After the publication of the first part, some people have some questions. Someone said that I compared the os.path and pathlib dishonestly. Some said that using os.path so ingrained in the Python community that switching to a new library would take a very long amount of time. I also saw some performance issues.

In this part I would like to comment on these questions. It can be considered both a pathlib protection and a bit of a love letter to PEP 519 .

Comparing `os.path` and `pathlib` to fair

In the last part, I compared the following two code fragments:

 import os import os.path os.makedirs(os.path.join('src', '__pypackages__'), exist_ok=True) os.rename('.editorconfig', os.path.join('src', '.editorconfig'))

 from pathlib import Path Path('src/__pypackages__').mkdir(parents=True, exist_ok=True) Path('.editorconfig').rename('src/.editorconfig')

This may seem like an unfair comparison, because using the os.path.join in the first example guarantees the use of correct separators on all platforms, which I did not do in the second example. In fact, everything is in order, because the Path automatically normalizes the path delimiters

We can prove this by looking at converting the Path object to a string on Windows:

 >>> str(Path('src/__pypackages__')) 'src\\__pypackages__'

It makes no difference whether we use the joinpath method, '/' in the path string, the / operator (another nice Path chip), or pass individual arguments to the Path constructor, we get the same result:

 >>> Path('src', '.editorconfig') WindowsPath('src/.editorconfig') >>> Path('src') / '.editorconfig' WindowsPath('src/.editorconfig') >>> Path('src').joinpath('.editorconfig') WindowsPath('src/.editorconfig') >>> Path('src/.editorconfig') WindowsPath('src/.editorconfig')

The last example caused some confusion from people who suggested that pathlib not smart enough to replace / with \ in the path string. Fortunately, everything is fine!

With Path objects, you no longer need to worry about the direction of the slashes: define all your paths using / , and the result will be predictable for any platform.

You do not have to worry about normalizing the paths.

If you are working on Linux or Mac, it is very easy to accidentally add bugs to your code that will only affect Windows users. If you do not follow closely the use of os.path.join and \ or os.path.normcase to convert slashes into suitable ones for the current platform, you can write code that will not work correctly in Windows .

Here is an example of a Windows-specific bug:

 import sys import os.path directory = '.' if not sys.argv[1:] else sys.argv[1] new_file = os.path.join(directory, 'new_package/__init__.py')

At the same time such code will work correctly everywhere:

 import sys from pathlib import Path directory = '.' if not sys.argv[1:] else sys.argv[1] new_file = Path(directory, 'new_package/__init__.py')

Previously, the programmer was responsible for concatenating and normalizing paths, just as in Python 2, the programmer was responsible for deciding where to use unicode instead of bytes. This is not your task anymore - Path solves all such problems for you.

I do not use Windows, and I do not have a Windows computer. But the huge number of people who will use my code will very likely use Windows, and I want everything to work correctly for them.

If there is a possibility that your code will run on Windows, you should seriously think about switching to pathlib .

Don't worry about normalization : use Path anyway when it comes to file paths.

It sounds cool, but I have a third-party library that does not use `pathlib` !

You have a large codebase that works with strings as paths. Why switch to pathlib if it means that everything needs to be rewritten?

Let's imagine that you have the following function:

 import os import os.path def make_editorconfig(dir_path): """Create .editorconfig file in given directory and return filename.""" filename = os.path.join(dir_path, '.editorconfig') if not os.path.exists(filename): os.makedirs(dir_path, exist_ok=True) open(filename, mode='wt').write('') return filename

The function takes a directory, and creates a .editorconfig file .editorconfig , like this:

 >>> import os.path >>> make_editorconfig(os.path.join('src', 'my_package')) 'src/my_package/.editorconfig'

If you replace the lines with Path , everything will also work:

 >>> from pathlib import Path >>> make_editorconfig(Path('src/my_package')) 'src/my_package/.editorconfig'

But how?

os.path.join accepts Path objects (starting with Python 3.6). The same can be said about os.makedirs .
In fact, the built-in open function accepts Path , shutil accepts Path and everything in the standard library used to accept a string should now work with both Path and with strings.

It’s worth giving thanks to PEP 519 , who provided the abstract os.PathLike class and announced that all the built-in utilities for working with file paths should now work with both strings and Path .

But in my favorite library there is a Path, better than the standard one!

You may already be using a third-party library that provides its own implementation of the Path , which is different from the standard one. Perhaps you like it more.

For example, django-environ , path.py , plumbum , and visidata contain their own Path objects. Some of these libraries are older than the pathlib , and decided to inherit from str , so that they could be passed to functions that wait for strings as paths. Thanks to PEP 519, integrating third-party libraries into your code will be easier without having to inherit from str .

Let's imagine that you do not want to use pathlib , because Path is immutable objects, and you really really want to change their state. Thanks to PEP 519, you can create your very best-mutable version of Path . To do this, simply implement the __fspath__ method __fspath__

Any self-written implementation of the Path can now work natively with built-in Python functions that expect file paths. Even if you are not a pathlib , the very fact of its existence is a big plus for third-party libraries with their own Path

But `pathlib.Path` and `str` don't mix, right?

You might be thinking: that’s all, of course, great, but doesn’t this approach with sometimes-string-a-sometimes-path add complexity to my code?

The answer to this question is yes, to some extent. But this problem has a fairly simple detour.

PEP 519 added a few more things besides PathLike : firstly, it’s a way to convert any PathLike to a string, and secondly, it’s a way to turn any PathLike into a Path .

Take two objects — a string and a Path (or anything with the fspath method):

 from pathlib import Path import os.path p1 = os.path.join('src', 'my_package') p2 = Path('src/my_package')

The os.fspath function normalizes both objects and turns them into strings:

 >>> from os import fspath >>> fspath(p1), fspath(p2) ('src/my_package', 'src/my_package')

In this case, Path can take both of these objects into a constructor and convert them to a Path :

 >>> Path(p1), Path(p2) (PosixPath('src/my_package'), PosixPath('src/my_package'))

This means that you can convert the result of make_editorconfig back to the Path if necessary:

 >>> from pathlib import Path >>> Path(make_editorconfig(Path('src/my_package'))) PosixPath('src/my_package/.editorconfig')

Although, of course, the best solution would be to rewrite make_editorconfig using pathlib .

`pathlib` too slow

I have seen several times questions about the performance of pathlib . This is true - pathlib can be slow. Creating thousands of Path objects can have a noticeable effect on program behavior.

I decided to measure the performance of pathlib and os.path on my computer using two different programs that search all .py files in the current directory

Here is the os.walk version:

 from os import getcwd, walk extension = '.py' count = 0 for root, directories, filenames in walk(getcwd()): for filename in filenames: if filename.endswith(extension): count += 1 print(f"{count} Python files found")

And here is the version with Path.rglob :

 from pathlib import Path extension = '.py' count = 0 for filename in Path.cwd().rglob(f'*{extension}'): count += 1 print(f"{count} Python files found")

Testing the performance of programs that work with the file system is a tricky task, because the runtime can vary quite a lot. I decided to run each script 10 times and compared the best results for each program.

Both programs found 97,507 files in the directory in which I ran them. The first one worked in 1.914 seconds, the second one finished the work in 3.430 seconds.

When I set the extension='' parameter, these programs find approximately 600,000 files, and the difference increases. The first program worked in 1.888 seconds, and the second in 7.485 seconds.

So, pathlib works about twice as slowly for files with the .py extension, and four times slower when running on my home directory. The relative performance pathlib and os quite large.

In my case, this speed changes little. I searched all the files in my directory and lost 6 seconds. If I had a task to process 10 million files, I would most likely have copied it. But as long as there is no such need, you can wait.

If you have a hot code snippet, and pathlib clearly has a negative effect on its operation, there is nothing wrong with replacing it with an alternative. You should not optimize code that is not a bottleneck - it is a waste of time, which also usually leads to poorly readable code, without much exhaust.

Improved readability

I would like to end this thought stream with some examples of refactoring with the help of pathlib . I took a couple of small code samples that work with files and made them work with pathlib . I will leave most of the code without comments to your court - decide which version you like more.

Here is the make_editorconfig function that we saw earlier:

 import os import os.path def make_editorconfig(dir_path): """Create .editorconfig file in given directory and return filename.""" filename = os.path.join(dir_path, '.editorconfig') if not os.path.exists(filename): os.makedirs(dir_path, exist_ok=True) open(filename, mode='wt').write('') return filename

And here is the version rewritten on pathlib :

 from pathlib import Path def make_editorconfig(dir_path): """Create .editorconfig file in given directory and return filepath.""" path = Path(dir_path, '.editorconfig') if not path.exists(): path.parent.mkdir(exist_ok=True, parent=True) path.touch() return path

Here is a console program that takes a string with a directory and prints the contents of the .gitignore file, if it exists:

 import os.path import sys directory = sys.argv[1] ignore_filename = os.path.join(directory, '.gitignore') if os.path.isfile(ignore_filename): with open(ignore_filename, mode='rt') as ignore_file: print(ignore_file.read(), end='')

The same, but with pathlib :

 from pathlib import Path import sys directory = Path(sys.argv[1]) ignore_path = directory / '.gitignore' if ignore_path.is_file(): print(ignore_path.read_text(), end='')

Here is a program that prints all duplicate files in the current folder and subfolders:

 from collections import defaultdict from hashlib import md5 from os import getcwd, walk import os.path def find_files(filepath): for root, directories, filenames in walk(filepath): for filename in filenames: yield os.path.join(root, filename) file_hashes = defaultdict(list) for path in find_files(getcwd()): with open(path, mode='rb') as my_file: file_hash = md5(my_file.read()).hexdigest() file_hashes[file_hash].append(path) for paths in file_hashes.values(): if len(paths) > 1: print("Duplicate files found:") print(*paths, sep='\n')

The same, but with pathlib :

 from collections import defaultdict from hashlib import md5 from pathlib import Path def find_files(filepath): for path in Path(filepath).rglob('*'): if path.is_file(): yield path file_hashes = defaultdict(list) for path in find_files(Path.cwd()): file_hash = md5(path.read_bytes()).hexdigest() file_hashes[file_hash].append(path) for paths in file_hashes.values(): if len(paths) > 1: print("Duplicate files found:") print(*paths, sep='\n')

, , -, . pathlib .

`pathlib.Path`

/ pathlib.Path . , .

 >>> path1 = Path('dir', 'file') >>> path2 = Path('dir') / 'file' >>> path3 = Path('dir/file') >>> path3 WindowsPath('dir/file') >>> path1 == path2 == path3 True

Python (. open ) Path , , pathlib , !

 from shutil import move def rename_and_redirect(old_filename, new_filename): move(old, new) with open(old, mode='wt') as f: f.write(f'This file has moved to {new}')

 >>> from pathlib import Path >>> old, new = Path('old.txt'), Path('new.txt') >>> rename_and_redirect(old, new) >>> old.read_text() 'This file has moved to new.txt'

pathlib , , PathLike . , , , PEP 519 .

 >>> from plumbum import Path >>> my_path = Path('old.txt') >>> with open(my_path) as f: ... print(f.read()) ... This file has moved to new.txt

pathlib , ( , ), , .

, pathlib . Python :

 from pathlib import Path gitignore = Path('.gitignore') if gitignore.is_file(): print(gitignore.read_text(), end='')

pathlib — . !

Source: https://habr.com/ru/post/453862/

All Articles

Why you should use pathlib

Part 1.

os.path clumsy

os overloaded

Don't forget glob

pathlib makes simple things even easier

Path objects make your code more obvious.

What is missing in the pathlib ?

Do you always need to use pathlib ?