From the translator: Hi, habr! I present to you the translation of the article Why you should be using pathlib and its continuation, No really, pathlib is great . Much attention is now paid to such Python features as asyncio, the operator: =, and optional typing. At the same time, the risk of passing through the radar is not so significant (although,: = a language is not turned serious), but very useful innovations in the language. In particular, I did not find articles on sabzh in Habré (except for one paragraph here ), so I decided to correct the situation.
When I discovered the then new pathlib module a few years ago, I decided, by the simplicity of my soul, that this is just a slightly awkward object-oriented version of the os.path
module. I was wrong. pathlib
is really wonderful !
In this article I will try to make you fall in love with the pathlib
. I hope that this article will inspire you to use pathlib
in any situation related to working with files in Python .
os.path
clumsyThe os.path
module os.path
always been what we used when talking about paths in Python. In principle, there is everything that you need, but often it does not look very elegant.
Is it worth it to import it?
import os.path BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__))) TEMPLATES_DIR = os.path.join(BASE_DIR, 'templates')
Or so?
from os.path import abspath, dirname, join BASE_DIR = dirname(dirname(abspath(__file__))) TEMPLATES_DIR = join(BASE_DIR, 'templates')
Maybe the join
function has a very general name, and we should do something like this:
from os.path import abspath, dirname, join as joinpath BASE_DIR = dirname(dirname(abspath(__file__))) TEMPLATES_DIR = joinpath(BASE_DIR, 'templates')
To me, all the options above do not seem too comfortable. We pass strings to functions that return strings, which we pass to the following functions that work with strings. It just so happens that they all contain paths, but they are still just strings.
Using strings for input and output in os.path
functions os.path
very inconvenient because the code has to be read from the inside to the outside. I would like to convert these calls from nested to sequential. This is exactly what pathlib
can do!
from pathlib import Path BASE_DIR = Path(__file__).resolve().parent.parent TEMPLATES_DIR = BASE_DIR.joinpath('templates')
The os.path
module requires nested function calls, but pathlib
allows us to create chains of successive calls to methods and attributes of the Path
class with an equivalent result.
I know what you think: stop, these Path
objects are not the same as they were before, we no longer operate with strings of paths! We will return to this question later (hint: in almost any situation, these two approaches are interchangeable).
os
overloadedThe classic os.path
module os.path
designed to work with paths. But after you want to do something with the path (for example, create a directory), you will need to access another module, often os
.
os
contains a bunch of utilities for working with files and directories: mkdir
, getcwd
, chmod
, stat
, remove
, rename
, rmdir
. Also chdir
, link
, walk
, listdir
, makedirs
, renames
, removedirs
, unlink
, symlink
. And a bunch of other things that are not related to file systems at all: fork
, getenv
, putenv
, environ
, getlogin
, system
, ... A few more dozen things that I will not mention here.
The os
module is designed for a wide range of tasks; This is such a box with everything related to the operating system. There are many utilities in os
, but it is not always easy to navigate in it: it is often necessary to dig a little in the module before you find what you need.
pathlib
transfers most of the file system functions to Path
objects.
Here is the code that creates the src/__pypackages__
and renames our .editorconfig
file to src/.editorconfig
:
import os import os.path os.makedirs(os.path.join('src', '__pypackages__'), exist_ok=True) os.rename('.editorconfig', os.path.join('src', '.editorconfig'))
Here is a similar code using Path
from pathlib import Path Path('src/__pypackages__').mkdir(parents=True, exist_ok=True) Path('.editorconfig').rename('src/.editorconfig')
Notice that the second sample code is much easier to read, because it is organized from left to right - this is all due to the chains of methods.
glob
Not only os
and os.path
contain file system related methods. It is also worth mentioning the glob
, which can not be called useless.
We can use the glob.glob
function to search for files with a specific pattern:
from glob import glob top_level_csv_files = glob('*.csv') all_csv_files = glob('**/*.csv', recursive=True)
The pathlib
module also provides similar methods:
from pathlib import Path top_level_csv_files = Path.cwd().glob('*.csv') all_csv_files = Path.cwd().rglob('*.csv')
After switching to the pathlib
module, the need for the glob
disappears completely : everything necessary is already a part of the Path
objects.
pathlib
makes simple things even easierpathlib
simplifies many complex situations, but beyond that makes some simple code snippets even easier .
Want to read all the text in one or more files?
You can open the file, read the contents, and close the file using the with
block:
from glob import glob file_contents = [] for filename in glob('**/*.py', recursive=True): with open(filename) as python_file: file_contents.append(python_file.read())
Or you can use the read_text
method on Path
objects and generate lists to get the same result for a single expression:
from pathlib import Path file_contents = [ path.read_text() for path in Path.cwd().rglob('*.py') ]
And what if you need to write to a file?
Here is what it looks like using open
:
with open('.editorconfig') as config: config.write('# config goes here')
Or you can use the write_text
method:
Path('.editorconfig').write_text('# config goes here')
If for some reason you need to use open
, either as a context manager or by personal preference, Path
provides an open
method, as an alternative:
from pathlib import Path path = Path('.editorconfig') with path.open(mode='wt') as config: config.write('# config goes here')
Or, starting with Python 3.6, you can pass your Path
directly to open
:
from pathlib import Path path = Path('.editorconfig') with open(path, mode='wt') as config: config.write('# config goes here')
What do the following variables indicate? What is the meaning of their meanings?
person = '{"name": "Trey Hunner", "location": "San Diego"}' pycon_2019 = "2019-05-01" home_directory = '/home/trey'
Each variable points to a string. But each of them has different meanings: the first is JSON, the second is the date, and the third is the file path.
Such a representation of objects is slightly more useful:
from datetime import date from pathlib import Path person = {"name": "Trey Hunner", "location": "San Diego"} pycon_2019 = date(2019, 5, 1) home_directory = Path('/home/trey')
JSON objects can be deserialized into a dictionary, dates can be natively represented using datetime.date
, and file path objects can be represented as Path
Using Path
objects makes your code more explicit. If you want to work with dates, you use date
. If you want to work with file paths, use Path
.
I am not a very big supporter of the PLO. Classes add an extra layer of abstraction, and abstractions sometimes tend to complicate the system rather than simplify it. At the same time, I believe that pathlib.Path
is a useful abstraction . Pretty quickly, it becomes a commonly accepted solution.
Thanks to PEP 519 , Path
are becoming standard for working with paths. At the time of Python 3.6, most of the os
, shutil
, os.path
work correctly with these objects. You can switch to pathlib
, transparent to your codebase!
pathlib
?Although pathlib
and cool, but not comprehensive. There are definitely several features that I would like to have included in the module .
The first thing that comes to mind is the lack of methods in the Path
equivalent to shutil
. And although you can pass Path
as the shutil
parameters for copying / deleting / moving files and directories, it’s impossible to call them as methods on Path
objects.
So, to copy files, you need to do something like this:
from pathlib import Path from shutil import copyfile source = Path('old_file.txt') destination = Path('new_file.txt') copyfile(source, destination)
There is also no analogue of the os.chdir
method. This means that you need to import it if you need to change the current directory:
from pathlib import Path from os import chdir parent = Path('..') chdir(parent)
There is also no equivalent of the os.walk
function. Although you can write your own function in the spirit of a walk
without too much difficulty.
I hope that one day the pathlib.Path
objects will contain methods for some of the operations mentioned. But even in this situation, I consider it much more simple to use the pathlib
with something else than to use os.path
and everything else .
pathlib
?Starting in Python 3.6, Paths work almost wherever you use strings . So I see no reason not to use pathlib
if you are using Python 3.6 and higher.
If you use an earlier version of Python 3, you can at any time wrap the Path
object in a str
call to get a string if you need to return to the country of lines. It is not very elegant, but it works:
from os import chdir from pathlib import Path chdir(Path('/home/trey')) # Python 3.6+ chdir(str(Path('/home/trey'))) #
After the publication of the first part, some people have some questions. Someone said that I compared the os.path
and pathlib
dishonestly. Some said that using os.path
so ingrained in the Python community that switching to a new library would take a very long amount of time. I also saw some performance issues.
In this part I would like to comment on these questions. It can be considered both a pathlib
protection and a bit of a love letter to PEP 519 .
os.path
and pathlib
to fairIn the last part, I compared the following two code fragments:
import os import os.path os.makedirs(os.path.join('src', '__pypackages__'), exist_ok=True) os.rename('.editorconfig', os.path.join('src', '.editorconfig'))
from pathlib import Path Path('src/__pypackages__').mkdir(parents=True, exist_ok=True) Path('.editorconfig').rename('src/.editorconfig')
This may seem like an unfair comparison, because using the os.path.join
in the first example guarantees the use of correct separators on all platforms, which I did not do in the second example. In fact, everything is in order, because the Path automatically normalizes the path delimiters
We can prove this by looking at converting the Path
object to a string on Windows:
>>> str(Path('src/__pypackages__')) 'src\\__pypackages__'
It makes no difference whether we use the joinpath
method, '/'
in the path string, the /
operator (another nice Path
chip), or pass individual arguments to the Path constructor, we get the same result:
>>> Path('src', '.editorconfig') WindowsPath('src/.editorconfig') >>> Path('src') / '.editorconfig' WindowsPath('src/.editorconfig') >>> Path('src').joinpath('.editorconfig') WindowsPath('src/.editorconfig') >>> Path('src/.editorconfig') WindowsPath('src/.editorconfig')
The last example caused some confusion from people who suggested that pathlib
not smart enough to replace /
with \
in the path string. Fortunately, everything is fine!
With Path
objects, you no longer need to worry about the direction of the slashes: define all your paths using /
, and the result will be predictable for any platform.
If you are working on Linux or Mac, it is very easy to accidentally add bugs to your code that will only affect Windows users. If you do not follow closely the use of os.path.join
and \ or os.path.normcase
to convert slashes into suitable ones for the current platform, you can write code that will not work correctly in Windows .
Here is an example of a Windows-specific bug:
import sys import os.path directory = '.' if not sys.argv[1:] else sys.argv[1] new_file = os.path.join(directory, 'new_package/__init__.py')
At the same time such code will work correctly everywhere:
import sys from pathlib import Path directory = '.' if not sys.argv[1:] else sys.argv[1] new_file = Path(directory, 'new_package/__init__.py')
Previously, the programmer was responsible for concatenating and normalizing paths, just as in Python 2, the programmer was responsible for deciding where to use unicode instead of bytes. This is not your task anymore - Path
solves all such problems for you.
I do not use Windows, and I do not have a Windows computer. But the huge number of people who will use my code will very likely use Windows, and I want everything to work correctly for them.
If there is a possibility that your code will run on Windows, you should seriously think about switching to pathlib
.
Don't worry about normalization : use Path
anyway when it comes to file paths.
pathlib
!You have a large codebase that works with strings as paths. Why switch to pathlib
if it means that everything needs to be rewritten?
Let's imagine that you have the following function:
import os import os.path def make_editorconfig(dir_path): """Create .editorconfig file in given directory and return filename.""" filename = os.path.join(dir_path, '.editorconfig') if not os.path.exists(filename): os.makedirs(dir_path, exist_ok=True) open(filename, mode='wt').write('') return filename
The function takes a directory, and creates a .editorconfig
file .editorconfig
, like this:
>>> import os.path >>> make_editorconfig(os.path.join('src', 'my_package')) 'src/my_package/.editorconfig'
If you replace the lines with Path
, everything will also work:
>>> from pathlib import Path >>> make_editorconfig(Path('src/my_package')) 'src/my_package/.editorconfig'
But how?
os.path.join
accepts Path
objects (starting with Python 3.6). The same can be said about os.makedirs
.
In fact, the built-in open
function accepts Path
, shutil
accepts Path
and everything in the standard library used to accept a string should now work with both Path
and with strings.
It’s worth giving thanks to PEP 519 , who provided the abstract os.PathLike
class and announced that all the built-in utilities for working with file paths should now work with both strings and Path
.
You may already be using a third-party library that provides its own implementation of the Path
, which is different from the standard one. Perhaps you like it more.
For example, django-environ , path.py , plumbum , and visidata contain their own Path
objects. Some of these libraries are older than the pathlib
, and decided to inherit from str
, so that they could be passed to functions that wait for strings as paths. Thanks to PEP 519, integrating third-party libraries into your code will be easier without having to inherit from str
.
Let's imagine that you do not want to use pathlib
, because Path
is immutable objects, and you really really want to change their state. Thanks to PEP 519, you can create your very best-mutable version of Path
. To do this, simply implement the __fspath__
method __fspath__
Any self-written implementation of the Path
can now work natively with built-in Python functions that expect file paths. Even if you are not a pathlib
, the very fact of its existence is a big plus for third-party libraries with their own Path
pathlib.Path
and str
don't mix, right?You might be thinking: that’s all, of course, great, but doesn’t this approach with sometimes-string-a-sometimes-path add complexity to my code?
The answer to this question is yes, to some extent. But this problem has a fairly simple detour.
PEP 519 added a few more things besides PathLike
: firstly, it’s a way to convert any PathLike
to a string, and secondly, it’s a way to turn any PathLike
into a Path
.
Take two objects — a string and a Path
(or anything with the fspath method):
from pathlib import Path import os.path p1 = os.path.join('src', 'my_package') p2 = Path('src/my_package')
The os.fspath
function normalizes both objects and turns them into strings:
>>> from os import fspath >>> fspath(p1), fspath(p2) ('src/my_package', 'src/my_package')
In this case, Path
can take both of these objects into a constructor and convert them to a Path
:
>>> Path(p1), Path(p2) (PosixPath('src/my_package'), PosixPath('src/my_package'))
This means that you can convert the result of make_editorconfig
back to the Path
if necessary:
>>> from pathlib import Path >>> Path(make_editorconfig(Path('src/my_package'))) PosixPath('src/my_package/.editorconfig')
Although, of course, the best solution would be to rewrite make_editorconfig
using pathlib
.
pathlib
too slowI have seen several times questions about the performance of pathlib
. This is true - pathlib
can be slow. Creating thousands of Path
objects can have a noticeable effect on program behavior.
I decided to measure the performance of pathlib
and os.path
on my computer using two different programs that search all .py
files in the current directory
Here is the os.walk
version:
from os import getcwd, walk extension = '.py' count = 0 for root, directories, filenames in walk(getcwd()): for filename in filenames: if filename.endswith(extension): count += 1 print(f"{count} Python files found")
And here is the version with Path.rglob
:
from pathlib import Path extension = '.py' count = 0 for filename in Path.cwd().rglob(f'*{extension}'): count += 1 print(f"{count} Python files found")
Testing the performance of programs that work with the file system is a tricky task, because the runtime can vary quite a lot. I decided to run each script 10 times and compared the best results for each program.
Both programs found 97,507 files in the directory in which I ran them. The first one worked in 1.914 seconds, the second one finished the work in 3.430 seconds.
When I set the extension=''
parameter, these programs find approximately 600,000 files, and the difference increases. The first program worked in 1.888 seconds, and the second in 7.485 seconds.
So, pathlib
works about twice as slowly for files with the .py
extension, and four times slower when running on my home directory. The relative performance pathlib
and os
quite large.
In my case, this speed changes little. I searched all the files in my directory and lost 6 seconds. If I had a task to process 10 million files, I would most likely have copied it. But as long as there is no such need, you can wait.
If you have a hot code snippet, and pathlib
clearly has a negative effect on its operation, there is nothing wrong with replacing it with an alternative. You should not optimize code that is not a bottleneck - it is a waste of time, which also usually leads to poorly readable code, without much exhaust.
I would like to end this thought stream with some examples of refactoring with the help of pathlib
. I took a couple of small code samples that work with files and made them work with pathlib
. I will leave most of the code without comments to your court - decide which version you like more.
Here is the make_editorconfig
function that we saw earlier:
import os import os.path def make_editorconfig(dir_path): """Create .editorconfig file in given directory and return filename.""" filename = os.path.join(dir_path, '.editorconfig') if not os.path.exists(filename): os.makedirs(dir_path, exist_ok=True) open(filename, mode='wt').write('') return filename
And here is the version rewritten on pathlib
:
from pathlib import Path def make_editorconfig(dir_path): """Create .editorconfig file in given directory and return filepath.""" path = Path(dir_path, '.editorconfig') if not path.exists(): path.parent.mkdir(exist_ok=True, parent=True) path.touch() return path
Here is a console program that takes a string with a directory and prints the contents of the .gitignore
file, if it exists:
import os.path import sys directory = sys.argv[1] ignore_filename = os.path.join(directory, '.gitignore') if os.path.isfile(ignore_filename): with open(ignore_filename, mode='rt') as ignore_file: print(ignore_file.read(), end='')
The same, but with pathlib
:
from pathlib import Path import sys directory = Path(sys.argv[1]) ignore_path = directory / '.gitignore' if ignore_path.is_file(): print(ignore_path.read_text(), end='')
Here is a program that prints all duplicate files in the current folder and subfolders:
from collections import defaultdict from hashlib import md5 from os import getcwd, walk import os.path def find_files(filepath): for root, directories, filenames in walk(filepath): for filename in filenames: yield os.path.join(root, filename) file_hashes = defaultdict(list) for path in find_files(getcwd()): with open(path, mode='rb') as my_file: file_hash = md5(my_file.read()).hexdigest() file_hashes[file_hash].append(path) for paths in file_hashes.values(): if len(paths) > 1: print("Duplicate files found:") print(*paths, sep='\n')
The same, but with pathlib
:
from collections import defaultdict from hashlib import md5 from pathlib import Path def find_files(filepath): for path in Path(filepath).rglob('*'): if path.is_file(): yield path file_hashes = defaultdict(list) for path in find_files(Path.cwd()): file_hash = md5(path.read_bytes()).hexdigest() file_hashes[file_hash].append(path) for paths in file_hashes.values(): if len(paths) > 1: print("Duplicate files found:") print(*paths, sep='\n')
, , -, . pathlib
.
pathlib.Path
.
/
pathlib.Path
. , .
>>> path1 = Path('dir', 'file') >>> path2 = Path('dir') / 'file' >>> path3 = Path('dir/file') >>> path3 WindowsPath('dir/file') >>> path1 == path2 == path3 True
Python (. open
) Path
, , pathlib
, !
from shutil import move def rename_and_redirect(old_filename, new_filename): move(old, new) with open(old, mode='wt') as f: f.write(f'This file has moved to {new}')
>>> from pathlib import Path >>> old, new = Path('old.txt'), Path('new.txt') >>> rename_and_redirect(old, new) >>> old.read_text() 'This file has moved to new.txt'
pathlib
, , PathLike
. , , , PEP 519 .
>>> from plumbum import Path >>> my_path = Path('old.txt') >>> with open(my_path) as f: ... print(f.read()) ... This file has moved to new.txt
pathlib
, ( , ), , .
, pathlib
. Python :
from pathlib import Path gitignore = Path('.gitignore') if gitignore.is_file(): print(gitignore.read_text(), end='')
pathlib
— . !
Source: https://habr.com/ru/post/453862/
All Articles