Speed up open().read() pattern by reducing the number of system calls #120754

cmaloney · 2024-06-19T19:36:57Z

Feature or enhancement

Proposal:

I came across some seemingly redundant fstat() and lseek() calls when working on a tool that scanned a directory of lots of small YAML files and loaded their contents as config. In tracing I found most execution time wasn't in the python interpreter but system calls (on top of NFS in that case, which made some I/O calls particularly slow).

I've been experimenting with a program that reads all .rst files in the python Docs directory to try and remove some of those redundant system calls..

Test Program

from pathlib import Path

nlines = []
for filename in Path("cpython/Doc").glob("**/*.rst"):
    nlines.append(len(filename.read_text()))

In my experimentation, with some tweaks to fileio can remove over 10% of the system calls the test program makes when scanning the whole Doc folders for .rst files on both macOS and Linux (don't have a Windows machine to measure on).

Current State (9 system calls)

Currently on my Linux machine to read a whole .rst file with the above code there is this series of system calls:

openat(AT_FDCWD, "cpython/Doc/howto/clinic.rst", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=343, ...}) = 0
ioctl(3, TCGETS, 0x7ffe52525930)        = -1 ENOTTY (Inappropriate ioctl for device)
lseek(3, 0, SEEK_CUR)                   = 0
lseek(3, 0, SEEK_CUR)                   = 0
fstat(3, {st_mode=S_IFREG|0644, st_size=343, ...}) = 0
read(3, ":orphan:\n\n.. This page is retain"..., 344) = 343
read(3, "", 1)                          = 0
close(3)                                = 0

Target State (7 5 system calls)

It would be nice to get it down to (for small files, large file caveat in PR / get an additional seek):

# Open the file
openat(AT_FDCWD, "cpython/Doc/howto/clinic.rst", O_RDONLY|O_CLOEXEC) = 3
# Check if the open fd is a file or directory and early-exit on directories with a specialized error.
# With my changes we also stash the size information from this for later use as an estimate.
fstat(3, {st_mode=S_IFREG|0644, st_size=343, ...}) = 0
# Read the data directly into a PyBytes
read(3, ":orphan:\n\n.. This page is retain"..., 344) = 343
# Read the EOF marker
read(3, "", 1)                          = 0
# Close the file
close(3)                                = 0

In a number of cases (ex. importing modules) there is often a fstat followed immediately by an open / read the file (which does another fstat typically), but that is an extension point and I want to keep that out of scope for now.

Questions rattling around in my head around this

Some of these are likely better for Discourse / longer form discussion, happy to start threads there as appropriate.

Is there a way to add a test for certain system calls happening with certain arguments and/or a certain amount of time? (I don't currently see a great way to write a test to make sure the number of system calls doesn't change unintentionally)
Running a simple python script (python simple.py that contains print("Hello, World!")) currently reads simple.py in full at least 4 times and does over 5 seeks. I have been pulling on that thread but it interacts with importlib as well as how the python compiler currently works, still trying to get my head around. Would removing more of those overheads be something of interest / should I keep working to get my head around it?
We could potentially save more
1. with readv (one readv call, two iovecs). I avoided this for now because _Py_read does quite a bit.
2. dispatching multiple calls in parallel using asynchronous I/O APIs to meet the python API guarantees; I am experimenting with this (backed by relatively new Linux I/O APIs but possibly for kqueue and epoll), but it's very experimental and feeling a lot like "has to be a interpreter primitive" to me to work effectively which is complex to plumb through. Very early days though, many thoughts, not much prototype code.
The _blksize member of fileio was added in bpo-21679. It is not used much as far as I can tell as its reflection _blksize in python or in the code. The only usage I can find is https://github.com/python/cpython/blob/main/Modules/_io/_iomodule.c#L365-L374, where we could just query for it when needed in that case to save some storage on all fileio objects. The behavior of using the stat returned st_blksize is part of the docs, so doesn't feel like we can fully remove it.

Has this already been discussed elsewhere?

This is a minor feature, which does not need previous discussion elsewhere

Links to previous discussion of this feature:

No response

Linked PRs

The text was updated successfully, but these errors were encountered:

mdboom · 2024-06-20T11:58:09Z

Running a simple python script (python simple.py that contains print("Hello, World!")) currently reads simple.py in full at least 4 times and does over 5 seeks. I have been pulling on that thread but it interacts with importlib as well as how the python compiler currently works, still trying to get my head around. Would removing more of those overheads be something of interest / should I keep working to get my head around it?

Startup time is always an important thing to optimize for.

cmaloney · 2024-06-21T06:34:58Z

I found the source of the remaining lseek(3, 0, SEEK_CUR) by using lldb to set a break point and get a backtrace. In _io.open() (_io_open_impl), by default, buffering is on, and that results in a BufferedIO* object being constructed. All of those call _buffered_init which calls _buffered_raw_tell to figure out the position so the buffered IO can track positions itself.

In the read whole file case that seems to be a lot of unnecessary work (Allocate a additional buffer and lock, do the seek, a list of "chunks" for read that get joined w/ special fast cases for "only one chunk", etc) which gets discarded shortly after. While it is likely possible to rewrite read_bytes() and read_text() to do less work, open().read() is common enough in code I'm planning to try and make that do less work overall, trying to avoid special-casing the operations.

cmaloney · 2024-06-27T03:04:32Z

Summary

# main
read_file_large: Mean +- std dev: 24.0 us +- 0.4 us

# readall_faster, removes a stat (#120755): 
read_file_large: Mean +- std dev: 21.2 us +- 0.6 us

# No isatty call (no pr yet)
read_file_large: Mean +- std dev: 20.2 us +- 0.3 us

No patch that removes the BufferedIO/TextIO seek passes tests yet. Planning to see what happens if stash whole stat result rather than lots of individual members.

Details

It touches a lot of code to remove the remaining seek. BufferedIO has implementation for the "absolute position" abs_pos to be optional / lazily filled. Once that is removed however, TextIO checks for "is this seekable" on construction, which is a boolean member that must be filled at construction currently. Still trying for a well contained way to do that.

The isatty call can in theory have a shortcut added around it using the fstat st_mode information. Just adding a "if this is a regular file" check before it + member accessible from fileio I get a speedup on my Mac :

main

python ../benchmark.py
.....................
read_file_small: Mean +- std dev: 8.43 us +- 0.12 us
.....................
read_file_large: Mean +- std dev: 24.0 us +- 0.4 us

cmaloney/readall_faster (#120755)

.....................
read_file_small: Mean +- std dev: 7.92 us +- 0.07 us
.....................
read_file_large: Mean +- std dev: 21.2 us +- 0.6 us

No isatty (local only currently):

.....................
read_file_small: Mean +- std dev: 7.32 us +- 0.09 us
.....................
read_file_large: Mean +- std dev: 20.2 us +- 0.3 us

That reduces the time for reading a single large .rst file in the cpython repo ~15% so far.

I built a little larger benchmark to more match code I ran into this with: "read all the .rst and .py in the cpython repo"

import pyperf
from pathlib import Path


def read_all(all_paths):
    for p in all_paths:
        p.read_bytes()


def read_file(path_obj):
    path_obj.read_text()


all_rst = list(Path("Doc").glob("**/*.rst"))
all_py = list(Path(".").glob("**/*.py"))
assert all_rst, "Should have found .rst files"
assert all_py, "Should have found .py source files"


runner = pyperf.Runner()
runner.bench_func("read_file_small", read_file, Path("Doc/howto/clinic.rst"))
runner.bench_func("read_file_large", read_file, Path("Doc/c-api/typeobj.rst"))

runner.bench_func("read_all_rst", read_all, all_rst)
runner.bench_func("read_all_py", read_all, all_py)

And on my Linux box read_all_py is ~9% faster with both changes at the moment.

I'm iterating on top of my existing PR code. Both removing the extra fstat and isatty require getting and stashing more data out of the initial stat result, so planning to see what the code looks like to just stash the whole stat result (although it is fairly big, so possibly with a hatch to unset it on write/seek or manually). This would also be a way to remove the _blksize member which is also only used at open() time (to figure out buffer size to pass to BufferedIO*).

…20755) This reduces the system call count of a simple program[0] that reads all the `.rst` files in Doc by over 10% (5706 -> 4734 system calls on my linux system, 5813 -> 4875 on my macOS) This reduces the number of `fstat()` calls always and seek calls most the time. Stat was always called twice, once at open (to error early on directories), and a second time to get the size of the file to be able to read the whole file in one read. Now the size is cached with the first call. The code keeps an optimization that if the user had previously read a lot of data, the current position is subtracted from the number of bytes to read. That is somewhat expensive so only do it on larger files, otherwise just try and read the extra bytes and resize the PyBytes as needeed. I built a little test program to validate the behavior + assumptions around relative costs and then ran it under `strace` to get a log of the system calls. Full samples below[1]. After the changes, this is everything in one `filename.read_text()`: ```python3 openat(AT_FDCWD, "cpython/Doc/howto/clinic.rst", O_RDONLY|O_CLOEXEC) = 3` fstat(3, {st_mode=S_IFREG|0644, st_size=343, ...}) = 0` ioctl(3, TCGETS, 0x7ffdfac04b40) = -1 ENOTTY (Inappropriate ioctl for device) lseek(3, 0, SEEK_CUR) = 0 read(3, ":orphan:\n\n.. This page is retain"..., 344) = 343 read(3, "", 1) = 0 close(3) = 0 ``` This does make some tradeoffs 1. If the file size changes between open() and readall(), this will still get all the data but might have more read calls. 2. I experimented with avoiding the stat + cached result for small files in general, but on my dev workstation at least that tended to reduce performance compared to using the fstat(). [0] ```python3 from pathlib import Path nlines = [] for filename in Path("cpython/Doc").glob("**/*.rst"): nlines.append(len(filename.read_text())) ``` [1] Before small file: ``` openat(AT_FDCWD, "cpython/Doc/howto/clinic.rst", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=343, ...}) = 0 ioctl(3, TCGETS, 0x7ffe52525930) = -1 ENOTTY (Inappropriate ioctl for device) lseek(3, 0, SEEK_CUR) = 0 lseek(3, 0, SEEK_CUR) = 0 fstat(3, {st_mode=S_IFREG|0644, st_size=343, ...}) = 0 read(3, ":orphan:\n\n.. This page is retain"..., 344) = 343 read(3, "", 1) = 0 close(3) = 0 ``` After small file: ``` openat(AT_FDCWD, "cpython/Doc/howto/clinic.rst", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=343, ...}) = 0 ioctl(3, TCGETS, 0x7ffdfac04b40) = -1 ENOTTY (Inappropriate ioctl for device) lseek(3, 0, SEEK_CUR) = 0 read(3, ":orphan:\n\n.. This page is retain"..., 344) = 343 read(3, "", 1) = 0 close(3) = 0 ``` Before large file: ``` openat(AT_FDCWD, "cpython/Doc/c-api/typeobj.rst", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=133104, ...}) = 0 ioctl(3, TCGETS, 0x7ffe52525930) = -1 ENOTTY (Inappropriate ioctl for device) lseek(3, 0, SEEK_CUR) = 0 lseek(3, 0, SEEK_CUR) = 0 fstat(3, {st_mode=S_IFREG|0644, st_size=133104, ...}) = 0 read(3, ".. highlight:: c\n\n.. _type-struc"..., 133105) = 133104 read(3, "", 1) = 0 close(3) = 0 ``` After large file: ``` openat(AT_FDCWD, "cpython/Doc/c-api/typeobj.rst", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=133104, ...}) = 0 ioctl(3, TCGETS, 0x7ffdfac04b40) = -1 ENOTTY (Inappropriate ioctl for device) lseek(3, 0, SEEK_CUR) = 0 lseek(3, 0, SEEK_CUR) = 0 read(3, ".. highlight:: c\n\n.. _type-struc"..., 133105) = 133104 read(3, "", 1) = 0 close(3) = 0 ``` Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com> Co-authored-by: Erlend E. Aasland <erlend.aasland@protonmail.com> Co-authored-by: Victor Stinner <vstinner@python.org>

cmaloney · 2024-07-04T07:55:00Z

Could this be reopened? My plan is to keep working on trying to simplify what happens in this case, would like to attach to this issue (strace pr followup test, removing seeks, possibly removing isatty)

Sometimes a large file is truncated (test_largefile). While estimated_size is used as a estimate (the read will stil get the number of bytes in the file), that it is much larger than the actual size of data can result in a significant over allocation and sometimes lead to a MemoryError / running out of memory. This brings the C implementation to match the Python _pyio implementation.

For POSIX, TTYs are never regular files, so if the interpreter knows the file is regular it doesn't need to do an additional system call to check if the file is a tty. The `open()` API requires a `stat` call at present in order to make sure the file being opened isn't a directory. That result includes the file mode which tells us if it is a regular file. There's a number of attributes from the stat which are stashed one off currently, move to stashing the whole object rather than just individual members. The stat object is reasonably large and currently the `stat_result.st_size` member cannot be modified from Python, which is needed by the `_pyio` implementation, so make the whole stat object optional. In the `_io` implementation this makes handling a stat failure simpler. At present there is no explicit user call to clear it, but if one is needed (ex. a program which has a lot of open FileIO objects and the memory becomes a problem) it would be straightforward to add. Ideally would be able to automatically clear (the values are generally used during I/O object initialization and not after. After a `write` they are no longer useful in current cases). It is fairly common pattern to scan a directory, look at the `stat` results (ex. is this file changed), and then open/read the file. In this PR I didn't update open's API to allow passing in a stat result to use, but that could be beneficial for some cases (ex. `importlib`). With this change on my Linux machine reading a small plain text file is down to 6 system calls. ``` openat(AT_FDCWD, "read_one.py", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=87, ...}) = 0 lseek(3, 0, SEEK_CUR) = 0 read(3, "from pathlib import Path\n\npath ="..., 88) = 87 read(3, "", 1) = 0 close(3) = 0 ```

For POSIX, TTYs are never regular files, so if the interpreter knows the file is regular it doesn't need to do an additional system call to check if the file is a tty. The `open()` API requires a `stat` call at present in order to make sure the file being opened isn't a directory. That result includes the file mode which tells us if it is a regular file. There's a number of attributes from the stat which are stashed one off currently, move to stashing the whole object rather than just individual members. The stat object is reasonably large and currently the `stat_result.st_size` member cannot be modified from Python, which is needed by the `_pyio` implementation, so make the whole stat object optional. In the `_io` implementation this makes handling a stat failure simpler. At present there is no explicit user call to clear it, but if one is needed (ex. a program which has a lot of open FileIO objects and the memory becomes a problem) it would be straightforward to add. Ideally would be able to automatically clear (the values are generally used during I/O object initialization and not after. After a `write` they are no longer useful in current cases). It is fairly common pattern to scan a directory, look at the `stat` results (ex. is this file changed), and then open/read the file. In this PR I didn't update open's API to allow passing in a stat result to use, but that could be beneficial for some cases (ex. `importlib`). With this change on my Linux machine reading a small plain text file is down to 6 system calls. ```python openat(AT_FDCWD, "read_one.py", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=87, ...}) = 0 lseek(3, 0, SEEK_CUR) = 0 read(3, "from pathlib import Path\n\npath ="..., 88) = 87 read(3, "", 1) = 0 close(3) = 0 ```

For POSIX, TTYs are never regular files, so if the interpreter knows the file is regular it doesn't need to do an additional system call to check if the file is a TTY. The `open()` Python builtin requires a `stat` call at present in order to ensure the file being opened isn't a directory. That result includes the file mode which tells us if it is a regular file. There are a number of attributes from the stat which are stashed one off currently, move to stashing the whole object rather than just individual members. The stat object is reasonably large and currently the `stat_result.st_size` member cannot be modified from Python, which is needed by the `_pyio` implementation, so make the whole stat object optional. In the `_io` implementation this makes handling a stat failure simpler. At present there is no explicit user call to clear it, but if one is needed (ex. a program which has a lot of open FileIO objects and the memory becomes a problem) it would be straightforward to add. Ideally would be able to automatically clear (the values are generally used during I/O object initialization and not after. After a `write` they are no longer useful in current cases). It is fairly common pattern to scan a directory, look at the `stat` results (ex. is this file changed), and then open/read the file. In this PR I didn't update open's API to allow passing in a stat result to use, but that could be beneficial for some cases (ex. `importlib`). With this change on my Linux machine reading a small plain text file is down to 6 system calls. ```python openat(AT_FDCWD, "read_one.py", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=87, ...}) = 0 lseek(3, 0, SEEK_CUR) = 0 read(3, "from pathlib import Path\n\npath ="..., 88) = 87 read(3, "", 1) = 0 close(3) = 0 ```

cmaloney · 2024-07-11T07:29:10Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up open().read() pattern by reducing the number of system calls #120754

Speed up open().read() pattern by reducing the number of system calls #120754

cmaloney commented Jun 19, 2024 •

edited by bedevere-app bot

Loading

mdboom commented Jun 20, 2024

cmaloney commented Jun 21, 2024

cmaloney commented Jun 27, 2024 •

edited

Loading

cmaloney commented Jul 4, 2024

cmaloney commented Jul 11, 2024

Speed up open().read() pattern by reducing the number of system calls #120754

Speed up open().read() pattern by reducing the number of system calls #120754

Comments

cmaloney commented Jun 19, 2024 • edited by bedevere-app bot Loading

Feature or enhancement

Proposal:

Test Program

Current State (9 system calls)

Target State (7 5 system calls)

Questions rattling around in my head around this

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

Linked PRs

mdboom commented Jun 20, 2024

cmaloney commented Jun 21, 2024

cmaloney commented Jun 27, 2024 • edited Loading

Summary

Details

cmaloney commented Jul 4, 2024

cmaloney commented Jul 11, 2024

cmaloney commented Jun 19, 2024 •

edited by bedevere-app bot

Loading

cmaloney commented Jun 27, 2024 •

edited

Loading