0

I have a folder into which new files are constantly being added. I have a python script that uses os.listdir() to find these files and then perform analysis on them automatically. However, the files are quite large and so they seem to show up in os.listdir() before they've actually been completely written/copied. Is there some way to distinguish which files are not in the process of being moved? Comparing sizes with os.path.getsize() doesn't seem to work.

Raspbian Buster on Pi4 with Python 3.7.3. I am a noob to programming and linux.

Thanks!

8
  • Does this answer your question? Is a move operation in Unix atomic?, also on the Unix StackExchange, Is mv atomic operation between two file systems? Commented Aug 4, 2020 at 3:59
  • 1
    A workaround would be to have the process that is creating the file do so in a temporary location on the same filesystem as the intended location, and only when it's done call rename to atomically move it to the final location where the Python program expects it. Commented Aug 4, 2020 at 4:04
  • Thanks for the links! I couldn't find documentation on linux rename moving files but I see that rename(2) moves files. Or did you mean python's os.rename? alexwlchan.net/2019/03/atomic-cross-filesystem-moves-in-python
    – rfii
    Commented Aug 13, 2020 at 16:47
  • The standard utility mv makes use of rename(2) to "move" files (as documented on the man page you found); likewise Python's os.rename call should function similarly as documented, with the additional Python datatypes being supported, which the interpreter will unwrap into the native binary type before making the appropriate system call(s). Commented Aug 14, 2020 at 3:15
  • ok thanks so linux mv = linux rename = python os.rename but does not equal python shutil.move
    – rfii
    Commented Aug 14, 2020 at 4:02

2 Answers 2

1

For a conceptual explanation of Atomic and cross filesystem moves, refer this moves in Python ( can really save your time)

You can take the following approaches to deal with your problem:-

->Monitor Filesystem Events with Pyinotify usage of Pynotify

-> Lock the file for few seconds using flock

-> Using lsof we can basically check for the processes that are using a particular file.

`from subprocess import check_output,Popen, PIPE
try:
   lsout=Popen(['lsof',filename],stdout=PIPE, shell=False)
   check_output(["grep",filename], stdin=lsout.stdout, shell=False)
except:
   #check_output will throw an exception here if it won't find any process using that file`

just write your log processing code in the except part and you are good to go.

-> a daemon that monitors the parent folder for any changes, by using, E.G., the watchdog library watchdog implementation

-> You can either check the file which is being used by another process by looping through the PID/s in /proc for a specific id (assuming you have the control over the program which is adding the new files continuously to identify its id).

-> Can check if a file has a handle on it using psutil.

1

In programming this is called concurrency, which is when computations happen simultaneously and the order of execution is not guaranteed. In your case, one program begins to read a file before another program has finished writing to it. This particular problem is called the reader-writers problem and is actually fairly common in embedded systems.

There are a number of solutions to this problem, but the simplest and most common is a lock. The simplest kind of lock protects a resource from being accessed by more than one program at the same time. In effect, it makes sure that operations on the resource happen atomically. A lock is implemented as an object that can be acquired or released (these are usually functions of the object). The program tries to acquire the lock in a loop that iterates for as long as the program does not acquire the lock. When the lock is acquired, it grants the program holding it the ability to execute some block of code (this is usually a simple if-statement), after which the lock is released. Note that what I am referring to as a program is typically called a thread.

In Python, you can use the threading.Lock object. First, you need to create a Lock object.

from threading import Lock
file_lock = Lock()

Then in each thread, wait to acquire the lock before proceeding. If you set blocking=True, it will cause the entire thread to stop running until the lock is acquired, without requiring a loop.

file_lock.acquire(blocking=True):
# atomic operation
file_lock.release()

Note that the same lock object should be used in each thread. You will need to acquire the lock before reading and writing to the file, and you will need to release the lock after reading and writing to the file. That will make sure those operations do not happen at the same time again.

1
  • I appreciate all the explanation. It helps a lot. I am having a bit of trouble with the documentation. (A) Do I make a new lock for each file? (B) Am I understanding correctly that this solution requires I have two threads in one python programs (a file mover thread and a file analyzer thread) instead of two separate programs? (C) Also, can this work with python's multiprocess subprocess because I'd like to use its Queue feature? Thanks again!
    – rfii
    Commented Aug 14, 2020 at 5:07

Not the answer you're looking for? Browse other questions tagged or ask your own question.