Getting started with pyfileindex#
pyfileindex keeps a pandas DataFrame in sync with a directory tree: new files and directories, modified files, and deletions are all reflected after calling update(). The index is plain tabular data, so you can filter, group, or join it like any other DataFrame.
Install it with:
pip install pyfileindex
or
conda install -c conda-forge pyfileindex
This notebook covers:
Basic usage in the default polling mode (
watch=False)The same operations in watch mode (
watch=True), and how it differsPatterns for using
pyfileindexinside your own project
Setup#
A couple of helpers used throughout this notebook: touch() to create/update a file like the Unix command, and filter_function() to restrict the index to .txt files.
import os
from pyfileindex import PyFileIndex
def touch(fname, times=None):
with open(fname, "a"):
os.utime(fname, times)
def filter_function(file_name):
return ".txt" in file_name
1. Polling mode (watch=False, the default)#
By default PyFileIndex does not run anything in the background. Every call to update() rescans the directory tree, compares it against the previous index, and updates the DataFrame. This has no cost between calls, but the cost of update() itself grows with the size of the tree.
Initialise PyFileIndex#
pfi = PyFileIndex(path=".", filter_function=filter_function, debug=True)
pfi
| basename | path | dirname | is_directory | mtime | nlink | |
|---|---|---|---|---|---|---|
| 0 | notebooks | /home/jovyan/notebooks | /home/jovyan | True | 1.782475e+09 | 1 |
| 1 | .ipynb_checkpoints | /home/jovyan/notebooks/.ipynb_checkpoints | /home/jovyan/notebooks | True | 1.782475e+09 | 2 |
pfi.dataframe is a pandas DataFrame with one row per file or directory. See the README for the full column reference (basename, path, dirname, is_directory, mtime, nlink).
Update PyFileIndex#
pfi.update()
pfi
Changes: [] [] <StringArray>
[]
Length: 0, dtype: str
| basename | path | dirname | is_directory | mtime | nlink | |
|---|---|---|---|---|---|---|
| 0 | notebooks | /home/jovyan/notebooks | /home/jovyan | True | 1.782475e+09 | 1 |
| 1 | .ipynb_checkpoints | /home/jovyan/notebooks/.ipynb_checkpoints | /home/jovyan/notebooks | True | 1.782475e+09 | 2 |
New directory#
A new row appears with is_directory=True.
os.makedirs("bla")
pfi.update()
pfi
Changes: <StringArray>
['/home/jovyan/notebooks/bla']
Length: 1, dtype: str <StringArray>
['/home/jovyan/notebooks']
Length: 1, dtype: str <StringArray>
[]
Length: 0, dtype: str
| basename | path | dirname | is_directory | mtime | nlink | |
|---|---|---|---|---|---|---|
| 0 | .ipynb_checkpoints | /home/jovyan/notebooks/.ipynb_checkpoints | /home/jovyan/notebooks | True | 1.782475e+09 | 2 |
| 1 | notebooks | /home/jovyan/notebooks | /home/jovyan | True | 1.782475e+09 | 1 |
| 2 | bla | /home/jovyan/notebooks/bla | /home/jovyan/notebooks | True | 1.782475e+09 | 2 |
New sub directory#
Nested directories are picked up too – pyfileindex scans recursively.
os.makedirs("bla/bla")
pfi.update()
pfi
Changes: <StringArray>
['/home/jovyan/notebooks/bla/bla']
Length: 1, dtype: str <StringArray>
['/home/jovyan/notebooks/bla']
Length: 1, dtype: str <StringArray>
[]
Length: 0, dtype: str
| basename | path | dirname | is_directory | mtime | nlink | |
|---|---|---|---|---|---|---|
| 0 | .ipynb_checkpoints | /home/jovyan/notebooks/.ipynb_checkpoints | /home/jovyan/notebooks | True | 1.782475e+09 | 2 |
| 1 | notebooks | /home/jovyan/notebooks | /home/jovyan | True | 1.782475e+09 | 1 |
| 2 | bla | /home/jovyan/notebooks/bla | /home/jovyan/notebooks | True | 1.782475e+09 | 3 |
| 3 | bla | /home/jovyan/notebooks/bla/bla | /home/jovyan/notebooks/bla | True | 1.782475e+09 | 2 |
New file#
The filtered .txt file shows up as a new row.
touch("bla/bla/test.txt")
pfi.update()
pfi
Changes: [] [] <StringArray>
[]
Length: 0, dtype: str
| basename | path | dirname | is_directory | mtime | nlink | |
|---|---|---|---|---|---|---|
| 0 | .ipynb_checkpoints | /home/jovyan/notebooks/.ipynb_checkpoints | /home/jovyan/notebooks | True | 1.782475e+09 | 2 |
| 1 | notebooks | /home/jovyan/notebooks | /home/jovyan | True | 1.782475e+09 | 1 |
| 2 | bla | /home/jovyan/notebooks/bla | /home/jovyan/notebooks | True | 1.782475e+09 | 3 |
| 3 | bla | /home/jovyan/notebooks/bla/bla | /home/jovyan/notebooks/bla | True | 1.782475e+09 | 2 |
Another new file#
Files in different subdirectories are tracked independently.
touch("bla/test.txt")
pfi.update()
pfi
Changes: [] [] <StringArray>
[]
Length: 0, dtype: str
| basename | path | dirname | is_directory | mtime | nlink | |
|---|---|---|---|---|---|---|
| 0 | .ipynb_checkpoints | /home/jovyan/notebooks/.ipynb_checkpoints | /home/jovyan/notebooks | True | 1.782475e+09 | 2 |
| 1 | notebooks | /home/jovyan/notebooks | /home/jovyan | True | 1.782475e+09 | 1 |
| 2 | bla | /home/jovyan/notebooks/bla | /home/jovyan/notebooks | True | 1.782475e+09 | 3 |
| 3 | bla | /home/jovyan/notebooks/bla/bla | /home/jovyan/notebooks/bla | True | 1.782475e+09 | 2 |
Touch an existing file#
Updating the modification time changes the mtime column for that row, without adding or removing rows.
touch("bla/bla/test.txt", (1330712280, 1330712292))
pfi.update()
pfi
Changes: [] [] <StringArray>
[]
Length: 0, dtype: str
| basename | path | dirname | is_directory | mtime | nlink | |
|---|---|---|---|---|---|---|
| 0 | .ipynb_checkpoints | /home/jovyan/notebooks/.ipynb_checkpoints | /home/jovyan/notebooks | True | 1.782475e+09 | 2 |
| 1 | notebooks | /home/jovyan/notebooks | /home/jovyan | True | 1.782475e+09 | 1 |
| 2 | bla | /home/jovyan/notebooks/bla | /home/jovyan/notebooks | True | 1.782475e+09 | 3 |
| 3 | bla | /home/jovyan/notebooks/bla/bla | /home/jovyan/notebooks/bla | True | 1.782475e+09 | 2 |
Remove a file#
The corresponding row disappears from the index.
os.remove("bla/bla/test.txt")
pfi.update()
pfi
Changes: [] [] <StringArray>
[]
Length: 0, dtype: str
| basename | path | dirname | is_directory | mtime | nlink | |
|---|---|---|---|---|---|---|
| 0 | .ipynb_checkpoints | /home/jovyan/notebooks/.ipynb_checkpoints | /home/jovyan/notebooks | True | 1.782475e+09 | 2 |
| 1 | notebooks | /home/jovyan/notebooks | /home/jovyan | True | 1.782475e+09 | 1 |
| 2 | bla | /home/jovyan/notebooks/bla | /home/jovyan/notebooks | True | 1.782475e+09 | 3 |
| 3 | bla | /home/jovyan/notebooks/bla/bla | /home/jovyan/notebooks/bla | True | 1.782475e+09 | 2 |
Remove a directory#
Removing a directory also drops its row; any files still inside would be removed from the index as well.
os.rmdir("bla/bla")
pfi.update()
pfi
Changes: <StringArray>
['/home/jovyan/notebooks/bla/test.txt']
Length: 1, dtype: str <StringArray>
['/home/jovyan/notebooks/bla']
Length: 1, dtype: str <StringArray>
['/home/jovyan/notebooks/bla/bla']
Length: 1, dtype: str
| basename | path | dirname | is_directory | mtime | nlink | |
|---|---|---|---|---|---|---|
| 0 | .ipynb_checkpoints | /home/jovyan/notebooks/.ipynb_checkpoints | /home/jovyan/notebooks | True | 1.782475e+09 | 2 |
| 1 | notebooks | /home/jovyan/notebooks | /home/jovyan | True | 1.782475e+09 | 1 |
| 2 | bla | /home/jovyan/notebooks/bla | /home/jovyan/notebooks | True | 1.782475e+09 | 2 |
| 3 | test.txt | /home/jovyan/notebooks/bla/test.txt | /home/jovyan/notebooks/bla | False | 1.782475e+09 | 1 |
Clean up#
os.remove("bla/test.txt")
os.rmdir("bla")
pfi.update()
pfi
Changes: [] [] <StringArray>
['/home/jovyan/notebooks/bla', '/home/jovyan/notebooks/bla/test.txt']
Length: 2, dtype: str
| basename | path | dirname | is_directory | mtime | nlink | |
|---|---|---|---|---|---|---|
| 0 | .ipynb_checkpoints | /home/jovyan/notebooks/.ipynb_checkpoints | /home/jovyan/notebooks | True | 1.782475e+09 | 2 |
| 1 | notebooks | /home/jovyan/notebooks | /home/jovyan | True | 1.782475e+09 | 1 |
2. Watch mode (watch=True)#
With watch=True, PyFileIndex starts a background thread (using watchfiles) that listens for file system events as they happen, instead of rescanning the tree on every update() call. This trades a small amount of background resource usage for much cheaper update() calls on large trees, since update() now just drains whatever change events have already been collected.
Three practical consequences:
A change made immediately before calling
update()may not have reached the background watcher yet.update()accepts atimeoutargument (default 0.1s) to wait briefly for such pending changes before giving up and returning whatever is available.The background thread needs to be stopped explicitly with
close(), or by usingPyFileIndexas a context manager, once you’re done with it. Forgetting to do so leaks a thread for as long as the process runs.On network filesystems such as NFS, Lustre, or GPFS – common on HPC clusters – changes made by a different node are often not delivered through these OS-level notifications at all. If you are monitoring simulation output written by jobs running on other compute nodes, prefer the default polling mode (
watch=False) instead.
Initialise PyFileIndex#
pfi = PyFileIndex(path=".", filter_function=filter_function, watch=True, debug=True)
pfi
| basename | path | dirname | is_directory | mtime | nlink | |
|---|---|---|---|---|---|---|
| 0 | notebooks | /home/jovyan/notebooks | /home/jovyan | True | 1.782475e+09 | 1 |
| 1 | .ipynb_checkpoints | /home/jovyan/notebooks/.ipynb_checkpoints | /home/jovyan/notebooks | True | 1.782475e+09 | 2 |
Update PyFileIndex#
pfi.update()
pfi
| basename | path | dirname | is_directory | mtime | nlink | |
|---|---|---|---|---|---|---|
| 0 | notebooks | /home/jovyan/notebooks | /home/jovyan | True | 1.782475e+09 | 1 |
| 1 | .ipynb_checkpoints | /home/jovyan/notebooks/.ipynb_checkpoints | /home/jovyan/notebooks | True | 1.782475e+09 | 2 |
New directory#
The watcher reports this as soon as update() drains the pending change – no rescan needed.
os.makedirs("bla")
pfi.update()
pfi
Changes: ['/home/jovyan/notebooks/bla'] []
| basename | path | dirname | is_directory | mtime | nlink | |
|---|---|---|---|---|---|---|
| 0 | notebooks | /home/jovyan/notebooks | /home/jovyan | True | 1.782475e+09 | 1 |
| 1 | .ipynb_checkpoints | /home/jovyan/notebooks/.ipynb_checkpoints | /home/jovyan/notebooks | True | 1.782475e+09 | 2 |
| 2 | bla | /home/jovyan/notebooks/bla | /home/jovyan/notebooks | True | 1.782475e+09 | 2 |
New sub directory#
Nested directories are reported the same way, via the background watcher.
os.makedirs("bla/bla")
pfi.update()
pfi
Changes: ['/home/jovyan/notebooks/bla/bla'] []
| basename | path | dirname | is_directory | mtime | nlink | |
|---|---|---|---|---|---|---|
| 0 | notebooks | /home/jovyan/notebooks | /home/jovyan | True | 1.782475e+09 | 1 |
| 1 | .ipynb_checkpoints | /home/jovyan/notebooks/.ipynb_checkpoints | /home/jovyan/notebooks | True | 1.782475e+09 | 2 |
| 2 | bla | /home/jovyan/notebooks/bla | /home/jovyan/notebooks | True | 1.782475e+09 | 2 |
| 3 | bla | /home/jovyan/notebooks/bla/bla | /home/jovyan/notebooks/bla | True | 1.782475e+09 | 2 |
New file#
The filtered .txt file shows up as a new row once update() drains the event.
touch("bla/bla/test.txt")
pfi.update()
pfi
Changes: ['/home/jovyan/notebooks/bla/bla/test.txt'] []
| basename | path | dirname | is_directory | mtime | nlink | |
|---|---|---|---|---|---|---|
| 0 | notebooks | /home/jovyan/notebooks | /home/jovyan | True | 1.782475e+09 | 1 |
| 1 | .ipynb_checkpoints | /home/jovyan/notebooks/.ipynb_checkpoints | /home/jovyan/notebooks | True | 1.782475e+09 | 2 |
| 2 | bla | /home/jovyan/notebooks/bla | /home/jovyan/notebooks | True | 1.782475e+09 | 2 |
| 3 | bla | /home/jovyan/notebooks/bla/bla | /home/jovyan/notebooks/bla | True | 1.782475e+09 | 2 |
| 4 | test.txt | /home/jovyan/notebooks/bla/bla/test.txt | /home/jovyan/notebooks/bla/bla | False | 1.782475e+09 | 1 |
Another new file#
Files in different subdirectories are tracked independently.
touch("bla/test.txt")
pfi.update()
pfi
Changes: ['/home/jovyan/notebooks/bla/test.txt'] []
| basename | path | dirname | is_directory | mtime | nlink | |
|---|---|---|---|---|---|---|
| 0 | notebooks | /home/jovyan/notebooks | /home/jovyan | True | 1.782475e+09 | 1 |
| 1 | .ipynb_checkpoints | /home/jovyan/notebooks/.ipynb_checkpoints | /home/jovyan/notebooks | True | 1.782475e+09 | 2 |
| 2 | bla | /home/jovyan/notebooks/bla | /home/jovyan/notebooks | True | 1.782475e+09 | 2 |
| 3 | bla | /home/jovyan/notebooks/bla/bla | /home/jovyan/notebooks/bla | True | 1.782475e+09 | 2 |
| 4 | test.txt | /home/jovyan/notebooks/bla/bla/test.txt | /home/jovyan/notebooks/bla/bla | False | 1.782475e+09 | 1 |
| 5 | test.txt | /home/jovyan/notebooks/bla/test.txt | /home/jovyan/notebooks/bla | False | 1.782475e+09 | 1 |
Touch an existing file#
The watcher reports a modification event, which updates the mtime column for that row.
touch("bla/bla/test.txt", (1330712280, 1330712292))
pfi.update()
pfi
Changes: ['/home/jovyan/notebooks/bla/bla/test.txt'] []
| basename | path | dirname | is_directory | mtime | nlink | |
|---|---|---|---|---|---|---|
| 0 | notebooks | /home/jovyan/notebooks | /home/jovyan | True | 1.782475e+09 | 1 |
| 1 | .ipynb_checkpoints | /home/jovyan/notebooks/.ipynb_checkpoints | /home/jovyan/notebooks | True | 1.782475e+09 | 2 |
| 2 | bla | /home/jovyan/notebooks/bla | /home/jovyan/notebooks | True | 1.782475e+09 | 2 |
| 3 | bla | /home/jovyan/notebooks/bla/bla | /home/jovyan/notebooks/bla | True | 1.782475e+09 | 2 |
| 4 | test.txt | /home/jovyan/notebooks/bla/test.txt | /home/jovyan/notebooks/bla | False | 1.782475e+09 | 1 |
| 5 | test.txt | /home/jovyan/notebooks/bla/bla/test.txt | /home/jovyan/notebooks/bla/bla | False | 1.330712e+09 | 1 |
Remove a file#
A deletion event removes the corresponding row from the index.
os.remove("bla/bla/test.txt")
pfi.update()
pfi
Changes: [] ['/home/jovyan/notebooks/bla/bla/test.txt']
| basename | path | dirname | is_directory | mtime | nlink | |
|---|---|---|---|---|---|---|
| 0 | notebooks | /home/jovyan/notebooks | /home/jovyan | True | 1.782475e+09 | 1 |
| 1 | .ipynb_checkpoints | /home/jovyan/notebooks/.ipynb_checkpoints | /home/jovyan/notebooks | True | 1.782475e+09 | 2 |
| 2 | bla | /home/jovyan/notebooks/bla | /home/jovyan/notebooks | True | 1.782475e+09 | 2 |
| 3 | bla | /home/jovyan/notebooks/bla/bla | /home/jovyan/notebooks/bla | True | 1.782475e+09 | 2 |
| 4 | test.txt | /home/jovyan/notebooks/bla/test.txt | /home/jovyan/notebooks/bla | False | 1.782475e+09 | 1 |
Remove a directory#
Deleting a directory removes its row; any files still inside are removed as well.
os.rmdir("bla/bla")
pfi.update()
pfi
Changes: [] ['/home/jovyan/notebooks/bla/bla']
| basename | path | dirname | is_directory | mtime | nlink | |
|---|---|---|---|---|---|---|
| 0 | notebooks | /home/jovyan/notebooks | /home/jovyan | True | 1.782475e+09 | 1 |
| 1 | .ipynb_checkpoints | /home/jovyan/notebooks/.ipynb_checkpoints | /home/jovyan/notebooks | True | 1.782475e+09 | 2 |
| 2 | bla | /home/jovyan/notebooks/bla | /home/jovyan/notebooks | True | 1.782475e+09 | 2 |
| 4 | test.txt | /home/jovyan/notebooks/bla/test.txt | /home/jovyan/notebooks/bla | False | 1.782475e+09 | 1 |
Clean up#
os.remove("bla/test.txt")
os.rmdir("bla")
pfi.update()
pfi
Changes: [] ['/home/jovyan/notebooks/bla', '/home/jovyan/notebooks/bla/test.txt']
| basename | path | dirname | is_directory | mtime | nlink | |
|---|---|---|---|---|---|---|
| 0 | notebooks | /home/jovyan/notebooks | /home/jovyan | True | 1.782475e+09 | 1 |
| 1 | .ipynb_checkpoints | /home/jovyan/notebooks/.ipynb_checkpoints | /home/jovyan/notebooks | True | 1.782475e+09 | 2 |
Stop the background watcher#
Once a PyFileIndex was created with watch=True, call close() to stop its background thread when you no longer need live updates.
pfi.close()
3. Using pyfileindex in your own project#
A few patterns that are useful once you embed pyfileindex in a larger application rather than calling it interactively.
Scoping to a subdirectory with open()#
open() returns a new PyFileIndex restricted to a subdirectory, reusing the parent’s already-scanned data instead of rescanning from scratch.
pfi = PyFileIndex(path=".", filter_function=filter_function)
os.makedirs("bla", exist_ok=True)
touch("bla/test.txt")
sub_pfi = pfi.open(path="bla")
sub_pfi
| basename | path | dirname | is_directory | mtime | nlink |
|---|
Filtering for the files your application cares about#
filter_function is called once per file (not per directory) and decides whether that file is kept in the index. Use it to limit the index to file types or naming patterns relevant to your application, which keeps the DataFrame small and update() fast.
def only_python_files(file_name):
return file_name.endswith(".py")
py_pfi = PyFileIndex(path=".", filter_function=only_python_files)
py_pfi
| basename | path | dirname | is_directory | mtime | nlink | |
|---|---|---|---|---|---|---|
| 0 | notebooks | /home/jovyan/notebooks | /home/jovyan | True | 1.782475e+09 | 1 |
| 1 | bla | /home/jovyan/notebooks/bla | /home/jovyan/notebooks | True | 1.782475e+09 | 2 |
| 2 | .ipynb_checkpoints | /home/jovyan/notebooks/.ipynb_checkpoints | /home/jovyan/notebooks | True | 1.782475e+09 | 2 |
Reliable cleanup with a context manager#
If you use watch=True inside a long-running application, prefer the context manager form over manually calling close() – it guarantees the background thread is stopped even if an exception is raised while the index is in use.
with PyFileIndex(path=".", filter_function=filter_function, watch=True) as live_pfi:
live_pfi.update()
print(len(live_pfi), "files tracked")
1 files tracked
os.remove("bla/test.txt")
os.rmdir("bla")
Next steps#
See the README for installation details and citation information, and the source for the full API reference.