Getting started with pyfileindex#

pyfileindex keeps a pandas DataFrame in sync with a directory tree: new files and directories, modified files, and deletions are all reflected after calling update(). The index is plain tabular data, so you can filter, group, or join it like any other DataFrame.

Install it with:

pip install pyfileindex

or

conda install -c conda-forge pyfileindex

This notebook covers:

  1. Basic usage in the default polling mode (watch=False)

  2. The same operations in watch mode (watch=True), and how it differs

  3. Patterns for using pyfileindex inside your own project

Setup#

A couple of helpers used throughout this notebook: touch() to create/update a file like the Unix command, and filter_function() to restrict the index to .txt files.

import os
from pyfileindex import PyFileIndex
def touch(fname, times=None):
    with open(fname, "a"):
        os.utime(fname, times)
def filter_function(file_name):
    return ".txt" in file_name

1. Polling mode (watch=False, the default)#

By default PyFileIndex does not run anything in the background. Every call to update() rescans the directory tree, compares it against the previous index, and updates the DataFrame. This has no cost between calls, but the cost of update() itself grows with the size of the tree.

Initialise PyFileIndex#

pfi = PyFileIndex(path=".", filter_function=filter_function, debug=True)
pfi
basename path dirname is_directory mtime nlink
0 notebooks /home/jovyan/notebooks /home/jovyan True 1.782475e+09 1
1 .ipynb_checkpoints /home/jovyan/notebooks/.ipynb_checkpoints /home/jovyan/notebooks True 1.782475e+09 2

pfi.dataframe is a pandas DataFrame with one row per file or directory. See the README for the full column reference (basename, path, dirname, is_directory, mtime, nlink).

Update PyFileIndex#

pfi.update()
pfi
Changes:  [] [] <StringArray>
[]
Length: 0, dtype: str
basename path dirname is_directory mtime nlink
0 notebooks /home/jovyan/notebooks /home/jovyan True 1.782475e+09 1
1 .ipynb_checkpoints /home/jovyan/notebooks/.ipynb_checkpoints /home/jovyan/notebooks True 1.782475e+09 2

New directory#

A new row appears with is_directory=True.

os.makedirs("bla")
pfi.update()
pfi
Changes:  <StringArray>
['/home/jovyan/notebooks/bla']
Length: 1, dtype: str <StringArray>
['/home/jovyan/notebooks']
Length: 1, dtype: str <StringArray>
[]
Length: 0, dtype: str
basename path dirname is_directory mtime nlink
0 .ipynb_checkpoints /home/jovyan/notebooks/.ipynb_checkpoints /home/jovyan/notebooks True 1.782475e+09 2
1 notebooks /home/jovyan/notebooks /home/jovyan True 1.782475e+09 1
2 bla /home/jovyan/notebooks/bla /home/jovyan/notebooks True 1.782475e+09 2

New sub directory#

Nested directories are picked up too – pyfileindex scans recursively.

os.makedirs("bla/bla")
pfi.update()
pfi
Changes:  <StringArray>
['/home/jovyan/notebooks/bla/bla']
Length: 1, dtype: str <StringArray>
['/home/jovyan/notebooks/bla']
Length: 1, dtype: str <StringArray>
[]
Length: 0, dtype: str
basename path dirname is_directory mtime nlink
0 .ipynb_checkpoints /home/jovyan/notebooks/.ipynb_checkpoints /home/jovyan/notebooks True 1.782475e+09 2
1 notebooks /home/jovyan/notebooks /home/jovyan True 1.782475e+09 1
2 bla /home/jovyan/notebooks/bla /home/jovyan/notebooks True 1.782475e+09 3
3 bla /home/jovyan/notebooks/bla/bla /home/jovyan/notebooks/bla True 1.782475e+09 2

New file#

The filtered .txt file shows up as a new row.

touch("bla/bla/test.txt")
pfi.update()
pfi
Changes:  [] [] <StringArray>
[]
Length: 0, dtype: str
basename path dirname is_directory mtime nlink
0 .ipynb_checkpoints /home/jovyan/notebooks/.ipynb_checkpoints /home/jovyan/notebooks True 1.782475e+09 2
1 notebooks /home/jovyan/notebooks /home/jovyan True 1.782475e+09 1
2 bla /home/jovyan/notebooks/bla /home/jovyan/notebooks True 1.782475e+09 3
3 bla /home/jovyan/notebooks/bla/bla /home/jovyan/notebooks/bla True 1.782475e+09 2

Another new file#

Files in different subdirectories are tracked independently.

touch("bla/test.txt")
pfi.update()
pfi
Changes:  [] [] <StringArray>
[]
Length: 0, dtype: str
basename path dirname is_directory mtime nlink
0 .ipynb_checkpoints /home/jovyan/notebooks/.ipynb_checkpoints /home/jovyan/notebooks True 1.782475e+09 2
1 notebooks /home/jovyan/notebooks /home/jovyan True 1.782475e+09 1
2 bla /home/jovyan/notebooks/bla /home/jovyan/notebooks True 1.782475e+09 3
3 bla /home/jovyan/notebooks/bla/bla /home/jovyan/notebooks/bla True 1.782475e+09 2

Touch an existing file#

Updating the modification time changes the mtime column for that row, without adding or removing rows.

touch("bla/bla/test.txt", (1330712280, 1330712292))
pfi.update()
pfi
Changes:  [] [] <StringArray>
[]
Length: 0, dtype: str
basename path dirname is_directory mtime nlink
0 .ipynb_checkpoints /home/jovyan/notebooks/.ipynb_checkpoints /home/jovyan/notebooks True 1.782475e+09 2
1 notebooks /home/jovyan/notebooks /home/jovyan True 1.782475e+09 1
2 bla /home/jovyan/notebooks/bla /home/jovyan/notebooks True 1.782475e+09 3
3 bla /home/jovyan/notebooks/bla/bla /home/jovyan/notebooks/bla True 1.782475e+09 2

Remove a file#

The corresponding row disappears from the index.

os.remove("bla/bla/test.txt")
pfi.update()
pfi
Changes:  [] [] <StringArray>
[]
Length: 0, dtype: str
basename path dirname is_directory mtime nlink
0 .ipynb_checkpoints /home/jovyan/notebooks/.ipynb_checkpoints /home/jovyan/notebooks True 1.782475e+09 2
1 notebooks /home/jovyan/notebooks /home/jovyan True 1.782475e+09 1
2 bla /home/jovyan/notebooks/bla /home/jovyan/notebooks True 1.782475e+09 3
3 bla /home/jovyan/notebooks/bla/bla /home/jovyan/notebooks/bla True 1.782475e+09 2

Remove a directory#

Removing a directory also drops its row; any files still inside would be removed from the index as well.

os.rmdir("bla/bla")
pfi.update()
pfi
Changes:  <StringArray>
['/home/jovyan/notebooks/bla/test.txt']
Length: 1, dtype: str <StringArray>
['/home/jovyan/notebooks/bla']
Length: 1, dtype: str <StringArray>
['/home/jovyan/notebooks/bla/bla']
Length: 1, dtype: str
basename path dirname is_directory mtime nlink
0 .ipynb_checkpoints /home/jovyan/notebooks/.ipynb_checkpoints /home/jovyan/notebooks True 1.782475e+09 2
1 notebooks /home/jovyan/notebooks /home/jovyan True 1.782475e+09 1
2 bla /home/jovyan/notebooks/bla /home/jovyan/notebooks True 1.782475e+09 2
3 test.txt /home/jovyan/notebooks/bla/test.txt /home/jovyan/notebooks/bla False 1.782475e+09 1

Clean up#

os.remove("bla/test.txt")
os.rmdir("bla")
pfi.update()
pfi
Changes:  [] [] <StringArray>
['/home/jovyan/notebooks/bla', '/home/jovyan/notebooks/bla/test.txt']
Length: 2, dtype: str
basename path dirname is_directory mtime nlink
0 .ipynb_checkpoints /home/jovyan/notebooks/.ipynb_checkpoints /home/jovyan/notebooks True 1.782475e+09 2
1 notebooks /home/jovyan/notebooks /home/jovyan True 1.782475e+09 1

2. Watch mode (watch=True)#

With watch=True, PyFileIndex starts a background thread (using watchfiles) that listens for file system events as they happen, instead of rescanning the tree on every update() call. This trades a small amount of background resource usage for much cheaper update() calls on large trees, since update() now just drains whatever change events have already been collected.

Three practical consequences:

  • A change made immediately before calling update() may not have reached the background watcher yet. update() accepts a timeout argument (default 0.1s) to wait briefly for such pending changes before giving up and returning whatever is available.

  • The background thread needs to be stopped explicitly with close(), or by using PyFileIndex as a context manager, once you’re done with it. Forgetting to do so leaks a thread for as long as the process runs.

  • On network filesystems such as NFS, Lustre, or GPFS – common on HPC clusters – changes made by a different node are often not delivered through these OS-level notifications at all. If you are monitoring simulation output written by jobs running on other compute nodes, prefer the default polling mode (watch=False) instead.

Initialise PyFileIndex#

pfi = PyFileIndex(path=".", filter_function=filter_function, watch=True, debug=True)
pfi
basename path dirname is_directory mtime nlink
0 notebooks /home/jovyan/notebooks /home/jovyan True 1.782475e+09 1
1 .ipynb_checkpoints /home/jovyan/notebooks/.ipynb_checkpoints /home/jovyan/notebooks True 1.782475e+09 2

Update PyFileIndex#

pfi.update()
pfi
basename path dirname is_directory mtime nlink
0 notebooks /home/jovyan/notebooks /home/jovyan True 1.782475e+09 1
1 .ipynb_checkpoints /home/jovyan/notebooks/.ipynb_checkpoints /home/jovyan/notebooks True 1.782475e+09 2

New directory#

The watcher reports this as soon as update() drains the pending change – no rescan needed.

os.makedirs("bla")
pfi.update()
pfi
Changes:  ['/home/jovyan/notebooks/bla'] []
basename path dirname is_directory mtime nlink
0 notebooks /home/jovyan/notebooks /home/jovyan True 1.782475e+09 1
1 .ipynb_checkpoints /home/jovyan/notebooks/.ipynb_checkpoints /home/jovyan/notebooks True 1.782475e+09 2
2 bla /home/jovyan/notebooks/bla /home/jovyan/notebooks True 1.782475e+09 2

New sub directory#

Nested directories are reported the same way, via the background watcher.

os.makedirs("bla/bla")
pfi.update()
pfi
Changes:  ['/home/jovyan/notebooks/bla/bla'] []
basename path dirname is_directory mtime nlink
0 notebooks /home/jovyan/notebooks /home/jovyan True 1.782475e+09 1
1 .ipynb_checkpoints /home/jovyan/notebooks/.ipynb_checkpoints /home/jovyan/notebooks True 1.782475e+09 2
2 bla /home/jovyan/notebooks/bla /home/jovyan/notebooks True 1.782475e+09 2
3 bla /home/jovyan/notebooks/bla/bla /home/jovyan/notebooks/bla True 1.782475e+09 2

New file#

The filtered .txt file shows up as a new row once update() drains the event.

touch("bla/bla/test.txt")
pfi.update()
pfi
Changes:  ['/home/jovyan/notebooks/bla/bla/test.txt'] []
basename path dirname is_directory mtime nlink
0 notebooks /home/jovyan/notebooks /home/jovyan True 1.782475e+09 1
1 .ipynb_checkpoints /home/jovyan/notebooks/.ipynb_checkpoints /home/jovyan/notebooks True 1.782475e+09 2
2 bla /home/jovyan/notebooks/bla /home/jovyan/notebooks True 1.782475e+09 2
3 bla /home/jovyan/notebooks/bla/bla /home/jovyan/notebooks/bla True 1.782475e+09 2
4 test.txt /home/jovyan/notebooks/bla/bla/test.txt /home/jovyan/notebooks/bla/bla False 1.782475e+09 1

Another new file#

Files in different subdirectories are tracked independently.

touch("bla/test.txt")
pfi.update()
pfi
Changes:  ['/home/jovyan/notebooks/bla/test.txt'] []
basename path dirname is_directory mtime nlink
0 notebooks /home/jovyan/notebooks /home/jovyan True 1.782475e+09 1
1 .ipynb_checkpoints /home/jovyan/notebooks/.ipynb_checkpoints /home/jovyan/notebooks True 1.782475e+09 2
2 bla /home/jovyan/notebooks/bla /home/jovyan/notebooks True 1.782475e+09 2
3 bla /home/jovyan/notebooks/bla/bla /home/jovyan/notebooks/bla True 1.782475e+09 2
4 test.txt /home/jovyan/notebooks/bla/bla/test.txt /home/jovyan/notebooks/bla/bla False 1.782475e+09 1
5 test.txt /home/jovyan/notebooks/bla/test.txt /home/jovyan/notebooks/bla False 1.782475e+09 1

Touch an existing file#

The watcher reports a modification event, which updates the mtime column for that row.

touch("bla/bla/test.txt", (1330712280, 1330712292))
pfi.update()
pfi
Changes:  ['/home/jovyan/notebooks/bla/bla/test.txt'] []
basename path dirname is_directory mtime nlink
0 notebooks /home/jovyan/notebooks /home/jovyan True 1.782475e+09 1
1 .ipynb_checkpoints /home/jovyan/notebooks/.ipynb_checkpoints /home/jovyan/notebooks True 1.782475e+09 2
2 bla /home/jovyan/notebooks/bla /home/jovyan/notebooks True 1.782475e+09 2
3 bla /home/jovyan/notebooks/bla/bla /home/jovyan/notebooks/bla True 1.782475e+09 2
4 test.txt /home/jovyan/notebooks/bla/test.txt /home/jovyan/notebooks/bla False 1.782475e+09 1
5 test.txt /home/jovyan/notebooks/bla/bla/test.txt /home/jovyan/notebooks/bla/bla False 1.330712e+09 1

Remove a file#

A deletion event removes the corresponding row from the index.

os.remove("bla/bla/test.txt")
pfi.update()
pfi
Changes:  [] ['/home/jovyan/notebooks/bla/bla/test.txt']
basename path dirname is_directory mtime nlink
0 notebooks /home/jovyan/notebooks /home/jovyan True 1.782475e+09 1
1 .ipynb_checkpoints /home/jovyan/notebooks/.ipynb_checkpoints /home/jovyan/notebooks True 1.782475e+09 2
2 bla /home/jovyan/notebooks/bla /home/jovyan/notebooks True 1.782475e+09 2
3 bla /home/jovyan/notebooks/bla/bla /home/jovyan/notebooks/bla True 1.782475e+09 2
4 test.txt /home/jovyan/notebooks/bla/test.txt /home/jovyan/notebooks/bla False 1.782475e+09 1

Remove a directory#

Deleting a directory removes its row; any files still inside are removed as well.

os.rmdir("bla/bla")
pfi.update()
pfi
Changes:  [] ['/home/jovyan/notebooks/bla/bla']
basename path dirname is_directory mtime nlink
0 notebooks /home/jovyan/notebooks /home/jovyan True 1.782475e+09 1
1 .ipynb_checkpoints /home/jovyan/notebooks/.ipynb_checkpoints /home/jovyan/notebooks True 1.782475e+09 2
2 bla /home/jovyan/notebooks/bla /home/jovyan/notebooks True 1.782475e+09 2
4 test.txt /home/jovyan/notebooks/bla/test.txt /home/jovyan/notebooks/bla False 1.782475e+09 1

Clean up#

os.remove("bla/test.txt")
os.rmdir("bla")
pfi.update()
pfi
Changes:  [] ['/home/jovyan/notebooks/bla', '/home/jovyan/notebooks/bla/test.txt']
basename path dirname is_directory mtime nlink
0 notebooks /home/jovyan/notebooks /home/jovyan True 1.782475e+09 1
1 .ipynb_checkpoints /home/jovyan/notebooks/.ipynb_checkpoints /home/jovyan/notebooks True 1.782475e+09 2

Stop the background watcher#

Once a PyFileIndex was created with watch=True, call close() to stop its background thread when you no longer need live updates.

pfi.close()

3. Using pyfileindex in your own project#

A few patterns that are useful once you embed pyfileindex in a larger application rather than calling it interactively.

Scoping to a subdirectory with open()#

open() returns a new PyFileIndex restricted to a subdirectory, reusing the parent’s already-scanned data instead of rescanning from scratch.

pfi = PyFileIndex(path=".", filter_function=filter_function)
os.makedirs("bla", exist_ok=True)
touch("bla/test.txt")
sub_pfi = pfi.open(path="bla")
sub_pfi
basename path dirname is_directory mtime nlink

Filtering for the files your application cares about#

filter_function is called once per file (not per directory) and decides whether that file is kept in the index. Use it to limit the index to file types or naming patterns relevant to your application, which keeps the DataFrame small and update() fast.

def only_python_files(file_name):
    return file_name.endswith(".py")


py_pfi = PyFileIndex(path=".", filter_function=only_python_files)
py_pfi
basename path dirname is_directory mtime nlink
0 notebooks /home/jovyan/notebooks /home/jovyan True 1.782475e+09 1
1 bla /home/jovyan/notebooks/bla /home/jovyan/notebooks True 1.782475e+09 2
2 .ipynb_checkpoints /home/jovyan/notebooks/.ipynb_checkpoints /home/jovyan/notebooks True 1.782475e+09 2

Reliable cleanup with a context manager#

If you use watch=True inside a long-running application, prefer the context manager form over manually calling close() – it guarantees the background thread is stopped even if an exception is raised while the index is in use.

with PyFileIndex(path=".", filter_function=filter_function, watch=True) as live_pfi:
    live_pfi.update()
    print(len(live_pfi), "files tracked")
1 files tracked
os.remove("bla/test.txt")
os.rmdir("bla")

Next steps#

See the README for installation details and citation information, and the source for the full API reference.