The key I think with s3 is using it mostly as a blobstore. We put the important ...

The key I think with s3 is using it mostly as a blobstore. We put the important metadata we want into postgres so we can quickly select stuff that needs to be updated based on other things being newer. So, we don't need to touch s3 that often if we don't need the actual data.

When we actually need to manipulate or generate something in Python, we download/upload to S3 and wrap it all in a tempfile.TemporaryDirectory() to cleanup the local disk when we're done. If you don't do this, you end up with a bunch of garbage eventually in /tmp/ you need to deal with.

We also have some longer-lived disk caches and using the data in the db and a os.stat() on the file we can easily know if the cache is up to date without hitting s3. And this cache, we can just delete stuff that's old wrt os.stat() to manage the size of it since we can always get it from s3 again if needed in the future.