I have a compressed tar file, which contains a large number of PDF files. I’m trying to preprocess the data for a model in the following way:
- extract tar
- each page of pdf files → downsample and store as jpg file
- create a .txt file for each of the files created in step 2 (I’m creating files of labels). I’m trying to train an existing model, which requires such input structure
- compress all the newly created files (.jpg and .txt) into a new tar
- track the new tar in LFS
The issue I am facing is that whenever I finish steps 2 and 3, the environment basically freezes (I cannot close/open/access terminal/…) and I basically need to close environment and restart it, thereby losing my newly created files, since they are not yet added to LFS.
I’m pretty sure it’s related to how git handles large quantities of files (I could see git processes taking 100% CPU from time to time, even though I hadn’t yet committed anything), but I don’t know how to fix it. Note, order of magnitude of new files created is probably O(10k).
Any help appreciated.