A great idea! I think tags are definitely a better way to organize most personal...

ChuckMcM · on May 9, 2016

ding ding ding! This is the monkey in the wrench as it were.

Tagging is a really useful idea, it is also a naming thing and as such either it lives in the naming infrastructure (aka dirents) or it rots over time. A simple example I used to use in the 'object naming' [1] days was, imagine that instead of house numbers on the street you wrote down last names. That works fine until somebody moves and now not only did you show up at the wrong house, you don't even have a chance of knowing what the correct house is. [2]

Microsoft's LongHorn project was way out there but took a swing at the actual problem. Just make the file system an actual relational database. Then your home directory is simply 'select * from files where (owner = chuck);' It really does solve the problem at a more fundamental level, using naming by attribute rather than mapping. I got to observe that effort from the outside (I was at NetApp at the time) but I believe it died due to really horrible performance issues.

I find it pretty awesome that people can lose files, back when a "big" hard drive was 100MB it really wasn't all that hard to just look through all the files on it, but when its a couple or three terabytes, all bets are off!

[1] Object File systems were all the rage in the early 2000's, files themselves were object ids and the naming was a database that connected object ids to user recognizable names. -- https://en.wikipedia.org/wiki/Object_storage

[2] The typical solution is to add "tombstones" or redirects at the previous address. That then is a layer of additional meta data to maintain, and sometimes the file doesn't move, it just changes value (trivial example you have a file 'my-favorite-song.mp3' which is tagged 'jazz mp3' and then you discover techno and make something from Tiesto your favorite song and while the name and type are still valid, the tag 'jazz' is now invalid.

kbenson · on May 9, 2016

Hmm, seems like they could have gone the other way, throw everything into a DB, and then wrote a fuse plugin to access it all through traditional file system mechanics. That would have allowed for gating direct access such that moves and renames could be dealt with accordingly. Of course, there are other problems with that approach, but probably not as many as you might think (the file system is a database, so you're really just choosing a back-end that is less likely to be directly accessed).

seagreen · on May 9, 2016

    they could have gone the other way, throw everything into
    a DB, and then wrote a fuse plugin to access it all
    through traditional file system

This is the Camlistore strategy!

    Of course, there are other problems with that approach

Could you elaborate more on these? I've never worked with FUSE.

kbenson · on May 9, 2016

The other problems I was alluding to weren't really with FUSE, but one that does pertain to FUSE is speed, since FUSE imposes overhead through a daemon running in user space, and associated context level switches because of that. From just looking into is again, this may have been mitigated to some larger or smaller degree with some FUSE performance enhancements in 2012.

Specifically, I was referring to the different off the shelf database systems which could be used. Each will have it's own benefits and drawbacks to storing large chunks of data per-record. Benefits might include (relatively) easy sharding or replication. Drawbacks might include not being space efficient for removed files, not being as resilient to corruption due to crashes or corruption affecting more than the files in use, or overly aggressive use of memory to function efficiently.

If a custom database was developed, you could tailor to your exact needs, but then you have much more work to do, and a period of immaturity.

Off the top of my head, if I were designing a general purpose system for tagging files where people were expected to use it as a regular file system and some overhead from FUSE was acceptable, I think I would leverage the file system but in a different way. I would set up a specialized directory for the files themselves, and store then hashed within it, and have a BerkelyDB database relate filename to hash and tags, and use FUSE to do direct file access. But that's my 5 minute assessment, so I reserve the right to change it completely given someone pointing out the obvious problems. :)

cvwright · on May 10, 2016

Couldn't they just create a hardlink in a private, hidden directory that they control, and then symlink to that?

Then, it's OK if the original file gets renamed or moved, as long as it stays on the same FS. You still have your hardlink, and so your symlink still works.

ZenPsycho · on May 10, 2016

what if you really want to delete the file tho? (passwords, customer data, incriminating evidence) then you have to remember to delete it from this system too!

GauntletWizard · on May 9, 2016

I've been considering writing something similar myself, and my plan had been to hardlink the files by their hash into my blobstore. It won't fix the move/modification case, but it would solve the problem for simple moves. But I guess they're trying to deal with remote filesystems, too, and I was not targeting those.