How can users disk usage be limited without quotas?

Question

I am currently working in a students project (machine learning) where we get access to a companies resources. They store their data on Windows servers, but we use Linux machines to access the data. It seems not to be possible to set up quotas. The reason seems to be that the data is stored on a Windows server and that my advisers don't have access to the machine where it is stored. The problem is that it happens once in a while that students use ENORMOUS amounts of disk space accidentally which leads to ermormous wastes of space in backups. For example, I trained a model for 3 days and created snapshots of the model on a regular basis. This resulted in 100GB disk usage. This is a problem.

Is it possible to prevent something like this?

I was thinking about a CRON job which executes for every user who is logged in every 30min or so. The CRON job checks the disk usage in the users home folder (e.g. du -s .) and kills all jobs of the user if he uses too much memory. My adviser had concerns that this would cost a significant amount of time to calculate (CPU time).

I've just tried it and the first execution of du -s . takes significantly longer then subsequent executions. Why is that the case? Would my proposed solution work or are there better solutions in the environment I described? (We have root access to the machines we use, but not to the machine where our home folders are)

The solution here seems to use quotas. That's exactly the use case they initially were designed for. I don't know which software and protocols you use. But I doubt that a Windows server is not capable of setting up quotas. It's a basic feature in every larger institution. You should talk to the responsible admin. — Marco, Jun 10 '15 at 18:57
I know that quotas are the way of choice to do it. I know that it has to be possible with Windows. I've talked to my supervisor. He says it is not possible. I have to deal with it. That is what my question is about. — Martin Thoma, Jun 10 '15 at 19:51
You could use a loopback device of a fixed size for each user. Then the home directory will then be limited to that size. But I don't think it's practical at a larger scale. Possibly you could leverage LVM or ZFS as well to achieve something similar. — Marco, Jun 10 '15 at 19:58

score 2 · Answer 1 · answered Jun 10 '15 at 20:00

The du -s should work in your context. Some notes you might want to consider:

if users give write permissions to other users the numbers reported by du could be skewed since it will take all files into account regardless of who owns them (which is not a big deal since the penalty falls onto the user which gave permission, discouraging the practice)
the cost to consider is not really the CPU time, but the disk I/O which may visibly impact the overall system performance; may be insignificant as well, it depends on many factors
the amount of time it can take to go through many large directories may be significant, 30 minutes might not be enough

Your subsequent du executions can take significantly less time than the first one (for few/small directories) because the filesystem cache already has some of the data in memory. However do not base your calculation on the short times as when processing large directories cache churning will occur and you'll see times more inline with to the 1st execution.

How can users disk usage be limited without quotas?

1 Answers1