Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
animist
Aug 28, 2018
having spent time attempting to run jobs on shared GPUs that don't have virtual memory, i love returning to the land of overcommit

Cybernetic Vermin posted:

but even there i think overcommit is good practice, it is a really convenient feature and great optimization, and doing anything more than trivial attempts to recover from oom becomes incredibly messy very quickly. not least software would almost necessarily allocate defensively (e.g. allocate all the memory you could need at the start of a transaction rather than risk having to deal with an oom partway through)

tensorflow does this, it always reserves all available GPU memory at process boot. this means you're SOL if you're trying to run a job on a shared node. wanna run some intensive physical simulations on this cluster? too bad, somebody is doing Machine Learning and that is more important than whatever you're trying to do

of course if you try to reserve more memory than available on a GPU, your CUDA program will immediately segfault. that's also annoying

overcommit is fine

Adbot
ADBOT LOVES YOU

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply