error info: (They can occur at the same time.)
tonyyan@tonyyan-X11SPI:~$ nvidia-smi Unable to determine the device handle for GPU 0000:65:00.0: GPU is lost. Reboot the system to recover this GPU
327.411fps 3ms 312.613fps 3ms 309.92fps 3ms 300.209fps 2ms 342.361fps 3ms 322.467fps 3ms 316.99fps 3ms 318.749fps 3ms 321.253fps 3ms 314.281fps 3ms 312.419fps 2ms 342.166fps 3ms 312.345fps 3ms 327.62fps 178ms 5.59761fps 192ms 5.19022fps 178ms 5.59837fps Cuda failure: 999
Unable to open 'raise.c': Unable to read file '/build/glibc-S9d2JN/glibc-2.27/sysdeps/unix/sysv/linux/raise.c' (Error: Unable to resolve non-existing file '/build/glibc-S9d2JN/glibc-2.27/sysdeps/unix/sysv/linux/raise.c').
Reason: unknown
How this occurs:
- Cuda GPU losts after a period of time (usually several hours) after being booted even if nothing is done .
- Running GPU dependent process, such as model traning or TensorRT inference. The FPS would gradually slow down until it shows 'Cuda failure: 999'.
Current solution:
Restart the computer.