NVIDIA DGX A100 AI Support Agent

Florian Maurer

projectsnvidiagpusupportai

projects

533 Words

2025-08-22


At the institute, we have 2 DGX A100 units - each with 8 x A100 GPU 40GB. We had some hardware problems with the units, as the GPU bay did not work anymore after a voltage incident in the datacenter.

Once we renewed the support license, the hardware problems were solved by replacing the whole GPU bay (including the 8 GPUs) and sending a new one to the datacenter. A few days later, a technician came to replace the unit. Problem solved.

Packaging issues of NVIDIA drivers

Yet, everyone knows what kind of pain it is to install the correct version of a nvidia-driver and it does not help to have a full system by nvidia.

So I had the following packaging problems:

sudo apt install nvidia-driver-570-open

Die folgenden Pakete haben unerfüllte Abhängigkeiten:

 nvidia-persistenced : Hängt ab von: libnvidia-cfg1 (= 575.57.08-0ubuntu1)

There is no package libnvidia-cfg1 as it is called libnvidia-cfg1-575.

I turned to the Nvidia support on how to resolve this packaging problem which occured after a fresh install of the NVIDIA DGX OS 6 and updating it for a while.

The response came promptly (pun intended):

Good day! Thanks for the update. Can we please try Purge and Reinstall NVIDIA Packages?

Remove all NVIDIA driver components and reinstall:

sudo apt-get purge “nvidia

sudo apt update

sudo apt install nvidia-driver-570-open nvidia-fabricmanager-570

Sure, lets go ahead and do what nvidia says right?

On a typical OS, this is fine and mentioned at a few places on the internet.

However, this also removes nvidia-base which includes the nvidia repositories. Now that these were removed, I had no way to readd them (except for a reinstall). Fortunately, I could copy the nvidia-apt-repo information from the second DGX A100 and reinstall nvidia-base again. So NVIDIA support did not help here.

My reply then was:

So your recommendation to run sudo apt-get purge "*nvidia*" did break the system. The docs say sudo apt-get purge "*nvidia*570*" instead which does not remove the repo itself and everything. We fixed this, but your chatbot might not be the best option if you want to reduce support amount

The bot said

Thanks for the update. Please continue to monitor the node and do let us know your concerns.

I still had further packaging issues because of the libnvidia-cfg1 problem, but the bot did not seem to bother even with newer replies.

So I installed nvidia-driver-575-open now, which seems to be the only way to not have a broken DGX A100 system and have apt update && apt dist-upgrade run correctly.

The bot asked for feedback, which I gave.

I would like to feed back that the instructions provided by the chat bot are cumbersome at best and sabotage at worst.

So finally, while having a thread to put my findings in, I still had to manually solve my packaging situation. The bot did tell me to do some things, but actually there is no reason to trust it with anything.

I had some more issues with the Bot not telling me correct things. In one case, I reached the Tier 3 Enterprise Support team which seems to be actual humans.

I wonder where this whole mess will lead us one day.