Setting up the GPU for fastai with Ubuntu and Nvidia
fastai alternatives
https://www.fast.ai/2017/04/06/alternatives/
Resources
2018 Lesson 1 wiki: https://forums.fast.ai/t/wiki-thread-lesson-1/6825
course 2018: https://course18.fast.ai/lessonsml1/lesson1.html
2020 setup help: https://forums.fast.ai/t/setup-help/65529
Ubuntu setup help used:
-
https://forums.fast.ai/t/pytorch-installation-in-conda-environment-failing/6703/19
-
https://forums.fast.ai/t/platform-local-server-ubuntu/65851
updates 2020: https://forums.fast.ai/t/official-part-1-2020-updates-and-resources-thread/63376
^^ Has all the code pictures and text
2020 book:https://github.com/fastai/fastbook/blob/master/01_intro.ipynb
2020 software fastai repo: https://github.com/fastai/fastai
^^ this is fastai v2 written from scratch
2020 course nbs: https://github.com/fastai/fastbook/tree/master/clean
2020 readme page fastai: https://course.fast.ai/#How-do-I-get-started?
2020 lesson 1 wiki: https://forums.fast.ai/t/lesson-1-official-topic/65626
forums: https://forums.fast.ai/
Lesson 1 wiki: https://forums.fast.ai/t/lesson-1-official-topic/65626
Help with setup: https://forums.fast.ai/t/setup-help/65529
Old
fastiai2 - https://github.com/fastai/fastai2 coursev4 - https://github.com/fastai/course-v4 fastbook - https://github.com/fastai/fastbook
Prepping ubuntu
sudo apt-get update
sudo apt-get upgrade
Software and updates also does the above. But you see errors with
ppa with sudo apt-get update
it seems.
If PPA seems not necessary anymore: Fixing errors with sudo apt-get
update
–> fixed with #
in the ppa
directory (source: here).
Had errors in ppa from opencpu, seems to be used for R but I don’t
know if there is even a point in maintaining the ppa. For later. For
now #
.
changing ubuntu desktop
All the info you need about environments is here. There is unity, there is gnome there is kde…
Unity is default till 16.
Info about systemctl-suspend and pm-suspend
Installing fastai
Resources
How to install nvidia drivers is here and here and here.
question (no answer on stack) (archive)
Goal
I somehow installed the driver 3 years back and want to check if I have the latest ones and if they are upgraded.
Status
Additional drivers
in Software & Updates
shows a bunch of
NVIDIA graphics drivers. And currently 390
has been installed and I
see it working with the processes using the GPU, with:
nvidia-smi
I dont think I have added repos ever, (like this):
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update
I know this because I don’t see the graphics-drivers
PPA’s in
grep ^ /etc/apt/sources.list /etc/apt/sources.list.d/*
Question
-
When I look in Nvidia website there is a higher proprietary version 450, based on the official Nvidia driver finder. When I look in launchpad it says
430
is the current supported version, but450
is not shown inadditional drivers
. Why? What am I missing? -
I want to go to the latest driver available. Is
390
the latest stable release for Ubuntu 16.04? How do I figure this out? -
There are many ways to install
450
(I am afraid of conflicts or my system breaking). Which should I prefer now that I have390
viaadditional drivers
?- download from Nvidia site and install
- apt-get has it’s own install
sudo apt-get nvidia-\*\*450\*\*
, once I have thegraphics-drivers
PPA’s. - add PPAs and install from
additional drivers
-
Should I uninstall
390
so I can install450
and if so how? (there is no option inadditional drivers
to unselect all options). Is the following ok?$ sudo apt purge nvidia-390 -y $ sudo apt autoremove -y $ sudo apt autoclean
-
Should I disable
secure boot
before installing and enable it afterwards?
Current Config
- Ubuntu 16.04 LTS & Windows 10 dual boot
- 8 gb ram
-
Nvidia GeForce GTX 1050/PCIe/SSE2
Installing fastai with dependencies
Packages to be installed can be found in the environment.yml
What seems to work (period):
conda create --name fastaiclean
conda install -c fastai -c pytorch -c anaconda fastai
conda install -c fastai -c pytorch -c anaconda gh
conda install -c conda-forge notebook
Yes to all the disclaimers such as additional packages to be installed. Checked as of jan 18 2021. STILL WORKS GREAT. :)
Based on this stack answer: We don’t need to install anaconda package as the miniconda install doesn’t seem to require it.
What doens’t work
`conda create --name fastai --clone base`
Man page instructions and directly into existing anaconda setup leads to conflicts (not sure how to solve):
conda install -c fastai -c pytorch -c anaconda fastai gh anaconda
Error seems to be some HTTPS error with Pytorch.
What else seems to work (but didn’t try)
Source: forums of fastai
git clone https://github.com/fastai/fastai2
cd fastai2
conda env create -f environment.yml
conda activate fastai2
+other things based on errors you get
Another source: Forums of fast ai
Note about Pip
Just in case you haven’t found that one. They can coexist, but maybe not in a way you’d expect. Pip can easily be run inside conda. You can use conda to manage python versions, install some libraries and then use pip for the rest. That will totally work. What will not work are packages you installed before installing conda, but that’s because the python you run after that (installed by conda) does not have the same site-packages directory in sys.path. Ofc, you can install them again after installing conda, but they will be downloaded, unpacked, and possibly compiled (in case they do not have wheels) again. —fastaiforum
GPU not working (running intro.pynb)
ImportError: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
Fix from this link,
conda install -c conda-forge ipywidgets
FASTAI works. Somehow the GPU was not used, and what took 1 min for the lecturer in lesson 1 took me 15 mins (atleast that’s what it said).
Fixing slow computer (swap)
- Swap occupied and reverting back from swap
Based on this stack answer
sudo swapoff -a
sudo swapon -a
Fixing GPU not working
Installing the right drivers
Basically: Additional drivers
tab in Softwares & Updates
does it for you.
Check which drivers are installed?
-
dpkg --get-selections | grep nvidia
-
Go to
Additional drivers
and see what is selected
Identify the latest stable driver for Ubuntu 16 and GTX 10 series GeForce 1050 by:
-
Highest in
Additional drivers
-
Launchpad (graphics drivers ppa basically)
-
launchpad says
430
(430.40)- Also the version names contained in the table below show 16.04
for
430
.
- Also the version names contained in the table below show 16.04
for
Uninstall and Re-install
Add PPA.
sudo add-apt-repository ppa:graphics-drivers/ppa
Don’t uninstall! Just change on additional drivers
and
rebooted and see that nvidia-smi
shows 430
.
Note: dpkg --get-selections | grep nvidia
now shows nvidia-390
deinstall
and nvidia-430 install
. This doesn’t matter.
Note: It also says it is open source in my addtional driver
tab. This looks like a bug. It is proprietary as fuck.
Resources on Nvidia graphics card
These are some ofthe links I looked into:
1 –> https://askubuntu.com/a/851144/443958
2 –> https://launchpad.net/~graphics-drivers/+archive/ubuntu/ppa
3 –> https://askubuntu.com/a/937898/443958
[4] –> https://askubuntu.com/questions/1045241/ubuntu-18-04-how-do-i-install-drivers-for-my-nvidia-geforce-gtx-1050
[5] –> https://askubuntu.com/questions/61396/how-do-i-install-the-nvidia-drivers?noredirect=1&lq=1
What Nvidia driver should I install for Ubuntu 16.04
My question on AskUbuntu:
Which is the LATEST STABLE proprietary drivers for UBUNTU 16.04 and NVIDIA Geforce GTX 1050 (4gb)? There are simply too many answers which I have compiled below:
-
Nvidia site says 450(450.66) is the latest for GTX 1050 (not ubuntu version specific)
- download from Nvidia site and install
- apt-get has it’s own install
sudo apt-get nvidia-\*\*450\*\*
,
-
Additional drivers tab in
Software and Updates
only shows version390
- Note:
graphics-drivers
PPA’s have not been added yet.
- Note:
-
launchpad says
430
(430.40)Current long-lived branch release:
nvidia-430
(430.40) Dropped support for Fermi series (https://nvidia.custhelp.com/app/answers/detail/a_id/4656) -
Launchpad also says old branch for GeForce 10 series (1050) is 390.129
Old long-lived branch release:
nvidia-390
(390.129) For GF1xx GPUs usenvidia-390
(390.129) -
Launchpad here shows overview of packages (scroll down).
Here the version names contain 16.04 for
430
. -
Ubuntu recommended drivers shows different version for ubuntu. If I go by this then only
384
has supported versions for 16.04 (Xenial Xerus)
These are some ofthe links I looked into:
1 –> https://askubuntu.com/a/851144/443958
2 –> https://launchpad.net/~graphics-drivers/+archive/ubuntu/ppa
3 –> https://askubuntu.com/a/937898/443958
[4] –> https://askubuntu.com/questions/1045241/ubuntu-18-04-how-do-i-install-drivers-for-my-nvidia-geforce-gtx-1050
[5] –> https://askubuntu.com/questions/61396/how-do-i-install-the-nvidia-drivers?noredirect=1&lq=1
Answer to NVIDIA driver to be used
Plan
The key to success is simply to go with the Additional Drivers tab recommendations. The Ubuntu Developers worked really hard to make that tab trustworthy and reliable. It only shows options that are compatible with your hardware from sources that you already subscribe to. After you add the PPA, then go with the Additional Drivers tab (#2 / 430). If you encounter problems, downgrade to 410 or 390. —my post on stack
Fixing issues with Cuda
pytorch
needs to be compiled with the right cuda. If you look on
pytorch
’s website it is always installed with torchvision and cuda
version.
CUDA version needs to be matched by the nvidia drivers as per table 1.
pytorch will use it’s own cuda (looks like). External CUDA driver from NVIDIA website not required.
Multiple versions of cuda in the system
I have Nvidia driver 430
on Ubuntu 16.04 with Geforce 1050. It comes
with libcuda1-430
when I installed the driver from additional
drivers
tab in ubuntu (Software and Updates
). I installed pytorch
with conda
which also installed the cudatoolkit
.
-
nvidia-smi
says I have cuda version10.1
-
conda list
tells me cudatoolkit version is10.2.89
-
print(torch.cuda.current_device())
, I get10.0.10
? (it looks like):AssertionError: The NVIDIA driver on your system is too old (found version 10010)
-
print(torch._C._cuda_getCompiledVersion(), 'cuda compiled version')
tells me my version is10.0.20
?10020 cuda compiled version
What am I missing? What version of CUDA is my torch actually looking at? Why are there so many different versions?
what is your version of cuda
-
nvidia-smi
is just showing version it can handle according to Berriel from stack -
Using
torch.version.cuda
we can find out which version pytorch was built with. this was found to be 10.2 (as wascudatoolkit version
) -
& 4. forget it.
Suggestion was to remove pytorch torchvision and cudatoolkit
conda remove pytorch torchvision cudatoolkit
Does your cuda work?
If testing goes well
In [1]: import torch
In [2]: torch.cuda.current_device()
Out[2]: 0
In [3]: torch.cuda.device(0)
Out[3]: <torch.cuda.device at 0x7efce0b03be0>
In [4]: torch.cuda.device_count()
Out[4]: 1
In [5]: torch.cuda.get_device_name(0)
Out[5]: 'GeForce GTX 950M'
In [6]: torch.cuda.is_available()
Out[6]: True
If wrong version of cuda
print(torch.cuda.device_count())
0
print(torch.cuda.current_device())
AssertionError:
The NVIDIA driver on your system is too old (found version 10010).
Looking the assertion error on google points to github pytorch issues where they talk about versions not matching of the driver and the CUDA version. Digging a bit deeper you see the NVIDIA website regarding cuda and driver version numbers Table 1: Nvidia site.
Downgrading cudatoolkit without conflicts
Current version of cuda:
torch.version.cuda
10.2
Version corresponding to NVIDIA driver (in this case 430)
- nvidia-smi says 10.1
- Table 1: Nvidia site also says 10.1 and below (roughly).
What doesn’t work?
In the current conda installing with cudatoolkit
pytorch
and
torchvision
gives tons of errors.
conda install pytorch==1.6.0 torchvision==0.7.0 cudatoolkit=10.1.168
-c pytorch
What works
Pytorch installation seems to be paired with pytorch torchvision and cudatoolkit according to official website.
Remove pytorch torchvision cudatoolkit:
conda remove pytorch torchvision cudatoolkit
conda install pytorch==1.6.0 torchvision==0.7.0 cudatoolkit=10.1.168 -c pytorch
Again it asks for several things I say yes blindly but seemingly NO CONFLICTS
check what torch.version.cuda is: 10.1
. GREAT we have successfully downgraded.
Checking seems successful based on SO answer
In [1]: import torch
In [2]: torch.cuda.current_device()
Out[2]: 0
In [3]: torch.cuda.device(0) ## previously this gave "old driver error"
Out[3]: <torch.cuda.device at 0x7efce0b03be0>
In [4]: torch.cuda.device_count()
Out[4]: 1
In [5]: torch.cuda.get_device_name(0)
Out[5]: 'GeForce GTX 950M'
In [6]: torch.cuda.is_available()
Out[6]: True
torch.cuda.device(0)
previously gave the assertion error that Nvidia
driver was too old. Now I have the value as 0 which seems to be
expected as per: SO answer.
Fresh error: Now fastai is missing probably got removed when I removed pytorch or re-installed pytorch.
conda install -c pytorch -c fastai fastai
So far so good. Testing again:
eghx@eghx-nitro:~$ conda activate fastaiclean
(fastaiclean) eghx@eghx-nitro:~$ python
Python 3.8.5 (default, Sep 4 2020, 07:30:14)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.current_device()
0
>>> torch.cuda.device(0)
<torch.cuda.device object at 0x7f14de991d30>
>>> torch.cuda.device_count()
1
>>> torch.cuda.get_device_name()
'GeForce GTX 1050'
>>> torch.cuda.is_available()
True
>>> print(torch.rand(2,3).cuda())
+tensor([[0.2424, 0.8397, 0.6206],
[0.0982, 0.4568, 0.3958]], device='cuda:0')
>>> print(torch.rand(2,3).cuda())
tensor([[0.8775, 0.8157, 0.1333],
[0.6189, 0.9641, 0.5741]], device='cuda:0')
>>>
For the first time I saw another process on nvidia-smi
+—————————————————————————–+ | Processes: GPU Memory | | GPU PID Type Process name Usage | | ============================================================================= | | 0 1164 G /usr/lib/xorg/Xorg 293MiB | | 0 2252 G compiz 114MiB | | 0 22866 C python 431MiB | +—————————————————————————–+
Other tools
conda search -c pytorch cudatoolkit ## searching for available version
conda list pytorch ## current version installed
Making GPU work faster
So 2 things I can do. Install libraries that do faster work and then remove all other processes on the gpu except cuda.
Resources
Isolate integrated intel (igpu) from Nvidia (gpu)
How to configure iGPU for xserver and nvidia GPU for CUDA work BY FASTI STASON
How to configure igpu for xserver and nvidia gpu for cuda? by simple solution guy
Removing xorg processes
-
change intel graphics card on
nvidia settings
or
prime-select intel
-
added the
export LD_LIBRARY_PATH=/usr/lib/nvidia-430:$LD_LIBRARY_PATH
and then restarted….
- Put command in ‘2’ in .bashrc to work with every terminal
open…
LD_Library_path
loses its value after terminal session
Result
- visibly no difference in speed
- you only save 11% of the gpu.
- second screen doesn’t work
- need to spend even more and more time to fix it.
time1 | time2 | |
---|---|---|
xorg running on Nvidia | 1:56 | 1.19 |
no processes and missing screen | 2.03 | 1.15 |
Reverting back | 2.10 | 1.18 |
adding the lib stuff | 1.11 | 1.19 |
second time | 0.58 | 1.17 |
We can pursue this later if required.
Going back to xorg and all the same way
Nvidia forums: prime-select nvidia
works.
Next: So let’s add the xconf file back and see what it does. or modify it ..
If this allows monitor to work, we keep it otherwise we go back with
prime-select nvidia
I want to check out the xconf solution without even modifying any xconf process.
It’s too complicated to get other screen to work and get cuda to run.. Especially for the complete lack of progress I got. so screw that.
sudo prime-select nvidia
Making GPU work faster with libraries
Don’t know what this does. Blindly followed it
conda uninstall --force jpeg libtiff -y
conda install -c conda-forge libjpeg-turbo
CC="cc -mavx2" pip install --no-cache-dir -U --force-reinstall --no-binary :all: --compile pillow-simd
-
step 3 in the link
-
medium post also has the same thing.
Unable to reproduce the “faster libraries” in another env
Unable to reproduce this in other python environment for some reason. Don’t know why…
Based on stack
conda activate fastaicleantest
conda list --json --export > agent18/DS-2020/fastaicleantest.json
conda deactivate
conda actiavte fasaiclean
conda list --json --export > agen18/DS-2020/fasaiclean.json
cd agent18/DS-2020/
diff fastaicleantest.json fastaiclean.jsonw
0 differences found, even with manual conda list. don’t know where I screwed up the installation… or what to do to rectify it.
PIL error
ImportError: The _imaging extension was built for another version of Pillow or PIL:
Core version: 7.0.0.post3
Pillow version: 7.2.0
conda list pil
Apparently, “This is only an installation issue.”—stack.
Uninstalling pillow and pillow-simd
conda uninstall pillow
Said yes to everything
conda uninstall pillow-simd
Doesn’t work. It says no:
(fastaiclean) eghx@eghx-nitro:~/agent18/DS-2020$ conda uninstall pillow pillow-smd
Collecting package metadata (repodata.json): done
Solving environment: failed
PackagesNotFoundError: The following packages are missing from the target environment:
- pillow-smd
Open jupyter notebook and do the following:
!pip uninstall -y pillow-simd
check with
conda list pil
now reinstall it…
CC="cc -mavx2" pip install --no-cache-dir -U --force-reinstall --no-binary :all: --compile pillow-simd
Was I supposed to install pillow first?
now somehow fastai is missing. :(
conda install -c pytorch -c fastai fastai
yes to all
OK that nonsense works.
BTW pillow is currently not installed…
Checked timings with the standard file and all seems well.
Other sources regarding installation
-
medium post by some pisth
-
Stack: How do I install NVIDIA and CUDA drivers into Ubuntu?
power off when sleep when GPU is full (recovery)
What seems to work with existing config:
-
sleep/wakeup and shutdown normally
-
If I shut down the gpu by closing the jupyter kernel then all is good.
-
whether I shutdown the gpu or not I can still shutdown.
What doesn’t work with existing config:
-
when Jupyter finishes running and the gpu is full sleep doesn’t work
-
It hangs (
C-M-F7/F1
) doesn’t work. It’s hung I think.
What all I attempted before screwing up the system.
I am tempted to try the following:
-
18.04 Screen remains blank after wake up from suspend
- look into kernels
- remove gdm and go to lightdm? (but I seem to already have lightdm)
- hybrid settings in bios
- install unity-ubuntu-desktop
- nouveau.modeset=0 (but i am not even using nouveau)
-
- switch to lightdm
- do I have hybrid graphics?
-
suspend, hibernate issues issues with nvidia
- 5 steps to do. Maybe try this as it is from nvidia. Doens’t work for 18 but might work for 16 I guess.
-
trying to perhaps see the error
- remove quiet splash to see where the error is
-
other blind guesses
- reinstall the driver (Not sure but it is in comments)
- downgrade perhaps (really not sure if this is the direction)
- suggestion of using 390 and 16 seems to work for this guy (with instructions)
- adding no modeset in grub screen?
- changing kernel?
- resetting monitors before sleeping?
xrandr -s 0
- reinstall the driver (Not sure but it is in comments)
This is ubuntu 16 pertinent.
-
- change powermizer to adaptive is one solution (change it back later)
-
other unaccepted solution of the same question in ask ubuntu is to change grub to something else…
GRUB_CMDLINE_LINUX_DEFAULT="quiet acpi_rev_override=1 acpi_osi=Linux scsi_mod.use_blk_mq=1 nouveau.modeset=0 nouveau.runpm=0 mem_sleep_default=deep"
-
Updating kernel to 4.17?
- update kernel to next version?
- kernel apparently didn’t do it all for this guy with ubuntu 16 and 4.15 kernel - Another user reporting this is not working for them :( kernel 4.17
What all I did:
-
Based on Nvidia forum added the following to grub and updated grub (sudo grub-update)
GRUB_CMDLINE_LINUX_DEFAULT="quiet acpi_rev_override=1 acpi_osi=Linux scsi_mod.use_blk_mq=1 nouveau.modeset=0 nouveau.runpm=0 mem_sleep_default=deep" sudo update-grub sudo gedit /etc/initramfs-tools/modules nvidia nvidia_modeset nvidia_uvm nvidia_drm sudo update-initramfs -u -k all
Got some warning for this but apprently it doesn’t mean anything as I have an i5 system —Stack
-
Didn’t change kernel as there was no evidence towards it. People changed to 4.17. Ming is currently 4.15
-
Tried different suspends (Suspend)
-
Tried downgrading.
sudo add-apt-repository ppa:graphics-drivers/ppa sudo apt update
Selected 384 version.
It didn’t allow me to work with pytorch 1.6 so I wanted to go back.
-
Tried upgrading back to 430
sudo apt-get purge nvidia*
Additional drivers tab was all greyed out.
sudo apt remove nvidia-* sudo apt autoremove sudo ubuntu-drivers autoinstall
Rebooted to black screen of death (tried this) but all seemed ok. Added
nomodeset
to grub (In hindsight I could have done:nouveau.modeset=0
I think).As a last step went to
prime select intel
and it finally allowed me to boot with just intel graphics card. Despite having 430 installed nothing works with it…Again blank screen and then
noveau.modeset=0
did the trick -
did complete re-install of xserver, unity, lightdm and the nvidia drivers
So far I have one screen working and the gpu not being NVIDIA.
that’s when I gave up… Now god forbid I had to recover.
Learning about the linux architechture
What is x-server? X11 Xclient
https://medium.com/mindorks/x-server-client-what-the-hell-305bd0dc857f
https://askubuntu.com/a/264411/443958
Graphical interface/ Display manager: Unity, Gnome
Protocol between GI and DS: X11, wayland etc…
Display Server: Xorg | window manager: Compiz (resize or postion window, close minimize etc…)
xorg packages : https://packages.ubuntu.com/search?keywords=xserver-xorg
sudo apt-get install
installs packages and dependences and
recommended.
To remove package we do sudo apt-get purge
or sudo apt-get remove
--purge
Should we use purge or remove –purge or remove
https://askubuntu.com/a/231565/443958
18.04 Screen remains blank after wake up from suspend
What is nomodeset quiet and splash
https://askubuntu.com/questions/1024895/why-do-i-need-to-replace-quiet-splash-with-nomodeset
https://askubuntu.com/questions/747314/is-nomodeset-still-required
Recovering linux system from said fatality
“re-installing” xorg, lightdm, unity and then nvidia
Graphical interface/Display manager/Desktop environments: Unity, Gnome
Protocol between GI and DS: X11, wayland etc…
Display Server: Xorg | window manager: Compiz (resize or postion window, close minimize etc…)
Display manager: Lightdm or gdm etc… (starting your login screen)
This has to be my best recovery so far. It took so much time to prep know the details. But I can say I did it STM. I think I can. Usually I was so afraid of screwing up the system (rightly so), but now I read and read and read, see what other’s advice is, know “roughly what the commands mean”. And then plan out the commands and jump into it.
Based mainly on How to install Nvidia drivers, Graphics issues in 16.04 and re-installing xorg and xserver, What is gdm3, kdm, lightdm? How to install and remove them?.
Went to TTY screen which I invoked before logging in:
X-server and Xorg and Lightdm
sudo apt-get purge xorg-* "xserver-*"
sudo apt-get purge lightdm
Removing dependencies and recommended packages of the purged packages (if installed with them),
sudo apt-get auroremove
sudo apt-get install xorg xserver-xorg
sudo dpkg-reconfigure xorg
Configuring lightdm means selecting lightdm instead of gdm
sudo apt-get lightdm
sudo dpkg-reconfigure lightdm
Now unity. So in my system ubuntu-desktop was not installed, instead only unity was installed but everywhere I found ubuntu-desktop being included so just added it. A quick search showed that ubuntu-desktop has it’s own unity.
sudo apt-get purge ubuntu-desktop
sudo apt-get purge unity
sudo apt-get autoremove
sudo apt-get install unity ubuntu-desktop
Re-installing nvidia
sudo apt-get purge nvidia-*
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-get update
To see the recognized drivers and what is the recommended driver etc…
ubuntu-drivers devices
Installs the recommended drivers also can be done manually with sudo
apt-get install nvidia-430
.
sudo ubuntu-drivers autoinstall
We can go and check in the additional drivers tab
later.
The last step I did and probably the most important was to make an
xorg.conf
file. This I somehow had in the previous installation and
nvidia-xconfig
didn’t bring it back. So I literally copied a back up
and pasted it.
Section "ServerLayout"
Identifier "layout"
Screen 0 "nvidia"
Inactive "intel"
EndSection
Section "Device"
Identifier "intel"
Driver "modesetting"
BusID "PCI:0@0:2:0"
Option "AccelMethod" "None"
EndSection
Section "Screen"
Identifier "intel"
Device "intel"
EndSection
Section "Device"
Identifier "nvidia"
Driver "nvidia"
BusID "PCI:1@0:0:0"
Option "ConstrainCursor" "off"
EndSection
Section "Screen"
Identifier "nvidia"
Device "nvidia"
Option "AllowEmptyInitialConfiguration" "on"
Option "IgnoreDisplayDevices" "CRT"
EndSection
That was it. I didn’t put anything in grub I think.
Check with
nvidia-smi
&
nvidia-settings
& aditional drivers tab
Result
System works like in the past. the issue with not being able to sleep when GPU is full is still pending.
There are many peoplewho were not able to solve this on their system. For some users it was concluded that it was a BIOS issue with their ASUS systems. Read this entire thread here. So if you can’t get it to work. It’s ok. It’s alright. We’re the same and there is no need to cry.
fastai functions not working
NameError: name ‘widgets’ is not defined
Is it supposed to be in fastbook?
Why do we need fastbook? What’s happening?
Where are these functions?
Is it in fastbook
Seems like it… tried it out on Gradient and I am certain it is this
lack of fastbook
and graphviz
Re-installing until fastai fresh env
conda install -c fastai -c pytorch fastai cudatoolkit=10.1.243 gh
conda install -c conda-forge ipywidgets
didn't help
Unable to install fastbook via conda install
Also python version seesm to have downgraded to python 3.7 (in fastaicleantest alone)
It seems like a known issue that conda instal
doesn’t work whereas
pip works WTF.
1: https://github.com/conda/conda/issues/9868
2: https://github.com/KevinMusgrave/pytorch-metric-learning/issues/55
pip install seems to be the way to go…
Testing…
clean test seems to be distroyed. Won’t install graphviz
Fastai cuda not working AGAIN :(
Quick reinstall
conda install -c fastai -c pytorch fastai cudatoolkit=10.1.243 gh
ends up downgrading 3 packages including pythiath
conda install -c conda-forge ipywidgets
ends up installing new packages
pip install -Uqq fastbook
0 output
conda install graphviz
Not tested the last part with graphviz
. While installing
fastaiclean4 I installed graphviz after gh
. And then had problems
with fasiai recognizing that I had installed graphviz. Removed and
reinstalled and all seemed to be goo.
fastaiclean is ok with libtiff steffs
Sometimes I get the fastbook.setup_book() error related to
graphviz... I don't why I get it sometimes and not someother
times. Jesus
fastaiclean test is ruined (can’t recover)
fastaiclean4 is a fresh install without the libtiff steffs.
graphviz still causing issues… I don’t know why? it is not displaying hte pictures via gv…
Kaggle and tabular stuff
conda install kaggle -c conda-forge
You will need a json file according to the api github…
To use the Kaggle API, sign up for a Kaggle account at https://www.kaggle.com. Then go to the ‘Account’ tab of your user profile (https://www.kaggle.com/
/account) and select 'Create API Token'. This will trigger the download of kaggle.json, a file containing your API credentials. Place this file in the location ~/.kaggle/kaggle.json (on Windows in the location C:\Users\ \.kaggle\kaggle.json - you can check the exact location, sans drive, with echo %HOMEPATH%). You can define a shell environment variable KAGGLE_CONFIG_DIR to change this location to $KAGGLE_CONFIG_DIR/kaggle.json (on Windows it will be %KAGGLE_CONFIG_DIR%\kaggle.json).
Create API token
in kaggle gives the json file to put in ~/.kaggle/kaggle.json
.
For your security, ensure that other users of your computer do not have read access to your credentials. On Unix-based systems you can do this with the following command:
chmod 600 ~/.kaggle/kaggle.json
dtreeviz
package.
conda install dtreeviz -c conda-forge
GPU full
RuntimeError: CUDA out of memory. Tried to allocate 22.00 MiB (GPU 0;
3.95 GiB total capacity; 3.01 GiB already allocated; 19.50 MiB free;
3.10 GiB reserved in total by PyTorch)
Tried changing batch size (bs) as suggested by Jeremy but reducing it to 32 and 16 didn’t yield any results.
A possibility could be to try and free up the space in the GPU used by compiz and Xorg. But it sounds like a bad idea, as it will take tons of time. I have already spent 1-2 full days on it without any results. My dual screen wont work and I don’t know how to get it to work.
While using the system and the GPU it is impossible to do any work despite the ram being completely free.
Considering all this, it seems wise to move to paper space. It seems faster maybe I might have to end up paying money. But that is a choice i think I can figure out later if necessary.
Sometimes cuda memory becomes full while running the cells fresh… It also takes a long time. Unsure how to reduce it at the cost of accuracy.
So I think for now as paperspace is faster and freer and I don’t use emacs anyways… So I don’t see the value of trying to get it to work in my pc. Let’s move to paperspace. the error I see that made me do this is:
Lesson 1
-
Let’s setup a gpu online first: https://course.fast.ai/start_gradient
-
Look for GPU usage on PC: https://forums.fast.ai/t/wiki-thread-lesson-1/6825
Resources
-
course notes by someone on medium
Jupiter
- Jupiter with crestle
- aws or PC
- paperspace (have )
Jupyter in emacs
https://tkf.github.io/emacs-ipython-notebook/
todo
setup jupyter on emacs and browser to see how different it is….
- clone fast ai repo, instructions on website…
finish this course lecture 1 atleast. seeing it
-
followed by assignmets
-
figure out how to use fast ai (pc or aws or whatever) with jupyter
Other useful links
Managing your data science project environments with Conda
Question
Suspend not working after GPU is full
On normal occasions when xorg
and compiz
is running in my gpu, I
can Suspend
peacefully. However if I run some intense (90% GPU
in use) training (via jupyter) related to pytorch
, and subsequently
suspend after the processes are over, it refuses to sleep/wakeup.
I am positive GPU being full or not empty is causing the issue. I
don’t know why “some process” possibly related to the GPU is not
Suspending. When I run jupyter
and run 1+1
(or a simple process)
and Suspend
, then also no issues.
Question
Kernlog shows me nothing “fishy”. I have tried a bunch of online remedies. Now at a dead end.
How do I identify what is happening? any ideas?
Other symptoms
It sort-of sleeps but I still hear some sound from the laptop when I hit a key (it sounds as if it is booting up). And then blank screen after that. Sometimes I get to go to the TTY but can’t type anything.
What all I tried to rectify this issue?
Spent a good 5 full days understanding and searching and re-installing etc… Now at a dead end.
-
Checked the kern logs (pastebin link) but didn’t see anything “fishy”. (at
02:08
I start sleeping and at10:21
I hit hard reset).Here is a tiny exerpt:
Oct 2 02:08:06 eghx-nitro NetworkManager[8152]: <info> [1601597286.6443] manager: sleep requested (sleeping: no enabled: yes)
Oct 2 02:08:06 eghx-nitro NetworkManager[8152]: <info> [1601597286.6443] manager: sleeping...
Oct 2 02:08:06 eghx-nitro NetworkManager[8152]: <info> [1601597286.6447] manager: NetworkManager state is now ASLEEP
Oct 2 02:08:06 eghx-nitro NetworkManager[8152]: <info> [1601597286.6453] device (wlp2s0): state change: activated -> deactivating (reason 'sleeping') [100 110 37]
Oct 2 02:08:06 eghx-nitro NetworkManager[8152]: <info> [1601597286.8169] device (wlp2s0): state change: deactivating -> disconnected (reason 'sleeping') [110 30 37]
Oct 2 02:08:06 eghx-nitro NetworkManager[8152]: <info> [1601597286.8356] dhcp4 (wlp2s0): canceled DHCP transaction, DHCP client pid 8328
Oct 2 02:08:06 eghx-nitro NetworkManager[8152]: <info> [1601597286.8356] dhcp4 (wlp2s0): state changed bound -> done
Oct 2 02:08:06 eghx-nitro NetworkManager[8152]: <info> [1601597286.8363] dns-mgr: Writing DNS information to /sbin/resolvconf
Oct 2 02:08:06 eghx-nitro kernel: [24100.153393] wlp2s0: deauthenticating from e8:cc:18:41:3c:15 by local choice (Reason: 3=DEAUTH_LEAVING)
Oct 2 02:08:07 eghx-nitro NetworkManager[8152]: <warn> [1601597287.0509] sup-iface[0xb4a6f0,wlp2s0]: connection disconnected (reason -3)
Oct 2 02:08:07 eghx-nitro NetworkManager[8152]: <info> [1601597287.0511] device (wlp2s0): supplicant interface state: completed -> disconnected
Oct 2 02:08:07 eghx-nitro NetworkManager[8152]: <info> [1601597287.0525] device (wlp2s0): state change: disconnected -> unmanaged (reason 'sleeping') [30 10 37]
Oct 2 02:08:08 eghx-nitro kernel: [24101.983885] PM: suspend entry (deep)
Oct 2 02:08:09 eghx-nitro kernel: [24101.983888] PM: Syncing filesystems ... done.
Oct 2 10:21:32 eghx-nitro kernel: [24103.953554] Freezing user space
processes ... (elapsed 0.002 seconds) done.
-
Based on Nvidia forum added the following to grub and updated.
GRUB_CMDLINE_LINUX_DEFAULT="quiet acpi_rev_override=1 acpi_osi=Linux scsi_mod.use_blk_mq=1 nouveau.modeset=0 nouveau.runpm=0 mem_sleep_default=deep"
Added the following to iniramfs-tools/modules and updated.
nvidia nvidia_modeset nvidia_uvm nvidia_drm
-
Didn’t change kernel as there was no evidence towards it. People changed to 4.17. Mine is currently 4.15.
-
Blind try: Trying different (Suspend)s
systemctl suspend pm-suspend
-
Tried downgrading the drivers to 384 from 430 with changing it at
additional drivers
. This was not useful as this was not capable of co-existing withpytorch=1.6.0
-
Complete remove and re-install of
nvdia-430
as per here:
purge
,add-apt-repository ppa:graphics-drivers/ppa
,update
andautoinstall
.This ended in the black screen of death. Recovered it with
noveau.modeset=0
. Somehow GPU was not working anymore. -
At this point did a complete re-install of
xserver
,unity
,lightdm
andnvidia-430
over tty terminal before login screen.This recovered the system to it’s previous state i.e.,
suspend
when GPU full hangs the system.
My system
- Ubuntu 16.04
- Nvidia 1050 GeForce