Setting up the GPU for fastai with Ubuntu and Nvidia

fastai alternatives

https://www.fast.ai/2017/04/06/alternatives/

Resources

2018 Lesson 1 wiki: https://forums.fast.ai/t/wiki-thread-lesson-1/6825

course 2018: https://course18.fast.ai/lessonsml1/lesson1.html

2020 setup help: https://forums.fast.ai/t/setup-help/65529

Ubuntu setup help used:

https://forums.fast.ai/t/pytorch-installation-in-conda-environment-failing/6703/19
https://forums.fast.ai/t/platform-local-server-ubuntu/65851

updates 2020: https://forums.fast.ai/t/official-part-1-2020-updates-and-resources-thread/63376

^^ Has all the code pictures and text

2020 book:https://github.com/fastai/fastbook/blob/master/01_intro.ipynb

2020 software fastai repo: https://github.com/fastai/fastai

^^ this is fastai v2 written from scratch

2020 course nbs: https://github.com/fastai/fastbook/tree/master/clean

2020 readme page fastai: https://course.fast.ai/#How-do-I-get-started?

2020 lesson 1 wiki: https://forums.fast.ai/t/lesson-1-official-topic/65626

forums: https://forums.fast.ai/

Lesson 1 wiki: https://forums.fast.ai/t/lesson-1-official-topic/65626

Help with setup: https://forums.fast.ai/t/setup-help/65529

Old

fastiai2 - https://github.com/fastai/fastai2 coursev4 - https://github.com/fastai/course-v4 fastbook - https://github.com/fastai/fastbook

Prepping ubuntu

sudo apt-get update
sudo apt-get upgrade

Software and updates also does the above. But you see errors with ppa with sudo apt-get update it seems.

If PPA seems not necessary anymore: Fixing errors with sudo apt-get update –> fixed with # in the ppa directory (source: here).

Had errors in ppa from opencpu, seems to be used for R but I don’t know if there is even a point in maintaining the ppa. For later. For now #.

changing ubuntu desktop

All the info you need about environments is here. There is unity, there is gnome there is kde…

Unity is default till 16.

Info about systemctl-suspend and pm-suspend

Installing fastai

Resources

How to install nvidia drivers is here and here and here.

question (no answer on stack) (archive)

Stack link

Goal

I somehow installed the driver 3 years back and want to check if I have the latest ones and if they are upgraded.

Status

Additional drivers in Software & Updates shows a bunch of NVIDIA graphics drivers. And currently 390 has been installed and I see it working with the processes using the GPU, with:

nvidia-smi

I dont think I have added repos ever, (like this):

sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update 

I know this because I don’t see the graphics-drivers PPA’s in

grep ^ /etc/apt/sources.list /etc/apt/sources.list.d/*

Question

When I look in Nvidia website there is a higher proprietary version 450, based on the official Nvidia driver finder. When I look in launchpad it says 430 is the current supported version, but450 is not shown in additional drivers. Why? What am I missing?
I want to go to the latest driver available. Is 390 the latest stable release for Ubuntu 16.04? How do I figure this out?
There are many ways to install 450 (I am afraid of conflicts or my system breaking). Which should I prefer now that I have 390 via additional drivers?
- download from Nvidia site and install
- apt-get has it’s own install sudo apt-get nvidia-\*\*450\*\*, once I have the graphics-drivers PPA’s.
- add PPAs and install from additional drivers
Should I uninstall 390 so I can install 450 and if so how? (there is no option in additional drivers to unselect all options). Is the following ok?
```
 $ sudo apt purge nvidia-390 -y
 $ sudo apt autoremove -y
 $ sudo apt autoclean
```
Should I disable secure boot before installing and enable it afterwards?

Current Config

Ubuntu 16.04 LTS & Windows 10 dual boot
8 gb ram
Nvidia GeForce GTX 1050/PCIe/SSE2

Installing fastai with dependencies

Packages to be installed can be found in the environment.yml

What seems to work (period):

conda create --name fastaiclean 

conda install -c fastai -c pytorch -c anaconda fastai 

conda install -c fastai -c pytorch -c anaconda gh

conda install -c conda-forge notebook

Yes to all the disclaimers such as additional packages to be installed. Checked as of jan 18 2021. STILL WORKS GREAT. :)

Based on this stack answer: We don’t need to install anaconda package as the miniconda install doesn’t seem to require it.

What doens’t work

`conda create --name fastai --clone base`

Man page instructions and directly into existing anaconda setup leads to conflicts (not sure how to solve):

conda install -c fastai -c pytorch -c anaconda fastai gh anaconda

Error seems to be some HTTPS error with Pytorch.

What else seems to work (but didn’t try)

Source: forums of fastai

git clone https://github.com/fastai/fastai2
cd fastai2
conda env create -f environment.yml
conda activate fastai2

+other things based on errors you get

Another source: Forums of fast ai

Note about Pip

Just in case you haven’t found that one. They can coexist, but maybe not in a way you’d expect. Pip can easily be run inside conda. You can use conda to manage python versions, install some libraries and then use pip for the rest. That will totally work. What will not work are packages you installed before installing conda, but that’s because the python you run after that (installed by conda) does not have the same site-packages directory in sys.path. Ofc, you can install them again after installing conda, but they will be downloaded, unpacked, and possibly compiled (in case they do not have wheels) again. —fastaiforum

GPU not working (running intro.pynb)

ImportError: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html

Fix from this link,

conda install -c conda-forge ipywidgets

FASTAI works. Somehow the GPU was not used, and what took 1 min for the lecturer in lesson 1 took me 15 mins (atleast that’s what it said).

Fixing slow computer (swap)

Swap occupied and reverting back from swap

Based on this stack answer

sudo swapoff -a
sudo swapon -a

Fixing GPU not working

Installing the right drivers

Basically: Additional drivers tab in Softwares & Updates does it for you.

Check which drivers are installed?

dpkg --get-selections | grep nvidia
Go to Additional drivers and see what is selected

Identify the latest stable driver for Ubuntu 16 and GTX 10 series GeForce 1050 by:

Highest in Additional drivers
Launchpad (graphics drivers ppa basically)
launchpad says 430 (430.40)
- Also the version names contained in the table below show 16.04 for 430.

Uninstall and Re-install

Add PPA.

sudo add-apt-repository ppa:graphics-drivers/ppa

Don’t uninstall! Just change on additional drivers and rebooted and see that nvidia-smi shows 430.

Note: dpkg --get-selections | grep nvidia now shows nvidia-390 deinstall and nvidia-430 install. This doesn’t matter.

Note: It also says it is open source in my addtional driver tab. This looks like a bug. It is proprietary as fuck.

Resources on Nvidia graphics card

Stack detailed resource

These are some ofthe links I looked into:

1 –> https://askubuntu.com/a/851144/443958
2 –> https://launchpad.net/~graphics-drivers/+archive/ubuntu/ppa
3 –> https://askubuntu.com/a/937898/443958
[4] –> https://askubuntu.com/questions/1045241/ubuntu-18-04-how-do-i-install-drivers-for-my-nvidia-geforce-gtx-1050
[5] –> https://askubuntu.com/questions/61396/how-do-i-install-the-nvidia-drivers?noredirect=1&lq=1

What Nvidia driver should I install for Ubuntu 16.04

My question on AskUbuntu:

Which is the LATEST STABLE proprietary drivers for UBUNTU 16.04 and NVIDIA Geforce GTX 1050 (4gb)? There are simply too many answers which I have compiled below:

Nvidia site says 450(450.66) is the latest for GTX 1050 (not ubuntu version specific)
- download from Nvidia site and install
- apt-get has it’s own install sudo apt-get nvidia-\*\*450\*\*,
Additional drivers tab in Software and Updates only shows version 390
- Note: graphics-drivers PPA’s have not been added yet.
launchpad says 430 (430.40)

Current long-lived branch release: nvidia-430 (430.40) Dropped support for Fermi series (https://nvidia.custhelp.com/app/answers/detail/a_id/4656)
Launchpad also says old branch for GeForce 10 series (1050) is 390.129

Old long-lived branch release: nvidia-390 (390.129) For GF1xx GPUs use nvidia-390 (390.129)
Launchpad here shows overview of packages (scroll down).

Here the version names contain 16.04 for 430.
Ubuntu recommended drivers shows different version for ubuntu. If I go by this then only 384 has supported versions for 16.04 (Xenial Xerus)

These are some ofthe links I looked into:

Answer to NVIDIA driver to be used

Plan

The key to success is simply to go with the Additional Drivers tab recommendations. The Ubuntu Developers worked really hard to make that tab trustworthy and reliable. It only shows options that are compatible with your hardware from sources that you already subscribe to. After you add the PPA, then go with the Additional Drivers tab (#2 / 430). If you encounter problems, downgrade to 410 or 390. —my post on stack

Fixing issues with Cuda

pytorch needs to be compiled with the right cuda. If you look on pytorch’s website it is always installed with torchvision and cuda version.

CUDA version needs to be matched by the nvidia drivers as per table 1.

pytorch will use it’s own cuda (looks like). External CUDA driver from NVIDIA website not required.

Multiple versions of cuda in the system

Stack question by me.

I have Nvidia driver 430 on Ubuntu 16.04 with Geforce 1050. It comes with libcuda1-430 when I installed the driver from additional drivers tab in ubuntu (Software and Updates). I installed pytorch with conda which also installed the cudatoolkit.

nvidia-smi says I have cuda version 10.1
conda list tells me cudatoolkit version is 10.2.89
print(torch.cuda.current_device()), I get 10.0.10? (it looks like):

AssertionError: The NVIDIA driver on your system is too old (found version 10010)
print(torch._C._cuda_getCompiledVersion(), 'cuda compiled version') tells me my version is 10.0.20?

10020 cuda compiled version

What am I missing? What version of CUDA is my torch actually looking at? Why are there so many different versions?

what is your version of cuda

nvidia-smi is just showing version it can handle according to Berriel from stack
Using torch.version.cuda we can find out which version pytorch was built with. this was found to be 10.2 (as was cudatoolkit version)
& 4. forget it.

Suggestion was to remove pytorch torchvision and cudatoolkit

conda remove pytorch torchvision cudatoolkit

Does your cuda work?

Source 1, Stack

If testing goes well

In [1]: import torch

In [2]: torch.cuda.current_device()
Out[2]: 0

In [3]: torch.cuda.device(0)
Out[3]: <torch.cuda.device at 0x7efce0b03be0>

In [4]: torch.cuda.device_count()
Out[4]: 1

In [5]: torch.cuda.get_device_name(0)
Out[5]: 'GeForce GTX 950M'

In [6]: torch.cuda.is_available()
Out[6]: True

If wrong version of cuda

print(torch.cuda.device_count())
0
print(torch.cuda.current_device())

AssertionError: 
The NVIDIA driver on your system is too old (found version 10010).

Looking the assertion error on google points to github pytorch issues where they talk about versions not matching of the driver and the CUDA version. Digging a bit deeper you see the NVIDIA website regarding cuda and driver version numbers Table 1: Nvidia site.

Downgrading cudatoolkit without conflicts

Current version of cuda:

torch.version.cuda

10.2

Version corresponding to NVIDIA driver (in this case 430)

nvidia-smi says 10.1
Table 1: Nvidia site also says 10.1 and below (roughly).

What doesn’t work?

In the current conda installing with cudatoolkit pytorch and torchvision gives tons of errors.

conda install pytorch==1.6.0 torchvision==0.7.0 cudatoolkit=10.1.168
-c pytorch

What works

Pytorch installation seems to be paired with pytorch torchvision and cudatoolkit according to official website.

Remove pytorch torchvision cudatoolkit:

conda remove pytorch torchvision cudatoolkit

conda install pytorch==1.6.0 torchvision==0.7.0 cudatoolkit=10.1.168 -c pytorch

Again it asks for several things I say yes blindly but seemingly NO CONFLICTS

check what torch.version.cuda is: 10.1. GREAT we have successfully downgraded.

Checking seems successful based on SO answer

In [1]: import torch

In [2]: torch.cuda.current_device()
Out[2]: 0

In [3]: torch.cuda.device(0) ## previously this gave "old driver error"
Out[3]: <torch.cuda.device at 0x7efce0b03be0>

In [4]: torch.cuda.device_count()
Out[4]: 1

In [5]: torch.cuda.get_device_name(0)
Out[5]: 'GeForce GTX 950M'

In [6]: torch.cuda.is_available()
Out[6]: True

torch.cuda.device(0) previously gave the assertion error that Nvidia driver was too old. Now I have the value as 0 which seems to be expected as per: SO answer.

Fresh error: Now fastai is missing probably got removed when I removed pytorch or re-installed pytorch.

conda install -c pytorch -c fastai fastai

So far so good. Testing again:

eghx@eghx-nitro:~$ conda activate fastaiclean
(fastaiclean) eghx@eghx-nitro:~$ python
Python 3.8.5 (default, Sep  4 2020, 07:30:14) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.current_device()
0
>>> torch.cuda.device(0)
<torch.cuda.device object at 0x7f14de991d30>
>>> torch.cuda.device_count()
1
>>> torch.cuda.get_device_name()
'GeForce GTX 1050'
>>> torch.cuda.is_available()
True
>>> print(torch.rand(2,3).cuda())
+tensor([[0.2424, 0.8397, 0.6206],
        [0.0982, 0.4568, 0.3958]], device='cuda:0')
>>> print(torch.rand(2,3).cuda())
tensor([[0.8775, 0.8157, 0.1333],
        [0.6189, 0.9641, 0.5741]], device='cuda:0')
>>> 

For the first time I saw another process on nvidia-smi

Other tools

	conda search -c pytorch cudatoolkit ## searching for available version
	
	conda list pytorch ## current version installed

Making GPU work faster

So 2 things I can do. Install libraries that do faster work and then remove all other processes on the gpu except cuda.

Resources

fastaiforum

Isolate integrated intel (igpu) from Nvidia (gpu)

How to configure iGPU for xserver and nvidia GPU for CUDA work BY FASTI STASON

How to configure igpu for xserver and nvidia gpu for cuda? by simple solution guy

Removing xorg processes

change intel graphics card on nvidia settings

or prime-select intel

added the

 export LD_LIBRARY_PATH=/usr/lib/nvidia-430:$LD_LIBRARY_PATH

and then restarted….

Put command in ‘2’ in .bashrc to work with every terminal open… LD_Library_path loses its value after terminal session

Result

visibly no difference in speed
you only save 11% of the gpu.
second screen doesn’t work
need to spend even more and more time to fix it.

	time1	time2
xorg running on Nvidia	1:56	1.19
no processes and missing screen	2.03	1.15
Reverting back	2.10	1.18
adding the lib stuff	1.11	1.19
second time	0.58	1.17

We can pursue this later if required.

Going back to xorg and all the same way

Nvidia forums: prime-select nvidia

works.

Next: So let’s add the xconf file back and see what it does. or modify it ..

If this allows monitor to work, we keep it otherwise we go back with prime-select nvidia

I want to check out the xconf solution without even modifying any xconf process.

components of a GUI

It’s too complicated to get other screen to work and get cuda to run.. Especially for the complete lack of progress I got. so screw that.

sudo prime-select nvidia

Making GPU work faster with libraries

Don’t know what this does. Blindly followed it

conda uninstall --force jpeg libtiff -y	
conda install -c conda-forge libjpeg-turbo 

CC="cc -mavx2" pip install --no-cache-dir -U --force-reinstall --no-binary :all: --compile pillow-simd

step 3 in the link
medium post also has the same thing.

Unable to reproduce the “faster libraries” in another env

Unable to reproduce this in other python environment for some reason. Don’t know why…

Based on stack

conda activate fastaicleantest
conda list --json --export > agent18/DS-2020/fastaicleantest.json
conda deactivate

conda actiavte fasaiclean
conda list --json --export > agen18/DS-2020/fasaiclean.json

cd agent18/DS-2020/

diff fastaicleantest.json fastaiclean.jsonw 

0 differences found, even with manual conda list. don’t know where I screwed up the installation… or what to do to rectify it.

PIL error

ImportError: The _imaging extension was built for another version of Pillow or PIL:
Core version: 7.0.0.post3
Pillow version: 7.2.0

conda list pil

Apparently, “This is only an installation issue.”—stack.

Uninstalling pillow and pillow-simd

conda uninstall pillow

Said yes to everything

conda uninstall pillow-simd

Doesn’t work. It says no:

(fastaiclean) eghx@eghx-nitro:~/agent18/DS-2020$ conda uninstall pillow pillow-smd
Collecting package metadata (repodata.json): done
Solving environment: failed

PackagesNotFoundError: The following packages are missing from the target environment:
  - pillow-smd

Open jupyter notebook and do the following:

!pip uninstall -y pillow-simd

check with

conda list pil

now reinstall it…

CC="cc -mavx2" pip install --no-cache-dir -U --force-reinstall --no-binary :all: --compile pillow-simd

Was I supposed to install pillow first?

now somehow fastai is missing. :(

conda install -c pytorch -c fastai fastai

yes to all

OK that nonsense works.

BTW pillow is currently not installed…

Checked timings with the standard file and all seems well.

Other sources regarding installation

medium post by some pisth
Stack: How do I install NVIDIA and CUDA drivers into Ubuntu?

power off when sleep when GPU is full (recovery)

What seems to work with existing config:

sleep/wakeup and shutdown normally
If I shut down the gpu by closing the jupyter kernel then all is good.
whether I shutdown the gpu or not I can still shutdown.

What doesn’t work with existing config:

when Jupyter finishes running and the gpu is full sleep doesn’t work
It hangs (C-M-F7/F1) doesn’t work. It’s hung I think.

What all I attempted before screwing up the system.

I am tempted to try the following:

18.04 Screen remains blank after wake up from suspend
- look into kernels
- remove gdm and go to lightdm? (but I seem to already have lightdm)
- hybrid settings in bios
- install unity-ubuntu-desktop
- nouveau.modeset=0 (but i am not even using nouveau)
System freeze after suspend
- switch to lightdm
- do I have hybrid graphics?
suspend, hibernate issues issues with nvidia
- 5 steps to do. Maybe try this as it is from nvidia. Doens’t work for 18 but might work for 16 I guess.
trying to perhaps see the error
- remove quiet splash to see where the error is
other blind guesses
- reinstall the driver (Not sure but it is in comments)
  - downgrade perhaps (really not sure if this is the direction)
  - suggestion of using 390 and 16 seems to work for this guy (with instructions)
- adding no modeset in grub screen?
- changing kernel?
- resetting monitors before sleeping? xrandr -s 0

This is ubuntu 16 pertinent.

Almost same issue
- change powermizer to adaptive is one solution (change it back later)
- other unaccepted solution of the same question in ask ubuntu is to change grub to something else…
```
  GRUB_CMDLINE_LINUX_DEFAULT="quiet acpi_rev_override=1 acpi_osi=Linux scsi_mod.use_blk_mq=1 nouveau.modeset=0 nouveau.runpm=0 mem_sleep_default=deep"
```
Updating kernel to 4.17?
- update kernel to next version?
- kernel apparently didn’t do it all for this guy with ubuntu 16 and 4.15 kernel - Another user reporting this is not working for them :( kernel 4.17

What all I did:

Based on Nvidia forum added the following to grub and updated grub (sudo grub-update)

 GRUB_CMDLINE_LINUX_DEFAULT="quiet acpi_rev_override=1
 acpi_osi=Linux scsi_mod.use_blk_mq=1 nouveau.modeset=0
 nouveau.runpm=0 mem_sleep_default=deep"
	
 sudo update-grub
	
 sudo gedit /etc/initramfs-tools/modules
	
 nvidia
 nvidia_modeset
 nvidia_uvm
 nvidia_drm
	
 sudo update-initramfs -u -k all

Got some warning for this but apprently it doesn’t mean anything as I have an i5 system —Stack

Didn’t change kernel as there was no evidence towards it. People changed to 4.17. Ming is currently 4.15
Tried different suspends (Suspend)
Tried downgrading.
```
 sudo add-apt-repository ppa:graphics-drivers/ppa
 sudo apt update 
```
Selected 384 version.

It didn’t allow me to work with pytorch 1.6 so I wanted to go back.
Tried upgrading back to 430
```
 sudo apt-get purge nvidia*
```
Additional drivers tab was all greyed out.
```
 sudo apt remove nvidia-*  
 sudo apt autoremove  
	
 sudo ubuntu-drivers autoinstall
```
Rebooted to black screen of death (tried this) but all seemed ok. Added nomodeset to grub (In hindsight I could have done: nouveau.modeset=0 I think).

As a last step went to prime select intel and it finally allowed me to boot with just intel graphics card. Despite having 430 installed nothing works with it…

Again blank screen and then noveau.modeset=0 did the trick
did complete re-install of xserver, unity, lightdm and the nvidia drivers

So far I have one screen working and the gpu not being NVIDIA.

that’s when I gave up… Now god forbid I had to recover.

Learning about the linux architechture

What is x-server? X11 Xclient

https://medium.com/mindorks/x-server-client-what-the-hell-305bd0dc857f

https://askubuntu.com/a/264411/443958

Graphical interface/ Display manager: Unity, Gnome

Protocol between GI and DS: X11, wayland etc…

Display Server: Xorg | window manager: Compiz (resize or postion window, close minimize etc…)

xorg packages : https://packages.ubuntu.com/search?keywords=xserver-xorg

sudo apt-get install installs packages and dependences and recommended.

To remove package we do sudo apt-get purge or sudo apt-get remove --purge

Should we use purge or remove –purge or remove

https://askubuntu.com/a/231565/443958

18.04 Screen remains blank after wake up from suspend

What is nomodeset quiet and splash

https://askubuntu.com/questions/1024895/why-do-i-need-to-replace-quiet-splash-with-nomodeset

https://askubuntu.com/questions/747314/is-nomodeset-still-required

Recovering linux system from said fatality

“re-installing” xorg, lightdm, unity and then nvidia

Graphical interface/~~Display manager~~/Desktop environments: Unity, Gnome

Protocol between GI and DS: X11, wayland etc…

Display Server: Xorg | window manager: Compiz (resize or postion window, close minimize etc…)

Display manager: Lightdm or gdm etc… (starting your login screen)

This has to be my best recovery so far. It took so much time to prep know the details. But I can say I did it STM. I think I can. Usually I was so afraid of screwing up the system (rightly so), but now I read and read and read, see what other’s advice is, know “roughly what the commands mean”. And then plan out the commands and jump into it.

Based mainly on How to install Nvidia drivers, Graphics issues in 16.04 and re-installing xorg and xserver, What is gdm3, kdm, lightdm? How to install and remove them?.

Went to TTY screen which I invoked before logging in:

X-server and Xorg and Lightdm

sudo apt-get purge xorg-* "xserver-*"
sudo apt-get purge lightdm

Removing dependencies and recommended packages of the purged packages (if installed with them),

sudo apt-get auroremove 

sudo apt-get install xorg xserver-xorg
sudo dpkg-reconfigure xorg

Configuring lightdm means selecting lightdm instead of gdm

sudo apt-get lightdm
sudo dpkg-reconfigure lightdm

Now unity. So in my system ubuntu-desktop was not installed, instead only unity was installed but everywhere I found ubuntu-desktop being included so just added it. A quick search showed that ubuntu-desktop has it’s own unity.

sudo apt-get purge ubuntu-desktop
sudo apt-get purge unity

sudo apt-get autoremove

sudo apt-get install unity ubuntu-desktop

Re-installing nvidia

sudo apt-get purge nvidia-*
sudo add-apt-repository ppa:graphics-drivers/ppa

sudo apt-get update

To see the recognized drivers and what is the recommended driver etc…

ubuntu-drivers devices

Installs the recommended drivers also can be done manually with sudo apt-get install nvidia-430.

sudo ubuntu-drivers autoinstall

We can go and check in the additional drivers tab later.

The last step I did and probably the most important was to make an xorg.conf file. This I somehow had in the previous installation and nvidia-xconfig didn’t bring it back. So I literally copied a back up and pasted it.

Section "ServerLayout"
    Identifier "layout"
    Screen 0 "nvidia"
    Inactive "intel"
EndSection

Section "Device"
    Identifier "intel"
    Driver "modesetting"
    BusID "PCI:0@0:2:0"
    Option "AccelMethod" "None"
EndSection

Section "Screen"
    Identifier "intel"
    Device "intel"
EndSection

Section "Device"
    Identifier "nvidia"
    Driver "nvidia"
    BusID "PCI:1@0:0:0"
    Option "ConstrainCursor" "off"
EndSection

Section "Screen"
    Identifier "nvidia"
    Device "nvidia"
    Option "AllowEmptyInitialConfiguration" "on"
    Option "IgnoreDisplayDevices" "CRT"
EndSection

That was it. I didn’t put anything in grub I think.

Check with

nvidia-smi

nvidia-settings

& aditional drivers tab

Result

System works like in the past. the issue with not being able to sleep when GPU is full is still pending.

There are many peoplewho were not able to solve this on their system. For some users it was concluded that it was a BIOS issue with their ASUS systems. Read this entire thread here. So if you can’t get it to work. It’s ok. It’s alright. We’re the same and there is no need to cry.

fastai functions not working

NameError: name ‘widgets’ is not defined

Is it supposed to be in fastbook?

Why do we need fastbook? What’s happening?

Where are these functions?

Is it in fastbook

Seems like it… tried it out on Gradient and I am certain it is this lack of fastbook and graphviz

Re-installing until fastai fresh env

conda install -c fastai -c pytorch fastai cudatoolkit=10.1.243 gh

conda install -c conda-forge ipywidgets

didn't help

Unable to install fastbook via conda install

Also python version seesm to have downgraded to python 3.7 (in fastaicleantest alone)

It seems like a known issue that conda instal doesn’t work whereas pip works WTF.

1: https://github.com/conda/conda/issues/9868

2: https://github.com/KevinMusgrave/pytorch-metric-learning/issues/55

pip install seems to be the way to go…

Testing…

clean test seems to be distroyed. Won’t install graphviz

Fastai cuda not working AGAIN :(

Quick reinstall

conda install -c fastai -c pytorch fastai cudatoolkit=10.1.243 gh

ends up downgrading 3 packages including pythiath

conda install -c conda-forge ipywidgets

ends up installing new packages

pip install -Uqq fastbook

0 output

conda install graphviz

Not tested the last part with graphviz. While installing fastaiclean4 I installed graphviz after gh. And then had problems with fasiai recognizing that I had installed graphviz. Removed and reinstalled and all seemed to be goo.

fastaiclean is ok with libtiff steffs

Sometimes I get the fastbook.setup_book() error related to
graphviz... I don't why I get it sometimes and not someother
times. Jesus

fastaiclean test is ruined (can’t recover)

fastaiclean4 is a fresh install without the libtiff steffs.

graphviz still causing issues… I don’t know why? it is not displaying hte pictures via gv…

Kaggle and tabular stuff

conda install kaggle -c conda-forge

You will need a json file according to the api github…

To use the Kaggle API, sign up for a Kaggle account at https://www.kaggle.com. Then go to the ‘Account’ tab of your user profile (https://www.kaggle.com//account) and select 'Create API Token'. This will trigger the download of kaggle.json, a file containing your API credentials. Place this file in the location ~/.kaggle/kaggle.json (on Windows in the location C:\Users\\.kaggle\kaggle.json - you can check the exact location, sans drive, with echo %HOMEPATH%). You can define a shell environment variable KAGGLE_CONFIG_DIR to change this location to $KAGGLE_CONFIG_DIR/kaggle.json (on Windows it will be %KAGGLE_CONFIG_DIR%\kaggle.json).

Create API token in kaggle gives the json file to put in ~/.kaggle/kaggle.json.

For your security, ensure that other users of your computer do not have read access to your credentials. On Unix-based systems you can do this with the following command:

chmod 600 ~/.kaggle/kaggle.json

dtreeviz package.

conda install dtreeviz -c conda-forge

GPU full

RuntimeError: CUDA out of memory. Tried to allocate 22.00 MiB (GPU 0;
3.95 GiB total capacity; 3.01 GiB already allocated; 19.50 MiB free;
3.10 GiB reserved in total by PyTorch)

Tried changing batch size (bs) as suggested by Jeremy but reducing it to 32 and 16 didn’t yield any results.

A possibility could be to try and free up the space in the GPU used by compiz and Xorg. But it sounds like a bad idea, as it will take tons of time. I have already spent 1-2 full days on it without any results. My dual screen wont work and I don’t know how to get it to work.

While using the system and the GPU it is impossible to do any work despite the ram being completely free.

Considering all this, it seems wise to move to paper space. It seems faster maybe I might have to end up paying money. But that is a choice i think I can figure out later if necessary.

Sometimes cuda memory becomes full while running the cells fresh… It also takes a long time. Unsure how to reduce it at the cost of accuracy.

So I think for now as paperspace is faster and freer and I don’t use emacs anyways… So I don’t see the value of trying to get it to work in my pc. Let’s move to paperspace. the error I see that made me do this is:

Lesson 1

Let’s setup a gpu online first: https://course.fast.ai/start_gradient
Look for GPU usage on PC: https://forums.fast.ai/t/wiki-thread-lesson-1/6825

Resources

Lesson wiki
course notes by someone on medium

Jupiter

Jupiter with crestle
aws or PC
paperspace (have )

Jupyter in emacs

https://tkf.github.io/emacs-ipython-notebook/

todo

setup jupyter on emacs and browser to see how different it is….

clone fast ai repo, instructions on website…

finish this course lecture 1 atleast. seeing it

followed by assignmets
figure out how to use fast ai (pc or aws or whatever) with jupyter

Question

Suspend not working after GPU is full

On normal occasions when xorg and compiz is running in my gpu, I can Suspend peacefully. However if I run some intense (90% GPU in use) training (via jupyter) related to pytorch, and subsequently suspend after the processes are over, it refuses to sleep/wakeup.

I am positive GPU being full or not empty is causing the issue. I don’t know why “some process” possibly related to the GPU is not Suspending. When I run jupyter and run 1+1 (or a simple process) and Suspend, then also no issues.

Question

Kernlog shows me nothing “fishy”. I have tried a bunch of online remedies. Now at a dead end.

How do I identify what is happening? any ideas?

Other symptoms

It sort-of sleeps but I still hear some sound from the laptop when I hit a key (it sounds as if it is booting up). And then blank screen after that. Sometimes I get to go to the TTY but can’t type anything.

What all I tried to rectify this issue?

Spent a good 5 full days understanding and searching and re-installing etc… Now at a dead end.

Checked the kern logs (pastebin link) but didn’t see anything “fishy”. (at 02:08 I start sleeping and at 10:21 I hit hard reset).

Here is a tiny exerpt:

Oct  2 02:08:06 eghx-nitro NetworkManager[8152]: <info>  [1601597286.6443] manager: sleep requested (sleeping: no  enabled: yes)
Oct  2 02:08:06 eghx-nitro NetworkManager[8152]: <info>  [1601597286.6443] manager: sleeping...
Oct  2 02:08:06 eghx-nitro NetworkManager[8152]: <info>  [1601597286.6447] manager: NetworkManager state is now ASLEEP
Oct  2 02:08:06 eghx-nitro NetworkManager[8152]: <info>  [1601597286.6453] device (wlp2s0): state change: activated -> deactivating (reason 'sleeping') [100 110 37]
Oct  2 02:08:06 eghx-nitro NetworkManager[8152]: <info>  [1601597286.8169] device (wlp2s0): state change: deactivating -> disconnected (reason 'sleeping') [110 30 37]
Oct  2 02:08:06 eghx-nitro NetworkManager[8152]: <info>  [1601597286.8356] dhcp4 (wlp2s0): canceled DHCP transaction, DHCP client pid 8328
Oct  2 02:08:06 eghx-nitro NetworkManager[8152]: <info>  [1601597286.8356] dhcp4 (wlp2s0): state changed bound -> done
Oct  2 02:08:06 eghx-nitro NetworkManager[8152]: <info>  [1601597286.8363] dns-mgr: Writing DNS information to /sbin/resolvconf
Oct  2 02:08:06 eghx-nitro kernel: [24100.153393] wlp2s0: deauthenticating from e8:cc:18:41:3c:15 by local choice (Reason: 3=DEAUTH_LEAVING)
Oct  2 02:08:07 eghx-nitro NetworkManager[8152]: <warn>  [1601597287.0509] sup-iface[0xb4a6f0,wlp2s0]: connection disconnected (reason -3)
Oct  2 02:08:07 eghx-nitro NetworkManager[8152]: <info>  [1601597287.0511] device (wlp2s0): supplicant interface state: completed -> disconnected
Oct  2 02:08:07 eghx-nitro NetworkManager[8152]: <info>  [1601597287.0525] device (wlp2s0): state change: disconnected -> unmanaged (reason 'sleeping') [30 10 37]
Oct  2 02:08:08 eghx-nitro kernel: [24101.983885] PM: suspend entry (deep)
Oct  2 02:08:09 eghx-nitro kernel: [24101.983888] PM: Syncing filesystems ... done.
Oct  2 10:21:32 eghx-nitro kernel: [24103.953554] Freezing user space
processes ... (elapsed 0.002 seconds) done.

Based on Nvidia forum added the following to grub and updated.

 GRUB_CMDLINE_LINUX_DEFAULT="quiet acpi_rev_override=1
 acpi_osi=Linux scsi_mod.use_blk_mq=1 nouveau.modeset=0
 nouveau.runpm=0 mem_sleep_default=deep"

Added the following to iniramfs-tools/modules and updated.

 nvidia
 nvidia_modeset
 nvidia_uvm
 nvidia_drm

Didn’t change kernel as there was no evidence towards it. People changed to 4.17. Mine is currently 4.15.
Blind try: Trying different (Suspend)s
```
 systemctl suspend
   
 pm-suspend
```
Tried downgrading the drivers to 384 from 430 with changing it at additional drivers. This was not useful as this was not capable of co-existing with pytorch=1.6.0
Complete remove and re-install of nvdia-430 as per here:
purge, add-apt-repository ppa:graphics-drivers/ppa, update and autoinstall.

This ended in the black screen of death. Recovered it with noveau.modeset=0. Somehow GPU was not working anymore.
At this point did a complete re-install of xserver,unity, lightdm and nvidia-430 over tty terminal before login screen.

This recovered the system to it’s previous state i.e., suspend when GPU full hangs the system.

My system

Ubuntu 16.04
Nvidia 1050 GeForce

fastai alternatives

Resources

Prepping ubuntu

changing ubuntu desktop

Installing fastai

question (no answer on stack) (archive)

Installing fastai with dependencies

GPU not working (running intro.pynb)

Fixing slow computer (swap)

Fixing GPU not working

Installing the right drivers

What Nvidia driver should I install for Ubuntu 16.04

Answer to NVIDIA driver to be used

Fixing issues with Cuda

Multiple versions of cuda in the system

what is your version of cuda

Does your cuda work?

Downgrading cudatoolkit without conflicts

Making GPU work faster

Unable to reproduce the “faster libraries” in another env

PIL error

Other sources regarding installation

power off when sleep when GPU is full (recovery)

What all I attempted before screwing up the system.

Learning about the linux architechture

Recovering linux system from said fatality

fastai functions not working

Fastai cuda not working AGAIN :(

Quick reinstall

Kaggle and tabular stuff

GPU full

Lesson 1

Resources

Jupiter

Jupyter in emacs

todo

Other useful links

Question