fastai-lessons
Resources
2018 Lesson 1 wiki: https://forums.fast.ai/t/wiki-thread-lesson-1/6825
course 2018: https://course18.fast.ai/lessonsml1/lesson1.html
2020 setup help: https://forums.fast.ai/t/setup-help/65529
Ubuntu setup help used:
-
https://forums.fast.ai/t/pytorch-installation-in-conda-environment-failing/6703/19
-
https://forums.fast.ai/t/platform-local-server-ubuntu/65851
updates 2020: https://forums.fast.ai/t/official-part-1-2020-updates-and-resources-thread/63376
^^ Has all the code pictures and text
2020 book:https://github.com/fastai/fastbook/blob/master/01_intro.ipynb
2020 software fastai repo: https://github.com/fastai/fastai
^^ this is fastai v2 written from scratch
2020 course nbs: https://github.com/fastai/fastbook/tree/master/clean
2020 readme page fastai: https://course.fast.ai/#How-do-I-get-started?
2020 lesson 1 wiki: https://forums.fast.ai/t/lesson-1-official-topic/65626
forums: https://forums.fast.ai/
Help with setup: https://forums.fast.ai/t/setup-help/65529
Questions and answers: FASTai questions and answers for each lesson
Fastai blogs: fastaiblogs in the fastforum
Old
fastiai2 - https://github.com/fastai/fastai2 coursev4 - https://github.com/fastai/course-v4 fastbook - https://github.com/fastai/fastbook
Collection of example projects
Lesson 1 Resources
Basics
Lesson 1 hw questions and solutions
Extra
Podcast + Writeups: Summaries + Things Jeremy Says to do + Qs
Lesson 1 notes
Basic training loop
#alt The basic training loop
gv('''ordering=in
model[shape=box3d width=1 height=0.7]
inputs->model->results; weights->model; results->performance
performance->weights[constraint=false label=update]''')
You basically have inputs->model->results
. In DL case you have a
system that can learn and update something.
I.e, inputs->model->results->performance
, so some metrics and
updating some weights
inputs->model->results->Performance
Weights->model
Performance--> weights
For example in the below work:
Inputs: Images, how many times you look at the image (epoch), number of
Model: CNN
results: trained model to find cat and dog with high probability
Performance: error_rate/accuracy of validation data set
Weights: tuning parameter
values in the cnn model
What does this do?
# CLICK ME
from fastai.vision.all import *
path = untar_data(URLs.PETS)/'images'
def is_cat(x): return x[0].isupper()
dls = ImageDataLoaders.from_name_func(
path, get_image_files(path), valid_pct=0.2, seed=42,
label_func=is_cat, item_tfms=Resize(224))
learn = cnn_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(1)
ImageDataLoaders
There are various different classes for different kinds of deep learning datasets and problems—here we’re using ImageDataLoaders. The first part of the class name will generally be the type of data you have, such as image, or text.
dls = ImageDataLoaders.from_name_func(
path, get_image_files(path), valid_pct=0.2, seed=42,
label_func=is_cat, item_tfms=Resize(224))
Convolution Neural Networks
The fifth line of the code training our image recognizer tells fastai to create a convolution neural network (CNN) and specifies what architecture to use (i.e. what kind of model to create), what data we want to train it on, and what metric to use.
learn = cnn_learner(dls, resnet34, metrics=error_rate)
There are many different architectures in fastai, which we will introduce in this book (as well as discussing how to create your own). Most of the time, however, picking an architecture isn’t a very important part of the deep learning process. It’s something that academics love to talk about, but in practice it is unlikely to be something you need to spend much time on. There are some standard architectures that work most of the time, and in this case we’re using one called ResNet that we’ll be talking a lot about during the book; it is both fast and accurate for many datasets and problems. The 34 in resnet34 refers to the number of layers in this variant of the architecture (other options are 18, 50, 101, and 152). Models using architectures with more layers take longer to train, and are more prone to overfitting (i.e. you can’t train them for as many epochs before the accuracy on the validation set starts getting worse). On the other hand, when using more data, they can be quite a bit more accurate.
cnn_learner also has a parameter pretrained, which defaults to True (so it’s used in this case, even though we haven’t specified it), which sets the weights in your model to values that have already been trained by experts to recognize a thousand different categories across 1.3 million photos (using the famous ImageNet dataset). A model that has weights that have already been trained on some other dataset is called a pretrained model.
fine_tune
cnn_learner
only describes the “architecture”,
A transfer learning technique where the parameters of a pretrained model are updated by training for additional epochs using a different task to that used for pretraining.
So you train for additional epochs (number of times you look at the each image)
I dont’ get this part:
Don’t get this part (Later!)
When you use the fine_tune
method, fastai will use these tricks for you. There are a few parameters you can set (which we’ll discuss later), but in the default form shown here, it does two steps:
- Use one epoch to fit just those parts of the model necessary to get the new random head to work correctly with your dataset.
- Use the number of epochs requested when calling the method to fit the entire model, updating the weights of the later layers (especially the head) faster than the earlier layers (which, as we’ll see, generally don’t require many changes from the pretrained weights).
The head of a model is the part that is newly added to be specific to the new dataset. An epoch is one complete pass through the dataset. After calling fit
, the results after each epoch are printed, showing the epoch number, the training and validation set losses (the “measure of performance” used for training the model), and any metrics you’ve requested (error rate, in this case).
Note on resnet34
The 34 in resnet34 refers to the number of layers in this variant of the architecture (other options are 18, 50, 101, and 152). Models using architectures with more layers take longer to train, and are more prone to overfitting (i.e. you can’t train them for as many epochs before the accuracy on the validation set starts getting worse). On the other hand, when using more data, they can be quite a bit more accurate.
Lesson 2 HW questions and answers
-
Do you need these for deep learning?
- Lots of math F
- Lots of data T
- Lots of expensive computers F
- A PhD F
-
Name five areas where deep learning is now the best in the world.
-
Natural language processing (NLP):: Answering questions; speech recognition; summarizing documents; classifying documents; finding names, dates, etc. in documents; searching for articles mentioning a concept
-
Computer vision:: Satellite and drone imagery interpretation (e.g., for disaster resilience); face recognition; image captioning; reading traffic signs; locating pedestrians and vehicles in autonomous vehicles
-
Medicine:: Finding anomalies in radiology images, including CT, MRI, and X-ray images; counting features in pathology slides; measuring features in ultrasounds; diagnosing diabetic retinopathy
-
Biology:: Folding proteins; classifying proteins; many genomics tasks, such as tumor-normal sequencing and classifying clinically actionable genetic mutations; cell classification; analyzing protein/protein interactions
-
Image generation:: Colorizing images; increasing image resolution; removing noise from images; converting images to art in the style of famous artists
-
Recommendation systems:: Web search; product recommendations; home page layout
-
Playing games:: Chess, Go, most Atari video games, and many real-time strategy games
-
Robotics:: Handling objects that are challenging to locate (e.g., transparent, shiny, lacking texture) or hard to pick up
-
Other applications:: Financial and logistical forecasting, text to speech, and much more…
-
-
What was the name of the first device that was based on the principle of the artificial neuron?
Mark I Perceptron
-
Based on the book of the same name, what are the requirements for parallel distributed processing (PDP)?
- A set of processing units
- A state of activation
- An output function for each unit
- A pattern of connectivity among units
- A propagation rule for propagating patterns of activities through the network of connectivities
- An activation rule for combining the inputs impinging on a unit with the current state of that unit to produce an output for the unit
- A learning rule whereby patterns of connectivity are modified by experience
- An environment within which the system must operate
-
What were the two theoretical misunderstandings that held back the field of neural networks?
- NN could not handle functions like XOR
- NN is slow and too big to be useful
-
What is a GPU?
GPUS can run NN 100 times faster
Graphics Processing Unit
-
Open a notebook and execute a cell containing:
1+1
. What happens?2
-
Follow through each cell of the stripped version of the notebook for this chapter. Before executing each cell, guess what will happen.
Documentation can be reached by
doc(PILImage.create)
for example.7didn’t understand how to understand PILImage… but moving on… Let’s try some motherfucking stuff.
-
Complete the Jupyter Notebook online appendix.
done.
-
Why is it hard to use a traditional computer program to recognize images in a photo?
This is very difficult because cats, dogs, or other objects, have a wide variety of shapes, textures, colors, and other features, and it is close to impossible to manually encode this in a traditional computer program.
No simple heuristics that you can come up with for the whole general problem.
-
What did Samuel mean by “weight assignment”?
how much each of the million parameters weigh!
“weight assignment” refers to the current values of the model parameters
-
What term do we normally use in deep learning for what Samuel called “weights”?
parameters
-
Draw a picture that summarizes Samuel’s view of a machine learning model.
inputs + parameters –> architechutre –> predictions
predictions + labels –> loss
loss –> update
-
Why is it hard to understand why a deep learning model makes a particular prediction?
because we are using an general function that can solve in theory everything
-
What is the name of the theorem that shows that a neural network can solve any mathematical problem to any level of accuracy?
universal approximation theorem
-
What do you need in order to train a model?
metric input program initial weights? pre-trained model
-
How could a feedback loop impact the rollout of a predictive policing model?
You end up predicting with biases. You start with predicting arrests based on the data that already exists. You are not “predicting crime”, but where “arrests” are going to be made.
Arrests are different from if crime took place… Arrests are a proxy. You are only as good as your proxy.
-
Do we always have to use 224×224-pixel images with the cat recognition model?
Nope we can use what ever the hell we want.
-
What is the difference between classification and regression?
Note: Classification and Regression: classification and regression have very specific meanings in machine learning. These are the two main types of model that we will be investigating in this book. A classification model is one which attempts to predict a class, or category. That is, it’s predicting from a number of discrete possibilities, such as “dog” or “cat.” A regression model is one which attempts to predict one or more numeric quantities, such as a temperature or a location. Sometimes people use the word regression to refer to a particular kind of model called a linear regression model; this is a bad practice, and we won’t be using that terminology in this book!
-
What is a validation set? What is a test set? Why do we need them?
Validation set to validate the model Test set to check at the very end to remain intellectually honest.
Validation set is used to test the hyperparameters by us. So we see that the training was poor based on the validation set. Then we just go to varying the hyperparameters such as architecture. Although the algorithm doesn’t see it see it, it actually sees it through us.
We want to prevent the computer from memorizing.
The solution to this conundrum is to introduce another level of even more highly reserved data, the test set. Just as we hold back the validation data from the training process, we must hold back the test set data even from ourselves. It cannot be used to improve the model; it can only be used to evaluate the model at the very end of our efforts. In effect, we define a hierarchy of cuts of our data, based on how fully we want to hide it from training and modeling processes: training data is fully exposed, the validation data is less exposed, and test data is totally hidden. This hierarchy parallels the different kinds of modeling and evaluation processes themselves—the automatic training process with back propagation, the more manual process of trying different hyper-parameters between training sessions, and the assessment of our final result.
Having two levels of “reserved data”—a validation set and a test set, with one level representing data that you are virtually hiding from yourself—may seem a bit extreme. But the reason it is often necessary is because models tend to gravitate toward the simplest way to do good predictions (memorization), and we as fallible humans tend to gravitate toward fooling ourselves about how well our models are performing. The discipline of the test set helps us keep ourselves intellectually honest. That doesn’t mean we always need a separate test set—if you have very little data, you may need to just have a validation set—but generally it’s best to use one if at all possible.
-
What will fastai do if you don’t provide a validation set?
20% default
-
Can we always use a random sample for a validation set? Why or why not?
No, in time series you might want to use a time time frame in the past to predict a time frame in the future.
why? not representative of most business use cases (where you are using historical data to build a model for use in the future)
-
What is overfitting? Provide an example.
Overfitting refers to when the model fits too closely to a limited set of data but does not generalize well to unseen data.
When you fit both your validations and training set you are bound to overfit
When function is not continuous anymore resulting in spurious values
-
What is a metric? How does it differ from “loss”?
The concept of a metric may remind you of loss, but there is an important distinction. The entire purpose of loss is to define a “measure of performance” that the training system can use to update weights automatically. In other words, a good choice for loss is a choice that is easy for stochastic gradient descent to use. But a metric is defined for human consumption, so a good metric is one that is easy for you to understand, and that hews as closely as possible to what you want the model to do. At times, you might decide that the loss function is a suitable metric, but that is not necessarily the case.
Sound exactly the same to me. I don’t have an example for when loss and metric are different. I don’t know how it would look.
??
-
How can pretrained models help?
Pretrained models have been trained on other problems that may be quite similar to the current task. For example, pretrained image recognition models were often trained on the ImageNet dataset, which has 1000 classes focused on a lot of different types of visual objects. Pretrained models are useful because they have already learned how to handle a lot of simple features like edge and color detection. However, since the model was trained for a different task than already used, this model cannot be used as is.
-
What is the “head” of a model?
The part that you remove from a pretrained model and then train your self. The last part.
-
What kinds of features do the early layers of a CNN find? How about the later layers?
Earlier layers learn simple features like diagonal, horizontal, and vertical edges. Later layers learn more advanced features like car wheels, flower petals, and even outlines of animals.
-
Are image models only useful for photos?
no… sound examples exist or anything that you can convert to image.
-
What is an “architecture”?
template for the mathematical functions
-
What is segmentation?
Identifying every individual pixel.
-
What is
y_range
used for? When do we need it?target range… think of movie reviews.
-
What are “hyperparameters”?
Knobs we want to tune which are based on the validation set. Sort of like parameters that will modify the parameters and simulation.
network architecture, learning rates, data augmentation strategies, and other factors
Training models requires various other parameters that define how the model is trained. For example, we need to define how long we train for, or what learning rate (how fast the model parameters are allowed to change) is used. These sorts of parameters are hyperparameters.
-
What’s the best way to avoid failures when using AI in an organization?
Have a training and a validation and a test set.
Experiment time?
Experiment time - Most important step according alumni of course Signup for bing & extract images Understand data block api Build model to perform classification Interpret confusion matrix Analyse the results by plotting losses Exporting model & Inference on any other image of your choice
Lesson 2 Resources
Lesson 2 hw questions and solutions
Lesson resources
- Lesson video (private)
- Covid-19 masks video (public)
- fastbook chapter 2
- fastbook Chapter 2 questionnaire solutions - feel free to contribute!
- Non-beginner discussion
- Part 1 (2020) - Weekly Beginner Only Review and Q&A
- Anki deck for Lessons 1 and 2
Links from lesson
- Masks research
- masks4all - from Czech Republic
- Publication “The ASA Statement on p -Values” & article “ Null Hypothesis Significance Testing Never Worked”
- Designing great data products
- Bing Image Search API
Deployment guides
-
Deploying the Bear Detector app with Flask + Gunicorn + Domain name with SSL certificate, plus Dockerizing the app: https://github.com/fastai/course-v3/blob/master/docs/deployment_flask_gunicorn_docker.md (just sent pull request to correct missing /docs in link to image oops)
-
Deployment Platform: SeeMe.ai - the AI marketplace: Easily deploy and share your Fastai (v1 and v2) models on the web, mobile, Python SDK, …. We have added support for
fastai2
and will continue to support it throughout its development to a stable version. We created afastai2
deployment quick guide, taking you through all the steps of training, deploying, using, and sharing your model with your friends and the world. Additional links: Deployment platform: Seeme.ai - wiki topic | Fastai v1 quick guide -
Deployment Platform: Amazon SageMaker. Follow the steps outlined in the following post to train and deploy your fastai2 model with Amazon SageMaker.
-
Deployment Platform: Azure Functions. Follow the steps outlined in this Github repo to deploy a trained fastai2 model on Azure cloud’s serverless infrastructure service – the Azure Functions. Welcome your feedback and questions on this topic thread.
Deployment examples
- Bear Detector web app: a Starlette example for deployment in fastai2
- Derotate: a code template for image-to-class and image-to-image fastai2 deployments with web app, mobile app (in Flutter), and a notebook showing how to call the model remotely as an API.
- Bear detector web app deployed with flask: example of the bear detector app deployed with fast.ai v2, flask, gunicorn and nginx
Other useful links
- Lesson Summaries + Things Jeremy Says to do + Qs
- Notes by @Lankinen
- Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations
Lesson 2 notes
Getting bing images for the key
To get key once done go here or generally
Metrics are printed out after we look at every single image ONCE (each epoch each time you have “looked” at all the images).
overfitting example needed
loss and metric example needed
loss could be error_rate but in the classification problem, a small
change in parameters won’t result in change in classification from dog
to cat. so we need something else other than error_rate
. and we
continue to use metrics to understand overfitting.
Overfitting
Overfitting is measured by looking at the METRICS that come out of the validation set. The fastai program doesn’t see the validation set for “training”. training for data that hasn’t seen before is the objective
Don’t want to memorize
You still end up cheating/overfitting as you look at the metrics
(produced by the validation data) and change the hyperparameters
.
Something about training loss validation loss and metric decrease and
how to identify if you are overtraining… din’t quite get that 15:56
?
Fine tuning unclear?
is unclear what he means that first you use one epoch and then you use
When you use the fine_tune method, fastai will use these tricks for you. There are a few parameters you can set (which we’ll discuss later), but in the default form shown here, it does two steps:
Use one epoch to fit just those parts of the model necessary to get the new random head to work correctly with your dataset. Use the number of epochs requested when calling the method to fit the entire model, updating the weights of the later layers (especially the head) faster than the earlier layers (which, as we’ll see, generally don’t require many changes from the pretrained weights).
transfer learning
2012 imagenet winners using visualization Zeiler and Fergus
Applications
vision tabular text recsys
Where to look
Modelzoo pretrained model deep learning imagenet
Medical images not there
Read paper
High temperature and High Humidity reduce the transmission of covid-19
check out blog post on how to do machine learning how to ask the right
questions 1:08:00
Brilliant
Jeremy uses p-value to discredit a paper. Someone predicts a trend
between temperature and R (a measure of covid transmicsibility) by
plotting for 100 cities and shows the situation 52:00
.
Task being if this happened by chance? Of course there is the whole causation correlation discussion. But this chance thing is extremely interesting.
Jeremy takes a similar std deviation and a mean and randomly plots 100 cities and see what sort of R and temp he gets. Unrealistically he then ends up discrediting the paper. I think what would be nice is a p-value of many such simulations and where the prediction stands among a null hypothesis. :) GREAT.
Here he also effectively tests what happens when you look at only 100 cities.
But the american statistical association hates p-values. Be
Careful while using them? What was the conclusion? 57:59
One of the Claim: more data you use the p value becomes smaller! so don’t use threshold?
Jeremy suggests that a multivariate model i.e., transmission dependent on gdp, density of cities and temperature and then you have a p-value of 0.01
Check out work by Frank Harell about p-values
Bing Image Search API
Gives info such as URLs of images from the internet which are labeled.
Error upon running notebook
RuntimeError: DataLoader worker (pid 19862) is killed by signal:
Killed.
Seems to be a memory issue. Somehow it switched to only CPU and the GPU plan got removed. Started a new notebook container.
Somehow the system in gradient changed from P5000 to CPU. I don’t understand how though.
GPU | CPU | |
---|---|---|
Free-P5000 | 16GB | 30GB |
Free-GPU | 8GB | 30GB |
Free-cpu | 0GB | 2gGB |
My PC | 4GB | 8GB |
Follow this issue, apparently it is not supposed to go to “Free-CPU”. What a cunt program. And there seems to be a guy who potentially could help from paperspace.
Deployment errors
binder does not work
forked bear-viola from the main hosters and it didn’t work with binder and there was no way to see what the error was.
Heroku
With heroku I had to add a requirement.txt
, runtime.txt
(specifying the right version of python for the ones in the
requirement.txt
). Procfile
allows to add --debug
option which
shows errors.
So, Procfile
with and without debug:
web: voila --port=$PORT --no-browser --enable_nbextensions=True obama_classifier_min_binder.ipynb
web: voila --debug --port=$PORT --no-browser --enable_nbextensions=True obama_classifier_min_binder.ipynb
Requirement.txt
now has and seems to need only the following for
lesson 2. Note that Haroku only allows 950Mb slugs, so we use CPU
version of pytorch as a result:
https://download.pytorch.org/whl/cpu/torch-1.6.0%2Bcpu-cp38-cp38-linux_x86_64.whl
https://download.pytorch.org/whl/cpu/torchvision-0.7.0%2Bcpu-cp38-cp38-linux_x86_64.whl
fastai==2.0.11
voila
ipywidgets
Lastly we need to specify the exact python version in runtime.txt
other wise we get the ` not the right wheels` error.
python-3.8.5
Most useful links were that of the old fast ai deployment on haroku guide and the link of someone for who this worked aka whatgame.
Useful links
Other github repos from which I copied ideas
Successful deployment by agent
Writing the blog lessons fastai
I am thinking I should start with the basics of data…. how I clean, how I do other things and then finish with the app. Basically take people through the entire journey of development.
More info about fast yaml front matter is here
Lesson 2 HW questions and answers
Flashcards of Lesson 2 questions and answers
-
Provide an example of where the bear classification model might work poorly in production, due to structural or style differences in the training data.
- Say black and white images
- bear hiding behind tree
- bear and people perhaps?
- polar bear perhaps
- hand drawn bears
- night time images
- small in the picture bear?
-
Where do text models currently have a major deficiency?
Deep learning is also very good at generating context-appropriate text, such as replies to social media posts, and imitating a particular author’s style. It’s good at making this content compelling to humans too—in fact, even more compelling than human-generated text. However, deep learning is currently not good at generating correct responses! We don’t currently have a reliable way to, for instance, combine a knowledge base of medical information with a deep learning model for generating medically correct natural language responses. This is very dangerous, because it is so easy to create content that appears to a layman to be compelling, but actually is entirely incorrect.
-
What are possible negative societal implications of text generation models?
Another concern is that context-appropriate, highly compelling responses on social media could be used at massive scale—thousands of times greater than any troll farm previously seen—to spread disinformation, create unrest, and encourage conflict. (fake news)
-
In situations where a model might make mistakes, and those mistakes could be harmful, what is a good alternative to automating a process?
Regular interventions to check by human pros. E.g., medical diagnoses
For example, a machine learning model for identifying strokes in CT scans can alert high priority cases for expedited review, while other cases are still sent to radiologists for review
-
What kind of tabular data is deep learning particularly good at?
Deep learning is good at analyzing tabular data that includes natural language, or high cardinality categorical columns (i.e. Zip codes)
-
What’s a key downside of directly using a deep learning model for recommendation systems?
Machine learning approaches for recommendation systems will often only tell what products a user might like, and may not be recommendations that would be helpful to the user. For example, if a user is familiar with other books from the same author, it isn’t helpful to recommend those products even though the user bought the author’s book. Or, recommending products a user may have already purchased.
After deep learning the model knows what X has bought in the past. But that doesn’t mean X will buy that or similar stuff like (boxed set of Harry potter novels).
-
What are the steps of the Drivetrain Approach?
Objective –> Levers –> DATA –> Models
Show most relevant search –> rankings of search results –> get data on linked sites –> build model
Vague… move on
-
How do the steps of the Drivetrain Approach map to a recommendation system?
The objective of a recommendation engine is to drive additional sales by surprising and delighting the customer with recommendations of items they would not have purchased without the recommendation. The lever is the ranking of the recommendations. New data must be collected to generate recommendations that will cause new sales . This will require conducting many randomized experiments in order to collect data about a wide range of recommendations for a wide range of customers. This is a step that few organizations take; but without it, you don’t have the information you need to actually optimize recommendations based on your true objective (more sales!)
Vague… Move on
-
Create an image recognition model using data you curate, and deploy it on the web.
done.
-
What is
DataLoaders
?DataLoaders is a thin class that just stores whatever DataLoader objects you pass to it, and makes them available as train and valid. Although it’s a very simple class, it’s very important in fastai: it provides the data for your model.
-
What four things do we need to tell fastai to create
DataLoaders
?- What kinds of data we are working with
- How to get the list of items
- How to label these items
- How to create the validation set
-
What does the
splitter
parameter toDataBlock
do?splits data to valid and train.
-
How do we ensure a random split always gives the same validation set?
seed in Data loaders or split ahead of time.
-
What letters are often used to signify the independent and dependent variables?
x for independent and y for dependent
The independent variable is the thing we are using to make predictions from, and the dependent variable is our target. In this case, our independent variables are images, and our dependent variables are the categories (type of bear) for each image
-
What’s the difference between the crop, pad, and squish resize approaches? When might you choose one over the others?
crops, pads adds pad, squish compresses image. random crop is preferred as all the information is still available at each epoch.
-
What is data augmentation? Why is it needed?
To account for rotations in the image, change in contrast, black and white… better result on validity.
-
What is the difference between
item_tfms
andbatch_tfms
?One by one and the other one are transforms to be performed on a batch.
-
What is a confusion matrix?
let’s see whether the mistakes the model is making are mainly thinking that grizzlies are teddies (that would be bad for safety!), or that grizzlies are black bears, or something else
It is calculated (Of course) based on the valid set.
-
What does
export
save?pickle file i.e., model readily usable
export saves both the architecture, as well as the trained parameters of the neural network architecture. It also saves how the DataLoaders are defined.
-
What is it called when we use a model for getting predictions, instead of training?
Inference
-
What are IPython widgets?
widgets such as upload button.
-
When might you want to use CPU for deployment? When might GPU be better?
CPU for running on one image (deployment)
GPU for running on multiple images (aka modelling)
-
What are the downsides of deploying your app to a server, instead of to a client (or edge) device such as a phone or PC?
The application will require network connection, and there will be extra network latency time when submitting input and returning results. Additionally, sending private data to a network server can lead to security concerns.
-
What are three examples of problems that could occur when rolling out a bear warning system in practice?
- false positives
- unseen data
- unknown regions (night time imaging not good)
- speed of prediction
- resolution of images
- false positives
-
What is “out-of-domain data”?
Data that is fundamentally different in some aspect compared to the model’s training data. For example, an object detector that was trained exclusively with outside daytime photos is given a photo taken at night.
-
What is “domain shift”?
This is when the type of data changes gradually over time. For example, an insurance company is using a deep learning model as part of their pricing algorithm, but over time their customers will be different, with the original training data not being representative of current data, and the deep learning model being applied on effectively out-of-domain data.
-
What are the three steps in the deployment process?
Manual process limited scope gradual expansion
Hacks Documentation
https://stackoverflow.com/a/64708015/5986651
All things should have module attributes usually.
Find info on
Use function.__module__
to know where it comes from. Use ?
and
??
to see the code. And doc()
to see what is happening with the
first?
, first??
pandain le…
In python,
__module__
is for class var,requires_grad_
is a member method of torch.Tensor(), it is not a class instance, so it has no module attr.
Other lesson 2 Resources
-
Example blogs, projects and tutorial
-
Lesson 1 TOP examples
Lesson 3 Resources
- fastbook chapter 2
- fastbook chapter 2 questionnaire solutions
- fastbook chapter 4 questionnaire solutions
Lesson resources
- Video link (private)
- fastbook chapter 2
- fastbook chapter 4
- Non-beginner discussion
- fastbook chapter 2 questionnaire solutions
- fastbook chapter 4 questionnaire solutions - feel free to contribute!
Links from lesson
- #mask4all
- Jeremy’s Washington Post article
- Building Machine Learning Powered Applications - by Emmanuel Ameisen
- Actionable Auditing - by Deborah Raji
Other useful links
Notes by @Lankinen
Lesson 3 Notes
-
Look at data properly e.g., looking at “good skin” and getting white people touching their faces
-
Always create a baseline
pytorch gradient
xt = tensor([3.,4.,10.]).requires_grad_()
yt=f(xt) #some function say def f(x): return x**2
yt.backward() # makes the gradient
xt.grad() # accesses the gradient at points in xt.
Cat is for concatenation, view is for reshaping. -1
is for the
function to figure out the dimension.
train_x = torch.cat([stacked_threes, stacked_sevens]).view(-1, 28*28)
Unsqueeze
is for redimensioning “vector” into a a tensor of
different dimensions such as 2000,1
or 1,2000
from just 2000
.
train_y = tensor([1]*len(threes) + [0]*len(sevens)).unsqueeze(1)
Getting a tuple from image+label
dset = list(zip(train_x,train_y))
x,y = dset[0]
x.shape,y
Todo
- run and understand the cells
- do hw cosins
Lesson 3 HW questions and answers
Coming days
- Lecture 3 video
- start with application notebook
- deploy application
-
think of where it can be used in my current setting
-
Do the questions
- chapter 2
- chapter 3??
Lesson 4
This is lesson 4, which is week 5 since last week was out of order.
Note: This is a wiki post - feel free to edit to add links from the lesson or other useful info.
Resources
- Video
- fastbook chapter 4
- fastbook chapter 4 questionnaire solutions
- fastbook chapter 5
- fastbook chapter 5 questionnaire solutions
- Non-beginner discussion
Links from lesson
Other useful links
Notes by @Lankinen
python R summary equivalent
Without pandas:
from scipy import stats
stats.describe(lst) stats.scoreatpercentile(lst,(5,10,50,90,95))
Here is an example:
from scipy import stats
import numpy as np
stdev = 10
mu = 10
a=stdev*np.random.randn(100)+mu
stats.describe(a)
[OUT1]: DescribeResult(nobs=100, minmax=(-13.180682481878286, 40.6109521437826), mean=10.352380786199149, variance=103.27168865119998, skewness=0.13852516641657087, kurtosis=0.2691915766145532)
stats.scoreatpercentile(a,(5,10,50,90,95))
[OUT2]: array([-7.21731609, -3.22696662, 10.39364637, 21.78527621, 24.20685179])
plot tensor
im3_t = tensor(im3)
df = pd.DataFrame(im3_t[4:15,4:22])
df.style.set_properties(**{'font-size':'6pt'}).background_gradient('Greys')
list zip; combine two tensors
dset= list(zip(train_x,train_y))
x,y = dset[0]
x.shape,y
[Out1]: (torch.Size([784]), tensor([1]))
generate quick tuples
https://discuss.pytorch.org/t/leaf-variable-was-used-in-an-inplace-operation/308
ds = L(enumerate(string.ascii_lowercase))
ds
make a quick list
import random
my_randoms = random.sample(range(100), 10)
leaf variable non-leaf variable
Loosely, tensors you create directly are leaf variables. Tensors that are the result of a differentiable operation are not leaf variables
For example:
w = torch.tensor([1.0, 2.0, 3.0]) # leaf variable
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True) # also leaf
variable
y = x + 1 # not a leaf variable
An in-place operation is something which modifies the data of a variable. For example:
x += 1 # in-place
y = x + 1 # not in place
PyTorch doesn’t allow in-place operations on leaf variables that have
requires_grad=True
(such as parameters of your model) because the
developers could not decide how such an operation should behave. If
you want the operation to be differentiable, you can work around the
limitation by cloning the leaf variable (or use a non-inplace version
of the operator).
x2 = x.clone() # clone the variable
x2 += 1 # in-place operation
runtimeerror leaf variable
RuntimeError: a leaf Variable that requires grad is being used in an
in-place operation.
The new style to express this is to use torch.no_grad() to signify that you don’t want to track the gradient of this operation (but z.data is OK too):
Shuffle=true not working in dataloader
I don’t know what it is but in lesson 4, if I run with shuffle =true I get validation results of 50% meaning somewhere things are lost in translation. i.e., the correlation between pictures and their labels don’t exist anymore.
Making grad zero
2 ways
/p.grad.zero_()
p.grad = None
course to do
Statistics course, SQL, keras, tensorflow
Lesson 4-5
Notes 4-5
So basically we start with:
-
Predictions for each breed [1x37] per image (? Unsure how it is calculated). I know how to do predictions per value (
Tens@w +b
). -
Predictions as probabilities using Softmax –>
e^x/e^x.sum()
–> e^x also hypes exagerrate the gap (e^1 and e^1.2 has a difference of 0.6) –> Activations -
Loss is based on NLL (negative Log Likelihood) –> torch.where function for more than 2 classifications.
-sm_acts[idx, targ] F.nll_loss(sm_acts, targ, reduction='none')
Loss based on target.
-
Take log of Activations to exaggerate difference between 0.99 and 0.999. This is done in softmax as log_softmax. When we are optimizing the system will treat 0.999 from 0.99 as a crazy improvement. That is what we are going for as suggested here.
-
Log_softmax + negative log likelihood is Cross entropy.
-
What does path object do? is it an object?
-
Issues with my nn (why random was not working)?
Time to delve deep into this shit?
-
Understand whats happening with the DataBlock
-
using a number other than 0.5 in predictions
-
What do the @’s do?
-
Why value of preds,_ =
learn.get_preds(0)
takes time and is different frompreds,_ = learn.get_preds(dl=[(x,y)])
-
How do you get many activations? for 3 and 7? How do you generate 37 activations? What are the activations in the 3 or 7 case? get_preds()
How do you get 2 predictions one for 3 category and 7 category where are the 2 activations?
How to get 2 activations for binary case which then becomes [0.4 and 0.6] based on softmax (exp(x)/exp(x).sum)
Lesson 5 nb3
This is a wiki post - feel free to edit to add links from the lesson or other useful info.
Note that the video and this thread is called “lesson 5” because that’s what this will be in the MOOC, even although it’s week 4 of the course
Lesson resources
Links from lesson
Other useful links
- Datasheets for datasets
- Weapons of math destruction
- AI Generated Faces
- Papers/repos/tools on how to check for bias in your datasets/algorithms AND address it
- Markkula Center - Ethical Toolkit
- The Pollyannish Assumption
- Understanding the context and consequences of pre-trial detention
- Fast.ai community ethics resources & discussion thread. Author credit: nbharatula
- Montreal AI Ethics Institute Weekly Newsletter #6: Radioactive data, attacking deep RL, steering AI progress, sucker’s list, AI ethics in marketing and more …
Lesson 6 nb5 nb6 nb8
Note: This is a wiki post - feel free to edit to add links from the lesson or other useful info.
Resources
- Video (private)
- fastbook chapter 5
- fastbook chapter 5 questionnaire solutions - feel free to contribute!
- fastbook chapter 6
- fastbook chapter 6 questionnaire solutions - feel free to contribute!
- fastbook chapter 8
- fastbook chapter 8 questionnaire solutions - feel free to contribute!
- Non-beginner discussion
Links from lesson
- Python for Data Analysis - Wes McKinney - link
Other useful links
- BCE vs BCEWithLogitLoss - Pytorch Forum Discussion - link
- Mixed Precision Training Comparison - link
- Notes by @Lankinen
Lesson 6 nb 5 notes
Learner explanation and how to get better accuracy
- Learning rates need to be found
- ROT: min/10
- Using the same LRs for all epochs doesn’t work
- e.g., start with Lr of first epoch and then check LR using LR_find #pnn
- Use
fit_one_cycle
–> Apparently it starts with low LR and then gradually increases it to the middle and reduces it again towards the end of layers. I guess this is when you use or provide anlr_max
.
- Using the same LRs across all layers is another criminal waste of
time.
Disciminative LRs
Lower layers barely need any change, upper layers need change. - fine_tune()
- Freezes the layers until -1 (
self.freeze
) - Number of epocs where only the last layer is trained is usually
freeze_epocs=1
- now you do
fit_one_cycle
for epochs = Freeze_epochs - then
unfreeze()
- and
fit_one_cycle
for X number of epochs
- Freezes the layers until -1 (
-
If you find that it plateaus at 8th epoch out of 12 epochs, rerun till 8.
overshooting
-
cnn_learner
freezes
by default <– Important -
We could also choose a higher artchitechture (but it takes more time to work it and potnetially
cuda gpu error
) - Might need to do cross-validation to improve accuracy
Other things to speed up with same accuracy
learn = cnn_learner().to_fp16()
from fastai.callback.fp16 import *
learn = cnn_learner(dls,resnet50,metrics=f1_score_multi).to_fp16()
Questions
-
How is one epoch trained? Is SGD changing per batch
-
Why value of preds,_ =
learn.get_preds(0)
takes time and is different frompreds,_ = learn.get_preds(dl=[(x,y)])
-
Why lr_steep is important? why not lr_min?
-
Why is Jeremy training frozen for 3 epocs?
-
how to make the model better at the issues raised by interpretation class.
-
- HOW TO IMPROVE FURTHER? WHAT HAVE OTHER PEOPEL DONE?
- Look up the documentation for L and try using a few of the new methods is that it adds.
- Look up the documentation for the Python pathlib module and try using a few methods of the Path class.
- For the loss you go all the way with NLL, but for the PRedictions you only do softmax?
Lesson 6 nb6 notes
Datablocks
Dataset:: A collection that returns a tuple of your independent and dependent variable for a single item
DataLoader:: An iterator that provides a stream of mini-batches, where each mini-batch is a tuple of a batch of independent variables and a batch of dependent variables
Datasets:: An object that contains a training Dataset and a validation Dataset
DataLoaders:: An object that contains a training DataLoader and a validation DataLoader
Dataset –> 1 tuple DataLoader –> Contains mini-batch tuples
Datasets –> contains both training and valid DS DataLoaders –> mini batch tuples of both contains both training and valid DL
- Create DataBlock
- Pass
get_x
get_y
functions dblock
containsdsets
(just tuples) anddloaders
(mini-batch)- All components together including a regularized splitter function
def splitter(df):
train = df.index[~df['is_valid']].tolist()
valid = df.index[df['is_valid']].tolist()
return train,valid
dblock = DataBlock(blocks=(ImageBlock, MultiCategoryBlock),
splitter=splitter,
get_x=get_x,
get_y=get_y)
dsets = dblock.datasets(df)
dsets.train[0]
- Including grabbing the right image portion using
item_tfms
dblock = DataBlock(blocks=(ImageBlock, MultiCategoryBlock),
splitter=splitter,
get_x=get_x,
get_y=get_y,
item_tfms = RandomResizedCrop(128, min_scale=0.35))
dls = dblock.dataloaders(df)
Note there is already some randomization within the dsets
even,
whether you create mini-batch or not (within the training set and
within the valid set I mean)
Tabular random splitter
np.random.seed(42)
msk = np.random.rand(len(df)) < 0.8
splits = (list(np.where(msk)[0]),list(np.where(~msk)[0]))
Losses
-
binary Cross-entropy
loss is the same asmnist_loss
except for the log function.def binary_cross_entropy(inputs, targets): inputs = inputs.sigmoid() return -torch.where(targets==1, inputs, 1-inputs).log().mean()
-
Losses
-
MSELoss
–> distance measuring-
used for PointBlock (e.g., Regression image centers)
-
Important Need to math.sqrt(valid_loss) to get final “error rate”
-
-
Mnist_loss
–> sigmoid the activations +torch.where
(if
statement) (basically an operation to check how your loss is based on target)- Used for binary Category blocks. (e.g., 3 or 7 or 8)
-
Cross-entropy loss
–> LOG softmax (normalizing with exp) the activation’s + Indexing function similar totorch.where
.- e.g., 3 or 7 or
-
Binary Cross-entropy Loss
–>Sigmoid
and thenLOG
+torch.where
.
F.binary_cross_entropy and its module equivalent nn.BCELoss calculate cross-entropy on a one-hot-encoded target, but do not include the initial sigmoid. Normally for one-hot-encoded targets you’ll want F.binary_cross_entropy_with_logits (or nn.BCEWithLogitsLoss), which do both sigmoid and binary cross-entropy in a single function, as in the preceding example.
The equivalent for single-label datasets (like MNIST or the Pet dataset), where the target is encoded as a single integer, is F.nll_loss or nn.NLLLoss for the version without the initial softmax, and F.cross_entropy or nn.CrossEntropyLoss for the version with the initial softmax.
-
Accuracy
Single Label Accuracy
def accuracy(inp, targ, axis=-1):
"Compute accuracy with `targ` when `pred` is bs * n_classes"
pred = inp.argmax(dim=axis)
return (pred == targ).float().mean()
Multi Label Accuracy
def accuracy_multi(inp, targ, thresh=0.5, sigmoid=True):
"Compute accuracy when `inp` and `targ` are the same size."
if sigmoid: inp = inp.sigmoid()
return ((inp>thresh)==targ.bool()).float().mean()
-
We need to pass the Metrics as a function, so we use
partial
to do so with different thresholds.learn = cnn_learner(dls, resnet50, metrics=partial(accuracy_multi, thresh=0.2)) learn.fine_tune(3, base_lr=3e-3, freeze_epochs=4)
Partial
def say_hello(name, say_what="Hello"): return f"{say_what} {name}."
say_hello('Jeremy'),say_hello('Jeremy', 'Ahoy!')
f = partial(say_hello, say_what="Bonjour")
f("Jeremy"),f("Sylvain")
Threshold
It’s always good to be “practical”. We are going to use the validation set to determine the threshold needed. In theory we are “overfitting”
- Note : We are not picking one value from a bumpy curve, but from a real smooth coirve.
xs = torch.linspace(0.05,0.95,29)
accs = [accuracy_multi(preds, targs, thresh=i, sigmoid=False) for i in xs]
plt.plot(xs,accs);
Looking at image files
- Get them.
img_files = get_image_files(path)
def img2pose(x): return Path(f'{str(x)[:-7]}pose.txt')
img2pose(img_files[0])
- Open them.
im = PILImage.create(img_files[0])
im.shape
not shape is col_len,row_len
:(
Different types of Block
-
ImageBlock (for images)
-
PointBlock (for 2 point tensors)
-
MultiCategoryblock (mutple catergories selectable e.g., pet breeds)
-
CategoryBlock (one category chosen at a time e.g., 3 or 7 or 9 ;) )
Splitter
Splits Training and Validation sets.
-
Split just one person
splitter=FuncSplitter(lambda o: o.parent.name=='13')
-
Split according to given df
def splitter(df):
train = df.index[~df['is_valid']].tolist()
valid = df.index[df['is_valid']].tolist()
return train,valid
dblock = DataBlock(blocks=(ImageBlock, MultiCategoryBlock),
splitter=splitter,
get_x=get_x,
get_y=get_y)
dsets = dblock.datasets(df)
dsets.train[0]
-
Random splitter
splitter=RandomSplitter(seed=42)
-
grandparent splitter
Y_range
Cnn learner
Here y_range is chose for the regression model.
learn = cnn_learner(dls, resnet18, y_range=(-1,1))
Lesson 6 nb8 ka notes
Latent factors (scifi or action or comedy etc. but we don’t have any info on them), movielens ranking
Collaborative Filtering e.g., NETFLIX recommendations
-
Uses are where we need to know what sort of diagnosis might be applicable to someone, figure out where to click next, where you learn from past behavior.
-
Cross tab
the data to look at 3 columns as x and y with some numbers in between. -
CollabDataLoaders.from_df() needs a
user_name
,item_name
(title), andrating_name
CollabDataLoaders.from_df( ratings, valid_pct=0.2, user_name=None, item_name=None, rating_name=None, seed=None, path='.', bs=64, val_bs=None, shuffle_train=True, device=None, )
-
We know how to do matrix multiplication and find gradients for that. We don’t know how to find indices.
So we use one-hot-coding to extract the corresponding rows and columns
-
Embedding is the matrix multiplication of a matrix (5x1000) with a one hot coded vector (1000x1) but is done with just indices without really creating the one hot coded vector.
Embedding
- So basically we have a batch (
x,y = dls.one(batch)
).
x.shape
–> 64x2 (64 indices, 2 –> 1 user_name
and 1
title_name
)
- When we try to make a model, we can’t work with indices so we want to do matrix multiplication. i.e., for 5 factors –>
1000x5 will be the user_name matrix and 5x1000 will be the indices
matrix. We want only 64 out of them. So we would like to use a
hot_coded
matrix. Therefore we have
1000x5^T @ 1000x1 for user_name
5x1000 @ 1000x1 for user_title
-
Instead of doing this for each batch, we try to use “embedding” which does the above with indices but can also enable calculation of gradient.
Embedding
emb = Embedding(n_users,n_factors)
“generates” a random matrix of that size. To this you can pass Tensor integers to index by doingemb(tensor(3)
, gives 3rd row.
Creating a colab model from scratch to train
For this we use a pytorch
setting where we can create a Class
(dotproduct
class –> DotProduct(Module)
) where in by default it
always calls the forward method
.
In this Forward
method we define the indexing using the Embedding
function.
Completely unsure how this embedding works but here is the model…
class DotProduct(Module):
def __init__(self, n_users, n_movies, n_factors):
self.user_factors = Embedding(n_users, n_factors)
self.movie_factors = Embedding(n_movies, n_factors)
def forward(self, x):
users = self.user_factors(x[:,0])
movies = self.movie_factors(x[:,1])
return (users * movies).sum(dim=1)
With this, we get a random set of vectors at the start and then pass
of the indices of each batch to the embed
object to index into
it. forward
is called “similar” to __call__
each time we call the
object as a function. i.e, A = DotProduct(9,12,5)
followed by
A(user_factors_batch)
Pass the batch’s independent variables to forward
.
Embedding
creates a random matrix with n_users,n_factors
from
which we can index with the user_names
within each batch with
self.user_factors(x[:,0]). The reason we use Embedding is it comes
with grad capabilities.
x,y = dls.one_batch()
a = Embedding(n_users,n_factors)
a(x[:,0]) # will select all `user_names` in that title.
With y_range
, sigmoid
and bias
added :)
class DotProductBias(Module):
def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
self.user_factors = Embedding(n_users, n_factors)
self.user_bias = Embedding(n_users, 1)
self.movie_factors = Embedding(n_movies, n_factors)
self.movie_bias = Embedding(n_movies, 1)
self.y_range = y_range
def forward(self, x):
users = self.user_factors(x[:,0])
movies = self.movie_factors(x[:,1])
res = (users * movies).sum(dim=1, keepdim=True)
res += self.user_bias(x[:,0]) + self.movie_bias(x[:,1])
return sigmoid_range(res, *self.y_range)
This is equivalent to:
class DotProductBias(Module):
def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
self.user_factors = create_params([n_users, n_factors])
self.user_bias = create_params([n_users])
self.movie_factors = create_params([n_movies, n_factors])
self.movie_bias = create_params([n_movies])
self.y_range = y_range
def forward(self, x):
users = self.user_factors[x[:,0]]
movies = self.movie_factors[x[:,1]]
res = (users*movies).sum(dim=1)
res += self.user_bias[x[:,0]] + self.movie_bias[x[:,1]]
return sigmoid_range(res, *self.y_range)
Lesson 6 code learnings
Transpose
dl_a = DataLoader(a,batch_size=8, shuffle=True)
b = first(dl_a)
b
list(zip(b[0],b[1]))
list(zip(*b))
RGB
im = image2tensor(Image.open("images/grizzly.jpg"))
_,axs = subplots(1,3)
for bear,ax,color in zip(im,axs,("Reds","Greens","Blues")):
show_image(255-bear,ax=ax, cmap=color)
Make own function using partial
partial
function in use:
def sigmoid_range(x, lo, hi): return torch.sigmoid(x) * (hi-lo) + lo
plot_function(partial(sigmoid_range,lo=-1,hi=1), min=-4, max=4)
Pandas Join merge
Join two tables by common column
ratings = ratings.merge(movies)
ratings.head()
Lesson 6 q&A discussion
-
how to extract from DL? and db’s in general
-
how to see summary
dblock.summary
-
why we need to use
to_cpu
x,y = to_cpu(dls.train.one_batch())
activs = learn.model(x)
activs.shape
-
get_items
vsget_x
-
What does normalize do here?
biwi = DataBlock(
blocks=(ImageBlock, PointBlock),
get_items=get_image_files,
get_y=get_ctr,
splitter=FuncSplitter(lambda o: o.parent.name=='13'),
batch_tfms=[*aug_transforms(size=(240,320)),
Normalize.from_stats(*imagenet_stats)]
)
-
discuss how embedding works
-
(users * movies).sum(dim=1)
but why? why not a dot product?
Lesson 7 nb 8
Collab filtering
-
DotProduct of factors.
-
NN
?? How does colabnn work
When you see overfitting (i.e., valid_loss increases while training loss decreases).
Reduce Overfitting by
-
Reducing number of latent factors (but “biases” the shapes to simpler ones)
-
Weight decay
Weight decay or L2 regularization
It adds a “parabola-type” function to it.
When you have a steep function, tendency to overfit is high. When you
have a smooth function, tendency to overfit is reduced. By adding a
smooth function like 0.1 x param**2
to the loss
, we control the
tendency to overfit “apparently”. I don’t get it but it works, let’s
move on.
loss_with_wd = loss + wd * (parameters**2).sum()
Also gradient is easy to calculate.
Parameters
Anything you need to “learn” should be wrapped in nn.parameters
class T(Module):
def __init__(self): self.a = nn.Parameter(torch.ones(3))
L(T().parameters())
(#1) [Parameter containing:
tensor([1., 1., 1.], requires_grad=True)]
Plotting the pca of the latent factors
g = ratings.groupby('title')['rating'].count()## get movies and number of ratings for those movies
top_movies = g.sort_values(ascending=False).index.values[:1000] ## sort the above and get the name of movies
top_idxs = tensor([learn.dls.classes['title'].o2i[m] for m in top_movies]) ## from movie names get indexes?
movie_w = learn.model.movie_factors[top_idxs].cpu().detach() ## Get 50 factors for those indices
movie_pca = movie_w.pca(3) ## do PCA of those factors
fac0,fac1,fac2 = movie_pca.t() ## transpose
idxs = np.random.choice(len(top_movies), 50, replace=False)
idxs = list(range(50))
X = fac0[idxs]
Y = fac2[idxs]
plt.figure(figsize=(12,12))
plt.scatter(X, Y)
for i, x, y in zip(top_movies[idxs], X, Y):
plt.text(x,y,i, color=np.random.rand(3)*0.7, fontsize=11)
plt.show()
Looking at data in the form of distance
movie_factors = learn.model.i_weight.weight
idx = dls.classes['title'].o2i['Silence of the Lambs, The (1991)']
## Get angle between two vectors is used here. Trivial:)
distances = nn.CosineSimilarity(dim=1)(movie_factors, movie_factors[idx][None])
idx = distances.argsort(descending=True)[1]
dls.classes['title'][idx]
Lesson 7 Decision tree and random forests tabular data
Decision tree (learn from statquest)
handling dates in tabular
Adds all these columns to the data.
'saleWeek saleYear saleMonth saleDay saleDayofweek saleDayofyear
saleIs_month_end saleIs_month_start saleIs_quarter_end
saleIs_quarter_start saleIs_year_end saleIs_year_start saleElapsed'
df = add_datepart(df, 'saledate')
df_test = pd.read_csv(path/'Test.csv', low_memory=False)
df_test = add_datepart(df_test, 'saledate')
' '.join(o for o in df.columns if o.startswith('sale')) ## not sure
## what it does
Processes
See Explanation on the docs here
Fastai’s tabular comes with 2 processes procs = [Categorify,
FillMissing]
, changes categories into numbers just like in
dls.vocab
, and does NOT fill missing values with median value but
with #na with a new column with True
or False
.
For neural nets (nns) however we would also like to add normalize
as
this could shoot the loss function of the system and make other
variables not work properly.
dft = pd.DataFrame({'a':[1,2,np.nan,0,0,2]})
tot = TabularPandas(dft, Categorify,cat_names=["a"])
dft.dtypes,tot.items.dtypes
tot.items
Shows how categorify works. You see 0
instead of
nan
. tot.classes["a"]
gives the classes as ['#na#', 0.0, 1.0,
2.0]
. Not sure how to go back though.
Splitting train and valid by date
cond = (df.saleYear<2011) | (df.saleMonth<10)
train_idx = np.where( cond)[0]
valid_idx = np.where(~cond)[0]
splits = (list(train_idx),list(valid_idx))
Split to cont and cat variables
cont,cat = cont_cat_split(df, 1, dep_var=dep_var)
Defining a tabular dsets
to = TabularPandas(df, procs, cat, cont, y_names=dep_var, splits=splits)
to.classes
shows the vocab and to.show
shows the actual vocab in
a table. to.items
shows the table without translated vocabs
Before initiating tabular dsets
## split to train and valid
msk = np.random.rand(len(df)) < 0.8
splits = (list(np.where(msk)[0]),list(np.where(~msk)[0]))
# df_tr = df[msk]
# df_vd = df[~msk]
# len(df_tr)/len(df), len(df_vd)/len(df)
len(splits[0])+len(splits[1]), len(df)
Leaf nodes and number of rows
-
At every leaf node, the dater is split. i.e., 50k gets split to 35k and 15k.
Eventually all the leaf nodes sum of dater will give you total dater length.
-
So if –> no. of leaf nodes = length of dater,
then –> in theory you have fit almost all of the training data one to one (#overfit).
so use
min_samples_leaf=25
. It wont split below 25 samples.
Why are there more nodes than columns? There will be bruv. Max number
of leaf nodes is 2^n_columns
. This is an insane number of leaf nodes. But
you will max out on samples. So don’t worry. :)
How to get deep trees without overfitting? Bagging.
How do we deal with N/A?
Decision Tree
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
#!pip install dtreeviz
from dtreeviz.trees import *
m = DecisionTreeClassifier(max_leaf_nodes=4)
m.fit(xs,y)
draw_tree(m, xs, leaves_parallel=True, precision=2)
Gini impurity
Main idea is here. I make an example out of this statquest. Also look at this medium post.
I think it helps with some context: Let’s say there is a machine that can detect Heart Disease (HD). The machine can predict HD 30% of the time. Following is the sample we have:
HD | !HD | |
---|---|---|
Machine | 30% | 70% |
This means the following cases are possible:
- Machine classify as HD and it is HD (P = 0.3*0.3)
- Machine classify as !HD and it is !HD (P = 0.7*0.7)
- Machine classify as HD but it is !HD
- Machine classify as !HD but it is HD
We pray that cases 3&4 happen less often. The sum of all probabilities
is 1. P(3&4) is therefore given by 1-(0.3^2)-(0.7^2)
=0.42. P(3&4)
is the impurity or how bad the machines’ prediction is, AKA
GINI=0.42.
The alternative is to check if someone is clutching his chest or not due to chest pain (CP) and then guess based on probability data if he has HD or not. The following is the sample we have. For each case we calculate the GINI. Then we take the average of it (assuming similar sample size) and this estimates the GINI impurity using CP to predict HD.
HD | !HD | GINI | |
---|---|---|---|
!CP | 25% | 75% | 0.375 |
CP | 80% | 20% | 0.32 |
avg | NA | NA | 0.38 |
Smaller the impurity the better. So we decide instead of buying the machine (GINI=0.42) we can just use CP as an indicator (GINI=0.38).
Feature importance’s
Weighted decrease in impurity and then normalized.
Calculated as : N_t / N * (impurity - N_t_R / N_t * right_impurity
- N_t_L / N_t * left_impurity)
to get the raw sklearn “feature importances”: ``m.tree_.compute_feature_importances(normalize=False)`
Bootstrapping
Making a dataset with random picking with replacement. Size of dataset picked, is the same as the orginal.
Bagging (rand choose rows and make multiple models)
Here is the procedure that Breiman is proposing:
- Randomly choose a subset of the rows of your data (i.e., “bootstrap replicates of your learning set”).
- Train a model using this subset.
- Save that model, and then return to step 1 a few times.
- This will give you a number of trained models. To make a prediction, predict using all of the models, and then take the average of each of those model’s predictions.
Idea being you get random errors which are not correlated with one
another. so when you average it you reduce the uncertainty or the
error by sqrt(number of models)
Random forest
In addition to the above randomly chosen rows per model (bagging), randomly selecting from a subset of columns at each node is Random forest.
check out the “Elements of Statistical Learning”
Note: You somehow don’t overfit, as you keep averaging and hence you plateau with accuracy. Which is very interesting. (lesson 7 Jeremy)
Out-of-bag-error
There are two reasons why you training and validation rmse are different.
-
Random forests was not accurate enough,
-
There is actual difference between training and validation set (e.g., time offset).
Using out-of-bag rmse error we can determine which of the ones it is, i.e., how accurate your training and validation rmse is.
How is it done?
-
Take each row and find the models where it is not used and compute the rmse
It gives an impression of ho wthe rf works on not seen data (aka valid data). Beware this seems to defeat the point when we use time series dataset.
-
Do it for every row.
In this lecture on tabular data they show a training error of 17%, OOB error of 21% and a validation error of 23%. Is my understanding below correct?
The training error is different from OOB error –> This implies overfitting. Because the OOB error is “much less” than the validation error –> This implies that the validation set is “out of domain” (e.g., from a different time series).
Model interpretation
- How confident are we in our predictions using a particular row of data?
- For predicting with a particular row of data, what were the most important factors, and how did they influence that prediction?
- Which columns are the strongest predictors, which can we ignore?
- Which columns are effectively redundant with each other, for purposes of prediction?
- How do predictions vary, as we vary these columns?
Variance of prediction
We usually look at the mean of all the predictions per row. This can have high std or low std. In the case that the mean=0.23 and std=0.21 (I personally would not be very confident). We can use this info for each prediction we make to see if we should take a deeper look at something.
Question But am quite unsure about the how to interpret the data. Are you supposed to look at other std of other rows or?
Feature Importance (not fully sure)
The way these importances are calculated is quite simple yet elegant. The feature importance algorithm loops through each tree, and then recursively explores each branch. At each branch, it looks to see what feature was used for that split, and how much the model improves as a result of that split. The improvement (weighted by the number of rows in that group) is added to the importance score for that feature. This is summed across all branches of all trees, and finally the scores are normalized such that they add to 1.
Remove columns that are not important
From the above select only columns that are >0.005
in importance.
Remove redundant columns i.e., correlated columns, i.e., linearly dependent colath
-
Using clustering to see which columns are close to each other.
-
Jermey looks at oob when keeping the variables and removing them and then finally removing all the ones that are a double. If you see “little” oob change then they are fine to be thrown.
-
He uses only 50k samples here randomly chosen with replacement.
-
However I wonder why not also see the actual valid and train rmse as a result of these. Actually he ends up doing it with the whole dataset.
Partial Dependence
Book to understand formalism of subtleties: “The Book of Why”
How does yearmade affect SalesPrice
all else being equal?
The thing is we can’t just take average sale price for a particular year. This does not account for changes in the other variables.
So we resort to another kind of study, where we generate different datasets. DATASET1: all years changed to 1950, DATASET2: all years changed to 1970…
We can predict using our model on each data set. and plot the values.
For instance, how does YearMade impact sale price, all other things being equal?
To answer this question, we can’t just take the average sale price for each YearMade. The problem with that approach is that many other things vary from year to year as well, such as which products are sold, how many products have air-conditioning, inflation, and so forth.
Instead, what we do is replace every single value in the YearMade column with 1950, and then calculate the predicted sale price for every auction, and take the average over all auctions. Then we do the same for 1951, 1952, and so forth until our final year of 2011. This isolates the effect of only YearMade (even if it does so by averaging over some imagined records where we assign a YearMade value that might never actually exist alongside some other values).
But the N/A columns can also show up.
The ProductSize partial plot is a bit concerning. It shows that the final group, which we saw is for missing values, has the lowest price. To use this insight in practice, we would want to find out why it’s missing so often, and what that means. Missing values can sometimes be useful predictors—it entirely depends on what causes them to be missing. Sometimes, however, they can indicate data leakage.
Not sure how to broach such a topic or what it means. :(
Tree interpreter
Similar to “feature importance”, but works with just one row of data. So you can understand what you are trying to do.
Random Forests is really bad on unseen data
It can’t predict what happens in a straight line projection between sales price and year for example.
Very hard for RF to extrapolate outside the data it has seen. Example, time series. Make sure the data is not “OUT OF DOMAIN”. HUh? what is that?
OUT OF DOMAIN DATER
So to identify if the valid set and training set are “different” or if they are “out of domain”, we can do an rf to predict if dater from training or valid.
If the rmse is small, or if the predicting power of the rf is high, then we have different valid dataset. It is “different” than the “training”.
We can look at the feature importance. In the case of the problem there is Sales Elapsed, MachineId and SalesID which are mostly important to predict training and valid. It makes sense and it can be imagined that the machine id keeps increasing over time, same with sales id. :) So it’s understandable that these predict the valid.
If we remove these then you loose predicting power. Also if you remove this, then you error on the main model slightly improves. So we can consider removing these time related parameters that we expect to not make much of a difference. THis makes the system more resillient to time changes.
Similarly you can also remove certain dates to train better or just keep the latest dates.
NN
Also looks good for tab dater.
By default, for tabular data fastai creates a neural network with two hidden layers, with 200 and 100 activations, respectively. This works quite well for small datasets, but here we’ve got quite a large dataset, so we increase the layer sizes to 500 and 250:
Ensembling
Some sort of weighted or equaly-weighted results from two models.
Boosting
IT WILL OVERFIT.
- Train a small model that underfits your dataset.
- Calculate the predictions in the training set for this model.
- Subtract the predictions from the targets; these are called the “residuals” and represent the error for each point in the training set.
- Go back to step 1, but instead of using the original targets, use the residuals as the targets for the training.
- Continue doing this until you reach some stopping criterion, such as a maximum number of trees, or you observe your validation set error getting worse.
Is training with small sample and getting residuals and training to reduce these residuals… In the end you add up all of the predictions unlike in bagging. There is gradient boosting machines and gradient boosting decision trees and some other variants.
Entity embedding
Conclusion
-
Random forests are the easiest to train, because they are extremely resilient to hyperparameter choices and require very little preprocessing. They are very fast to train, and should not overfit if you have enough trees. But they can be a little less accurate, especially if extrapolation is required, such as predicting future time periods.
-
Gradient boosting machines in theory are just as fast to train as random forests, but in practice you will have to try lots of different hyperparameters. They can overfit, but they are often a little more accurate than random forests.
-
Neural networks take the longest time to train, and require extra preprocessing, such as normalization; this normalization needs to be used at inference time as well. They can provide great results and extrapolate well, but only if you are careful with your hyperparameters and take care to avoid overfitting.
HW
- Start a kaggle competition
- Do KNN
- and EE as well and see where that takes your results. Tabulate them.
Data leakage example in nb 9
In the paper “Leakage in Data Mining: Formulation, Detection, and Avoidance”, Shachar Kaufman, Saharon Rosset, and Claudia Perlich describe leakage as:
: The introduction of information about the target of a data mining problem, which should not be legitimately available to mine from. A trivial example of leakage would be a model that uses the target itself as an input, thus concluding for example that ‘it rains on rainy days’. In practice, the introduction of this illegitimate information is unintentional, and facilitated by the data collection, aggregation and preparation process.
They give as an example:
: A real-life business intelligence project at IBM where potential customers for certain products were identified, among other things, based on keywords found on their websites. This turned out to be leakage since the website content used for training had been sampled at the point in time where the potential customer has already become a customer, and where the website contained traces of the IBM products purchased, such as the word ‘Websphere’ (e.g., in a press release about the purchase or a specific product feature the client uses).
Data leakage is subtle and can take many forms. In particular, missing values often represent data leakage.
For instance, Jeremy competed in a Kaggle competition designed to predict which researchers would end up receiving research grants. The information was provided by a university and included thousands of examples of research projects, along with information about the researchers involved and data on whether or not each grant was eventually accepted. The university hoped to be able to use the models developed in this competition to rank which grant applications were most likely to succeed, so it could prioritize its processing.
Jeremy used a random forest to model the data, and then used feature importance to find out which features were most predictive. He noticed three surprising things:
- The model was able to correctly predict who would receive grants over 95% of the time.
- Apparently meaningless identifier columns were the most important predictors.
- The day of week and day of year columns were also highly predictive; for instance, the vast majority of grant applications dated on a Sunday were accepted, and many accepted grant applications were dated on January 1.
For the identifier columns, one partial dependence plot per column showed that when the information was missing the application was almost always rejected. It turned out that in practice, the university only filled out much of this information after a grant application was accepted. Often, for applications that were not accepted, it was just left blank. Therefore, this information was not something that was actually available at the time that the application was received, and it would not be available for a predictive model—it was data leakage.
In the same way, the final processing of successful applications was often done automatically as a batch at the end of the week, or the end of the year. It was this final processing date which ended up in the data, so again, this information, while predictive, was not actually available at the time that the application was received.
This example showcases the most practical and simple approaches to identifying data leakage, which are to build a model and then:
- Check whether the accuracy of the model is too good to be true.
- Look for important predictors that don’t make sense in practice.
- Look for partial dependence plot results that don’t make sense in practice.
Thinking back to our bear detector, this mirrors the advice that we provided in «chapter_production»—it is often a good idea to build a model first and then do your data cleaning, rather than vice versa. The model can help you identify potentially problematic data issues.
It can also help you identifyt which factors influence specific predictions, with tree interpreters.
Computer Vision Assignment Kaggle
-
Read about the competition (15 mins)
So what Kaggle does is to itself split the data into a training set and a test set. They release these to us and the training set contains the outcomes to predict, the test set does not. So the short answer is that you submit your outcome list for the test set that Kaggle has given you. What Kaggle does is to take your guesses and apply it to a random 50% of the test set and give you your results back (the exact metric used changes and can be found in the evaluation tab of the competition info). So during the competition you are getting feedback on a random static 50% of the test set. At the end of the competition they release how you went on the other 50% for your chosen top 2 guesses. So one thing to watch out for is overfitting on the 50% of the test set during the competition itself - it is common to see quite big changes to the leader board once a competiton has finished because of this. Hope this makes sense! — user
-
Come up with a plan
-
Submit one result (random selection or the top one)
-
How to not overfit?
-
Get a smaller subset
-
Come up with most basic solution and submit
-
How to improve?
-
-
EDA
- how many species have one item to train with?
- what are strategies to not “remember a picture”?
-
Strategies
- More epochs?
- Presizing?
- different model for the ones with few?
- Cross validation?
- Getting more data from the internet
- Augmentation for sure
- different sizes of tensor
- how to not overfit?
-
Prepare your arsenal of tools (1hr)
-
documentation?
TODO
Goals ———————— by Monday, 7th Dec @agent18
-
-Adding fastai and test one command
- download data as pdf from gradient
-
Download fastai stuff from gradient
- Make a random “csv” submission and SUBMIT TO KAGGLE (?)
- Make a simple model with shit accurcy and run with fastai (?)
- just find out how to increase time of GPU (?)
- Take a small subset (1%) and setup the CSV submission when species are missing
by Tuesday, morning
- Setup one run of the with shit accuracy with fastai
- Using a random sample extract parts of json you need (done)
- Make df with it
- Make data block
- getx
- gety
- resnet 18/resnet50.fp16
- do an eda of the 32k and how many specimens they have
by wednesday
Kaggle
-
Submit a csv for final results.
-
Save Version
when you want to save a version or you want to submit that versions results as -
Every Version seems to be like in VERSION CONTROL. e.g., you can select the Notebook Version and the output file generated.
-
Submit any versions output documents to get back on the leader board.
Don’t understand difference bewtween final submission and non-final submission.
You may submit a maximum of 10 entries per day. You may select up to 5 final submissions for judging.
installing dtreeviz
You might get errors. Just proceed it seems ok. Otherwise refer this
IEEE fraud detection
- First place solution part 1
- First place Part 2
- First place Finding UID
- Data and description
-
notebook on adverserial validation
- another competition
kaggle api installation and working
Python Learnings
-
pay more attention to code, don’t take things for granted. preferably check with an example data frame so you are absolutely sure of the transactions.
-
discuss on how to generate hypothesis..
-
I am quickly convinced by “proof”. e.g., I was convinced that set(a) outputing 2000 values was enough proof that I had done it right. Which it was
-
Look at what is happening different between you and another example. After identifying where things are going wrong I didn’t know how to probe further (e.g., iloc, loc issue) Strip problem to basic issues. Try to see if you can write a MWE?
Json file opening
If error: UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in
position 0: invalid start byte
import json,codecs
with codecs.open("../input/herbarium-2020-fgvc7/nybg2020/train/metadata.json", 'r',
encoding='utf-8', errors='ignore') as f:
train_meta = json.load(f)
with open(train_json_path, encoding="utf8", errors='ignore') as f:
data = json.load(f)
Dataframe index iloc loc pandas
Very Big Disclaimer:
Sorting or any operations that shuffles the rows in the DF is DANGEROUS. So Row number is not the same as Index (row name).
Pandas has Columns and indexes, column names and index names
Indexing
dft = pd.DataFrame(np.random.randint(0,100,size=(20, 4)), columns=list('ABCD'))
dft = dft.sort_values("D")
print(dft["C"].iloc[[0,1]],dft.loc[[0,1],"C"])
dft
iloc
–> gives rows 0 and 1 (NOT INDEX) –> [99, 50]
loc
–> gives based on INDEX (rowname)–> [18,7]
loc
is based on Labels (Index, rowname); LABEL BASED INDEXING
iloc
is based on Positions (row number != index); POSITION BASED INDEXING
Table: dft
A B C D
18 33 87 99 5
4 98 77 50 9
1 25 61 7 10
5 47 90 86 11
6 0 40 42 18
2 29 51 99 22
17 22 20 35 30
10 93 39 59 30
13 70 65 61 35
16 2 74 48 39
9 42 30 13 40
14 33 91 51 49
7 99 38 91 53
0 69 76 18 58
12 96 65 48 69
11 75 64 14 72
8 41 33 18 74
19 63 71 53 81
3 34 2 20 92
15 36 76 33 96
Using iloc and loc
iloc
–> dft.iloc[0,1]
or dft.iloc[0,]
–> 33
loc
–> dft.loc[0,"A"]
–> 69
To use iloc
but with labels
dft.iloc[dft.index.get_loc(0),dft.columns.get_loc("C")]
Get index (rowname) or row number from value
dft.index
–> gives index order
dft.index.values
–> gives an array of index order
dft.index[dft["C"]==12]
, dft[dft["C"]==12].index
both give same
answer i.e., row name (INDEX)
Get row Number (!rowname)
Get colnumber and index number from value
dft.columns.get_loc("C")
Index order
dft.index.values
array([17, 10, 2, 13, 9, 18, 5, 6, 8, 15, 4, 11, 3, 12, 19, 14, 16,
0, 7, 1])
Can be reset using: dft.reset_index(drop=True, inplace=True)
Avoid issues by using
- reset index, pay attention to using loc (labels) or iloc (positions)
doing multiple operations on pandas
g = ratings.groupby('title')['rating'].count()
top_movies = g.sort_values(ascending=False).index.values[:1000]
g
groups by title
and then gets “ratings” and then counts listings
per group. Lists as title
and number of counts
. :)
top_movies
sorts g in descending order, finds the index and
gets
Pandas groupby, sort_values, merge
Adding function by grouping.
df["len_rows"] = df.groupby("category_id")[["category_id"]].transform(len)
df = df.sort_values("len_rows")
df.reset_index(drop=True, inplace=True)
df = df_all.groupby("category_id").apply(lambda
x:x.sample(min(8,len(x)))).reset_index(drop=True)
df = df_all.groupby("category_id").head(n=10) # displays 10 of the top
Random selection of index values.
train_ind_1 = [random.sample(df[df["category_id"]==ctg].index.tolist(),1)[0] for ctg in ctg_unq]
Merging…
df = transaction.merge(identity)
reading fastai documents
Go here first and look at training. All important functions are listed: https://docs.fast.ai/vision.learner.html
Fastai github: https://github.com/fastai/fastai/blob/master/fastai/metrics.py#L141
Testing metrics Tracing the functions methods def
F1Score
needs to be instantiated.
pred = torch.randint(0,10,(64,))
targ = torch.randint(0,10,(64,))
f1score_ins = F1Score(average="macro")
f1score_ins(pred,targ)
Learn
needs a metric
function. In this case f1score_ins
.
It looks like a function but is actually a class and needs to be
instantiated. F1Score
is a class based on the skm_to_fastai
functions which returns a Class object (AccumMetric
). AccumMetric
has function __call__
which basically allows you to call the
object
as a function
So It will be:
f1_score_multi = F1Score(average="macro") ## convert class to functie
learn = cnn_learner(dls,resnet18,metrics=f1_score_multi)
Star variable *variable
, *self
unpacks list or dict
>>> lis=[1, 2, 3, 4]
>>> dic={'a': 10, 'b':20}
>>> functionA(*lis, **dic) #it is similar to functionA(1, 2, 3, 4, a=10, b=20)
(1, 2, 3, 4)
{'a': 10, 'b': 20}
Unsqueeze using None,
xs_lin = x_lin.unsqueeze(1)
x_lin.shape,xs_lin.shape
x_lin[:,None].shape
Both output: torch.Size([40, 1])
from torch.Size([40])
GPU CPU usage
def print_cpu_gpu_usage():
!gpustat -cp
!free -m
#!top -bn1 | grep "Cpu(s)" | sed "s/.*, *\([0-9.]*\)%* id.*/\1/" |
awk '{print 100 - $1"%"}'
Nan na None Null inf
When you import a dataframe I think all missing values are encoded as
NaN
. Whether it is an object, boolean or other. Use this to see all
in one goi.
df.isna().sum()
Dropping them
df.dropna() #drop all rows that have any NaN values
df.dropna(how='all') #drop only if ALL columns are NaN
df.dropna(thresh=2) #Drop row if it does not have at least two values that are **not** NaN
FASTAI
The thing is fastai’s Tabular pandas has this attribute classes
to.classes["ProductCD"]
This shows : [‘#na#’, ‘C’, ‘H’, ‘R’, ‘S’, ‘W’]
FASTAI converts the dtype “categories or obejects” to numerical
(int
). While providing the VOCAB
it changes NaN types to #na#
string.
It always shows “#na#” for categorical variables.
Guess: I think categories can have “#na#” while working with skitlearn but not cont variables
General
NaN
is not a number. R
has na
and Null
etc… But Python
doesn’t. Pandas
is built on top of numpy
.
To detect
NaN
values numpy usesnp.isnan()
.To detect
NaN
values pandas uses either.isna()
or.isnull()
. TheNaN
values are inherited from the fact thatpandas
is built on top ofnumpy
, while the two functions’ names originate fromR
’sDataFrames
, whose structure and functionality pandas tried to mimic. —Stack
dropna(), fillna() that handles missing values and it always helps to remember easily.
What are they? how to identify them in the context of pandas.
To check for both None
and NaN
np.inf
is to denote infinity.
Pandas
NaN in about missing values in pandas. use np.isnan(array)
x = np.array([np.NaN])
np.isnan(x)
array([ True])
Fastest way to replace nan and inf Stack
In [11]: df = pd.DataFrame([1, 2, np.inf, -np.inf])
In [12]: df.replace([np.inf, -np.inf], np.nan)
Out[12]:
0
0 1
1 2
2 NaN
3 NaN
pandas and tabular with GPU
This kaggle kernel is an introduction to using rapids.
RAPIDS is a collection of libraries, GitHub page here. The library cuDF has the same API as Pandas and does what Pandas does. The library cuML has the same API has Scikit-learn and does what Scikit-learn does.
Their purpose is to allows us to do our data analytics and machine learning on GPU instead of CPU giving us speed increases between 10x and 1000x! They are open source libraries currently in development. They still have some bugs and are still missing some features that Pandas and Scikit-learn have but the existing functions are very fast!
also check this out.
Pass by reference
Immutable objects are those that when passed by reference (in a function) get mutated. Understanding this shows why we can modify a list and not an int variable.
## Column Names in dictionaries.
col_count = [0]
def count_col(str_sch,col_count=col_count):
tf = [bool(re.search("^"+str_sch+"[0-9]+",col)) for col in xs.columns.to_list()]
col_count[0]+=sum(tf)
return [str_sch,cols[tf].to_list()]
col_lst = [["other",cols_sngl]]; col_count[0]+=len(cols_sngl)
[col_lst.append(count_col(str_sch)) for str_sch in str_mltpl]
col_dict = dict(col_lst)
col_count
Pandas correlation
Pandas excludes nans while calculating corrs… So we really don’t know the contribution of the nans, and their spacing.
Calculating corr with other collumns
remove certain columns pandas
tf_V = [bool(re.search("^V"+"[0-9]+",col)) for col in xs.columns.to_list()]
cols = xs.columns[list(~np.array(tf_V))]
len(cols)
split column by another column and count groupby
Most elegant solution
df.groupby(["id_31","isFraud"]).size().unstack(fill_value=0).rename(columns={0:'isFraud0',1:'isFraud1'}).reset_index()
had to reset index otherwise kept getting keyerror, group error etc….. Huh!
select and display one client
my_filter = df[uid].to_dict(orient='index')[10876]#[10302]
print(my_filter)
## Looking at one UID
df_all[feat].loc[(df_all[list(my_filter)] == pd.Series(my_filter)).all(axis=1)]
pandas and aggregate
https://stackoverflow.com/a/53781645/5986651
temp.groupby(['A', 'B'])['C',"D"].agg([('average','mean'),('total','sum')])
pd.concat([temp,temp.groupby(['A', 'B'])['C',"D"].transform("mean").add_suffix("_mean_uid")],
axis=1)
Statistics learning
Several metrics are discussed here.
Confusion matrix, precision, recall
Everything adds up to total number. In python confusion_matrix
from
sklearn.metrics
has this transposed.
Actual isFraud | Actual !isFraud | |
---|---|---|
Predicted isFraud | TP | FP |
Predicted !isFraud | FN | TN |
Sensitivity TPR (recall): What percentage of Frauds were correctly identified.
TP/(TP+FN)
Specificity TNR (1-FNR) (recall): What percentage of !Frauds were correctly identified
TN/(FP+TN)
Precision (across same prediction): The precision of a class define how trustable is the result when the model answer that a point belongs to that class. From what ever it predicts it does a good job if Precision is high.
TP/(FP+TN)_
Think of P and N as two classes. There could be more classes.
recall (across same Class): The recall of a class expresses how well the model is able to detect that class. If recall is low for a class then model is unable to
TP/(TP+FN)
Like your accuracy of prediction.
ROC curve
-
Graph between TPR and FPR. If TPR is 1 it means that all samples have been correctly predicted for “true” class. If FPR is 1, it means that all samples of the “false” class, are predicted wrong.
-
What we want thus is a low FPR and a high TPR for a given threshold. We want the ratio of TPR to FPR to be as high as possible.
-
Above the line is good. Below the line not.
python Precision, confusion, recall, balanced_accuracy
Calculate the precision score or recallof one label alone.
precision_score(y,m.predict(xs), average=None, labels=[1,0])
precision_score(y,m.predict(xs), average="macro",labels=[1])
precision_score(y,m.predict(xs), average="binary", pos_label=1)
def m_conf(m,xs,y):
plot_confusion_matrix(m,xs,y)
return confusion_matrix(y, m.predict(xs))
## Metrics functions
def r_mse(pred,y): return round(math.sqrt(((pred-y)**2).mean()), 6)
def m_rmse(m, xs, y): return r_mse(m.predict(xs), y)
#def m_accuracy(m,xs,y): return sum(m.predict(xs)==y)/len(xs)
def m_accuracy2(m,xs,y): return m.score(xs,y)
def m_conf(m,xs,y):
plot_confusion_matrix(m,xs,y)
return confusion_matrix(y, m.predict(xs))
def m_prec(m,xs,y): return precision_score(y,m.predict(xs), average=None, labels=[1,0])
def m_recall(m,xs,y): return recall_score(y,m.predict(xs), average=None, labels=[1,0])
Why sample_weight does not work is still a mystery to me…???
## is balanced accuracy the same as recall average? YES
from sklearn.metrics import balanced_accuracy_score
avg1 = (tr_rep["1"]["recall"]+tr_rep["0"]["recall"])/2
sample_weight = np.array([1 if i == 0 else 1000 for i in y])
avg2 = balanced_accuracy_score(y,m.predict(xs),sample_weight=sample_weight)
avg1,avg2
Unbalanced data
It is important to realize that we might not need to do anything for this dataset regarding unbalanced dataset.
I did try that. It helped my CatBoost models using hyperparameter class_weights=[1,2.5] but over/under sampling didn’t help my LGBM nor XGB models.
I don’t have any great insights but I think the proportion of isFraud in the test.csv file makes a difference. For example if the train.csv had 3% isFraud but the test.csv had 10% isFraud then it would be important to upsample so that your predictor would predict more isFraud than 3% in the test.csv. —topper
-
Balance the samples by over or undersampling
-
Sort of summary here.
-
There are python libraries that do it. SMOTE, “UnbalancedDataset” module has other features too. Check it out.
I tried using a 29% data as opposed to 3% data of minority class, and got .89 (auc) on vlid as opposed to 0.83 (auc).
consider undersampling (done ROCKS with 29% 5% increase in AUC)
Considering just oversampling with replacement (Done good for precision)
Considering oversampling with smote, Rose
Considering undersampling with oversampling (could try)
Considering weighting each observation (apparently can be done on RF) (improves by a percent).
There are more resources here
Classification when 80% of dataset is one class
Quora ideas on how to handle unb ds
Smote implementation
# ## Oversampling SMOTE
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42, n_jobs=-1,k_neighbors=15)
X,Y = sm.fit_resample(xs,y)
m = rf(X,Y)
Y.value_counts(), y.value_counts(),
ROS implementation
# #oversampling
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=42)
X,Y = ros.fit_resample(xs,y)
m = rf(X,Y)
Balancing weights manually
Weighing the minority class.
## Weighting the samples
wt = y.value_counts()[0]/y.value_counts()[1]
y.value_counts(),wt
sample_weight = np.array([1 if i == 0 else wt for i in y])
m = rf(xs,y, sample_weight=sample_weight)
trying pipeline for oversampling
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import RandomOverSampler
pipe_ros = RandomOverSampler(random_state=42)
pipeline = Pipeline([("ros", pipe_ros),("rf",pipe_rf)])
pipeline.fit(xs,y)
Doesn’t work. testing pipeline with just rf and was able to reproduce results.
Pipline with ROS + RF and gridsearchcv
from imblearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
## define pipeline
steps = [('ros', RandomOverSampler()), ('rf', RandomForestClassifier())]
pipeline = Pipeline(steps=steps)
# score={'AUC':'roc_auc',
# 'RECALL':'recall'}
param_grid = {"rf__max_features":[0.2,0.3,0.4,0.5,0.7,0.8],
"rf__max_samples":[0.2,0.4,0.5,0.6],
"rf__min_samples_leaf":[1,5,15,25,35,50],
"rf__n_estimators":[40]
}
cv_rf = GridSearchCV(pipeline, param_grid=param_grid, scoring="roc_auc", n_jobs=-1,
cv=10, verbose=3, return_train_score=True)
cv_rf.fit(xs,y)
Random forests
Using m.estimators_
and getting the mean of predictions or using m
(the entire model) for predicting is the same.
preds_p = np.stack([t.predict_proba(valid_xs)[:,1] for t in m.estimators_])
m.predict_proba(valid_xs)[:,1]==[preds_p.mean(0)]
Remove out of domain data (adversarial validation) OOD
nb9 tabular
## finding out of domain dater
df_dom = pd.concat([xs, valid_xs])
is_valid = np.array([0]*len(xs) + [1]*len(valid_xs))
m = rf(df_dom, is_valid)
rf_feat_importance(m, df_dom, df_dom)[:6]
## Removing OOD dater.
ood_drop = ["TransactionDT","TransactionID","D15"]
drop_var_xs(ood_drop, xs, valid_xs)
X.drop(ood_drop,axis=1, inplace=True)
m=rf(X,Y)
metrics(m,xs,y,valid_xs,valid_y)
Kaggle how to: here
Link from previous: here
Use of Adv Validation in IEEE fraudulent: here
AV done for IEEE but with test
## finding out of domain dater
df_dom = pd.concat([xs, test_xs],ignore_index=True)
is_test = pd.DataFrame([0]*len(xs) + [1]*len(test_xs))
len(df_dom)
idxT,idxV = train_test_split(np.arange(len(df_dom)),test_size=0.2)
m = xgb_av_fun(df_dom.iloc[idxT], is_test.iloc[idxT],
eval_set=[(df_dom.iloc[idxV],is_test.iloc[idxV])])
The ideas online about AV is that do it and then select your train data most similar to your test data…
Grid search cv
## Getting best hyperparameters
from sklearn.model_selection import GridSearchCV
param_grid = {"max_features":[0.2,0.3,0.4,0.5,0.7,0.8],
"max_samples":[0.2,0.4,0.5,0.6],
"min_samples_leaf":[1,5,15,25,35,50]
}
cv_rf = GridSearchCV(estimator=m, param_grid=param_grid, scoring="roc_auc", n_jobs=-1,cv=5, verbose=3, return_train_score=True)
cv_rf.fit(xs,y)
## Get the results in a csv
pd.DataFrame(cv_rf.cv_results_).to_csv("gscv.csv")
## Save the pickle file of GSCV
import joblib
joblib.dump(cv_rf, 'GSCV1.pkl')
cv_rf_test = joblib.load("GSCV1.pkl")
Mean of all estimators gives the random forests results
## checking if mean of all estimators gives the random forest results
def r_mse(pred,y): return round(math.sqrt(((pred-y)**2).mean()), 6)
preds_p = np.stack([t.predict_proba(valid_xs)[:,1] for t in m.estimators_])
r_mse(m.predict_proba(valid_xs)[:,1],preds_p.mean(0))
Interpreting each decision tree
using waterfalls and tree interpreter
#hide
import warnings
warnings.simplefilter('ignore', FutureWarning)
from treeinterpreter import treeinterpreter
from waterfall_chart import plot as waterfall
draw_tree(m, xs, leaves_parallel=False, precision=2)
row = valid_xs.iloc[:5]
prediction,bias,contributions = treeinterpreter.predict(m, row.values)
print(prediction[0,1], bias[0,1], contributions[0][:,1].sum())
waterfall(valid_xs.columns, contributions[0][:,1], threshold=0.01, rotation_value=45,formatting='{:,.3f}');
The contributions are the difference in the mean values of internal node and the previous node. In case they are probabilities they are the ratios of “true” to “false” values at each internal node.
The bias is the mean value at the top most node (doesn’t change for any row. Sum of contributions + bias gives the prediction.
Docs description here.
Partial dependence plot
from sklearn.inspection import plot_partial_dependence
fig,ax = plt.subplots(figsize=(12, 4))
plot_partial_dependence(m, valid_xs_imp, fi.cols[0:2],
grid_resolution=20, ax=ax);
In this plot is done with the existing model. First the data independent variable is all made X, then X+ 1 then X+2 etc… and dependent variable is plotted.
Handling missing values
In the IEEE fraud detection dataset, using RFs and removing variables with 25% N/A values led to 1% increase (88%–>89%) ROSed sample.
-
algorithms such as naive bayes, k nearest neigbours can work with it.
-
People seem to want to drop columns with more than 30% max. But if the variable is “important” then you might want to keep it. It is said that the age column in the titanic dataset adds a lot of value to the final result despite not having 30% of the age data.
-
could use linear regression to determine the missing values
-
could use mean/median to determine the values (fastai in addition does another column where it finds where stuff is NaN)
-
MICE imputation???, check out variance of the variable etc…
Handling missing data only in test dataset
def na_check(col): return df.loc[:,col].isna().sum()>0,df_te.loc[:,col].isna().sum()>0
na_col = {col: na_check(col) for col in cont}
na_col_te = [col for col in cont if df.loc[:,col].isna().sum()==0 and df_te.loc[:,col].isna().sum()>0]
#na_col_te = ['C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8', 'C9', 'C10', 'C11', 'C12', 'C13', 'C14']
na_col_te
df = pd.concat([df.median().to_frame().T, df], ignore_index=True)
## add values to that row
df.loc[df.index[0],na_col_te] = np.nan
df.loc[df.index[0],"isFraud"]=0 ## forcing it to be an int rather than in between like 0.5
df.loc[df.index[0],na_col_te]
df.shape
Gradient boosting
IT WILL OVERFIT.
- Train a small model that underfits your dataset.
- Calculate the predictions in the training set for this model.
- Subtract the predictions from the targets; these are called the “residuals” and represent the error for each point in the training set.
- Go back to step 1, but instead of using the original targets, use the residuals as the targets for the training.
- Continue doing this until you reach some stopping criterion, such as a maximum number of trees, or you observe your validation set error getting worse.
Is training with small sample and getting residuals and training to reduce these residuals… In the end you add up all of the predictions unlike in bagging. There is gradient boosting machines and gradient boosting decision trees and some other variants.
-
Start with mean value one leaf
-
Get residual, and train an “underfit” tree with it
-
Scale it
-
now make many more with residuals.
Note: use 8 to 32 leafs as max.
Note2: We use “scaling” i.e., use a learning rate (0.1) to make smaller steps towards the final value. “According to empirical evidence”, taking lots of small steps in the right direction results in better predictions.”
GB classification
Here we do classification by looking at log odds. Log(odds) if something “isFraud” or not. We convert log(odds) to probability like in logistic regression, to
e^(log(odds))/(1+e^log(odds))
Links
saving tabular pandas and loading them
save_pickle("to.pkl",to)
to = load_pickle(input_path+"fraud-detection/to.pkl")
XGBoost xgb
xgboost decides at training time whether missing values go into the right or left node. It chooses which to minimise loss. If there are no missing values at training time, it defaults to sending any new missings to the right node.—stack
implement kfold or stratified k fold
from sklearn.model_selection import KFold
skf = KFold(n_splits = 2, shuffle = True, random_state = 42)
from sklearn.model_selection import cross_validate
m_cv = cross_validate(m, df_dom, is_test, scoring='roc_auc', cv=skf,
n_jobs=-1, return_train_score=True)
KFold
can be substituted by StratifiedKFold
.
To get output you can do:
m_cv["test_score"],m_cv["train_score"],m_cv["fit_time"]
feature importance handy definitions
## Plot important features
def xgb_fi(m, df,df_real):
fi = pd.DataFrame({'cols':df.columns, 'imp':m.feature_importances_}
).sort_values('imp', ascending=False)
#fi["isCont"] = fi.cols.isin(cont)
fi["countNA"] = [df_real.loc[:,col].isna().sum()/len(df_real) if
col in df_real.columns else float("NaN") for col in fi.cols]
return fi
def plot_fi(fi):
return fi.plot('cols', 'imp', 'barh', figsize=(12,7), legend=False)
Metrics
def m_rep(m,xs,y): return classification_report(y, m.predict(xs), labels=[1,0], digits=4, output_dict=True)
def metrics(m, xs, y , valid_xs, valid_y):
tr_rep = m_rep(m,xs,y)
vd_rep = m_rep(m,valid_xs,valid_y)
tr_auc = roc_auc_score(y,m.predict_proba(xs)[:,1])
oob_auc=0.0000
#oob_auc = roc_auc_score(y,m.oob_decision_function_[:,1])
vd_auc = roc_auc_score(valid_y,m.predict_proba(valid_xs)[:,1])
tr_rec_1 = tr_rep["1"]["recall"]
tr_rec_0 = tr_rep["0"]["recall"]
vd_rec_1 = vd_rep["1"]["recall"]
vd_rec_0 = vd_rep["0"]["recall"]
tr_prec_1 = tr_rep["1"]["precision"]
tr_prec_0 = tr_rep["0"]["precision"]
vd_prec_1 = vd_rep["1"]["precision"]
vd_prec_0 = vd_rep["0"]["precision"]
print('{:.4f} ; {:.4f} ; {:.4f}; {:.4f}, {:.4f}; {:.4f}, {:.4f}; {:.4f}, {:.4f}; {:.4f}, {:.4f}'
.format(tr_auc, oob_auc, vd_auc, tr_rec_1, tr_rec_0, vd_rec_1,vd_rec_0,tr_prec_1, tr_prec_0, vd_prec_1,vd_prec_0))
xgb cv for loop OOF out of fold predictions
%%time
from sklearn.model_selection import KFold
skf = KFold(n_splits = 5, shuffle = False, random_state = 42)
tr_pred = np.zeros(len(xs))
oof_pred = np.zeros(len(xs))
#te_preds = np.zeros(len(X_test))
for i, (idxT, idxV) in enumerate(skf.split(xs,y)):
print('Fold',i)
print(' rows of train =',len(idxT),'rows of holdout =',len(idxV))
m = xgb_fun(xs[cols].iloc[idxT], y.iloc[idxT],
eval_set=[(xs[cols].iloc[idxV],y.iloc[idxV])])
tr_pred[idxT] += m.predict_proba(xs[cols].iloc[idxT])[:,1]/(skf.n_splits-1)
oof_pred[idxV] += m.predict_proba(xs[cols].iloc[idxV])[:,1]
#te_preds += clf.predict_proba(test_xs[cols])[:,1]/skf.n_splits
del m; x=gc.collect()
print('{:.4f} ; {:.4f} ; {:.4f}'
.format(roc_auc_score(y,tr_pred),0.0000, roc_auc_score(y,oof_pred)))
Prediction of Test dataset code
def format_test_preds(preds,test_xs,comment):
#preds = m.predict_proba(test_xs[cols])[:,1]
print(pd.DataFrame(preds).describe())
## Make into submission dataframe
df_pred = pd.DataFrame({"TransactionID":test_xs.TransactionID.to_list(),
"isFraud": preds})
## save
df_pred.to_csv(comment+"my_subs.csv",index=False)
print(df_pred.shape)
del df_pred; x=gc.collect()
regex
R in front
r"ba[rzd]"
says to find exactly “ba[rzd]” and ignore the []
. r
helps to treat the whole thing as string.
Cheat Sheet Table is here
Getting True False values from regex’ed columns
tf = [bool(re.search("^"+str_sch+"[0-9]+",col)) for col in
xs.columns.to_list()]
heatmap correlation
%%time
import seaborn as sns
xs_corr = xs.corr()
plt.figure(figsize=(15,15))
#mask = np.triu(xs_corr)
sns.heatmap(xs_corr, cmap='RdBu_r', annot=False, center=0.0)#, mask=mask)
plt.title('All Cols',fontsize=14)
plt.show()
corrwith
df[cols].corrwith(
df[col]).reset_index().sort_values(
0, ascending=False).reset_index(drop=True).rename(
{'index':'Column',0:'Correlation with ' + col}, axis=1)[:20]
cardinality ordered and unordered categorical data
Here they say when you need to model unordered categorical data you could use one hot encoding. Other wise with label encoding you end up treating it like an ordered category i.e., numerical treatment… A>5, A<5 in the decision tree. Where certain pairs might not be possible.
man… Great find! damn! Never knew these things existed. Thanks a ton. He essentially seems to be saying “not bad” about using “ordered categories” in a “non-ordered-way” with label encoding. right? and we can also specify that and label encode if necessary. I guess we can assume this is the case for “non-ordered categories” as well. What “not bad” means is not clear to me, but for now, this is good enough I guess. I can skip worrying about it and move on then… :slight_smile: thanks mate good one. :slight_smile: – Pandian on discord
Here, exactly jeremy talks about the above. For now we don’t think this is too big of a problem. :) Maybe we can check this later.
Also check this out.
This describes how to do it in python
splitting columns
cents and dollars
def split_cols(df, col = "TransactionAmt"):
df['cents'] = df[col].mod(1)
df[col] = df[col].floordiv(1)
R_emaildomain
def split_cols_str(df,col,delim="."):
df[[col+str(1),col+str(2)]] = df[col].str.split(delim,expand=True,n=1)
Replace string in column
%%time
lst_to_rep = [r"^.*chrome.*$",r"^.*aol.*$",r"^.*[Ff]irefox.*$",r"^.*google.*$",r"^.*ie.*$",r"^.*safari.*$",r"^.*opera.*$",r"^.*[Ss]amsung.*$",r"^.*edge.*$",r"^.*[Aa]ndroid.*$"]
lst_val = ["chrome","aol","firefox","google","ie","safari","opera","samsung","edge","chrome"]
df_all["id_31_browser"].replace(to_replace=lst_to_rep, value=lst_val,
regex=True,inplace=True);
day hr conversion from seconds
TransactionDT given in seconds…
df_all["TransactionHr"] = df_all["TransactionDT"]/(60*60)%24//1/24
df_all["TransactionHr2"] =
(df_all['TransactionDT']%(3600*24)/3600//1)/24.0 #CD's solution
Seconds to days…
df_all["TransactionDay"] = df_all["TransactionDT"]/(60*60)//24
df_all[["TransactionDay","TransactionDT"]].sample(10)
days to weekday
1477538/(3600*24)/7,1477538/(3600*24)%7,1477538/(3600*24)%7//1,1477538/(3600*24)%7/7
df_all["WkDayNum"] = df_all["TransactionDT"]/(3600*24)%7/7
df_all[["WkDayNum","TransactionDT"]].sample(10)
Encoding labels
def change_col_dtype(df,col):
unum = len(df[col].unique())
#print(col,unum,df[col].dtype)
df[col],_ = df[col].factorize() # Label encoding, nans changed to -1
df[col] = df[col].replace(-1,np.nan)
# if unum<128: df[col] = df[col].astype('int8')
# elif unum<32768: df[col] = df[col].astype('int16')
# else: df[col].astype('int32')
print(col,len(df[col].unique()),df[col].dtype)
encoding combine
def encode_CB2(df1,df2,uid):
newcol = "_".join(uid)
## make combined column
df1[newcol] = df1[uid].astype(str).apply(lambda x: '_'.join(x), axis=1)
df2[newcol] = df2[uid].astype(str).apply(lambda x: '_'.join(x), axis=1)
## concat and then factorize
temp_df = pd.concat([df1[newcol],df2[newcol]],axis=0)
temp_df,_ = temp_df.factorize(sort=True)
## unconcat
if temp_df.max()>32000:
df1[newcol+"_fact"] = temp_df[:len(df1)].astype('int32')
df2[newcol+"_fact"] = temp_df[len(df1):].astype('int32')
else:
df1[newcol+"_fact"] = temp_df[:len(df1)].astype('int16')
df2[newcol+"_fact"] = temp_df[len(df1):].astype('int16')
return [newcol,newcol+"_fact"]
datacrunch setup
Docs are here.
What I did is, first ssh’ed into the system using
to create an instance you need to create a key set, for that I used:
ssh-keygen -t ed25519 -C "your_email@example.com"
Then I provided the public key to DC and then to ssh into that shit I did
ssh -i path-to-private-key host-ip-address
sudo ssh -i /home/eghx/.ssh/id_ed25519 135.181.63.176
sudo ssh -i /home/eghx/.ssh/id_ed25519 username@135.181.63.176
shortcutted as dc_connect
For the first time it is suggested to create a new user.
When you go into root you can create a new user follwoing these instructions.
To get into a user use su - username
.
currently unable to open jupyter not as root: https://askubuntu.com/questions/1038339/i-always-have-to-access-jupyter-notebook-as-a-root-user
fastai image
It turns out that in the “fastai” image there is already a user: “user”. Only
in this “user” (not root or other users you yourself create) do the
things happen such as conda
or pip
. This is where anaconda is
installed.
To go into this I needed to ssh the following:
sudo ssh -i /home/eghx/.ssh/id_ed25519 user@135.181.63.176
Kaggle API
-
Go to kaggle account and then create token aPI
-
Copy the downloaded file from B to A.
scp Desktop/kaggle.json user@135.181.63.176:~/.kaggle/
-
check if it has reached. :)
-
change permissions not sure why, but is on the official kaggle documentation
chmod 600 ~/.kaggle/kaggle.json
-
401 permissions unauthorized?
Then just get a new token, possible that the old one expired.
changing permissions of folder created by root
sudo chown -R user:user agent18/
user:user
refers to the user –> user and the group –> user.
Extracting zip files from folder
for i in *.zip; do unzip "$i" -d "${i%%.zip}"; done
Extracting datasets from yours and from kernels
kaggle datasets download -d thejravichandran/fraud-detection
kaggle competitions download -c ieee-fraud-detection
kaggle datasets download thejravichandran/ieee-fraud-detection-joined-tables
kaggle kernels pull thejravichandran/fraud-detection-v10-understanding-adv-val
Error with import fastai vision related staments
Check if fastcore version is correct using conda list fast
and
comparing it with the fastai github yaml content.
Setting up google cloud
https://www.kaggle.com/alexisbcook/get-started-with-google-cloud-platform
https://www.kaggle.com/akashram/auto-updating-data-using-kaggle-api-gcp
https://www.kaggle.com/rosebv/train-model-with-tensorflow-cloud (how to work with tensorflowcloud)
https://www.kaggle.com/product-feedback/159602 –> direct upgrade
Installing
Identify instance OS
cat /etc/*-release
Installing monitoring system
follwing these instructions: https://cloud.google.com/monitoring/agent/installation
df -h #for disk space
Follow this on SSH terminal (very very different from the terminal associated with jupyterlabs).
Check installed packages
dpkg --get-selections | grep nvidia
todo CV (11 days to go)
1. baseline
- move to gcp or datacrunch.io
- bigger dataset results
- callbacks to get export
- Other architectures
- Implement works from round 5
- what can you find from confusion matrix
- over-fitting, false +ves, false -ves result
- Write a blog
todo tabular
look at data (common man)- clean tables (understand nan na type stuff)
- build pipeline (markus)
- baseline
- Categories fixing
- do the entire nb (by tomorrow)
- tweak stuff (sunday)
- see if you can do something else (add novelty) (monday)
- overfitting reduction (tuesday)
- Write blog on the learnings.(wednesday)
Am thinking of starting with rf working with the valid for now. Let’s see where that gets us. this includes,
Decision trees don’t work well here as we require prolties and AOC. So let’s kill it. Going back to the drawing board to look at the data.
Also Let’s stick to basic techniques. Let’s see what OOC error we get then come back and fine tune… Min. effort. quick draft PG style?
- Cleaning (replacing zeros strategies? categorical variables?) 9- Bagging, RF, and DL
- looking at data and what else to remove cluster etc…
- reduction of size of columns (svd, clustering etc…)
- boosting?
- regression anova and other stuff later?
Also want to implement sigmoid stuff today for the other assignment. or test other architectures.
today
predict to test,
understand conf matrix, understand what output is required? does label selection make a difference. etc…
add the split: https://www.kaggle.com/c/ieee-fraud-detection/discussion/99993
“Looks like the data has a time split… so make sure you are not using time as a feature.”
“I confirm here that the data is a clean time series split on the TransactionDT. It looks like we might need to consider different validation techniques.”
Heat map might require clustering and scaling to see how they group. Not sure what to do after that…
now (deprecated)
Test on the bigg data, “balanced subsample”, and “oversampling”, and
and use the metrics generated from gridcv.
- reproduce conf matrix, what is tfr and tfr?
- reproduce praba –> roc coiv
-
fastai 5/5 plan to understand rf…
-
look at removing time features as adviced by Peter?
-
CV again understanding whats happening.
-
check where you are today in the whole plan (today lets dag)
-
remove time related features understand why?
-
SETUP proper CV and do the whole feature importance based on validxs again.
-
Make a plan for the coming days.
-
start new sheet
-
Print score function
-
Do gridsearchCV, do feature importance and clustering with check against valid, remove is_valid params, reduce overfitting, Look at params where you add shit loads of nans based on feature importance. Undersampling but not random… check correlation.
Rewrite the whole thing. Clean version.
Let’s remove features and check for isna issues of top features then do testing…
and then do the testing?
-
remove the parts that make it non resilient?
-
Set up RF
-
Setup GridsearchCV with certain params and then run cv on it 5. Number of features, etc…
-
Run weighted oversampling “balanced dataset” for unbalanced DS see if results change then try smote as well.
-
Run RF and check feature importance
-
Filter to 20-50
-
check isna for filled up columns and how we might be affecting it?
-
Filter further by running induvidual tests
-
Look at clustering to remove similar stuff.
-
See how to look at clustering and remove
-
run grid search one last time?
-
Make sure you have provided some extra care for isna types and IMB
-
Generating other columns with date?
-
extrapolate to the all dater.
-
- clusterning remove var, check current grid, make ok.
- Tomo look at overfitting and do the same thing?
- start with GBM asap
-
consider NNs perhaps.
-
final set
- try undersampling alone with not random undersampling but “most different row selection” undersampling.
- try what we have learnt so far…
- Look at GBM and understand the scene…
now
Goal ultimately: XGB, regression types, NN, or this.
-
check if you increase the ood numbers if you get better results. -
do adversarial validation on daterset between test and train using rfs and xgb. Now that I have access to both.
-
See an example of XGB
-
what is XGB?
-
reproduce identifying “same clients” in test and train…
-
Understand XGB, stratified kfold and then adverserial validation??? (8 videos + notes)
-
come up with plan for next 7 days. where do you want to be and check everyday. Parkinson’s law.
-
EDA reproing his plots and undersstanding what he means,
-
Run XGB and see where we are at with default
-
plan some tuning with cv
Also see stratified kfold. See what he has done and plan reproduction…
find out what they did for their unbalanced dataset. He didn’t care.
tomorrow
-
ensemble, gbm, regression
-
rf parameters and what they mean, what is the output actually outputting? auc? TP?
-
removing time elements
-
how to tune which parameters
-
why is rf overfitting.
-
Improve ERROR
-
REMOVING 200 columns?
-
heatmaps, oversampling, undersampling, fastai checks (boosting etc…) and remove variables and test, remove time variables and it’s corr variables, pca, normalizing?, isfraud a float or a class?
-
nn?
-
regression anova etc…
Experiments
- with valid without valid prediction for whole dataset
- 100k prediction vs 590k prediction
- wht columns can you remove?
observations
- categ split between 590k set and 50k sample is the same 3% in this case.
- time based split.
- Peter suggests to remove time details.
- for the final set take everything
-
Apprently most of the ML techniques are expecting Balanced datasets —Statquest
tr recall vd recall tr prec vd prec baseline 37.4 26.9 94.8 79.6 balanced (1,26) 67.6 (80%) 30.2 (12%) 88.7 (-6.5%) 75.3 (-5.5%) subsample balanced 67.5 (80%) 32.9 (22%) 88.0 (-7.2%) 70.9 (-11.7%) overweighting (1,1000) 65.3 (80%) 25.1 (-7%) 84.5 (-10.9%) 84.0 (5%) - You want to predict on a set where that has times that are not seen in training. This will give you how good your dater is. In my case Valid gives 0.88 while training gives 0.99 or something. Using OOB score I think is pointless as it behaves like cross-validation. The same reason why we think cv is useless, we also think oob is useles.
Reason for difference in train and valid auc score
-
different imbalance in train and valid
NO. Both have the same % of imbalance :) I would go on to say that the same will be the case of imbalance in the test dataset too as I get similar auc scores.
-
“overfitting”
-
time info inside
-
too many unnecessary rows
-
precision and recall of “1” too poor( 78% and 26% in valid). even training is bad.
plannensie
-
USing GPU???? rapids?
-
Get data processed (done)
-
Do AV like Konrad, but based on suggestions from Chris read part 1 and 2 of this as linked by Konrad
-
Use XGB to predicta awards section without imputing.
-
Userid identify
-
EDA
-
tackle NaNs
-
feature importance
-
Check validation strategy come back and see how to circumvent gpu issues. I am thinking of going to datacrunch today… use GPU, get a bit faster computations… How much faster are they… Please find out?
download the dater…
then we need to do other things… check fastai install!