Learning R
R on emacs
Look here for the installation and setup of R on Emacs.
Using R basics
To get a console simply do:
Mx R
The setup we now want to have is R editor on the left and R console on
the right. Create testing.R
file. Run Cx 3
to split the
window. Go to the other window (Cx 0
). Open the console using Mx
R
. The following command will run the file in the console:
Cc Cl
Some more helpful configuration things are given in here. This shall be taken into account later if needed.
_
key is bound to <
by default in the ESS
mode. ESSsmartunderscore
tries to overcome this issue (I have not
yet tested how).
https://stackoverflow.com/questions/15289995/howtogethelpinr
Useful Keybindings, shortcuts
Cc Cv Help for object
Source has lots of other R ESS Emacs Key bindings
R packages and needed installation
When R is downloaded from CRAN we get the “base” R system. Primary location to obtain
Installing R packages required
Packages are usually found on CRAN or Bioconductor Project. More info
on c1w2installingRpackages
file. To install a package:
install.packages("packagename1")
It also installs the dependencies automatically.
In case you want to use Bio conductor:
source("http://bioconductor.org/biocLite.R")
biocLite()
biocLite(c("packagename1","packagename2"))
You can also install packages directly on RStudio. Tools > install
R package
.
To load a library use:
library(packagename)
To look at what functions are there in the library:
search() ### Installing xlsx
This is particularly painful to get, considering something about my R version being 3.5.
install.packages(xlsx)
will give an error saying you have error with
rJava. In order to install rJava follow the procedure below:
sudo aptget install defaultjdk
sudo R CMD javareconf
sudo aptget install rcranrjava
This is where you will have more errors something like this:
the following packages have unmet dependencies:
rcranrjava : Depends: rapi3.4
E: Unable to correct problems, you have held broken packages.
So I followed this answer from stack. Which would the following on the terminal:
sudo addaptrepository ppa:marutter/c2d4u3.5
sudo aptget update
Then coming back to the rjava installation:
sudo aptget install libgdaldev libprojdev
And finally in your R console (Rstudio or terminal)
install.packages("rJava")
This followed by install.packages("xlsx")
should do the trick. For
the main installation of rJava
I looked here.
installing XML
From Stackanswer here, Go to terminalR* and do:
sudo aptget update
sudo aptget install libxml2dev
followed by install.packages("XML")
. It works!
Sometimes RCurl
might also be needed.
Looks like I broke my R installaiton or 3.5 is unstable?
installing RCurl
from this answer , go to terminal and do:
sudo aptget install libcurl4openssldev
It works! after this.
installing MySQL
According to here, we can use APT to install, nothing more is given. So I try based on digitalocean:
sudo aptget install mysqlserver
Running security script,
mysql_secure_installation
This will prompt you for the root password you created in Step 1. You can press Y and then ENTER to accept the defaults for all the subsequent questions, with the exception of the one that asks if you’d like to change the root password. You just set it in Step 1, so you don’t have to change it now.
mysql_install_db # before 5.7.6
mysqld initialize #for 5.7.6
You are supposed to get this error:
20160307T20:11:15.998193Z 0 [ERROR] initialize specified but
the data directory has files in it. Aborting.
But I got this:
[Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use explicit_defaults_for_timestamp server option (see documentation for more details).
According to fromdual, the suggestion is to just follow the advice of the error. Some functionality is being deprecated.
Our advice is to enable the variable explicit_defaults_for_timestamp now on your testing systems so you can see if your application behave well and then you are prepared for the next release when this feature will become the default.
In short term this warning is NOT dangerous. In the long term you have to be prepared for the deprecated functionality. You get rid of the warning my setting explicit_defaults_for_timestamp = 1 in your my.cnf [mysqld] section.
You can find the my.cnf
file in the following locations, and in
this order the values override each other:
 /etc/my.cnf
 /etc/mysql/my.cnf
 $MYSQL_HOME/my.cnf
 [datadir]/my.cnf
 ~/.my.cnf
You can also find your file by
find / name my.cnf
I added the following in /etc/mysql/my.cnf
[mysqld]
explicit_defaults_for_timestamp = 1
Now try
mysqld initialize
And I got a reduced error without Timestamp stuff
mysqld: Can't create directory '/var/lib/mysql/' (Errcode: 17  File exists)
20181112T20:13:45.024116Z 0 [ERROR] Aborting
Is this good? Not sure so I look deeper. According to this accepted stack answer for the error you need to
sudo i #log into root
cd /var/lib/mysql
rm r *
su username # get back to the original user
mysqld initialize
This, still gave me the exact same error.
mysqld: Can't create directory '/var/lib/mysql/' (Errcode: 17  File exists)
20181112T20:13:45.024116Z 0 [ERROR] Aborting
I am not sure the accepted answer helped at all with the removing the
/var/lib/mysql
, but sudo
helped:
sudo mysqld initialize
worked… No error at all!
Testing:
systemctl status mysql.service
additional check:
mysqladmin p u root version
In R,
install.packages("RMySQL")
installing for hdf5
This will install packages from Bioconductor http://bioconductor.org/, primarily used for genomics but also has good “big data” packages Can be used to interface with hdf5 data sets.
source("http://bioconductor.org/biocLite.R")
biocLite("rhdf5")
other packages installed
"plyr", "Hmisc", "reshape","jpeg", "stringr", "lubridate",
"quantmod", "reader", "rafalib","kernlab", "rmarkdown"
Learning R
I am currently doing the Data Science Specialization at 40€/month on Coursera (Course 2: R Programming).
My DS repository is on github.
R commands that might be useful

Useful for discovering new commands
help.search("concatenate")

syntax of new commands
str(function)

DATE
as.Date(Sys.time())
Unclass removes all the formatting and stores the date as an integer from 19700101.
unclass(as.Date(Sys.time())

TIME
Used for storing in data frames
p < as.POSIXct(Sys.time()) ## [1] "20130124 22:04:14 EST" y < as.POSIXct("20121025 06:00:00", tz = "GMT")
Used for getting difference.
p < as.POSIXlt(Sys.time()) p$sec

strptime
x < strptime(datestring, "%B %d, %Y %H:%M")

factor, table, attr
> x < factor(c("yes", "yes", "no", "yes", "no"))
> x
[1] yes yes no yes no
Levels: no yes
attr(,"levels")
[1] "no" "yes"

Missing values
 is.na(); could be integer NA or character NA
 is.nan(); NA is not NAN, but Nan is also NA

clear variable space based on this stack question
rm(list=ls())

str
Gives one line info about a function. Compactly displaying large lists
str(x) num [1:100] 1.87 2.51 7.07 1.93 3.53 ... > f < gl(40,10) > str(f) Factor w/ 40 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1
str is really useful to get quick info on structure of object.

summary
summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. 8.560 1.445 1.225 1.583 4.454 13.904

loading and unloading libraries (source)
library(library_name_not_in_courts) detach("package:RMySQL", unload=TRUE)
Calculating memory
According to stack and the coursera course,
memory required = no. of column * no. of rows * 8 bytes/numeric
so for example if you have 1,500,00 rows and 120 column you will need more than 1.34 GB of spare memory required
Loop functions

apply
rowSums = apply(x, 1, sum) rowMeans = apply(x, 1, mean) colSums = apply(x, 2, sum) colMeans = apply(x, 2, mean)

lapply (list): takes a list and applies a function over every element. Example to get the first column of each matrix in the list:
lapply(x, function(elt) elt[,1])
 sapply: simplifying output of lapply
 mapply (list):
mapply(fun,variable_1_range,variable_2_range,...)
>mapply(rnorm,1:3,c(4,4,4),c(0.5,0.5,0.5))
[[1]]
[1] 3.310842
[[2]]
[1] 3.369922 3.665328
[[3]]
[1] 3.917087 4.670286 3.666672

tapply (list) : X has to be a vector; be careful
tapply(X, level&index, function)
/lapply(split(x, f), mean)
> x < c(rnorm(10), runif(10), rnorm(10, 1))
> f < gl(3, 10)
> f
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3
[24] 3 3 3 3 3 3 3
Levels: 1 2 3
> tapply(x, f, mean)
1
2
3
0.1144464 0.5163468 1.2463678

split:
split(dataframe, dataframe$month)

Splitting more than one level can be done using interaction
interaction(gl(2,5), gl(5,2))

Examples using Loop functions

There will be an object called ‘iris’ in your workspace. In this dataset, what is the mean of ‘Sepal.Length’ for the species virginica?
library(datasets) data(iris)
The answer is got by:
iris$Sepal.Length[iris$Sp=="virginica"] mean(iris$Sepal.Length[101:150]) s < split(iris,iris$Speci) lappy(s,function(x) mean(x[,"Sepal.Length"]))
sapply will simplify the results into a data frame which is cool
s < split(iris,iris$Speci) sapply(s,function(x) mean(x[,"Sepal.Length"]))
tapply cannot be used as X needs to be an atomic vector, not DataFrame, not list, but an atomic vector. But you can use it like this:
tapply(iris$Sepal.Length,iris$Sp,mean)
with
with
: Need to use full names of columns!with(iris,tapply(Sepal.Length,Species,mean))
Debugging
Mostly you wont need them or you can get much far without using them
Actually, I was using R for 6 years and didn’t know about any of these functions  the teacher from Coursera

traceback: Which function call you are in and where error occured
traceback()

debug: You say
debug(mean)
in your code and any timemean()
is executed, it will stop at the first line and allow you to go line by line in abrowser
debug(mean)

browser: In the console window you get this browser which doesn’t need much explanation

trace : allows you to insert debugging code into a function a specific places

recover : allows you to modify the error behavior so that you can browse the function call stack at the actual point of the error happening. More importantly,
options(error=recover) options(error=NULL)
This will get you to the line just before the error gets triggered, and you can do somethings from there, what exactly you can do is not fully clear.
Functions

Lazy evaluation happens with functions where it returns and prints the last line, i.e.,
f < function(a, b) { a^2 } f(2)
In case you don’t want to print the returned value then use
invisible(a^2)
for example. 
Argument matching happens based on
 Check for exact match for a named argument
 Check for a partial match
 Check for a positional match

you can also pass
...
as arguments, which will be passed on!
Static or lexical scoping
 Basically it looks for the variable in the environment in which the function was defined. functions can be defined in either the global environment or the function environment.
y < 10
f < function(x) {
y < 2
y^2 + g(x)
}
g < function(x) {
x*y
}
f(3) is 34.
As explained here in better detail. Consider:
a=1
b=2
f<function(a,b)
{
return( function(x) {
a*x + b
})
}
g=f(2,1)
g(2)
Here g=f(2,1)
returns a function which looks like:
g < function(x){
2*x+1
}
This is what happens in Lexical scoping.
”«“
Apparently <<
this is used:
The operators ‘«’ and ‘»’ are normally only used in functions, and cause a search to be made through parent environments for an existing definition of the variable being assigned. If such a variable is found (and its binding is not locked) then its value is redefined, otherwise assignment takes place in the global environment. –from the help
As said above, which is absolutely not clear, unless you read the greatest explanation of all time, i.e., the man page. The basics of Lexical scoping is fine. So when you have a function(parent) inside a function(child), then the child inherits the variables defined in the parent. I.e.,
a=1
b=2
f<function(a,b)
{
return( function(x) {
a*x + b
})
}
g=f(2,1)
g(2)
But, say you want to modify the parent variables from within the child
function, then you have to use <<
to gain access to modify the
parent variables. I.e., the example below from the R man page on
Scope. Excellent example, and description. Copied verbatim for future
direct reference.
The special assignment operator, «, is used to change the value associated with total. This operator looks back in enclosing environments for an environment that contains the symbol total and when it finds such an environment it replaces the value, in that environment, with the value of right hand side. If the global or toplevel environment is reached without finding the symbol total then that variable is created and assigned to there. For most users « creates a global variable and assigns the value of the right hand side to it23. Only when « has been used in a function that was returned as the value of another function will the special behavior described here occur.
open.account < function(total) {
list(
deposit = function(amount) {
if(amount <= 0)
stop("Deposits must be positive!\n")
total << total + amount
cat(amount, "deposited. Your balance is", total, "\n\n")
},
withdraw = function(amount) {
if(amount > total)
stop("You don't have that much money!\n")
total << total  amount
cat(amount, "withdrawn. Your balance is", total, "\n\n")
},
balance = function() {
cat("Your balance is", total, "\n\n")
}
)
}
ross < open.account(100)
robert < open.account(200)
ross$withdraw(30)
ross$balance()
robert$balance()
ross$deposit(50)
ross$balance()
ross$withdraw(500)
The following example shows more meaning about <<
from this
stack answer.
new_counter < function() {
i < 0
function() {
# do something useful, then ...
i << i + 1
i
}
}
counter_one < new_counter()
counter_two < new_counter()
counter_one() # > [1] 1
counter_one() # > [1] 2
counter_two() # > [1] 1
Probability distributions
Every distribution that R handles has four functions. There is a root name, for example, the root name for the normal distribution is norm. This root is prefixed by one of the letters
 p for “probability”, the cumulative distribution function (c. d. f.)
 q for “quantile”, the inverse c. d. f.
 d for “density”, the density function (p. f. or p. d. f.)
 r for “random”, a random variable having the specified distribution — Random website
For the normal distribution, these functions are pnorm, qnorm, dnorm, and rnorm. For the binomial distribution, these functions are pbinom, qbinom, dbinom, and rbinom. And so forth.
For a continuous distribution (like the normal), the most useful functions for doing problems involving probability calculations are the “p” and “q” functions (c. d. f. and inverse c. d. f.), because the the density (p. d. f.) calculated by the “d” function can only be used to calculate probabilities via integrals and R doesn’t do integrals.
For a discrete distribution (like the binomial), the “d” function calculates the density (p. f.), which in this case is a probability
f(x) = P(X = x) and hence is useful in calculating probabilities. — Random website
Distributions’ are Normal, Poisson, binomial etc…

dnorm: gives the probability density function, i.e., p.d.f, which is the value of probability for a given range of x, in P(~x)=F(x)

pnorm: gives F(x) in P(X<=x)=F(x) i.e., c.d.f cumulative distribution function

qnorm: gives x for a given P. based on x=F(P)^(1) i.e., i.c.d.f, inverse cumulative distribution function

rnorm: gives random variables in that distribution for given mean and standard deviation.
Lists
Quick aside  lists
mylist < list(letters = c("A", "b", "c"), numbers = 1:3, matrix(1:25, ncol = 5))
head(mylist)
$letters
[1] "A" "b" "c"
$numbers
[1] 1 2 3
[[3]]
[,1] [,2] [,3] [,4] [,5]
[1,] 1 6 11 16 21
[2,] 2 7 12 17 22
[3,] 3 8 13 18 23
[4,] 4 9 14 19 24
[5,] 5 10 15 20 25
http://www.biostat.jhsph.edu/~ajaffe/lec_winterR/Lecture%203.pdf
Quick aside  lists
mylist[1]
$letters
[1] "A" "b" "c"
mylist$letters
[1] "A" "b" "c"
mylist[[1]]
[1] "A" "b" "c"
http://www.biostat.jhsph.edu/~ajaffe/lec_winterR/Lecture%203.pdf
Dataframe

reading data with header column (default)
outcome < read.csv("outcomeofcaremeasures.csv", colClasses = "character") head(outcome)

quick look at data
str(outcome[,11:16]) head(outcome[,11]) nrow(outcome) ncol(coucome) names(outcome)

important need to actively remove NA if doing something like this:
dim(data[data$val==24,1])
otherwise NA is also counted for some stupid reason!
DataTable

Much, much faster at creating subsetting, group, and updating than DataFrame it appears.

All functions that accept data.frame work on data.table

creation (simple installation)
library(data.table) DT = data.table(x=rnorm(9),y=rep(c("a","b","c"),each=3),z=rnorm(9))

read and write table
write.table(big_df, file=file, row.names=FALSE, col.names=TRUE, sep="\t", quote=FALSE) system.time(fread(file))

Subsetting columns
DT[,c(2,3)]

subsetting rows
DT[c(2,3),] DT[c(2,3),]

and difference between DF and DT
DF[c(2,3)]!=DT[c(2,3)]
 If only the first argument is present
DF[c(2,3)]
gives columns. andDT[c(2,3)]
gives ROWS
 If only the first argument is present

Functions on columns
DT[,list(mean(x), sum(z))]
Result is still a DT.
DT[,table(y)]

Adding new columns
DT[,w:=z^2] DT[,m:= {tmp < (x+z); log2(tmp+5)}] DT[,a:=x>0]

function
by
/based the column ‘a’. A can be boolean, character (these make most sense, but probably even number can be used)DT[,b:= mean(x+w),by=a]

counting elements based on a factor
DT[,.N, by=x]


setting key helps in merging
DT1 < data.table(x=c('a', 'a', 'b', 'dt1'), y=1:4) DT2 < data.table(x=c('a', 'b', 'dt2'), z=5:7) setkey(DT1, x); setkey(DT2, x) merge(DT1, DT2)
Merge happens based on X
x y z 1: a 1 5 2: a 2 5 3: b 3 6

help
The latest development version contains new functions like melt and dcast for data.tables https://rforge.rproject.org/scm/viewvc.php/pkg/NEWS?view=markup&root=datatable Here is a list of differences between data.table and data.frame http://stackoverflow.com/questions/13618488/whatyoucandowithdataframethatyoucantindatatable Notes based on Raphael Gottardo's notes https://github.com/raphg/Biostat578/blob/master/Advanced_data_manipulation.Rpres, who got them from Kevin Ushey.
^^copied from https://github.com/DataScienceSpecialization/courses/blob/master/03_GettingData/01_09_dataTable/index.md
expressions
{
x = 1
y = 2
}
k = {print(10); 5}
print(k)
[1] 10
[2] 5
Handling different Data
 popular databases where data is stored SQL, MongoDB.
file, downloading, directory

geting and setting working directory
getwd() setwd()

does file exist or not
file.exists("directoryName") dir.create("directoryName") if (!file.exists("data")){ dir.create("data") }

downloading a file from the internet
download.file(fileUrl, destfile="./data/camera.csv", method="curl") list.files("./data")

keep track of dateDownloaded
dateDownloaded < date()

tab separated files or just normal text files with data are very well automatically loaded with this command.
read.table("")

comma separated with
header=true
read.csv("") read.table("", sep="", header=TRUE)

other parameters that might be useful:

quote: tell r whetehre there are quoted values
quote=""
means no quotes  na.strings: set character that represents a missing value
 nrows: how many rows to read from the top
 skip: how many rows to skip from the top


Read and write text files large files small files
https://rpubs.com/msundar/large_data_analysis
Have a look!
fixed width files fwf
According to this stack which is a question directly from
coursera, read.fwf
is a function used to clean a fixed width
file.
The same thing can be done in several ways:
library(readr)
x < read_fwf(
file="http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for",
skip=4,
fwf_widths(c(12, 7, 4, 9, 4, 9, 4, 9, 4)))
x < read.fwf(
file=url("http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for"),
skip=4,
widths=c(12, 7, 4, 9, 4, 9, 4, 9, 4))
’‘ in 1 removes the columns quickly.
df < read.fwf(
file=url("http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for"),
widths=c(1, 9, 5, 4, 4, 5, 4, 4, 5, 4, 4, 5, 4, 4),
skip=4
)
x < readLines(con=url("http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for"))
# Skip 4 lines
x < x[(1:4)]
mydata < data.frame(var1 = substr(x, 1, 10),
var2 = substr(x, 16, 19),
var3 = substr(x, 20, 23),
var4 = substr(x, 29, 32) # and so on and so on
)
However, having the header is a problem. Accorrding to ?read.fwf
when using the header
argument, you also need the sep
argument.
The problem is discussed here. If necessary the header and the content can be separately attached. or go to the extent of modifying the data to have useless delimiters.
Excel
 installation was fucking intense. Look in the above sections for it.

reading excel files uses package
xlsx
library(xlsx) read.xlsx("./data.xlsx", sheetIndex=1, header=TRUE,colIndex=7:15, rowIndex=18:23)

writing
write.xlsx

tips

read.xlsx2
faster but unstable for subsets thanread.xlsx

XLConnect package has more options for manipulating Excel data.

XML
 example of XML from w3schools
<breakfast_menu>
<food>
<name>Belgian Waffles</name>
<price>$5.95</price>
<description>
Two of our famous Belgian Waffles with plenty of real maple syrup
</description>
<calories>650</calories>
</food>
<food>
<name>Strawberry Belgian Waffles</name>
<price>$7.95</price>
<description>
Light Belgian waffles covered with strawberries and whipped cream
</description>
<calories>900</calories>
</food>

for installation look in above sections on installing packages. Some work needs to be done for this package and it takes 1015 mins to install!

Reding the file library is
XML
. The following does not work:library(XML) fileUrl < "http://www.w3schools.com/xml/simple.xml" doc < xmlTreeParse(fileUrl,useInternal=TRUE) rootNode < xmlRoot(doc) xmlName(rootNode) [1] "breakfast_menu" names(rootNode)

Reading the file; file library is ‘XML’, from this stack answer and not the DSS course from coursera!
library(XML)
library(RCurl)
fileURL < "https://www.w3schools.com/xml/simple.xml"
xData < getURL(fileURL)
doc < xmlParse(xData)
rootNode <xmlRoot(doc)
 Also this works from the discussion formus of Coursera
library (XML)
library(httr)
fileUrl < "http://www.w3schools.com/xml/simple.xml"
doc < xmlTreeParse(GET(fileUrl),useInternal=TRUE)
rootNode < xmlRoot(doc)
doc
variable has all the text in the xml file. rootNode
has only the relevant XML info.

extracting info from
rootNode
orxmlRoot(doc)
xmlName(rootNode) names(rootNode) rootNode[[1]] <food> <name>Belgian Waffles</name> <price>$5.95</price> <description>Two of our famous Belgian Waffles with plenty of real maple syrup</description> <calories>650</calories> </food> rootNode[[1]][[1]] <name>Belgian Waffles</name> xmlSApply(rootNode,xmlValue) "Belgian Waffles$5.95Two of our famous Belgian Waffle ...

XPath is the language of XML or something like that. The following can be used to extract info from
rootNode
: /node Top level node
 //node Node at any level
 node[@attrname] Node with an attribute name

node[@attrname=’bob’] Node with attribute name attrname=’bob’
xpathSApply(rootNode,"//price",xmlValue)

Another example for XPath extraction based on [http://espn.go.com/nfl/team//name/bal/baltimoreravens](http://espn.go.com/nfl/team//name/bal/baltimoreravens)
fileUrl < "http://espn.go.com/nfl/team/_/name/bal/baltimoreravens" doc < htmlTreeParse(rawToChar(GET(url)$content), useInternalNodes = TRUE) xpathSApply(doc, "//title", xmlValue) xpathSApply(doc,"//div[@class='score']",xmlValue) teams < xpathSApply(doc,"//div[@class='teamname']",xmlValue)
Even if http moved to https it can handle. Another example is as follows:
library(XML)
library(httr)
url < "http://scholar.google.com/citations?user=HII6C0AAAAJ&hl=en"
html < htmlTreeParse(rawToChar(GET(url)$content), useInternalNodes = TRUE)
xpathSApply(html, "//title", xmlValue)
xpathSApply(html, "//td[@class='gsc_a_c']", xmlValue)
Another way to do it is:
library(httr)
url < "http://scholar.google.com/citations?user=HII6C0AAAAJ&hl=en"
library(httr); html2 = GET(url)
content2 = content(html2,as="text")
parsedHtml = htmlParse(content2,asText=TRUE)
xpathSApply(parsedHtml, "//title", xmlValue)
This shit doesn’t work doc <
htmlTreeParse(getURL(fileUrl),useInternal=TRUE)
.
You are able to extract the elements in tags div
of class
i.e., the teams and scores from the website. The teamname doens’t
work anymore as it is not a class anymore
Things that don’t work with XML and html parsing
url< getURL("http://scholar.google.com/citations?user=HII6C0AAAAJ&hl=en", ssl.verifyPeer=FALSE)
html< htmlTreeParse(url, useInternalNodes = TRUE)
doc<xpathSApply(html, "//title", xmlValue)
doc
[1] "302 Moved"
JSON

Similar to XML but different in syntax and format.

Simple installation in R using
install.packages()
. Has `curl’ dependency though. 
It is a text file consisting of dataframes, each cell in the DF can also be a DF!

getting data
library(jsonlite) jsonData < fromJSON("https://api.github.com/users/jtleek/repos")

getting the “column” names
names(jsonData)

getting certain “columns” from the data
jsonData$id jsonData$column_name

accessing jsondata nested data across all rows
jsonData$owner$login

Writing DF to json
myjson < toJSON(iris, pretty=TRUE) cat(myjson)

help or further resources
 http://www.json.org/

a good tutorial on jsonlite  http://www.rbloggers.com/newpackagejsonliteasmarterjsonencoderdecoder/
 json cran page
MySQL
MySQL is a type of database software. Frequently used in internet based applications. You need to install MySQL and install RMySQL for this.
The MySQL™ software delivers a very fast, multithreaded, multiuser, and robust SQL (Structured Query Language) database server. — Source
Anyways, it looks like you need a server and client way of work. So databases are stored in the server and accessed by the client. Still very vague. but moving on.
So I think what we are doing is we are running the server from within out pc and that is what the installation of the server was all about.
 mysqld is the server executable (one of them)
 mysql is the command line client
 mysqladmin is a maintainance or administrative utility
Basics
Starting and stopping and checking status
Status:
systemctl status mysql.service
Start:
sudo service mysql start
or
sudo /etc/init.d/mysql start
Stop:
sudo service mysql stop
Accessing as the client once the terminal once the server is running is done by:
mysql u root p
Connecting to servers online and using their stuff
UCSC server and instructions are here.
Library
library("RMySQL")
Connecting:
ucscDb < dbConnect(MySQL(),user="genome",
host="genomemysql.cse.ucsc.edu")
result < dbGetQuery(ucscDb,"show databases;");
dbDisconnect(ucscDb);
[1] TRUE
Result contains list of all databases. You can connect to a particular database and extract its info.
hg19 < dbConnect(MySQL(),user="genome", db="hg19",
host="genomemysql.cse.ucsc.edu")
allTables < dbListTables(hg19)
length(allTables)
Gives a list of tables in this database.
Identifying info about the table in a particular database.
dbListFields(hg19,"affyU133Plus2")
dbGetQuery(hg19, "select count(*) from affyU133Plus2")
Reading the table:
affyData < dbReadTable(hg19, "affyU133Plus2")
Getting a subset
query < dbSendQuery(hg19, "select * from affyU133Plus2 where misMatches between 1 and 3")
affyMis < fetch(query); quantile(affyMis$misMatches)
Not accidentally getting a ton of data
affyMisSmall < fetch(query,n=10); dbClearResult(query);
dbDisconnect(hg19)
SQL commands
List of mysql commands, blog of other commands.
SQL withing R without MySQL
sqldf
library is used for the purpose of using sql on R data frames
directly.
According to this stack question, sqldf
has to use a driver given by
the argument drv
. If nothing is specified it looks for some other
libraries loaded and tries to use their driver, which is when for a
command like sqldf("select * from df limit 6", drv="SQLite")
you get
errors reagarding driver.
So do one of the following:
detach("package:RMySQL", unload=TRUE)
options(sqldf.driver = "SQLite")
sqldf("select * from df limit 6", drv="SQLite")
SQL queries
sqldf("select distinct AGEP from acs")
sqldf("select pwgtp1 from acs where AGEP < 50")
HDF5
Hierarchical data format.
This lecture is modeled very closely on the rhdf5 tutorial that can be found here: http://www.bioconductor.org/packages/release/bioc/vignettes/rhdf5/inst/doc/rhdf5.pdf
Start
library(rhdf5)
created = h5createFile("example.h5")
Create groups
created = h5createGroup("example.h5","foo")
created = h5createGroup("example.h5","baa")
created = h5createGroup("example.h5","foo/foobaa")
h5ls("example.h5") # list groups in the table
Write groups
Write matrix directly
A = matrix(1:10,nr=5,nc=2)
h5write(A, "example.h5","foo/A")
B = array(seq(0.1,2.0,by=0.1),dim=c(5,2,2))
attr(B, "scale") < "liter"
h5write(B, "example.h5","foo/foobaa/B")
h5ls("example.h5")
group name otype dclass dim
0 / baa H5I_GROUP
1 / foo H5I_GROUP
2 /foo A H5I_DATASET INTEGER 5 x 2
3 /foo foobaa H5I_GROUP
4 /foo/foobaa B H5I_DATASET FLOAT 5 x 2 x 2
Write data sets
df = data.frame(1L:5L,seq(0,1,length.out=5),
c("ab","cde","fghi","a","s"), stringsAsFactors=FALSE)
h5write(df, "example.h5","df")
h5ls("example.h5")
group name otype dclass dim
0 / baa H5I_GROUP
1 / df H5I_DATASET COMPOUND 5
2 / foo H5I_GROUP
3 /foo A H5I_DATASET INTEGER 5 x 2
4 /foo foobaa H5I_GROUP
5 /foo/foobaa B H5I_DATASET FLOAT 5 x 2 x 2
Reading data
readA = h5read("example.h5","foo/A")
readB = h5read("example.h5","foo/foobaa/B")
readdf= h5read("example.h5","df")
readA
Writing and reading chunks
h5write(c(12,13,14),"example.h5","foo/A",index=list(1:3,1))
h5read("example.h5","foo/A")
Notes and further resources
 hdf5 can be used to optimize reading/writing from disc in R
 The rhdf5 tutorial:
 The HDF group has informaton on HDF5 in general http://www.hdfgroup.org/HDF5/
Webscraping html reading
Webscraping: Programatically extracting data from the HTML code of websites.
 It can be a great way to get data How Netflix reverse engineered Hollywood
 Many websites have information you may want to programaticaly read
 In some cases this is against the terms of service for the website
 Attempting to read too many pages too quickly can get your IP address blocked
http://en.wikipedia.org/wiki/Web_scraping
Getting data off webpages  readLines()
con = url("http://scholar.google.com/citations?user=HII6C0AAAAJ&hl=en")
htmlCode = readLines(con)
close(con)
[1] "<!DOCTYPE html><html><head><title>Jeff Leek  Google Scholar
Citations</title><meta name=\"robots\" content=\"noarchive\"><meta
httpequiv=\"ContentType\"
content=\"text/html;charset=ISO88591\"><meta
httpequiv=\"XUACompatible\" content=\"IE=Edge\"><meta
name=\"formatdetection\" content=\"telephone=no\"><link
rel=\"canonical\"
href=\"http://scholar.google.com/citations?user=HII6C0AAAAJ&hl=en\"><style
type=\"text/css\" media=\"screen,
Parsing with XML
The actual lecture notes is sooooooooooo wrong.
library(XML)
library(httr)
url < "http://scholar.google.com/citations?user=HII6C0AAAAJ&hl=en"
html < htmlTreeParse(rawToChar(GET(url)$content), useInternalNodes = TRUE)
xpathSApply(html, "//title", xmlValue)
[1] "Jeff Leek  Google Scholar Citations"
xpathSApply(html, "//td[@class='gsc_a_c']", xmlValue)
[1] "Cited by" "397" "259" "237" "172" "138" "125" "122"
[9] "109" "101" "34" "26" "26" "24" "19" "13"
[17] "12" "10" "10" "7" "6"
GET from the httr package
Somehow this works in the first try.
library(httr)
url < "http://scholar.google.com/citations?user=HII6C0AAAAJ&hl=en"
library(httr); html2 = GET(url)
content2 = content(html2,as="text")
parsedHtml = htmlParse(content2,asText=TRUE)
xpathSApply(parsedHtml, "//title", xmlValue)
xpathSApply(parsedHtml, "//td[@class='gsc_a_c']", xmlValue)
[1] "Jeff Leek  Google Scholar Citations"
Accessing websites with passwords
pg1 = GET("http://httpbin.org/basicauth/user/passwd")
pg1
Response [http://httpbin.org/basicauth/user/passwd]
Status: 401
Contenttype: application/json
{
"authenticated": true,
"user": "user"
}
names(pg2)
[1] "url" "handle" "status_code" "headers" "cookies" "content"
[7] "times" "config"
http://cran.rproject.org/web/packages/httr/httr.pdf
Using handles
handles are used when you want to authenticate once and leave every other time to use the same authentication.
google = handle("http://google.com")
pg1 = GET(handle=google,path="/")
pg2 = GET(handle=google,path="search")
http://cran.rproject.org/web/packages/httr/httr.pdf
Notes and further resources
 R Bloggers has a number of examples of web scraping http://www.rbloggers.com/?s=Web+Scraping
 The httr help file has useful examples http://cran.rproject.org/web/packages/httr/httr.pdf
 See later lectures on APIs
API’s
Source for API usage when the time is right, not now!
Example of making and getting info from github API.
library(httr)
library(httpuv)
# 1. Find OAuth settings for github:
# http://developer.github.com/v3/oauth/
oauth_endpoints("github")
# 2. To make your own application, register at
# https://github.com/settings/developers. Use any URL for the homepage URL
# (http://github.com is fine) and http://localhost:1410 as the callback url
#
# Replace your key and secret below.
myapp < oauth_app("github",
key = "56b637a5baffac62cad9",
secret = "8e107541ae1791259e9987d544ca568633da2ebf")
# 3. Get OAuth credentials
github_token < oauth2.0_token(oauth_endpoints("github"), myapp)
# 4. Use API
gtoken < config(token = github_token)
req < GET("https://api.github.com/users/jtleek/repos", gtoken)
stop_for_status(req)
lst < content(req)
library(jsonlite)
# Convert to a data.frame
gitDF = fromJSON(toJSON(lst))
# Subset data.frame
gitDF[gitDF$full_name == "jtleek/datasharing", "created_at"]
Source
 Exact question answered for the Coursera question
 Source given by course era course question
 Source given by jtleek
other sources
There is a package for that
 Roger has a nice video on how there are R packages for most things that you will want to access.
 Here I’m going to briefly review a few useful packages
 In general the best way to find out if the R package
exists is to Google “data storage mechanism R package”
 For example: “MySQL R package”
Interacting more directly with files
 file  open a connection to a text file
 url  open a connection to a url
 gzfile  open a connection to a .gz file
 bzfile  open a connection to a .bz2 file
 ?connections for more information

Remember to close connections
foreign package
 Loads data from Minitab, S, SAS, SPSS, Stata,Systat
 Basic functions read.foo
 read.arff (Weka)
 read.dta (Stata)
 read.mtp (Minitab)
 read.octave (Octave)
 read.spss (SPSS)
 read.xport (SAS)
 See the help page for more details http://cran.rproject.org/web/packages/foreign/foreign.pdf
Examples of other database packages
 RPostresSQL provides a DBIcompliant database connection from R. Tutorialhttps://code.google.com/p/rpostgresql/, help filehttp://cran.rproject.org/web/packages/RPostgreSQL/RPostgreSQL.pdf
 RODBC provides interfaces to multiple databases including PostgreQL, MySQL, Microsoft Access and SQLite. Tutorial  http://cran.rproject.org/web/packages/RODBC/vignettes/RODBC.pdf, help file  http://cran.rproject.org/web/packages/RODBC/RODBC.pdf
 RMongo http://cran.rproject.org/web/packages/RMongo/RMongo.pdf (example of Rmongo http://www.rbloggers.com/randmongodb/) and rmongodb provide interfaces to MongoDb.
Reading images
 jpeg  http://cran.rproject.org/web/packages/jpeg/index.html
 readbitmap  http://cran.rproject.org/web/packages/readbitmap/index.html
 png  http://cran.rproject.org/web/packages/png/index.html
 EBImage (Bioconductor)  http://www.bioconductor.org/packages/2.13/bioc/html/EBImage.html
Reading GIS data
 rgdal  http://cran.rproject.org/web/packages/rgdal/index.html
 rgeos  http://cran.rproject.org/web/packages/rgeos/index.html
 raster  http://cran.rproject.org/web/packages/raster/index.html
Reading music data
 tuneR  http://cran.rproject.org/web/packages/tuneR/
 seewave  http://rug.mnhn.fr/seewave/
jpeg
readJPEG Read a bitmap image stored in the JPEG format writeJPEG Write a bitmap image in JPEG format — library(help=jpeg)
Manipulating DATA (c3w3)
dealing with NA
final[complete.cases(final), ]
na.omit(your.data.frame)
final < final[!(is.na(final$rnor))  !(is.na(rawdata$cfam)),]
Also look at imputing where NA is replaced with other things
NA’s are ignored using which
X[which(X$var2 > 8),]
var1 var2 var3
4 1 10 11
5 4 9 13
dealing with blanks
https://stackoverflow.com/questions/12763890/excludeblankandnainr Question
https://stackoverflow.com/a/12764040/5986651 Answer
foo[foo==""] < NA
foo < na.omit(foo$column.name) ### Selecting random rows and setting value (Sampling)
set.seed(13435)
X < data.frame("var1"=sample(1:5),"var2"=sample(6:10),"var3"=sample(11:15))
X < X[sample(1:5),]; X$var2[c(1,3)] = NA
X
var1 var2 var3
1 2 NA 15
4 1 10 11
2 3 NA 12
3 5 6 14
5 4 9 13
Subsetting  quick review (2)
X[,1]
X[,"var1"]
X[1:2,"var2"]
Subsetting with %in%
restData[restData$zipCode %in% c("21212","21213"),]
Logicals and’s and or’s
X[(X$var1 <= 3 & X$var3 > 11),]
var1 var2 var3
1 2 NA 15
2 3 NA 12
X[(X$var1 <= 3  X$var3 > 15),]
var1 var2 var3
1 2 NA 15
4 1 10 11
2 3 NA 12
Sorting
Excludes NA’s unless an argument is used If needed then use
na.last
=True
sort(X$var1)
[1] 1 2 3 4 5
sort(X$var1,decreasing=TRUE)
[1] 5 4 3 2 1
sort(X$var2,na.last=TRUE)
[1] 6 9 10 NA NA
Ordering
Different than Sorting in that in orders the whole DF
X[order(X$var1),]
var1 var2 var3
4 1 10 11
1 2 NA 15
2 3 NA 12
5 4 9 13
3 5 6 14
X[order(X$var1,X$var3),]
var1 var2 var3
4 1 10 11
1 2 NA 15
2 3 NA 12
5 4 9 13
3 5 6 14
Order(arrange) with some more capabilities (plyr)
library(plyr)
arrange(X,var1)
var1 var2 var3
1 1 10 11
2 2 NA 15
3 3 NA 12
4 4 9 13
5 5 6 14
arrange(X,desc(var1))
var1 var2 var3
1 5 6 14
2 4 9 13
3 3 NA 12
4 2 NA 15
5 1 10 11
Adding rows and columns
Directly add columns
X$var4 < rnorm(5)
X
var1 var2 var3 var4
1 2 NA 15 0.18760
4 1 10 11 1.78698
2 3 NA 12 0.49669
3 5 6 14 0.06318
5 4 9 13 0.53613
or Use cind or rbind
Y < cbind(X,rnorm(5))
Y
var1 var2 var3 var4 rnorm(5)
1 2 NA 15 0.18760 0.62578
4 1 10 11 1.78698 2.45084
2 3 NA 12 0.49669 0.08909
3 5 6 14 0.06318 0.47839
5 4 9 13 0.53613 1.00053
Col and row names
names(test) < c("A","B","C","D","E","F","G","H","I","J","K")

faster
colnames(test) < c("A","B","C","D","E","F","G","H","I","J","K")
Using dictionary types in python similar(plyr)
 You have a table (activty.names) of 2 columns
$V1
and$V2
. V1 with number, V2 with char to be replaced.
library(plyr)
traintest$Activity <
mapvalues(traintest$Activity,activity.names$V1,activity.names$V2)# requires `plyr`
Notes and further resources
 R programming in the Data Science Track
 Andrew Jaffe’s lecture notes http://www.biostat.jhsph.edu/~ajaffe/lec_winterR/Lecture%202.pdf
Summarizing data (c3w3)
Using STR, Summary, HEAD
if(!file.exists("./data")){dir.create("./data")}
fileUrl < "https://data.baltimorecity.gov/api/views/k5ryef3g/rows.csv?accessType=DOWNLOAD"
download.file(fileUrl,destfile="./data/restaurants.csv",method="curl")
restData < read.csv("./data/restaurants.csv")
Read first few lines of the DF
head(restData,n=3)
tail(restData,n=3)
Summary of data
Info about every column based on its class.
summary(restData$name,5)
MCDONALD'S POPEYES FAMOUS FRIED CHICKEN
8 7
SUBWAY KENTUCKY FRIED CHICKEN
6 5
(Other)
1301
Indepth info (STR)
str
is more about meta info about the data, like what class is each
column, starting numbers, how many levels if factor etc..
str(restData)
'data.frame': 1327 obs. of 6 variables:
$ name : Factor w/ 1277 levels "#1 CHINESE KITCHEN",..: 9 3 992 1 2 4 5 6 7 8 ...
$ zipCode : int 21206 21231 21224 21211 21223 21218 21205 21211 21205 21231 ...
$ neighborhood : Factor w/ 173 levels "Abell","Arlington",..: 53 52 18 66 104 33 98 133 98 157 ...
$ councilDistrict: int 2 1 1 14 9 14 13 7 13 1 ...
$ policeDistrict : Factor w/ 9 levels "CENTRAL","EASTERN",..: 3 6 6 4 8 3 6 4 6 6 ...
$ Location.1 : Factor w/ 1210 levels "1 BIDDLE ST\nBaltimore, MD\n",..: 835 334 554 755 492 537 505 530 507 569 ...
Quantile info
quantile(restData$councilDistrict,na.rm=TRUE)
0% 25% 50% 75% 100%
1 2 9 11 14
quantile(restData$councilDistrict,probs=c(0.5,0.75,0.9))
table
Counts the number of times of occurance of one column or even a combination of columns in X and Y axis.
NA is removed by default.
table(restData$councilDistrict,useNA="ifany")
table(restData$councilDistrict,restData$zipCode)
Check for missing values
sum(is.na(restData$councilDistrict))
[1] 0
any(is.na(restData$councilDistrict))
[1] FALSE
all(restData$zipCode > 0)
[1] FALSE
Getting info about all columns
colSums(is.na(restData))
name zipCode neighborhood councilDistrict policeDistrict Location.1
0 0 0 0 0 0
all(colSums(is.na(restData))==0)
[1] TRUE
Values with specific characteristics
table(restData$zipCode %in% c("21212"))
table(restData$zipCode %in% c("21212"))
FALSE TRUE
1299 28
Subsetting with %in%
restData[restData$zipCode %in% c("21212","21213"),]
name zipCode neighborhood councilDistrict
29 BAY ATLANTIC CLUB 21212 Downtown 11
39 BERMUDA BAR 21213 Broadway East 12
92 ATWATER'S 21212 Chinquapin ParkBelvedere 4
111 BALTIMORE ESTONIAN SOCIET
Cross tabs
Shows values of Freq in a table of Gender and Admit
xt < xtabs(Freq ~ Gender + Admit,data=DF)
xt
Admit
Gender Admitted Rejected
Male 1198 1493
Female 557 1278
Even for third dimension
warpbreaks$replicate < rep(1:9, len = 54)
xt = xtabs(breaks ~.,data=warpbreaks)
xt
, , replicate = 1
tension
wool L M H
A 26 18 36
B 27 42 20
, , replicate = 2
tension
wool L M H
A 30 21 21
B 14 26 21
With possibility to flatten it out!
ftable(xt)
Size of data set
fakeData = rnorm(1e5)
object.size(fakeData)
800040 bytes
print(object.size(fakeData),units="Mb")
0.8 Mb
Create new variables (c3w3)
Why create new variables?
 Often the raw data won’t have a value you are looking for
 You will need to transform the data to get the values you would like
 Usually you will add those values to the data frames you are working with
 Common variables to create
 Missingness indicators
 “Cutting up” quantitative variables
 Applying transforms
Creating sequences
Sample Data
if(!file.exists("./data")){dir.create("./data")}
fileUrl < "https://data.baltimorecity.gov/api/views/k5ryef3g/rows.csv?accessType=DOWNLOAD"
download.file(fileUrl,destfile="./data/restaurants.csv",method="curl")
restData < read.csv("./data/restaurants.csv")
Sometimes you need an index for your data set
s1 < seq(1,10,by=2) ; s1
[1] 1 3 5 7 9
s2 < seq(1,10,length=3); s2
[1] 1.0 5.5 10.0
x < c(1,3,8,25,100); seq(along = x)
[1] 1 2 3 4 5
Creating new variable by subsetting
restData$nearMe = restData$neighborhood %in% c("Roland Park", "Homeland")
table(restData$nearMe)
FALSE TRUE
1314 13
Creating binary variables
restData$zipWrong = ifelse(restData$zipCode < 0, TRUE, FALSE)
table(restData$zipWrong,restData$zipCode < 0)
FALSE TRUE
FALSE 1326 0
TRUE 0 1
Creating categorical variables (CUT)
restData$zipGroups = cut(restData$zipCode,breaks=quantile(restData$zipCode))
table(restData$zipGroups)
(2.123e+04,2.12e+04] (2.12e+04,2.122e+04] (2.122e+04,2.123e+04] (2.123e+04,2.129e+04]
337 375 282 332
table(restData$zipGroups,restData$zipCode)
21226 21201 21202 21205 21206 21207 21208 21209 21210 21211 21212 21213
(2.123e+04,2.12e+04] 0 136 201 0 0 0 0 0 0 0 0 0
(2.12e+04,2.122e+04] 0 0 0 27 30 4 1 8 23 41 28 31
(2.122e+04,2.123e+04] 0 0 0 0 0 0 0 0 0 0 0 0
(2.123e+04,2.129e+04] 0 0 0 0 0 0 0 0 0 0 0 0
Quiz question
Cut the GDP ranking into 5 separate quantile groups. Make a table versus Income.Group. How many countries
are Lower middle income but among the 38 nations with highest GDP?
Answer:5
pandian2$RankingGroups < cut(pandian2$Ranking,breaks=quantile(pandian2$Ranking,probs=seq(0,1,0.2))) table(pandian2$RankingGroups,pandian2$Income.Group)
Easier cutting library(hmisc) cut2
library(Hmisc)
restData$zipGroups = cut2(restData$zipCode,g=4)
table(restData$zipGroups)
[21226,21205) [ 21205,21220) [ 21220,21227) [ 21227,21287]
338 375 300 314
Creating factor variables
restData$zcf < factor(restData$zipCode)
restData$zcf < as.factor(restData$zipCode)
restData$zcf[1:10]
[1] 21206 21231 21224 21211 21223 21218 21205 21211 21205 21231
32 Levels: 21226 21201 21202 21205 21206 21207 21208 21209 21210 21211 21212 21213 21214 ... 21287
class(restData$zcf)
[1] "factor"
Levels of factor variables
yesno < sample(c("yes","no"),size=10,replace=TRUE)
yesnofac = factor(yesno,levels=c("yes","no"))
levels(as.factor(yesno))
relevel(yesnofac,ref="no")
[1] "no" "yes"
[2] yes yes yes yes no yes yes yes no no
Levels: no yes
as.numeric(yesnofac)
[1] 1 1 1 1 2 1 1 1 2 2
Using the mutate function or do it directly
Need both HMISC and plyr.
library(Hmisc); library(plyr)
restData2 = mutate(restData,zipGroups=cut2(zipCode,g=4))
restData$zipGroups < cut2(restData$zipCode,g=4) # same result
table(restData2$zipGroups)
[21226,21205) [ 21205,21220) [ 21220,21227) [ 21227,21287]
338 375 300 314
This stack source on different ways of adding columns to the main data frame but factored:
df1$Y.New < ave(df1$Y, df1$X)
## or
library(dplyr)
df1 < df1 %>%
group_by(X) %>%
mutate(Y.new = mean(Y))
Common transforms
abs(x)
absolute valuesqrt(x)
square rootceiling(x)
ceiling(3.475) is 4floor(x)
floor(3.475) is 3round(x,digits=n)
round(3.475,digits=2) is 3.48signif(x,digits=n)
signif(3.475,digits=2) is 3.5cos(x), sin(x)
etc.log(x)
natural logarithmlog2(x)
,log10(x)
other common logsexp(x)
exponentiating x
http://www.biostat.jhsph.edu/~ajaffe/lec_winterR/Lecture%202.pdf http://statmethods.net/management/functions.html
Notes and further reading
 A tutorial from the developer of plyr  http://plyr.had.co.nz/09user/
 Andrew Jaffe’s R notes http://www.biostat.jhsph.edu/~ajaffe/lec_winterR/Lecture%202.pdf
 A nice lecture on categorical and factor variables http://www.stat.berkeley.edu/classes/s133/factors.html
Reshaping data (c3w3)
The goal is tidy data
 Each variable forms a column
 Each observation forms a row
 Each table/file stores data about one kind of observation (e.g. people/hospitals).
http://vita.had.co.nz/papers/tidydata.pdf
Leek, Taub, and Pineda 2011 PLoS One
Melting dataframes
library(reshape2)
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
mtcars$carname < rownames(mtcars)
carMelt < melt(mtcars,id=c("carname","gear","cyl"),measure.vars=c("mpg","hp"))
head(carMelt,n=3)
carname gear cyl variable value
1 Mazda RX4 4 6 mpg 21.0
2 Mazda RX4 Wag 4 6 mpg 21.0
3 Datsun 710 4 4 mpg 22.8
tail(carMelt,n=3)
carname gear cyl variable value
62 Ferrari Dino 5 6 hp 175
63 Maserati Bora 5 8 hp 335
64 Volvo 142E 4 4 hp 109
http://www.statmethods.net/management/reshape.html
dCasting data frames
cylData < dcast(carMelt, cyl ~ variable)
cylData
cyl mpg hp
1 4 11 11
2 6 7 7
3 8 14 14
cylData < dcast(carMelt, cyl ~ variable,mean)
cylData
cyl mpg hp
1 4 26.66 82.64
2 6 19.74 122.29
3 8 15.10 209.21
http://www.statmethods.net/management/reshape.html
Averaging values
head(InsectSprays)
``` count spray 1 10 A 2 7 A 3 20 A 4 14 A 5 14 A 6 12 A ```
tapply(InsectSprays$count,InsectSprays$spray,sum)
#or
spIns = split(InsectSprays$count,InsectSprays$spray)
sprCount = lapply(spIns,sum)
unlist(sprCount) # to make it a table
#or
sapply(spIns,sum) # directly a table
A B C D E F
174 184 25 59 42 200
http://www.rbloggers.com/aquickprimeronsplitapplycombineproblems/
Averaging across multiple variables
balt.NEI < subset(NEI,fips==24510)
mn2 < with(balt.NEI, tapply(Emissions, list(year,type), mean, na.rm=T))
library(dplyr)
mn20 < balt.NEI %>% group_by(year,type) %>%
summarise(Pandian=mean(Emissions)) # result in clean format
library(plyr)
mn21 > ddply(balt.NEI, .(type,year), summarize,Pandian=mean(Emissions))
Note: Summarize and Summarise perform the same function.
Warning: loading plyr masks the Summarise of dplyr! You need to detach plyr before using functions like Summarize again.
https://stackoverflow.com/a/27407856/5986651
Another way  plyr package
ddply(InsectSprays,.(spray),summarize,sum=sum(count))
spray sum
1 A 174
2 B 184
3 C 25
4 D 59
5 E 42
6 F 200
Summary: Average of col1 for factor(col2) (quiz)
 Preinfo
url <
"https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv"
download.file(url,"./c3w3.csv",method="curl")
gdp.data < read.csv("./c3w3.csv", header=TRUE,skip=3)
gdp.data[gdp.data==""] < NA # Make "" > NA and then...
gdp.data < gdp.data[!is.na(gdp.data$X),] # Remove na from one row
# alone
gdp.data < gdp.data[!is.na(gdp.data$Ranking),] # Remove na from one row
# alone
gdp.data$X < factor(gdp.data$X)# remove unused levels
gdp.data$Ranking<factor(gdp.data$Ranking) # remove unused levels
is.na(gdp.data$Ranking)
gdp.data$Ranking < as.numeric(as.character(gdp.data$Ranking))
url <
"https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv"
download.file(url,"./c3w3other.csv",method="curl")
edu.data < read.csv("./c3w3other.csv", header=TRUE)
pandian < merge(gdp.data,edu.data,by.x="X",by.y="CountryCode")
 Finding average of Ranking for different Income groups
ave.rank1 < tapply(pandian$Ranking,pandian$Income.Group,mean)
ave.rank2 < sapply(split(pandian$Ranking,pandian$Income.Group),mean)#simplify=T
ave.rank3 < lapply(split(pandian$Ranking,pandian$Income.Group),mean)
ave.rank3 < unlist(ave.rank3)
ave.rank4 < ddply(pandian,.(Income.Group),summarize,average=mean(Ranking))
Creating a new column variable
 adding the column to the existing dataframe using
transform
Transform
pandian2 < ddply(pandian,.(Income.Group),transform,average=mean(Ranking))
head(data.frame(pandian2$average,pandian2$Income.Group),n=50)
ave
is not fully understood. but we see the difference. Go deeper if necessary.
spraySums < ddply(InsectSprays,.(spray),summarize,sum=ave(count,FUN=sum))
dim(spraySums)
[1] 72 2
head(spraySums)
spray sum
1 A 174
2 A 174
3 A 174
4 A 174
5 A 174
6 A 174
More information
 A tutorial from the developer of plyr  http://plyr.had.co.nz/09user/
 A nice reshape tutorial http://www.slideshare.net/jeffreybreen/reshapingdatainr
 A good plyr primer  http://www.rbloggers.com/aquickprimeronsplitapplycombineproblems/
 See also the functions
 acast  for casting as multidimensional arrays
 arrange  for faster reordering without using order() commands
 mutate  adding new variables
Merging data (c3w3)
Peer review experiment data
http://www.plosone.org/article/info:doi/10.1371/journal.pone.0026895
Peer review data
if(!file.exists("./data")){dir.create("./data")}
fileUrl1 = "https://dl.dropboxusercontent.com/u/7710864/data/reviewsapr29.csv"
fileUrl2 = "https://dl.dropboxusercontent.com/u/7710864/data/solutionsapr29.csv"
download.file(fileUrl1,destfile="./data/reviews.csv",method="curl")
download.file(fileUrl2,destfile="./data/solutions.csv",method="curl")
reviews = read.csv("./data/reviews.csv"); solutions < read.csv("./data/solutions.csv")
head(reviews,2)
id solution_id reviewer_id start stop time_left accept
1 1 3 27 1304095698 1304095758 1754 1
2 2 4 22 1304095188 1304095206 2306 1
head(solutions,2)
id problem_id subject_id start stop time_left answer
1 1 156 29 1304095119 1304095169 2343 B
2 2 269 25 1304095119 1304095183 2329 C
Merging data  merge()
 Merges data frames
 Important parameters: x,y,by,by.x,by.y,all
names(reviews)
[1] "id" "solution_id" "reviewer_id" "start" "stop" "time_left"
[7] "accept"
names(solutions)
[1] "id" "problem_id" "subject_id" "start" "stop" "time_left" "answer"
Example
mergedData = merge(reviews,solutions,by.x="solution_id",by.y="id",all=TRUE)
head(mergedData)
If all=TRUE
then for every row that has no match, you will see a row
with corresponding NA
s.
solution_id id reviewer_id start.x stop.x time_left.x accept problem_id subject_id
1 1 4 26 1304095267 1304095423 2089 1 156 29
2 2 6 29 1304095471 1304095513 1999 1 269 25
3 3 1 27 1304095698 1304095758 1754 1 34 22
4 4 2 22 1304095188 1304095206 2306 1 19 23
5 5 3 28 1304095276 1304095320 2192 1 605 26
6 6 16 22 1304095303 1304095471 2041 1 384 27
start.y stop.y time_left.y answer
1 1304095119 1304095169 2343 B
2 1304095119 1304095183 2329 C
3 1304095127 1304095146 2366 C
4 1304095127 1304095150 2362 D
5 1304095127 1304095167 2345 A
6 1304095131 1304095270 2242 C
Default  merge all common column names
intersect(names(solutions),names(reviews))
[1] "id" "start" "stop" "time_left"
mergedData2 = merge(reviews,solutions,all=TRUE)
head(mergedData2)
id start stop time_left solution_id reviewer_id accept problem_id subject_id answer
1 1 1304095119 1304095169 2343 NA NA NA 156 29 B
2 1 1304095698 1304095758 1754 3 27 1 NA NA <NA>
3 2 1304095119 1304095183 2329 NA NA NA 269 25 C
4 2 1304095188 1304095206 2306 4 22 1 NA NA <NA>
5 3 1304095127 1304095146 2366 NA NA NA 34 22 C
6 3 1304095276 1304095320 2192 5 28 1 NA NA <NA>
Using join in the plyr package
Faster, but less full featured  defaults to left join, see help file for more
df1 = data.frame(id=sample(1:10),x=rnorm(10))
df2 = data.frame(id=sample(1:10),y=rnorm(10))
arrange(join(df1,df2),id)
id x y
1 1 0.2514 0.2286
2 2 0.1048 0.8395
3 3 0.1230 1.1165
4 4 1.5057 0.1121
5 5 0.2505 1.2124
6 6 0.4699 1.6038
7 7 0.4627 0.8060
8 8 1.2629 1.2848
9 9 0.9258 0.8276
10 10 2.8065 0.5794
If you have multiple data frames
df1 = data.frame(id=sample(1:10),x=rnorm(10))
df2 = data.frame(id=sample(1:10),y=rnorm(10))
df3 = data.frame(id=sample(1:10),z=rnorm(10))
dfList = list(df1,df2,df3)
join_all(dfList)
id x y z
1 6 0.39093 0.16670 0.56523
2 1 1.90467 0.43811 0.37449
3 7 1.48798 0.85497 0.69209
4 10 2.59440 0.39591 0.36134
5 3 0.08539 0.08053 1.01247
6 4 1.63165 0.13158 0.21927
7 5 0.50594 0.24256 0.44003
8 9 0.85062 2.08066 0.96950
9 2 0.63767 0.10069 0.09002
10 8 1.20439 1.29138 0.88586
More on merging data
 The quick R data merging page  http://www.statmethods.net/management/merging.html
 plyr information  http://plyr.had.co.nz/
 Types of joins  http://en.wikipedia.org/wiki/Join_(SQL)
Cleaning data NA and “” & (c3w4)
Cleaning data
So usually we have NA
or empty spaces ""
which really fuck with
us. Based on which column we are using we skip them either by using
functions or literally just cleaning up things.
This example shows a case of how to deal with NA
or empty spaces
""
, including comments.
 data
url <
"https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv"
download.file(url,"./c3w3.csv",method="curl")
gdp.data < read.csv("./c3w3.csv", header=TRUE,skip=3)
 making “” –> NA
gdp.data[gdp.data==""] < NA # Make "" > NA and then...
 removing NA rows from necessary cols
(numeric and char)
gdp.data < gdp.data[!is.na(gdp.data$X),] # Remove na from one row
# alone
gdp.data < gdp.data[!is.na(gdp.data$Ranking),] # Remove na from one row
# alone
Manipulating factors (Important)

Manipulating factors is slightly different and can lead to errors.

removing unused levels is forcefully done
gdp.data$X < factor(gdp.data$X)# remove unused levels
gdp.data$Ranking<factor(gdp.data$Ranking) # remove unused levels
is.na(gdp.data$Ranking)
 converting factor to numeric for doing some arithmetic or arranging. BE CAREFUL
gdp.data$Ranking < as.numeric(as.character(gdp.data$Ranking))

factors and headers (append data)
header.data < factor(append(c(subject.header,y.header),as.character(X.header)))
q3 question on which reflection is done
Based on q3 coursera question:
Load the Gross Domestic Product data for the 190 ranked countries in this data set:
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv
Load the educational data from this data set:
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv
Match the data based on the country shortcode. How many of the IDs match? Sort the data frame in descending order by GDP rank (so United States is last). What is the 13th country in the resulting data frame?
Original data sources:
http://data.worldbank.org/datacatalog/GDPrankingtable
http://data.worldbank.org/datacatalog/edstats

Answer
189 matches, 13th country is St. Kitts and Nevis
Cleaning up colnames
Example  Baltimore camera data
https://data.baltimorecity.gov/Transportation/BaltimoreFixedSpeedCameras/dz542aru
Data
if(!file.exists("./data")){dir.create("./data")}
fileUrl < "https://data.baltimorecity.gov/api/views/dz542aru/rows.csv?accessType=DOWNLOAD"
download.file(fileUrl,destfile="./data/cameras.csv",method="curl")
cameraData < read.csv("./data/cameras.csv")
Removing Capitals tolower(), toupper()
names(cameraData)
[1] "address" "direction" "street" "crossStreet" "intersection" "Location.1"
names(cameraData) < tolower(names(cameraData))
[1] "address" "direction" "street" "crossstreet" "intersection" "location.1"
Removing .
and _
strsplit()
 Good for automatically splitting variable names
 Important parameters: x, split
For .
you need to use \\.
as it is the escape character.
splitNames = strsplit(names(cameraData),"\\.")
splitNames[[5]]
[1] "intersection"
splitNames[[6]]
[1] "Location" "1"
firstElement < function(x){x[1]}
names(cameraData) < sapply(splitNames,firstElement)
[1] "address" "direction" "street" "crossStreet" "intersection" "Location"
Peer review experiment data
http://www.plosone.org/article/info:doi/10.1371/journal.pone.0026895
fileUrl1 < "https://dl.dropboxusercontent.com/u/7710864/data/reviewsapr29.csv"
fileUrl2 < "https://dl.dropboxusercontent.com/u/7710864/data/solutionsapr29.csv"
download.file(fileUrl1,destfile="./data/reviews.csv",method="curl")
download.file(fileUrl2,destfile="./data/solutions.csv",method="curl")
reviews < read.csv("./data/reviews.csv"); solutions < read.csv("./data/solutions.csv")
head(reviews,2)
id solution_id reviewer_id start stop time_left accept
1 1 3 27 1304095698 1304095758 1754 1
2 2 4 22 1304095188 1304095206 2306 1
head(solutions,2)
id problem_id subject_id start stop time_left answer
1 1 156 29 1304095119 1304095169 2343 B
2 2 269 25 1304095119 1304095183 2329 C
replacing char with other chars sub()
 Important parameters: pattern, replacement, x
names(reviews)
[1] "id" "solution_id" "reviewer_id" "start" "stop" "time_left"
[7] "accept"
sub("_","",names(reviews),)
[1] "id" "solutionid" "reviewerid" "start" "stop" "timeleft" "accept"
Fixing character vectors  gsub()
Removes only one
testName < "this_is_a_test"
sub("_","",testName)
gsub("_","",testName)
[1] "thisis_a_test"
[2] "thisisatest"
gsub
removes recursively.
Finding values  grep(),grepl()
grep
for location of value. grepl
for true or false which can be
passed as argument.
grep("Alameda",cameraData$intersection)
grep("Alameda",cameraData$intersection,value=TRUE)
table(grepl("Alameda",cameraData$intersection))
cameraData2 < cameraData[!grepl("Alameda",cameraData$intersection),]
[1] 4 5 36
[1] "The Alameda & 33rd St" "E 33rd & The Alameda" "Harford \n & The Alameda"
[2]FALSE TRUE
77 3
length
to determine if grep()
found something
grep("JeffStreet",cameraData$intersection)
integer(0)
length(grep("JeffStreet",cameraData$intersection))
[1] 0
http://www.biostat.jhsph.edu/~ajaffe/lec_winterR/Lecture%203.pdf
More useful string functions
library(stringr)
nchar("Jeffrey Leek")
[1] 12
substr("Jeffrey Leek",1,7)
[1] "Jeffrey"
paste("Jeffrey","Leek")
paste0("Jeffrey","Leek")
[1] "Jeffrey Leek"
[2] "JeffreyLeek"
To trim the empty spaces
str_trim("Jeff ")
[1] "Jeff"
Important points about text in data sets
 Names of variables should be
 All lower case when possible
 Descriptive (Diagnosis versus Dx)
 Not duplicated
 Not have underscores or dots or white spaces
 Variables with character values
 Should usually be made into factor variables (depends on application)
 Should be descriptive (use TRUE/FALSE instead of 0/1 and Male/Female versus 0/1 or M/F)
Identifying expressions with meta and literals
Regular expressions
 Regular expressions can be thought of as a combination of literals and metacharacters
 To draw an analogy with natural language, think of literal text forming the words of this language, and the metacharacters defining its grammar
 Regular expressions have a rich set of metacharacters
Literals
Simplest pattern consists only of literals. The literal “nuclear” would match to the following lines:
Ooh. I just learned that to keep myself alive after a
nuclear blast! All I have to do is milk some rats
then drink the milk. Aweosme. :}
^I think
Beginning of line
i think we all rule for participating
i think i have been outed
morning$
end of line
well they had something this morning
then had to catch a tram home in the morning
[Bb][Uu][Ss][Hh]
will find any case combi ofBuSH
Bush
BUSH
busHwalking
^[Ii]
am
i am great!
I am mass
^[09][azAZ]
starting with number follwed by any character
7th inning
2nd half s
 [^?.]$ Matching characters not
?
or.
at the end of the line.
i like basketballs
6 and 9
9.11
any 1 character
its stupid the post 911 rules
9/11
floodfireearthwindwater
; flood or fire or …
is firewire like usb on none macs?
the global flood makes sense within the context of the bible
^[Gg]ood[Bb]ad
Beginning of line with G or g or Bad/bad anywhere
in the sentence
good to hear some good knews from someone here
Good afternoon fellow american infidels!

^([Gg]ood[Bb]ad)
; for both it is the beginning of the line! 
[Gg]eorge( [Ww]\.)? [Bb]ush
; ? indicated optional charactersAlso here we need to escape
.
so we use\.
George W. Bush
George Bushless
(.*)
; brackets with any number of chars
anyone wanna chat? (24, m, germany)
hello, 20.m here... ( east area + drives + webcam )
(he means older men)
[09]+ (.*)[09]+
;*
mean none or many of the item and+
means atleast 1 of the item.
working as MP here 720 MP battallion, 42nd birgade
so say 2 or 3 years at colleage and 4 at uni makes us 23 when and if we fin
[Bb]ush( +[^ ]+ +){1,5} debate
; { and } are referred to as interval quantifiers; the let us specify the minimum and maximum number of matches of an expression
Bush has historically won all major debates he’s done.
in my view, Bush doesn’t need these debates..

parentheses not only limit the scope of alternatives divided by a “”, but also can be used to “remember” text matched by the subexpression enclosed
+([azAZ]+) +\1 +
; some starting + one character atleast + somethingelse+ a repitition of the ones in()
+ something else
time for bed, night night twitter!
^s(.*)s
matches the longest string starting with s and ending with s
sitting at starbucks
setting up mysql and rails

The greediness of * can be turned off with the ?, as in
^s(.*?)s$
Quiz q4 grep question
Load the Gross Domestic Product data for the 190 ranked countries in this data set:
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv
Load the educational data from this data set:
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv
Match the data based on the country shortcode. Of the countries for which the end of the fiscal year is available, how many end in June?
Original data sources:
http://data.worldbank.org/datacatalog/GDPrankingtable
http://data.worldbank.org/datacatalog/edstats
Answer: 13
url <
"https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv"
download.file(url,"./c3w3.csv",method="curl")
gdp.data < read.csv("./c3w3.csv", header=TRUE,skip=3)
gdp.data[gdp.data==""] < NA # Make "" > NA and then...
gdp.data < gdp.data[!is.na(gdp.data$X),] # Remove na from one row
# alone
gdp.data < gdp.data[!is.na(gdp.data$Ranking),] # Remove na from one row
# alone
gdp.data$X < factor(gdp.data$X)# remove unused levels
gdp.data$Ranking<factor(gdp.data$Ranking) # remove unused levels
is.na(gdp.data$Ranking)
gdp.data$Ranking < as.numeric(as.character(gdp.data$Ranking))
url <
"https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv"
download.file(url,"./c3w3other.csv",method="curl")
edu.data < read.csv("./c3w3other.csv", header=TRUE)
pandian < merge(gdp.data,edu.data,by.x="X",by.y="CountryCode")
length(grep("[Ff]iscal(.*)[Jj]une [09]",pandian$Special.Notes))
Summary
 Regular expressions are used in many different languages; not unique to R.
 Regular expressions are composed of literals and metacharacters that represent sets or classes of characters/words
 Text processing via regular expressions is a very powerful way to extract data from “unfriendly” sources (not all data comes as a CSV file)
 Used with the functions
grep
,grepl
,sub
,gsub
and others that involve searching for text strings (Thanks to Mark Hansen for some material in this lecture.)
Dates & Times
d1 = date() # class is "char"
d2 = Sys.Date() # class is "Date"
[1] "Sun Jan 12 17:48:33 2014"
[2] "20140112"
Formatting dates
%d
= day as number (031), %a
= abbreviated weekday,%A
= unabbreviated weekday, %m
= month (0012), %b
= abbreviated month,
%B
= unabbrevidated month, %y
= 2 digit year, %Y
= four digit year
format(d2,"%a %b %d")
[1] "Sun Jan 12"
Creating dates
x = c("1jan1960", "2jan1960", "31mar1960", "30jul1960"); z = as.Date(x, "%d%b%Y")
z
[1] "19600101" "19600102" "19600331" "19600730"
z[1]  z[2]
Time difference of 1 days
as.numeric(z[1]z[2])
[1] 1
Converting to Julian
weekdays(d2)
[1] "Sunday"
months(d2)
[1] "January"
julian(d2)
[1] 16082
attr(,"origin")
[1] "19700101"
Lubridate
library(lubridate); ymd("20140108")
[1] "20140108 UTC"
mdy("08/04/2013")
[1] "20130804 UTC"
dmy("03042013")
[1] "20130403 UTC"
http://www.rstatistics.com/2012/03/domorewithdatesandtimesinrwithlubridate110/
Dealing with times
ymd_hms("20110803 10:15:03")
[1] "20110803 10:15:03 UTC"
ymd_hms("20110803 10:15:03",tz="Pacific/Auckland")
[1] "20110803 10:15:03 NZST"
?Sys.timezone
http://www.rstatistics.com/2012/03/domorewithdatesandtimesinrwithlubridate110/
Some functions have slightly different syntax
x = dmy(c("1jan2013", "2jan2013", "31mar2013", "30jul2013"))
wday(x[1])
[1] 3
wday(x[1],label=TRUE)
[1] Tues
Levels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat
DATES and TIMEs from Exploratory graphs assignment learnings C4
Mixing Date and Time for graphs. Source: Stack.
df$Date < as.character(df$Date)
df$Time < as.character(df$Time)
df$DateTime < as.POSIXct(paste(df$Date, df$Time),
format="%d/%m/%Y %H:%M:%S")
library(lubridate)
df$Date.Time < dmy_hms(paste(df$Date, df$Time))
Notes and further resources
 More information in this nice lubridate tutorial http://www.rstatistics.com/2012/03/domorewithdatesandtimesinrwithlubridate110/
 The lubridate vignette is the same content http://cran.rproject.org/web/packages/lubridate/vignettes/lubridate.html
 Ultimately you want your dates and times as class “Date” or the classes “POSIXct”, “POSIXlt”. For more information type
?POSIXlt
Quiz q c3w4
You can use the quantmod (http://www.quantmod.com/) package to get historical stock prices for publicly traded companies on the NASDAQ and NYSE. Use the following code to download data on Amazon’s stock price and get the times the data was sampled.
How many values were collected in 2012? How many values were collected on Mondays in 2012?
library(quantmod)
amzn = getSymbols("AMZN",auto.assign=FALSE)
sampleTimes = index(amzn)
library(lubridate)
sampleYears < year(sampleTimes)
length(sampleYears[sampleYears==2012])
length(sampleTimes[year(sampleTimes)==2012 & weekdays(sampleTimes)=="maandag"])
Working with TIME, difference, str_pad,strptime
mn1 < df %>% group_by(interval) %>% summarize(num.of.steps.per.day=mean(steps))
# convert mn1 interval to DT
# subtract!
mn1$interval < as.character(mn1.interval)
mn1$interval < str_pad(mn1$interval, width=4, side="left", pad="0")
mn1$interval < strptime(mn1$interval,"%H%M")
mn1
mn1$actualInterval < mn1$intervalmn1$interval[1]
mna$actualInterval < as.numeric(mn1$actualInterval)
Data resources
Open Government Sites
 United Nations http://data.un.org/
 U.S. http://www.data.gov/
 United Kingdom http://data.gov.uk/
 France http://www.data.gouv.fr/
 Ghana http://data.gov.gh/
 Australia http://data.gov.au/
 Germany https://www.govdata.de/
 Hong Kong http://www.gov.hk/en/theme/psi/datasets/
 Japan http://www.data.go.jp/
 Many more http://www.data.gov/opendatasites
Gapminder
<img class=center src=../../assets/img/03_ObtainingData/gapminder.png height=400/>
Survey data from the United States
<img class=center src=../../assets/img/03_ObtainingData/asdfree.png height=400/>
Infochimps Marketplace
<img class=center src=../../assets/img/03_ObtainingData/infochimps.png height=400/>
http://www.infochimps.com/marketplace
Kaggle
<img class=center src=../../assets/img/03_ObtainingData/kaggle.png height=400 />
Collections by data scientists
 Hilary Mason http://bitly.com/bundles/hmason/1
 Peter Skomoroch https://delicious.com/pskomoroch/dataset
 Jeff Hammerbacher http://www.quora.com/JeffHammerbacher/IntroductiontoDataScienceDataSets
 Gregory PiatetskyShapiro http://www.kdnuggets.com/gps.html
 http://blog.mortardata.com/post/67652898761/6datasetlistscuratedbydatascientists
More specialized collections
 Stanford Large Network Data
 UCI Machine Learning
 KDD Nugets Datasets
 CMU Statlib
 Gene expression omnibus
 ArXiv Data
 Public Data Sets on Amazon Web Services
Some API’s with R interfaces
 twitter and twitteR package
 figshare and rfigshare
 PLoS and rplos
 rOpenSci
 Facebook and RFacebook * Google maps and RGoogleMaps
Course4: Exploratory data Analysis
Principles
 Principle 1: Show comparisons
 Principle 2: Show causality, mechanism, explanation
 Principle 3: Show multivariate data
 Principle 4: Integrate multiple modes of evidence
 Principle 5: Describe and document the evidence

Principle 6: Content is king

Principle 1: Show comparisons

Evidence for a hypothesis is always relative to another competing hypothesis.

Always ask “Compared to What?”

 Principle 2: Show causality, mechanism, explanation, systematic structure
 What is your causal framework for thinking about a question?
 Principle 3: Show multivariate data
 Multivariate = more than 2 variables
 The real world is multivariate
 Need to “escape flatland”
 Principle 4: Integration of evidence

Completely integrate words, numbers, images, diagrams

Data graphics should make use of many modes of data presentation

Don’t let the tool drive the analysis


Principle 5: Describe and document the evidence with appropriate labels, scales, sources, etc.
 A data graphic should tell a complete story that is credible

Principle 6: Content is king
 Analytical presentations ultimately stand or fall depending on the quality, relevance, and integrity of their content
Exploratory graphics
Air Pollution in the United States

The U.S. Environmental Protection Agency (EPA) sets national ambient air quality standards for outdoor air pollution

For fine particle pollution (PM2.5), the “annual mean, averaged over 3 years” cannot exceed $12~\mu g/m^3$.

Data on daily PM2.5 are available from the U.S. EPA web site

Question: Are there any counties in the U.S. that exceed that national standard for fine particle pollution?
Data
Annual average PM2.5 averaged over the period 2008 through 2010 is available on jtleeks repository here. The actual csv file can be accessed from here.
fileUrl <
"https://raw.githubusercontent.com/jtleek/modules/master/04_ExploratoryAnalysis/exploratoryGraphs/data/avgpm25.csv"
download.file(fileUrl, destfile="./data/avgpm25.csv", method="curl")
pollution < read.csv("./data/avgpm25.csv", colClasses = c("numeric", "character",
"factor", "numeric", "numeric"))
head(pollution)
## pm25 fips region longitude latitude
## 1 9.771 01003 east 87.75 30.59
## 2 9.994 01027 east 85.84 33.27
## 3 10.689 01033 east 87.73 34.73
## 4 11.337 01049 east 85.80 34.46
## 5 12.120 01055 east 86.03 34.02
## 6 10.828 01069 east 85.35 31.19
Do any counties exceed the standard of $12~\mu g/m^3$?
Summary of data, View data
One dimension
 Fivenumber summary
 Boxplots
 Histograms
 Density plot
 Barplot
Five Number Summary
summary(pollution$pm25)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.38 8.55 10.00 9.84 11.40 18.40
Boxplot
boxplot(pollution$pm25, col = "blue")

median (Q2/50th Percentile): the middle value of the dataset.

first quartile (Q1/25th Percentile): the middle number between the smallest number (not the “minimum”) and the median of the dataset.

third quartile (Q3/75th Percentile): the middle value between the median and the highest value (not the “maximum”) of the dataset.

interquartile range (IQR): 25th to the 75th percentile.

whiskers (shown in blue)

outliers (shown as green circles)

“maximum”: Q3 + 1.5*IQR

“minimum”: Q1 1.5*IQR
Histogram
Bar plot basically!
hist(pollution$pm25, col = "green")
Rug representation roughly informs about density by plotting below histogram.
hist(pollution$pm25, col = "green")
rug(pollution$pm25)
breaks
parameter is used for determining the number of segments on
the Histogram.
hist(pollution$pm25, col = "green", breaks = 100)
rug(pollution$pm25)
Overlaying Features
boxplot(pollution$pm25, col = "blue")
abline(h = 12)
hist(pollution$pm25, col = "green")
abline(v = 12, lwd = 2)
abline(v = median(pollution$pm25), col = "magenta", lwd = 4)
Barplot
Barplot is the histogram of text based info but without quantile and all that shit!
barplot(table(pollution$region), col = "wheat", main = "Number of Counties in Each Region")
What plots to see >2 dimensions
Two dimensions
 Multiple/overlayed 1D plots (Lattice/ggplot2)
 Scatterplots
 Smooth scatterplots
greater than 2 dimensions
 Overlayed/multiple 2D plots; coplots
 Use color, size, shape to add dimensions
 Spinning plots
 Actual 3D plots (not that useful)
Multiple Boxplots
boxplot(pm25 ~ region, data = pollution, col = "red")
Multiple Histograms
par(mfrow = c(2, 1), mar = c(4, 4, 2, 1))
hist(subset(pollution, region == "east")$pm25, col = "green")
hist(subset(pollution, region == "west")$pm25, col = "green")
Scatterplot
with(pollution, plot(latitude, pm25))
abline(h = 12, lwd = 2, lty = 2)
Scatterplot  Using Color
with(pollution, plot(latitude, pm25, col = region))
abline(h = 12, lwd = 2, lty = 2)
Multiple Scatterplots
par(mfrow = c(1, 2), mar = c(5, 4, 2, 1))
with(subset(pollution, region == "west"), plot(latitude, pm25, main = "West"))
with(subset(pollution, region == "east"), plot(latitude, pm25, main = "East"))
Summary

Exploratory plots are “quick and dirty”

Let you summarize the data (usually graphically) and highlight any broad features

Explore basic questions and hypotheses (and perhaps rule them out)

Suggest modeling strategies for the “next step”
Further resources
Plotting systems
Functions like plot
in base, xyplot
in lattice, or qplot
in
ggplot2 will default to sending a plot to the screen device
The Base Plotting System
 “Artist’s palette” model
 Start with blank canvas and build up from there

Start with plot function (or similar)

Use annotation functions to add/modify (
text
,lines
,points
,axis
) 
Convenient, mirrors how we think of building plots and analyzing data

** Can’t go back once** plot has started (i.e. to adjust margins); need to plan in advance

Difficult to “translate” to others once a new plot has been created (no graphical “language”)
 Plot is just a series of R commands
Base Plot
library(datasets)
data(cars)
with(cars, plot(speed, dist))
The Lattice System

Plots are created with a single function call (
xyplot
,bwplot
, etc.) 
Most useful for conditioning types of plots: Looking at how y changes with x across levels of z

Things like margins/spacing set automatically because entire plot is specified at once

Good for puttng many many plots on a screen

Sometimes awkward to specify an entire plot in a single function call

Annotation in plot is not especially intuitive

Use of panel functions and subscripts difficult to wield and requires intense preparation

Cannot “add” to the plot once it is created
Lattice Plot
library(lattice)
state < data.frame(state.x77, region = state.region)
xyplot(Life.Exp ~ Income  region, data = state, layout = c(4, 1))
The ggplot2 System

Splits the difference between base and lattice in a number of ways

Automatically deals with spacings, text, titles but also allows you to annotate by “adding” to a plot

Superficial similarity to lattice but generally easier/more intuitive to use

Default mode makes many choices for you (but you can still customize to your heart’s desire)
ggplot2 Plot
library(ggplot2)
data(mpg)
qplot(displ, hwy, data = mpg)
Summary

Base: “artist’s palette” model

Lattice: Entire plot specified by one function; conditioning

ggplot2: Mixes elements of Base and Lattice
References
Paul Murrell (2011). R Graphics, CRC Press.
Hadley Wickham (2009). ggplot2, Springer.
What is a Graphics Device?

A graphics device is something where you can make a plot appear

A window on your computer (screen device); Linux
x11()

A PDF file (file device)

A PNG or JPEG file (file device)

A scalable vector graphics (SVG) file (file device)


The most common place for a plot to be “sent” is the screen device
 On Unix/Linux the screen device is launched with
x11()
 On Unix/Linux the screen device is launched with

When making a plot, you need to consider how the plot will be used to determine what device the plot should be sent to.
 The list of devices is found in
?Devices
; there are also devices created by users on CRAN
 The list of devices is found in

For quick visualizations and exploratory analysis, usually you want to use the screen device
 Functions like
plot
in base,xyplot
in lattice, orqplot
in ggplot2 will default to sending a plot to the screen device
 Functions like

otherwise use the file device
Screen Device
 Call a plotting function like
plot
,xyplot
, orqplot
library(datasets)
with(faithful, plot(eruptions, waiting)) ## Make plot appear on screen device
title(main = "Old Faithful Geyser data") ## Annotate with a title
File Device
The second approach to plotting is most commonly used for file devices:

Explicitly launch a graphics device

Call a plotting function to make a plot (Note: if you are using a file device, no plot will appear on the screen)

Annotate plot if necessary

Explicitly close graphics device with
dev.off()
(this is very important!)
pdf(file = "myplot.pdf") ## Open PDF device; create 'myplot.pdf' in my working directory
## Create plot and send to a file (no plot appears on screen)
with(faithful, plot(eruptions, waiting))
title(main = "Old Faithful Geyser data") ## Annotate plot; still nothing on screen
dev.off() ## Close the PDF file device
## Now you can view the file 'myplot.pdf' on your computer
Another example
png(filename="plot1.png", width=480, height=480)
hist(df$Global_active_power,col="red",main="Global Active Power",
xlab="Global Active Power (kilowatts)")
dev.off()
Vector: File Devices types
There are two basic types of file devices: vector and bitmap devices
Vector formats:

pdf
: useful for linetype graphics, resizes well, usually portable, not efficient if a plot has many objects/points 
svg
: XMLbased scalable vector graphics; supports animation and interactivity, potentially useful for webbased plots 
win.metafile
: Windows metafile format (only on Windows) 
postscript
: older format, also resizes well, usually portable, can be used to create encapsulated postscript files; Windows systems often don’t have a postscript viewer
Bitmap: File Devices
Bitmap formats

png
: bitmapped format, good for line drawings or images with solid colors, uses lossless compression (like the old GIF format), most web browsers can read this format natively, good for plotting many many many points, does not resize well 
jpeg
: good for photographs or natural scenes, uses lossy compression, good for plotting many many many points, does not resize well, can be read by almost any computer and any web browser, not great for line drawings 
tiff
: Creates bitmap files in the TIFF format; supports lossless compression 
bmp
: a native Windows bitmapped format
Multiple Open Graphics Devices
x11() # 1st screen
x11() # 2nd screen
dev.cur()
[1] X11cairo
3
dev.set(2)
[1] X11cairo
2
Copying Plots; display and save plots
Copying a plot to another device can be useful because some plots require a lot of code and it can be a pain to type all that in again for a different device.

dev.copy
: copy a plot from one device to another 
dev.copy2pdf
: specifically copy a plot to a PDF file
NOTE: Copying a plot is not an exact operation, so the result may not be identical to the original.
library(datasets)
with(faithful, plot(eruptions, waiting)) ## Create plot on screen device
title(main = "Old Faithful Geyser data") ## Add a main title
dev.copy(png, file = "geyserplot.png") ## Copy my plot to a PNG file
dev.off() ## Don't forget to close the PNG device!
with(faithful, plot(eruptions,waiting))
dev.copy2pdf(file="pandian")
dev.off()
Example Multiplot base plotting system using legend
As part of assignment,
plot(df$DateTime, df$Sub_metering_1, type="l", col="black", ylab="Energy sub metering",xlab="")
lines(df$DateTime, df$Sub_metering_2, type="l", col="red")
lines(df$DateTime, df$Sub_metering_3, type="l", col="blue")
legend("topright",c("Sub_metering_1","Sub_metering_2","Sub_metering_3"),col=c("black","red","blue"),lty=1:3, cex=0.8)
The below Doesn’t work. Spent way too much time on it.
## legend(1, 95,
legend=c("Sub_metering_1","Sub_metering_2","Sub_metering_3"),
col=c("black","red","blue"),lty=1:3, cex=0.8)
Language settings on plots
Do this and all languages will be in english on the plots
Sys.setlocale("LC_TIME", "en_US")
Summary

Plots must be created on a graphics device

The default graphics device is almost always the screen device, which is most useful for exploratory analysis

File devices are useful for creating plots that can be included in other documents or sent to other people

For file devices, there are vector and bitmap formats

Vector formats are good for line drawings and plots with solid colors using a modest number of points

Bitmap formats are good for plots with a large number of points, natural scenes or webbased plots
The Lattice Plotting System

The lattice plotting system is implemented using the following packages:

lattice: contains code for producing Trellis graphics, which are independent of the “base” graphics system; includes functions like
xyplot
,bwplot
,levelplot
 grid: implements a different graphing system independent of the
“base” system; the lattice package builds on top of grid
 We seldom call functions from the grid package directly

The lattice plotting system does not have a “twophase” aspect with separate plotting and annotation like in base plotting
 All plotting/annotation is done at once with a single function call
Lattice Functions
xyplot
: this is the main function for creating scatterplotsbwplot
: boxandwhiskers plots (“boxplots”)histogram
: histogramsstripplot
: like a boxplot but with actual pointsdotplot
: plot dots on “violin strings”splom
: scatterplot matrix; likepairs
in base plotting systemlevelplot
,contourplot
: for plotting “image” data
Lattice functions generally take a formula for their first argument, usually of the form
xyplot(y ~ x  f * g, data)

We use the formula notation here, hence the
~
. 
On the left of the ~ is the yaxis variable, on the right is the xaxis variable
 f and g are conditioning variables — they are optional
 the * indicates an interaction between two variables

The second argument is the data frame or list from which the variables in the formula should be looked up
 If no data frame or list is passed, then the parent frame is used.
 If no other arguments are passed, there are defaults that can be used.
Simple Lattice Plot
library(lattice)
library(datasets)
## Simple scatterplot.
xyplot(Ozone ~ Wind, data = airquality)
Looking at 3 variables at a time
library(datasets)
library(lattice)
## Convert 'Month' to a factor variable
airquality < transform(airquality, Month = factor(Month))
xyplot(Ozone ~ Wind  Month, data = airquality, layout = c(5, 1))
transform can be used or:
airquality$Month < as.factor(airquality$Month)
Lattice Behavior
Lattice functions behave differently from base graphics functions in one critical way.

Base graphics functions plot data directly to the graphics device (screen, PDF file, etc.)

Lattice graphics functions return an object of class trellis

Lattice functions return “plot objects” that can, in principle, be stored (but it’s usually better to just save the code + data).
You can save the plot to an object and it wont display just like regular functions.
p < xyplot(Ozone ~ Wind, data = airquality) ## Nothing happens!
print(p) ## Plot appears
xyplot(Ozone ~ Wind, data = airquality) ## Autoprinting
Lattice Panel Functions

Lattice functions have a panel function which controls what happens inside each panel of the plot.

The lattice package comes with default panel functions, but you can supply your own if you want to customize what happens in each panel

Panel functions receive the x/y coordinates of the data points in their panel (along with any optional arguments)
Simple XY plots for case with xy related and not related
set.seed(10)
x < rnorm(100)
f < rep(0:1, each = 50)
y < x + f  f * x + rnorm(100, sd = 0.5)
f < factor(f, labels = c("Group 1", "Group 2"))
xyplot(y ~ x  f, layout = c(2, 1)) ## Plot with 2 panels
Custom panel function: Each window can have similar features
## Custom panel function
xyplot(y ~ x  f, panel = function(x, y, ...) {
panel.xyplot(x, y, ...) ## First call the default panel function for 'xyplot'
panel.abline(h = median(y), lty = 2) ## Add a horizontal line at the median
})
Can view large amount of data as shown here
Lattice Panel Functions: Regression line
## Custom panel function
xyplot(y ~ x  f, panel = function(x, y, ...) {
panel.xyplot(x, y, ...) ## First call default panel function
panel.lmline(x, y, col = 2) ## Overlay a simple linear regression line
})
Many Panel Lattice Plot: Example from MAACS

Study: Mouse Allergen and Asthma Cohort Study (MAACS)

Study subjects: Children with asthma living in Baltimore City, many allergic to mouse allergen

Design: Observational study, baseline home visit + every 3 months for a year.

Question: How does indoor airborne mouse allergen vary over time and across subjects?
Ahluwalia et al., Journal of Allergy and Clinical Immunology, 2013
Summary

Lattice plots are constructed with a single function call to a core lattice function (e.g.
xyplot
) 
Aspects like margins and spacing are automatically handled and defaults are usually sufficient

The lattice system is ideal for creating conditioning plots where you examine the same kind of plot under many different conditions

Panel functions can be specified/customized to modify what is plotted in each of the plot panels
GGPlot2  mainly qplot()
What is ggplot2?
 An implementation of The Grammar of Graphics by Leland Wilkinson
 Written by Hadley Wickham (while he was a graduate student at Iowa State)
 A “third” graphics system for R (along with base and lattice)
 Available from CRAN via
install.packages()
 Web site: http://ggplot2.org (better documentation)
What is ggplot2?
 Grammar of graphics represents an abstraction of graphics ideas/objects
 Think “verb”, “noun”, “adjective” for graphics
 Allows for a “theory” of graphics on which to build new graphics and graphics objects
 “Shorten the distance from mind to page”
Grammer of Graphics
“In brief, the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system”
Plotting Systems in R: Base
 “Artist’s palette” model
 Start with blank canvas and build up from there
 Start with
plot
function (or similar)  Use annotation functions to add/modify (
text
,lines
,points
,axis
)
Plotting Systems in R: Base
 Convenient, mirrors how we think of building plots and analyzing data
 Can’t go back once plot has started (i.e. to adjust margins); need to plan in advance
 Difficult to “translate” to others once a new plot has been created (no graphical “language”)
 Plot is just a series of R commands
Plotting Systems in R: Lattice
 Plots are created with a single function call (
xyplot
,bwplot
, etc.)  Most useful for conditioning types of plots: Looking at how $y$ changes with $x$ across levels of $z$
 Things like margins/spacing set automatically because entire plot is specified at once
 Good for putting many many plots on a screen
Plotting Systems in R: Lattice
 Sometimes awkward to specify an entire plot in a single function call
 Annotation in plot is not intuitive
 Use of panel functions and subscripts difficult to wield and requires intense preparation
 Cannot “add” to the plot once it’s created
Plotting Systems in R: ggplot2
 Split the difference between base and lattice
 Automatically deals with spacings, text, titles but also allows you to annotate by “adding”
 Superficial similarity to lattice but generally easier/more intuitive to use
 Default mode makes many choices for you (but you can customize!)
The Basics: qplot()
 Works much like the
plot
function in base graphics system  Looks for data in a data frame, similar to lattice, or in the parent environment
 Plots are made up of aesthetics (size, shape, color) and geoms (points, lines)
The Basics: qplot()
 Factors are important for indicating subsets of the data (if they are to have different properties); they should be labeled
 The
qplot()
hides what goes on underneath, which is okay for most operations ggplot()
is the core function and very flexible for doing thingsqplot()
cannot do
Example Dataset
library(ggplot2)
str(mpg)
'data.frame': 234 obs. of 11 variables:
$ manufacturer: Factor w/ 15 levels "audi","chevrolet",..: 1 1 1 1 1 1 1 1 1 1 ...
$ model : Factor w/ 38 levels "4runner 4wd",..: 2 2 2 2 2 2 2 3 3 3 ...
$ displ : num 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
$ year : int 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
$ cyl : int 4 4 4 4 6 6 6 4 4 4 ...
$ trans : Factor w/ 10 levels "auto(av)","auto(l3)",..: 4 9 10 1 4 9 1 9 4 10 ...
$ drv : Factor w/ 3 levels "4","f","r": 2 2 2 2 2 2 2 1 1 1 ...
$ cty : int 18 21 20 21 16 18 18 18 16 20 ...
$ hwy : int 29 29 31 30 26 26 27 26 25 28 ...
$ fl : Factor w/ 5 levels "c","d","e","p",..: 4 4 4 4 4 4 4 4 4 4 ...
$ class : Factor w/ 7 levels "2seater","compact",..: 2 2 2 2 2 2 2 2 2 2 ...
ggplot2 “Hello, world!”
qplot(displ, hwy, data = mpg)
Modifying aesthetics
qplot(displ, hwy, data = mpg, color = drv)
Adding a geom
qplot(displ, hwy, data = mpg, geom = c("point", "smooth"))
Histograms
qplot(hwy, data = mpg, fill = drv)
Facets
Splits figure into as many based on the variable, for example months!
qplot(displ, hwy, data = mpg, facets = . ~ drv)
qplot(hwy, data = mpg, facets = drv ~ ., binwidth = 2)
MAACS Cohort
 Mouse Allergen and Asthma Cohort Study
 Baltimore children (aged 5—17)
 Persistent asthma, exacerbation in past year
 Study indoor environment and its relationship with asthma morbidity
 Recent publication: http://goo.gl/WqE9j8
Example: MAACS
str(maacs)
'data.frame': 750 obs. of 5 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
$ eno : num 141 124 126 164 99 68 41 50 12 30 ...
$ duBedMusM: num 2423 2793 3055 775 1634 ...
$ pm25 : num 15.6 34.4 39 33.2 27.1 ...
$ mopos : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
Histogram of eNO
qplot(log(eno), data = maacs)
Histogram by Group
qplot(log(eno), data = maacs, fill = mopos)
Density Smooth
qplot(log(eno), data = maacs, geom = "density")
qplot(log(eno), data = maacs, geom = "density", color = mopos)
Scatterplots: eNO vs. PM$_{2.5}$
qplot(log(pm25), log(eno), data = maacs)
qplot(log(pm25), log(eno), data = maacs, shape = mopos)
qplot(log(pm25), log(eno), data = maacs, color = mopos)
Scatterplots: eNO vs. PM$_{2.5}$
qplot(log(pm25), log(eno), data = maacs, color = mopos,
geom = c("point", "smooth"), method = "lm")
Scatterplots: eNO vs. PM$_{2.5}$
qplot(log(pm25), log(eno), data = maacs, geom = c("point", "smooth"),
method = "lm", facets = . ~ mopos)
Summary of qplot()
 The
qplot()
function is the analog toplot()
but with many builtin features  Syntax somewhere in between base/lattice
 Produces very nice graphics, essentially publication ready (if you like the design)
 Difficult to go against the grain/customize (don’t bother; use full ggplot2 power in that case)
Resources
 The ggplot2 book by Hadley Wickham
 The R Graphics Cookbook by Winston Chang (examples in base plots and in ggplot2)
 ggplot2 web site (http://ggplot2.org)
 ggplot2 mailing list (http://goo.gl/OdW3uB), primarily for
developers
ggplot2 part 2
Basic Components of a ggplot2 Plot
 A data frame
 aesthetic mappings: how data are mapped to color, size
 geoms: geometric objects like points, lines, shapes.
 facets: for conditional plots.
 stats: statistical transformations like binning, quantiles, smoothing.
 scales: what scale an aesthetic map uses (example: male = red, female = blue).
 coordinate system
Building Plots with ggplot2
 When building plots in ggplot2 (rather than using qplot) the “artist’s palette” model may be the closest analogy
 Plots are built up in layers
 Plot the data
 Overlay a summary
 Metadata and annotation
Example: BMI, PM$_{2.5}$, Asthma
 Mouse Allergen and Asthma Cohort Study
 Baltimore children (age 517)
 Persistent asthma, exacerbation in past year
 Does BMI (normal vs. overweight) modify the relationship between PM$_{2.5}$ and asthma symptoms?
Basic Plot
library(ggplot2)
qplot(logpm25, NocturnalSympt, data = maacs, facets = . ~ bmicat,
geom = c("point", "smooth"), method = "lm")
Building Up in Layers
head(maacs)
logpm25 bmicat NocturnalSympt logno2_new
1 1.5362 normal weight 1 1.299
2 1.5905 normal weight 0 1.295
3 1.5218 normal weight 0 1.304
4 1.4323 normal weight 0 NA
5 1.2762 overweight 8 1.108
6 0.7139 overweight 0 0.837
g < ggplot(maacs, aes(logpm25, NocturnalSympt))
summary(g)
data: logpm25, bmicat, NocturnalSympt, logno2_new [554x4]
mapping: x = logpm25, y = NocturnalSympt
faceting: facet_null()
No Plot Yet!
g < ggplot(maacs, aes(logpm25, NocturnalSympt))
print(g)
Error: No layers in plot
First Plot with Point Layer
g < ggplot(maacs, aes(logpm25, NocturnalSympt))
g + geom_point()
Adding More Layers: Smooth geom
g + geom_point() + geom_smooth()
g + geom_point() + geom_smooth(method = "lm")
Adding More Layers: Facets
g + geom_point() + facet_grid(. ~ bmicat) + geom_smooth(method = "lm")
Annotation
 Labels:
xlab()
,ylab()
,labs()
,ggtitle()
 Each of the “geom” functions has options to modify
 For things that only make sense globally, use
theme()
 Example:
theme(legend.position = "none")
 Example:
 Two standard appearance themes are included
theme_gray()
: The default theme (gray background)theme_bw()
: More stark/plain
Modifying Aesthetics
g + geom_point(color = "steelblue", size = 4, alpha = 1/2)
g + geom_point(aes(color = bmicat), size = 4, alpha = 1/2)
Modifying Labels
g + geom_point(aes(color = bmicat)) + labs(title = "MAACS Cohort") +
labs(x = expression("log " * PM[2.5]), y = "Nocturnal Symptoms")
Customizing the Smooth
g + geom_point(aes(color = bmicat), size = 2, alpha = 1/2) +
geom_smooth(size = 4, linetype = 3, method = "lm", se = FALSE)
Changing the Theme
g + geom_point(aes(color = bmicat)) + theme_bw(base_family = "Times")
A Note about Axis Limits
testdat < data.frame(x = 1:100, y = rnorm(100))
testdat[50,2] < 100 ## Outlier!
plot(testdat$x, testdat$y, type = "l", ylim = c(3,3))
g < ggplot(testdat, aes(x = x, y = y))
g + geom_line()
Axis Limits
g + geom_line() + ylim(3, 3)
g + geom_line() + coord_cartesian(ylim = c(3, 3))
More Complex Example
 How does the relationship between PM$_{2.5}$ and nocturnal symptoms vary by BMI and NO$_2$?
 Unlike our previous BMI variable, NO$_2$ is continuous
 We need to make NO$_2$ categorical so we can condition on it in the plotting
 Use the
cut()
function for this
Making NO$_2$ Tertiles
## Calculate the tertiles of the data
cutpoints < quantile(maacs$logno2_new, seq(0, 1, length = 4), na.rm = TRUE)
## Cut the data at the tertiles and create a new factor variable
maacs$no2tert < cut(maacs$logno2_new, cutpoints)
## See the levels of the newly created factor variable
levels(maacs$no2tert)
[1] "(0.378,1.2]" "(1.2,1.42]" "(1.42,2.55]"
Code for Final Plot
## Setup ggplot with data frame
g < ggplot(maacs, aes(logpm25, NocturnalSympt))
## Add layers
g + geom_point(alpha = 1/3) +
facet_wrap(bmicat ~ no2tert, nrow = 2, ncol = 4) +
geom_smooth(method="lm", se=FALSE, col="steelblue") +
theme_bw(base_family = "Avenir", base_size = 10) +
labs(x = expression("log " * PM[2.5])) +
labs(y = "Nocturnal Symptoms") +
labs(title = "MAACS Cohort")
Summary
 ggplot2 is very powerful and flexible if you learn the “grammar” and the various elements that can be tuned/modified
 Many more types of plots can be made; explore and mess around with the package (references mentioned in Part 1 are useful)
Important resource on legend, legend order and coloring
Was trying to change labels of legend. nothing works just like in this stack question;
https://stackoverflow.com/questions/23635662/editinglegendtextlabelsinggplot
http://www.cookbookr.com/Graphs/
We need to change the order of the factor so that breaks
corresponds with values
(see below)
mn5$fips < factor(mn5$fips, levels=c("24510","06037")) # to change
# legend order later!
g < ggplot(data=mn5, aes(x=year,y=Mean.Emissions,color=fips))+
geom_line()
g + scale_colour_manual(breaks=c("24510","06037"),values=c("red","green"))
When you have a histogram and 2 vlines, how to add legend
https://stackoverflow.com/questions/37660694/addlegendtogeomvline
quantile_1 < quantile(sf$Unit.Sales, prob = 0.25)
quantile_2 < quantile(sf$Unit.Sales, prob = 0.75)
ggplot(aes(x = Unit.Sales), data = sf) +
geom_histogram(color = 'black', fill = NA) +
geom_vline(aes(xintercept=median(Unit.Sales)),
color="blue", linetype="dashed", size=1) +
geom_vline(aes(xintercept=mean(Unit.Sales)),
color="red", linetype="dashed", size=1) +
geom_vline(aes(xintercept=quantile_1), color="yellow", linetype="dashed", size=1)
ggplot histogram
http://www.cookbookr.com/Graphs/Plotting_distributions_(ggplot2)
g < ggplot(data.frame(mn.mat),aes(x=mn.mat))
g < ggplot(data.frame(mn.mat),aes(x=mn.mat))
## hist and plots
g+geom_histogram(aes(y=..density..,fill="Distribution"),binwidth=0.5,color="black") +
scale_fill_manual("Histogram Legend", values=c("white")) +
geom_vline(aes(xintercept=mn.sample,color="Sample.mean"),linetype="dashed",size=0.5)+
geom_vline(aes(xintercept=mean(mn.mat),color="Mean.of.distribution"),linetype="dashed",size=1)+
scale_color_manual(name = "vLine legend", values = c( Sample.mean= "blue", Mean.of.distribution = "red"))+
ggtitle("Sampling distribution of Sample Mean") +xlab("Sample Mean, n=40") + ylab("density")
Hierarchical Clustering
Can we find things that are close together?
Clustering organizes things that are close into groups
 How do we define close?
 How do we group things?
 How do we visualize the grouping?

How do we interpret the grouping?
 An agglomerative approach
 Find closest two things
 Put them together
 Find next closest
 Requires
 A defined distance
 A merging approach
 Produces
 A tree showing how close things are to each other
How do we define close?
 Most important step
 Garbage in > garbage out
 Distance or similarity
 Continuous  euclidean distance
 Continuous  correlation similarity
 Binary  manhattan distance
 Pick a distance/similarity that makes sense for your problem
Example distances  Euclidean
Point to point
In general:
\(\sqrt{(A_1A_2)^2 + (B_1B_2)^2 + \ldots + (Z_1Z_2)^2}\) http://rafalab.jhsph.edu/688/lec/lecture5clustering.pdf
Example distances  Manhattan
Like moving in a grid
In general:
\[A_1A_2 + B_1B_2 + \ldots + Z_1Z_2\]http://en.wikipedia.org/wiki/Taxicab_geometry
Hierarchical clustering  example
set.seed(1234)
par(mar = c(0, 0, 0, 0))
x < rnorm(12, mean = rep(1:3, each = 4), sd = 0.2)
y < rnorm(12, mean = rep(c(1, 2, 1), each = 4), sd = 0.2)
plot(x, y, col = "blue", pch = 19, cex = 2)
text(x + 0.05, y + 0.05, labels = as.character(1:12))
Hierarchical clustering  dist
 Important parameters: x,method
dataFrame < data.frame(x = x, y = y)
dist(dataFrame)
## 1 2 3 4 5 6 7 8 9
## 2 0.34121
## 3 0.57494 0.24103
## 4 0.26382 0.52579 0.71862
## 5 1.69425 1.35818 1.11953 1.80667
## 6 1.65813 1.31960 1.08339 1.78081 0.08150
## 7 1.49823 1.16621 0.92569 1.60132 0.21110 0.21667
## 8 1.99149 1.69093 1.45649 2.02849 0.61704 0.69792 0.65063
## 9 2.13630 1.83168 1.67836 2.35676 1.18350 1.11500 1.28583 1.76461
## 10 2.06420 1.76999 1.63110 2.29239 1.23848 1.16550 1.32063 1.83518 0.14090
## 11 2.14702 1.85183 1.71074 2.37462 1.28154 1.21077 1.37370 1.86999 0.11624
## 12 2.05664 1.74663 1.58659 2.27232 1.07701 1.00777 1.17740 1.66224 0.10849
## 10 11
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
## 11 0.08318
## 12 0.19129 0.20803
Hierarchical clustering  hclust
dataFrame < data.frame(x = x, y = y)
distxy < dist(dataFrame)
hClustering < hclust(distxy)
plot(hClustering)
Plots a dendogram
Prettier dendrograms
Use this function for prettier dendograms
myplclust < function(hclust, lab = hclust$labels, lab.col = rep(1, length(hclust$labels)),
hang = 0.1, ...) {
## modifiction of plclust for plotting hclust objects *in colour*! Copyright
## Eva KF Chan 2009 Arguments: hclust: hclust object lab: a character vector
## of labels of the leaves of the tree lab.col: colour for the labels;
## NA=default device foreground colour hang: as in hclust & plclust Side
## effect: A display of hierarchical cluster with coloured leaf labels.
y < rep(hclust$height, 2)
x < as.numeric(hclust$merge)
y < y[which(x < 0)]
x < x[which(x < 0)]
x < abs(x)
y < y[order(x)]
x < x[order(x)]
plot(hclust, labels = FALSE, hang = hang, ...)
text(x = x, y = y[hclust$order]  (max(hclust$height) * hang), labels = lab[hclust$order],
col = lab.col[hclust$order], srt = 90, adj = c(1, 0.5), xpd = NA, ...)
}
Pretty dendrograms
dataFrame < data.frame(x = x, y = y)
distxy < dist(dataFrame)
hClustering < hclust(distxy)
myplclust(hClustering, lab = rep(1:3, each = 4), lab.col = rep(1:3, each = 4))
Even Prettier dendrograms
Site doesn’t work! Site broken!
http://gallery.renthusiasts.com/RGraphGallery.php?graph=79
Merging points  complete or average
method: the agglomeration method to be used. This should be (an unambiguous abbreviation of) one of ‘”ward.D”’, ‘”ward.D2”’, ‘”single”’, ‘”complete”’, ‘”average”’ (= UPGMA), ‘”mcquitty”’ (= WPGMA), ‘”median”’ (= WPGMC) or ‘”centroid”’ (= UPGMC).
method for hclust so that the clustering is done based on the farthest points within the existing cluster if “complete”.
heatmap()
Clusters rows based on XY distance just like hclust.
For columns it looks at individual columns and color codes them to points that are close to each other.
dataFrame < data.frame(x = x, y = y)
set.seed(143)
dataMatrix < as.matrix(dataFrame)[sample(1:12), ]
heatmap(dataMatrix)
In this example Y (9,12,6,7,1,5,11) are all “close” by where (3,2,4) ain’t!
Notes and further resources
 Gives an idea of the relationships between variables/observations
 The picture may be unstable
 Change a few points
 Have different missing values
 Pick a different distance
 Change the merging strategy
 Change the scale of points for one variable
 But it is deterministic
 Choosing where to cut isn’t always obvious
 Should be primarily used for exploration
 Rafa’s Distances and Clustering Video
 Elements of statistical learning
My notes on hierarchical clustering and heatmaps
Hierarchical clustering is when you compare rows or columns of a DF by first scaling them and the point is to see how everything is related to each other. For example:
s<matrix(1:25,5)
s[lower.tri(s)] = t(s)[lower.tri(s)]
heatmap(s)
Heatmap()
takes a DF and then does hierarchical clustering on the
rows and columns. It uses the dist()
to make distance based on
different method
arguments. For example, for the eucledian method
,
it basically takes 2 entire row vectors on n dimension and measures
the eucledian distance and uses that for clustering. Same with the Column.
Kmeans! Can we find things that are close together?
 How do we define close?
 How do we group things?
 How do we visualize the grouping?
 How do we interpret the grouping?
How do we define close?
 Most important step
 Garbage in $\longrightarrow$ garbage out
 Distance or similarity
 Continuous  euclidean distance
 Continous  correlation similarity
 Binary  manhattan distance
 Pick a distance/similarity that makes sense for your problem
Kmeans clustering
 A partioning approach
 Fix a number of clusters
 Get “centroids” of each cluster
 Assign things to closest centroid
 Reclaculate centroids
 Requires
 A defined distance metric
 A number of clusters
 An initial guess as to cluster centroids
 Produces
 Final estimate of cluster centroids
 An assignment of each point to clusters
Kmeans clustering  example
set.seed(1234)
par(mar = c(0, 0, 0, 0))
x < rnorm(12, mean = rep(1:3, each = 4), sd = 0.2)
y < rnorm(12, mean = rep(c(1, 2, 1), each = 4), sd = 0.2)
plot(x, y, col = "blue", pch = 19, cex = 2)
text(x + 0.05, y + 0.05, labels = as.character(1:12))
What Kmeans does?

starting centroids are guessed

assign points to closest centroid

recalculates centroids

reassigns points to closest centroid

update centroids
kmeans()
 Important parameters: x, centers, iter.max, nstart
dataFrame < data.frame(x, y)
kmeansObj < kmeans(dataFrame, centers = 3)
names(kmeansObj)
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
kmeansObj$cluster
## [1] 3 3 3 3 1 1 1 1 2 2 2 2
kmeans()
par(mar = rep(0.2, 4))
plot(x, y, col = kmeansObj$cluster, pch = 19, cex = 2)
points(kmeansObj$centers, col = 1:3, pch = 3, cex = 3, lwd = 3)
Heatmaps
set.seed(1234)
dataMatrix < as.matrix(dataFrame)[sample(1:12), ]
kmeansObj2 < kmeans(dataMatrix, centers = 3)
par(mfrow = c(1, 2), mar = c(2, 4, 0.1, 0.1))
image(t(dataMatrix)[, nrow(dataMatrix):1], yaxt = "n")
image(t(dataMatrix)[, order(kmeansObj2$cluster)], yaxt = "n")
slides have an error: kmeansObj$cluster
is used and gives shitty
results!
Notes and further resources
 Kmeans requires a number of clusters
 Pick by eye/intuition
 Pick by cross validation/information theory, etc.
 Determining the number of clusters
 Kmeans is not deterministic
 Different # of clusters
 Different number of iterations
 Rafael Irizarry’s Distances and Clustering Video
 Elements of statistical learning
Dimension reduction
Matrix data
set.seed(12345)
par(mar = rep(0.2, 4))
dataMatrix < matrix(rnorm(400), nrow = 40)
image(1:10, 1:40, t(dataMatrix)[, nrow(dataMatrix):1])
Cluster the data
par(mar = rep(0.2, 4))
heatmap(dataMatrix)
What if we add a pattern?
set.seed(678910)
for (i in 1:40) {
# flip a coin
coinFlip < rbinom(1, size = 1, prob = 0.5)
# if coin is heads add a common pattern to that row
if (coinFlip) {
dataMatrix[i, ] < dataMatrix[i, ] + rep(c(0, 3), each = 5)
}
}
rep(c(0,3),each=5)
produces 0,0,0,0,0,3,3,3,3,3;
Adds mean to last 5 columns only
What if we add a pattern?  the data
You see split between first and last 5 columns!
par(mar = rep(0.2, 4))
image(1:10, 1:40, t(dataMatrix)[, nrow(dataMatrix):1])
What if we add a pattern?  the clustered data; heatmap
Appears random in the rows…
par(mar = rep(0.2, 4))
heatmap(dataMatrix)
Patterns in rows and columns; heatmap
order it and plot the heatmap
hh < hclust(dist(dataMatrix))
dataMatrixOrdered < dataMatrix[hh$order, ]
par(mfrow = c(1, 3))
image(t(dataMatrixOrdered)[, nrow(dataMatrixOrdered):1])
plot(rowMeans(dataMatrixOrdered), 40:1, , xlab = "Row Mean", ylab = "Row", pch = 19)
plot(colMeans(dataMatrixOrdered), xlab = "Column", ylab = "Column Mean", pch = 19)
Related problems
You have multivariate variables $X_1,\ldots,X_n$ so $X_1 = (X_{11},\ldots,X_{1m})$
 Find a new set of multivariate variables that are uncorrelated and explain as much variance as possible.
 If you put all the variables together in one matrix, find the best matrix created with fewer variables (lower rank) that explains the original data.
The first goal is statistical and the second goal is data compression.
Related solutions  PCA/SVD
SVD
If $X$ is a matrix with each variable in a column and each observation in a row then the SVD is a “matrix decomposition”
\[X = UDV^T\]where the columns of $U$ are orthogonal (left singular vectors), the columns of $V$ are orthogonal (right singular vectors) and $D$ is a diagonal matrix (singular values).
PCA
The principal components are equal to the right singular values if you first scale (subtract the mean, divide by the standard deviation) the variables.
This is pretty much it, it is a different way of doing the SVD I guess!
What is SVD?
X = U D V’, U spans the column space and V spans the row space and are orthonormal vectors (UU^T=I).
X –> mxn; U –> mxm; D –> mxn; V –> nxn
Rank of X, r <= min(m,n)
If r < min(m,n), then
U and V have r vectors in the column (n) and row space (m). and nr vectors in the column space and mr vectors in the row space.
Gilbert Strang lecture is the source of this content.
Components of the SVD  $u$ and $v$
svd1 < svd(scale(dataMatrixOrdered))
par(mfrow = c(1, 3))
image(t(dataMatrixOrdered)[, nrow(dataMatrixOrdered):1])
plot(svd1$u[, 1], 40:1, , xlab = "Row", ylab = "First left singular vector",
pch = 19)
plot(svd1$v[, 1], xlab = "Column", ylab = "First right singular vector", pch = 19)
When to scale and center
scale()
does both centering and scaling if true, i.e., (x_i 
mean(x_i)/sd(x_i)
it depends on the type of data you have. For some types of well defined data, there may be no need to scale and center. A good example is geolocation data (longitudes and latitudes). If you were seeking to cluster towns, you wouldn’t need to scale and center their locations.
For data that is of different physical measurements or units, its probably a good idea to scale and center. For example, when clustering vehicles, the data may contain attributes such as number of wheels, number of doors, miles per gallon, horsepower etc. In this case it may be a better idea to scale and center since you are unsure of the relationship between each attribute.
The intuition behind that is that since many clustering algorithms require some definition of distance, if you do not scale and center your data, you may give attributes which have larger magnitudes more importance.
In the context of your problem, I would scale and center the data if it contains attributes like patient height, weight, age etc.
Components of the SVD  Variance calculation
svd1$d^2/sum(svd1$d^2)
par(mfrow = c(1, 2))
plot(svd1$d, xlab = "Column", ylab = "Singular value", pch = 19)
plot(svd1$d^2/sum(svd1$d^2), xlab = "Column", ylab = "Prop. of variance explained",
pch = 19)
SVD vs PCA; principle component analysis
Same as Right singular vectors. Nothing more. Look here for more info.
svd1 < svd(scale(dataMatrixOrdered))
pca1 < prcomp(dataMatrixOrdered, scale = TRUE)
plot(pca1$rotation[, 1], svd1$v[, 1], pch = 19, xlab = "Principal Component 1",
ylab = "Right Singular Vector 1")
abline(c(0, 1))
Components of the SVD  variance explained
constantMatrix < dataMatrixOrdered*0
for(i in 1:dim(dataMatrixOrdered)[1]){constantMatrix[i,] < rep(c(0,1),each=5)}
svd1 < svd(constantMatrix)
par(mfrow=c(1,3))
image(t(constantMatrix)[,nrow(constantMatrix):1])
plot(svd1$d,xlab="Column",ylab="Singular value",pch=19)
plot(svd1$d^2/sum(svd1$d^2),xlab="Column",ylab="Prop. of variance explained",pch=19)
Dimension reduction: identifying patterns
What if we add a second pattern?
set.seed(678910)
for (i in 1:40) {
# flip a coin
coinFlip1 < rbinom(1, size = 1, prob = 0.5)
coinFlip2 < rbinom(1, size = 1, prob = 0.5)
# if coin is heads add a common pattern to that row
if (coinFlip1) {
dataMatrix[i, ] < dataMatrix[i, ] + rep(c(0, 5), each = 5)
}
if (coinFlip2) {
dataMatrix[i, ] < dataMatrix[i, ] + rep(c(0, 5), 5)
}
}
hh < hclust(dist(dataMatrix))
dataMatrixOrdered < dataMatrix[hh$order, ]
Singular value decomposition  true patterns
svd2 < svd(scale(dataMatrixOrdered))
par(mfrow = c(1, 3))
image(t(dataMatrixOrdered)[, nrow(dataMatrixOrdered):1])
plot(rep(c(0, 1), each = 5), pch = 19, xlab = "Column", ylab = "Pattern 1")
plot(rep(c(0, 1), 5), pch = 19, xlab = "Column", ylab = "Pattern 2")
$v$ and patterns of variance in rows
svd2 < svd(scale(dataMatrixOrdered))
par(mfrow = c(1, 3))
image(t(dataMatrixOrdered)[, nrow(dataMatrixOrdered):1])
plot(svd2$v[, 1], pch = 19, xlab = "Column", ylab = "First right singular vector")
plot(svd2$v[, 2], pch = 19, xlab = "Column", ylab = "Second right singular vector")
$d$ and variance explained
svd1 < svd(scale(dataMatrixOrdered))
par(mfrow = c(1, 2))
plot(svd1$d, xlab = "Column", ylab = "Singular value", pch = 19)
plot(svd1$d^2/sum(svd1$d^2), xlab = "Column", ylab = "Percent of variance explained",
pch = 19)
Missing values NA
dataMatrix2 < dataMatrixOrdered
## Randomly insert some missing data
dataMatrix2[sample(1:100, size = 40, replace = FALSE)] < NA
svd1 < svd(scale(dataMatrix2)) ## Doesn't work!
## Error: infinite or missing values in 'x'
Own imputing the missing values NA
mn2 < df %>% group_by(interval) %>% mutate(mean.across.days=mean(steps,na.rm=TRUE))
## Are there differences in activity patterns between weekdays and
## weekends?
na.row < which(is.na(mn2$steps),arr.ind=TRUE)
mn2$steps.imputed.without.NA < mn2$steps
mn2$steps.imputed.without.NA[na.row] < mn2$mean.across.days[na.row]
Imputing {impute}
library(impute) ## Available from http://bioconductor.org
dataMatrix2 < dataMatrixOrdered
dataMatrix2[sample(1:100,size=40,replace=FALSE)] < NA
dataMatrix2 < impute.knn(dataMatrix2)$data
svd1 < svd(scale(dataMatrixOrdered)); svd2 < svd(scale(dataMatrix2))
par(mfrow=c(1,2)); plot(svd1$v[,1],pch=19); plot(svd2$v[,1],pch=19)
Face example
load("data/face.rda")
image(t(faceData)[, nrow(faceData):1])
Face example  variance explained
svd1 < svd(scale(faceData))
plot(svd1$d^2/sum(svd1$d^2), pch = 19, xlab = "Singular vector", ylab = "Variance explained")
Face example  create approximations
svd1 < svd(scale(faceData))
## Note that %*% is matrix multiplication
# Here svd1$d[1] is a constant
approx1 < svd1$u[, 1] %*% t(svd1$v[, 1]) * svd1$d[1]
# In these examples we need to make the diagonal matrix out of d
approx5 < svd1$u[, 1:5] %*% diag(svd1$d[1:5]) %*% t(svd1$v[, 1:5])
approx10 < svd1$u[, 1:10] %*% diag(svd1$d[1:10]) %*% t(svd1$v[, 1:10])
Face example  plot approximations
par(mfrow = c(1, 4))
image(t(approx1)[, nrow(approx1):1], main = "(a)")
image(t(approx5)[, nrow(approx5):1], main = "(b)")
image(t(approx10)[, nrow(approx10):1], main = "(c)")
image(t(faceData)[, nrow(faceData):1], main = "(d)") ## Original data
Notes and further resources
 Scale matters
 PC’s/SV’s may mix real patterns
 Can be computationally intensive
 Advanced data analysis from an elementary point of view
 Elements of statistical learning
 Alternatives
 Factor analysis
 Independent components analysis
 Latent semantic analysis
Colors
Plotting and Color
 The default color schemes for most plots in R are horrendous
 I don’t have good taste and even I know that
 Recently there have been developments to improve the handling/specifica1on of colors in plots/graphs/etc.
 There are functions in R and in external packages that are very handy
Color U1li1es in R
 The
grDevices
package has two functionscolorRamp
colorRampPalette
 These functions take palettes of colors and help to interpolate between the colors
 The function colors() lists the names of colors you can use in any plotting function
Color Palette Utilities in R
colorRamp
: Take a palette of colors and return a function that takes valeus between 0 and 1, indicating the extremes of the color palette (e.g. see the ‘gray’ function)colorRampPalette
: Take a palette of colors and return a function that takes integer arguments and returns a vector of colors interpolating the palette (likeheat.colors
ortopo.colors
)
colorRamp
[,1] [,2] [,3]
corresponds to [Red] [Blue] [Green]
> pal < colorRamp(c("red", "blue"))
> pal(0)
[,1] [,2] [,3]
[1,] 255 0 0
> pal(1)
[,1] [,2] [,3]
[1,] 0 0 255
> pal(0.5)
[,1] [,2] [,3]
[1,] 127.5 0 127.5
colorRamp
> pal(seq(0, 1, len = 10))
[,1] [,2] [,3]
[1,] 255.00000 0 0
[2,] 226.66667 0 28.33333
[3,] 198.33333 0 56.66667
[4,] 170.00000 0 85.00000
[5,] 141.66667 0 113.33333
[6,] 113.33333 0 141.66667
[7,] 85.00000 0 170.00000
[8,] 56.66667 0 198.33333
[9,] 28.33333 0 226.66667
[10,] 0.00000 0 255.00000
colorRampPalette
> pal < colorRampPalette(c("red", "yellow"))
> pal(2)
[1] "#FF0000" "#FFFF00"
> pal(10)
[1] "#FF0000" "#FF1C00" "#FF3800" "#FF5500" "#FF7100"
[6] "#FF8D00" "#FFAA00" "#FFC600" "#FFE200" "#FFFF00”
RColorBrewer Package

One package on CRAN that contains interesing/useful color palettes
 There are 3 types of palettes
 Sequential
 Diverging
 Qualitative
 Palette informa1on can be used in conjunction with the
colorRamp()
andcolorRampPalette()
https://www.google.com/search?q=rcolorbrewer+package&source=lnms&tbm=isch&sa=X&ved=0ahUKEwjUgtT7wsLgAhXIaFAKHVk6CdsQ_AUIDigB&biw=792&bih=756#imgrc=FVFdJalwdTOVNM:
RColorBrewer and colorRampPalette
library(RColorBrewer)
cols < brewer.pal(3, "BuGn")
cols
[1] "#E5F5F9" "#99D8C9" "#2CA25F"
pal < colorRampPalette(cols)
image(volcano, col = pal(20))
The smoothScatter function
x> rnorm(1000)
y> rnorm(1000)
smoothScatter(x,y)
Some other plotting notes
 The
rgb
function can be used to produce any color via red, green, blue proportions  Color transparency can be added via the
alpha
parameter torgb
 The
colorspace
package can be used for a different control over colors
Scatterplot with no transparency
plot(x,y,pch=19)
Scatterplot with transparency
plot(x,y,col=rgb(0,0,0,0.2),pch=19)
Summary
 Careful use of colors in plots/maps/etc. can make it easier for the reader to get what you’re trying to say (why make it harder?)
 The
RColorBrewer
package is an R package that provides color palettes for sequential, categorical, and diverging data  The
colorRamp
andcolorRampPalette
functions can be used in conjunction with color palettes to connect data to colors  Transparency can sometimes be used to clarify plots with many points
c4w4
Slightly processed data
load("data/samsungData.rda")
names(samsungData)[1:12]
## [1] "tBodyAccmean()X" "tBodyAccmean()Y" "tBodyAccmean()Z"
## [4] "tBodyAccstd()X" "tBodyAccstd()Y" "tBodyAccstd()Z"
## [7] "tBodyAccmad()X" "tBodyAccmad()Y" "tBodyAccmad()Z"
## [10] "tBodyAccmax()X" "tBodyAccmax()Y" "tBodyAccmax()Z"
table(samsungData$activity)
##
## laying sitting standing walk walkdown walkup
## 1407 1286 1374 1226 986 1073
Plotting average acceleration for first subject
par(mfrow = c(1, 2), mar = c(5, 4, 1, 1))
samsungData < transform(samsungData, activity = factor(activity))
sub1 < subset(samsungData, subject == 1)
plot(sub1[, 1], col = sub1$activity, ylab = names(sub1)[1])
plot(sub1[, 2], col = sub1$activity, ylab = names(sub1)[2])
legend("bottomright", legend = unique(sub1$activity), col = unique(sub1$activity),
pch = 1)
Need to use unique
for legend, don’t need to use us unique in plot.
Clustering based just on average acceleration
This is a bad idea in this case as the avg acc is not able to distinguish between the different activities.
library(rafalib)
source("myplclust.R")
distanceMatrix < dist(sub1[, 1:3])
hclustering < hclust(distanceMatrix)
myplclust(hclustering, lab.col = unclass(sub1$activity))
https://rdrr.io/cran/rafalib/src/R/myplclust.R
Plotting max acceleration for the first subject
par(mfrow = c(1, 2))
plot(sub1[, 10], pch = 19, col = sub1$activity, ylab = names(sub1)[10])
plot(sub1[, 11], pch = 19, col = sub1$activity, ylab =
names(sub1)[11])
legend("bottomright", legend = unique(sub1$activity), col = unique(sub1$activity), pch = 1)
Much better way to identify parameters for clustering and setting cutoffs for identifying or distinguishing things.
Clustering based on maximum acceleration
Now we see some useful clustering based on distance, taking in XYZ acc!
distanceMatrix < dist(sub1[, 10:12])
hclustering < hclust(distanceMatrix)
myplclust(hclustering, lab.col = unclass(sub1$activity))
legend("topright", legend = unique(sub1$activity), col = unique(sub1$activity),
pch = 1)
Singular Value Decomposition
svd1 = svd(scale(sub1[, c(562, 563)]))
par(mfrow = c(1, 2))
plot(svd1$u[, 1], col = sub1$activity, pch = 19)
plot(svd1$u[, 2], col = sub1$activity, pch = 19)
Find maximum contributor
plot(svd1$v[, 2], pch = 19)
New clustering with maximum contributer
maxContrib < which.max(svd1$v[, 2])
distanceMatrix < dist(sub1[, c(10:12, maxContrib)])
hclustering < hclust(distanceMatrix)
myplclust(hclustering, lab.col = unclass(sub1$activity))
New clustering with maximum contributer
names(samsungData)[maxContrib]
## [1] "fBodyAcc.meanFreq...Z"
Kmeans clustering (nstart=1, first try)
kClust < kmeans(sub1[, c(562, 563)], centers = 6)
table(kClust$cluster, sub1$activity)
##
## laying sitting standing walk walkdown walkup
## 1 0 0 0 50 1 0
## 2 0 0 0 0 48 0
## 3 27 37 51 0 0 0
## 4 3 0 0 0 0 53
## 5 0 0 0 45 0 0
## 6 20 10 2 0 0 0
Kmeans clustering (nstart=1, second try)
kClust < kmeans(sub1[, c(562, 563)], centers = 6, nstart = 1)
table(kClust$cluster, sub1$activity)
##
## laying sitting standing walk walkdown walkup
## 1 0 0 0 0 49 0
## 2 18 10 2 0 0 0
## 3 0 0 0 95 0 0
## 4 29 0 0 0 0 0
## 5 0 37 51 0 0 0
## 6 3 0 0 0 0 53
Kmeans clustering (nstart=100, first try)
kClust < kmeans(sub1[, c(562, 563)], centers = 6, nstart = 100)
table(kClust$cluster, sub1$activity)
##
## laying sitting standing walk walkdown walkup
## 1 18 10 2 0 0 0
## 2 29 0 0 0 0 0
## 3 0 0 0 95 0 0
## 4 0 0 0 0 49 0
## 5 3 0 0 0 0 53
## 6 0 37 51 0 0 0
Kmeans clustering (nstart=100, second try)
kClust < kmeans(sub1[, c(562, 563)], centers = 6, nstart = 100)
table(kClust$cluster, sub1$activity)
##
## laying sitting standing walk walkdown walkup
## 1 29 0 0 0 0 0
## 2 3 0 0 0 0 53
## 3 0 0 0 0 49 0
## 4 0 0 0 95 0 0
## 5 0 37 51 0 0 0
## 6 18 10 2 0 0 0
Cluster 1 Variable Centers (Laying)
plot(kClust$center[1, 1:10], pch = 19, ylab = "Cluster Center", xlab = "")
Cluster 2 Variable Centers (Walking)
plot(kClust$center[4, 1:10], pch = 19, ylab = "Cluster Center", xlab = "")
Airpollution example
setwd
setwd("./blablabla/")
Entry questions Has fine particle pollution in the U.S. decreased from 1999 to 2012?
Read in data from 1999
pm0 < read.table("RD_501_88101_19990.txt", comment.char = "#", header = FALSE, sep = "", na.strings = "")
dim(pm0)
head(pm0)
cnames < readLines("RD_501_88101_19990.txt", 1)
print(cnames)
cnames < strsplit(cnames, "", fixed = TRUE)
print(cnames)
names(pm0) < make.names(cnames[[1]])
head(pm0)
x0 < pm0$Sample.Value
class(x0)
str(x0)
summary(x0)
mean(is.na(x0)) ## Are missing values important here?
Read in data from 2012
pm1 < read.table("RD_501_88101_20120.txt", comment.char = "#", header = FALSE, sep = "", na.strings = "", nrow = 1304290)
names(pm1) < make.names(cnames[[1]]) # remove spaces in names
head(pm1)
dim(pm1)
x1 < pm1$Sample.Value
class(x1)
Five number summaries for both periods
summary(x1) summary(x0) mean(is.na(x1)) ## Are missing values important here?
Make a boxplot of both 1999 and 2012
boxplot(x0, x1) boxplot(log10(x0), log10(x1))
Check negative values in ‘x1’
summary(x1) negative < x1 < 0 sum(negative, na.rm = T) mean(negative, na.rm = T) dates < pm1$Date str(dates) dates < as.Date(as.character(dates), “%Y%m%d”) str(dates) hist(dates, “month”) ## Check what’s going on in months 1–6
Plot a subset for one monitor at both times
Find a monitor for New York State that exists in both datasets
site0 < unique(subset(pm0, State.Code == 36, c(County.Code, Site.ID))) site1 < unique(subset(pm1, State.Code == 36, c(County.Code, Site.ID))) site0 < paste(site0[,1], site0[,2], sep = “.”) site1 < paste(site1[,1], site1[,2], sep = “.”) str(site0) str(site1) both < intersect(site0, site1) print(both)
Find how many observations available at each monitor
pm0$county.site < with(pm0, paste(County.Code, Site.ID, sep = “.”)) pm1$county.site < with(pm1, paste(County.Code, Site.ID, sep = “.”)) cnt0 < subset(pm0, State.Code == 36 & county.site %in% both) cnt1 < subset(pm1, State.Code == 36 & county.site %in% both) sapply(split(cnt0, cnt0$county.site), nrow) sapply(split(cnt1, cnt1$county.site), nrow)
Choose county 63 and side ID 2008
pm1sub < subset(pm1, State.Code == 36 & County.Code == 63 & Site.ID == 2008) pm0sub < subset(pm0, State.Code == 36 & County.Code == 63 & Site.ID == 2008) dim(pm1sub) dim(pm0sub)
Plot data for 2012
dates1 < pm1sub$Date x1sub < pm1sub$Sample.Value plot(dates1, x1sub) dates1 < as.Date(as.character(dates1), “%Y%m%d”) str(dates1) plot(dates1, x1sub)
Plot data for 1999
dates0 < pm0sub$Date dates0 < as.Date(as.character(dates0), “%Y%m%d”) x0sub < pm0sub$Sample.Value plot(dates0, x0sub)
Plot data for both years in same panel
par(mfrow = c(1, 2), mar = c(4, 4, 2, 1)) plot(dates0, x0sub, pch = 20) abline(h = median(x0sub, na.rm = T)) plot(dates1, x1sub, pch = 20) ## Whoa! Different ranges abline(h = median(x1sub, na.rm = T))
Find global range
rng < range(x0sub, x1sub, na.rm = T) rng par(mfrow = c(1, 2), mar = c(4, 4, 2, 1)) plot(dates0, x0sub, pch = 20, ylim = rng) abline(h = median(x0sub, na.rm = T)) plot(dates1, x1sub, pch = 20, ylim = rng) abline(h = median(x1sub, na.rm = T))
Show statewide means and make a plot showing trend
head(pm0) mn0 < with(pm0, tapply(Sample.Value, State.Code, mean, na.rm = T)) str(mn0) summary(mn0) mn1 < with(pm1, tapply(Sample.Value, State.Code, mean, na.rm = T)) str(mn1)
Make separate data frames for states / years
d0 < data.frame(state = names(mn0), mean = mn0) d1 < data.frame(state = names(mn1), mean = mn1) mrg < merge(d0, d1, by = “state”) dim(mrg) head(mrg)
Connect lines
par(mfrow = c(1, 1)) with(mrg, plot(rep(1, 52), mrg[, 2], xlim = c(.5, 2.5))) with(mrg, points(rep(2, 52), mrg[, 3])) segments(rep(1, 52), mrg[, 2], rep(2, 52), mrg[, 3])
Reproducible research (c5) Structure of Data analysis
Steps in a data analysis

Define the question 
Define the ideal data set 
Determine what data you can access 
Obtain the data 
Clean the data  Exploratory data analysis
 Statistical prediction/modeling
 Interpret results
 Challenge results
 Synthesize/write up results
 Create reproducible code
The key challenge in data analysis
“Ask yourselves, what problem have you solved, ever, that was worth solving, where you knew all of the given information in advance? Where you didn’t have a surplus of information and have to filter it out, or you had insufficient information and have to go find some?”
An example
Start with a general question
Can I automatically detect emails that are SPAM that are not?
Make it concrete
Can I use quantitative characteristics of the emails to classify them as SPAM/HAM?
Define the ideal data set
 The data set may depend on your goal
 Descriptive  a whole population
 Exploratory  a random sample with many variables measured
 Inferential  the right population, randomly sampled
 Predictive  a training and test data set from the same population
 Causal  data from a randomized study
 Mechanistic  data about all components of the system
Say for example, http://www.google.com/about/datacenters/inside/
Determine what data you can access
 Sometimes you can find data free on the web
 Other times you may need to buy the data
 Be sure to respect the terms of use
 If the data don’t exist, you may need to generate it yourself
A possible solution
Open source data available on UCI!
http://archive.ics.uci.edu/ml/datasets/Spambase
Obtain the data
 Try to obtain the raw data
 Be sure to reference the source
 Polite emails go a long way
 If you will load the data from an internet source, record the url and time accessed
Our data set for spam emails!
kernlab
seems to have data on spam email!
http://search.rproject.org/library/kernlab/html/spam.html
Clean the data
 Raw data often needs to be processed
 If it is preprocessed, make sure you understand how
 Understand the source of the data (census, sample, convenience sample, etc.)
 May need reformating, subsampling  record these steps
 Determine if the data are good enough  if not, quit or change data
Our cleaned data set
# If it isn't installed, install the kernlab package with install.packages()
library(kernlab)
data(spam)
str(spam[, 1:5])
'data.frame': 4601 obs. of 5 variables:
$ make : num 0 0.21 0.06 0 0 0 0 0 0.15 0.06 ...
$ address: num 0.64 0.28 0 0 0 0 0 0 0 0.12 ...
$ all : num 0.64 0.5 0.71 0 0 0 0 0 0.46 0.77 ...
$ num3d : num 0 0 0 0 0 0 0 0 0 0 ...
$ our : num 0.32 0.14 1.23 0.63 0.63 1.85 1.92 1.88 0.61 0.19 ...
http://search.rproject.org/library/kernlab/html/spam.html
Structure of Data analysis 2 (c5)
Steps in a data analysis
 Define the question
 Define the ideal data set
 Determine what data you can access
 Obtain the data
 Clean the data

Exploratory data analysis 
Statistical prediction/modeling 
Interpret results 
Challenge results 
Synthesize/write up results 
Create reproducible code
An example
Start with a general question
Can I automatically detect emails that are SPAM or not?
Make it concrete
Can I use quantitative characteristics of the emails to classify them as SPAM/HAM?
Our data set
http://search.rproject.org/library/kernlab/html/spam.html
Subsampling our data set
We need to generate a test and training set (prediction)
# If it isn't installed, install the kernlab package
library(kernlab)
data(spam)
# Perform the subsampling
set.seed(3435)
trainIndicator = rbinom(4601, size = 1, prob = 0.5)
table(trainIndicator)
## trainIndicator
## 0 1
## 2314 2287
trainSpam = spam[trainIndicator == 1, ]
testSpam = spam[trainIndicator == 0, ]
Exploratory data analysis
 Look at summaries of the data
 Check for missing data
 Create exploratory plots
 Perform exploratory analyses (e.g. clustering)
Names
names(trainSpam)
## [1] "make" "address" "all"
## [4] "num3d" "our" "over"
## [7] "remove" "internet" "order"
## [10] "mail" "receive" "will"
## [13] "people" "report" "addresses"
## [16] "free" "business" "email"
Head
head(trainSpam)
## make address all num3d our over remove internet order mail receive
## 1 0.00 0.64 0.64 0 0.32 0.00 0.00 0 0.00 0.00 0.00
## 7 0.00 0.00 0.00 0 1.92 0.00 0.00 0 0.00 0.64 0.96
## 9 0.15 0.00 0.46 0 0.61 0.00 0.30 0 0.92 0.76 0.76
## 12 0.00 0.00 0.25 0 0.38 0.25 0.25 0 0.00 0.00 0.12
## 14 0.00 0.00 0.00 0 0.90 0.00 0.90 0 0.00 0.90 0.90
## 16 0.00 0.42 0.42 0 1.27 0.00 0.42 0 0.00 1.27 0.00
## will people report addresses free business email you credit your font
## 1 0.64 0.00 0 0 0.32 0 1.29 1.93 0.00 0.96 0
## 7 1.28 0.00 0 0 0.96 0 0.32 3.85 0.00 0.64 0
## 9 0.92 0.00 0 0 0.00 0 0.15 1.23 3.53 2.00 0
## 12 0.12 0.12 0 0 0.00 0 0.00 1.16 0.00 0.77 0
## 14 0.00 0.90 0 0 0.00 0 0.00 2.72 0.00 0.90 0
## 16 0.00 0.00 0 0 1.27 0 0.00 1.70 0.42 1.27 0
## num000 money hp hpl george num650 lab labs telnet num857 data num415
## 1 0 0.00 0 0 0 0 0 0 0 0 0.00 0
## 7 0 0.00 0 0 0 0 0 0 0 0 0.00 0
## 9 0 0.15 0 0 0 0 0 0 0 0 0.15 0
## 12 0 0.00 0 0 0 0 0 0 0 0 0.00 0
## 14 0 0.00 0 0 0 0 0 0 0 0 0.00 0
## 16 0 0.42 0 0 0 0 0 0 0 0 0.00 0
## num85 technology num1999 parts pm direct cs meeting original project re
## 1 0 0 0.00 0 0 0.00 0 0 0.0 0 0
## 7 0 0 0.00 0 0 0.00 0 0 0.0 0 0
## 9 0 0 0.00 0 0 0.00 0 0 0.3 0 0
## 12 0 0 0.00 0 0 0.00 0 0 0.0 0 0
## 14 0 0 0.00 0 0 0.00 0 0 0.0 0 0
## 16 0 0 1.27 0 0 0.42 0 0 0.0 0 0
## edu table conference charSemicolon charRoundbracket charSquarebracket
## 1 0 0 0 0.000 0.000 0
## 7 0 0 0 0.000 0.054 0
## 9 0 0 0 0.000 0.271 0
## 12 0 0 0 0.022 0.044 0
## 14 0 0 0 0.000 0.000 0
## 16 0 0 0 0.000 0.063 0
## charExclamation charDollar charHash capitalAve capitalLong capitalTotal
## 1 0.778 0.000 0.000 3.756 61 278
## 7 0.164 0.054 0.000 1.671 4 112
## 9 0.181 0.203 0.022 9.744 445 1257
## 12 0.663 0.000 0.000 1.243 11 184
## 14 0.000 0.000 0.000 2.083 7 25
## 16 0.572 0.063 0.000 5.659 55 249
## type
## 1 spam
## 7 spam
## 9 spam
## 12 spam
## 14 spam
## 16 spam
Summaries
table(trainSpam$type)
##
## nonspam spam
## 1381 906
Plots
plot(trainSpam$capitalAve ~ trainSpam$type)
Plots
plot(log10(trainSpam$capitalAve + 1) ~ trainSpam$type)
Relationships between predictors
plot(log10(trainSpam[, 1:4] + 1))
Clustering
hCluster = hclust(dist(t(trainSpam[, 1:57])))
plot(hCluster)
New clustering
hClusterUpdated = hclust(dist(t(log10(trainSpam[, 1:55] + 1))))
plot(hClusterUpdated)
Statistical prediction/modeling
 Should be informed by the results of your exploratory analysis
 Exact methods depend on the question of interest
 Transformations/processing should be accounted for when necessary
 Measures of uncertainty should be reported
Statistical prediction/modeling
Logistic regression; prediction modelling, binomial etc…
trainSpam$numType = as.numeric(trainSpam$type)  1
costFunction = function(x, y) sum(x != (y > 0.5))
cvError = rep(NA, 55)
library(boot)
for (i in 1:55) {
lmFormula = reformulate(names(trainSpam)[i], response = "numType")
glmFit = glm(lmFormula, family = "binomial", data = trainSpam)
cvError[i] = cv.glm(trainSpam, glmFit, costFunction, 2)$delta[2]
}
## Which predictor has minimum crossvalidated error?
names(trainSpam)[which.min(cvError)]
## [1] "charDollar"
The explanation goes as follows. It took me about 23 hrs to wrap the whole thing around my head but here goes my explanation.
Goal: Find a “simple model” (binomial regression) that has least
“error” (cross validated error), when we use the “simple model” to predict.
Dataset We only use the trainSpam
data set to make the
predictive model. Each cell has the numerical value representing the
occurrence of a word (given by column) for a given email (row). For
example, row 2 of the charDollar
column has a of 0.054. This means
that 2nd mail has 0.054 $
symbols in the mail.
The data is binomial in nature, i.e., 0 for nonSpam and 1 for Spam. This is made numerical with:
trainSpam$numType = as.numeric(trainSpam$type)1
Binomial regression As the data is binomial, we fit a curve,
Binomial Regression that predicts probability for a mail being
spam depending on the Value . For example, look at column charDollar
,
png(filename="glm.png")
lmFormula=numType~charDollar
plot(lmFormula,data=trainSpam, ylab="probability")
g=glm(lmFormula,family=binomial, data=trainSpam)
curve(predict(g,data.frame(charDollar=x),type="resp"),add=TRUE)
dev.off()
Here you see that for charDollar values > 0.5 there is almost a 100% probability that it is SPAM. This is how binomial regression is used.
The author looks at every column, makes a binomial regression
fit. This is done with the for loop
. So the author now has 55 models.
Error Estimation The author wants to see which of these 55 models is predicting the “BEST”. For this we use Cross validation…
cv.glm or CrossValidation
The crossvalidation works as follows: It divides the trainData further into TRAIN and TEST. The TRAIN data is used to compute the glm, and this glm is used to predict the outcome of the TEST data. This is done “in a particular way” many times and the results are averaged.
The CV uses a costfunction to calculate the error.
The cost function
(in this case) counts number of failed
predictions. The TEST data is used for this. It takes two parameters
which is X (observed TEST data) and Y (Predicted data based on glm)
and checks how many times it failed in this case:
costFunction = function(x,y) sum(x!=(y > 0.5))
Y>0.5
provides a cutoff to decide if a value is spam or not. So if
the predicted value is 0.6
then the prediction is SPAM (or 1). If
the predicted value is <=0.5
then it is NOT SPAM (or 0).
With the for loop we cycle over every single column and in the end pic the column which has the least error of predictions:
which.min(cvError)
P.S It is very beneficial to look at how the glm binomial fitting
(including timestamp) is done, and the explanation of the
coefficients that come from glm
and what it means to obtain
crossvalidated error. The course, I agree however made a steep
jump in this aspect, without bothering to explain anything related to
this, what so ever. Hope this is helpful.
Question and my answer on stack as well
Get a measure of uncertainty
## Use the best model from the group
predictionModel = glm(numType ~ charDollar, family = "binomial", data = trainSpam)
## Get predictions on the test set
predictionTest = predict(predictionModel, testSpam)
predictedSpam = rep("nonspam", dim(testSpam)[1])
## Classify as `spam' for those with prob > 0.5
predictedSpam[predictionModel$fitted > 0.5] = "spam"
Get a measure of uncertainty
## Classification table
table(predictedSpam, testSpam$type)
##
## predictedSpam nonspam spam
## nonspam 1346 458
## spam 61 449
## Error rate
(61 + 458)/(1346 + 458 + 61 + 449)
## [1] 0.2243
Interpret results
 Use the appropriate language
 describes
 correlates with/associated with
 leads to/causes
 predicts
 Give an explanation
 Interpret coefficients
 Interpret measures of uncertainty
Our example
 The fraction of charcters that are dollar signs can be used to predict if an email is Spam
 Anything with more than 6.6% dollar signs is classified as Spam
 More dollar signs always means more Spam under our prediction
 Our test set error rate was 22.4%
Challenge results
 Challenge all steps:
 Question
 Data source
 Processing
 Analysis
 Conclusions
 Challenge measures of uncertainty
 Challenge choices of terms to include in models
 Think of potential alternative analyses
Synthesize/writeup results
 Lead with the question
 Summarize the analyses into the story
 Don’t include every analysis, include it
 If it is needed for the story
 If it is needed to address a challenge
 Order analyses according to the story, rather than chronologically
 Include “pretty” figures that contribute to the story
In our example
 Lead with the question
 Can I use quantitative characteristics of the emails to classify them as SPAM/HAM?
 Describe the approach
 Collected data from UCI > created training/test sets
 Explored relationships
 Choose logistic model on training set by cross validation
 Applied to test, 78% test set accuracy
 Interpret results
 Number of dollar signs seems reasonable, e.g. “Make money with Viagra \$ \$ \$ \$!”
 Challenge results
 78% isn’t that great
 I could use more variables
 Why logistic regression?
Organizing Data Analysis (c5)
Data analysis files
 Data
 Raw data
 Processed data
 Figures
 Exploratory figures
 Final figures
 R code
 Raw / unused scripts
 Final scripts
 R Markdown files
 Text
 README files
 Text of analysis / report
Raw Data
<img class=center src=../../assets/img/medicalrecord.png height=’400’/>
 Should be stored in your analysis folder
 If accessed from the web, include url, description, and date accessed in README
Processed data
<img class=center src=../../assets/img/excel.png height=’400’/>
 Processed data should be named so it is easy to see which script generated the data.
 The processing script  processed data mapping should occur in the README
 Processed data should be tidy
Exploratory figures
<img class=center src=../../assets/img/example10.png height=’400’/>
 Figures made during the course of your analysis, not necessarily part of your final report.
 They do not need to be “pretty”
Final Figures
<img class=center src=../../assets/img/figure1final.png height=’400’/>
 Usually a small subset of the original figures
 Axes/colors set to make the figure clear
 Possibly multiple panels
Raw scripts
<img class=center src=../../assets/img/rawcode.png height=’350’/>
 May be less commented (but comments help you!)
 May be multiple versions
 May include analyses that are later discarded
Final scripts
<img class=center src=../../assets/img/finalscript2.png height=’300’/>
 Clearly commented
 Small comments liberally  what, when, why, how
 Bigger commented blocks for whole sections
 Include processing details
 Only analyses that appear in the final writeup
R markdown files
<img class=center src=../../assets/img/rmd.png height=’400’/>
 R markdown files can be used to generate reproducible reports
 Text and R code are integrated
 Very easy to create in Rstudio
Readme files
<img class=center src=../../assets/img/readme.png height=’400’/>
 Not necessary if you use R markdown
 Should contain stepbystep instructions for analysis
 Here is an example https://github.com/jtleek/swfdr/blob/master/README
Text of the document
<img class=center src=../../assets/img/swfdr.png height=’350’/>
 It should include a title, introduction (motivation), methods (statistics you used), results (including measures of uncertainty), and conclusions (including potential problems)
 It should tell a story
 It should not include every analysis you performed
 References should be included for statistical methods
Further resources
 Information about a nonreproducible study that led to cancer patients being mistreated: The Duke Saga Starter Set
 Reproducible research and Biostatistics
 Managing a statistical analysis project guidelines and best practices
 Project template  a preorganized set of files for data analysis
Markdown R knitr (c5)
What is Markdown?

Created by John Gruber and Aaron Swartz

A simplified version of “markup” languages

Allows one to focus on writing as opposed to formatting

Simple/minimal intuitive formatting elements

Easily converted to valid HTML (and other formats) using existing tools

Complete information is available at http://daringfireball.net/projects/markdown/

Some background information at http://daringfireball.net/2004/03/dive_into_markdown
What is R Markdown?

R markdown is the integration of R code with markdown

Allows one to create documents containing “live” R code

R code is evaluated as part of the processing of the markdown

Results from R code are inserted into markdown document

A core tool in literate statistical programming
What is R Markdown?

R markdown can be converted to standard markdown using the knitr package in R

Markdown can be converted to HTML using the markdown package in R

Any basic text editor can be used to create a markdown document; no special editing tools needed

The R markdown –> markdown –> HTML work flow can be easily managed using R Studio (but not required)
 These slides were written in R markdown and converted to slides
using the slidify package
Literate programming pros and cons
 Text and code all in one place, logical order
 Data, results automatically updated to reflect external changes

Code is live–automatic “regression test” when building a document
 Text and code all in one place; can make documents difficult to read, especially if there is a lot of code
 Can substantially slow down processing of documents (although there are tools to help)
Knitr USAGE and syntax
An R package written by Yihui Xie (while he was a grad student at Iowa State) Available on CRAN Supports RMarkdown, LaTeX, and HTML as documentation languages Can export to PDF, HTML Built right into RStudio for your convenience
Usage
library(knitr)
setwd(<working directory>)
knit2html(“document.Rmd”)
browseURL(“document.html”)
Knitr adds both code and the output, is useful for documentation.
\```{r firstchunk}
## R code goes here
\```
.Rmd –> .md –> .html
Hiding results!
\```{r firstchunk, echo=FALSE,results="hide"}
## R code goes here
time> bla bla
rand> rnorm(1)
\```
The current time is `r time`. My favourite random number is `r rand`.
Plotting and controlling figure height
\```{r ploting, fig.height=4}
plot(x,y) bla bla bla
\```
Making tables with xtable
\```{r tablespandian, results="asis"}
library(xtable)
xt< xtable(summary(fit))
print(xt,type="html")
\```
Setting options to on a global level
\```{r settingblablaoptions, echo=FALSE}
knitr::opts_chunk$set(echo=FALSE, results="hide")
\```
https://yihui.name/knitr/options/
Options common
Output
results: “asis”, “hide” echo: TRUE, FALSE message: FALSE warning=FALSE
https://yihui.name/knitr/demo/output/ More info by author.
“Error” reported here: https://github.com/yihui/knitr/issues/220
Figures
fig.height: numeric fig.width: numeric
Caching
What if one chunk takes a long time to run? All chunks have to be
recomputed every time you reknit the file The cache=TRUE
option can
be set on a chunkbychunk basis to store results of computation After
the first run, results are loaded from cache
Useful help document from stack: https://stackoverflow.com/a/51409622/5986651
YAML Front Matter example
title: “Reproducible Research: Peer Assessment 1” output: html_document: keep_md: yes —
```{r setup, include=FALSE} knitr::opts_chunk$set(echo=TRUE, message=F, warning=F, cache=T)
More info in [Yihui's book](https://bookdown.org/yihui/rmarkdown/pdfdocument.html) and his blog.
### Knitr for pdf; plots not working!
https://stackoverflow.com/questions/55032228/packagepdftexdeferrorfilenotfound
Made a stack question that says it doesn't work.
however I stumbled upon something by accident:
`rmarkdown::render("./code.rmd")` in the console seems to work on
plots. I just used this for now and finished the assignement. A lot of
issues here, appears to be not within EMACS!
### Making an Rmd document work with ESS init.el
https://stackoverflow.com/a/23326318/5986651 shows how to add the
basics and setup rmd mode
polymode keybindings are given here https://polymode.github.io/usage/
Also the library `rmarkdown` seems to be needed along with pandoc!
So install pandoc with
sudo aptget install pandoc
Got lot of sql errors but ignored it for now! and the system generates
an html nicely with `Mn e`.
Done! Peace!
Also for adding the code chunk you need to follow instructions given
here: https://emacs.stackexchange.com/a/27419/17941
(defun twsinsertrchunk (header)
"Insert an rchunk in markdown mode. Necessary due to interactions between polymode and yas snippet"
(interactive "sHeader: ")
(insert (concat "```{r " header "}\n\n```"))
(forwardline 1))
Mx twsinsertrchunk
Emacs init file attempts and failures for keybinding!
;; (evalafterload 'rmdmode
;; '(definekey rmdmodemap (kbd "Cc r")
;; 'twsinsertrchunk))
(globalsetkey (kbd "Cc r") 'twsinsertrchunk)
### Caching and the way to go
> It seems the default is set to FALSE and local chunk options
> override the global options but one thing **you could do is set the
> global options to cache by default by adding this to the top of your
> document**
`r opts_chunk$set(cache=TRUE)`
> Then for the sections you don't want cached ever you would
> explicitly set those sections to cache=FALSE.
>
> Then if you want to set the whole document to not cache anything you
> could change the global option to FALSE and rerun it.
>
> The problem is that if any of the chunk options are set to
> cache=TRUE then those will override the global setting and won't be
> rerun if you set the global option to FALSE. So I think the only
> way to achieve what you want is to change the default to cache=TRUE,
> explicitly set chunks that you don't want cached to have
> cache=FALSE, and then you can switch the global option to FALSE to
> do what you want when the time occurs.
Source: https://stackoverflow.com/a/10628731/5986651
### template Knitr doc and my workflow


title: "Effects of Storm Events on People and Economy"
output: html

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo=TRUE, message=F, warning=F, cache=T)
Keep the cache at true in the main settings so that you can toggle when needed to do a full evaluation.
You use include=FALSE to have the chunk evaluated, but neither the code nor its output displayed.
And, echo=FALSE indicates that the code will not be shown in the final document (though any results/output would still be displayed).
Cc r
gives the r code section based on my init file!
```{r } a < 2
```{r libraries, include=F}
library(dplyr) #library that aids in grouping (mainly %>%)
```{r scatterplot, fig.width=8, fig.height=6} plot(x,y)

Use `Mn e` for executing. PDF images don't work with this command but
html works well! Use `rmarkdown::render("./code.rmd")` in the console
and it works well. Recently that has been failing as well. So do the
html rendering and then in the end wrap it up with rstudio's console!
### Making a presentation
``` R
##
title: "Data Science Capstone  Word Prediction App"
date: "June 6, 2019"
output: slidy_presentation
##
This already brings up a presentation.
Uploading it to github
Uploading it to rpubs
Based on https://stackoverflow.com/a/32304336/5986651
result < rpubsUpload(title='Your
title',htmlFile='your_html_file_and_path.html',method=getOption('rpubs.upload.method','auto')
browseURL(results$continueUrl)
Thats all you need to do and the rest is self explanatory. You get a website where you trim shit down.
Update it to rpubs
Adding picture to presentation
based on https://stackoverflow.com/a/44665110/5986651
```{r, echo=FALSE} knitr::include_graphics(‘./CaptureOutput.png’)
Works all the way into the file
remember id
"https://api.rpubs.com/api/v1/document/516159/bf2c070b171144b0bf949bca29e62ef2"
#### Resize picture
two methods as per
https://stackoverflow.com/questions/15625990/howtosetsizeforlocalimageusingknitrformarkdown
```{r}
knitr::include_graphics("path/to/image.png", dpi = 100)
```{r, out.width = “400px”} knitr::include_graphics(“path/to/image.png”)
https://stackoverflow.com/questions/25415365/insertsidebysidepngimagesusingknitr/25454753
```{r, echo=FALSE,out.width="49%", out.height="20%",fig.cap="caption",fig.show='hold',fig.align='center'}
knitr::include_graphics(c("path/to/img1","path/to/img1"))
Levels of detail (c5w3)
tl;dr

People are busy, especially managers and leaders

Results of data analyses are sometimes presented in oral form, but often the first cut is presented via email

It is often useful to breakdown the results of an analysis into different levels of granularity / detail

Getting responses from busy people: http://goo.gl/sJDb9V
Hierarchy of Information: Research Paper

Title / Author list

Abstract

Body / Results

Supplementary Materials / the gory details

Code / Data / really gory details
Hierarchy of Information: Email Presentation

Subject line / Sender info
 At a minimum; include one
 Can you summarize findings in one sentence?

Email body

A brief description of the problem / context; recall what was proposed and executed; summarize findings / results; 1–2 paragraphs

If action needs to be taken as a result of this presentation, suggest some options and make them as concrete as possible.

If questions need to be addressed, try ot make them yes / no

Hierarchy of Information: Email Presentation

Attachment(s)
 R Markdown file

knitr report
 Stay concise; don’t spit out pages of code (because you used knitr we know it’s available)

Links to Supplementary Materials
 Code / Software / Data
 GitHub repository / Project web site
Caching computations
Not sure where to use it. the library is called cacher
SI: probability (c6w1)
Statistical inference
Course Content The course is taught via 13 lectures
Introduction Probability Conditional Probability Expectations Variance Common Distributions Asymptotics T confidence intervals Hypothesis testing Pvalues Power Multiple Testing Resampling
https://www.youtube.com/playlist?list=PLplgQkQivXiBmGyzLrUjzsblmQsLtkzJ
 In these slides we will cover the basics of probability at low enough level to have a basic understanding for the rest of the series
 For a more complete treatment see the class Mathematical Biostatistics Boot Camp 1
 Youtube: www.youtube.com/playlist?list=PLplgQkQivXhk6qSyiNj51qamjAtZISJ
 Coursera: www.coursera.org/course/biostats
 Git: http://github.com/bcaffo/CaffoCoursera
Probability c6w1
Rules probability must follow
An event A occurs; A die is rolled and we get the outcome from the set {1,2,3,4,5,6}.
 P(nothing occurs) = 0; When a die rolls the probability that we don’t see 1,2,3,4,5,6 is 0.
 P(something occurs) = 1;
 P(A) + P(1A) = 1
 P(AUB)= P(A) + P(B) ; if they are mutually exclusive
You roll a die. A={1,2}, B= {3,4}. P(A)=2/6; P(B)=2/6; P(1 or 2 or 3 or 4)=P(A∪B)=P4/6.
 P(AUB) = P(A) + P(B)  P(A∩B)
Example: A={1,2,3}; B={1,5,6}
 if {A} => {B}; P(A) < P(B)
A={1,2}; B={1,2,3,4}
Example
The National Sleep Foundation (www.sleepfoundation.org) reports that around 3% of the American population has sleep apnea. They also report that around 10% of the North American and European population has restless leg syndrome. Does this imply that 13% of people will have at least one sleep problems of these sorts?
Assuming all of this are probabilities from the same population, i.e., P(sleep apnea in America) represents the population, as much as P(RLS in North America).
A person can have SA and/or RLS. Both are not mutually exclusive. So
P(SA)= 0.03; P(RLS)= 0.1; P(atleast SA or RLS)=P(SA)+P(RLS)P(SA∩RLS) Answer: No, the events can simultaneously occur and so are not mutually exclusive. To elaborate let:
Random variables
 A random variable is a numerical outcome of an experiment.
Roll a die and the outcome is say X (random variable), and this is either 1 or,2,3,4,5,6.
 The random variables that we study will come in two varieties, discrete or continuous.
 Discrete random variable are random variables that take on only a countable number of possibilities and we talk about the probability that they take specific values
 Continuous random variable can conceptually take any value on the real line or some subset of the real line and we talk about the probability that they line within some range
Examples of variables that can be thought of as random variables
Experiments that we use for intuition and building context
 The $(01)$ outcome of the flip of a coin
 The outcome from the roll of a die
Specific instances of treating variables as if random
 The web site traffic on a given day
 The BMI of a subject four years after a baseline measurement
 The hypertension status of a subject randomly drawn from a population
 The number of people who click on an ad
 Intelligence quotients for a sample of children
PMF and PDF
PMF is for discrete random variables and PDF (probability density function) is for continuous random variables.
Probability Mass Function example:
\[PMF(x)=(1/2)ˣ (1/2)^(1x); for x = {0,1} PMF2(x)=(θ)ˣ (1θ)^(1x);\]PMF(x) is a probability of a coin flip. This function gives value of 0.5 when x the random variable takes 0 or 1 . With PMF2 we see a die which is biased.
∑PMF(xᵢ)=1;
PDF is for continuous variable. Example: Suppose that the proportion of help calls that get addressed in a random day by a help line is given by \(f(x) = \left{\begin{array}{ll} 2 x & \mbox{ for }& 0< x < 1 \ 0 & \mbox{ otherwise} \end{array} \right.\)
Is this a mathematically valid density?
\[PDF(x)=2x for 0<x<1\]x < c(0.5, 0, 1, 1, 1.5)
y < c(0, 0, 2, 0, 0)
plot(x, y, lwd = 3, frame = FALSE, type = "l")
Example continued
What is the probability that 75% or fewer of calls get addressed?
1.5 * 0.75/2
## [1] 0.5625
pbeta(0.75, 2, 1)
## [1] 0.5625
CDF and survival function
Certain areas are so useful, we give them names
 The cumulative distribution function (CDF) of a random variable, $X$, returns the probability that the random variable is less than or equal to the value $x$ \(F(x) = P(X \leq x)\) (This definition applies regardless of whether $X$ is discrete or continuous.)
 The survival function of a random variable $X$ is defined as the probability that the random variable is greater than the value $x$ \(S(x) = P(X > x)\)
 Notice that $S(x) = 1  F(x)$
Example
What are the survival function and CDF from the density considered before?
For $1 \geq x \geq 0$ \(F(x) = P(X \leq x) = \frac{1}{2} Base \times Height = \frac{1}{2} (x) \times (2 x) = x^2\)
\[S(x) = 1  x^2\]pbeta(c(0.4, 0.5, 0.6), 2, 1)
## [1] 0.16 0.25 0.36
Percentiles
You’ve heard of sample quantiles. If you were the 95th percentile on an exam, you know that 95% of people scored worse than you and 5% scored better. These are sample quantities. Here we define their population analogs.
Arrange grades in increasing order. If there were 100 people you are 95th.
Definition
The $\alpha^{th}$ quantile of a distribution with distribution function $F$ is the point $x_\alpha$ so that \(F(x_\alpha) = \alpha\)
 A percentile is simply a quantile with $\alpha$ expressed as a percent
 The median is the $50^{th}$ percentile
For example
The $95^{th}$ percentile of a distribution is the point so that:
 the probability that a random variable drawn from the population is less is 95%
 the probability that a random variable drawn from the population is more is 5%
Example
Consider a PDF for scores for 10000 people. The 0.5th quantile is F(score<=some_score)=0.5. F (score<=some_score) is the area under the curve or the CDF(distribution function.x What is the median of the distribution that we were working with before?
 We want to solve $0.5 = F(x) = x^2$
 Resulting in the solution
sqrt(0.5)
## [1] 0.7071
 Therefore, about 0.7071 of calls being answered on a random day is the median.
Example continued
R can approximate quantiles for you for common distributions
qbeta(0.5, 2, 1)
## [1] 0.7071
Summary
 You might be wondering at this point “I’ve heard of a median before, it didn’t require integration. Where’s the data?”
 We’re referring to are population quantities. Therefore, the median being discussed is the population median.
 A probability model connects the data to the population using assumptions.
 Therefore the median we’re discussing is the estimand, the sample median will be the estimator
SI: Conditional probability c6w1
Conditional probability, motivation
 The probability of getting a one when rolling a (standard) die is usually assumed to be one sixth
 Suppose you were given the extra information that the die roll was an odd number (hence 1, 3 or 5)
 conditional on this new information, the probability of a one is now one third
Conditional probability, definition
implies given condition! Roll a die. 
 Consider our die roll example
 $B = {1, 3, 5}$

$A = {1}$

P(A B)=P(A∩B)/P(B). Draw venn diagram you will understand.
= 1/3= 1/6 ÷ 3/6 = 1/3.
Bayes’ rule
Baye’s rule allows us to reverse the conditioning set provided that we know some marginal probabilities \(P(B ~~ A) = \frac{P(A ~~ B) P(B)}{P(A ~~ B) P(B) + P(A ~~ B^c)P(B^c)}.\)
Diagnostic tests
 Let $+$ and $$ be the events that the result of a diagnostic test is positive or negative respectively
 Let $D$ and $D^c$ be the event that the subject of the test has or does not have the disease respectively

The sensitivity is the probability that the test is positive given that the subject actually has the disease, $P(+ ~ ~ D)$ 
The specificity is the probability that the test is negative given that the subject does not have the disease, $P( ~ ~ D^c)$ 
The positive predictive value is the probability that the subject has the disease given that the test is positive, $P(D ~ ~ +)$ 
The negative predictive value is the probability that the subject does not have the disease given that the test is negative, $P(D^c ~ ~ )$ 
The prevalence of the disease is the marginal probability of disease, $P(D)$

The diagnostic likelihood ratio of a positive test, labeled $DLR_+$, is $P(+ ~ ~ D) / P(+ ~ ~ D^c)$, which is the \(sensitivity / (1  specificity)\) 
The diagnostic likelihood ratio of a negative test, labeled $DLR_$, is $P( ~ ~ D) / P( ~ ~ D^c)$, which is the \((1  sensitivity) / specificity\)
Example
 A study comparing the efficacy of HIV tests, reports on an experiment which concluded that HIV antibody tests have a sensitivity of 99.7% and a specificity of 98.5%
 Suppose that a subject, from a population with a .1% prevalence of HIV, receives a positive test result. What is the positive predictive value?
 Mathematically, we want $P(D ~~ +)$ given the sensitivity, $P(+ ~~ D) = .997$, the specificity, $P( ~~ D^c) =.985$, and the prevalence $P(D) = .001$
Using Bayes’ formula
 In this population a positive test result only suggests a 6% probability that the subject has the disease

(The positive predictive value is 6% for this test) This is super low owing to the prevalence!
 The low positive predictive value is due to low prevalence of disease and the somewhat modest specificity
 Suppose it was known that the subject was an intravenous drug user and routinely had intercourse with an HIV infected partner
 Notice that the evidence implied by a positive test result does not change because of the prevalence of disease in the subject’s population, only our interpretation of that evidence changes
Likelihood ratios
 Therefore \(\ frac{P(D ~~ +)}{P(D^c ~~ +)} = \frac{P(+~~D)}{P(+~~D^c)}\times \frac{P(D)}{P(D^c)}\)
i.e.,
\[\mbox{posttest odds of }D = DLR_+\times\mbox{pretest odds of }D\]
odds is when you do P(X)÷P(1X)

Post odds = likelihood * prior odds.

Similarly, $DLR_$ relates the decrease in the odds of the disease after a negative test result to the odds of disease prior to the test.
HIV example revisited
 Suppose a subject has a positive HIV test
 $DLR_+ = .997 / (1  .985) \approx 66$
 The result of the positive test is that the odds of disease is now 66 times the pretest odds
 Or, equivalently, the hypothesis of disease is 66 times more supported by the data than the hypothesis of no disease
HIV example revisited
 Suppose that a subject has a negative test result
 $DLR_ = (1  .997) / .985 \approx .003$
 Therefore, the posttest odds of disease is now $.3\%$ of the pretest odds given the negative test.
 Or, the hypothesis of disease is supported $.003$ times that of the hypothesis of absence of disease given the negative test result
Independence

Two events $A$ and $B$ are independent if \(P(A \cap B) = P(A)P(B)\);
Two successive coin flips.

Equivalently if $P(A ~ ~ B) = P(A)$  Two random variables, $X$ and $Y$ are independent if for any two sets $A$ and $B$ P(A∩B)=P(A)*P(B)
 Two coin flips both having heads is 0.5*0.5
 If $A$ is independent of $B$ then
 $A^c$ is independent of $B$
 $A$ is independent of $B^c$
 $A^c$ is independent of $B^c$
Example (star mark example)
 Volume 309 of Science reports on a physician who was on trial for expert testimony in a criminal trial
 Based on an estimated prevalence of sudden infant death syndrome of 1÷8543, the physician testified that that the probability of a mother having two children with SIDS was $\left(\frac{1}{8,543}\right)^2
 The mother was convicted for murder.
WOW! Apparently A and B are not independent. The chance of A&B happening given () the same mother is different. There are dependent as fuck!
 That is, $P(A_1 \cap A_2)$ is not necessarily equal to $P(A_1)P(A_2)$

Biological processes that have a believed genetic or familiar environmental component, of course, tend to be dependent within families

Relevant to this discussion, the principal mistake was to assume that the events of having SIDs within a family are independent
 (There are many other statistical points of discussion for this case.)
IID random variables (independent identically distributed)
 Random variables are said to be iid if they are independent and identically distributed
 Independent: statistically unrelated from one and another
 Identically distributed: all having been drawn from the same population distribution
Unlike the mother murder example and like a coin toss, every coin toss is independent of the other and the probabilities are obtained from the same distribution data (0.5).
 iid random variables are the default model for random samples
 Many of the important theories of statistics are founded on assuming that variables are iid
 Assuming a random sample and iid will be the default starting point of inference for this class
SI: Expected values c6w1
Expected values or Mean
 Expected values are useful for characterizing a distribution
 The mean is a characterization of its center
 The variance and standard deviation are characterizations of how spread out it is
 Our sample expected values (the sample mean and variance) will estimate the population versions
The population mean
 The expected value or mean of a random variable is the center of its distribution

For discrete random variable $X$ with PMF $p(x)$, it is defined as follows:
X*100% = ∑xᵢ*p(xᵢ); It is the center of Mass of the system
 X represents the center of mass of a collection of locations and weights, x and p(x).
The sample mean
 The sample mean estimates this population mean
 The center of mass of the data is the empirical mean
Why? we assume it? why? no idea!
Example
Using manipulate
library(manipulate)
myHist < function(mu){
g < ggplot(galton, aes(x = child))
g < g + geom_histogram(fill = "salmon",
binwidth=1, aes(y = ..density..), colour = "black")
g < g + geom_density(size = 2)
g < g + geom_vline(xintercept = mu, size = 2)
mse < round(mean((galton$child  mu)^2), 3)
g < g + labs(title = paste('mu = ', mu, ' MSE = ', mse))
g
}
manipulate(myHist(mu), mu = slider(62, 74, step = 0.5))
Getting an error that it needs to be run from Rstudio.
Example of a population mean

Suppose a coin is flipped and $X$ is declared $0$ or $1$ corresponding to a head or a tail, respectively, the expected value is at 0.5 (i.e., mean);

Note, if thought about geometrically, this answer is obvious; if two equal weights are spaced at 0 and 1, the center of mass will be $.5$
Example
 Suppose that a die is rolled and $X$ is the number face up
 What is the expected value of $X$? \(E[X] = 1 \times \frac{1}{6} + 2 \times \frac{1}{6} + 3 \times \frac{1}{6} + 4 \times \frac{1}{6} + 5 \times \frac{1}{6} + 6 \times \frac{1}{6} = 3.5\)
 Again, the geometric argument makes this answer obvious without calculation.
What about a biased coin?
 Suppose that a random variable, $X$, is so that $P(X=1) = p$ and $P(X=0) = (1  p)$
 (This is a biased coin when $p\neq 0.5$)
 What is its expected value? \(E[X] = 0 * (1  p) + 1 * p = p\)
Continuous random variables
 For a continuous random variable, $X$, with density, $f$, the expected value is again exactly the center of mass of the density
Example
 Consider a density where $f(x) = 1$ for $x$ between zero and one
 (Is this a valid density?)
 Suppose that $X$ follows this density; what is its expected value?
Facts about expected values
 Recall that expected values are properties of distributions
 Note the average of random variables is itself a random variable and its associated distribution has an expected value

The center of this distribution is the same as that of the original distribution

Take several values from a std. normal distribution. Plot it and you can draw the original PDF.

Now take 10 observations and average it (likely to be closer to the center than one observation); Do this several times and you have another PDF which has the same Expected value aka mean as that of the PDF we began with.

Average of 10 die flips taken N times and ploted gives a shape like a normal distribution with mean in the centre as you would have for N*10 die flips
 Average of 20 die flips will be more concentrated towards the centre
and so on with 30 die flips. The variance will be low and the mean
is at the same place as that of individual die flips.
—
Summarizing what we know
 Expected values are properties of distributions
 The population mean is the center of mass of population
 The sample mean is the center of mass of the observed data
 The sample mean is an estimate of the population mean
 The sample mean is unbiased if the population mean of its distribution is the mean that it’s trying to estimate
 The more data that goes into the sample mean, the more concentrated its density / mass function is around the population mean
SI c6w2 Variance
The variance
 The variance of a random variable is a measure of spread
 If $X$ is a random variable with mean $\mu$, the variance of $X$ is defined as
derivation for above is here
 The expected (squared) distance from the mean
 Densities with a higher variance are more spread out than densities with a lower variance
 The square root of the variance is called the standard deviation
 The standard deviation has the same units as $X$
Example

What’s the variance from the result of a toss of a die?
 $E[X] = 3.5$
 $E[X^2] = 1 ^ 2 \times \frac{1}{6} + 2 ^ 2 \times \frac{1}{6} + 3 ^ 2 \times \frac{1}{6} + 4 ^ 2 \times \frac{1}{6} + 5 ^ 2 \times \frac{1}{6} + 6 ^ 2 \times \frac{1}{6} = 15.17$

$Var(X) = E[X^2]  E[X]^2 \approx 2.92$
Example: coin toss

What’s the variance from the result of the toss of a coin with probability of heads (1) of $p$?
 $E[X] = 0 \times (1  p) + 1 \times p = p$
 $E[X^2] = E[X] = p$
The sample variance
 The sample variance is \(S^2 = \frac{\sum_{i=1} (X_i  \bar X)^2}{n1}\) (almost, but not quite, the average squared deviation from the sample mean)
 It is also a random variable
 It has an associate population distribution
 Its expected value is the population variance
 Its distribution gets more concentrated around the population variance with mroe data
 Its square root is the sample standard deviation
Variances of x die rolls
Variance of sample is $S^2$
Variance of sample mean (n average) is $S^2/n$
Recall the mean
 Recall that the average of random sample from a population is itself a random variable
 We know that this distribution is centered around the population mean, $E[\bar X] = \mu$
 We also know what its variance is $Var(\bar X) = \sigma^2 / n$
 This is very useful, since we don’t have repeat sample means to get its variance; now we know how it relates to the population variance
 We call the standard deviation of a statistic a standard error
To summarize
 The sample variance, $S^2$, estimates the population variance, $\sigma^2$
 The distribution of the sample variance is centered around $\sigma^2$
 The variance of the sample mean is $\sigma^2 / n$
 Its logical estimate is $s^2 / n$
 The logical estimate of the standard error is $S / \sqrt{n}$
 $S$, the standard deviation, talks about how variable the population is
 $S/\sqrt{n}$, the standard error, talks about how variable averages of random samples of size $n$ from the population are
Simulation example: standard normal
Standard normals have variance 1; means of $n$ standard normals have standard deviation $1/\sqrt{n}$
A normal distribution with a mean of 0 and a standard deviation of 1 is called a standard normal distribution. wiki
nosim < 1000
n < 10
sd(apply(matrix(rnorm(nosim * n), nosim), 1, mean))
## [1] 0.3156
1 / sqrt(n)
## [1] 0.3162
Simulation example: uniform distribution
https://en.wikipedia.org/wiki/Uniform_distribution_%28continuous%29
\[f(x) = 1/(ba) for a<=x<=b f(x) = for other cases\]https://math.stackexchange.com/a/728072/332456 for explanation of variance = (ba)/12.
Standard uniforms have variance $1/12$; means of random samples of $n$ uniforms have sd = $1/\sqrt{12 \times n}$
nosim < 1000
n < 10
sd(apply(matrix(runif(nosim * n), nosim), 1, mean))
## [1] 0.09017
1 / sqrt(12 * n)
## [1] 0.09129
Binomial distribution
 Based on khan lecture.
Let’s say we have 5 coin tosses and we are interested in the #heads that come about!
X is the random variable which has the outcome of #heads
P(X=0) = 5C1 x (1/2)^0 x (1/2)^5
P(X=2) = 5C2 x (1/2)^1 x (1/2)^4
And so on. It will look like a normal distribution but it is discrete.
 Based on “Binomial probability example”
P(making basket) = 70%
P(failing) = 30%
P(X=2 in 6 attempts) = 6C2 x (0.7)^2 x (0.3)^4 = 15 x 0.49 x 0.0081 = 6%
 Mean of binomial probability.
E(X)= n x p(success)
For example, if we shoot with 60% success (per shot), then for 10 shots we would have made on an average = 10 x 60% = 6 baskets.
 Generalization of p of binomial distributions!
P(exactly k scores in n attempts) = nCk f^k (1f)^(nk)
Simulation example: Poissons
Based on this khan video, we understand that Poisson is nothing but binomial but when n > infinity.
Also according to stack,
The difference between the two is that while both measure the number of certain random events (or “successes”) within a certain frame, the Binomial is based on discrete events, while the Poisson is based on continuous events. That is, with a binomial distribution you have a certain number, $n$, of “attempts,” each of which has probability of success $p$. With a Poisson distribution, you essentially have infinite attempts, with infinitesimal chance of success. That is, given a Binomial distribution with some $n,p$, if you let $n\rightarrow\infty$ and $p\rightarrow0$ in such a way that $np\rightarrow\lambda$, then that distribution approaches a Poisson distribution with parameter $\lambda$.
Mean=Variance= $\lambda$
Probability is given by
\[\Pr[X = k] = e^{\lambda} \frac{\lambda^k}{k!}, \quad k = 0, 1, 2, \ldots.\]If Var=Mean=4, & number we average upon is n = 10; then SD = sqrt(variance/n)
nosim < 1000
n < 10
sd(apply(matrix(rpois(nosim * n, 4), nosim), 1, mean))
## [1] 0.6219
sqrt(4/n)
## [1] 0.6325
Simulation example: binomial distribution
Fair coin flips have variance $0.25$; means of random samples of $n$ coin flips have SD = sqrt(variance/n); same as poissons!
nosim < 1000
n < 10
sd(apply(matrix(sample(0 : 1, nosim * n, replace = TRUE),
nosim), 1, mean))
## [1] 0.1587
1 / (2 * sqrt(n))
## [1] 0.1581
Data example
library(UsingR); data(father.son);
x < father.son$sheight
n<length(x)
Plot of the son’s heights
Let’s interpret these numbers
round(c(var(x), var(x) / n, sd(x), sd(x) / sqrt(n)),2)
## [1] 7.92 0.01 2.81 0.09
Summarizing what we know about variances
 The sample variance estimates the population variance
 The distribution of the sample variance is centered at what its estimating
 It gets more concentrated around the population variance with larger sample sizes
 The variance of the sample mean is the population variance
divided by $n$
 The square root is the standard error
 It turns out that we can say a lot about the distribution of averages from random samples, even though we only get one to look at in a given data set
Common distributions c6w2
The Bernoulli distribution
 The Bernoulli distribution arises as the result of a binary outcome
 Bernoulli random variables take (only) the values 1 and 0 with probabilities of (say) $p$ and $1p$ respectively
 The PMF for a Bernoulli random variable $X$ is \(P(X = x) = p^x (1  p)^{1  x}\)
 The mean of a Bernoulli random variable is $p$ and the variance is $p(1  p)$
 If we let $X$ be a Bernoulli random variable, it is typical to call $X=1$ as a “success” and $X=0$ as a “failure”
Binomial trials
 The binomial random variables are obtained as the sum of iid Bernoulli trials
 In specific, let $X_1,\ldots,X_n$ be iid Bernoulli$(p)$; then $X = \sum_{i=1}^n X_i$ is a binomial random variable
 The binomial mass function is \(P(X = x) = \left( \begin{array}{c} n \\ x \end{array} \right) p^x(1  p)^{nx}\) for $x=0,\ldots,n$
Choose
 Recall that the notation \(\left( \begin{array}{c} n \\ x \end{array} \right) = \frac{n!}{x!(nx)!}\) (read “$n$ choose $x$”) counts the number of ways of selecting $x$ items out of $n$ without replacement disregarding the order of the items
Example
 Suppose a friend has $8$ children (oh my!), $7$ of which are girls and none are twins
 If each gender has an independent $50$% probability for each birth, what’s the probability of getting $7$ or more girls out of $8$ births? \(\left( \begin{array}{c} 8 \\ 7 \end{array} \right) .5^{7}(1.5)^{1} + \left( \begin{array}{c} 8 \\ 8 \end{array} \right) .5^{8}(1.5)^{0} \approx 0.04\)
choose(8, 7) * 0.5^8 + choose(8, 8) * 0.5^8
## [1] 0.03516
pbinom(6, size = 8, prob = 0.5, lower.tail = FALSE)
## [1] 0.03516
The normal distribution
 A random variable is said to follow a normal or Gaussian distribution with mean $\mu$ and variance $\sigma^2$ if the associated density is \((2\pi \sigma^2)^{1/2}e^{(x  \mu)^2/2\sigma^2}\) If $X$ a RV with this density then $E[X] = \mu$ and $Var(X) = \sigma^2$
 We write $X\sim \mbox{N}(\mu, \sigma^2)$
 When $\mu = 0$ and $\sigma = 1$ the resulting distribution is called the standard normal distribution
 Standard normal RVs are often labeled $Z$
The standard normal distribution with reference lines
Facts about the normal density
If $X \sim \mbox{N}(\mu,\sigma^2)$ then \(Z = \frac{X \mu}{\sigma} \sim N(0, 1)\)
If $Z$ is standard normal \(X = \mu + \sigma Z \sim \mbox{N}(\mu, \sigma^2)\)
More facts about the normal density by ThRa
Take a normal distribution with mean 0 and standard deviation 1.
0th quantile is the middle. 1st quantile is 1 sd to the right. 2nd quantile is 2 sd to the right.
0th quantile is 50 percentile. 2nd quantil is 97.7%. 3rd quantile is 99.8%.
pnorm(q) # gives distribution function value for given quantile
qnorm(p) # gives the other way around
CDF: F(X_alpha)= area under the curve until X_alpha

Approximately $68\%$, $95\%$ and $99\%$ of the normal density lies within $1$, $2$ and $3$ standard deviations from the mean, respectively
>pnorm(1)pnorm(1) [1] 0.6826895 > pnorm(2)pnorm(2) [1] 0.9544997 > (pnorm(3)pnorm(3)) [1] 0.9973002

$1.28$, $1.645$, $1.96$ and $2.33$ are the $10^{th}$, $5^{th}$, $2.5^{th}$ and $1^{st}$ percentiles of the standard normal distribution respectively
> qnorm(0.1) [1] 1.281552

By symmetry, $1.28$, $1.645$, $1.96$ and $2.33$ are the $90^{th}$, $95^{th}$, $97.5^{th}$ and $99^{th}$ percentiles of the standard normal distribution respectively
Question
 What is the $95^{th}$ percentile of a $N(\mu, \sigma^2)$ distribution?
 Quick answer in R
qnorm(.95, mean = mu, sd = sd)
 Quick answer in R
 Or, because you have the standard normal quantiles memorized and you know that 1.645 is the 95th percentile you know that the answer has to be \(\mu + \sigma 1.645\)
 (In general $\mu + \sigma z_0$ where $z_0$ is the appropriate standard normal quantile)
Question
 What is the probability that a $\mbox{N}(\mu,\sigma^2)$ RV is larger than $x$?
Example
Assume that the number of daily ad clicks for a company is (approximately) normally distributed with a mean of 1020 and a standard deviation of 50. What’s the probability of getting more than 1,160 clicks in a day?
Example
Assume that the number of daily ad clicks for a company is (approximately) normally distributed with a mean of 1020 and a standard deviation of 50. What’s the probability of getting more than 1,160 clicks in a day?
It’s not very likely, 1,160 is 2.8 standard deviations from the mean
pnorm(1160, mean = 1020, sd = 50, lower.tail = FALSE)
## [1] 0.002555
pnorm(2.8, lower.tail = FALSE)
## [1] 0.002555
Example
Assume that the number of daily ad clicks for a company is (approximately) normally distributed with a mean of 1020 and a standard deviation of 50. What number of daily ad clicks would represent the one where 75% of days have fewer clicks (assuming days are independent and identically distributed)?
Example
Assume that the number of daily ad clicks for a company is (approximately) normally distributed with a mean of 1020 and a standard deviation of 50. What number of daily ad clicks would represent the one where 75% of days have fewer clicks (assuming days are independent and identically distributed)?
qnorm(0.75, mean = 1020, sd = 50)
## [1] 1054
The Poisson distribution
 Used to model counts

The Poisson mass function is \(P(X = x; \lambda) = \frac{\lambda^x e^{\lambda}}{x!}\) for $x=0,1,\ldots$
 The mean of this distribution is $\lambda$
 The variance of this distribution is $\lambda$
 Notice that $x$ ranges from $0$ to $\infty$
Some uses for the Poisson distribution
 Modeling count data
 Modeling eventtime or survival data
 Modeling contingency tables
 Approximating binomials when $n$ is large and $p$ is small
Rates and Poisson random variables
 Poisson random variables are used to model rates
 $X \sim Poisson(\lambda t)$ where
 $\lambda = E[X / t]$ is the expected count per unit of time
 $t$ is the total monitoring time
Example
The number of people that show up at a bus stop is Poisson with a mean of $2.5$ per hour.
If watching the bus stop for 4 hours, what is the probability that $3$ or fewer people show up for the whole time?
ppois(3, lambda = 2.5 * 4)
## [1] 0.01034
Poisson approximation to the binomial
 When $n$ is large and $p$ is small the Poisson distribution is an accurate approximation to the binomial distribution
 Notation
 $X \sim \mbox{Binomial}(n, p)$
 $\lambda = n p$
 $n$ gets large
 $p$ gets small
Example, Poisson approximation to the binomial
We flip a coin with success probablity $0.01$ five hundred times.
What’s the probability of 2 or fewer successes?
pbinom(2, size = 500, prob = 0.01)
## [1] 0.1234
ppois(2, lambda = 500 * 0.01)
## [1] 0.1247
Asymtotics
Asymptotics
 Asymptotics is the term for the behavior of statistics as the sample size (or some other relevant quantity) limits to infinity (or some other relevant number)
 (Asymptopia is my name for the land of asymptotics, where everything works out well and there’s no messes. The land of infinite data is nice that way.)
 Asymptotics are incredibly useful for simple statistical inference and approximations
 (Not covered in this class) Asymptotics often lead to nice understanding of procedures
 Asymptotics generally give no assurances about finite sample performance
 Asymptotics form the basis for frequency interpretation of probabilities (the long run proportion of times an event occurs)
Law of large numbers in action
n < 10000
means < cumsum(rnorm(n))/(1:n)
library(ggplot2)
g < ggplot(data.frame(x = 1:n, y = means), aes(x = x, y = y))
g < g + geom_hline(yintercept = 0) + geom_line(size = 2)
g < g + labs(x = "Number of obs", y = "Cumulative mean")
g
Law of large numbers in action, coin flip
means < cumsum(sample(0:1, n, replace = TRUE))/(1:n)
g < ggplot(data.frame(x = 1:n, y = means), aes(x = x, y = y))
g < g + geom_hline(yintercept = 0.5) + geom_line(size = 2)
g < g + labs(x = "Number of obs", y = "Cumulative mean")
g
If we make infinite coin tosses then we get the right answer, law of large numbers!
The Central Limit Theorem
 The Central Limit Theorem (CLT) is one of the most important theorems in statistics
 For our purposes, the CLT states that the distribution of averages of iid variables (properly normalized) becomes that of a standard normal as the sample size increases
 The CLT applies in an endless variety of settings
 The result is that \(\frac{\bar X_n  \mu}{\sigma / \sqrt{n}}= \frac{\sqrt n (\bar X_n  \mu)}{\sigma} = \frac{\mbox{Estimate}  \mbox{Mean of estimate}}{\mbox{Std. Err. of estimate}}\) has a distribution like that of a standard normal for large $n$.
 (Replacing the standard error by its estimated value doesn’t change the CLT)
 The useful way to think about the CLT is that $\bar X_n$ is approximately $N(\mu, \sigma^2 / n)$
Basically, that when you take an iid like a coin flip, and you start making distribution of averages out of them, it tends to normal distribution.
Example
 Simulate a standard normal random variable by rolling $n$ (six sided)
 Let $X_i$ be the outcome for die $i$
 Then note that $\mu = E[X_i] = 3.5$
 $Var(X_i) = 2.92$
 SE $\sqrt{2.92 / n} = 1.71 / \sqrt{n}$
 Lets roll $n$ dice, take their mean, subtract off 3.5, and divide by $1.71 / \sqrt{n}$ and repeat this over and over
Result of our die rolling experiment
You see that for 10 rolls of a die it is still discrete, but for 30 rolls it almost makes the bell curve given by $\mu$ and the Variance!
Coin CLT

Let $X_i$ be the $0$ or $1$ result of the $i^{th}$ flip of a possibly unfair coin
 The sample proportion, say $\hat p$, is the average of the coin flips
 $E[X_i] = p$ and $Var(X_i) = p(1p)$
 Standard error of the mean is $\sqrt{p(1p)/n}$

Then \(\frac{\hat p  p}{\sqrt{p(1p)/n}}\) will be approximately normally distributed
 Let’s flip a coin $n$ times, take the sample proportion of heads, subtract off .5 and multiply the result by $2 \sqrt{n}$ (divide by $1/(2 \sqrt{n})$)
Simulation results
Simulation results, $p = 0.9$
Galton’s quincunx
http://en.wikipedia.org/wiki/Bean_machine#mediaviewer/File:Quincunx_(Galton_Box)__Galton_1889_diagram.png
</img>
Confidence intervals thej
This is a hard topic almost impossible to follow from the coursera course.
Population
Let’s say there are 100k people and you want to find out who voted for
k. Let’s assume the True mean (aka sample proportion) to be p.
Sampling
But unfortunately we cannot estimate the true mean, so we take
n=100 sample. We determine the what is the sample proportion i.e.,
54 people vote for A (success). $\cap{p}=0.54$.
Normal distribution
Imagine a bell curve over different population sets of 100. So we make $\cap{p}_1$ and then another 100 leads to $\cap{p}_2$ etc… We make a plot out of it.
Obviously E[$\cap{p}$] is the same as the population mean p and a bell curve is formed eventually where std deviation as we have seen earlier is $\sigma_\cap{p} = \sqrt{p(1p)/n}$. Note it is p that determines the SD of the bell!
2 things we are interested in
 What is the probability that any $\cap{p}$ lies within 2sigma of the True Mean p
or conversely,
 What is the probability that the True Mean p lies within 2sigma of the $\cap{p}$
Note: Sigma can only be calculated from the true mean.
The problem and solution
But there is no way of knowing the true mean in advance and possibly ever, (for example think of determining who people are going to vote for in the country.)
So we do what is the next “best” possible stuff!
We try to take the SE (standard Error) which is obtained by using $\cap{p}$ instead of P in the SD calculations. I.e., We draw a sample of n=100, find the $\cap{p}=0.54$. Then we use this to determine the SE and not the SD.
What does this mean? and why?
There is a 95% probability that p is within 2sigma $\cap{p}$ of $\cap{p}$.
In the above example SE=0.05, so there is 95% probability that the drawn sample mean is within 0.44 and 0.64.
It turns out that SE is a decent estimation of SD.
There are 3 types of interval findings in this notes

Finding SE with $\cap{p}$ instead of p (true mean), and computing Confidence interval

SE for biased coin type problems with ** 2xSE=1/sqrt(n)**
 Using Agresti interval by taking X+2 and n+4 (i.e., adding 2 more successes and failures!
Give a confidence interval for the average height of sons
in Galton’s data
Here we take the same of data we have for the son’s height and find the 95% confidence intervals.
library(UsingR)
data(father.son)
x < father.son$sheight
(mean(x) + c(1, 1) * qnorm(0.975) * sd(x)/sqrt(length(x)))/12
## [1] 5.710 5.738
Confidence interval Walds interval
We imagine a biased coin, i.e., SD= sqrt(p(1p));
 The interval takes the form:
The maximum value p(1p) can take is 1/2 at p=0.5.
For 95% intervals \(\hat p \pm \frac{1}{\sqrt{n}}\)
This is used for quick “conservative”(maybe) estimates.
Example of voters for Wald’s interval
 Your campaign advisor told you that in a random sample of 100 likely voters,
56 intent to vote for you.
 Can you relax? Do you have this race in the bag?
 Without access to a computer or calculator, how precise is this estimate?
1/sqrt(100)=0.1
so a back of the envelope calculation gives an approximate 95% interval of(0.46, 0.66)
 Not enough for you to relax, better go do more campaigning as it is not >50%
 Rough guidelines is that when n=100 you have one decimal place of in the SE and n=10000 you have 2 deimal places in the SE.
round(1/sqrt(10^(1:6)), 3)
## [1] 0.316 0.100 0.032 0.010 0.003 0.001
2 ways of determining the confidence interval!
0.56 + c(1, 1) * qnorm(0.975) * sqrt(0.56 * 0.44/100)
## [1] 0.4627 0.6573
binom.test(56, 100)$conf.int
## [1] 0.4572 0.6592
## attr(,"conf.level")
## [1] 0.95
Simulation fOR n=20
n < 20
pvals < seq(0.1, 0.9, by = 0.05)
nosim < 1000
coverage < sapply(pvals, function(p) {
phats < rbinom(nosim, prob = p, size = n)/n
ll < phats  qnorm(0.975) * sqrt(phats * (1  phats)/n)
ul < phats + qnorm(0.975) * sqrt(phats * (1  phats)/n)
mean(ll < p & ul > p)
})
When true mean is 0.5, then confidence interval covers the true mean, i.e., follows central limit theorem!! but otherwise for n=20 it doesn’t seem to work correctly. For some cases of p=0.1, the confidence interval obtained from the SE does not seem to contain the p.
Simulation for n=100
For n=100 the graph is rather spot on: For all p’s there is a 9095% chance that it covers the true mean when using SE.
n < 100
pvals < seq(0.1, 0.9, by = 0.05)
nosim < 1000
coverage2 < sapply(pvals, function(p) {
phats < rbinom(nosim, prob = p, size = n)/n
ll < phats  qnorm(0.975) * sqrt(phats * (1  phats)/n)
ul < phats + qnorm(0.975) * sqrt(phats * (1  phats)/n)
mean(ll < p & ul > p)
})
What’s happening? Fix: Agresti/Coull interval

$n$ isn’t “large enough” for the CLT to be applicable for many of the values of $p$
 Quick fix, form the interval with \(\frac{X + 2}{n + 4}\)
 (Add two successes and failures, Agresti/Coull interval)
Use Agresti/Coull interval generally Bcaffo
Simulation with Agresti interval for n=20
Now let’s look at $n=20$ but adding 2 successes and failures
n < 20
pvals < seq(0.1, 0.9, by = 0.05)
nosim < 1000
coverage < sapply(pvals, function(p) {
phats < (rbinom(nosim, prob = p, size = n) + 2)/(n + 4)
ll < phats  qnorm(0.975) * sqrt(phats * (1  phats)/n)
ul < phats + qnorm(0.975) * sqrt(phats * (1  phats)/n)
mean(ll < p & ul > p)
})
It’s a little conservative, i.e. the mean lies within >2 sigma for sure. i.e., >95% confidence
Poisson interval
 A nuclear pump failed 5 times out of 94.32 days, give a 95% confidence interval for the failure rate per day?
 $X \sim Poisson(\lambda t)$.
 Estimate $\hat \lambda = X/t$
 $Var(\hat \lambda) = \lambda / t$
 $\hat \lambda / t$ is our variance estimate
R code
x < 5
t < 94.32
lambda < x/t
round(lambda + c(1, 1) * qnorm(0.975) * sqrt(lambda/t), 3)
## [1] 0.007 0.099
poisson.test(x, T = 94.32)$conf
## [1] 0.01721 0.12371
## attr(,"conf.level")
## [1] 0.95
Simulating the Poisson coverage rate
Let’s see how this interval performs for lambda values near what we’re estimating
lambdavals < seq(0.005, 0.1, by = 0.01)
nosim < 1000
t < 100
coverage < sapply(lambdavals, function(lambda) {
lhats < rpois(nosim, lambda = lambda * t)/t
ll < lhats  qnorm(0.975) * sqrt(lhats/t)
ul < lhats + qnorm(0.975) * sqrt(lhats/t)
mean(ll < lambda & ul > lambda)
})
Covarage
(Gets really bad for small values of lambda)
What if we increase t to 1000?
Summary
 The LLN states that averages of iid samples converge to the population means that they are estimating
 The CLT states that averages are approximately normal, with
distributions
 centered at the population mean
 with standard deviation equal to the standard error of the mean
 CLT gives no guarantee that $n$ is large enough
 Taking the mean and adding and subtracting the relevant
normal quantile times the SE yields a confidence interval for the mean
 Adding and subtracting 2 SEs works for 95% intervals
 Confidence intervals get wider as the coverage increases (why?)
 Confidence intervals get narrower with less variability or larger sample sizes
 The Poisson and binomial case have exact intervals that
don’t require the CLT
 But a quick fix for small sample size binomial calculations is to add 2 successes and failures
Questions c6w2
 But a quick fix for small sample size binomial calculations is to add 2 successes and failures

What is the variance of the distribution of the average an IID draw of nn observations from a population with mean μ and variance σ^2
$sigma^2/n$

Suppose that diastolic blood pressures (DBPs) for men aged 3544 are normally distributed with a mean of 80 (mm Hg) and a standard deviation of 10. About what is the probability that a random 3544 year old has a DBP less than 70?
pnorm(70, mean = 80, sd = 10)

Brain volume for adult women is normally distributed with a mean of about 1,100 cc for women with a standard deviation of 75 cc. What brain volume represents the 95th percentile?
qnorm(0.95, mean = 1100, sd = 75)

Refer to the previous question. Brain volume for adult women is about 1,100 cc for women with a standard deviation of 75 cc. Consider the sample mean of 100 random adult women from this population. What is the 95th percentile of the distribution of that sample mean?
qnorm(0.95, mean = 1100, sd = 75/sqrt(100))

You flip a fair coin 5 times, about what’s the probability of getting 4 or 5 heads?
pbinom(3, size = 5, prob = 0.5, lower.tail = FALSE)

The respiratory disturbance index (RDI), a measure of sleep disturbance, for a specific population has a mean of 15 (sleep events per hour) and a standard deviation of 10. They are not normally distributed. Give your best estimate of the probability that a sample mean RDI of 100 people is between 14 and 16 events per hour?
pnorm(16, mean = 15, sd = 1)  pnorm(14, mean = 15, sd = 1)
The population needn’t be normal but distribution of sample mean seems to be normally distributed.

Consider a standard uniform density. The mean for this density is .5 and the variance is 1 / 12. You sample 1,000 observations from this distribution and take the sample mean, what value would you expect it to be near?
Via the LLN it should be near .5 it seems!!!!

The number of people showing up at a bus stop is assumed to be Poisson with a mean of 5 5 people per hour. You watch the bus stop for 3 hours. About what’s the probability of viewing 10 or fewer people?
ppois(10, lambda = 15)
Need to give this a bit more thought! But I nailed the answer! of course!
T confidence interval (c6w3)
T Confidence intervals
 In the previous, we discussed creating a confidence interval using the CLT
 They took the form $Est \pm ZQ \times SE_{Est}$
But according to jbstatistics, he says that only Z quantile implies using population sd. But what ever! we move on.

In this lecture, we discuss some methods for small samples, notably Gosset’s $t$ distribution and $t$ confidence intervals
 They are of the form $Est \pm TQ \times SE_{Est}$
 If you want a rule between whether to use a $t$ interval or normal interval, just always use the $t$ interval
Gosset’s $t$ distribution
If you look in this jbstatistics video, you see how the curve looks for a given “degrees of Freedom”. No one dares to explain what it is though.
 Has thicker tails than the normal

Is indexed by a degrees of freedom; gets more like a standard normal as df gets larger

It assumes that the underlying data are iid Gaussian with the result that \(\frac{\bar X  \mu}{SE/\sqrt{n}}\) follows Gosset’s $t$ distribution with $n1$ degrees of freedom
 (If we replaced $s$ by $\sigma$ the statistic would be exactly standard normal)
 Interval is $\bar X \pm t_{n1} S/\sqrt{n}$ where $t_{n1}$ is the relevant quantile
So what changes in essence is the lack of $\sigma$ resulting in a t distribution whose quantile terms are not Z anymore, they are t. And the approximation of this distribution closes to the normal when dof = infinity!
Code for manipulate
Shows the normal and T plots with varying dofs!
library(ggplot2)
library(manipulate)
k < 1000
xvals < seq(5, 5, length = k)
myplot < function(df) {
d < data.frame(y = c(dnorm(xvals), dt(xvals, df)), x = xvals, dist = factor(rep(c("Normal",
"T"), c(k, k))))
g < ggplot(d, aes(x = x, y = y))
g < g + geom_line(size = 2, aes(colour = dist))
g
}
manipulate(myplot(mu), mu = slider(1, 20, step = 1))
Easier to see
Compares the z and t quantiles
pvals < seq(0.5, 0.99, by = 0.01)
myplot2 < function(df) {
d < data.frame(n = qnorm(pvals), t = qt(pvals, df), p = pvals)
g < ggplot(d, aes(x = n, y = t))
g < g + geom_abline(size = 2, col = "lightblue")
g < g + geom_line(size = 2, col = "black")
g < g + geom_vline(xintercept = qnorm(0.975))
g < g + geom_hline(yintercept = qt(0.975, df))
g
}
manipulate(myplot2(df), df = slider(1, 20, step = 1))
Note’s about the $t$ interval
 The $t$ interval technically assumes that the data are iid normal, though it is robust to this assumption
 It works well whenever the distribution of the data is roughly symmetric and mound shaped
 Paired observations are often analyzed using the $t$ interval by taking differences
 For large degrees of freedom, $t$ quantiles become the same as standard normal quantiles; therefore this interval converges to the same interval as the CLT yielded

For skewed distributions, the spirit of the $t$ interval assumptions are violated
Also, for skewed distributions, it doesn’t make a lot of sense to center the interval at the mean????
 In this case, consider taking logs or using a different summary like the median
 For highly discrete data, like binary, other intervals are available
Paired data — Sleep data
In R typing data(sleep)
brings up the sleep data originally
analyzed in Gosset’s Biometrika paper, which shows the increase in
hours for 10 patients on two soporific drugs. R treats the data as two
groups rather than paired despite the same patients using both the
drugs in succession.
The data
data(sleep)
head(sleep)
## extra group ID
## 1 0.7 1 1
## 2 1.6 1 2
## 3 0.2 1 3
## 4 1.2 1 4
## 5 0.1 1 5
## 6 3.4 1 6
Different ways of running the T.test t distribution
97.5 is used and not 95%
g1 < sleep$extra[1:10]
g2 < sleep$extra[11:20]
difference < g2  g1
mn < mean(difference)
s < sd(difference)
n < 10
mn + c(1, 1) * qt(0.975, n  1) * s/sqrt(n)
t.test(difference)
t.test(g2, g1, paired = TRUE)
t.test(extra ~ I(relevel(group, 2)), paired = TRUE, data = sleep)
The results
(After a little formatting)
## [,1] [,2]
## [1,] 0.7001 2.46
## [2,] 0.7001 2.46
## [3,] 0.7001 2.46
## [4,] 0.7001 2.46
Unpaired data — Independent group $t$ confidence intervals
 Suppose that we want to compare the mean blood pressure between two groups in a randomized trial; those who received the treatment to those who received a placebo
 We cannot use the paired t test because the groups are independent and may have different sample sizes
 We now present methods for comparing independent groups
Confidence interval
 Therefore a $(1  \alpha)\times 100\%$ confidence interval for $\mu_y  \mu_x$ is \(\bar Y  \bar X \pm t_{n_x + n_y  2, 1  \alpha/2}S_p\left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2}\)
 The pooled variance estimator is \(S_p^2 = \{(n_x  1) S_x^2 + (n_y  1) S_y^2\}/(n_x + n_y  2)\)

Remember this interval is assuming a constant variance across the two groups
 According to the lecture, it is a reasonable assumption if we perform randomization, that variance is the same.
 If there is some doubt, assume a different variance per group, which we will discuss later
Example : unpaired data; var is equal
Based on Rosner, Fundamentals of Biostatistics
(Really a very good reference book)
 Comparing SBP for 8 oral contraceptive users versus 21 controls
 $\bar X_{OC} = 132.86$ mmHg with $s_{OC} = 15.34$ mmHg
 $\bar X_{C} = 127.44$ mmHg with $s_{C} = 18.23$ mmHg
 Pooled variance estimate
sp < sqrt((7 * 15.34^2 + 20 * 18.23^2)/(8 + 21  2))
132.86  127.44 + c(1, 1) * qt(0.975, 27) * sp * (1/8 + 1/21)^0.5
## [1] 9.521 20.361
Mistakenly treating the sleep data as paired
n1 < length(g1)
n2 < length(g2)
sp < sqrt(((n1  1) * sd(x1)^2 + (n2  1) * sd(x2)^2)/(n1 + n2  2))
## Error: object 'x1' not found
md < mean(g2)  mean(g1)
semd < sp * sqrt(1/n1 + 1/n2)
rbind(md + c(1, 1) * qt(0.975, n1 + n2  2) * semd, t.test(g2, g1, paired = FALSE,
var.equal = TRUE)$conf, t.test(g2, g1, paired = TRUE)$conf)
## [,1] [,2]
## [1,] 14.8873 18.047
## [2,] 0.2039 3.364
## [3,] 0.7001 2.460
Var equal and Var non equal!
ChickWeight
data in R
library(datasets)
data(ChickWeight)
library(reshape2)
## define weight gain or loss
wideCW < dcast(ChickWeight, Diet + Chick ~ Time, value.var = "weight")
names(wideCW)[(1:2)] < paste("time", names(wideCW)[(1:2)], sep = "")
library(dplyr)
wideCW < mutate(wideCW, gain = time21  time0)
Plotting the raw data
noodle plot of the data per diet
Weight gain by diet
violin plot of the data per diet!
Let’s do a t interval
wideCW14 < subset(wideCW, Diet %in% c(1, 4))
rbind(t.test(gain ~ Diet, paired = FALSE, var.equal = TRUE, data = wideCW14)$conf,
t.test(gain ~ Diet, paired = FALSE, var.equal = FALSE, data = wideCW14)$conf)
## [,1] [,2]
## [1,] 108.1 14.81
## [2,] 104.7 18.30
Finding the Unequal variances
 Under unequal variances \(\bar Y  \bar X \pm t_{df} \times \left(\frac{s_x^2}{n_x} + \frac{s_y^2}{n_y}\right)^{1/2}\) where $t_{df}$ is calculated with degrees of freedom \(df= \frac{\left(S_x^2 / n_x + S_y^2/n_y\right)^2} {\left(\frac{S_x^2}{n_x}\right)^2 / (n_x  1) + \left(\frac{S_y^2}{n_y}\right)^2 / (n_y  1)}\) will be approximately a 95% interval
 This works really well
 So when in doubt, just assume unequal variances
Example
 Comparing SBP for 8 oral contraceptive users versus 21 controls
 $\bar X_{OC} = 132.86$ mmHg with $s_{OC} = 15.34$ mmHg
 $\bar X_{C} = 127.44$ mmHg with $s_{C} = 18.23$ mmHg
 $df=15.04$, $t_{15.04, .975} = 2.13$
 Interval \(132.86  127.44 \pm 2.13 \left(\frac{15.34^2}{8} + \frac{18.23^2}{21} \right)^{1/2} = [8.91, 19.75]\)
 In R,
t.test(..., var.equal = FALSE)
Comparing other kinds of data
 For binomial data, there’s lots of ways to compare two groups
 Relative risk, risk difference, odds ratio.
 Chisquared tests, normal approximations, exact tests.
 For count data, there’s also Chisquared tests and exact tests.
 We’ll leave the discussions for comparing groups of data for binary and count data until covering glms in the regression class.
 In addition, Mathematical Biostatistics Boot Camp 2 covers many special cases relevant to biostatistics.
Summary Agents

t intervals arise due to lack of population $\sigma$

T intervals can be obtained from
t.test()
, no need to memorize any insane formulas! 
T intervals can be grouped (takes difference) or ungrouped

T intervals also are split based on the assumption that the variances of the two ungrouped groups are equal or not!
 Variance is assumed to be equal when the sample is randomized!!! What ever that is supposed to mean!
Hypothesis testing c6w3
 Hypothesis testing is concerned with making decisions using data
 A null hypothesis is specified that represents the status quo, usually labeled $H_0$
 The null hypothesis is assumed true and statistical evidence is required to reject it in favor of a research or alternative hypothesis
Example
 A respiratory disturbance index of more than $30$ events / hour, say, is considered evidence of severe sleep disordered breathing (SDB).
 Suppose that in a sample of $100$ overweight subjects with other risk factors for sleep disordered breathing at a sleep clinic, the mean RDI was $32$ events / hour with a standard deviation of $10$ events / hour.
 We might want to test the hypothesis that
 $H_0 : \mu = 30$
 $H_a : \mu > 30$
 where $\mu$ is the population mean RDI.

Disease : > 30 events/hr

Overweight sample n=100

Mu_sample = 32 events/hr
 Sigma_sample = 10 events/hr
Hypothesis testing
 The alternative hypotheses are typically of the form $<$, $>$ or $\neq$
 Note that there are four possible outcomes of our statistical decision process
Truth  Decide  Result 

$H_0$  $H_0$  Correctly accept null 
$H_0$  $H_a$  Type I error 
$H_a$  $H_a$  Correctly reject null 
$H_a$  $H_0$  Type II error 
Discussion
 Consider a court of law; the null hypothesis is that the defendant is innocent
 We require a standard on the available evidence to reject the null hypothesis (convict)
 If we set a low standard, then we would increase the percentage of innocent people convicted (type I errors); however we would also increase the percentage of guilty people convicted (correctly rejecting the null)

If we set a high standard, then we increase the the percentage of innocent people let free (correctly accepting the null) while we would also increase the percentage of guilty people let free (type II errors)

H0: Defendant is innocent; Person does not have Disease

Ha: Defendant is guilty; Person has Disease

if we make type 1 error by having a convicting everyone 90% close to the murder scene, then we reject H0, i.e.,we would increase the percentage of innocent people convicted and we would also increase the percentage of guilty people convicted
 if we make type 2 error, vice versa!
In the case of diseases

If we make a type 1 error, then we wrongly accuse people of having disease and we rightly accuse people of having disease

and Vice Versa!
So we need to set some limit to minimize the type 1 error i.e., wrong accusation of people as having disease when they actually don’t.
The Rejection constant C
 Consider our sleep example again
 A reasonable strategy would reject the null hypothesis if $\bar X$ was larger than some constant, say $C$
 Typically, $C$ is chosen so that the probability of a Type I error, $\alpha$, is $.05$ (or some other relevant constant)

$\alpha$ = Type I error rate = Probability of rejecting the null hypothesis when, in fact, the null hypothesis is correct
 So if in a random sample we have $\bar X$ to be >95th percentile
Rejection based on C Example!

Standard error of the mean $10 / \sqrt{100} = 1$

Under $H_0$ $\bar X \sim N(30, 1)$
How we are justified in using the std deviation of 1, I am not clear! Take this into class and identify exactly what you do not understand!
 We want to chose $C$ so that the $P(\bar X > C; H_0)$ is 5%

The 95th percentile of a normal distribution is 1.645 standard deviations from the mean
 If $C = 30 + 1 \times 1.645 = 31.645$
 Then the probability that a $N(30, 1)$ is larger than it is 5%
 So the rule “Reject $H_0$ when $\bar X \geq 31.645$” has the property that the probability of rejection is 5% when $H_0$ is true (for the $\mu_0$, $\sigma$ and $n$ given)
 So we want to identify if H0 shall be rejected.
 We take a normal distribution of sample means, based on H0 and see where the sample mean lies (within C or outside C!
Discussion
 In general we don’t convert $C$ back to the original scale
 We would just reject because the Zscore; which is how many standard errors the sample mean is above the hypothesized mean \(\frac{32  30}{10 / \sqrt{100}} = 2\) is greater than $1.645$
 Or, whenever $ (\bar X  \mu_0) / (s/\sqrt{n}) > Z_{1\alpha}$
Summary 1: General rules
 The $Z$ test for $H_0:\mu = \mu_0$ versus
 $H_1: \mu < \mu_0$
 $H_2: \mu \neq \mu_0$
 $H_3: \mu > \mu_0$
 Test statistic $ TS = \frac{\bar{X}  \mu_0}{S / \sqrt{n}} $
 Reject the null hypothesis when
 $TS \leq Z_{\alpha} = Z_{1  \alpha}$

$ TS \geq Z_{1  \alpha / 2}$  $TS \geq Z_{1  \alpha}$
Failing to reject H0 and not, accepting H0!
 We have fixed $\alpha$ to be low, so if we reject $H_0$ (either our model is wrong) or there is a low probability that we have made an error
 We have not fixed the probability of a type II error, $\beta$; therefore we tend to say ``Fail to reject $H_0$’’ rather than accepting $H_0$
 Statistical significance is not the same as scientific significance
 The region of TS values for which you reject $H_0$ is called the rejection region
When to apply Z and when to apply T test! Agents!
From jbStatisctics’s explanation:
 Z statistic is used when $\sigma$ is known. As a result we have a normal distribution,
 t statistic is used when $\sigma$ is not known. We use SE and so have t statistic with n1 dofs!
\(t = (\bar{X}  \mu) / (**SE** / sqrt(n))\) has a t distribution with n1 dofs!
 95% confidence intervals are obtained by:
Z –> \(\bar{X} + 1.96 \sigma / sqrt(n)\)
T –> \(\bar{X} + qt(0.95,n1) SE / sqrt(n)\)
 The $Z$ test requires the assumptions of the CLT and for $n$ to be large enough for it to apply
 If $n$ is small, then a Gossett’s $T$ test is performed exactly in the same way, with the normal quantiles replaced by the appropriate Student’s $T$ quantiles and $n1$ df
 The probability of rejecting the null hypothesis when it is false is called power
 Power is a used a lot to calculate sample sizes for experiments
Example with n=16 and using Tstatistic

Consider our example again. Suppose that $n= 16$ (rather than $100$)

The statistic \(\frac{\bar X  30}{s / \sqrt{16}}\) follows a $T$ distribution with 15 df under $H_0$
Under $H_0$, the probability that it is larger that the 95th percentile of the $T$ distribution is 5%

The 95th percentile of the T distribution with 15 df is 1.7531 (obtained via
qt(.95, 15)
) 
So that our test statistic is now $\sqrt{16}(32  30) / 10 = 0.8 $ « 1.7531 (95% with 15 dofs)

We now fail to reject.
Two sided tests
Not 95% anymore. It will become 97.5%; Just pay attention when you encounter such a problem.
 Suppose that we would reject the null hypothesis if in fact the mean was too large or too small
 That is, we want to test the alternative $H_a : \mu \neq 30$
 We will reject if the test statistic, $0.8$, is either too large or too small
 Then we want the probability of rejecting under the null to be 5%, split equally as 2.5% in the upper tail and 2.5% in the lower tail
 Thus we reject if our test statistic is larger
than
qt(.975, 15)
or smaller thanqt(.025, 15)
 This is the same as saying: reject if the
absolute value of our statistic is larger than
qt(0.975, 15)
= 2.1314  So we fail to reject the two sided test as well
 (If you fail to reject the one sided test, you know that you will fail to reject the two sided)
 This is the same as saying: reject if the
absolute value of our statistic is larger than
T test in R
library(UsingR); data(father.son)
t.test(father.son$sheight  father.son$fheight)
>
> One Sample ttest
>
> data: father.son$sheight  father.son$fheight
> t = 11.79, df = 1077, pvalue < 2.2e16
> alternative hypothesis: true mean is not equal to 0
> 95 percent confidence interval:
> 0.831 1.163
> sample estimates:
> mean of x
> 0.997
Connections with confidence intervals (Important)
 Consider testing $H_0: \mu = \mu_0$ versus $H_a: \mu \neq \mu_0$
 Take the set of all possible values for which you fail to reject $H_0$, this set is a $(1\alpha)100\%$ confidence interval for $\mu$
 The same works in reverse; if a $(1\alpha)100\%$ interval contains $\mu_0$, then we fail to reject $H_0$
library(UsingR); data(father.son)
t.test(father.son$sheight  father.son$fheight)
qt(0.975,length(father.son$sheight  father.son$fheight)1)
>
> One Sample ttest
>
> data: father.son$sheight  father.son$fheight
> t = 11.79, df = 1077, pvalue < 2.2e16
> alternative hypothesis: true mean is not equal to 0
> 95 percent confidence interval:
> 0.831 1.163
> sample estimates:
> mean of x
> 0.997
> 1.962
H0: Mu = 0 (no difference in heights)
H_$\alpha$ : Mu != 0 (two sided alternate hypothesis)
t statistic for 95% confidence = 1.962 for 97.5 percentile;
t statistic for mu.sample=0.99 is, 11.78 => ~100 percentile; This implies there is a 0% chance that 11.78 will occur, which leads us to think that the null hypothesis is BS!
So we reject the null hypothesis that it is a normal distribution about 0. BS!
If the t statistic lied below the 95th t percentile, then H0 is failed to be rejected, as shown here in jbstatistics video.
The beauti is that we can also see if we reject the null hypothesis with the confidence interval (0.83 and 1.16). 0 is not in the 95% confidence interval, so, REJECT H0.
Two group intervals
 First, now you know how to do two group T tests since we already covered indepedent group T intervals
 Rejection rules are the same
 Test $H_0 : \mu_1 = \mu_2$
 Let’s just go through an example
chickWeight
data
Recall that we reformatted this data
library(datasets); data(ChickWeight); library(reshape2)
##define weight gain or loss
wideCW < dcast(ChickWeight, Diet + Chick ~ Time, value.var = "weight")
names(wideCW)[(1 : 2)] < paste("time", names(wideCW)[(1 : 2)], sep = "")
library(dplyr)
wideCW < mutate(wideCW,
gain = time21  time0
)
Unequal variance T test comparing diets 1 and 4
wideCW14 < subset(wideCW, Diet %in% c(1, 4))
t.test(gain ~ Diet, paired = FALSE,
var.equal = TRUE, data = wideCW14)
>
> Two Sample ttest
>
> data: gain by Diet
> t = 2.725, df = 23, pvalue = 0.01207
> alternative hypothesis: true difference in means is not equal to 0
> 95 percent confidence interval:
> 108.15 14.81
> sample estimates:
> mean in group 1 mean in group 4
> 136.2 197.7
We reject the null hypothesis as it falls the t statistic falls outside the 95% confidence interval!
Exact binomial test
 Recall this problem, Suppose a friend has $8$ children, $7$ of which are girls and none are twins
 Perform the relevant hypothesis test. $H_0 : p = 0.5$ $H_a : p > 0.5$
 What is the relevant rejection region so that the probability of rejecting is (less than) 5%?
Rejection region  Type I error rate 

[0 : 8]  1 
[1 : 8]  0.9961 
[2 : 8]  0.9648 
[3 : 8]  0.8555 
[4 : 8]  0.6367 
[5 : 8]  0.3633 
[6 : 8]  0.1445 
[7 : 8]  0.0352 
[8 : 8]  0.0039 
Notes
 It’s impossible to get an exact 5% level test for this case due to the discreteness of the binomial.
 The closest is the rejection region [7 : 8]
 Any alpha level lower than 0.0039 is not attainable.
 For larger sample sizes, we could do a normal approximation, but you already knew this.
 Two sided test isn’t obvious.
 Given a way to do two sided tests, we could take the set of values of $p_0$ for which we fail to reject to get an exact binomial confidence interval (called the Clopper/Pearson interval, BTW)
 For these problems, people always create a Pvalue (next lecture) rather than computing the rejection region.
Pvalues (c6w3)
 Most common measure of statistical significance
 Their ubiquity, along with concern over their interpretation and use
makes them controversial among statisticians
 http://warnercnr.colostate.edu/~anderson/thompson1.html
 Also see Statistical Evidence: A Likelihood Paradigm by Richard Royall
 Toward EvidenceBased Medical Statistics. 1: The P Value Fallacy by Steve Goodman
 The hilariously titled: The Earth is Round (p < .05) by Cohen.
 Some positive comments
What is a Pvalue?
Not using alpha, but allowing others to choose their own alpha.
It is the 1CDF(test statistic value). That’s all!
Pvalues
 The Pvalue is the probability under the null hypothesis of obtaining evidence as extreme or more extreme than that obtained
 If the Pvalue is small, then either $H_0$ is true and we have observed a rare event or $H_0$ is false
 Suppos that you get a $T$ statistic of $2.5$ for 15 df testing $H_0:\mu = \mu_0$ versus $H_a : \mu > \mu_0$.
 What’s the probability of getting a $T$ statistic as large as $2.5$?
pt(2.5, 15, lower.tail = FALSE)
## [1] 0.01225
 Therefore, the probability of seeing evidence as extreme or more extreme than that actually obtained under $H_0$ is 0.0123
The attained significance level
 Our test statistic was $2$ for $H_0 : \mu_0 = 30$ versus $H_a:\mu > 30$.
 Notice that we rejected the one sided test when $\alpha = 0.05$, would we reject if $\alpha = 0.01$, how about $0.001$?
 The smallest value for alpha that you still reject the null hypothesis is called the attained significance level
 This is equivalent, but philosophically a little different from, the Pvalue
Notes
 By reporting a Pvalue the reader can perform the hypothesis test at whatever $\alpha$ level he or she choses
 If the Pvalue is less than $\alpha$ you reject the null hypothesis
 For two sided hypothesis test, double the smaller of the two one sided hypothesis test Pvalues
Revisiting an earlier example (star mark) ;)
 Suppose a friend has $8$ children, $7$ of which are girls and none are twins
 If each gender has an independent $50$% probability for each birth, what’s the probability of getting $7$ or more girls out of $8$ births?
choose(8, 7) * 0.5^8 + choose(8, 8) * 0.5^8
## [1] 0.03516
pbinom(6, size = 8, prob = 0.5, lower.tail = FALSE)
## [1] 0.03516
Poisson example
 Suppose that a hospital has an infection rate of 10 infections per 100 person/days at risk (rate of 0.1) during the last monitoring period.
 Assume that an infection rate of 0.05 is an important benchmark.
 Given the model, could the observed rate being larger than 0.05 be attributed to chance?
 Under $H_0: \lambda = 0.05$ so that $\lambda_0 100 = 5$
 Consider $H_a: \lambda > 0.05$.
JFC the writing is crazy!
Sample: 10 infections for 100 sick people/day => rate of 0.1
Expected, prefered rate: 5 infections for …. => rate of 0.05
H0 rate = 0.05
H_a rate > 0.05
ppois(9, 5, lower.tail = FALSE)
## [1] 0.03183
quiz c6w3

In a population of interest, a sample of 9 men yielded a sample average brain volume of 1,100cc and a standard deviation of 30cc. What is a 95% Student’s T confidence interval for the mean brain volume in this new population?
mn=1100 sp=30/3 mn + c(1,1) * qt(0.975,8) * sp

In a population of interest, a sample of 9 men yielded a sample average brain volume of 1,100cc and a standard deviation of 30cc. What is a 95% Student’s T confidence interval for the mean brain volume in this new population?
2*3/qt(0.975,8) 2.60

In an effort to improve running performance, 5 runners were either given a protein supplement or placebo. Then, after a suitable washout period, they were given the opposite treatment. Their mile times were recorded under both the treatment and placebo, yielding 10 measurements with 2 per subject. The researchers intend to use a T test and interval to investigate the treatment. Should they use a paired or independent group T test and interval?

In a study of emergency room waiting times, investigators consider a new and the standard triage systems. To test the systems, administrators selected 20 nights and randomly assigned the new triage system to be used on 10 nights and the standard system on the remaining 10 nights. They calculated the nightly median waiting time (MWT) to see a physician. The average MWT for the new system was 3 hours with a variance of 0.60 while the average MWT for the old system was 5 hours with a variance of 0.68. Consider the 95% confidence interval estimate for the differences of the mean MWT associated with the new system. Assume a constant variance. What is the interval? Subtract in this order (New System  Old System).
2.75 to 1.25

Suppose that you create a 95% T confidence interval. You then create a 90% interval using the same data. What can be said about the 90% interval with respect to the 95% interval?
Smaller obviously

To further test the hospital triage system, administrators selected 200 nights and randomly assigned a new triage system to be used on 100 nights and a standard system on the remaining 100 nights. They calculated the nightly median waiting time (MWT) to see a physician. The average MWT for the new system was 4 hours with a standard deviation of 0.5 hours while the average MWT for the old system was 6 hours with a standard deviation of 2 hours. Consider the hypothesis of a decrease in the mean MWT associated with the new treatment.
What does the 95% independent group confidence interval with unequal variances suggest vis a vis this hypothesis? (Because there’s so many observations per group, just use the Z quantile instead of the T.)
n1 < n2 < 100 xbar1 < 4 xbar2 < 6 s1 < 0.5 s2 < 2 xbar2  xbar1 + c(1, 1) * qnorm(0.975) * sqrt(s1^2/n1 + s2^2/n2)

Suppose that 18 obese subjects were randomized, 9 each, to a new diet pill and a placebo. Subjects’ body mass indices (BMIs) were measured at a baseline and again after having received the treatment or placebo for four weeks. The average difference from followup to the baseline (followup  baseline) was −3 kg/m2 for the treated group and 1 kg/m2 for the placebo group. The corresponding standard deviations of the differences was 1.5 kg/m2 for the treatment group and 1.8 kg/m2 for the placebo group. Does the change in BMI over the four week period appear to differ between the treated and placebo groups? Assuming normality of the underlying data and a common population variance, calculate the relevant 90% t confidence interval. Subtract in the order of (Treated  Placebo) with the smaller (more negative) number first.
n1 < n2 < 9 x1 < 3 ##treated x2 < 1 ##placebo s1 < 1.5 ##treated s2 < 1.8 ##placebo s < sqrt(((n1  1) * s1^2 + (n2  1) * s2^2)/(n1 + n2  2)) (x1  x2) + c(1, 1) * qt(0.95, n1 + n2  2) * s * sqrt(1/n1 + 1/n2) ## Power (c6w4)
Power is the probability of rejecting the null hypothesis when it is false
 Ergo, power (as its name would suggest) is a good thing; you want more power
 A type II error (a bad thing, as its name would suggest) is failing to reject the null hypothesis when it’s false; the probability of a type II error is usually called $\beta$
 Note Power $= 1  \beta$
H_a is true but you decide H0; for example, mu>30 but you decide mu=30.
Khan academy notes
https://www.khanacademy.org/math/apstatistics/testssignificanceap/errorprobabilitiespower/v/introductiontopowerinsignificancetests
H0 true  H0 False  

Reject H0  Type I error  Correct (POWER) 
Failing to reject H0  Correct  Type II error 
H0: mu=30
Ha: mu>30
Type I error: when mu=30 but you reject H0.
Type II error: When mu!=30 but you fail to reject the H0.
Imagine you take a random sample of n=100 from the population with sigma = 4
Under H0, N(30,4) you check where mu=mu.sample is?
If mu=mu.sample lies outside of 95% values, then you reject H0; This 5% is alpha.
If mu=mu.sample lies within 95% values, then you fail to reject H0.
i.e., there is a 95% chance that you will be right (fail to reject H0under H0)
or there is a 5% chance that you are wrong (reject H0  under H0) Type I error
Under !H0, N(say 35,4), you check for where mu=mu.sample is?
if mu.sample lies to the left of Z_alpha quantile, then you fail to reject H0 (Type II error)
if mu.sample lies to the right of Z_alpha quantile, then you reject H0 happily (POWER)
Power increase when?
alpha ^ Power ^ And also Type I error ^
n ^ Power ^
Less variability ^ Power ^
Mu far from null hypotheses the Power ^
Example continued
 $\mu_a = 32$, $\mu_0 = 30$, $n =16$, $\sigma = 4$
mu0 = 30; mua = 32; sigma = 4; n = 16
z = qnorm(1  alpha)
You get alpha from the below:
pnorm(mu0 + z * sigma / sqrt(n), mean = mu0, sd = sigma / sqrt(n),
lower.tail = FALSE)
You get Power from the below:
pnorm(mu0 + z * sigma / sqrt(n), mean = mua, sd = sigma / sqrt(n),
lower.tail = FALSE)
Graphical Depiction of Power
library(manipulate)
mu0 = 30
myplot < function(sigma, mua, n, alpha){
g = ggplot(data.frame(mu = c(27, 36)), aes(x = mu))
g = g + stat_function(fun=dnorm, geom = "line",
args = list(mean = mu0, sd = sigma / sqrt(n)),
size = 2, col = "red")
g = g + stat_function(fun=dnorm, geom = "line",
args = list(mean = mua, sd = sigma / sqrt(n)),
size = 2, col = "blue")
xitc = mu0 + qnorm(1  alpha) * sigma / sqrt(n)
g = g + geom_vline(xintercept=xitc, size = 3)
g
}
manipulate(
myplot(sigma, mua, n, alpha),
sigma = slider(1, 10, step = 1, initial = 4),
mua = slider(30, 35, step = 1, initial = 32),
n = slider(1, 50, step = 1, initial = 16),
alpha = slider(0.01, 0.1, step = 0.01, initial = 0.05)
)
Question
 When testing $H_a : \mu > \mu_0$, notice if power is $1  \beta$, then \(1  \beta = P\left(\bar X > \mu_0 + z_{1\alpha} \frac{\sigma}{\sqrt{n}} ; \mu = \mu_a \right)\)
 where $\bar X \sim N(\mu_a, \sigma^2 / n)$
 Unknowns: $\mu_a$, $\sigma$, $n$, $\beta$
 Knowns: $\mu_0$, $\alpha$
 Specify any 3 of the unknowns and you can solve for the remainder
Notes

Power doesn’t need $\mu_a$, $\sigma$ and $n$, instead only $\frac{\sqrt{n}(\mu_a  \mu_0)}{\sigma}$
 The quantity $\frac{\mu_a  \mu_0}{\sigma}$ is called the effect size
 Unknowns: $\mu_a$, $\sigma$, $n$, $\beta$
 Knowns: $\mu_0$, $\alpha$
 Specify any 3 of the unknowns and you can solve for the remainder
Ttest power
 Consider calculating power for a Gossett’s $T$ test for our example
 The power is \(P\left(\frac{\bar X  \mu_0}{S /\sqrt{n}} > t_{1\alpha, n1} ~;~ \mu = \mu_a \right)\)
 Calcuting this requires the noncentral t distribution.
power.t.test
does this very well Omit one of the arguments and it solves for it
Example
power.t.test(n = 16, delta = 2 / 4, sd=1, type = "one.sample", alt = "one.sided")$power
## [1] 0.604
power.t.test(n = 16, delta = 2, sd=4, type = "one.sample", alt = "one.sided")$power
## [1] 0.604
power.t.test(n = 16, delta = 100, sd=200, type = "one.sample", alt = "one.sided")$power
## [1] 0.604
Example
power.t.test(power = .8, delta = 2 / 4, sd=1, type = "one.sample", alt = "one.sided")$n
## [1] 26.14
power.t.test(power = .8, delta = 2, sd=4, type = "one.sample", alt = "one.sided")$n
## [1] 26.14
power.t.test(power = .8, delta = 100, sd=200, type = "one.sample", alt = "one.sided")$n
## [1] 26.14
Key ideas Multiple testing C6w4
 Hypothesis testing/significance analysis is commonly overused
 Correcting for multiple testing avoids false positives or discoveries
 Two key components
 Error measure
 Correction
Three eras of statistics
The age of Quetelet and his successors, in which huge censuslevel data sets were brought to bear on simple but important questions: Are there more male than female births? Is the rate of insanity rising?
The classical period of Pearson, Fisher, Neyman, Hotelling, and their successors, intellectual giants who developed a theory of optimal inference capable of wringing every drop of information out of a scientific experiment. The questions dealt with still tended to be simple Is treatment A better than treatment B?
The era of scientific mass production, in which new technologies typified by the microarray allow a single team of scientists to produce data sets of a size Quetelet would envy. But now the flood of data is accompanied by a deluge of questions, perhaps thousands of estimates or hypothesis tests that the statistician is charged with answering together; not at all what the classical masters had in mind. Which variables matter among the thousands measured? How do you relate unrelated information?
http://wwwstat.stanford.edu/~ckirby/brad/papers/2010LSIexcerpt.pdf
http://xkcd.com/882/ I don’t get this!
Types of errors
Suppose you are testing a hypothesis that a parameter $\beta$ equals zero versus the alternative that it does not equal zero. These are the possible outcomes. </br></br>
 $\beta=0$  $\beta\neq0$  Hypotheses  Claim $\beta=0$  $U$  $T$  $mR$ Claim $\beta\neq 0$  $V$  $S$  $R$
Claims  $m_0$  $mm_0$  $m$
</br></br>
Type I error or false positive ($V$) Say that the parameter does not equal zero when it does
Type II error or false negative ($T$) Say that the parameter equals zero when it doesn’t
Error rates
False positive rate  The rate at which false results ($\beta = 0$) are called significant: $E\left[\frac{V}{m_0}\right]$*
Family wise error rate (FWER)  The probability of at least one false positive ${\rm Pr}(V \geq 1)$
False discovery rate (FDR)  The rate at which claims of significance are false $E\left[\frac{V}{R}\right]$
 The false positive rate is closely related to the type I error rate http://en.wikipedia.org/wiki/False_positive_rate
Controlling the false positive rate FPR
If Pvalues are correctly calculated calling all $P < \alpha$ significant will control the false positive rate at level $\alpha$ on average.
: Suppose that you perform 10,000 tests and $\beta = 0$ for all of them.
Suppose that you call all $P < 0.05$ significant.
The expected number of false positives is: $10,000 \times 0.05 = 500$ false positives.
How do we avoid so many false positives? reduce alpha!
Controlling familywise error rate (FWER)
The Bonferroni correction is the oldest multiple testing correction.
Basic idea:
 Suppose you do $m$ tests
 You want to control FWER at level $\alpha$ so $Pr(V \geq 1) < \alpha$
 Calculate Pvalues normally
 Set $\alpha_{fwer} = \alpha/m$
 Call all $P$values less than $\alpha_{fwer}$ significant
Pros: Easy to calculate, conservative Cons: May be very conservative
Controlling false discovery rate (FDR)
This is the most popular correction when performing lots of tests say in genomics, imaging, astronomy, or other signalprocessing disciplines.
Basic idea:
 Suppose you do $m$ tests
 You want to control FDR at level $\alpha$ so $E\left[\frac{V}{R}\right]$
 Calculate Pvalues normally
 Order the Pvalues from smallest to largest $P_{(1)},…,P_{(m)}$
 Call any $P_{(i)} \leq \alpha \times \frac{i}{m}$ significant
Pros: Still pretty easy to calculate, less conservative (maybe much less)
Cons: Allows for more false positives, may behave strangely under dependence
Example with 10 Pvalues
<img class=center src=fig/example10pvals.png height=450>
Controlling all error rates at $\alpha = 0.20$
Controlling FDR results in a line.
Controlling FWER resutls in a very conservative cutoff
Adjusted Pvalues
 One approach is to adjust the threshold $\alpha$
 A different approach is to calculate “adjusted pvalues”
 They are not pvalues anymore
 But they can be used directly without adjusting $\alpha$
Example:
 Suppose Pvalues are $P_1,\ldots,P_m$
 You could adjust them by taking $P_i^{fwer} = \max{m \times P_i,1}$ for each Pvalue.
 Then if you call all $P_i^{fwer} < \alpha$ significant you will control the FWER.
Case study I: no true positives
set.seed(1010093)
pValues < rep(NA, 1000)
for (i in 1:1000) {
y < rnorm(20)
x < rnorm(20)
pValues[i] < summary(lm(y ~ x))$coeff[2, 4]
}
# Controls false positive rate
sum(pValues < 0.05)
## [1] 51
Case study I: no true positives
# Controls FWER
sum(p.adjust(pValues, method = "bonferroni") < 0.05)
## [1] 0
# Controls FDR
sum(p.adjust(pValues, method = "BH") < 0.05)
## [1] 0
Case study II: 50% true positives
set.seed(1010093)
pValues < rep(NA, 1000)
for (i in 1:1000) {
x < rnorm(20)
# First 500 beta=0, last 500 beta=2
if (i <= 500) {
y < rnorm(20)
} else {
y < rnorm(20, mean = 2 * x)
}
pValues[i] < summary(lm(y ~ x))$coeff[2, 4]
}
trueStatus < rep(c("zero", "not zero"), each = 500)
table(pValues < 0.05, trueStatus)
## trueStatus
## not zero zero
## FALSE 0 476
## TRUE 500 24
Case study II: 50% true positives
# Controls FWER
table(p.adjust(pValues, method = "bonferroni") < 0.05, trueStatus)
## trueStatus
## not zero zero
## FALSE 23 500
## TRUE 477 0
# Controls FDR
table(p.adjust(pValues, method = "BH") < 0.05, trueStatus)
## trueStatus
## not zero zero
## FALSE 0 487
## TRUE 500 13
Case study II: 50% true positives
Pvalues versus adjusted Pvalues
par(mfrow = c(1, 2))
plot(pValues, p.adjust(pValues, method = "bonferroni"), pch = 19)
plot(pValues, p.adjust(pValues, method = "BH"), pch = 19)
For Bonferroni it is mostly 1 and for BH it increases with pvalues!
Notes and resources
Notes:
 Multiple testing is an entire subfield
 A basic Bonferroni/BH correction is usually enough
 If there is strong dependence between tests there may be problems
 Consider method=”BY”
Further resources:
 Multiple testing procedures with applications to genomics
 Statistical significance for genomewide studies
 Introduction to multiple testing
resampling: bootstrap and permutations c6w4
Both are popular and useful, but primarily for different uses. The permutation test is best for testing hypotheses and bootstrapping is best for estimating confidence intervals.
Also look at this
If you are using R, then they are all easy to implement. See, for instance, http://www.burnsstat.com/pages/Tutor/bootstrap_resampling.html
 The bootstrap is a tremendously useful tool for constructing confidence intervals and calculating standard errors for difficult statistics
 For example, how would one derive a confidence interval for the median?
 The bootstrap procedure follows from the so called bootstrap principle
Obtain 10k datasets from the same dataset!
library(UsingR)
data(father.son)
x < father.son$sheight
n < length(x)
B < 10000
resamples < matrix(sample(x,
n * B,
replace = TRUE),
B, n)
resampledMedians < apply(resamples, 1, median)
Plot of sample distribution of the medians
The bootstrap principle
 Suppose that I have a statistic that estimates some population parameter, but I don’t know its sampling distribution
 The bootstrap principle suggests using the distribution defined by the data to approximate its sampling distribution
Nonparametric bootstrap algorithm example

Bootstrap procedure for calculating confidence interval for the median from a data set of $n$ observations
i. Sample $n$ observations with replacement from the observed data resulting in one simulated complete data set
ii. Take the median of the simulated data set
iii. Repeat these two steps $B$ times, resulting in $B$ simulated medians
iv. These medians are approximately drawn from the sampling distribution of the median of $n$ observations; therefore we can
 Draw a histogram of them
 Calculate their standard deviation to estimate the standard error of the median
 Take the $2.5^{th}$ and $97.5^{th}$ percentiles as a confidence interval for the median
Example code
B < 10000
resamples < matrix(sample(x,
n * B,
replace = TRUE),
B, n)
medians < apply(resamples, 1, median)
sd(medians)
## [1] 0.08424
quantile(medians, c(.025, .975))
## 2.5% 97.5%
## 68.43 68.81
Histogram of bootstrap resamples
g = ggplot(data.frame(medians = medians), aes(x = medians))
g = g + geom_histogram(color = "black", fill = "lightblue", binwidth = 0.05)
g
Notes on the bootstrap
 The bootstrap is nonparametric
 Better percentile bootstrap confidence intervals correct for bias
 There are lots of variations on bootstrap procedures; the book “An Introduction to the Bootstrap”” by Efron and Tibshirani is a great place to start for both bootstrap and jackknife information
Group comparisons
 Consider comparing two independent groups.
 Example, comparing sprays B and C
Permutation tests
 Consider the null hypothesis that the distribution of the observations from each group is the same
 Then, the group labels are irrelevant
 Consider a data frome with count and spray
 Permute the spray (group) labels
 Recalculate the statistic
 Mean difference in counts
 Geometric means
 T statistic
 Calculate the percentage of simulations where the simulated statistic was more extreme (toward the alternative) than the observed
Variations on permutation testing (apparently too much info)
Data type  Statistic  Test name 

Ranks  rank sum  rank sum test 
Binary  hypergeometric prob  Fisher’s exact test 
Raw data  ordinary permutation test 
 Also, socalled randomization tests are exactly permutation tests, with a different motivation.
 For matched data, one can randomize the signs
 For ranks, this results in the signed rank test
 Permutation strategies work for regression as well
 Permuting a regressor of interest
 Permutation tests work very well in multivariate settings
Permutation test B v C
subdata < InsectSprays[InsectSprays$spray %in% c("B", "C"),]
y < subdata$count
group < as.character(subdata$spray)
testStat < function(w, g) mean(w[g == "B"])  mean(w[g == "C"])
observedStat < testStat(y, group)
permutations < sapply(1 : 10000, function(i) testStat(y, sample(group)))
observedStat
## [1] 13.25
mean(permutations > observedStat)
## [1] 0
Histogram of permutations B v C
Quiz 4 (c6w4) (lot of deaths)

A pharmaceutical company is interested in testing a potential blood pressure lowering medication. Their first examination considers only subjects that received the medication at baseline then two weeks later. The data are as follows (SBP in mmHg)
Consider testing the hypothesis that there was a mean reduction in blood pressure? Give the Pvalue for the associated two sided T test.
(Hint, consider that the observations are paired.)
0.087
bl < c(140, 138, 150, 148, 135) fu < c(132, 135, 151, 146, 130) t.test(fu, bl, alternative = "two.sided", paired = TRUE) t.test(fu  bl, alternative = "two.sided")

A sample of 9 men yielded a sample average brain volume of 1,100cc and a standard deviation of 30cc. What is the complete set of values of \mu_0 μ 0 that a test of H_0: \mu = \mu_0 H 0 :μ=μ 0 would fail to reject the null hypothesis in a two sided 5% Students ttest?
1100 + c(1, 1) * qt(0.975, 8) * 30/sqrt(9)
1077 to 1123

Researchers conducted a blind taste test of Coke versus Pepsi. Each of four people was asked which of two blinded drinks given in random order that they preferred. The data was such that 3 of the 4 people chose Coke. Assuming that this sample is representative, report a Pvalue for a test of the hypothesis that Coke is preferred to Pepsi using a one sided exact test.
pbinom(2, size = 4, prob = 0.5, lower.tail = FALSE)
What the hell was the point of the 3/4???? Nothing! PNN!

Infection rates at a hospital above 1 infection per 100 person days at risk are believed to be too high and are used as a benchmark. A hospital that had previously been above the benchmark recently had 10 infections over the last 1,787 person days at risk. About what is the one sided Pvalue for the relevant test of whether the hospital is below the standard?
ppois(10, lambda = 0.01 * 1787)
10 is on the left!

Suppose that 18 obese subjects were randomized, 9 each, to a new diet pill and a placebo. Subjects’ body mass indices (BMIs) were measured at a baseline and again after having received the treatment or placebo for four weeks. The average difference from followup to the baseline (followup  baseline) was −3 kg/m2 for the treated group and 1 kg/m2 for the placebo group. The corresponding standard deviations of the differences was 1.5 kg/m2 for the treatment group and 1.8 kg/m2 for the placebo group. Does the change in BMI appear to differ between the treated and placebo groups? Assuming normality of the underlying data and a common population variance, give a pvalue for a two sided t test.
n1 < n2 < 9 x1 < 3 ##treated x2 < 1 ##placebo s1 < 1.5 ##treated s2 < 1.8 ##placebo s < sqrt(((n1  1) * s1^2 + (n2  1) * s2^2)/(n1 + n2  2)) ts < (x1  x2)/(s * sqrt(1/n1 + 1/n2)) 2 * pt(ts, n1 + n2  2)
Double p so that you get it on both sides!

Brain volumes for 9 men yielded a 90% confidence interval of 1,077 cc to 1,123 cc. Would you reject in a two sided 5% hypothesis test of
H_0 : \mu = 1,078 H 0 :μ=1,078?
Ans: No, you would fail to reject. The 95% interval would be wider than the 90% interval. Since 1,078 is in the narrower 90% interval, it would also be in the wider 95% interval. Thus, in either case it’s in the interval and so you would fail to reject.

Researchers would like to conduct a study of 100 100 healthy adults to detect a four year mean brain volume loss of .01~mm^3 .01 mm 3 . Assume that the standard deviation of four year volume loss in this population is .04~mm^3 . About what would be the power of the study for a 5\% 5% one sided test versus a null hypothesis of no volume loss?
pnorm(1.645 * 0.004, mean = 0.01, sd = 0.004, lower.tail = FALSE)

Researchers would like to conduct a study of n n healthy adults to detect a four year mean brain volume loss of .01~mm^3 .01 mm 3 . Assume that the standard deviation of four year volume loss in this population is .04~mm^3 .04 mm 3 . About what would be the value of n n needed for 90\% 90% power of type one error rate of 5\% 5% one sided test versus a null hypothesis of no volume loss?
power.t.test(delta=0.01,sd=0.04,sig.level=0.05,power=0.9,type="one.sample",alt="one.sided") ceiling((4 * (qnorm(0.95)  qnorm(0.1)))^2)

As you increase the type one error rate, \alpha α, what happens to power?
As you require less evidence to reject, i.e. your \alpha α rate goes up, you will have larger power.
Regression Models c7w1
(Perhaps surprisingly, this example is still relevant)
http://www.nature.com/ejhg/journal/v17/n8/full/ejhg20095a.html
Predicting height: the Victorian approach beats modern genomics
Recent simply statistics post
(Simply Statistics is a blog by Jeff Leek, Roger Peng and Rafael Irizarry, who wrote this post, link on the image)
 “Data supports claim that if Kobe stops ball hogging the Lakers will win more”
 “Linear regression suggests that an increase of 1% in % of shots taken by Kobe results in a drop of 1.16 points (+/ 0.22) in score differential.”
 How was it done? Do you agree with the analysis?
Questions for this class (too many labels!!!)
 Consider trying to answer the following kinds of questions:
 To use the parents’ heights to predict childrens’ heights.
 To try to find a parsimonious, easily described mean relationship between parent and children’s heights.
 To investigate the variation in childrens’ heights that appears unrelated to parents’ heights (residual variation).
 To quantify what impact genotype information has beyond parental height in explaining child height.
 To figure out how/whether and what assumptions are needed to generalize findings beyond the data in question.
 Why do children of very tall parents tend to be tall, but a little shorter than their parents and why children of very short parents tend to be short, but a little taller than their parents? (This is a famous question called ‘Regression to the mean’.)
Galton’s Data
 Let’s look at the data first, used by Francis Galton in 1885.
 Galton was a statistician who invented the term and concepts of regression and correlation, founded the journal Biometrika, and was the cousin of Charles Darwin.
 You may need to run
install.packages("UsingR")
if theUsingR
library is not installed.  Let’s look at the marginal (parents disregarding children and children disregarding parents) distributions first.
 Parent distribution is all heterosexual couples.
 Correction for gender via multiplying female heights by 1.08.
 Overplotting is an issue from discretization. (do not overplot)
library(UsingR); data(galton); library(reshape); long < melt(galton)
g < ggplot(long, aes(x = value, fill = variable))
g < g + geom_histogram(colour = "black", binwidth=1)
g < g + facet_grid(. ~ variable)
g
plot of histograms of children heights and parents heights!
Finding the middle via least squares
 Consider only the children’s heights.
 How could one describe the “middle”?
 One definition, let $Y_i$ be the height of child $i$ for $i = 1, \ldots, n = 928$, then define the middle as the value of $\mu$ that minimizes \(\sum_{i=1}^n (Y_i  \mu)^2\)
 This is physical center of mass of the histrogram.
 You might have guessed that the answer $\mu = \bar Y$.
Least squares finds a line with minimum “error”
Experiment
Use R studio’s manipulate to see what value of $\mu$ minimizes the sum of the squared deviations.
library(manipulate)
myHist < function(mu){
mse < mean((galton$child  mu)^2)
g < ggplot(galton, aes(x = child)) + geom_histogram(fill = "salmon", colour = "black", binwidth=1)
g < g + geom_vline(xintercept = mu, size = 3)
g < g + ggtitle(paste("mu = ", mu, ", MSE = ", round(mse, 2), sep = ""))
g
}
manipulate(myHist(mu), mu = slider(62, 74, step = 0.5))
The least squares est. is the empirical mean for least error
g < ggplot(galton, aes(x = child)) + geom_histogram(fill = "salmon", colour = "black", binwidth=1)
g < g + geom_vline(xintercept = mean(galton$child), size = 3)
g
Proof: that least squares est. for low error is the emperical mean!
\[\begin{align} \sum_{i=1}^n (Y_i  \mu)^2 & = \ \sum_{i=1}^n (Y_i  \bar Y + \bar Y  \mu)^2 \\ & = \sum_{i=1}^n (Y_i  \bar Y)^2 + \ 2 \sum_{i=1}^n (Y_i  \bar Y) (\bar Y  \mu) +\ \sum_{i=1}^n (\bar Y  \mu)^2 \\ & = \sum_{i=1}^n (Y_i  \bar Y)^2 + \ 2 (\bar Y  \mu) \sum_{i=1}^n (Y_i  \bar Y) +\ \sum_{i=1}^n (\bar Y  \mu)^2 \\ & = \sum_{i=1}^n (Y_i  \bar Y)^2 + \ 2 (\bar Y  \mu) (\sum_{i=1}^n Y_i  n \bar Y) +\ \sum_{i=1}^n (\bar Y  \mu)^2 \\ & = \sum_{i=1}^n (Y_i  \bar Y)^2 + \sum_{i=1}^n (\bar Y  \mu)^2\\ & \geq \sum_{i=1}^n (Y_i  \bar Y)^2 \ \end{align}\]Comparing childrens’ heights and their parents’ heights
ggplot(galton, aes(x = parent, y = child)) + geom_point()
Size of point represents number of points at that (X, Y) combination (See the Rmd file for the code).
Plot code not given here about how not to Overplot!
Regression through the origin
 Suppose that $X_i$ are the parents’ heights.
 Consider picking the slope $\beta$ that minimizes \(\sum_{i=1}^n (Y_i  X_i \beta)^2\)
 This is exactly using the origin as a pivot point picking the line that minimizes the sum of the squared vertical distances of the points to the line
 Use R studio’s manipulate function to experiment
 Subtract the means so that the origin is the mean of the parent and children’s heights
y < galton$child  mean(galton$child)
x < galton$parent  mean(galton$parent)
freqData < as.data.frame(table(x, y))
names(freqData) < c("child", "parent", "freq")
freqData$child < as.numeric(as.character(freqData$child))
freqData$parent < as.numeric(as.character(freqData$parent))
myPlot < function(beta){
g < ggplot(filter(freqData, freq > 0), aes(x = parent, y = child))
g < g + scale_size(range = c(2, 20), guide = "none" )
g < g + geom_point(colour="grey50", aes(size = freq+20, show_guide = FALSE))
g < g + geom_point(aes(colour=freq, size = freq))
g < g + scale_colour_gradient(low = "lightblue", high="white")
g < g + geom_abline(intercept = 0, slope = beta, size = 3)
mse < mean( (y  beta * x) ^2 )
g < g + ggtitle(paste("beta = ", beta, "mse = ", round(mse, 3)))
g
}
manipulate(myPlot(beta), beta = slider(0.6, 1.2, step = 0.02))
The solution
In the next few lectures we’ll talk about why this is the solution
lm(I(child  mean(child))~ I(parent  mean(parent))  1, data = galton)
Call:
lm(formula = I(child  mean(child)) ~ I(parent  mean(parent)) 
1, data = galton)
Coefficients:
I(parent  mean(parent))
0.646
The 1 somehow gets rid of the intercepts!!!!???
Summary AGENT

Linear regression is line which minimizes error in “Y”.

\[\sum_{i=1}^n (Y_i  \mu)^2\]
 mu is nothing but the mean (proof is available above!) so that error is minimized

For some reason we centre the line around the mean so to get better estimates!

This is done by using:
lm(I(child  mean(child))~ I(parent  mean(parent))  1, data = galton)
 the 1 somehow gets rid of the intercepts!
Notation c7w1
Some basic definitions
 In this module, we’ll cover some basic definitions and notation used throughout the class.
 We will try to minimize the amount of mathematics required for this class.
 No calculus is required.
Notation for data
 We write $X_1, X_2, \ldots, X_n$ to describe $n$ data points.
 As an example, consider the data set ${1, 2, 5}$ then
 $X_1 = 1$, $X_2 = 2$, $X_3 = 5$ and $n = 3$.
 We often use a different letter than $X$, such as $Y_1, \ldots , Y_n$.
 We will typically use Greek letters for things we don’t know. Such as, $\mu$ is a mean that we’d like to estimate.
The empirical mean
 Define the empirical mean as \(\bar X = \frac{1}{n}\sum_{i=1}^n X_i.\)
 Notice if we subtract the mean from data points, we get data that has mean 0. That is, if we define \(**\tilde X_i = X_i  \bar X.**\) The mean of the $\tilde X_i$ is 0.
 This process is called “centering” the random variables.
 Recall from the previous lecture that the mean is the least squares solution for minimizing \(\sum_{i=1}^n (X_i  \mu)^2\)
The empirical standard deviation and variance
 Define the empirical variance as \(S^2 = \frac{1}{n1} \sum_{i=1}^n (X_i  \bar X)^2 = \frac{1}{n1} \left( \sum_{i=1}^n X_i^2  n \bar X ^ 2 \right)\)
 The empirical standard deviation is defined as $S = \sqrt{S^2}$. Notice that the standard deviation has the same units as the data.
 The data defined by $X_i / s$ have empirical standard deviation 1. This is called “scaling” the data. i.e., SCALING
Empirical seems to be associated with the sample!
Normalization
 The data defined by \(**Z_i = \frac{X_i  \bar X}{s}**\) have empirical mean zero and empirical standard deviation 1.
 The process of centering then scaling the data is called “normalizing” the data.
 Normalized data are centered at 0 and have units equal to standard deviations of the original data.
 Example, a value of 2 from normalized data means that data point was two standard deviations larger than the mean.
SCALING AND CENTERING
The empirical covariance
 Consider now when we have pairs of data, $(X_i, Y_i)$.
 Their empirical covariance is \(Cov(X, Y) = \frac{1}{n1}\sum_{i=1}^n (X_i  \bar X) (Y_i  \bar Y) = \frac{1}{n1}\left( \sum_{i=1}^n X_i Y_i  n \bar X \bar Y\right)\)
 The correlation is defined is \(Cor(X, Y) = \frac{Cov(X, Y)}{S_x S_y}\) where $S_x$ and $S_y$ are the estimates of standard deviations for the $X$ observations and $Y$ observations, respectively.
Covariance
It is the measure of deviation of X and Y from linear relationship relationship. i.e., if X and Y are plotted on to a line with slope “doesn’t matter” degrees!
Correlation
\[Cor(X, Y) = \frac{Cov(X, Y)}{S_x S_y}\]Just divide Covariance by emperical SD’s.
Some facts about correlation
 $Cor(X, Y) = Cor(Y, X)$
 $1 \leq Cor(X, Y) \leq 1$
 $Cor(X,Y) = 1$ and $Cor(X, Y) = 1$ only when the $X$ or $Y$ observations fall perfectly on a positive or negative sloped line, respectively.
 $Cor(X, Y)$ measures the strength of the linear relationship between the $X$ and $Y$ data, with stronger relationships as $Cor(X,Y)$ heads towards 1 or 1.

$Cor(X, Y) = 0$ implies no linear relationship.
cor(x,2*x)=1 cor(x,x)=1
Cor measures ability to fall on straight line. Cov is not that useful other than in getting Cov I think
OLS c7w1
Fitting the best line
 Let $Y_i$ be the $i^{th}$ child’s height and $X_i$ be the $i^{th}$ (average over the pair of) parents’ heights.
 Consider finding the best line
 Child’s Height = $\beta_0$ + Parent’s Height $\beta_1$
 Use least squares \(\sum_{i=1}^n \{Y_i  (\beta_0 + \beta_1 X_i)\}^2\)
Results (beta0, beta1)
 The least squares model fit to the line $Y = \beta_0 + \beta_1 X$ through the data pairs $(X_i, Y_i)$ with $Y_i$ as the outcome obtains the line $Y = \hat \beta_0 + \hat \beta_1 X$ where \(\hat **\beta_1 = Cor(Y, X) \frac{Sd(Y)}{Sd(X)} ~~~ \hat \beta_0 = \bar Y  \hat \beta_1 \bar X**\)
 The line passes through the point $(\bar X, \bar Y$)
 The slope of the regression line with $X$ as the outcome and $Y$ as the predictor is $Cor(Y, X) Sd(X)/ Sd(Y)$.
 The slope is the same one you would get if you centered the data, $(X_i  \bar X, Y_i  \bar Y)$, and did regression through the origin.

If you normalized the data, ${ \frac{X_i  \bar X}{Sd(X)}, \frac{Y_i  \bar Y}{Sd(Y)}}$, the slope is $Cor(Y, X)$.
 Regression line passes through the centre
 centered data also has the same slope as the actual data coef(lm(yc~xc))[2]==coef(lm(y~x))[2]
Normalizing the values results in slope beta1=cor(y,x)
yn < (y  mean(y))/sd(y)
xn < (x  mean(x))/sd(x)
c(cor(y, x), cor(yn, xn), coef(lm(yn ~ xn))[2])
This lecture gives more info on the derivations.
Revisiting Galton’s data
Double check our calculations using R
y < galton$child
x < galton$parent
beta1 < cor(y, x) * sd(y) / sd(x)
beta0 < mean(y)  beta1 * mean(x)
rbind(c(beta0, beta1), coef(lm(y ~ x)))
(Intercept) x
[1,] 23.94 0.6463
[2,] 23.94 0.6463
Revisiting Galton’s data
Reversing the outcome/predictor relationship
beta1 < cor(y, x) * sd(x) / sd(y)
beta0 < mean(x)  beta1 * mean(y)
rbind(c(beta0, beta1), coef(lm(x ~ y)))
(Intercept) y
[1,] 46.14 0.3256
[2,] 46.14 0.3256
Revisiting Galton’s data
Regression through the origin yields an equivalent slope if you center the data first
yc < y  mean(y)
xc < x  mean(x)
beta1 < sum(yc * xc) / sum(xc ^ 2)
c(beta1, coef(lm(y ~ x))[2])
x
0.6463 0.6463
Forcing the regression through origin
x < c(0.8, 0.47, 0.51, 0.73, 0.36, 0.58, 0.57, 0.85, 0.44, 0.42)
y < c(1.39, 0.72, 1.55, 0.48, 1.19, 1.59, 1.23, 0.65, 1.49, 0.05)
sum(y*x)/sum(x^2)
or
lm(y~0 + .,cbind(y,x))
lm(y~0 _ .,cbind(x,y)) also gives the same result PNN!
Revisiting Galton’s data
Normalizing variables results in the slope being the correlation
yn < (y  mean(y))/sd(y)
xn < (x  mean(x))/sd(x)
c(cor(y, x), cor(yn, xn), coef(lm(yn ~ xn))[2])
xn
0.4588 0.4588 0.4588
Quiz c7w1

x < c(0.18, 1.54, 0.42, 0.95) w < c(2, 1, 3, 1)
w is weights, what is the value that minimizes?
sum(x * w)/sum(w)

x < c(0.8, 0.47, 0.51, 0.73, 0.36, 0.58, 0.57, 0.85, 0.44, 0.42) y < c(1.39, 0.72, 1.55, 0.48, 1.19, 1.59, 1.23, 0.65, 1.49, 0.05)
it the regression through the origin and get the slope treating y as the outcome and x as the regressor. (Hint, do not center the data since we want regression through the origin, not through the means of the data.)
coef(lm(y ~ x  1)) sum(y * x)/sum(x^2) coef(lm(y~0 + . , cbind(x,y)))

Do \verbdata(mtcars)data(mtcars) from the datasets package and fit the regression model with mpg as the outcome and weight as the predictor. Give the slope coefficient.
cor(mpg, wt) * sd(mpg)/sd(wt)

Consider data with an outcome (Y) and a predictor (X). The standard deviation of the predictor is one half that of the outcome. The correlation between the two variables is .5. What value would the slope coefficient for the regression model with YY as the outcome and XX as the predictor?
1

Students were given two hard tests and scores were normalized to have empirical mean 0 and variance 1. The correlation between the scores on the two tests was 0.4. What would be the expected score on Quiz 2 for a student who had a normalized score of 1.5 on Quiz 1?
1.5 * 0.4

x < c(8.58, 10.46, 9.01, 9.64, 8.86)
((x  mean(x))/sd(x))[1]

x < c(0.8, 0.47, 0.51, 0.73, 0.36, 0.58, 0.57, 0.85, 0.44, 0.42) y < c(1.39, 0.72, 1.55, 0.48, 1.19, 1.59, 1.23, 0.65, 1.49, 0.05)
coef(lm(y ~ x))[1]

You know that both the predictor and response have mean 0. What can be said about the intercept when you fit a linear regression?
The intercept estimate is $\bar Y  \beta_1 \bar X$ and so will be zero.

x < c(0.8, 0.47, 0.51, 0.73, 0.36, 0.58, 0.57, 0.85, 0.44, 0.42)
mean(x)

beta1/ gamma1 ?
when beta1 is y~x slope and gamma1 is x~y slope!
Ans: var(y)/var(x)
rttm c7w2
Regression to the Mean examples Agent
 Why is it that the children of tall parents tend to be tall, but not as tall as their parents?
 Why do children of short parents tend to be short, but not as short as their parents?
 Why do parents of very short children, tend to be short, but not a short as their child? And the same with parents of very tall children?
 Why do the best performing athletes this year tend to do a little worse the following?
RTTM is about ramdomness
Consider the above examples… where even though the athlete might have worked the same, some randomness makes him better or worse is the claim.
100% rttm
Think of a normal distribution, if in the first 100 numbers you get maximum of 2. In the second 100 numbers (paired) the chance of getting such a high number(in its pair)) is super low (mean=0, sd=1), (i.e., 2 sd’s apart aka <5% chance in a bell curve).
x < rnorm(100)
y < rnorm(100)
odr < order(x)
x[odr[100]] is 2
y[odr[100]] is 0.5 or sumpin
Exam tests rttm
Test 1 has some mistakes but is still hard, Test 2 is also hard.
Good students score well in Test 1 compared to the second test. Quizzes are imperfect to judge the potential of the kid.
Many people get by, by just doing the previousexam questions for example. Think of the poor students, they would have done better in the tests with errors than their test 2 grades.
vaguely rttm!
Basically rttm is some randomness in your life that peaks and lowers your true performance, it is the noise, that doesn’t allow to measure your true potential.
Regression to the mean
 These phenomena are all examples of socalled regression to the mean
 Invented by Francis Galton in the paper “Regression towards mediocrity in hereditary stature” The Journal of the Anthropological Institute of Great Britain and Ireland , Vol. 15, (1886).
 Think of it this way, imagine if you simulated pairs of random normals
 The largest first ones would be the largest by chance, and the probability that there are smaller for the second simulation is high.
 In other words $P(Y < x  X = x)$ gets bigger as $x$ heads into the very large values.
 Similarly $P(Y > x  X = x)$ gets bigger as $x$ heads to very small values.
 Think of the regression line as the intrisic part.
 Unless $Cor(Y, X) = 1$ the intrinsic part isn’t perfect
Regression to the mean
 Suppose that we normalize $X$ (child’s height) and $Y$ (parent’s height) so that they both have mean 0 and variance 1.
 Then, recall, our regression line passes through $(0, 0)$ (the mean of the X and Y).
 If the slope of the regression line is $Cor(Y,X)$, regardless of which variable is the outcome (recall, both standard deviations are 1).
 Notice if $X$ is the outcome and you create a plot where $X$ is the horizontal axis, the slope of the least squares line that you plot is $1/Cor(Y, X)$.
library(UsingR)
data(father.son)
y < (father.son$sheight  mean(father.son$sheight)) / sd(father.son$sheight)
x < (father.son$fheight  mean(father.son$fheight)) / sd(father.son$fheight)
rho < cor(x, y)
g = ggplot(data.frame(x = x, y = y), aes(x = x, y = y))
g = g + geom_point(size = 6, colour = "black", alpha = 0.2)
g = g + geom_point(size = 4, colour = "salmon", alpha = 0.2)
g = g + xlim(4, 4) + ylim(4, 4)
g = g + geom_abline(intercept = 0, slope = 1)
g = g + geom_vline(xintercept = 0)
g = g + geom_hline(yintercept = 0)
g = g + geom_abline(intercept = 0, slope = rho, size = 2)
g = g + geom_abline(intercept = 0, slope = 1 / rho, size = 2)
g
Discussion
 If you had to predict a son’s normalized height, it would be $Cor(Y, X) * X_i$
 If you had to predict a father’s normalized height, it would be $Cor(Y, X) * Y_i$
 Multiplication by this correlation shrinks toward 0 (regression toward the mean)
 If the correlation is 1 there is no regression to the mean (if father’s height perfectly determine’s child’s height and vice versa)
 Note, regression to the mean has been thought about quite a bit and generalized
Linear regression c7w2
Basic regression model with additive Gaussian errors.
 Least squares is an estimation tool, how do we do inference?
 Consider developing a probabilistic model for linear regression \(Y_i = \beta_0 + \beta_1 X_i + **\epsilon_{i}**\)
 Here the $\epsilon_{i}$ are assumed iid $N(0, \sigma^2)$.

Note, $E[Y_i ~ ~ X_i = x_i] = \mu_i = \beta_0 + \beta_1 x_i$ 
Note, $Var(Y_i ~ ~ X_i = x_i) = \sigma^2$. variance for a given X
Recap
 Model $Y_i = \mu_i + \epsilon_i = \beta_0 + \beta_1 X_i + \epsilon_i$ where $\epsilon_i$ are iid $N(0, \sigma^2)$
 ML estimates of $\beta_0$ and $\beta_1$ are the least squares estimates \(\hat \beta_1 = Cor(Y, X) \frac{Sd(Y)}{Sd(X)} ~~~ \hat \beta_0 = \bar Y  \hat \beta_1 \bar X\)

$E[Y ~ ~ X = x] = \beta_0 + \beta_1 x$ 
$Var(Y ~ ~ X = x) = \sigma^2$
Changing regression coefficients beta1 beta0
if you multiply the X axis by 10, then you divide the units by 10. It changes the slope by 10
\[Y_i = \beta_0 + \beta_1 X_i + \epsilon_i = \beta_0 + \frac{\beta_1}{a} (X_i a) + \epsilon_i = \beta_0 + \tilde \beta_1 (X_i a) + \epsilon_i\] Example: $X$ is height in $m$ and $Y$ is weight in $kg$. Then $\beta_1$ is $kg/m$. Converting $X$ to $cm$ implies multiplying $X$ by $100 cm/m$. To get $\beta_1$ in the right units, we have to divide by $100 cm /m$ to get it to have the right units.
Subtracting the X axis by ‘a’ changes the intercept but not the slope.
Example
diamond
data set from UsingR
Data is diamond prices (Singapore dollars) and diamond weight in carats (standard measure of diamond mass, 0.2 $g$). To get the data use
library(UsingR);
data(diamond)
plot(diamond$carat, diamond$price,
xlab = "Mass (carats)",
ylab = "Price (SIN $)",
bg = "lightblue",
col = "black", cex = 1.1, pch = 21,frame = FALSE)
abline(lm(price ~ carat, data = diamond), lwd = 2)
Fitting the linear regression model
fit < lm(price ~ carat, data = diamond)
coef(fit)
(Intercept) carat
259.6 3721.0
 We estimate an expected 3721.02 (SIN) dollar increase in price for every carat increase in mass of diamond.
 The intercept 259.63 is the expected price of a 0 carat diamond.
0 carats value is worthless to you so you Center the Carats!
Subtracting the X axis by ‘a’ changes the intercept but not the slope.
fit2 < lm(price ~ I(carat  mean(carat)), data = diamond)
coef(fit2)
(Intercept) I(carat  mean(carat))
500.1 3721.0
Thus $500.1 is the expected price for the average sized diamond of the data (0.2042 carats).
Changing scale
 A one carat increase in a diamond is pretty big, what about changing units to 1/10th of a carat?
 We can just do this by just dividing the coeficient by 10.
 We expect a 372.102 (SIN) dollar change in price for every 1/10th of a carat increase in mass of diamond.
 Showing that it’s the same if we rescale the Xs and refit
fit3 < lm(price ~ I(carat * 10), data = diamond)
coef(fit3)
(Intercept) I(carat * 10)
259.6 372.1
Predicting the price of a diamond
newx < c(0.16, 0.27, 0.34)
coef(fit)[1] + coef(fit)[2] * newx
[1] 335.7 745.1 1005.5
Use Predict next time!
predict(fit, newdata = data.frame(carat = newx))
1 2 3
335.7 745.1 1005.5
Plot with lm
library(UsingR)
data(diamond)
library(ggplot2)
g = ggplot(diamond, aes(x = carat, y = price))
g = g + xlab("Mass (carats)")
g = g + ylab("Price (SIN $)")
g = g + geom_point(size = 7, colour = "black", alpha=0.5)
g = g + geom_point(size = 5, colour = "blue", alpha=0.2)
g = g + geom_smooth(method = "lm", colour = "black")
g
residual variation c7w2
Motivating example
diamond
data set from UsingR
Data is diamond prices (Singapore dollars) and diamond weight
in carats (standard measure of diamond mass, 0.2 $g$). To get the data use library(UsingR); data(diamond)
library(UsingR)
data(diamond)
library(ggplot2)
g = ggplot(diamond, aes(x = carat, y = price))
g = g + xlab("Mass (carats)")
g = g + ylab("Price (SIN $)")
g = g + geom_smooth(method = "lm", colour = "black")
g = g + geom_point(size = 7, colour = "black", alpha=0.5)
g = g + geom_point(size = 5, colour = "blue", alpha=0.2)
g
Residuals

Model $Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$ where $\epsilon_i \sim N(0, \sigma^2)$.
 Observed outcome $i$ is $Y_i$ at predictor value $X_i$
 Predicted outcome $i$ is $\hat Y_i$ at predictor valuve $X_i$ is \(\hat Y_i = \hat \beta_0 + \hat \beta_1 X_i\)

Residual, the between the observed and predicted outcome \(e_i = Y_i  \hat Y_i\)
Residual is nothing but the difference between data and linear regressor line
 The vertical distance between the observed data point and the regression line
 Least squares minimizes $\sum_{i=1}^n e_i^2$
 The $e_i$ can be thought of as estimates of the $\epsilon_i$.
Properties of the residuals

$E[e_i] = 0$. Intuitively if everything stands on either sides of the mean, then the sum shall be equal to 0

If an intercept is included, $\sum_{i=1}^n e_i = 0$ !!! Convinced by intuitive explanation here. Regression line is in some sense a mean around which to the left are certain values which are exactly equal to the right. PNN!
I think when they say including the intercept I think we should not force the center through other points like the origin, in which case residual might be different?
sum(resid(lm(price~diamond$carat1))) != 0 (tested) sum(resid(lm(price~diamond$carat))) = 0 sum(resid(lm(xc~yc))) == 0 True!

If a regressor variable, $X_i$, is included in the model $\sum_{i=1}^n e_i X_i = 0$. ???

Residuals are useful for investigating poor model fit by zooming in on the locations.

Positive residuals are above the line, negative residuals are below.

Residuals can be thought of as the outcome ($Y$) with the linear association of the predictor ($X$) removed.
 One differentiates residual variation (variation after removing the predictor) from systematic variation (variation explained by the regression model).
 Residual plots highlight poor model fit.
Residual variation is a random error like the one that occurs when you take the same measurement over and over again but your reading varies “randomly”
Calculate Residuals in 3 ways!
data(diamond)
y < diamond$price; x < diamond$carat; n < length(y)
fit < lm(y ~ x)
e < resid(fit)
yhat < predict(fit)
max(abs(e (y  yhat)))
## [1] 9.486e13
max(abs(e  (y  coef(fit)[1]  coef(fit)[2] * x)))
## [1] 9.486e13
Important properties
sum(e)=0
sum(e*X)=0
Residuals are the assigned length of the red lines
```{r, echo = FALSE, fig.height=5, fig.width=5}
plot(diamond$carat, diamond$price,
xlab = “Mass (carats)”,
ylab = “Price (SIN $)”,
bg = “lightblue”,
col = “black”, cex = 2, pch = 21,frame = FALSE)
abline(fit, lwd = 2)
for (i in 1 : n)
lines(c(x[i], x[i]), c(y[i], yhat[i]), col = “red” , lwd = 2)

### Residuals versus X
```{r, echo = FALSE, fig.height=5, fig.width=5}
plot(x, e,
xlab = "Mass (carats)",
ylab = "Residuals (SIN $)",
bg = "lightblue",
col = "black", cex = 2, pch = 21,frame = FALSE)
abline(h = 0, lwd = 2)
for (i in 1 : n)
lines(c(x[i], x[i]), c(e[i], 0), col = "red" , lwd = 2)
Nonlinear data
```{r, echo = FALSE, fig.height=5, fig.width=5} x = runif(100, 3, 3); y = x + sin(x) + rnorm(100, sd = .2); library(ggplot2) g = ggplot(data.frame(x = x, y = y), aes(x = x, y = y)) g = g + geom_smooth(method = “lm”, colour = “black”) g = g + geom_point(size = 7, colour = “black”, alpha = 0.4) g = g + geom_point(size = 5, colour = “red”, alpha = 0.4) g

### Residual plot
```{r, echo = FALSE, fig.height=5, fig.width=5}
g = ggplot(data.frame(x = x, y = resid(lm(y ~ x))),
aes(x = x, y = y))
g = g + geom_hline(yintercept = 0, size = 2);
g = g + geom_point(size = 7, colour = "black", alpha = 0.4)
g = g + geom_point(size = 5, colour = "red", alpha = 0.4)
g = g + xlab("X") + ylab("Residual")
g
Heteroskedasticity
```{r, echo = FALSE, fig.height=4.5, fig.width=4.5} x < runif(100, 0, 6); y < x + rnorm(100, mean = 0, sd = .001 * x); g = ggplot(data.frame(x = x, y = y), aes(x = x, y = y)) g = g + geom_smooth(method = “lm”, colour = “black”) g = g + geom_point(size = 7, colour = “black”, alpha = 0.4) g = g + geom_point(size = 5, colour = “red”, alpha = 0.4) g
**When residual increases with X**

### Getting rid of the blank space can be helpful
```{r, echo = FALSE, fig.height=4.5, fig.width=4.5}
g = ggplot(data.frame(x = x, y = resid(lm(y ~ x))),
aes(x = x, y = y))
g = g + geom_hline(yintercept = 0, size = 2);
g = g + geom_point(size = 7, colour = "black", alpha = 0.4)
g = g + geom_point(size = 5, colour = "red", alpha = 0.4)
g = g + xlab("X") + ylab("Residual")
g
Diamond data residual plot
```{r, echo = FALSE, fig.height=4.5, fig.width=4.5} diamond$e < resid(lm(price ~ carat, data = diamond)) g = ggplot(diamond, aes(x = carat, y = e)) g = g + xlab(“Mass (carats)”) g = g + ylab(“Residual price (SIN $)”) g = g + geom_hline(yintercept = 0, size = 2) g = g + geom_point(size = 7, colour = “black”, alpha=0.5) g = g + geom_point(size = 5, colour = “blue”, alpha=0.2) g

### Diamond data residual plot
```{r, echo = FALSE, fig.height=4.5, fig.width=4.5}
## price~1 gives variation about average
e = c(resid(lm(price ~ 1, data = diamond)),
resid(lm(price ~ carat, data = diamond)))
fit = factor(c(rep("Itc", nrow(diamond)),
rep("Itc, slope", nrow(diamond))))
g = ggplot(data.frame(e = e, fit = fit), aes(y = e, x = fit, fill = fit))
g = g + geom_dotplot(binaxis = "y", size = 2, stackdir = "center", binwidth = 20)
g = g + xlab("Fitting approach")
g = g + ylab("Residual price")
g
Estimating residual variation (n2 dof)
 Model $Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$ where $\epsilon_i \sim N(0, \sigma^2)$.
 The ML estimate of $\sigma^2$ is $\frac{1}{n}\sum_{i=1}^n e_i^2$, the average squared residual.
 Most people use \(\hat \sigma^2 = \frac{1}{n2}\sum_{i=1}^n e_i^2.\)
 The $n2$ instead of $n$ is so that $E[\hat \sigma^2] = \sigma^2$
Diamond example
y < diamond$price; x < diamond$carat; n < length(y)
fit < lm(y ~ x)
summary(fit)$sigma
## [1] 31.84
sqrt(sum(resid(fit)^2) / (n  2))
## [1] 31.84
Total variation = regression variation + residual variation
 The total variability in our response is the variability around an intercept (think mean only regression) $\sum_{i=1}^n (Y_i  \bar Y)^2$
 The regression variability is the variability that is explained by adding the predictor $\sum_{i=1}^n (\hat Y_i  \bar Y)^2$
 The error variability(residual)) is what’s leftover around the regression line $\sum_{i=1}^n (Y_i  \hat Y_i)^2$
 Neat fact \(\sum_{i=1}^n (Y_i  \bar Y)^2 = \sum_{i=1}^n (Y_i  \hat Y_i)^2 + \sum_{i=1}^n (\hat Y_i  \bar Y)^2\)
R squared
https://www.youtube.com/watch?v=w2FKXOa0HGA
 R squared is the percentage of the total variability that is explained by the linear relationship with the predictor
1R^2 = residual variation(intercept and slope) / total variation (only intercept (lm(y~1))))
R^2 = sum_squared(actualestimate with regression)/sum_squared(actualmean without any regressors) = 1residvariation/total variation
Some facts about $R^2$
 $R^2$ is the percentage of variation explained by the regression model.
 $0 \leq R^2 \leq 1$
 $R^2$ is the sample correlation squared. ???
 $R^2$ can be a misleading summary of model fit. Look in the
anscombe library for data and the below subsection!
 Deleting data can inflate $R^2$.
 (For later.) Adding terms to a regression model always increases $R^2$.
 Do
example(anscombe)
to see the following data. Basically same mean and variance of X and Y.
 Identical correlations (hence same $R^2$ ).
 Same linear regression relationship.
data(anscombe);example(anscombe)
```{r, echo = FALSE, fig.height=5, fig.width=5, results=’hide’} require(stats); require(graphics); data(anscombe) ff < y ~ x mods < setNames(as.list(1:4), paste0(“lm”, 1:4)) for(i in 1:4) { ff[2:3] < lapply(paste0(c(“y”,”x”), i), as.name) ## or ff[[2]] < as.name(paste0(“y”, i)) ## ff[[3]] < as.name(paste0(“x”, i)) mods[[i]] < lmi < lm(ff, data = anscombe) #print(anova(lmi)) }

### How to derive R squared (Not required!)
#### For those that are interested
$$
\begin{align}
\sum_{i=1}^n (Y_i  \bar Y)^2
& = \sum_{i=1}^n (Y_i  \hat Y_i + \hat Y_i  \bar Y)^2 \\
& = \sum_{i=1}^n (Y_i  \hat Y_i)^2 +
2 \sum_{i=1}^n (Y_i  \hat Y_i)(\hat Y_i  \bar Y) +
\sum_{i=1}^n (\hat Y_i  \bar Y)^2 \\
\end{align}
$$
****
#### Scratch work
$(Y_i  \hat Y_i) = \{Y_i  (\bar Y  \hat \beta_1 \bar X)  \hat \beta_1 X_i\} = (Y_i  \bar Y)  \hat \beta_1 (X_i  \bar X)$
$(\hat Y_i  \bar Y) = (\bar Y  \hat \beta_1 \bar X  \hat \beta_1 X_i  \bar Y )
= \hat \beta_1 (X_i  \bar X)$
$\sum_{i=1}^n (Y_i  \hat Y_i)(\hat Y_i  \bar Y)
= \sum_{i=1}^n \{(Y_i  \bar Y)  \hat \beta_1 (X_i  \bar X))\}\{\hat \beta_1 (X_i  \bar X)\}$
$=\hat \beta_1 \sum_{i=1}^n (Y_i  \bar Y)(X_i  \bar X) \hat\beta_1^2\sum_{i=1}^n (X_i  \bar X)^2$
$= \hat \beta_1^2 \sum_{i=1}^n (X_i  \bar X)^2\hat\beta_1^2\sum_{i=1}^n (X_i  \bar X)^2 = 0$

### The relation between R squared and r(not required?)
#### (Again not required)
Recall that $(\hat Y_i  \bar Y) = \hat \beta_1 (X_i  \bar X)$
so that
$$
R^2 = \frac{\sum_{i=1}^n (\hat Y_i  \bar Y)^2}{\sum_{i=1}^n (Y_i  \bar Y)^2}
= \hat \beta_1^2 \frac{\sum_{i=1}^n(X_i  \bar X)^2}{\sum_{i=1}^n (Y_i  \bar Y)^2}
= Cor(Y, X)^2
$$
Since, recall,
$$
\hat \beta_1 = Cor(Y, X)\frac{Sd(Y)}{Sd(X)}
$$
So, $R^2$ is literally $r$ squared.
## Inference c7w1
### Recall our model and fitted values
* Consider the model
$$
Y_i = \beta_0 + \beta_1 X_i + \epsilon_i
$$
* $\epsilon \sim N(0, \sigma^2)$.
* We assume that the true model is known.
* We assume that you've seen confidence intervals and hypothesis tests before.
* $\hat \beta_0 = \bar Y  \hat \beta_1 \bar X$
* $\hat \beta_1 = Cor(Y, X) \frac{Sd(Y)}{Sd(X)}$.

### Review (not required! hypothesis testing and condifence intervals)
* Statistics like $\frac{\hat \theta  \theta}{\hat \sigma_{\hat \theta}}$ often have the following properties.
1. Is normally distributed and has a finite sample Student's T distribution if the variance is replaced with a sample estimate (under normality assumptions).
3. Can be used to test $H_0 : \theta = \theta_0$ versus $H_a : \theta >, <, \neq \theta_0$.
4. Can be used to create a confidence interval for $\theta$ via $\hat \theta \pm Q_{1\alpha/2} \hat \sigma_{\hat \theta}$
where $Q_{1\alpha/2}$ is the relevant quantile from either a normal or T distribution.
* In the case of regression with iid sampling assumptions and normal errors, our inferences will follow
very similarily to what you saw in your inference class.
* We won't cover asymptotics for regression analysis, but suffice it to say that under assumptions
on the ways in which the $X$ values are collected, the iid sampling model, and mean model,
the normal results hold to create intervals and confidence intervals

### Results
**Variation of the slope is dependent of \sigma and inversely on
variation of the X**
* $\sigma_{\hat \beta_1}^2 = Var(\hat \beta_1) = \sigma^2 /
\sum_{i=1}^n (X_i  \bar X)^2$
**variation of intercepts**
* $\sigma_{\hat \beta_0}^2 = Var(\hat \beta_0) = \left(\frac{1}{n} +
\frac{\bar X^2}{\sum_{i=1}^n (X_i  \bar X)^2 }\right)\sigma^2$
* **In practice, $\sigma$ is replaced by its estimate.**
* It's probably not surprising that under iid Gaussian errors
$$
\frac{\hat \beta_j  \beta_j}{\hat \sigma_{\hat \beta_j}}
$$
follows a $t$ distribution with $n2$ degrees of freedom and a normal distribution for large $n$.
* This can be used to create confidence intervals and perform
hypothesis tests.

### Understand output of lm(y~x) table
```r
library(UsingR); data(diamond)
y < diamond$price; x < diamond$carat; n < length(y)
beta1 < cor(y, x) * sd(y) / sd(x)
beta0 < mean(y)  beta1 * mean(x)
e < y  beta0  beta1 * x
sigma < sqrt(sum(e^2) / (n2))
ssx < sum((x  mean(x))^2)
seBeta0 < (1 / n + mean(x) ^ 2 / ssx) ^ .5 * sigma
seBeta1 < sigma / sqrt(ssx)
tBeta0 < beta0 / seBeta0; tBeta1 < beta1 / seBeta1
pBeta0 < 2 * pt(abs(tBeta0), df = n  2, lower.tail = FALSE)
pBeta1 < 2 * pt(abs(tBeta1), df = n  2, lower.tail = FALSE)
coefTable < rbind(c(beta0, seBeta0, tBeta0, pBeta0), c(beta1, seBeta1, tBeta1, pBeta1))
colnames(coefTable) < c("Estimate", "Std. Error", "t value", "P(>t)")
rownames(coefTable) < c("(Intercept)", "x")
tstatistic is computed based on 0 intercept and 0 slope being the null hypothesis value
Residual variation
a < resid(lm(y~x))
sqrt(sum(a^2)/46)
Manually calculated coefTable compared with lm(y~x)
coefTable
Estimate Std. Error t value P(>t)
(Intercept) 259.6 17.32 14.99 2.523e19
x 3721.0 81.79 45.50 6.751e40
fit < lm(y ~ x);
summary(fit)$coefficients
Estimate Std. Error t value Pr(>t)
(Intercept) 259.6 17.32 14.99 2.523e19
x 3721.0 81.79 45.50 6.751e40
Getting a confidence interval
sumCoef < summary(fit)$coefficients
sumCoef[1,1] + c(1, 1) * qt(.975, df = fit$df) * sumCoef[1, 2]
[1] 294.5 224.8
(sumCoef[2,1] + c(1, 1) * qt(.975, df = fit$df) * sumCoef[2, 2]) / 10
[1] 355.6 388.6
With 95% confidence, we estimate that a 0.1 carat increase in diamond size results in a 355.6 to 388.6 increase in price in (Singapore) dollars.
Prediction of outcomes
 Consider predicting $Y$ at a value of $X$
 Predicting the price of a diamond given the carat
 Predicting the height of a child given the height of the parents
 The obvious estimate for prediction at point $x_0$ is \(\hat \beta_0 + \hat \beta_1 x_0\)
 A standard error is needed to create a prediction interval.
 There’s a distinction between intervals for the regression line at point $x_0$ and the prediction of what a $y$ would be at point $x_0$.
 Line at $x_0$ se, $\hat \sigma\sqrt{\frac{1}{n} + \frac{(x_0  \bar X)^2}{\sum_{i=1}^n (X_i  \bar X)^2}}$

Prediction interval se at $x_0$, $\hat \sigma\sqrt{1 + \frac{1}{n} + \frac{(x_0  \bar X)^2}{\sum_{i=1}^n (X_i  \bar X)^2}}$

There are two things we are interested as part of the prediction interval we see in the lm plots with ggplot,

The confidence interval for the line itself at x0 keeping in mind a population

the prediction interval at x0 keeping in mind the population.

Prediction variance varies with the following:

$\hat \sigma$; Increases with less R^2 error.

$\frac{1}{n}$ goes down with more samples! sqrt(1/n) is always part of it.

$\frac{(x_0  \bar X)^2$; Best prediction when x0 is closest to average

$\sum_{i=1}^n (X_i  \bar X)^2}$ More variability in the X term leads to less variability in the in the interval.
Plotting the prediction intervals and confidence intervals
```{r, fig.height=5, fig.width==5, echo = FALSE, results=’hide’} library(ggplot2) newx = data.frame(x = seq(min(x), max(x), length = 100)) p1 = data.frame(predict(fit, newdata= newx,interval = (“confidence”))) p2 = data.frame(predict(fit, newdata = newx,interval = (“prediction”))) p1$interval = “confidence” p2$interval = “prediction” p1$x = newx$x p2$x = newx$x dat = rbind(p1, p2) names(dat)1 = “y” g = ggplot(dat, aes(x = x, y = y)) g = g + geom_ribbon(aes(ymin = lwr, ymax = upr, fill = interval), alpha = 0.2) g = g + geom_line() g = g + geom_point(data = data.frame(x = x, y=y), aes(x = x, y = y), size = 4) g
http://www2.stat.duke.edu/~tjl13/s101/slides/unit6lec3H.pdf
Is very important especially this following line:
> Two type of intervals available:
>  Confidence interval for the average foster twin’s IQ
>  Prediction interval for a single foster twin’s IQ
Average or slope is associated with confidence interval, and
prediction is associated with single or expected value
> A confidence interval gives a range for $\text{E}[y \mid x]$, as you
> say. A prediction interval gives a range for $y$ itself. Naturally,
> our best guess for $y$ is $\text{E}[y \mid x]$, so the intervals
> will both be centered around the same value, $x\hat{\beta}$.
### Predicting with new data
Absolute nonsense with respect to R's way of work!
https://stackoverflow.com/questions/15115909/feedingnewdatatorpredictfunction
Note:
Variables are first looked for in ‘newdata’ and then searched for
in the usual way (which will include the environment of the
formula used in the fit). A warning will be given if the
variables found are not of the same length as those in ‘newdata’
if it was supplied.
So if you do
newx < 3
predict(fit, newdata=newx, interval=("prediction"))
This wont work. R is looking for the x variable name in newX
newx<data.frame(x=x0)
predict(fit, newdata=newx, interval=("prediction"))
works! my god!

### Discussion an agent and Bcaffo's understanding!
* Both intervals have varying widths.
* Least width at the mean of the Xs.
* We are quite confident in the regression line, so that
interval is very narrow.
* If we knew $\beta_0$ and $\beta_1$ this interval would have zero width.
* The prediction interval must incorporate the variabilibity
in the data around the line.
* Even if we knew $\beta_0$ and $\beta_1$ this interval would still have width.
**Plotting this shows a couple of things:**
 salmon bands are thin and represent the prediction of the line
itself (confidence)
 the confidence bands are thinner than the prediction bands.
 confidence bands (salmon)
 The confidence bands show the uncertainty in predicting the
line itself. If there are more and more points this uncertainty
would reduce. It is about beta1 the slope.
 the prediction bands show the uncertainty in predicting Y0. They
will always exist even when there are a million points as the
linear regression cannot account for everything.
 they grow smaller at the mean and then become big away from the
center.

### In R
```{r, fig.height=5, fig.width=5, echo=FALSE,results='hide'}
newdata < data.frame(x = xVals)
p1 < predict(fit, newdata, interval = ("confidence"))
p2 < predict(fit, newdata, interval = ("prediction"))
plot(x, y, frame=FALSE,xlab="Carat",ylab="Dollars",pch=21,col="black", bg="lightblue", cex=2)
abline(fit, lwd = 2)
lines(xVals, p1[,2]); lines(xVals, p1[,3])
lines(xVals, p2[,2]); lines(xVals, p2[,3])
Based on khan academy
Quiz c7w2

Consider the following data with x as the predictor and y as as the outcome. Give a Pvalue for the two sided hypothesis test of whether β1 from a linear regression model is 0 or not.
x < c(0.61, 0.93, 0.83, 0.35, 0.54, 0.16, 0.91, 0.62, 0.62) y < c(0.67, 0.84, 0.6, 0.18, 0.85, 0.47, 1.1, 0.65, 0.36) summary(lm(y ~ x))$coef

Consider the previous problem, give the estimate of the residual standard deviation.
summary(lm(y ~ x))$sigma

In the \verbmtcarsmtcars data set, fit a linear regression model of weight (predictor) on mpg (outcome). Get a 95% confidence interval for the expected mpg at the average weight. What is the lower endpoint?
data(mtcars) fit < lm(mpg ~ I(wt  mean(wt)), data = mtcars) confint(fit) or fit < lm(mpg~wt,data=mtcars) predict(fit,newdata=data.frame(wt=mean(df$wt)), interval="confidence")

Refer to the previous question. Read the help file for \verbmtcarsmtcars. What is the weight coefficient interpreted as?
The estimated expected change in mpg per 1,000 lb increase in weight.

Consider again the \verbmtcarsmtcars data set and a linear regression model with mpg as predicted by weight (1,000 lbs). A new car is coming weighing 3000 pounds. Construct a 95% prediction interval for its mpg. What is the upper endpoint?
fit < lm(mpg ~ wt, data = mtcars) predict(fit, newdata = data.frame(wt = 3), interval = "prediction")

Consider again the \verbmtcarsmtcars data set and a linear regression model with mpg as predicted by weight (in 1,000 lbs). A “short” ton is defined as 2,000 lbs. Construct a 95% confidence interval for the expected change in mpg per 1 short ton increase in weight. Give the lower endpoint.
fit < lm(mpg ~ wt, data = mtcars) confint(fit)[2, ] * 2 fit < lm(mpg ~ I(wt * 0.5), data = mtcars) confint(fit)[2, ]
confint gives info about confidence of the slope and intercept as is asked in the question…
Estimate Std. Error t value Pr(>t) (Intercept) 37.285 1.878 19.858 < 2e16 *** x 10.689 1.118 9.559 1.29e10 *** 10.689+ c(1,1)*1.118* qt(0.975,30) gives the same values!

If my X from a linear regression is measured in centimeters and I convert it to meters what would happen to the slope coefficient?
It would get multiplied by 100

I have an outcome, YY, and a predictor, XX and fit a linear regression model with Y=β0+β1X+ϵ to obtain β^0 and β^1. What would be the consequence to the subsequent slope and intercept if I were to refit the model with a new regressor, X + cX+c for some constant, cc?
The new intercept would be β^0−cβ^1

Refer back to the mtcars data set with mpg as an outcome and weight (wt) as the predictor. About what is the ratio of the the sum of the squared errors, ∑ni=1(Yi−Y^i)2 when comparing a model with just an intercept (denominator) to the model with the intercept and slope (numerator)?
fit1 < lm(mpg ~ wt, data = mtcars) fit2 < lm(mpg ~ 1, data = mtcars) 1  summary(fit1)$r.squared sse1 < sum((predict(fit1)  mtcars$mpg)^2) sse2 < sum((predict(fit2)  mtcars$mpg)^2) sse1/sse2

Do the residuals always have to sum to 0 in linear regression?
If an intercept is included, then they will sum to 0
3, 6 except for Xa and beta/a, 7, and 5 with the confidence intervals are quite a mistery!
Summary of inference
There seem to be 3 confidence intervals:

Confint of slope
Can be found from
confint(fit)
or from summary of fitfit < lm(mpg ~ I(wt * 0.5), data = mtcars) confint(fit)[2, ] OR Estimate Std. Error t value Pr(>t) (Intercept) 37.285 1.878 19.858 < 2e16 *** x 10.689 1.118 9.559 1.29e10 *** 10.689+ c(1,1)*1.118* qt(0.975,30) gives the same values!

Conf interval of expected value (aka, average)
This is when you want to find the variation of the avg of Y at an x_i
This can be found with predict variable, but sometimes also with confint which is confusing to me. The relation between expected value and slope is not clear!
data(mtcars) fit < lm(mpg ~ I(wt  mean(wt)), data = mtcars) confint(fit) or fit < lm(mpg~wt,data=mtcars) predict(fit,newdata=data.frame(wt=mean(df$wt)), interval="confidence")
I don’t get the difference but whatever

Prediction interval for the actual value of Y and not the expected value
fit < lm(mpg ~ wt, data = mtcars) predict(fit, newdata = data.frame(wt = 3), interval = "prediction")
This is it! for now. As needed we can go into depths!
Note
Ths summary of fit gives a lot of info. The t.value is the value under the null hypothesis and not the 95% confidence interval thingy!
Estimate Std. Error t value Pr(>t)
(Intercept) 37.285 1.878 19.858 < 2e16 ***
x 10.689 1.118 9.559 1.29e10 ***
That about sums it up! Cheers!
c7w3 multivariate regression
Multivariable regression analyses Why?
 If I were to present evidence of a relationship between breath mint
useage (mints per day, X) and pulmonary function (measured in FEV),
you would be skeptical.
 Likely, you would say, ‘smokers tend to use more breath mints than non smokers, smoking is related to a loss in pulmonary function. That’s probably the culprit.’
 If asked what would convince you, you would likely say, ‘If nonsmoking breath mint users had lower lung function than nonsmoking nonbreath mint users and, similarly, if smoking breath mint users had lower lung function than smoking nonbreath mint users, I’d be more inclined to believe you’.
 In other words, to even consider my results, I would have to demonstrate that they hold while holding smoking status fixed.
Multivariable regression analyses Why people are interested?
 An insurance company is interested in how last year’s claims can
predict a person’s time in the hospital this year.
 They want to use an enormous amount of data contained in claims to predict a single number. Simple linear regression is not equipped to handle more than one predictor.
 How can one generalize SLR to incoporate lots of regressors for the purpose of prediction?

What are the consequences of adding lots of regressors?
 Surely there must be consequences to throwing variables in that aren’t related to Y?
 Surely there must be consequences to omitting variables that are?
With sufficient random vectors you can come up with 0 residuals
The linear model Equations

The general linear model extends simple linear regression (SLR) by adding terms linearly into the model. \(Y_i = \beta_1 X_{1i} + \beta_2 X_{2i} + \ldots + \beta_{p} X_{pi} + \epsilon_{i} = \sum_{k=1}^p X_{ik} \beta_j + \epsilon_{i}\)
Outcome = Predictor * coefficients

Here $X_{1i}=1$ typically, so that an intercept is included.

Least squares (and hence ML estimates under iid Gaussianity of the errors) minimizes
\(\sum_{i=1}^n \left(Y_i  \sum_{k=1}^p X_{ki} \beta_j\right)^2\)
Minimizing overall error by looking at error at each point ‘i’, and over

Note, the important linearity is linearity in the coefficients. Thus \(Y_i = \beta_1 X_{1i}^2 + \beta_2 X_{2i}^2 + \ldots + \beta_{p} X_{pi}^2 + \epsilon_{i}\) is still a linear model. (We’ve just squared the elements of the predictor variables.)
Whether the same beta holds when you square the variable, am not sure! but moving on!
How to get estimates Expected values Least squares!
 Recall that the LS estimate for regression through the origin, $E[Y_i]=X_{1i}\beta_1$, was $\sum X_i Y_i / \sum X_i^2$.
 Let’s consider two regressors, $E[Y_i] = X_{1i}\beta_1 + X_{2i}\beta_2 = \mu_i$.
* Least squares tries to minimize
$$ \sum_{i=1}^n (Y_i  X_{1i} \beta_1  X_{2i} \beta_2)^2 $$
Result
\[\hat \beta_1 = \frac{\sum_{i=1}^n e_{i, Y  X_2} e_{i, X_1  X_2}}{\sum_{i=1}^n e_{i, X_1  X_2}^2}\]
That is, the regression estimate for $\beta_1$ is the regression through the origin estimate having regressed $X_2$ out of both the response and the predictor.
What does this even mean?
 (Similarly, the regression estimate for $\beta_2$ is the regression through the origin estimate having regressed $X_1$ out of both the response and the predictor.)
 More generally, multivariate regression estimates are exactly those having removed the linear relationship of the other variables from both the regressor and response.
Example with two variables, simple linear regression (important)

$Y_{i} = \beta_1 X_{1i} + \beta_2 X_{2i}$ where $X_{2i} = 1$ is an intercept term.
Think of X2 as the number of people who smoke, you fix it at 1 person smoking (aka the intercept term!)
 Notice the fitted coefficient of $X_{2i}$ on $Y_{i}$ is $\bar Y$

The residuals $e_{i, Y X_2} = Y_i  \bar Y$

 Notice the fitted coefficient of $X_{2i}$ on $X_{1i}$ is $\bar X_1$

The residuals $e_{i, X_1 X_2}= X_{1i}  \bar X_1$


Thus \(\hat \beta_1 = \frac{\sum_{i=1}^n e_{i, Y  X_2} e_{i, X_1  X_2}}{\sum_{i=1}^n e_{i, X_1  X_2}^2} = \frac{\sum_{i=1}^n (X_i  \bar X)(Y_i  \bar Y)}{\sum_{i=1}^n (X_i  \bar X)^2} = Cor(X, Y) \frac{Sd(Y)}{Sd(X)}\)

choose an intercept value for X2?

get rid of X2 by centering Y and X
 compute the “origin intercept lm” for yc~xc,
I don’t get it though!
 and same beta formula with correlation
The general case
 Least squares solutions have to minimize \(\sum_{i=1}^n (Y_i  X_{1i}\beta_1  \ldots  X_{pi}\beta_p)^2\)
 The least squares estimate for the coefficient of a multivariate regression model is exactly regression through the origin with the linear relationships with the other regressors removed from both the regressor and outcome by taking residuals.?????????????
 In this sense, multivariate regression “adjusts” a coefficient for the linear impact of the other variables.
beta 1 > All the variable from X2 to Xp have been “linearly” removed from Y and X1
Demonstration that it works using an example
Linear model with two variables
n = 100; x = rnorm(n); x2 = rnorm(n); x3 = rnorm(n)
y = 1 + x + x2 + x3 + rnorm(n, sd = .1)
ey = resid(lm(y ~ x2 + x3))
ex = resid(lm(x ~ x2 + x3))
sum(ey * ex) / sum(ex ^ 2)
coef(lm(ey ~ ex  1))
coef(lm(y ~ x + x2 + x3))
So what you do is remove x2 and x3 aka, take residuals from a fit on them. This way the residuals ey and ex are without x2 and x3 contribution. and why we then do 1 is beyond me!
Interpretation of the coeficients
\(E[Y  X_1 = **x_1**, \ldots, X_p = x_p] = \sum_{k=1}^p x_{k} \beta_k\)
\[E[Y  X_1 = **x_1 + 1**, \ldots, X_p = x_p] = (x_1 + 1) \beta_1 + \sum_{k=2}^p x_{k} \beta_k\]\(**E[Y** X_1 = **x_1 + 1**, \ldots, X_p = x_p] **** **E[Y ** X_1 = **x_1**, \ldots, X_p = x_p]\) \(= (x_1 + 1) \beta_1 + \sum_{k=2}^p x_{k} \beta_k + \sum_{k=1}^p x_{k} \beta_k **= \beta_1**\)
So that the interpretation of a multivariate regression coefficient is the expected change in the response per unit change in the regressor, holding all of the other regressors fixed.
In the next lecture, we’ll do examples and go over contextspecific interpretations.
Fitted values, residuals and residual variation
All of our SLR quantities can be extended to linear models
 Model $Y_i = \sum_{k=1}^p X_{ik} \beta_{k} + \epsilon_{i}$ where $\epsilon_i \sim N(0, \sigma^2)$
 Fitted responses $\hat Y_i = \sum_{k=1}^p X_{ik} \hat \beta_{k}$
 Residuals $e_i = Y_i  \hat Y_i$
 Variance estimate $\hat \sigma^2 = \frac{1}{np} \sum_{i=1}^n e_i ^2$
 To get predicted responses at new values, $x_1, \ldots, x_p$, simply plug them into the linear model $\sum_{k=1}^p x_{k} \hat \beta_{k}$
 Coefficients have standard errors, $\hat \sigma_{\hat \beta_k}$, and $\frac{\hat \beta_k  \beta_k}{\hat \sigma_{\hat \beta_k}}$ follows a $T$ distribution with $np$ degrees of freedom.
 Predicted responses have standard errors and we can calculate predicted and expected response intervals.
Linear models
 Linear models are the single most important applied statistical and machine learning techniqe, by far.
 Some amazing things that you can accomplish with linear models
 Decompose a signal into its harmonics.
 Flexibly fit complicated functions.
 Fit factor variables as predictors.
 Uncover complex multivariate relationships with the response.
 Build accurate prediction models.
Multivariate examples (c7w3)
Data set for discussion
require(datasets); data(swiss); ?swiss
Standardized fertility measure and socioeconomic indicators for each of 47 Frenchspeaking provinces of Switzerland at about 1888.
A data frame with 47 observations on 6 variables, each of which is in percent, i.e., in [0, 100].
 [,1] Fertility a common standardized fertility measure
 [,2] Agriculture % of males involved in agriculture as occupation
 [,3] Examination % draftees receiving highest mark on army examination
 [,4] Education % education beyond primary school for draftees
 [,5] Catholic % catholic (as opposed to protestant)
 [,6] Infant.Mortality live births who live less than 1 year
All variables but Fertility give proportions of the population.
Nice plot that shows all variations
```{r, fig.height=6, fig.width=10, echo = FALSE} require(datasets); data(swiss); require(GGally); require(ggplot2) g = ggpairs(swiss, lower = list(continuous = “smooth”),params = c(method = “loess”)) g

### Calling `lm` for all variables "lm(y~.)"
`summary(lm(Fertility ~ . , data = swiss))`
```{r, echo = FALSE}
summary(lm(Fertility ~ . , data = swiss))$coefficients
Agriculture 0.17 slope without the contributions of other variables!
Example interpretation (importance of removing variables)
 Agriculture is expressed in percentages (0  100)
 Estimate is 0.1721.
 Our models estimates an expected 0.17 decrease in standardized fertility for every 1% increase in percentage of males involved in agriculture in holding the remaining variables constant.
 The ttest for $H_0: \beta_{Agri} = 0$ versus $H_a: \beta_{Agri} \neq 0$ is significant.
 Interestingly, the unadjusted estimate is
summary(lm(Fertility ~ Agriculture, data = swiss))$coefficients
0.19
“If there hasn’t been randomization to protect you from the founding you are going to have to come up with a dynamic process of choosing which variables etc…”
How can adjustment reverse the sign of an effect? Let’s try a simulation.
```{r, echo = TRUE} n < 100; x2 < 1 : n; x1 < .01 * x2 + runif(n, .1, .1); y = x1 + x2 + rnorm(n, sd = .01) summary(lm(y ~ x1))$coef summary(lm(y ~ x1 + x2))$coef
**First output coeff is 95 instead of "1". Quite a blunder! The linear
model is picking up on the large contribution of x2 on Y and making
its own fit which is absolutely not correct. x2 is uniform noise which
spikes up the value of x1**

**Plot of y vs x1 and x2 to see all trends in one image!**
Does show the "fake relationship" between y and x1, we also see how y
is dependent on x2 and x1 is dependent on x2.
```{r, echo = FALSE, fig.height=5, fig.width=10, results = 'show'}
dat = data.frame(y = y, x1 = x1, x2 = x2, ey = resid(lm(y ~ x2)), ex1
= resid(lm(x1 ~ x2)))
library(ggplot2)
g = ggplot(dat, aes(y = y, x = x1, colour = x2))
g = g + geom_point(colour="grey50", size = 5) + geom_smooth(method = lm, se = FALSE, colour = "black")
g = g + geom_point(size = 4)
g
So we need to remove the effect of x2 from both y and x1 i.e., by
taking the residual, which is the left over data.
```{r, echo = FALSE, fig.height=5, fig.width=10, results = ‘show’}
g2 = ggplot(dat, aes(y = ey, x = ex1, colour = x2))
g2 = g2 + geom_point(colour=”grey50”, size = 5) + geom_smooth(method = lm, se = FALSE, colour = “black”) + geom_point(size = 4)
g2
It is very clear with this plot that, x2 is now properly randomized
and the true relationship pops out.

### Back to this data set
* The sign reverses itself with the inclusion of Examination and
Education.
* The percent of males in the province working in agriculture is
negatively related to educational attainment (correlation of `r
cor(swiss$Agriculture, swiss$Education)`) and Education and
Examination (correlation of `r cor(swiss$Education,
swiss$Examination)`) are obviously measuring similar things.
* Is the positive marginal an artifact for not having accounted for,
say, Education level? (Education does have a stronger effect, by
the way.)
* At the minimum, anyone claiming that provinces that are more
agricultural have higher fertility rates would immediately be open
to criticism.

### What if we include an unnecessary variable?
z adds no new linear information, since it's a linear
combination of variables already included. R just drops
terms that are linear combinations of other terms.
```{r, echo = TRUE}
z < swiss$Agriculture + swiss$Education
lm(Fertility ~ . + z, data = swiss)
When Z is a linear combination of other variables then you get NA when you remove the components of z to do lm(y~z).
Dummy variables are smart
 Consider the linear model \(Y_i = \beta_0 + X_{i1} \beta_1 + \epsilon_{i}\) where each $X_{i1}$ is binary so that it is a 1 if measurement $i$ is in a group and 0 otherwise. (Treated versus not in a clinical trial, for example.)
 Then for people in the group $E[Y_i] = \beta_0 + \beta_1$
 And for people not in the group $E[Y_i] = \beta_0$
 The LS fits work out to be $\hat \beta_0 + \hat \beta_1$ is the mean for those in the group and $\hat \beta_0$ is the mean for those not in the group.
 $\beta_1$ is interpretted as the increase or decrease in the mean comparing those in the group to those not.

Note including a binary variable that is 1 for those not in the group would be redundant. It would create three parameters to describe two means. ???
 basically we can lm binary factor variables too! They give means and change compared to the lower group!
More than 2 levels
 Consider a multilevel factor level. For didactic reasons, let’s say a three level factor (example, US political party affiliation: Republican, Democrat, Independent)
 $Y_i = \beta_0 + X_{i1} \beta_1 + X_{i2} \beta_2 + \epsilon_i$.
 $X_{i1}$ is 1 for Republicans and 0 otherwise.
 $X_{i2}$ is 1 for Democrats and 0 otherwise.
 If $i$ is Republican $E[Y_i] = \beta_0 +\beta_1$
 If $i$ is Democrat $E[Y_i] = \beta_0 + \beta_2$.
 If $i$ is Independent $E[Y_i] = \beta_0$.
 $\beta_1$ compares Republicans to Independents.
 $\beta_2$ compares Democrats to Independents.
 $\beta_1  \beta_2$ compares Republicans to Democrats.
 (Choice of reference category changes the interpretation.)
Warning:What you choose as your reference somehow has big effect on how you interpret! What does it mean? maybe the examples show more info!
Insect Sprays (Understanding factors and dummy variables!)
```{r, echo = FALSE, fig.height=5, fig.width=5} require(datasets);data(InsectSprays); require(stats); require(ggplot2) g = ggplot(data = InsectSprays, aes(y = count, x = spray, fill = spray)) g = g + geom_violin(colour = “black”, size = 2) g = g + xlab(“Type of spray”) + ylab(“Insect count”) g
Estimate Std. Error t value Pr(>t) (Intercept) 14.5000 1.132 12.8074 1.471e19 sprayB 0.8333 1.601 0.5205 6.045e01 sprayC 12.4167 1.601 7.7550 7.267e11 sprayD 9.5833 1.601 5.9854 9.817e08 sprayE 11.0000 1.601 6.8702 2.754e09 sprayF 2.1667 1.601 1.3532 1.806e01 ```
Spray is missing as it is the reference B0 with mean of 14.5 and everything else is compared to it, i.e., mean of spray b is 14.5+0.833 etc…
Spray A is the dummy variable!
Linear model fit, group A is the reference
```{r, echo= TRUE} summary(lm(count ~ spray, data = InsectSprays))$coef

#### Hard coding the dummy variables
```{r, echo= TRUE}
summary(lm(count ~
I(1 * (spray == 'B')) + I(1 * (spray == 'C')) +
I(1 * (spray == 'D')) + I(1 * (spray == 'E')) +
I(1 * (spray == 'F'))
, data = InsectSprays))$coef
What if we include all 6?
```{r, echo= TRUE}
summary(lm(count ~
I(1 * (spray == ‘B’)) + I(1 * (spray == ‘C’)) +
I(1 * (spray == ‘D’)) + I(1 * (spray == ‘E’)) +
I(1 * (spray == ‘F’)) + I(1 * (spray == ‘A’)), data = InsectSprays))$coef

#### What if we omit the intercept?
```{r, echo= TRUE}
summary(lm(count ~ spray  1, data = InsectSprays))$coef
library(dplyr)
summarise(group_by(InsectSprays, spray), mn = mean(count))
When you omit the intercept its all about 0. All the P values and t statistics are about a null hypothesis 0 instead of spray A.
Reordering the levels
spray2 < relevel(InsectSprays$spray, "C")
summary(lm(count ~ spray2, data = InsectSprays))$coef
Summary
 If we treat Spray as a factor, R includes an intercept and omits the alphabetically first level of the factor.
 All ttests are for comparisons of Sprays versus Spray A.
 Emprirical mean for A is the intercept.
 Other group means are the itc plus their coefficient.
 If we omit an intercept, then it includes terms for all levels of the factor.
 Group means are the coefficients.
 Tests are tests of whether the groups are different than zero. (Are the expected counts zero for that spray.)
 If we want comparisons between, Spray B and C, say we could refit the model with C (or B) as the reference level.
Other thoughts on this data
 Counts are bounded from below by 0, violates the assumption of normality of the errors.
 Also there are counts near zero, so both the actual assumption and the intent of the assumption are violated.
 Variance does not appear to be constant.
 Perhaps taking logs of the counts would help.
 There are 0 counts, so maybe log(Count + 1)
 Also, we’ll cover Poisson GLMs for fitting count data.
Modelling the data as b0+b1X1+b2X2 and b0+…+b3X1X2
Two types of modelling is possible leading to either
one intercept and one slope
or
2 intercepts and 2 different slopes
Recall the swiss
data set
library(datasets); data(swiss)
head(swiss)
Create a binary variable
library(dplyr);
swiss = mutate(swiss, CatholicBin = 1 * (Catholic > 50))
Plot the data
```{r, fig.height=5, fig.width=8, echo = FALSE} g = ggplot(swiss, aes(x = Agriculture, y = Fertility, colour = factor(CatholicBin))) g = g + geom_point(size = 6, colour = “black”) + geom_point(size = 4) g = g + xlab(“% in Agriculture”) + ylab(“Fertility”) g

#### No effect of religion
```{r, echo = TRUE}
summary(lm(Fertility ~ Agriculture, data = swiss))$coef
The associated fitted line
```{r, echo = FALSE, fig.width=8, fig.height=5} fit = lm(Fertility ~ Agriculture, data = swiss) g1 = g g1 = g1 + geom_abline(intercept = coef(fit)1, slope = coef(fit)[2], size = 2) g1

#### Parallel lines, i.e., same slope different intercepts (using +)
```{r, echo = TRUE}
summary(lm(Fertility ~ Agriculture + factor(CatholicBin), data = swiss))$coef
```{r, echo = FALSE, fig.width=5, fig.height=4} fit = lm(Fertility ~ Agriculture + factor(CatholicBin), data = swiss) g1 = g g1 = g1 + geom_abline(intercept = coef(fit)1, slope = coef(fit)[2], size = 2) g1 = g1 + geom_abline(intercept = coef(fit)1 + coef(fit)[3], slope = coef(fit)[2], size = 2) g1
1. b0 + b1 X + b2 *0
2. b0 + b1 X + b2 *1

#### Lines with different slopes and intercepts (using * instead of +)
```{r, echo = TRUE}
summary(lm(Fertility ~ Agriculture * factor(CatholicBin), data = swiss))$coef
```{r, echo = FALSE, fig.width=5, fig.height=4} fit = lm(Fertility ~ Agriculture * factor(CatholicBin), data = swiss) g1 = g g1 = g1 + geom_abline(intercept = coef(fit)1, slope = coef(fit)[2], size = 2) g1 = g1 + geom_abline(intercept = coef(fit)1 + coef(fit)[3], slope = coef(fit)[2] + coef(fit)[4], size = 2) g1

#### Just to show you it can be done
```{r, echo = TRUE}
summary(lm(Fertility ~ Agriculture + Agriculture : factor(CatholicBin), data = swiss))$coef
adjustment (playing with different variables in lm) c7w3
Consider the following simulated data
Code for the first plot, rest omitted (See the git repo for the rest of the code.)
n < 100; t < rep(c(0, 1), c(n/2, n/2)); x < c(runif(n/2), runif(n/2));
beta0 < 0; beta1 < 2; tau < 1; sigma < .2
y < beta0 + x * beta1 + t * tau + rnorm(n, sd = sigma)
plot(x, y, type = "n", frame = FALSE)
abline(lm(y ~ x), lwd = 2)
abline(h = mean(y[1 : (n/2)]), lwd = 3)
abline(h = mean(y[(n/2 + 1) : n]), lwd = 3)
fit < lm(y ~ x + t)
abline(coef(fit)[1], coef(fit)[2], lwd = 3)
abline(coef(fit)[1] + coef(fit)[3], coef(fit)[2], lwd = 3)
points(x[1 : (n/2)], y[1 : (n/2)], pch = 21, col = "black", bg = "lightblue", cex = 2)
points(x[(n/2 + 1) : n], y[(n/2 + 1) : n], pch = 21, col = "black", bg = "salmon", cex = 2)
Simulation 1
```{r, fig.height=5, fig.width=5, echo = FALSE, results=’hide’} n < 100; t < rep(c(0, 1), c(n/2, n/2)); x < c(runif(n/2), runif(n/2)); beta0 < 0; beta1 < 2; tau < 1; sigma < .2 y < beta0 + x * beta1 + t * tau + rnorm(n, sd = sigma) plot(x, y, type = “n”, frame = FALSE) abline(lm(y ~ x), lwd = 2) abline(h = mean(y[1 : (n/2)]), lwd = 3) abline(h = mean(y[(n/2 + 1) : n]), lwd = 3) fit < lm(y ~ x + t) abline(coef(fit)1, coef(fit)[2], lwd = 3) abline(coef(fit)1 + coef(fit)[3], coef(fit)[2], lwd = 3) points(x[1 : (n/2)], y[1 : (n/2)], pch = 21, col = “black”, bg = “lightblue”, cex = 2) points(x[(n/2 + 1) : n], y[(n/2 + 1) : n], pch = 21, col = “black”, bg = “salmon”, cex = 2)

### Discussion
#### Some things to note in this simulation
* The X variable is unrelated to group status
* The X variable is related to Y, but the intercept depends
on group status.
* The group variable is related to Y.
* The relationship between group status and Y is constant depending on X.
* The relationship between group and Y disregarding X is about the same as holding X constant

### Simulation 2
```{r, fig.height=5, fig.width=5, echo = FALSE, results='hide'}
n < 100; t < rep(c(0, 1), c(n/2, n/2)); x < c(runif(n/2), 1.5 + runif(n/2));
beta0 < 0; beta1 < 2; tau < 0; sigma < .2
y < beta0 + x * beta1 + t * tau + rnorm(n, sd = sigma)
plot(x, y, type = "n", frame = FALSE)
abline(lm(y ~ x), lwd = 2)
abline(h = mean(y[1 : (n/2)]), lwd = 3)
abline(h = mean(y[(n/2 + 1) : n]), lwd = 3)
fit < lm(y ~ x + t)
abline(coef(fit)[1], coef(fit)[2], lwd = 3)
abline(coef(fit)[1] + coef(fit)[3], coef(fit)[2], lwd = 3)
points(x[1 : (n/2)], y[1 : (n/2)], pch = 21, col = "black", bg = "lightblue", cex = 2)
points(x[(n/2 + 1) : n], y[(n/2 + 1) : n], pch = 21, col = "black", bg = "salmon", cex = 2)
Discussion
Some things to note in this simulation
 The X variable is highly related to group status
 The X variable is related to Y, the intercept
doesn’t depend on the group variable.
 The X variable remains related to Y holding group status constant
 The group variable is marginally related to Y disregarding X.
 The model would estimate no adjusted effect due to group.
 There isn’t any data to inform the relationship between group and Y.
 This conclusion is entirely based on the model.
Simulation 3
```{r, fig.height=5, fig.width=5, echo = FALSE, results=’hide’} n < 100; t < rep(c(0, 1), c(n/2, n/2)); x < c(runif(n/2), .9 + runif(n/2)); beta0 < 0; beta1 < 2; tau < 1; sigma < .2 y < beta0 + x * beta1 + t * tau + rnorm(n, sd = sigma) plot(x, y, type = “n”, frame = FALSE) abline(lm(y ~ x), lwd = 2) abline(h = mean(y[1 : (n/2)]), lwd = 3) abline(h = mean(y[(n/2 + 1) : n]), lwd = 3) fit < lm(y ~ x + t) abline(coef(fit)1, coef(fit)[2], lwd = 3) abline(coef(fit)1 + coef(fit)[3], coef(fit)[2], lwd = 3) points(x[1 : (n/2)], y[1 : (n/2)], pch = 21, col = “black”, bg = “lightblue”, cex = 2) points(x[(n/2 + 1) : n], y[(n/2 + 1) : n], pch = 21, col = “black”, bg = “salmon”, cex = 2)

### Discussion
#### Some things to note in this simulation
* Marginal association has red group higher than blue.
* Adjusted relationship has blue group higher than red.
* Group status related to X.
* There is some direct evidence for comparing red and blue
holding X fixed.

### Simulation 4
```{r, fig.height=5, fig.width=5, echo = FALSE, results='hide'}
n < 100; t < rep(c(0, 1), c(n/2, n/2)); x < c(.5 + runif(n/2), runif(n/2));
beta0 < 0; beta1 < 2; tau < 1; sigma < .2
y < beta0 + x * beta1 + t * tau + rnorm(n, sd = sigma)
plot(x, y, type = "n", frame = FALSE)
abline(lm(y ~ x), lwd = 2)
abline(h = mean(y[1 : (n/2)]), lwd = 3)
abline(h = mean(y[(n/2 + 1) : n]), lwd = 3)
fit < lm(y ~ x + t)
abline(coef(fit)[1], coef(fit)[2], lwd = 3)
abline(coef(fit)[1] + coef(fit)[3], coef(fit)[2], lwd = 3)
points(x[1 : (n/2)], y[1 : (n/2)], pch = 21, col = "black", bg = "lightblue", cex = 2)
points(x[(n/2 + 1) : n], y[(n/2 + 1) : n], pch = 21, col = "black", bg = "salmon", cex = 2)
Discussion
Some things to note in this simulation
 No marginal association between group status and Y.
 Strong adjusted relationship.
 Group status not related to X.
 There is lots of direct evidence for comparing red and blue holding X fixed.
Simulation 5
```{r, fig.height=5, fig.width=5, echo = FALSE, results=’hide’} n < 100; t < rep(c(0, 1), c(n/2, n/2)); x < c(runif(n/2, 1, 1), runif(n/2, 1, 1)); beta0 < 0; beta1 < 2; tau < 0; tau1 < 4; sigma < .2 y < beta0 + x * beta1 + t * tau + t * x * tau1 + rnorm(n, sd = sigma) plot(x, y, type = “n”, frame = FALSE) abline(lm(y ~ x), lwd = 2) abline(h = mean(y[1 : (n/2)]), lwd = 3) abline(h = mean(y[(n/2 + 1) : n]), lwd = 3) fit < lm(y ~ x + t + I(x * t)) abline(coef(fit)1, coef(fit)[2], lwd = 3) abline(coef(fit)1 + coef(fit)[3], coef(fit)[2] + coef(fit)[4], lwd = 3) points(x[1 : (n/2)], y[1 : (n/2)], pch = 21, col = “black”, bg = “lightblue”, cex = 2) points(x[(n/2 + 1) : n], y[(n/2 + 1) : n], pch = 21, col = “black”, bg = “salmon”, cex = 2)

### Discussion
#### Some things to note from this simulation
* There is no such thing as a group effect here.
* The impact of group reverses itself depending on X.
* Both intercept and slope depends on group.
* Group status and X unrelated.
* There's lots of information about group effects holding X fixed.

#### Simulation 6
```{r, fig.height=5, fig.width=5, echo = FALSE, results='hide'}
p < 1
n < 100; x2 < runif(n); x1 < p * runif(n)  (1  p) * x2
beta0 < 0; beta1 < 1; tau < 4 ; sigma < .01
y < beta0 + x1 * beta1 + tau * x2 + rnorm(n, sd = sigma)
plot(x1, y, type = "n", frame = FALSE)
abline(lm(y ~ x1), lwd = 2)
co.pal < heat.colors(n)
points(x1, y, pch = 21, col = "black", bg = co.pal[round((n  1) * x2 + 1)], cex = 2)
Do this to investigate the bivariate relationship
library(rgl)
plot3d(x1, x2, y)
Residual relationship
```{r, fig.height=5, fig.width=5, echo = FALSE, results=’hide’} plot(resid(lm(x1 ~ x2)), resid(lm(y ~ x2)), frame = FALSE, col = “black”, bg = “lightblue”, pch = 21, cex = 2) abline(lm(I(resid(lm(x1 ~ x2))) ~ I(resid(lm(y ~ x2)))), lwd = 2)

### Discussion
#### Some things to note from this simulation
* X1 unrelated to X2
* X2 strongly related to Y
* Adjusted relationship between X1 and Y largely unchanged
by considering X2.
* Almost no residual variability after accounting for X2.

### Some final thoughts
* Modeling multivariate relationships is difficult.
* Play around with simulations to see how the
inclusion or exclusion of another variable can
change analyses.
* The results of these analyses deal with the
impact of variables on associations.
* Ascertaining mechanisms or cause are difficult subjects
to be added on top of difficulty in understanding multivariate associations.
©
## Diagnostics c7w3
### The linear model
* Specified as $Y_i = \sum_{k=1}^p X_{ik} \beta_j + \epsilon_{i}$
* We'll also assume here that $\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)$
* Define the residuals as
$e_i = Y_i  \hat Y_i = Y_i  \sum_{k=1}^p X_{ik} \hat \beta_j$
* Our estimate of residual variation is $\hat \sigma^2 = \frac{\sum_{i=1}^n e_i^2}{np}$, the $np$ so that $E[\hat \sigma^2] = \sigma^2$

```{r, fig.height = 5, fig.width = 5}
data(swiss); par(mfrow = c(2, 2))
fit < lm(Fertility ~ . , data = swiss); plot(fit)
Influential, high leverage and outlying points
```{r, fig.height = 5, fig.width=5, echo = FALSE, results=’hide’} n < 100; x < rnorm(n); y < x + rnorm(n, sd = .3) plot(c(3, 6), c(3, 6), type = “n”, frame = FALSE, xlab = “X”, ylab = “Y”) abline(lm(y ~ x), lwd = 2) points(x, y, cex = 2, bg = “lightblue”, col = “black”, pch = 21) points(0, 0, cex = 2, bg = “darkorange”, col = “black”, pch = 21) points(0, 5, cex = 2, bg = “darkorange”, col = “black”, pch = 21) points(5, 5, cex = 2, bg = “darkorange”, col = “black”, pch = 21) points(5, 0, cex = 2, bg = “darkorange”, col = “black”, pch = 21)

### Summary of the plot
Calling a point an outlier is vague.
* Outliers can be the result of spurious or real processes.
* Outliers can have varying degrees of influence.
* Outliers can conform to the regression relationship (i.e being marginally outlying in X or Y, but not outlying given the regression relationship).
* Upper left hand point has low leverage, low influence, outlies in a way not conforming to the regression relationship.
* Lower left hand point has low leverage, low influence and is not to be an outlier in any sense.
* Upper right hand point has high leverage, but chooses not to extert it and thus would have low actual influence by conforming to the regresison relationship of the other points.
* Lower right hand point has high leverage and would exert it if it were included in the fit.

### Influence measures of points! of data not variables
* Do `?influence.measures` to see the full suite of influence measures
in stats. The measures include
* `rstandard`  standardized residuals, residuals divided by their
standard deviations)
* `rstudent`  standardized residuals, residuals divided by their
standard deviations, where the ith data point was deleted in the
calculation of the standard deviation for the residual to follow a
t distribution
* `hatvalues`  measures of leverage
* `dffits`  change in the predicted response when the $i^{th}$
point is deleted in fitting the model. **one data for one point**.
* `dfbetas`  change in **individual coefficients** when the $i^{th}$
point is deleted in fitting the model.
* `cooks.distance`  overall change in teh coefficients when the
$i^{th}$ point is deleted.
* `resid`  returns the ordinary residuals
* `resid(fit) / (1  hatvalues(fit))` where `fit` is the linear
model fit returns the PRESS residuals, i.e. the leave one out
cross validation residuals  the difference in the response and
the predicted response at data point $i$, where it was not
included in the model fitting.

### How do I use all of these things?
* Be wary of simplistic rules for diagnostic plots and measures. The
use of these tools is context specific. It's better to understand
what they are trying to accomplish and use them judiciously.
* Not all of the measures have meaningful absolute scales. You can
look at them relative to the values across the data.
* They probe your data in different ways to diagnose different
problems.
* Patterns in your residual plots generally indicate some poor aspect
of model fit. These can include:
* Heteroskedasticity (non constant variance).
* Missing model terms.
* Temporal patterns (plot residuals versus collection order).
* Residual QQ plots investigate normality of the errors.
* Leverage measures (hat values) can be useful for diagnosing data
entry errors.
* Influence measures get to the bottom line, 'how does deleting or
including this point impact a particular aspect of the model'.
**Plot residual vs fitted values** to check for systematic patterns
that you are missing

### Case 1
```{r, fig.height=5, fig.width=5, echo=FALSE}
x < c(10, rnorm(n)); y < c(10, c(rnorm(n)))
plot(x, y, frame = FALSE, cex = 2, pch = 21, bg = "lightblue", col = "black")
abline(lm(y ~ x))
The code
n < 100; x < c(10, rnorm(n)); y < c(10, c(rnorm(n)))
plot(x, y, frame = FALSE, cex = 2, pch = 21, bg = "lightblue", col = "black")
abline(lm(y ~ x))
 The point
c(10, 10)
has created a strong regression relationship where there shouldn’t be one.
Showing a couple of the diagnostic values
fit < lm(y ~ x)
round(dfbetas(fit)[1 : 10, 2], 3)
round(hatvalues(fit)[1 : 10], 3)
Case 2
```{r, fig.height=5, fig.width=5, echo=FALSE} x < rnorm(n); y < x + rnorm(n, sd = .3) x < c(5, x); y < c(5, y) plot(x, y, frame = FALSE, cex = 2, pch = 21, bg = “lightblue”, col = “black”) fit2 < lm(y ~ x) abline(fit2)

### Looking at some of the diagnostics
```{r, echo = TRUE}
round(dfbetas(fit2)[1 : 10, 2], 3)
round(hatvalues(fit2)[1 : 10], 3)
Example described by Stefanski TAS 2007 Vol 61.
```{r, fig.height=4, fig.width=4}
Don’t everyone hit this server at once. Read the paper first.
dat < read.table(‘http://www4.stat.ncsu.edu/~stefanski/NSF_Supported/Hidden_Images/orly_owl_files/orly_owl_Lin_4p_5_flat.txt’, header = FALSE) pairs(dat)

### Got our Pvalues, should we bother to do a residual plot?
```{r}
summary(lm(V1 ~ . 1, data = dat))$coef
Residual plots zoom in on poor model fits, they show ys how much we have covered the systematic part. Run the above to see more!
Residual plot
Pvalues significant, O RLY?
```{r, fig.height=4, fig.width=4, echo = TRUE} fit < lm(V1 ~ .  1, data = dat); plot(predict(fit), resid(fit), pch = ‘.’)
Shows how bad the fit is as we completely missed the systematic
patterns!

### Back to the Swiss data
```{r, fig.height = 5, fig.width = 5, echo=FALSE}
data(swiss); par(mfrow = c(2, 2))
fit < lm(Fertility ~ . , data = swiss); plot(fit)
Multiple variables regression (c7w3)
Multivariable regression
 We have an entire class on prediction and machine learning, so we’ll
focus on modeling.
 Prediction has a different set of criteria, needs for interpretability and standards for generalizability.
 In modeling, our interest lies in parsimonious, interpretable representations of the data that enhance our understanding of the phenomena under study.
 A model is a lense through which to look at your data. (I attribute this quote to Scott Zeger)
 Under this philosophy, what’s the right model? Whatever model connects the data to a true, parsimonious statement about what you’re studying.
 There are nearly uncountable ways that a model can be wrong, in this lecture, we’ll focus on variable inclusion and exclusion.
 Like nearly all aspects of statistics, good modeling decisions are
context dependent.
 A good model for prediction versus one for studying mechanisms versus one for trying to establish causal effects may not be the same. if only I could gve an example!
The Rumsfeldian triplet
There are known knowns. These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don’t know. But there are also unknown unknowns. There are things we don’t know we don’t know. Donald Rumsfeld
In our context
 (Known knowns) Regressors that we know we should check to include in the model and have.
 (Known Unknowns) Regressors that we would like to include in the model, but don’t have.
 (Unknown Unknowns) Regressors that we don’t even know about that we should have included in the model.
General rules (very important!)
 Omitting variables results in bias in the coeficients of
interest  unless their regressors are uncorrelated with the omitted
ones.

This is why we randomize treatments, it attempts to uncorrelate our treatment indicator with variables that we don’t have to put in the model. WE need to make out own example for this one! during writing Thats why AB testing and clinical trials are so powerful. but if there are too many confounding things then even randomization is not going to help you!

(If there’s too many unobserved confounding variables, even randomization won’t help you.)

 Including variables that we shouldn’t have increases standard errors
of the regression variables. No BIAS!!!!!!
 Actually, including any new variables increasese (actual, not estimated) standard errors of other regressors. So we don’t want to idly throw variables into the model.
 The model must tend toward perfect fit as the number of nonredundant regressors approaches $n$.
 $R^2$ increases monotonically as more regressors are included.
 The SSE decreases monotonically as more regressors are included.
Need to check these out!
Plot of $R^2$ versus $n$
For simulations as the number of variables included equals increases to $n=100$. No actual regression relationship exist in any simulation ```{r, fig.height=5, fig.width=5, echo=FALSE} n < 100 plot(c(1, n), 0 : 1, type = “n”, frame = FALSE, xlab = “p”, ylab = “R^2”) r < sapply(1 : n, function(p) { y < rnorm(n); x < matrix(rnorm(n * p), n, p) summary(lm(y ~ x))$r.squared } ) lines(1 : n, r, lwd = 2) abline(h = 1)
Awesome! shows exactly what the problem is during my DP sessions!

### Variance inflation; all three are rnormed; including variables!
```{r, echo = TRUE}
n < 100; nosim < 1000
x1 < rnorm(n); x2 < rnorm(n); x3 < rnorm(n);
betas < sapply(1 : nosim, function(i){
y < x1 + rnorm(n, sd = .3)
c(coef(lm(y ~ x1))[2],
coef(lm(y ~ x1 + x2))[2],
coef(lm(y ~ x1 + x2 + x3))[2])
})
round(apply(betas, 1, sd), 5)
Apparently, the actual variance is inflated and maybe not the estimated variance. I am not sure what this means and what is the consequence for us!
Monte carlo error?
As long as variables are independent, we seem to be cool!
x1 x1 x1
0.02839 0.02872 0.02884
Variance inflation (when there is a relation between x1 and x3)
```{r, echo = TRUE} n < 100; nosim < 1000 x1 < rnorm(n); x2 < x1/sqrt(2) + rnorm(n) /sqrt(2) x3 < x1 * 0.95 + rnorm(n) * sqrt(1  0.95^2); betas < sapply(1 : nosim, function(i){ y < x1 + rnorm(n, sd = .3) c(coef(lm(y ~ x1))[2], coef(lm(y ~ x1 + x2))[2], coef(lm(y ~ x1 + x2 + x3))[2]) }) round(apply(betas, 1, sd), 5)
x1 x1 x1 0.03131 0.04270 0.09653 ```
If the variable you include (x3) is highly correlated with variable you are interested in (x1), then you are going to inflat the fuck out of it (aka 3 times as shown above!)
Think about what you wanna include and why!
If you included height and weight in the same model! both of them are highly corelated. Maybe we check the correlation before adding variables!
As long as they are randomized it is still ok? or it somehow reduces the correlation (investigate!)
Variance inflation factors (dp!)
 Notice variance inflation was much worse when we included a variable
that was highly related to
x1
.  We don’t know $\sigma$, so we can only estimate the increase in the actual standard error of the coefficients for including a regressor.
 However, $\sigma$ drops out of the relative standard errors. If one sequentially adds variables, one can check the variance (or sd) inflation for including each one.
 When the other regressors are actually orthogonal to the regressor of interest, then there is no variance inflation.

The variance inflation factor (VIF) is the increase in the variance for the ith regressor compared to the ideal setting where it is orthogonal to the other regressors.
Ratio of current case with all regressors and the case with noncorrelated regressors! Look into how this works out! Shown below! * (The square root of the VIF is the increase in the sd ...)
 Remember, variance inflation is only part of the picture. We want to include certain variables, even if they dramatically inflate our variance.
Revisting our previous simulation
```{r, echo = TRUE} ##doesn’t depend on which y you use, y < x1 + rnorm(n, sd = .3) a < summary(lm(y ~ x1))$cov.unscaled[2,2] c(summary(lm(y ~ x1 + x2))$cov.unscaled[2,2], summary(lm(y~ x1 + x2 + x3))$cov.unscaled[2,2]) / a temp < apply(betas, 1, var); temp[2 : 3] / temp1

### Swiss data
```{r}
data(swiss);
fit1 < lm(Fertility ~ Agriculture, data = swiss)
a < summary(fit1)$cov.unscaled[2,2]
fit2 < update(fit, Fertility ~ Agriculture + Examination)
fit3 < update(fit, Fertility ~ Agriculture + Examination + Education)
c(summary(fit2)$cov.unscaled[2,2],
summary(fit3)$cov.unscaled[2,2]) / a
Swiss data VIFs,
library(car)
fit < lm(Fertility ~ . , data = swiss)
vif(fit)
sqrt(vif(fit)) #I prefer sd
Agriculture Examination Education Catholic Infant.Mortality
1.511 1.917 1.666 1.392 1.052
WE see infant mortality and expect it to be not too bothered by other variables! but examination and edumacation, PNN.
What about residual variance estimation?
 Assuming that the model is linear with additive iid errors (with
finite variance), we can mathematically describe the impact of
omitting necessary variables or including unnecessary ones.
 If we underfit the model, the variance estimate is biased.
 If we correctly or overfit the model, including all necessary
covariates and/or unnecessary covariates, the variance estimate is
unbiased.
 However, the variance of the variance is larger if we include unnecessary variables.
Covariate model selection
 Automated covariate selection is a difficult topic. It depends
heavily on how rich of a covariate space one wants to explore.
 The space of models explodes quickly as you add interactions and polynomial terms.
 In the prediction class, we’ll cover many modern methods for traversing large model spaces for the purposes of prediction.
 Principal components or factor analytic models on covariates are often useful for reducing complex covariate spaces.

Good design can often eliminate the need for complex model searches at analyses; though often control over the design is limited; Randomization for example…
For example in bio statistics, when you want to measure effectiveness of the aspirin, they try to give the drug to one person and use a washout period to test the drug the next time!
 If the models of interest are nested and without lots of parameters differentiating them, it’s fairly uncontroversial to use nested likelihood ratio tests. (Example to follow.)
 My favoriate approach is as follows. Given a coefficient that I’m interested in, I like to use covariate adjustment and multiple models to probe that effect to evaluate it for robustness and to see what other covariates knock it out. This isn’t a terribly systematic approach, but it tends to teach you a lot about the the data as you get your hands dirty.
How to do nested model testing in R
fit1 < lm(Fertility ~ Agriculture, data = swiss)
fit3 < update(fit, Fertility ~ Agriculture + Examination + Education)
fit5 < update(fit, Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality)
anova(fit1, fit3, fit5)
Analysis of Variance Table
Model 1: Fertility ~ Agriculture
Model 2: Fertility ~ Agriculture + Examination + Education
Model 3: Fertility ~ Agriculture + Examination + Education + Catholic +
Infant.Mortality
Res.Df RSS Df Sum of Sq F Pr(>F)
1 45 6283
2 43 3181 2 3102 30.2 8.6e09 ***
3 41 2105 2 1076 10.5 0.00021 ***
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This looks like a good way to understand again, what is a good parameter to include! Use above only when there is possibility of nesting.
Summary

use anova of 2 fits to determine in a nested fashion if a variable adds value or not!

otherwise just look at the slopes to determine if they have impact of not

use influence measures to determine if certain points influence the data!

vis is to measure variance extra!
 more variables => more variance if not truly randomized
 example, examination and education!

less variables => more error in result due to spurious additions
 always try to removed contributions of other variables by looking at residuals!
Quiz

Consider the \verb mtcarsmtcars data set. Fit a model with mpg as the outcome that includes number of cylinders as a factor variable and weight as confounder. Give the adjusted estimate for the expected change in mpg comparing 8 cylinders to 4.
fit < lm(mpg ~ factor(cyl) + wt, data = mtcars) summary(fit)$coef

Consider the \verb mtcarsmtcars data set. Fit a model with mpg as the outcome that includes number of cylinders as a factor variable and weight as a possible confounding variable. Compare the effect of 8 versus 4 cylinders on mpg for the adjusted and unadjusted by weight models. Here, adjusted means including the weight variable as a term in the regression model and unadjusted means the model without weight included. What can be said about the effect comparing 8 and 4 cylinders after looking at models with and without weight included?.
Holding weight constant, cylinder appears to have less of an impact on mpg than if weight is disregarded.
It is both true and sensible that including weight would attenuate the effect of number of cylinders on mpg.
Not sure what they mean by more impact (pvalue or anova tests or slopes etc…)

Consider the \verbmtcarsmtcars data set. Fit a model with mpg as the outcome that considers number of cylinders as a factor variable and weight as confounder. Now fit a second model with mpg as the outcome model that considers the interaction between number of cylinders (as a factor variable) and weight. Give the Pvalue for the likelihood ratio test comparing the two models and suggest a model using 0.05 as a type I error rate significance benchmark.
The Pvalue is larger than 0.05. So, according to our criterion, we would fail to reject, which suggests that the interaction terms may not be necessary.
fit1 < lm(mpg ~ factor(cyl) + wt, data = mtcars) fit2 < lm(mpg ~ factor(cyl) * wt, data = mtcars) summary(fit1)$coef
summary(fit2)$coef
anova(fit1, fit2)

Consider the \verbmtcarsmtcars data set. Fit a model with mpg as the outcome that includes number of cylinders as a factor variable and weight inlcuded in the model as
lm(mpg ~ I(wt * 0.5) + factor(cyl), data = mtcars)
The estimated expected change in MPG per one ton increase in weight for a specific number of cylinders (4, 6, 8).

Consider the following data set
x < c(0.586, 0.166, 0.042, 0.614, 11.72) y < c(0.549, 0.026, 0.127, 0.751, 1.344)
Give the hat diagonal for the most influential point
influence(lm(y ~ x))$hat ## showing how it's actually calculated xm < cbind(1, x) diag(xm %*% solve(t(xm) %*% xm) %*% t(xm))

Consider the following data set
x < c(0.586, 0.166, 0.042, 0.614, 11.72) y < c(0.549, 0.026, 0.127, 0.751, 1.344)
Give the slope dfbeta for the point with the highest hat value.
influence.measures(lm(y ~ x))

Consider a regression relationship between Y and X with and without adjustment for a third variable Z. Which of the following is true about comparing the regression coefficient between Y and X with and without adjustment for Z.
It is possible for the coefficient to reverse sign after adjustment. For example, it can be strongly significant and positive before adjustment and strongly significant and negative after adjustment.
See lecture 02_03 for various examples.
optional quiz
Your assignment is to study how income varies across college major categories. Specifically answer: “Is there an association between college major category and income?”
To get started, start a new R/RStudio session with a clean workspace. To do this in R, you can use the q() function to quit, then reopen R. The easiest way to do this in RStudio is to quit RStudio entirely and reopen it. After you have started a new session, run the following commands. This will load a data.frame called college for you to work with.
install.packages("devtools")
devtools::install_github("jhudsl/collegeIncome")
library(collegeIncome)
data(college)
devtools::install_github("jhudsl/matahari")
library(matahari)
## To start and end
dance_start(value = FALSE, contents = FALSE)
dance_save("~/Desktop/college_major_analysis.rds")
Based on your analysis, would you conclude that there is a significant association between college major category and income?
Ans: NO as for one, pvalues are way higher indicating similarity!
Mean and boxplots don’t seem to show variability
My answer
df$ranks doesn’t seem to be coming from df$medianl in that case I don’t understand the rank variable! So I almost don’t look at it!
with this, I look at only the variables in isolation , i.e., removing all residuals needed: fit < lm(median ~ . major  major_code, data=df) ; summary(fit)
Result:
major_categoryAgriculture & Natural Resources 0.8917
major_categoryBiology & Life Science 0.4126
major_categoryBusiness 0.2229
major_categoryCommunications & Journalism 0.1541
major_categoryComputers & Mathematics 0.0680
Most of the categories are above a 20 % significance level and all above 5%. This goes to show that when I look at median income, and the categories, there is not a “significant” level of association!
Also the same output shows the corelation between p25th and p75th. I guess this is enough! Using a lot of variables (as I have done above) might increase the variance, but I am not sure what is the significance of it for this case!
my work: published: in rds matahari stuff: https://github.com/agent18/linearregression/blob/master/college_major_analysis.rds
better solution by prof for later consumption!
https://d3c33hcgiwev3.cloudfront.net/fd9c88bac4ae1ea84e1994f141541ef2_solution.pdf?Expires=1556064000&Signature=XbGR26cRgoTYnJe209LjxxU4U1T~edPw7Ri0akbYlaX70aQodTO10w6YTlSRv~38qwJXFgL6AIcQpBLdKwgr~r8KkJhi6iXBME4iLJGcvNmnfcpaED6RFYH4Y1jy36CcKWJx9l6uHDSt5yzyRfH6qCUgStRsrHexcTw7Nh~BM&KeyPairId=APKAJLTNE6QMUY6HBC5A
I have also saved this here: https://github.com/agent18/linearregression/commit/a91f8a84668e3201146281ebbd65e8b35792509b
p hacking
Your assignment is to study how income varies across different categories of college majors. You will be using data from a study of recent college graduates. Make sure to use good practices that you have learned so far in this course and previous courses in the specialization.In particular, it is good practice to specify an analysis plan early in the process to avoid the “phacking” behavior of trying many analyses to find one that has desired results. If you want to learn more about “phacking”, you can visit https://projects.fivethirtyeight.com/phacking/
GLM – Don’t read Horrible lecture and notes 0 EX (C7w4)
Linear models
 Linear models are the most useful applied statistical
technique. However, they are not without their limitations.
 Additive response models don’t make much sense if the response is discrete, or stricly positive.
 Additive error models often don’t make sense, for example if the outcome has to be positive.
 Transformations are often hard to interpret.
 There’s value in modeling the data on the scale that it was collected.
 Particularly interpetable transformations, natural logarithms in specific, aren’t applicable for negative or zero values.
Generalized linear models
 Introduced in a 1972 RSSB paper by Nelder and Wedderburn.
 Involves three components
 An exponential family model for the response.
 includes normal, binom and poisson
 A systematic component via a linear predictor.
 A link function that connects the means of the response to the linear predictor.
 An exponential family model for the response.
Example, linear models
 Assume that $Y_i \sim N(\mu_i, \sigma^2)$ (the Gaussian distribution is an exponential family distribution.)
 Define the linear predictor to be $\eta_i = \sum_{k=1}^p X_{ik} \beta_k$.
 The link function as $g$ so that $g(\mu) = \eta$.
 For linear models $g(\mu) = \mu$ so that $\mu_i = \eta_i$
 This yields the same likelihood model as our additive error Gaussian linear model \(Y_i = \sum_{k=1}^p X_{ik} \beta_k + \epsilon_{i}\) where $\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)$
Example, logistic regression
 Assume that $Y_i \sim Bernoulli(\mu_i)$ so that $E[Y_i] = \mu_i$ where $0\leq \mu_i \leq 1$.
 Linear predictor $\eta_i = \sum_{k=1}^p X_{ik} \beta_k$
 Link function $g(\mu) = \eta = \log\left( \frac{\mu}{1  \mu}\right)$ $g$ is the (natural) log odds, referred to as the logit.
 Note then we can invert the logit function as \(\mu_i = \frac{\exp(\eta_i)}{1 + \exp(\eta_i)} ~~~\mbox{and}~~~ 1  \mu_i = \frac{1}{1 + \exp(\eta_i)}\) Thus the likelihood is \(\prod_{i=1}^n \mu_i^{y_i} (1  \mu_i)^{1y_i} = \exp\left(\sum_{i=1}^n y_i \eta_i \right) \prod_{i=1}^n (1 + \eta_i)^{1}\)
Example, Poisson regression
 Assume that $Y_i \sim Poisson(\mu_i)$ so that $E[Y_i] = \mu_i$ where $0\leq \mu_i$
 Linear predictor $\eta_i = \sum_{k=1}^p X_{ik} \beta_k$
 Link function $g(\mu) = \eta = \log(\mu)$
 Recall that $e^x$ is the inverse of $\log(x)$ so that \(\mu_i = e^{\eta_i}\) Thus, the likelihood is \(\prod_{i=1}^n (y_i !)^{1} \mu_i^{y_i}e^{\mu_i} \propto \exp\left(\sum_{i=1}^n y_i \eta_i  \sum_{i=1}^n \mu_i\right)\)
Some things to note
 In each case, the only way in which the likelihood depends on the data is through \(\sum_{i=1}^n y_i \eta_i = \sum_{i=1}^n y_i\sum_{k=1}^p X_{ik} \beta_k = \sum_{k=1}^p \beta_k\sum_{i=1}^n X_{ik} y_i\) Thus if we don’t need the full data, only $\sum_{i=1}^n X_{ik} y_i$. This simplification is a consequence of chosing socalled ‘canonical’ link functions.
 (This has to be derived). All models achieve their maximum at the root of the so called normal equations \(0=\sum_{i=1}^n \frac{(Y_i  \mu_i)}{Var(Y_i)}W_i\) where $W_i$ are the derivative of the inverse of the link function.
About variances
\(0=\sum_{i=1}^n \frac{(Y_i  \mu_i)}{Var(Y_i)}W_i\)
 For the linear model $Var(Y_i) = \sigma^2$ is constant.
 For Bernoulli case $Var(Y_i) = \mu_i (1  \mu_i)$
 For the Poisson case $Var(Y_i) = \mu_i$.
 In the latter cases, it is often relevant to have a more flexible variance model, even if it doesn’t correspond to an actual likelihood \(0=\sum_{i=1}^n \frac{(Y_i  \mu_i)}{\phi \mu_i (1  \mu_i ) } W_i ~~~\mbox{and}~~~ 0=\sum_{i=1}^n \frac{(Y_i  \mu_i)}{\phi \mu_i} W_i\)
 These are called ‘quasilikelihood’ normal equations
Odds and ends
 The normal equations have to be solved iteratively. Resulting in $\hat \beta_k$ and, if included, $\hat \phi$.
 Predicted linear predictor responses can be obtained as $\hat \eta = \sum_{k=1}^p X_k \hat \beta_k$
 Predicted mean responses as $\hat \mu = g^{1}(\hat \eta)$
 Coefficients are interpretted as \(g(E[Y  X_k = x_k + 1, X_{\sim k} = x_{\sim k}])  g(E[Y  X_k = x_k, X_{\sim k}=x_{\sim k}]) = \beta_k\) or the change in the link function of the expected response per unit change in $X_k$ holding other regressors constant.
 Variations on Newon/Raphson’s algorithm are used to do it.
 Asymptotics are used for inference usually.
 Many of the ideas from linear models can be brought over to GLMs.
Logistic regression(c7w4)
Key ideas
 Frequently we care about outcomes that have two values
 Alive/dead
 Win/loss
 Success/Failure
 etc
 Called binary, Bernoulli or 0/1 outcomes
 Collection of exchangeable binary outcomes for the same covariate data are called binomial outcomes.
Example Baltimore Ravens win/loss
Ravens Data
```{r loadRavens,cache=TRUE} download.file(“https://dl.dropboxusercontent.com/u/7710864/data/ravensData.rda” , destfile=”./data/ravensData.rda”,method=”curl”) # doesn’t work load(“./data/ravensData.rda”) head(ravensData)
Download from here!
https://github.com/jtleek/dataanalysis/blob/master/week5/002binaryOutcomes/data/ravensData.rda?raw=true

### Linear regression
$$ RW_i = b_0 + b_1 RS_i + e_i $$
$RW_i$  1 if a Ravens win, 0 if not
$RS_i$  Number of points Ravens scored
$b_0$  probability of a Ravens win if they score 0 points ( because
winning is a binary variable, values between o and 1 imply a
probability I think!)
$b_1$  increase in probability of a Ravens win for each additional point
$e_i$  residual variation due

### Linear regression in R
```{r linearReg, dependson = "loadRavens", cache=TRUE}
lmRavens < lm(ravensData$ravenWinNum ~ ravensData$ravenScore)
summary(lmRavens)$coef
summary(lm(ravensData$ravenWinNum~ravensData$ravenScore))
plot(ravensData$ravenScore,ravensData$ravenWinNum)
abline(lm(ravensData$ravenWinNum~ravensData$ravenScore))
Odds
Binary Outcome 0/1
\[RW_i\]Probability (0,1)
\[\rm{Pr}(RW_i  RS_i, b_0, b_1 )\]Odds $(0,\infty)$ \(\frac{\rm{Pr}(RW_i  RS_i, b_0, b_1 )}{1\rm{Pr}(RW_i  RS_i, b_0, b_1)}\)
Log odds $(\infty,\infty)$
\[\log\left(\frac{\rm{Pr}(RW_i  RS_i, b_0, b_1 )}{1\rm{Pr}(RW_i  RS_i, b_0, b_1)}\right)\]^^ is the LOGIT
Linear vs. logistic regression
Linear
Modeling it as linear
\[RW_i = b_0 + b_1 RS_i + e_i\]or
\[E[RW_i  RS_i, b_0, b_1] = b_0 + b_1 RS_i\]Expected value of a fair coin is a probability, i.e., 0.5 fair coin
Logistic
Modelling it as odds!
x = b0+b1*RSi
p = e^x/(1+e^x) — exfit
or
log(p/(1p)) = x — logit— log odds
LOG ODDS OF PROBABILITY = f(regressors) = b0+b1*RSi
\[\rm{Pr}(RW_i  RS_i, b_0, b_1) = \frac{\exp(b_0 + b_1 RS_i)}{1 + \exp(b_0 + b_1 RS_i)}\] \[\log\left(\frac{\rm{Pr}(RW_i  RS_i, b_0, b_1 )}{1\rm{Pr}(RW_i  RS_i, b_0, b_1)}\right) = b_0 + b_1 RS_i\]Interpreting Logistic Regression
\[\log\left(\frac{\rm{Pr}(RW_i  RS_i, b_0, b_1 )}{1\rm{Pr}(RW_i  RS_i, b_0, b_1)}\right) = b_0 + b_1 RS_i\]log(p/(1p)) = x — logit— log odds ; x = b0+b1*RSi
p = e^x/(1+e^x)
$b_0$  Log odds of a Ravens win if they score zero points
$b_1$  Log odds ratio of win probability for each point scored (compared to zero points)
$\exp(b_1)$  Odds ratio of win probability for each point scored (compared to zero points)
A better explanation on the actual logic of exp(beta1) is given here.
In the end exp(beta1) gives Odds ratio!
Odds
 Imagine that you are playing a game where you flip a coin with success probability $p$.
 If it comes up heads, you win $X$. If it comes up tails, you lose $Y$.

What should we set $X$ and $Y$ for the game to be fair?
\(E[earnings]= X p  Y (1  p) = 0\)

Implies
\(\frac{Y}{X} = \frac{p}{1  p}\)
 The odds can be said as “How much should you be willing to pay for a
$p$ probability of winning a dollar?”
 (If $p > 0.5$ you have to pay more if you lose than you get if you win.)
 (If $p < 0.5$ you have to pay less if you lose than you get if you win.)
Visualizing fitting logistic regression curves
x < seq(10, 10, length = 1000)
manipulate(
plot(x, exp(beta0 + beta1 * x) / (1 + exp(beta0 + beta1 * x)),
type = "l", lwd = 3, frame = FALSE),
beta1 = slider(2, 2, step = .1, initial = 2),
beta0 = slider(2, 2, step = .1, initial = 0)
)
Its an S curve when you plot, e^x/(1e^x)
Ravens logistic regression
```{r logReg, dependson = “loadRavens”} logRegRavens < glm(ravensData$ravenWinNum ~ ravensData$ravenScore,family=”binomial”) summary(logRegRavens)
**It automatically assumes that the link function is the logit function!**

### Ravens fitted values
```{r dependson = "logReg",fig.height=4,fig.width=4}
plot(ravensData$ravenScore,logRegRavens$fitted,pch=19,col="blue",xlab="Score",ylab="Prob Ravens Win")
It’s actually a complete S curve but it is not shown here.
Odds ratios and confidence intervals
```{r dependson = “logReg”,fig.height=4,fig.width=4} exp(logRegRavens$coeff) exp(confint(logRegRavens))
**The above gives the coefficients for the probabilities**

### ANOVA for logistic regression
```{r dependson = "logReg",fig.height=4,fig.width=4}
anova(logRegRavens,test="Chisq")
Interpreting Odds Ratios
 Not probabilities
 Odds ratio of 1 = no difference in odds
 Log odds ratio of 0 = no difference in odds
 Odds ratio < 0.5 or > 2 commonly a “moderate effect”

Relative risk $\frac{\rm{Pr}(RW_i RS_i = 10)}{\rm{Pr}(RW_i RS_i = 0)}$ often easier to interpret, harder to estimate  For small probabilities RR $\approx$ OR but they are not the same!
Further resources
 Wikipedia on Logistic Regression
 Logistic regression and glms in R
 Brian Caffo’s lecture notes on: Simpson’s paradox, Casecontrol studies
 Open Intro Chapter on Logistic Regression
Logistic regression and linear Regression for factors — Agent
Linear regression with X as factor
y < mtcars$mpg
x < mtcars$vs
boxplot(y[x==0],y[x!=0]) shows mean for the two values of X
a < lm(y~x)
All LM can do is predict mpg if the vs is 0 or 1. That’s it and it does so based on means!
The equation is Y = mean_0 + (mean_1  mean_0)*x$
.
predict(a,newdata=data.frame(x=as.factor(1)))
Ans: 24
predict(a,newdata=data.frame(x=as.factor(0)))
Ans: 16.
Ans are nothing but the means!
In LM the Y is a continuous variable! In logistic regression Y is 0 or 1 as well.
Source with timestamp!
Linear regression with X continuous and Y factor
y < mtcars$vs ## vmodel
y < as.numeric(as.character(y))
x < mtcars$mpg ## vmodel
plot(x,y)
fit < lm(as.numeric(y)~x)
abline(fit)
predict(fit, newdata=data.frame(x=mean(x)))
predict(fit, newdata=data.frame(x=max(x)))
Y is a probability of getting a 1. Higher the X higher the probability. But probability cannot be greater than 1
Logistic regression y factor and X continuous
log(p/(1p)) = beta0 + beta1*x = out
beta0 is the intercept of the log odds plot vs X.
beta1 is the slope of the line, i.e., change in log odds per change in X.
Change is log odds is nothing but the log odds ratio. logA  logB = Log A/B
p = e^(out)/(1+e^(out))
The probability is given as above!
fit>glm(y~x, family="binomial") ## gives beta0 and beta1
plot(x,y)
points(x,fit$fitted.values,pch="", col="red")
We see an S curve!
Logistic regression y factor and X also Factor!
The greatest and most useful source was:
x < mtcars$am
y < mtcars$vs
y <as.numeric(as.character(y))
x <as.numeric(as.character(x))
glm(y~x,family="binomial")
table(x,y)
c(log(7/12),log(7/12)+log(7/6)) ## glm manual
Quiz 4: q1
Consider the space shuttle data \verb?shuttle?shuttle in the \verbMASSMASS library. Consider modeling the use of the autolander as the outcome (variable name \verbuseuse). Fit a logistic regression model with autolander (variable auto) use (labeled as “auto” 1) versus not (0) as predicted by wind sign (variable wind). Give the estimated odds ratio for autolander use comparing head winds, labeled as “head” in the variable headwind (numerator) to tail winds (denominator).
shuttle$use2<as.integer(shuttle$use=="auto")
shuttle$wind2<as.integer(shuttle$wind=="head")
logRegShuttle<glm(use2~wind2, family=binomial, shuttle)
## or
table(y,x)
log(55/73)log(56/72) ## slope
log(56/72) ## intercept
Poisson’s logistic regression c7w4
Key ideas
 Many data take the form of counts
 Calls to a call center
 Number of flu cases in an area
 Number of cars that cross a bridge
 Data may also be in the form of rates
 Percent of children passing a test
 Percent of hits to a website from a country
 Linear regression with transformation is an option
Poisson distribution
 The Poisson distribution is a useful model for counts and rates
 Here a rate is count per some monitoring time
 Some examples uses of the Poisson distribution
 Modeling web traffic hits
 Incidence rates
 Approximating binomial probabilities with small $p$ and large $n$
 Analyzing contigency table data
The Poisson mass function
 $X \sim Poisson(t\lambda)$ if \(P(X = x) = \frac{(t\lambda)^x e^{t\lambda}}{x!}\) For $x = 0, 1, \ldots$.
 The mean of the Poisson is $E[X] = t\lambda$, thus $E[X / t] = \lambda$
 The variance of the Poisson is $Var(X) = t\lambda$.
 The Poisson tends to a normal as $t\lambda$ gets large.
```{r simPois,fig.height=4,fig.width=8, cache=TRUE} par(mfrow = c(1, 3)) plot(0 : 10, dpois(0 : 10, lambda = 2), type = “h”, frame = FALSE) plot(0 : 20, dpois(0 : 20, lambda = 10), type = “h”, frame = FALSE) plot(0 : 200, dpois(0 : 200, lambda = 100), type = “h”, frame = FALSE)

### Poisson distribution
#### Sort of, showing that the mean and variance are equal
```{r}
x < 0 : 10000; lambda = 3
mu < sum(x * dpois(x, lambda = lambda))
sigmasq < sum((x  mu)^2 * dpois(x, lambda = lambda))
c(mu, sigmasq)
Example: Leek Group Website Traffic
 Consider the daily counts to Jeff Leek’s web site
http://biostat.jhsph.edu/~jleek/
 Since the unit of time is always one day, set $t = 1$ and then the Poisson mean is interpretted as web hits per day. (If we set $t = 24$, it would be web hits per hour).
Website data
```{r leekLoad,cache=TRUE} download.file(“https://dl.dropboxusercontent.com/u/7710864/data/gaData.rda”,destfile=”./data/gaData.rda”,method=