R

R-chi-squared-2.png

R and Python (ok, and Stata) are the most commonly used scripting laguages/programming languages/Statistical Softwares for statistical computating. Although R and Python are more popular among Staticians, Data Scientists, and Political Scientists, Stata is very popular among Economists. You find below some short codes and references for R that I have used and decided to compile here in this site for quick reference.


Data Manipulation

Change encoding

One of the annoying problems when working with text data in different languages is the encoding issues. Here is a simple way to convert from one format to another for linux users (see here also). As a side note, for people out there working with Portuguese text data, “latin-1” or “latin1” is a commonly used enconding systems.

#+Begin iconv -f UTF-8 -t ISO-8859-1 in.txt > out.txt -f ENCODING the encoding of the input -t ENCODING the encoding of the output #+end

Data.frame/matrix

Ordering data.frame by columns

Problem: order a data.frame by one, two, or more or its columns.

Solution: use with() and order()

x <- data.frame(id=c('a','b','a','b','c'),num=c(2,2,1,3,0),other=c(10,9,8,7,2))
x
x[with(x, order(id, num)),]

id num other
1 a 2 10
2 b 2 9
3 a 1 8
4 b 3 7
5 c 0 2
id num other
3 a 1 8
1 a 2 10
2 b 2 9
4 b 3 7
5 c 0 2

Reshape

Suppose you have a data.frame where the first variable is a case (e.g. country),the columns are groups (e.g. years of the observation) and the values in the cells are information about the case (e.g GDP). Problem: reshape the data frame from wide format (columns represent years) to long format (the first column will be country, the second the year, the third GDP)

Solution: reshape()

# from wide to long
df        <- data.frame(id1=c('a','b','c','d','e'),id2=c('k','l','m','n','o'),
				 '2000'=c(1:5),'2001'=c(101:105))
names(df) <- gsub(names(df), replacement='', pattern='X')
df        <- reshape(df,
				  direction='long',            # new format
				  idvar=c("id1",'id2'),        # id var
				  varying=3:ncol(df),          # just the number of the columns
				  timevar='year',              # name of the var will receive the yera
				  v.names='GDP',               # name of the var will receive the cell values
				  times=names(df[3:ncol(df)])) # Value of timevar
df

# from long to wide
df        <- reshape(df,
				  direction='wide',            # new format
				  idvar=c("id1",'id2'),        # id var
				  timevar='year',              # name of the var will receive the yera
				  v.names='GDP')               # name of the var will receive the cell values
df

id1 id2 year GDP
a.k.2000 a k 2000 1
b.l.2000 b l 2000 2
c.m.2000 c m 2000 3
d.n.2000 d n 2000 4
e.o.2000 e o 2000 5
a.k.2001 a k 2001 101
b.l.2001 b l 2001 102
c.m.2001 c m 2001 103
d.n.2001 d n 2001 104
e.o.2001 e o 2001 105
id1 id2 GDP.2000 GDP.2001
a.k.2000 a k 1 101
b.l.2000 b l 2 102
c.m.2000 c m 3 103
d.n.2000 d n 4 104
e.o.2000 e o 5 105

excluding rows with NA in the data.frame or matrix

x <- as.data.frame(matrix(c(NA,2,3,4,1,3,3,NA), ncol=2))
x
na.omit(x)
x[complete.cases(x),]

V1 V2
1 NA 1
2 2 3
3 3 3
4 4 NA
V1 V2
2 2 3
3 3 3
V1 V2
2 2 3
3 3 3

remove duplicated lines

This function remove duplicated lines according to given variables. Example from here

a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b  df[!duplicated(df), ]
  a b
1 A 1
3 A 2
4 B 4
5 B 1
7 C 2

Factors

Change order of levels

This example show how to change the order of the levels in a factor. (original from R-bloggers)

## generate data
x = factor(sample(letters[1:5],100, replace=TRUE))

print(levels(x))  ## This will show the levels of x are "Levels: a b c d e"

## To reorder the levels:
## note, if x is not a factor use levels(factor(x))
x = factor(x,levels(x)[c(4,5,1:3)])

print(levels(x))  ## Now "Levels: d e a b c"
[1] "a" "b" "c" "d" "e"
[1] "d" "e" "a" "b" "c"

Drop unused levels

y <- droplevels(y)

List

extract specific elements of in a list of vectors

l <- list(a=0:10,b=10:20)
do.call("sum", lapply(l, "[[", 10))
lapply(l,'[[',5)
lapply(l,'[',2)

Regular Expression

It is quite common to face a situation where we have to organize data that contains strings or text. Below there is a brief exposition about how to handle strings in R. If you really want to learn regular expression (regexp), I would suggest regexp puzzles, as this website.

## =============================================================
## File: text_manipulation
## -----------------------
## Author: Diogo Ferrari
##
## This file has a couple of functions (with explanations)
## to manipulate strings (text) in R
##
## =============================================================

## =============================================================
## Preambule
## =============================================================
## - R uses some classes of characteres defined  by Portable
##   Operating System Interface (POSIX) and ASCII.
## - IEEE POSIX released the Basic Regular Expression (BRE).
## - Later, it release Extended Regular Expression (ERE). Both are supported by R.
## - IEEE POSIX have actually three sets: BRE, ERE, S(imple)RE,last deprecated
## - Those patterns define two types of characters: literal and metacharacters.
## - Metacharacters are characters that have special meaning. They inform the string
## how to behave. It is not literally part the string itself.
## - Metacharacters perform some action in the string.
## - Metacharacters must be preceded, in R, by a backslash (\)
## in other to be considered a literal character and not a metacharacter.
##
## - As metacharacter we have:
##   - ERE/BRE reserved characters
##       BRE: {}[]()^$.*\
##       ERE: ? + |
##   - Escape character
##       Two characters (or more), starting with \ (backslash)
##       (Therefore to make it a character we must an additional backslash)
##
## - In practice, we have a target pattern or possible patterns
## that we want to find for some practical reason.
## - Regular Expresion helps us to performe this task.
## - Using metacharacter we can create patterns to match a large
## number of possible strings ar one, without having to
## create an extensive list of all possibilities.
##
## Useful for help:
## ?regex
##
## Useful packages:
## - stringi: http://cran.r-project.org/web/packages/stringi/stringi.pdf
## - stringr
## =============================================================

## Escape character
## ---------------
## \n   : new line
## \'   : single quote
## \"   : double quote
## \r   : carriege (cursor) return to begining of the line
## \t   : tab character
## \b   : backspace
## \a   : make a beep when char is printed
## \f   : form feed
## \v   : vertical tab
## \\   : backslash itself
## \nnn : character with given octal code (see http://character-code.com).
## \xnn : character with given hex code (see http://character-code.com)

## Other possibilities (http://cran.r-project.org/doc/manuals/R-lang.html#Literal-constants)
## \unnnn or \u{nnnn} multibyte locales are supported. Unicode character with given hex code
## \Unnnnnnnn \U{nnnnnnnn}

## Examples
## --------
cat('In two\nlines \n')                         ## new line
cat('a\'aa \n')                                 ## single quote
cat("aA\aaa \n")                                ## beep
cat("aA\baa \n")                                ## backspace
cat('Subst first letter by 1\r1\n')             ## return carriege
cat("aA\n\vaa \n")                              ## vertical space
cat('cedilha in octal code is \\307: \307  \n') ## octal code \nnn
cat('cedilha in octal code is \\307: \307  \n') ## octal code \xnn
cat('cedilha in octal code is \\xC7: \xC7  \n') ## octal code \xnn

## to include the metacharacter as string: two backslashes (\\)
cat('In one\\n lines\n')

## IEEE POSIX BRE and ERE
## ----------------------
## [ ]    : matches any string that have any character specified
##          inside the brackets in that position.
##          - using - inside allows a range
##          - using - at begining or end treats - literally
##          - using ] at begining treats ] literally
##          - scape characters not allowed inside [ ]
##          - [^ ] is the negation of what is inside (dif from ^ alone)
##          Ex: '[Hh]vana' will find 'havana' and 'Havana'
##               a[a-d4-6]c matches 'aac', 'a4c', 'a6c', but not 'a7c' nor 'aa4c'
##               a[-4-6]c matches 'a-c', 'a4c', 'a6c', but not 'aac' nor 'aa4c'
##               a[^4-6]c matches 'a-c', 'aSc', 'aac', but not 'a4c' 'a5c' 'aa4c'
## |      : boolean "or" match any of the whole listed options before and after |
##          Ex: 'the cat|a dog' will match 'the cat' and 'a dog''
## .      : matches any character in that position (including space)
##          Note: [.] matches the literal "."
## ^      : matches the FIRST character of the string
## $      : matches the LAST character of the string
## ( )    : allows to mark a substring inside a string and
##          using \\x, where x=1,2,...,9 we can match in another
##          position what the xth () have marked
##          Ex: 'wo([Rr])ld is \\1ed' will match
##              'world is red' and
##              'woRld is Red' but not
##              'world is Red' b/c it marked 'r' in the first position
## Quantifiers
## ?      : ZERO or ONE occurrence of the character imediately
##          preceding "?"
##          Ex: 'PSD?B' matches 'PSDB' and 'PSB'
## *      : ZERO or ONE or MULTIPLE occurrence of the imediately
##          preceding character
##          Ex: 'PSD?B' matches 'PSDB' and 'PSB' and 'PSDDB' and 'PSDDDB' ...
## +      : at least ONE or MULTIPLE occurrence of the imediately
##          preceding character
##          Ex: 'PSD?B' matches 'PSDB' and 'PSDDB' and 'PSDDDB'... but not 'PSB'
## {n,m}  : at LEAST n and at MOST m occurences of the character
##          imediately preciding {n,m}

## important remark:  space is a character as any other (a,b,c, 5, etc)
##                    therefore 'this' is diff from ' this'

## Examples
## Note: grep is explained below (you may read that section first)
## Note: we can print the value instead of index by setting value=T
## 1. getting the index
grep(pattern="12",  x=c('123','345'))
## 2. getting the values
grep(pattern="12",  x=c('123','345'), value=T)

## [ ]
grep(pattern='o[rR]ld', x=c('world', ' something else', 'WoRld'))
grep(pattern='[rR]', x=c('world', 'sjdj r sjdj', '  something else', 'WoRld'))
## [ ] for ranges, [- ] and [^ ]
grep(pattern='a[b-d]c', x=c('abc','acc','ab','aee'))
grep(pattern='a[b-d1-4]c', x=c('abc','acc','ab','aee','a3c','a5c','a-c'))
grep(pattern='a[-b-d1-4]c', x=c('abc','acc','ab','aee','a3c','a5c','a-c'))
grep(pattern='a[^b-d1-4]c', x=c('abc','acc','ab','aee','a3c','a5c','a-c'))
## |
grep(pattern='world', x=c('world', 'sjdj r sjdj', '  something else', 'woRld'))
grep(pattern='world|woRld', x=c('world', 'sjdj r ', '  something else', 'woRld'))
grep(pattern='wor|Rld', x=c('world', 'sjdj r sjdj', '  something else', 'woRld'))
grep(pattern='wor|Rld', x=c('world', 'sjdj r sjdj', '  something else', 'WoRld'))
grep(pattern='wo(r|R)ld', x=c('world', 'wolld r ', '  something else', 'woRld'))
grep(pattern='wor|l|Rld', x=c('world', 'wolld r ', '  something else', 'woRld'))
grep(pattern='wo(r|l|R)ld', x=c('world', 'wolld r ', ' something else', 'woRld'))
grep(pattern='the cat|a dog', x=c('cat',' the cat',' dog', 'a dog'))
## .
grep(pattern='.', x=c('.world', 'wold r ', ' something.ee', 'woRld'))
grep(pattern='[.]', x=c('.world', 'wold r ', ' something.ee', 'woRld'))
grep(pattern='.w', x=c('.world', 'wold r ', ' something.ee', ' woRld'))
## ^
grep(pattern='^ ', x=c('.world', 'wold r ', ' something.ee', ' woRld'))
grep(pattern='^w', x=c('.world', 'wold r ', ' something.ee', ' woRld'))
grep(pattern='^ ', x=c('.world', 'wold r ', ' something.ee', ' woRld'))
grep(pattern='^ s', x=c('.world', 'wold r ', ' something.ee', ' woRld'))
grep(pattern='^.', x=c('.world', 'wold r ', ' something.ee', ' woRld'))
## $
grep(pattern=' $', x=c('.world', 'wold r ', ' something.ee', ' woRl'))
grep(pattern=' $', x=c('.world', 'wol  \nd r ', ' something.ee', ' woRl'))
grep(pattern='d$', x=c('.world', 'wold r ', ' something.ee', ' woRl'))
grep(pattern='[dl]$', x=c('.world', 'wold r ', ' something.ee', ' woRl'))
grep(pattern='d|l$', x=c('.world', 'wold r ', ' something.ee', ' woRl')) # note the first d can appear anywhere while l just at the end,compare with next example
grep(pattern='d$|l$', x=c('.world', 'wold r ', ' something.ee', ' woRl'))
grep(pattern='(d|l)$', x=c('.world', 'wold r ', ' something.ee', ' woRl'))
##  ( )
grep(pattern='w(a|o)R', x=c('.woR - rl - d','w-old r ', ' s-omeng.ee', ' waRl'))
grep(pattern='w(a|o)R', x=c('.woR - rl - d', 'woR a' ,
							'w-old r ', ' s-omeng.ee', ' waRl'))
grep(pattern='w(a|o)R \\1', x=c('.woR - rl - d', 'woR o' , 'waR waR',
							'w-old r ', ' s-omeng.ee', ' waRl'))
grep(pattern= 'wo([Rr])ld is red',x=c('woRld is red', 'world is red','adsfa'))
grep(pattern= 'wo([Rr])ld is \\1ed',x=c('woRld is Red', 'world is red','adsfa'))
grep(pattern= 'wo([Rr])ld is \\1ed',x=c('woRld is red', 'world is red','adsfa'))
grep(pattern= 'wo([Rr])ld is \\1ed',x=c('world is red', 'world is Red','adsfa'))
grep(pattern= 'wo([Rr])ld is \\1ed',x=c('world is red', 'world is Red','adsfa'))
## {n,m}
grep(patter='ab{2,4}c',x=c('ac','abc','abbc','abbbc','abbbbc', 'abbbbbc'))

## Class build-in of (literal) characters
## --------------------------------------
## R uses a class of character defined as POSIX and ASCII as shortcuts.
## We can use either one. They are:

## POSIX            ASCII                                   info
## [[:alnum:]]   or  [A-Za-z0-9]                          : it includes '_'
## [[:alpha:]]   or  [A-Za-z]                             : it includes '_'
## [[:blank:]]   or  [ \t]                                : space and tab
## [[:cntrl:]]   or  [\x00-\x1F\x7F]                      : control characters
## [[:digit:]]   or  [0-9] :
## [[:graph:]]   or  [\x21-\x7E]                          : visible characters
## [[:lower:]]   or  [a-z]
## [[:print:]]   or  [\x20-\x7E]                          : visible characters plus space
## [[:punct:]]   or  [!"##$%&'()*+,-./:;?@[\]^_`{|}~.] : Punctuation characters
## [[:space:]]   or  [ \t\r\n\v\f]                        : white spaces
## [[:upper:]]   or  [A-Z]:
## [[:xdigit:]   or  [A-Fa-f0-9]                          : Hexadecimal digits

## Note
## [^A-Za-z0-9]     = negation of [A-Za-z0-9]

## Examples
##---------
text <- c('aBc123','abc123','abc-123','123-abc','123',' abc123', '\1.2+3','ab')
grep(pattern="[A-Za-z]",            x=text, value = TRUE)
grep(pattern="[1-9]",               x=text, value = TRUE)
grep(pattern='[./:;?@^_`{|}~-]', x=text, value = TRUE)
grep(pattern="[[:upper:]]",         x=text, value = TRUE)
grep(pattern="[[:blank:]]",         x=text, value = TRUE)
grep(pattern="[[:punct:]]",         x=text, value = TRUE)

## ==============================================================
## grep - GET THE POSITIONS (OR VALUES) IN THE VECTOR THAT MATCH THE PATTERN
## ==============================================================
text <- c("0.1234", ".1267", "111", "00012")
## 1. getting the index
grep(pattern=".12",  x=text)
grep(pattern="\\.12",x=text)

## 2. getting the values
grep(pattern=".12",  x=text, value=T)
grep(pattern="\\.12",x=text, value=T)

## ==============================================================
## nchar - COUNTS THE NUMBER OF CHAR IN EACH STRING OF THE VECTOR
## ==============================================================
## Count the numer of Characteres
nchar(c('numero de letras','abc'))

## ==============================================================
## sub and gsub - SUBSTITUTE THE PATTERN INSIDE A STRING
## ==============================================================
text <- c("Limpar \n\n texto", "...nesse \n tambem")
## just the fisrt matched PATTERN
sub(pattern="\n",replacement="",x=text)
## replace all matched string in the elements, i.e. in GENERAL
gsub(pattern="\n",replacement='',x=text) 

## ==============================================================
## chartr - SUBSTITUTE THE PATTERN WITH ANOTHER OF SAME SIZE INSIDE A STRING
## ==============================================================
## for characters instead of string
## old and new must have the same length
chartr (old="Limpar", new="Novo12", text)
## see?
chartr (old="Limpar", new="Novo", text)   

## ==============================================================
## gregexpr - TAKE THE POSITION OF THE PATTERN INSIDE COMPONENT OF THE VECTOR
## ==============================================================
text <- c("001234", "x1267", "111 1a6")
gregexpr('12', text)

## ==============================================================
## substr - TO TAKE A SUBSTRING FROM AN ELEMENT OF A VECTOR
## ==============================================================
## excluding what is inside the parenthesis
text <- c("001234333 (comment) up here",
		  "123671111 (another) up here 111 but not up (and) here",
		  "111")
begin <- as.vector(regexpr('\\(.*?\\)', text))
size  <- attr(regexpr('\\(.*?\\)', text),'match.length')
substr (text, start=begin, stop=begin+size-1)
## and if we want to exclude it
pattern <- substr (text, start=begin, stop=begin+size-1)
for (i in 1:length(pattern))
	text[i] <- gsub(pattern=pattern[i], x=text[i],repl='')
for (i in 1:length(pattern))
	text[i] <- gsub(pattern='\\(\\) ', x=text[i],repl='')

## ==============================================================
## strsplit - SPLITING COMPONENTS OF A VECTOR
## ==============================================================
text <- c('This part.c and that part', 'second.c example')
strsplit (text, '\\.c')

## ==============================================================
## EXCLUDING SPACES OF A STRING  VECTOR
## ==============================================================
text <- c(" Limpar \n\n texto", "...nesse \n tambem ")
gsub(pattern="^ ",replacement="",x=text)
gsub(pattern=" $",replacement='',x=text)

## or
require(gdata)
trim(text)  ## remove spaces on the begining and on the end of the string
## ==============================================================

Data Analyses

GLM

Change the reference category in GLM

Problem: change the reference category in glm

Solution: relevel()

df |t|)
(Intercept)  4.412e-01  7.130e-02   6.188 2.02e-09 ***
x1b         -2.377e-16  7.094e-02   0.000    1.000
x1a          3.178e-16  7.094e-02   0.000    1.000
x2           3.720e-04  1.003e-03   0.371    0.711
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.2516399)

    Null deviance: 74.520  on 299  degrees of freedom
Residual deviance: 74.485  on 296  degrees of freedom
AIC: 443.41

Number of Fisher Scoring iterations: 2

Graphics

Cartesian Coordinate-like plot for mathematical functions

I think the plots of mathematical functions are ugly in R. There is no standard package (to the best of my knowledge) to create those traditional graphics of mathematical functions using Cartesian Coordinate system with the origem (0,0) or (0,0,0) in the right place. I created myself two functions that extends the base plot functionalities and generate such plots. Below you see some examples and the source file for that generate the plots in the style I am refering to. Use it as you wish.

The source files are

A reproductible example is below.

source('/plotCC.R')
source('/curveCC.R')

## =====================================================
## Standard R plot
## =====================================================
f <- quote(x^2-3)
g <- quote(x^3)

x <- seq(-3,3,length=50)
plot(x,eval(f), type='l', col='blue')
points(x,eval(g), type='l', col='red')
legend <- c(as.expression(f), as.expression(g) )
legend('bottomright', lty=1, col=c('blue','red'), legend=legend, bty='n')
title('ugly Standard better for functions')

par(las=1,cex.axis=.7,bty='l', pch=20, cex.main=.9, mar=c(4,5,3,2))
plot(x,eval(f), type='l', col='blue')
points(x,eval(g), type='l', col='red', add=T)
legend <- c(as.expression(f), as.expression(g) )
legend('bottomright', lty=1, col=c('blue','red'), legend=legend, bty='n')
title('Standard plot a little better')

## =====================================================
## Plot CC
## =====================================================
f <- quote(x^2-3)
g <- quote(x^3)

x <- seq(-3,3,length=50)
plotCC(x,eval(f), col='blue')
plotCC(x,eval(g), col='red', add=T)
legend <- c(as.expression(f), as.expression(g) )
legend('bottomright', lty=1, col=c('blue','red'), legend=legend, bty='n')
title('Using plotCC or curve CC')
dev.off()

## another method
f <- function(x) return(x^2-3)
g <- function(x) return(x^3)
x <- seq(-3,3,length=50)
plotCC(x,f(x), col='blue')
plotCC(x,g(x), col='red', add=T)
legend <- c(as.expression(bquote(f*(x)==x^2-3)), as.expression(bquote(f(x)==x^3)) )
legend('bottomright', lty=1, col=c('blue','red'), legend=legend, bty='n')
title('y=f(x)')

## =====================================================
## curve CC
## =====================================================
## not necessary to create x
f <- function(x) return(x^2-3)
g <- function(x) return(x^3)
curveCC(f,xmin=-3,xmax=3, ymax=6,  col='blue')
curveCC(g,xmin=-3,xmax=3, ymax=6,  col='red', add=T)
legend('bottomright', lty=1, col=c('blue','red'), legend=c('f','g'), bty='n')

plotCC.png

Color Palletes in R

The two documents below summarize some available colors in R.

Color Palette 1

Color Palette 2

Probability distribution (Plots)

Exponential Distribution

# Exponential
exppdf      <- function(x) lambda*exp(-lambda*x)

par(las=1, cex.axis=.7,bty='l', cex.main=.9, mar=c(4,5,3,2), col='lightblue')
# parameter
lambda      <-1
plot(exppdf,0,10, main=bquote(f(X==x)==.(lambda)*e^(-.(lambda)*x)), ylab=expression(f(x)==lambda*e^(-lambda*x)))

percentis   <- qexp(seq(.2,1,.2), rate=lambda)
segments(x0= percentis,x1= percentis,y0=0,y1=dexp(percentis, rate=lambda),lt=2)
text(x=percentis,y=dexp(percentis, rate=lambda), label=paste(seq(20,100,20),'%'),pos=4, cex=.7)

expDist.png

X^2 Distribution

# chi-squared distribution
distribution <- function(x) dchisq(x,df=df)
# Parameter
df<-1

par(las=1, cex.axis=.7,bty='l', cex.main=.9, mar=c(4,5,3,2))
plot(distribution,0,20, bty='l',  las=1,cex.axis=.7,ylab='Density', col='lightblue', main=expression(chi^2~'-distribution; '~f(x)==frac(1,2^(frac(2,k))*Tau*(frac(2,k)) )*x^(frac(2,k)~-1)*e^(-~frac(x,2))),ylim=c(0,.3))

# other values for the parameter
for (i in 2:7){
df=i;par(new=T)
plot(distribution,0,20, bty='l', col=i, lty=i,add=T) }
legend('topright', legend=paste('df=',1:7), col=1:7, lty=1:7, cex=.7)

chiDist.png

Beta Distribution

The beta distribution is given by

X\sim Beta(\alpha,\beta)

f_X(x) = \begin{cases}\displaystyle \frac{1}{B(\alpha,\beta)}x^{\alpha-1}(1-x)^{\beta-1} & , x\in (0,1)\\ 0 & , \text{otherwise} \end{cases}

Where

\displaystyle B(\alpha,\beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)}

and

\displaystyle\Gamma(\alpha) = \int_0^\infty x^{\alpha-1}e^{-x}dx

require(gtools)

# Beta
betaDist        <- function(x) dbeta(x, alpha,beta)

# parameter
values     <- c(.3,1,1.5,3)
parameters <- permutations(values, n=length(values), r=2, repeats.allowed=T)

par(las=1,cex.axis=.7,bty='l', pch=20, cex.main=.9, mar=c(4,5,3,2))
alpha      <-parameters[1,1]
beta       <-parameters[1,2]

par(las=1,cex.axis=.7,bty='l', pch=20, cex.main=.9,mfrow=c(nrow(parameters)/4,nrow(parameters)/4),mar=c(1,1,1,1))
for( i in 1:nrow(parameters)){
	alpha      <-parameters[i,1]
	beta       <-parameters[i,2]
	plot(betaDist,0,1, col='black')
	legend('topright', legend=bquote(atop(alpha == .(alpha),beta == .(beta))),bty='n', cex=1.4)
}

betaDistr.png

Density with shaded area (example with Gaussian Curve)

mu = 0
sd = 1
x0 = -Inf
x1 = 1

par(las=1, cex.axis=.7,bty='l', cex.main=.9, mar=c(4,5,3,2))
plot(dnorm,mu-4*sd,mu+4*sd, bty='l',las=1, cex.axis=.8, ylab=expression(f(x)), col='darkgrey')
shaded <- seq(mu-4*sd,x1,.2)
polygon(x=c(x1, shaded),y=c(0,dnorm(shaded)), col='lightgrey', lty=0)
segments(x0=x1,y0=0,y1=dnorm(x1), lty=2)

normDistShaded_ex1.png

mean=100; sd=15
lb=80; ub=105

x   <- seq(-4,4,length=100)*sd + mean
hx  <- dnorm(x,mean,sd)

par(las=1, cex.axis=.7,bty='l', cex.main=.9, mar=c(4,5,3,2))
plot(x, hx, type="l", xlab="X Values", ylab="Density",main="Normal Distribution", axes=T, las=1, cex.axis=.7)
i   = lb & x <= ub
lines(x, hx)
polygon(c(lb,x[i],ub), c(0,hx[i],0), col="grey") 

area    <- pnorm(ub, mean, sd) - pnorm(lb, mean, sd)
result  <- paste("P(",lb,"< IQ <",ub,") =",signif(area, digits=3))

normDistShaded_ex2.png

Bivariate Normal and its contour plot

The bivariate normal distribution is given by:

\displaystyle f_{x_1,x_2}(x_1,x_2)=\frac{1}{2\pi\sigma_1\sigma_2\sqrt{1-\rho^2}}e^{-z/(2(1-\rho^2)}

where

\displaystyle z=\frac{(x_1-\mu_1)^2}{\sigma_1^2}-\frac{2\rho(x_1-\mu_1)(x_2-\mu_2)}{\sigma_1\sigma_2}+\frac{(x_2-\mu_2)^2}{\sigma_2^2}

and

\displaystyle \rho=\text{corr}(X_1,X_2)=\frac{\text{Cov}(X_1,X_2)}{\sigma_1\sigma_2}

require(plot3D)
require(mvtnorm)

mu1 <- 1    # expected value of x
sig1    <- 0.5  # variance of x
mu2 <- 2    # expected value of y
sig2    <- 2    # variance of y
rho <- .3   # corr(x, y)

x <- seq(-2,5,length=100)
y <- seq(-2,5,length=100)

jointpdf <- function(x,y) dmvnorm(cbind(x,y), mean=c(mu1,mu2),sigma=matrix(c(sig1, rho, rho,sig2), byrow=T,nrow=2))
fxy <- outer(x, y, jointpdf)

par(mfrow=c(2,2), mar=c(2,2,2,2))
persp3D(x, y,fxy, theta=10, phi=30, expand=0.8 ,shade = 0.03, colkey = FALSE)
contour2D(fxy, x=x,y=y,NAcol = "black", colkey = FALSE)
image2D(fxy, x=x,y=y,NAcol = "black", colkey = FALSE)
image2D(fxy, x=x,y=y,NAcol = "black",contour = TRUE, lphi = 0.7)

Bivariate Normal and its contour plot.png

Histogram with adjusted curve

x <- rbinom(n=5000,size=50,prob=.5)
hist(x, prob=T, col='lightgrey', las=1, cex.axis=.8)
lines(density(x), col='blue', lty=2) # adjust=1
lines(density(x,adjust=2), col='green', lty=2)
lines(density(x,adjust=3), col='red', lty=1)
legend('topright',legend=c('Adjust 1','Adjust 2','Adjust 3'), lty=c(2,2,1),col=c('blue', 'green','red'))

Histogram with adjusted curve1.png

x <- rbinom(n=5000,size=50,prob=.5)
hist(x, prob=T, col='lightgrey', las=1, cex.axis=.8)
curve(dnorm(x, mean=.5*50,sd=sqrt(.5*.5*50)), add=T, col='red',lty=2)

Histogram with adjusted curve2.png

ggplot

Clean plots with ggplot

require(ggplot2)
data                        <- data.frame(income=rnorm(300,1000,300),gender=c(0,1))
data$income[data$gender==1] <- 5*data$income[data$gender==1]
data$x                      <- 4*data$income + 10*(data$gender + 100) + rnorm(300,0,3000)
data$gender                 <- factor(data$gender, levels=c(0,1),labels=c('M','F'))
model                       <- lm(income~x+(gender),data=data)

colours     <- c('red', 'darkgreen')
opts        <- theme_bw()+theme(legend.position='bottom',
						 panel.grid.major=element_blank(),
						 panel.grid.minor=element_blank(),
						 panel.border=element_blank(),
						 axis.line=element_line(size=.2))

ggplot(data, aes(x = x, y = income)) +
		geom_point(aes(colour=gender), size=1.5)+
		stat_smooth(method='lm') +
		opts+
		scale_color_manual(values=colours )+
		scale_fill_manual(values=colours )+
					ylab('income')

Clean plots with ggplot.png

Force Multiple plots

Some function do not allow plot more than one plot using mfrow. One solution is the following:

p1 <- plot(1,1, main='figure 1')
p2 <- plot(1,1, main='figure 2')
p3 <- plot(1,1, main='figure 3')
p4 <- plot(1,1, main='figure 4')

print(p1, position = c(0, 0, 0.5, .5), more = TRUE)
print(p2, position = c(0.5, 0, 1, .50), more = TRUE)
print(p3, position = c(0.5, .5, 1, 1), more = TRUE)
print(p4, position = c(0, .5, .5, 1))

Math annotation

Annotations using math symbol and values of variables

Problem: plotting a annotation with greek characters and a value of a variables

Solution: bquote

par(las=1,cex.axis=.7,bty='l', pch=20, cex.main=.9, mar=c(4,5,3,2)) 

lambda=3
xPois <- rpois(100000, lambda=3)
hist(xPois, main=
		 bquote('Poisson Distribution: '~~
				f(lambda)~'='~frac(e^-lambda*lambda^x,x*'!')~'; ' # function
				~lambda==.(lambda)),                              # including var
	 xlab='x',xlim=c(0,17),col='gray')

AnnotPoisson.png

Expressions for some probability distributions to plot

# Normal
expr.norm <- expression(italic(paste(displaystyle(f(x)~"="~frac(1,sqrt(2*pi*sigma^scriptscriptstyle("2")))~e^{frac(-1,2*sigma^{scriptscriptstyle("2")})*(x-mu)^scriptscriptstyle("2")})~~~~
	displaystyle(list(paste(-infinity<x) <infinity, paste(-infinity<mu) <infinity, paste(0<sigma^scriptscriptstyle("2")) <infinity)))))

# Uniform
expr.unif <- expression(italic(paste(displaystyle(f(x)~"="~frac(1,b-a)~~~~displaystyle(paste(-infinity<paste(a<=paste(x<=b))) <infinity)))))

# t-distribution
expr.t <- expression(italic(paste(displaystyle(f(x)~"="~frac(Gamma~bgroup("(",frac(nu+1,2),")"),sqrt(nu*pi)~Gamma~bgroup("(",frac(nu,2),")"))~bgroup("(",1+frac(x^2,nu),")")^{-frac(nu+1,2)})~~~~displaystyle(list(paste(-infinity<x)  0)))))

# f-distribution
expr.F <- expression(italic(paste(displaystyle(f(x)~"="~frac(Gamma~bgroup("(",frac(nu[1]+nu[2],2),")"),Gamma~bgroup("(",frac(nu[1],2),")")~Gamma~bgroup("(",frac(nu[2],2),")"))~bgroup("(",frac(nu[1],nu[2]),")")^{frac(nu[1],2)}~frac(x^{frac(nu[1],2)-1},bgroup("(",1+frac(d[1],d[2])*x,")")^{frac(d[1]+d[2],2)}))~~~~displaystyle(paste(0<=x)  0))))

# gamma
expr.gam <- expression(italic(paste(displaystyle(f(x)~"="~frac(beta^alpha,Gamma(alpha))*x^{alpha-1}*e^{-beta*x})~~~~displaystyle(list(paste(0<x) <infinity, paste(0<alpha) <infinity, paste(0<beta) <infinity)))))

# exponential
expr.exp <- expression(italic(paste(displaystyle(f(x)~"="~lambda*e^{-lambda*x})~~~~displaystyle(list(paste(0<=x) 0)))))

# chi-squared
expr.chisq <- expression(italic(paste(frac(1,2^{frac(nu,2)}*Gamma~bgroup("(",frac(nu,2),")"))*x^{frac(nu,2)-1}*e^{-frac(x,2)}
~~~~displaystyle(list(paste(0<=x) <infinity, nu %in% bold(N))))))

# log-normal
expr.lnorm <- expression(italic(paste(displaystyle(f(x)~"="~frac(1,x*sigma*sqrt(2*pi))~e^{-frac((log(x)-mu)^2,2*sigma^2)})~~~~displaystyle(list(paste(0<x) <infinity, paste(-infinity<log(mu)) <infinity, paste(0<sigma^scriptscriptstyle("2")) <infinity)))))

# beta
expr.beta <- expression(italic(paste(displaystyle(f(x)~"="~frac(Gamma(alpha+beta),Gamma(alpha)*Gamma(beta))*x^{alpha-1}*(1-x)^{beta-1})~~~~displaystyle(list(paste(0<=x) <=1, paste(0<alpha) <infinity, paste(0<beta) <infinity)))))

par(mar=c(0,0,2,0), mfrow=c(8,1),xaxt='n', yaxt='n')
plot(1,1,type='n',axes=FALSE)
	title('Normal')
	text(1,1,label=expr.norm, main='Normal')
#plot(1,1,pch=F)
#title('Uniform')
#    text(1,1,label=expr.unif)
plot(1,1,type='n',axes=FALSE)
title('t-distribution')
	text(1,1,label=expr.t)
plot(1,1,type='n',axes=FALSE)
title('F-distribution')
	text(1,1,label=expr.F)
plot(1,1,type='n',axes=FALSE)
title('Gamma')
	text(1,1,label=expr.gam)
plot(1,1,type='n',axes=FALSE)
title('exponential')
	text(1,1,label=expr.exp)
plot(1,1,type='n',axes=FALSE)
title(bquote(chi^2-'distribution'))
	text(1,1,label=expr.chisq)
plot(1,1,type='n',axes=FALSE)
title('log-normal')
	text(1,1,label=expr.lnorm)
plot(1,1,type='n',axes=FALSE)
title('Beta')
	text(1,1,label=expr.beta)

AnnotDistribExpression.png

Maps

Plotting maps using R (example with Brazilian Municipal level data)

This is an example of how to plot GEO data using R. In this example, I will plot three maps of Brazilian municipalities and the color of each city in the maps will be mapped, in each plot, to its GDP, to the proportion of votes of the above 50% each president received in the 2014 Brazilian election, and proportion of families receiving the assistance from the governamental program called Bolsa Familia. To reproduce the example you need

  1. Shapefiles (Municipalities, States, Region)
  2. Information about municipalities GDP (here or here)
  3. Information about votes (here)
  4. Information about bolsa familia (here)
  5. A file linking the ID of the cities used by the Shapefiles (in the example here such ID is the Brazilian Institute of Geography and Statistics’ code for municipalities) and the ID of the cities used by electoral data (not provided here).

Note: as I didn’t provide the last item, you may not be able to reproduce the map with the result of the election. You should be able to reproduce using the GDP and Bolsa Familia, though. The script below is used to plot the GDP per capita in each city (the blue-colored map below). You can easily adapt it to plot the proportion of families receiving Bolsa Familia in each city.

The tricky part is to create categories to attach colors that correspond to the values of a possible continuous variable. If there is a limited amount of colors available, you must create as many categories as the number of colors. Lets say you have 500 different colors but 5000 cities, you must create 500 categories of GDP. One suggestion would be using the percentiles. It may not make much difference in the final result, but you should compare the results when using colors by category and the results using colors in the continuous variable without making this transformation.

I present a map colored according to the proportion of votes each candidate received in the second round of 2014 Brazilian Election in each municipality along with another map that represents the proportion of families that receive Bolsa Familia in each city. Bolsa familia is a federal redistributive welfare program. Although it looks like so, *these two maps do not imply that Bolsa Familia explains or is correlated to presidencial vote. They are just two maps, and any stronger statement about causes of (factors correlated to) outcome of election must be further verified with appropriate statistical techniques*.

Note: you need to download and install the R packages listed in the script. You also need to change the path and name of the files your are loading in the script according to your local file path. So, check where you saved the files and the name of them and change the script accordingly.

# =====================================================
# MapBRMunic.R
# --------------
# Author: Diogo A. Ferrari
#
#
# =====================================================
#
rm(list=ls())

# options
quartz(width=6,height=6)
par(las=1,cex.axis=.7,bty='l', pch=20, cex.main=.9, mar=c(4,5,3,2)) 

# packages
require(maptools)
require(descr)
require(RColorBrewer)
require(plotrix)

# path and files used
pathShapeFileMun    <- "/Users/myshapefiles/municipios_2010.shp"
pathShapeFileStates <- "/Users/myshapefiles/estados_2010.shp"
pathShapeFileRegions    <- "/Users/myshapefiles/regioes_2010.shp"
pathDataPIB     <- "/Users/data/pib2010.csv"

# Municipal level data
# ================================================================
# Load files
# ================================================================
# shapefiles
municBR   <- readShapePoly(fn=pathShapeFileMun)
statesBR  <- readShapePoly(fn=pathShapeFileStates)
regionsBR <- readShapePoly(fn=pathShapeFileRegions)
# GDP
pib <- read.csv2(pathDataPIB)
# ================================================================

# ================================================================
# merging shapefile and data
# ================================================================
# this order var will make it possible to go back to the original order
municBR@data$order <- 1:nrow(municBR@data)

# NOTE: municipalities more municipalities in the pib file. we will ignore them her
municBR@data <- merge(municBR@data, pib,
					  by.x='codigo_ibg',by.y='C._digo', all.x=T)
# creating GDP per capita
municBR@data$X2010 <- municBR@data$X2010/as.numeric(as.character(municBR@data$populacao))

# ================================================================
# plot
# ================================================================
#brewer.pal.info
myPaletteBlue <- brewer.pal(9,'Blues')
createColors  <- colorRampPalette(myPaletteBlue)
myColorsBlue  <- createColors(nrow(municBR@data))
#check the colors
#pie(rep(1,nrow(municBR@data)), col=myColorsBlue, lty=0, labels=NA)

# good way to use colors: creating categories of income (as many as colors we have)
municBR@data        <- municBR@data[order(municBR@data$X2010),]
numberColors        <- length(unique(myColorsBlue))
# the cut points are arbitrary, and you can choose others
cuts            <- quantile(municBR@data$X2010,
								 probs=(0:numberColors)/numberColors)
municBR@data$GDPcat <- cut(municBR@data$X2010,breaks = cuts,include.lowest=T)

# ordering the vector by GDP
municBR@data        <- municBR@data[order(municBR@data$X2010),]

# merging GDPcat and their colors
colorOfCats     <- data.frame(GDPcat=levels(municBR@data$GDPcat),colors=unique(myColorsBlue))
municBR@data        <- merge(municBR@data,colorOfCats)

# going back to the original order of the data in the shapefile
municBR@data        <- municBR@data[order(municBR@data$order),]

plot(municBR, col=as.character(municBR@data$colors), lty=0, main='')
plot(statesBR, add=T) # if you want to emphasize the states boundaries
plot(regionsBR, add=T, lwd=3) # to emphasize the macroregions boundaries

# legend
color.legend(xl=-70, xr=-60,yb=-25,yt=-26,
			 legend=c('lowest','','highest'),
			 rect.col=myColorsBlue, gradient='x',
			 cex=.7, pos=c(1,1,1))
text(x=-68,y=-24,label='GDP per capita', cex=1)

2014_Presidential_2ndRound_municip.png

mapGDPMuniRight.png

Bolsa Familia and Proportion of Votes

From R to Latex

Exporting R table to Latex with mathematical annotation

This script saves a file .tex that contains only a latex table environment with the information of the R table. Mathematical annotation are allowed in the name of the columns or in the lines. After saving the table, you can use the command \input{<path-to-the-table>/<file-name>.tex} in your main latex file and use it as usual.

## path to save the table
pathTables <- '/tables'

## table
tab

The code will give

\sigma^2 Mean
0.20 1.00
0.40 2.00

Open xlsx file in R in Ubuntu

The original post is here. I am just reproducing the solution

Error while installing xlsx package in R, ubuntu :

> install.packages(“xlsx”)

configure: error: One or more Java configuration variables are not set. Make sure R is configured with full Java support (including JDK). Run R CMD javareconf as root to add Java support to R.

If you don’t have root privileges, run R CMD javareconf -e to set all Java-related variables and then install rJava.

ERROR: configuration failed for package ‘rJava’

  • removing ‘/home/nitin/R/x86_64-pc-linux-gnu-library/3.1/rJava’

Warning in install.packages : installation of package ‘rJava’ had non-zero exit status ERROR: dependency ‘rJava’ is not available for package ‘xlsxjars’

  • removing ‘/home/nitin/R/x86_64-pc-linux-gnu-library/3.1/xlsxjars’

Warning in install.packages : installation of package ‘xlsxjars’ had non-zero exit status ERROR: dependencies ‘rJava’, ‘xlsxjars’ are not available for package ‘xlsx’

  • removing ‘/home/nitin/R/x86_64-pc-linux-gnu-library/3.1/xlsx’

Warning in install.packages : installation of package ‘xlsx’ had non-zero exit status

The downloaded source packages are in ‘/tmp/RtmpNKD9KK/downloaded_packages’

Solution : I did the following and it worked :

sudo R CMD javareconf apt-get install r-cran-rjava install.packages(“xlsx”) =============================================================================

Error while loading xlsx package in R

> library(xlsx) Loading required package: rJava Error : .onLoad failed in loadNamespace() for ‘rJava’, details: call: dyn.load(file, DLLpath = DLLpath, …) error: unable to load shared object ‘/usr/lib/R/site-library/rJava/libs/rJava.so’: libjvm.so: cannot open shared object file: No such file or directory Error: package ‘rJava’ could not be loaded

Solution :

$ sudo updatedb $ locate libjvm.so /home/nitin/Softwares/java6/jdk1.6.0_07/jre/lib/i386/client/libjvm.so /home/nitin/Softwares/java6/jdk1.6.0_07/jre/lib/i386/server/libjvm.so /home/nitin/Softwares/java7/jdk1.7.0/jre/lib/i386/client/libjvm.so /home/nitin/Softwares/java7/jdk1.7.0/jre/lib/i386/server/libjvm.so /home/nitin/Softwares/jre1.7.0_21/lib/amd64/server/libjvm.so /usr/lib/java6/jdk1.6.0_45/jre/lib/amd64/server/libjvm.so /usr/lib/java7/jdk1.7.0/jre/lib/i386/client/libjvm.so /usr/lib/java7/jdk1.7.0/jre/lib/i386/server/libjvm.so /usr/lib/jvm/java-6-openjdk-amd64/jre/lib/amd64/cacao/libjvm.so /usr/lib/jvm/java-6-openjdk-amd64/jre/lib/amd64/jamvm/libjvm.so /usr/lib/jvm/java-6-openjdk-amd64/jre/lib/amd64/server/libjvm.so

$ sudo ln -s /usr/lib/java6/jdk1.6.0_45/jre/lib/amd64/server/libjvm.so /usr/lib

Now run > library(xlsx)

Issues with packages

Installing Car in Ubuntu

If the regular instalation does not work, try

sudo apt-get install r-cran-car

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s