当前位置：文档库 › Econometrics in R

Econometrics in R

Grant V.Farnsworth?

October26,2008

?This paper was originally written as part of a teaching assistantship and has subsequently become a personal reference.

I learned most of this stu?by trial and error,so it may contain ine?ciencies,inaccuracies,or incomplete explanations.If you?nd something here suboptimal or have suggestions,please let me know.Until at least2009I can be contacted at g-farnsworth@https://www.wendangku.net/doc/b716781314.html,.

Contents

1Introductory Comments3

1.1What is R? (3)

1.2How is R Better Than Other Packages? (3)

1.3Obtaining R (3)

1.4Using R Interactively and Writing Scripts (4)

1.5Getting Help (5)

2Working with Data6

2.1Basic Data Manipulation (6)

2.2Caveat:Math Operations and the Recycling Rule (7)

2.3Important Data Types (8)

2.3.1Vectors (8)

2.3.2Arrays,Matrices (8)

2.3.3Dataframes (9)

2.3.4Lists (9)

2.3.5S3Classes (9)

2.3.6S4Classes (10)

2.4Working with Dates (10)

2.5Merging Dataframes (11)

2.6Opening a Data File (11)

2.7Working With Very Large Data Files (12)

2.7.1Reading?elds of data using scan() (12)

2.7.2Utilizing Unix Tools (12)

2.7.3Using Disk instead of RAM (13)

2.7.4Using RSQLite (13)

2.8Issuing System Commands—Directory Listing (13)

2.9Reading Data From the Clipboard (14)

2.10Editing Data Directly (14)

3Cross Sectional Regression14

3.1Ordinary Least Squares (14)

3.2Extracting Statistics from the Regression (15)

3.3Heteroskedasticity and Friends (16)

3.3.1Breusch-Pagan Test for Heteroskedasticity (16)

3.3.2Heteroskedasticity(Autocorrelation)Robust Covariance Matrix (16)

3.4Linear Hypothesis Testing(Wald and F) (16)

3.5Weighted and Generalized Least Squares (17)

3.6Models With Factors/Groups (17)

4Special Regressions18

4.1Fixed/Random E?ects Models (18)

4.1.1Fixed E?ects (18)

4.1.2Random E?ects (19)

4.2Qualitative Response (19)

4.2.1Logit/Probit (19)

4.2.2Multinomial Logit (19)

4.2.3Ordered Logit/Probit (20)

4.3Tobit and Censored Regression (20)

4.4Quantile Regression (20)

4.5Robust Regression-M Estimators (20)

4.6Nonlinear Least Squares (20)

4.7Two Stage Least Squares on a Single Structural Equation (21)

4.8Systems of Equations (21)

4.8.1Seemingly Unrelated Regression (21)

4.8.2Two Stage Least Squares on a System (21)

5Time Series Regression22

5.1Di?erences and Lags (22)

5.2Filters (23)

5.2.1Canned AR and MA?lters (23)

5.2.2Manual Filtration (23)

5.2.3Hodrick Prescott Filter (24)

5.2.4Kalman Filter (24)

5.3ARIMA/ARFIMA (24)

5.4ARCH/GARCH (25)

5.4.1Basic GARCH–garch() (25)

5.4.2Advanced GARCH–garchFit() (26)

5.4.3Miscellaneous GARCH–Ox G@RCH (26)

5.5Correlograms (26)

5.6Predicted Values (27)

5.7Time Series Tests (27)

5.7.1Durbin-Watson Test for Autocorrelation (27)

5.7.2Box-Pierce and Breusch-Godfrey Tests for Autocorrelation (27)

5.7.3Dickey-Fuller Test for Unit Root (27)

5.8Vector Autoregressions(VAR) (28)

6Plotting28

6.1Plotting Empirical Distributions (29)

6.2Contour Plots (29)

6.3Adding Legends and Stu? (29)

6.4Adding Arrows,Text,and Markers (30)

6.5Multiple Plots (30)

6.6Saving Plots—png,jpg,eps,pdf,x?g (31)

6.7Adding Greek Letters and Math Symbols to Plots (31)

6.8Other Graphics Packages (32)

7Statistics32

7.1Working with Common Statistical Distributions (32)

7.2P-Values (33)

7.3Sampling from Data (34)

8Math in R34

8.1Matrix Operations (34)

8.1.1Matrix Algebra and Inversion (34)

8.1.2Factorizations (35)

8.2Numerical Optimization (35)

8.2.1Unconstrained Minimization (35)

8.2.2Minimization with Linear Constraints (36)

8.3Numerical Integration (36)

9Programming37

9.1Writing Functions (37)

9.2Looping (37)

9.3Avoiding Loops (38)

9.3.1Applying a Function to an Array(or a Cross Section of it) (38)

9.3.2Replicating (38)

9.4Conditionals (39)

9.4.1Binary Operators (39)

9.4.2WARNING:Conditionals and NA (39)

9.5The Ternary Operator (39)

9.6Outputting Text (40)

9.7Pausing/Getting Input (40)

9.8Timing Blocks of Code (40)

9.9Calling C functions from R (41)

9.9.1How to Write the C Code (41)

9.9.2How to Use the Compiled Functions (41)

9.10Calling R Functions from C (42)

10Changing Con?gurations43

10.1Default Options (43)

10.1.1Signi?cant Digits (43)

10.1.2What to do with NAs (43)

10.1.3How to Handle Errors (43)

10.1.4Suppressing Warnings (44)

11Saving Your Work44

11.1Saving the Data (44)

11.2Saving the Session Output (44)

11.3Saving as L A T E X (45)

12Final Comments45 13Appendix:Code Examples46

13.1Monte Carlo Simulation (46)

13.2The Haar Wavelet (46)

13.3Maximum Likelihood Estimation (47)

13.4Extracting Info From a Large File (47)

13.5Contour Plot (48)

1Introductory Comments

1.1What is R?

R is an implementation of the object-oriented mathematical programming language S.It is developed by statisticians around the world and is free software,released under the GNU General Public License.Syn-tactically and functionally it is very similar(if not identical)to S+,the popular statistics package.

1.2How is R Better Than Other Packages?

R is much more more?exible than most software used by econometricians because it is a modern mathe-matical programming language,not just a program that does regressions and tests.This means our analysis need not be restricted to the functions included in the default package.There is an extensive and constantly expanding collection of libraries online for use in many disciplines.As researchers develop new algorithms and processes,the corresponding libraries get posted on the R website.In this sense R is always at the forefront of statistical knowledge.Because of the ease and?exibility of programming in R it is easy to extend.

The S language is the de facto standard for statistical science.Reading the statistical literature,we?nd that examples and even pseudo-code are written in R-compatible syntax.Since most users have a statistical background,the jargon used by R experts sometimes di?ers from what an econometrician(especially a beginning econometrician)may expect.A primary purpose of this document is to eliminate this language barrier and allow the econometrician to tap into the work of these innovative statisticians.

Code written for R can be run on many computational platforms with or without a graphical user interface,and R comes standard with some of the most?exible and powerful graphics routines available anywhere.

And of course,R is completely free for any use.

1.3Obtaining R

The R installation program can be downloaded free of charge from https://www.wendangku.net/doc/b716781314.html,.Because R is a programming language and not just an econometrics program,most of the functions we will be interested in are available through libraries(sometimes called packages)obtained from the R website.To obtain a library that does not come with the standard installation follow the CRAN link on the above website.Under contrib you will?nd is a list of compressed libraries ready for download.Click on the one you need and save it somewhere you can?nd it later.If you are using a gui,start R and click install package from local directory under the package menu.Then select the?le that you downloaded.Now the package will be available for use in the future.If you are using R under linux,install new libraries by issuing the following command at the command prompt:“R CMD INSTALL packagename”

Alternately you can download and install packages at once from inside R by issuing a command like

>install.packages(c("car","systemfit"),repo="https://www.wendangku.net/doc/b716781314.html,",dep=TRUE)

which installs the car and system?t libraries.The repo parameter is usually auto-con?gured,so there is normally no need to specify it.The dependencies or dep parameter indicates that R should download packages that these depend on as well,and is recommended.Note:you must have administrator(or root) privileges to your computer to install the program and packages.

Contributed Packages Mentioned in this Paper and Why

(*indicates package is included by default)

adapt Multivariate numerical integration

car Regression tests and robust standard errors

DBI Interact with databases

dse1State space models,Kalman?ltration,and Vector ARMA

?lehash Use hard disk instead of RAM for large datasets

fSeries Garch models with nontrivial mean equations

fracdi?Fractionally integrated ARIMA models

foreign*Loading and saving data from other programs

ggplot2Graphics and plotting

graphics*Contour graphs and arrows for plots

grid Graphics and plotting

Hmisc L A T E X export

lattice An alternate type of contour plot and other graphics

lmtest Breusch-Pagan and Breusch-Godfrey tests

MASS*Robust regression,ordered logit/probit

Matrix Matrix norms and other matrix algebra stu?

MCMCpack Inverse gamma distribution

MNP Multinomial probit via MCMC

nlme*Nonlinear?xed and random e?ects models

nls*Nonlinear least squares

nnet Multinomial logit/probit

quantreg Quantile Regressions

R.matlab Read matlab data?les

RSQLite Interact with SQL databases

sandwich(and zoo)Heteroskedasticity and autocorrelation robust covariance

sem Two stage least squares

survival*Tobit and censored regression

system?t SUR and2SLS on systems of equations

ts*Time series manipulation functions

tseries Garch,ARIMA,and other time series functions

VAR Vector autoregressions

xtable Alternative L A T E X export

zoo required in order to have the sandwich package

From time to time we can get updates of the installed packages by running update.packages().

1.4Using R Interactively and Writing Scripts

We can interact directly with R through its command prompt.Under windows the prompt and what we type are in red and the output it returns is blue–although you can control the font colors though“GUI preferences”in the edit menu.Pressing the up arrow will generally cycle through commands from the history.Notice that R is case sensitive and that every function call has parentheses at the end.Instead of issuing commands directly we can load script?les that we have previously written,which may include new function de?nitions.

Script?les generally have the extension“.R”.These?les contain commands as you would enter them at the prompt,and they are recommended for any project of more than a few lines.In order to load a script ?le named“mcmc.R”we would use the command

>source("mcmc.R")

One way to run R is to have a script?le open in an external text editor and run periodically from the R https://www.wendangku.net/doc/b716781314.html,mands executed from a script?le may not print as much output to the screen as they do when run interactively.If we want interactive-level verbosity,we can use the echo argument

>source("mcmc.R",echo=TRUE)

If no path is speci?ed to the script?le,R assumes that the?le is located in the current working directory. The working directory can be viewed or changed via R commands

>getwd()

[1]"/home/gvfarns/r"

>setwd("/home/gvfarns")

>getwd()

[1]"/home/gvfarns"

or under windows using the menu item change working directory.Also note that when using older versions of R under windows the slashes must be replaced with double backslashes.

>getwd()

[1]"C:\\Program Files\\R\\rw1051\\bin"

>setwd("C:\\Program Files\\R\\scripts")

>getwd()

[1]"C:\\Program Files\\R\\scripts"

We can also run R in batch(noninteractive)mode under linux by issuing the command:“R CMD BATCH scriptname.R”The output will be saved in a?le named scriptname.Rout.Batch mode is also available under windows using Rcmd.exe instead of Rgui.exe.

Since every command we will use is a function that is stored in one of the libraries,we will often have to load libraries before working.Many of the common functions are in the library base,which is loaded by default.For access to any other function,however,we have to load the appropriate library.

>library(foreign)

will load the library that contains the functions for reading and writing data that is formatted for other programs,such as SAS and Stata.Alternately(under windows),we can pull down the package menu and select library

1.5Getting Help

There are several methods of obtaining help in R

>?qt

>help(qt)

>help.start()

>help.search("covariance")

Preceding the command with a question mark or giving it as an argument to help()gives a description of its usage and functionality.The help.start()function brings up a menu of help options and help.search() searches the help?les for the word or phrase given as an argument.Many times,though,the best help available can be found by a search online.Remember as you search that the syntax and functionality of R is almost identical to that of the proprietary statistical package S+.

The help tools above only search through the R functions that belong to packages on your computer.A large percentage of R questions I hear are of the form“Does R have a function to do...”Users do not know if functionality exists because the corresponding package is not installed on their computer.To search the R website for functions and references,use

>RSiteSearch("Kalman Filter")

The results from the search should appear in your web browser.

2Working with Data

2.1Basic Data Manipulation

R allows you to create many types of data storage objects,such as numbers,vectors,matrices,strings, and dataframes.The command ls()gives a list of all data objects currently available.The command rm() removes the data object given it as an argument.We can determine the type of an object using the command typeof()or its class type(which is often more informative)using class().

Entering the name of the object typically echos its data to the screen.In fact,a function is just another data member in R.We can see the function’s code by typing its name without parenthesis.

The command for creating and/or assigning a value to a data object is the less-than sign followed by the minus sign.

>g<-7.5

creates a numeric object called g,which contains the value7.5.True vectors in R(as opposed to one dimensional matrices)are treated as COLUMN vectors,when the distinction needs to be made.

>f<-c(7.5,6,5)

>F<-t(f)

uses the c()(concatenate)command to create a vector with values7.5,6,and5.c()is a generic function that can be used on multiple types of data.The t()command transposes f to create a1x2matrix—because “vectors”are always column vectors.The two data objects f and F are separate because of the case sensitivity of R.The command cbind()concatenates the objects given it side by side:into an array if they are vectors, and into a single dataframe if they are columns of named data.

>dat<-cbind(c(7.5,6,5),c(1,2,3))

Similarly,rbind()concatenates objects by rows—one above the other—and assumes that vectors given it are ROW vectors(see2.3.1).

Notice that if we were concatenating strings instead of numeric values,we would have to put the strings in quotes.Alternately,we could use the Cs()command from the Hmisc library,which eliminates the need for quotes.

>Cs(Hey,you,guys)

[1]"Hey""you""guys"

Elements in vectors and similar data types are indexed using square brackets.R uses one-based indexing. >f

[1]7.56.05.0

>f[2]

[1]6

Notice that for multidimensional data types,such as matrices and dataframes,leaving an index blank refers to the whole column or row corresponding to that index.Thus if foo is a4x5array of numbers,

>foo

will print the whole array to the screen,

>foo[1,]

will print the?rst row,

>foo[,3]

will print the third column,etc.We can get summary statistics on the data in goo using the summary()and we can determine its dimensionality using the NROW(),and NCOL()commands.More generally,we can use the dim()command to know the dimensions of many R objects.

If we wish to extract or print only certain rows or columns,we can use the concatenation operator.

>oddfoo<-foo[,c(1,3,5)]

makes a4x3array out of columns1,3,and5of foo and saves it in oddfoo.By prepending the subtraction operator,we can remove certain columns

>nooddfoo<-foo[,-c(1,3,5)]

makes a4x2array out of columns2and4of foo(i.e.,it removes columns1,3,and5).We can also use comparison operators to extract certain columns or rows.

>smallfoo<-foo[foo[,1]<1,]

compares each entry in the?rst column of foo to one and inserts the row corresponding to each match into smallfoo.We can also reorder data.If wealth is a dataframe with columns year,gdp,and gnp,we could sort the data by year using order()or extract a period of years using the colon operator

>wealth<-wealth[order(wealth$year),]

>firstten<-wealth[1:10,]

>eighty<-wealth[wealth$year==1980,]

This sorts by year and puts the?rst ten years of data in?rstten.All rows from year1980are stored in eighty(notice the double equals sign).

Using double instead of single brackets for indexing changes the behavior slightly.Basically it doesn’t allow referencing multiple objects using a vector of indices,as the single bracket case does.For example, >w[[1:10]]

does not return a vector of the?rst ten elements of w,as it would in the single bracket case.Also,it strips o?attributes and types.If the variable is a list,indexing it with single brackets yields a list containing the data,double brackets return the(vector of)data itself.Most often,when getting data out of lists,double brackets are wanted,otherwise single brackets are more common.

Occasionally we have data in the incorrect form(i.e.,as a dataframe when we would prefer to have a matrix).In this case we can use the as.functionality.If all the values in goo are numeric,we could put them into a matrix named mgoo with the command

>mgoo<-as.matrix(goo)

Other data manipulation operations can be found in the standard R manual and online.There are a lot of them.

2.2Caveat:Math Operations and the Recycling Rule

Mathematical operations such as addition and multiplication operate elementwise by default.The matrix algebra operations are generally surrounded by%(see section8).The danger here happens when one tries to do math using certain objects of di?erent sizes.Instead of halting and issuing an error as one might expect,R uses a recycling rule to decide how to do the math—that is,it repeats the values in the smaller data object.For example,

>a<-c(1,3,5,7)

>b<-c(2,8)

>a+b

[1]311715

Only if the dimensions are not multiples of each other does R return a warning(although it still does the computation)

>a<-c(2,4,16,7)

>b<-c(2,8,9)

>a+b

[1]412259

Warning message:

longer object length

is not a multiple of shorter object length in:a+b

At?rst the recycling rule may seem like a dumb idea(and it can cause error if the programmer is not careful)but this is what makes operations like scalar addition and scalar multiplication of vectors and matrices(as well as vector-to-matrix addition)possible.One just needs to be careful about dimensionality when doing elementwise math in R.

Notice that although R recycles vectors when added to other vectors or data types,it does not recycle when adding,for example,two matrices.Adding matrices or arrays of di?erent dimensions to each other produces an error.

2.3Important Data Types

2.3.1Vectors

The most fundamental numeric data type in R is an unnamed vector.A scalar is,in fact,a1-vector.Vectors are more abstract than one dimensional matrices because they do not contain information about whether they are row or column vectors—although when the distinction must be made,R usually assumes that vectors are columns.

The vector abstraction away from rows/columns is a common source of confusion in R by people familiar with matrix oriented languages,such as matlab.The confusion associated with this abstraction can be shown by an example

a<-c(1,2,3)

b<-c(4,6,8)

Now we can make a matrix by stacking vertically or horizontally and R assumes that the vectors are either rows or columns,respectively.

>cbind(a,b)

a b

[1,]14

[2,]26

[3,]38

>rbind(a,b)

[,1][,2][,3]

a123

b468

One function assumed that these were column vectors and the other that they were row vectors.The take home lesson is that a vector is not a one dimensional matrix,so don’t expect them to work as they do in a linear algebra world.To convert a vector to a1xN matrix for use in linear algebra-type operations(column vector)us as.matrix().

Note that t()returns a matrix,so that the object t(t(a))is not the same as a.

2.3.2Arrays,Matrices

In R,homogeneous(all elements are of the same type)multivariate data may be stored as an array or a matrix.A matrix is a two-dimensional object,whereas an array may be of many dimensions.These data types may or may not have special attributes giving names to columns or rows(although one cannot reference a column using the$operator as with dataframes)but can hold only numeric data.Note that one cannot make a matrix,array,or vector of two di?erent types of data(numeric and character,for example).Either they will be coerced into the same type or an error will occur.

2.3.3Dataframes

Most econometric data will be in the form of a dataframe.A dataframe is a collection of vectors(we think of them as columns)containing data,which need not all be of the same type,but each column must have the same number of elements.Each column has a title by which the whole vector may be addressed.If goo is a3x4data frame with titles age,gender,education,and salary,then we can print the salary column with the command

>goo$salary

or view the names of the columns in goo

>names(goo)

Most mathematical operations a?ect multidimensional data elementwise(unlike some mathematical lan-guages,such as matlab).From the previous example,

>salarysq<-(goo$salary)^2

creates a dataframe with one column entitled salary with entries equal to the square of the corresponding entries in goo$salary.Output from actions can also be saved in the original variable,for example,

>salarysq<-sqrt(salarysq)

replaces each of the entries in salarysq with its square root.

>goo$lnsalary<-log(salarysq)

adds a column named lnsalary to goo,containing the log of the salary.

2.3.4Lists

A list is more general than a dataframe.It is essentially a bunch of data objects bound together,optionally with a name given to each.These data objects may be scalars,strings,dataframes,or any other type. Functions that return many elements of data(like summary())generally bind the returned data together as a list,since functions return only one data object.As with dataframes,we can see what objects are in a list(by name if they have them)using the names()command and refer to them either by name(if existent) using the$symbol or by number using brackets.Remember that referencing a member of a list is generally done using double,not single,brackets(see section2.1).Sometimes we would like to simplify a list into a vector.For example,the function strsplit()returns a list containing substrings of its argument.In order to make them into a vector of strings,we must change the list to a vector using unlist().Lists sometimes get annoying,so unlist()is a surprisingly useful function.

2.3.5S3Classes

Many functions return an object containing many types of data,like a list,but would like R to know something about what type of object it is.A list with an associated“class”attribute designating what type of list it is is an S3class.If the class is passed as an argument,R will?rst search for an appropriately named function.If x is of class foo and you print it with

>print(x)

the print()routine?rst searches for a function named print.foo()and will use that if it exists.Otherwise it will use the generic print.default().For example,if x is the output of a call to lm(),then

>print(x)

will call print.lm(x),which prints regression output in a meaningful and aesthetically pleasing manner.

S3lists are quite simple to use.The are really just lists with an extra attribute.We can create them either using the class()function or just adding the class attribute after creation of a list.

>h<-list(a=rnorm(3),b="This shouldn’t print")

>class(h)<-"myclass"

>print.myclass<-function(x){cat("A is:",x$a,"\n")}

>print(h)

A is:-0.710968-1.6118960.6219214

If we were to call print()without assigning the class,we would get a di?erent result.

Many R packages include extensions to common generic functions like print(),summary(),and plot() which operate on the particular classes produced by that package.The ability of R to choose a function to execute depending on the class of the data passed to it makes interacting with new classes very convenient. On the other hand,many extensions have options speci?c to them,so we must read the help?le on that particular extension to know how to best use it.For example,we should read up on the regression print routine using

>?summary.lm

instead of

>?summary

2.3.6S4Classes

S4classes are a recent addition to R.They generically hold data and functions,just like S3classes,but have some technical advantages,which transcend the scope of this document.For our purposes,the most important di?erence between an S3and S4class is that attributes of the latter are referenced using@instead of$and it can only be created using the new()command.

>g<-garchFit(~arma(0,1)+garch(2,3),y)

>fitvalues<-g@fit

2.4Working with Dates

The standard way of storing dates internally in R is as an object of class Date.This allows for such things as subtraction of one date from another yielding the number of days between them.To convert data to dates, we use as.Date().This function takes as input a character string and a format.If the given vector of dates is stored as a numeric format(like“20050627”)it should be converted to a string using as.character()?rst.The format argument informs the code what part of the string corresponds to what part of the date. Four digit year is%Y,two digit year is%y,numeric month is%m,alphabetic(abbreviated)month is%b, alphabetic(full)month is%B,day is%d.For other codes,see the help?les on strptime.For example,if d is a vector of dates formatted like“2005-Jun-27”,we could use

>g<-as.Date(d,format="%Y-%b-%d")

Internally,Date objects are numeric quantities,so they don’t take up very much memory.

We can perform the reverse operation of as.Date()—formatting or extracting parts of a Date object—using format().For example,given a column of numbers like“20040421”,we can extract a character string representing the year using

>year<-format(as.Date(as.character(v$DATE),format="%Y%m%d"),format="%Y") Although we can extract day of the week information in string or numeric form using format(),a simpler interface is available using the weekdays()function.

>mydates<-as.Date(c("19900307","19900308"),format="%Y%m%d")

>weekdays(mydates)

[1]"Wednesday""Thursday"

Notice that in order to get a valid date,we need to have the year,month,and day.Suppose we were using monthly data,we would need to assign each data point to,for example,the?rst day of the month.In the case of a numeric format such as“041985”where the?rst two digits represent the month and the last four the year,we can simply add10,000,000to the number,convert to string,and use the format“%d%m%Y”. In the case of a string date such as“April1985”we can use

>as.Date(paste("1",v$DATE),format="%d%B%Y")

Notice that in order for paste to work correctly v$DATE must be a vector of character strings.Some read methods automatically convert strings to factors by default,which we can rectify by passing the as.is=T keyword to the read method or converting back using as.character().

2.5Merging Dataframes

If we have two dataframes covering the same time or observations but not completely aligned,we can merge the two dataframes using merge().Either we can specify which column to use for aligning the data,or merge()will try to identify column names in common between the two.

For example,if B is a data frame of bond data prices over a certain period of time and had a column named date containing the dates of the observations and E was a similar dataframe of equity prices over about the same time period,we could merge them into one dataframe using

>OUT<-merge(B,E)

If the date column was named date in B but day in E,the command would instead be

>OUT<-merge(B,E,by.x="date",by.y="day")

Notice that by default merge()includes only rows in which data are present in both B and E.To put NA values in the empty spots instead of omitting the rows we include the all=T keyword.We can also specify whether to include NA values only from one or the other using the all.x and all.y keywords.

2.6Opening a Data File

R is able to read data from many formats.The most common format is a text?le with data separated into columns and with a header above each column describing the data.If blah.dat is a text?le of this type and is located on the windows desktop we could read it using the command

>mydata<-read.table("C:/WINDOWS/Desktop/blah.dat",header=TRUE)

Now mydata is a dataframe with named columns,ready for analysis.Note that R assumes that there are no labels on the columns,and gives them default values,if you omit the header=TRUE argument.Now let’s suppose that instead of blah.dat we have blah.dta,a stata?le.

>library(foreign)

>mydata<-read.dta("C:/WINDOWS/Desktop/blah.dta")

Stata?les automatically have headers.

Another data format we may read is.csv comma-delimited?les(such as those exported by spreadsheets). These?les are very similar to those mentioned above,but use punctuation to delimit columns and rows. Instead of read.table(),we use read.csv().Fixed width?les can be read using read.fwf().

Matlab(.mat)?les can be read using readMat()from the R.matlab package.The function writeMat() from the same package writes matlab data?les.

2.7Working With Very Large Data Files

R objects can be as big as our physical computer memory(and operating system)will allow,but it is not designed for very large datasets.This means that extremely large objects can slow everything down tremendously and suck up RAM greedily.The read.table()family of routines assume that we are not working with very large data sets and so are not careful to conserve on memory1.They load everything at once and probably make at least one copy of it.A better way to work with huge datasets is to read the?le a line(or group of lines)at a time.We do this using connections.A connection is an R object that points to a?le or other input/output stream.Each time we read from a connection the location in the?le from which we read moves forward.

Before we can use a connection,we must create it using file()if our data source is a?le or url()for an online source(there are other types of connections too).Then we use open()to open it.Now we can read one or many lines of text using readLines(),read?elds of data using scan(),or write data using cat()or one of the write.table()family of routines.When we are done we close using close().

2.7.1Reading?elds of data using scan()

Reading?elds of data from a huge?le is a very common task,so we give it special attention.The most important argument to scan()is what,which speci?es what type of data to read in.If the?le contains columns of data,what should be a list,with each member of the list representing a column of data.For example,if the?le contains a name followed by a comma separator and an age,we could read a single line using

>a<-scan(f,what=list(name="",age=0),sep=",",nlines=1)

where f is an open connection.Now a is a list with?elds name and age.Example13.4shows how to read from a large data?le.

If we try to scan when the connection has reached the end of the?le,scan()returns an empty list.We can check it using length()in order to terminate our loop.

Frequently we know how many?elds are in each line and we want to make sure the scan()gets all of them,?lling the missing ones with NA.To do this we specify fill=T.Notice that in many cases scan()will ?ll empty?elds with NA anyway.

Unfortunately scan returns an error if it tries to read a line and the data it?nds is not what it is expecting. For example,if the string"UNK"appeared under the age column in the above example,we would have an error.If there are only a few possible exceptions,they can be passed to scan()as na.strings.Otherwise we need to read the data in as strings and then convert to numeric or other types using as.numeric()or some other tool.

Notice that reading one line at a time is not the fastest way to do things.R can comfortably read100, 1000,or more lines at a time.Increasing how many lines are read per iteration could speed up large reads considerably.With large?les,we could read lines1000at a time,transform them,and then write1000at a time to another open connection,thereby keep system memory free.

If all of the data is of the same type and belong in the same object(a2000x2000numeric matrix,for example)we can use scan()without including the nlines argument and get tremendously faster reads.The resulting vector would need only to be converted to type matrix.

2.7.2Utilizing Unix Tools

If you are using R on a linux/unix machine2you can use various unix utilities(like grep and awk)to read only the colunms and rows of your?le that you want.The utility grep trims out rows that do or do not contain a speci?ced pattern.The programming language awk is a record oriented tool that can pull out and manipulate columns as well as rows based on a number of criteria.

1According to the help?le for read.table()you can improve memory usage by informing read.table()of the number of rows using the nrows parameter.On unix/linux you can obtain the number of rows in a text?le using“wc-l”.

2Thanks to Wayne Folta for these suggestions

Some of these tools are useful within R as well.For example,we can preallocate our dataframes according to the number of records(rows)we will be reading in.For example to know how large a dataframe to allocate for the calls in the above example,we could use

>howmany<-as.numeric(system("grep-c’,C,’file.dat"))

Since allocating and reallocating memory is one of the time consuming parts of the scan()loop,this can save a lot of time and troubles this way.To just determine the number of rows,we can use the utility wc. >totalrows<-as.numeric(strsplit(system("wc-l Week.txt",intern=T),split="")[[1]][1]) Here system()returns the number of rows,but with the?le name as well,strsplit()breaks the output into words,and we then convert the?rst word to a number.

The bottom line is that we should use the right tool for the right job.Unix utilities like grep,awk,and wc can be fast and dirty ways to save a lot of work in R.

2.7.3Using Disk instead of RAM

Unfortunately,R uses only system RAM by default.So if the dataset we are loading is very large it is likely that our operating system will not be able to allocate a large enough chunck of system RAM for it,resulting in termination of our program with a message that R could not allocate a vector of that size.Although R has no built in disk cache system,there is a package called?lehash that allows us to store our variables on disk instead of in system RAM.Clearly this will be slower,but at least our programs will run as long as we have su?cient disk space and our?le size does not exceed the limits of our operating system.And sometimes that makes all the di?erence.

Instead of reading a?le into memory,we read into a database and load that database as an environment >dumpDF(read.table("large.txt",header=T),dbName="mydb")

>myenv<-db2env(db="mydb")

Now we can operate on the data within its environment using with()

>with(mydb,z<-y+x)

Actually I haven’t spent much time with this package so I’m not familiar with its nuances,but it seems very promising3.

2.7.4Using RSQLite

Many times the best way to work with large datasets is to store them in SQL databases and then just query the stu?you need using mySQL or SQLite from within R.This functionality is available using the RSQLite and DBI libraries.

2.8Issuing System Commands—Directory Listing

Sometimes we want to issue a command to the operating system from inside of R.For example,under unix we may want to get a list of the?les in the current directory that begin with the letter x.We could execute this command using

>system("ls x*")

xaa xab xac xad xae

If we want to save the output of the command as an R object,we use the keyword intern

>files<-system("ls x*",intern=T)

3For more information see https://www.wendangku.net/doc/b716781314.html,/2007/09/dealing-with-large-data-set-in-r.html

2.9Reading Data From the Clipboard

When importing from other applications,such as spreadsheets and database managers,the quickest way to get the data is to highlight it in the other application,copy it to the desktop clipboard,and then read the data from the clipboard.R treats the clipboard like a?le,so we use the standard read.table()command

>indata<-read.table("clipboard")

2.10Editing Data Directly

R has a built-in spreadsheet-like interface for editing data.It’s not very advanced,but it works in a pinch. Suppose a is a dataframe,we can edit it in place using

>data.entry(a)

Any changes we make(including changing the type of data contained in a column)will be re?ected in a immediately.If we want to save the changed data to a di?erent variable,we could have used

>b<-de(a)

Notice that both de()and data.entry()return a variable of type list.If what we wanted was a dataframe, for example,we need to convert it back after editing.

The function edit()works like de()but for many di?erent data types.In practice,it calls either de() or the system default text editor(which is set using options()).

A similar useful function is fix(),which edits an R object in place.fix()operates on any kind of data: dataframes are opened with de()and functions are opened with a text editor.This can be useful for quick edits either to data or code.

3Cross Sectional Regression

3.1Ordinary Least Squares

Let’s consider the simplest case.Suppose we have a data frame called byu containing columns for age, salary,and exper.We want to regress various forms of age and exper on salary.A simple linear regression might be

>lm(byu$salary~byu$age+byu$exper)

or alternately:

>lm(salary~age+exper,data=byu)

as a third alternative,we could“attach”the dataframe,which makes its columns available as regular variables

>attach(byu)

>lm(salary~age+exper)

Notice the syntax of the model argument(using the tilde).The above command would correspond to the linear model

salary=β0+β1age+β2exper+ (1) Using lm()results in an abbreviated summary being sent to the screen,giving only theβcoe?cient estimates.For more exhaustive analysis,we can save the results in as a data member or“?tted model”

>result<-lm(salary~age+exper+age*exper,data=byu)

>summary(result)

>myresid<-result$resid

>vcov(result)

The summary()command,run on raw data,such as byu$age,gives statistics,such as the mean and median(these are also available through their own functions,mean and median).When run on an ols object,summary gives important statistics about the regression,such as p-values and the R2.

The residuals and several other pieces of data can also be extracted from result,for use in other computa-tions.The variance-covariance matrix(of the beta coe?cients)is accessible through the vcov()command.

Notice that more complex formulae are allowed,including interaction terms(speci?ed by multiplying two data members)and functions such as log()and sqrt().Unfortunately,in order to include a power term,such as age squared,we must either?rst compute the values,then run the regression,or use the I() operator,which forces computation of its argument before evaluation of the formula

>salary$agesq<-(salary$age)^2

>result<-lm(salary~age+agesq+log(exper)+age*log(exper),data=byu)

>result<-lm(salary~age+I(age^2)+log(exper)+age*log(exper),data=byu)

Notice that if we omit the I()operator and don’t explicitly generate a power term variable(like agesq) then lm()will not behave as expected4,but it will not give we an error or warning,so we must be careful.

In order to run a regression without an intercept,we simply specify the intercept explicitly,traditionally with a zero.

>result<-lm(smokes~0+male+female,data=smokerdata)

3.2Extracting Statistics from the Regression

The most important statistics and parameters of a regression are stored in the lm object or the summary object.Consider the smoking example above

>output<-summary(result)

>SSR<-deviance(result)

>LL<-logLik(result)

>DegreesOfFreedom<-result$df

>Yhat<-result$fitted.values

>Coef<-result$coefficients

>Resid<-result$residuals

>s<-output$sigma

>RSquared<-output$r.squared

>CovMatrix<-s^2*output$cov

>aic<-AIC(result)

Where SSR is the residual sum of squares,LL is the log likelihood statistic,Yhat is the vector of?tted values,Resid is the vector of residuals,s is the estimated standard deviation of the errors(assuming homoskedasticity),CovMatrix is the variance-covariance matrix of the coe?cients(also available via vcov()), aic is the Akaike information criterion and other statistics are as named.

Note that the AIC criterion is de?ne by R as

AIC=?2log L(p)+2p

where p is the number of estimated parameters and L(p)is the likelihood.Some econometricians prefer to call AIC/N the information criterion.To obtain the Bayesian Information Criterion(or Schwartz Bayesian Criterion)we use AIC but specify a di?erent“penalty”parameter as follows

>sbc<-AIC(result,k=log(NROW(smokerdata)))

which means

SBC=?2log L(p)+p log(N)

4Generally it just omits power terms whose?rst power is already included in the regression.

3.3Heteroskedasticity and Friends

3.3.1Breusch-Pagan Test for Heteroskedasticity

In order to test for the presence of heteroskedasticity,we can use the Breusch-Pagan test from the lmtest package.Alternately we can use the the ncv.test()function from the car package.They work pretty much the same way.After running the regression,we call the bptest()function with the?tted regression.

>unrestricted<-lm(z~x)

>bptest(unrestricted)

Breusch-Pagan test

data:unrestricted

BP=44.5465,df=1,p-value=2.484e-11

This performs the“studentized”version of the test.In order to be consistent with some other software (including ncv.test())we can specify studentize=FALSE.

3.3.2Heteroskedasticity(Autocorrelation)Robust Covariance Matrix

In the presence of heteroskedasticity,the ols estimates remain unbiased,but the ols estimates of the variance of the beta coe?cients are no longer correct.In order to compute the heteroskedasticity consistent covariance matrix5we use the hccm()function(from the car library)instead of vcov().The diagonal entries are variances and o?diagonals are covariance terms.

This functionality is also available via the vcovHC()command in the sandwich package.Also in that package is the heteroskedasticity and autocorrelation robust Newey-West estimator,available in the function vcovHAC()or the function NeweyWest().

3.4Linear Hypothesis Testing(Wald and F)

The car package provides a function that automatically performs linear hypothesis tests.It does either an F or a Wald test using either the regular or adjusted covariance matrix,depending on our speci?cations.In order to test hypotheses,we must construct a hypothesis matrix and a right hand side vector.For example, if we have a model with?ve parameters,including the intercept and we want to test against

H0:β0=0,β3+β4=1

The hypothesis matrix and right hand side vector would be

10000 00110

β=

and we could implement this as follows

>unrestricted<-lm(y~x1+x2+x3+x4)

>rhs<-c(0,1)

>hm<-rbind(c(1,0,0,0,0),c(0,0,1,1,0))

>linear.hypothesis(unrestricted,hm,rhs)

Notice that if unrestricted is an lm object,an F test is performed by default,if it is a glm object,a Waldχ2test is done instead.The type of test can be modi?ed through the type argument.

Also,if we want to perform the test using heteroskedasticity or autocorrelation robust standard errors, we can either specify white.adjust=TRUE to use white standard errors,or we can supply our own covariance matrix using the vcov parameter.For example,if we had wished to use the Newey-West corrected covariance matrix above,we could have speci?ed

5obtaining the White standard errors,or rather,their squares.

>linear.hypothesis(unrestricted,hm,rhs,vcov=NeweyWest(unrestricted))

See the section on heteroskedasticity robust covariance matrices for information about the NeweyWest()func-tion.We should remember that the speci?cation white.adjust=TRUE corrects for heteroskedasticity using an improvement to the white estimator.To use the classic white estimator,we can specify white.adjust="hc0".

3.5Weighted and Generalized Least Squares

You can do weighted least squares by passing a vector containing the weights to lm().

>result<-lm(smokes~0+male+female,data=smokerdata,weights=myweights)

Generalized least squares is available through the lm.gls()command in the MASS library.It takes a formula,weighting matrix,and(optionally)a dataframe from which to get the data as arguments.

The glm()command provides access to a plethora of other advanced linear regression methods.See the help?le for more details.

3.6Models With Factors/Groups

There is a separate datatype for qualitative factors in R.When a variable included in a regression is of type factor,the requisite dummy variables are automatically created.For example,if we wanted to regress the adoption of personal computers(pc)on the number of employees in the?rm(emple)and include a dummy for each state(where state is a vector of two letter abbreviations),we could simply run the regression

>summary(lm(pc~emple+state))

Call:

lm(formula=pc~emple+state)

Residuals:

Min1Q Median3Q Max

-1.7543-0.55050.35120.42720.5904

Coefficients:

Estimate Std.Error t value Pr(>|t|)

(Intercept) 5.572e-01 6.769e-028.232<2e-16***

emple 1.459e-04 1.083e-0513.475<2e-16***

stateAL-4.774e-037.382e-02-0.0650.948

stateAR 2.249e-028.004e-020.2810.779

stateAZ-7.023e-027.580e-02-0.9260.354

stateDE 1.521e-01 1.107e-01 1.3750.169

...

stateFL-4.573e-027.136e-02-0.6410.522

stateWY 1.200e-01 1.041e-01 1.1530.249

---

Signif.codes:0‘***’0.001‘**’0.01‘*’0.05‘.’0.1‘’1

Residual standard error:0.4877on9948degrees of freedom

Multiple R-Squared:0.02451,Adjusted R-squared:0.01951

F-statistic:4.902on51and9948DF,p-value:<2.2e-16

The three dots indicate that some of the coe?cients have been removed for the sake of brevity.

In order to convert data(either of type string or numeric)to a factor,simply use the factor()command. It can even be used inside the regression.For example,if we wanted to do the same regression,but by a numeric code specifying an area,we could use the command

>myout<-lm(pc~emple+factor(naics6))

which converts naics6into a factor,generates the appropriate dummies,and runs a standard regression.

4Special Regressions

4.1Fixed/Random E?ects Models

Warning:The de?nitions of?xed and random e?ects models are not standardized across disciplines.I

describe?xed and random e?ects estimation as these terms are generally used by econometricians.The

terms“?xed”and“random”have historical roots and are econometrically misleading.

Within the context of economics,?xed and random e?ects estimators are panel data models that account

for cross sectional variation in the intercept.Letting i denote the cross sectional index(or the one by which

data is grouped)and t the time index(or the index that varies within a group),a standard?xed e?ects

model can be written

y it=α+u i+βX it+ it.(2) Essentially,each individual has a di?erent time-invariant intercept(α+u i).Usually we are interested inβ

but not any of the u i.A random e?ects model has the same mean equation,but imposes the additional

restriction that the individual speci?c e?ect is uncorrelated with the explanatory variables X it.That is,

E[u i X it]=0.Econometrically this is a more restrictive version of the?xed e?ects estimator(which allows for arbitrary correlation between the“e?ect”and exogenous variables).One should not let the unfortunate

nomenclature confuse the relationship between these models.

4.1.1Fixed E?ects

A simple way to do a?xed e?ects estimation,particularly if the cross sectional dimension is not large,is to

include a dummy for each individual—that is,make the cross sectional index a factor.If index identi?es

the individuals in the sample,then

>lm(y~factor(index)+x)

will do a?xed e?ects estimation and will report the correct standard errors onβ.Unfortunately,in cases

where there are many individuals in the sample and we are not interested in the value of their?xed e?ects,

the lm()results are awkward to deal with and the estimation of a large number of u i coe?cients could

render the problem numerically intractable.

A more common way to estimate?xed e?ects models is to remove the?xed e?ect by time demeaning

each variable(the so called within estimator).Then equation(2)becomes

(y it?ˉy i)=α+β(X it?ˉX i)+ζit.(3) Most econometric packages(for example,stata’s xtreg)use this method when doing?xed e?ects estimation. Using R,one can manually time demean the independent and dependent variables.If d is a dataframe containing year,firm,profit,and predictor and we wish to time demean each?rm’s pro?t and predictor, we could run something like

>g<-d

>for(i in unique(d$firm)){

+timemean<-mean(d[d$firm==i,])

+g$profit[d$firm==i]<-d$profit[d$firm==i]-timemean["profit"]

+g$predictor[d$firm==i]<-d$predictor[d$firm==i]-timemean["predictor"]

>output<-lm(profit~predictor,data=g)

Notice that the standard errors reported by this regression will be biased downward.The?σ2reported by

lm()is computed using

?σ2=

SSR NT?K