文档库 最新最全的文档下载
当前位置:文档库 › bandwidthSelection

bandwidthSelection

bandwidthSelection
bandwidthSelection

Elementary Research about Bandwidth Selection of Univariate Kernel Density Estimation in NEI PM2.5

Emissions Data

Li Xuefeng

Friday,April10,2015

======================================================================== Introduction

Fine particulate matter(PM2.5)is an ambient air pollutant for which there is strong evidence that it is

harmful to human health.In the United States,the Environmental Protection Agency(EPA)is tasked with

setting national ambient air quality standards for?ne PM and for tracking the emissions of this pollutant

into the atmosphere.Approximatly every3years,the EPA releases its database on emissions of PM2.5.This

database is known as the National Emissions Inventory(NEI).You can read more information about the NEI

at the EPA National Emissions Inventory web site.

For each year and for each type of PM source,the NEI records how many tons of PM2.5were emitted from

that source over the course of the entire year.The data that you will use for this assignment are for1999,

2002,2005,and2008.

Data

The data set was downloaded from:

https://https://www.wendangku.net/doc/7314443215.html,/exdata-013/human_grading/view/courses/973507/assessments/4/submissions

The original zip?le contains two?les:

PM2.5Emissions Data(summarySCC_PM25.rds):This?le contains a data frame with all of the PM2.5

emissions data for1999,2002,2005,and2008.For each year,the table contains number of tons of PM2.5

emitted from a speci?c type of source for the entire year.Here are the?rst few rows.

##fips SCC Pollutant Emissions type year

##40900110100401PM25-PRI15.714POINT1999

##80900110100404PM25-PRI234.178POINT1999

##120900110100501PM25-PRI0.128POINT1999

##160900110200401PM25-PRI 2.036POINT1999

##200900110200504PM25-PRI0.388POINT1999

##240900110200602PM25-PRI 1.490POINT1999

?ps:A?ve-digit number(represented as a string)indicating the U.S.county

SCC:The name of the source as indicated by a digit string(see source code classi?cation table)

Pollutant:A string indicating the pollutant

Emissions:Amount of PM2.5emitted,in tons

type:The type of source(point,non-point,on-road,or non-road)

year:The year of emissions recorded

Source Classi?cation Code Table(Source_Classi?cation_Code.rds):This table provides a mapping from the

SCC digit strings in the Emissions table to the actual name of the PM2.5source.The sources are categorized

in a few di?erent ways from more general to more speci?c and you may choose to explore whatever categories

you think are most useful.For example,source“10100101”is known as“Ext Comb/Electric Gen/Anthracite

Coal/Pulverized Coal”.

Exploratory Data Analysis

1.Generates plot1.png to answer the question:“Have total emissions from PM

2.5decreased in the United

States from1999to2008?”

The following script constructs a plot of total PM2.5emissions(in million tons)[y-axis]vs Year[x-axis],and

stores the result in plot1.png.

Load PM2.5emission data to data.frame object‘NEI’(unless emission data as already been loaded,as indicated by whether‘NEI’exists)

if(!exists("NEI")){

NEI<-readRDS("summarySCC_PM25.rds")

##coerce year(character)into factor

NEI$year<-as.factor(NEI$year)

}

Load source classi?cation codes to data.frame object‘SCC’(unless SCC data as already been loaded,as indicated by whether‘SCC’exists)

if(!exists("SCC"))

SCC<-readRDS("Source_Classification_Code.rds")

##Calculates total emissions within each year

totalEmissionsYear<-tapply(NEI$Emissions,NEI$year,sum)

Generates plot1.png

##png("plot1.png")

#plot(seq_along(NEI$Emissions),NEI$Emissions/1e6,type="l",xlab="",

#ylab="PM2.5emissions(in million tons)")

plot(names(totalEmissionsYear),totalEmissionsYear/1e6,type="l",xaxt="n", xlab="Year",ylab="PM2.5emissions(in million tons)",main="PM2.5emissions in USA") points(names(totalEmissionsYear),totalEmissionsYear/1e6,pch=19)

axis(1,at=seq(1999,2008,by=3),labels=names(totalEmissionsYear))

4

567

PM2.5 emissions in USA

Y ear

P M 2.5 e m i s s i o n s (i n m i l l i o n t o n s

)

1999

2002

20052008

##dev.off()

========================================================2.Generates plot2.png to answer the question:“Have total emissions from PM2.5decreased in the Baltimore City,Maryland (?ps ==”24510“)from 1999to 2008?”

The following script constructs a plot of total PM2.5emissions (in tons)[y-axis]vs Year [x-axis],for Baltimore City,and adds a linear regression (in red)to more clearly display the trend.The result is stored in plot2.png.Load PM2.5emission data to data.frame object ‘NEI’(unless emission data as already been loaded,as indicated by whether ‘NEI’exists)##if (!exists("NEI")){

##NEI <-readRDS("summarySCC_PM25.rds")####coerce year (character)into factor ##NEI$year <-as.factor(NEI$year)}

##Load source classification codes to data.frame object SCC (unless SCC data as already ##been loaded,as indicated by whether SCC exists)##if (!exists("SCC"))

##

SCC <-readRDS("Source_Classification_Code.rds")

Gets the subset of the data that corresponds to Baltimore City,Maryland –?ps 24510(unless it has already been generated,as indicated by whether ‘NEIbalt’exists)

if (!exists ("NEIbalt"))

NEIbalt <-subset (NEI,fips =="24510")

##Calculates the total emissions within each year for Baltimore City totalEmissionsBalt <-tapply (NEIbalt$Emissions,NEIbalt$year,sum)Generates plot2.png ##png("plot2.png")

head (NEIbalt$Emissions)##[1]

6.53278.880

0.92010.37610.85983.025

plot (seq_along (NEIbalt$Emissions),NEIbalt$Emissions,type ="b",

ylab ="PM2.5emissions (in tons)")

500100015002000

200400600800100

seq_along(NEIbalt$Emissions)

P M 2.5 e m i s s i o n s (i n t o n s )

plot (names (totalEmissionsBalt),totalEmissionsBalt,type ="l",xaxt ="n",

xlab ="Year",ylab ="PM2.5emissions (in tons)",

main ="PM2.5emissions in Baltimore City,Maryland")##Add a linear regression line

abline (lm (as.vector (totalEmissionsBalt)~as.numeric (names (totalEmissionsBalt))),

lwd =2,col ="red")

points (names (totalEmissionsBalt),totalEmissionsBalt,pch =19)

axis (1,at =seq (1999,2008,by =3),labels =names (totalEmissionsBalt))legend ("topright","linear regression",lty =1,lwd =2,col ="red")

2000

24002800

3200

PM2.5 emissions in Baltimore City, Maryland

Y ear

P M 2.5 e m i s s i o n s (i n t o n s )

1999

2002

2005

2008

##dev.off()

=========================================================3.Generates plot3.png to answer the question:“Of the four types of sources indicated by the type (point,nonpoint,onroad,nonroad)variable,which of these four sources have seen decreases in emissions from 1999-2008for Baltimore City?Which have seen increases in emissions from 1999-2008?”

The following script constructs a lattice-type plot using the ggplot2system,which shows,for each of the four source types,the total PM2.5emissions (in tons)[y-axis]vs Year [x-axis],for Baltimore City.It also adds a linear regressions (in red)to more clearly display the trend.The result is stored in plot3.png.Calculates the total emissions within each year and source type for Baltimore City

library (ggplot2)##Load the ggplot2plotting system (if not available,installs it)library (reshape2)##Load the reshape2package (if not available,installs it)

totEmissionsBaltByType <-tapply (NEIbalt$Emissions,list (NEIbalt$year,NEIbalt$type),sum)## totEmissionsBaltByType is now a matrix object with years as rows and source types as ##columns.Convert into a data.frame object for future manipulation:totEmissionsBaltByType <-as.data.frame (totEmissionsBaltByType)totEmissionsBaltByType$year <-rownames (totEmissionsBaltByType)

##Use melt to make the column names a variable,so that the result is a data.frame object with ##columns: year , variable (the source type)and value (the total emissions for each year ##and source type):

totEmissionsBaltByType <-melt (totEmissionsBaltByType,id="year",

measure.vars=colnames (totEmissionsBaltByType)[1:4])

##for cosmetics,reformats the source name NONPOINT to NON-POINT :

totEmissionsBaltByType$variable<-factor(totEmissionsBaltByType$variable,

labels=c("NON-ROAD","NON-POINT","ON-ROAD","POINT"))

Loads‘grid’package to be able to use the unit()function when adding space between the lattice-panels. Generates plot3.png

library(grid)

##png("plot3.png")

print(qplot(as.numeric(year),value,data=totEmissionsBaltByType,

facets=.~variable,geom="point",xlab="Year",

ylab="PM2.5emissions(in tons)",

main="PM2.5emissions by source type in Baltimore City,Maryland")+

geom_line()+

geom_smooth(method="lm",se=FALSE,colour="red",

aes(linetype="linear regression"))+

scale_x_continuous(breaks=as.numeric(totEmissionsBaltByType$year))+

theme(axis.text.x=element_text(angle=45,hjust=1),

panel.margin=unit(0.5,"lines"))+

theme(legend.position="top",legend.margin=unit(-0.5,"cm"),

legend.title=element_blank()))

##dev.off()

=========================================================

4.Generates plot4.png to answer the question:“Across the United States,how have emissions from coal combustion-related sources changed from1999-2008?”

The following script constructs a plot which shows the total PM2.5emissions(in thousand tons)[y-axis]

vs Year[x-axis].Due to the ambiguity as to what is considered to be a coal combustion-related source,we considered four types of selection schemes:

1.“coal”&“combustion”(or abbreviation“comb”)in‘https://www.wendangku.net/doc/7314443215.html,’(black)

2.simply“coal”in‘https://www.wendangku.net/doc/7314443215.html,’(magenta)

3.“coal”in‘SCC.Level.Three’(green)

4.“coal”in‘EI.Sector’(blue)

We also add a linear regression(in red)to display the global trend,but only for the selection scheme1(since

the others follow the same trend,as can be seen from the plot).The result is stored in plot4.png.

Identify the rows of SSCs satisfying the selection scheme1.“coal”&“combustion”(or abbreviation“comb”)

in‘https://www.wendangku.net/doc/7314443215.html,’.Note(|/|$)is added to“coal”and“comb”to guanratee that the actual word“coal”and the abbreviation of“combustion”,respectively,are being matched(preventing matches from substrings).

xCoal<-grep("coal(|/|$)|lignite",tolower(SCC$https://www.wendangku.net/doc/7314443215.html,))

xCombustion<-grep("combustion|comb(|/|$)",tolower(SCC$https://www.wendangku.net/doc/7314443215.html,))

xCoalCombustion<-intersect(xCombustion,xCoal)

Identify the rows of SSCs satisfying the alternative selection schemes:2.naive(simply“coal”in‘https://www.wendangku.net/doc/7314443215.html,’) xCoalNaive<-grep("coal(|/|$)",tolower(SCC$https://www.wendangku.net/doc/7314443215.html,))

##3."coal"in SCC.Level.Three

xCoalThreeLevel<-grep("coal(|/|$)",tolower(SCC$SCC.Level.Three))

##4."coal"in EI.Sector

xCoalEISector<-grep("coal(|/|$)",tolower(SCC$EI.Sector))

##Obtain the SCCs corresponding to the selected rows for the different selection schemes: selCoalCombSCC<-SCC[xCoalCombustion,]$SCC

selCoalCombSCCNaive<-SCC[xCoalNaive,]$SCC

selCoalCombSCCThreeLevel<-SCC[xCoalThreeLevel,]$SCC

selCoalCombSCCEISector<-SCC[xCoalEISector,]$SCC

##Obtains the data whose SCCs are contained in the selected SCCs,for different selection

##schemes

NEIcoalComb<-subset(NEI,SCC%in%selCoalCombSCC)

NEIcoalCombNaive<-subset(NEI,SCC%in%selCoalCombSCCNaive)

NEIcoalCombThreeLevel<-subset(NEI,SCC%in%selCoalCombSCCThreeLevel) NEIcoalCombEISector<-subset(NEI,SCC%in%selCoalCombSCCEISector)

##Calculates the total emissions within each year,for the different selection schemes: EmissionsCoalComb<-tapply(NEIcoalComb$Emissions,NEIcoalComb$year,sum) EmissionsCoalCombNaive<-tapply(NEIcoalCombNaive$Emissions,NEIcoalCombNaive$year,sum) EmissionsCoalCombThreeLevel<-tapply(NEIcoalCombThreeLevel$Emissions,

NEIcoalCombThreeLevel$year,sum) EmissionsCoalCombEISector<-tapply(NEIcoalCombEISector$Emissions,NEIcoalCombEISector$year,sum)

Generates plot4.png

##png("plot4.png")

plot (names (EmissionsCoalComb),EmissionsCoalComb/1e3,type ="l",xaxt ="n",

xlab ="Year",ylab ="PM2.5emissions (in thousand tons)",main ="PM2.5coal combustion-related emissions in USA

(for different coal combustion-related selection schemes)",ylim =c (340,610))abline (lm (as.vector (EmissionsCoalComb/1e3)~as.numeric (names (EmissionsCoalComb))),

lwd =2,col ="red")

points (names (EmissionsCoalComb),EmissionsCoalComb /1e3,pch =19)##selection scheme 2

lines (names (EmissionsCoalCombNaive),EmissionsCoalCombNaive /1e3,col ="magenta")

points (names (EmissionsCoalCombNaive),EmissionsCoalCombNaive /1e3,pch =19,col ="magenta")##selection scheme 3

lines (names (EmissionsCoalCombThreeLevel),EmissionsCoalCombThreeLevel /1e3,col ="green")

points (names (EmissionsCoalCombThreeLevel),EmissionsCoalCombThreeLevel /1e3,pch =19,col ="green")##selection scheme 4

lines (names (EmissionsCoalCombEISector),EmissionsCoalCombEISector /1e3,col ="blue")

points (names (EmissionsCoalCombEISector),EmissionsCoalCombEISector /1e3,pch =19,col ="blue")axis (1,at =seq (1999,2008,by =3),labels =names (EmissionsCoalComb))legend ("bottomleft",c ("1","1(linear regression)","2","3","4"),

col =c ("black","red","magenta","green","blue"),lty =1,pch =c (19,-1,19,19,19),title ="Selection schemes")

350400

450

500550

600PM2.5 coal combustion?related emissions in USA

(for different coal combustion?related selection schemes)

Y ear

P M 2.5 e m i s s i o n s (i n t h o u s a n d t o n s )

1999

2002

2005

2008

##dev.off()

=========================================================

5.Generates plot5.png to answer the question:“How have emissions from motor vehicle sources changed from 1999-2008in Baltimore City?”

The following script constructs a plot which shows the total PM2.5emissions(in tons)[y-axis]vs Year[x-axis], for Baltimore City,from motor vehicle sources.(The chosen criterion for identifying“motor vehicle”sources is the presence of“vehicle”(or its abbreviated form“veh”)in‘EI.Sector’and“onroad”in‘Data.Category’.) We also add a linear regressions(in red)to more clearly display the trend.The result is stored in plot5.png. Identify the rows of SSCs satisfying the selection criterion of being“motor vehicle”sources.The chosen selection criterion is the presence of“vehicle”(or its abbreviated form“veh”)in‘EI.Sector’and“onroad”in ‘Data.Category’:

xVeh<-grep("vehicle|veh(|/|$)",tolower(SCC$EI.Sector))

xOnroad<-grep("onroad",tolower(SCC$Data.Category))

xVehicle<-intersect(xVeh,xOnroad)

##Obtain the SCCs corresponding to the selected rows

selVehicleSCC<-SCC[xVehicle,]$SCC

##Obtains the data whose SCCs are contained in the selected SCCs

NEIbaltVehicle<-subset(NEIbalt,SCC%in%selVehicleSCC)

##Calculates the total emissions within each year,for Baltimore City EmissionsBaltVehicle<-tapply(NEIbaltVehicle$Emissions,NEIbaltVehicle$year,sum)

Generates plot5.png

##png("plot5.png")

plot(names(EmissionsBaltVehicle),EmissionsBaltVehicle,type="l",xaxt="n", xlab="Year",ylab="PM2.5emissions(in tons)",

main="PM2.5motor vehicle emissions in Baltimore City,Maryland",ylim=c(50,350)) abline(lm(as.vector(EmissionsBaltVehicle)~as.numeric(names(EmissionsBaltVehicle))), lwd=2,col="red")

points(names(EmissionsBaltVehicle),EmissionsBaltVehicle,pch=19)

axis(1,at=seq(1999,2008,by=3),labels=names(EmissionsBaltVehicle))

legend("topright","linear regression",lty=1,lwd=2,col="red")

50

100200

300PM2.5 motor vehicle emissions in Baltimore City, Maryland

Y ear

P M 2.5 e m i s s i o n s (i n t o n s )

19992002

2005

2008

##dev.off()

=========================================================6.Generates plot6.png to answer the question:“Compare emissions from motor vehicle sources in Baltimore City with emissions from motor vehicle sources in Los Angeles County,California (?ps ==”06037“).Which city has seen greater changes over time in motor vehicle emissions?”

The following script constructs a plot which shows the relative total PM2.5emissions (in %)[y-axis](normalised to the year 1999=100%)vs Year [x-axis],for both Baltimore City and Los Angeles County,from motor vehicle sources.(The chosen criterion for identifying “motor vehicle”sources is the presence of “vehicle”(or its abbreviated form “veh”)in ‘EI.Sector’and “onroad”in ‘Data.Category’.)The result is stored in plot6.png.Identify the rows of SSCs satisfying the selection criterion of being “motor vehicle”sources.The chosen selection criterion is the presence of “vehicle”(or its abbreviated form “veh”)in ‘EI.Sector’and “onroad”in ‘Data.Category’:

xVeh <-grep ("vehicle|veh(|/|$)",tolower (SCC$EI.Sector))xOnroad <-grep ("onroad",tolower (SCC$Data.Category))xVehicle <-intersect (xVeh,xOnroad)

##Obtain the SCCs corresponding to the selected rows selVehicleSCC <-SCC[xVehicle,]$SCC

##Obtains the data whose SCCs are contained in the selected SCCs for Baltimore City NEIbaltVehicle <-subset (NEIbalt,SCC %in%selVehicleSCC)

##Obtains the data whose SCCs are contained in the selected SCCs for Los Angles County NEIlaVehicle <-subset (NEI,fips =="06037"&SCC %in%selVehicleSCC)

##Calculates the total emissions within each year,for Baltimore City and Los Angeles County

EmissionsBaltVehicle <-tapply (NEIbaltVehicle$Emissions,NEIbaltVehicle$year,sum)EmissionsBaltVehicleLA <-tapply (NEIlaVehicle$Emissions,NEIlaVehicle$year,sum)Generates plot6.png

##png("plot6.png")

plot (names (EmissionsBaltVehicle),

EmissionsBaltVehicle /EmissionsBaltVehicle[[1]]*1e2,type ="l",xaxt ="n",xlab ="Year",ylab ="Relative PM2.5emissions (in %)",

main ="Relative PM2.5motor vehicle emissions(for Baltimore City and Los Angeles County)",ylim =c (20,120))

points (names (EmissionsBaltVehicle),

EmissionsBaltVehicle /EmissionsBaltVehicle[[1]]*1e2,pch =19)

lines (names (EmissionsBaltVehicleLA),

EmissionsBaltVehicleLA /EmissionsBaltVehicleLA[[1]]*1e2,col ="red")

points (names (EmissionsBaltVehicleLA),

EmissionsBaltVehicleLA /EmissionsBaltVehicleLA[[1]]*1e2,pch =19,col ="red")

axis (1,at =seq (1999,2008,by =3),labels =names (EmissionsBaltVehicle))legend (0,0,c ("f","g"),col =c ("black","red"))

legend ("right",c ("Baltimore City,Maryland","Los Angeles County,California"),

col =c ("black","red"),pch =19,lty =c (1,1))

204060

80100

120Relative PM2.5 motor vehicle emissions(for Baltimore City and Los Angeles County)

Y ear

R e l a t i v e P M 2.5 e m i s s i o n s (i n %)

19992002

2005

2008

##dev.off()

Emissions Kernel Density Esitimation

Normal reference rule-of-thumb

balE<-NEIbalt$Emissions

len<-length(balE)

bal1<-balE[1:100]

range(bal1)

##[1]0.03241.16

varr<-var(bal1)

#Compute bandwidth using rule-of-thumb

h_pilot<-1.06*sqrt(varr)*100^(-1/5)

h_rot<-1.06*min(sqrt(varr),(quantile(bal1)[4]-quantile(bal1)[2])/1.34)*100^(-1/5) eks<-seq(-10,10,by=0.2)

ex1<-matrix(0,length(eks),100)

#Using Gaussian kernel to estimate density

for(i in1:length(eks)){

for(j in1:100)

ex1[i,j]<-exp(-(bal1[j]-eks[i])^2/(2*h_rot^(2)))

}

exr1<-c()

for(i in1:length(eks)){

exr1[i]<-0.01*(1/h_rot)*(1/sqrt(2*pi))*sum(ex1[i,])

}

#plot(eks,exr1,type="l",xlab="x",ylab="f(x)")

Plug-in method

#Compute integral of2-order derivative squared

DI<-matrix(0,100,100)

for(i in1:100){

for(j in1:100){

DI[i,j]<-(1-(bal1[j]-bal1[i])^2/(2*h_pilot^2))*

exp(-(bal1[j]-bal1[i])^2/(4*h_pilot^2))

}

}

SDI<-(1/(100^2*h_pilot^6))*(1/(4*sqrt(pi)))*sum(DI)

#Compute bandwidth using plug-in method

h_plug1<-(1/(2*SDI*sqrt(pi)))^(1/5)*100^(-1/5)

ex2<-matrix(0,length(eks),100)

#Estimate Emissions density with Gaussian kernel

for(i in1:length(eks)){

for(j in1:100)

ex2[i,j]<-exp(-(bal1[j]-eks[i])^2/(2*h_plug1^(2)))

}

exr2<-c()

for(i in1:length(eks)){

exr2[i]<-0.01*(1/h_plug1)*(1/sqrt(2*pi))*sum(ex2[i,])

}

#plot(eks,exr2,type="l",xlab="x",ylab="f(x)")

1.With Gaussian kernel

h_plug2<-((10/3)*h_pilot^6)^(1/5)*100^(-1/5)

ex3<-matrix(0,length(eks),100)

#Estimate Emissions density with Gaussian kernel

for(i in1:length(eks)){

for(j in1:100){

if(((bal1[j]-eks[i])/h_plug2)^2<=1)

ex3[i,j]<-0.75*(1/(100*h_plug2))*

(1-((bal1[j]-eks[i])/h_plug2)^2)

else ex3[i,j]<-0

}

}

exr3<-c()

for(i in1:length(eks)){

exr3[i]<-sum(ex3[i,])

}

#plot(eks,exr3,type="l",xlab="x",ylab="f(x)")

2.With Epanechnikov kernel

Least square cross-validation

#Compute optimal bandwidth with Epanechnikov kernel

dis<-matrix(0,100,100)

for(i in1:100){

for(j in1:100){

dis[i,j]<-(bal1[j]-bal1[i])^2

}

}

h_cv<-0.01*sqrt((15/7)*sum(dis))

#Estimate Emissions density with Epanechnikov kernel

ex4<-matrix(0,length(eks),100)

for(i in1:length(eks)){

for(j in1:100)

ex4[i,j]<-0.75*(1/(100*h_cv))*(1-((bal1[j]-eks[i])/h_cv)^2) }

exr4<-c()

for(i in1:length(eks)){

exr4[i]<-sum(ex4[i,])

}

#plot(eks,exr4,type="l",xlab="x",ylab="f(x)")

Plot all the density estimations in single graph

plot (eks,exr1,xlab ="x",ylab ="f(x)",type ="l",main ="Emissions Density Estimation about

Different kernels and Bandwidth Selection Methods")lines (eks,exr2,col ="red")lines (eks,exr3,col ="blue")lines (eks,exr4,col ="green")

legend ("topright",c ("GK(RoT)","GK(P-in)","EK(P-in)","EK(LSCV)"),

lty ="solid",col =c ("black","red","blue","green"),bty ="n")

?10

?50510

0.00

0.050.100.15

Emissions Density Estimation about

Different kernels and Bandwidth Selection Methods

x

f (x )

GK(RoT)GK(P?in)EK(P?in)EK(LSCV)

相关文档