Wednesday, May 26, 2010

final project idea


Comparing the influence of geometry in spatial analysis

Starting with Tobler’s first law—“Everything is related to everything else, but near things are more related than distant things"—it is clear that how we measure or define nearness or distantness becomes extremely important. However if this measurement is held constant, then what other factors can distort our understanding or accuracy of spatial analysis? I propose that the geometry of a given location will also affect the nature of the spatial autocorrelation. Using the African countries of The Gambia and Tanzania as case studies, this project will look at how differing geometric shapes will/can effect the outcomes of spatial analysis. For each country I’ll create four data sets, two uniform distributions and two normal distributions at two densities to see if the linear shape of The Gambia plays a significant role verse the more regular almost circular shape of Tanzania.

Above are maps of the constructed data sets generated in R and projected using ArcGIS. Bibliography to follow but suggestions are highly welcomed. 

Tuesday, May 25, 2010

GeoDa, R and ArcGIS do spatial analysis

This week's assignment clearly illustrates that the tool is only as good as the user. While the spatial analysis capabilities of GeoDa, R and ArcGIS might be vast and varied, my lack of knowledge and understanding of the possibilities, limitations and, in general, the significance of geospatial statistical methods is the biggest hurdle.

With that understanding, I felt that would most identify (and find most useful) with the tool I've used the most: ArcGIS. However, without a solid understanding of how to map out spatial statistics I couldn't produce a mapped image of the Moran's I that I choose to calculate with each program. (And to my credit, I don't think this information can be mapped.) R and GeoDa were better tools for visualizing the statistical data particularly R. If the goal is to map out geospatial data, then ArcGIS would be preferable as color schemes, symbology and attributes can be easily manipulated. However in dealing with statistical results, R does a better job. GeoDa, I feel, tries to mediate between the two, but doesn't quite live up to the respective powers of the fancier programs. 

For this assignment I used each tool to create a spatial matrix of the k nearest neighbor variety and set k=1 and k=2. Then I calculated Moran's I and visualized the results the best I could in each respective program.

GeoDA


R

##print out from R moran.test

> moran.test(US$obama,nb2listw(US.knn1, style="W"))

        Moran's I test under randomisation

data:  US$obama 
weights: nb2listw(US.knn1, style = "W") 

Moran I statistic standard deviate = 3.2767, p-value =
0.0005252
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance
       0.53038969       -0.02083333        0.02830000

> moran.test(US$obama,nb2listw(US.knn2, style="W"))

        Moran's I test under randomisation

data:  US$obama 
weights: nb2listw(US.knn2, style = "W") 

Moran I statistic standard deviate = 4.3288, p-value =
7.495e-06
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance
       0.51891274       -0.02083333        0.01554667


ArcGIS

 ##printout from ArcGIS, no visualizations were created

Executing: SpatialAutocorrelation lower48 obama true "Get Spatial Weights From File" "Euclidean Distance" None # F:\R_work\week_8\knn1.swm 0 0 0
Start Time: Tue May 25 19:45:12 2010
Running script SpatialAutocorrelation...
WARNING 000916: The input feature class does not appear to contain projected data.

 Global Moran's I Summary
Moran's Index:   0.530499
Expected Index:  -0.020833
Variance:        0.027516
Z Score:         3.323665
p-value:         0.000888


Executing: SpatialAutocorrelation lower48 obama true "Get Spatial Weights From File" "Euclidean Distance" None # F:\R_work\week_8\knn2.swm 0 0 0
Start Time: Tue May 25 19:51:10 2010
Running script SpatialAutocorrelation...
WARNING 000916: The input feature class does not appear to contain projected data.

 Global Moran's I Summary
Moran's Index:   0.524584
Expected Index:  -0.020833
Variance:        0.015769
Z Score:         4.343416
p-value:         0.000014

Monday, May 17, 2010

Spatial autocorrelation and Moran's I


This first map (in yellow/red) was created in ArcGIS. I downloaded a shape file from DIVA GIS which contained basic information about the different administrative boundaries.  Then I merged the tables in ArcMap (after a little doctoring in Excel) with data about trees planted downloaded from Tanzania's countrystat website run by the FAO.
 
The second (in blue/green) was created from the same data in when imported into R.

Creating a spatial weights matrix
w.cols = 1:26
w.rows = 1:26

#create spatial weights matrix from neighbor object
w.mat.knn = nb2mat(TZ.knn1, zero.policy=TRUE)
w.mat.dist = nb2mat(TZ.dist.150, zero.policy=TRUE)
image(w.cols,w.rows,w.mat.dist,col=brewer.pal(3,"BuPu"))


#return binary spatial weights matrix
w.mat.knn
#print out of 0s and 1s

#visualize binary spatial weights matrix
image(w.cols,w.rows,w.mat.knn,col=brewer.pal(3,"BuPu"))


#create and visualize distance-based spatial weights matrix;
w.mat.dist = nb2mat(TZ.dist.250, zero.policy=TRUE)
image(w.cols,w.rows,w.mat.dist,col=brewer.pal(9,"PuRd"))


Moran's I test
##for distance = 350
moran.plot(tz$TREES_2008,nb2listw(TZ.dist.350),labels=tz$F1)
moran.test(tz$TREES_2008,nb2listw(TZ.dist.350, style="W"))


######print out from moran.test###

# Moran's I test under randomisation
#data:  tz$TREES_2008
#weights: nb2listw(TZ.dist.350, style = "W") 
#Moran I statistic standard deviate = 0.7428, p-value = 0.2288
#alternative hypothesis: greater
#sample estimates:
#Moran I statistic       Expectation          Variance
#      0.011532877      -0.040000000       0.004812588



##for distance = 500
moran.plot(tz$TREES_2008,nb2listw(TZ.dist.500),labels=tz$F1)
moran.test(tz$TREES_2008,nb2listw(TZ.dist.500, style="W"))

##print out##
#Moran's I test under randomisation

#data:  tz$TREES_2008 
#weights: nb2listw(TZ.dist.500, style = "W")
#Moran I statistic standard deviate = 0.7428, p-value = 0.2288
#alternative hypothesis: greater
#sample estimates:
#Moran I statistic       Expectation          Variance
 #   0.011532877      -0.040000000       0.004812588


##for distance = 1000
moran.plot(tz$TREES_2008,nb2listw(TZ.dist.1000),labels=tz$F1)
moran.test(tz$TREES_2008,nb2listw(TZ.dist.1000, style="W"))


#Moran's I test under randomisation

#data:  tz$TREES_2008 
#weights: nb2listw(TZ.dist.1000, style = "W")
#Moran I statistic standard deviate = 0.0731, p-value = 0.4709
#alternative hypothesis: greater
#sample estimates:
#Moran I statistic       Expectation          Variance
#    -3.936736e-02     -4.000000e-02      7.496845e-05

A note on spatial autocorrelation
Spatial autocorrelation is an understanding that real-life phenomena tend to be most similar the closer they are together. In nature, things ten to cluster in patches of populations or are spread out in gradient where there is a concentration at one point and a slow dispersion outward. In this way we can see that the characteristics of a phenomena can be estimated by using information from surrounding areas.

I could give the example of my family members. My mother and father’s families both come from the same town in western Minnesota. Now many aunts, uncles and cousins no longer live in that town but still live in Minnesota, South Dakota or Wisconsin. Knowing the distance from "home" it is likely that we can predict the possibility I would have a family member living there. And it is true. There are only very few outliers like me who live as far away as California and Texas. 

Of course this completely depends on how you define “close”. As with changing the scale of investigation in last week’s assignment, measurements of spatial autocorrelation change as you change the parameters in which you measure closeness. Another factor in this example would be city size. If you factor in distance from home and size of the city, I feel that you could closely predict the number of family members I have now living in a given area.  

However the data that I presented above (then number of trees planted in Tanzania in 2008 by district) does not seem to be spatially correlated (with some exceptions). It’s true that as the way I define distance changes (only using absolute distance from the centroids of the districts and not using a k nearest neighbor), the Moran’s I test changes, but not significantly. I do find that the significant outliers of Iringa and Shinyanga to be interesting. Without knowing much about how the data was collected and what the numbers really mean, I do know that these are two of the districts that are known for their significant forest cover. Much more than that, I can not say.

Monday, May 10, 2010

R meets ArcGIS

What I find most interesting is that the uniformally distribution of points, which is supposed to reflect “on the ground” homogeneity (or close to it), is that it actually maps out spatial patterns particularly at course resolution. This variation among the different scales of analysis for the uniform distribution (green maps) is an example of modifiable areal unit problem (MAUP). The MAUP occurs when individual point data is aggregated into set areas. In this example the set areas change with the resolution. We can see that in the 3x3, 5x5 and even the 7x7 images there seems to be a higher concentration of points on the left of the map. When the resolution of area is increased to 10x10, that relationship goes away and we start to see the uniform distribution that was meant all along.

For the normal distribution map/plot, the lower resolutions seem to show the concentration of points in the center (or up and right of center) but points still seem to be distributed throughout the entire frame. As the resolution increases, the actual concentration of points becomes smaller.

The MAUP is closely related to the ecological fallacy in that as scales of analysis change, so do the conclusions that we might draw from them. If we think of the example normal distribution as locations of saplings produced from a single tree, we would expect to see a concentration of trees around the original. As stated above, the coarser resolutions show a wider distribution of saplings. However the apple, indeed, does not fall far from the tree, and the finer resolutions how the clustering better.
 


R Code for creating data sets
setwd("/Users/Erin/R_work/ArcGIS")

#generate 2 sets of 250 numbers between 0 and 1000 from a uniform distribution;
x=runif(250,0,1000)
y=runif(250,0,1000)
#puts the two sets together and saves as a .csv file
uniform250 = round(cbind(x,y),0)
write.csv(uniform250,file="uniform250.csv")
 

#generates 2 sets of 250 numbers from a normal distribution with a mean of 100 and variance of 100
x1=rnorm(250,100,100)
y1=rnorm(250,100,100)
#puts the two sets together and saves as a .csv file
normal250 = round(cbind(x1,y1),0)
write.csv(normal250,file="normal250.csv")

##creates a 2 row, 3 column image that contains the scatter plots of both the normal and uniform distribution of the 2 sets of 250 numbers. Then plots the histogram of all created number sets.
par(mfrow=c(2,3))
for(i in 1:6){
plot(x,y,xlim=c(0,999),ylim=c(0,999))
hist(x)
hist(y)
plot(x1,y1,xlim=c(-199,299),ylim=c(-199,299))
hist(x1)
hist(y1)}

Tuesday, May 4, 2010

R is for Round Two


This week in R, I chose data from the United States Department of Transportation -- Federal Highway Administration. The data included information about state motor-vehicle registration in 2008 and were separated in to vehicle registration for cars, truck and buses both private or commercial and public. It gave totals of registered vehicles for each state.
At first I was struggling with plotting something meaningful as I kept creating bargraphs that no matter what variable I chose would have a value of one for every state. I (h/t Kebonye) figured out this was because the data was loading in as a string rather than a number. The data file was fixed in Excel and we were cookin’ with gas!
Now having readable numbers in the columns, I was able to plot charts showing total numbers of registered vehicles by state. However, I felt that this type of information wasn’t helpful and therefore added two other columns of information—total population by state and regions. I added the variables of regions in the Excel file to be able to work with the data in some geographic form. This, I felt was the only was to get any sense of space beyond individual states. The population data was added so I could normalized the total number of vehicles by total population.

With the additional variables I could do some math and calculated total registered trucks and cars per person. This information was then visualized as a boxplot organized by regions for both trucks and cars.

Finally the data about cars and trucks per person were graphed in a bivariate plot to show the relationship between the two variables and a regression line was added.



Note: For the scatter plot of vehicles per person with selected states identified I had to create a separate spreadsheet in Excel to sort the data by vehicles per person. If I didn’t do this, I could not identify the state correctly.

##Working with stats from the 48 states. State motor-vehicle registration in 2008 data from US DoT Federal Highway Administration available at http://www.fhwa.dot.gov/policyinformation/statistics/2008/mv1.cfm

setwd("/Users/Erin/R_work")
#changes where the working directory is
getwd()


#read Tab deliminated files motor_veh_08
drive <- read.delim("/Users/Erin/R_work/motor_veh_08.txt", header=TRUE, na.strings="NA", dec=".")

#show working files
ls()

#make variables accessible
attach(drive)
names(drive)



##creating car and truck density per population
cars <- total_car/population
trucks <- total_truck/population

#creating box plots
boxplot(trucks~region, ylab="trucks per capita", xlab="region", main="Trucks per person by Region", data=drive, col=5)
boxplot(cars~region, ylab="cars per capita", xlab="region", main="Cars per person by Region", data=drive, col=2)

#scatter chart cars per population
plot(cars, xlab="States", ylab="registered vehicles per capita", main="2008 Registered Vehicles per Capita: selected states")
identify(cars, labels=state, cex=0.7)

#scatter chart trucks per population
plot(trucks, xlab="States", ylab="registered trucks per capita", main="2008 Registered Trucks per Capita: selected states")
identify(trucks, labels=state, cex=0.7)

#select the points on the graph to be labled, right click to stop identification

##figure 4
#cars vs. trucks
plot (cars, trucks, xlab="registered cars per capita", ylab="registered trucks per capita", main="Registered cars and trucks by state, 2008", col=3)
abline(lm(cars~trucks), col=3)

All I have to say is bring on ArcGIS!

Wednesday, April 28, 2010

R is Recession

For data, I traveled to the US Department of Labor. Bureau of Labor Statistics and was met with an overwhelming amount of files, charts, tables and numbers.





Tuesday, April 20, 2010

week 3 -- improved hockey stick?




I don’t believe that this graph is an improvement to the traditional hockey stick image. Even with access to the raw data or even processed data, I was not able to understand the information well enough to provide a critical reinterpretation. However this graph changes the aspect of the graph by showing simply the estimated surface temperatures by using data collected from lake cores from Lake Tanganika in East Africa. I chose this data set because in Mann et al. 1998 they acknowledged few if any data from the African continent. While I understand that Mann et al. were working with the data sets they had available, I thought it would be interesting to work with data that are from a drastically under-represented area.


Where these charts differ the greatest from the hockey stick projection, is that they do not show the departures of temperature from the 1961 to 1990 average but rather just display the estimated temperatures based on lake core analyses. In fact they do not show contemporary temperatures at all. I tried this same projection with other data sets and I came up with similar looking charts that seem not to have any consistent pattern. In this way we can see that perhaps one single study cannot add to current climate change debates, however, only aggregate compilations of these studies can give weight and force to the international, scholarly and secular, and very public debate on climate change.

The elementary graphs above were produced with excel based on data from
Tierney, J.E., J.M. Russell, Y. Huang, J.S. Sinninghe Damsté, E.C. Hopmans, and A.S. Cohen. 2008.  "Northern Hemisphere Controls on Tropical Southeast African 
Climate During the Past 60,000 Years" Science, Vol. 322, No. 5899, pp. 252-255, 10 October 2008.  
available here.