Wednesday, May 26, 2010

final project idea


Comparing the influence of geometry in spatial analysis

Starting with Tobler’s first law—“Everything is related to everything else, but near things are more related than distant things"—it is clear that how we measure or define nearness or distantness becomes extremely important. However if this measurement is held constant, then what other factors can distort our understanding or accuracy of spatial analysis? I propose that the geometry of a given location will also affect the nature of the spatial autocorrelation. Using the African countries of The Gambia and Tanzania as case studies, this project will look at how differing geometric shapes will/can effect the outcomes of spatial analysis. For each country I’ll create four data sets, two uniform distributions and two normal distributions at two densities to see if the linear shape of The Gambia plays a significant role verse the more regular almost circular shape of Tanzania.

Above are maps of the constructed data sets generated in R and projected using ArcGIS. Bibliography to follow but suggestions are highly welcomed. 

Tuesday, May 25, 2010

GeoDa, R and ArcGIS do spatial analysis

This week's assignment clearly illustrates that the tool is only as good as the user. While the spatial analysis capabilities of GeoDa, R and ArcGIS might be vast and varied, my lack of knowledge and understanding of the possibilities, limitations and, in general, the significance of geospatial statistical methods is the biggest hurdle.

With that understanding, I felt that would most identify (and find most useful) with the tool I've used the most: ArcGIS. However, without a solid understanding of how to map out spatial statistics I couldn't produce a mapped image of the Moran's I that I choose to calculate with each program. (And to my credit, I don't think this information can be mapped.) R and GeoDa were better tools for visualizing the statistical data particularly R. If the goal is to map out geospatial data, then ArcGIS would be preferable as color schemes, symbology and attributes can be easily manipulated. However in dealing with statistical results, R does a better job. GeoDa, I feel, tries to mediate between the two, but doesn't quite live up to the respective powers of the fancier programs. 

For this assignment I used each tool to create a spatial matrix of the k nearest neighbor variety and set k=1 and k=2. Then I calculated Moran's I and visualized the results the best I could in each respective program.

GeoDA


R

##print out from R moran.test

> moran.test(US$obama,nb2listw(US.knn1, style="W"))

        Moran's I test under randomisation

data:  US$obama 
weights: nb2listw(US.knn1, style = "W") 

Moran I statistic standard deviate = 3.2767, p-value =
0.0005252
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance
       0.53038969       -0.02083333        0.02830000

> moran.test(US$obama,nb2listw(US.knn2, style="W"))

        Moran's I test under randomisation

data:  US$obama 
weights: nb2listw(US.knn2, style = "W") 

Moran I statistic standard deviate = 4.3288, p-value =
7.495e-06
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance
       0.51891274       -0.02083333        0.01554667


ArcGIS

 ##printout from ArcGIS, no visualizations were created

Executing: SpatialAutocorrelation lower48 obama true "Get Spatial Weights From File" "Euclidean Distance" None # F:\R_work\week_8\knn1.swm 0 0 0
Start Time: Tue May 25 19:45:12 2010
Running script SpatialAutocorrelation...
WARNING 000916: The input feature class does not appear to contain projected data.

 Global Moran's I Summary
Moran's Index:   0.530499
Expected Index:  -0.020833
Variance:        0.027516
Z Score:         3.323665
p-value:         0.000888


Executing: SpatialAutocorrelation lower48 obama true "Get Spatial Weights From File" "Euclidean Distance" None # F:\R_work\week_8\knn2.swm 0 0 0
Start Time: Tue May 25 19:51:10 2010
Running script SpatialAutocorrelation...
WARNING 000916: The input feature class does not appear to contain projected data.

 Global Moran's I Summary
Moran's Index:   0.524584
Expected Index:  -0.020833
Variance:        0.015769
Z Score:         4.343416
p-value:         0.000014

Monday, May 17, 2010

Spatial autocorrelation and Moran's I


This first map (in yellow/red) was created in ArcGIS. I downloaded a shape file from DIVA GIS which contained basic information about the different administrative boundaries.  Then I merged the tables in ArcMap (after a little doctoring in Excel) with data about trees planted downloaded from Tanzania's countrystat website run by the FAO.
 
The second (in blue/green) was created from the same data in when imported into R.

Creating a spatial weights matrix
w.cols = 1:26
w.rows = 1:26

#create spatial weights matrix from neighbor object
w.mat.knn = nb2mat(TZ.knn1, zero.policy=TRUE)
w.mat.dist = nb2mat(TZ.dist.150, zero.policy=TRUE)
image(w.cols,w.rows,w.mat.dist,col=brewer.pal(3,"BuPu"))


#return binary spatial weights matrix
w.mat.knn
#print out of 0s and 1s

#visualize binary spatial weights matrix
image(w.cols,w.rows,w.mat.knn,col=brewer.pal(3,"BuPu"))


#create and visualize distance-based spatial weights matrix;
w.mat.dist = nb2mat(TZ.dist.250, zero.policy=TRUE)
image(w.cols,w.rows,w.mat.dist,col=brewer.pal(9,"PuRd"))


Moran's I test
##for distance = 350
moran.plot(tz$TREES_2008,nb2listw(TZ.dist.350),labels=tz$F1)
moran.test(tz$TREES_2008,nb2listw(TZ.dist.350, style="W"))


######print out from moran.test###

# Moran's I test under randomisation
#data:  tz$TREES_2008
#weights: nb2listw(TZ.dist.350, style = "W") 
#Moran I statistic standard deviate = 0.7428, p-value = 0.2288
#alternative hypothesis: greater
#sample estimates:
#Moran I statistic       Expectation          Variance
#      0.011532877      -0.040000000       0.004812588



##for distance = 500
moran.plot(tz$TREES_2008,nb2listw(TZ.dist.500),labels=tz$F1)
moran.test(tz$TREES_2008,nb2listw(TZ.dist.500, style="W"))

##print out##
#Moran's I test under randomisation

#data:  tz$TREES_2008 
#weights: nb2listw(TZ.dist.500, style = "W")
#Moran I statistic standard deviate = 0.7428, p-value = 0.2288
#alternative hypothesis: greater
#sample estimates:
#Moran I statistic       Expectation          Variance
 #   0.011532877      -0.040000000       0.004812588


##for distance = 1000
moran.plot(tz$TREES_2008,nb2listw(TZ.dist.1000),labels=tz$F1)
moran.test(tz$TREES_2008,nb2listw(TZ.dist.1000, style="W"))


#Moran's I test under randomisation

#data:  tz$TREES_2008 
#weights: nb2listw(TZ.dist.1000, style = "W")
#Moran I statistic standard deviate = 0.0731, p-value = 0.4709
#alternative hypothesis: greater
#sample estimates:
#Moran I statistic       Expectation          Variance
#    -3.936736e-02     -4.000000e-02      7.496845e-05

A note on spatial autocorrelation
Spatial autocorrelation is an understanding that real-life phenomena tend to be most similar the closer they are together. In nature, things ten to cluster in patches of populations or are spread out in gradient where there is a concentration at one point and a slow dispersion outward. In this way we can see that the characteristics of a phenomena can be estimated by using information from surrounding areas.

I could give the example of my family members. My mother and father’s families both come from the same town in western Minnesota. Now many aunts, uncles and cousins no longer live in that town but still live in Minnesota, South Dakota or Wisconsin. Knowing the distance from "home" it is likely that we can predict the possibility I would have a family member living there. And it is true. There are only very few outliers like me who live as far away as California and Texas. 

Of course this completely depends on how you define “close”. As with changing the scale of investigation in last week’s assignment, measurements of spatial autocorrelation change as you change the parameters in which you measure closeness. Another factor in this example would be city size. If you factor in distance from home and size of the city, I feel that you could closely predict the number of family members I have now living in a given area.  

However the data that I presented above (then number of trees planted in Tanzania in 2008 by district) does not seem to be spatially correlated (with some exceptions). It’s true that as the way I define distance changes (only using absolute distance from the centroids of the districts and not using a k nearest neighbor), the Moran’s I test changes, but not significantly. I do find that the significant outliers of Iringa and Shinyanga to be interesting. Without knowing much about how the data was collected and what the numbers really mean, I do know that these are two of the districts that are known for their significant forest cover. Much more than that, I can not say.

Monday, May 10, 2010

R meets ArcGIS

What I find most interesting is that the uniformally distribution of points, which is supposed to reflect “on the ground” homogeneity (or close to it), is that it actually maps out spatial patterns particularly at course resolution. This variation among the different scales of analysis for the uniform distribution (green maps) is an example of modifiable areal unit problem (MAUP). The MAUP occurs when individual point data is aggregated into set areas. In this example the set areas change with the resolution. We can see that in the 3x3, 5x5 and even the 7x7 images there seems to be a higher concentration of points on the left of the map. When the resolution of area is increased to 10x10, that relationship goes away and we start to see the uniform distribution that was meant all along.

For the normal distribution map/plot, the lower resolutions seem to show the concentration of points in the center (or up and right of center) but points still seem to be distributed throughout the entire frame. As the resolution increases, the actual concentration of points becomes smaller.

The MAUP is closely related to the ecological fallacy in that as scales of analysis change, so do the conclusions that we might draw from them. If we think of the example normal distribution as locations of saplings produced from a single tree, we would expect to see a concentration of trees around the original. As stated above, the coarser resolutions show a wider distribution of saplings. However the apple, indeed, does not fall far from the tree, and the finer resolutions how the clustering better.
 


R Code for creating data sets
setwd("/Users/Erin/R_work/ArcGIS")

#generate 2 sets of 250 numbers between 0 and 1000 from a uniform distribution;
x=runif(250,0,1000)
y=runif(250,0,1000)
#puts the two sets together and saves as a .csv file
uniform250 = round(cbind(x,y),0)
write.csv(uniform250,file="uniform250.csv")
 

#generates 2 sets of 250 numbers from a normal distribution with a mean of 100 and variance of 100
x1=rnorm(250,100,100)
y1=rnorm(250,100,100)
#puts the two sets together and saves as a .csv file
normal250 = round(cbind(x1,y1),0)
write.csv(normal250,file="normal250.csv")

##creates a 2 row, 3 column image that contains the scatter plots of both the normal and uniform distribution of the 2 sets of 250 numbers. Then plots the histogram of all created number sets.
par(mfrow=c(2,3))
for(i in 1:6){
plot(x,y,xlim=c(0,999),ylim=c(0,999))
hist(x)
hist(y)
plot(x1,y1,xlim=c(-199,299),ylim=c(-199,299))
hist(x1)
hist(y1)}

Tuesday, May 4, 2010

R is for Round Two


This week in R, I chose data from the United States Department of Transportation -- Federal Highway Administration. The data included information about state motor-vehicle registration in 2008 and were separated in to vehicle registration for cars, truck and buses both private or commercial and public. It gave totals of registered vehicles for each state.
At first I was struggling with plotting something meaningful as I kept creating bargraphs that no matter what variable I chose would have a value of one for every state. I (h/t Kebonye) figured out this was because the data was loading in as a string rather than a number. The data file was fixed in Excel and we were cookin’ with gas!
Now having readable numbers in the columns, I was able to plot charts showing total numbers of registered vehicles by state. However, I felt that this type of information wasn’t helpful and therefore added two other columns of information—total population by state and regions. I added the variables of regions in the Excel file to be able to work with the data in some geographic form. This, I felt was the only was to get any sense of space beyond individual states. The population data was added so I could normalized the total number of vehicles by total population.

With the additional variables I could do some math and calculated total registered trucks and cars per person. This information was then visualized as a boxplot organized by regions for both trucks and cars.

Finally the data about cars and trucks per person were graphed in a bivariate plot to show the relationship between the two variables and a regression line was added.



Note: For the scatter plot of vehicles per person with selected states identified I had to create a separate spreadsheet in Excel to sort the data by vehicles per person. If I didn’t do this, I could not identify the state correctly.

##Working with stats from the 48 states. State motor-vehicle registration in 2008 data from US DoT Federal Highway Administration available at http://www.fhwa.dot.gov/policyinformation/statistics/2008/mv1.cfm

setwd("/Users/Erin/R_work")
#changes where the working directory is
getwd()


#read Tab deliminated files motor_veh_08
drive <- read.delim("/Users/Erin/R_work/motor_veh_08.txt", header=TRUE, na.strings="NA", dec=".")

#show working files
ls()

#make variables accessible
attach(drive)
names(drive)



##creating car and truck density per population
cars <- total_car/population
trucks <- total_truck/population

#creating box plots
boxplot(trucks~region, ylab="trucks per capita", xlab="region", main="Trucks per person by Region", data=drive, col=5)
boxplot(cars~region, ylab="cars per capita", xlab="region", main="Cars per person by Region", data=drive, col=2)

#scatter chart cars per population
plot(cars, xlab="States", ylab="registered vehicles per capita", main="2008 Registered Vehicles per Capita: selected states")
identify(cars, labels=state, cex=0.7)

#scatter chart trucks per population
plot(trucks, xlab="States", ylab="registered trucks per capita", main="2008 Registered Trucks per Capita: selected states")
identify(trucks, labels=state, cex=0.7)

#select the points on the graph to be labled, right click to stop identification

##figure 4
#cars vs. trucks
plot (cars, trucks, xlab="registered cars per capita", ylab="registered trucks per capita", main="Registered cars and trucks by state, 2008", col=3)
abline(lm(cars~trucks), col=3)

All I have to say is bring on ArcGIS!

Wednesday, April 28, 2010

R is Recession

For data, I traveled to the US Department of Labor. Bureau of Labor Statistics and was met with an overwhelming amount of files, charts, tables and numbers.





Tuesday, April 20, 2010

week 3 -- improved hockey stick?




I don’t believe that this graph is an improvement to the traditional hockey stick image. Even with access to the raw data or even processed data, I was not able to understand the information well enough to provide a critical reinterpretation. However this graph changes the aspect of the graph by showing simply the estimated surface temperatures by using data collected from lake cores from Lake Tanganika in East Africa. I chose this data set because in Mann et al. 1998 they acknowledged few if any data from the African continent. While I understand that Mann et al. were working with the data sets they had available, I thought it would be interesting to work with data that are from a drastically under-represented area.


Where these charts differ the greatest from the hockey stick projection, is that they do not show the departures of temperature from the 1961 to 1990 average but rather just display the estimated temperatures based on lake core analyses. In fact they do not show contemporary temperatures at all. I tried this same projection with other data sets and I came up with similar looking charts that seem not to have any consistent pattern. In this way we can see that perhaps one single study cannot add to current climate change debates, however, only aggregate compilations of these studies can give weight and force to the international, scholarly and secular, and very public debate on climate change.

The elementary graphs above were produced with excel based on data from
Tierney, J.E., J.M. Russell, Y. Huang, J.S. Sinninghe Damsté, E.C. Hopmans, and A.S. Cohen. 2008.  "Northern Hemisphere Controls on Tropical Southeast African 
Climate During the Past 60,000 Years" Science, Vol. 322, No. 5899, pp. 252-255, 10 October 2008.  
available here.

climate science essay -- the power of the visual

It seems that the debate on climate science in the media has similar cycles--peaks and valleys--just as line graph of climate variation on earth has. In the months before and also following a high profile international conference, newspaper articles, blog posts and TV media stories increase in frequency and intensity. Debates become heated as each side attempts to poke holes in their opponents’ arguments and destabilize their opponents’ positions. New information comes to light, it is challenged and rebutted. Then the public interest in the discussion dies down and we concern ourselves with other debates such as health care, the failing economy, and many, many more. The interest in climate change and climate science debates seems to lie dormant until the next international meeting or vote on this issue. Yet the total trajectory of how climate science has fared in this arena is most interesting. In the March 18 issue of The Economist it states that “…the scientists’ shameful mistakes have certainly changed perceptions. They have not, however, changed science itself.” If indeed the science hasn’t changed, then what exactly was it that led and still feeds the on-going rollercoaster public debate? Did scientists simply learn to present their findings to a larger audience? Can we blame the media for yet again making a sensation out of a molehill? And what is most influential—the methodology, the findings, how they are visually presented, or how the discussion plays out for the general public?

            Possibly the beginning of this debate was with the publishing of Mann et al in 1998 in Nature (often referred to as MBH98) and the first appearance of the famous hockey stick graph. Using various paleoclimate data sets and recent instrument-recorded surface temperatures, they were able to construct a northern hemisphere projection of past climates calibrated with current temperature data. Their conclusion was that, in the northern hemisphere, the last 50 years were unnaturally warm compared to the previous 2000 and then suggests this is due to anthropogenic forces. Their line graph was then challenged by outside observers for its correctness and completeness. They were accused of misrepresenting the data, leaving data out and operating in a cohort of scientists that “rubber stamp” each other’s work. Critics claimed that climate scientists were withholding their data which suggested they had something to hide. Blogs and articles produced by climate change skeptic watchdogs appeared and gathered support while picking apart details of the various reports. 

            Perhaps most influential on the opposing side were Stephen McIntyre and Ross McKitrick who published papers questioning the statistical analyses of MBH98. The blog of McIntyre (Climate Audit) provided fuel for the climate skeptics and a controversy for the media to cover.  

            The role of imagery and data projections is certainly central to the debate. While the climate change debate might have started among peer-reviewed journals and academic circles of the physical science world, this discussion morphed onto the secular scene. If we think about the phrase an image is worth a thousand words, then we must understand the images that the general public is viewing when ideas such as global warming, increasing CO2 levels, and melting ice caps in addition to data centric statistically rendered graphs and charts. 

            While the famed “hockey stick” projection, with its estimates of error and red line climbing high in the late 20th century may be the visualization that most scientists point out to give weight to the vast and diverse studies of surface temperatures, proxy measurements of lake, ice or soil cores and tree ring, and atmospheric gases, it is certainly a target for skeptics. Blog post such as “the broken hockey stick” or “the hockey stick debunked” produced completely flat line or U-shaped graphs. 


I would argue that these types of images aren’t the ones that have the most impact. Instead think of all the times images of polar bears trapped on melting ice caps or barren, dry and cracked soil are shown in conjunction with climate change debates. Or other images like maps of projected temperature increases, sea level rise and historic/contemporary glaciers.  


            Still even more powerful, especially early in the climate science debates, might be Al Gore’s Inconvenient Truth. This documentary film is based off the scholarship of the science behind the publications and recommendations of the IPCC. Here a hockey stick-like chart needs an elevator to raise Gore high enough to show the rising levels of CO2 in the atmosphere. Inconvenient Truth brought together statistical charts, historic photographs, and heart wrenching cinematography of ecosystems in harsh transitions. I believe that it is these bundles of images that provide the greatest impact to the greatest population. Debate on whether or not scientists maliciously left out data or conveniently tweaked data into conformity will continue. Hockey stick charts and accompanying images of an earth in potentially traitorous transition were the first to bring climate change and especially human driven climate change to the general public. They are now what all science and rhetoric has to build off of and react to. 


Images of "debunked hockey sticks" are found on Jo Nova's blog
Other images come from a wiki called Global Warming Art
And Al Gore on an lift can be watched here

Monday, April 19, 2010

Gallery-Remote Sensing

Fig. 1. Example of remote-sensing images used to assess flooding regimes on the Connecticut River. a) Additive image of ETM+ bands 4 and 7 for September; b) same for April; c) slope layer; d) composite image (R:G:B = September:April:slope. Areas of flooding are in pink. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)


Fig. 2. Extraction of spring flooded areas, Northampton, MA. a) original imagery of September 2001; b) original imagery of April 2001; c) composite image; d) binary grid, with darker areas representing areas of overbank flooding.


 Fig. 3. Example output of the ecological condition model for longleaf pine sandhills at Eglin AFB, Florida, scaled to 1-ha monitoring units for a) 2001, and b) 2007. Tier 1 represents high-quality sandhills while Tier 4 reflects degraded sandhills.

 Fig. 4. a) Change in longleaf pine condition tier by area (ha) from 2001 to 2007 at Eglin AFB as assessed by the ecological condition model. b) A matrix showing how the tier values of Eglin's 1-ha management units transitioned from 2001 to 2007. For example, 7703 ha moved from Tier 2 to Tier 1 over the time period. Shaded cells indicate 1-ha management units that did not experience a change in condition from 2001 to 2007 (total of 80,141 ha).

 Fig. 5. Flow chart summarizing the spatial modeling approach used to assess the ecological condition of longleaf pine sandhills across Eglin AFB.

 Fig. 5. Flow chart summarizing the spatial modeling approach used to assess the ecological condition of longleaf pine sandhills across Eglin AFB.

All images are from
Wiens J., Sutter R., Anderson M., Blanchard J., Barnett A., Aguilar-Amuchastegui N., Avery C., Laine S.(2009) "Selecting and conserving lands for biodiversity: The role of remote sensing" Remote Sensing of Environment, 113 (7), pp. 1370-1381. 

Tuesday, April 6, 2010

week 1 assignment

To find images for this assignment, I used a Google Scholar search and browsed through journal articles looking for maps or map-like projections. I wanted to look specifically at images that show spatial relationships in a traditional way. While I know that much of the data in environmental geography journals are not in map-like form, I was shocked to see just how little there really was. I was especially shocked to not find images that were easily understandable or were more confusing than helpful.




These first two images are from the same article (Petit et al. 2001) only the top one appears in grey scale. Working with a lot of satellite images, I see this kind of projection often. At first I thought I would post the grey scale to show just how absurd it is to publish this type of map in a print journal without color plates or even how un-useful it becomes when printed on a grey scale printer. Yet the color version doesn't seem to be helpful either. Without any other features on the image except a grid and simple key, this doesn't give an additional information to the reader. It is as if the authors put the image in the article simply prove that they did the data analysis with Landsat scene. (Or perhaps in the late 90s when the authors did this study, they paid a lot of money for the single image and were sure to use it as much as they could!)





The second image is of the relationship between baobab trees and villages in Mali. Here I imagine the collection of these data was through field measurements with most of the coded information not having a direct spatial relationship, yet the purpose of the study was to examine how attributes like age of both the village and the trees relates spatially. While I think visualization adds to the analysis, I still don't see the spatial relationships clearly.



Therefore, as a final image I wanted to show a geovisualization that would really say more with a map-like image (or here a time series of images) than with the text or another sort of chart or graph. I turned to the recent New York Times article about where to get a cab. (h/t Timur) Here the a heat map of where to find a cab in NYC displays information that could not be understood in other ways...at least not to the extent or ease that it is here.

Therefore I think these images call into question how we visualize certain kinds of data (obviously) and whether or not a geovisulization is the best way to display the data. Particularly with the satellite images, these data are, at their creation, spatially linked--each pixel of information is directly bordered by 4 (to 8) other pixels of information. So when does breaking this relationship benefit the viewer? What kinds of data are better represented in tabular or discriptive terms? I know that when I work with these satellite derived data, I feel I cannot betray them by representing them in any other form, yet I can see how it's not necessarily helpful to view in this way.


Articles referenced
Duvall C.S. (2007) Human settlement and baobab distribution in southwestern Mali. Journal of Biogeography 34(11):1947–1961

Grynbaum, M. (2010) "Need a Cab? New Analysis Shows Where to Find One" The New York Times. April 2.


Petit, C. , Scudder, T. andLambin, E.(2001) 'Quantifying processes of land-cover change by remote sensing: resettlement and rapid land-cover changes in south-eastern Zambia', International Journal of Remote Sensing, 22: 17, 3435 — 3456