Methods Master Thesis

by ustroetz

This blog post will explain the Methods of my Thesis. I will show how I intend to answer what the significance of Spatial Predictors on Timber Harvest Costs is and if it possible to calculate Timber Harvest Costs solely based on Spatial Predictors. The Methods can be split up in basically four parts. First a harvest cost model will be developed. Next, based on the model data to be analyzed will be created. After that these data will be statically analyzed. Based on the analysis two cost equations are created. One expressing Timber Harvest Cost based on all Predictors and the other one expressing Timber Harvest Cost solely based on the Spatial Predictors. Last, the developed spatially explicit cost equation is used in to create a Cost Surface covering the entire Colorado State Forest.

1. Model Development
A harvest model was developed to estimate relative Harvest Costs per ton ($/ton). The model is based on the Fuel Reduction Cost Simulator (FRCS) (Fight, Hartsough and Noordijk 2006) software. Relevant formulas for a ground-based mechanized-felling whole tree system were taken from the FRCS and were written up in a Python script. The script allows iterations over the model in order to analyze it.
The model consists of three processing activities: felling, transportation to the landing, and processing at the landing. Trees are felled and bunched from drive-to-tree machines (which are assumed for flat ground), or swingboom and self-leveling versions (which are assumed for steeper terrain). Rubber-tired grapple skidders transport bunches to the landing. Trees are processed mechanically with stroke or single-grip processors at the landing.

2. Data Creation
For the statistical analysis, dependent and independent variables need to be defined and created. In the following I will explain the creation of the dataset. The dataset contains the independent variables, which are the input data to the cost model and the dependent variable, which is the actual Harvest Cost calculated by the cost model.
The model takes in four input variables per timber stand, which are the independent variables of the analysis:

  • Slope (S)
  • Skidding Distance (SD)
  • Trees per Acre (TPA)
  • Volume per Tree (VPT)

The data were created based on data for 74 timber sales from the CSF ranging from the years 2005 to 2014. The data include values for TPA and VPT. Slope was derived from a Slope raster and Skidding Distance was calculated as the Euclidean Distance to the next road.

3. Statistical Analyses
The produced data were statistically analyzed. First a descriptive statistic of the data is given. Next I developed a regression model, expressing the Cost with the four input variables. The following model resulted:

C = -3.667098572 + 133.515209875 x VPT(-0.72) + -0.003088015 x TPA + 0.305091203 x S + 0.007587668 x SD + εi

Next I validated the model. I tested to assure that it is reasonable and that it is statistically significant. Several indicators are investigated for that. First, the statistical significance of the model was verified. Next, the significances of the coefficients were tested. Then the usefulness of the model and how well the data fits the model is explored. The last step was to explore how well the data satisfies the assumptions of a linear regression. Below is the summary statistics of the model. The full validation can then be found in my paper.

lm(formula = C ~ I(VPT^(-0.72)) + TPA + S + SD, data = costData)
 Min 1Q Median 3Q Max 
-7.842 -0.649 0.042 0.626 77.691
 Estimate Std. Error t value Pr(>|t|) 
(Intercept) -3.667e+00 1.546e-02 -237.16 <2e-16 ***
I(VPT^(-0.72)) 1.335e+02 5.993e-02 2227.67 <2e-16 ***
TPA -3.088e-03 3.428e-05 -90.08 <2e-16 ***
S 3.051e-01 3.362e-04 907.56 <2e-16 ***
SD 7.588e-03 4.515e-06 1680.67 <2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.422 on 161185 degrees of freedom
Multiple R-squared: 0.9828, Adjusted R-squared: 0.9828 
F-statistic: 2.309e+06 on 4 and 161185 DF, p-value: < 2.2e-16

As you see it has an excellent R-square and all coefficients are significant. But more of this will be discussed in the Discussion chapter.

After I created the regression model with all predictors, I derived a spatially explicit regression model with only the spatial predictors. The following model resulted:

C = 22.8038 + 0.3272 x S + 0.007578 x SD + εi

I validated the model the same way as the previous model. Below again its summary statistics:

lm(formula = C ~ S + SD, data = costData)
 Min 1Q Median 3Q Max 
-24.01 -4.94 -1.19 3.42 446.41
 Estimate Std. Error t value Pr(>|t|) 
(Intercept) 2.280e+01 4.950e-02 460.7 <2e-16 ***
S 3.272e-01 1.979e-03 165.3 <2e-16 ***
SD 7.578e-03 2.655e-05 285.4 <2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.379 on 161187 degrees of freedom
Multiple R-squared: 0.4045, Adjusted R-squared: 0.4045 
F-statistic: 5.475e+04 on 2 and 161187 DF, p-value: < 2.2e-16

The R-squared is not as good as in the previous model, but still something to work with.

4. Cost Surface
Now that the spatially explicit regression model was created, I can create my Cost Surface. For this I simply used a Slope Raster. It serves as a reference raster for my Cost Surface, and also its values are used in the regression model. The Cost Surface will contain in each pixel to relative Harvest Cost. For this I measure the Euclidean Distance for each pixel to the closest road and but that distance into the regression. Also the Slope value was but into the regression. Based on the regression I get a value for each pixel. All the calculation was done in Numpy arrays, which is fairly fast to process. Though calculating the Distance for each pixel was very work intensive, which is the reason that I ended up only calculating it for the southern part of the state forest. On my little MacBook this took five days processing.