Application Example Master Thesis

In order to visualize and demonstrate the results from my Master Thesis I created a little web app. You can access it via the url www.wald.io.

The app visualizes the Cost Surface of the CSF as a web map. The Cost Surface is based on the regression model with the spatial variables. In this way foresters get a complete overview of the forest’s potential Harvest Cost. Also the application allows the user to digitize a forest stand and then reports the potential Harvest Costs for the digitized stand. The user gets immediate feedback since the Cost Surface is pre-generated. In addition, if the user has inventory data of the stand, they can enter these information and the application returns the exact Harvest Cost based on the full harvest model.

The Cost Surface is served as a WMS with GeoServer. The app is severed as a Python Flask App. So all the back-end processing is done in the same Python scripts used for the research itself. Fronted is simply done with Leaflet and jQuery. Below is a detailed overview of the application structure.



Discussion and Conclusion Master Thesis

In this post I want to discuss a couple of things regarding my research before I draw my finale Conclusion.

I want to talk a little bit about the Results from the Statistical Analysis and about the actual Harvest Costs for the State Forest.


During the Statistical Analysis to outlier patterns kept recurring which I think do have a significance on the result:

The first pattern, that high Harvest Costs occur in extremely densely stocked stands with a low volume of the trees, is a realistic real world condition and an expected silvicultural behavior. If stands are extremely densely stocked the Volume per Tree value is low. Therefore the outliers were not removed from the dataset. But from a practical point of view these stands are not likely to be harvested.

The second pattern is that high Harvest Costs arise in stands with a high Slope value.

This is also an expected behavior; the steeper the slope, the more expensive it is to harvest. But in real world conditions slopes steeper than 40% are not harvested with ground based machinery. Yet the given stands of the CSF are located in this terrain and are assigned by the CSF as harvestable areas. Therefore these outliers were also not removed from the dataset.

Even though both patterns can occur in real world situations, it is likely that they caused the unexplained variance in Cost. So I think future studies should take definitely take that into account and should consider removing those outliers.

Interpretation Regression Model

I could write a whole lot about possible interpretations about the regression model. You can read all that in the full paper. But I want to give you one sample calculation, that highlights fairly good the results:

The intercept of the spatially explicit regression is 22.80 $/ton. This is the Cost if the stand is in an absolute flat ground (Slope of 0 %), and the stand is located at the road (Skidding Distance of 0 ft.). The coefficient value for Skidding Distance is 0.0076 $/ft and the Slope coefficient value is 0.33 $/%. Since both coefficients are positive, Harvest Cost will increase if Slope increases or the distance to the road increases. For each percent increase in Slope, the Harvest Cost will increase by 0.33 $/ton. The Harvest Cost will also increase by 0.0076 $/ton for each foot increase of the Skidding Distance. Or expressed in other units, the Harvest Cost will increase by 7.58 $/ton for each additional 100 feet to skid. So lets say the stand is 1000 ft away from the road and the Slope is 10%. The basic harvest cost of 22.80 $/ton would increase to 33.70 $/ton (=22.80 $/ton+0.0076 $/ft*1000 ft+0.33 $/%*10%).

Harvest Costs Colorado State Forest 

The calculated Harvest Costs for the specific stands of the Colorado State Forest kind a serve as a validation of this research. The regression with all four predictors is useful since it produced almost the identical results as were produced with the full model. The mean of all stand’s Harvest Costs differed only by 0.75 $/ton and the standard deviation differed by 0.11 $/ton. This confirms again the high R-squared of 0.98, but also confirms the unexplained 1.72% variance of the regression. The spatially explicit regression differs more from these results. The mean of the regression with all predictors to the mean of the spatially explicit regression differs by 3.69 $/ton. The standard deviation of the spatially explicit regression is with 8.6 $/ton significantly lower than the full regression model’s standard deviation of 10.18 $/ton. This is because the spatially explicit regression is missing the two Non-Spatial Predictors, and assumes therefore a fixed value for the variables. Therefore the variance in Cost is smaller.

The produced Cost Surface for the southern part of the Colorado State Forest with a mean Harvest Cost of 40.83 $/ton and a standard deviation of 15.75 $/ton are higher than the other calculations means and standard deviations. The high mean value and the high standard deviation, result from the fact that many stands, or in the case of the Cost Surface pixels, are not connected to roads. The map shown in a previous post clearly shows that stands close to the road are in the lower price range (green color). Therefore extreme high Skidding Distance values result. Therefore the Cost Surface is only useful in areas where roads already exist, though usually roads are not created until a stand is to be harvested. This makes planning and cost calculation very difficult. Therefore a way that estimates where potential logging roads will be located, is needed to calculate meaningful Harvest Costs for areas without road access.


In my conclusion I want to come back to my original research question:

The research showed that Spatial Predictors predict 40% of Timber Harvest Costs. The remaining 60% are predicted by the variables Trees per Acre and Volume per Tree. Therefore the first research question, which asks what the significance of Spatial Predictors on Timber Harvest Costs is, can be answered as follows: Spatial Predictors have a significance of 40% on Timber Harvest Costs.

The second research question, which asks if it is possible to calculate Timber Harvest Costs solely based on Spatial Predictors, depends on the use case:

It is not possible to calculate with this method an absolute Harvest Cost, because an R-squared of 0.4045 of the spatially explicit regression model is too low to calculate Harvest Costs solely based on Spatial Predictors.

But this study was conducted in order to answer if it is possible to calculate Timber Harvest Costs for use in optimization models. Optimization models require iterating through millions of potential solutions and comparing results in terms of an objective function. For this kind of optimization a R-squared of 0.4045 is sufficient because it gives relative Harvest Costs. This allows optimization models to compare the Costs of different stands and scenarios. These models do not require absolute Harvest Cost.

Therefore the results of this research make it possible to include Harvest Costs in optimization models for ecological forestry approaches. With their inclusion optimization models are significantly improved.


Results Master Thesis

Three things result from this research:

  1. From the Statistical Analysis a better understanding of the influence of the input variables on Harvest Cost and the spatially explicit cost equation
  2. The specific Harvest Costs for the Colorado State Forest
  3. The Cost Surface based on the spatially explicit regression for the Colorado State Forest


1. Statistical Results

Overall the four variables explain 98.28% of the Harvest Cost using the given model. All of them have an importance in predicting Cost. The Trees per Acre variable, with 3.2% explanatory value, has the least importance, followed by Slope with 9.5%. Skidding Distance is the most important spatial variable with 29.9% explanatory value. And Volume per Tree is the most important variable out of the four, with 55.7% explanatory value. The two spatial variables taken together explain 39.4% of the Cost.

     Relative importance metrics:
     I(VPT^(-0.72)) 0.55669591
     TPA            0.03239262
     S              0.09457912
     SD             0.29918160

The big thing that results form the statistical analysis is the spatially explicit regression. It was validated in the statistics section and has a R-squared of 0.4212. The degree to which it is useful is discussed later.

C = 22.83 + 0.3306 x S + 0.007526 x SD + εi

2. Harvest Costs Colorado State Forest

For the 74 timber sale stands of the CSF the Harvest Cost based on the full model, on the regression model with all predictors, and on the spatially explicit regression model was calculate. The mean Harvest Cost of all stands is 36.38 $/ton from the full model, 35.62 $/ton from the full equation, and 39.3 $/ton from the spatially explicit equation. The full model and the full equation have roughly the same standard deviations, while the spatially explicit equation’s standard deviation is significantly lower.

The graph below gives an exemplary overview of the Harvest Cost of 10 stands. The graphs compare the different models:Untitled


3. Cost Surface

The last things, and most important thing, that results from this paper, is the Cost Surface. Check out the Application Example at www.wald.io to see the full Cost Surface. In the Cost Surface you can really see the heavy influence of the Skidding Distance on the Cost. Everything close to a road is green (less expensive). The further away you get the more expensive it gets (red). I created the Cost Surface with numpy arrays which made the process quite fast. But since I only have a old MacBook the process for the southern part of the Forest still took five days. Once I have  a more powerful machine again, I will calculate the Cost Surface for the entire forest. Below in image of the Cost Surface and the roads covering the State Forest.


Methods Master Thesis

This blog post will explain the Methods of my Thesis. I will show how I intend to answer what the significance of Spatial Predictors on Timber Harvest Costs is and if it possible to calculate Timber Harvest Costs solely based on Spatial Predictors. The Methods can be split up in basically four parts. First a harvest cost model will be developed. Next, based on the model data to be analyzed will be created. After that these data will be statically analyzed. Based on the analysis two cost equations are created. One expressing Timber Harvest Cost based on all Predictors and the other one expressing Timber Harvest Cost solely based on the Spatial Predictors. Last, the developed spatially explicit cost equation is used in to create a Cost Surface covering the entire Colorado State Forest.

1. Model Development
A harvest model was developed to estimate relative Harvest Costs per ton ($/ton). The model is based on the Fuel Reduction Cost Simulator (FRCS) (Fight, Hartsough and Noordijk 2006) software. Relevant formulas for a ground-based mechanized-felling whole tree system were taken from the FRCS and were written up in a Python script. The script allows iterations over the model in order to analyze it.
The model consists of three processing activities: felling, transportation to the landing, and processing at the landing. Trees are felled and bunched from drive-to-tree machines (which are assumed for flat ground), or swingboom and self-leveling versions (which are assumed for steeper terrain). Rubber-tired grapple skidders transport bunches to the landing. Trees are processed mechanically with stroke or single-grip processors at the landing.

2. Data Creation
For the statistical analysis, dependent and independent variables need to be defined and created. In the following I will explain the creation of the dataset. The dataset contains the independent variables, which are the input data to the cost model and the dependent variable, which is the actual Harvest Cost calculated by the cost model.
The model takes in four input variables per timber stand, which are the independent variables of the analysis:

  • Slope (S)
  • Skidding Distance (SD)
  • Trees per Acre (TPA)
  • Volume per Tree (VPT)

The data were created based on data for 74 timber sales from the CSF ranging from the years 2005 to 2014. The data include values for TPA and VPT. Slope was derived from a Slope raster and Skidding Distance was calculated as the Euclidean Distance to the next road.

3. Statistical Analyses
The produced data were statistically analyzed. First a descriptive statistic of the data is given. Next I developed a regression model, expressing the Cost with the four input variables. The following model resulted:

C = -3.667098572 + 133.515209875 x VPT(-0.72) + -0.003088015 x TPA + 0.305091203 x S + 0.007587668 x SD + εi

Next I validated the model. I tested to assure that it is reasonable and that it is statistically significant. Several indicators are investigated for that. First, the statistical significance of the model was verified. Next, the significances of the coefficients were tested. Then the usefulness of the model and how well the data fits the model is explored. The last step was to explore how well the data satisfies the assumptions of a linear regression. Below is the summary statistics of the model. The full validation can then be found in my paper.

lm(formula = C ~ I(VPT^(-0.72)) + TPA + S + SD, data = costData)
 Min 1Q Median 3Q Max 
-7.842 -0.649 0.042 0.626 77.691
 Estimate Std. Error t value Pr(>|t|) 
(Intercept) -3.667e+00 1.546e-02 -237.16 <2e-16 ***
I(VPT^(-0.72)) 1.335e+02 5.993e-02 2227.67 <2e-16 ***
TPA -3.088e-03 3.428e-05 -90.08 <2e-16 ***
S 3.051e-01 3.362e-04 907.56 <2e-16 ***
SD 7.588e-03 4.515e-06 1680.67 <2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.422 on 161185 degrees of freedom
Multiple R-squared: 0.9828, Adjusted R-squared: 0.9828 
F-statistic: 2.309e+06 on 4 and 161185 DF, p-value: < 2.2e-16

As you see it has an excellent R-square and all coefficients are significant. But more of this will be discussed in the Discussion chapter.

After I created the regression model with all predictors, I derived a spatially explicit regression model with only the spatial predictors. The following model resulted:

C = 22.8038 + 0.3272 x S + 0.007578 x SD + εi

I validated the model the same way as the previous model. Below again its summary statistics:

lm(formula = C ~ S + SD, data = costData)
 Min 1Q Median 3Q Max 
-24.01 -4.94 -1.19 3.42 446.41
 Estimate Std. Error t value Pr(>|t|) 
(Intercept) 2.280e+01 4.950e-02 460.7 <2e-16 ***
S 3.272e-01 1.979e-03 165.3 <2e-16 ***
SD 7.578e-03 2.655e-05 285.4 <2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.379 on 161187 degrees of freedom
Multiple R-squared: 0.4045, Adjusted R-squared: 0.4045 
F-statistic: 5.475e+04 on 2 and 161187 DF, p-value: < 2.2e-16

The R-squared is not as good as in the previous model, but still something to work with.

4. Cost Surface
Now that the spatially explicit regression model was created, I can create my Cost Surface. For this I simply used a Slope Raster. It serves as a reference raster for my Cost Surface, and also its values are used in the regression model. The Cost Surface will contain in each pixel to relative Harvest Cost. For this I measure the Euclidean Distance for each pixel to the closest road and but that distance into the regression. Also the Slope value was but into the regression. Based on the regression I get a value for each pixel. All the calculation was done in Numpy arrays, which is fairly fast to process. Though calculating the Distance for each pixel was very work intensive, which is the reason that I ended up only calculating it for the southern part of the state forest. On my little MacBook this took five days processing.

Motivation Master Thesis

Today I will introduce the bigger topic of my Master Thesis. I will explain the motivation of my research and what the actually research problem is.

Who cares about it?

In the last year I worked for Ecotrust, a environmental Non-Profit organization in the Pacific Northwest of the US. One of the tools they work on is the Forest Planner. It is an online tool for forest management and scenario planning in Oregon and Washington. Essentially it is an optimization model for ecological forestry.  Optimization models require considering a variety of spatial features, including Harvest Costs, in order to maximize triple bottom line returns (check out the picture to get a reminder what that was). 

The models require a pre-generated dataset with the potential Harvest Costs for the entire landscape in order to iterate through millions of potential solutions and compare results in terms of an objective function. Since the composition and the structure of the forest systems are usually not available for an entire landscape, a model is required that calculates Harvest Costs solely based on Spatial Predictors. Spatial Predictors can be determined via Geographic Information Systems. Currently no existing study investigates the significance of Spatial Predictors on Timber Harvest Cost. Therefore it is also not known if the significance of Spatial Predictors on Harvest Cost is high enough to calculate Timber Harvest Costs solely based on Spatial Predictors. In this is actually what I am trying to answer with my thesis.

What is the actual question?

  1. First, I want to find out what the significance of Spatial Predictors on Timber Harvest Costs is?
  2. Then, I want to answer if it possible to calculate Timber Harvest Costs solely based on Spatial Predictors?

When I know the significance of Spatial Predictors on Harvest Costs is high enough to calculate the Costs based, solely on Spatial Predictors, I am able to create with GIS a Cost Surface for an entire region. This Cost Surface will be used than again in the Forest Planner in order to quickly iterate through millions of potential solutions and find the best answer on how we treat our forests.

In the next post I will go through my Methods and will explain how I find out what the significance of Spatial Predictors on Timber Harvest Costs is.

Structure Master Thesis

For me it is curtail to have early on a structure in my mind how the thesis will look like later on. In the following I will touch on each major point of the thesis a little bit


Obviously it starts out with an introduction, reviewing the literature around different harvest cost models, and reviewing the few literature about statistical analysis of these models. Also an important part will be defining terms like Harvest Cost, Skidding Distance, Spatial Variables, etc. Towards the end the hypothesis will be explained, which will roughly be:

Harvest cost are to ???% driven by the spatial variables.

Last, since the study is heavily focused on the Colorado State Forest, a quick introduction to this area and the specifics of it will be given.


The biggest part will be the methods. After explaining how the cost model works I will explain in detail the data creation, which is a core part for analyzing the costs.

Also the actual statistical analyzing part will here be explained.


The results of my statistical analyses.

Application Example

Here I want to demonstrate what can be done with the results of my analysis. I created a web application where you can digitize timber stands and get immediate results for the costs. Also a cost raster will be visible in the application. Showing for every location on the Colorado State Forest the harvest costs.

Screen Shot 2014-04-04 at 12.34.23

Discussion & Conclusion

In the end I will review my results and discuss critical components of it.

Introduction to my Master’s Thesis Research

During my internship at Ecotrust I created a cost model that estimates timber harvest costs. It tells you, how much does it cost to harvest a timber stand.


The cost model is part of Ecotrust’s Forest Planner, an online tool for forest management and scenario planning.

In my research for my master thesis I want to analyze the driving factors that influence harvest costs based on the create model. The model takes in four input variables in order to calculate the harvest cost for a stand:

  • Slope in % (S)
  • Skidding Distance in feet (SD)
  • Trees per Acres (TPA)
  • Volume per Tree in cubic feet (VPT)

It returns a Harvest Cost per ton, which can be extrapolated to the stand level.

Research Question

I want to investigate if it is possible to predict relative Harvest Costs (per ton) with only explicit spatial variables, which are in the case of the model Slope and Skidding Distance.

Data Production

In order to investigate the influence of each variable on the Harvest Costs, I need test input data. For that I will take timber sales data of the last ten years of the Colorado State Forest (CSF) and run the data with my cost model.

Statistical Analysis

With the create test datasets I will conduct a regression analysis and want to come up with a regression model like this:

Harvest Cost ≈ β0β1 TPA + β2 VPA + β3 SD + β4 S

In the next step I will remove the non-spatial variables Trees per Acre and Volume per Acre.

Harvest Cost ≈ β0 + β1 SD + β2 S

My hypothesis is that Slope and Skidding Distance are sufficient to still come up with a reasonable Harvest Cost.

What is it good for?

Based on this formula I would be able to predict cost for any given location in the CSF, solely based on Skidding Distance and Slope. I could pre-generate a Harvest Cost raster covering the entire CSF showing for every location the relative Harvest Cost.


Module Visualization and Cartography

Finally the day has come: 600 days after starting my first module, I handed in my last mandatory module – Visualization and Cartography. A module you would expect at the beginning of a GIS degree was for some reason the last one. After producing many, many maps for various courses I actually learned what matters when creating them.

Even though I am currently not planning to focus on map making (at least print maps), knowing a little bit about visualization will always be handy. You have to communicate your work and ideas somehow. Inspired by Steve Jobs’ Stanford commencement speech and him pointing out the influence in his life of taking a calligraphy class, I wanted to learn calligraphy anyway, which is an important part in visualization. So I was excited to learn something about visualization and cartography.

It was the first time during all of the modules that the lectures were taught as an online class and not as a pdf class. But only kind of: We got a zip folder with an html inside. The html only worked on Windows browsers. So on the one hand, cool – online, on the other, not cool – some weird html only working on a Windows browser. Why couldn’t they host the data? I finally got it to work with the help of another student by using the developer user agent from Safari.
The html was very static. Unfortunately no javascript jumping around, even though the topic of the class is begging for it. But that’s just the form. Let’s get to the content.

The module had six big lectures: Introduction, Cartography language, Abstraction, Labeling/Typography, Map design, and Surfaces.
To the six lectures came four assignments. We had to produce a thematic map about tourism, design a poster (see the cyprus image), produce a couple of maps with the focus on classification, and answer questions about visual variables, like the use of color.

Most of the covered topics in the lectures were not limited to the use of maps (e.g. use of color). Though it was a little bit disappointing that the entire class was focused on static maps. Not a single word about web maps was mentioned.

Also the class was scratching mainly on the surface of the topics and didn’t give me more information or inspiration than Wikipedia can give me about design. In addition clicking trough the html jungle didn’t really motivate to get interested.
From my point of view, this module definitely needs an update – both the form and the content.

Optional Module Application Development (using Java)

We do it everyday, almost every second: Using apps. New apps pop up daily. We have apps for text editing, for cooking our meals, for keeping track of our workouts, for everything. Just in time for the big app boom, UNIGIS released the optional module “Application Development (using Java)”. It fit right into my programming focused GIS path, after I finished the OSM and Python modules. So I went ahead and selected the module.

The module was structured in 13 lessons and four assignments.

It started out with a broad introduction to different application types and different use cases, which was also the first assignment. We were given three different use cases and we had to identify the proper programming language. Two examples were about general geo-processing and one was about an iPhone app. Since I do Python geo-processing on a daily basis, it wasn’t to difficult to figure that out. The answer to the iPhone app was delivered by a quick Google search. I was surprised: I thought all that stuff is written in JavaScript, but iPhone apps are actually written in Objective-C, and can be ‘translated’ from Javascript.

Javascript was then also the topic of the next lesson and assignment. We had to modify a OpenLayers Javascript code and pep up the HTML a little bit. This was nothing new since I learned that extensively during the Web Mapping Applications Summer School in Girona last year.

Lesson and Assignment three were then about the other end of the game: Server-side scripting! We were given a PHP query and had to explain what it does.

After that was a long stretch of lessons without assignments. The topics covered were: system architecture, object-oriented programming, and programming environment.

Finally we touched Java itself. The following seven lectures covered everything necessary to build an actual Java app which was the last assignment. All of that was new to me. We used Eclipse to build the app. The app is supposed to receive data from a GPS device and bring it to a server, which eventually will publish the information. The conceptual model (from the UNIGIS lectures) below illustrates the individual steps.
Our task was to develop the receiving, processing, validating, and triggering parts of the app – the server component. It was quite tricky, but very well documented and we received detailed step by step instructions. Eventually we could test our app with a set up server from the instructor.

Developing the app was definitely the hardest and most time consuming part of the module. Though I learned a lot: starting from learning some Java to actually understanding what steps are necessary to receive data and bring it to a server.

The module was well structured and managed, going from a general app introduction all the way to developing a Java app without losing or boring the student.

Module Spatial Statistics

The latest module “Spatial Statistics” was of great interests for me, since I currently do a lot of statistics with R for my internship project. I analyze the influence of the spatial variables on harvest costs. R gives out great reports with lots of numbers, which I am not always sure how to interpret. So I was very excited to get a more fundamental knowledge in the field of statistics.

Also I was glad to read in the instructor’s welcoming email that he puts an emphasis on the theoretical and methodological background of the used tools. That was indeed the case during the model. We used mainly ArcGIS Geostatistical Analyst extension for the spatial analysis and the necessary steps to get the results were very well documented for us. This way we could focus on the outputs of the tool, rather then learning another tool. In general the Geostatistical Analyst is pretty intuitive. Also we used SPSS for analysis. I used it already a little bit during my bachelors degree and was hoping that we use R this time. I use R a lot for my work projects, since it integrates better with Python and is Open-Source.

So what was the module about?

The first lecture was a great refresher of statistical terms. Thinks I heard many times before. But as with many things, the more often one hears it and thinks about it, the more one understands it. Slowly I am getting the hang of all those statistical terms.

From there we moved on and learned the differences between estimate statistics and test statistics, and about autocorrelation.

Next we got to Point Pattern Analysis. One of it’s methods is for instance the Nearest Neighbor Analysis. It is used a lot in crime statistics, to test if certain crimes show spatial clustering. The typical example is of course, that the crime rate is higher in low-income neighborhoods. We had to analyze, if the spatial distribution of Salzburg’s citizens eduction level is random or if it is possible to find a pattern. For example, are people with university degrees clustered in a certain neighborhood, or are they spread out all over the city.

On we went with the topic of variography. Variography has the interesting assumption, that two points that are close to each other, take on close values because these values were generated under similar physical conditions (Isaaks and Srivastava, 1989). We put that to use right away with another analysis of Salzburg’s population. Using ArcGIS Geostatistical Analyst we had to examine if there is a spatial autocorrelation of Non-EU citizens in the city of Salzburg. For instance the trend analysis below shows, the majority of Non-EU citizens live in the centre of Salzburg.
Screen Shot 2013-11-24 at 11.14.29
We stayed around that topic and learned about the connection between Variography and Interpolation. Towards the end of the module we finally came to Regression Analysis. The topic I was especially interested since that is exactly what I do with my harvest cost research. With a multivariate regression analysis we had to test the following hypotheses:

Furnishing of an apartment is related to its marked value, ownership situation, the size of the apartment, and how many people live there.

The analysis was done with the very user friendly program SPSS. We had to try out a forward and backward selection and discuss the output differences.

The last lecture and assignment was about clustering: How they are created, how to interpreted them, necessary preconditions, etc.

For me the use case examples in the assignments were very concrete and therefor made the often abstract statistical methods a lot more accessible. The newly learned knowledge will be of great help for future statistical analysis.