Using Machine Learning to derive Soil
Information from Soil Reflectance
Composites (SCMaP)
Wissenschaftliche Arbeit zur Erlangung des akademischen Grades
Bachelor of Science
im Studiengang
Geographie
am Institut für Geographie der Universität zu Köln
Autor:
Simon Donike
Matrikelnummer 5976642
22. November 2019
Erstgutachter: Zweitgutachterin:
Prof. Dr. Georg Bareth Dr. Uta Heiden
Universität zu Köln Deutsches Zentrum für Luft- und Raumfahrt e. V.
i
Table of Contents
i. Acknowledgement ........................................................................................................ iii
ii. List of Figures .............................................................................................................. iv
iii. List of Tables ............................................................................................................... v
iv. List of Equations .......................................................................................................... v
v. List of Abbreviations.................................................................................................... vi
1. Introduction ................................................................................................................... 1
1.1 Significance of Soil Earth Observation ............................................................. 1
1.2 Visibility of Soil Parameters in Earth Observation ........................................... 2
1.3 Soil Composite Mapping Processor (SCMaP) .................................................. 2
1.3.1 SCMaP ..................................................................................................... 2
1.3.2 SCMaP Product –Soil Reflectance Composite ........................................ 5
1.4 Research Objectives........................................................................................... 6
2. Data and Study Area ..................................................................................................... 7
2.1 Study Area ......................................................................................................... 7
2.2 Soil Reflectance Composite ............................................................................... 8
2.3 Soil Sample Point Information ........................................................................ 11
2.3.1 Point Samples ........................................................................................ 11
2.3.2 Preprocessing ......................................................................................... 13
3. Methodology ............................................................................................................... 17
3.1 Analysis of distributions within dataset ........................................................... 17
3.2 Correlation Coefficients................................................................................... 18
3.3 Component Analysis ........................................................................................ 20
3.3.1 Principal Component Analysis (PCA) ................................................... 20
3.3.2 Linear Discriminant Analysis (LDA) .................................................... 21
3.4 Random Decision Forest Classifier ................................................................. 22
3.4.1 Random Decision Forest Methodology ................................................. 22
3.4.2 Random Decision Forest implementation and parameters .................... 25
3.5 Validation ........................................................................................................ 27
4. Results ......................................................................................................................... 29
4.1 Statistical Results ............................................................................................. 29
4.1.1 Representation of Soil Parameter Class Distributions ........................... 30
ii
4.1.2 Spectral Distribution of Parameters by Classes ..................................... 31
4.1.3 Correlation Coefficients ........................................................................ 37
4.1.4 Principal Component Analysis and Linear Discriminant Analysis ....... 39
4.2 Validation and Quality Control ....................................................................... 41
4.3 RDF Classification Results – Map Products ................................................... 42
5. Discussion ................................................................................................................... 43
6. Conclusion................................................................................................................... 46
Bibliography .................................................................................................................... 48
Appendix ......................................................................................................................... 54
Eidesstaatliche Erklärung ................................................................................................ 56
iii
i. Acknowledgement
Firstly, I would like to thank Prof. Dr. Georg Bareth. He allowed me to explore the
many possibilities within the German Aerospace Center (DLR), enabling me to find
my own scientific interest.
Most importantly, I express my gratitude to the German Aerospace Center in
Oberpfaffenhofen, Germany. I am thankful for the trust extended to me and my
capabilities, especially by Dr. Uta Heiden and Prof. Dr. Claudia Künzer for giving
me the opportunity to spend the time of this thesis at the DLR.
Being surrounded by a group of extremely knowledgeable and motivated experts who
never ignored any questions that arose deeply furthered my understanding of the
scientific framework and the methods used in this thesis. I express my gratitude to
Dr. Marianne Jilge, Dr. Stefanie Holzwarth, Dr. Nicole Pinnel, Chaonan Xi, Martin
Habermayer, Martin Bachmann and especially Simone Zepp, who was unfortunate
enough to share an office with me, having to bear the brunt of my (sometimes trivial)
questions.
Also, the Bavarian State Offices for Agriculture and Environment made this thesis
possible by kindly providing non-public soil samples from their databases.
The Institute for Geography at the University of Cologne, Germany as well as the
University of Turku, Finland enabled me to develop the knowledge and skills
necessary to conduct this project. I am very thankful for the opportunities provided
by these institutions, which allowed me to explore the field of Geography and find
my personal interests and strengths within.
Lastly, my deepest gratitude and appreciation goes to my parents, who gave both me
and my sister the proverbial “roots to grow and wings to fly”, as well as supported
me in every step of my life.
iv
ii. List of Figures
Figure 1: SCMaP Threshold Selection............................................................................. 5
Figure 2: Workflow .......................................................................................................... 7
Figure 3: Study area in relation to Germany and the state of Bavaria ............................. 7
Figure 4: Excerpt of area of RSC ................................................................................... 10
Figure 5: Distribution of Soil Sampling Locations, color-coded by origin. .................. 13
Figure 6: Preprocessing of Soil Sample Locations. ....................................................... 15
Figure 7: Histogram of pixels in 3x3 matrix around original locations ......................... 16
Figure 8: First 3 layers of singular decision tree ............................................................ 24
Figure 9: Examplary location of „quality control” matrix ............................................. 28
Figure 10: Distribution of C
org
classes in soil samples .................................................. 30
Figure 11: Distribution of Soil Type classes in soil samples ......................................... 30
Figure 12: Distribution of Soil Texture classes in soil samples ..................................... 31
Figure 13: Reflectance Values for C
org
Classes ............................................................. 32
Figure 14: Mean Reflectances by C
org
Class .................................................................. 32
Figure 15: Scatterplot for C
org
numerical values per Band ............................................ 33
Figure 16: Reflectance Values for Soil Type Classes .................................................... 35
Figure 17: Mean Reflectancescby Soil Type Classes .................................................... 36
Figure 18: 3D PCA and LDA graphs for each soil parameter ....................................... 40
Figure 19: Side-by-side comparison of RSC and predicted C
org
content map ............... 42
Figure 20(A): Reflectance Values for Soil Texture Classes .......................................... 54
Figure 21(A): Mean Reflectance for Soil Texture Classes ............................................ 54
Figure 22(A): Predicted Soil Type Map, excerpt ........................................................... 55
Figure 23(A): Predicted Soil Texture Map, excerpt ...................................................... 55
v
iii. List of Tables
Table 1: Landsat Bands used in SCMaP ........................................................................ 10
Table 2: Soil Type Classes ............................................................................................. 14
Table 3: Soil Texture Classes ......................................................................................... 14
Table 4: Soil Organic Carbon Content Classes .............................................................. 15
Table 5: Difference Matrix of all C
org
classes for Band 4 (NIR) ................................... 33
Table 6: Difference Matrix of all Soil Type classes for Band 4 (NIR) .......................... 36
Table 7: C
org
and Reflectance Pearson correlation coefficients per Band ...................... 38
Table 8: C
org
and Reflectance Spearman correlation coefficients per Band .................. 38
Table 9: Internal RDF Prediction Verification ............................................................... 41
Table 10: Matrix “quality control” accuracy.................................................................. 41
Table 11(A): Difference Matrix of all Soil Texture Classes for Band 5 (SWIR1) ........ 54
iv. List of Equations
Equation 1: Modified NDVI for SCMaP ......................................................................... 4
Equation 2: Pearson correlation coefficient................................................................... 19
Equation 3: Variation ratio for axis location (LDA) ..................................................... 22
vi
v. List of Abbreviations
ATCOR Atmospheric and Topographic Correction
BGR – Bundesanstalt für Geowissenschaften und Rohstoffe, Federal Office for
Geosciences and Raw Materials
BKKABodenkundliche Kartieranleitung, German Soil Systematic
CLC CORINE Land Cover
CORINE – Coordination of Information on the Environment
C
org
– Organic Carbon
DLR – Deutsches Zentrum für Luft- und Raumfahrt, German Aerospace Center
ETM+ - Enhanced Thematic Mapper Plus (Landsat Sensor)
LDA - Linear Discriminant Analysis
LfL – Bayrisches Landesamt für Landwirtschaft, Bavarian State Office for Agriculture
LFUBayrisches Landesamt für Umwelt, Bavarian State Office for the Environment
NDVINormalized Difference Vegetation Index
PC1/PC2/PC3 – Principal Component 1/2/3
PCAPrincipal Component Analysis
PVPhotosynthetically active vegetation index
SCMaP – Soil Composite Mapping Processor
SRCSoil Reflectance Composite
SSL Soil Sampling Location
RDF – Random Decision Forest
TMThematic Mapper (Landsat Sensor)
1
1. Introduction
1.1 Significance of Soil Earth Observation
The soils on earth are an existential and fundamental resource for humanity
(D
OMINATI et al. 2010: 1858–1868), as well as for the planets ecosystem (YOUNG et
al. 2004: 113–132). The 2005 Millennium Ecosystem Assessment identified four
major services granted by soils (Millennium Ecosystem Assessment: 2005):
- providing direct and indirect food, as well as water, wood, fiber and fuel,
- regulating services regarding water, climate, floods and erosion,
- cultural services in the domains of recreation, spirituality and aesthetics.
- supporting services as to the nutrition cycle, providing a habitat and
supporting biodiversity.
Thus, soils play an important regulating and limiting role for the atmosphere,
lithosphere, hydrosphere and biosphere (S
ZABOLCS: 1994: 33–39). Because of its
many variations, soils are “one of the most complex biomaterials on the planet”
(ADHIKARI et al. 2016: 101–111), also playing a role in climate change (OMUTO et al.
2013: 81). Since soils are important for many economic and ecological factors,
information about them needs to be embedded into political decision making and
supported by accurate and cost-effective scientific data (D
AILY et al. 1997: 113–132).
In the past, information about the soils and their dynamics have been mostly gained
through in-situ sampling, which is rather expensive and time-consuming (HANKS et
al. 1962: 530
; RUBIN et al. 1963: 247–521). Since the opening of the Landsat Archives
(W
OODCOCK et al. 2008: 1011), large scale images over long time periods are
available and used for land monitoring (H
ANSEN et al. 2012: 66–74). A great effort
has been made using multispectral satellite and aircraft imagery to gain soil
information (MULDER et al. 2011: 1–19), in order to expand existing soil databases
(B
EN-DOR et al. 2008: 321–392). In the age of precision farming, where soil
information is essential, satellite-based information can fill the gap between small
scale measurements (CANDIAGO et al. 2015: 4026–4047) and very coarse information
as the Harmonized World Soil Database (N
ACHTERGAELE et al. 2009: 33-37). Not
only providing this information on a large spatial scale, satellite-based information
2
can also provide continuous monitoring on a much higher temporal scale by providing
images and their analysis in short intervals.
1.2 Visibility of Soil Parameters in Earth Observation
The research conducted thus far mostly focuses on aircraft multi- and hyperspectral
imagery, since these data acquisition methods provide a higher spectral resolution and
fewer disturbances (OCHSNER et al. 2013: 1888–1919; STEVENS et al. 2008: 395
404). In the past, extensive studies have been conducted regarding the detection and
quantification of the organic carbon content of the soil, as well as for soil texture and
soil type detection (B
ARNES et al. 2000: 731–741; JARMER et al. 2003: 115–123). The
visibility of soil parameters in multispectral and hyperspectral imagery was
investigated by Bayer et al in 2016 for the C
org
content (BAYER et al. 2016: 3997
4010), Lakshmi et al in 2015 for soil texture (L
AKSHMI et al. 2015: 1452–1460) and
Nanni et al in 2012 for a soil type discrimination (N
ANNI et al. 2012: 103–112) and
found to be generally possible.
This thesis also attempts to identify distinguishing reflectance characteristics but
based on the Soil Composite Mapping Processor (SCMaP). Since these distinctions
have been successfully identified previously, this thesis tries to recreate these results
based on the SCMaP output.
1.3 Soil Composite Mapping Processor (SCMaP)
1.3.1 SCMaP
The basis for this thesis is the product derived from the Soil Composite Mapping
Processor (SCMaP) as developed by Derek Rogge et al. (2018) at the German
Aerospace Center (DLR). This processor utilizes Landsat imagery from 1984 until
2014 to create an exposed soil map on a regional scale, using the temporal dimension
of the data to overcome soil exposure inconsistencies and atmospheric influences by
looking at vegetation cycles (R
OGGE et al. 2018: 1–17).
Landsat-data from missions 4 (Sensor: TM), 5 (Sensor: TM) and 7 (Sensor: ETM+)
was used due to their free availability and most importantly their long-term
continuous data acquisition (W
OODCOCK et al. 2008: 1011). This processor is then
3
applied to create the aforementioned soil exposure map, which is used as the data
basis for this thesis.
SCMaP was initially developed for Germany based on 36 path/row combinations with
a scene size of 170 km north-south and 183 km east-west (paths 192-197, rows 22-
27). The TM sensor onboard of Landsat 4 and 5 collects spectral information on six
bands ranging from 0.45 to 2.35 μm with a spectral resolution of 30 m. On board of
the Landsat 7 satellite, the ETM+ sensor was placed, which has a similar setup as the
TM with the addition of a panchromatic band at 15m spatial resolution. After the scan
line corrector (SLC) failure of Landsat 7 in May 2003, 22% of any given scene is
missing. Different areas are affected by the failure for every flyover, thus image
stacking can be used to provide continuous coverage.
The data type of the downloaded images is the compressed GeoTIFF format in an
Universal Transverse Mercator Projection with the WGS 84 datum (ROGGE et al.
2018: 1–17). The dataset is made of scenes classed as “Tier 1 precision and terrain
corrected” (L1T) and hence is deemed appropriate and accurate enough for time-
series analysis by the USGS.
In order to achieve a high quality of detection of exposed soils, artifacts and visual
obstructions need to be removed. According to Zhang et al. (ZHANG et al. 2005: 357
371), the global annual mean cloud coverage in Landsat satellite image datasets is
approximately 66%. Identifying clouds, as well as their shadows, proved to be crucial,
since their darkening and brightening effects complicate many remote sensing
applications. The “Fmask” algorithm of Zhu and Woodcock (2012) is used, which
has a detection rate of 96.4% (Z
HU et al. 2005: 357–371).
Secondly, the ATCOR software introduced in 2008 by R. Richter et al. is used to
minimize the effects of atmospheric disturbances such as water vapor and aerosols.
Starting from the “digital number” value, the individual pixel metadata consisting of
coordinates, illumination angle and the date of acquisition are taken into account to
derive the bottom of atmosphere reflectances via the top of atmosphere radiances
(RICHTER et al. 2006: 2077–2085).
Due to snow- and cloud cover during the winter, images taken within the first and last
30 days of the year are eliminated. After the application of the processor for Germany
as done by the DLR, 9,331 Landsat scenes remain.
4
After the preprocessing, the scenes are transformed into 1° by 1° tiles and broken up
into six time frames of five years. The only exception is the first timeframe, which
contains the data of 6 years (1. 1984-89, 2. 1990-94, 3. 1995-99, 4. 2000-04, 5. 2005-
09, 6. 2010-14). Additionally, all data from 1984 until 2014 are combined into a
seventh timeframe by the DLR, which is the final SCMaP product showing the
exposed soils used in this thesis (R
OGGE et al. 2018: 1–17). This product will be
refered to as SRC.
Fundamentally, SCMaP tries to identify exposed soil pixels (without vegetation) in
the images. Other than in very rare conditions in the Bavarian Alps above the tree line
or along the shore of oceans and rivers, naturally exposed soil occurs very rarely. A
key characteristic of exposed soil is a very low vegetation index, such as the
Normalized difference vegetation index (NDVI), which has been used since the early
days of Earth Observation (ROUSE et al. 1974: 309–317). This vegetation index uses
the high reflectance rate of healthy green plants in the near infrared range from 0.7 to
1.1 µm to create a photosynthetically active vegetation index (PV) (KRIEGLER et al.
1969).
Rogge et al. decided to use a modified version of the NDVI for the highest/lowest PV
selection. The blue channel was introduced to the Index, in order to reduce the higher
reflectance effect of thin haze in pixels, which were not correctly revised during the
preprocessing stage.
Equation 1: PV determination, modified NDVI for SCMaP
A low value of the aforementioned vegetation index value by itself is not sufficient
to detect exposed soils, since water and other anthropogenically influenced regions
as well as non-photosynthetically active vegetation have a low PV value as well.
Since exposed soils most commonly occur on agriculturally used land (in Germany),
their distinguishing characteristic is a high seasonal variability of the PV. Due to
growing and harvesting/plowing seasons, agriculturally active areas experience at
least one, if not several distinct changes in PV every given year (R
OGGE et al. 2018:
1–17). Because of the high temporal resolution of the Landsat data, these changes can
5
be detected and quantified in order to create a composite image. This composite image
encompasses both the highest and lowest PV values for each pixel.
The next step is overlaying the composite with the Coordination of Information on
the Environment (CORINE) Land Cover data set (CLC). After eliminating those
pixels, which underwent a change in the CLC classification, 5000 pixels were
randomly selected and then plotted to show their maximum and minimum PV value,
as well as their CLC classification group (Figure 1) (ROGGE et al. 2018: 1–17).
Pixels classed as agricultural
land can be clearly
distinguished from other
pixels by their significant
difference between maximum
and minimum PV, a clear
cluster of the “fields”
classification is visible.
Different classifications such
as coniferous and deciduous
trees have a similar maximum
PV value, but the lower
minimum PV value increases the variability, enabling the distinction between the
classes.
Finally, thresholds are manually chosen to “box in” this cluster of soil pixels after
careful examination of the five test tiles. An automatization of the threshold selection
process is currently under development (ROGGE et al. 2018: 1–17).
1.3.2 SCMaP Product Soil Reflectance Composite
The main result of SCMaP is the SRC, which shows the mean reflectance value of
those pixels, which were identified to have undergone significant changes in PV and
are thus classed as exposed soil. Those composites are created for each timeframe and
show the average reflectance value for each band of the stack of images at any given
soil pixel. The error rate for identifying exposed soils of the SRC over the whole 30
year period is stated as 2.24% (Rogge et al. 2018: 1–17).
Figure 1: SCMaP Threshold Selection
(adapted from
ROGGE et al. 2018: 1–17)
6
1.4 Research Objectives
The aim of this thesis is to extract a maximum of soil related information from the
Soil Composite Mapping Processor product, using the technique of machine learning.
More specifically, the SRC of the years 1984-2014 is used.
SCMaP, as previously described, is a method for identifying bare soil pixels from
averaged reflectance data from these points; the reflectance composite thus contains
spectral information about the soils. By examining the SRC pixel colours (cf. “2.2
Reflectance Soil Composite”), areas of conformity can be clearly distinguished from
other areas, even though all pixels represent bare soil. Based on this observation, the
idea to try to extract information about soil parameters was conceived and this
hypothesis formed. Variances in reflectance could possibly be traced down to local
differences in soil parameters.
Based on in-situ soil samples and their reflectance value of the according SRC pixel,
predictions will be made for all other SRC pixels regarding their soil parameters.
Since the pixels have to be assigned to predefined classes, the machine learning
method of random decision forests was chosen in order to make these predictions.
The objectives can be more precisely articulated with the following questions:
-Do the reflectance values of SCMaP pixels correlate with the soil parameters of
carbon organic content, soil type and soil texture?
-Can a random decision forest perform an accurate classification of the entirety of the
SCMaP product?
Figure 2 shows the workflow of this thesis. After acquiring and preprocessing the
input data, a database is built, containing the spectral information of the sampling
locations as well as their soil parameters. This database is then analysed for statistical
correlations between the reflectance and the soil parameters, checking to what degree
the classes are spectrally distinguishable. Using the input database, a random decision
forest machine learning classification method is built, which predicts the soil
parameters for all pixels of the SCMaP composite. The predictions are consequently
transformed into three soil parameters maps. The previously obtained statistical
results are used to check the accuracy of the predictions.
7
Figure 2: Workflow
Green: inputs, yellow: processing steps, red: results
2. Data and Study Area
2.1 Study Area
Based on the 1° by 1° tiles of the SCMaP output, twelve of these tiles were chosen as
the study area (Figure 3).
Figure 3: Study area in relation to Germany and the state of Bavaria
These tiles were selected because they mostly cover Bavaria and exclude the
southernmost region of the state where the Alps are located, preventing alpine
8
samples with different soil genesis and sediment configurations from distorting the
dataset. It is focused on Bavaria because of the availability of geocoded soil samples
(cf. “2.3 Soil Sample Point Information”). Some areas of the tiles also cover territories
of the German federal states of Baden-rttemberg in the west, Hesse, Thuringia,
and Saxony in the north as well as parts of the Czech Republic and Austria in the east.
As a result of the central European location of the study area, the climate is classed
as “temperate oceanic” according to Köppen (Cfb). Mean annual temperatures range
from 9°C to 4°C with a precipitation of 550 mm up to 2500 mm (CHEN D. et al.
2015: 69–79).
Geomorphologically the area consists of a sedimentary filled basin formed by glacial
advances during the last ice age at the foot of the Alps in the south surrounding
Munich. In the north of Munich are the mountain ranges of the Franconian Jura in the
northwest and the Bavarian and Bohemian Forests in the northeast. The northernmost
area encompasses the South German Scarp lands in the northwest as well as parts of
the Thuringian-Franconian Highlands. These are relatively low mountain ranges with
peaks just shy of 1000m, formed by the variscan orogeny.
Mostly, soils in the study area are classed as Cambisols and Luvisols (this includes
their German classification of brown earths), covering about 45% of the area. With
15% coverage Regosols are the next most frequent, followed by water stagnation soils
like Stagnosols at 12%, according to the BKKA (AD-HOC AG Boden: 2005: 173-
175) , and their equivalent soil classes of the World Reference Base for Soil
Resources system (IUSS Working Group WRB 2006).
The CLC classification shows that 31% of the study area is cropland, 37% is covered
by forests and 18% is grassland.
2.2 Soil Reflectance Composite
As a spectral database, this thesis uses the Soil Reflectance Composite, as processed
by the DLR after the SCMaP workflow, combined for the years 1984 until 2014. The
long timespan is chosen for this thesis because of the higher number of individual
pixels classed as fields and cropland over the larger timespan. This results in a larger
repository for comparing reflectance values to the SSLs. In the 5-year time frames on
the other hand, the total amount of pixels is smaller and thus the likelihood of a soil
9
sample intersecting a soil pixel is decreased. Additionally, most of the SSLs have
been taken at dates distributed from the 1980s to the 2000s, choosing a 5-year period
would limit the available SSLs.
Most importantly, averaging out the reflectance values over a greater amount of time
helps reduce variability due to medium-term weather influences such as droughts,
exceptionally long summers or extended wet periods.
The spectral basis for this thesis consequently consists of about 63 million 30 m by
30 m pixels from Landsat missions 4, 5, and 7 which have been identified as exposed
soil, containing spectral bands (Table 1) averaged out over the timeframe 1984-2014.
Band 6 is a thermal band with a range from 10.40 - 12.50μ and was therefore removed.
An excerpt of the final SRC shows the pixels identified as exposed soils as an RGB
image, all other pixels are removed (white background) (Figure 4).
10
Table 1: Landsat Bands used in SCMaP
Band Number
Band Name
Range
Band 1
Blue
0.45 - 0.52 μm
Band 2
Green
0.52 - 0.60 μm
Band 3
Red
0.63 - 0.69 μm
Band 4
Near-Infrared
0.77 - 0.90 μm
Band 5
Short-Wave Infrared 1
1.55 - 1.75 μm
Band 7
Short-Wave Infrared 2
2.09 - 2.35 μm
Figure 4: Excerpt of area of SRC for Bands 7 (represented as R), 5 (represented as G)
and 3 (represented as B). False-color image. Areas of conformity in color are visible.
11
2.3 Soil Sample Point Information
2.3.1 Point Samples
In order to create an input database with sample locations and their known soil
parameters, several data sources are compiled together. Two state agencies provided
their non-public data points to this thesis. Additionally, a database from the
European Statistical Office is used. The different sources, their methods of
acquiring the data and its representation is explained below.
LFU
The Bavarian State Office for the Environment (Bayerisches Landesamt für Umwelt
- LFU) maintains a network of sampling locations in Bavaria. Sites are selected based
on an 8 km by 8 km grid which is drawn over the state, exact sampling locations are
then picked within a 500-meter radius around the grid nodes to find a suitable
position. Special attention is given to choosing sites as homogeneously as possible
regarding altitude, relief, soil type, parent rock and vegetation (W
IESMEIER et al.
2014: 208–220). Only sample locations which are situated within an agricultural field
were made available by the LFU, 2086 in total.
The soil type attribute of these points is given in abbreviations according to the
German soil systematic BKKA (AD-HOC AG Boden: 2005: 173-175), for example
“RR-BB” for rendzina/brown earth, as well as their highest-tier classification of the
soil group, in the aforementioned example “B” for brown earth.
Top soil texture is given in four classes according to the BKKA (AD-HOC AG
Boden: 2005: 132-133),: sand, with a diameter of 0.063 mm to 2 mm; silt, with a
diameter of 0.002 mm to 0.063 mm; clay, with a diameter smaller than 0.002 mm and
loam, which is an even mixture of the three diameter classes.
Appended are finer distinguishing prefixes, such as “Sl” to indicate the presence of
other diameter classes, with the dominant class written in a capital letter.
The unit in which the carbon organic content is listed is mg/g.
LfL
The Bavarian State Office for Agriculture (Bayerische Landesanstalt für
Landwirtschaft - LfL) is part of a network of permanent soil sampling sites,
aggregated and maintained by the German Federal Environmental Agency. All points
12
located within the state of Bavaria and on land classified as agriculturally used were
provided to this study by the LfL (in total 118 sampling locations). More than 400
parameters are recorded for each location, including all of the soil parameters relevant
in this study, as well as for example land use, in-depth soil analysis, and radioactive
measurements (SCHILLI et al. 2011: 16) .The network exists since the 1980s, each
location is revisited once every 1 to 10 years, depending on the individual point
(KAUFMANN-BOLL et al. 2011: 4).
Since several measurements at different points in time are present for any given point,
they need to be modified. Samples taken outside of the timeframe were removed, the
remaining C
org
values were averaged and then transformed into their according
classes (cf. “2.3.2 Preprocessing”). None of the points showed a change in soil type
or soil texture, thus the classification could be easily transferred. After the
transformation, one single value was left for each soil parameter used in this thesis.
LUCAS
The European Statistical Office (Eurostat) conducts the “Land Use/Cover Area frame
Survey” (LUCAS) since 2001 and is available for all EU member states since 2012.
In this survey, sample points are placed over the European Union in a grid with a
width of 2 km in between nodes. Each of the 1.1 million points is sampled and classed
individually. Many different measurements are taken, including the humus content of
the top soil, but not the soil class or the soil texture (T
ÓTH et al. 2013: 3-6). The humus
content can be easily converted into the organic content of the topsoil, so that the
LUCAS information can be used to expand the C
org
training dataset, but not the soil
type and soil texture datasets. Unlike the other two data sources, the LUCAS dataset
is not limited to Bavaria, meaning that there are also points located within areas which
fall outside of the state into other German federal states as well as into the Czech
Republic and Austria. A total of 505 LUCAS points are available for the twelve test
tiles.
13
2.3.2 Preprocessing
The data from the different sources need to be unified into the same format. Merging
the data from all sources is only possible if the classifications for the soil parameters
follow the same structure. As mentioned in the data descriptions the classification
methods differ from each other.
Information regarding soil type is converted to the highest tier classification as given
in the German soil systematic BKKA (AD-HOC AG Boden: 2005: 173-175). The
classes are presented in Table 2.
Figure 5: Distribution of Soil Sampling Locations, color-coded by origin.
14
Table 2: Soil Type Classes
Symbol
German Name
(BKKA)
Translated Name
A
Auenboden
Floodplain Soil
B
Braunerde
Brown Earth
D
Pelosol
Pelosol
G
Gley
Gley
L
Lessivés
Lessivés
M
Marsch
Marshland Soil
P
Pelosol
Pelosol
R
Rendzina
Rendzina
S
Stauwasserböden
Water Stagnation Soils
T
Tschernosem
Chernozem
Y
Terr.-Anthr. Boden
Anthrosol
Soil texture classifications are also converted to their highest-tier classes by keeping
the majuscule letter of the abbreviation given in the BKKA (AD-HOC AG Boden:
2005: 173-175), see Table 3.
Table 3: Soil Texture Classes
Symbol
Name
Grain Size
S
Sand
0.063 mm to 2 mm
U
Silt
0.002 mm to 0.063 mm
T
Clay
< 0.002 mm
L
Loam
Even mixture of the three
C
org
values were given in different units by the different data sources. The LFU used
mg/g, while LfL used percentage. The conversion to percent is easily done, but the
LUCAS database gives C
org
values only indirectly by specifying the humus content
of the top soil. According to the BKKA, humus is made up of 58% organic carbon
(AD-HOC AG Boden: 2005: 107). Therefore, multiplying the humus content by the
factor 0.58 returns the C
org
content in its original unit, which was also percent. All of
the C
org
-information is thus converted to the same unit and format (percentage of soil
mass).
As seen in Table 4, the values are classed according to the carbon content
classification of the BKKA (AD-HOC AG Boden: 2005: 109).
15
Table 4: Soil Organic Carbon Content Classes
The tables from the data providers can be merged because of the data conformity after
the aforementioned restructuring and conversion. LUCAS points are given the soil
type and soil texture value of nA”, simplifying their detection and exclusion from
soil type and texture classifications further down the line.
In order to create a joined spatial database with correct location data, the point
information is reprojected into the same map projection. Since both Bavarian
agencies use the Gauß-Krüger coordinate system, they are warped into WGS1984
EPSG:4326 to fit the LUCAS data as well as the WGS1984 formatted .tiff SCMaP
mosaics which are used in this thesis. Also, unique IDs are given to each point
location.
Figure 6: Preprocessing of Soil Sample Locations.
Sankey Diagram, width of bars is proportional to amount of SSLs.
Grey: data sources, red: eliminated SSLs, green: final dataset, yellow: composition of
final dataset.
16
Following the conversions and attribute-related changes, the 2,709 data points need
to be adapted to the spectral data and the study area.
As shown in the Sankey Diagram (Figure 6), 95 points are eliminated because they
are not within the twelve tiles. 792 sample points are excluded because they were not
taken within the 1984 - 2014 timeframe. Even though it is not likely that soil type,
C
org
content or soil texture will change during a few decades, it is still preferred that
the sample point was actually once depicted in the spectral data.
Afterwards, an intersect is performed to retain only those locations which are situated
on a pixel of the SRC. Another 374 points are discarded during this step.
Within the SRC pixels, severe spectral outliers are expected. At a pixel size of 30 m,
the pixel might consist of 50% cropland and 50% road or pavement. To eliminate
distortion by these outliers, a matrix of 3 by 3 pixels around the SSLs is created. SSL
pixels which display a deviation of more than twice the standard deviation compared
to the surrounding pixels in the matrix are removed (Figure 7). By doing so, another
155 pixels are excluded.
Figure 7: Histogram of pixels in 3x3 matrix around original locations, also showing
standard deviation multiples.
17
After all ground-sampled data is merged, they need to be combined with the spectral
data. Reflectance data from the SRC is now gathered by sampling the pixel
reflectance value of every location from the list. Values for all six bands are added to
each respective point as attributes in the list.
The final input database is hence an indexed list containing x and y coordinates, their
classifications regarding soil type, C
org
content and soil texture aggregated and
uniformized; as well as the six local reflectance values extracted from the SRC for
1293 entries. A column showing the C
org
numerical value is also included.
975 of these points have attributes regarding soil type and/or texture, due to the
LUCAS database almost all 1293 locations have an attribute regarding their C
org
content. It is important to note that due to in-field sampling errors, some SSLs do not
have a valid data entry. The attribute itself is not removed because it has entries for
the other parameters. The statistical methods and Random Decision Forest methods
used automatically ignore these fields (“no entry” percentage: C
org
0,15%, soil type
2,36%, soil texture 1,33%)
3. Methodology
3.1 Analysis of distributions within dataset
Initially, the input dataset is analysed with a focus on the soil parameters, gaining
knowledge on the distribution of the classes is essential. Regarding the three different
soil parameters, it is important to know how the classes are distributed within the
dataset. An overwhelming majority of one or several classes could drain out other
classes or limit the validity of their spectral boundaries if there are only a few points
available in the dataset for a given class. Therefore, the class distributions are
visualised as bar graphs.
Then, the aggregated training dataset with their spectral information is statistically
analysed in order to visually determine correlations between different classes and
their reflectance values. All points are plotted as individual graphs. Plotting all points
together results in a visualisation of the variance within the dataset. Combining the
18
points and adding graphs showing the mean reflectance as well as the standard
deviation can show the variance inside the classes, as well as expose possible outliers.
Since the C
org
soil parameter also carries a numerical value, a scatterplot can further
visualize the distribution of those values. Consequently, a scatterplot is created for
each band, plotting the C
org
value against the reflectance value of the bands. Colouring
the points according to their class shows the distribution of reflectance values of the
classes in direct comparison to the other classes of this individual band, enabling a
visual judgement as to how far the classes are varying.
Then, for each soil parameter, the mean spectra of all classes are plotted together,
allowing for a visual comparison of the graphs, as well as enabling the visual
identification of the distinguishability of the classes.
In order to back up the previous visual observations with numbers, a table showing
the differences between the classes is created. For each soil parameter, the classes are
plotted against each other showing the differences in reflectance between them for
their manually chosen most distinctive single band. The differences between the
classes are shown as percentages, enabling the cross-referencing of all classes for this
individual band. This allows for the identification of classes with a higher percentage
in spectral difference in singular band and thus have a higher chance of correct
prediction.
3.2 Correlation Coefficients
Two statistical correlation methods are implemented to find possible correlations in
the dataset where numerical values are available (C
org
), the Pearson (Pearson R) and
Spearman correlation coefficients (Spearman R). Since the other soil parameters are
classed and without numerical values, correlation cannot be examined in those cases.
For each of the six bands, a correlation between C
org
value and reflectance is
examined using the “SciPy” open source scientific computing library for Python 3.7
(M
ILLMAN et al. 2011: 9–12).
The scipy.stats.pearsonr utility calculates the correlation according to the method
derived by Karl Pearson (PEARSON 1896: 253–318). As a measure of linear strength
19
of association between two observations (bivariable), the coefficient ranges from -1
and 1, which represent the highest correlation values for negative and positive
correlation, to 0, which indicates no correlation (C
HEN et al. 1981: 135; CROUX et al.
2010: 497–515).
Equation 2: Pearson correlation coefficient
As this simplified formula (Equation 2) shows, the Pearson correlation coefficient
(R) is the covariance of each set of the bivariable (x and y), divided by their multiplied
standard deviations (σx, σy) (P
ARK: 2018: 213–265). Pearson R is applied to the
training data in order to evaluate a possible direct linear correlation.
Spearman R works similar to Pearson’s method, but the variable sets are ranked
beforehand. Each variable is given it’s rank along the axis as its new value, starting
at the lowest value with rank one. This replacement of absolute values with their rank
from lowest to highest removes the variability and is thus useful in detecting
monotinic trends within the dataset.
With the values changed to their corresponding ranks, the Pearson R method is
executed (W
EAVER et al.: 2018: 435–448). The resulting Spearman correlation
coefficient takes the same form as the Pearson coefficient, meaning +1 and -1 stand
for highest positive and negative correlation, while 0 indicates no correlation (CROUX
et al. 2010: 497–515). Due to its ranking system, the Spearman R is able to more
accurately detect monotonic relationships, thus relationships where while variable x
declines in value, variable y never increases and vice versa (B
ORKOWF 2002: 271
286). Removing the absolute values makes this coefficient less vulnerable regarding
outliers and as long as the x and y values develop in the direction of the same algebraic
sign (regardless of their absolute value), a correlation is detected. (R
OWE 2015: 311–
335). Spearman R is chosen to detect a possible monotonic trend relationship between
reflectance and C
org
content, in case there is a correlation between rising reflectance
values and rising C
org
content or vice versa.
20
3.3 Component Analysis
3.3.1 Principal Component Analysis (PCA)
While dealing with multispectral data such as the training dataset, it is challenging to
visualize the variance inherent in the spectral bands. Each band stands for one
dimension in which the data might vary. In order to reduce the dimensionality, a
Principal Component Analysris is performed (P
EARSON 1901: 559-572). PCA has
established itself as a statistical tool for multivariate analysis and is widely used in
research dealing with large datasets, including spectroscopy (JAMES: 2013: 127-173).
The aim to break down the attributes and identify the most significant data dimensions
is achieved by linearly transforming the data to uncover the principal scatter direction
of each dimension.
The best transformation is found by visualising the attributes as a scatterplot. Then,
the centre of all data points from all classes is found and the origin of the graph is
moved to this location (DE SILVA 2017: 15–17). The next step is drawing a line, which
will later become the new axis, intersecting the origin and most accurately describing
the axis of variance of the points, followed by the reprojection of the points onto that
line. The squared sum of all distances from each original point towards the new axis
is called “eigenvalue”, the vector describing the tilt of the original axis to the newly