Using Machine Learning to derive Soil
Information from Soil Reflectance
Composites (SCMaP)
Wissenschaftliche Arbeit zur Erlangung des akademischen Grades
Bachelor of Science
im Studiengang
Geographie
am Institut für Geographie der Universität zu Köln
Autor:
Simon Donike
Matrikelnummer 5976642
22. November 2019
Erstgutachter: Zweitgutachterin:
Prof. Dr. Georg Bareth Dr. Uta Heiden
Universität zu Köln Deutsches Zentrum für Luft- und Raumfahrt e. V.
i
Table of Contents
i. Acknowledgement ........................................................................................................ iii
ii. List of Figures .............................................................................................................. iv
iii. List of Tables ............................................................................................................... v
iv. List of Equations .......................................................................................................... v
v. List of Abbreviations.................................................................................................... vi
1. Introduction ................................................................................................................... 1
1.1 Significance of Soil Earth Observation ............................................................. 1
1.2 Visibility of Soil Parameters in Earth Observation ........................................... 2
1.3 Soil Composite Mapping Processor (SCMaP) .................................................. 2
1.3.1 SCMaP ..................................................................................................... 2
1.3.2 SCMaP Product –Soil Reflectance Composite ........................................ 5
1.4 Research Objectives........................................................................................... 6
2. Data and Study Area ..................................................................................................... 7
2.1 Study Area ......................................................................................................... 7
2.2 Soil Reflectance Composite ............................................................................... 8
2.3 Soil Sample Point Information ........................................................................ 11
2.3.1 Point Samples ........................................................................................ 11
2.3.2 Preprocessing ......................................................................................... 13
3. Methodology ............................................................................................................... 17
3.1 Analysis of distributions within dataset ........................................................... 17
3.2 Correlation Coefficients................................................................................... 18
3.3 Component Analysis ........................................................................................ 20
3.3.1 Principal Component Analysis (PCA) ................................................... 20
3.3.2 Linear Discriminant Analysis (LDA) .................................................... 21
3.4 Random Decision Forest Classifier ................................................................. 22
3.4.1 Random Decision Forest Methodology ................................................. 22
3.4.2 Random Decision Forest implementation and parameters .................... 25
3.5 Validation ........................................................................................................ 27
4. Results ......................................................................................................................... 29
4.1 Statistical Results ............................................................................................. 29
4.1.1 Representation of Soil Parameter Class Distributions ........................... 30
ii
4.1.2 Spectral Distribution of Parameters by Classes ..................................... 31
4.1.3 Correlation Coefficients ........................................................................ 37
4.1.4 Principal Component Analysis and Linear Discriminant Analysis ....... 39
4.2 Validation and Quality Control ....................................................................... 41
4.3 RDF Classification Results – Map Products ................................................... 42
5. Discussion ................................................................................................................... 43
6. Conclusion................................................................................................................... 46
Bibliography .................................................................................................................... 48
Appendix ......................................................................................................................... 54
Eidesstaatliche Erklärung ................................................................................................ 56
iii
i. Acknowledgement
Firstly, I would like to thank Prof. Dr. Georg Bareth. He allowed me to explore the
many possibilities within the German Aerospace Center (DLR), enabling me to find
my own scientific interest.
Most importantly, I express my gratitude to the German Aerospace Center in
Oberpfaffenhofen, Germany. I am thankful for the trust extended to me and my
capabilities, especially by Dr. Uta Heiden and Prof. Dr. Claudia Künzer for giving
me the opportunity to spend the time of this thesis at the DLR.
Being surrounded by a group of extremely knowledgeable and motivated experts who
never ignored any questions that arose deeply furthered my understanding of the
scientific framework and the methods used in this thesis. I express my gratitude to
Dr. Marianne Jilge, Dr. Stefanie Holzwarth, Dr. Nicole Pinnel, Chaonan Xi, Martin
Habermayer, Martin Bachmann and especially Simone Zepp, who was unfortunate
enough to share an office with me, having to bear the brunt of my (sometimes trivial)
questions.
Also, the Bavarian State Offices for Agriculture and Environment made this thesis
possible by kindly providing non-public soil samples from their databases.
The Institute for Geography at the University of Cologne, Germany as well as the
University of Turku, Finland enabled me to develop the knowledge and skills
necessary to conduct this project. I am very thankful for the opportunities provided
by these institutions, which allowed me to explore the field of Geography and find
my personal interests and strengths within.
Lastly, my deepest gratitude and appreciation goes to my parents, who gave both me
and my sister the proverbial “roots to grow and wings to fly”, as well as supported
me in every step of my life.
iv
ii. List of Figures
Figure 1: SCMaP Threshold Selection............................................................................. 5
Figure 2: Workflow .......................................................................................................... 7
Figure 3: Study area in relation to Germany and the state of Bavaria ............................. 7
Figure 4: Excerpt of area of RSC ................................................................................... 10
Figure 5: Distribution of Soil Sampling Locations, color-coded by origin. .................. 13
Figure 6: Preprocessing of Soil Sample Locations. ....................................................... 15
Figure 7: Histogram of pixels in 3x3 matrix around original locations ......................... 16
Figure 8: First 3 layers of singular decision tree ............................................................ 24
Figure 9: Examplary location of „quality control” matrix ............................................. 28
Figure 10: Distribution of C
org
classes in soil samples .................................................. 30
Figure 11: Distribution of Soil Type classes in soil samples ......................................... 30
Figure 12: Distribution of Soil Texture classes in soil samples ..................................... 31
Figure 13: Reflectance Values for C
org
Classes ............................................................. 32
Figure 14: Mean Reflectances by C
org
Class .................................................................. 32
Figure 15: Scatterplot for C
org
numerical values per Band ............................................ 33
Figure 16: Reflectance Values for Soil Type Classes .................................................... 35
Figure 17: Mean Reflectancescby Soil Type Classes .................................................... 36
Figure 18: 3D PCA and LDA graphs for each soil parameter ....................................... 40
Figure 19: Side-by-side comparison of RSC and predicted C
org
content map ............... 42
Figure 20(A): Reflectance Values for Soil Texture Classes .......................................... 54
Figure 21(A): Mean Reflectance for Soil Texture Classes ............................................ 54
Figure 22(A): Predicted Soil Type Map, excerpt ........................................................... 55
Figure 23(A): Predicted Soil Texture Map, excerpt ...................................................... 55
v
iii. List of Tables
Table 1: Landsat Bands used in SCMaP ........................................................................ 10
Table 2: Soil Type Classes ............................................................................................. 14
Table 3: Soil Texture Classes ......................................................................................... 14
Table 4: Soil Organic Carbon Content Classes .............................................................. 15
Table 5: Difference Matrix of all C
org
classes for Band 4 (NIR) ................................... 33
Table 6: Difference Matrix of all Soil Type classes for Band 4 (NIR) .......................... 36
Table 7: C
org
and Reflectance Pearson correlation coefficients per Band ...................... 38
Table 8: C
org
and Reflectance Spearman correlation coefficients per Band .................. 38
Table 9: Internal RDF Prediction Verification ............................................................... 41
Table 10: Matrix “quality control” accuracy.................................................................. 41
Table 11(A): Difference Matrix of all Soil Texture Classes for Band 5 (SWIR1) ........ 54
iv. List of Equations
Equation 1: Modified NDVI for SCMaP ......................................................................... 4
Equation 2: Pearson correlation coefficient................................................................... 19
Equation 3: Variation ratio for axis location (LDA) ..................................................... 22
vi
v. List of Abbreviations
ATCOR Atmospheric and Topographic Correction
BGR – Bundesanstalt für Geowissenschaften und Rohstoffe, Federal Office for
Geosciences and Raw Materials
BKKABodenkundliche Kartieranleitung, German Soil Systematic
CLC CORINE Land Cover
CORINE – Coordination of Information on the Environment
C
org
– Organic Carbon
DLR – Deutsches Zentrum für Luft- und Raumfahrt, German Aerospace Center
ETM+ - Enhanced Thematic Mapper Plus (Landsat Sensor)
LDA - Linear Discriminant Analysis
LfL – Bayrisches Landesamt für Landwirtschaft, Bavarian State Office for Agriculture
LFUBayrisches Landesamt für Umwelt, Bavarian State Office for the Environment
NDVINormalized Difference Vegetation Index
PC1/PC2/PC3 – Principal Component 1/2/3
PCAPrincipal Component Analysis
PVPhotosynthetically active vegetation index
SCMaP – Soil Composite Mapping Processor
SRCSoil Reflectance Composite
SSL Soil Sampling Location
RDF – Random Decision Forest
TMThematic Mapper (Landsat Sensor)
1
1. Introduction
1.1 Significance of Soil Earth Observation
The soils on earth are an existential and fundamental resource for humanity
(D
OMINATI et al. 2010: 1858–1868), as well as for the planets ecosystem (YOUNG et
al. 2004: 113–132). The 2005 Millennium Ecosystem Assessment identified four
major services granted by soils (Millennium Ecosystem Assessment: 2005):
- providing direct and indirect food, as well as water, wood, fiber and fuel,
- regulating services regarding water, climate, floods and erosion,
- cultural services in the domains of recreation, spirituality and aesthetics.
- supporting services as to the nutrition cycle, providing a habitat and
supporting biodiversity.
Thus, soils play an important regulating and limiting role for the atmosphere,
lithosphere, hydrosphere and biosphere (S
ZABOLCS: 1994: 33–39). Because of its
many variations, soils are “one of the most complex biomaterials on the planet”
(ADHIKARI et al. 2016: 101–111), also playing a role in climate change (OMUTO et al.
2013: 81). Since soils are important for many economic and ecological factors,
information about them needs to be embedded into political decision making and
supported by accurate and cost-effective scientific data (D
AILY et al. 1997: 113–132).
In the past, information about the soils and their dynamics have been mostly gained
through in-situ sampling, which is rather expensive and time-consuming (HANKS et
al. 1962: 530
; RUBIN et al. 1963: 247–521). Since the opening of the Landsat Archives
(W
OODCOCK et al. 2008: 1011), large scale images over long time periods are
available and used for land monitoring (H
ANSEN et al. 2012: 66–74). A great effort
has been made using multispectral satellite and aircraft imagery to gain soil
information (MULDER et al. 2011: 1–19), in order to expand existing soil databases
(B
EN-DOR et al. 2008: 321–392). In the age of precision farming, where soil
information is essential, satellite-based information can fill the gap between small
scale measurements (CANDIAGO et al. 2015: 4026–4047) and very coarse information
as the Harmonized World Soil Database (N
ACHTERGAELE et al. 2009: 33-37). Not
only providing this information on a large spatial scale, satellite-based information
2
can also provide continuous monitoring on a much higher temporal scale by providing
images and their analysis in short intervals.
1.2 Visibility of Soil Parameters in Earth Observation
The research conducted thus far mostly focuses on aircraft multi- and hyperspectral
imagery, since these data acquisition methods provide a higher spectral resolution and
fewer disturbances (OCHSNER et al. 2013: 1888–1919; STEVENS et al. 2008: 395
404). In the past, extensive studies have been conducted regarding the detection and
quantification of the organic carbon content of the soil, as well as for soil texture and
soil type detection (B
ARNES et al. 2000: 731–741; JARMER et al. 2003: 115–123). The
visibility of soil parameters in multispectral and hyperspectral imagery was
investigated by Bayer et al in 2016 for the C
org
content (BAYER et al. 2016: 3997
4010), Lakshmi et al in 2015 for soil texture (L
AKSHMI et al. 2015: 1452–1460) and
Nanni et al in 2012 for a soil type discrimination (N
ANNI et al. 2012: 103–112) and
found to be generally possible.
This thesis also attempts to identify distinguishing reflectance characteristics but
based on the Soil Composite Mapping Processor (SCMaP). Since these distinctions
have been successfully identified previously, this thesis tries to recreate these results
based on the SCMaP output.
1.3 Soil Composite Mapping Processor (SCMaP)
1.3.1 SCMaP
The basis for this thesis is the product derived from the Soil Composite Mapping
Processor (SCMaP) as developed by Derek Rogge et al. (2018) at the German
Aerospace Center (DLR). This processor utilizes Landsat imagery from 1984 until
2014 to create an exposed soil map on a regional scale, using the temporal dimension
of the data to overcome soil exposure inconsistencies and atmospheric influences by
looking at vegetation cycles (R
OGGE et al. 2018: 1–17).
Landsat-data from missions 4 (Sensor: TM), 5 (Sensor: TM) and 7 (Sensor: ETM+)
was used due to their free availability and most importantly their long-term
continuous data acquisition (W
OODCOCK et al. 2008: 1011). This processor is then
3
applied to create the aforementioned soil exposure map, which is used as the data
basis for this thesis.
SCMaP was initially developed for Germany based on 36 path/row combinations with
a scene size of 170 km north-south and 183 km east-west (paths 192-197, rows 22-
27). The TM sensor onboard of Landsat 4 and 5 collects spectral information on six
bands ranging from 0.45 to 2.35 μm with a spectral resolution of 30 m. On board of
the Landsat 7 satellite, the ETM+ sensor was placed, which has a similar setup as the
TM with the addition of a panchromatic band at 15m spatial resolution. After the scan
line corrector (SLC) failure of Landsat 7 in May 2003, 22% of any given scene is
missing. Different areas are affected by the failure for every flyover, thus image
stacking can be used to provide continuous coverage.
The data type of the downloaded images is the compressed GeoTIFF format in an
Universal Transverse Mercator Projection with the WGS 84 datum (ROGGE et al.
2018: 1–17). The dataset is made of scenes classed as “Tier 1 precision and terrain
corrected” (L1T) and hence is deemed appropriate and accurate enough for time-
series analysis by the USGS.
In order to achieve a high quality of detection of exposed soils, artifacts and visual
obstructions need to be removed. According to Zhang et al. (ZHANG et al. 2005: 357
371), the global annual mean cloud coverage in Landsat satellite image datasets is
approximately 66%. Identifying clouds, as well as their shadows, proved to be crucial,
since their darkening and brightening effects complicate many remote sensing
applications. The “Fmask” algorithm of Zhu and Woodcock (2012) is used, which
has a detection rate of 96.4% (Z
HU et al. 2005: 357–371).
Secondly, the ATCOR software introduced in 2008 by R. Richter et al. is used to
minimize the effects of atmospheric disturbances such as water vapor and aerosols.
Starting from the “digital number” value, the individual pixel metadata consisting of
coordinates, illumination angle and the date of acquisition are taken into account to
derive the bottom of atmosphere reflectances via the top of atmosphere radiances
(RICHTER et al. 2006: 2077–2085).
Due to snow- and cloud cover during the winter, images taken within the first and last
30 days of the year are eliminated. After the application of the processor for Germany
as done by the DLR, 9,331 Landsat scenes remain.
4
After the preprocessing, the scenes are transformed into 1° by 1° tiles and broken up
into six time frames of five years. The only exception is the first timeframe, which
contains the data of 6 years (1. 1984-89, 2. 1990-94, 3. 1995-99, 4. 2000-04, 5. 2005-
09, 6. 2010-14). Additionally, all data from 1984 until 2014 are combined into a
seventh timeframe by the DLR, which is the final SCMaP product showing the
exposed soils used in this thesis (R
OGGE et al. 2018: 1–17). This product will be
refered to as SRC.
Fundamentally, SCMaP tries to identify exposed soil pixels (without vegetation) in
the images. Other than in very rare conditions in the Bavarian Alps above the tree line
or along the shore of oceans and rivers, naturally exposed soil occurs very rarely. A
key characteristic of exposed soil is a very low vegetation index, such as the
Normalized difference vegetation index (NDVI), which has been used since the early
days of Earth Observation (ROUSE et al. 1974: 309–317). This vegetation index uses
the high reflectance rate of healthy green plants in the near infrared range from 0.7 to
1.1 µm to create a photosynthetically active vegetation index (PV) (KRIEGLER et al.
1969).
Rogge et al. decided to use a modified version of the NDVI for the highest/lowest PV
selection. The blue channel was introduced to the Index, in order to reduce the higher
reflectance effect of thin haze in pixels, which were not correctly revised during the
preprocessing stage.
Equation 1: PV determination, modified NDVI for SCMaP
A low value of the aforementioned vegetation index value by itself is not sufficient
to detect exposed soils, since water and other anthropogenically influenced regions
as well as non-photosynthetically active vegetation have a low PV value as well.
Since exposed soils most commonly occur on agriculturally used land (in Germany),
their distinguishing characteristic is a high seasonal variability of the PV. Due to
growing and harvesting/plowing seasons, agriculturally active areas experience at
least one, if not several distinct changes in PV every given year (R
OGGE et al. 2018:
1–17). Because of the high temporal resolution of the Landsat data, these changes can
5
be detected and quantified in order to create a composite image. This composite image
encompasses both the highest and lowest PV values for each pixel.
The next step is overlaying the composite with the Coordination of Information on
the Environment (CORINE) Land Cover data set (CLC). After eliminating those
pixels, which underwent a change in the CLC classification, 5000 pixels were
randomly selected and then plotted to show their maximum and minimum PV value,
as well as their CLC classification group (Figure 1) (ROGGE et al. 2018: 1–17).
Pixels classed as agricultural
land can be clearly
distinguished from other
pixels by their significant
difference between maximum
and minimum PV, a clear
cluster of the “fields”
classification is visible.
Different classifications such
as coniferous and deciduous
trees have a similar maximum
PV value, but the lower
minimum PV value increases the variability, enabling the distinction between the
classes.
Finally, thresholds are manually chosen to “box in” this cluster of soil pixels after
careful examination of the five test tiles. An automatization of the threshold selection
process is currently under development (ROGGE et al. 2018: 1–17).
1.3.2 SCMaP Product Soil Reflectance Composite
The main result of SCMaP is the SRC, which shows the mean reflectance value of
those pixels, which were identified to have undergone significant changes in PV and
are thus classed as exposed soil. Those composites are created for each timeframe and
show the average reflectance value for each band of the stack of images at any given
soil pixel. The error rate for identifying exposed soils of the SRC over the whole 30
year period is stated as 2.24% (Rogge et al. 2018: 1–17).
Figure 1: SCMaP Threshold Selection
(adapted from
ROGGE et al. 2018: 1–17)
6
1.4 Research Objectives
The aim of this thesis is to extract a maximum of soil related information from the
Soil Composite Mapping Processor product, using the technique of machine learning.
More specifically, the SRC of the years 1984-2014 is used.
SCMaP, as previously described, is a method for identifying bare soil pixels from
averaged reflectance data from these points; the reflectance composite thus contains
spectral information about the soils. By examining the SRC pixel colours (cf. “2.2
Reflectance Soil Composite”), areas of conformity can be clearly distinguished from
other areas, even though all pixels represent bare soil. Based on this observation, the
idea to try to extract information about soil parameters was conceived and this
hypothesis formed. Variances in reflectance could possibly be traced down to local
differences in soil parameters.
Based on in-situ soil samples and their reflectance value of the according SRC pixel,
predictions will be made for all other SRC pixels regarding their soil parameters.
Since the pixels have to be assigned to predefined classes, the machine learning
method of random decision forests was chosen in order to make these predictions.
The objectives can be more precisely articulated with the following questions:
-Do the reflectance values of SCMaP pixels correlate with the soil parameters of
carbon organic content, soil type and soil texture?
-Can a random decision forest perform an accurate classification of the entirety of the
SCMaP product?
Figure 2 shows the workflow of this thesis. After acquiring and preprocessing the
input data, a database is built, containing the spectral information of the sampling
locations as well as their soil parameters. This database is then analysed for statistical
correlations between the reflectance and the soil parameters, checking to what degree
the classes are spectrally distinguishable. Using the input database, a random decision
forest machine learning classification method is built, which predicts the soil
parameters for all pixels of the SCMaP composite. The predictions are consequently
transformed into three soil parameters maps. The previously obtained statistical
results are used to check the accuracy of the predictions.