vignettes/technicalities.Rmd
technicalities.Rmd
crestObj
object
In crestr
, all the CREST-related data are stored within
a single S3 object of the class crestObj
. Most package
functions will take a crestObj
as their primary input and
return an updated version of that object. In practice, a
crestObj
is a nested list that contains five sub-lists,
each one grouping a specific type of information, such as the
calibration data, the fitted climate responses, or the reconstructions
(Fig. 1). Wrapper functions have been implemented to
manipulate and modify the information contained in a
crestObj
, and users are never expected to manually modify
their crestObj
– even if it is possible. The five sub-lists
contain the following information:
inputs
: contains the input data (e.g. the
counts/percentages of the fossil proxy data, the ages of the samples or
the names of the fossil taxa).
parameters
: contains the parameters provided at the
different stages of the analysis (e.g. the tailoring of the
gbif4crest calibration dataset or the fitting and combination
of the PDFs).
modelling
: contains all the data related to the
estimation of the PDFs (e.g. the occurrence data (the
‘distributions’) used to estimate the PDFs, the climate space of the
study area, or the PDFs themselves).
reconstructions
: contains all the results
(e.g. best estimates, synthetic error measurements, as well as
the full distribution of the reconstruction).
misc
: contains some additional metadata relative to
the reconstruction (e.g. the site location or, most
importantly, information related to the proxy-species associated process
described in section).
crestr
Five different input data files are compatible with crestr.
However, most applications will only require two files (the
df
and PSE
files, see below) to be created.
More specific applications may require up to four of these files. All
the files can be prepared outside the R environment and imported using
standard R functions.
df
)
The df
data frame is only required if
crestr
is used for reconstucting climate and can be omitted
if the objective is to model the climate response(s) of different taxa.
df
should be a data frame with the different samples
entered as rows, with either the age, depth or sample ID as first column
and the fossil data in the subsequent columns. df
can
countain raw counts, percentages, presence/absence (1s and 0s) or even
relative weights to be used in the reconstruction.
The df
data frame is required if crestr is used
for reconstructing climate and can be omitted if the objective is
limited to modelling the climate response(s) of different taxa.
df
is a data frame with the samples entered as rows, with
either the age, depth, or sample ID as the first column and the fossil
data in the subsequent columns. df
can contain raw counts,
percentages, presence/absence (1s and 0s) or even relative weights to be
used in the reconstruction. In the case study presented here,
215 terrestrial and aquatic pollen taxa were observed in 181 samples, so
that df
is a data frame with 216 columns (the age of the
samples followed by the pollen taxa) and 181 rows containing pollen
percentages of terrestrial taxa.
PSE
) table
The PSE
data frame is required to use the
gbif4crest calibration dataset. It is used to associate
individual species available in the TAXA
table with their
corresponding fossil taxon. When all the fossil taxa are identified at
the species level, the PSE
table is a simple data frame
with one row per taxon (see for instance the row corresponding to
Elais guineensis in Table 1). However, fossil
taxa are most often identified at a lower taxonomic resolution
(sub-genus, genus, sub-family, family). These varying levels of
identification should be encoded in the PSE
file to link
one or more (groups of) species to their common fossil taxon name
(i.e. group together all the species that are likely to have
produced the observed fossil). Several species can be assigned to a
taxon at once by limiting the taxonomic description at the family or
genus level (e.g. Artemisia in Table
1). A PSE
file is composed of five columns
(Table 1). The first one (Level) contains an
integer that indicates the level of taxonomic resolution of the row (1
for Family, 2 for Genus, 3 for Species and 4 for taxa that should be
excluded from the reconstruction, e.g. ‘Triletes spores’ in the
example case study). The fifth column ProxyName contains the
name of the taxon. All the taxa recorded in the df
dataset
should be listed here, or they will be excluded. Columns two to four
contain the taxonomic classification of that taxon as Family,
Genus and Species, respectively. For simplicity, a
pre-formatted version of the PSE
table with the names of
all the taxa to study can be generated by crestr using the createPSE(list_of_taxa)
function that generates a spreadsheet with the correct structure and
with the Level and ProxyName columns automatically.
Table 1 Example classification of four pollen taxa
from the example case study, each one with a different level of
taxonomic resolution. The last column ‘Taxonomic resolution’ is added
here for illustrative purposes and should not included in the
PSE
table.
Level | Family | Genus | Species | ProxyName | Taxonomic resolution |
---|---|---|---|---|---|
1 | Asteraceae | Asteraceae undiff. | Family | ||
2 | Asteraceae | Stoebe | Stoebe-type | Subfamily | |
2 | Asteraceae | Elytropappus | Stoebe-type | Subfamily | |
2 | Asteraceae | Artemisia | Artemisia | Genus level | |
3 | Arecaceae | Elaeis | Elaeis guineensis | Elaeis guineensis | Species |
4 | Triletes spores | To be excluded |
The species - taxon associated is performed in sequential steps by the crest.get_modern_data() function. First, crestr classifies the taxa with the lowest taxonomical resolution (i.e. when Level is equal to one) and then increases the resolution Level by Level. In the example in Table 1, different taxonomic resolution levels are provided for different plant species belonging to the highly diverse Asteraceae family (the daisy family). To distribute all the Asteraceae species observed across the study area to their appropriate taxon, all the species are first classified as ‘Asteraceae undiff.’ (first row, Level = 1). In a second time, the classification of some of these Asteraceae species is refined when reaching the better-resolved sub-groups (Stoebe-type and Artemisia at Level = 2). At the end of the process, the ‘Asteraceae undiff.’ group only contains Asteraceae species that grow in the study area but are not part of the genera Stoebe, Elytropappus or Artemisia. The latter are categorised separately as Stoebe-type or Artemisia.
Additional taxa can also be added to the PSE
file to
exclude species known not to be part of a group. For instance, this
‘trick’ could have been used to simplify the climate response of the
‘Asteraceae undiff.’ group by excluding more species from it, even if
the pollen grains corresponding to these species have not been observed.
This categorisation process can be time-consuming, as all the taxa must
be classified in a unique PSE
table, and this process will
often require many iterations to be optimised. The results of the
different assignments are stored in the crestObj
returned
by the crest.get_modern_data()
function and can be evaluated by checking
rcnstrctn$misc$taxa_notes
.
distributions
)
Users that prefer fitting proxy-climate responses from their own
calibration data instead of the proposed gbif4crest dataset
should prepare a distributions
dataset following the
specific structure presented in Table 2. The first two
columns should contain species names (or any unique identifiers) and the
corresponding proxy name, respectively. If more than one species
correspond to one taxon, the PDFs will be fitted in two steps. The
following two columns contain the coordinates of the species occurrence
data. Finally, the last columns contain the climate values to be
reconstructed. An optional column called weight
can be
added to distributions
in the fifth position (i.e.
between the coordinates and the climate variables) if one wants to weigh
the different observations. For example, the (relative) abundance of
taxa observed from modern proxy assemblages can be used when fitting the
PDFs to give more importance to the observations where that abundance is
highest.
Table 2 Template for the data frame. The weights column, here indicated with a ’*’, is optional and can be omitted or its values all set to 1 to assign the same weight to each observation. The number of rows the of table should correspond to the number of unique occurrences available.
Species name | Taxon Name | Longitude | Latitude | Weight* | clim_1 | … | clim_n |
---|---|---|---|---|---|---|---|
Stoebe plumosa | Stoebe-type | 18.875 | -34.375 | 20 | 15.8 | … | 711 |
Elytropappus rhinocerotis | Stoebe-type | 18.375 | -33.625 | 32 | 16.9 | … | 477 |
… | … | … | … | … | … | … | … |
Elaeis guineensis | Elaeis guineensis | -4.375 | 10.875 | 4 | 27.4 | … | 1020 |
climate_space
data frame
This data frame is only necessary if the users use a personal
calibration dataset (distributions
) instead of the
gbif4crest dataset. This data frame enables 1) using the
climate space weighting option and 2) including plots of modern climate
in the different diagnostic tools. Its structure is straightforward,
with the first two columns containing longitudes and latitudes and the
subsequent columns the climate variables to reconstruct. The spatial
resolution and the ordering of the climate variables should be identical
to the distributions
table (Table 2).
However, the arrangement of the rows is not important.
selectedTaxa
data frame
The last data frame that may be used to inform the reconstruction is
a data frame of ones and zeros called selectedTaxa
. This
data frame has as many rows and columns as there are taxa and climate
variables, respectively. Each entry, which should be either 1 or 0,
indicates if the taxon should be used to reconstruct the climate
variable (value = 1) or not (value = 0). If a selectedTaxa
data frame is not provided, a default data frame with all entries set to
1 is added to the crestObj
at initialisation. Users can
then modify this information at any point using the includeTaxa()
and excludeTaxa()
built-in functions. The crest.get_modern_data()
function also modifies this data frame by setting the value to -1 when
the PSE
classification failed for a taxon or when the
amount of data in the study area is insufficient to fit a reliable
PDF.