An uncertainty-focused database approach to extract spatiotemporal trends from qualitative and discontinuous lake-status histories

De Cort, Gijs Limnology Unit, Department of Biology, Ghent University, Belgium Division of Ocean and Climate Physics, Lamont-Doherty Earth Observatory, Columbia University, USA Department of Earth Sciences, Royal Museum for Central Africa, Belgium E-mail: gijs.decort@ugent.be Chevalier, Manuel Institute of Earth Surface Dynamics, University of Lausanne, Switzerland Institute of Geosciences, Sect. Meteorology, Rheinische Friedrich-Wilhelms-Universität Bonn, Germany E-mail: manuel.chevalier@unil.ch Twitter: @ChevalierManu Burrough, Sallie L. School of Geography and the Environment, University of Oxford, UK E-mail: sallie.burrough@ouce.ox.ac.uk Twitter: @SLBurrough Chen, Christine Y. Massachusetts Institute of Technology-Woods Hole Oceanographic Institution Joint Program in Oceanography, USA Department of Earth, Atmospheric and Planetary Sciences, Massachusetts Institute of Technology, USA Division of Geological and Planetary Sciences, California Institute of Technology, USA Now at Division of Nuclear and Chemical Sciences, Lawrence Livermore National Laboratory, USA E-mail: chen127@llnl.gov Twitter: @earth2christine Harrison, Sandy P. School of Archaeology, Geography and Environmental Science, University of Reading, UK E-mail: s.p.harrison@reading.ac.uk


Introduction
Understanding hydroclimate dynamics on timescales longer than a few decades depends largely on natural environmental archives that have recorded past climate changes. Changes in lake volume and/or area provide important information about past hydroclimates because they reflect changes in the balance of precipitation and evaporation over the lake and its catchment (Cheddadi et al., 1997;Harrison et al., 2002). The first efforts to reconstruct late-Quaternary palaeo-hydroclimate dynamics from lake data started in the late 1970's (Street and Grove, 1979) and led to the creation of several regional compilations (e.g., Harrison, 1989;Fang, 1991;Harrison, 1993;Tarasov et al., 1994;Yu and Harrison, 1996;Harrison et al., 1996;Jolly et al., 1998).
These regional databases were subsequently homogenised and integrated into the first Global Lake Status Database (GLSDB; Qin et al., 1998) that contained histories of 'lake status' (a term that covers all parameters of lake size, i.e. depth, level, area and volume) across the world spanning some or all of the last 30,000 radiocarbon years (~34,000 cal yr BP). Initially, the GLSDB records were categorised into three qualitative lake-status classes -'low ', 'intermediate' and 'high'at 1000 radiocarbon-year intervals. Later, this structure was adapted to provide continuous records and allow an assessment of dating quality and flexible definitions of status class. The GLSDB has been used to assess palaeoenvironmental changes since the Last Glacial Maximum (LGM) and to evaluate climate-model hindcasting performance for key time windows in the past (e.g., Qin et al., 1998;Kohfeld and Harrison, 2000;Coe and Harrison, 2002;Tierney et al., 2011;Bartlein et al., 2017).
Unfortunately, the GLSDB was last updated in the late 1990s and does not include the many new palaeolimnological studies published since then. An update of the chronologies of the records stored in the database is also necessary because all the chronologies were reported as uncalibrated radiocarbon dates, and because many other geochronological methods have since been applied to palaeolimnological archives (e.g., Kutterolf et al., 2016;Roberts et al., 2018;Chen et al., 2020;Hatfield et al., 2020). In parallel, awareness has grown regarding the necessity to account for uncertainties when interpreting paleoenvironmental records and their age models (e.g., Telford et al., 2004;Heegaard et al., 2005;Blaauw, 2010;Blaauw and Christen, 2011;Parnell et al., 2011;Breitenbach 2012). In its original form, the GLSDB cannot account for such uncertainties. One particular consequence of such limited consideration of uncertainties is that all the records are treated as equivalent. Such an assumption is obviously problematic as some records are more informative than others, due to better dating and/or more straightforward interpretations. This paper presents a new relational database structure that efficiently stores lake-status histories, including both chronological and status uncertainties, and demonstrates how such an uncertainty-focused database can enable more complex reconstructions of past environmental change. As a proof-of-concept, we present a new compilation of 67 late-Quaternary lake-status histories from eastern and southern Africa (ESA), many of which are discrete and discontinuous. We employ a new probabilistic approach based on a Monte Carlo (MC) algorithm and empirical orthogonal functions (EOFs) to identify the dominant hydroclimate features across ESA during the last 20,000 years. Finally, based on the results of this concept study, we explore and discuss the relative strengths and weaknesses of our framework.

Lake status as an indicator of past climate
The balance between precipitation and evaporation (P-E) exerts a universal control on lacustrine systems by regulating lake volume, which itself depends on lake depth and surface area (Mason et al., 1994;Cheddadi et al., 1996;Harrison et al., 2002). Responses of lake volume to climatic change are most striking in endorheic, or 'closed' lakes, i.e. lakes without an (above-ground) outflow where most or all water leaving the lake is lost through evaporation. Closed lakes are found most often in semi-arid environments, where catchment evapotranspiration exceeds water input, thus constraining lake-level equilibria below overflow level (Meybeck, 1995). While the response of exorheic, or 'open' lakes, to climatic change is often less dramatic, these also undergo lake-level changes if changes in the hydrologic budget are not compensated by inflow/outflow adjustments (Cheddadi et al., 1996). On longer time scales, many lakes can be classified as transitional, shifting between open and closed states as changing climate controls their water balance (Kalff, 2001).
Past lake status is derived from two main lines of evidence. The first source is geomorphic features formed by marginal processes that reflect displacements of the past shore or near-shore environment ('palaeoshorelines'). Palaeoshorelines above the present water level can be dated using radiometric or luminescence dating techniques to estimate the timing at which they were last active. Combined with elevation data and knowledge of basin morphometryand if necessary corrected for post-depositional deformation that might occur through tectonic vertical displacement (e.g., Garcin et al., 2009) or isostatic adjustments of the lithosphere (e.g., Chen and Maloof, 2017)they allow for a direct, quantitative reconstruction of water level and lake hydrology at their time of formation. Palaeoshorelines represent discrete, rather than continuous, recordings of lake-level history as they are often only deposited and preserved under specific hydrological conditions such as water-level stasis. As such, they only provide snapshots of past hydrological states. Palaeoshoreline records tend to dominate the palaeolimnological record in semi-arid to arid regions, where the climate regime necessary to maintain steady state lake-full conditions deviates most strongly from present-day climate conditions (Bowler, 1986;).
The second source of evidence is derived from the lakebed, where the progressive deposition of sediments results in sequences representative of the character of the overlying water column through time. Such stratigraphies can be retrieved as sediment cores or as outcropping sections of sedimentary landforms above the modern water surface. Lakebed archives can be (but are not necessarily) uninterrupted deposits and thus provide a continuous time series of past lake status. Absolute reconstructions of water depth can be made from lake-bed records using sequences of cores to trace the limit of lacustrine deposition through time (e.g., Digerfeldt, 1965;Schneider and Tobolski, 1985;Winkler et al., 1986;Almqvist-Jacobsen, 1995;Shuman et al., 2009) but the procedure is timeconsuming. Generally lake-bed records only provide indirect information of relative changes in lake status approximated through sensors of water depth such as sedimentological characteristics or floral and faunal community composition. For these reasons, lake-level reconstructions derived from lakebed archives tend to be continuous but qualitative in nature.
Reconstructing lake status and associated climatic change from sedimentology and geomorphology can be challenging for several reasons (e.g., Gasse, 2000;Holmes and Hoelzmann, 2017): 1. Like all natural palaeoenvironmental archives, the chronological control on lake records is dependent on the dating method applied. Apart from lakes containing annually varved sediments, dating of lacustrine deposits relies mostly on radiometric techniques with varying precision.
2. Lake-status archives can be incomplete and low stands are often underrepresented. In sedimentological records, low water levels or dry stands can inhibit deposition and even remove previously deposited sediments, thus creating stratigraphic hiatuses or unconformities. Similarly, palaeoshorelines above a lakebed only reflect instances where water level was higher than today, and prolonged high stands can erode previously formed shoreline structures thereby eliminating evidence for older episodes of lower water level ).
3. The information about lake-status changes tends to decrease further back in time, which can be partly explained by practical fieldwork constraints (i.e., coring older deposits is technically challenging and costly) but also reflects difficulties of obtaining records after lake desiccation because of compaction, pedogenesis and/or mineralisation of the lakebed (e.g., Stager et al., 2002;De Cort et al., 2018). Lake coring is therefore often restricted to the time window following the most recent dry stand. The result is that lake-status data is often skewed towards recent periods of higher water level, which have a higher probability of being recorded than periods of greater age and/or lower lake status.
4. The sensitivity of lake level to climate is site-specific, as factors such as physical and ecological catchment characteristics and groundwater inflow or outflow can affect the amplitude and timing of lake response to variability in P-E (Vassiljev et al. 1998). Different archives also show different levels of sensitivity to lake-status changes, which may differ between indicators but also be non-stationary through both space and time for the same indicator (Juggins, 2013). 5. Lake-status changes can also be driven by non-climatic factors, such as gradual basin infilling, basin deformation (e.g. by tectonism or volcanism) or sea-level change. These processes can mask or modify the climate signal preserved in sedimentary or geomorphic archives.
Lakes vary widely in their climate sensitivity. Most lakes act as low-pass filters that mainly respond to lowerfrequency climate variability (Mason et al., 1994;Liu and Schwartz, 2014). Catchment size and morphometry, as well as relative groundwater flux, are major factors in determining the sensitivity of lake level to climate change (Vassiljev et al. 1998;Olaka et al., 2010). The time needed for a lake to return to equilibrium creates a lagged response which usually varies from months to a few years (Yi and Zhang, 2015) but can take multiple millennia in extremely large catchments (Singarayer et al., 2019).

The new relational lake-status database
This section describes the principles behind the construction of the new database, as applied to the ESA sites used in our case study (see section 5).

Data selection
Studies containing data on past lake-status change were included in the database if they fulfilled the following criteria: 1. Data and publications were peer-reviewed; 2. Chronological data was available to anchor past lake variability in time (e.g., radiometric dating, reliable historical accounts or instrumental records); 3. Evidence indicates that lake-level changes were primarily driven by climate and were interpreted as such in the original literature. Sites or sections of individual records where changes in lake level were demonstrably non-climatic in origin (e.g., direct influence of sea-level changes, volcanism, basin infilling or anthropogenic activities) were excluded to avoid mixing environmental signals; 4. The palaeoenvironmental evidence reflected changes in lake level, depth, volume and/or area rather than other hydrological variables such as precipitation, runoff, or moisture source. Interpretations from the original literature were followed in this matter, unless competing evidence (e.g., from contradictory studies) demanded otherwise.

Compilation and revision of chronological information
A wide range of approaches for age-model construction were used in the original literature, so we built standardised chronologies for all records. Radiocarbon dates were recalibrated using IntCal13 for Northern Hemisphere sites  and SHCal13 for Southern Hemisphere sites . Agedepth models for sediment profiles were reconstructed using the Bayesian age-depth modelling software Bacon in R (Blaauw and Christen, 2011;R Core Team, 2020). Core-top ages were included in the age model when intact recovery of the surface sediment was reported, in which case the core top was assumed to correspond to the time of sampling. The specifics of all the age-depth models, such as the exclusion of anomalous outlying dates, the excision of event deposits, the inclusion of sediment hiatuses, and the settings of the Bacon run were adjusted on an ad-hoc basis for each individual sequence to ensure the most reliable results and with the objective of staying as close as possible to interpretations in the original literature.
The discrete nature of palaeoshorelines means that calibrated ages obtained from them were used directly as a shoreline age, except in the case where multiple shoreline dates overlapped within their uncertainty ranges to such a degree that they could be interpreted as representing a single continuous time interval for a specific lake status.
New chronologies were not created for sites where i) the original chronological data was (partly) unavailable but the site was considered fundamental for regional reconstruction, or ii) where the published age model was already created using Bacon with the appropriate calibration curve. Original 14 C measurements and associated information, as well as data from other geochronological techniques, were retained in the database to enable future updates.

Construction of consensus lake-status histories
Many lines of evidence only provide indirect evidence of past lake status. Thus, assumptions on the relationship between such proxies and lake level are often required to reconstruct lake status. Our standard approach was to combine all available sources of information on the assumption that if several lines of evidence point to a change in water depth or area, this consensus is real (Harrison et al., 1991;Harrison and Digerfeldt, 1993). For every site, the information on past changes in lake status was combined with the revised chronologies to produce sequences of lake-status episodes. Each episode is defined by i) a best-estimate of the start and end age, each with their associated 2-σ credible intervals, taken from individual/clustered ages from discontinuous archives (Fig. 1a) or from the reconstructed age-depth models of sediment cores (Fig. 1b); and ii) a relative qualitative lake-status class represented by a positive integer, in which a higher status (higher number) signifies a higher relative lake level, volume or surface area (1 being the lowest status documented and not necessarily meaning drystand; Fig. 1). The weighted mean of the probability distribution produced by the Bacon model, or in the case of palaeoshorelines, of the single defining chronological point, was taken as the best estimate of the 'true' age.
There is no upper limit to the number of different lake-status classes assigned to a lake, as this depends on the available lines of evidence and their sensitivity to lake-status change, the resolution of the available information and the total time span of the archive. The ranking is specific to each lake and status classes cannot be directly compared between lakes. In general, the original authors' interpretation was followed unless there were contrasting conclusions published in the literature. When the status at a particular time relative to other periods in a lake's history was ambiguous, lower and upper limits on the possible lake status class were defined. This occurred when status was based on ambiguous data interpretation, or when several archives were used that do not overlap sufficiently in time or type of evidence so that comparison of status in different periods of a lake's history was not straightforward.
Assignment of lake-status classes to lakebed sediments and palaeoshorelines was challenging in different ways.
The density of palaeoshoreline dates was often insufficient to create continuous reconstructions. Additional assumptions were often required concerning the duration of events represented by features dated by only one or a few ages. The same was true for other discontinuous records such as non-shoreline geomorphological features or archaeological artefacts. With such 'snapshot' information, event duration had to be estimated conservatively, informed by the nature of the deposit and the sensitivity of the basin to hydrological changes. For example, raised beaches may be assigned shorter durations (on the order of years) than thick carbonate deposits that accumulated in the littoral zone over multiple decades or centuries. Similarly, some lake basins are well buffered against shortterm water-level fluctuations whereas others exhibit significant variability on (sub-)annual time scales, depending on basin characteristics such as morphometry, hydrography, and vegetation.

Database structure and content
Data on site characteristics, chronology and lake-status history was assembled into 9 tables, which were combined into a relational database and exported as a SQLite3 file (Fig. 2 Coding Source -This table lists which lake-status indicators were available in the consulted literature, and which of these were used in the consensus lake-status reconstruction.
Lake Size -This table provides information on absolute water depth or lake-level elevation as recorded by geomorphological features. Best estimates of the mean, lower and upper still-water bound of each shoreline, and corresponding estimates and uncertainties of lake surface area and lake volume, are given if available from the original literature.
Coding -This table contains the lake-status histories. For each site, a sequence of episodes is defined. Each of these episodes is defined by a best-estimate start and end age, together with their associated 2σ uncertainties, and by a best estimate, a minimum and a maximum value for lake status. The Coding table also links lake-status episodes to corresponding quantitative information, if available in the Lake Size table.
A number of additional tables is used to store lists of accepted values for specific fields or to link the different data tables together (Fig. 2, see Data availability).

Analytical approach
Integrating temporal and lake-status uncertainties allows for the extraction of robust spatiotemporal palaeoclimatic trends from data (e.g., Breitenbach et al. 2012;Anchukaitis and Tierney, 2013;Chevalier and Chase 2015). Here, we designed a specific Monte-Carlo (MC) framework to iteratively sample the lake-status histories, taking account of their uncertainties in chronology and lake status, to produce a large number of possible historical scenarios (section 4.1). These randomised iterations were subsequently used to extract the principal spatiotemporal modes of variability (section 4.2). A flowchart of this analytical approach is shown in Fig. 3.

Monte-Carlo sampling of time-and status-uncertain lake histories
Our MC algorithm iteratively sampled the Coding table to produce a large number of variants of each lake's status history; in our application (see section 5), we used 10,000 iterations. In each iteration, uncertainty ranges of start and end ages of subsequent status episodes were sampled unidirectionally through time to conserve the chronological sequence of defined episodes. The outcome of the sampling of a given time point (i.e. the start or end of a certain status episode) was thus constrained by the uncertainty defined for that point and by the result of the priorly sampled point. To avoid biasing the sampled records either towards younger or towards older ageswhich can happen especially for sequences of relatively short episodes with highly overlapping temporal uncertainty rangesthe sampling was performed by randomly moving either forward or backward in time. This means that for any lake, roughly half of the iterations sampled the coding table in an old-to-young sequence and the other half of the iterations sampled the coding table starting at the youngest and ending at the oldest episode.
Similar to the sampling of the start and end dates of each episode, the status class was also sampled from its uncertainty range but without being constrained by the priorly sampled point, since we assumed that status class is independent between episodes.
Although the distribution of estimated ages produced by radiocarbon calibration or Bayesian age-depth modelling are generally not symmetric or even unimodal, only the model weighted means and upper and lower 95% confidence intervals are used to define the sequence of episodes that make up a lake's history in the database.
Similarly, probability distributions of status classes are also not normally distributed and may be highly asymmetrical (see Fig. 1c). To cope with the asymmetrical nature of age uncertainties, the MC algorithm samples start and end ages of status episodes following a probability density function defined by a split-normal distribution, also known as a two-piece normal distribution. The mode of the split-normal distribution is equal to the best estimate of that age (Fig. 1c) and the left and right standard deviations (1σ) of the distribution are derived from the lower and upper limit of the 2σ range as defined in the coding table. Uncertain status classes were set to follow a triangular probability density function, peaking at best-estimate values and linearly decreasing towards 0 at values of minimum-1 and maximum+1, which precludes obtaining values outside of the defined minimum-tomaximum range (Fig. 1c). The split-normal (for ages) and triangular (for status class) distributions are scaled to ensure that their cumulated probabilities sum to 1.

Recursively-Subtracted Empirical Orthogonal Function analysis
To extract regional palaeoclimatic trends, MC-sampled lake-history variants were generated for a set of lakes of interest. These histories were then analysed using Empirical Orthogonal Functions (EOFs), which decompose data of high spatiotemporal dimensionality into its dominant spatiotemporal features of variability. Recursively-Subtracted Empirical Orthogonal Function (RSEOF) analysis is a specific case of EOF analysis that is able to deal with data sets with high proportions of missing values (Taylor et al., 2013), as can be the case with lake-status data. All MC-produced lake histories were interpolated to a common time scale with regular spacing. In our proofof-concept study using African lakes (see Section 5), a time step of 5 years was employed. Time series were centred and scaled to a mean of 0 and a standard deviation of 1 prior to the analysis. A reference RSEOF was also calculated for the scenario where all ages and statuses of included sites were assumed at their best-estimate value.
The result was an ensemble of 10,000 (MC-generated) + 1 (best-estimate) sets of principal components (PCs), each summarizing the dominant spatiotemporal modes of variability in their corresponding set of lake-status histories. By construction, the sign of each eigenvector is arbitrary (Legendre & Legendre, 2012). To homogenise the sign of corresponding eigenvectors across all RSEOF iterations, each PC was multiplied by -1 if this resulted in a reduction of its distance to the corresponding PC of the set of best-estimate lake-status histories, i.e. the sign and direction of each eigenvector was aligned with the sign and direction of the eigenvector of the reference RSEOF. For each RSEOF iteration, unicity of EOF modes was assessed using North's Rule of Thumb, which determines which eigenvectors are distinct from their nearest neighbour (North, 1982). RSEOF analyses were carried out using the R package sinkr (Taylor, 2017; R Core Team, 2020).

Regional setting
Our study areaeastern and southern Africa (ESA)is delimited by the Sahel to the north and by the Congo Basin and the Angolan highlands to the west (Fig. 4) and encompasses different climatic zones (Gasse et al., 2008;Burrough and Thomas, 2013). Tropical East Africa and Madagascar mainly receive monsoonal rainfall from the Indian Ocean associated with the yearly migration of the tropical rain belt, although Atlantic moisture can also significantly contribute in western areas (Nicholson, 2017;2018). In southern Africa, the strength of the monsoon decreases with distance from the Indian Ocean, grading from the subtropical grasslands and savannas in the east into the arid Kalahari over the southern African plateau and finally into the hyper-arid Namib Desert along the west coast. This pattern is additionally influenced by the cold Benguela current in the Atlantic, further depriving the Namib of rain. The southern and southwestern coasts of South Africa are dominated by a Mediterranean climate, where winter rains prevail (Tyson and Preston-White 2000).

Lake sediments are the dominant source of palaeoenvironmental and palaeoclimatic reconstructions for East
Africa (Verschuren, 2003). Tectonic activity and faulting associated with the East-African Rift System (EARS) have produced a string of basins which are arranged in two distinct rift branches (Chorowicz, 2005). The western branch, or Albertine Rift, holds several large and deep freshwater lakes (Tiercelin and Lezzar, 2002). Lakes of the eastern branch, or Gregory Rift, are typically smaller, shallower and more saline (Schagerl and Renaut, 2016).
Additionally, volcanic activity associated with the EARS created a number of explosion craters, many of which hold small but sometimes relatively deep lakes. High densities of these crater lakes are found for example in the Rungwe Volcanic Province of southern Tanzania (Delalande et al., 2008) and the maar crater lake districts of southwestern Uganda (Melack, 1978). Locally concentrated small lakes of glacial origin are found in the alpine zones of the highest peaks and mountain ranges of East Africa (e.g., Mahaney, 2004;Tiercelin et al., 2008).
Because of southern Africa's dry conditions; a deep covering of sand related to its cratonic origins (Haddon and McCarthy, 2005); and the restricted influence of rifting, the southern African interior has less potential for deep lakes than East Africa. Today, most lakes and wetlands are found along the coast of South Africa, where they are often under significant marine influence (Hill, 1975;Whitfield et al., 2017). Nevertheless, the subcontinental interior holds many endorheic basins of various sizes, most of which are perennially dry or hold seasonally filled playas today, but where geomorphological features provide evidence for long-lived standing water in the past.

The ESA lake-status database
We identified a total of 67 lakes in ESA that met the criteria defined in Sect. 3.1 (Fig. 4, Table 1). Sites from the original GLSDB were only incorporated here if they met these criteria, and their GLSDB histories were critically re-evaluated in the process. Data for these sites was sourced from 244 publications (see Supplementary Material).
Their catchments were delineated using the HydroSHEDS database (Lehner et al., 2008) or, for the smallest drainage basins, by manual delineation using Google satellite imagery and Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) Global Digital Elevation Model (GDEM) v2 data (ASTER GDEM is a product of METI and NASA). The ESA database is available online, along with supplementary documentation for each individual site (see Data availability).
The current nature of the lakes in these basins reflects regional climate. Most basins are topographically closed today (54/67). Seasonally or perennially dry basins are disproportionately common in the interior and western portions of southern Africa, while 10 of the 13 currently overflowing lakes are located within 12° of the Equator.
Slightly more than half the sites (39/67) occur below 1000 m a.s.l., while 4/67 of sites lie above 2000 m asl. Most lakes have a surface area smaller than 100 km² (50/67) while 9/67 are currently larger than 1000 km². Catchment area ranges from < 1 km² for the smallest crater lakes to 1,205,959 km² for the Makgadikgadi basin (Fig. 5a). The most common sources of evidence for lake status (Table 1) ( (1967)   The chronological data is summarised in Fig. 6 (Fig. 7) can be traced at least in part to the many relatively small and shallow lakes in East Africa that desiccate relatively regularly and therefore do not contain archives older than a couple of millennia (Verschuren, 2003). This bias in archive development by local climate is inherent to palaeolimnology in arid to semi-arid climate zones.
Although long histories are available from catchments of all sizes, there is a positive correlation between catchment size and the age of the oldest available information on lake status (Fig. 5a). Large catchments are generally associated with a higher variety in depositional environments, producing a diversity in palaeolimnological evidence that is more likely to survive for extended periods of time. Larger basins may also get more attention from the scientific community, resulting in a higher likelihood of deep-drilling campaigns or intensive geomorphological surveys. The largest catchments in the database are of such dimensions that their water levels are no longer driven solely by local climate but follow an integrated response to distinct climate patterns within several climate zones (e.g., ). On average more status classes were defined for long histories from large catchments (Fig. 5a). This might reflect a higher sensitivity to moisture-balance change that comes with large, climatologically and ecologically diverse catchments and higher catchment to lake area ratios (Street, 1980;Olaka et al., 2010), but it is also likely that longer records contain evidence of higheramplitude climate variability justifying more status classes. Shorter records typically have more status classes defined per unit of time (not shown) and the number of defined status episodes increases closer to the present (Fig.   5b), reflecting the different resolutions typically employed when studying archives of different length.

Application of joint MC-RSEOF analysis
We applied our joint MC-RSEOF framework (Fig. 3) to extract the dominant histories of the studied lakes across ESA during the past 20,000 years. Very few sites completely cover this time window: some sites have continuous histories but do not extend back to 20,000 years, others' histories are fragmentary across that time window. The RSEOF method allows for incomplete datasets, but the meaning of the PCs is directly related to the amount of information available in the system. Temporal coverage determines how complete a certain lake-status history is over the studied period, while the number of status classes reflects the sensitivity of a lake to undergo significant change and to record that change. The number of included lakes is also likely to have an influence on the PCs.
To determine to what extent the outcomes of our analyses are affected by temporal coverage, sensitivity to environmental change, data resolution and the number of the included lake-status records, we investigated different combinations of these site-selection criteria. Temporal coverage of the 20,000-0 cal yr BP window was set to 10, 25, 33, 50, 75 or 100 %, and the minimum required number of status episodes covering (part of) that window was set to 2 or 4, which resulted in the number of sites meeting these criteria, and thus being retained for subsequent RSEOF analyses, varying between 7 and 45 (Fig. 8a). does not change significantly when the time-coverage criterion is changed between 33 and 100 %, i.e. the advantage of using more complete records is balanced by the reduction in their number. The time series of EOF1 is relatively consistent across tested subsets, despite the EOFs being based on a variable number of lake sites (Fig.   8b). This supports the notion that EOF1 reflects lake-status behaviour that is relatively consistent across the study region. Fig. 9 highlights the RSEOF results for the subset of sites that cover at least 33% of 20,000-0 cal BP and for which at least 4 status episodes were defined during this time window. The scree plot (Fig. 9b) shows eigenvalues levelling from EOF3 onward, and North's rule of thumb identifies only the first 2 EOFs as non-overlapping. The first EOF explains 40.3 ± 3.8% of the variability in lake status across the MC ensemble of 10,000 simulations, or 43.6% in the best-estimates scenario. When interpreted as a trajectory of lake status, sites that score positively on this EOF tend to have low lake status at 20,000 cal yr BP, culminating in a minimum around c. 17,500-16,500 cal yr BP (Fig. 9c, d). Afterwards, lake status increases towards the start of the Holocene, although with a temporary reversal of this trend between c. 13,000-12,000 cal yr BP. A prolonged maximum in lake status is registered between c. 11,000-5,000 cal yr BP. After c. 4,000 cal yr BP, levels are intermediate between those of the start of the record and those of the early Holocene. All the sites that correlate positively to EOF1 are located in the EARS and Madagascar (Fig. 9a, c). Sites negatively correlated to EOF1 (i.e. showing opposite trends in lake status) are confined to interior and western southern Africa, with the exception of Lake Masoko in the southern Rift. The EOF biplot also shows the distinction between interior southern Africa and East Africa along the first axis, while homogeneous distribution among negatively scoring subregions reveals little to no spatial pattern within the EARS (Fig. 9a). The consistency of EOF1 across experiments with differential site selection (Fig. 8b), as well as the relatively narrow 95-% uncertainty envelope that is similar to EOF1 of the best-estimate lake histories (Fig. 9d),

Millennial
suggest that the dominant millennial-scale lake-status patterns in ESA over the last 20,000 years are widely represented and robust against the combined influence of chronological and status uncertainties. The first EOF shows that lake status variability over the last 20,000 years is dominated by a coherent pattern of regionally opposed behaviour between the EARS and interior southern Africa.
The late-Glacial portion of EOF1 is remarkably similar to the climatic trend of the deglaciation in Africa according to transient palaeoclimate model simulations (Otto-Bliesner et al., 2014) and important palaeoclimate records from Africa (e.g., deMenocal 2000; Weldeab et al., 2007;Tierney and deMenocal 2013). Heinrich Stadial 1, which has been described as an extreme climatic event for the Afro-Asian monsoon region (Stager et al., 2011), is associated with minimum lake status across the EARS and high lake status in interior southern Africa between ca. 18,000-16,000 cal yr BP. From ca. 15,000 cal yr BP, lake-status EOF1 shows the increase of monsoonal precipitation over East Africa as a possible response to rising greenhouse-gas concentrations (Otto-Bliesner et al., 2014). This evolution is temporarily interrupted during the Younger Dryas (YD), a global event that is also captured in our EOF1 (Rasmussen et al., 2006;Carlson, 2013). With their almost uniformly positive loading on EOF1, lakes within the EARS had a high status during the early to mid-Holocene, concurrent with the widespread African Humid Period (Jolly et al., 1998;deMenocal et al., 2000;Shanahan et al., 2015). In contrast to northern Africa, where a similar early-to mid-Holocene humidity maximum was induced by the precessional expansion of the West-African monsoon (Kutzbach and Otto-Bliesner, 1982;Otto-Bliesner et al., 2014), the causes for tropical East Africa's similar moisture-balance behaviour during this episode remain unclear (Tierney et al., 2011;Liu et al., 2017;Reid et al., 2019;Wang et al., 2019).
The number of informative sites through time reflects the trends reconstructed by EOF1 (Fig. 7). The maximum of informative Kalahari sites is observed between 18,500-15,500 cal yr BP, as well as prolonged recording in the Namib between 17,000-11,000 cal yr BP, meaning that shorelines formed across the region during these times due to increased lake status. Afterwards, Holocene palaeoshorelines are less common across the interior of southern Africa, except for a short-lived increase in Kalahari sites around 6000-5500 cal yr BP. The aridification of southern-African drylands, mostly reflected in our database through absence of lake-status evidence, has been reconstructed from other data sources (Goudie and Thomas, 1986;Telfer and Thomas, 2006;Thomas, 2016, Chase et al. 2019. In contrast, the number of records from the EARS increases as lakes recovered from widespread lowstands or desiccation during HS1 (Stager et al., 2011). High numbers of sites reflect maximum positive moisture balance during the early Holocene, before a slight decrease during the mid Holocene and a final rise towards maximum numbers from ca. 4000 cal yr BP to the present.
More caution is necessary when interpreting higher-order EOFs of uncertain time series (e.g., Anchukaitis and Tierney, 2013). While significantly different from its nearest neighbours, EOF2 has large uncertainties and a low explanatory power (Fig. 9f), explaining 19.8 ± 1.9% of the total variance across all MC iterations, and 18.7% in the case when only best-estimate lake-status histories are used. The main difference between EOF2 and EOF1 time series is the Holocene section, where EOF2 values for the early to mid-Holocene are similar to those of the preceding millennia, and the last 4,000 years show the most negative values of the entire period (Fig. 9f). The uncertainty range of EOF2 is considerably wider than the uncertainty range of EOF1. While sites south of 10°S load positively on EOF2, there is no evident spatial pattern for the EARS (Fig. 9a, e). We were not able to identify a mechanism for this trend with such a scatter of positive and negative loadings across the EARS. However, its general trend and consistent correlation to sites across the Kalahari could support the general evolution towards drier conditions over the last 20,000 years.
The trends and events identified in ESA have been described in earlier studies based either on individual records or from syntheses of selected palaeoclimate records (e.g., Chase et al., 2010;Stager et al., 2011;Stone, 2014;Singarayer and Burrough, 2015). While this paper does not aim at exploring the causal climatological processes behind them, the coherence of our results with well-established climatic events as evident in independent records demonstrates that robust palaeoclimate trends can be identified from qualitative and discontinuous lake-status data when age-model and interpretation uncertainties are taken into account.

Discussion and recommendations
Any collection of palaeoenvironmental data is confronted with what can be called a 'synthesis dilemma': the choice between targeting a low number of high-quality records at the expense of spatial and/or temporal coverage, or including as much data as possible, thereby increasing the internal noise of the data compilation due to lower average data quality. For the former, the curation of records often requires subjective interpretations of what is considered 'high-quality', which is not without its own limitations. The lake-status synthesis method presented in this paper adopts a more conservative approach and includes as many lakes as possible. Lake-status histories are often fragmentary and of variable temporal resolution, making them challenging to incorporate in palaeoenvironmental syntheses that often focus on selected archives or records of higher continuity and/or resolution (e.g., Tierney et al., 2013;McKay and Kaufman, 2014;Steiger et al., 2018;Atsawawaranunt et al., 2019). Our ESA case study nevertheless demonstrates that robust palaeoclimate patterns can still be extracted when the proxy and dating uncertainties are taken into account, even from collections of lake-status histories containing highly discontinuous and uncertain time series. Most interestingly, our final product derived from our MC-RSEOF approach has relatively narrow uncertaintiescompared to the uncertainties of the composing records (see Fig. 1), thus highlighting its capacity to extract reliable signals, even from noisy data.
Uncertainty in relative lake status can have different causes, such as poorly documented sedimentation processes, uncertain translation of biological indicators to environmental change, and uncertainties comparing environmental indicators, especially when different indicators are used for different parts of a record. Consideration of uncertainty in the interpretation of a record is rarely taken into account in palaeoclimate-data syntheses, but can be important (e.g., Lohmann et al., 2013). The definition of minimum and maximum possible lake status, which together with the best estimate describe a triangular probability density function, is a practical solution to such challenges.
The incorporation of temporal uncertainty is crucial to assess past environmental variability ( to Anchukaitis and Tierney (2013), the instability introduced by temporal uncertainty is one of the main limits to recovering higher-order modes of variability from MC-sampled uncertain time series. When using our algorithm to study a fixed-boundary time window, chronological uncertainty not only affects the temporal data distribution but also the number of sites providing those data. In practice, the number of sites meeting a certain temporalcoverage criterion varies considerably between the 10,000 iterations of our last-20k experiments (Fig. 8a). While EOF1 exhibits clear similarities with the well-known succession of events that characterise the last 20,000 years, the meaning of EOF2 was less clear, and may support the results of Anchukaitis and Tierney (2013). Because our MC-RSEOF approach simultaneously integrates chronological and lake-status uncertainty, the uncertainty of its product is influenced by both factors. Since the lake-status classification system of the database is qualitative, all derived products should be treated as qualitative as well. However, quantitative information is available for a subset of the lakes of our database (Table 1) for more specific studies on climate-driven water budget.
In addition to the proposed methodology, our database format also offers major improvements in terms of uniformity and precision of lake-status chronologies over the GLSDB. The preservation of relevant information in a relational database format enables exploring and selecting sites based on sensitivity or temporal coverage, or other characteristics such as catchment size or type of chronological data. While we primarily focused on the past 20,000 years, different site selection based on density or other aspects of dating control could be advisable when studying shorter time scales or time windows. Similarly, caution is advised when studying poorly represented time windows or regions, where inclusion or exclusion of individual sites may significantly alter the resulting EOFs.
Our database format also improves upon the GLSDB in terms of uniformity and precision of lake-status chronologies. Almost one third of all reported radiocarbon dates were not calibrated upon original publication ( Fig. 6). Our updated chronologies commonly resulted in significant changes in comparison to their original counterparts. An important factor in radiocarbon-based chronologies is the dated material and, correspondingly, old-carbon effects. While our data compilation conserves information on reported carbon-reservoir effects, these effects are often not investigated in the original studies, which may cast additional uncertainty even on relatively high-resolution chronologies.
Our new database approach offers a new perspective on the use of lake-status data for comparative studies between reconstructions based on natural archives and palaeoclimate model simulations. Data-model comparisons have become an integral part of evaluating model performance (Kohfeld and Harrison, 2000;Harrison et al., 2014) but their effectiveness requires standardised quantification of data uncertainty (Harrison et al., 2016). Our new database significantly improves the incorporation of such uncertainties in assessing model simulations using lakestatus data and provides significant improvements over previous lake-status syntheses, such as the widely used GLSDB.
The utility of our analytical framework combining MC sampling of uncertainties and RSEOF identification of dominant modes of variability extends beyond lake-status data. Indeed, it can be applied to any palaeoenvironmental data set that comes in the form of time series with chronological, analytical and/or interpretational uncertainties. This opens new possibilities for more inclusive syntheses, employing a standardised methodology across all archive types, data sources and palaeoenvironmental variables.
However, to enable the development of such large-scale studies, adequate reporting of data is crucial. Most of the challenges involved in assembling the ESA lake-status database were related to incomplete reporting of data.
While we fully acknowledge that reporting data for purposes beyond the original goal of a paper can be complex and time-consuming, the following points emphasise key elements to consider in order to foster the best use of palaeolimnological data by independent groups: • Raw data (e.g., lake-level curves, stratigraphic profiles, time series of key indicators) were not always accessible. Where the information was not available from the data producers, we had to digitise the data from figures, which at best led to a loss of data quality, and possibly to a distortion of some signals.
• Stratigraphic data series were often only expressed against the calculated age and not versus original core or section depth. The inclusion of depth information, either in the original publication or in archived data sets, is essential for future updates of the chronologies (e.g., when the radiocarbon calibration curves are updated).
• In some cases, chronological data was only presented graphically. Dating information presented in tabular form allow for both a quick assessment of the raw measurements and their errors, and contribute to a better reuse of the data (Millard, 2014;Courtney Mustaphi et al., 2019).
• The details of the construction of the published chronology were sometimes incomplete. More details on the radiocarbon calibration curve used and/or the method of interpolation between multiple age points would contribute to a more accurate reproduction of the data.
• We encourage a clear definition and discussion of which environmental variables are reconstructed based on the available evidence.

Conclusion
Lake-status histories are important sources of information on past climate change, but their synthesis and translation to palaeoclimate change is challenging. We present a new relational lake-status database that combines updated chronologies with the incorporation of uncertainties. We have shown that the combination of MC sampling to integrate the full range of data uncertainty and RSEOF to extract the dominant modes of variability allows the identification of dominant historical trends, and an assessment of their robustness. The application of this approach to ESA identified the dominant millennial-scale palaeoclimate features of the last 20,000 years, consistent with the current understanding of environmental change in the region, including a tendency towards anti-phased lake-status behaviour between the EARS and interior and eastern southern Africa. These results highlight the potential contributions lake-status data can make to large-scale palaeoclimate syntheses, despite their often discontinuous and qualitative nature. The data and methods presented here serve as a proof-of-concept study for future large-scale syntheses of lake status in other regions of the world.

Data availability
The -The ESA lake-status database, in both SQLite and spreadsheet format.
-Description and constraints of all database fields.
-Documentation for all sites included in the database. This includes relevant background information of the site, presentation of original literature from which lake-status information was sourced, and discussion of consensus lake-status history.
-Catchment shapefiles of all included sites.
-R code for MC-RSEOF analysis on our database format.       Peer-reviewed preprint