2.2 Workflow overview

The synthesis process followed the workflow of data downloading, data quality control and cleaning, data aggregation, gap-filling in daily time series, and finally, writing to NetCDF format. To extract the desired data, we carefully inspected the source websites for information about how the original data were measured, processed, and recorded. Our data cleaning and quality control procedures included scanning for unrealistic values and cross-checking data flag reports. After unrealistic values were removed, any time series that were recorded at sub-daily intervals were aggregated to daily time steps. Subsequently, three levels of gap-filling methods (interpolation, regression, and climate catalog; see Section 2.4) were applied to the daily-scaled data. The resulting data were stored in NetCDF format using a consistent structure and layout, together with metadata which provided additional information including variable units, station names, locations and record lengths.

2.3 Data downloading and cleaning

For each site, we acquired (if available) time series data of streamflow, precipitation, air temperature, solar radiation, relative humidity, wind direction, wind speed, SWE, snow depth, vapor pressure, soil moisture, soil temperature, and isotope values. For the convenience of cross-watershed research and intercomparison of datasets, variable naming standards and their units were made consistent, following Addor et al.’s (2020) suggested format for large sample hydrology datasets. As detailed in the data pipeline Jupyter Notebooks attached to the CHOSEN database, we aggregated any hourly time series in one of two different ways: cumulative variables were summed, and rate variables were averaged.

2.4 Gap-filling methods

Gaps in the cleaned and aggregated daily data were filled using one of three techniques, depending on the length of the gap and availability of complementary data. The first technique involved linear interpolation between the two nearest non-missing values. Linear interpolation was applied to gaps of less than seven days, over which seasonal effects can be considered trivial. Longer gaps were filled by regression for those catchments with multiple monitoring stations (Pappas et al., 2014). To implement spatial regression, we first evaluated the correlation coefficients between the station with missing values and all the other stations within the watershed. We then used the data from the station with the highest correlation coefficient to estimate the regression parameters. If the highest correlation coefficient was less than 0.7, or if no data were available from other stations contemporarily, the missing values were reconstructed using the climate catalog technique. The climate catalogue method filled gaps by using data from the same site for a different year (the one containing at least 9 months of data and with the highest correlation coefficient greater than 0.7 to the year in which values were missing). For example, suppose a catchment’s only streamflow gauge was missing all of April’s measurements in 2002. In this case, we would first group the available data by year, and calculate the correlation coefficients between daily streamflows in 2002 and the other years. If the 2002 data correlated most strongly with data from 2006, then 2006’s April 1st data point replaced the missing value from April 1st 2002, with the addition of a Gaussian random number scaled by the standard deviation of all April 1st values from all the years of record.
Figure 2. Data pipeline and visualizations of cleaning methods: a) interpolation, b) regression and c) climate catalog
To assure the quality of reconstructed data (interpolated, regressed, or based on climate catalog), we deleted any reconstructed values that fell outside of the thresholds that were originally used to detect unrealistic data. After all the data filling methods were applied, a flag table was generated indicating the technique that was used to create each filled data point. All the python coding scripts for processing methods are available on GitLab (https://gitlab.com/esdl/chosen) and will be published on Zenodo (DOI: 10.5281/zenodo.4060384) open to the public along with a Jupyter Notebook tutorial.

2.5 NetCDF data product

We stored and published the processed data in NetCDF format. NetCDF is emerging as the data standard for large-sample hydrology, as well as for other large-sample products across the geosciences, particularly climate science and remote sensing (Liu et al., 2016; Romañach et al., 2015; Signell et al., 2008). The NetCDF library is designed to read and write multi-dimensional scientific data in a well-structured manner. The library enables writing data in several coordinate dimensions, accommodating multiple measurement stations. The machine-based interface makes data highly accessible and easily portable across various computer platforms. Data (variables) and metadata (corresponding attributes) are intrinsically linked and stored in the same file, making the data set self-documenting.
We generated one NetCDF file for each watershed to store its data and metadata. In these NetCDF files, there are four kinds of variables. Hydrometeorological variables are stored in two-dimensional arrays (i.e., by time and location), along with flag variables having the same number and array dimensions. The timestamp variable is a one-dimensional array of measurement dates and times. Lastly, a grid variable contains information about gauges and monitoring stations, including their names, latitudes, and longitudes. The attributes include website links, units, full names, and record starting and ending dates (Figure 3).
Figure 3. Variables, corresponding dimensions and attributes in NetCDF files