More options

Netherlands

Assessing the utility of synthetic data: A density ratio perspective, Netherlands

Read more about Assessing the utility of synthetic data: A density ratio perspective, Netherlands

data dissemination, synthetic data, quality of synthetic data,

Languages and translations

English

UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE

CONFERENCE OF EUROPEAN STATISTICIANS

Expert meeting on Statistical Data Confidentiality

26–28 September 2023, Wiesbaden

ASSESSING THE UTILITY OF SYNTHETIC DATA: A DENSITY RATIO PERSPECTIVE Thom Benjamin Volker (Utrecht University, the Netherlands; Statistics Netherlands, the Netherlands) Peter-Paul de Wolf (Statistics Netherlands, the Netherlands) Erik-Jan van Kesteren (Utrecht University, the Netherlands) [email protected], [email protected], [email protected]

Abstract Synthetic data can be a solution to reduce disclosure risks that arise when disseminating research data to the public. However, for the synthetic data to be useful for general inferential purposes, it is paramount that its distribution is similar to the distribution of the observed data. Often, data disseminators consider multiple synthetic data models and make refinements in an iterative fashion. After each adjustment, it is crucial to evaluate whether the quality of the synthetic data has actually improved. Although many methods exist to provide such an evaluation, their results are often incomplete or even misleading. To improve the evaluation strategy for synthetic data, and thereby the quality of synthetic data itself, we propose to use the density ratio estimation framework. Using techniques from this field, we show how an interpretable utility measure can be obtained from the ratio of the observed and synthetic data densities. We show how the density ratio estimation framework bridges the gap between fit-for-purpose and global utility measures, and discuss how it can also be used to evaluate analysis-specific utility. Using empirical examples, we show that density ratio estimation improves on existing (global) utility measures by providing higher statistical power and offering a fine-grained view of discrepancies between the observed and synthetic data. Moreover, we describe several additional advantages of the approach, such as providing a measure of utility on the level of individual synthetic data points, automatic model selection without requiring user specification, and readily available high-dimensional extensions. We conclude that density ratio estimation provides a promising framework in synthetic data generation workflows and present an R-package with functionality to implement the approach.

mailto:[email protected]

Introduction

In recent years, the academic interest in synthetic data has exploded. Synthetic data are increasingly being used as a solution to overcome privacy and confidentiality issues that are inherently linked to the dissemination of research data. National statistical institutes and other government agencies have started to disseminate synthetic data to the public while restricting access to the original data to protect sensitive information (e.g., Abowd, Stinson, and Benedetto 2006; Hawala 2008; Drechsler 2012). At the same time, researchers began to share synthetic versions of their research data to comply with open science standards (e.g., van de Wiel et al. 2023; Obermeyer et al. 2019; Zettler et al. 2021). Rather than sharing the original research data, a synthetic surrogate is shared to facilitate reviewing of the data processing and analysis pipeline. Additionally, synthetic data is increasingly being used for training machine learning models (Nikolenko 2021). On a lower level, synthetic data can be used in model testing pipelines (before access to the real data is provided), for data exploration, and for educational purposes.

At its core, the idea of synthetic data is to replace values from the observed data with new values that are generated from a model. In this way, it is possible to generate an entirely new synthetic data set (commonly referred to as the fully synthetic data approach; Rubin 1993), but also to replace just those values that are sensitive or that would yield a high risk of disclosure when released (an approach called partially synthetic data; Little 1993). Both approaches attempt to build a model that incorporates as much of the information in the real data as possible, given a pre-specified privacy risk level that is still deemed acceptable. The models used to generate synthetic data were originally closely related to methods used for multiple imputation of missing data, such as fully conditional specification (Volker and Vink 2021) or sequential regression (Nowok, Raab, and Dibben 2016). Recently, significant improvements in generative modelling sparked the scientific interest in synthetic data in the computer science community, leading to novel synthesis methods (e.g., Patki, Wedge, and Veeramachaneni 2016; Xu et al. 2019). Combined with work on formal privacy guarantees, this resulted in new models that explicitly control the level of privacy risk in synthesis methods (Jordon, Yoon, and Schaar 2019; Torkzadehmahani, Kairouz, and Paten 2019). Through both methodological advances and practical implementations, data synthesis has evolved into an increasingly popular approach to enhance data dissemination.

Regardless of these developments, the main challenge when generating synthetic data remains to adequately balance the privacy risk with the utility (i.e., quality) of the synthetic data. On the upper limit of this privacy-utility trade-off, the synthesis model captures the information in the observed data so precisely that the real data is exactly reproduced, resulting in the same privacy loss as when disseminating the real data. In statistical terms, the synthesis model is overparameterized to such an extent that there are no degrees of freedom left, and there is thus no randomness involved in the generation of the synthetic values. On the lower limit of the trade-off, synthetic values are generated without borrowing any information from the real data. For example, we could place the value 0 or a random draw from a standard normal distribution for every record and every variable, such that the synthetic data contains only noise. Synthetic data sets sit somewhere between these extremes: they contain some information from the real data, yielding some disclosure risk, but they also resemble the real data to some extent, yielding more than zero utility. Because not all information is captured, the utility of the synthetic data will always be lower than the utility of the real data. The question that naturally arises is where on the privacy-utility continuum the synthetic data is located: how much information is sacrificed, and which aspects of the real data are reproduced in the synthetic data. From the perspective of the data provider, it is important to know how informative the released data is, while the user wants to know whether their analysis can be reliably performed. Additionally, the data provider can use knowledge about the utility to finetune the synthesis model and improve the synthetic data quality.

To evaluate the utility of synthetic data, three classes of utility measures have been distinguished (for a thorough review of these measures, see Drechsler and Haensch 2023): fit-for-purpose measures, global utility measures, and analysis-specific utility measures. Fit-for-purpose measures are often the first step in assessing the quality of the synthetic data. They typically involve comparing the univariate distributions of the observed and synthetic data (for example using visualization techniques or goodness-of-fit measures). Although these measures provide an initial impression of the quality of the synthesis models used, this picture is by definition limited, because only one or two variables are assessed at the same time. Hence, complex relationships between variables will always be out of scope. Global utility measures build on the fit-for-purpose measures, but attempt to capture the quality of the entire multivariate distribution of the synthetic data relative to the observed data in a single, global, indicator. This can be done using some distance measure (e.g., the Kullback-Leibler divergence; see Karr et al. 2006), but also by estimating how well a prediction model can distinguish between the observed and synthetic data, using the predicted probabilities (propensity scores; Rosenbaum and Rubin 1983) as a measure of discrepancy (e.g., the propensity score mean squared error, 𝑝𝑀𝑆𝐸 ; Woo et al. 2009; Snoke et al. 2018). While global utility measures paint a

rather complete picture, and provide information over the entire range of the data, they tend to be too general. That is, global utility measures can be so broad that important discrepancies between the real and synthetic are missed, and a synthetic data set with high global utility might still yield analyses with results that are far from the results from real data analyses (see Drechsler 2022). Lastly, the analysis-specific utility measures quantify to what extent analyses performed on the synthetic data align with the same analyses on the observed data. These measures can, for example, evaluate how similar the coefficients of a regression model are (e.g., using the confidence interval overlap; Karr et al. 2006), or whether prediction models trained on the synthetic and observed data perform comparably in terms of evaluation metrics. However, analysis-specific utility generally does not carry over: high specific utility for one analysis does not at all imply high utility for another analysis. Since data providers typically do not know which analyses will be performed with the synthetic data, it is impossible to provide analysis-specific utility measures for all potentially relevant analyses (see also Drechsler 2022).

In this paper, we propose to use the framework of density ratio estimation (Sugiyama, Suzuki, and Kanamori 2012a) to place all above measures under a common umbrella. We show empirically that this approach performs at least as well as various existing utility measures, while providing a more fine-grained view of the misfit of the synthetic data. Moreover, the typically non-parametric nature of density ratio estimation in combination with automatic model selection mitigates the burden around model specification of existing utility measures as the 𝑝𝑀𝑆𝐸 . In short, density ratio estimation compares the (multivariate) distributions of two data sets (e.g., two different samples or groups) by directly estimating the ratio of their densities. Crucially, this method does not estimate the densities of the observed and synthetic data separately, subsequently taking their ratio, but estimates the density ratio directly, which has been shown to yield better performance (e.g., Kanamori, Hido, and Sugiyama 2009). The idea is that if two data sets are drawn from the same data-generating mechanism, the sampled data should be similar, and the ratio of their densities should be close to one over the entire multivariate space. This approach readily extends from univariate to bivariate and multivariate densities, and thus bridges the gap between fit-for-purpose and global utility measures. Moreover, we briefly discuss how density ratio estimation can be used to compare the distributions of parameters of observed and synthetic data, to incorporate analysis-specific utility measures as well. Hence, we show that it is a versatile approach that is useful in the entire domain of data utility.

Also from the privacy-side several promising advances have been made to quantify the amount of information leakage through the synthetic data. Important work has been done to build formal privacy guarantees into the synthesis models through differential privacy (Dwork 2006). In addition to these privacy-by-design mechanisms, some measures exist to quantify privacy loss of synthetic data after generation (e.g., McClure and Reiter 2016; Reiter and Mitra 2009; Hu 2019). However, the practical applicability of these measures depends on whether the data is fully or partially synthetic, and especially in case of the former, the practical applicability of these measures is often limited (for an extensive discussion of these issues, see Drechsler and Haensch 2023). More research on measures to evaluate disclosure risks in synthetic data is thus certainly needed, but in this paper we focus exclusively on measuring utility.

In what follows, we describe the density ratio estimation framework by summarizing some of the work in this area, and show how it provides a useful framework for measuring utility of synthetic data. Subsequently, we illustrate how the method can be used in practice by providing multiple examples, and empirically compare its performance to existing utility measures. Lastly, we discuss how density ratio estimation relates to existing utility measures, describe current shortcomings of the approach and relate these shortcomings to avenues for future work.

Density ratio estimation

The density ratio estimation framework was originally developed in the machine learning community for the comparison of two probability distributions (for an overview, see Sugiyama, Suzuki, and Kanamori 2012a). The framework has been shown to be applicable to prediction (Sugiyama et al. 2010; Sugiyama 2010), outlier detection (Hido et al. 2008), change-point detection in time-series (Liu et al. 2013), importance weighting under domain adaptation (i.e., sample selection bias; Kanamori, Hido, and Sugiyama 2009), and, importantly, two-sample homogeneity tests (Sugiyama, Suzuki, et al. 2011). The general idea of density ratio estimation is depicted in Figure 1, and boils down to comparing two distributions by modelling the density ratio 𝑟 (𝒙) between the probability distributions of the numerator samples, taken from the synthetic data distribution, 𝑝𝑠𝑦𝑛 (𝒙), and the denominator samples, taken from the observed data distribution, 𝑝𝑜𝑏𝑠 (𝒙), such that

𝑟 (𝒙) = 𝑝𝑠𝑦𝑛 (𝒙) 𝑝𝑜𝑏𝑠 (𝒙)

. (1)

0.0 0.1 0.2 0.3 0.4 0.5

-4 -2 0 2 4 6

D en

sit y

N(0, 1) N(1, 2)

0.0 0.5 1.0 1.5 2.0 2.5

-4 -2 0 2 4 6

D en

sit y

ra tio

N(0, 1) N(1, 2)

Figure 1. Example of the density ratio of two normal distributions with different means and variances (i.e., 𝑁 (0, 1) and 𝑁 (1, 2)). Note that the density ratio is itself not a proper density.

This specification has the intuitive interpretation that if the density ratio is large, too many synthetic values will be generated in that region, whereas if the density ratio is small, there will be too few synthetic observations, both relative to the observed data. An intuitive approach to estimating 𝑟 (𝒙) from samples of 𝑝𝑜𝑏𝑠 (𝒙) and 𝑝𝑠𝑦𝑛 (𝒙) would be to estimate the observed and synthetic data density separately, for example using kernel density estimation (Scott 1992), and subsequently compute the ratio of these estimated densities. However, density estimation is one of the hardest tasks in statistical learning, unavoidably leading to estimation errors for both densities. When subsequently taking the ratio of the estimated densities, the estimation errors might be magnified, resulting in a poorer estimate of the density ratio than necessary as compared to direct estimation. An alternative is to specify and estimate a model directly for the ratio without first estimating the separate densities. Extensive simulations on a wide variety of tasks showed that this approach typically outperforms density ratio estimation through naive kernel density estimation, especially when the dimensionality of the data increases (e.g., Kanamori, Suzuki, and Sugiyama 2012; Hido et al. 2008; Kanamori, Hido, and Sugiyama 2009).

Over the past years, several methods for direct density ratio estimation have been developed. Typically, these methods aim to minimize some discrepancy 𝒟(𝑟 (𝒙), 𝑟 (𝒙)) between the true density ratio and some density ratio model. One commonly used discrepancy measure is the following squared error

S0 (𝑟 (𝒙), 𝑟 (𝒙)) = 1 2

∫ (𝑟 (𝒙) − 𝑟 (𝒙))2𝑝𝑜𝑏𝑠 (𝒙)𝑑𝒙, (2)

which can be considered as the expected discrepancy between the two functions over the density of the observed data. One could also use other discrepancy measures, such as the binary or unnormalized Kullback-Leibler divergence or Basu’s power divergence (which are all members of the family of Bregman divergences; for a detailed discussion, see Sugiyama, Suzuki, and Kanamori 2012b). It is convenient to model the density ratio with a linear model, such that

𝑟 (𝒙) = 𝝋(𝒙)𝜽 , (3)

where 𝝋(𝒙) is a non-negative basis function vector that transforms the data from a 𝑝-dimensional to a 𝑏-dimensional space, and 𝜽 is a 𝑏-dimensional parameter vector. Although the model is linear in its parameters, the density ratio itself is a non-linear function of the data if 𝝋(𝒙) is a non-linear transformation of the data, which it typically is.

To illustrate the idea of density ratio estimation, we briefly review one method from the field: unconstrained least squares importance fitting (Kanamori, Hido, and Sugiyama 2009), which will also be used in our illustrations in the upcoming section. The authors show that the squared error can be rewritten as

S0 (𝑟 (𝒙), 𝑟 (𝒙)) = 1 2

∫ 𝑟 (𝒙)2𝑝𝑜𝑏𝑠 (𝒙)𝑑𝒙 −

∫ 𝑟 (𝒙)𝑟 (𝒙)𝑝𝑜𝑏𝑠 (𝒙)𝑑𝒙 + 1

∫ 𝑟 (𝒙)2𝑝𝑜𝑏𝑠 (𝒙)𝑑𝒙

= 1 2

∫ 𝑟 (𝒙)2𝑝𝑜𝑏𝑠 (𝒙)𝑑𝒙 −

∫ 𝑟 (𝒙)𝑝𝑠𝑦𝑛 (𝒙)𝑑𝒙 + 𝐶,

(4)

where 𝑟 (𝒙) in the second term on the first line is rewritten in terms of the ratio of 𝑝𝑠𝑦𝑛 (𝒙) over 𝑝𝑜𝑏𝑠 (𝒙). After dropping the irrelevant (with respect to the data) constant 𝐶, and substituting the density ratio model as defined in Equation 3, we have

S(𝑟 (𝒙), 𝑟 (𝒙)) = 1 2

∫ 𝜽 ′𝝋(𝒙)′𝝋(𝒙)𝜽𝑝𝑜𝑏𝑠 (𝒙)𝑑𝒙 −

∫ 𝝋(𝒙)𝜽𝑝𝑠𝑦𝑛 (𝒙)𝑑𝒙 (5)

as the objective function. The integrals in Equation 5 are typically not available, but can be replaced by empirical averages, such that

Ŝ(𝑟 (𝒙), 𝑟 (𝒙)) = 1 2 𝜽 ′

( 1

𝑛𝑜𝑏𝑠 𝝋(𝒙𝑜𝑏𝑠)′𝝋(𝒙𝑜𝑏𝑠)

) 𝜽 −

( 1

𝑛𝑠𝑦𝑛 𝝋(𝒙𝑠𝑦𝑛)′1𝑛𝑠𝑦𝑛

) ′ 𝜽 . (6)

It follows directly that the parameter vector 𝜽 can be estimated as

�̂� =

( 1 𝑛𝑜𝑏𝑠

𝝋(𝒙𝑜𝑏𝑠)′𝝋(𝒙𝑜𝑏𝑠) )−1 ( 1

𝑛𝑠𝑦𝑛 𝝋(𝒙𝑠𝑦𝑛)′1𝑛𝑠𝑦𝑛

) , (7)

which shows the least-squares nature of the problem. Because one would expect the density ratio to be non-negative, a non-negativity constraint for 𝜽 can be added to the optimization problem, which would yield a convex quadratic optimization problem that can be solved with dedicated software. However, ignoring the non-negativity constraint has the advantage that Equation 6 has an analytical expression, which is numerically stable and computationally very efficient. The corresponding downside of having negative estimated density ratio values can be remedied by setting negative values in �̂� to 0.

From here, we are left with two remaining tasks. First, one typically wants to add a regularization parameter 𝜆 to the objective function to prevent overfitting and ensure positive-definiteness. In the unconstrained realm, a ridge penalty (𝜆/2)𝜽 ′𝜽 is typically added to the optimization problem in Equation 6. Adding this to the solution in Equation 7 yields

�̂� =

( 1 𝑛𝑜𝑏𝑠

𝝋(𝒙𝑜𝑏𝑠)′𝝋(𝒙𝑜𝑏𝑠) + 𝜆𝑰𝑏 )−1 ( 1

𝑛𝑠𝑦𝑛 𝝋(𝒙𝑠𝑦𝑛)′1𝑛𝑠𝑦𝑛

) , (8)

where 𝑰𝑏 denotes a 𝑏 × 𝑏 identity matrix. The regularization parameter 𝜆 can be chosen via cross-validation. Conveniently, the leave-one-out cross-validation score can also be computed analytically when using unconstrained least-squares importance fitting (see Section 3.4 in Kanamori, Hido, and Sugiyama 2009). Second, we need to specify the basis functions used in the density ratio model. A common choice is to use a Gaussian kernel, which quantifies the similarity between observations as

𝝋(𝒙) = 𝑲 (𝒙, 𝒄) = exp

( − ∥𝒙 − 𝒄∥2

2𝜎2

) , (9)

where 𝒄 denotes the Gaussian centers and 𝜎 controls the kernel width. The bandwidth parameter 𝜎 can also be selected using cross-validation. Typically a subset of the numerator samples are chosen as the Gaussian centers, because the density ratio tends to take large values at locations where the numerator density has more mass than the denominator density. To estimate the density ratio accurately, we may use many kernels where the density ratio is expected to be large, whereas having few kernels might suffice in the locations where the density ratio is small. Hence, we place many kernels where the synthetic data density is large, by taking a sample of the synthetic records as Gaussian centers, with the number of samples 𝑛𝑐 dependent on the computational resources available (but typically min(100, 𝑛𝑠𝑦𝑛) ≤ 𝑛𝑐 ≤ min(1000, 𝑛𝑠𝑦𝑛)).

After estimating the density ratio, one can assess whether the numerator and denominator densities differ significantly via a permutation test. To this end, Sugiyama, Suzuki, et al. (2011) propose a two-sample test that quantifies the discrepancy between the numerator (synthetic) and denominator (observed) samples through the density ratio, using the Pearson divergence P(𝑝𝑠𝑦𝑛 (𝒙), 𝑝𝑜𝑏𝑠 (𝒙)) as a test statistic:

P̂ (𝑝𝑠𝑦𝑛 (𝒙), 𝑝𝑜𝑏𝑠 (𝒙)) = 1

2𝑛𝑠𝑦𝑛

𝑛𝑠𝑦𝑛∑︁ 𝑖=1

𝑟 (𝒙 (𝑖) 𝑠𝑦𝑛) −

1 𝑛𝑜𝑏𝑠

𝑛𝑜𝑏𝑠∑︁ 𝑗=1

𝑟 (𝒙 ( 𝑗 ) 𝑜𝑏𝑠

) + 1 2 . (10)

Intuitively, this discrepancy captures how different the synthetic data is from the observed data by measuring the distance from the density ratio at the observed data points to the density ratio at the synthetic data points. As we show in our empirical examples, this statistic is difficult to interpret in an absolute sense. However, we show that it is useful as a relative measure of fit of the different synthetic data sets. Additionally, the value of the test statistic can be used to construct a hypothesis test for the lack of fit of the synthetic data using a permutation test. An empirical 𝑝-value can then be calculated as the proportion of test statistics under the null model that are greater than the observed test statistic. In this way, it can be assessed whether the synthetic data model is misspecified, by comparing the observed value to what can be expected under a correctly specified synthesis model.

Density ratio estimation as a utility measure: Simulated and empirical examples

In this section, we illustrate density ratio estimation using unconstrained least-squares importance fitting. In a small simulation, we show that the method gives reasonable results when the goal is to estimate a density ratio in several parametric examples. Subsequently, we use these examples to show how the results of density ratio estimation can be used as a measure of utility, and we describe how a lack of fit of the synthesis model can be inferred from the density ratio. Starting with univariate examples, we compare the density ratio two-sample test with existing goodness-of-fit measures (the Kolmogorov-Smirnov test and the 𝑝𝑀𝑆𝐸). As a final illustration, we build upon the work by Drechsler (2022), and showcase how density ratio estimation improves upon utility assessment through the 𝑝𝑀𝑆𝐸 in a multivariate example. All analyses were conducted in R (Version 4.3.0; R Core Team 2023), and the code is available on GitHub. The software used to perform density ratio estimation is implemented in an R-package called densityratio (Volker 2023).

Density ratio estimation in simulated univariate examples

To provide an intuition about the performance of unconstrained least-squares importance fitting, we apply it to a simplified example of a typical situation in the synthetic data field. When creating synthetic data, we often have a complex, usually unknown, data distribution that we want to approximate with a model. We generally lack information to correctly model real-world phenomena, and even if we would have sufficient information, some important factors might be missing from the data, or the model might be so complex that it is unfeasible to actually simulate data from it. For the sake of illustrational clarity, we generate univariate data according to four true data-generating mechanisms:

1. Laplace(𝜇 = 1, 𝑏 = 1) 2. Log-normal(𝜇log = log {𝜇2/

√︁ 𝜇2 + 𝜎2

𝑥}, 𝜎2 log = log {1 + 𝜎2

𝑥/𝜇2}), with 𝜇 = 1 and 𝜎2 𝑥 = 2

3. Location-scale 𝑡-distribution 𝑙𝑠𝑡 (𝜇 = 1, 𝜏2 = 1, 𝜈 = 4) 4. Normal(𝜇 = 1, 𝜎2

𝑥 = 2)

Note that these four distributions all have the same population mean 𝜇 = 1 and the same population variance 𝜎2 𝑥 = 2. From

each distribution, we generate 200 data sets of size 𝑛𝑜𝑏𝑠 = 250. For all scenarios, we approximate the true data-generating mechanism by drawing 200 data sets of size 𝑛𝑠𝑦𝑛 = 250 from a normal distribution (Normal(𝜇 = 1, 𝜎2

𝑥 = 2)), such that we accurately model the mean and variance of each true data-generating distribution (see also Figure 2 for a graphical depiction of the true and synthetic data densities). Note that in the fourth scenario, we thus model the true data-generating distribution correctly, which is included to get some intuition on how density ratio estimation performs when we specify the synthesis model correctly. All density ratios were estimated with the exact same model specifications: we used 100 observations from the synthetic data as Gaussian centers and performed cross-validation over 10 values of the Gaussian kernel width 𝜎 and 10 values of the regularization parameter 𝜆.

Figure 3 shows how the estimated density ratios for the 200 simulated datasets in each scenario (the blue lines in each subfigure) compared to the true density ratios (the black lines). In each of the four figures, the estimated density ratios follow the general trend of the true density ratios. In the top-left plot, showing the ratio of the normal distribution over the Laplace distribution, the density ratio decreases at the sides, then increases when moving towards the center, but decreases again close to the center. The same can be observed in the bottom-left plot, which shows the normal distribution over the 𝑙𝑠𝑡-distribution. In the top right panel, the estimated density ratios are typically large for negative values, very close to zero (or even negative) around the peak of the log-normal distribution, and subsequently increasing and later on decreasing again. In the bottom right panel, where both distributions are identical, the majority of the estimated density ratios are very flat, tending towards zero to some extent at the edges of the figure where only few data points are located. Moreover, all figures show some highly variable estimated density ratios due to modest overfitting regardless of the cross-validation scheme, whereas the normal versus log-normal figure shows many highly variable estimates outside of the center of the figure, due to the fact that either the synthetic or the observed data has only few cases in these regions. Normally, the stability of the estimates increases with the sample size. Figure 3 also shows one of the main advantages of density ratio estimation as a utility measure, in the sense that it provides a quantification of the fit for every data point. At those locations where the estimated density ratio takes large values, there are too many synthetic observations compared to what should be expected based on the observed data, whereas at the points where the density ratio is close to zero, there are too few synthetic observations relative to the observed data. Likewise, a high density ratio value for a synthetic record indicates that this point deviates from what would be typical under the observed data-generating mechanism.

https://github.com/thomvolker/unece-density-ratio

0.00 0.25 0.50 0.75 1.00 1.25

-2 0 2 4

D en

sit y

Laplace Normal

0.00 0.25 0.50 0.75 1.00 1.25

-2 0 2 4

D en

sit y

Log-normal Normal

0.00 0.25 0.50 0.75 1.00 1.25

-2 0 2 4

D en

sit y

l st Normal

0.00 0.25 0.50 0.75 1.00 1.25

-2 0 2 4

D en

sit y

Normal Normal

Figure 2. True and synthetic data densities for the examples considered (Laplace, Log-normal, 𝑡 and Normal), all distributions have mean 𝜇 = 1 and variance 𝜎2

𝑥 = 2. Note that the true and synthetic data density in the bottom right plot are completely overlapping.

As it is hard to infer from visualizations whether the misfit could arise from chance alone, or whether the synthetic data model is misspecified, we formally evaluate the fit of the synthetic data by performing statistical inference using the Pearson divergence as a measure of discrepancy (see Equation 10). To explore the properties of the corresponding permutation test, we compare it in terms of power and Type I error rate with the Kolmogorov-Smirnov test and with a 𝑝𝑀𝑆𝐸-based test, obtained by performing a permutation test and assessing the proportion of times the permuted 𝑝𝑀𝑆𝐸s are larger than the observed 𝑝𝑀𝑆𝐸 (Snoke et al. 2018). The 𝑝𝑀𝑆𝐸s are calculated by using the utility.tab() function with default settings from the R-package synthpop (Nowok, Raab, and Dibben 2016). Table 1 shows that in terms of evaluating the misfit of the synthetic data, the density ratio-based test has statistical power similar to the 𝑝𝑀𝑆𝐸-based test. That is, when the synthetic data model differs from the observed data-generating mechanism, the density ratio-based test and the 𝑝𝑀𝑆𝐸-based test indicate significant misfit in approximately 60% of the simulations for the Laplace data, 100% for the log-normal data and 50% for the location-scale 𝑡-distributed data. Both methods achieve considerably higher power than the Kolmogorov-Smirnov test. When the synthesis model is correctly specified, all three methods achieve a nominal Type I error rate close to 0.05.

Table 1. Proportion of significant tests for the fit of the synthetic data.

Data Density ratio Kolmogorov-Smirnov 𝑝𝑀𝑆𝐸

Laplace 0.620 0.375 0.610 Log-normal 1.000 1.000 1.000 𝑙𝑠𝑡 0.495 0.235 0.495 Normal 0.050 0.045 0.040

Figure 3. Estimated density ratios by unconstrained least-squares importance fitting in four univariate examples: A Laplace distribution, a log-normal distribution, a 𝑡-distribution and a normal distribution, all approximated by a normal distribution with the same mean and variance as the original distributions.

Density ratio estimation for synthetic Current Population Survey data

To evaluate the properties of density ratio estimation in a real-life example, we repeat Drechsler’s (2022) illustration of the 𝑝𝑀𝑆𝐸 as a fit-for-purpose and global utility measure with a subset of the March 2000 U.S. Current Population Survey, but now using density ratio estimation. Notably, we use the exact same (default) density ratio model specifications as in the previous simulations, both for evaluating the utility of the variables separately and over the synthetic data sets as a whole. We use exactly the same data as Drechsler (2022), that is, the variables Sex, Race, Marital Status, Highest attained education level, Age, Social security payments, Household property taxes and Household income, measured on 𝑛 = 5000 individuals (descriptive statistics are provided in Table 2 and a graphical depiction of the numeric variables is shown in Figure 6, both in Appendix A). Note that the continuous variables are typically non-normal, while the variables Household property taxes and Social security payments in addition have a point-mass at zero. Data synthesis is done using the R-package synthpop (Nowok, Raab, and Dibben 2016), using the same synthesis strategies as used in Drechsler (2022). We thus refer to this paper for details about the synthesis strategies, and only briefly describe the synthesis models here. Three of the synthetic data sets are created using parametric models: Sex and Race are synthesized with logistic regression models, Marital status and Highest attained education are synthesized using multinomial regression, and all continuous variables are synthesized using a linear model. The parametric synthesis models build up in complexity in how they model the continuous variables in the following way: the first model (labelled naive) does not take the distributions of the variables into account, and models the variables on the original scale; the second model (called transformed) transforms the variables by taking their cubic root and subsequently applies a linear model to the transformed variables; the third model (labelled semi-continuous) also transforms all variables to the cubic root scale, but in addition separately models the point mass at zero for the variables Household property taxes and Social security payments separately, after which a linear model is used for the non-zero values. The non-parametric synthesis model applies classification and regression trees (CART) to all variables, augmented by smoothing through kernel density estimation in the terminal nodes. For all strategies, 𝑚 = 5 synthetic data sets are generated, and the utility is assessed by averaging the Pearson divergence over those sets. As noted

Real Naive Transformed Semi-continuous Smoothed CART

-2 0 2 4 -2 0 2 4 -2 0 2 4 -2 0 2 4 -2 0 2 4 0.00 0.25 0.50 0.75

A ge

Real Naive Transformed Semi-continuous Smoothed CART

-50 0 50 -50 0 50 -50 0 50 -50 0 50 -50 0 50 0.00 0.01 0.02 0.03 0.04

In co

m e

Real Naive Transformed Semi-continuous Smoothed CART

-20 0 20 40 -20 0 20 40 -20 0 20 40 -20 0 20 40 -20 0 20 40 0.00 0.05 0.10 0.15 0.20

So ci

al s

ec ur

ity

Real Naive Transformed Semi-continuous Smoothed CART

-25 0 25 50-25 0 25 50-25 0 25 50-25 0 25 50-25 0 25 50 0.000 0.025 0.050 0.075 0.100

T ax

Figure 4. Real and synthetic data distributions for the variables age, household income (income), household property taxes (tax) and social security payments (social security) on a cubic root scale (using 𝑓 (𝑥) = sign(𝑥) |𝑥 |1/3).

by Drechsler (2022), based on the sequential refinements of the parametric synthesis models, one would expect the utility to improve with every parametric model, leaving open how the CART models compare to the parametric models.

Figure 4 shows how the increasing complexity of the synthesis models leads to increasingly realistic synthetic data distributions (all variables are plotted on a cubic root scale using 𝑓 (𝑥) = sign(𝑥) |𝑥 |1/3 to also allow for negative values). It is evident that the naive synthesis strategy does a poor job for all variables except Age, whereas the transformed strategy does a poor job for Tax and Social security. The semi-continuous strategy seems to fit well for all variables, similarly to the data created with CART, although the latter method preserves the non-normality of the non-zero values in Social security slightly better. The insights from visual inspection are entirely corroborated by the relative Pearson divergences as given by density ratio estimation (see Figure 5). For all variables, the naive synthesis method performs worst. Typically, the transformed synthesis improves the synthetic data to some extent, although the difference is relatively small for Age, because here the naive synthesis strategy already performed reasonable. For both Age and Income, the transformed strategy performs similarly to both the semi-continuous and the CART strategies, because for these variables there is no point-mass to model separately. For the variables where a point-mass is modeled separately (e.g., Social security and Tax), the semi-continuous approach clearly outperforms the transformed strategy. Lastly, CART outperforms the naive and transformed strategies, and performs highly similar to the semi-continuous approach.

When modelling the density ratio over all variables in the data simultaneously (including the categorical variables, for simplicity recoded as numeric variables to be included in density ratio estimation), we see the same picture emerging.

0.01

0.10

1.00

Age Income Social security Tax All Variable

Pe ar

so n

di ve

rg en

Synthesis method Naïve Transformed Semi-continuous Smoothed CART

Figure 5. Pearson divergence estimates after different synthesis strategies for the separate variables and the synthetic data sets as a whole.

Figure 5 shows the stepwise improvements in utility when refining the synthesis models. Naive synthesis clearly performs worst, followed by the transformed strategy. Both strategies are outperformed by the semi-continuous approach, which performs more or less on par with CART. These results compare favorably with utility assessment through the 𝑝𝑀𝑆𝐸 as reported in Drechsler (2022). The evaluation of utility through the 𝑝𝑀𝑆𝐸 shows no improvement when going from naive to transformed synthesis, whereas some 𝑝𝑀𝑆𝐸 models qualified the naive approach as better than the transformed approach. The improvement from the first two strategies to semi-continuous and CART was picked up by most 𝑝𝑀𝑆𝐸 models. Hence, the utility assigned by density ratio estimation was more in line with the refinements to the synthesis models than the utility scores that were obtained with the 𝑝𝑀𝑆𝐸 .

Discussion

When creating synthetic data with the goal of private data release, it is crucial to evaluate its quality. This allows the data provider to decide whether the synthetic data is useful for the purposes of the release or requires further refinements, and to inform the data user about the analyses that can be reliably conducted. In this paper, we showed that density ratio estimation provides a promising framework to evaluate the utility of synthetic data and we implemented the approach in the R-package densityratio (Volker 2023). In a small simulation, we showed that for sample sizes as small as 250 observations, it was possible to obtain a rather accurate estimate of the true density ratio. Moreover, in terms of statistical power, density ratio estimation performed on par with the 𝑝𝑀𝑆𝐸 and outperforms the Kolmogorov-Smirnov test in the univariate comparisons considered. When evaluating density ratio estimation on multiple synthetic versions of a real-world data set, we showed that the method was able to pick up all improvements in the synthesis models made, in contrast to the 𝑝𝑀𝑆𝐸 (as shown by Drechsler 2022). Moreover, whereas Drechsler (2022) showed that quantification of the utility through the 𝑝𝑀𝑆𝐸 was highly dependent on the propensity score model, density ratio estimation possesses automatic model selection in terms of its hyperparameters, and thus requires almost no user-specification. We emphasize that we used the same default settings for our simulations and for modelling all individual variables and the entire data sets in our empirical example, regardless of the varying scales of the variables and other variable-specific peculiarities, such as point masses and non-normality.

Although this paper focused on various comparisons, we note that there are many connections between density ratio estimation and existing utility measures. For example, the 𝑝𝑀𝑆𝐸 can be considered as an instance of density ratio

estimation, in which the propensity scores are used to model the density ratio. Specifically, the propensity scores can be transformed into the posterior odds of any record belonging to the synthetic data versus the observed data, which yields an estimate of the density ratio. Additionally, Sugiyama, Suzuki, and Kanamori (2012a) show that density ratio estimation can be regarded as divergence estimation between the numerator and denominator density. As such, the framework also encompasses estimation of, for example, the Kullback-Leibler divergence, proposed as utility measure by Karr et al. (2006). Lastly, density ratio estimation can be seen as an extension of the “ratio of estimates” utility measure (Taub, Elliot, and Sakshaug 2020), which is defined for categorical data as the ratio of observed and synthetic frequencies (scaled to be between 0 and 1 by putting the largest count in the denominator), to continuous data. As such, the density ratio framework encapsulates various measures to evaluate the utility of synthetic data.

Expanding upon the appealing properties discussed in this paper, we foresee three additional advantages of the density ratio framework. (1) Utility on the level of individual data points. The density ratio is estimated over the entire (multivariate) space of the data, and these estimates can be used to quantify the deviation of every synthetic data point with respect to the observed data. These values can help to identify sub-spaces that are poorly reproduced in the synthetic data, but they might also yield additional benefits. On a low level, these values might be used to discard observations that are considered as being too far from the observed data to be realistic, or resample observations that are typical in the observed data but occur infrequently in the synthetic data. On a higher level, one could potentially use density ratio values to reweigh analyses with synthetic data to bring the results closer to the real data. Future research should evaluate the merits of this approach, but also potential privacy risks of disseminating such weights. (2) Density ratios for specific utility. Another potential benefit is that the use of the method is not necessarily restricted to the level of the data at hand. Density ratio estimation could give rise to analysis-specific utility measures by applying the framework on the posterior distributions of parameters (or an approximation hereof). That is, if the distribution of the parameters of the analysis model can be approximated, for example by a multivariate normal distribution, or when samples from the parameter distribution are available, it is possible to either analytically calculate the density ratio, or estimate it using the techniques described above. The resulting density ratio can then again be used to quantify how similar the distributions are. (3) Extensions to high-dimensional data. When the number of variables grows large relative to the number of observations, direct density ratio estimation through unconstrained least-squares importance fitting might become inaccurate. However, the density ratio estimation framework possesses readily available extensions that include dimension reduction as part of the estimation process, which yields the advantage of simultaneously optimizing the density ratio solution with the dimension-reduced subspace of the data (Sugiyama, Yamada, et al. 2011).

Finally, let us remark that there are several open questions that need to be addressed before density ratio estimation can be fully incorporated in synthetic data evaluation pipelines. First, methodological research should investigate how to deal with categorical variables. In the density ratio estimation framework, the focus has almost exclusively been on numeric data, whereas in practical situations, categorical data is all too common. In this paper, we dealt with the issue by simply transforming the categorical variables into numeric variables, but other techniques might yield more accurate results. To name three other strategies, one could transform the categorical variables into dummy variables, use a different distance metric that allows for categorical data when specifying the kernel, or assume an underlying continuous latent space, and model the categorical variables on this space. Second, which default settings to use in density ratio estimation is still an open question. Although we showed that our default settings performed reasonably, most choices lack a strong theoretical justification. Potentially, the utility of synthetic data can be evaluated much more accurately by, for example, choosing a different kernel, choosing the centers in the Gaussian kernel in a different way, or using a broader range of bandwidth and regularization parameters. Lastly, it must be evaluated what information from density ratio estimation can be released to the public without incurring severe privacy risks. Presumably, releasing the Pearson divergence, potentially augmented with a 𝑝-value to indicate the lack of fit of the synthetic data, will yield only little additional privacy risk. However, releasing visualizations of the estimated density ratio or the estimated density ratio values themselves might cause unacceptable threats, especially for observations in the tails of the distribution. Future research can make efforts to privatize the output from density ratio estimation, or at least investigate what risks are related to releasing the output of the estimation process. With these promising avenues for extensions in mind, we conclude that the density ratio estimation framework provides a viable and intuitive alternative to existing utility measures that can enhance synthetic data workflows.

Acknowledgements

We are grateful to Dr. Jörg Drechsler for sharing his cleaned version of the Current Population Survey data and the corresponding analysis code.

References

Abowd, John M., Martha Stinson, and Gary Benedetto. 2006. “Final Report to the Social Security Administration on the SIPP/SSA/IRS Public Use File Project.” Longitudinal Employer-Household Dynamics Program, U.S. Bureau of the Census, Washington, DC. https://ecommons.cornell.edu/bitstream/handle/1813/43929/SSAfinal.pdf?sequence=3&isAl lowed=y.

Drechsler, Jörg. 2012. “New Data Dissemination Approaches in Old Europe – Synthetic Datasets for a German Establishment Survey.” Journal of Applied Statistics 39 (2): 243–65. https://doi.org/10.1080/02664763.2011.584523.

———. 2022. “Challenges in Measuring Utility for Fully Synthetic Data.” In Privacy in Statistical Databases, edited by Josep Domingo-Ferrer and Maryline Laurent, 220–33. Cham: Springer International Publishing. https://doi.org/10.1007/978- 3-031-13945-1_16.

Drechsler, Jörg, and Anna-Carolina Haensch. 2023. “30 Years of Synthetic Data.” https://doi.org/10.48550/ARXIV.2304. 02107.

Dwork, Cynthia. 2006. “Differential Privacy.” In Automata, Languages and Programming, edited by Michele Bugliesi, Bart Preneel, Vladimiro Sassone, and Ingo Wegener, 1–12. Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/11787006_1.

Hawala, Sam. 2008. Producing Partially Synthetic Data to Avoid Disclosure. http://www.asasrms.org/Proceedings/y2008/ Files/301018.pdf.

Hido, Shohei, Yuta Tsuboi, Hisashi Kashima, Masashi Sugiyama, and Takafumi Kanamori. 2008. “Inlier-Based Outlier Detection via Direct Density Ratio Estimation.” In 2008 Eighth IEEE International Conference on Data Mining, edited by Fosca Giannotti, Dimitrios Gunopulos, Franco Turini, Carlo Zaniolo, Naren Ramakrishnan, and Xindong Wu, 223–32. https://doi.org/10.1109/ICDM.2008.49.

Hu, Jingchen. 2019. “Bayesian Estimation of Attribute and Identification Disclosure Risks in Synthetic Data.” Transactions on Data Privacy 12: 61–89. http://www.tdp.cat/issues16/tdp.a313a18.pdf.

Jordon, James, Jinsung Yoon, and Mihaela van der Schaar. 2019. “PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees.” In International Conference on Learning Representations. https://openreview.net/forum?id=S1zk 9iRqF7.

Kanamori, Takafumi, Shohei Hido, and Masashi Sugiyama. 2009. “A Least-Squares Approach to Direct Importance Estimation.” Journal of Machine Learning Research 10 (48): 1391–1445. http://jmlr.org/papers/v10/kanamori09a.html.

Kanamori, Takafumi, Taĳi Suzuki, and Masashi Sugiyama. 2012. “Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation.” Machine Learning 86 (3): 335–67. https://doi.org/10.1007/s10994-011-5266-3.

Karr, Alan F., Christine N. Kohnen, Anna Oganian, Jerome P. Reiter, and Ashish P. Sanil. 2006. “A Framework for Evaluating the Utility of Data Altered to Protect Confidentiality.” The American Statistician 60 (3): 224–32. https://doi.org/10.1198/000313006X124640.

Little, Roderick J. A. 1993. “Statistical Analysis of Masked Data.” Journal of Official Statistics 9 (2): 407–7. https://www.scb.se/contentassets/ca21efb41fee47d293bbee5bf7be7fb3/statistical-analysis-of-masked-data.pdf.

Liu, Song, Makoto Yamada, Nigel Collier, and Masashi Sugiyama. 2013. “Change-Point Detection in Time-Series Data by Relative Density-Ratio Estimation.” Neural Networks 43: 72–83. https://doi.org/10.1016/j.neunet.2013.01.012.

McClure, David, and Jerome P Reiter. 2016. “Assessing Disclosure Risks for Synthetic Data with Arbitrary Intruder Knowledge.” Statistical Journal of the IAOS 32 (1): 109–26. https://doi.org/10.3233/SJI-160957.

Nikolenko, Sergey I. 2021. Synthetic Data for Deep Learning. Springer International Publishing. https://doi.org/10.1007/ 978-3-030-75178-4.

Nowok, Beata, Gillian M. Raab, and Chris Dibben. 2016. “Synthpop: Bespoke Creation of Synthetic Data in R.” Journal of Statistical Software 74 (11). https://doi.org/10.18637/jss.v074.i11.

Obermeyer, Ziad, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. 2019. “Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations.” Science 366 (6464): 447–53. https://doi.org/10.1126/science.aax2342.

Patki, Neha, Roy Wedge, and Kalyan Veeramachaneni. 2016. “The Synthetic Data Vault.” 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), October. https://doi.org/10.1109/dsaa.2016.49.

https://ecommons.cornell.edu/bitstream/handle/1813/43929/SSAfinal.pdf?sequence=3&isAllowed=y

https://doi.org/10.1080/02664763.2011.584523

https://doi.org/10.1007/978-3-031-13945-1_16

https://doi.org/10.48550/ARXIV.2304.02107

https://doi.org/10.1007/11787006_1

http://www.asasrms.org/Proceedings/y2008/Files/301018.pdf

https://doi.org/10.1109/ICDM.2008.49

http://www.tdp.cat/issues16/tdp.a313a18.pdf

https://openreview.net/forum?id=S1zk9iRqF7

http://jmlr.org/papers/v10/kanamori09a.html

https://doi.org/10.1007/s10994-011-5266-3

https://doi.org/10.1198/000313006X124640

https://www.scb.se/contentassets/ca21efb41fee47d293bbee5bf7be7fb3/statistical-analysis-of-masked-data.pdf

https://doi.org/10.1016/j.neunet.2013.01.012

https://doi.org/10.3233/SJI-160957

https://doi.org/10.1007/978-3-030-75178-4

https://doi.org/10.18637/jss.v074.i11

https://doi.org/10.1126/science.aax2342

https://doi.org/10.1109/dsaa.2016.49

R Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Reiter, Jerome P., and Robin Mitra. 2009. “Estimating Risks of Identification Disclosure in Partially Synthetic Data.” Journal of Privacy and Confidentiality 1 (1). https://doi.org/10.29012/jpc.v1i1.567.

Rosenbaum, Paul R., and Donald B. Rubin. 1983. “The Central Role of the Propensity Score in Observational Studies for Causal Effects.” Biometrika 70 (1): 41–55. https://doi.org/10.1093/biomet/70.1.41.

Rubin, Donald B. 1993. “Statistical Disclosure Limitation.” Journal of Official Statistics 9 (2): 461–68. https: //www.scb.se/contentassets/ca21efb41fee47d293bbee5bf7be7fb3/discussion-statistical-disclosure-limitation2.pdf.

Scott, David W. 1992. Multivariate Density Estimation: Theory, Practice, and Visualization. Wiley. https://doi.org/10.100 2/9780470316849.

Snoke, Joshua, Gillian M. Raab, Beata Nowok, Chris Dibben, and Aleksandra Slavkovic. 2018. “General and Specific Utility Measures for Synthetic Data.” Journal of the Royal Statistical Society. Series A (Statistics in Society) 181 (3): pp. 663–688. https://doi.org/10.1111/rssa.12358.

Sugiyama, Masashi. 2010. “Superfast-Trainable Multi-Class Probabilistic Classifier by Least-Squares Posterior Fitting.” IEICE Transactions on Information and Systems E93-D (10). https://doi.org/10.1587/transinf.E93.D.2690.

Sugiyama, Masashi, Taĳi Suzuki, Yuta Itoh, Takafumi Kanamori, and Manabu Kimura. 2011. “Least-Squares Two-Sample Test.” Neural Networks 24 (7): 735–51. https://doi.org/10.1016/j.neunet.2011.04.003.

Sugiyama, Masashi, Taĳi Suzuki, and Takafumi Kanamori. 2012a. Density Ratio Estimation in Machine Learning. Cambridge University Press. https://doi.org/10.1017/CBO9781139035613.

———. 2012b. “Density-Ratio Matching Under the Bregman Divergence: A Unified Framework of Density-Ratio Estimation.” Annals of the Institute of Statistical Mathematics 64 (5): 1009–44. https://doi.org/10.1007/s10463-011- 0343-8.

Sugiyama, Masashi, Ichiro Takeuchi, Taĳi Suzuki, Takafumi Kanamori, Hirotaka Hachiya, and Daisuke Okanohara. 2010. “Conditional Density Estimation via Least-Squares Density Ratio Estimation.” In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, edited by Yee Whye Teh and Mike Titterington, 9:781–88. Proceedings of Machine Learning Research. Chia Laguna Resort, Sardinia, Italy: PMLR. https://proceedings.mlr.press/v9/sugiyama10a.html.

Sugiyama, Masashi, Makoto Yamada, Paul von Bünau, Taĳi Suzuki, Takafumi Kanamori, and Motoaki Kawanabe. 2011. “Direct Density-Ratio Estimation with Dimensionality Reduction via Least-Squares Hetero-Distributional Subspace Search.” Neural Networks 24 (2): 183–98. https://doi.org/10.1016/j.neunet.2010.10.005.

Taub, Jennifer, Mark Elliot, and Joseph W. Sakshaug. 2020. “The Impact of Synthetic Data Generation on Data Utility with Application to the 1991 UK Samples of Anonymised Records.” Transactions on Data Privacy 13: 1–23. http://www.tdp.cat/issues16/tdp.a306a18.pdf.

Torkzadehmahani, Reihaneh, Peter Kairouz, and Benedict Paten. 2019. “DP-CGAN: Differentially Private Synthetic Data and Label Generation.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. https://doi.org/10.1109/cvprw.2019.00018.

van de Wiel, Mark A., Gwenaël G. R. Leday, Jeroen Hoogland, Martĳn W. Heymans, Erik W. van Zwet, and Ailko H. Zwinderman. 2023. “Think Before You Shrink: Alternatives to Default Shrinkage Methods Can Improve Prediction Accuracy, Calibration and Coverage.” https://doi.org/10.48550/ARXIV.2301.09890.

Volker, Thom Benjamin. 2023. “Densityratio: Direct Estimation of the Ratio of Densities of Two Groups of Observations.” https://github.com/thomvolker/densityratio.

Volker, Thom Benjamin, and Gerko Vink. 2021. “Anonymiced Shareable Data: Using Mice to Create and Analyze Multiply Imputed Synthetic Datasets.” Psych 3 (4): 703–16. https://doi.org/10.3390/psych3040045.

Woo, Mi-Ja, Jerome P. Reiter, Anna Oganian, and Alan F. Karr. 2009. “Global Measures of Data Utility for Microdata Masked for Disclosure Limitation.” Journal of Privacy and Confidentiality 1 (1). https://doi.org/10.29012/jpc.v1i1.568.

Xu, Lei, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. 2019. “Modeling Tabular Data Using Conditional GAN.” In Advances in Neural Information Processing Systems, edited by H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett. Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/p aper_files/paper/2019/file/254ed7d2de3b23ab10936522dd547b78-Paper.pdf.

Zettler, Ingo, Christoph Schild, Lau Lilleholt, Lara Kroencke, Till Utesch, Morten Moshagen, Robert Böhm, Mitja D. Back, and Katharina Geukes. 2021. “The Role of Personality in COVID-19-Related Perceptions, Evaluations, and Behaviors: Findings Across Five Samples, Nine Traits, and 17 Criteria.” Social Psychological and Personality Science 13 (1): 299–310. https://doi.org/10.1177/19485506211001680.

https://www.R-project.org/

https://doi.org/10.29012/jpc.v1i1.567

https://doi.org/10.1093/biomet/70.1.41

https://www.scb.se/contentassets/ca21efb41fee47d293bbee5bf7be7fb3/discussion-statistical-disclosure-limitation2.pdf

https://doi.org/10.1002/9780470316849

https://doi.org/10.1111/rssa.12358

https://doi.org/10.1587/transinf.E93.D.2690

https://doi.org/10.1016/j.neunet.2011.04.003

https://doi.org/10.1017/CBO9781139035613

https://doi.org/10.1007/s10463-011-0343-8

https://proceedings.mlr.press/v9/sugiyama10a.html

https://doi.org/10.1016/j.neunet.2010.10.005

http://www.tdp.cat/issues16/tdp.a306a18.pdf

https://doi.org/10.1109/cvprw.2019.00018

https://doi.org/10.48550/ARXIV.2301.09890

https://github.com/thomvolker/densityratio

https://doi.org/10.3390/psych3040045

https://doi.org/10.29012/jpc.v1i1.568

https://proceedings.neurips.cc/paper_files/paper/2019/file/254ed7d2de3b23ab10936522dd547b78-Paper.pdf

https://doi.org/10.1177/19485506211001680

Appendix A - Description of the CPS data

Table 2. Descriptive statistics of the considered subset of observations and variables in the March 2000 U.S. Current Population Survey.

Overall

(N=5000) Sex

Male 2823 (56.5%) Female 2177 (43.5%)

Race White 4272 (85.4%) Non-white 728 (14.6%)

Marital status Married 2698 (54.0%) Separated 564 (11.3%) Widowed 692 (13.8%) Single 145 (2.9%) Widowed or divorced 901 (18.0%)

Highest attained education level No High School Diploma 815 (16.3%) High School Diploma 1610 (32.2%) Associate or bachelor’s degree 2121 (42.4%) Master’s degree or higher 454 (9.1%)

Age Mean (SD) 48.2 (16.8) Median [Min, Max] 46.0 [15.0, 90.0]

Social security payments Mean (SD) 2180 (4680) Median [Min, Max] 0 [0, 50000]

Household property taxes Mean (SD) 1020 (2500) Median [Min, Max] 450 [0, 98400]

Household income Mean (SD) 54000 (50400) Median [Min, Max] 40700 [1.00, 583000]

Social security Tax

Age Income

0 10000 20000 30000 40000 50000 0 25000 50000 75000 100000

25 50 75 0e+00 2e+05 4e+05 6e+05 0.0e+00

5.0e-06

1.0e-05

1.5e-05

0e+00

1e-04

2e-04

3e-04

0.000

0.005

0.010

0.015

0.020

0.025

0e+00

2e-04

4e-04

6e-04

D en

sit y

Figure 6. Histograms of the considered subset of observations and continuous variables in the March 2000 U.S. Current Population Survey.

Introduction
Density ratio estimation
Density ratio estimation as a utility measure: Simulated and empirical examples

Density ratio estimation in simulated univariate examples
Density ratio estimation for synthetic Current Population Survey data

Discussion
Acknowledgements
References
Appendix A - Description of the CPS data

Assessing the utility of synthetic data: A density ratio perspective

Thom Benjamin Volker, Peter-Paul de Wolf & Erik-Jan van Kesteren

UNECE Expert Meeting on Statistical Data Confidentiality

Synthetic data utility Intuitively

• How different are the synthetic data from the observed data?

• Is the synthetic data (almost) as useful as the observed data?

Practically

• Can we tell the observed and synthetic data apart?

• Can we obtain inferences from the synthetic data that are similar to inferences from the observed data?

Utility is hard to measure The utility of synthetic data depends on what it’s used for

How can we know what the synthetic data will be used for?

We can’t…

We need good measures for general utility (distributional similarity)

• If the observed and synthetic data have indistinguishable distributions, they should allow for similar inferences.

Existing general utility measures pMSE

• Practical, easy to use

• Not straightforward to specify the propensity score model

• Increasingly difficult to use when the dimensionality of the data increases relative to the sample size

Kullback-Leibler divergence

• Theoretically very elegant

• Hard to estimate in practice

Utility as a density ratio

𝒓 𝒙 = 𝑝𝑠𝑦𝑛(𝑥)

𝑝𝑜𝑏𝑠(𝑥)

Evaluating utility: density ratios

Example: Density ratios for utility Data – U.S. Current Population Survey (n = 5000)*

• Four continuous variables (age, income, social security payments, household income)

• Four categorical variables (sex, race, marital status, educational attainment)

Synthesis strategies

• Linear models; transformations; semi-continuous + transformations; smoothed CART

• Logistic / multinomial regression

* Our gratitude goes out to Jörg Drechsler for his willingness to share the data and synthesis script

Assessing utility

1. Estimate the density ratio using a kernel model • Unconstrained least-squares importance fitting (uLSIF; Kanamori et al., 2009)

2. Compute a discrepancy measure for each synthetic dataset

• Pearson divergence: ෢𝑃𝐸 𝑋𝑠𝑦𝑛, 𝑋𝑜𝑏𝑠 = 1

2𝑛𝑠𝑦𝑛 σ 𝑖=1

𝑛𝑠𝑦𝑛 Ƹ𝑟 𝑥𝑠𝑦𝑛 (𝑖)

− 1

𝑛𝑜𝑏𝑠 σ𝑗=1 𝑛𝑜𝑏𝑠 Ƹ𝑟 𝑥𝑜𝑏𝑠

(𝑗) +

3. Compare the test statistics between synthetic data sets

All implemented in the R-package densityratio

Assessing utility in R library(densityratio)

# for every variable in every synthetic dataset, do: dr <- ulsif(cps_synthetic$var, cps_real$var) summary(dr_var)

# and for every entire synthetic dataset, do: dr <- ulsif(cps_synthetic, cps_real) summary(dr)

The way forward… Density ratio estimation provides: • An intuitive framework for the evaluation of synthetic data; • Cross-validation for automatic hyperparameter selection; • Readily available extensions to high dimensional settings; • Utility scores for individual data points

But requires research into • How to deal with categorical data; • How to make best use of side products (e.g., individual utility scores)

Slide 1: Assessing the utility of synthetic data: A density ratio perspective
Slide 2: Synthetic data utility
Slide 3: Utility is hard to measure
Slide 4: Existing general utility measures
Slide 5: Utility as a density ratio
Slide 6: Evaluating utility: density ratios
Slide 7: Example: Density ratios for utility
Slide 8
Slide 9: Assessing utility
Slide 10: Assessing utility in R
Slide 11
Slide 12: The way forward…

An overview of used methods to protect the European Census 2021 tables, Statistics Netherlands

Read more about An overview of used methods to protect the European Census 2021 tables, Statistics Netherlands

confidentiality methods, data protection, census hypercubes, dissemination,

Languages and translations

English

UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE

CONFERENCE OF EUROPEAN STATISTICIANS

Expert Meeting on Statistical Data Confidentiality

26-28 September 2023, Wiesbaden

An overview of used methods to protect the European Census 2021

tables

Marieke de Vries*, Manca Golmajer**, Reinhard Tent***, Sarah Giessing*** and Peter-Paul de Wolf*

* Statistics Netherlands, ** Statistics Slovenia, *** Destatis

[email protected], [email protected], [email protected], [email protected],

[email protected]

Abstract

In the European projects “Harmonized protection of census data in the ESS” and “Open source tools for perturbative

confidentiality methods” it was suggested to use Targeted Record Swapping and/or the Cell Key Method to protect the

census 2021 tables. At the end of 2022 the first tables were sent to Eurostat, containing information at the level of 1km x

1km grid cells. In this paper we will discuss the outcomes of a workshop dedicated to discuss the SDC methods actually

used to protect the grid cells, the intended methods to be used to protect the census hypercubes and the consequences for

other publications of population statistics.

1 Introduction

At its meeting in November 2022 the European Expert Group on statistical disclosure control (EG SDC) has

selected topics for guidelines to be developed by a present Centre of Excellence on SDC (CoE on SDC1). One

of the selected topics is ‘SDC for population statistics’. The intended guidelines are expected to include lessons

learned from application of methods developed for census 2021 data statistics. These methods include targeted

record swapping (TRS) and cell key method (CKM). Moreover, the guidelines should include

recommendations for improvements.

As a first step, the CoE on SDC organized a virtual workshop on Statistical Disclosure Control of Census-data

in May 2023, inviting Census experts as well as SDC experts. The aim of the workshop was to share

knowledge and experience on topics like:

- the status of Census grid data protection at European NSIs

- plans for the Census hypercube data protection

- coordination of Census and population data protection

- feedback from data users in countries where SDC methods have been employed while releasing Census

statistics

Participants were invited to give short presentations on special problems and solutions, work in progress,

interesting examples from their institute’s experience on the listed topics or any other relevant topic.

Additionally, a survey with questions addressing the aforementioned topics was sent out to the participants in

advance. The results of that survey were shortly discussed during the workshop.

Our paper provides an overview on the survey results in Section 2 and summarizes the presentations prepared

by some of the participants in Section 3. Section 4 explains special consistency issues in a situation when

Census and population data protection must be coordinated. The paper concludes with a summary and outlook

on plans for further work to derive the requested guidelines on statistical disclosure control for population

statistics.

2 Results of the Survey

A survey was sent out to all participants of the workshop on protection of census data. Some countries did not

participate, and were later asked to fill in the survey. We have received replies from 30 countries: DE, AT, SI,

MT, EL, CH, HR, IS, FR, BG, FI, PL, CY, LV, IE, RO, HU, LU, EE, NL, IT, PT, SK, BE, SE, CZ, ES, DK,

LT, NO. The following section consists of a short overview of the main results of the survey. For an overview

of the questions asked during the survey, see Appendix 4.1.

2.1 Methods

At the time of the workshop, 11 countries had only completed the production phase for the grid data cells, and

two countries had only finished the production phase for the hypercubes. Just one country had completed both.

Countries were asked which methods they applied for statistical disclosure control, specifically about the use of

the Cell Key Methods and Targeted Record Swapping. Not all countries applied both methods. In fact, half of

all respondents do not use TRS at all, while 40% do not apply CKM. Some countries use their own methods

while some do not protect their data as they are allowed by national law to release unprotected tables.

1 The CoE on SDC is co-funded by the European Commission by means of grant agreement number 899218, 2019-BG-

Methodology. For more information on this grant and the work of the CoE on SDC, please consult:

https://ec.europa.eu/eurostat/cros/content/centre-excellence-statistical-methods-and-tools_en.

https://ec.europa.eu/eurostat/cros/content/centre-excellence-statistical-methods-and-tools_en

Figure 1: Use of, or intent to use, Targeted Record Swapping and the Cell Key Method in European countries.

Figure 2 shows that use of the dedicated statistical disclosure control software mu-argus and tau-argus are the

least popular for testing and/or production. The R-packages sdcMicro, RecordSwapping, ptable and CellKey are

more popular, for both testing and production. Some countries used none of the aforementioned methods, or

used other tools as supplements: multiple countries mentioned having implemented TRS or a comparable

method using SAS, or having built their own R scripts and packages for CKM.

Figure 2: Use of software for testing and production of grid data and/or hypercube, differentiated by whether countries had previously

used, or intended to use in the future, either the Cell Key Method or Targeted Record Swapping.

Some examples from countries using other methods include:

- Norway is not using either CKM or TRS, but making use of their own methodology: Small Count Rounding. This

method can be used with the R-package SmallCountRounding, created and used by Norway2.

- Belgium is also using their own methodology called geographical perturbation, implemented in SAS. Here grid

cells are protected based on the aggregated data by virtually moving households from cells that need to be

2 For more information on Small Count Rounding, see: https://cran.r-project.org/package=SmallCountRounding

https://cran.r-project.org/package=SmallCountRounding

protected to another geographical location for which the cell does not need to be protected. Restrictions are placed

on which relocations can be made, such as always moving within the same municipality and if possible the same

neighbourhood.

- Spain has applied a rounding method for their national dissemination where the data is randomly rounded to

multiples of three, and is considering the same method for Eurostat delivery.

When asked about the methods with which countries found suitable parameters, most used a combination of

methods. Data-driven experimentation was the most popular.

Figure 3: Overview of which methods are used when finding suitable parameters for Cell Key Method and Targeted Record Swapping.

2.2 Communication

At the time of the survey, not all countries had yet finished their strategy for communication. Most of the

replies in ‘Other’ are because of this. Seven of the 30 countries offer user specific explanations of the

protection method, with different levels of depth or detail for different user group.

Figure 4: Formats used for communication to the public about statistical disclosure control methods used in the protection of census

data.

Only three countries partially release their parameters: Luxembourg released the maximum noise and variance

for CKM, Portugal released the risk threshold for TRS and France is considering releasing their swapping rate

for TRS. All other countries either do not apply the methods or decided not to make the parameters public.

Some countries have committed to publishing details about the used protection methods, but do not include any

parameter choices in this communication. Germany has published quality indicators.

About three quarters of the countries (23 respondents) were aware of the recommendations on communication

offered by the Centre of Excellence on Statistical Disclosure Control, of which 10 had applied them. Feedback

on the protection methods has been sparse so far; the few comments that have been received by countries are

mostly apprehensive or negative about the distortion of the data, although some feedback has been positive.

2.3 National Publication

Not all respondents knew whether the level of detail between their national census publication and the EU

census hypercubes and grid data was different: 23% replied they were unsure. Most countries (60%) however

said there was a difference in level of detail.

Figure 5: Response to the questions “Is the level of detail different between your national Census publication and the EU Census

hypercubes and Grid Data?” and “Do you use different protection methods for your national publication and EU Census Hypercubes

or Grid Data?”

Of the 18 countries affirming, only six applied or partly applied different methods for the protection of national

and EU census publications. In some cases, cell suppression is used. In others, aggregation or detail levels were

not published in cases where they might create issues with disclosure. In a few cases where respondents were

unsure of any difference in detail, there were still differences in protection between EU and national

publications.

Figure 6: Responses to “Do you have encountered consistency issues regarding national and European level census and/or

demographic data?”

The majority of countries do not seem to have encountered consistency issues regarding national and European

level census data, or regarding census and demographic data. Five countries replied they had issues, but that

those issues were solved. Another six countries do however still struggle with consistency issues. In Sections 3

and 4, we expand upon consistency issues as experienced at Statistics Slovenia.

2.4 Challenges

Countries report more challenges with the methods, concept and code development of disclosure control for

census data than computational challenges.

Figure 7: Answers to the questions “On a scale of 1-5, how challenging is implementing disclosure control for your Census data in term

of concepts / methods / code development?” and “On a scale of 1-5, how challenging is implementing disclosure control for your

Census data computationally?”

Some countries specified exactly what difficulties they faced during the process of protecting the census data.

Many of the difficulties seem to stem from issues with knowing what choice of parameters is best. Some

countries found this to be complicated due to issues with interpreting and comparing information loss measures

for different sets of parameters. Other difficulties encountered during the process were:

- A difference in the national grid and Eurostat grid. This can cause issues with differencing, and therefore requires

extra protection.

- Technical difficulties with installing packages from GitHub for testing statistical disclosure control methods.

- Restrictions imposed due to national requirements, such as population values of administrative regions needing to

be published unperturbed, which complicated the implementation of statistical disclosure control methods.

- More complexity in real-world census data compared to examples found in literature.

3 Presentations during workshop: description of census data protection for a

sample of European countries

During the workshop, five countries elaborated on their approach to statistical disclosure control for the 2021

Census data. In the following Section we summarise the presentations of four countries: France, Ireland,

Portugal and Slovenia. Additionally, we give a summary of the protection methods used in Germany.

3.1 France

INSEE has a special method of data collection. Instead of a survey on a specific date, the French census is a

rolling procedure where every year a sample of the data is collected. The EU Census release is derived from

these data and is seen as an additional release for INSEE, which consists of more detailed and more sensitive

data, to their national disseminations. Especially, the usage of grid data is a new type of release for French

census data. The disclosure control process of these two different products is performed separately.

For grid data, INSEE uses some kind of coarsening and cell suppression technique called the ‘Multilevel Grids

Method’, which is implemented in a package called gridy. For the Multilevel Grids Method the tiles that make

up the grid are grouped, such that risky tiles build larger groups together with other tiles, such that these groups

are above a predefined threshold. The use of several grids helps to constraint the search of partners in a certain

area. The algorithm proceeds from the coarsest level (a grid which doesn’t contain any tile below the threshold)

to the finest level (1km or 200m tiles). In a second step each subtotal at the group level (for example the

number of males) is broken down in the tiles of the group, proportionally to the population of each tile.

Opposed to that, for hypercubes Targeted Record Swapping will be used. For this technique, depending on a

threshold k, a cell value is declared as risky if there are less than k observations contributing (and hence these

corresponding observations need to be swapped). INSEE swaps individuals and not households, since none of

the households or dwellings hypercubes contain sensitive information. Hence, there’s no loss of information on

them, while swapping only individuals. This might lead to some minor inconsistencies between hypercubes on

individuals and hypercubes on households, which INSEE is willing to accept.

3.2 Ireland

The Irish population census was carried out as a traditional paper census in April 2022. The Central Statistics

Office (CSO) of Ireland decided to apply Targeted Record Swapping to the final microdata file as one aspect of

statistical disclosure control.

The data CSO collected are the de facto count of all persons present on census night. Now TRS will be carried

out, without those households which had visitors staying on census night or which had usual residents who

were absent that night. For the specification of the TRS method the census team decided to swap households

within a close geographical distance. The CSO didn’t use an existing implementation of the TRS method but

carried out the swapping in SAS in order to tailor the program to their specific needs. Unlike INSEE, CSO

decided to swap entire households of the same size. However certain dwelling level characteristics remain with

the geography variables e.g. it wouldn’t make sense to swap detached dwelling descriptors with an apartment

(such as Year Built, Water Supply or Number of Floors). For communal establishments, like nursing homes

and educational establishments, the swapping algorithm was extended to include swapping individuals between

establishments of the same type. Here the algorithm takes into account the gender and age of each individual to

prevent illogical swaps (like a woman in an all-male prison). In contrast to INSEE, by now CSO didn’t make

use of k-anonymity to define which households need to be swapped. Instead the selection is based on a random

probability sampling. To this end a seed was included at the request of the census team so they can replicate the

process if needed.

Tests showed that the swapping rates of the CSO’s implementation are similar to what other NSIs have

reported. Yet there are future plans, to expand the program, by including k-anonymisation techniques to select

risky households. CSO also plans to assess other software such as mu-argus and R packages.

3.3 Portugal

In particular, the loss of additivity of the tables and its high operational demands are major weaknesses of the

Cell Key method. In contrast to that, being a pre-tabular method, Targeted Record Swapping only needs to be

applied once to the microdata base. Since all tabulations must then be obtained from the same perturbed

microdata, evaluations are cheap to perform.

Yet TRS also has its limitations. For example, frequencies relating to geographical levels higher than the

geographical hierarchy considered in the application of the method (e.g. NUTS II, NUTS I and national total

levels) are not disturbed - and therefore not protected. Hence Statistics Portugal has done some research to find

out if it is possible to get around the limitations of TRS.

The results showed the effect of the TRS is neither evident nor substantial with respect to cells with low

frequencies: depending on the frame and the municipality under analysis, the TRS can lead to a decrease,

maintenance or even an increase in the number of cells with low frequencies. It turns out, however, that as a

rule, the number of cells with actual low frequencies (i.e. whose frequency is also low in the original data) is

always smaller than the number of cells with low frequencies in the perturbed data. This fact reflects the main

effect of the TRS regarding data protection: the increased uncertainty associated with the analysis of the

information provides, in itself, some degree of protection.

With regard to the risk of attribute disclosure, Statistics Portugal finds that in the generality of the tables this

risk is already very low from the outset. TRS had no significant effects on this type of disclosure in the

analysed tables and municipalities.

3.4 Slovenia

Unlike the other NSO’s presented here, the Statistical Office of Slovenia (SURS) already delivered their grid

data as well as the annual demographic tables to Eurostat. For protection of grid data, the Cell Key Method is

used and implemented in R. Record keys are generated in R using the runif function (random uniform

distribution) and the perturbation table is generated in R using the ptable package. The record keys and the

perturbation table are the same for different grids. Grid data for Eurostat have already been sent to Eurostat.

The values and the protection are final for all 13 statistics on 1 km2 grids.

Slovenia also publishes their own grid data in a special application called STAGE. These include several

statistics for grids with length size 1 km, 500 m, and 100 m. The 1 km2 grids for SURS do not perfectly overlap

with the 1 km2 grids for Eurostat, as they have different boundaries. However, the disclosure risk due to

differences between grids for SURS and for Eurostat is very low.

SURS has also already published their own demographic tables. In general, tables at a higher geographical level

can be more detailed and they can have more dimensions. Some parts of tables are suppressed on the whole

(e.g. if a settlement has 1-9 inhabitants, then information about their gender and age is suppressed regardless of

the cell frequencies). Secondary suppression is also used (e.g. if only one settlement in a municipality is

suppressed, then another settlement is chosen for secondary suppression). In comparison to the previous census,

some additional protection was done: this was done mostly by aggregation, and in some cases by suppression.

Annual demographic tables for Eurostat have already been sent to Eurostat. No protection is used, taking into

account how detailed these tables are. Protection of annual demographic tables for Eurostat and tables for

SURS is consistent.

Hypercubes for Eurostat haven't been protected yet. The hypercubes are very detailed, so they need additional

protection. Some parts of hypercubes are the same as in tables for SURS or annual demographic tables for

Eurostat. For the parts of hypercubes that are more detailed, CKM has already been tested. SURS found that

only applying CKM is not enough and can lead to disclosure.

3.5 Germany

The Federal Statistical Office of Germany (Destatis), together with the State Statistical Offices, decided to

solely rely on the Cell Key Method to protect all of their disseminated census data. Since data shall be

disseminated via an online database and since CKM is a post tabular disclosure control method, there is the

necessity for tools to access the original microdata, build requested tables and to apply perturbation, before

disseminating the data. Hence, Destatis developed a java application which performs the actual perturbation

and runs on a dedicated server. Now, this application cannot only be called from the public census database, but

can also be accessed by authorized users from statistics software like SAS or R, for which suitable interfaces

have been created. Having such a centrally managed tool and interfaces guarantees that every perturbation will

be performed consistently and with the same parameters.

Additional features to the method have been added in order to e.g. improve the accuracy of perturbed ratios and

in order to deal with disclosure issues.

The reference date for the data collected by the German Federal and state Statistical Office is May 15th 2022,

whereas the reference date for the EU census tables is December 31st 2021. This means the data have to be

backcasted. To this end, monthly migration data from demography statistics are used. Unfortunately, this leads

to the problem, that the reference date for the EU census data coincides with the reference date of the German

annual demography tables. So far, for those tables it was not intended to apply the Cell Key Method. What

impact this will have on disclosure risks is an open issue at the moment, which still needs to be investigated.

4 The issue of annual demographic tables and their connection to Eurostat

Census Hypercubes

In addition to the Eurostat Census Hypercubes, each European country disseminates annual demographic

tables. Since the SDC method used for an NSI’s annual demographic tables might be different from the

methods recommended by Eurostat for census data, this might lead to some issues concerning disclosure risk.

Specifically, these issues occur when the exact breakdown combinations are published for the same reference

time in both statistical products.

Imagine for example that for the annual demography population stock tables, only coarsening is used. Hence,

all the published data are original data. Furthermore, assume the Cell Key Method is used for protecting the

census tables. If now there are characteristic combinations that occur in both tables, one can directly compare

the original data with the perturbed ones and by this retrieve additional information about the perturbation

parameters (like maximum possible deviation). An attacker could use this knowledge when attempting to

disclose other perturbed values in the Census disseminations as well, particularly very detailed cells with low

person counts.

One country that had to deal with this issue is Slovenia. Slovenia has already sent their annual demographic

tables to Eurostat. Here no protection is used, taking into account how detailed these tables are. On the other

hand, hypercubes for Eurostat have not been protected yet. These hypercubes are very detailed, so they need

additional protection. Some parts of the hypercubes are identical to tables for Statistics Slovenia or annual

demographic tables for Eurostat; for these parts, the same protection will be used in the interest of consistency.

See Appendix 3.2 for an overview of tables that share variables. The use of CKM for the parts of the

hypercubes that are more detailed was tested. The conclusion was that CKM only is not enough and can lead to

disclosure, so another method besides CKM will be considered (e.g. TRS).

The Working Group on Population and Housing Censuses has covered this issue before. In a paper on the

Harmonised protection of census data from May 20223, usage of TRS, CKM and their combination is analysed.

Eurostat recommends the usage of both TRS and CKM and, if external consistency is to be established, as a

practical solution, it is recommended to impute the overlapping cell values from the annual tables into the

census hypercubes.

5 Summary and Outlook

In the Census 2021 it was advised to apply two methods: TRS and CKM. This paper has provided a first

analysis of results of a survey sent out to participants of a workshop on protection of census data. The results

prove that in half of the 30 respondent’s countries TRS is used (or intended to be used) on at least one of the

Census data products. For CKM this holds for even 60%. To find out, how many countries use (or plan to use)

the recommended combination of both methods, a more detailed analysis of the responses will be needed. But

clearly, it is the first time that so many European countries apply these techniques, and will thus be able to

deliver the harmonized, detailed outputs requested by the European regulations while still addressing disclosure

control in a way considered appropriate in each individual country.

Other interesting results of the survey are that the implementations of CKM and TRS as R-packages were quite

popularly used, for both testing and production. Still, although supported by these (or other) packages,

participants consider method and code development the most challenging part of implementing SDC for their

Census data. Regarding communication with the users of the data, most countries will rely on documents and

footnotes. As of yet, not much user feedback has been received.

Looking at more detailed reports from some countries presented in section 3, an interesting observation is that

although two countries may ‘nominally’ use the same SDC method, implementation may be quite different, not

only because different parameters are being used. For example, in France individuals are swapped and not

households. In Ireland it is the other way around. Slovenia and Germany both use CKM. While in Germany

CKM is applied to all Census data output products, in Slovenia it is used only to protect grid data and the

European Census hypercubes. However, both countries have to deal with consistency issues regarding

demography data released for the exact same reference date.

3 Harmonised protection of census data - issues for 2021, Doc. CENS/2022/May/10 prepared by Eurostat for the meeting

of the Working Group on Population and Housing Censuses, 16-17 May 2022

Now that we approach the publication of all high dimensional Census 2021 tables, it is time to think about the

follow-up. Based on the shared experiences so far, the intended guidelines on ‘SDC for population statistics’

should discuss the lessons learned and make suggestions for improvements of the applied methods. Moreover,

they should discuss whether the new techniques and experiences can also be used for other statistics. It is

logical in particular to continue the current work in the future ESOP (European Statistics on Population and

Housing) context where census and demography are combined. Especially the current inconsistencies in

census and demographic statistics, and their protection, should be addressed. Guidelines how to apply the new

protection methods might also be helpful for several future statistics.

6 Appendix

6.1 Survey Workshop on Statistical Disclosure Control of Census data

Question Options

Which country are you joining us from? open question

Methods: How sure are you about the method(s) you are going

to use?

1-5

Production: Has the production phase of the protected tables

(Census Hypercubes or Grid Data) - to be delivered to Eurostat -

been completed?

Yes (Census Hypercubes only)

Yes (Grid Data only)

Yes (Census Hypercubes and Grid Data)

Methods: Do/Did you use (or intend to use) TRS? Yes (Census Hypercube only)

Yes (Grid Cube only)

Yes (both)

Methods: Do/Did you use (or intend to use) CKM? Yes (Census Hypercube only)

Yes (Grid Cube only)

Yes (both)

Tools: Do/Did you (intend to) apply TRS using Mu-Argus? Yes, for testing

Yes, for production

Yes, for testing and production

Tools: Do/Did you (intend to) apply TRS using the R-packages

(sdcMicro or recordSwapping)?

Yes, for testing

Yes, for production

Yes, for testing and production

Tools: Do/Did you (intend to) apply CKM using Tau-Argus. Yes, for testing

Yes, for production

Yes, for testing and production

Tools: Do/Did you (intend to) apply CKM using the R-package

cellKey?

Yes, for testing

Yes, for production

Yes, for testing and production

Tools: Do/Did you (intend to) use the R-package ptable to

design the CKM parameters?

Yes, for testing

Yes, for production

Yes, for testing and production

Tools: Do/Did you use another (freely) available tool? open question

Tools/Methods: How did/will you find suitable parameters? data-driven experimentation

simulation

consultation of experts

externally provided (from literature,

national law)

re-use / adapt from usage in other

statistics

Tools/Methods: Did you or your colleagues participate in the

ESTP Course "Statistical disclosure control (SDC) methods and

tools for census 2021"?

Yes

I/We didn't know about that course.

Communication: Which format(s) or channel(s) do you use to

communicate Census data protection methods to the public?

footnotes (in tables)

documents or handbooks

wiki

videos

podcasts

other

Communication: Please specify "other" if you have used other

format(s)/channel(s).

open question

Communication: Do you offer user specific explanations of the

protection method (with different levels of depth or detail for

different user groups)?

Yes

Communication: Do you make the parameters of your SDC

methods available to the public?

Yes, fully

Yes, partially

Yes, to selected user groups only

Communication: Which of the TRS parameters are or will be

made public?

[TRS] swapping rate

[TRS] k-anonymity

[TRS] risk threshold

None of the mentioned parameters

We don't apply TRS

Communication: Which of the CKM parameters are or will be

made public?

[CKM] Maximum Noise

[CKM] Variance

[CKM] Probabilities for no change

[CKM] ptable

None of the mentioned parameters

We don't apply CKM

Communication: What other parameters or further information

do you publish?

open question

Communication: Do you know about the recommendations

offered by the Centre of Excellence on Statistical Disclosure

Control?

Yes (we applied the recommendations)

Yes (but we elaborated our own different

communication strategy or applied

another methods)

Communication: Did you get any feedback from prospective

(external) user groups for your planned protection method(s)?

open question

Communication: How was the feedback? 1-5

National Publication: Is the level of detail different between

your national Census publication and the EU Census hypercubes

and Grid Data?

Yes

I don't know

National Publication: Do you use different protection methods

for your national publication and EU Census Hypercubes or

Grid Data?

Yes (please specify in the next question)

I don't know

National Publication: Please, specify why you have used or will

use different methods.

open question

National Publication: Do you have encountered consistency

issues regarding national and European level census and/or

demographic data?

Yes (unsolved)

Yes (solved)

No consistency issues

Question not relevant

Experience: On a scale of 1-5, how challenging is implementing

disclosure control for your Census data in term of concepts /

methods / code development?

1-5

Experience: On a scale of 1-5, how challenging is implementing

disclosure control for your Census data computationally?

1-5

Experience: If there are difficulties encountered in practice for

Census SDC, please tell us.

open question

6.2 Connection between hypercubes for Eurostat and annual demographic tables for Eurostat

List of hypercubes can be found in explanatory notes on EU legislation on the 2021 population and housing

censuses (2019 edition)4. Hypercubes are divided into 41 groups. They contain:

1. Groups 1-32: Total population

2. Groups 33, 35: Number of all private households

3. Groups 34, 36: Number of all families

4. Groups 37, 38: Number of all conventional dwellings

5. Groups 39, 40: Number of all occupied conventional dwellings

6. Group 41: Number of all living quarters

List of annual demographic tables can be found in files Eurostat_Demography_UNIDEMO1260_RY2020.xls

and Eurostat_Demography_UNIDEMO862_RY2020.xls. Annual demographic tables are divided into 12

groups:

1. Group P: Population

2. Group SLA: Surface

3. Group B: Live births

4 File KS-GQ-18-010-EN-N.pdf, p. 170-175, https://ec.europa.eu/eurostat/documents/3859598/9670557/KS-GQ-18-010-

EN-N.pdf/c3df7fcb-f134-4398-94c8-4be0b7ec0494

4. Group L: Legally induced abortions, late foetal deaths

5. Group D: Deaths

6. Group X: Infant deaths

7. Group M: Marriages

8. Group S: Divorces

9. Group I: Immigrants

10. Group E: Emigrants

11. Group A: Acquisition of citizenship

12. Group LS: Loss of citizenship

Tables can be mandatory, voluntary or partially mandatory.

Hypercubes from groups 1-32 (Total population) are linked to annual demographic tables from group P

(Population). Other groups of hypercubes (33-41) are not linked to annual demographic tables

(households, families, dwellings, living quarters are not sent to Eurostat every year).

The following 19 variables appear in Hypercubes from groups 1-32:

Variable content Variable

name

Variable exists in annual

demographic tables

Age AGE YES (age)

Current activity status CAS NO

Country of citizenship COC YES (citizenship)

Educational attainment EDU YES (educational attainment)

Family status FST NO

Place of usual residence GEO YES (region of residence)

Housing arrangements HAR NO

Household status HST NO

Industry (branch of economic activity) IND NO

Legal marital status LMS YES (legal marital status)

Size of the locality LOC NO

Location of place of work LPW NO

Occupation OCC NO

Country/place of birth POB YES (country of birth)

Place of residence one year prior to the

census ROY NO

Sex SEX YES (sex)

Status in employment SIE NO

Ever resided abroad and year of arrival in

the country since 1980 YAE NO

Ever resided abroad and year of arrival in

the country since 2010 YAT NO

Below, variable names from hypercubes are used also for variable names from annual demographic tables

because of simplicity, although variables from annual demographic tables can be a little different (e.g. different

categories).

Out of the 19 variables, only seven variables appear also in annual demographic tables: GEO, SEX, AGE,

COC, POB, EDU, and LMS.

- All tables in group P have variables GEO, SEX, AGE. Only P07 does not have SEX and AGE.

- Only tables in group P that have GEO.N. (only country level) can have other variables besides SEX and

AGE:

o Variable COC appears only in tables P03, P04, P07.

o Variable POB appears only in tables P05, P06, P07.

o Variable EDU appears only in table P09.

o Variable LMS appears only in table P08.

- The lowest geographical level in group P is in table P02, which is similar to GEO.M. x SEX. x AGE.M.

In order to check linked protection between a group of hypercubes and annual demographic tables, it is not

necessary to check all hypercubes and all tables in group P, it is enough to check the combinations below:

Group of

hypercubes

Hypercubes

for linked

protection

Annual demographic

tables for linked

protection

1 1.1 P08

2 2.1 P02, P08

2.3 P02

3 3.1 P02

3.3 P02, P08

4 4.3 P01, P09

5 5.3 P02, P09

6 6.2 P02, P09

7 7.2 P02

8 8.1 P02, P03

8.2 P02, P05

9 9.1 P03, P05, P07

9.3 P02, P05

10 10.1 P02, P03

10.2 P02, P05

11 11.1 P02, P03

12 12.2 P02, P05

12.3 P02, P03, P05, P07

13 13.1 P02, P05

13.2 P02

13.3 P02, P05

13.4 P02, P03

13.5 P01, P07

14 14.1 P02, P03

14.2 P02, P05

15 15.1 P02, P05, P09

15.4 P02, P03, P05, P07

16 16.1 P02, P03

16.2 P02, P05

17 17.1 P02, P03

18 18.2 P02, P05, P09

18.3 P02, P03, P05, P07

19 19.1 P02, P05, P09

20 20.1 P02, P03

20.2 P02, P05

21 21.2 P02, P08, P09

22 22.1 P02, P09

23 23.2 P02, P09

24 24.2 P02, P08

25 25.1 P02, P05, P08

25.2 P02, P03, P08

26 26.3 P02, P03, P05, P07

27 27.1 P02

28 28.1 P02

29 29.1 P02, P05, P08

30 30.1 P02, P03, P08

31 31.1 P02, P05

31.3 P02, P05, P09

32 32.1 P02, P03

32.3 P02, P03, P09

An overview of used methods

UNECE Expert Meeting for Statistical

Data Confidentiality, Wiesbaden,

September 27 1Statistics Netherlands, 2Destatis, 3Statistics Slovenia

Marieke de Vries1, Reinhard Tent2, Manca Golmajer3, Sarah Giessing2, Peter-Paul de Wolf1

Protection of the European Census 2021

Agenda 02

• STACE project

• Survey Results

• Some Perspectives

• Issues of Connected Tables

• Summary and Outlook

• User support for and maintenance of SDC tools • Support to producers on implementation and use of the tools • Provide guidance (guidelines) on SDC methodology

WP2 in project “Statistical methods and tools for time series, seasonal adjustment and statistical disclosure control”

STACE project

• Guidelines on SDC for geo-referenced data • Update general ESSnet handbook SDC • Guidelines on SDC for population statistics, after Census 2021

Some deliverables

STACE project – current work

• Countries were asked to present their approach for protecting 2021 census • Presentations by France, Ireland, Portugal, Slovenia, Romania • Survey on protection of the census sent out to all countries before meeting

Workshop organized on Census Data protection, May 23rd 2023

Online Survey • Sent out before workshop. • Completed by 30 countries. • Questions on:

• methods, • communication, • differences in national publication, and • challenges during implementation.

• TRS applied by 50%, CKM by 60%. Combination of TRS and CKM only 33%.

• Other methods include: • Specific methodology designed by

NSI • SDC methods imposed by law

Use of CKM more popular than TRS

Results 06

• Other methods include: • Implementation of TRS or a

comparable method using SAS, • Building own R scripts and packages

for CKM.

R-packages more popular than mu- argus or tau-argus

Results 07

• Different sdc methods for EU vs national: • Cell suppression, • Aggregation levels suppressed

• 11 countries experience consistency issues between publications, including demographic data. Only six solved.

Majority of countries have different level of detail between publications

Results 08

Some perspectives 09

• Perspectives of France, Ireland, Portugal, Slovenia, Germany, • CKM: Slovenia, Germany • TRS: France, Ireland, Portugal

• Partially CKM and TRS are combined with additional SDC methods. • Coarsening • Cell Suppression

• Even if two countries use the same method, implementation may be different. • Swapping individuals (France) vs. Swapping households (Ireland). • Applying CKM to all data (Germany) vs. applying CKM to grid data and Census hypercubes

only.

CKM TRS

Maintain additivity - +

Evaluation easy/not costly - +

Deviation controllable + -

Protection on highest aggregation level + -

Pros and cons for both methods that influenced their choices

Issue: Connection between Tables 11

• Overlapping of national annual demographic tables and census tables. • SDC methods for annual demographic tables might differ from methods for census data.

→ Issues concerning disclosure risk possible. • Theoretical Example:

• Demographic tables: Coarsening • Census data: perturbative disclosure control method

• If same combinations occur in both tables → Attacker could disclose other values

Two real cases 12

Slovenia (SURS) • Hypercubes for Eurostat not yet protected.

• Very detailed. • Overlapping with other publications that use Cell Suppression or no protection at all (due to

low level of detail). • Tests showed: CKM only not sufficient.

Germany (Destatis) • SDC method for National demographic tables not yet decided.

• Overlapping with Census publications, which use CKM.

Recommendation of Eurostat 13

• Usage of both TRS and CKM. • For consistency: impute the overlapping cell values from the annual tables into the census

hypercubes.

Summary and Outlook 14

• Advised SDC methods for Census 2021 : TRS and CKM • Survey showed:

• TRS: Used on at least one of the Census data products by 50% of 30 respondents • CKM: Used on at least one of the Census data products by 60% of 30 respondents

• First time that so many European countries apply these techniques. • Method and code development the most challenging part. • Intended guidelines on ‘SDC for population statistics’ should discuss the lessons learned. • Discuss whether the new techniques and experiences can also be used for other statistics • Continue the current work in the future ESOP* (combines census and demography ) • Consistency issues regarding demography data released for the exact same reference date. • Guidelines how to apply the new protection methods might be helpful.

Thank you!

Slide 1
Slide 2
Slide 3
Slide 4
Slide 5
Slide 6
Slide 7
Slide 8
Slide 9
Slide 10
Slide 11
Slide 12
Slide 13
Slide 14
Slide 15

Businesses first: the latest developments at Statistics Netherlands (CBS) with regard to the new vision - Leanne Houben, Ger Snijkers, Robert Nieuwenhuijs (Statistics Netherlands)

Read more about Businesses first: the latest developments at Statistics Netherlands (CBS) with regard to the new vision - Leanne Houben, Ger Snijkers, Robert Nieuwenhuijs (Statistics Netherlands)

Languages and translations

English

Leanne Houben, Ger Snijkers, Arjen de Boer, Bryan Bungardt and Robert Nieuwenhuijs Statistics Netherlands

UNECE Expert Meeting on Statistical Data Collection, 12-14 June 2023 (on-line) Wednesday 14 June 2023

“Businesses first”: The latest developments at CBS with regard to the new vision Towards a new customer Journey

Businesses

first

A matter of

course

Re-use of

data

Helpful

Transparent

Innovative

Safe and reliable

Informative

User orientated

CBS new vision on Businesses Data Collection: “Naturally relevant”

In 2032 Businesses say:

“it’s little effort because it’s predictable and goes

by itself”

Developing a future customer journey • Supported by businesses, sector organisations and internal staff • How?

• Focus of CJ: providing data to CBS. From receiving the request by mail until delivering the requested information / enforcement / receiving information in return

Putting the vision into practice

TAKE A WALK IN THEIR SHOES: The customer journey is the complete sum of experiences that

businesses go through in a process. For an NSI, instead of looking at just a part of a transaction or experience (e.g.

contacting the help-desk) the customer journey documents the full experience of being a business that has to report to an NSI

from start to finish.

• Business oriented

• Make it easier on businesses to report

• Strengthened image

• Collect correct and complete data in time

CBS objectives for a future customer journey

Mapping out a customer journey

transfer- session

Docu- ments

Inside- out session

Current Customer Journey, from CBS p.o.v.

12 inter- views busi- nesses scope

session

Validated Current Customer Journey

Re- design session

12 Inter- views busi- nesses, session with sector organi- sations

Analyse- session

Validated Future Customer Journey

Current CJ

• Completing questionaires is a burden, necessary evil ➢ Automate the reporting proces as much as possible (less work, safe,

business need to know who is asking for data)

• Businesses link CBS to inflation figures. They don’t know that CBS is a governmental organisation ➢ Strenthen the image and identity as trusted governmental organisation,

introduce a well defined core message

• Businesses do not fully understand why they have to report to CBS, and why it is mandatory. Partly because of unfamilarity with CBS and their sampling approach ➢ Be transparant about the approach and make additional information

available for interested businesses.

Findings (1)

• A predictable process. Businesses need to know why and when they have to deliver data. ➢ invest in a portal with a survey calendar (fixed moments in time of data

reporting), collaborate with the Chambers of Commerce

• There is (a sense of) overlap in the data businesses have to submit to (government) agencies. Frequent spontaneous tips to use tax data ➢ Explore additional the use new data sources, and better exploit already

used registers

• The relationship with CBS is described by companies as businesslike and one-way, sometimes even authoritarian ➢ show more appreciation by sharing relevant data / benchmarks

Findings (2)

1. Receive an e-mail to report data

2. Log in on central portal

3. Select survey, authorize CBS to retreive data automatically in order to pre-fill questionnaire (S2S data communication)

4. Check pre-filled data, correct and complete data, submit questionnaire

5. Receive confirmation and statistics in which data are used

A future customer journey Core values for businesses: • Efficient

• Secure data communication • Who and why?

• Overview and clarity • Be appreciated

Major consequences for existing CBS processes,

questionnaires, sampling, etc.

Question for discussion: How do you see the future? Is this also your view on the future developments?

Slide 1
Slide 2
Slide 3
Slide 4
Slide 5
Slide 6
Slide 7
Slide 8
Slide 9: Question for discussion:

Using the paradata to guild on the questionnaire design and evaluate the data collection quality - Gina Cheung (Statistics Netherlands)

Read more about Using the paradata to guild on the questionnaire design and evaluate the data collection quality - Gina Cheung (Statistics Netherlands)

Languages and translations

English

Use Paradata for improving questionnaire design and data

collection process Gina-Qian Cheung

(Consultant for CBS - Statistics Netherlands)

UNECE Expert meeting on Statistical Data Collection, June 2023

Paradata includes: • Interviewer (experience, training grades, historical performance)

• Sample segments (PSU, Stratum, observations)

• Address (probability of selection, observations, # contacts, status)

• Screener contacts (call #, interviewer, time, date, informant behavior, outcome)

• Household (composition, informant behavior, sample respondent characteristics)

• Main interview contacts (call #, interviewer, time, date, informant behavior, outcome)

• Audit trails (keystrokes, timings, functions, consistency checks, suspensions)

• Sample management system (log and timing of actions)

• GPS (Global Positioning System)

• Digital recordings

• Collection of various anthropometric data using digital devices

Focus on: Audit Trails (keystrokes, timings, functions, consistency checks, suspensions)

Using Audit Trail Data Examples

• Total survey time, question-by-question timing

• Mode switch, language switch

• Brower types/device types

• The last question answered

• Key questions’ values

• Consistency Checks (Hard/Soft)

• Questions answered/not-answered

• Online Help/Remarks

Consistency checks (Hard/Soft) demo

Online Help /Remarks demo

Example1: Consistency Checks

• Hard Checks – Show Messages; the inconsistency must be resolved to be continuing for the interview

• Soft Checks – Show Messages; warning signal messages appear, but there is a “suppress” choice to ignore the warning.

Counting the hard/soft check messages may indicate potential questionnaire design issues, such as question sequencing/logical flow; Question format and response options: Evaluate the appropriateness of the question formats (e.g., multiple-choice, open-ended) and the available response ranges.

Example 2: Online Help and Remarks

• Online Help is a building-in function that helps the respondents to understand the questionnaire precisely.

• The remark function allows the interviewer to recode critical information associated with the respondent’s answer.

We can evaluate the types of questions that prompt participants to click on the online help option and analyze the specific areas or question types where participants encounter difficulties or confusion. The Remarks and the number of online help clicks would help identify patterns or particular questions that need attention.

Short BlaiseTooLs Demo

Parsed Audit Trail data for a case

Consistency Checks counts

Online Help counts

Remarks

Conclusion

• Paradata is a powerful tool that can be used to improve questionnaire design and quality control in the survey life cycle.

• We can use Paradata to ensure that the data collected is accurate and valuable by monitoring response rates, tracking completion times, identifying question skips, analyzing interviewer behavior, and testing and iterating the questionnaire design.

• We can also use Paradata to improve the effectiveness of the questionnaires and produce high-quality data that can be used to inform decision-making and policy development.

Thank You!

Using the paradata to guild on the questionnaire design and evaluate the data collection quality - Gina Cheung (Statistics Netherlands)

Read more about Using the paradata to guild on the questionnaire design and evaluate the data collection quality - Gina Cheung (Statistics Netherlands)

Languages and translations

English

Use Paradata for improving questionnaire design and data

collection process Gina-Qian Cheung

(Consultant for CBS - Statistics Netherlands)

UNECE Expert meeting on Statistical Data Collection, June 2023

Paradata includes: • Interviewer (experience, training grades, historical performance)

• Sample segments (PSU, Stratum, observations)

• Address (probability of selection, observations, # contacts, status)

• Screener contacts (call #, interviewer, time, date, informant behavior, outcome)

• Household (composition, informant behavior, sample respondent characteristics)

• Main interview contacts (call #, interviewer, time, date, informant behavior, outcome)

• Audit trails (keystrokes, timings, functions, consistency checks, suspensions)

• Sample management system (log and timing of actions)

• GPS (Global Positioning System)

• Digital recordings

• Collection of various anthropometric data using digital devices

Focus on: Audit Trails (keystrokes, timings, functions, consistency checks, suspensions)

Using Audit Trail Data Examples

• Total survey time, question-by-question timing

• Mode switch, language switch

• Brower types/device types

• The last question answered

• Key questions’ values

• Consistency Checks (Hard/Soft)

• Questions answered/not-answered

• Online Help/Remarks

Consistency checks (Hard/Soft) demo

Online Help /Remarks demo

Example1: Consistency Checks

• Hard Checks – Show Messages; the inconsistency must be resolved to be continuing for the interview

• Soft Checks – Show Messages; warning signal messages appear, but there is a “suppress” choice to ignore the warning.

Example 2: Online Help and Remarks

• Online Help is a building-in function that helps the respondents to understand the questionnaire precisely.

• The remark function allows the interviewer to recode critical information associated with the respondent’s answer.

Short BlaiseTooLs Demo

Parsed Audit Trail data for a case

Consistency Checks counts

Online Help counts

Remarks

Conclusion

• Paradata is a powerful tool that can be used to improve questionnaire design and quality control in the survey life cycle.

• We can also use Paradata to improve the effectiveness of the questionnaires and produce high-quality data that can be used to inform decision-making and policy development.

Thank You!

What can we learn from in-app paradata on the active-passive trade-off in smart surveys? - Barry Schouten, Anne Elevelt, Maaike Kompier and Bella Struminskaya (Statistics Netherlands & Utrecht University)

Read more about What can we learn from in-app paradata on the active-passive trade-off in smart surveys? - Barry Schouten, Anne Elevelt, Maaike Kompier and Bella Struminskaya (Statistics Netherlands & Utrecht University)

Languages and translations

English

What can we learn from in-app paradata on the active- passive trade-off in smart surveys?

Anne Elevelt, Bella Struminskaya, Maaike Kompier, Daniëlle Remmerswaal & Barry Schouten

UNECE 2023, June 12-14, 2023

Smart surveys?

Smart surveys have at least one of the following smart features: 1. Device intelligence: It can use the intelligence (computing,

storage) of the device 2. Internal sensors: It can employ the sensors available in the device; 3. External sensors: It can communicate with other sensor systems; 4. Public online data: It can extract publicly available online data; 5. Personal online data: It can go online and request access to

existing external personal data; 6. Linkage consent: It can ask consent to link external personal data

already in possession of the survey institute.

UNECE, June 13, 2023

Why smart surveys?

Survey themes that satisfy one or more of following: 1. Burden: The survey topic(s) are burdensome for a respondent (time or

cognitive effort); 2. Centrality: The survey topic(s) are non-central to respondents; 3. Non-survey type: The survey topic(s) do not lend themselves to a

survey question-answer approach to begin with;

→ reduce drop-out, satisficing, measurement error, improve validity.

Active data collection = Respondents are involved in interpretation of the smart task, retrieving information through the smart task, judging the smart data, and/or submitting the smart data.

BUT ARE RESPONDENTS MOTIVATED, ABLE AND COMPETENT?

UNECE, June 13, 2023

In-app logdata on navigation/clicking behaviour

UNECE, June 13, 2023

Example analysis – Household Budget Survey

RQ: How much time do respondents spend in-app?

UNECE, June 13, 2023

Example analysis – Household Budget Survey

RQ: Does allowing to edit scanned (e-)tickets help?

UNECE, June 13, 2023

Example analysis – Household Budget Survey

RQ: Do respondents look at tutorials?

UNECE, June 13, 2023

Example analysis – Mobility Survey

RQ: What edit actions on stops – tracks are done?

UNECE, June 13, 2023

One day sample First day of seven day sample All actions Limited editing All editing Limited editing

Time in-app (seconds) 58 41 69 33 Nr of sessions 4.7 4.7 4.7 3.9 Nr of messages 62 44 73 36

% User-entered 6.2 NA 7.7 NA Mean nr tracks 4 3.3 3.6 4.1 Mean nr stops 3.5 3.4 3.3 4.2 % labeled stops 81% 71% 84% 66% % labeled tracks 76% 68% 82% 66%

Example analysis – Mobility Survey

RQ: How does battery management affect geotracking?

UNECE, June 13, 2023

Discussion

• In-app logdata can be very useful in smart (diary) surveys to evaluate problematic (smart) tasks, BUT o Data generally large in size and warrants hypotheses o Only partially structured o Hard to separate inability from lack of motivation

• Research on how to structure and detect ‘breaking points’ in active-passive trade-offs

• Others doing similar studies?

Contact: [email protected]

UNECE, June 13, 2023

mailto:[email protected]

Slide 1
Slide 2
Slide 3
Slide 4
Slide 5
Slide 6
Slide 7
Slide 8
Slide 9
Slide 10

Using the paradata to guild on the questionnaire design and evaluate the data collection quality - Gina Cheung (Statistics Netherlands)

Read more about Using the paradata to guild on the questionnaire design and evaluate the data collection quality - Gina Cheung (Statistics Netherlands)

Languages and translations

English

Use Paradata for improving questionnaire design and data

collection process Gina-Qian Cheung

(Consultant for CBS - Statistics Netherlands)

UNECE Expert meeting on Statistical Data Collection, June 2023

Paradata includes: • Interviewer (experience, training grades, historical performance)

• Sample segments (PSU, Stratum, observations)

• Address (probability of selection, observations, # contacts, status)

• Screener contacts (call #, interviewer, time, date, informant behavior, outcome)

• Household (composition, informant behavior, sample respondent characteristics)

• Main interview contacts (call #, interviewer, time, date, informant behavior, outcome)

• Audit trails (keystrokes, timings, functions, consistency checks, suspensions)

• Sample management system (log and timing of actions)

• GPS (Global Positioning System)

• Digital recordings

• Collection of various anthropometric data using digital devices

Focus on: Audit Trails (keystrokes, timings, functions, consistency checks, suspensions)

Using Audit Trail Data Examples

• Total survey time, question-by-question timing

• Mode switch, language switch

• Brower types/device types

• The last question answered

• Key questions’ values

• Consistency Checks (Hard/Soft)

• Questions answered/not-answered

• Online Help/Remarks

Consistency checks (Hard/Soft) demo

Online Help /Remarks demo

Example1: Consistency Checks

• Hard Checks – Show Messages; the inconsistency must be resolved to be continuing for the interview

• Soft Checks – Show Messages; warning signal messages appear, but there is a “suppress” choice to ignore the warning.

Example 2: Online Help and Remarks

• Online Help is a building-in function that helps the respondents to understand the questionnaire precisely.

• The remark function allows the interviewer to recode critical information associated with the respondent’s answer.

Short BlaiseTooLs Demo

Parsed Audit Trail data for a case

Consistency Checks counts

Online Help counts

Remarks

Conclusion

• Paradata is a powerful tool that can be used to improve questionnaire design and quality control in the survey life cycle.

• We can also use Paradata to improve the effectiveness of the questionnaires and produce high-quality data that can be used to inform decision-making and policy development.

Thank You!

What can we learn from in-app paradata on the active-passive trade-off in smart surveys? - Barry Schouten, Anne Elevelt, Maaike Kompier and Bella Struminskaya (Statistics Netherlands & Utrecht University)

Read more about What can we learn from in-app paradata on the active-passive trade-off in smart surveys? - Barry Schouten, Anne Elevelt, Maaike Kompier and Bella Struminskaya (Statistics Netherlands & Utrecht University)

Languages and translations

English

What can we learn from in-app paradata on the active-passive trade-off in smart surveys?

Anne Elevelt, Bella Struminskaya, Maaike Kompier, Daniëlle Remmerswaal & Barry Schouten

UNECE 2023, June 12-14, 2023

Smart surveys?

Smart surveys have at least one of the following smart features:

Device intelligence: It can use the intelligence (computing, storage) of the device

Internal sensors: It can employ the sensors available in the device;

External sensors: It can communicate with other sensor systems;

Public online data: It can extract publicly available online data;

Personal online data: It can go online and request access to existing external personal data;

Linkage consent: It can ask consent to link external personal data already in possession of the survey institute.

UNECE, June 13, 2023

Why smart surveys?

Survey themes that satisfy one or more of following:

Burden: The survey topic(s) are burdensome for a respondent (time or cognitive effort);

Centrality: The survey topic(s) are non-central to respondents;

Non-survey type: The survey topic(s) do not lend themselves to a survey question-answer approach to begin with;

→ reduce drop-out, satisficing, measurement error, improve validity.

Active data collection = Respondents are involved in interpretation of the smart task, retrieving information through the smart task, judging the smart data, and/or submitting the smart data.

BUT ARE RESPONDENTS MOTIVATED, ABLE AND COMPETENT?

UNECE, June 13, 2023

In-app logdata on navigation/clicking behaviour

UNECE, June 13, 2023

Example analysis – Household Budget Survey

RQ: How much time do respondents spend in-app?

UNECE, June 13, 2023

Example analysis – Household Budget Survey

RQ: Does allowing to edit scanned (e-)tickets help?

UNECE, June 13, 2023

Example analysis – Household Budget Survey

RQ: Do respondents look at tutorials?

UNECE, June 13, 2023

Example analysis – Mobility Survey

RQ: What edit actions on stops – tracks are done?

UNECE, June 13, 2023

	One day sample		First day of seven day sample
	All actions	Limited editing	All editing	Limited editing
Time in-app (seconds)	58	41	69	33
Nr of sessions	4.7	4.7	4.7	3.9
Nr of messages	62	44	73	36

% User-entered	6.2	NA	7.7	NA
Mean nr tracks	4	3.3	3.6	4.1
Mean nr stops	3.5	3.4	3.3	4.2
% labeled stops	81%	71%	84%	66%
% labeled tracks	76%	68%	82%	66%

Example analysis – Mobility Survey

RQ: How does battery management affect geotracking?

UNECE, June 13, 2023

Discussion

In-app logdata can be very useful in smart (diary) surveys to evaluate problematic (smart) tasks, BUT

Data generally large in size and warrants hypotheses

Only partially structured

Hard to separate inability from lack of motivation

Research on how to structure and detect ‘breaking points’ in active-passive trade-offs

Others doing similar studies?

Contact: [email protected]

UNECE, June 13, 2023

Dagoverzicht ios

The impact of high energy prices on Dutch national accounts

Read more about The impact of high energy prices on Dutch national accounts

Languages and translations

English

Stephen Chong

April 27th 2023

The impact of high energy prices on Dutch national accounts

─ How quarterly GDP is measured in the Netherlands

─ Impact high energy prices on the volume of energy consumption

─ Implications for the deflator of household consumption

Topics discussed

• Supply and use framework used to measure quarterly GDP

• 109 industries x 245 productgroups

• Balanced in current and constant prices

How quarterly GDP is constructed in the Netherlands

• Natural gas and energy market: • Source for volume of supply and use of natural gas and

electricity: Balance sheet of natural gas and electricity compiled by SN on monthly basis

• Source for prices: Producer Price Indices and Consumer Price Index

• Value is derived from volume and price indices.

How quarterly GDP is constructed in the Netherlands (2)

High prices of natural gas and electricity….

-50

100

150

200

250

% y

ea r-

o n

-y ea

Dutch CPI

Natural gas Electricity

Impact on Dutch CPI….

Impact high prices on the volume of natural gas consumption

-30,0

-25,0

-20,0

-15,0

-10,0

-5,0

0,0

5,0

10,0

15,0

2021Q1 2021Q2 2021Q3 2021Q4 2022Q1 2022Q2 2022Q3 2022Q4

% y

-o -y

Use of natural gas

Impact high prices on the volume of natural gas consumption

-30,0

-25,0

-20,0

-15,0

-10,0

-5,0

0,0

5,0

10,0

15,0

20,0

2021Q1 2021Q2 2021Q3 2021Q4 2022Q1 2022Q2 2022Q3 2022Q4

Use of natural gas by industry (contribution to growth, in percentage points)

Energy companies Manufacturing Households Other users

• CPI currently measures prices paid by new customers of energy companies.

• Price-index of consumption in National Accounts has to include fixed and variable contracts, i.e. contracts where prices may change within the contract-period.

• Prices of type of contracts that are missing in CPI increased much more slowly than prices paid by new customers.

Implications for the deflator of household consumption

• When energy prices started to strongly increase in 2021Q4 colleagues from NA and CPI made an estimate of the price changes of all contracts using information on the prices paid by new customers (= CPI) and the share of each type of contract (variable, 1 and 3 year fixed contracts).

An alternative estimation method for the deflator of household consumption of natural gas

An example

Consumer prices; price index 2015=100

Periods

2017 2018 2019 2020 2021 2022

045200 Gas 103,1 110,8 122,59 125,78 146,74 327,44

Source: CBS

Type of contract Share in total

number of

contracts (in %)

Price change

(%, y-o-y)

T-1 T

Variable 50 146,74 327,44 123,1

1 year fixed 15 125,78 146,74 16,7

3 year fixed 35 110,8 122,59 10,6

Total 100 67,8

Price index (2015=100)

• Recently, detailed customer data of utility companies were made available to SN.

• These data include prices and number of consumer energy contracts provided by the largest utility companies in the Netherlands.

• These transaction data: • will be used to calculate a new CPI for natural gas and electricity

that take into account existing contracts. First publication in June 2023.

• are already used in national accounts to calculate the deflator of household consumption for energy products, replacing the method described in previous sheet.

New data source for measuring the CPI of natural gas and electricity

CPI vs deflator household consumption

0,0

2,0

4,0

6,0

8,0

10,0

12,0

14,0

2021Q1 2021Q2 2021Q3 2021Q4 2022Q1 2022Q2 2022Q3 2022Q4

% c

h an

ge , y

ea y-

o n

-y ea

CPI (only new contracts) Deflator household consumption in National Accounts

• High energy prices also made it challenging to properly measure gross value added of energy producers in national accounts.

• The Dutch government introduced a price ceiling on the consumption of natural gas and electricity by households starting in January 1st 2023.

• Households also received a fixed 190 euro discount on their energy bills in November and December 2022.

Concluding remarks

Invitation and provisional agenda

Read more about Invitation and provisional agenda

INVITATION to THE PEP PARTNERSHIP ON ACTIVE MOBILITY, THE NETHERLANDS, 23RD –25TH JANUARY 2023

Languages and translations

English

Dear Sir or Madam, We are happy to invite you to the next meeting of THE PEP Partnership on Active Mobility taking place on invitation of the Dutch Ministry of Infrastructure and Water Management the Dutch Ministry of Economic Affairs and the City of Rotterdam, January 23rd – 25th, 2023. Supported by the secretariats of the United Nations Economic Commission for Europe (UNECE) Transport and Environment Divisions in Geneva and the World Health Organization Regional Office for Europe (WHO/Europe) in Copenhagen, THE PEP is a unique tripartite policy platform that seeks to encourage transport policymakers and urban planners to consider the health and environmental impacts of transport and address them through shared policy approaches. THE PEP Partnerships form an important pillar of THE PEP implementation mechanism providing platforms for pan-European cooperation and the joint development of forward looking strategies, concrete projects and actions for clean, safe, healthy and inclusive mobility and transport. At the 5th High-level Ministerial Meeting on Transport Health Environment in Vienna May 2021 the ministers responsible for transport, health and environment of the member states of the UNECE WHO- Euro pan-European region adopted the Vienna Ministerial Declaration and with it the first pan-European Master Plan for Cycling Promotion – a major milestone for the promotion of cycling on the pan-European level. With this declaration the ministers are also calling for the development of a Pan-European Master Plan for Active Mobility and have mandated THE PEP Partnership on Active Mobility with the task to highlight and integrate walking. To this end a first ever pan-European Master Plan Walking will be developed which will then together with the Pan-European Master Plan on Cycling Promotion provide the requested strategic umbrella – the Pan European Master Plan on Active Mobility - for the promotion of active mobility in the pan-European region. Implementing the commitments of the Vienna Ministerial Declaration a new THEPEP strategy on transport, health and environment will be developed including healthy and active mobility in particular to support the implementation and further development of the pan-European Master Plan for Cycling Promotion and the new pan-European Masterplan Walking together forming the new pan-European Masterplan Active Mobility. THE PEP Partnership Active Mobility and the Member States and other partners involved therein are invited to discuss inputs and provide proposals for THE PEP strategy and the options for possible legal instruments in a special THE PEP Session in our Partnership Meeting on January 24th. For this session the equipment for a full online participation will be provided! This is of course a great important opportunity for our Partnership Active Mobility that we should definitely use.

INVITATION THE PEP PARTNERSHIP ON

ACTIVE MOBILITY THE NETHERLANDS, 23RD –25TH JANUARY 2023

VENUES ROTTERDAM / THE HAGUE / UTRECHT

The forthcoming partnership meeting in the Netherlands will give participants the possibility to get inspired by the great examples for the promotion of active travel in Dutch cities. The multiday workshop allows participants to:

• Share and discuss questions and challenges they encounter in promotion and policy development for walking and cycling

• Gather ideas based on the Netherlands experience on potential follow-ups for the recommendations of the Pan-European Master Plan for Cycling Promotion

• Get inspiration for the further development of the Pan-European Master Plan for Walking

• Expand their network and connect with Netherlands experts and peers from research, government, companies and NGOs

• Discuss inputs and provide proposals for THE PEP strategy on transport, health and environment and the options for possible legal instruments

Inspiration: https://www.holland.com/global/tourism/travel-inspiration/cycling-lifestyle.htm We therefore kindly invite the representatives of member states, intergovernmental organizations and NGOs including relevant experts and stakeholders to join and participate in this important event and bring in their perspectives and approaches. Please be aware that the first day of the meeting will have a stronger focus on walking, while the second and third day will put priority on cycling issues. Together with all members of THE PEP Partnership and their big experience, their inspiring contributions and their joint efforts and support we will be able to provide a great first ever pan-European Master Plan Walking that can be adopted by the ministers in the course of the 6th High-level pan-European Ministerial Meeting on Transport Health Environment in 2025 while at the same time also support countries in setting up their national walking strategies. We furthermore hope that this meeting inspires as many countries as possible to implement the recommendations of the pan-European Master Plan for Cycling Promotion including their own national cycling strategies. Please be aware that this meeting of the partnership will be organized as a live meeting. In case you can’t make it to the Netherlands there will be the possibility to follow some sessions via livestream with limited possibilities for interaction (chat). Those sessions will be earmarked in the agenda. We are looking forward to meeting you in the Netherlands and join our forces to promote active mobility in Europe! Robert Thaler Thierry Du Crest Austrian Federal Ministry of Climate Action, Environment, Energy, Mobility, Innovation and Technology Member of THE PEP Bureau

French Ministry of Transport

https://www.holland.com/global/tourism/travel-inspiration/cycling-lifestyle.htm

MONDAY, JANUARY 23RD – ROTTERDAM

General information about Rotterdam: https://www.holland.com/global/tourism/destinations/rotterdam.htm

10.30 Arrival Rotterdam Central Station

Walk & Talk Signature projects

Karen van der Spek, André de Wit, José Besselink, Willem Philipsen, Brigit van Son, Marcus Edelenbosch (Hofplein + Coolsingel)

11:30 Coffee + Welcome (Venue: Wereldmuseum, Willemskade 25, Rotterdam)

• Jeroen Maijers – team manager healthy mobility

• Rotterdam Walks Team

• Robert Thaler, Austrian Federal Ministry of Climate Action, Environment, Energy, Mobility, Innovation and Technology & Member of THE PEP Bureau

• Thierry Du Crest, French Ministry of Transport

12:15

Timeline + organization

Highlight the walking policy, structure and actions and how the city is organised for its delivery, finance and evaluation

Karen van der Spek – program coördinator Rotterdam Walks

José Besselink – urban planner Rotterdam Walks

13:00 Joint lunch (Venue: Wereldmuseum, Willemskade 25, Rotterdam)

14:00 Dive into Rotterdam

Projects Pitches:

1) Prioritizing for the Pedestrian - André de Wit

2) Priority Maps - Merel Scheltema

3) Walking Monitor – Zoë Peters

AGENDA THE PEP PARTNERSHIP ON

ACTIVE MOBILITY 23RD-25TH, JANUARY 2023

https://www.holland.com/global/tourism/destinations/rotterdam.htm

4) Design Principles for walking - Franka Fontijn

5) Walking activities – Brigit van Son

6) Active Mobility and public health – Joke Drijfhout / José Besselink

7) Accessible City – Willem Philipsen

8) 99 quick wins pedestrians – Vivian Ooms

9) National Platform for Walking and National City Deal – Martine de Vaan

15.00 Break

15.15

Peer to peer discussion

national and city relationship + political support + evidence, action, impact etc

Robert Thaler (chair), all participants

17.00 Surprise signature project

Walk on our route to the restaurant

tbc

18.00 Dinner offered by the city of Rotterdam

Individual trip to The Hague by train: 20 minutes

TUESDAY, JANUARY 24TH – THE HAGUE

General information about The Hague: https://www.holland.com/global/tourism/destinations/the-hague.htm

10.30 Welcome

Arlette van Gilst, Ministry of Infrastructure and Water Management

Paulien Officier, Netherlands Enterprise Agency

Robert Thaler, Austrian Federal Ministry of Climate Action, Environment, Energy, Mobility, Innovation and Technology & Member of THE PEP Bureau

Thierry Du Crest, French Ministry of Transport

Virginia Fuse, UN ECE / THE PEP Secretariat

11:00

Introduction to the meeting and overview of recommendations from Pan-European Master Plan Cycling Promotion

Mirjam Borsboom & Anton Wuis, Ministry of Infrastructure and Water Management

Andreas Friedwagner, Verracon/Austria

11:30 Walkshop (1/2 groups)

Focus on Dutch past, present and future approaches to cycling. Including multimodality, first- and last mile solutions (incl. OV-fiets) and bicycle parking

Shelley Bontje, Dutch Cycling Embassy and Rinse Gorter, Rina Adriani en Lilian Oskamp, Municipality of the Hague

11:30 Cycling tour (2 groups)

Focus on Dutch past, present and future approaches to cycling. Including multimodality, first- and last mile solutions (incl. OV-fiets) and bicycle parking.

Possible route (link)

N.B. depends on cycling capabilities of group / alternative: 1 group walking

Shelley Bontje, Dutch Cycling Embassy and Rinse Gorter, Rina Adriani en Lilian Oskamp, Municipality of the Hague

13:30 Lunch offered by the Netherlands Ministry of Infrastructure and Water Management

14:30 Expert presentations

• National government approach to employers (werkgeversaanpak)

Ministry of Infrastructure and Water Management and employer (e.g. ASML, ANWB, national police)

https://www.holland.com/global/tourism/destinations/the-hague.htm

https://routeplanner.fietsersbond.nl/#route?locations=l2013214_82070_455432,22319024,l795492_81503_454915,l1333045_80461_454372,l1333956_80723_455187,l790344_80993_455256,l790391_80771_455604,l1520380_81147_455204,l2013214_82070_455432&speed=12&routetype=70&preferences=63

• Safety and LEV admissibility Ministry of Infrastructure and Water Management

• Discussion All participants

16:00 Break

16.30

ONLINE

THE PEP strategy and the options for possible legal instruments

• Introduction Robert Thaler

• Discuss inputs and provide proposals for THE PEP strategy and the options for possible legal instruments

All participants

18:30 End of meeting

19.00 Dinner in The Hague offered by the Netherlands Enterprise Agency

WEDNESDAY, JANUARY 25TH – UTRECHT

General information about Utrecht: https://www.holland.com/global/tourism/destinations/utrecht.htm

Train at 7:55 hrs from The Hague central station

Arrival in Utrecht Centraal at 8:32 hrs

9.00

Welcome and THE PEP session (Location: Utrecht City hall, Stadhuisbrug 1, 3511 JK Utrecht)

Welcome by Heleen Zijlstra (Municipality Utrecht)

Tour de table on cycling policy/uptake in the pan-European region:

Where is everyone now in implementing the recommendations of the pan-European Master Plan for Cycling Promotion?

Short statements of all participants

Partner presentations on new initiatives/projects

• E-Bike and E-Cargo purchase subsidy programmes in Hungary

Bálint Madarász-Losonczy

• Activities of the Serbian Ministry of Environmental Protection: working in partnership with Danube Competence Center and other institutions in the preparation of the National Strategy for Cycling

Biljana Filipovic

Please let us know whether you would like to share a new initiative/development from your country/your work with the other members of the partnership. We will be happy to add it here!

All participants welcome

11:00

Competence centers: capacity-building for active mobility

Dutch Cycling Embassy, Transport Decarbonisation Alliance, UN Environment Programme

Ministry of Infrastructure and Water Management

Dutch Cycling Embassy

12:00

Welcome at location for lunch

Venue. Paushuize (Kromme Nieuwegracht 49, 3512 HE Utrecht) – Lunch offered by Provincial Minister Arne Schaddelee of Province of Utrecht

12.20

Start lunch – Meet the members of THE PEP Partnership Active Mobility

Welcome by Arne Schaddelee, Utrecht bike story by Ronald Besemer and Jelle Bakker (Bicycle Mayor of Utrecht)

https://www.holland.com/global/tourism/destinations/utrecht.htm

14:00 Cycling OR walking tour (3 groups)

• Focus on multimodality, first and last mile solutions together with Herbert Tiemens (10 pax max.)

• City-region connections; sustainable mobility together with Jitte Roosendaal (10 pax max.)

• Mobility management at destinations, integration biking and walking (WALKING GROUP)

16:00 Tour around bike parking Utrecht Central Station

17:00 Closing the day (Location: Municipality of Utrecht)

Debriefing

Welcome speech by Susanne Schilderman (Municipality Utrecht)

Concluding words by Robert Thaler, Thierry du Crest and Anouk van Eekelen

Network drinks

18:30 Joint dinner – goodbye for members of THE PEP Partnership Active Mobility Venue: Metro citykitchen (Utrecht) – costs to be covered by the participants!

Individual trip back to The Hague (approx.. 40mins)

Practical information

Registration

To register for the meeting please send an email to [email protected]

Accommodation

Hotel recommendations close to The Hague Central Station (most practical for this program):

• Eden Babylon Hotel The Hague Central Station (€ 165 per night incl breakfast, + 18,50 for 2nd

person if you book per email or phone with reference DBA-

GF13667 [email protected])

Other options close to The Hague Central Station/ city centre:

• Moxy The Hague Wijnhaven 6 (5 minutes walking)

• Holiday inn Express, Lange Houtstraat 5 (parliament & city centre 10 minutes walking)

• Novotel The Hague, Hofweg 5-7 (10 minutes walking, parliament & city centre)

mailto:[email protected]

https://www.babylonhoteldenhaag.com/en/

mailto:[email protected]

https://www.marriott.com/en-us/hotels/rtmox-moxy-the-hague/overview/

https://www.ihg.com/holidayinnexpress/hotels/gb/en/the-hague/hagms/hoteldetail

https://all.accor.com/hotel/1180/index.nl.shtml?utm_campaign=seo+maps&utm_medium=seo+maps&utm_source=google+Maps

• Paleishotel and Parkhotel (18 minutes walking, close to the Royal Noordeinde Palace)

And lots of other little boutique hotels, B&B’s look here for all options.

Clothing

In The Netherlands the weather in winter can be cold and rainy. Please be aware of part of the program is outside. https://www.holland.com/global/tourism/information/weather/packing-a-suitcase-what-clothes- should-you-bring.htm

Transport

While we recommend to stay in a hotel in The Hague, the venues of the meetings will vary throughout

the days. In order to not cause any delays in the schedule please make sure that you buy your train

tickets in advance! You can check https://www.ns.nl/en for more details.

Luggage

On the day of arrival and/or departure it is possible to store your luggage at railway stations. Here you

can read how it works: https://www.ns.nl/en/travel-information/luggage/luggage-at-the-station.html

Costs

All partners need to cover costs for travel expenses (including the trips from and to The Hague), daily allowances and accommodation. The costs for coffee breaks, lunches and dinners – except the dinner on Wednesday evening - are covered by the host.

Please be aware that this meeting of the partnership will be organized as a live meeting. In case you can’t make it to the Netherlands there will be the possibility to follow some sessions via livestream with limited possibilities for interaction (chat). The link to the livestream will be provided in due time.

For any other question concerning THE PEP Partnership, please contact:

Andreas Friedwagner, Verracon GmbH ([email protected])

https://www.paleishotel.nl/

https://www.parkcentraal.com/den-haag/

https://denhaag.com/en/stay-the-night

https://www.holland.com/global/tourism/information/weather/packing-a-suitcase-what-clothes-should-you-bring.htm

https://www.ns.nl/en

https://www.ns.nl/en/travel-information/luggage/luggage-at-the-station.html

mailto:[email protected]