Skip to main content

United Kingdom of Great Britain and Northern Ireland

Disclosure control issues in complex medical data, University of the West of England

COVID-19, routine access, medical records for research, microdata use, 

Languages and translations
English

UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE

CONFERENCE OF EUROPEAN STATISTICIANS

Expert Meeting on Statistical Data Confidentiality

26-28 September 2023, Wiesbaden

Disclosure control issues in complex medical data

Elizabeth Green1, Felix Ritchie1, Jim Smith1, David Western1, Paul White1 1University of the West of England

[email protected]

Abstract

The covid19 pandemic assisted the acceleration of routine access to medical records for research. In the UK

platforms including OpenSafely and NHSDigital, alongside emerging hospital trust based Trusted Research

Environments (TREs), demonstrate the utility and need for medical researchers to access and use microdata

safely and securely. Whilst many employ traditional principles-based SDC standards to statistical outputs,

complexity arises when considering complex medical data which is required to remain highly detailed; for

example genome, medical imaging, or fMRI data where the output often includes reference to individual

observations. Current imaging libraries and databases have demonstrated awareness and need for metadata

standards, but consideration of both input and output protection is less clear. With the need to retain

observations with high level of detail this presentation discusses present considerations for potential SDC

solutions and also invites conversation from the wider community.

2

1 Introduction

The use of medical data for research purposes has clear public benefit and direct impact. Medical data by nature is highly detailed and specific to an individual: it is important to include a wide range of observations and background information to allow practitioners to make informed decisions and choices around treatment. Specific medical tests such as genome analysis or an MRI scan, generates large volumes of data which are specific to the individual and is evaluated and examined as a whole entity- not just a one particular fraction of the MRI scan is used, the whole scan is used and retained.

Historically, medical research has long been intertwined with delivery and provision of care to patients, as such research is conducted with direct informed consent and an expectation that the data will be used to further knowledge in the area. The medical data is of course highly detailed and often the number of observations used in a study can be low due to rarity of disease, or the collection of data is limited to particular hospitals/ sites. As such the research outputs can be highly detailed with descriptive tables and survival curves often including singular observations.

In contrast, microdata used in social science is often not directly collected by the researcher (for example census data) so informed consent specific to the research is not obtained. When it comes to accessing and publishing data outputs, social science has established data repositories and access arrangements for research with clear standards for statistical disclosure control (SDC) within both shared datasets and research outputs.

The aim of this paper is: first, to outline some present examples of sharing of medical data and also outputs of medical data; and second, to reflect on the disciplinary differences in disclosure control. In this paper we will illustrate this with some examples and consider whether this is due to lack of awareness or lack of concern. We will illustrate with three commonplace examples of shared data, to illustrate some of the issues and the expectations of the public health world. Finally, we reflection ways forward and where medical science may benefit from the experience of social scientists.

It should be noted that this paper is not intended to embarrass organisations or researchers- examples where potential disclosure and poor practice has been identified by the team are de-identified and described. The team has not directly referenced these examples, and we encourage the community to have an open conversation about how to integrate SDC standards when sharing data.

3

2 Medical examples

2.1 Genomic data

The devil is in the detail. A genome provides the complete set of all the genetic information in an organism. Genomic analysis (for example, microarray data) allows for the investigation of genes, and provides the necessary insights for developing cures, vaccines, and identification of new diseases and diagnostic tests. Whilst the sharing of individual genome data has facilitated remarkable breakthroughs in fields such as genetics and personalized medicine, it also raises significant privacy concerns.

The current practice of ‘anonymization’ of genomic data is performed by removing direct identifiers (for example, name, patient ID) and indirect identifiers (hospital, postcode) (Bonomi, Huang and Ohno-Machado, 2020). However other variables such as age of patient, gender, prognosis are not redacted. Below is an example of an ‘anonymised’ genome array data- available via website in the public domain which does not require sign in. The data is associated with a published research article, a condition of publication with the journal is that the raw data must be made available.

Data collection: The DRAGoN Hospital for Exhausted Researchers

Participant characteristics: Participant number Gender Age Prognosis

1 Male 48 Bad- chronic insomnia

2 Female 31 Good

Xlsx attachment with participant 1 microarray, participant 2 microarray etc.

The main issue here is not only the level of detail presented in the participant characteristics list, but also the level of detail within the array/ genome dataset. It is effectively the raw output of the individual’s entire genetic array. Whilst research has advanced an understanding of the specific roles of different structural points, mutations, and specific markers knowledge, we are still in the process of identifying and discovering the roles of specific which genetic markers. Therefore, when considering SDC we need to be aware that what is considered non-sensitive today may become sensitive in the near future (Ritchie and Smith, 2019; McKay et al. 2022).

For medical research it is difficult to define what information is disclosive and what is not. For example, it is possible to extract information about the individual such as eye colour, hair colour, hair texture (curly), baldness, physical traits etc from array data. Previous studies demonstrated the possibility of generating 3D face maps based on genomic data which could be used to reidentify individuals (Lippert et al. 2017, Crouch et al. 2018, Venkatesaramani and Vorobeychik, 2021). From a social science perspective we would be considering whether a form of input disclosure control could be employed; alternatively, could we safeguard who is accessing the data, and what might the consequences be if we did introduce such practices?

Input SDC on the sharing of genomic data is only one part of the puzzle. There are also disclosure issues in research outputs. As previously explained the data is uploaded to a shared platform- available for anyone to download, this sharing is often a mandatory requirement from both funders and journals. Below (figure 1) is an example of a published survival analysis which outlines the probability of survival for patients with a particular disease overtime. With small number of values it is easy to identify when individuals die at specific time points- accompanying the survival curve is a table detailing the change in numbers across time.

4

Figure 1 Example Kaplan-Meier curve with low numbers

Survival analysis is commonly used in medical research to demonstrate the relationship between diagnosis (or treatment) and death. Concerns around disclosure relates to number of observations between each step down in the curve, with detailed graphs often detailing a step down with less than 3 observations. O’Keefe et al. (2012) suggests smoothing and incorporating confidence intervals, while SDAP (2019) proposes checking to ensure thresholds are met within each step change.

Interestingly a tool which specifically generates Kaplan-Meier plots for genomic research is being used within the medical community- https://kmplot.com/ (Gyorffy, 2023). This open-access, free for use website allows researchers to perform survival analysis on different gene expressions from database of over 30k different samples. The user can select below the cancer subtype they wish to research and then the level of analysis (see below). By default the website is set to censor at the threshold for the plot, but the user is able to turn off this function.

Figure 2 Demonstration of a confidentialised output taken from O’Keefe et al. (2012) p134

Figure 2 Guidance for SDC in Kaplan Meier graphs by Welpton et al 2019

5

While this is an extremely useful resource for researchers, it is also of potential concern. It seems likely that very small subsets of the data could be selected and associated with personal characteristics – these would not produce meaningful graphs, but they could be used to challenge the anonymisation of the data.

2.2 Inappropriate use of medical dermal images

In dermatology, photographic capture of clinical findings is routine, with digital images providing support and awareness in both practice, research, training, and education. One publicly available tool is the DermAtlas (available http://www.dermatlas.net/reference/index.cfm) which stores a wide array of clinical images demonstrating the presentation of different dermological conditions. Anyone can access this tool and explore the wide range of photos it holds. In terms of impact this tool can help aid health professionals in identifying and evaluating their own patients, it can also be used by the general public to help them feel empowered or understand their own conditions/ potential diagnosis.

As the skin is the largest organ of our bodies, some dermatological conditions are localised to specific personal areas, this coupled with also an array of different clinical photos providing insights across the age range, the dermatology archives found it had become susceptible to misuse. Lehman, Cohen and Kim (2006) described the journey of discovery, ongoing detection, and management of misuse of DermAtlas content across a period of 4 years. A shocking 14.3% of all referrals originated from pornography / fetish sites (Lehman, Cohen and Kim, 2006).

This leads to concerns surrounding how to share safely medical information from what is undoubtedly a valuable medical resource. Any referral from a pornography/ fetish site resulted in the user being presented with a denial page (Lehman, Cohen and Kim, 2006). The DermAtlas implemented filters through user query patterns, with IP addresses of frequent queries for genital images being restricted. Restricted IP addresses were still able to use DermAtlas, but were presented with thumbnail sized images and unable to retrieve full images of genital sites. However, this approach was not straightforward: for example, the NHS in the UK and US military services were then inappropriately restricted.

DermAtlas presents an interesting example of the complexities when hosting data in a public domain which is aimed for a universal audience. The benefits of the tool for both public and health professionals are clear, but the tool is also being used for other purposes not intended by the designers. When considering potential solutions for de-identification or anonymisation of medical photos, current practice in social research where direct informed consent has not been obtained (such as photographing a busy city) is often to use object and

Figure 3 Website Kaplan- Meier plotter

6

face detection software to automatically mask individuals (Fitwi et al. 2021). When considering the clinical dermatological case photos, the current simplistic approach is to redaction is to mask the eyes and mouth, but for many case photos there is no form of redaction, and sometimes it is not possible to redact the eyes and mouth. We therefore assume, as is common practice within medical research, that the emphasis is on obtaining direct informed consent- and the patient consents to data being held within the public domain. However, can such consent be truly informed when unanticipated uses are made of the data? The DermAtlas and indeed other similar tools face a impossible triad: how can we retain detailed photographs and provide an open access tool and ensure no misuse?

2.3 fMRI scans

Functional magnetic resonance imaging or functional MRI (fMRI) provides a highly detailed image of the blood flow and structure of an item/ body part, these scans are being used to assist in treatment of the patient (diagnosis) but also medical research. Due to the large volume of high data produced by these scans sharing this information has proven to be invaluable for medical research. Current examples of sharing fMRI includes the Brain Imaging Data Structure (BIDS) website https://bids.neuroimaging.io/ . Here users can contribute, access, and download de-identified fMRI data.

When considering the input disclosure control BIDS requires contributors to remove all direct identifiers alongside ‘defacing’ the scan images (which can be achieved using a module https://raamana.github.io/visualqc/gallery_defacing.html ). Interestingly facial reconstruction based on detailed medical scans (such as CT, fMRI) has been achieved. Schwarz et al (2019) found that the software achieved an impressive re-identification rate of 83% (70 of the 84 participants) when comparing their MRI scan to photos.

BIDS ensures that the data entering the service is de-identified by providing excellent support to depositors- ensure that data uploaded to their service is stripped of direct identifiers and defaced. However uploading and publishing/ sharing data in tandem is common practice so the sticky issue of secondary disclosure is more apparent in this example. To highlight this a recent published journal article, cites that they have deposited the data used in publication in BIDS, but within the journal article the participants’ demographic characteristics are highly detailed with low numbers in particular cells and distinctive characteristics. If the identity of the depositor is known, then it increases the chance of knowing where the sample comes from (i.e. which hospital/patient group), dramatically increasing the chances of re-identification. Finally, with more researchers using data depositories such as BIDS to deposit datasets used in publications/ research, information already in the public domain about the dataset may be crucial for re-identification, but it is not necessarily considered by the individual depositor. Now the problem here is not within the data depository input side, but a lack of statistical disclosure control awareness from the authors- demonstrating the need for training and standards amongst the medical community.

3 Discussion

We are not stating that the above examples are necessarily disclosive or provide direct identification- a number of steps would be required to reidentify the individual and the value to an intruder would be questionable. For example, safe to assume that social media profile pictures in the public domain are not going to be viable for identification/ reconstruction of an fMRI scan. Venkatesaramani and Vorobeychik, (2021) found that the overall effectiveness of re-identification (when using social media photos) was substantially lower than previously suggested- as literature often uses high-quality data (both genomic and photographic) which is not consistent with real life scenarios. Conceptualisation of what is a reasonable threat is beyond the scope of this paper.

7

Nevertheless, the three examples have highlighted a number of issues and challenges within disclosure control from both an input and output side along with how to share. Many of these challenges are unique to the data, and traditional methods used to aid disclosure control in social research may be inappropriate. There are also some very unexpected factors; for example DermAtlas and the actual use. Going forward what mitigations and recommendations might social scientists offer the medical community?

On microdata access we must always accept a level of risk, risk needs to be conceptualised as to the realism of risk (i.e. what is the true likelihood of an intruder performing this for nefarious gain? And can we ever meaningfully and more importantly reliably measure this risk?). It is also essential that whilst discussing risk we must also discuss benefit, we are all to familiar of the invaluable findings and applications of health research and to potentially halt or delay findings is harm within itself. So, whilst we highlight areas of weakness and vulnerability we must objectively generate new paths going forward.

Our primary concern is the lack of standards, guidance and continuity- this is not being checked or reviewed or updated to current practices known within the SDC community (for example thresholds). Perhaps this demonstrates a lack in training and awareness around SDC, as in the examples there are demonstration of de- identification. This also could potentially be an area in which re-identification back to the individual is important for example if the research generates incidental findings on an individual and it’s necessary the receive intervention. Consent for data to be shared is often obtain directly with individuals being more inclined to trust the research and a presumption that they had “agreed to use this for research and we said we would anonymise it...”.

What about outputs? Sharing the data seems to happen in tandem with the outputs so output SDC not as relevant, however is this an output or input issue? Should we consider the attached journal participants characteristics tables as secondary disclosure or is this an example of input? What is clear however is a want to de-identify and a concern around ethics and consent in the medical community. Derrick et al 2022, highlights that training in OSDC is mostly limited to TRE users and lots of medical research on very sensitive data is not traditionally held in TREs (compare do social science), so moving forward training appears to be long-hanging fruit in supporting disclosure control in this area.

4 Future considerations

Identification of problem/risk – at first glance appears poor practice when compared to standards in social science but is it a genuine risk? How do we balance genuine risk vs perceived risk vs utility of data?

Training – what is done and to what level (again as social scientists not great but perhaps have experience and also conceptual understandings of thresholds, rounding etc).

Standards- what is done and to what level- can we support a harmonised approach?

Is open sharing good? In social science the move has been to open access not open data i.e. anyone with genuine reason has access to the data but not everyone gets access – need to review data sharing models and also pressures from funders and journals.

We especially welcome views from medical research community [email protected]

8

5 References

Bonomi, L., Huang, Y., & Ohno-Machado, L. (2020). Privacy challenges and research opportunities for genomic data sharing. Nature genetics, 52(7), 646-654.

Crouch, D. J., Winney, B., Koppen, W. P., Christmas, W. J., Hutnik, K., Day, T., ... & Bodmer, W. F. (2018). Genetics of the human face: Identification of large-effect single gene variants. Proceedings of the National Academy of Sciences, 115(4), E676-E685.

Derrick, B., Green, E., Ritchie, F., & White, P. (2022, September). The Risk of Disclosure When Reporting Commonly Used Univariate Statistics. In International Conference on Privacy in Statistical Databases (pp. 119- 129). Cham: Springer International Publishing.

Fitwi, A., Chen, Y., Zhu, S., Blasch, E., & Chen, G. (2021). Privacy-preserving surveillance as an edge service based on lightweight video protection schemes using face de-identification and window masking. Electronics, 10(3), 236.

Gyorffy B: Discovery and ranking of the most robust prognostic biomarkers in serous ovarian cancer, Geroscience, 2023, doi: 10.1007/s11357-023-00742-4.

Homer, N., Szelinger, S., Redman, M., Duggan, D., Tembe, W., Muehling, J., ... & Craig, D. W. (2008). Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS genetics, 4(8), e1000167.

Lehmann, C. U., Cohen, B. A., & Kim, G. R. (2006). Detection and management of pornography-seeking in an online clinical dermatology atlas. Journal of the American Academy of Dermatology, 54(4), 633-637.

Lippert, C., Sabatini, R., Maher, M. C., Kang, E. Y., Lee, S., Arikan, O., ... & Venter, J. C. (2017). Identification of individuals by trait prediction using whole-genome sequencing data. Proceedings of the National Academy of Sciences, 114(38), 10166-10171.

McKay, F., Williams, B. J., Prestwich, G., Bansal, D., Hallowell, N., & Treanor, D. (2022). The ethical challenges of artificial intelligence‐driven digital pathology. The Journal of Pathology: Clinical Research, 8(3), 209-216.

O'Keefe, C. M., Sparks, R. S., McAullay, D., & Loong, B. (2012). Confidentialising survival analysis output in a remote data access system. Journal of Privacy and Confidentiality, 4(1).

Schwarz CG, Kremers WK, Therneau TM, et al. (2019) Identification of anonymous MRI research participants with face-recognition software. N Engl J Med; 381:1684-6.

Venkatesaramani, R., Malin, B. A., & Vorobeychik, Y. (2021). Re-identification of individuals in genomic datasets using public face images. Science advances, 7(47), eabg3296.

Welpton, Richard (2019). SDC Handbook. figshare. Book. https://doi.org/10.6084/m9.figshare.9958520.v1

Disclosure control in

complex medical outputs

E L I Z A B E T H G R E E N , F E L I X R I T C H I E , J I M S M I T H ,

D A V I D W E S T E R N , P A U L W H I T E

U N I V E R S I T Y O F T H E W E S T O F E N G L A N D

Overview

➢Medical research and positioning

➢ Current practices

➢ Alignment with TRE standards

➢Examples

➢ Solutions?

➢ Future considerations

Medical research

Traditionally:

∙ Revolves around direct informed consent and primary data collection

∙ Some bad examples of mismanagement data and ethics... Henrietta Lacks for example

∙ Easy to see direct benefit/ public good

Sharing medical data and disclosure control

∙ Varied practices from depositing raw data in the public domain to secure access

∙ Tools are being developed- uptake poor

∙ Unforeseen consequences encountered

∙ Benefit to society

Genomic Data

∙ Tissue sample extracted, analysis is then conducted

∙ Microarray data of specimen often deposited/ shared in the public domain (remember knowledge advancing continually)

∙ Descriptive variables provided

Statistical Disclosure Control?

∙ Basic SDC principles- counts and thresholds

∙ Kaplan-Meier curves often result in low numbers, potential for low number of observations between each step.

∙ Not directly disclosive as alone, requires linking to contextual information (provided in the report).

∙ Recommendations exist- O'Keefe et al (2012) smoothing and adding CI. SDAP (2019) thresholds meet.

∙ Online tools- generating a Kaplan-Meier curve with SDC inbuilt

Dermatology photographs

• Online photo repository providing examples of different skin disorders.

• Used as an aid to help experts and public identify different ailments.

• Huge public benefit

• However unintended consequence was found...

Statistical Disclosure Control? Solutions?

FMRI scan • Huge amount of data • Brain anatomy and structure (remember knowledge advancing continually) • Variation in sharing online repositories open access. • Concern for disclosure is based on rebuilding face based on structure

Digital Facial Reconstruction Sorbonne University

FMRI solutions?

Scalp the face! Remove/ roughly the facial attributes - Potential loss of data - Disclosure elsewhere

Statistical Disclosure Control?

∙ SDC training and principles for medical data users!

∙ Unique attributes, highly detailed information, informed consent, understanding risk

∙ Use and implementation of tools and solutions

∙ Universal agreement on standards ∙ What is anonymized data

∙ Evaluation of data access

∙ Development of a network to support this work?

Thank you! Elizabeth Green [email protected]

This work is funded by UK research and Innovation [Grant Number MC_PC_23006] as part of Phase 1 of the DARE UK (Data and Analytics Research Environments UK) programme, delivered in partnership with Health Data Research UK (HDR UK) and Administrative Data Research UK (ADR UK).

  • Slide 1: Disclosure control in complex medical outputs
  • Slide 2: Overview
  • Slide 3: Medical research
  • Slide 4: Sharing medical data and disclosure control
  • Slide 5: Genomic Data
  • Slide 6: Statistical Disclosure Control? 
  • Slide 7: Dermatology photographs
  • Slide 8: Statistical Disclosure Control?  Solutions?
  • Slide 9: FMRI scan
  • Slide 10: FMRI solutions?
  • Slide 11: Statistical Disclosure Control? 
  • Slide 12: Thank you!  Elizabeth Green [email protected] 

Experiments on Federated Data Synthesis, University of Manchester

Federated Learning, decentralized approach to statistical model training, quality of synthetic microdata, synthetic datasets,

Languages and translations
English

UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE

CONFERENCE OF EUROPEAN STATISTICIANS

Expert meeting on Statistical Data Confidentiality 26–28 September 2023, Wiesbaden

Experiments on Federated Data Synthesis Claire Little, Mark Elliot, Richard Allmendinger (University of Manchester, UK)

[email protected]

Abstract Federated Learning (FL) is a decentralized approach to statistical model training in which training is performed across multiple clients to produce a global model. This approach can be used where multiple sites have data but do not have enough data to generate the required statistical power and cannot for legal, commercial or ethical reasons share their data. One paradigm case is randomized control trials for rare diseases. With FL, training data stays with each local client and is not shared or exchanged with other clients, so the use of FL can reduce privacy and security risks (compared to methods that pool multiple data sources) while addressing data access and heterogeneity problems. This study explores the feasibility of using FL to generate synthetic microdata, allowing multiple organizations to contribute to the construction of combined synthetic datasets (possibly for wider release) without the need to share or distribute their own data. The primary issue is whether it is possible in principle to produce good enough quality synthetic data and the study here focuses on this as a proof of concept before going on to discuss the issue of risk measurement. The results show that the approach is feasible and crucially in the main experiment the synthetic datasets better represented the full population than random samples of that population do. However the experiments are on toy datasets and the next step is to expand the dataset size.

1 Introduction

To enable the safe release of data, Statistical Disclosure Control (SDC) methods (Hundepool et al., 2012) can be applied to remove or alter disclosive information. Data synthesis (Rubin, 1993; Little, 1993) is an alternative to SDC which uses models of the original dataset to generate artificial data with the same structure and statistical properties as the original but (in the case of full synthesis) not containing any of the original data. In this study, we explore the feasibility of federated synthesis, allowing multiple organizations to contribute to the construction of combined synthetic datasets (possibly for wider release) without the need to share or distribute their own data. The primary issue is whether it is possible in principle to produce good enough quality synthetic data and the study here focuses on this as a proof of concept before going on to discuss the issue of risk measurement. The next section will present background information on Data Synthesis and Federated Learning, Section 3 outlines the methodology, Section 4 provides the results of our experiments, Section 5 discusses the results and their implications and final thoughts and ideas for future work an be found in Section 6.

2 Background

2.1 Data Synthesis

Data Synthesis (Rubin, 1993; Little, 1993) is an alternative to SDC and uses models built using the original dataset to generate artificial data with the same structure and statistical properties as the original but (in the case of full synthesis) not containing any of the original data. Synthetic data may be used where access to the original data is not possible or restricted due to privacy constraints. For example, the approval process to acquire access to safeguarded data can be lengthy, potentially delaying research; in these situations synthetic data can allow researchers to test code or plan analysis whilst awaiting access. Synthetic data may also be used to augment (add more records to) existing datasets. There is an increasing number of techniques to generate synthetic data, including statistical methods (such as Nowok et al. (2016); Zhang et al. (2017)), and deep learning (DL) methods based on neural networks (NN) such as Generative Adversarial Networks (GANs) (Goodfellow et al., 2014), variational autoencoders (VAE) (Kingma and Welling, 2014), large language models (Radford et al., 2019), diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020), and genetic algorithms (GAs) (Chen et al., 2017, 2018).

2.2 Federated Learning

Federated Learning (FL) (McMahan et al., 2017) is a method that allows multiple clients (or devices) to collaboratively build a shared model without the clients transmitting or exchanging their raw data. In the context of synthetic data, this could allow multiple clients (organisations, users, etc.) to produce a shared synthetic dataset, without the need to share their own individual private data thereby minimising disclosure risk. It could allow the linkage of datasets that would otherwise be unlikely to be linked in the traditional sense, thereby producing opportunities to access unique synthetic data that is potentially more diverse, and richer, than each participant’s synthetic dataset alone. This paper explores the feasibility of using FL together with a GA, to produce a combined synthetic dataset, which as far as we are aware has not been attempted so far. The early focus of FL was its use on mobile and edge devices (e.g. Bonawitz et al. (2016); Konecny et al. (2016)), where an FL model could have many massively distributed clients, each with potentially different computational capabilities, limited communication and unbalanced data. An example of its usage is Google’s Gboard (keyboard) application which trains a model on each mobile device (when it is idle) using the local data and then sends only model updates (parameters) to the server; this allows it to predict the next word when typing, suggest emojis and discover new words (McMahan and Thakurta, 2022). As described by Kairouz

1

et al. (2021) interest has increased in the use of FL for other (non-mobile) applications, such as allowing cross- organisational collaboration to train models. For example, in healthcare, sensitive data is difficult to access and tightly regulated, making sharing/pooling data (between institutions) prohibitive – FL can allow the creation of more robust models, trained on a larger and more diverse pool of data (than a single institution could provide), without the need to exchange or centralise sensitive medical data (Rieke et al., 2020; Kumar and Singla, 2021). FL has generally been used to produce shared models (such as predictive models) collectively trained on each clients data. A central server controls the process but does not access any of the client data. NN based methods are typically used, where each client receives the current model weights from the central server, trains the model on their own data and then sends the model weights (or parameters) back to the server. All the clients’ weights are then aggregated (typically using the FedAvg, or Federated Averaging algorithm (McMahan et al., 2017)) by the server which updates the global shared model. The model is then sent back to the clients and the process continues until some stopping condition is met. There is a small body of research into the use of FL to generate synthetic data. We use microdata for this study and therefore focus on methods designed for tabular data (i.e. structured data comprising rows and columns containing mixed-type features, such as categorical and numerical). Duan et al. (2023); Fang et al. (2022); Zhao et al. (2021) use GAN-based methods to generate synthetic data, with a GAN training on each client and each sending the model weights to the server to aggregate, etc. (each client generates the final synthetic data individually using the shared model). Weldon et al. (2021) use a GAN on the clients and on the server, but differs in that the server GAN generates the final synthetic dataset. Lomurno et al. (2023) present a different method, using VAE, with each client training a data generator locally. The clients send their models (generators) to the server, but they are not aggregated or combined (as is typical in FL). In the final phase, each client can access the set of generators (from all clients) stored on the server and use some or all of these to generate their own synthetic data. Here we use a GA to generate synthetic data on the server, which is then sent to the clients who each calculate the fitness (utility) score, then send it back to the server where all client scores are combined and used to create the next generation of synthetic datasets. Qu et al. (2020) also generate synthetic data on the server, which is sent to the clients to evaluate, but this employs a GAN-based method, uses image data and focusses on the use-case where clients are temporary (i.e. they may not be available for the whole process).

2.3 Study Aims

In this study, our objective is to assess the feasibility of using a federated learning to generate a combined synthetic dataset. The research questions are as follows.

RQ1: Can a federated synthesis model reproduce the joint distribution of combined distributed datasets? RQ1.1: What information does the server need to be able to reproduce that joint distribution? RQ2: Is the utility of a synthesised combined dataset at least as good as that of the samples held by each client.

3 Methodology

The study is simulation of a server and two clients. The basic simulation scenario is that the sever generates synthetic data, which is then sent to the clients, who each calculate the similarity of the synthetic data to the sample that they hold, then send those similarity scores back to the server where all client scores are combined and used to create the next generation of synthetic datasets. Our machine learning model of choice is genetic algorithms. In Section 3.1, we describe GAs and motivate this choice, then in Section 3.2 we describe the data that we use and how it was set up for the simulation.

2

Table 1. Simple binary original dataset with ten rows, sampled from UK 1991 Census data, which was split into two five-row datasets, for clients A and B.

AGE MSTATUS SEX LTILL TENURE client

1 2 2 2 2 A 1 1 1 2 2 A 1 1 2 2 2 A 2 2 2 2 1 A 1 1 1 2 1 A 2 2 2 2 1 B 1 2 1 2 1 B 1 1 2 2 1 B 1 1 2 1 2 B 1 1 1 2 1 B

3.1 Genetic Algorithms

Genetic Algorithms (GAs) (Holland, 1992) perform iterative optimisation. There are three main (biologically inspired) operators: selection (parental and environmental), crossover, and mutation. Broadly speaking, an initial population of candidate solutions is specified (in this case, a candidate solution is a synthetic dataset), and the fitness (the utility) of the candidates is calculated. The parental selection operator is used to select candidates (parents) to reproduce for a new population, with fitter candidates more likely to be selected. A crossover operator combines some of the parents (there are a variety of methods for this) to produce new candidate solutions (children). A mutation operator then mutates some of the candidates (i.e. randomly changes some of the features). The children or a combination of children and parents form the population of the next generation (this step is called environmental selection). This process is repeated multiple times (generations), using the fitness to guide it, with ideally fitter solutions produced with each generation. Commonly, the process terminates when a specified number of generations has been produced or a particular fitness level has been reached. GAs are flexible in that there are many parameters that can be changed or set, and the fitness function can be designed for the specific purpose. Work by Chen et al. (2017, 2018) has shown the feasibility of using GAs to generate synthetic microdata, and demonstrated the viability of using risk and utility as conflicting objectives (Chen et al., 2019). More recently, Thogarchety and Das (2023) used a GA approach to produce synthetic data to augment class imbalanced datasets and Liu et al. (2023) presented a GA method that generates synthetic data capable of approximating a range of statistical queries.

3.2 Data

A (very) small binary dataset was used, which was randomly split into two datasets (of equal sizes) to represent two clients named client A and client B. The UK 1991 Census (University of Manchester and ONS, 2013) microdata was used, with 10 rows randomly sampled (from the same geographic area). Table 1 displays the data, with five variables (respectively: age, marital status, sex, long-term illness, and housing tenure) which were all converted to binary (using values of 1 and 2). This is called the original data set. It was randomly split into two five-row datasets, one representing client A and one client B, these are identified in Table 1.

3

Table 2. Parameters that were fixed in the experiments.

Parameter Type Value chosen Further details

No. of clients Simulation VARIES - Initial Metadata sent by clients Simulation Univariates - Combination of client scores Simulation VARIES - No. of objectives for GA Simulation 1 Similarity (utility) SDC applied to the output sent to server Simulation None - Output passed to client by server Simulation VARIES - Population size Model 50 - Parental selection Model Binary tournament k=2 Mutation rate Model 0.05 - Crossover Operator Model None - Environmental selection Model Elitism - No. of generations Experiment 150 - Choice of Dataset Experiment UK Census microdata 1991 No. of rows (per client) Experiment VARIES - No. of variables Experiment 5 - Type of variables Experiment Binary - No. of runs Experiment 5 -

Table 3. Parameters varied by experiment.

Parameter Value Experiment 1 Experiment 2 Experiment 3

No. of clients 1 2 2 Combination of client scores N/A None Mean Output passed to client by server Synthetic clients Synthetic clients Synthetic combined

dataset dataset dataset No. of rows (per client) 10 5 5

3.3 Method and Parameters

The potential range of variation in the simulation is huge. There are three types of parameters that could be varied in the study design: Model Parameters: These are changeable settings for the GA (e.g., mutation rate) Simulation Parameters: These are variations in the scenario being presented (e.g., number of clients) Experimental Parameters: These are elements of the study design that are not part of the simulation itself

(e.g., number of runs, data choices). A set of these is shown in Table 2. For proof-of-concept experiments we have chosen one value of most of these parameters; a much simpler set than might be used in practice. As well as using a very small sample (of real data), we kept the model complexity low. This simplicity assists us with the interpretation of the results. We have varied four of the parameters across three experiments. These are shown in Table 3. The first two experiments are used to establish a baseline. In experiment 1 we have just a single client. In effect, this tests whether a GA can reproduce the original data when unencumbered by the distributed data. In experiment 2 we split the data across two clients, but the server has a separate interaction with each client and then is deemed to combine the data at the end. This is in effect a minor variation on experiment 1. Experiment 3 is the main experiment, and we now describe what is simulated in more detail.

4

The experiment 3 simulation is represented graphically in Figure 1. The GA runs on the central server, and at the start of the process (labeled Initialisation, in the figure) each of the clients sends metadata about their individual data to the server. At the most basic, the server would need to know the variable names and the size (how many records) of the data. It is expected that the clients will agree in advance on the variables to be included. For this experiment, the clients send the univariate distributions (this information is used by the GA to mutate the data) and the number of records in each dataset.

Figure 1. An illustration of the federated synthesis simulation used for Experiment 3, with a server and two clients.

The server then combines the distributions of each client by taking the average to calculate a combined distribution. An initial population of synthetic datasets (candidates) is generated; these are drawn from the

5

uniform distributions of the five variables. The datasets in the initial population have the same number of records as both combined client data would have and the same variables. For this simple model, only one objective is assessed by the client, which is the similarity of the multivariate distribution of the clients data and each of the synthetic datasets passed by the server to the client. They then send those similarity scores back to the server. In detail the similarity measure calculates the proportion of every combination of values in the synthetic (candidates) and client data, then takes the mean of the absolute differences. This is then subtracted from 1 so that the similarity score takes a value between zero and one (where 0 indicates no similarity and 1 indicates an identical distribution). Once the server receives each of the client scores, it calculates the mean to produce an overall score for each synthetic dataset. In experiment 3 the server simply averages the clients scores, but they could be combined in other ways (e.g. using the lowest or the highest, or weighted by how similar each of the clients scores are to each other). This completes the initialisation phase. The main repeating process involves selection and mutation, but not crossover; this was excluded to reduce complexity. Firstly, parents are selected from the population using tournament selection (two synthetic datasets are randomly selected and the one with the highest similarity score wins). Two parents produce two children (i.e., two datasets), which are simply copies of themselves (where crossover is used, the children may be a combination of the parents), and the same amount of children are produced as the population size. Then, each child is mutated with a probability equal to the mutation rate (0.05), with the replacement value being drawn from the relevant univariate distribution. The children are then sent to the clients, who score the similarity and send the scores back to the server. These scores are then aggregated by the server. Finally elitism is used to select the next generation, that is the best (those with highest similarity) of the parents and the children are retained for the next generation (e.g. if a parent has higher similarity than the child, the parent is retained). This process is repeated for a set number of generations.

4 Results

Each experiment was repeated five times (using different random seeds). The plots in Figures 2-4 give the mean similarity score across the population for each of the generations for which the GA was run. Figure 2 shows the results for experiment 1. For all runs the GA converged (that is, the synthetic datasets reproduced the original). The goal would not generally be to reproduce the original dataset, but this demonstrates that the GA works (albeit on a very small dataset). In experiment 2, the GA was run separately on both clients five-row datasets, with the results plotted in Figure 3. Each of the five runs converged to one (that is, all runs reproduced the clients data), and so each of the clients resulting data could be combined to reproduce the original dataset. The results for experiment 3 (as described in Figure 1) are illustrated in Figure 4. Panels 1 and 2 illustrate the scores calculated by clients A and B, these individual scores are sent to the server which aggregates them, as displayed in panel 3. The aggregated score is what drives the GA (and clients A and B do not see each other’s scores, they only communicate with the server). The plot highlights that the synthetic datasets generated in run 3.3 scored highly with client A, but scored poorly with client B, however, when aggregated by the server, all five of the runs look remarkably similar. Run 3.3 is interesting as until about generation 20 the gradient is very similar on both clients to the other runs but around that generation a bifurcation happens. This appears to be the result of the process falling into a local optimum in which client A’s dataset was optimised at the expense of client B’s. This was the result of some mutation around generation 20 (subsequent test runs with the same starting seed failed to reproduce this result). Panel 4 shows the similarity scores of the synthetic datasets produced at each generation against the real combined data – by definition this would not be possible in a real-life scenario since the original data would not be available, but this is calculated post hoc to evaluate how the overall model is working (i.e., we are more interested in whether the server is reproducing the overall dataset than whether it replicates individual client

6

Figure 2. Experiment 1, the mean (n=50) similarity of five randomly initialised runs of the GA on the original ten-row dataset. Note the truncated y axis.

Figure 3. Experiment 2, the mean (n=50) similarity of five randomly initialised runs of the GA on the five-row datasets of client A and client B. Note the truncated y axis.

distributions). Panel 4 shows that for all but run 3.3 the model converges on the original data, that is, each run reproduces the original dataset. This is a particularly fascinating finding as it has done this despite the evaluations from the clients indicating sub-optimality. The baseline is included to indicate the combined client to server data similarity. Panel 3 (Figure 4) illustrates that (at least in this example) it may be difficult for the server to determine how well the overall model is performing. Other methods of combining the client scores as variations on experiment 3 were also considered (minimum, weighted, and alternating). The results are shown in the Appendices.

7

Figure 4. Experiment 3, the mean (n=50) similarity scores of five randomly initialised runs of the server GA, showing client A (1), client B (2), the server aggregated scores (3) and the comparison against the original data (4). Note the truncated y axes.

5 Discussion

The results of experiment 3 demonstrates our proof of concept. 4 out of 5 of the runs reproduced the original data. Fascinatingly they did this despite the mean evaluations scores from the client indicating that the operation had not achieved unity. This however was simple a reflection of the clients own sample not fully representing the combined datasets structure. Thus the synthetic datasets were a better representation of the ’real’ combined dataset than the ’samples’ held by each client. This emergent reproducibility shows how the approach could deliver the desired outcome to produce analytically useful datasets synthesised across distributed datasets. The experiments reported here focused on the single objective of utility, and in this case the goal was to reproduce the original data. In a real-life scenario, there would also be a consideration of risk – reproducing

8

the original data would not be desirable. A way to incorporate risk would be to use a multi-objective approach within the GA, and explore options such as Pareto optimality. The flexibility of GAs means that different utility and risk measures could be easily added. Another angle would be using deep learning methods (such as GANs and diffusion models) and adapting them to multi-objective optimisation (GANs are already used widely within FL). The fact that in our experiments it was not clear on the server that the original data had been reproduced might be thought of as useful, in terms of disclosure risk, but it would also mean that in this mechanism we could not rely on severe side restraint to manage risk. An obvious important expansion of these experiments is to test the method on larger and more complex datasets. Firstly, we need to establish if the emergent reproducibility effect scales. Also, for very large datasets, it simply may not be practical to send the entire population of datasets at each generation, and so alternatives may need to be explored. Another important expansion is to examine the effect of having more than two clients. The flexibility of the method also means that there are many parameters that can be experimented with.

6 Conclusion

The purpose of this study was as a proof of concept to determine whether using FL together with a GA to produce synthetic data was feasible. We have shown that it is feasible, albeit with a very small dataset, and with the focus being only synthetic data utility. The results are promising and there are many areas of future work including testing this on larger, more complex datasets, using a multiobjective approach that incorporates risk, and experimenting more generally with the various parameters.

References

Bonawitz, K. A., V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage, A. Segal, and K. Seth (2016). Practical secure aggregation for federated learning on user-held data. In NIPS Workshop on Private Multi-Party Machine Learning. https://arxiv.org/abs/1611.04482.

Chen, Y., M. Elliot, and D. Smith (2018). The application of genetic algorithms to data synthesis: a comparison of three crossover methods. In Privacy in Statistical Databases. PSD 2018, pp. 160–171. Springer.

Chen, Y., M. J. Elliot, and J. W. Sakshaug (2017). Genetic algorithms in matrix representation and its application in synthetic data. In UNECE Worksession on Statistical Confidentiality. https://unece.org/fileadmin/ DAM/stats/documents/ece/ces/ge.46/2017/2_Genetic_algorithms.pdf.

Chen, Y., J. Taub, and M. J. Elliot (2019). Trade-off between information utility and disclosure risk in ga synthetic data generator. In Joint UNECE/Eurostat Expert Meeting on Statistical Data Con- fidentiality. https://unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2019/mtg1/ SDC2019_S3_UK_Chen_Taub_Elliot_AD.pdf.

Duan, S., C. Liu, P. Han, X. Jin, X. Zhang, T. He, H. Pan, and X. Xiang (2023). Ht-fed-gan: Federated generative model for decentralized tabular data synthesis. Entropy 25(1). DOI: 10.3390/e25010088.

Fang, M. L., D. S. Dhami, and K. Kersting (2022). Dp-ctgan: Differentially private medical data generation using ctgans. In M. Michalowski, S. S. R. Abidi, and S. Abidi (Eds.), Artificial Intelligence in Medicine, pp. 178–188. Springer International Publishing. DOI: 2022.10.1007/978-3-031-09342-5_17.

Goodfellow, I., J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014). Generative Adversarial Nets. In Proceedings of the Advances in Neu- ral Information Processing Systems, Volume 27. https://papers.nips.cc/paper/2014/file/ 5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf.

Ho, J., A. Jain, and P. Abbeel (2020). Denoising diffusion probabilistic models. Advances in Neural Infor- mation Processing Systems 33, 6840–6851. https://proceedings.neurips.cc/paper/2020/file/

9

4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf. Holland, J. H. (1992). Adaptation in natural and artificial systems: an introductory analysis with applications

to biology, control, and artificial intelligence. MIT press. Hundepool, A., J. Domingo-Ferrer, L. Franconi, S. Giessing, E. Schulte Nordholt, K. Spicer, and P.-P. de Wolf

(2012). Statistical Disclosure Control. Wiley series in Survey Methodology. John Wiley & Sons, Ltd. ISBN: 978-1-119-97815-2.

Kairouz, P., H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et al. (2021). Advances and open problems in federated learning. Foundations and Trends in Machine Learning 14(1–2), 1–210. DOI: 10.1561/2200000083.

Kingma, D. and M. Welling (2014). Auto-encoding variational bayes. DOI: 10.48550/ARXIV.1312.6114. Konecny, J., H. B. McMahan, F. X. Yu, P. Richtarik, A. T. Suresh, and D. Bacon (2016). Federated learning:

Strategies for improving communication efficiency. In NIPS Workshop on Private Multi-Party Machine Learning. https://arxiv.org/abs/1610.05492.

Kumar, Y. and R. Singla (2021). Federated Learning Systems for Healthcare: Perspective and Recent Progress, pp. 141–156. Cham: Springer International Publishing. DOI:10.1007/978-3-030-70604-3_6.

Little, R. (1993). Statistical Analysis of Masked Data. Journal of Official Statis- tics 9(2), 407–426. https://www.scb.se/contentassets/ca21efb41fee47d293bbee5bf7be7fb3/ statistical-analysis-of-masked-data.pdf.

Liu, T., J. Tang, G. Vietri, and Z. S. Wu (2023). Generating private synthetic data with genetic algorithms. DOI: 10.48550/arXiv.2306.03257.

Lomurno, E., A. Archetti, L. Cazzella, S. Samele, L. Di Perna, and M. Matteucci (2023). Sgde: Secure generative data exchange for cross-silo federated learning. In Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition, pp. 205–214. Association for Computing Machinery. DOI: 10.1145/3573942.3573974.

McMahan, B., E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas (2017). Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273–1282. PMLR. http://proceedings.mlr.press/v54/mcmahan17a/mcmahan17a.pdf.

McMahan, B. and A. Thakurta (2022). Federated learning with formal differential privacy guarantees. https: //ai.googleblog.com/2022/02/federated-learning-with-formal.html, accessed 2023-05-24.

Nowok, B., G. Raab, and C. Dibben (2016). Synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software 74(11). DOI: 10.18637/jss.v074.i11.

Qu, H., Y. Zhang, Q. Chang, Z. Yan, C. Chen, and D. Metaxas (2020). Learn distributed gan with temporary discriminators. In A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm (Eds.), Computer Vision – ECCV 2020, pp. 175–192. Springer International Publishing. DOI: 10.1007/978-3-030-58583-9_11.

Radford, A., J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019). Language models are unsupervised multitask learners. OpenAI blog 1(8), 9. https://d4mucfpksywv.cloudfront.net/ better-language-models/language-models.pdf.

Rieke, N., J. Hancox, W. Li, F. Milletari, H. R. Roth, S. Albarqouni, S. Bakas, M. N. Galtier, B. A. Landman, K. Maier-Hein, et al. (2020). The future of digital health with federated learning. NPJ digital medicine 3(1), 119. DOI:10.1038/s41746-020-00323-1.

Rubin, D. B. (1993). Statistical Disclosure Limitation. Journal of Official Statistics 9(2), 461–468. https: //ecommons.cornell.edu/bitstream/handle/1813/23033/rubin-1993.pdf?sequence=7.

Sohl-Dickstein, J., E. Weiss, N. Maheswaranathan, and S. Ganguli (2015). Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the 32nd International Conference on Machine Learning, Volume 37, pp. 2256–2265. https://proceedings.mlr.press/v37/sohl-dickstein15.html.

Thogarchety, P. and K. Das (2023). Synthetic data generation using genetic algorithm. In 2023 2nd International Conference for Innovation in Technology (INOCON), pp. 1–6. DOI: 10.1109/INOCON57975.2023.10101072.

University of Manchester and ONS (2013). Census 1991: Individual Sample of Anonymised Records for Great Britain (SARs). http://doi.org/10.5255/UKDA-SN-7210-1.

10

Weldon, J., T. Ward, and E. Brophy (2021). Generation of synthetic electronic health records using a federated gan. DOI: 10.48550/arXiv.2109.02543.

Zhang, J., G. Cormode, C. Procopiuc, D. Srivastava, and X. Xiao (2017). PrivBayes: Private data release via Bayesian networks. ACM Transactions on Database Systems 42(4). DOI: 10.1145/2588555.2588573.

Zhao, Z., R. Birke, A. Kunar, and L. Y. Chen (2021). Fed-tgan: Federated learning framework for synthesizing tabular data. DOI: 10.48550/arXiv.2108.07927.

11

A Using the worst client scores to drive the GA, rather than averaging

Figure 5. Mean (n=50) similarity scores of five randomly initialised runs of the server GA, where only the worst (lowest) client score is used to drive the GA (rather than averaging both client scores). Showing client A (1), client B (2), the worst scores (3) and the comparison against the original data (4). Note the truncated y axes.

12

B Using weighted averaged score to drive the GA

Figure 6. Mean (n=50) similarity scores of five randomly initialised runs of the server GA, where a weighted averaged score is used to drive the GA. Where the client scores are close (the clients agree) the scores are weighted higher, where they are far apart (the clients disagree) the scores are weighted lower. Showing client A (1), client B (2), the server weighted averaged scores (3) and the comparison against the original data (4). Note the truncated y axes.

13

C Alternating the client scores to drive the GA, rather than averaging

Figure 7. Mean (n=50) similarity scores of five randomly initialised runs of the server GA, where the alternating client score is used to drive the GA (five generations using client A, five using client B, etc.). Showing client A (1), client B (2), the server mean alternating scores (3) and the comparison against the original data (4). Note the truncated y axes.

14

  • 1. Introduction
  • 2. Background
    • 2.1. Data Synthesis
    • 2.2. Federated Learning
    • 2.3. Study Aims
  • 3. Methodology
    • 3.1. Genetic Algorithms
    • 3.2. Data
    • 3.3. Method and Parameters
  • 4. Results
  • 5. Discussion
  • 6. Conclusion
  • References
  • A. Using the worst client scores to drive the GA, rather than averaging
  • B. Using weighted averaged score to drive the GA
  • C. Alternating the client scores to drive the GA, rather than averaging

Experiments on Federated Data Synthesis

C L A I R E L I T T L E , M A R K E L L I OT, R I C H A R D A L L M E N D I N G E R

U N I V E RS I T Y O F M A N C H E S T E R

Questions? https://tinyurl.com/QuestionsUoM

Federated Learning (FL) FL (McMahan et al., 2017) is a decentralized approach to training statistical models • Multiple clients can produce one global model

• Clients do not share or exchange their own data

• Can reduce privacy and security risks (compared to methods that combine multiple data sources)

• Allows models to train on data that is more representative of the whole distribution

• Useful where clients do not possess enough data to generate the required statistical power

Federated Learning (FL) Central server controls the process (but does not access any client data) • Initialises model, sends to each client • Typically, neural network type models are used

Each client trains the model on their own data • Send updates (parameters or model weights) back to server

Server aggregates the client updates • Sends updated model back to clients

Iterative process • Training usually terminated when specific criterion is met: • E.g., maximum number of iterations

NVIDIA - A centralized-server approach to federated learning. https://blogs.nvidia.com/blog/2019/10/13/what-is-federated-learning/

Federated Synthesis Using FL to generate synthetic data • Emerging research field

• Small body of research focussing mostly on image data

• Less research on tabular data

• Methods predominantly use GANs (Generative Adversarial Networks, Goodfellow et al. 2014))

Is it possible to produce useful synthetic microdata in a federated way? • Proof of concept using Genetic Algorithm (GA)

Genetic Algorithms (GAs) GAs (Holland, 1992) perform iterative optimisation, training over multiple generations • Three main biologically inspired operators: • Selection, Crossover, Mutation

➢ Initial population of candidate solutions (candidate solution = synthetic dataset) ➢ Fitness (similarity to original data) of each candidate calculated ➢ Select fitter candidates (parents) to reproduce for new population ➢ Crossover – combines parents to produce new candidates (children) ➢ Mutation – randomly change some of the candidates features ➢ Next generation – children, or combination of best (fittest) parents and children

(elitism) ➢ Repeat process multiple times (generations) using fitness to guide

Study Design - Data A (very) simple binary dataset, randomly sampled from UK 1991 Census microdata (University of Manchester, 2023) • Small dataset to enable understanding

• 10 rows, 5 binary variables

• “Original” dataset

• Randomly split into two five-row datasets

• representing two clients (A and B)

AGE MSTATUS SEX LTILL TENURE client

1 2 2 2 2 A

1 1 1 2 2 A

1 1 1 2 2 A

2 2 2 2 1 A

1 1 1 2 1 A

2 2 2 2 1 B

1 2 2 2 1 B

1 1 1 2 1 B

1 1 1 1 2 B

1 1 1 2 1 B

Study Design - Parameters Huge potential range of variation in the simulation Three types of parameters: • Model: changeable settings for the GA (e.g., mutation rate)

• Simulation: variations in the scenario being presented (e.g., number of clients)

• Experimental: elements that are not part of the simulation itself (e.g., data choice, number of runs)

Model complexity is kept low to aid with interpreting the results

• Focus only on utility (not risk)

• Small dataset

• GA uses mutation but not crossover

• Two clients for FL

Study Design - Parameters

Results – Experiment 1 Running GA on original dataset (10 rows) • All five randomly initialised

runs converged • i.e., they reproduced the original

dataset

Results – Experiment 2 Running GA separately on client A and B datasets (5 rows each) • For each, all five randomly initialised runs converged and reproduced the original dataset

Results Experiment 3 FL with two clients (A and B) • All but one of the randomly

initialised runs converged and reproduced the original datasets

• Panel 4 would not be available in reality – used for evaluation

• Convergence achieved despite the evaluations from clients, and the server aggregated score indicating suboptimality

Discussion Experiment 3 demonstrates proof of concept • Analytically useful datasets were synthesised across distributed datasets

It was not clear on the server that the original data had been reproduced • Might be useful in terms of disclosure risk

• Means we cannot rely on server-side restraint to minimise risk

Caveats and future work Experiments conducted on small sample of binary Census microdata • May not scale to larger, more complex data

• Very large datasets may be computationally impractical

Would need to consider different parameters • More than 2 clients

Single-objective focus on utility • In a real-life scenario, the goal would not be to reproduce the original data

• Risk would need to be factored in

◦ A multi-objective approach within the GA could be used

◦ Deep learning methods also a possibility

Questions? https://tinyurl.com/QuestionsUoM

Email: [email protected]

References McMahan, B., E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas (2017). Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273–1282. PMLR. http://proceedings.mlr.press/v54/mcmahan17a/mcmahan17a.pdf

Goodfellow, I., J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014). Generative Adversarial Nets. In Proceedings of the Advances in Neural Information Processing Systems, Volume 27. https://papers.nips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf

Holland, J. H. (1992). Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. MIT press.

University of Manchester, Cathie Marsh Centre for Census and Survey Research, Office for National Statistics, Census Division. (2023). Census 1991: Individual Sample of Anonymised Records for Great Britain (SARs). [data collection]. UK Data Service. SN: 7210, DOI: http://doi.org/10.5255/UKDA-SN-7210-1

  • Slide 1: Experiments on Federated Data Synthesis
  • Slide 2: Questions?
  • Slide 3: Federated Learning (FL)
  • Slide 4: Federated Learning (FL)
  • Slide 5
  • Slide 6: Federated Synthesis
  • Slide 7: Genetic Algorithms (GAs)
  • Slide 8: Study Design - Data
  • Slide 9: Study Design - Parameters
  • Slide 10: Study Design - Parameters
  • Slide 11
  • Slide 12: Results – Experiment 1
  • Slide 13: Results – Experiment 2
  • Slide 14: Results Experiment 3
  • Slide 15: Discussion
  • Slide 16: Caveats and future work
  • Slide 17: Questions?
  • Slide 18: References

SACRO: semi-automated output checking, University of the West of England

output checking, confidential data, automate checking, analytical languages, secure environments, 

Languages and translations
English

UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE

CONFERENCE OF EUROPEAN STATISTICIANS

Expert meeting on Statistical Data Confidentiality 26–28 September 2023, Wiesbaden

SACRO: Semi-Automated Checking Of Research Outputs Jim Smith1, Richard Preen1, Maha Albashir1,Felix Ritchie1, Elizabeth Green1,Simon Davy2 Pete Stokes2, Sebastian Bacon2 1: University of the West of England, UK, 2: Bennett Institute, University of Oxford

[email protected]

Abstract Output checking can require significant resources, acting as a barrier to scaling up the research use of confidential data. We report on a project, SACRO, that is developing a general-purpose, semi-automatic output checking systems that works across the range of restricted research environments. SACRO is designed to

• Automate checking of most common statistics, using best-practice principles-based modelling. • Support researchers using the major analytical languages (R, Python and Stata), with minimal changes,

by exploiting the ‘wrapper’ approach successfully trialled already. • Support secure environments with different operating models and output checking workflows, through a

process of co-design to maximise useability. SACRO builds on previous work: (ACRO, funded by Eurostat and reported in in the 2021 Workshop) to establish the proof-of-concept; and Py-ACRO which showed how a software-independent tool might be developed. It differs from those earlier projects in terms of a wider range of statistics covered, and a requirement to achieve general applicability. To do this, the project draws on our extensive networks of practitioners. A series of workshops and ‘hands-on’ evaluations ensure the design frameworks support buy-in from a wide range of prospective users across health and social sciences, and from the public and private sectors.

1 Introduction

Statistical agencies and other custodians of secure facilities such as Trusted Research Environments (TREs) Hub- bard et al. (2020) provide researchers with access to confidential data under the ‘Five-Safes’ framework Ritchie (2017). This enforces five orthogonal layers of safety procedures, and the last requires explicit checking of research outputs for disclosure risk. This can be a time-consuming and costly task, requiring skilled staff. This paper discusses the development of an open source tool for automating the statistical disclosure control (SDC) of routine research outputs. The goal is to make the clearance process more efficient and timely, and to allow the skilled checkers to focus their attention on the less straightforward cases. The purpose of the tool, (SACRO, for Semi-Automated Checking of Research Outputs) is to assist researchers and output checkers by distinguishing between research output that is safe to publish, output that requires further analysis, and output that cannot be published because of substantial disclosure risk. This work builds upon a previous Eurostat-funded project Green et al. (2020, 2021) in which Green, Ritchie and Smith developed a proof-of-concept prototype for the proprietary Stata software.The primary new contributions reported in this paper are:

• The implementation of a Python toolkit. • An extensible multi-language platform with interfaces familiar to users of popular statistical tools. • ‘Skins’ in Stata and the language R, demonstrating cross-language support. • An open source repository with examples, help, documentation, etc.

2 Background

The Five Safes framework Ritchie (2017) is a set of principles that enable services to provide safe research access to their data and has been adopted by a range of TREs, including the Office for National Statistics (ONS), Health Data Research-UK (HDR-UK), and the National Institute for Health Research Design Service (NIHR), as well as many others worldwide. Ensuring the last of these, ‘safe outputs’ is a complex and often costly human labour-intensive process. Auto- mated output checking aims to improve the rigour and consistency of the output disclosure control process and reduce human workload by automatically identifying, reporting, and (optionally) suppressing disclosive outputs where possible and categorising outputs as ‘safe’ or ‘unsafe’. ‘Safe’ outputs requiring no or minimal further changes can be expedited through the clearing process whereas ‘unsafe’ outputs can be prioritised for human review Ritchie (2008). A small number of SDC tools have been produced to assist in the process of achieving ‘safe outputs’, such as tauArgus and sdcTable1, however these are primarily designed for users such as National Statistic Institutes as they require expert knowledge of SDC to use effectively. Moreover, they are designed for tabular outputs, and do not cover the range of statistics produced by researchers With the aim of improving the efficiency of the process, and (where applicable) reducing the amount of user training required, a recent Eurostat project Green et al. (2021) developed a proof-of-concept prototype in Stata where primary disclosure is regulated by a set of simple rules. For example, a minimum threshold rule applied to the number of observations used by a statistic ensures that there is sufficient uncertainty with respect to any individual respondent. Dominance rules protect large respondent values from being approximated where the contribution to a statistic is dominated by only a few individuals. For example, the p%-rule sorts the N observations by magnitude and checks whether the sum of the smallest N − 3 observations is at least p% of the largest observation. The NK rule checks that the largest N observations contribute less than K% of the total. Also, not all aggregation statistics are permitted: reporting minima or maxima values of a subgroup are prohibited, and regressions are protected by checking that the Residual degrees-of-freedom exceeds a minimum threshold.

1Respectively, https://github.com/sdcTools/tauargus and https://github.com/sdcTools/sdcTable 2

Building on the experience of the initial proof-of concept, funding was secured from the UK Research Council’s DARE initiative2 for the project: Semi Automated Checking of Researcher Outputs (SACRO) which involves:

• Computer scientists with backgrounds ranging from AI research to commercial software development. • A range of TREs as co-designers of a toolset. • SDC theorists and statisticians to provide a conceptual framework for handling different types of output

and providing guidance to researchers and output checkers. • Public Involvement and Engagement specialists and groups to develop a consensus statement around the

use of (semi)-automation in disclosure control • Researchers from a previous DARE project examining the output checking of machine learning models

trained on sensitive data within a TRE Jefferson et al. (2022). In this paper we report on the principal tools developed within the SACRO project, specifically:

1. A toolkit for researchers to use within TREs that produces automated reports on disclosure risk with minimal changes to their practice - simply prefixing common commands with the word ‘acro’.

2. Explicit support for researchers to reduce the number of disclosive outputs they request. 3. Cross-language support: with exemplar interfaces provided for Stata and R. 4. Support for the output types that our TRE partners tell us form the majority of requested releases. 5. A stand-alone viewer for TRE output staff to facilitate rapid, informed, and audited, decision making. 6. A revised guide incorporating theoretical developments, directly linked to its implementation in SACRO.

3 The SACRO toolkit

SACRO is composed of three parts which may be deployed independently: the main ‘ACRO-engine’, a stand- alone viewer, and ‘AI-SDC’ - support for disclosure control of machine learning models (described elsewhere).

3.1 Design Philosophy

The operational design philosophy is extensively documented in Green et al. (2020), who studied the character- istics that an automated solution needs to have to be feasible, effective, and a positive choice for users. Essential criteria are that it should be:

• Acceptable to users, output checkers and TRE managers; • Able to implement an organisation’s business rules for primary and secondary disclosure, which may

vary across datasets or users; • Comprehensive, even if the automated tool’s response is “I don’t know so this needs manual checking"; • Consistent, providing the same results across different studies within a TRE, and across TREs; • Able to support exceptions under principles-based regimes; • Scalable over users and outputs.

Key operational requirements were for the tool to work in different technical environments, and to be easily updated through well understood mechanisms. This meant separating the software itself (distributed through a recognised channel3, from the specification of a given TRE’s risk appetite (held in a human and machine readable and editable file). Acceptability to users was identified as the most crucial element. If researchers and output checkers see the tool as something that makes their life better and easier, then they are more likely to use it effectively. Hence, designing the user interface was identified as a separate workstream in SACRO, and given the same resources as the design and implementation of the output-checking component. This is also one reason why SACRO set up a large network of potential users and tests (see Sec. 6 below).

2https://dareuk.org.uk/

3for example, PyPi (https://pypi.org or CRAN (https://cran.r-project.org) 3

Researcher

TRE Staff

Light-Weight Translation Functions

Python

R

Stata

. . .

Excel spreadsheet or JSON file with details and recommenda- tions for each requested output

TRE-specific file detailing risk appetite

Disclosure Control Checks

(Python) Tests:

- threshold - dominance

- degrees of freedom

Applies: - cell sup- pression - others

Standard Python

Libraries:

Pandas for tables

statsmodels for regression

Analysis commands prefixed by acro

SDC output

Approve/Discuss/Reject

Reads

Finalise

Figure 1. Schematic illustration of ACRO.

The ‘proof-of-concept’ version of ACRO did not address secondary disclosure (such as checking for differencing across tables), for two reasons. First, business rules for secondary checking are often not clear or comprehensive. Second, ACRO/SACRO works by intercepting commands and assessing disclosure risk at the time the output is being produced. Analysing results post-hoc is a considerably harder problem, requiring the researcher to produce a lot more information and also locate the other outputs to be compared. Although SACRO does not currently (as of July 2023) carry out secondary disclosure review, we are investigating how to at least flag potential differencing risks across the set of outputs from a research ‘session’, and in future, create a library of outputs which might allow secondary disclosure to be assessed, even if only partially.

3.2 Workflow

ACRO Preen et al. (2023) is an open source toolkit (MIT License) that provides a light-weight ‘skin’ that sits over well-known analysis tools, in a variety of languages researchers might use. The process is illustrated in Fig. 1. This adds functionality to identify potentially disclosive outputs against a range of commonly used disclosure tests and report to researchers and TREs reasons why outputs should not be released ‘as-is’. It creates simple summary documents TRE staff can use to streamline their workflow. ACRO has been designed with the following aims:

• Reducing barriers to adoption via a front-end application programming interface (API) that is similar to those already commonly used by researchers in their favoured language.

• Providing researchers with: immediate feedback on the results of disclosure checks (on-screen alongside their query results); facilities to add comments or exception requests, and control over what is submitted for review, e.g., removing disclosive outputs if they use feedback to design non-disclosive ones.

• Having a single back-end code base constituting a single source of truth for performing checks, with extensibility for different languages and ongoing support and consistency.

• Providing easy to understand help and documentation. In practice, researchers prepare their data and statistical queries in the usual way, in their preferred language, using common commands prefixed by ‘acro’. The lightweight ACRO translation functions then call the Python back-end, which executes the queries and performs the requisite output checks. The results of the checks, and the queries are immediately displayed to the researcher, and full details are stored in a list. When the user calls acro.finalise() to end their session, outputs and all SDC details are saved to file for review by a TRE output checker. A schematic illustration of the ACRO workflow is shown in Figure 1 and some notebooks demonstrating example code usage and output are available via the ACRO project wiki4.

4https://github.com/AI-SDC/ACRO/wiki

4

3.3 Checks Implemented

For tabular data (e.g., cross tabulation and pivot tables), we prohibit the reporting of the maximum or minimum value in any cell that represents a sub-group of one or more contributors. Moreover, we suppress, and report the reason, the value of the aggregation statistic (mean, median, variance, etc.) for any cell deemed to be sensitive. ACRO currently supports the three most common tests for sensitivity: ensuring the number of contributors is above a frequency threshold, and testing for dominance via p% and NK rules. ACRO builds a series of suppression masks, which indicate which cells are to be suppressed for each check. A summary outcome table indicating which suppression rule was applied to each cell is presented to the researcher (thre grey box in Fig. 2, alongside the query results. For regressions, e.g., linear, probit and logit regression, the tests verify the number of degrees of freedom exceeds a threshold. Immediate feedback on all these checks is designed to support researchers to improve their practice and so reduce the SDC bottleneck by making fewer disclosive requests The checking of graphical plots is not currently implemented, as this is a complex problem with many different methods for producing visualisations. However, we expect to have some support by Autumn 2023. As noted above, all of these tests and checks are configurable according to the TRE’s risk appetite. The data custodian, e.g., TRE staff member, specifies the parameter values used for the output checks in a YAML5

configuration file, which is loaded upon ACRO initialisation. The default ACRO parameters are shown in Table 1. Future releases will offer the option to over-ride these on a dataset, or even attribute level.

Table 1. ACRO Default Parameters for sensitivity tests

Description Parameter Value Min frequency threshold for tabular data safe_threshold 10.0 Min degrees-of-freedom for analytical stats safe_dof_threshold 10.0 N parameter in NK test safe_nk_n 2.0 K parameter in NK test safe_nk_k 0.9 Min ratio for p% test safe_pratio_p 0.1

3.4 The SACRO Python ‘Engine’

Python is a popular multi-platform language widely used for data analysis and machine learning. PyPI provides a simple package management system for distributing open source Python libraries. Pandas and Statsmodels6 are industry-standard, mature, popular, and well-supported python packages for data analysis, statistical testing, and statistical data exploration. Pandas is currently used by more than 55% of all Python users Python Software Foundation (2021) and there are many web-sites and user groups providing help with formulating queries. The use of Python as the primary implementation therefore enables the leveraging of existing expertise and community support with these packages so that the ACRO front-end can be as similar to the API researchers already know and trust, and further facilitates the rapid development of disclosure checking functionality on the back-end. As the PyPI distribution system is simple and allows the use of semantic versioning, it supports a rapid and iterative develop-and-deploy strategy to provide continuing functionality and improvements. For example, the current version of ACRO may be installed [or updated] as simply as: p i p i n s t a l l [−−upgrade ] a c r o The currently implemented methods are listed below, split into analysis commands, and sessions management commands. For more details see the ACRO project documentation7.

5https://yaml.org

6https://github.com/pandas-dev/pandas and https://www.statsmodels.org/stable/index.html respectively

7https://ai-sdc.github.io/ACRO/

5

3.4.1 Analysis commands for Researchers. These are implemented via the use of multiple inheritance from Pandas and Statsmodels. For making tables, the relevant methods are:

: crosstab(index, columns[, values, rownames, . . . ]) Compute a simple cross tabulation of two (or more) factors, with options for hierarchies in rows/columns and multiple aggreagation functions. Same API as pandas.crosstab.

: pivot_table(data[, values, index, columns, . . . ]) Create a spreadsheet-style pivot table as a DataFrame. Same API as pandas.pivot_table.

and for regression analysis:

: logit(endog, exog[, missing, check_rank]) Fits Logit model. Same API as statsmodels.discrete.discrete_model.Logit.

: logitr(formula, data[, subset, drop_cols]) Fits Logit model from an R-style formula and DataFrame. Same API as statsmodels.formula.api.logit.

: ols(endog[, exog, missing, hasconst]) Fits Ordinary Least Squares Regression. Same API as statsmodels.regression.linear_model.OLS.

: olsr(formula, data[, subset, drop_cols]) Fits Ordinary Least Squares Regression from an R-style formula and DataFrame. Same API as statsmodels.formula.api.ols.

: probit(endog, exog[, missing, check_rank]) Fits Probit model. Same API as statsmodels.discrete.discrete_model.Probit.

: probitr(formula, data[, subset, drop_cols]) Fits Probit model from an R-style formula and DataFrame. Same API as statsmodels.formula.api.probit.

3.4.2 Session Management Commands.

: ACRO()(config,suppress) Creates an ACRO session object with optional parameters for a config (risk appetite) filename and whether disclosive tables should have suppression applied (default False).

: print_outputs() Prints the current results dictionary - i.e., the outputs that would be sent for checking.

: remove_output(key) Removes an output from the results dictionary.

: rename_output(key, newname=) Assigns a new (ideally more self-explanatory) name to an output from the results dictionary.

: add_comments(key,text) Allows researcher to add a description for an output

: add_exception(key,text) Allows a user to request and justify an exception to strict rules-based checking.

: custom_output(filename,description) Adds a file containing output from unsupported analysis to an ACRO session for inclusion in outputs shown in viewer.

: finalise(directory_name, format) Creates a results file for checking in the desired format(json or xlsx).

6

» safe_table = acro.crosstab( df.recommend, df.parents, values=df.children, aggfunc="mean")

» print(safe_table)

INFO:get_summary:fail; threshold: 4 cells may need suppressing

INFO:outcome_df: parents great_pret pretentious usual recommend not_recom ok ok ok priority ok ok ok recommend threshold threshold threshold spec_prior ok ok ok very_recommend threshold ok ok

INFO:acro:add(): output_1

grant_type great_pret pretentious usual recommend not_recom 1440 1440 1440 priority 858 1484 1924 recommend 0 0 0 spec_prior 2022 1264 758 very_recom 0 132 196

Figure 2. Example ACRO query for the ‘nursery’ data(top), with immediate disclosure control reporting (middle, grey background - pink onscreen) followed output (bottom). This ’researcher- view’ corresponds to the top image in the viewer screenshots

An example ACRO query run on the nursery admission dataset8 and its output is shown in Fig. 2. This is the ‘researchers-view’ of the output at run-time. The corresponding ‘TRE-view’ is shown in the top screenshot in Fig. 3. This example does not have an aggregation function so dominance rules are not applied, otherwise they would also show in the ‘INFO’ section of the report in any relevant cells. Note that if the user starts their session with acro= ACRO(suppress=True) then any disclosive cells would have their values set to NaN

3.5 The R interface to ACRO

The R front-end is an example of cross-language support. It provides a set of wrapper functions that execute Python back-end checking via the reticulate9 package, which provides automatic conversations for many types, e.g., R data frame to Pandas DataFrame. A session is created when the acro package is calledsource("../acro.R") and thereafter the acro methods work as callable functions with the prefix acro_ e.g., acro_rename_output(output5,"xy-plot") etc., and to end a session the user calls acro_finalise(results_dir,“json") For regressions, the common R lm() and glm() functions were shadowed with equivalent versions imple- mented as acro_lm() and acro_glm(), respectively. For tabular data, the dplyr10 package is commonly used within R, however no simple cross tabulation or pivot table functions are provided; instead various combinations of groupby() and summarize() etc. are used. Therefore, at this stage of development, the Python cross tabu- lation and pivot table functions were directly interfaced with acro_crosstab() and acro_pivot_table().

8https://www.openml.org/search?type=data&sort=runs&id=1568&status=active

9https://github.com/rstudio/reticulate

10https://github.com/tidyverse/dplyr

7

3.6 Stata Interface

This makes extensive use of Stata’s SFIToolkit library to manage a python session, transfer data in memory from stata to a Pandas dataframe in the python session, and results back to the Stata window. A simple acro.ado file defines a new functionacrowhich takes as parameters either one of the ACRO session management methods (adding init() to start a session) or the name of a standard Stata function such as table, regress etc. Stata’s inbuilt parsing functions are used to separate out the parts of command and pass them as lists to a python function parse_and_run() which handles the rest of the translation between the two languages.

4 SACRO Viewer for Output Checking

We have also created an open-source platform-independent stand-alone viewer for output checkers to use to: view outputs and their risks; make decisions with reasons (all recorded for auditing purposes); and produce zipped packages of files for release Open-Safely (2023). Figure 3 illustrates two screenshots from the version currently (July 2023) being evaluated by TREs. The viewer supports and renders a range of different file types for results from unsupported queries. A separate script lets TRE staff create an ACRO session from a set of output files in a directory, and hence use the viewer for making and recording decisions, even if the researcher has not used ACRO during their analysis. Automated disclosure risk analysis is not provided in those cases.

5 Linking theory and implementation

As part of the project, the SACRO team committed to review and re-develop the theory and operational guidelines for output SDC. The aim was threefold; first, to bring together key points from the OSDC literature (and fill in some of the theoretical gaps) to provide an integrated guide to both theory and practice of output checking; second, to develop a new approach to OSDC based on classifications into groups (see Derrick et al. (2023), for details); third, to explicitly link theory to operational rules and their implementation in manual and automatic checking regimes. The third aim is essential to demonstrating that SACRO is not seen as a ‘black box’ implementing its own rules, but is fully integrated into core theory. It is also important for showing how manual and automatic output checking necessarily differs. For example, dominance checks are almost impossible for a human, but straightforward for computers; on the other hand, computers cannot easily identify whether zero cells in tables are structural or disclosive, but humans can. The purpose of the guide is to show precisely what checks have been made, where differences occur between humans and computers, and why they are necessary.

6 Engagement with TREs

One of the lessons learned from the original Stata version of ACRO Green et al. (2021) was the importance of user buy-in. Although that version met its design goals (and has subsequently been adopted by Eurostat in its TRE), reaction to it was a mixture of “this looks useful, I’ll give a go", “this looks useful, I’ll wait to see it installed before I commit myself", and “I’ve read the installation manual and have no idea what’s going on, so it’s a no". As a result,that version of ACRO has remained largely within the project remit: a demonstration of possibilities. The SACRO project was intended to involve co-design from the outset to take ACRO to the next stage, of general utility and application. This involved three tests:

1. Would a new tool be acceptable to users? 2. Would a new tool be acceptable to output checkers? 3. Could a new tool be installed in secure research environments?

8

Figure 3. Two screenshots of viewer. The left hand column shows list of files requested. In top image, colouring of file names suggests which files require special attention. In lower image background colour-coding and tick/cross symbols show decisions made by output checker. Top image shows checker viewing table that fails disclosure tests, with problematic cells highlighted in red. Bottom shows acceptable table. Also in this image the top right hand panel shows option to view TRE ’risk appetite’ expanded.

9

The SACRO project took two approaches. First, six TREs (OpenSafely at the University of Oxford, and the five Scottish Safe Havens) were funded as co-investigators on the project to provide detailed feedback on user and output checker perspectives (OpenSafely also took the lead in the design of the user interface). This group also directly tested the feasibility of installing and allowing the Python code to run on their systems as TREs differ in their perceptions of python’s ‘riskiness’. Second, the SACRO team contacted a large number of TREs in the UK and abroad, and set up a network of interested parties potentially willing to be testers. Several engagement events with this group identified how they worked and what they would expect from an automatic solution. At the time of writing (July 2023), the first ‘external’ TRE’s are starting to install and run the tool with genuine users. SACRO has a workpackage dedicated to helping TREs set up their systems, and then collecting evaluation feedback. This aims to make sure that the tool is tested in as wide a variety of environments as possible, given the time constrain. A secondary aim is to involve TREs in the development, to build a sense of ownership and lay the foundations for widespread adoption. This helps to address the concerns of ‘wait-and-see’ TREs.

7 Future Plans

By the current project end in October 2023 we aim to have added support for: more common types of analyses (including simple plots); different versions of Stata; and more ways of creating tables within R. Additional features and improved user experience will be facilitated by the involvement of end-users and output checkers. Beyond then, UWE has committed to web hosting various resources for the indefinite future, and partners have agreed to continue support and development of the toolkits. We are keen to engage with any interested parties to enrich and build an on-going community of support for SACRO.

References

Derrick, B., E. Green, F. Ritchie, J. Smith, and P. White (2023). Towards a comprehensive theory and practice of output SDC. In UNECE/Eurostat Workshop on Statistical Data Confidentiality.

Green, E., F. Ritchie, and J. Smith (2020). Understanding output checking. Technical report, European Commission (Eurostat - Methodology Directorate).

Green, E., F. Ritchie, and J. Smith (2021, October). Automatic checking of research outputs (ACRO): A tool for dynamic disclosure checks. ESS Statistical Working Papers 2021, 1–27. doi: 10.2785/75954.

Hubbard, T., G. Reilly, S. Varma, and D. Seymour (2020, July). Trusted research environments (TRE) green paper. ZENODO 2020, 1–31. doi: 10.5281/zenodo.4594704.

Jefferson, E., J. Liley, M. Malone, S. Reel, A. Crespi-Boixader, X. Kerasidou, F. Tava, A. McCarthy, R. Preen, A. Blanco-Justicia, E. Mansouri-Benssassi, J. Domingo-Ferrer, J. Beggs, A. Chuter, C. Cole, F. Ritchie, A. Daly, S. Rogers, and J. Smith (2022, September). GRAIMATTER Green Paper: Recommendations for disclosure control of trained Machine Learning (ML) models from Trusted Research Environments (TREs).

Open-Safely (2023). Sacro:a tool for fast, secure and effective output checking, which can work in any TRE. https://github.com/opensafely-core/sacro.

Preen, R. J., J. Smith, M. Albashir, and S. Davy (2023). ACRO. https://github.com/AI-SDC/ACRO. Python Software Foundation (2021). Python developers survey 2021 results. https://lp.jetbrains. com/python-developers-survey-2021/. Accessed: 24/07/2023.

Ritchie, F. (2008). Disclosure detection in research environments in practice. In Joint UNECE/Eurostat work session on statistical data confidentiality, Volume WP. 73. United Nations Statistical Commission and Economic Commission for Europe Conference of Europe Statisticians, European Commission Statistical Office of the European Communities (Eurostat).

Ritchie, F. (2017, September). The ‘five safes’: A framework for planning, designing and evaluating data access solutions. Zenodo 2017, 1–5. doi: 10.5281/zenodo.897821.

10

  • 1. Introduction
  • 2. Background
  • 3. The SACRO toolkit
    • 3.1. Design Philosophy
    • 3.2. Workflow
    • 3.3. Checks Implemented
    • 3.4. The SACRO Python `Engine'
    • 3.5. The R interface to ACRO
    • 3.6. Stata Interface
  • 4. SACRO Viewer for Output Checking
  • 5. Linking theory and implementation
  • 6. Engagement with TREs
  • 7. Future Plans
  • References

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO: Semi-Automated Checking of Research Outputs

Professor Jim Smith,

University of the West of England

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Research results for

publication

Data Repository

Researcher Analytical

Environment

TRE

Disclosure Control

Checking Process

Subset of pseudonymised

data

Safe People

Safe Projects

Safe Setting

Safe Data

Safe Outputs

Graph or table of

summary results

Export from TRE

AI trained model

AI trained model

Software using AI model

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Research results for

publication

Data Repository

Researcher Analytical

Environment

TRE

Disclosure Control

Checking Process

Subset of pseudonymised

data

Safe People

Safe Projects

Safe Setting

Safe Data

Safe Outputs

Graph or table of

summary results

Export from TRE

AI trained model

AI trained model

Software using AI model

(Semi) Automating this bottleneck!

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Who are we? (alphabetically)

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Who are we? (alphabetically)

Universities

• Aberdeen

• Dundee

• Durham

• Edinburgh

• Oxford

• UWE

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Who are we? (alphabetically)

Universities

• Aberdeen

• Dundee

• Durham

• Edinburgh

• Oxford

• UWE

Public Data Bodies • Health Data Research UK • NHS Scotland • Public Health Scotland • Research Data Scotland

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Who are we? (alphabetically)

Universities

• Aberdeen

• Dundee

• Durham

• Edinburgh

• Oxford

• UWE

Public Data Bodies • Health Data Research UK • NHS Scotland • Public Health Scotland • Research Data Scotland

TREs • DASH (Aberdeen/Grampian) • DataLoch (Edinburgh) • HIC (Dundee) • eDRIS (Public Health Scot) • OpenSafely (Oxford)

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Who are we? (alphabetically)

Universities

• Aberdeen

• Dundee

• Durham

• Edinburgh

• Oxford

• UWE

Public Data Bodies • Health Data Research UK • NHS Scotland • Public Health Scotland • Research Data Scotland

TREs • DASH (Aberdeen/Grampian) • DataLoch (Edinburgh) • HIC (Dundee) • eDRIS (Public Health Scot) • OpenSafely (Oxford)

External steering group: UK: Office for National Statistics, ESRC, DARE Global: Eurostat, SDC-GESIS, ICPSR (US)

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

The current situation

4

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

The current situation

4

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

The current situation

4

Confidential data in

TRE

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

The current situation

4

Confidential data in

TRE

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

The current situation

4

Confidential data in

TRE Stata,R,Python

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

The current situation

4

Confidential data in

TRE Stata,R,Python

Analysis

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

The current situation

4

Confidential data in

TRE

File1 File1

File1 Filen

Stata,R,Python

Analysis

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

The current situation

4

Confidential data in

TRE

File1 File1

File1 Filen

Stata,R,Python

Analysis

request

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

The current situation

4

Confidential data in

TRE

File1 File1

File1 Filen

Stata,R,Python

TRE staff

Analysis

request

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

The current situation

4

Confidential data in

TRE

File1 File1

Filen

Stata,R,Python

TRE staff

Analysis

request

decision

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO in a nutshell

5

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO in a nutshell

5

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO in a nutshell

5

Confidential data in

TRE

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO in a nutshell

5

Confidential data in

TRE

TRE risk appetite

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO in a nutshell

5

Confidential data in

TRE

TRE risk appetite

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO in a nutshell

5

Confidential data in

TRE

Stata,R,Python

TRE risk appetite

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO in a nutshell

5

Confidential data in

TRE

Stata,R,Python

TRE risk appetite

ACRO engine

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO in a nutshell

5

Confidential data in

TRE

Stata,R,Python

TRE risk appetite

ACRO engine

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO in a nutshell

5

Confidential data in

TRE

Stata,R,Python

TRE risk appetite

ACRO engine

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO in a nutshell

5

Confidential data in

TRE

Stata,R,Python

Analysis

TRE risk appetite

ACRO engine

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO in a nutshell

5

Confidential data in

TRE

File1 File1

File1 Filen

Stata,R,Python

Analysis

request TRE risk appetite

ACRO engine

ACRO report

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO in a nutshell

5

Confidential data in

TRE

File1 File1

File1 Filen

Stata,R,Python

TRE staff

Analysis

request TRE risk appetite

ACRO engine

ACRO report

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO in a nutshell

5

Confidential data in

TRE

File1 File1

File1 Filen

Stata,R,Python

TRE staff

Analysis

request TRE risk appetite

ACRO engine

ACRO report

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO in a nutshell

5

Confidential data in

TRE

File1 File1

Filen

Stata,R,Python

TRE staff

Analysis

request

decision

TRE risk appetite

ACRO engine

ACRO report

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Similar for Machine Learning Models Except that we :

6

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Similar for Machine Learning Models Except that we :

•Run a range of “inference” attacks”

6

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Similar for Machine Learning Models Except that we :

•Run a range of “inference” attacks”

•Aim to support more ‘user journeys’

6

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Similar for Machine Learning Models Except that we :

•Run a range of “inference” attacks”

•Aim to support more ‘user journeys’

•Don’t have a set of ‘tried and trusted’ guidelines to work with

6

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: analytic commands

7

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: analytic commands

Using the same api as pandas, adding checks for cell count and dominance

• acro.crosstab() • acro.pivot_table() Using the same api as statsmodels, adding checks for DoF:

7

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: analytic commands

Using the same api as pandas, adding checks for cell count and dominance

• acro.crosstab() • acro.pivot_table() Using the same api as statsmodels, adding checks for DoF:

• acro.logit() , acro.logitr()

7

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: analytic commands

Using the same api as pandas, adding checks for cell count and dominance

• acro.crosstab() • acro.pivot_table() Using the same api as statsmodels, adding checks for DoF:

• acro.logit() , acro.logitr()

• acro.ols() , acro.olsr()

7

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: analytic commands

Using the same api as pandas, adding checks for cell count and dominance

• acro.crosstab() • acro.pivot_table() Using the same api as statsmodels, adding checks for DoF:

• acro.logit() , acro.logitr()

• acro.ols() , acro.olsr()

• acro.probit() , acro.probitr()

7

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: analytic commands

Using the same api as pandas, adding checks for cell count and dominance

• acro.crosstab() • acro.pivot_table() Using the same api as statsmodels, adding checks for DoF:

• acro.logit() , acro.logitr()

• acro.ols() , acro.olsr()

• acro.probit() , acro.probitr()

R versions are prefixed by “acro_”

7

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: analytic commands

Using the same api as pandas, adding checks for cell count and dominance

• acro.crosstab() • acro.pivot_table() Using the same api as statsmodels, adding checks for DoF:

• acro.logit() , acro.logitr()

• acro.ols() , acro.olsr()

• acro.probit() , acro.probitr()

R versions are prefixed by “acro_”

• Also support R’s built in ‘table’ command

7

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: analytic commands

Using the same api as pandas, adding checks for cell count and dominance

• acro.crosstab() • acro.pivot_table() Using the same api as statsmodels, adding checks for DoF:

• acro.logit() , acro.logitr()

• acro.ols() , acro.olsr()

• acro.probit() , acro.probitr()

R versions are prefixed by “acro_”

• Also support R’s built in ‘table’ command

• Stata versions prefixed by “acro ”

• Code currently captures and translates commands: table, regress, probit, logit 7

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: user commands for session management

8

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: user commands for session management

8

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: user commands for session management

• acro = ACRO()

8

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: user commands for session management

• acro = ACRO()

• acro.suppress= [True, False]

8

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: user commands for session management

• acro = ACRO()

• acro.suppress= [True, False]

• acro.print_outputs()

8

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: user commands for session management

• acro = ACRO()

• acro.suppress= [True, False]

• acro.print_outputs()

• acro.remove_output(key)

8

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: user commands for session management

• acro = ACRO()

• acro.suppress= [True, False]

• acro.print_outputs()

• acro.remove_output(key)

• acro.rename_output(key, newname)

8

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: user commands for session management

• acro = ACRO()

• acro.suppress= [True, False]

• acro.print_outputs()

• acro.remove_output(key)

• acro.rename_output(key, newname)

• acro.add_exception(key, text)

8

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: user commands for session management

• acro = ACRO()

• acro.suppress= [True, False]

• acro.print_outputs()

• acro.remove_output(key)

• acro.rename_output(key, newname)

• acro.add_exception(key, text)

• acro.add_comment(key, text)

8

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: user commands for session management

• acro = ACRO()

• acro.suppress= [True, False]

• acro.print_outputs()

• acro.remove_output(key)

• acro.rename_output(key, newname)

• acro.add_exception(key, text)

• acro.add_comment(key, text)

• acro.custom_output(filename, description)

8

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: user commands for session management

• acro = ACRO()

• acro.suppress= [True, False]

• acro.print_outputs()

• acro.remove_output(key)

• acro.rename_output(key, newname)

• acro.add_exception(key, text)

• acro.add_comment(key, text)

• acro.custom_output(filename, description)

• acro.finalise(output_directory,format)

8

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Example: what happens if researcher requests a disclosive table

9

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO viewer for TRE output checkers

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO viewer for TRE output checkers

Outputs to review

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO viewer for TRE output checkers

Outputs to review

Output with problematic cells

highlighted

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO viewer for TRE output checkers

Outputs to review

view TRE risk appetite

Output with problematic cells

highlighted

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO viewer for TRE output checkers

Outputs to review

view TRE risk appetite

Output with problematic cells

highlighted

Type of output Recommendation Comments from user Exception. request

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO viewer for TRE output checkers

Outputs to review

view TRE risk appetite

Output with problematic cells

highlighted

Type of output Recommendation Comments from user Exception. request

Comments needed to override

recommendation

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO viewer for TRE output checkers

Outputs to review

view TRE risk appetite

Output with problematic cells

highlighted

Type of output Recommendation Comments from user Exception. request

Comments needed to override

recommendation

Record overall comments, create release package

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Get Involved

Get involved Thanks to:

11

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Get Involved

Get involved

ACRO ‘engine’:

• https://github.com/AI-SDC/ACRO

Thanks to:

11

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Get Involved

Get involved

ACRO ‘engine’:

• https://github.com/AI-SDC/ACRO

Viewer:

• https://github.com/opensafely-core/sacro

Consensus statement: [email protected]

Thanks to:

11

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Get Involved

Get involved

ACRO ‘engine’:

• https://github.com/AI-SDC/ACRO

Viewer:

• https://github.com/opensafely-core/sacro

Consensus statement: [email protected]

Anything else: [email protected]

Thanks to:

11

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Get Involved

Get involved

ACRO ‘engine’:

• https://github.com/AI-SDC/ACRO

Viewer:

• https://github.com/opensafely-core/sacro

Consensus statement: [email protected]

Anything else: [email protected]

Thanks to:

• All the project partners and TRE staff for all their feedback on the tools as they developed

11

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Get Involved

Get involved

ACRO ‘engine’:

• https://github.com/AI-SDC/ACRO

Viewer:

• https://github.com/opensafely-core/sacro

Consensus statement: [email protected]

Anything else: [email protected]

Thanks to:

• All the project partners and TRE staff for all their feedback on the tools as they developed

• Members of the public for input to consensus statement

11

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Get Involved

Get involved

ACRO ‘engine’:

• https://github.com/AI-SDC/ACRO

Viewer:

• https://github.com/opensafely-core/sacro

Consensus statement: [email protected]

Anything else: [email protected]

Thanks to:

• All the project partners and TRE staff for all their feedback on the tools as they developed

• Members of the public for input to consensus statement

• External steering group

11

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Get Involved

Get involved

ACRO ‘engine’:

• https://github.com/AI-SDC/ACRO

Viewer:

• https://github.com/opensafely-core/sacro

Consensus statement: [email protected]

Anything else: [email protected]

Thanks to:

• All the project partners and TRE staff for all their feedback on the tools as they developed

• Members of the public for input to consensus statement

• External steering group

• You for listening

11

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Get Involved

Get involved

ACRO ‘engine’:

• https://github.com/AI-SDC/ACRO

Viewer:

• https://github.com/opensafely-core/sacro

Consensus statement: [email protected]

Anything else: [email protected]

Thanks to:

• All the project partners and TRE staff for all their feedback on the tools as they developed

• Members of the public for input to consensus statement

• External steering group

• You for listening

11

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Get Involved

Get involved

ACRO ‘engine’:

• https://github.com/AI-SDC/ACRO

Viewer:

• https://github.com/opensafely-core/sacro

Consensus statement: [email protected]

Anything else: [email protected]

Thanks to:

• All the project partners and TRE staff for all their feedback on the tools as they developed

• Members of the public for input to consensus statement

• External steering group

• You for listening

11

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Get Involved

Get involved

ACRO ‘engine’:

• https://github.com/AI-SDC/ACRO

Viewer:

• https://github.com/opensafely-core/sacro

Consensus statement: [email protected]

Anything else: [email protected]

Thanks to:

• All the project partners and TRE staff for all their feedback on the tools as they developed

• Members of the public for input to consensus statement

• External steering group

• You for listening

This work is funded by UK research and Innovation, [Grant Number MC_PC_23006], as part of Phase 1 of the DARE UK (Data and Analytics Research Environments UK) programme, delivered in partnership with Health Data Research UK (HDR UK) and Administrative Data Research UK (ADR UK)

11

Towards a comprehensive theory and practice of output SDC, University of the West of England

statistical disclosure control of outputs, OSDC, secure research environments, OSCD associated risks,

Languages and translations
English

UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE

CONFERENCE OF EUROPEAN STATISTICIANS

Expert Meeting on Statistical Data Confidentiality

26-28 September 2023, Wiesbaden

Towards a comprehensive theory and practice of output SDC

Ben Derrick(University of the West of England, UK)

Elizabeth Green(University of the West of England, UK)

Felix Ritchie(University of the West of England, UK)

Paul White(University of the West of England, UK)

e-mail: [email protected]

Abstract

In 2000, the statistical disclosure control of outputs (OSDC) was largely limited to models of table protection developed

by and intended for national statistical institutes (NSIs), as a particular branch of general SDC theory. However, in this

century OSDC as a field of enquiry has expanded significantly, reflecting the important of secure research environments

run by NSIs and others. OSDC is still a relatively under-developed field compared to SDC for tables or microdata. There

are a small number of practitioner guides, and some theoretical articles, but this is a diffuse literature.

In the UK, a consortium of universities and data providers is collaborating to provide an integrated analysis of output

checking including

- Key theoretical and operational concepts (eg safe statistics, principles-based OSDC)

- A comprehensive listing of statistics, associated risks, and mitigation measures as well as various practical element to

support output checking.

A key element of this is a theory-driven classification which enables us to have that comprehensive listing whilst still

limiting the dimensionality of OSDC guidelines to a manageable number of rules. This paper explains this model and how

it has been co-developed with RDCs and others, and considers whether this provides a sustainable model for future

development of the OSDC field.

2

1 Introduction

Increasingly social scientists are making use of confidential data for research. This has accelerated in the 21st

century with the growth of secure environments, referred to as ‘safe havens’, ‘secure data centres’, ‘research

data centres’, ‘trusted research environments’ (TREs) and similar names. These TREs provide standardised

secure access to a range of sensitive datasets for research purposes. In OECD countries these are now common

as part of the portfolio of research data services offered by National Statistics Institutes (NSIs), and academic

groups are also adopting them.

TREs have introduced one substantial change to the way social scientists work. When working with

confidential data, researchers are generally unaware of the potential disclosure risk in statistical outputs, as this

is not covered in research methods courses (Derrick et al, 2022). However, TREs generally require researchers

to submit outputs for a confidentiality review before release (Green et al, 2021). The efficiency of this process

relies substantially on the researchers being aware of confidentiality risks and actively aiming to produce non-

disclosive outputs (Alves and Ritchie, 2019). Hence, most TREs (Green et al, 2021) provide researchers with

some training and/or guidelines in output statistical disclosure control (OSDC). Some organisations that allow

downloads have also provided OSDC guidelines eg Eurostat (2015).

The practice of output checking, and the training of researchers and checkers, lags considerably behind other

areas of confidential data protection, such as source data anonymisation. For many years, OSDC was limited to

models of table protection (frequencies and magnitudes) developed by and intended for national statistical

institutes (NSIs). In this century OSDC as a field of enquiry has expanded significantly, largely as a result of

the growth of TREs and the need to cover the much wider range of outputs generated by researchers.

Nevertheless, general OSDC is still a relatively under-developed field compared to SDC for tables or

microdata.

A part of the problem is that the conceptual framework for generalised OSDC is lacking. There are a small

number of practitioner guides, and a few theoretical articles, but this is a sparse literature. However, that

literature does contains the seeds for a new overarching framework; in particular, the realisation that statistics

could be grouped to minimise the need for rules covering every potential output.

In 2023 the UK academic funding council UKRI funded the project SACRO (Semi-automated checking of

research outputs; see Green et al, 2023a) to deliver a general-purpose toolkit for automating output checking

processes, based on the Eurostat funded pilot ACRO (Green, Ritchie and Smith 2020 and 2021). As part of the

project, the team undertook to provide a comprehensive review of SDC theory, integrated with practical

guidelines. A key part of the project was to formalise the use of classifications (‘statbarns’) and push the

concept to its limit to minimise the dimensionality problem.

This paper describes the statbarn concept, how it was operationalised, how it simplifies disclosure control

processes (both automatic and manual). As of July 2023, this is still a work in progress, so we review the

current status and highlight areas where research needs to be done.

2 Generalised OSDC development1

Statistical disclosure control (SDC, sometimes called statistical disclosure limitation) is the practice of using

statistical analysis to ensure that the use of confidential or sensitive data does not breach the privacy of the data

subjects. SDC can be split into ‘input SDC’ (removing identifying information from the data before analysis is

carried out) and ‘output SDC’ (checking that statistical aggregates do not reveal information).

1 This short review is based on our own understanding and experience in the last two decades. We would very much

appreciate comments from colleagues working in this area as to the accuracy of our representation.

3

Input SDC is a very well-established process. It has a large and stable literature, a large evidence base of the

efficacy of different measures in different circumstances, and software tools implementing these to de-identify

datasets. Research methods courses rarely teach formal de-identification, but researchers are usually given

some basic guidance on broad principles.

In contrast, OSDC is a largely unknown quantity. Until 2000, ‘output SDC’ (had the term been coined then)

would have been seen as the need to protect frequency and magnitude tables from inadvertent disclosure. This

field had seen some study, and there was a relatively well-established literature, but it remained a specialist

area, even for statisticians. We are not aware of research methods courses, then or now, that teach this as a

matter of course, with one exception.

The exception is courses in the production of official statistics, which do cover OSDC for tables. Until recently,

SDC was very heavily influenced by the needs of national statistics institutes (NSIs), who produce statistical

tables and, increasingly, microdata for secondary analysis. These organisations promoted research into relevant

SDC, which explains the overwhelming focus on tables for OSDC. The first OSDC papers not focusing on

tables appear to be Reznek (2004), Reznek and Riggs (2005) and Corscadden et al (2006), both tacking specific

problems.

In 2003 the TRE at the UK Office for National Statistics was set up, and it was run by social science

researchers rather than the teams producing official statistics.. The ONS team realised that (a) the literature on

tabular OSDC was of limited value in research environments, and (b) the vast majority of research outputs had

no guidance at all. As a result, the team began developing guidelines with a research focus. This included an

analysis of the principles behind output SDC for research (Ritchie, 2007), and the first statement of ‘safe

statistics’ (Ritchie, 2008).

The concept of ‘safe statistics’ is key for efficient processing of research outputs. It recognises that certain

types of output have no meaningful disclosure risk in any reasonable use. For example, the regression

coefficients cannot by themselves reveal an individual value, nor can they be differenced to reveal individual

values, nor are they affected by special cases such as single observations in a category (Ritchie, 2019). Of

course, it is possible to construct special cases such that the regression is informative about individuals, but

these have no meaningful research purpose. For all reasonable purposes, regressions coefficients are non-

informative about individuals in all cases2, and therefore they do not need to undergo output checking.

Ritchie (2016) proposed a method for classifying outputs as safe or unsafe:

- Does the statistic itself pose a risk in the case of low numbers, extreme values or something else which is a

legitimate value?

- If the statistic is compared to another with one more observation, does any differencing risk arise?

- Are there are any other reasonable risks to disclosure, specific to this statistic?

If the answer to all three of these is ‘no’ then the statistic is classified as ‘safe’. The innovation in Ritchie

(2008) was that the classification should be based upon the mathematical characteristics of the statistic, not the

statistical ones; in other words, a ‘safe’ statistic should be safe irrespective of the data it is calculated on.

The ONS guidelines formed the basis for Brandt et al (2010; subsequently re-released, with minor revisions, as

Bond et al, 2016). This Eurostat-sponsored project (complementing a second piece on ‘traditional’ SDC;

Hundepool et al, 2010) aimed to provide the first comprehensive guide for researchers and output checkers. The

guide covered broad theory, including a discussion of safe statistics; guidelines and ‘rules’ on specific statistics,

grouped into similar types; and suggestions for operationalising good practice, including training. Brandt et al

(2010) has been the basis for many of the practice manuals now being produced by NSIs and others for TRE

users.

Despite its influence, Brandt et al (2010) has some significant limitations. The most obvious is that the list of

statistics covered is not comprehensive but selective, neglecting the interests of the report committee. Thus, it

2 There are basic rules that can be checked to make sure that the regression is a genuine regression (sufficient degrees of

freedom to be clear this is not an equation, regression must not be saturated to ensure this is an estimate and not a table

masquerading as a regression) but in genuine situations we would not expect these conditions to occur.

4

is strong on the measures used by social scientists but has significant gaps relating to health research, for

example. The second limitation is that the recommendations are presented ‘as is’ with little in the way of

explanation as to why this came about. A third limitation is that the report is very laconic, offering rules but

very little in the way of practical interpretation for researchers or output checkers. Subsequent manuals based

on the guide have managed to address some of these; for example, the popular SDAP manual (Griffiths et al,

2019) has both a wider range of statistics, and a commentary for output checkers on how to usefully assess the

output.

However, the major limitation of Brandt et al (2010) is that there is no overall integrating conceptual

framework. The guide reduces the range of rules somewhat by grouping statistics, but these are as likely to be

on whether they are commonly put together, rather than on their disclosure characteristics. Moreover, the

structure of the guide implies that any additional statistics will need to have their own rules added, rather than

being seen as variations on existing ones. Other manuals follow this (implicit) approach as well, listing outputs

and associated rules as if they were separate entities. The implications of safe statistics and the grouping

approach used in Brandt et al (2010) have not been followed through. We consider this now.

3 Conceptual foundations of an integrated approach

Analysts use a great range of statistical techniques in their models. Devising statistical rules for all of these

separately is not feasible. However, it is possible to combine statistics into groups based not on statistical

relation but on common disclosure risks and solutions. For example:

- means and totals are identical in terms of the disclosure risk for all practical purposes

- means and frequencies generate the same risks of low numbers and potential for differencing

- means have the potential for dominance

- survival tables are frequencies but they also generate an implicit secondary table

So a grouping would put means, totals, frequency tables and survival tables into three different disclosure

groups:

Everything in the groups should have the same risks and solutions. For example, suppression, rounding or noise

addition are valid solutions to disclosure risks in both frequency and survival tables, but on the latter they need

to be implemented in a different way to allow for the monotonic relationship between cells.

The advantages of this approach are both statistical and operational:

- Fewer rules/cases for researchers and output checkers to learn

- More consistent treatment of outputs

- Clearer distinctions between outputs

- Easier to develop the theoretical basis for any guidance

- Easier to update guidance when it changes (which it does)

- Adding new statistics is now a case of ‘what category does it fall into?’ rather than ‘what rules are needed?’

- Output checker (and researcher) training can focus on the risky classes rather than trying to cover all cases

5

Because classification is used in this field in many different ways, we refer to the groupings as ‘statistical

barns’ or ‘statbarns’3.

The real value of this comes from finding that, in terms of disclosure characteristics, the minimum number of

statbarns is fairly small. To a researcher, estimation of a hazard model bears little analytical relation to a

quantile regression; but they pose the same disclosure risks: that is, no meaningful risk in any reasonable use,

and so the only test needed is to make sure that this a genuine research use. In the case of estimated models, the

tests are always

- Are there sufficient residual degrees of freedom (ie making sure this a model not an equation)?

- Is the model saturated (explanatory factors all categorical and all fully interacted ie making sure this is not a table

masquerading as an estimate)?

And just like that, a large and essential part of research output is consigned to the box ‘nothing to see here’.

4 The SACRO classification model

As it currently stands, the SACRO models contains fourteen statbarns:

Barn Example Class Status

1 Frequencies Frequency tables Unsafe Very well understood

2 Statistical hypothesis tests t-stats, p-stats, f-stats Safe Provisional

3 Correlation coefficients Regression coefficients Safe Confirmed

4 Position Median, quartiles, min, max Unsafe Provisional

5 Shape s.d., skewness, kurtosis Safe Provisional

6 Linear aggregations Means, totals Unsafe Very well understood

7 Mode n/a Safe Confirmed

8 Smooth distributions Kernel density functions Safe Provisional

9 Concentration ratios Herfindahl index Safe Provisional

10 Calculated ratios Odds & risk ratios Unsafe Provisional

11 Implicit tables Hazard/survival tables Unsafe Provisional

12 Linked/multi-level tables Nested categorical data ? No knowledge

13 Clusters Cluster analysis ? No knowledge

14 Gini/lorenz curves n/a ? No knowledge

It is clear that some of these statbarns cover a very large number of cases (‘correlation coefficients’ cover linear

and non-linear regression, ANOVA, ANCOVA, pairwise correlation etc). In contrast, the disclosure risks of the

mode are unlike any other statistic, and so it merits its own class. This shows the importance of identifying

exactly what are the disclosure characteristics of a particular statistic.

The act of creating the list is itself a useful exercise, forcing one to consider what are the meaningful

differences. For example, mean and median are often grouped together in OSDC guidelines, but they have quite

different characteristics. On the other hand, maxima and minima are often dealt with on their own but they can

be considered as a special case of percentiles. This means that we no longer need separately rules for

‘structural’ end points (such as 0% or 100% in a proportion variable) but can apply general percentile rules.

This list is likely to undergo change over time. Even in the development process, the list changed as more

statistics were deemed to be of the same type, and others demand a new type. The process of identifying risks

3 The term originally came from an analogy with a farmer trying to organise her livestock, but as a neologism it has the

advantage of being unambiguous

6

and defining OSDC guidelines for each class is crucial, as this is usually the point at which it becomes clear

whether a new type is needed or not. It may also be the case that trying to identify a minimal set is counter-

productive. As noted, formally maxima/minima can be treated as percentiles; but in terms of communication of

risk to researchers, it may be sensible to separate them again. Finally, we have created some categories as, at

the moment, we don’t have enough information to be comfortable that they fit an existing category. Category

12 “linked/multiple tables” is an example – it seems like these should be covered by frequency tables, but we

suspect there are nuances which need to be explored, and so creating it as a separate category show the need for

more understanding.

The coverage of OSDC theory is decidedly patchy. The ‘status’ column has four values:

Very well

understood

This disclosure issues, things to be checked and protection mechanisms have been

comprehensively studied and there is a consensus

Confirmed These have not been so well studied (conclusions rest on one or two papers) but we are

confident that the conclusions and guidance are robust, well-founded and comprehensive

Provisional We have confidence in our conclusions but this is based on extrapolation from other types,

and from our own understanding; there is substantial further work to be done (for example,

on the impact of extreme values) before the classification can be confirmed

No

knowledge

While we may have suspicion of how these should be seen, basic analysis has not been

carried out

At present, the focus is to get the ‘provisional’ status raised to ‘confirmed’.

The list above is provisional and was devised by the SACRO team based at the University of the West of

England, Bristol. SACRO’s network of output checkers was consulted as to whether this was a sensible

approach in general; the response was positive, but expected: earlier evidence-gathering sessions had already

indicated a desire for simplification of the current OSDC landscape. The initial categories seemed both sensible

and comprehensive, although these are likely to be modified as they develop in practice.

Of more concern to the output checkers was how they (and researchers) would easily check the guidelines for

statistics. This is achieved by a look-up table, linking statistics to the appropriate statbarn, from which the

corresponding checks, problems and solutions could be found:

7

This will be created as a searchable file, but the output tools being developed by the SACRO project (Green et

al, 2023) intend to incorporate this in the user front end. Researchers and output checkers should be able to

click on a link to see more information about the output, drawn from the statbarn classification. In the initial

project this will only include basic data such as that shown above, but in future it may be useful to expand the

information on each classification. This highlights the advantage of classification: the SACRO coders only

need to know the statbarn code and then can draw all this information from a finite set of outputs.

5 Graphical outputs

Graphs do not present new issues. In theory, every graph can be represented as a table in some way, and so the

above rules could be applied. To take an obvious example, a pie chart or a histogram are clearly just one-way

tabulations, whereas a waterfall graph is a two-way table. As a counter example, a kernel density estimate could

be represented as a mathematical form, but in practice is almost always show graphically. In practice, we need

separate rules because (a) the quantity of information differs, and (b) precision is likely to be lower in a graph.

Consider the Kaplan-Meier graph, which is simply a survival table re-presented, usually in proportional form

(we assume that counts and proportions are equally disclosive as the total from which the proportion is

calculated is likely to be published somewhere). Survival tables are classed as ‘unsafe but very low risk’

because, even in the case of a unit being identified, the personal information content in the survival table is

negligible. Griffiths et al (2019) suggest that the underlying survival table should be supplied along with the

graph, but this can cause more problems:

In the left-hand graph, the source table would have 15 steps and be checkable by a human. But that table would

have precise numbers easily readable, whereas getting the exact figures from the graph depends on the way that

the image was produced (and even then, some laborious analysis). In the right-hand diagram, a survival table

with 100 rows in it is much harder to assess accurately, whereas identifying individual data points from the

image has become harder.

The above graphs are presented as numbers. Formally Kaplan-Meier graphs should show the survival rate

rather than numbers (ie 0%-100%). In theory this makes graphs slightly more disclosive than the survival table:

tables are likely to limit the number of decimal points shown, whereas the full decimal value may be used in

creating the graph points.

Given the low information content in any data point, even if relating to one person, producing survival tables

alongside graphs seems to increase risk rather than reducing it. Hence, the current guidance from SACRO is

that Kaplan-Meier graphs should be released subject to the researcher confirming that each step and the end

point meets thresholds

The objective for the SACRO guide is that it will show the statbarns that each graph falls into (which in itself

might lead to additional statbarns being defined, as in the case of kernel densities), but will concentrate on the

practical assessment; in particular, how graphical representation adjusts the perspective on what is discoverable

8

from an output. Again, this is the value of the grouping – we can see what we should be looking for in the

output.

6 Conclusion

As the use of confidential microdata for research rises, so does the need for efficient and effective OSDC.

OSDC for research has made considerable advances in this century, but guidelines have tended to develop on

an ad hoc basis as new statistical queries are raised. The strategic approach being taken by SACRO and

described in this paper attempts to provide a longer-term solution to the problem.

The idea of grouping statistics was first raised in Ritchie (2008) partly as a response to proliferation of OSDC

rules emerging from research use of the ONS TRE. While the safe-unsafe classification is crude, it highlights

how applying a structure can significantly improve operational as well as statistical outcomes. Classification

also changes the way we think about outputs. When Brandt et al (2010) was written, the implication is that

additional statistics would require new rules. In the statbarn model, risk assessment for a new statistic should be

a matter of deciding whether it fits into an existing category. If it does, then no further work is needed. If not,

then a new category is added, but this should be a rare event.

The statbarn approach is part of the development of a wider set of operational guidelines aiming to bring

consistency between theory and practice to output checking.

7 References

Alves, K., & Ritchie, F. (2020). Runners, repeaters, strangers and aliens: Operationalising efficient output

disclosure control. Statistical Journal of the IAOS, 36(4), 1281-1293.

Brandt M., Franconi L., Guerke C., Hundepool A., Lucarelli M., Mol J., Ritchie F., Seri G. and Welpton R.

(2010), Guidelines for the checking of output based on microdata research, Final report of ESSnet sub-

group on output SDC

Bond S., Brandt M., de Wolf P-P (2015) Guidelines for Output Checking. Eurostat.

Corscadden, L., Enright J., Khoo J., Krsnich F., McDonald S., and Zeng I. (2006) Disclosure assessment of

analytical outputs, mimeo, Statistics New Zealand, Wellington

Derrick, B., Green, E., Ritchie, F., Smith J. & White, P. (2022, April). Disclosure protection: a systemic gap in

statistical training?. Paper presented at Scottish Economic Society Annual Conference 2022: Special

session 'Protecting confidentiality in social science research outputs', Glasgow

Eurostat (2015) Self-study material for Microdata users. Eurostat.

Green, E., Ritchie, F., Tava, F., Ashford, W., & Ferrer Breda, P. (2021, July). The present and future of

confidential microdata access: Post-workshop report.

Green, E., Ritchie, F., & Smith, J. (2020). Understanding output checking. Luxembourg: European

Commission (Eurostat - Methodology Directorate)

Green, E., Ritchie, F., & Smith, J. (2021). Automatic Checking of Research Outputs (ACRO): A tool for

dynamic disclosure checks. ESS Statistical Working Papers, 2021 Edition

Griffiths E., Greci C., Kotrotsios Y., Parker S., Scott J., Welpton R., Wolters A. and Woods C. (2019)

Handbook on Statistical Disclosure Control for Outputs. Safe Data Access Professionals Working

Group.

Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Lenz, R., Longhurst, J., Schulte Nord-holt, E.,

Seri, G. and De Wolf, P. (2010). Handbook on Statistical Disclosure Control. ESSNet SDC.

Reznek, A. (2004) Disclosure risks in cross-section regression models, mimeo, Center for Economic Studies,

US Bureau of the Census, Washington

9

Reznek A. and Riggs T. (2005) "Disclosure Risks in Releasing Output Based on Regression Residuals" ASA

2004 Proceedings, Section on Government Statistics and Section on Social Statistics pp1397-1404

Ritchie F. (2007) Statistical disclosure control in a research environment, mimeo, Office for National Statistics;

available as WISERD Data Resources Paper No. 6

Ritchie F. (2008) “Disclosure detection in research environments in practice”, in Work session on statistical

data confidentiality 2007; Eurostat; pp399-406

Ritchie, F. (2014). Operationalising ‘safe statistics’: The case of linear regression. UWE Working Papers in

Economics no 14/10. Bristol

Ritchie, F. (2019). Analyzing the disclosure risk of regression coefficients. Transactions on data privacy, 12(2),

145-173

Smith J., Preen R., Ritchie F., Green E., Stokes P., & Bacon S. (2023) SACRO: Semi-Automated Checking Of

Research Outputs. Paper prepared for the 2023 UNECE/Eurostat Workshop on Statistical Data

Confidentiality,September.

Towards a comprehensive theory and practice of output checking Ben Derrick

Elizabeth Green

Felix Ritchie

Paul White

Data Research Access & Governance Network

UNECE/Eurostat

Expert Group on

Statsitical Data

Confidentiality

September 2023

A brief history of output SDC

late 20th Century 21st CenturyIn the beginning

TA BL ES

TABLES

everything else

S A C R O

2023

SACRO Feb-Oct 2023

• Review/revise theory

• Re-integrate theory and practice guidelines

• Tools (regular SDC and AI-SDC)

• Community engagement

now

next

User/checker guides

General, structured by use, maybe organisation-specific

Why a new guideline/manual?

Theory papers

Specific, rarely structural

New guide: Structured

Comprehensive Definitive

Theoretically sound Evidenced/sourced Practice-relevant

but not necessarily user-friendly

Practice papers

Operational, not integrated

Structured? okay

Comprehensive? No way!

• Building block: ‘safe statistic’

o unsafe: data-dependent; check before release

o safe: no disclosure risk [minimal check before] release

o based on mathematical (not statistical) characteristics

• Expand: define all statistics by

o common disclosure characteristics

o common mitigation responses

The statistical barn

• Place homologous statistical analysis into ‘statbarns’ eg

o histogram, count table, pie chart  'frequencies’

FREQUENCIES

Applying the group rules

• In the Frequencies barn we know all outputs are

FREQUENCIES

• With any statistic in the Frequencies barn we need to check:

• Low counts

• Differencing

• Class disclosure

• We would apply the following rules in this barn

• Minimum count

• Appropriate mitigation techniques for this barn are

• Cell suppression, noise addition, rounding

The statistical barn

• Place homologous statistical analysis into ‘statbarns’ eg

o histogram, count table, pie chart  'frequencies’

o median, interquartile range  ‘position’

o ANOVA, proportional hazards  ‘correlation’

FREQUENCIES

The barns so far 1.Frequencies

2.Statistical hypothesis tests

3.Correlation coefficients

4.Position

5.End points

6.Shape

7.Means and total

8.Mode

9.Non-linear concentration ratios

10.Calculated risk ratios

11.Hazard/survival tables

12.Clusters

13.Linked/multi-level tables

14.Gini coefficient

• Each barn has its own set of rules for output checking

Progress (as of today)

• Pretty confident on decisions…

• Using in output checker training – positive feedback

• Statbarn model is basis for SACRO (coming up)

• Some rethinking; some revelations

Rethinking example: survival tables

2010 O’Keefe et al JPC

• No detail in tables

• Blur lines in K-M graphs

2019 SDAP manual

• treat as frequency table

• Minimum thresholds

o tables & graphs

2023 DRAGonTome

• Risk

1. outliers

2. absolute dates

3. differencing via subsets

• Recommendation

o Approve unless the above

o Apply threshold to final count

Rethinking example: survival tables

2010 O’Keefe et al JPC

• No detail in tables

• Blur lines in K-M graphs

2019 SDAP manual

• treat as frequency table

• Minimum thresholds

o tables & graphs

2023 DRAGonTome

• Risk

1. outliers

2. absolute dates

3. differencing via subsets

• Recommendation

o Approve unless the above

o Apply threshold to final count

mode

regression on single binary variable

linear vs non-linear concentration ratios

Min/max versus medians and percentiles

Revelation example

• What is disclosure?

• Small numbers + finite values = rationale for higher thresholds

• Evidential vs structural zeros

Still to be done

• To be written:

o Basic concepts

o Operational issues

o Directory of other guides

o FAQs for researchers and output checkers

• Classifications

o Lookup table – is it comprehensive?

o Web pages

• Community buy-in

o More feedback!

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15

Do samples of synthetic microdata population replicate the relationship between samples taken from an original population and that population? University of Manchester

disclosure risk in sample surveys, k-anonymity, synthetic population, original population, synthetic data, 

Languages and translations
English

UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE

CONFERENCE OF EUROPEAN STATISTICIANS

Expert Meeting on Statistical Data Confidentiality

26-28 September 2023, Wiesbaden

Do samples taken from a synthetic microdata population replicate

the relationship between samples taken from an original

population? Mark Elliot, Claire Little and Richard Allmendinger (University of Manchester)

[email protected]

Abstract

Assessment of disclosure risk in sample surveys by data controllers who don’t have access to the population

data are constrained by verifiability challenges. A sample unique may not be population uniques. Statistics

generated at the sample level may not carry over to the population level. Privacy models such as k-anonymity

simply may not make sense when applied to sample data (or only make sense for some scenarios) This study

aims to understand whether samples generated from a synthetic population present the same relationship, in

terms of risk and utility, to the synthetic population, as samples generated from the original population. Note

that this is a very different question from the more general questions about the utility of synthetic data which

compares the synthetic and original data. Here we are comparing two relationships. This opens the possibility of

being able to test and set parameters for models of risk assessment to be applied to real data using synthetic data.

2

1 Introduction

This document explores whether the relationship between a population dataset and samples drawn

from it is maintained when the samples are drawn from (and compared to) a synthetic version of the

same population. This extends the work of Little et al. (2022), where samples were used to determine

the sample equivalence of synthetic data to the original dataset (for example, to be able to say “the

synthetic dataset has utility equivalent to a 10% original data sample and risk equivalent to a 5%

sample”). In real-life scenarios the population data may not be available, so if synthetic samples were

able to mimic this relationship, it would be useful.

As visualised below, two scenarios are explored: Experiment A (Figure 1), where we do not have

access to the original population data but have a synthetic dataset generated from it that is the same

size as the original population; and Experiment B (Figure 2), where we have a sample of the original

population dataset and from that create a larger synthetic population. An extension to Experiment B

(named B2) is to include the original sample within the synthetic population.

Figure 1: Diagram of data relationships for Experiment A

Experiments were performed using the UK 1991 Census dataset (although it may make sense to

repeat these experiments on other Census datasets in the future). The synthetic data was generated

using Synthpop (Nowok et al., 2016). This was selected because in previous experiments it produced

data with the highest utility compared to other methods (although it should be noted this came with

higher disclosure risk). It may make sense to also experiment with other methods in the future.

The next section introduces the dataset and data/sample generation approach adopted in this study.

Section 3 describes the risk and utility measures used, and Section 4 presents an analysis of

Experiment A and B. Finally, Section 5 concludes the paper and discusses areas for future research.

2 Data

2.1 UK 1991 Census

A subset of the UK 1991 Individual Sample of Anonymised Records for Great Britain (SARs) was

used to simulate a population. The SARs data was downloaded from the UK Data Service on

3

29/05/21.1 This consists of a 2% sample of the population of Great Britain (excluding Northern

Ireland), with 1,116,181 individual records and 67 attributes. The dataset includes children and adults

and contains information on topics such as age, gender, ethnicity, employment, and housing. To

reduce the computational load the data was subsetted on geographical region (the REGIONP

attribute); there are 12 regions, and the West Midlands was randomly selected for use in this study.

Details of each of the variables are contained in Appendix A. The subset consisted of 104,267 records

(9.34% of overall sample) and fifteen variables (thirteen categorical, two numeric). This subset will be

henceforth referred to as the original population.

Figure 2: Data relationships for Experiment B

2.2 Synthetic Data Generation

Synthpop, developed by Nowok et al. (2016), was used to generate the synthetic data. Synthetic data

the same size as the original population (104,267 records) was generated. Default parameters were

used, with the visit sequence ordered with numerical variables first, followed by categorical variables

with least number of categories to most (with ties decided alphabetically). That gave a visit sequence

of: AGE, HOURS, LTILL, SEX, QUALNUM, MSTATUS, TENURE, RELAT, FAMTYPE,

SOCLASS, ECONPRIM, ETHGROUP, TRANWORK, AREAP, COBIRTH.

2.3 Sample Generation

Random samples of sizes 99%, 98%, 97%, 96%, 95%, 90%, 80%, 70%, 60%, 50%, 40%, 30%, 20%,

10%, 5%, 4%, 3%, 2%, 1%, 0.5%, 0.25%, 0.1% were drawn (without replacement) from both the

original and synthetic populations. For each sample size 100 samples were drawn. This follows the

framework developed in earlier experiments (as reported in Little et al., 2022).

3 Risk and Utility Measures

For calculating the associated risk and utility the sample datasets were measured against the

population dataset. That is, the synthetic samples were measured against the synthetic population they

1 Study Number 7210 (Office for National Statistics, Census Division, University of Manchester, Cathie Marsh

Centre for Census and Survey Research 2013).

4

were sampled from, and the original samples were measured against the original population that they

were sampled from. Risk-Utility (R-U) maps, as developed by Duncan et al. (2004), were used to

visualise the trade-off between risk and utility.

3.1 TCAP for disclosure Risk

Elliot (2014) and Taub et al. (2018) introduced a measure for the disclosure risk of synthetic data

called the Correct Attribution Probability (CAP) score. The disclosure risk is calculated using an

adaptation used in Taub et al. (2019) called the Targeted Correct Attribution Probability (TCAP).

TCAP is based on a scenario whereby an intruder has partial knowledge about a particular individual.

Specifically, they know (i) the values for some of the variables in the dataset (the keys) and (ii) that

the individual is in the original dataset. We assume that the intruder wishes to infer the value of a

sensitive variable (the target) for that individual. The TCAP metric is then the probability that those

matched records yield a correct value for the target variable (i.e., that the adversary makes a correct

attribution inference).

Three target variables, and corresponding key variables were identified from the UK Census data. For

each target, the TCAP score was calculated using sets of 3, 4, 5 and 6 keys. The overall mean of the

TCAP scores (for each of the target and key combinations) was calculated as the overall disclosure

risk score.

The TCAP statistic has a value between 0 and 1; a low value indicates that the synthetic dataset

carries little risk of disclosure whereas a score close to 1 indicates a higher risk. A baseline value can

be calculated (the usual one being the probability of the intruder being correct if they drew randomly

from the univariate distribution of the target variable) and then the TCAP score is rescaled so that the

baseline equals zero.2 We refer to the rescaled TCAP value as the marginal TCAP, i.e., it is the

increase in risk above the baseline. Rescaling is performed by subtracting the baseline from the TCAP

score and then dividing by 1 minus the baseline. For all experiments the targets were:

• LTILL : baseline = 0.774

• FAMTYPE : baseline = 0.223

• TENURE : baseline = 0.329

With a mean baseline of 0.442. The keys for each were:

• 6 keys: AREAP, AGE, SEX, MSTATUS, ETHGROUP, ECONPRIM

• 5 keys: AREAP, AGE, SEX, MSTATUS, ETHGROUP

• 4 keys: AREAP, AGE, SEX, MSTATUS

• 3 keys: AREAP, AGE, SEX

3.2 Utility

Following previous work (Little et al. 2022) the mean of the Ratio of Counts (ROC) and Confidence

Interval Overlap (CIO) was calculated as the overall utility score. This was to provide a more

complete view of the utility, rather than just using a single measure.

3.2.1 Ratio of Counts (ROC)

The Ratio of Counts (ROC) was calculated for univariate and bivariate cross tabulations of the data.

This is calculated by taking the ratio of the synthetic and original data estimates (where the smaller is

divided by the larger one). Thus, given two corresponding estimates (for example, the number of

records with SEX = female in the original dataset, compared to the number in the synthetic dataset),

where yorig is the estimate from the original data and ysynth is the corresponding estimate from the

synthetic data, the ROC is calculated as:

ROC = min(𝑦𝑜𝑟𝑖𝑔 , 𝑦𝑠𝑦𝑛𝑡ℎ)

max(𝑦𝑜𝑟𝑖𝑔 , 𝑦𝑠𝑦𝑛𝑡ℎ)

2 This does create the possibility of a synthetic dataset receiving a negative TCAP score (which can still be

plotted on the R-U map) but that simply indicates a risk level below that of the baseline and will only occur in

degenerate cases.

5

If yorig = ysynth then the ROC = 1. Where the original and synthetic (or sample) datasets are of different

sizes (as is the case when calculating the ROC for the various sample datasets) the proportion, rather

than the count can be used. The ROC was calculated over univariate and bivariate cross-tabulations of

the data and takes a value between 0 and 1. For each variable the ROC was averaged across categories

to give an overall score.

3.2.2 Confidence Interval Overlap (CIO)

To calculate the CIO (using 95% confidence intervals), the coefficients from regression models built

on the original and synthetic datasets are used. The CIO, proposed by Karr et al. (2006), is defined as:

𝐶𝐼𝑂 = 1

2 { min(𝑢𝑜, 𝑢𝑠) − max(𝑙𝑜, 𝑙𝑠)

𝑢𝑜 − 𝑙𝑜 +

min(𝑢𝑜, 𝑢𝑠) − max(𝑙𝑜, 𝑙𝑠)

𝑢𝑠 − 𝑙𝑠 }

where uo, lo and us, ls denote the respective upper and lower bounds of the confidence intervals for the

original and synthetic/sample data. This can be summarised by the average across all regression

coefficients, with a higher CIO indicating greater utility (maximum value is 1 and a negative value

indicating no overlap).

For each synthetic (or sample) dataset two logistic regressions were performed, and the CIO (between

the same regression on the original data) for each was calculated. The mean CIO over all coefficients

was used (where a negative overlap was equivalent to no overlap and therefore set to zero). The mean

of the two CIOs was then calculated as the overall score.

The target variables were marital status (MSTATUS) and housing tenure (TENURE), and they were

converted into a binary attribute: for marital status this was married (or living as married) and

anything else; and for tenure this was whether an individual owns their property (or lives in property

that is owned by a family member), and anything else. Eight variables were used as predictors, using

more would seem to overcomplicate the models. The predictors were: AGE, ECONPRIM,

ETHGROUP, LTILL, QUALNUM, SEX, SOCLASS, and TENURE or MSTATUS (whichever was

not the target).

4 Results

4.1 Experiment A

The scenario where we do not have access to the original/population data but have a synthetic dataset

the same size created from it. This explores using a synthetic dataset to model the relationship

between samples and population data. To be clear, throughout this section, the original dataset (the

UK 1991 sample, n=104,267) is referred to as the original population, and the synthetic dataset

created from this is referred to as the synthetic population. The samples are referred to as original

samples and synthetic samples.

The synthetic population was created (using Synthpop) from the original population. The synthetic

population had utility = 0.7596 and Marginal TCAP = 0.7228 (to 4dp) compared to the original.

Samples were drawn from the synthetic population to determine if the results follow the same patterns

as samples drawn from the original population. The same sample sizes were used as in previous

experiments (0.1%, 0.25%, …, 99%, see Little et al., 2022).

The utility and TCAP scores for each sample size were calculated by measuring against the 100%

synthetic population dataset, not the original population since this would not be available in this

scenario. The baseline TCAP scores (used for calculating Marginal TCAP) were calculated from the

100% synthetic population, and these vary slightly from the original population:

• Original TCAP baseline = 0.442

• Synthetic TCAP baseline = 0.441

For each sample size 100 datasets were drawn, and the results are the mean of the 100. The risk and

utility of the synthetic samples were contrasted with the equivalent results from the original samples.

Tables with the mean utility and TCAP scores for each sample size, and the standard deviation (all

values less than 0.04) are contained in Appendix B. Figure 3 displays the R-U map for the original

sample data at each sample size, together with the results for the synthetic sample data. The plot and

tables indicate that the relationship in terms of (risk and utility) between synthetic samples and the

6

synthetic population follows closely to the relationship between the original samples and original

population. However, the synthetic samples have moderately higher risk (particularly around the 50%

sample size) and moderately lower utility.

Figure 3: R-U map showing the original samples and the synthetic samples (mean of n=100) in experiment A.

Appendix B contains a table with the mean absolute error (MAE) and standard deviation (SD) of the

synthetic utility and TCAP values (when calculated against the original samples), for each sample

size. Figure 4 illustrates the values in the table, displaying the MAE of the utility and TCAP scores. It

highlights that the MAE in terms of utility is low and generally decreases as sample size increases,

whereas whilst the MAE for the TCAP is also low it displays an interesting curve around the 50%

point and then decreases beyond that as sample size increases.

4.2 Experiment B

This scenario where the original (UK 1991 Census sample, n=104,267) dataset represents the

population, then:

• take smaller samples from the original population (1%, 2%, 3%, 4%, 5%)

• generate synthetic populations (the same size as the original population) from the smaller

samples

• then draw multiple samples of different sizes from each synthetic population

• calculate the risk and utility of the samples and contrast with original population samples

This is perhaps the more likely scenario (compared to Experiment A) since we do not usually have

access to the population data – it is more likely a small sample will be provided, and we can then use

this to generate a synthetic population. From this synthetic population samples can be drawn and the

resulting utility and risk of these can be compared to the equivalent results from the original

population samples.

7

Figure 4: Mean Absolute Error of the utility and marginal TCAP for each synthetic sample size (calculated against the

original samples), with error bars shows +-1 standard deviation.

To calculate the utility and risk, the synthetic samples are measured against the synthetic population

they were drawn from. They are not measured against the original population as that data would not

be available.

4.2.1 Samples to generate the synthetic populations

Sample sizes of 1%, 2%, … 5% were drawn from the population data, Table 1 lists the number of

records in each sample. Note that only 1 sample was (randomly) drawn for each size, this is because

emanating from each of these individual samples were hundreds of datasets, therefore, to keep

complexity down only one of each size was drawn initially.

Table 1: Number of records for each sample size

Sample size 1% 2% 3% 4% 5%

Number of records 1042 2085 3128 4170 5213

Synthpop was used to generate a synthetic population from each sample, using default parameters

(and with the visit sequence as detailed in Section 2.2). One synthetic population the same size as the

original population (104,267) was generated for each sample; therefore 5 synthetic populations were

produced. Table 2 indicates the utility and risk values for each synthetic population measured against

the original population. It highlights that (even with these small sample sizes), the utility of a

population generated from a smaller sample is lower than the utility of a population generated from a

larger sample, as might be expected. The risk (TCAP) exhibits a different pattern, and it is notable

that the TCAP score for the synthetic population generated from a 1% sample is higher than that for

the 2% and 3% sample populations.

For each of these five synthetic populations, random samples the same size as used in previous

experiments (0.1%, 0.25%, …, 99%, see Little et al., 2022) were drawn (without replacement). For

each sample size 100 samples were drawn.

8

Table 2: Utility and risk scores for each synthetic population, to 3dp

Synthetic population

generated from a: Utility TCAP

Marginal

TCAP

1% sample 0.539 0.669 0.407

2% sample 0.585 0.638 0.351

3% sample 0.591 0.648 0.370

4% sample 0.616 0.670 0.409

5% sample 0.643 0.678 0.423

4.2.2 Utility and Risk

Appendix C contains tables with the results for utility and Appendix D for TCAP. To calculate the

utility and TCAP the synthetic samples are measured against the synthetic population they were

drawn from (they are not compared against the original population as that data would not be

available). Error! Reference source not found. plots (in the left panel) the utility for each of the

synthetic populations at different sample sizes, with the original population plotted for comparison.

The plot highlights that, regardless of the synthetic population origin (whether it was generated from a

1% sample of the original population or a 5% sample) the relationship between the utility and the

sample proportion is similar.

Figure 5:The utility (left) and marginal TCAP (right) for samples drawn from the synthetic populations, contrasted with

samples from the original population, in experiment B

The panel on the right in Figure 5 displays the marginal TCAP results for each synthetic population.

This illustrates that, whilst they all follow a similar curve, the synthetic samples all overestimate the

TCAP compared to the original samples - the samples taken from the synthetic population generated

from a 1% sample of the original population particularly so.

The R-U map (plotting the utility against the marginal TCAP) can be visualised for each synthetic

population. Figure 6 plots them all in one plot, alongside the original population results. Whilst they

all follow a similar pattern, the results from synthetic populations generated from smaller original

samples tend to have higher TCAP values than those generated from larger samples.

9

Figure 6: R-U map contrasting the results for samples generated from synthetic populations to the original population (with

sample sizes labelled) in experiment B.

Plots and tables of the MAE (and standard deviation) are in Appendix C (utility) and Appendix D

(marginal TCAP). The marginal TCAP plot indicates that the overall pattern of the MAE fluctuates at

lower sample sizes and then generally decreases as the sample size gets larger. The samples from the

synthetic population generated from a 1% sample of the original data have higher MAE than those

generated from larger samples. The samples generated from a 2%, 3% and 4% synthetic population

exhibit unusual behaviour in that they are not in the order one might expect, this is likely due to

variation in the samples for the TCAP key and target variables.

5 Final Thoughts

The results show that, at least in terms of the risk and utility of samples drawn from a synthetic

population, the relationship is similar to the results obtained by drawing samples from the original

population. For Experiment A, which used a synthetic population generated directly from the original

population, the relationship between the synthetic samples and the synthetic population follows

closely the relationship between the original samples and the original population; the lines on the R-U

map were very close together when compared.

For Experiment B, which is perhaps a more likely scenario (since we do not usually have access to the

population data), synthetic populations were generated from samples (of varying sizes) drawn from

the original population. For each synthetic population samples were drawn, and the risk and utility

calculated, with the results compared (in terms of risk and utility) to the results of samples drawn

from the original population. For each of the synthetic populations, the overall relationship, in terms

of the curve on the R-U map, is similar to the original population results. However, each of the

synthetic populations had higher risk (TCAP), pushing the curve upwards; and as the sample that the

synthetic population was generated from gets smaller the curve moves further away from the original

population curve.

10

Further work on this might involve using a different data synthesizer – Synthpop was selected because

it generally produces data of high utility (and therefore higher risk) – but it may make sense to

perform these experiments with synthetic data of lower utility/risk to determine whether the results

replicate. It is also possible that using different risk and utility metrics may produce different results.

Repeating the experiments with different datasets may also make sense. As in previous work, we have

used a sample to represent the population data, so a further extension would be to access population

data and repeat these experiments.

6 References

Duncan, G.T., Keller-McNulty, S.A. and Stokes, S.L. (2004). Database Security and Confidentiality:

Examining Disclosure Risk vs. Data Utility through the R-U Confidentiality Map.

Elliot, M. (2014). Final Report on the Disclosure Risk Associated with the Synthetic Data Produced

by the SYLLS Team. [online]. Available from:

https://hummedia.manchester.ac.uk/institutes/cmist/archive-publications/reports/2015-02 -Report on

disclosure risk analysis of synthpop synthetic versions of LCF_ final.pdf.

Karr, A.F. et al. (2006). A framework for evaluating the utility of data altered to protect

confidentiality. American Statistician, 60(3), pp.224–232.

Little, C., Elliot, M. and Allmendinger, R. (2022). Comparing the Utility and Disclosure Risk of

Synthetic Data with Samples of Microdata. In Privacy in Statistical Databases. PSD 2022. Springer

International Publishing, pp. 234–249. [online]. Available from: https://doi.org/10.1007/978-3-031-

13945-1_17.

Nowok, B., Raab, G.M. and Dibben, C. (2016). Synthpop: Bespoke creation of synthetic data in R.

Journal of Statistical Software, 74(11).

Office for National Statistics, Census Division, University of Manchester, Cathie Marsh Centre for

Census and Survey Research. (2013). Census 1991: Individual Sample of Anonymised Records for

Great Britain (SARs). UK Data Service. [online]. Available from: http://doi.org/10.5255/UKDA-SN-

7210-1 [Accessed May 29, 2021].

Taub, J. et al. (2019). Creating the Best Risk-Utility Profile: The Synthetic Data Challenge. In Joint

UNECE/Eurostat Work Session on Statistical Data Confidentiality.

Taub, J. et al. (2018). Differential Correct Attribution Probability for Synthetic Data: An Exploration.

In Privacy in Statistical Databases. pp. 122–137. [online]. Available from:

http://dx.doi.org/10.1007/978-3-319-99771-1_9.

11

7 Appendix A

The UK 1991 Census dataset sample, 104267 records and 15 variables:

Variable

Name

Description Number

of Values

Number

of missing

AREAP Individual SAR area,

e.g., Birmingham, Solihull

21 0

AGE Age

Range: 0 - 95

94 0

COBIRTH Country of birth 42 0

ECONPRIM Primary economic position,

e.g., Employee FT, Student, Retired

Note: omits individuals < 16

10 21467

(20.6%)

ETHGROUP Ethnic group

e.g., White, Black Caribbean

10 0

FAMTYPE Family type

e.g., Married no children, Cohabiting with children

Note: n/a for individuals in communal establishments or with

no family

9 0

HOURS Number of hours worked weekly

Range: 1-81

Note: excludes individuals aged <=16 and those who have not

worked in previous ten years

72 46979

(45.1%)

LTILL Limiting long-term illness.

Two categories: Yes or no

2 0

MSTATUS Marital status

e.g., Single, married, divorced

Note: individuals < 16 are categorised as ‘single’

5 0

QUALNUM Number of higher educational qualifications

Three categories: 0, 1 or 2+

Note: individuals < 18 have a “0”

3 0

RELAT Relationship to household head

e.g., Head, spouse, daughter

8 2113

(2.0%)

SEX Sex

Two categories: Male or female

2 0

SOCLASS Social class (based on occupation)

e.g., Professional, skilled

Note: omits individuals < 16, & those not in paid work in last

10 years

9 44537

(42.7%)

TENURE

Tenure of household space

e.g., Owner occupied outright, rented privately

Note: omits individuals not in a household

7 2113

(2.0%)

TRANWORK Mode of transport to work

e.g., Bus, on foot

Note: omits individuals not in employment in the week before

Census

11 59249

(56.8%)

12

8 Appendix B

Experiment A: the mean utility and TCAP scores for each synthetic sample size (to 3dp, n=100),

contrasted with the mean utility and TCAP of samples taken from the original population

Experiment A, the standard deviation to 4dp (n=100) of the utility and TCAP scores for the original

and synthetic data samples

Overall utility TCAP (3 targets) Marginal TCAP (3 targets)

Sample size Original Synthetic Original Synthetic Original Synthetic

0.1% 0.424 0.420 0.609 0.607 0.300 0.298

0.25% 0.503 0.497 0.613 0.612 0.306 0.306

0.5% 0.559 0.554 0.617 0.618 0.313 0.317

1% 0.610 0.605 0.627 0.627 0.331 0.333

2% 0.657 0.653 0.643 0.643 0.360 0.362

3% 0.682 0.680 0.655 0.655 0.382 0.384

4% 0.702 0.701 0.664 0.666 0.398 0.403

5% 0.715 0.712 0.674 0.675 0.416 0.419

10% 0.762 0.760 0.710 0.713 0.480 0.486

20% 0.810 0.808 0.762 0.768 0.574 0.585

30% 0.842 0.840 0.800 0.807 0.641 0.656

40% 0.865 0.864 0.831 0.840 0.696 0.713

50% 0.887 0.887 0.858 0.868 0.746 0.764

60% 0.905 0.904 0.885 0.895 0.794 0.812

70% 0.922 0.921 0.913 0.921 0.843 0.859

80% 0.940 0.939 0.941 0.947 0.895 0.905

90% 0.960 0.960 0.970 0.973 0.947 0.952

95% 0.974 0.974 0.985 0.986 0.974 0.976

96% 0.977 0.977 0.988 0.989 0.979 0.981

97% 0.980 0.980 0.991 0.992 0.984 0.985

98% 0.985 0.985 0.994 0.995 0.989 0.990

99% 0.990 0.990 0.997 0.997 0.995 0.995

Overall utility TCAP (3 targets) Marginal TCAP (3 targets)

Sample size Original Synthetic Original Synthetic Original Synthetic

0.1% 0.0106 0.0125 0.0192 0.0185 0.0344 0.0331

0.25% 0.0114 0.0122 0.0108 0.0107 0.0193 0.0192

0.5% 0.0101 0.0108 0.0077 0.0078 0.0138 0.0139

1% 0.0078 0.0089 0.0061 0.0062 0.0109 0.0110

2% 0.0064 0.0076 0.0044 0.0039 0.0080 0.0070

3% 0.0066 0.0066 0.0034 0.0030 0.0061 0.0053

4% 0.0060 0.0057 0.0029 0.0031 0.0052 0.0055

5% 0.0068 0.0066 0.0028 0.0031 0.0050 0.0056

10% 0.0054 0.0065 0.0024 0.0022 0.0042 0.0039

20% 0.0059 0.0061 0.0021 0.0018 0.0037 0.0032

30% 0.0049 0.0060 0.0019 0.0019 0.0035 0.0033

40% 0.0067 0.0050 0.0016 0.0018 0.0028 0.0033

50% 0.0048 0.0049 0.0022 0.0017 0.0039 0.0030

60% 0.0045 0.0041 0.0021 0.0017 0.0037 0.0030

70% 0.0041 0.0041 0.0018 0.0017 0.0032 0.0031

80% 0.0036 0.0038 0.0021 0.0014 0.0038 0.0025

90% 0.0027 0.0028 0.0017 0.0014 0.0030 0.0025

95% 0.0019 0.0018 0.0013 0.0010 0.0024 0.0017

96% 0.0019 0.0016 0.0012 0.0010 0.0021 0.0018

97% 0.0016 0.0015 0.0011 0.0008 0.0020 0.0014

98% 0.0011 0.0011 0.0009 0.0008 0.0016 0.0014

99% 0.0010 0.0009 0.0005 0.0005 0.0010 0.0009

13

Experiment A: Mean Absolute Error (n=100) and standard deviation to 4dp of the utility and TCAP

values of synthetic samples compared to the original samples

Overall utility TCAP (3 targets) Marginal TCAP (3 targets)

Sample size MAE SD MAE SD MAE SD

0.1% 0.0108 0.0074 0.0147 0.0113 0.0262 0.0201

0.25% 0.0111 0.0081 0.0080 0.0071 0.0143 0.0127

0.5% 0.0088 0.0075 0.0062 0.0049 0.0114 0.0090

1% 0.0079 0.0065 0.0048 0.0038 0.0088 0.0068

2% 0.0060 0.0062 0.0032 0.0023 0.0060 0.0043

3% 0.0052 0.0043 0.0024 0.0017 0.0045 0.0031

4% 0.0043 0.0039 0.0029 0.0022 0.0059 0.0044

5% 0.0053 0.0050 0.0026 0.0021 0.0051 0.0040

10% 0.0053 0.0047 0.0029 0.0018 0.0061 0.0035

20% 0.0050 0.0039 0.0054 0.0018 0.0106 0.0032

30% 0.0045 0.0044 0.0075 0.0019 0.0142 0.0033

40% 0.0040 0.0032 0.0091 0.0018 0.0169 0.0033

50% 0.0039 0.0030 0.0101 0.0017 0.0186 0.0030

60% 0.0034 0.0026 0.0097 0.0017 0.0177 0.0030

70% 0.0035 0.0025 0.0084 0.0017 0.0153 0.0031

80% 0.0030 0.0025 0.0055 0.0014 0.0100 0.0025

90% 0.0022 0.0018 0.0029 0.0014 0.0054 0.0024

95% 0.0014 0.0012 0.0013 0.0008 0.0024 0.0014

96% 0.0012 0.0011 0.0012 0.0007 0.0022 0.0013

97% 0.0012 0.0009 0.0009 0.0006 0.0016 0.0011

98% 0.0009 0.0007 0.0008 0.0005 0.0014 0.0009

99% 0.0007 0.0005 0.0005 0.0003 0.0009 0.0005

14

9 Appendix C

Experiment B: Mean utility of original samples and synthetic samples, by sample size to 3dp. This is

the mean utility (across 100 samples) of each sample size (the rows) for each of the synthetic

populations (columns).

Experiment B: the standard deviation to 4dp (n=100) of the utility for samples taken from the original

population, and the five synthetic populations

Sample size Original

Population

Synthetic population generated from:

1% sample 2% sample 3% sample 4% sample 5% sample

0.1% 0.424 0.429 0.425 0.428 0.425 0.425

0.25% 0.503 0.509 0.506 0.505 0.505 0.500

0.5% 0.559 0.569 0.564 0.564 0.566 0.558

1% 0.610 0.626 0.617 0.618 0.619 0.611

2% 0.657 0.673 0.666 0.667 0.666 0.660

3% 0.682 0.700 0.692 0.694 0.694 0.687

4% 0.702 0.718 0.712 0.714 0.711 0.706

5% 0.715 0.733 0.725 0.728 0.727 0.721

10% 0.762 0.776 0.771 0.773 0.772 0.766

20% 0.810 0.823 0.817 0.820 0.818 0.813

30% 0.842 0.851 0.848 0.849 0.848 0.844

40% 0.865 0.874 0.871 0.871 0.872 0.868

50% 0.887 0.894 0.891 0.892 0.893 0.890

60% 0.905 0.911 0.909 0.909 0.909 0.907

70% 0.922 0.927 0.926 0.925 0.925 0.924

80% 0.940 0.944 0.944 0.943 0.943 0.941

90% 0.960 0.963 0.962 0.962 0.962 0.961

95% 0.974 0.976 0.974 0.975 0.975 0.974

96% 0.977 0.978 0.978 0.978 0.978 0.977

97% 0.980 0.981 0.981 0.981 0.981 0.980

98% 0.985 0.985 0.985 0.985 0.985 0.984

99% 0.990 0.990 0.990 0.990 0.990 0.990

Sample

size

Original

Population

Synthetic population generated from:

1% sample 2% sample 3% sample 4% sample 5% sample

0.1% 0.0106 0.0140 0.0125 0.0145 0.0113 0.0116

0.25% 0.0114 0.0110 0.0120 0.0119 0.0102 0.0113

0.5% 0.0101 0.0095 0.0092 0.0102 0.0097 0.0084

1% 0.0078 0.0086 0.0087 0.0085 0.0075 0.0076

2% 0.0064 0.0073 0.0066 0.0061 0.0066 0.0070

3% 0.0066 0.0062 0.0069 0.0070 0.0066 0.0064

4% 0.0060 0.0064 0.0059 0.0057 0.0073 0.0071

5% 0.0068 0.0062 0.0067 0.0058 0.0064 0.0057

10% 0.0054 0.0071 0.0063 0.0062 0.0056 0.0060

20% 0.0059 0.0056 0.0057 0.0047 0.0060 0.0064

30% 0.0049 0.0058 0.0056 0.0064 0.0055 0.0060

40% 0.0067 0.0059 0.0051 0.0053 0.0053 0.0048

50% 0.0048 0.0050 0.0052 0.0052 0.0049 0.0047

60% 0.0045 0.0045 0.0046 0.0051 0.0046 0.0045

70% 0.0041 0.0046 0.0046 0.0044 0.0044 0.0043

80% 0.0036 0.0035 0.0035 0.0034 0.0037 0.0038

90% 0.0027 0.0025 0.0024 0.0025 0.0026 0.0028

95% 0.0019 0.0020 0.0020 0.0019 0.0019 0.0019

96% 0.0019 0.0016 0.0018 0.0017 0.0017 0.0018

97% 0.0016 0.0014 0.0014 0.0012 0.0014 0.0016

98% 0.0011 0.0012 0.0014 0.0012 0.0011 0.0015

99% 0.0010 0.0009 0.0009 0.0009 0.0009 0.0010

15

Experiment B: The MAE to 4dp (n=100) between the utility of the original population samples and

each of the synthetic population samples

Experiment B: the standard deviation for the MAE of the utility, to 4dp

Sample

size

Synthetic population generated from:

1% sample 2% sample 3% sample 4% sample 5% sample

0.1% 0.0121 0.0097 0.0109 0.0092 0.0093

0.25% 0.0103 0.0099 0.0095 0.0081 0.0092

0.5% 0.0118 0.0084 0.0097 0.0091 0.0069

1% 0.0166 0.0096 0.0097 0.0101 0.0065

2% 0.0159 0.0092 0.0102 0.0096 0.0060

3% 0.0181 0.0104 0.0128 0.0125 0.0066

4% 0.0158 0.0103 0.0119 0.0108 0.0067

5% 0.0179 0.0107 0.0128 0.0121 0.0068

10% 0.0138 0.0092 0.0110 0.0096 0.0057

20% 0.0125 0.0077 0.0098 0.0086 0.0061

30% 0.0097 0.0076 0.0085 0.0074 0.0056

40% 0.0092 0.0067 0.0069 0.0072 0.0047

50% 0.0079 0.0057 0.0065 0.0069 0.0044

60% 0.0065 0.0056 0.0054 0.0052 0.0042

70% 0.0060 0.0047 0.0045 0.0045 0.0038

80% 0.0048 0.0046 0.0039 0.0042 0.0032

90% 0.0037 0.0027 0.0027 0.0028 0.0025

95% 0.0022 0.0016 0.0017 0.0016 0.0016

96% 0.0018 0.0016 0.0016 0.0015 0.0015

97% 0.0014 0.0013 0.0011 0.0013 0.0013

98% 0.0011 0.0011 0.0010 0.0010 0.0011

99% 0.0008 0.0007 0.0008 0.0007 0.0008

Synthetic population generated from:

Sample size 1% sample 2% sample 3% sample 4% sample 5% sample

0.1% 0.0085 0.0079 0.0102 0.0067 0.0069

0.25% 0.0071 0.0072 0.0073 0.0063 0.0070

0.5% 0.0069 0.0063 0.0063 0.0074 0.0048

1% 0.0079 0.0055 0.0063 0.0063 0.0042

2% 0.0071 0.0056 0.0049 0.0057 0.0042

3% 0.0062 0.0059 0.0056 0.0058 0.0047

4% 0.0063 0.0048 0.0051 0.0052 0.0045

5% 0.0061 0.0055 0.0052 0.0058 0.0043

10% 0.0056 0.0047 0.0046 0.0050 0.0037

20% 0.0051 0.0045 0.0039 0.0046 0.0037

30% 0.0046 0.0039 0.0043 0.0041 0.0034

40% 0.0044 0.0041 0.0042 0.0041 0.0031

50% 0.0042 0.0037 0.0037 0.0037 0.0032

60% 0.0039 0.0032 0.0034 0.0034 0.0025

70% 0.0032 0.0032 0.0027 0.0028 0.0028

80% 0.0027 0.0025 0.0024 0.0026 0.0026

90% 0.0019 0.0017 0.0017 0.0019 0.0015

95% 0.0014 0.0012 0.0012 0.0012 0.0010

96% 0.0011 0.0011 0.0011 0.0011 0.0010

97% 0.0010 0.0009 0.0007 0.0009 0.0010

98% 0.0008 0.0008 0.0007 0.0007 0.0010

99% 0.0005 0.0005 0.0005 0.0005 0.0006

16

Experiment B: the MAE for the utility by sample proportion, for each synthetic population, with error

bars indicating +- 1 standard deviation

17

Appendix D

Experiment B: the mean (n=100) Marginal TCAP values from each of the synthetic populations, and the original

population, to 3dp.

Experiment B: The standard deviation (to 4dp) of the marginal TCAP scores for samples from each of the synthetic

populations. The original population results are included for comparison.

Sample size Original

Population

Synthetic

Population

(1%)

Synthetic

Population

(2%)

Synthetic

Population

(3%)

Synthetic

Population

(4%)

Synthetic

Population

(5%)

0.1% 0.300 0.432 0.410 0.373 0.385 0.378

0.25% 0.306 0.448 0.414 0.378 0.388 0.381

0.5% 0.313 0.458 0.425 0.390 0.398 0.393

1% 0.331 0.482 0.440 0.415 0.418 0.412

2% 0.360 0.516 0.468 0.448 0.450 0.439

3% 0.382 0.539 0.489 0.472 0.473 0.461

4% 0.398 0.558 0.504 0.491 0.491 0.478

5% 0.416 0.574 0.520 0.506 0.510 0.492

10% 0.480 0.635 0.584 0.569 0.574 0.556

20% 0.574 0.717 0.670 0.658 0.670 0.646

30% 0.641 0.776 0.731 0.723 0.738 0.711

40% 0.696 0.822 0.781 0.775 0.790 0.765

50% 0.746 0.861 0.823 0.820 0.833 0.811

60% 0.794 0.894 0.863 0.860 0.872 0.853

70% 0.843 0.924 0.899 0.898 0.907 0.892

80% 0.895 0.951 0.934 0.933 0.939 0.930

90% 0.947 0.976 0.968 0.967 0.970 0.965

95% 0.974 0.988 0.984 0.984 0.985 0.983

96% 0.979 0.991 0.987 0.987 0.988 0.986

97% 0.984 0.993 0.991 0.990 0.991 0.990

98% 0.989 0.995 0.994 0.993 0.994 0.993

99% 0.995 0.998 0.997 0.997 0.997 0.997

Sample

size

Original

Population

Synthetic

Population

(1%)

Synthetic

Population

(2%)

Synthetic

Population

(3%)

Synthetic

Population

(4%)

Synthetic

Population

(5%)

0.1% 0.0344 0.0374 0.0456 0.0429 0.0389 0.0374

0.25% 0.0193 0.0246 0.0250 0.0256 0.0270 0.0241

0.5% 0.0138 0.0176 0.0179 0.0162 0.0170 0.0177

1% 0.0109 0.0097 0.0128 0.0135 0.0118 0.0114

2% 0.0080 0.0082 0.0086 0.0086 0.0088 0.0092

3% 0.0061 0.0066 0.0066 0.0080 0.0062 0.0067

4% 0.0052 0.0052 0.0059 0.0067 0.0059 0.0058

5% 0.0050 0.0046 0.0057 0.0052 0.0048 0.0047

10% 0.0042 0.0031 0.0040 0.0045 0.0042 0.0040

20% 0.0037 0.0025 0.0039 0.0037 0.0032 0.0032

30% 0.0035 0.0025 0.0028 0.0032 0.0027 0.0031

40% 0.0028 0.0024 0.0028 0.0025 0.0024 0.0031

50% 0.0039 0.0021 0.0025 0.0025 0.0021 0.0027

60% 0.0037 0.0016 0.0019 0.0024 0.0019 0.0025

70% 0.0032 0.0016 0.0020 0.0020 0.0021 0.0023

80% 0.0038 0.0011 0.0016 0.0017 0.0013 0.0018

90% 0.0030 0.0008 0.0012 0.0013 0.0010 0.0015

95% 0.0024 0.0005 0.0009 0.0009 0.0008 0.0007

96% 0.0021 0.0005 0.0008 0.0008 0.0007 0.0009

97% 0.0020 0.0005 0.0008 0.0008 0.0006 0.0007

98% 0.0016 0.0004 0.0006 0.0007 0.0005 0.0007

99% 0.0010 0.0003 0.0004 0.0004 0.0003 0.0005

18

Experiment B: the MAE of the marginal TCAP for each synthetic population by sample proportion, with error

bars indicating +- 1 standard deviation

Do samples taken from a synthetic microdata population replicate the relationship between samples taken from an original population?

M A R K E L L I OT, C L A I R E L I T T L E , R I C H A R D A L L M E N D I N G E R

U N I V E RS I T Y O F M A N C H E S T E R

Introduction Is the relationship between:

• a population dataset and samples drawn from it

replicated by

• a synthetic version of the same population and samples drawn from it?

Population data usually unavailable - if synthetic samples can mimic this relationship, it would be useful

Extends previous work (Little et al., 2022) using samples to determine the sample equivalence of synthetic data to the original dataset • (to be able to say, for example, “the synthetic dataset has utility equivalent to a 10% original sample and

risk equivalent to a 5% original sample”)

Study Design - Data UK 1991 Census microdata (University of Manchester, 2023) is used to represent the population • subsetted on geographical region (West Midlands)

• 104267 records

• 15 variables (13 categorical, 2 numerical)

Area Age Country

of birth

Economic

group

Ethnic

group Family type

Hours

worked

Long term

illness

Marital

status

Num

qualifications Relationship Sex

Social

class

Transport

to work

Housing

tenure

Sandwell 7 England NA Bangladeshi Married dep.

Children NA No Single None Child M NA NA

Own

outright

Coventry 40 England Employee FT White NA 50 No Married None NA F Manag.

tech Car NA

Walsall 70 England Retired White Married no

children 39 Yes Married None

Household

head M

Part

skilled NA

Own

buying

Study Design synthpop (Nowok et al. 2016) used to generate synthetic data • Default parameters

• Visit sequence ordered by ascending number of categories, with numerical variables first

Data samples were drawn randomly without replacement

Various sample fractions • 0.1%, 0.25%, 0.5%, 1%, 2%, 3%, 4%, 5%, 10%, 20%, …, 80%, 90%, 95%, 96%, 97%, 98%, 99%

◦ 22 overall

• n = 100 samples randomly drawn for each sample fraction

• 2200 samples

Study Design – Metrics Disclosure Risk • For synthetic data reidentification risk not meaningful

• Attribution is possible

• Measured using the Targeted Correct Attribution Probability (TCAP) (Taub & Elliot, 2019) ◦ Probability that an intruder makes a correct attribution inference about a particular target variable, given partial

knowledge (key variables)

• We use marginal TCAP score ◦ Calculate baseline – probability of intruder being correct if they drew randomly from univariate distribution of

target variable

◦ Scale TCAP score between baseline and 1

◦ marginal TCAP indicates risk above the baseline

◦ Value between -x and 1, where a higher value indicates greater risk

Study Design – Metrics Utility • Confidence Interval Overlap (CIO) (Karr et al., 2006)

◦ Logistic regressions performed on synthetic and original data (using same target/predictors for each)

◦ Regression coefficients are compared

◦ Score between 0 (no overlap) and 1

• Ratio of Counts/Estimates (ROC) ◦ For univariate and bivariate cross-tabulations

◦ Compares proportion of synthetic and original data estimates by taking the ratio

◦ Score between 0 and 1

• Overall utility score ◦ Mean of CIO, ROC univariate and ROC bivariate

◦ Value between 0 and 1, where a higher value indicates greater utility

Study Design – Metrics Risk-Utility comparison • R-U confidentiality map (developed by Duncan et al. 2004)

• Plots utility against risk (TCAP) score

• Ideally disclosure risk is minimised, utility is maximised

Synthetic / Sample data • Utility and risk metrics calculated in the same way for samples of original data as for

samples of synthetic data ◦ By comparing against the dataset that the samples were drawn from

• Allows comparison on R-U map

Results - Experiment A A synthetic population was generated from the original population

Random samples taken from both populations

Risk and utility calculated for each sample compared to the population it was sampled from

Results compared

Experiment A: Risk-Utility map showing the original samples and synthetic samples

Experiment A: Individual plots showing the original samples and synthetic samples for:

Utility Risk (Marginal TCAP)

Mean Absolute Error of the utility and marginal TCAP for each synthetic sample size (calculated against the original samples, error bars show +- 1 standard deviation)

Results - Experiment B UK 1991 Census data represents the population

Take samples from the population (1%, 2%, 3%, 4%, 5%)

Generate synthetic populations from the samples

Random samples taken from original and synthetic populations

Risk and utility calculated for each sample compared to the population it was sampled from

Results compared

Experiment B Synthetic population generated from smaller samples

• A more likely scenario

Process:

• Take samples from the original population

• 1%, 2%, 3%, 4%, 5%

• From each sample, a synthetic dataset the same size as the population (n=104267) was generated

• Utility increases with sample size

• TCAP differs

Synthetic population generated from a:

Utility Marginal TCAP

1% sample 0.539 0.407

2% sample 0.585 0.351

3% sample 0.591 0.370

4% sample 0.616 0.409

5% sample 0.643 0.423

Risk-Utility map contrasting the results for samples drawn from synthetic populations to those drawn from original population

Individual plots contrasting the results for samples drawn from synthetic populations to samples drawn from the original population, for:

Utility Risk (Marginal TCAP)

Mean Absolute Error of the utility and marginal TCAP for each synthetic sample size (calculated against the original samples, error bars show +- 1 standard deviation)

Utility Risk (marginal TCAP)

Risk-Utility map contrasting the results for samples drawn from synthetic populations to those drawn from original population…

where the synthetic population also contains the original sample used to generate it • very little difference

whether or not the original sample is included

An aside:

Observations Experiment A → Synthetic population generated from original population • Relationship between synthetic samples and the synthetic population follows closely the

relationship between original samples and the original population

Experiment B → Synthetic populations generated from samples drawn from original population • Overall relationship similar to original populations results (similar curve on the RU map)

• But the smaller the original sample (used to generate the synthetic population) the more the risk is overestimated

• Utility similar no matter the original sample size

Caveats Experiments conducted on samples of Census microdata

◦ May not generalise to full population data

Only one data synthesis method used ◦ Synthpop – which tends to create high utility (but also higher risk) synthetic data

Only one dataset used ◦ It may be useful to repeat this on other datasets

Underestimation of the risk of samples, relative to synthetic data ◦ Whilst synthetic data should not contain re-identification risk, sample data does

Risk measure uses a response knowledge attribution disclosure ◦ OK for Census data, but presence detection may be a significant risk in other data

Different risk and utility metrics may produce different results

Future Work Run experiments on full population data

Use different data synthesis methods

Use different datasets

Assess other utility measures

Assess other disclosure control methods

References Nowok, B., Raab, G.M. and Dibben, C., 2016. synthpop: Bespoke creation of synthetic data in R. Journal of statistical software, 74(1), pp.1- 26.

Little, C., Elliot, M. & Allmendinger, R., 2022, Comparing the Utility and Disclosure Risk of Synthetic Data with Samples of Microdata. In Privacy in Statistical Databases: International Conference, PSD 2022, Paris, France, September 21–23, 2022, Proceedings. Lecture Notes in Computer Science vol. 13463 LNCS, Springer Nature, Cham, Switzerland, pp. 234-249. https://doi.org/10.1007/978-3-031-13945-1_17

University of Manchester, Cathie Marsh Centre for Census and Survey Research, Office for National Statistics, Census Division. (2023). Census 1991: Individual Sample of Anonymised Records for Great Britain (SARs). [data collection]. UK Data Service. SN: 7210, DOI: http://doi.org/10.5255/UKDA-SN-7210-1

Taub, J., Elliot, M., Raab, G., Charest, A., Chen, C., O'Keefe, C. M., Nixon, M. P., Snoke, J., Slavkovic, A., 2019. The synthetic data challenge. Joint UNECE/Eurostat Work Session on Statistical Data Condentiality. https://unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2019/mtg1/SDC2019_S3_UK_Synthethic_Data_Challenge_Elliot_AD.p df

Duncan, G.T., Keller-McNulty, S.A. and Stokes, S.L., 2004. Database security and confidentiality: examining disclosure risk vs. data utility through the RU confidentiality map.

Karr, A.F., Kohnen, C.N., Oganian, A., Reiter, J.P., Sanil, A.P.: A framework for evaluating the utility of data altered to protect confidentiality. Am. Stat. 60(3), 224–232 (2006).

  • Slide 1: Do samples taken from a synthetic microdata population replicate the relationship between samples taken from an original population?
  • Slide 2: Introduction
  • Slide 3: Study Design - Data
  • Slide 4: Study Design
  • Slide 5: Study Design – Metrics
  • Slide 6: Study Design – Metrics
  • Slide 7: Study Design – Metrics
  • Slide 8: Results - Experiment A
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12: Results - Experiment B
  • Slide 13: Experiment B
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18: Observations
  • Slide 19: Caveats
  • Slide 20: Future Work
  • Slide 21: References

Intruder testing – an empirical measure of the quality of Census 2021 England and Wales Disclosure Control methods, ONS UK

confidentiality rules, individual data, swapping records, cell key method, disclosure rules, intruder testing,

Languages and translations
English

UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE

CONFERENCE OF EUROPEAN STATISTICIANS

Expert Meeting on Statistical Data Confidentiality

26-28 September 2023, Wiesbaden

Title : Intruder testing – an empirical measure of the quality of Census 2021

England and Wales Disclosure Control methods

Author(s) Samantha Trace (Office For National Statistics) Dominic Nelson (Office For National Statistics

e-mail [email protected]

Abstract

By law, the Office for National Statistics (ONS) must protect the confidentiality of respondents to

Census 2021. We protected the confidentiality of individuals' data in three ways: swapping records

between areas, applying a cell key method to each table, and applying disclosure rules in deciding

which tables could be published. To assess the effectiveness of these methods and provide assurance,

an intruder test was performed on Census 2021 data using a secure version of the outputs system. 51

intruders were recruited to attempt to identify individuals in the planned data outputs. 30 Intruders

took part, 81 claims were made, and more than half of these claims (41/81) were incorrect. Further

steps were taken reduce the risks identified by the test, making the data the majority of these claims

were made from no longer possible to access through the Create a Custom Dataset system. This gave

the Office for National Statistics evidence there was sufficient uncertainty in the data to meet the

standard required by legal guidance and we would meet our ethical duty to protect confidentiality.

2

1 Introduction

The Office for National Statistics (ONS) has legal obligations under the Statistics and Registration

Service Act (SRSA, 2007) Section 39 and the Data Protection Act (2018) that require the ONS not to

reveal the identity or private information about an individual or organisation.

We have a pledge to respondents that the information will only be used for statistical purposes, so we

must look after and protect the information that is provided to us. Moreover, a breach of disclosure

could lead to criminal proceedings against an individual who has released or authorised the release of

personal information, as defined under Section 39 of the SRSA.

The SRSA defines "personal information" as information that identifies a particular person if the

identity of that person:

• is specified in the information

• can be deduced from the information

• can be deduced from the information taken together with any other published information

Therefore, in order for data to be released, the risk of identifying individuals from it, potentially with

additional publicly available information, must be minimal.

Intruder testing is an empirical test to check that the measures applied to make data sufficiently

difficult to identify individuals within have been successful. This involves recruiting ‘friendly

intruders’ who emulate the actions of potential ‘real intruders’ upon the data.

The standard that needs to be met is suggested by the National Statistician’s Guidance, “the design

and selection of intruder scenarios should be informed by the means likely reasonably to be used to

identify an individual in the statistic”.

So, intruder tests are designed to measure what could be done with the means likely to be available to

an opportunistic attacker, it does not have to cover every imaginable scenario, just the most probable.

The 2011 Census outputs were tested in this way, and the findings were useful in providing assurance

that the disclosure controls measures used on the data were adequate, and provided evidence to what

further steps should be taken to further reduce disclosure risk. Other ad-hoc exercises have been

undertaken by the ONS as required since, with the same purpose – to determine the level of

identification risk in a dataset.

For Census 2021, new disclosure control methods were required for a new output system. On top of

the imputation of missing records done to make the Census as representative as it can be, which also

adds doubt as to whether a particular record is ‘real’ or not, there were new measures in place to

protect the data:

• Targeted Record Swapping – swapping households that are marked as unique in the data with

a similar record in the local area. The geographies were changed for between 7% and 10% of

households, and for between 2% and 5% of individuals in communal establishments.

• Cell Key Perturbation - this adds noise to the figures, making slight changes to cell counts

including zero cell counts, by a method which means that where the same records are

presented in a cell, the number should remain consistent. A typical dataset would have around

14% of cell counts perturbed by a small amount, and small counts were more likely to have

been perturbed than large counts.

3

• Disclosure rules (in the Create a Custom Dataset system) – automated rules including

measures of how many small counts are in the table, that can stop data being given for an area.

These methods were intended to combine as a ‘lighter touch’ approach, allowing some detail to be

possible at low level geography, whilst maintaining the usefulness of the data within the new Create a

custom dataset (CACD) system, and other census outputs. The CACD system allows users to create

their own multivariate datasets, so the rules are set to prevent the possibility of identifying a single

record and building up a list of potential attributes. The level of identification risk should still be

minimal, using information public or private.

2 The Intruder Test

2.1 Method

51 intruders, all ONS employees, were recruited. All had appropriate security clearances and

consented to an enhanced non-disclosure agreement. They were given training on how to use the

output system, and possible methods of working against our statistical disclosure controls. A safe area

of an approved file management system was set up, and they were given access to individualised

folders to record their findings and keep notes.

A version of the planned outputs system was created on a secure internal-access platform and loaded

with the usual resident database. This is the main basis for Census outputs as it includes all people

who are ‘usually resident’ at the enumeration address at the time of the census. This was also

programmed with all the current planned variables and classifications for those variables. A version of

the planned statistical disclosure rules was placed in this system, to auto-control outputs requested by

intruders, and deny access if the output does not pass these rules. The system had built in perturbation

so automatically created outputs with some values slightly changed.

The data placed in the system had targeted swapping already applied and imputed records present, just

as it would be when published. The main census 2021 geographies were available in this system, the

smallest geography used was output area (OA), an area with at least 100 persons in it, though more

typically 400 persons.

Intruders were given individual access to the system, encouraged to collaborate on a private Teams

channel, and to share resources, such as web pages, hints and tips. An errors log was set up to record

system issues, and the details of the claim, including geography, variables and classifications used, as

well as the name and address of the individual being claimed as found, and the confidence level in the

identification as a percentage.

Claims were transcribed from the individual file folders to a single sheet that the checkers had access

to. These checkers were from a different team to ensure the data was fully firewalled from the

intruders, and no actual disclosure would result from the exercise.

The checkers had access to record level data, so could determine whether a claim was correct, partial,

or incorrect. A correct claim would match on name and approximate address. Inaccurate address

matches were counted as correct so long as they would have been within the geographical area used to

make the claim.

Inaccurate name matching was counted as incorrect. A partial match would be where a claim was

made on a 1 in a cell, where more records would have been in that cell but were perturbed down to 1.

4

.

2.2 Limitations

We had considered engaging a third party to take part in the test, however we could not be sure of

start time, and there are few companies engaged in exactly this sort of testing that could have gained

security clearances in time, so it was deemed impractical to engage a third party in this exercise.

Therefore, there may be some organisational biases in our exercise.

Although attempts were made to recruit people from more sparsely populated areas of England and

Wales, most people were still clustered geographically around ONS offices and reflect the socio-

demographic mix of ONS staff rather than the general population.

Intruders also had to use their spare time around their regular work, and the exercise ran in August

when many took leave, although it took place over three weeks to allow more people to participate.

The dataset looked at was not the full range of planned Census outputs. The final system includes not

just Usual Resident, but also Usual Residents in Households and Communal Establishments,

Households and Household reference Persons. The Usual Resident dataset used was taken to be a

sufficient test of the general level of risk in the data.

2.3 Results

2.31 Claims

81 identification claims were made, excluding duplicates. These claims are where an intruder

highlighted a ‘1’ cell count in a dataset, and gave the details of this, and claimed they knew which

person it related it to. Some (2) claims listed various methods to approach the same identification, in

these cases this was still counted as one claim and measures such as cell count were taken from the

first tables stated.

40/81 or 49% of identification claims were correct (the intruder correctly named an individual in a

cell)

8/81 or 10% of identification claims were partially correct (the intruder correctly names an individual

in a cell of apparent size 1, but the cell count is greater than 1 – due to cell key perturbation – the cell

could have been representing any of the people in it)

33/81 or 41% of identification claims were incorrect, the record marked in the cell did not relate to the

individual named.

No attribute claims were made, an attribute claim is where an intruder claims to have found something

new about a person through the data presented.

Of the initial 51, 12 dropped out, citing workload or holiday as reasons, and a further 9 filed no notes

and made no claims. Of the 30 intruders that took part, 6 (20%) did not make any claims. Reasons

cited included not being able to claim anything with certainty, some may also have lacked time to

spend on the project.

2.32 Confidence

5

Figure1: Confidence, correctness and number of claims

This histogram shows numbers of claims by the percentage confidence the intruder reported in the

claim, banded by whether they were correct, partially correct or incorrect.

There was a range of 7.5-100% confidence in claims

The mean confidence placed in a claim was 73.6%, the median was 80%.

2.32 Cell Counts and correctness

The cell count is the number of cells (row * columns) present in the table used to make the claim.

A wide range of table sizes were used to inform claims, (range 7 – 2100, mean 183, median 182).

Figure 2: Cell counts and correctness

The scatter plot shows claims rated by percentage correctness. Partially correct claims are 50%

correct, fully correct are 100% correct. One outlier (cell count 2100) was removed. This shows a

R² = 0.0986

0

100

200

300

400

500

600

700

0 20 40 60 80 100 120

N u m

b er

o

f ce

ll s

fo r

cl ai

m d

at as

et

Correctness

6

positive correlation (R^2 = 0.0986), but with outlier, this relationship was zero. This could suggest

that higher cell counts may increase possibility of identification within limits – very high cell counts

may not.

2.34 Variables Used

To assess which variables were most likely to result in a claim, and which in a correct claim, the

claims were coded to variable type. Any table constructed with a single classification making the bulk

of the cells would be coded to that variable, e.g., any claim using single year of age, or single year of

age plus another less detailed classification such as sex, was coded to ‘age’, any claim using a 3-part

country of birth classification, 10-part age, and sex would be coded ‘multivariate’. A few variables

with only a few claims each were coded to ‘other’, such as country_of_birth.

Table 1: Number of claims by variables used in the datasets those claims came from

Variable Number of Claims Number of Correct Claims % Claims that were correct

Age 35 21 60%

Multivariate 28 12 43%

Occupation 9 2 22%

Other 8 5 63%

The table shows claims where age was the main component had the highest number of claims, and

highest number of correct claims. Multivariate tables were less than 50% likely to yield a correct

claim, and occupation was unlikely to result in a correct claim. The main cause of correct claims from

the ‘Other’ category were claims using country_of_birth.

2.35 Geography

Table 2: Number and correctness of claims by Geography used in datasets

Geography

of the table

used for the

claim

Number of

Claims

Number

of Correct

Claims

% of Claims made

that were correct

Mean %

confidence

Mean cell

count

OA 67 34 51% 75 142

LSOA 9 5 56% 73 248

MSOA 5 1 20% 47 610

The largest geography used for any claim was Middle Super Output Area (MSOA), Output area (OA)

was the main area of risk with the bulk of claims (65/81 or 80%) being made using OA datasets. It

was also the focus of correct claims (34/40 or 85%). There were few claims at MSOA, and only one

7

correct claim. Lower Super Output Area (LSOA) again was used in few claims, and though these were

majority correct, with such a small sample it cannot be concluded that this would always be more

likely to be correct or not.

Table 3: Subject of the Disclosure Claim

Number of Claims Correct claims Percentage of

Claims correct

Family and friends 59 25 42%

People from news/ web 16 11 68%

Self-identification 6 4 67%

Many those known about through news or online articles were centenarians, identified through age

and location.

Other

Though intruders were given access to ‘fixed’ tables as csv files, at least 7 intruders used them, there

were no correct claims from these.

Qualitative evidence suggested the intruders found the new flexible outputs system was very easy to

use (rated 4.3 out of 5 by the 15 intruders surveyed), and low amounts of time were recorded as

typical to arrive at a claim (5-30 minutes) though it is hard to calculate total time taken per claim

accurately as time spent logged in could not be taken as an indication of time spent on this project.

Intruder feedback suggested that the disclosure rules built into the system were working as intended

and when they tried to obtain a cell value of 1 at lower geography, the rules prevented this by denying

the data.

3 Discussion

The overall results show that over half of identification claims were incorrect. However, unlike other

intruder testing exercises carried out previously by ONS, intruders were fairly unlikely to make claims

where they had low confidence. Almost all claims were made with a confidence of 60% or greater.

Generally, the higher percentage of confidence the intruder rated a claim, the more likely they were to

be correct. Although this was statistically significant, the relationship was not so strong, and a

significant portion of those who were over 90% confident were still incorrect or partially correct (35%

or 13/37).

The exercise on 2011 census data saw a drop off in percentage correctness at very high confidence

claims which was not seen here. Possibly, the ease of using the system may have made all intruders

more confident, and meant intruders went for easier identifications, rather than putting forwards ones

they were less sure of.

8

The method used for this exercise did not allow us to know whether an identification was wrong due

to swapping, or other reasons – only if it was perturbed and therefore a ‘partial’. Therefore, it is hard

to evaluate the success of swapping as a single method from this evidence.

Cell counts of tables present an unclear picture, as no correlation was found with table size in cell

count and correctness. Smaller tables may be easier to be sure where a person might be represented,

where a larger table makes it more likely to get a small count to base an identification claim. It seems

more detailed classifications may offer additional risk in some circumstances, but dependent on

geography.

There were no claims at any geography higher than MSOA. It is likely that an intruder would have far

more confidence over a claim at lower geographies since they may have considerable knowledge as to

who lived in an OA with which they are familiar, but far more uncertainty as the geography level

increases. Observing a cell count of 1 in an OA may convince them that the person they know is the

only one with that combination of attributes. They might have less certainty at MSOA that the 1

corresponds to the subject of the claim given the lower likelihood of familiarity with the individuals in

the population, as well as ‘noise’ introduced by error, imputation, record swapping and the cell key

method.

The high level of claims and correct claims at OA make this the main area of risk to address in

planned outputs. Claims made at OA also had the highest level of confidence with an average of 75%

confidence expressed in the claims. The variables used for these claims were consistent with the

general picture, that is, age was a main variable used for identifications, followed by other detailed

classifications such as occupation and country_of_birth. Multivariate tables made the basis for 22 of

the OA claims, of which most were incorrect or partially correct (13/22 or 59%), which demonstrates

that the protections did well at protecting multivariate data as they were designed to do.

Whilst most of the claims were correct at LSOA (5/9 or 56%) this was a small sample and could

equally have been majority incorrect with one fewer correct claim. However, some of the claims made

at OA could equally have been made at LSOA, as they are small enough to make small counts

prevalent, and intruders might have a moderate level of familiarity with most residents within a typical

sized LSOA (1600 people). The level of confidence in LSOA claims was not much less than that

shown in claims made from OA level tables (73% confidence in LSOA, 75% in OA claims). A

majority of LSOA claims (5/9 or 56%) were based in multivariate tables, though a minority of these

were correct (2/5 or 40%). The mean cell count of tables used for claims at LSOA was consequently

much higher.

There was little risk of a correct claim (only 1/6 or 17%) from an MSOA table, so this supported

earlier evaluations of the data that looked only at the sparsity of the likely tables, and restricted fixed-

table outputs of detailed univariates to MSOA geography. The cell counts used for MSOA tables were

higher on average, which is unsurprising given the higher population (typically 7000) that would have

to be divided in the classifications to obtain a cell count of ‘1’ to base an identification upon. The level

of confidence was also significantly lower at average 47%.

That age was shown as a specific risk should be noted; however, some of these claims were claims

made using already publicly available information on Centenarians so arguably the disclosure came

from these sources, not the output. That said, many claims were also identifying people who happened

to be the only one of that age in their area, so single year of age at Output Area geography has been

shown as a specific risk to mitigate.

9

The variables used for correct claims supports current thinking that more ‘definite’ variables are more

disclosive, that is age and country_of_birth are both variables that are likely to be reported

consistently by the person filling in the Census.

Claims based upon occupation were very unlikely to be correct on the other hand, which may be due

to uncertainty about how the question may have been interpreted by the person answering, and how

their answer would have been coded by the automated processing system.

Multivariate claims are also less likely to be correct, possibly because increasing the number of

variables increases the chances an answer would not have been given or been recorded the way the

intruder guessed. The level of risk in these detailed univariates was still limited to smaller sized

geography, so there is no evidence from this test to restrict the use of these variables at MSOA or

higher geography.

In terms of the variables that relate to special category data there was no evidence that variables such

as health, disability, ethnicity, religion, sexual_orientation and gender_identity, all of which were

included in the test, were at significant risk of correct identification claims. This may be due to the

protections put in place for these, and the less definite nature of these variables. Though we know 7

intruders tried to use the sexual_orientation and gender_identity datasets, these were made available

separately through .csv files which may have made them harder to access. In the final outputs they

would not be available below MSOA, so this intruder testing exercise seems to support that decision

in terms of sufficient protection for that data.

The test was conducted pragmatically, and therefore recruited people with more statistical awareness

and knowledge of the data than would be found in the general population, as they were ONS

employees. This may be taken a slightly over-stringent test, as it may over-estimate the risks from

intruder attempts made by the public.

4. Conclusion

The standard to be met to fulfil legal requirements is that claims should not be both made with

confidence and correctness. The level of risk found in the current planned outputs found by this

exercise would meet these legal definitions of safety, and additional steps were taken to decrease this

risk further.

In response to the findings, the rules in the table builder were altered to restrict the availability of

detailed classifications at lower geography, and one more detailed topic summary was replaced with a

classification with fewer categories that consequently posed less risk. The majority of claims made

here would not be possible to make using the actual output system.

Perturbation, swapping, the disclosure rules and general level of doubt in the data together were

shown to be effective at preventing correct identifications.

Awareness of perturbation and swapping did not appear to result in lower levels of intruder’s

confidence in making claims, so this alone cannot be relied upon to meet the legal standards. Further

steps were also taken to ensure LSOA level data was protected by restriction of the level of detail

available at this geography.

10

The evidence seen here, with lower risk at MSOA, supports the decision to limit the geography of

usual residents in communal establishments and households to MSOA, even though those datasets

were not included in the test

The CACD system has been launched since this test took place, and sees some 900,000 interactions

per month (ONS data), demonstrating the usefulness of Census data delivered in a flexible and

immediate format. If this system is to be employed for a wider range of statistical products, further

intruder testing should be considered as a means of measuring and mitigating disclosure risk in those

datasets.

Intruder testing is a highly useful exercise for data providers to employ, where the level of risk

presented by a dataset is in doubt. It gives evidence on the likely level of risk, where that risk lies, and

can inform appropriate action to mitigate those risks.

1

Intruder Testing

Census 2021 England and Wales

Risk and Utility in the Create a Custom Dataset System Sam Trace

2

Background

• Key Census 2021 White Paper promise ‘Every person’s identity will be protected, not only through secure handling and storage of their data, but also by ensuring that our statistical publications do not identify individuals’

• Since 2011, there has been exponential growth in information publicly available about individuals

• There is an all new customizable system for Census 2021

• Census 2021 has new methods protecting the data

3

Statistical Disclosure Control (SDC) methods

• Targeted Record Swapping – identifying people and Households that stand

out in the data, swapping them with a similar record in a nearby area.

• Cell Key Perturbation - this adds noise to the figures, making slight

changes to cell counts

• Disclosure rules – automated rule-based checks run by the system, which

decide if there is a low enough disclosure risk to allow the release of a

dataset.

How do we check these have done enough?

4

Intruder Testing

• Intruder testing is where ‘friendly’ intruders try to identify people in the data to check the risk level

• Census 2011 outputs were intruder tested before release

• It is a practical check to see if the methods worked

• The point of the exercise is to try and find out if it is possible to identify individuals in the data

5

Legal Standard for outputs

• There must be ‘sufficient uncertainty’ about any identification from a small count

• Identifications made with publicly available information in combination with the data are included

• Testers do not need to be specialist hackers

• Methods must cover the ‘means likely reasonably to be used’

6

Method

• Recruit intruders – ONS people only

• Consent intruders

• Train them and advise of the disclosure control methods

• Get the data on a secure pre-release system

• Intruders try to identify individuals in the data

• Collate results including feedback

• Analyse in Excel

7

Results

• 51 Intruders recruited

• 30 confirmed as working on the project

• 24 intruders made claims

• 81 Claims made (excluding duplicates)

8

Claims

49%

41%

10%

Correct Incorrect Partial

10

Confidence and Correctness

12

Variables Used

Correct All % Correct

Age 21 35 60%

Multi 12 29 41%

Occupation 2 8 25%

Other 5 9 56%

40 81 49%

13

Cell Counts

Correct All % Correct

0-49 2 9 22%

50-99 3 9 33%

100-149 9 20 45%

150-199 6 9 67%

200+ 20 34 59%

Total 40 81

14

Options

Remove detailed classifications from the Create Your Own Dataset system

• Loss of useful classifications at higher geography

• There may be other classifications not tried that also pose a risk

Limit max number of Cells

• Loss of useful functionality at higher geography

Specify Max cells specific to geography for univariates

• Would prevent the main risk

15

Limit max cells by Geography?

• The majority of datasets used for claims and correct claims used would not be available

• Might need to apply to LSOA too as some OA claims could equally have been successful at LSOA

• MSOA claims were already likely to be unsuccessful

16

Conclusions

• Detail available at low geography was a risk that was addressed in the live release system

• Some variables carry higher risk than others

• Changes to rules effectively blocked the main risks identified

• Automated rules in the Create a Custom dataset system worked to make claims harder to arrive at

17

Actions

• Limit detail available at low geography

• Keep detailed topic summaries at MSOA level geography

• Releases could take place as planned

18

User Experience We asked the intruders their opinions of the new system

19

Ease of use

Choosing Variables

Choosing

Classifications

Data was clear

Speakers

Samantha Trace

Methodologist

Statistical Disclosure Control

Office for National Statistics

21

  • Slide 1: Intruder Testing
  • Slide 2: Background
  • Slide 3: Statistical Disclosure Control (SDC) methods
  • Slide 4: Intruder Testing
  • Slide 5: Legal Standard for outputs
  • Slide 6: Method
  • Slide 7: Results
  • Slide 8: Claims
  • Slide 10: Confidence and Correctness
  • Slide 12: Variables Used
  • Slide 13: Cell Counts
  • Slide 14: Options
  • Slide 15: Limit max cells by Geography?
  • Slide 16: Conclusions
  • Slide 17: Actions
  • Slide 18: User Experience
  • Slide 19: Ease of use
  • Slide 20: Questions and comments please [email protected] [email protected]
  • Slide 21: Speakers

Intruder testing for Census 2021 England and Wales– checking risk and utility in Build Your Own system , ONS, UK

confidentiality, individual data, cell key method,  disclosure rules, intruder test

Languages and translations
English

UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE

CONFERENCE OF EUROPEAN STATISTICIANS

Expert Meeting on Statistical Data Confidentiality

26-28 September 2023, Wiesbaden

Title : Intruder testing – an empirical measure of the quality of Census 2021

England and Wales Disclosure Control methods

Author(s) Samantha Trace (Office For National Statistics) Dominic Nelson (Office For National Statistics

e-mail [email protected]

Abstract

By law, the Office for National Statistics (ONS) must protect the confidentiality of respondents to

Census 2021. We protected the confidentiality of individuals' data in three ways: swapping records

between areas, applying a cell key method to each table, and applying disclosure rules in deciding

which tables could be published. To assess the effectiveness of these methods and provide assurance,

an intruder test was performed on Census 2021 data using a secure version of the outputs system. 51

intruders were recruited to attempt to identify individuals in the planned data outputs. 30 Intruders

took part, 81 claims were made, and more than half of these claims (41/81) were incorrect. Further

steps were taken reduce the risks identified by the test, making the data the majority of these claims

were made from no longer possible to access through the Create a Custom Dataset system. This gave

the Office for National Statistics evidence there was sufficient uncertainty in the data to meet the

standard required by legal guidance and we would meet our ethical duty to protect confidentiality.

2

1 Introduction

The Office for National Statistics (ONS) has legal obligations under the Statistics and Registration

Service Act (SRSA, 2007) Section 39 and the Data Protection Act (2018) that require the ONS not to

reveal the identity or private information about an individual or organisation.

We have a pledge to respondents that the information will only be used for statistical purposes, so we

must look after and protect the information that is provided to us. Moreover, a breach of disclosure

could lead to criminal proceedings against an individual who has released or authorised the release of

personal information, as defined under Section 39 of the SRSA.

The SRSA defines "personal information" as information that identifies a particular person if the

identity of that person:

• is specified in the information

• can be deduced from the information

• can be deduced from the information taken together with any other published information

Therefore, in order for data to be released, the risk of identifying individuals from it, potentially with

additional publicly available information, must be minimal.

Intruder testing is an empirical test to check that the measures applied to make data sufficiently

difficult to identify individuals within have been successful. This involves recruiting ‘friendly

intruders’ who emulate the actions of potential ‘real intruders’ upon the data.

The standard that needs to be met is suggested by the National Statistician’s Guidance, “the design

and selection of intruder scenarios should be informed by the means likely reasonably to be used to

identify an individual in the statistic”.

So, intruder tests are designed to measure what could be done with the means likely to be available to

an opportunistic attacker, it does not have to cover every imaginable scenario, just the most probable.

The 2011 Census outputs were tested in this way, and the findings were useful in providing assurance

that the disclosure controls measures used on the data were adequate, and provided evidence to what

further steps should be taken to further reduce disclosure risk. Other ad-hoc exercises have been

undertaken by the ONS as required since, with the same purpose – to determine the level of

identification risk in a dataset.

For Census 2021, new disclosure control methods were required for a new output system. On top of

the imputation of missing records done to make the Census as representative as it can be, which also

adds doubt as to whether a particular record is ‘real’ or not, there were new measures in place to

protect the data:

• Targeted Record Swapping – swapping households that are marked as unique in the data with

a similar record in the local area. The geographies were changed for between 7% and 10% of

households, and for between 2% and 5% of individuals in communal establishments.

• Cell Key Perturbation - this adds noise to the figures, making slight changes to cell counts

including zero cell counts, by a method which means that where the same records are

presented in a cell, the number should remain consistent. A typical dataset would have around

14% of cell counts perturbed by a small amount, and small counts were more likely to have

been perturbed than large counts.

3

• Disclosure rules (in the Create a Custom Dataset system) – automated rules including

measures of how many small counts are in the table, that can stop data being given for an area.

These methods were intended to combine as a ‘lighter touch’ approach, allowing some detail to be

possible at low level geography, whilst maintaining the usefulness of the data within the new Create a

custom dataset (CACD) system, and other census outputs. The CACD system allows users to create

their own multivariate datasets, so the rules are set to prevent the possibility of identifying a single

record and building up a list of potential attributes. The level of identification risk should still be

minimal, using information public or private.

2 The Intruder Test

2.1 Method

51 intruders, all ONS employees, were recruited. All had appropriate security clearances and

consented to an enhanced non-disclosure agreement. They were given training on how to use the

output system, and possible methods of working against our statistical disclosure controls. A safe area

of an approved file management system was set up, and they were given access to individualised

folders to record their findings and keep notes.

A version of the planned outputs system was created on a secure internal-access platform and loaded

with the usual resident database. This is the main basis for Census outputs as it includes all people

who are ‘usually resident’ at the enumeration address at the time of the census. This was also

programmed with all the current planned variables and classifications for those variables. A version of

the planned statistical disclosure rules was placed in this system, to auto-control outputs requested by

intruders, and deny access if the output does not pass these rules. The system had built in perturbation

so automatically created outputs with some values slightly changed.

The data placed in the system had targeted swapping already applied and imputed records present, just

as it would be when published. The main census 2021 geographies were available in this system, the

smallest geography used was output area (OA), an area with at least 100 persons in it, though more

typically 400 persons.

Intruders were given individual access to the system, encouraged to collaborate on a private Teams

channel, and to share resources, such as web pages, hints and tips. An errors log was set up to record

system issues, and the details of the claim, including geography, variables and classifications used, as

well as the name and address of the individual being claimed as found, and the confidence level in the

identification as a percentage.

Claims were transcribed from the individual file folders to a single sheet that the checkers had access

to. These checkers were from a different team to ensure the data was fully firewalled from the

intruders, and no actual disclosure would result from the exercise.

The checkers had access to record level data, so could determine whether a claim was correct, partial,

or incorrect. A correct claim would match on name and approximate address. Inaccurate address

matches were counted as correct so long as they would have been within the geographical area used to

make the claim.

Inaccurate name matching was counted as incorrect. A partial match would be where a claim was

made on a 1 in a cell, where more records would have been in that cell but were perturbed down to 1.

4

.

2.2 Limitations

We had considered engaging a third party to take part in the test, however we could not be sure of

start time, and there are few companies engaged in exactly this sort of testing that could have gained

security clearances in time, so it was deemed impractical to engage a third party in this exercise.

Therefore, there may be some organisational biases in our exercise.

Although attempts were made to recruit people from more sparsely populated areas of England and

Wales, most people were still clustered geographically around ONS offices and reflect the socio-

demographic mix of ONS staff rather than the general population.

Intruders also had to use their spare time around their regular work, and the exercise ran in August

when many took leave, although it took place over three weeks to allow more people to participate.

The dataset looked at was not the full range of planned Census outputs. The final system includes not

just Usual Resident, but also Usual Residents in Households and Communal Establishments,

Households and Household reference Persons. The Usual Resident dataset used was taken to be a

sufficient test of the general level of risk in the data.

2.3 Results

2.31 Claims

81 identification claims were made, excluding duplicates. These claims are where an intruder

highlighted a ‘1’ cell count in a dataset, and gave the details of this, and claimed they knew which

person it related it to. Some (2) claims listed various methods to approach the same identification, in

these cases this was still counted as one claim and measures such as cell count were taken from the

first tables stated.

40/81 or 49% of identification claims were correct (the intruder correctly named an individual in a

cell)

8/81 or 10% of identification claims were partially correct (the intruder correctly names an individual

in a cell of apparent size 1, but the cell count is greater than 1 – due to cell key perturbation – the cell

could have been representing any of the people in it)

33/81 or 41% of identification claims were incorrect, the record marked in the cell did not relate to the

individual named.

No attribute claims were made, an attribute claim is where an intruder claims to have found something

new about a person through the data presented.

Of the initial 51, 12 dropped out, citing workload or holiday as reasons, and a further 9 filed no notes

and made no claims. Of the 30 intruders that took part, 6 (20%) did not make any claims. Reasons

cited included not being able to claim anything with certainty, some may also have lacked time to

spend on the project.

2.32 Confidence

5

Figure1: Confidence, correctness and number of claims

This histogram shows numbers of claims by the percentage confidence the intruder reported in the

claim, banded by whether they were correct, partially correct or incorrect.

There was a range of 7.5-100% confidence in claims

The mean confidence placed in a claim was 73.6%, the median was 80%.

2.32 Cell Counts and correctness

The cell count is the number of cells (row * columns) present in the table used to make the claim.

A wide range of table sizes were used to inform claims, (range 7 – 2100, mean 183, median 182).

Figure 2: Cell counts and correctness

The scatter plot shows claims rated by percentage correctness. Partially correct claims are 50%

correct, fully correct are 100% correct. One outlier (cell count 2100) was removed. This shows a

R² = 0.0986

0

100

200

300

400

500

600

700

0 20 40 60 80 100 120

N u m

b er

o

f ce

ll s

fo r

cl ai

m d

at as

et

Correctness

6

positive correlation (R^2 = 0.0986), but with outlier, this relationship was zero. This could suggest

that higher cell counts may increase possibility of identification within limits – very high cell counts

may not.

2.34 Variables Used

To assess which variables were most likely to result in a claim, and which in a correct claim, the

claims were coded to variable type. Any table constructed with a single classification making the bulk

of the cells would be coded to that variable, e.g., any claim using single year of age, or single year of

age plus another less detailed classification such as sex, was coded to ‘age’, any claim using a 3-part

country of birth classification, 10-part age, and sex would be coded ‘multivariate’. A few variables

with only a few claims each were coded to ‘other’, such as country_of_birth.

Table 1: Number of claims by variables used in the datasets those claims came from

Variable Number of Claims Number of Correct Claims % Claims that were correct

Age 35 21 60%

Multivariate 28 12 43%

Occupation 9 2 22%

Other 8 5 63%

The table shows claims where age was the main component had the highest number of claims, and

highest number of correct claims. Multivariate tables were less than 50% likely to yield a correct

claim, and occupation was unlikely to result in a correct claim. The main cause of correct claims from

the ‘Other’ category were claims using country_of_birth.

2.35 Geography

Table 2: Number and correctness of claims by Geography used in datasets

Geography

of the table

used for the

claim

Number of

Claims

Number

of Correct

Claims

% of Claims made

that were correct

Mean %

confidence

Mean cell

count

OA 67 34 51% 75 142

LSOA 9 5 56% 73 248

MSOA 5 1 20% 47 610

The largest geography used for any claim was Middle Super Output Area (MSOA), Output area (OA)

was the main area of risk with the bulk of claims (65/81 or 80%) being made using OA datasets. It

was also the focus of correct claims (34/40 or 85%). There were few claims at MSOA, and only one

7

correct claim. Lower Super Output Area (LSOA) again was used in few claims, and though these were

majority correct, with such a small sample it cannot be concluded that this would always be more

likely to be correct or not.

Table 3: Subject of the Disclosure Claim

Number of Claims Correct claims Percentage of

Claims correct

Family and friends 59 25 42%

People from news/ web 16 11 68%

Self-identification 6 4 67%

Many those known about through news or online articles were centenarians, identified through age

and location.

Other

Though intruders were given access to ‘fixed’ tables as csv files, at least 7 intruders used them, there

were no correct claims from these.

Qualitative evidence suggested the intruders found the new flexible outputs system was very easy to

use (rated 4.3 out of 5 by the 15 intruders surveyed), and low amounts of time were recorded as

typical to arrive at a claim (5-30 minutes) though it is hard to calculate total time taken per claim

accurately as time spent logged in could not be taken as an indication of time spent on this project.

Intruder feedback suggested that the disclosure rules built into the system were working as intended

and when they tried to obtain a cell value of 1 at lower geography, the rules prevented this by denying

the data.

3 Discussion

The overall results show that over half of identification claims were incorrect. However, unlike other

intruder testing exercises carried out previously by ONS, intruders were fairly unlikely to make claims

where they had low confidence. Almost all claims were made with a confidence of 60% or greater.

Generally, the higher percentage of confidence the intruder rated a claim, the more likely they were to

be correct. Although this was statistically significant, the relationship was not so strong, and a

significant portion of those who were over 90% confident were still incorrect or partially correct (35%

or 13/37).

The exercise on 2011 census data saw a drop off in percentage correctness at very high confidence

claims which was not seen here. Possibly, the ease of using the system may have made all intruders

more confident, and meant intruders went for easier identifications, rather than putting forwards ones

they were less sure of.

8

The method used for this exercise did not allow us to know whether an identification was wrong due

to swapping, or other reasons – only if it was perturbed and therefore a ‘partial’. Therefore, it is hard

to evaluate the success of swapping as a single method from this evidence.

Cell counts of tables present an unclear picture, as no correlation was found with table size in cell

count and correctness. Smaller tables may be easier to be sure where a person might be represented,

where a larger table makes it more likely to get a small count to base an identification claim. It seems

more detailed classifications may offer additional risk in some circumstances, but dependent on

geography.

There were no claims at any geography higher than MSOA. It is likely that an intruder would have far

more confidence over a claim at lower geographies since they may have considerable knowledge as to

who lived in an OA with which they are familiar, but far more uncertainty as the geography level

increases. Observing a cell count of 1 in an OA may convince them that the person they know is the

only one with that combination of attributes. They might have less certainty at MSOA that the 1

corresponds to the subject of the claim given the lower likelihood of familiarity with the individuals in

the population, as well as ‘noise’ introduced by error, imputation, record swapping and the cell key

method.

The high level of claims and correct claims at OA make this the main area of risk to address in

planned outputs. Claims made at OA also had the highest level of confidence with an average of 75%

confidence expressed in the claims. The variables used for these claims were consistent with the

general picture, that is, age was a main variable used for identifications, followed by other detailed

classifications such as occupation and country_of_birth. Multivariate tables made the basis for 22 of

the OA claims, of which most were incorrect or partially correct (13/22 or 59%), which demonstrates

that the protections did well at protecting multivariate data as they were designed to do.

Whilst most of the claims were correct at LSOA (5/9 or 56%) this was a small sample and could

equally have been majority incorrect with one fewer correct claim. However, some of the claims made

at OA could equally have been made at LSOA, as they are small enough to make small counts

prevalent, and intruders might have a moderate level of familiarity with most residents within a typical

sized LSOA (1600 people). The level of confidence in LSOA claims was not much less than that

shown in claims made from OA level tables (73% confidence in LSOA, 75% in OA claims). A

majority of LSOA claims (5/9 or 56%) were based in multivariate tables, though a minority of these

were correct (2/5 or 40%). The mean cell count of tables used for claims at LSOA was consequently

much higher.

There was little risk of a correct claim (only 1/6 or 17%) from an MSOA table, so this supported

earlier evaluations of the data that looked only at the sparsity of the likely tables, and restricted fixed-

table outputs of detailed univariates to MSOA geography. The cell counts used for MSOA tables were

higher on average, which is unsurprising given the higher population (typically 7000) that would have

to be divided in the classifications to obtain a cell count of ‘1’ to base an identification upon. The level

of confidence was also significantly lower at average 47%.

That age was shown as a specific risk should be noted; however, some of these claims were claims

made using already publicly available information on Centenarians so arguably the disclosure came

from these sources, not the output. That said, many claims were also identifying people who happened

to be the only one of that age in their area, so single year of age at Output Area geography has been

shown as a specific risk to mitigate.

9

The variables used for correct claims supports current thinking that more ‘definite’ variables are more

disclosive, that is age and country_of_birth are both variables that are likely to be reported

consistently by the person filling in the Census.

Claims based upon occupation were very unlikely to be correct on the other hand, which may be due

to uncertainty about how the question may have been interpreted by the person answering, and how

their answer would have been coded by the automated processing system.

Multivariate claims are also less likely to be correct, possibly because increasing the number of

variables increases the chances an answer would not have been given or been recorded the way the

intruder guessed. The level of risk in these detailed univariates was still limited to smaller sized

geography, so there is no evidence from this test to restrict the use of these variables at MSOA or

higher geography.

In terms of the variables that relate to special category data there was no evidence that variables such

as health, disability, ethnicity, religion, sexual_orientation and gender_identity, all of which were

included in the test, were at significant risk of correct identification claims. This may be due to the

protections put in place for these, and the less definite nature of these variables. Though we know 7

intruders tried to use the sexual_orientation and gender_identity datasets, these were made available

separately through .csv files which may have made them harder to access. In the final outputs they

would not be available below MSOA, so this intruder testing exercise seems to support that decision

in terms of sufficient protection for that data.

The test was conducted pragmatically, and therefore recruited people with more statistical awareness

and knowledge of the data than would be found in the general population, as they were ONS

employees. This may be taken a slightly over-stringent test, as it may over-estimate the risks from

intruder attempts made by the public.

4. Conclusion

The standard to be met to fulfil legal requirements is that claims should not be both made with

confidence and correctness. The level of risk found in the current planned outputs found by this

exercise would meet these legal definitions of safety, and additional steps were taken to decrease this

risk further.

In response to the findings, the rules in the table builder were altered to restrict the availability of

detailed classifications at lower geography, and one more detailed topic summary was replaced with a

classification with fewer categories that consequently posed less risk. The majority of claims made

here would not be possible to make using the actual output system.

Perturbation, swapping, the disclosure rules and general level of doubt in the data together were

shown to be effective at preventing correct identifications.

Awareness of perturbation and swapping did not appear to result in lower levels of intruder’s

confidence in making claims, so this alone cannot be relied upon to meet the legal standards. Further

steps were also taken to ensure LSOA level data was protected by restriction of the level of detail

available at this geography.

10

The evidence seen here, with lower risk at MSOA, supports the decision to limit the geography of

usual residents in communal establishments and households to MSOA, even though those datasets

were not included in the test

The CACD system has been launched since this test took place, and sees some 900,000 interactions

per month (ONS data), demonstrating the usefulness of Census data delivered in a flexible and

immediate format. If this system is to be employed for a wider range of statistical products, further

intruder testing should be considered as a means of measuring and mitigating disclosure risk in those

datasets.

Intruder testing is a highly useful exercise for data providers to employ, where the level of risk

presented by a dataset is in doubt. It gives evidence on the likely level of risk, where that risk lies, and

can inform appropriate action to mitigate those risks.

Smoothing the way for secure data access using synthetic data

de-identified form, accredited researchers, safe access, synthetic data

Languages and translations
English

UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE

CONFERENCE OF EUROPEAN STATISTICIANS

Expert Meeting on Statistical Data Confidentiality

26-28 September 2023, Wiesbaden

Smoothing the way for secure data access using synthetic data

Authors: Richard Welpton and Emily Oliver, Economic and Social Research Council (ESRC), UK

[email protected] [email protected]

Abstract

In the UK, sensitive and potentially disclosive data (including survey and government-owned administrative data) are kept securely and safely in de-identified form and are only accessible to accredited researchers through Secure Data Environments (SDEs). Using this data for research has enormous potential, although access can be constrained by the need for researchers to understand enough about these complex datasets for them to submit a viable project proposal, tensioned against the resource required for data owners to assess every application to use it, and data guardians to answer questions from researchers about the data. For the researcher, they need to be very invested to engage: they can’t see the data in advance of applying for it; can’t test it to see if it will answer their research question; it can take a long time to get hold of; and when they do, it might not contain what they need. It's also burdensome for the SDE as the researcher needs to spend a lot of time in the SDE exploring and preparing data ready for analysis. The resource costs to both researcher and SDE can be considerable.

Low-fidelity synthetic data can be an effective tool to improve the researcher journey because it can lower the barriers to understanding the data before giving researchers access to the real data. As well as accessing it for training purposes, researchers can use it for exploratory analysis to determine if the real data includes the variables they need. In turn, this can help support researchers to improve the quality of their applications for funding and data access; and to develop and test their code while they are waiting for access to the real data. Researchers can continue to develop their code outside of the SDE, therefore minimising the time and resources spent inside the environment. In the UK, only a small number of data services provide access to synthetic data, despite the development of numerous methods for creating synthetic data in the last decade or so.

The Economic and Social Research Council (ESRC, the UK funding council for social and economic research in the UK) has invested in a programme of work to support the creation and routine operationalisation to supply low-fidelity synthetic data to support data access for research and improve the efficiency of SDEs. This has been done largely through ESRC’s Administrative Data Research UK (ADR UK) programme. They have:

 Conducted an in-depth study of the concerns and myths held by government data owners surrounding synthetic data production and use;

 Funded the creation of a Python Notebook tool to create synthetic data easily, at low cost and minimal risk which has been tested and approved by government departments;

 Formed a position statement across its UK partnership setting the vision for synthetic data within its wider remit and mission;

 Embarked on a significant project to explore the utility and use cases of different approaches to synthetic data creation and to evaluate the efficacy of different models to provide recommendations for how synthetic data production can be achieved at scale whilst still acceptable to data owners;

2

 Developed a public dialogue on the acceptability of synthetic data, and public understanding of it and its uses to increase trust and confidence in its development for research for public good.

This session will describe the secure data landscape within which synthetic data sits in the UK and explain the approach taken by ESRC and ADR UK to utilise it as a catalyst for better quality applications for funding and data access, and a smoother researcher journey. We will demonstrate the effectiveness of provisioning access to low-fidelity data by describing how it makes the researcher journey for accessing data and use of data in a SDE more productive, while simultaneously reducing the burden for data custodians and maintaining confidentiality.

1 Introduction

Considerable progress has been achieved to improve access to sensitive data for research, particularly in the UK. For example, the Office for National Statistics (ONS) launched the Virtual Microdata Laboratory (VML) in the mid-2000s (later to become the SRS – the Secure Research Service). In 2011, the UK Data Archive established the Secure Data Service (now UK Data Service Secure Lab). ESRC’s ADR UK programme, a partnership between government and academic groups across all four UK nations, creates linked datasets from administrative sources, making these available to researchers through four Trusted Research Environments (TREs): SAIL databank (ADR Wales); NISRA (ADR Northern Ireland); eDRIS/Research Data Scotland (ADR Scotland); and ONS Secure Research Service (ADR England). These are all examples of Secure Data Environments (SDEs), also known as Trusted Research Environments (TREs).

These facilities have become common place across the health and social science research sectors because they offer a robust approach to accessing sensitive data. They reassure data owners that data they are responsible for, on behalf of the public, can be accessed safely (mitigating risk to individuals in the data) according to the principles of the Five Safes Framework1. SDEs are now considered the default option as far as access to sensitive data is concerned.

SDEs enable a range of data sources to be accessed securely. Consequently, researchers can better explain a range of health, social and economic phenomena. Examples of these data include:

 Business survey microdata available in the SRS and Secure Lab (these are sensitive due to the difficulty of anonymising the data and keeping enough utility in the data to undertake research)

 Detailed versions of social survey data also available in the SRS and Secure Lab, where the additional detail such as very low-level geographies or occupation codes not available in downloadable versions of the data offer new research insights.

 ADR UK has supported UK and devolved governments to make a range of administrative datasets available through their network of four SDEs. These data are de-identified, but not suitable for download because of their sensitivity, and offer utility for researchers.

 Health data such as cancer registration data and records from primary and secondary care services are accessible to researchers through organisations such as NHS England, and other SDEs.

 Linked health and administrative datasets are also now becoming available to researchers through ADR UK’s network of SDEs.

The UK benefits from a legal climate that permits use of such data for research purposes; but culturally the use of the data described above continues to provide ethical and public perception challenges. Concerns about the misuse of data are understandably a constant feature of public debate in this area. This underlines the important role that SDEs have in maintaining the social licence to use these data for research in the public good. When managed through the Five Safes Framework, secure access to these data through an SDE provides assurance that such access leads to safe use in the public good.

3

Despite the SDE solution, it should be pointed that the cost of setting up and operating an SDE is high. Unlike distribution of data, secure access to data through an SDE requires:

 A technological solution (controlling access to researchers, data, projects; coupled with computational processing power)

 An auditable information governance and assurance framework  Expert staff (technology, research, data management, statistical disclosure control etc.)

An SDE can only support as many data sources, researchers and research projects as its technology and staff capacity can allow. For example, between 2007 and 2010, the VML could support about 12 researchers accessing the facility simultaneously (the number of physical desks available at the offices where researchers could sit to visit the facility). When the Secure Data Service was launched in 2011, it could allow 40 researchers to remotely access the service at any one time; this was increased to 150 recently, following funding from ADR UK to expand and improve the service.

Other capacity constraints remain:

Inputs: procedures that researchers must navigate to access data in an SDE often require the researcher to explain in detail how they will use the data to address their research hypothesis. While metadata and documentation can help (when available), researchers often cannot describe accurately how they will use the data until they actually have access to the data. This creates uncertainty and can lengthen the application process.

Quality and completeness of information: occasionally, researchers who have spent considerable time gaining approvals for access to data discover that the data are not suitable for their research when they finally acquire access: a significant opportunity cost for them (and the data owner and SDE that support their access).

Outputs: In an SDE, researchers need to have their research outputs checked for potential disclosure before being released, a process known as statistical disclosure control. This is largely a manual process: SDE staff receive and process these requests. The ability to support researchers can be constrained simply by the number of staff available to service these requests.

Throughput: Much research involves exploring data and methods before a research question is answered. This iterative process relies on computing power to process data. In practice, little of this processing effort leads to a direct research output (for example, it may take several iterations to estimate a research model that yields statistical results that a researcher decides to publish). Yet depending on the technical architecture of the SDE, researchers may be competing for available compute resource, such as CPU/GPU memory.

One solution to address these constraints is to simply invest more money into SDEs, so more staff can be recruited, and more computational capacity can be sourced, etc. Despite such efforts in recent years, the demand to access these data sources continues to grow. SDEs are unlikely to be able to scale up to keep pace with this demand indefinitely.

This paper describes the potential of synthetic data to reduce these bottlenecks. We provide a vision whereby synthetic versions of sensitive data are routinely produced to:

 Enable researchers to assess data before making an application to access them; making sure they are the right data to support their research and help them accurately justify their use of the data when applying to access the data.

 Support the iterative process of research methodology and execution outside of the SDE and thereby reducing demand on SDE computational resources and demand for staff time to undertake statistical disclosure control (accepting the latter may be automated or partially automated in the future).

4

The next section outlines in more detail the challenges that researchers experience. We proceed by explaining ADR UK’s efforts to pilot the generation of synthetic data, and then describe how these synthetic data can support researchers and enable SDEs to work more efficiently given their limited resources, resulting in improved outcomes for researchers and the policy world they support.

2 Challenges for researchers

Using sensitive data for research, such as administrative data, has huge potential, not least because there is so much of it. Administrative data, by its very nature, includes everyone. The datasets are enormous and complex, rich with potential for discovering insights about behaviours, trends, implications and consequences for individuals, communities and the policies and services they are dependent upon. By linking datasets and combining survey and administrative data, these insights that can be even deeper, and the things they can tell us can be transformational.

Access approvals can be slow to gain particularly for linked administrative datasets, because typically each data owner (that is, the government department, local authority or other public body) will want to approve requests. For the researcher this is dependent upon:

 Knowing what data they want to access – including the dataset, the variables, and even within the variable, the period of time they want to consider. Generally, a data owner will not want to give permission for a researcher to access any data they do not need to answer their specific question (the principle of only providing the minimum data necessary to address the research question). If the data has good, accessible documentation (metadata, user guide etc) this could be possible. Otherwise, they might need to rely on access to an expert who has used the data before and knows it well. The researcher needs to be specific and accurate in their request, but knowing enough about the data to do this before they make the request is not always possible.

 Getting a response from the data owner: this is dependent on the data owner having adequate resources in place to respond to data requests. The data owner needs staff who know and understand the data, who also have the time and remit to respond to these queries. If the data is deemed particularly useful by researchers and/or it does not have good and accessible documentation, the data owner might be inundated with requests for it. During times of political turmoil, such as during and post-elections, industrial action or national crises, processing data access application queries might be deprioritised.

Although a researcher can apply to be accredited to access secure data, it is generally only when the data owner has indicated approval can the researcher apply through more predictable channels: applying to the relevant SDE and getting confirmation from research approval panels.

Dr Paul Calcraft2 has described the process of applying to access linked administrative data in the UK as trying to buy a second-hand car without being able to see it or test drive it first: Does it have all its parts, is anything missing, does it do what you think it will do, are there any quirks you should know about? In applying for data, one cannot see it in advance of applying for it; and one cannot test it to see if it will answer the research question. Accessing the data can be lengthy without certainty it will contain the information needed. Figure 1 sets out the process researchers need to follow to access secure data in England.

5

Figure 1: Process of access to secure data in England

3 Synthetic data as a solution

Bypassing much of the system for accessing secure data by instead accessing a version which is not real data and therefore does not need to be held securely, could be one solution for researchers. At the very least, using a synthetic version of the data to find out if you really do want to embark on a protracted process to access the real data, could be valuable. In this section we describe how this prospect should be considered.

3.1 Types of synthetic data and their potential The utility of synthetic data for different applications is, of course, central to the question of its potential. High fidelity synthetic data which mimics the original data and preserves the statistical relationships between variables could reduce costs and complexities for the researcher, as it could also allow for analyses which are extremely close to those done on the real data. The use of such high fidelity synthetic data does come with a degree of risk for the data owner however, particularly if people misinterpreted findings from such data, or it was ‘passed off’ as real data.

Low fidelity synthetic data can, on the other hand, significantly reduce, if not remove, the risks for data owners, as analyses of the data would not generate meaningful results. It can also provide the researcher with easy access to a dataset which can be used to prepare code, test code, become familiar with the format of the data and learn how it can be used. It can also be used for training purposes, to raise awareness about the data.

In the UK, only a small number of data services provide access to any synthetic data, despite the development of numerous methods for creating it in the last decade or so. Making the production of low fidelity synthetic datasets more common could be beneficial to researchers and data managers alike. However, public perception of it is currently unclear and could be reputationally damaging if not addressed alongside other considerations.

For the purposes of this paper, we have described SDEs as ‘remote access’ solutions, in which the researcher can access and ‘see’ the data they have applied to access to undertake their research. Another approach is the

6

‘remote execution’ model, where a researcher develops statistical programming code using synthetic data, then submits their code to be run remotely on the data. Statistical outputs are then returned back to the researcher, subject to a statistical disclosure control check. Recent develops have included Application Programming Interfaces (APIs) to automate this process (such as DataShield, OpenSafely). Remote execution relies heavily on accurate synthetic data to ensure that the researchers can submit accurate statistical programming code; it may fail otherwise, to the frustration and delay of the research.

3.2 Developments In 2020, ADR UK commissioned the Behavioural Insights Team (BIT) to undertake an investigation into the attitudes to and appetite for the provision of synthetic data by government departments. The intention was to understand the concerns and barriers to it with a view to being able to tackle these head on in a more informed way. It identified technical considerations, risk aversion and lack of knowledge, the use of advanced privacy- preserving technologies, and the need for better understanding of public attitudes to synthetic data alongside clearer communication as the key influencing factors. The results of the study are set out in the project report, Accelerating public policy research with synthetic data, and led to recommendations to:

 Encourage the use and sharing of low-fidelity synthetic data to support rapid discovery of whether the dataset is appropriate for answering the research question; to develop and test code before full access is available; reducing delays in the process, including the amount of time needed to be spent in a secure environment;

 Expand the use of synthetic data for training so that researchers can be exposed to relevant idiosyncratic datasets earlier, thus improving their efficiency on live projects;

 Develop a cross-government repository of synthetic data for restricted access without a specific project proposal to allow for better design and more refined project proposals, and for this to be fed by a semi- automated pipeline to routinely generate low-fidelity synthetic data.

The study was followed up with the development of a synthetic data generation tool in the form of a prototype Python notebook which could be used by government analysts or researchers to generate low-fidelity synthetic datasets quickly and easily. It creates a version of the data that follows the structure and some of the patterns found in the real data. As such, it is plausible and represents the data as a whole. At the same time, because it does not preserve statistical relationships between columns, it reveals very little - if anything - about any individual in the dataset. The tool has now been extensively tested and is available for use. Users need Python (preferably Python 3), two common Python libraries (NumPy and pandas), and a software tool for viewing, editing, and running Python notebooks such as VSCode or Jupyter.

The BIT developers also produced a user guide which provides clear, step-by-step instructions, including how to ensure your system can run it. It guides the user through methods to run the cells in the notebook, explains how output files can be saved, and even tells you how to check that the notebook has worked. There is a useful section on troubleshooting as well as further information for more advanced users.

In an attempt to visualise the benefits of synthetic data for researchers using the process set out above in Figure 1, we have indicated on Figure 2 where the efficiencies could lie.

7

Figure 2: Proposed efficiencies on process when access to synthetic data is added:

Of course, low-fidelity synthetic data is not a silver bullet. There will be instances where higher fidelity synthetic data is both more appropriate and more useful. In an ADR UK-led workshop at the International Population Data Linkage Network (IPDLN) conference in 2022, where different approaches to creating synthetic data were discussed, participants agreed that the value of different tools was entirely reliant on the end utility of the synthetic dataset2. Partners from ADR UK have taken their own approaches to developing synthetic data according to need and appetite in the devolved nations of Wales, Scotland and Northern Ireland and these have recently been published as an Interim Position Statement on Synthetic Data. It sets out ADR UK’s vision for synthetic data and frames it in the wider context of its remit and mission. The statement is intentionally ‘interim’ because of the dynamic nature of this topic and our growing understanding of issues and opportunities associated with it.

3.3 Putting synthetic data into practice While the case for the provision and use of synthetic data is powerful, data owners remain cautious, and we need to find effective ways of engaging the public in discussions about the creation of synthetic data. As such, we are a long way from seeing synthetic data operationalised to the point where trusted research environments can produce it routinely and facilitate access to it at scale. There is also a lack of evidence to support decisions among data owners and data services about how the governance around this might be best implemented. Data owners and services need real-world use case studies on costs and benefits to inform more systematic approaches to creation and sharing of synthetic data.

To inform future practice, ESRC and ADR UK are opening a joint research call to fund individuals and teams to explore how the potential of synthetic data can be harnessed at scale. Recipients of these grants will evaluate the current uptake, utility and governance of synthetic versions of datasets held in SDEs, including the benefits, costs and challenges to researchers, data owners and the SDEs themselves. They will also support a qualitative study of public understanding of and attitudes to synthetic data. The results of these funded projects will

8

collaboratively inform a report and recommendations for how synthetic data production and provision can be achieved at scale and with the trust and support of stakeholders, including the public.

3 Discussion: Challenges and opportunities The use of synthetic data provides an opportunity to reduce demand for SDE access, as analysis to complete projects within an SDE environment could be carried out more quickly. Our desire is that SDEs operate as efficiently as possible, and synthetic data, in our opinion, offers way to improve that efficiency, in the following ways:

It can enable researchers to make much more accurate data access applications. A benefit of this is that researchers will have more certainty that the data they are interested in accessing will support their research. Synthetic data should reduce the number of researchers who apply to access data, are set-up by the SDE to access data, but realise the data cannot support their research after all.

Researchers ought to be able to construct a significant amount of their statistical programming code outside of the SDE; and only use the SDE to refine and run the code on real data. This means they spend less time logged into the SDE and less time using compute resources for iterative coding.

If the use of synthetic data did create more opportunities to train and engage researchers in accessing sensitive data within an SDE environment, improve the quality of applications to access the data held, and also improve the efficiency of how SDEs operate, this may all drive up the use of this data for research in the public good. The process of producing useful synthetic data requires time, skills and customisation although much of the process can also be automated3. There are further challenges to address, including:

 Deciding which organisation is best placed to produce the synthetic data. The data owning organisation, or the organisation running the SDE?

 Should Digital Object Identifiers and other techniques be adopted to monitor version control and use of the synthetic data?

 What training and guidance should be made available to ensure that researchers do not inadvertently try to publish statistical findings that have been drawn from the synthetic version of the data, instead of the real data?

 How do we engage the public in discussions about the creation of synthetic data?

4 Conclusions

Synthetic data provides opportunities to smooth the researcher journey to access sensitive data via an SDE and reduce the burden on data owners and SDEs supporting researchers requesting such access. However, few use cases exist in the literature that evaluate the benefits and costs to stakeholders (researchers, data owners and SDEs), which is hindering scaled production and routine use of it. Evidence of public understanding and positive acceptance is not clear. Other barriers as described in this paper are not insurmountable and could, in the long run, reduce costs for stakeholders if automated systems were put in place. The benefits of the use of synthetic data are becoming clearer as more research is funded using secure data, the complexity of new, linked datasets increases, computational power increases and data science skills become better recognised for research across disciplines. For access to secure data to keep up with demand, synthetic data is a strong enabler and an important consideration for progress.

9

References:

1. Ritchie, F. 2008. Secure access to confidential microdata: four years of the Virtual Microdata Laboratory. Economic and Labour Market Review, vol 2, No.5

2. ADR UK Approaches to creating synthetic data: Workshop at IPDLN conference 2022. 3. Nowok, B., Raab, GM., and Dibben, C., 2017. ‘Providing Bespoke Synthetic Data for the UK

Longitudinal Studies and Other Sensitive Data with the Synthpop Package for R 1’. Statistical Journal

of the IAOS 33/3: 785 – 796. DOI: 10.3233/SJI-150153

Respondent centric survey design and data collection – the Transformed Labour Force Survey - Colin Beavan-Seymour, Maria Tortoriello and Sabina Kastberg (Office for National Statistics, United Kingdom)

Languages and translations
English

Respondent Centric Survey Design and Data Collection – Transformed Labour Force Survey

UNECE Expert Meeting 2023

Maria Tortoriello Principal Social Researcher

Colin Beavan-Seymour Principal Social Researcher

UNECE Expert Meeting 2023

Talk outline

Part 1 – Survey Design

• What is the purpose of the Transformed Labour Force Survey?

• Survey Design – sample, collection modes

• Return rates

Part 2 – Implementation of an Adaptive Survey Design

• Why use an Adaptive Survey Design?

• How was it developed?

• How was it implemented?

• Initial findings

UNECE Expert Meeting 2023

Part 1 – Survey Design

Colin Beavan-Seymour

UNECE Expert Meeting 2023

What is the Transformed Labour Force Survey?

• A new survey which will collect data on key labour market measures

• Developed with a respondent centric approach

• Qualitative and quantitative research

• Online first

• A rationalisation and redevelopment / rethink of how to measure core labour

market concepts

• Extensive qualitative research with members of the public, interviewers,

data users

UNECE Expert Meeting 2023

The journey so far…

2017 2018 2019 2020 2022

Tests 1 & 2 Online response rates Engagement strategies

Test 3 Mixed mode (online & F2F) Statistical outcomes

Test 4 Online attrition test – response rates across 3 waves

TLFS Beta Online only in response to pandemic

Addition of Telephone Online & telephone collection

Knock-to- nudge Using an Adaptive Survey Design

2022/23

UNECE Expert Meeting 2023

Sample Design

Transformed Labour Force Survey

Wave 1

TLFS Wave 2

TLFS Wave 3

TLFS Wave 4

TLFS Wave 5

Opinions

Survey

Other Social

Surveys

Other Social

Surveys

140,000 households

40,000

households

UNECE Expert Meeting 2023

What data did this give us? • A return rate (complete returns & partials) of around 37.5% - a great start!

• However, we were still seeing similar biases in the responding sample that other voluntary surveys in the UK were

experiencing, despite the online mode and user-centric design:

• A large proportion of respondents were over 55, many over 65 – fewer respondents of working age, more economically

inactive

• A majority of respondents owned their homes, many without a mortgage or loan

• Respondents with a white ethnic background comprised the vast majority of the data, under-representation from

other ethnic backgrounds

• The vast majority of data was from the online mode – only a small percentage was from telephone collections

• The 2018 test indicated that interviewers visiting households can increase response from under-represented areas

• But… with a large scale survey of over 500,000 a year… how can we increase the quality of the data collection but

keep the cost of the operation down?

UNECE Expert Meeting 2023

Part 2 – Adaptive Survey Design Maria Tortoriello

UNECE Expert Meeting 2023

What is an Adaptive Survey Design (ASD)?

In November 2022 we implemented an ASD for the TLFS.

• What is an ASD?

➢ Dividing a sample into smaller groups that have

similar characteristics (segmentation)

➢ Applying alternative survey design features for

different groups: • modes, materials, incentives

➢ Objective is to improve targeted survey outcomes • reduce bias, reduce costs

Why use an Adaptive Survey Design?

• TLFS data collection strategy same for all sampled addresses = no adaptive survey design

• Experiencing differential non-response bias which affects estimates

• Statistical processing enables weighting of sample to account for some

bias, but confidence in estimates would only improve with higher quality input data.

• Next step for TLFS was to introduce additional modes - Face to Face follow up

• One size does not fit all!

• ASD allows you to target the right respondents in the right way, rather than targeting all

respondents in the same way = more efficient use of field resources

UNECE Expert Meeting 2023

How was the Adaptive Survey Design developed?

• Closely followed work of Statistics Netherlands (Schouten, B et al.)

• A key objective of ASD is to divide the sample into strata in order to define targeted protocols for each of

the strata

• A logistic regression model was applied to historical TLFS data to identify auxiliary variables strongly

associated with response to formulate the ASD strata.

• Variables considered were Index of Multiple Deprivation (IMD), Urban/Rural Classification, Country of

Birth, Age & Ethnicity (limited by available data).

• Derived and examined CV, R-Indicators and Partial R-Indicators to identify the variables and categories

of variables driving variation in response propensities

• Strongest predictors of response:

• Age (<45)

• Urban/Rural Classification (Urban)

• Index of Multiple Deprivation (IMD deciles 1-4)

UNECE Expert Meeting 2023

ASD: Iteration 1

➢ STRATA 1 = Urban, less deprived areas, 45+

➢ STRATA 2 = urban. more deprived areas, 16-44

➢ STRATA 3 = urban, less deprived areas, 16-44

➢ STRATA 4 = urban, more deprived areas, 45+

➢ STRATA 5 = non-urban, more deprived areas, 16-44

➢ STRATA 6 = non-urban, more deprived areas, 45+

➢ STRATA 7 = non-urban, less deprived areas, 16-44

➢ STRATA 8 = non-urban, less deprived areas, 45+

high priority strata

• Potential to include numerous interventions in the ASD (e.g. mode, incentive, materials..)

• Keeping it simple with 1 intervention = ‘Knock to Nudge’ (KtN) follow up

• ASD will target KtN data collection at under-represented strata based on response propensities in order to

reduce the variation in response propensities for a selected set of auxiliary variables.

• This will ensure that data collection resources are used in the most efficient way whilst increasing response

from historically underrepresented population groups.

UNECE Expert Meeting 2023

ASD Optimisation approach

• We are following a structured ‘trial and error’ approach to optimising our

ASD.

• The optimum solution is unknown and experimental testing is needed

• Start with a simple design that can be accommodated using existing

systems

• Document, evaluate, learn, extend…

• Grow – add features to the ASD as technical and admin systems improve

over time

UNECE Expert Meeting 2023

Early results

• ASD Evaluation project - ongoing

• Operational evaluation – evaluating optimal set up of KtN

oOptimal number of visits = 2/3

oBest days to make contact: Monday, Tuesday, Sunday

oBest time of day to make contact between 3pm-8pm

oKtN not working as well in London and North West

regions

• Data quality evaluation

• Improving variability in response across strata

• Small improvements in representivity of data

oStatistically significant increase in response from 'hard

to reach' groups

First ‘full’ knock-to-nudge month

Thank you for listening!

Any questions?

Contact details:

[email protected]

[email protected]

UNECE Expert Meeting 2023

Survey Research and Development Principles: 11 value statements that facilitate Respondent Centred Design - Laura Wilson (Office for National Statistics, United Kingdom)

Languages and translations
English

Rethinking Data Collection

Survey Research and Development Principles: 11 value statements that facilitate Respondent Centred Design

Laura Wilson Principal Researcher Data collection expert, UK Government Data Quality Hub [email protected]

12 June 2023

UNECE Expert Meeting on Statistical Data Collection

Design Principles What exactly are they and why are they needed?

UNECE Expert Meeting on Statistical Data Collection

Design principles…

• Value statements

• Help us to be Respondent Centred

• Define good design

• Provide clear and practical recommendations for all to follow

• Educational aid

• Support change, consistency and decision making

UNECE Expert Meeting on Statistical Data Collection

ONS’ Design Principles 11 Survey Strategy Research and Development Principles

UNECE Expert Meeting on Statistical Data Collection

UNECE Expert Meeting on Statistical Data Collection

Be different when you need to be

Principle 1

Take an optimode and adaptive approach to design

Principle 2

Evidence informs decision making

Principle 3

Data users lead the way

Principle 4

Respondents have the answers

Principle 5

Everyone counts

Principle 6

Trust, roles and responsibilities

Principle 7

It’s our job to make things simpler

Principle 8

Follow, reuse, and refresh

Principle 9

Iterate, learn, and share

Principle 10

Think about the whole service and solve problems as a whole

Principle 11

https://analysisfunction. civilservice.gov.uk/polic y-store/office-for- national-statistics-ons- survey-strategy- research-and- development-principles- ssrdp/

UNECE Expert Meeting on Statistical Data Collection

Be different when you need to be

Principle 1

UNECE Expert Meeting on Statistical Data Collection

Take an optimode and adaptive approach to design

Principle 2

UNECE Expert Meeting on Statistical Data Collection

Evidence informs decision making

Principle 3

UNECE Expert Meeting on Statistical Data Collection

Data users lead the way

Principle 4

UNECE Expert Meeting on Statistical Data Collection

Respondents have the answers

Principle 5 Respondent Centred

Design Framework (RCDF) https://analysisfunction.civilservic

e.gov.uk/policy-store/a-user- centred-design-approach-to-

surveys/

UNECE Expert Meeting on Statistical Data Collection

Everyone counts

Principle 6

UNECE Expert Meeting on Statistical Data Collection

Trust, roles and responsibilities

Principle 7

UNECE Expert Meeting on Statistical Data Collection

It’s our responsibility to make things simpler

Principle 8

UNECE Expert Meeting on Statistical Data Collection

Follow, reuse, and refresh

Principle 9

UNECE Expert Meeting on Statistical Data Collection

Iterate, learn, and share

Principle 10

UNECE Expert Meeting on Statistical Data Collection

Think about the whole service and solve problems as a whole

Principle 11

UNECE Expert Meeting on Statistical Data Collection

Be different when you need to be

Principle 1

Take an optimode and adaptive approach to design

Principle 2

Evidence informs decision making

Principle 3

Data users lead the way

Principle 4

Respondents have the answers

Principle 5

Everyone counts

Principle 6

Trust, roles and responsibilities

Principle 7

It’s our job to make things simpler

Principle 8

Follow, reuse, and refresh

Principle 9

Iterate, learn, and share

Principle 10

Think about the whole service and solve problems as a whole

Principle 11

https://analysisfunction. civilservice.gov.uk/polic y-store/office-for- national-statistics-ons- survey-strategy- research-and- development-principles- ssrdp/

Thank you – questions? Laura Wilson

[email protected]

UNECE Expert Meeting on Statistical Data Collection

2023 abstract UNECE Expert Meeting on Statistical Data Collection 'Rethinking Data Collection' online (12 - 14 June 2023)

Title:

Survey Research and Development Principles: 11 value statements that facilitate Respondent Centred Design.

Speaker:

Laura Wilson

Abstract: To successfully achieve the paradigm shift, to where respondents become central and integral to survey design, we first need to know the values that underpin the new state. This is where design principles step in – they are value statements that set the standards and ways of working for all to follow. They are used to support change, consistency and decision making within teams and across organisations.

Design principles foster a common understanding of what it takes to make a survey respondent centred and they define what good design looks like. Having clear and practical recommendations for research and development teams to follow means that they are more likely to design successful surveys. They can also be used as an educational tool with stakeholders and staff to share and help explain the ethos and future vision.

At ONS, we’ve created 11 Survey Research and Development Principles for the new ONS Survey Strategy. They are:

1. Be different when you need to be 2. Take an optimode and an adaptive approach to design 3. Evidence informs our decisions, not assumptions 4. Data users lead the way 5. Respondents have the answers 6. Everyone counts 7. Trust, roles and responsibilities 8. Achieving simplicity is on us 9. Follow, reuse and refresh 10. Iterate, learn and share 11. Think whole service and solve whole problems

These will be used by all teams creating surveys at ONS. During this talk I will step through them and share how they help to facilitate Respondent Centred Design.

___________________________________________________________

Paper:

The ONS Survey Strategy Research and Development Principles

Be different when you need to be When we find something that works, for example, a letter template or a question pattern, we use it widely. We:

 follow harmonised standards to improve the quality and comparability of our data across government

 use consistency to build legitimacy and brand recognition  use tried and tested products to improve our ways of working and help us all

achieve our goals

But, we also allow ourselves to take a different approach when our evidence shows we need to. This prevents us from complicating the respondent journey, which could compromise user needs. We always aim for consistency and not uniformity.

ONS Survey Strategy Delivery Principles

This SSRDP links to the following ONS Survey Strategy Delivery Principles:

 aligns with other initiatives across the organisation  products should reduce complexity and burden that exists in the system

Take an optimode and an adaptive approach to design We develop each product, for example a letter or questionnaire, in the best way for the mode, or modes, it is administered in. We also develop each product for its mode specific users. This is known as “optimode” design. By designing in this way, we can help respondents give us the data we need by reducing respondent burden. It also creates products that are more user friendly to our internal users, including interviewers and call centre staff.

We tailor each product to the medium it uses and the specific needs of the users in that mode. This helps us get the data we need and improves data quality.

We use adaptive web design during development which allows the layout to adapt to the screen size appropriately. We design for mobile screens first, and then larger ones. That’s because it helps challenge us to think about the minimum content needed. We justify each piece of content being added, and refer to user needs, user stories and user journeys to do so.

ONS Survey Strategy Delivery Principles

This SSRDP links to the ONS Survey Strategy Delivery Principles that “products should reduce complexity and burden that exists in the system”.

Evidence informs our decisions, not assumptions The designs of our surveys and their products are based on alongside evidence. We:

 do not make assumptions about our users’ needs  do not make a design decision if there is no evidence to support and inform it  avoid assumption led design as this will lead us to produce the wrong thing

Evidence and insights can be gathered from many sources. For example, we could complete some research with respondents or explore existing data to inform our next actions.

ONS Survey Strategy Delivery Principles

This SSRDP links to the ONS Survey Strategy Delivery Principles that “decisions are backed up by evidence”.

Data users lead the way Our surveys meet the data users’ needs because our design journey starts with them. We invest time with our users to learn about their data intention. This includes understanding how they intend to use and analyse the data.

We let our data users lead the way by providing the concepts to be investigated, but they do not design the content itself. We avoid getting data users to design the content because the designs will not be respondent centred.

Once we understand our users’ needs we use this information, alongside the respondent needs, to inform the design of the respondent centred survey products.

ONS Survey Strategy Delivery Principles

This SSRDP links to the following ONS Survey Strategy Delivery Principles:

 clear ownership and accountability  decisions at right level  decisions are backed up by evidence

Respondents have the answers We listen to our respondents and know what they need. This means we design the right thing. We do not make design decisions based on our assumptions, personal views, and biases. Instead, we carry out research to learn about respondent mental models and needs. We explore the cognition and usability of our household and business survey products through testing. We involve interviewers, call centre staff, survey processors and operational staff in the development of products to incorporate their needs and their insights on respondents. This could include insights about issues with an existing questionnaire, for example.

We always ensure we learn about what respondents need, rather than what they want. We use our analysis of respondent needs to develop assets such as respondent journeys and stories which inform the design of survey products.

We follow the Respondent Centred Design Framework to ensure we design based on needs to create respondent centred products.

ONS Survey Strategy Delivery Principles

This SSRDP links to the following ONS Survey Strategy Delivery Principles:

 decisions are backed up by evidence  decisions at right level  products should reduce complexity and burden that exists in the system

Everyone counts Our statistics reflect the experiences of everyone in our society. This means all our surveys are designed to be inclusive and compliant with accessibility legislation.

We think about all types of respondents from the start because we want everyone to be able to take part in our surveys. We increase response and representation in our data by removing barriers to interaction and participation created through exclusionary design. We follow harmonised standards to ensure our survey questions are inclusive and that we collect representative data.

Inclusive and accessible designs reduce burden for all respondents, not just those with additional needs or disabilities. For example, we aim to design each product to meet the average reading age of the UK. This makes our products easier to understand which improves the overall respondent experience. Inclusive and accessible designs improve the quality of our data and build trust in our statistics.

These ways of working also apply to products that are developed for internal ONS users that are part of running a successful survey. This includes interviewers, call centre staff, survey processors, and operational staff. This provides equal opportunity to our workforce by ensuring everyone can use our products.

ONS Survey Strategy Delivery Principles

This SSRDP links to the following ONS Survey Strategy Delivery Principles:

 products should reduce complexity and burden that exists in the system  aligns with other initiatives across the organisation  contribution to survey and ONS strategic outcomes is clear

Trust, roles, and responsibilities We trust and involve the right people at the right time at every phase of a project. We define clear roles and responsibilities which helps us run a successful survey and achieve our goals. We are transparent about the design process with stakeholders and involve them in the development journey.

Everyone involved in a project clearly understands their purpose and expected contribution at every point of designing and developing a survey. They know where their role begins and ends, which helps ensure people with the right skills for the job are assigned to the right part of the design and development process. This allows

the organisation to fully benefit from the investment made to employ and train these people who are experts in their roles. It also avoids products being influenced and designed by the wrong people at the wrong time. This can lead to the wrong thing being built. For example, data users are responsible for providing their data needs and analytical requirements to the research and design teams. The research and design teams then fulfil their role in the process, which is to conduct the research to develop the appropriate designs to meet user needs. The roles are clear, the data users do not dictate the design of the questions as the research and design teams are trusted and skilled to produce the right product to meet their needs.

ONS Survey Strategy Delivery Principles

This SSRDP links to the following ONS Survey Strategy Delivery Principles:

 clear ownership and accountability  decisions at right level

It's our responsibility to make things simpler We have a responsibility to make our surveys easy to understand and use. We do the hard work to make our surveys simple, which removes that burden from our staff and respondents.

We prioritise the respondent experience because we know that without doing so, we risk us not achieving our goals. We develop surveys that meet the needs of respondents and data users by investing time and resources into the early research, design, and testing phases of a project. We monitor respondent burden and use the insights to inform decision making.

We develop surveys that do not rely upon staff intervention and lengthy help and guidance to get the data we need. Instead, they are clear and highly usable on their own, without the need for much additional support or advice. Through good design we empower our respondents to take part in our surveys and provide us the data we need. We only add additional help where research shows that further support is needed.

ONS Survey Strategy Delivery Principles

This SSRDP links to the following ONS Survey Strategy Delivery Principles:

 products should reduce complexity and burden that exists in the system  aligns with other initiatives across the organisation  contribution to survey and ONS strategic outcomes is clear

Follow, reuse, and refresh We follow best practice and standards in the design and development of our surveys. This ensures our surveys are high quality, modern, and sustainable.

When we have found something that works, we make it reusable and shareable instead of starting from the beginning of the development process every time. Our harmonised standards are good examples of this. This approach ensures others can

benefit from the investment made in developing that product. It also avoids duplication of effort and spending of public money on creating the same thing.

Sometimes we may need to take a different approach. It is important to remember that each survey is different and may need bespoke products or solutions. The decision to do something different and stray from best practice and standards is always based on evidence and respondent needs, not assumptions, personal views, or biases. For example, harmonised standards are used as the starting point, but they can be adapted to meet the needs of specific surveys.

We continuously refresh our knowledge and understanding of best practice and standards. We can which allows us to constantly add to our evidence base. We look to the research of others around the world to inform our work, but we keep in mind the importance of country context.

We refresh our surveys and carry out continuous improvement to our content to ensure they remain relevant.

We use administrative data or other data sources, where available, to reduce survey length, respondent burden and operational demands while improving processing and the quality of statistical outputs.

ONS Survey Strategy Delivery Principles

This SSRDP links to the following ONS Survey Strategy Delivery Principles:

 aligns with other initiatives across the organisation  products should reduce complexity and burden that exists in the system  proposals can be sustainably resourced and financed  decisions are backed up by evidence

Iterate, learn, and share We take an agile approach to developing our surveys, which helps us avoid the risk of building the wrong thing and finding that out too late. We always test our survey products and processes with respondent users before releasing them. We iterate and refine them based on research insights and not assumptions, which ensures we develop something that meets the users’ needs.

We are transparent about what does not work and we abandon these things when our research shows they are not suitable. We then work to find an alternative solution. We always test our survey products along the full end-to-end respondent journey to ensure we are providing our respondents with highly usable and coherent products.

Sometimes we may find a problem with a questionnaire after the live phase. There are several ways we might find this out, for example through interviewer feedback, respondent feedback, or by looking at the amount of imputation needed. When this happens, we flexibly adapt and improve the questionnaire rather than needing to run big re-development projects.

We involve topic experts from inside and outside of ONS to support with:

 the design of our surveys and their products  how our surveys are run

We share our insights and learnings widely with others internally and externally. We abandon what does not work in favour of finding something that does, and we remember that discovering something does not work is a valid insight. We share prototypes and progress widely to gather feedback from people with different areas of expertise to create better products for our users.

We recognise the importance of bringing our stakeholders on our development journey to ensure successful survey design. This is why we involve them at all stages of a project.

ONS Survey Strategy Delivery Principles

This SSRDP links to the following ONS Survey Strategy Delivery Principles:

 decisions are backed up by evidence  decisions at right level

Think about the whole service and solve problems as a whole We see our surveys and the data they generate as an ecosystem. We are aware of the interdependencies of each phase in the data lifecycle and how our decisions at the beginning and throughout affect the quality of the data we collect and produce.

When a survey uses multiple modes for data collection, we do not view or develop each mode separately. Instead, we think about all aspects together and consider them as one so we can create successful surveys that meet respondent needs. When there is a problem, we think about how it affects whole respondent journey and data lifecycle. We then work to fix the problem accordingly.

We use intelligence from paradata, processing and analysis of data collection to help the future design of our surveys. For example, we learn about where we are doing lots of imputation as this may mean that we need to review the questions and concepts for clarity.

We think holistically about the design and appearance of our products to maximise response and data quality. We ensure all products respondents interact with are consistent in tone and appearance. This helps build trust and build a brand identity.

ONS Survey Strategy Delivery Principles

This SSRDP links to the following ONS Survey Strategy Delivery Principles:

 products should reduce complexity & burden that exists in the system  contribution to survey and ONS strategic outcomes is clear  aligns with other initiatives across the organisation