More options

United Kingdom of Great Britain and Northern Ireland

Disclosure control issues in complex medical data, University of the West of England

Read more about Disclosure control issues in complex medical data, University of the West of England

COVID-19, routine access, medical records for research, microdata use,

Languages and translations

English

UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE

CONFERENCE OF EUROPEAN STATISTICIANS

Expert Meeting on Statistical Data Confidentiality

26-28 September 2023, Wiesbaden

Disclosure control issues in complex medical data

Elizabeth Green1, Felix Ritchie1, Jim Smith1, David Western1, Paul White1 1University of the West of England

[email protected]

Abstract

The covid19 pandemic assisted the acceleration of routine access to medical records for research. In the UK

platforms including OpenSafely and NHSDigital, alongside emerging hospital trust based Trusted Research

Environments (TREs), demonstrate the utility and need for medical researchers to access and use microdata

safely and securely. Whilst many employ traditional principles-based SDC standards to statistical outputs,

complexity arises when considering complex medical data which is required to remain highly detailed; for

example genome, medical imaging, or fMRI data where the output often includes reference to individual

observations. Current imaging libraries and databases have demonstrated awareness and need for metadata

standards, but consideration of both input and output protection is less clear. With the need to retain

observations with high level of detail this presentation discusses present considerations for potential SDC

solutions and also invites conversation from the wider community.

1 Introduction

The use of medical data for research purposes has clear public benefit and direct impact. Medical data by nature is highly detailed and specific to an individual: it is important to include a wide range of observations and background information to allow practitioners to make informed decisions and choices around treatment. Specific medical tests such as genome analysis or an MRI scan, generates large volumes of data which are specific to the individual and is evaluated and examined as a whole entity- not just a one particular fraction of the MRI scan is used, the whole scan is used and retained.

Historically, medical research has long been intertwined with delivery and provision of care to patients, as such research is conducted with direct informed consent and an expectation that the data will be used to further knowledge in the area. The medical data is of course highly detailed and often the number of observations used in a study can be low due to rarity of disease, or the collection of data is limited to particular hospitals/ sites. As such the research outputs can be highly detailed with descriptive tables and survival curves often including singular observations.

In contrast, microdata used in social science is often not directly collected by the researcher (for example census data) so informed consent specific to the research is not obtained. When it comes to accessing and publishing data outputs, social science has established data repositories and access arrangements for research with clear standards for statistical disclosure control (SDC) within both shared datasets and research outputs.

The aim of this paper is: first, to outline some present examples of sharing of medical data and also outputs of medical data; and second, to reflect on the disciplinary differences in disclosure control. In this paper we will illustrate this with some examples and consider whether this is due to lack of awareness or lack of concern. We will illustrate with three commonplace examples of shared data, to illustrate some of the issues and the expectations of the public health world. Finally, we reflection ways forward and where medical science may benefit from the experience of social scientists.

It should be noted that this paper is not intended to embarrass organisations or researchers- examples where potential disclosure and poor practice has been identified by the team are de-identified and described. The team has not directly referenced these examples, and we encourage the community to have an open conversation about how to integrate SDC standards when sharing data.

2 Medical examples

2.1 Genomic data

The devil is in the detail. A genome provides the complete set of all the genetic information in an organism. Genomic analysis (for example, microarray data) allows for the investigation of genes, and provides the necessary insights for developing cures, vaccines, and identification of new diseases and diagnostic tests. Whilst the sharing of individual genome data has facilitated remarkable breakthroughs in fields such as genetics and personalized medicine, it also raises significant privacy concerns.

The current practice of ‘anonymization’ of genomic data is performed by removing direct identifiers (for example, name, patient ID) and indirect identifiers (hospital, postcode) (Bonomi, Huang and Ohno-Machado, 2020). However other variables such as age of patient, gender, prognosis are not redacted. Below is an example of an ‘anonymised’ genome array data- available via website in the public domain which does not require sign in. The data is associated with a published research article, a condition of publication with the journal is that the raw data must be made available.

Data collection: The DRAGoN Hospital for Exhausted Researchers

Participant characteristics: Participant number Gender Age Prognosis

1 Male 48 Bad- chronic insomnia

2 Female 31 Good

Xlsx attachment with participant 1 microarray, participant 2 microarray etc.

The main issue here is not only the level of detail presented in the participant characteristics list, but also the level of detail within the array/ genome dataset. It is effectively the raw output of the individual’s entire genetic array. Whilst research has advanced an understanding of the specific roles of different structural points, mutations, and specific markers knowledge, we are still in the process of identifying and discovering the roles of specific which genetic markers. Therefore, when considering SDC we need to be aware that what is considered non-sensitive today may become sensitive in the near future (Ritchie and Smith, 2019; McKay et al. 2022).

For medical research it is difficult to define what information is disclosive and what is not. For example, it is possible to extract information about the individual such as eye colour, hair colour, hair texture (curly), baldness, physical traits etc from array data. Previous studies demonstrated the possibility of generating 3D face maps based on genomic data which could be used to reidentify individuals (Lippert et al. 2017, Crouch et al. 2018, Venkatesaramani and Vorobeychik, 2021). From a social science perspective we would be considering whether a form of input disclosure control could be employed; alternatively, could we safeguard who is accessing the data, and what might the consequences be if we did introduce such practices?

Input SDC on the sharing of genomic data is only one part of the puzzle. There are also disclosure issues in research outputs. As previously explained the data is uploaded to a shared platform- available for anyone to download, this sharing is often a mandatory requirement from both funders and journals. Below (figure 1) is an example of a published survival analysis which outlines the probability of survival for patients with a particular disease overtime. With small number of values it is easy to identify when individuals die at specific time points- accompanying the survival curve is a table detailing the change in numbers across time.

Figure 1 Example Kaplan-Meier curve with low numbers

Survival analysis is commonly used in medical research to demonstrate the relationship between diagnosis (or treatment) and death. Concerns around disclosure relates to number of observations between each step down in the curve, with detailed graphs often detailing a step down with less than 3 observations. O’Keefe et al. (2012) suggests smoothing and incorporating confidence intervals, while SDAP (2019) proposes checking to ensure thresholds are met within each step change.

Interestingly a tool which specifically generates Kaplan-Meier plots for genomic research is being used within the medical community- https://kmplot.com/ (Gyorffy, 2023). This open-access, free for use website allows researchers to perform survival analysis on different gene expressions from database of over 30k different samples. The user can select below the cancer subtype they wish to research and then the level of analysis (see below). By default the website is set to censor at the threshold for the plot, but the user is able to turn off this function.

Figure 2 Demonstration of a confidentialised output taken from O’Keefe et al. (2012) p134

Figure 2 Guidance for SDC in Kaplan Meier graphs by Welpton et al 2019

While this is an extremely useful resource for researchers, it is also of potential concern. It seems likely that very small subsets of the data could be selected and associated with personal characteristics – these would not produce meaningful graphs, but they could be used to challenge the anonymisation of the data.

2.2 Inappropriate use of medical dermal images

In dermatology, photographic capture of clinical findings is routine, with digital images providing support and awareness in both practice, research, training, and education. One publicly available tool is the DermAtlas (available http://www.dermatlas.net/reference/index.cfm) which stores a wide array of clinical images demonstrating the presentation of different dermological conditions. Anyone can access this tool and explore the wide range of photos it holds. In terms of impact this tool can help aid health professionals in identifying and evaluating their own patients, it can also be used by the general public to help them feel empowered or understand their own conditions/ potential diagnosis.

As the skin is the largest organ of our bodies, some dermatological conditions are localised to specific personal areas, this coupled with also an array of different clinical photos providing insights across the age range, the dermatology archives found it had become susceptible to misuse. Lehman, Cohen and Kim (2006) described the journey of discovery, ongoing detection, and management of misuse of DermAtlas content across a period of 4 years. A shocking 14.3% of all referrals originated from pornography / fetish sites (Lehman, Cohen and Kim, 2006).

This leads to concerns surrounding how to share safely medical information from what is undoubtedly a valuable medical resource. Any referral from a pornography/ fetish site resulted in the user being presented with a denial page (Lehman, Cohen and Kim, 2006). The DermAtlas implemented filters through user query patterns, with IP addresses of frequent queries for genital images being restricted. Restricted IP addresses were still able to use DermAtlas, but were presented with thumbnail sized images and unable to retrieve full images of genital sites. However, this approach was not straightforward: for example, the NHS in the UK and US military services were then inappropriately restricted.

DermAtlas presents an interesting example of the complexities when hosting data in a public domain which is aimed for a universal audience. The benefits of the tool for both public and health professionals are clear, but the tool is also being used for other purposes not intended by the designers. When considering potential solutions for de-identification or anonymisation of medical photos, current practice in social research where direct informed consent has not been obtained (such as photographing a busy city) is often to use object and

Figure 3 Website Kaplan- Meier plotter

face detection software to automatically mask individuals (Fitwi et al. 2021). When considering the clinical dermatological case photos, the current simplistic approach is to redaction is to mask the eyes and mouth, but for many case photos there is no form of redaction, and sometimes it is not possible to redact the eyes and mouth. We therefore assume, as is common practice within medical research, that the emphasis is on obtaining direct informed consent- and the patient consents to data being held within the public domain. However, can such consent be truly informed when unanticipated uses are made of the data? The DermAtlas and indeed other similar tools face a impossible triad: how can we retain detailed photographs and provide an open access tool and ensure no misuse?

2.3 fMRI scans

Functional magnetic resonance imaging or functional MRI (fMRI) provides a highly detailed image of the blood flow and structure of an item/ body part, these scans are being used to assist in treatment of the patient (diagnosis) but also medical research. Due to the large volume of high data produced by these scans sharing this information has proven to be invaluable for medical research. Current examples of sharing fMRI includes the Brain Imaging Data Structure (BIDS) website https://bids.neuroimaging.io/ . Here users can contribute, access, and download de-identified fMRI data.

When considering the input disclosure control BIDS requires contributors to remove all direct identifiers alongside ‘defacing’ the scan images (which can be achieved using a module https://raamana.github.io/visualqc/gallery_defacing.html ). Interestingly facial reconstruction based on detailed medical scans (such as CT, fMRI) has been achieved. Schwarz et al (2019) found that the software achieved an impressive re-identification rate of 83% (70 of the 84 participants) when comparing their MRI scan to photos.

BIDS ensures that the data entering the service is de-identified by providing excellent support to depositors- ensure that data uploaded to their service is stripped of direct identifiers and defaced. However uploading and publishing/ sharing data in tandem is common practice so the sticky issue of secondary disclosure is more apparent in this example. To highlight this a recent published journal article, cites that they have deposited the data used in publication in BIDS, but within the journal article the participants’ demographic characteristics are highly detailed with low numbers in particular cells and distinctive characteristics. If the identity of the depositor is known, then it increases the chance of knowing where the sample comes from (i.e. which hospital/patient group), dramatically increasing the chances of re-identification. Finally, with more researchers using data depositories such as BIDS to deposit datasets used in publications/ research, information already in the public domain about the dataset may be crucial for re-identification, but it is not necessarily considered by the individual depositor. Now the problem here is not within the data depository input side, but a lack of statistical disclosure control awareness from the authors- demonstrating the need for training and standards amongst the medical community.

3 Discussion

We are not stating that the above examples are necessarily disclosive or provide direct identification- a number of steps would be required to reidentify the individual and the value to an intruder would be questionable. For example, safe to assume that social media profile pictures in the public domain are not going to be viable for identification/ reconstruction of an fMRI scan. Venkatesaramani and Vorobeychik, (2021) found that the overall effectiveness of re-identification (when using social media photos) was substantially lower than previously suggested- as literature often uses high-quality data (both genomic and photographic) which is not consistent with real life scenarios. Conceptualisation of what is a reasonable threat is beyond the scope of this paper.

Nevertheless, the three examples have highlighted a number of issues and challenges within disclosure control from both an input and output side along with how to share. Many of these challenges are unique to the data, and traditional methods used to aid disclosure control in social research may be inappropriate. There are also some very unexpected factors; for example DermAtlas and the actual use. Going forward what mitigations and recommendations might social scientists offer the medical community?

On microdata access we must always accept a level of risk, risk needs to be conceptualised as to the realism of risk (i.e. what is the true likelihood of an intruder performing this for nefarious gain? And can we ever meaningfully and more importantly reliably measure this risk?). It is also essential that whilst discussing risk we must also discuss benefit, we are all to familiar of the invaluable findings and applications of health research and to potentially halt or delay findings is harm within itself. So, whilst we highlight areas of weakness and vulnerability we must objectively generate new paths going forward.

Our primary concern is the lack of standards, guidance and continuity- this is not being checked or reviewed or updated to current practices known within the SDC community (for example thresholds). Perhaps this demonstrates a lack in training and awareness around SDC, as in the examples there are demonstration of de- identification. This also could potentially be an area in which re-identification back to the individual is important for example if the research generates incidental findings on an individual and it’s necessary the receive intervention. Consent for data to be shared is often obtain directly with individuals being more inclined to trust the research and a presumption that they had “agreed to use this for research and we said we would anonymise it...”.

What about outputs? Sharing the data seems to happen in tandem with the outputs so output SDC not as relevant, however is this an output or input issue? Should we consider the attached journal participants characteristics tables as secondary disclosure or is this an example of input? What is clear however is a want to de-identify and a concern around ethics and consent in the medical community. Derrick et al 2022, highlights that training in OSDC is mostly limited to TRE users and lots of medical research on very sensitive data is not traditionally held in TREs (compare do social science), so moving forward training appears to be long-hanging fruit in supporting disclosure control in this area.

4 Future considerations

Identification of problem/risk – at first glance appears poor practice when compared to standards in social science but is it a genuine risk? How do we balance genuine risk vs perceived risk vs utility of data?

Training – what is done and to what level (again as social scientists not great but perhaps have experience and also conceptual understandings of thresholds, rounding etc).

Standards- what is done and to what level- can we support a harmonised approach?

Is open sharing good? In social science the move has been to open access not open data i.e. anyone with genuine reason has access to the data but not everyone gets access – need to review data sharing models and also pressures from funders and journals.

We especially welcome views from medical research community [email protected]

5 References

Bonomi, L., Huang, Y., & Ohno-Machado, L. (2020). Privacy challenges and research opportunities for genomic data sharing. Nature genetics, 52(7), 646-654.

Crouch, D. J., Winney, B., Koppen, W. P., Christmas, W. J., Hutnik, K., Day, T., ... & Bodmer, W. F. (2018). Genetics of the human face: Identification of large-effect single gene variants. Proceedings of the National Academy of Sciences, 115(4), E676-E685.

Derrick, B., Green, E., Ritchie, F., & White, P. (2022, September). The Risk of Disclosure When Reporting Commonly Used Univariate Statistics. In International Conference on Privacy in Statistical Databases (pp. 119- 129). Cham: Springer International Publishing.

Fitwi, A., Chen, Y., Zhu, S., Blasch, E., & Chen, G. (2021). Privacy-preserving surveillance as an edge service based on lightweight video protection schemes using face de-identification and window masking. Electronics, 10(3), 236.

Gyorffy B: Discovery and ranking of the most robust prognostic biomarkers in serous ovarian cancer, Geroscience, 2023, doi: 10.1007/s11357-023-00742-4.

Homer, N., Szelinger, S., Redman, M., Duggan, D., Tembe, W., Muehling, J., ... & Craig, D. W. (2008). Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS genetics, 4(8), e1000167.

Lehmann, C. U., Cohen, B. A., & Kim, G. R. (2006). Detection and management of pornography-seeking in an online clinical dermatology atlas. Journal of the American Academy of Dermatology, 54(4), 633-637.

Lippert, C., Sabatini, R., Maher, M. C., Kang, E. Y., Lee, S., Arikan, O., ... & Venter, J. C. (2017). Identification of individuals by trait prediction using whole-genome sequencing data. Proceedings of the National Academy of Sciences, 114(38), 10166-10171.

McKay, F., Williams, B. J., Prestwich, G., Bansal, D., Hallowell, N., & Treanor, D. (2022). The ethical challenges of artificial intelligence‐driven digital pathology. The Journal of Pathology: Clinical Research, 8(3), 209-216.

O'Keefe, C. M., Sparks, R. S., McAullay, D., & Loong, B. (2012). Confidentialising survival analysis output in a remote data access system. Journal of Privacy and Confidentiality, 4(1).

Schwarz CG, Kremers WK, Therneau TM, et al. (2019) Identification of anonymous MRI research participants with face-recognition software. N Engl J Med; 381:1684-6.

Venkatesaramani, R., Malin, B. A., & Vorobeychik, Y. (2021). Re-identification of individuals in genomic datasets using public face images. Science advances, 7(47), eabg3296.

Welpton, Richard (2019). SDC Handbook. figshare. Book. https://doi.org/10.6084/m9.figshare.9958520.v1

Disclosure control in

complex medical outputs

E L I Z A B E T H G R E E N , F E L I X R I T C H I E , J I M S M I T H ,

D A V I D W E S T E R N , P A U L W H I T E

U N I V E R S I T Y O F T H E W E S T O F E N G L A N D

Overview

➢Medical research and positioning

➢ Current practices

➢ Alignment with TRE standards

➢Examples

➢ Solutions?

➢ Future considerations

Medical research

Traditionally:

∙ Revolves around direct informed consent and primary data collection

∙ Some bad examples of mismanagement data and ethics... Henrietta Lacks for example

∙ Easy to see direct benefit/ public good

Sharing medical data and disclosure control

∙ Varied practices from depositing raw data in the public domain to secure access

∙ Tools are being developed- uptake poor

∙ Unforeseen consequences encountered

∙ Benefit to society

Genomic Data

∙ Tissue sample extracted, analysis is then conducted

∙ Microarray data of specimen often deposited/ shared in the public domain (remember knowledge advancing continually)

∙ Descriptive variables provided

Statistical Disclosure Control?

∙ Basic SDC principles- counts and thresholds

∙ Kaplan-Meier curves often result in low numbers, potential for low number of observations between each step.

∙ Not directly disclosive as alone, requires linking to contextual information (provided in the report).

∙ Recommendations exist- O'Keefe et al (2012) smoothing and adding CI. SDAP (2019) thresholds meet.

∙ Online tools- generating a Kaplan-Meier curve with SDC inbuilt

Dermatology photographs

• Online photo repository providing examples of different skin disorders.

• Used as an aid to help experts and public identify different ailments.

• Huge public benefit

• However unintended consequence was found...

Statistical Disclosure Control? Solutions?

FMRI scan • Huge amount of data • Brain anatomy and structure (remember knowledge advancing continually) • Variation in sharing online repositories open access. • Concern for disclosure is based on rebuilding face based on structure

Digital Facial Reconstruction Sorbonne University

FMRI solutions?

Scalp the face! Remove/ roughly the facial attributes - Potential loss of data - Disclosure elsewhere

Statistical Disclosure Control?

∙ SDC training and principles for medical data users!

∙ Unique attributes, highly detailed information, informed consent, understanding risk

∙ Use and implementation of tools and solutions

∙ Universal agreement on standards ∙ What is anonymized data

∙ Evaluation of data access

∙ Development of a network to support this work?

Thank you! Elizabeth Green [email protected]

This work is funded by UK research and Innovation [Grant Number MC_PC_23006] as part of Phase 1 of the DARE UK (Data and Analytics Research Environments UK) programme, delivered in partnership with Health Data Research UK (HDR UK) and Administrative Data Research UK (ADR UK).

mailto:[email protected]

Slide 1: Disclosure control in complex medical outputs
Slide 2: Overview
Slide 3: Medical research
Slide 4: Sharing medical data and disclosure control
Slide 5: Genomic Data
Slide 6: Statistical Disclosure Control?
Slide 7: Dermatology photographs
Slide 8: Statistical Disclosure Control? Solutions?
Slide 9: FMRI scan
Slide 10: FMRI solutions?
Slide 11: Statistical Disclosure Control?
Slide 12: Thank you! Elizabeth Green [email protected]

Experiments on Federated Data Synthesis, University of Manchester

Read more about Experiments on Federated Data Synthesis, University of Manchester

Federated Learning, decentralized approach to statistical model training, quality of synthetic microdata, synthetic datasets,

Languages and translations

English

UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE

CONFERENCE OF EUROPEAN STATISTICIANS

Expert meeting on Statistical Data Confidentiality 26–28 September 2023, Wiesbaden

Experiments on Federated Data Synthesis Claire Little, Mark Elliot, Richard Allmendinger (University of Manchester, UK)

[email protected]

Abstract Federated Learning (FL) is a decentralized approach to statistical model training in which training is performed across multiple clients to produce a global model. This approach can be used where multiple sites have data but do not have enough data to generate the required statistical power and cannot for legal, commercial or ethical reasons share their data. One paradigm case is randomized control trials for rare diseases. With FL, training data stays with each local client and is not shared or exchanged with other clients, so the use of FL can reduce privacy and security risks (compared to methods that pool multiple data sources) while addressing data access and heterogeneity problems. This study explores the feasibility of using FL to generate synthetic microdata, allowing multiple organizations to contribute to the construction of combined synthetic datasets (possibly for wider release) without the need to share or distribute their own data. The primary issue is whether it is possible in principle to produce good enough quality synthetic data and the study here focuses on this as a proof of concept before going on to discuss the issue of risk measurement. The results show that the approach is feasible and crucially in the main experiment the synthetic datasets better represented the full population than random samples of that population do. However the experiments are on toy datasets and the next step is to expand the dataset size.

1 Introduction

To enable the safe release of data, Statistical Disclosure Control (SDC) methods (Hundepool et al., 2012) can be applied to remove or alter disclosive information. Data synthesis (Rubin, 1993; Little, 1993) is an alternative to SDC which uses models of the original dataset to generate artificial data with the same structure and statistical properties as the original but (in the case of full synthesis) not containing any of the original data. In this study, we explore the feasibility of federated synthesis, allowing multiple organizations to contribute to the construction of combined synthetic datasets (possibly for wider release) without the need to share or distribute their own data. The primary issue is whether it is possible in principle to produce good enough quality synthetic data and the study here focuses on this as a proof of concept before going on to discuss the issue of risk measurement. The next section will present background information on Data Synthesis and Federated Learning, Section 3 outlines the methodology, Section 4 provides the results of our experiments, Section 5 discusses the results and their implications and final thoughts and ideas for future work an be found in Section 6.

2 Background

2.1 Data Synthesis

Data Synthesis (Rubin, 1993; Little, 1993) is an alternative to SDC and uses models built using the original dataset to generate artificial data with the same structure and statistical properties as the original but (in the case of full synthesis) not containing any of the original data. Synthetic data may be used where access to the original data is not possible or restricted due to privacy constraints. For example, the approval process to acquire access to safeguarded data can be lengthy, potentially delaying research; in these situations synthetic data can allow researchers to test code or plan analysis whilst awaiting access. Synthetic data may also be used to augment (add more records to) existing datasets. There is an increasing number of techniques to generate synthetic data, including statistical methods (such as Nowok et al. (2016); Zhang et al. (2017)), and deep learning (DL) methods based on neural networks (NN) such as Generative Adversarial Networks (GANs) (Goodfellow et al., 2014), variational autoencoders (VAE) (Kingma and Welling, 2014), large language models (Radford et al., 2019), diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020), and genetic algorithms (GAs) (Chen et al., 2017, 2018).

2.2 Federated Learning

Federated Learning (FL) (McMahan et al., 2017) is a method that allows multiple clients (or devices) to collaboratively build a shared model without the clients transmitting or exchanging their raw data. In the context of synthetic data, this could allow multiple clients (organisations, users, etc.) to produce a shared synthetic dataset, without the need to share their own individual private data thereby minimising disclosure risk. It could allow the linkage of datasets that would otherwise be unlikely to be linked in the traditional sense, thereby producing opportunities to access unique synthetic data that is potentially more diverse, and richer, than each participant’s synthetic dataset alone. This paper explores the feasibility of using FL together with a GA, to produce a combined synthetic dataset, which as far as we are aware has not been attempted so far. The early focus of FL was its use on mobile and edge devices (e.g. Bonawitz et al. (2016); Konecny et al. (2016)), where an FL model could have many massively distributed clients, each with potentially different computational capabilities, limited communication and unbalanced data. An example of its usage is Google’s Gboard (keyboard) application which trains a model on each mobile device (when it is idle) using the local data and then sends only model updates (parameters) to the server; this allows it to predict the next word when typing, suggest emojis and discover new words (McMahan and Thakurta, 2022). As described by Kairouz

et al. (2021) interest has increased in the use of FL for other (non-mobile) applications, such as allowing cross- organisational collaboration to train models. For example, in healthcare, sensitive data is difficult to access and tightly regulated, making sharing/pooling data (between institutions) prohibitive – FL can allow the creation of more robust models, trained on a larger and more diverse pool of data (than a single institution could provide), without the need to exchange or centralise sensitive medical data (Rieke et al., 2020; Kumar and Singla, 2021). FL has generally been used to produce shared models (such as predictive models) collectively trained on each clients data. A central server controls the process but does not access any of the client data. NN based methods are typically used, where each client receives the current model weights from the central server, trains the model on their own data and then sends the model weights (or parameters) back to the server. All the clients’ weights are then aggregated (typically using the FedAvg, or Federated Averaging algorithm (McMahan et al., 2017)) by the server which updates the global shared model. The model is then sent back to the clients and the process continues until some stopping condition is met. There is a small body of research into the use of FL to generate synthetic data. We use microdata for this study and therefore focus on methods designed for tabular data (i.e. structured data comprising rows and columns containing mixed-type features, such as categorical and numerical). Duan et al. (2023); Fang et al. (2022); Zhao et al. (2021) use GAN-based methods to generate synthetic data, with a GAN training on each client and each sending the model weights to the server to aggregate, etc. (each client generates the final synthetic data individually using the shared model). Weldon et al. (2021) use a GAN on the clients and on the server, but differs in that the server GAN generates the final synthetic dataset. Lomurno et al. (2023) present a different method, using VAE, with each client training a data generator locally. The clients send their models (generators) to the server, but they are not aggregated or combined (as is typical in FL). In the final phase, each client can access the set of generators (from all clients) stored on the server and use some or all of these to generate their own synthetic data. Here we use a GA to generate synthetic data on the server, which is then sent to the clients who each calculate the fitness (utility) score, then send it back to the server where all client scores are combined and used to create the next generation of synthetic datasets. Qu et al. (2020) also generate synthetic data on the server, which is sent to the clients to evaluate, but this employs a GAN-based method, uses image data and focusses on the use-case where clients are temporary (i.e. they may not be available for the whole process).

2.3 Study Aims

In this study, our objective is to assess the feasibility of using a federated learning to generate a combined synthetic dataset. The research questions are as follows.

RQ1: Can a federated synthesis model reproduce the joint distribution of combined distributed datasets? RQ1.1: What information does the server need to be able to reproduce that joint distribution? RQ2: Is the utility of a synthesised combined dataset at least as good as that of the samples held by each client.

3 Methodology

The study is simulation of a server and two clients. The basic simulation scenario is that the sever generates synthetic data, which is then sent to the clients, who each calculate the similarity of the synthetic data to the sample that they hold, then send those similarity scores back to the server where all client scores are combined and used to create the next generation of synthetic datasets. Our machine learning model of choice is genetic algorithms. In Section 3.1, we describe GAs and motivate this choice, then in Section 3.2 we describe the data that we use and how it was set up for the simulation.

Table 1. Simple binary original dataset with ten rows, sampled from UK 1991 Census data, which was split into two five-row datasets, for clients A and B.

AGE MSTATUS SEX LTILL TENURE client

1 2 2 2 2 A 1 1 1 2 2 A 1 1 2 2 2 A 2 2 2 2 1 A 1 1 1 2 1 A 2 2 2 2 1 B 1 2 1 2 1 B 1 1 2 2 1 B 1 1 2 1 2 B 1 1 1 2 1 B

3.1 Genetic Algorithms

Genetic Algorithms (GAs) (Holland, 1992) perform iterative optimisation. There are three main (biologically inspired) operators: selection (parental and environmental), crossover, and mutation. Broadly speaking, an initial population of candidate solutions is specified (in this case, a candidate solution is a synthetic dataset), and the fitness (the utility) of the candidates is calculated. The parental selection operator is used to select candidates (parents) to reproduce for a new population, with fitter candidates more likely to be selected. A crossover operator combines some of the parents (there are a variety of methods for this) to produce new candidate solutions (children). A mutation operator then mutates some of the candidates (i.e. randomly changes some of the features). The children or a combination of children and parents form the population of the next generation (this step is called environmental selection). This process is repeated multiple times (generations), using the fitness to guide it, with ideally fitter solutions produced with each generation. Commonly, the process terminates when a specified number of generations has been produced or a particular fitness level has been reached. GAs are flexible in that there are many parameters that can be changed or set, and the fitness function can be designed for the specific purpose. Work by Chen et al. (2017, 2018) has shown the feasibility of using GAs to generate synthetic microdata, and demonstrated the viability of using risk and utility as conflicting objectives (Chen et al., 2019). More recently, Thogarchety and Das (2023) used a GA approach to produce synthetic data to augment class imbalanced datasets and Liu et al. (2023) presented a GA method that generates synthetic data capable of approximating a range of statistical queries.

3.2 Data

A (very) small binary dataset was used, which was randomly split into two datasets (of equal sizes) to represent two clients named client A and client B. The UK 1991 Census (University of Manchester and ONS, 2013) microdata was used, with 10 rows randomly sampled (from the same geographic area). Table 1 displays the data, with five variables (respectively: age, marital status, sex, long-term illness, and housing tenure) which were all converted to binary (using values of 1 and 2). This is called the original data set. It was randomly split into two five-row datasets, one representing client A and one client B, these are identified in Table 1.

Table 2. Parameters that were fixed in the experiments.

Parameter Type Value chosen Further details

No. of clients Simulation VARIES - Initial Metadata sent by clients Simulation Univariates - Combination of client scores Simulation VARIES - No. of objectives for GA Simulation 1 Similarity (utility) SDC applied to the output sent to server Simulation None - Output passed to client by server Simulation VARIES - Population size Model 50 - Parental selection Model Binary tournament k=2 Mutation rate Model 0.05 - Crossover Operator Model None - Environmental selection Model Elitism - No. of generations Experiment 150 - Choice of Dataset Experiment UK Census microdata 1991 No. of rows (per client) Experiment VARIES - No. of variables Experiment 5 - Type of variables Experiment Binary - No. of runs Experiment 5 -

Table 3. Parameters varied by experiment.

Parameter Value Experiment 1 Experiment 2 Experiment 3

No. of clients 1 2 2 Combination of client scores N/A None Mean Output passed to client by server Synthetic clients Synthetic clients Synthetic combined

dataset dataset dataset No. of rows (per client) 10 5 5

3.3 Method and Parameters

The potential range of variation in the simulation is huge. There are three types of parameters that could be varied in the study design: Model Parameters: These are changeable settings for the GA (e.g., mutation rate) Simulation Parameters: These are variations in the scenario being presented (e.g., number of clients) Experimental Parameters: These are elements of the study design that are not part of the simulation itself

(e.g., number of runs, data choices). A set of these is shown in Table 2. For proof-of-concept experiments we have chosen one value of most of these parameters; a much simpler set than might be used in practice. As well as using a very small sample (of real data), we kept the model complexity low. This simplicity assists us with the interpretation of the results. We have varied four of the parameters across three experiments. These are shown in Table 3. The first two experiments are used to establish a baseline. In experiment 1 we have just a single client. In effect, this tests whether a GA can reproduce the original data when unencumbered by the distributed data. In experiment 2 we split the data across two clients, but the server has a separate interaction with each client and then is deemed to combine the data at the end. This is in effect a minor variation on experiment 1. Experiment 3 is the main experiment, and we now describe what is simulated in more detail.

The experiment 3 simulation is represented graphically in Figure 1. The GA runs on the central server, and at the start of the process (labeled Initialisation, in the figure) each of the clients sends metadata about their individual data to the server. At the most basic, the server would need to know the variable names and the size (how many records) of the data. It is expected that the clients will agree in advance on the variables to be included. For this experiment, the clients send the univariate distributions (this information is used by the GA to mutate the data) and the number of records in each dataset.

Figure 1. An illustration of the federated synthesis simulation used for Experiment 3, with a server and two clients.

The server then combines the distributions of each client by taking the average to calculate a combined distribution. An initial population of synthetic datasets (candidates) is generated; these are drawn from the

uniform distributions of the five variables. The datasets in the initial population have the same number of records as both combined client data would have and the same variables. For this simple model, only one objective is assessed by the client, which is the similarity of the multivariate distribution of the clients data and each of the synthetic datasets passed by the server to the client. They then send those similarity scores back to the server. In detail the similarity measure calculates the proportion of every combination of values in the synthetic (candidates) and client data, then takes the mean of the absolute differences. This is then subtracted from 1 so that the similarity score takes a value between zero and one (where 0 indicates no similarity and 1 indicates an identical distribution). Once the server receives each of the client scores, it calculates the mean to produce an overall score for each synthetic dataset. In experiment 3 the server simply averages the clients scores, but they could be combined in other ways (e.g. using the lowest or the highest, or weighted by how similar each of the clients scores are to each other). This completes the initialisation phase. The main repeating process involves selection and mutation, but not crossover; this was excluded to reduce complexity. Firstly, parents are selected from the population using tournament selection (two synthetic datasets are randomly selected and the one with the highest similarity score wins). Two parents produce two children (i.e., two datasets), which are simply copies of themselves (where crossover is used, the children may be a combination of the parents), and the same amount of children are produced as the population size. Then, each child is mutated with a probability equal to the mutation rate (0.05), with the replacement value being drawn from the relevant univariate distribution. The children are then sent to the clients, who score the similarity and send the scores back to the server. These scores are then aggregated by the server. Finally elitism is used to select the next generation, that is the best (those with highest similarity) of the parents and the children are retained for the next generation (e.g. if a parent has higher similarity than the child, the parent is retained). This process is repeated for a set number of generations.

4 Results

Each experiment was repeated five times (using different random seeds). The plots in Figures 2-4 give the mean similarity score across the population for each of the generations for which the GA was run. Figure 2 shows the results for experiment 1. For all runs the GA converged (that is, the synthetic datasets reproduced the original). The goal would not generally be to reproduce the original dataset, but this demonstrates that the GA works (albeit on a very small dataset). In experiment 2, the GA was run separately on both clients five-row datasets, with the results plotted in Figure 3. Each of the five runs converged to one (that is, all runs reproduced the clients data), and so each of the clients resulting data could be combined to reproduce the original dataset. The results for experiment 3 (as described in Figure 1) are illustrated in Figure 4. Panels 1 and 2 illustrate the scores calculated by clients A and B, these individual scores are sent to the server which aggregates them, as displayed in panel 3. The aggregated score is what drives the GA (and clients A and B do not see each other’s scores, they only communicate with the server). The plot highlights that the synthetic datasets generated in run 3.3 scored highly with client A, but scored poorly with client B, however, when aggregated by the server, all five of the runs look remarkably similar. Run 3.3 is interesting as until about generation 20 the gradient is very similar on both clients to the other runs but around that generation a bifurcation happens. This appears to be the result of the process falling into a local optimum in which client A’s dataset was optimised at the expense of client B’s. This was the result of some mutation around generation 20 (subsequent test runs with the same starting seed failed to reproduce this result). Panel 4 shows the similarity scores of the synthetic datasets produced at each generation against the real combined data – by definition this would not be possible in a real-life scenario since the original data would not be available, but this is calculated post hoc to evaluate how the overall model is working (i.e., we are more interested in whether the server is reproducing the overall dataset than whether it replicates individual client

Figure 2. Experiment 1, the mean (n=50) similarity of five randomly initialised runs of the GA on the original ten-row dataset. Note the truncated y axis.

Figure 3. Experiment 2, the mean (n=50) similarity of five randomly initialised runs of the GA on the five-row datasets of client A and client B. Note the truncated y axis.

distributions). Panel 4 shows that for all but run 3.3 the model converges on the original data, that is, each run reproduces the original dataset. This is a particularly fascinating finding as it has done this despite the evaluations from the clients indicating sub-optimality. The baseline is included to indicate the combined client to server data similarity. Panel 3 (Figure 4) illustrates that (at least in this example) it may be difficult for the server to determine how well the overall model is performing. Other methods of combining the client scores as variations on experiment 3 were also considered (minimum, weighted, and alternating). The results are shown in the Appendices.

Figure 4. Experiment 3, the mean (n=50) similarity scores of five randomly initialised runs of the server GA, showing client A (1), client B (2), the server aggregated scores (3) and the comparison against the original data (4). Note the truncated y axes.

5 Discussion

The results of experiment 3 demonstrates our proof of concept. 4 out of 5 of the runs reproduced the original data. Fascinatingly they did this despite the mean evaluations scores from the client indicating that the operation had not achieved unity. This however was simple a reflection of the clients own sample not fully representing the combined datasets structure. Thus the synthetic datasets were a better representation of the ’real’ combined dataset than the ’samples’ held by each client. This emergent reproducibility shows how the approach could deliver the desired outcome to produce analytically useful datasets synthesised across distributed datasets. The experiments reported here focused on the single objective of utility, and in this case the goal was to reproduce the original data. In a real-life scenario, there would also be a consideration of risk – reproducing

the original data would not be desirable. A way to incorporate risk would be to use a multi-objective approach within the GA, and explore options such as Pareto optimality. The flexibility of GAs means that different utility and risk measures could be easily added. Another angle would be using deep learning methods (such as GANs and diffusion models) and adapting them to multi-objective optimisation (GANs are already used widely within FL). The fact that in our experiments it was not clear on the server that the original data had been reproduced might be thought of as useful, in terms of disclosure risk, but it would also mean that in this mechanism we could not rely on severe side restraint to manage risk. An obvious important expansion of these experiments is to test the method on larger and more complex datasets. Firstly, we need to establish if the emergent reproducibility effect scales. Also, for very large datasets, it simply may not be practical to send the entire population of datasets at each generation, and so alternatives may need to be explored. Another important expansion is to examine the effect of having more than two clients. The flexibility of the method also means that there are many parameters that can be experimented with.

6 Conclusion

The purpose of this study was as a proof of concept to determine whether using FL together with a GA to produce synthetic data was feasible. We have shown that it is feasible, albeit with a very small dataset, and with the focus being only synthetic data utility. The results are promising and there are many areas of future work including testing this on larger, more complex datasets, using a multiobjective approach that incorporates risk, and experimenting more generally with the various parameters.

References

Bonawitz, K. A., V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage, A. Segal, and K. Seth (2016). Practical secure aggregation for federated learning on user-held data. In NIPS Workshop on Private Multi-Party Machine Learning. https://arxiv.org/abs/1611.04482.

Chen, Y., M. Elliot, and D. Smith (2018). The application of genetic algorithms to data synthesis: a comparison of three crossover methods. In Privacy in Statistical Databases. PSD 2018, pp. 160–171. Springer.

Chen, Y., M. J. Elliot, and J. W. Sakshaug (2017). Genetic algorithms in matrix representation and its application in synthetic data. In UNECE Worksession on Statistical Confidentiality. https://unece.org/fileadmin/ DAM/stats/documents/ece/ces/ge.46/2017/2_Genetic_algorithms.pdf.

Chen, Y., J. Taub, and M. J. Elliot (2019). Trade-off between information utility and disclosure risk in ga synthetic data generator. In Joint UNECE/Eurostat Expert Meeting on Statistical Data Con- fidentiality. https://unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2019/mtg1/ SDC2019_S3_UK_Chen_Taub_Elliot_AD.pdf.

Duan, S., C. Liu, P. Han, X. Jin, X. Zhang, T. He, H. Pan, and X. Xiang (2023). Ht-fed-gan: Federated generative model for decentralized tabular data synthesis. Entropy 25(1). DOI: 10.3390/e25010088.

Fang, M. L., D. S. Dhami, and K. Kersting (2022). Dp-ctgan: Differentially private medical data generation using ctgans. In M. Michalowski, S. S. R. Abidi, and S. Abidi (Eds.), Artificial Intelligence in Medicine, pp. 178–188. Springer International Publishing. DOI: 2022.10.1007/978-3-031-09342-5_17.

Goodfellow, I., J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014). Generative Adversarial Nets. In Proceedings of the Advances in Neu- ral Information Processing Systems, Volume 27. https://papers.nips.cc/paper/2014/file/ 5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf.

Ho, J., A. Jain, and P. Abbeel (2020). Denoising diffusion probabilistic models. Advances in Neural Infor- mation Processing Systems 33, 6840–6851. https://proceedings.neurips.cc/paper/2020/file/

https://arxiv.org/abs/1611.04482

https://unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2017/2_Genetic_algorithms.pdf

https://unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2019/mtg1/SDC2019_S3_UK_Chen_Taub_Elliot_AD.pdf

https://papers.nips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf

https://proceedings.neurips.cc/paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf

4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf. Holland, J. H. (1992). Adaptation in natural and artificial systems: an introductory analysis with applications

to biology, control, and artificial intelligence. MIT press. Hundepool, A., J. Domingo-Ferrer, L. Franconi, S. Giessing, E. Schulte Nordholt, K. Spicer, and P.-P. de Wolf

(2012). Statistical Disclosure Control. Wiley series in Survey Methodology. John Wiley & Sons, Ltd. ISBN: 978-1-119-97815-2.

Kairouz, P., H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et al. (2021). Advances and open problems in federated learning. Foundations and Trends in Machine Learning 14(1–2), 1–210. DOI: 10.1561/2200000083.

Kingma, D. and M. Welling (2014). Auto-encoding variational bayes. DOI: 10.48550/ARXIV.1312.6114. Konecny, J., H. B. McMahan, F. X. Yu, P. Richtarik, A. T. Suresh, and D. Bacon (2016). Federated learning:

Strategies for improving communication efficiency. In NIPS Workshop on Private Multi-Party Machine Learning. https://arxiv.org/abs/1610.05492.

Kumar, Y. and R. Singla (2021). Federated Learning Systems for Healthcare: Perspective and Recent Progress, pp. 141–156. Cham: Springer International Publishing. DOI:10.1007/978-3-030-70604-3_6.

Little, R. (1993). Statistical Analysis of Masked Data. Journal of Official Statis- tics 9(2), 407–426. https://www.scb.se/contentassets/ca21efb41fee47d293bbee5bf7be7fb3/ statistical-analysis-of-masked-data.pdf.

Liu, T., J. Tang, G. Vietri, and Z. S. Wu (2023). Generating private synthetic data with genetic algorithms. DOI: 10.48550/arXiv.2306.03257.

Lomurno, E., A. Archetti, L. Cazzella, S. Samele, L. Di Perna, and M. Matteucci (2023). Sgde: Secure generative data exchange for cross-silo federated learning. In Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition, pp. 205–214. Association for Computing Machinery. DOI: 10.1145/3573942.3573974.

McMahan, B., E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas (2017). Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273–1282. PMLR. http://proceedings.mlr.press/v54/mcmahan17a/mcmahan17a.pdf.

McMahan, B. and A. Thakurta (2022). Federated learning with formal differential privacy guarantees. https: //ai.googleblog.com/2022/02/federated-learning-with-formal.html, accessed 2023-05-24.

Nowok, B., G. Raab, and C. Dibben (2016). Synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software 74(11). DOI: 10.18637/jss.v074.i11.

Qu, H., Y. Zhang, Q. Chang, Z. Yan, C. Chen, and D. Metaxas (2020). Learn distributed gan with temporary discriminators. In A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm (Eds.), Computer Vision – ECCV 2020, pp. 175–192. Springer International Publishing. DOI: 10.1007/978-3-030-58583-9_11.

Radford, A., J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019). Language models are unsupervised multitask learners. OpenAI blog 1(8), 9. https://d4mucfpksywv.cloudfront.net/ better-language-models/language-models.pdf.

Rieke, N., J. Hancox, W. Li, F. Milletari, H. R. Roth, S. Albarqouni, S. Bakas, M. N. Galtier, B. A. Landman, K. Maier-Hein, et al. (2020). The future of digital health with federated learning. NPJ digital medicine 3(1), 119. DOI:10.1038/s41746-020-00323-1.

Rubin, D. B. (1993). Statistical Disclosure Limitation. Journal of Official Statistics 9(2), 461–468. https: //ecommons.cornell.edu/bitstream/handle/1813/23033/rubin-1993.pdf?sequence=7.

Sohl-Dickstein, J., E. Weiss, N. Maheswaranathan, and S. Ganguli (2015). Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the 32nd International Conference on Machine Learning, Volume 37, pp. 2256–2265. https://proceedings.mlr.press/v37/sohl-dickstein15.html.

Thogarchety, P. and K. Das (2023). Synthetic data generation using genetic algorithm. In 2023 2nd International Conference for Innovation in Technology (INOCON), pp. 1–6. DOI: 10.1109/INOCON57975.2023.10101072.

University of Manchester and ONS (2013). Census 1991: Individual Sample of Anonymised Records for Great Britain (SARs). http://doi.org/10.5255/UKDA-SN-7210-1.

https://proceedings.neurips.cc/paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf

https://arxiv.org/abs/1610.05492

https://www.scb.se/contentassets/ca21efb41fee47d293bbee5bf7be7fb3/statistical-analysis-of-masked-data.pdf

http://proceedings.mlr.press/v54/mcmahan17a/mcmahan17a.pdf

https://ai.googleblog.com/2022/02/federated-learning-with-formal.html

https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf

https://ecommons.cornell.edu/bitstream/handle/1813/23033/rubin-1993.pdf?sequence=7

https://proceedings.mlr.press/v37/sohl-dickstein15.html

http://doi.org/10.5255/UKDA-SN-7210-1

Weldon, J., T. Ward, and E. Brophy (2021). Generation of synthetic electronic health records using a federated gan. DOI: 10.48550/arXiv.2109.02543.

Zhang, J., G. Cormode, C. Procopiuc, D. Srivastava, and X. Xiao (2017). PrivBayes: Private data release via Bayesian networks. ACM Transactions on Database Systems 42(4). DOI: 10.1145/2588555.2588573.

Zhao, Z., R. Birke, A. Kunar, and L. Y. Chen (2021). Fed-tgan: Federated learning framework for synthesizing tabular data. DOI: 10.48550/arXiv.2108.07927.

A Using the worst client scores to drive the GA, rather than averaging

Figure 5. Mean (n=50) similarity scores of five randomly initialised runs of the server GA, where only the worst (lowest) client score is used to drive the GA (rather than averaging both client scores). Showing client A (1), client B (2), the worst scores (3) and the comparison against the original data (4). Note the truncated y axes.

B Using weighted averaged score to drive the GA

Figure 6. Mean (n=50) similarity scores of five randomly initialised runs of the server GA, where a weighted averaged score is used to drive the GA. Where the client scores are close (the clients agree) the scores are weighted higher, where they are far apart (the clients disagree) the scores are weighted lower. Showing client A (1), client B (2), the server weighted averaged scores (3) and the comparison against the original data (4). Note the truncated y axes.

C Alternating the client scores to drive the GA, rather than averaging

Figure 7. Mean (n=50) similarity scores of five randomly initialised runs of the server GA, where the alternating client score is used to drive the GA (five generations using client A, five using client B, etc.). Showing client A (1), client B (2), the server mean alternating scores (3) and the comparison against the original data (4). Note the truncated y axes.

1. Introduction
2. Background

2.1. Data Synthesis
2.2. Federated Learning
2.3. Study Aims

3. Methodology

3.1. Genetic Algorithms
3.2. Data
3.3. Method and Parameters

4. Results
5. Discussion
6. Conclusion
References
A. Using the worst client scores to drive the GA, rather than averaging
B. Using weighted averaged score to drive the GA
C. Alternating the client scores to drive the GA, rather than averaging

Experiments on Federated Data Synthesis

C L A I R E L I T T L E , M A R K E L L I OT, R I C H A R D A L L M E N D I N G E R

U N I V E RS I T Y O F M A N C H E S T E R

Questions? https://tinyurl.com/QuestionsUoM

https://tinyurl.com/QuestionsUoM

Federated Learning (FL) FL (McMahan et al., 2017) is a decentralized approach to training statistical models • Multiple clients can produce one global model

• Clients do not share or exchange their own data

• Can reduce privacy and security risks (compared to methods that combine multiple data sources)

• Allows models to train on data that is more representative of the whole distribution

• Useful where clients do not possess enough data to generate the required statistical power

Federated Learning (FL) Central server controls the process (but does not access any client data) • Initialises model, sends to each client • Typically, neural network type models are used

Each client trains the model on their own data • Send updates (parameters or model weights) back to server

Server aggregates the client updates • Sends updated model back to clients

Iterative process • Training usually terminated when specific criterion is met: • E.g., maximum number of iterations

NVIDIA - A centralized-server approach to federated learning. https://blogs.nvidia.com/blog/2019/10/13/what-is-federated-learning/

Federated Synthesis Using FL to generate synthetic data • Emerging research field

• Small body of research focussing mostly on image data

• Less research on tabular data

• Methods predominantly use GANs (Generative Adversarial Networks, Goodfellow et al. 2014))

Is it possible to produce useful synthetic microdata in a federated way? • Proof of concept using Genetic Algorithm (GA)

Genetic Algorithms (GAs) GAs (Holland, 1992) perform iterative optimisation, training over multiple generations • Three main biologically inspired operators: • Selection, Crossover, Mutation

➢ Initial population of candidate solutions (candidate solution = synthetic dataset) ➢ Fitness (similarity to original data) of each candidate calculated ➢ Select fitter candidates (parents) to reproduce for new population ➢ Crossover – combines parents to produce new candidates (children) ➢ Mutation – randomly change some of the candidates features ➢ Next generation – children, or combination of best (fittest) parents and children

(elitism) ➢ Repeat process multiple times (generations) using fitness to guide

Study Design - Data A (very) simple binary dataset, randomly sampled from UK 1991 Census microdata (University of Manchester, 2023) • Small dataset to enable understanding

• 10 rows, 5 binary variables

• “Original” dataset

• Randomly split into two five-row datasets

• representing two clients (A and B)

AGE MSTATUS SEX LTILL TENURE client

1 2 2 2 2 A

1 1 1 2 2 A

2 2 2 2 1 A

1 1 1 2 1 A

2 2 2 2 1 B

1 2 2 2 1 B

1 1 1 2 1 B

1 1 1 1 2 B

1 1 1 2 1 B

Study Design - Parameters Huge potential range of variation in the simulation Three types of parameters: • Model: changeable settings for the GA (e.g., mutation rate)

• Simulation: variations in the scenario being presented (e.g., number of clients)

• Experimental: elements that are not part of the simulation itself (e.g., data choice, number of runs)

Model complexity is kept low to aid with interpreting the results

• Focus only on utility (not risk)

• Small dataset

• GA uses mutation but not crossover

• Two clients for FL

Study Design - Parameters

Results – Experiment 1 Running GA on original dataset (10 rows) • All five randomly initialised

runs converged • i.e., they reproduced the original

dataset

Results – Experiment 2 Running GA separately on client A and B datasets (5 rows each) • For each, all five randomly initialised runs converged and reproduced the original dataset

Results Experiment 3 FL with two clients (A and B) • All but one of the randomly

initialised runs converged and reproduced the original datasets

• Panel 4 would not be available in reality – used for evaluation

• Convergence achieved despite the evaluations from clients, and the server aggregated score indicating suboptimality

Discussion Experiment 3 demonstrates proof of concept • Analytically useful datasets were synthesised across distributed datasets

It was not clear on the server that the original data had been reproduced • Might be useful in terms of disclosure risk

• Means we cannot rely on server-side restraint to minimise risk

Caveats and future work Experiments conducted on small sample of binary Census microdata • May not scale to larger, more complex data

• Very large datasets may be computationally impractical

Would need to consider different parameters • More than 2 clients

Single-objective focus on utility • In a real-life scenario, the goal would not be to reproduce the original data

• Risk would need to be factored in

◦ A multi-objective approach within the GA could be used

◦ Deep learning methods also a possibility

Questions? https://tinyurl.com/QuestionsUoM

Email: [email protected]

https://tinyurl.com/QuestionsUoM

References McMahan, B., E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas (2017). Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273–1282. PMLR. http://proceedings.mlr.press/v54/mcmahan17a/mcmahan17a.pdf

Goodfellow, I., J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014). Generative Adversarial Nets. In Proceedings of the Advances in Neural Information Processing Systems, Volume 27. https://papers.nips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf

Holland, J. H. (1992). Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. MIT press.

University of Manchester, Cathie Marsh Centre for Census and Survey Research, Office for National Statistics, Census Division. (2023). Census 1991: Individual Sample of Anonymised Records for Great Britain (SARs). [data collection]. UK Data Service. SN: 7210, DOI: http://doi.org/10.5255/UKDA-SN-7210-1

http://proceedings.mlr.press/v54/mcmahan17a/mcmahan17a.pdf

https://papers.nips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf

http://doi.org/10.5255/UKDA-SN-7210-1

Slide 1: Experiments on Federated Data Synthesis
Slide 2: Questions?
Slide 3: Federated Learning (FL)
Slide 4: Federated Learning (FL)
Slide 5
Slide 6: Federated Synthesis
Slide 7: Genetic Algorithms (GAs)
Slide 8: Study Design - Data
Slide 9: Study Design - Parameters
Slide 10: Study Design - Parameters
Slide 11
Slide 12: Results – Experiment 1
Slide 13: Results – Experiment 2
Slide 14: Results Experiment 3
Slide 15: Discussion
Slide 16: Caveats and future work
Slide 17: Questions?
Slide 18: References

SACRO: semi-automated output checking, University of the West of England

Read more about SACRO: semi-automated output checking, University of the West of England

output checking, confidential data, automate checking, analytical languages, secure environments,

Languages and translations

English

UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE

CONFERENCE OF EUROPEAN STATISTICIANS

Expert meeting on Statistical Data Confidentiality 26–28 September 2023, Wiesbaden

SACRO: Semi-Automated Checking Of Research Outputs Jim Smith1, Richard Preen1, Maha Albashir1,Felix Ritchie1, Elizabeth Green1,Simon Davy2 Pete Stokes2, Sebastian Bacon2 1: University of the West of England, UK, 2: Bennett Institute, University of Oxford

[email protected]

Abstract Output checking can require significant resources, acting as a barrier to scaling up the research use of confidential data. We report on a project, SACRO, that is developing a general-purpose, semi-automatic output checking systems that works across the range of restricted research environments. SACRO is designed to

• Automate checking of most common statistics, using best-practice principles-based modelling. • Support researchers using the major analytical languages (R, Python and Stata), with minimal changes,

by exploiting the ‘wrapper’ approach successfully trialled already. • Support secure environments with different operating models and output checking workflows, through a

process of co-design to maximise useability. SACRO builds on previous work: (ACRO, funded by Eurostat and reported in in the 2021 Workshop) to establish the proof-of-concept; and Py-ACRO which showed how a software-independent tool might be developed. It differs from those earlier projects in terms of a wider range of statistics covered, and a requirement to achieve general applicability. To do this, the project draws on our extensive networks of practitioners. A series of workshops and ‘hands-on’ evaluations ensure the design frameworks support buy-in from a wide range of prospective users across health and social sciences, and from the public and private sectors.

1 Introduction

Statistical agencies and other custodians of secure facilities such as Trusted Research Environments (TREs) Hub- bard et al. (2020) provide researchers with access to confidential data under the ‘Five-Safes’ framework Ritchie (2017). This enforces five orthogonal layers of safety procedures, and the last requires explicit checking of research outputs for disclosure risk. This can be a time-consuming and costly task, requiring skilled staff. This paper discusses the development of an open source tool for automating the statistical disclosure control (SDC) of routine research outputs. The goal is to make the clearance process more efficient and timely, and to allow the skilled checkers to focus their attention on the less straightforward cases. The purpose of the tool, (SACRO, for Semi-Automated Checking of Research Outputs) is to assist researchers and output checkers by distinguishing between research output that is safe to publish, output that requires further analysis, and output that cannot be published because of substantial disclosure risk. This work builds upon a previous Eurostat-funded project Green et al. (2020, 2021) in which Green, Ritchie and Smith developed a proof-of-concept prototype for the proprietary Stata software.The primary new contributions reported in this paper are:

• The implementation of a Python toolkit. • An extensible multi-language platform with interfaces familiar to users of popular statistical tools. • ‘Skins’ in Stata and the language R, demonstrating cross-language support. • An open source repository with examples, help, documentation, etc.

2 Background

The Five Safes framework Ritchie (2017) is a set of principles that enable services to provide safe research access to their data and has been adopted by a range of TREs, including the Office for National Statistics (ONS), Health Data Research-UK (HDR-UK), and the National Institute for Health Research Design Service (NIHR), as well as many others worldwide. Ensuring the last of these, ‘safe outputs’ is a complex and often costly human labour-intensive process. Auto- mated output checking aims to improve the rigour and consistency of the output disclosure control process and reduce human workload by automatically identifying, reporting, and (optionally) suppressing disclosive outputs where possible and categorising outputs as ‘safe’ or ‘unsafe’. ‘Safe’ outputs requiring no or minimal further changes can be expedited through the clearing process whereas ‘unsafe’ outputs can be prioritised for human review Ritchie (2008). A small number of SDC tools have been produced to assist in the process of achieving ‘safe outputs’, such as tauArgus and sdcTable1, however these are primarily designed for users such as National Statistic Institutes as they require expert knowledge of SDC to use effectively. Moreover, they are designed for tabular outputs, and do not cover the range of statistics produced by researchers With the aim of improving the efficiency of the process, and (where applicable) reducing the amount of user training required, a recent Eurostat project Green et al. (2021) developed a proof-of-concept prototype in Stata where primary disclosure is regulated by a set of simple rules. For example, a minimum threshold rule applied to the number of observations used by a statistic ensures that there is sufficient uncertainty with respect to any individual respondent. Dominance rules protect large respondent values from being approximated where the contribution to a statistic is dominated by only a few individuals. For example, the p%-rule sorts the N observations by magnitude and checks whether the sum of the smallest N − 3 observations is at least p% of the largest observation. The NK rule checks that the largest N observations contribute less than K% of the total. Also, not all aggregation statistics are permitted: reporting minima or maxima values of a subgroup are prohibited, and regressions are protected by checking that the Residual degrees-of-freedom exceeds a minimum threshold.

1Respectively, https://github.com/sdcTools/tauargus and https://github.com/sdcTools/sdcTable 2

https://github.com/sdcTools/tauargus

https://github.com/sdcTools/sdcTable

Building on the experience of the initial proof-of concept, funding was secured from the UK Research Council’s DARE initiative2 for the project: Semi Automated Checking of Researcher Outputs (SACRO) which involves:

• Computer scientists with backgrounds ranging from AI research to commercial software development. • A range of TREs as co-designers of a toolset. • SDC theorists and statisticians to provide a conceptual framework for handling different types of output

and providing guidance to researchers and output checkers. • Public Involvement and Engagement specialists and groups to develop a consensus statement around the

use of (semi)-automation in disclosure control • Researchers from a previous DARE project examining the output checking of machine learning models

trained on sensitive data within a TRE Jefferson et al. (2022). In this paper we report on the principal tools developed within the SACRO project, specifically:

1. A toolkit for researchers to use within TREs that produces automated reports on disclosure risk with minimal changes to their practice - simply prefixing common commands with the word ‘acro’.

2. Explicit support for researchers to reduce the number of disclosive outputs they request. 3. Cross-language support: with exemplar interfaces provided for Stata and R. 4. Support for the output types that our TRE partners tell us form the majority of requested releases. 5. A stand-alone viewer for TRE output staff to facilitate rapid, informed, and audited, decision making. 6. A revised guide incorporating theoretical developments, directly linked to its implementation in SACRO.

3 The SACRO toolkit

SACRO is composed of three parts which may be deployed independently: the main ‘ACRO-engine’, a stand- alone viewer, and ‘AI-SDC’ - support for disclosure control of machine learning models (described elsewhere).

3.1 Design Philosophy

The operational design philosophy is extensively documented in Green et al. (2020), who studied the character- istics that an automated solution needs to have to be feasible, effective, and a positive choice for users. Essential criteria are that it should be:

• Acceptable to users, output checkers and TRE managers; • Able to implement an organisation’s business rules for primary and secondary disclosure, which may

vary across datasets or users; • Comprehensive, even if the automated tool’s response is “I don’t know so this needs manual checking"; • Consistent, providing the same results across different studies within a TRE, and across TREs; • Able to support exceptions under principles-based regimes; • Scalable over users and outputs.

Key operational requirements were for the tool to work in different technical environments, and to be easily updated through well understood mechanisms. This meant separating the software itself (distributed through a recognised channel3, from the specification of a given TRE’s risk appetite (held in a human and machine readable and editable file). Acceptability to users was identified as the most crucial element. If researchers and output checkers see the tool as something that makes their life better and easier, then they are more likely to use it effectively. Hence, designing the user interface was identified as a separate workstream in SACRO, and given the same resources as the design and implementation of the output-checking component. This is also one reason why SACRO set up a large network of potential users and tests (see Sec. 6 below).

2https://dareuk.org.uk/

3for example, PyPi (https://pypi.org or CRAN (https://cran.r-project.org) 3

https://dareuk.org.uk/

https://pypi.org

https://cran.r-project.org

Researcher

TRE Staff

Light-Weight Translation Functions

Python

Stata

. . .

Excel spreadsheet or JSON file with details and recommenda- tions for each requested output

TRE-specific file detailing risk appetite

Disclosure Control Checks

(Python) Tests:

- threshold - dominance

- degrees of freedom

Applies: - cell sup- pression - others

Standard Python

Libraries:

Pandas for tables

statsmodels for regression

Analysis commands prefixed by acro

SDC output

Approve/Discuss/Reject

Reads

Finalise

Figure 1. Schematic illustration of ACRO.

The ‘proof-of-concept’ version of ACRO did not address secondary disclosure (such as checking for differencing across tables), for two reasons. First, business rules for secondary checking are often not clear or comprehensive. Second, ACRO/SACRO works by intercepting commands and assessing disclosure risk at the time the output is being produced. Analysing results post-hoc is a considerably harder problem, requiring the researcher to produce a lot more information and also locate the other outputs to be compared. Although SACRO does not currently (as of July 2023) carry out secondary disclosure review, we are investigating how to at least flag potential differencing risks across the set of outputs from a research ‘session’, and in future, create a library of outputs which might allow secondary disclosure to be assessed, even if only partially.

3.2 Workflow

ACRO Preen et al. (2023) is an open source toolkit (MIT License) that provides a light-weight ‘skin’ that sits over well-known analysis tools, in a variety of languages researchers might use. The process is illustrated in Fig. 1. This adds functionality to identify potentially disclosive outputs against a range of commonly used disclosure tests and report to researchers and TREs reasons why outputs should not be released ‘as-is’. It creates simple summary documents TRE staff can use to streamline their workflow. ACRO has been designed with the following aims:

• Reducing barriers to adoption via a front-end application programming interface (API) that is similar to those already commonly used by researchers in their favoured language.

• Providing researchers with: immediate feedback on the results of disclosure checks (on-screen alongside their query results); facilities to add comments or exception requests, and control over what is submitted for review, e.g., removing disclosive outputs if they use feedback to design non-disclosive ones.

• Having a single back-end code base constituting a single source of truth for performing checks, with extensibility for different languages and ongoing support and consistency.

• Providing easy to understand help and documentation. In practice, researchers prepare their data and statistical queries in the usual way, in their preferred language, using common commands prefixed by ‘acro’. The lightweight ACRO translation functions then call the Python back-end, which executes the queries and performs the requisite output checks. The results of the checks, and the queries are immediately displayed to the researcher, and full details are stored in a list. When the user calls acro.finalise() to end their session, outputs and all SDC details are saved to file for review by a TRE output checker. A schematic illustration of the ACRO workflow is shown in Figure 1 and some notebooks demonstrating example code usage and output are available via the ACRO project wiki4.

4https://github.com/AI-SDC/ACRO/wiki

https://github.com/AI-SDC/ACRO/wiki

3.3 Checks Implemented

For tabular data (e.g., cross tabulation and pivot tables), we prohibit the reporting of the maximum or minimum value in any cell that represents a sub-group of one or more contributors. Moreover, we suppress, and report the reason, the value of the aggregation statistic (mean, median, variance, etc.) for any cell deemed to be sensitive. ACRO currently supports the three most common tests for sensitivity: ensuring the number of contributors is above a frequency threshold, and testing for dominance via p% and NK rules. ACRO builds a series of suppression masks, which indicate which cells are to be suppressed for each check. A summary outcome table indicating which suppression rule was applied to each cell is presented to the researcher (thre grey box in Fig. 2, alongside the query results. For regressions, e.g., linear, probit and logit regression, the tests verify the number of degrees of freedom exceeds a threshold. Immediate feedback on all these checks is designed to support researchers to improve their practice and so reduce the SDC bottleneck by making fewer disclosive requests The checking of graphical plots is not currently implemented, as this is a complex problem with many different methods for producing visualisations. However, we expect to have some support by Autumn 2023. As noted above, all of these tests and checks are configurable according to the TRE’s risk appetite. The data custodian, e.g., TRE staff member, specifies the parameter values used for the output checks in a YAML5

configuration file, which is loaded upon ACRO initialisation. The default ACRO parameters are shown in Table 1. Future releases will offer the option to over-ride these on a dataset, or even attribute level.

Table 1. ACRO Default Parameters for sensitivity tests

Description Parameter Value Min frequency threshold for tabular data safe_threshold 10.0 Min degrees-of-freedom for analytical stats safe_dof_threshold 10.0 N parameter in NK test safe_nk_n 2.0 K parameter in NK test safe_nk_k 0.9 Min ratio for p% test safe_pratio_p 0.1

3.4 The SACRO Python ‘Engine’

Python is a popular multi-platform language widely used for data analysis and machine learning. PyPI provides a simple package management system for distributing open source Python libraries. Pandas and Statsmodels6 are industry-standard, mature, popular, and well-supported python packages for data analysis, statistical testing, and statistical data exploration. Pandas is currently used by more than 55% of all Python users Python Software Foundation (2021) and there are many web-sites and user groups providing help with formulating queries. The use of Python as the primary implementation therefore enables the leveraging of existing expertise and community support with these packages so that the ACRO front-end can be as similar to the API researchers already know and trust, and further facilitates the rapid development of disclosure checking functionality on the back-end. As the PyPI distribution system is simple and allows the use of semantic versioning, it supports a rapid and iterative develop-and-deploy strategy to provide continuing functionality and improvements. For example, the current version of ACRO may be installed [or updated] as simply as: p i p i n s t a l l [−−upgrade ] a c r o The currently implemented methods are listed below, split into analysis commands, and sessions management commands. For more details see the ACRO project documentation7.

5https://yaml.org

6https://github.com/pandas-dev/pandas and https://www.statsmodels.org/stable/index.html respectively

7https://ai-sdc.github.io/ACRO/

https://yaml.org

https://github.com/pandas-dev/pandas

https://www.statsmodels.org/stable/index.html

https://ai-sdc.github.io/ACRO/

3.4.1 Analysis commands for Researchers. These are implemented via the use of multiple inheritance from Pandas and Statsmodels. For making tables, the relevant methods are:

: crosstab(index, columns[, values, rownames, . . . ]) Compute a simple cross tabulation of two (or more) factors, with options for hierarchies in rows/columns and multiple aggreagation functions. Same API as pandas.crosstab.

: pivot_table(data[, values, index, columns, . . . ]) Create a spreadsheet-style pivot table as a DataFrame. Same API as pandas.pivot_table.

and for regression analysis:

: logit(endog, exog[, missing, check_rank]) Fits Logit model. Same API as statsmodels.discrete.discrete_model.Logit.

: logitr(formula, data[, subset, drop_cols]) Fits Logit model from an R-style formula and DataFrame. Same API as statsmodels.formula.api.logit.

: ols(endog[, exog, missing, hasconst]) Fits Ordinary Least Squares Regression. Same API as statsmodels.regression.linear_model.OLS.

: olsr(formula, data[, subset, drop_cols]) Fits Ordinary Least Squares Regression from an R-style formula and DataFrame. Same API as statsmodels.formula.api.ols.

: probit(endog, exog[, missing, check_rank]) Fits Probit model. Same API as statsmodels.discrete.discrete_model.Probit.

: probitr(formula, data[, subset, drop_cols]) Fits Probit model from an R-style formula and DataFrame. Same API as statsmodels.formula.api.probit.

3.4.2 Session Management Commands.

: ACRO()(config,suppress) Creates an ACRO session object with optional parameters for a config (risk appetite) filename and whether disclosive tables should have suppression applied (default False).

: print_outputs() Prints the current results dictionary - i.e., the outputs that would be sent for checking.

: remove_output(key) Removes an output from the results dictionary.

: rename_output(key, newname=) Assigns a new (ideally more self-explanatory) name to an output from the results dictionary.

: add_comments(key,text) Allows researcher to add a description for an output

: add_exception(key,text) Allows a user to request and justify an exception to strict rules-based checking.

: custom_output(filename,description) Adds a file containing output from unsupported analysis to an ACRO session for inclusion in outputs shown in viewer.

: finalise(directory_name, format) Creates a results file for checking in the desired format(json or xlsx).

» safe_table = acro.crosstab( df.recommend, df.parents, values=df.children, aggfunc="mean")

» print(safe_table)

INFO:get_summary:fail; threshold: 4 cells may need suppressing

INFO:outcome_df: parents great_pret pretentious usual recommend not_recom ok ok ok priority ok ok ok recommend threshold threshold threshold spec_prior ok ok ok very_recommend threshold ok ok

INFO:acro:add(): output_1

grant_type great_pret pretentious usual recommend not_recom 1440 1440 1440 priority 858 1484 1924 recommend 0 0 0 spec_prior 2022 1264 758 very_recom 0 132 196

Figure 2. Example ACRO query for the ‘nursery’ data(top), with immediate disclosure control reporting (middle, grey background - pink onscreen) followed output (bottom). This ’researcher- view’ corresponds to the top image in the viewer screenshots

An example ACRO query run on the nursery admission dataset8 and its output is shown in Fig. 2. This is the ‘researchers-view’ of the output at run-time. The corresponding ‘TRE-view’ is shown in the top screenshot in Fig. 3. This example does not have an aggregation function so dominance rules are not applied, otherwise they would also show in the ‘INFO’ section of the report in any relevant cells. Note that if the user starts their session with acro= ACRO(suppress=True) then any disclosive cells would have their values set to NaN

3.5 The R interface to ACRO

The R front-end is an example of cross-language support. It provides a set of wrapper functions that execute Python back-end checking via the reticulate9 package, which provides automatic conversations for many types, e.g., R data frame to Pandas DataFrame. A session is created when the acro package is calledsource("../acro.R") and thereafter the acro methods work as callable functions with the prefix acro_ e.g., acro_rename_output(output5,"xy-plot") etc., and to end a session the user calls acro_finalise(results_dir,“json") For regressions, the common R lm() and glm() functions were shadowed with equivalent versions imple- mented as acro_lm() and acro_glm(), respectively. For tabular data, the dplyr10 package is commonly used within R, however no simple cross tabulation or pivot table functions are provided; instead various combinations of groupby() and summarize() etc. are used. Therefore, at this stage of development, the Python cross tabu- lation and pivot table functions were directly interfaced with acro_crosstab() and acro_pivot_table().

8https://www.openml.org/search?type=data&sort=runs&id=1568&status=active

9https://github.com/rstudio/reticulate

10https://github.com/tidyverse/dplyr

https://www.openml.org/search?type=data&sort=runs&id=1568&status=active

https://github.com/rstudio/reticulate

https://github.com/tidyverse/dplyr

3.6 Stata Interface

This makes extensive use of Stata’s SFIToolkit library to manage a python session, transfer data in memory from stata to a Pandas dataframe in the python session, and results back to the Stata window. A simple acro.ado file defines a new functionacrowhich takes as parameters either one of the ACRO session management methods (adding init() to start a session) or the name of a standard Stata function such as table, regress etc. Stata’s inbuilt parsing functions are used to separate out the parts of command and pass them as lists to a python function parse_and_run() which handles the rest of the translation between the two languages.

4 SACRO Viewer for Output Checking

We have also created an open-source platform-independent stand-alone viewer for output checkers to use to: view outputs and their risks; make decisions with reasons (all recorded for auditing purposes); and produce zipped packages of files for release Open-Safely (2023). Figure 3 illustrates two screenshots from the version currently (July 2023) being evaluated by TREs. The viewer supports and renders a range of different file types for results from unsupported queries. A separate script lets TRE staff create an ACRO session from a set of output files in a directory, and hence use the viewer for making and recording decisions, even if the researcher has not used ACRO during their analysis. Automated disclosure risk analysis is not provided in those cases.

5 Linking theory and implementation

As part of the project, the SACRO team committed to review and re-develop the theory and operational guidelines for output SDC. The aim was threefold; first, to bring together key points from the OSDC literature (and fill in some of the theoretical gaps) to provide an integrated guide to both theory and practice of output checking; second, to develop a new approach to OSDC based on classifications into groups (see Derrick et al. (2023), for details); third, to explicitly link theory to operational rules and their implementation in manual and automatic checking regimes. The third aim is essential to demonstrating that SACRO is not seen as a ‘black box’ implementing its own rules, but is fully integrated into core theory. It is also important for showing how manual and automatic output checking necessarily differs. For example, dominance checks are almost impossible for a human, but straightforward for computers; on the other hand, computers cannot easily identify whether zero cells in tables are structural or disclosive, but humans can. The purpose of the guide is to show precisely what checks have been made, where differences occur between humans and computers, and why they are necessary.

6 Engagement with TREs

One of the lessons learned from the original Stata version of ACRO Green et al. (2021) was the importance of user buy-in. Although that version met its design goals (and has subsequently been adopted by Eurostat in its TRE), reaction to it was a mixture of “this looks useful, I’ll give a go", “this looks useful, I’ll wait to see it installed before I commit myself", and “I’ve read the installation manual and have no idea what’s going on, so it’s a no". As a result,that version of ACRO has remained largely within the project remit: a demonstration of possibilities. The SACRO project was intended to involve co-design from the outset to take ACRO to the next stage, of general utility and application. This involved three tests:

1. Would a new tool be acceptable to users? 2. Would a new tool be acceptable to output checkers? 3. Could a new tool be installed in secure research environments?

Figure 3. Two screenshots of viewer. The left hand column shows list of files requested. In top image, colouring of file names suggests which files require special attention. In lower image background colour-coding and tick/cross symbols show decisions made by output checker. Top image shows checker viewing table that fails disclosure tests, with problematic cells highlighted in red. Bottom shows acceptable table. Also in this image the top right hand panel shows option to view TRE ’risk appetite’ expanded.

The SACRO project took two approaches. First, six TREs (OpenSafely at the University of Oxford, and the five Scottish Safe Havens) were funded as co-investigators on the project to provide detailed feedback on user and output checker perspectives (OpenSafely also took the lead in the design of the user interface). This group also directly tested the feasibility of installing and allowing the Python code to run on their systems as TREs differ in their perceptions of python’s ‘riskiness’. Second, the SACRO team contacted a large number of TREs in the UK and abroad, and set up a network of interested parties potentially willing to be testers. Several engagement events with this group identified how they worked and what they would expect from an automatic solution. At the time of writing (July 2023), the first ‘external’ TRE’s are starting to install and run the tool with genuine users. SACRO has a workpackage dedicated to helping TREs set up their systems, and then collecting evaluation feedback. This aims to make sure that the tool is tested in as wide a variety of environments as possible, given the time constrain. A secondary aim is to involve TREs in the development, to build a sense of ownership and lay the foundations for widespread adoption. This helps to address the concerns of ‘wait-and-see’ TREs.

7 Future Plans

By the current project end in October 2023 we aim to have added support for: more common types of analyses (including simple plots); different versions of Stata; and more ways of creating tables within R. Additional features and improved user experience will be facilitated by the involvement of end-users and output checkers. Beyond then, UWE has committed to web hosting various resources for the indefinite future, and partners have agreed to continue support and development of the toolkits. We are keen to engage with any interested parties to enrich and build an on-going community of support for SACRO.

References

Derrick, B., E. Green, F. Ritchie, J. Smith, and P. White (2023). Towards a comprehensive theory and practice of output SDC. In UNECE/Eurostat Workshop on Statistical Data Confidentiality.

Green, E., F. Ritchie, and J. Smith (2020). Understanding output checking. Technical report, European Commission (Eurostat - Methodology Directorate).

Green, E., F. Ritchie, and J. Smith (2021, October). Automatic checking of research outputs (ACRO): A tool for dynamic disclosure checks. ESS Statistical Working Papers 2021, 1–27. doi: 10.2785/75954.

Hubbard, T., G. Reilly, S. Varma, and D. Seymour (2020, July). Trusted research environments (TRE) green paper. ZENODO 2020, 1–31. doi: 10.5281/zenodo.4594704.

Jefferson, E., J. Liley, M. Malone, S. Reel, A. Crespi-Boixader, X. Kerasidou, F. Tava, A. McCarthy, R. Preen, A. Blanco-Justicia, E. Mansouri-Benssassi, J. Domingo-Ferrer, J. Beggs, A. Chuter, C. Cole, F. Ritchie, A. Daly, S. Rogers, and J. Smith (2022, September). GRAIMATTER Green Paper: Recommendations for disclosure control of trained Machine Learning (ML) models from Trusted Research Environments (TREs).

Open-Safely (2023). Sacro:a tool for fast, secure and effective output checking, which can work in any TRE. https://github.com/opensafely-core/sacro.

Preen, R. J., J. Smith, M. Albashir, and S. Davy (2023). ACRO. https://github.com/AI-SDC/ACRO. Python Software Foundation (2021). Python developers survey 2021 results. https://lp.jetbrains. com/python-developers-survey-2021/. Accessed: 24/07/2023.

Ritchie, F. (2008). Disclosure detection in research environments in practice. In Joint UNECE/Eurostat work session on statistical data confidentiality, Volume WP. 73. United Nations Statistical Commission and Economic Commission for Europe Conference of Europe Statisticians, European Commission Statistical Office of the European Communities (Eurostat).

Ritchie, F. (2017, September). The ‘five safes’: A framework for planning, designing and evaluating data access solutions. Zenodo 2017, 1–5. doi: 10.5281/zenodo.897821.

https://github.com/opensafely-core/sacro

https://github.com/AI-SDC/ACRO

https://lp.jetbrains.com/python-developers-survey-2021/

1. Introduction
2. Background
3. The SACRO toolkit

3.1. Design Philosophy
3.2. Workflow
3.3. Checks Implemented
3.4. The SACRO Python `Engine'
3.5. The R interface to ACRO
3.6. Stata Interface

4. SACRO Viewer for Output Checking
5. Linking theory and implementation
6. Engagement with TREs
7. Future Plans
References

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO: Semi-Automated Checking of Research Outputs

Professor Jim Smith,

University of the West of England

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Research results for

publication

Data Repository

Researcher Analytical

Environment

TRE

Disclosure Control

Checking Process

Subset of pseudonymised

data

Safe People

Safe Projects

Safe Setting

Safe Data

Safe Outputs

Graph or table of

summary results

Export from TRE

AI trained model

Software using AI model

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Research results for

publication

Data Repository

Researcher Analytical

Environment

TRE

Disclosure Control

Checking Process

Subset of pseudonymised

data

Safe People

Safe Projects

Safe Setting

Safe Data

Safe Outputs

Graph or table of

summary results

Export from TRE

AI trained model

Software using AI model

(Semi) Automating this bottleneck!

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Who are we? (alphabetically)

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Who are we? (alphabetically)

Universities

• Aberdeen

• Dundee

• Durham

• Edinburgh

• Oxford

• UWE

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Who are we? (alphabetically)

Universities

• Aberdeen

• Dundee

• Durham

• Edinburgh

• Oxford

• UWE

Public Data Bodies • Health Data Research UK • NHS Scotland • Public Health Scotland • Research Data Scotland

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Who are we? (alphabetically)

Universities

• Aberdeen

• Dundee

• Durham

• Edinburgh

• Oxford

• UWE

Public Data Bodies • Health Data Research UK • NHS Scotland • Public Health Scotland • Research Data Scotland

TREs • DASH (Aberdeen/Grampian) • DataLoch (Edinburgh) • HIC (Dundee) • eDRIS (Public Health Scot) • OpenSafely (Oxford)

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Who are we? (alphabetically)

Universities

• Aberdeen

• Dundee

• Durham

• Edinburgh

• Oxford

• UWE

Public Data Bodies • Health Data Research UK • NHS Scotland • Public Health Scotland • Research Data Scotland

TREs • DASH (Aberdeen/Grampian) • DataLoch (Edinburgh) • HIC (Dundee) • eDRIS (Public Health Scot) • OpenSafely (Oxford)

External steering group: UK: Office for National Statistics, ESRC, DARE Global: Eurostat, SDC-GESIS, ICPSR (US)

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

The current situation

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

The current situation

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

The current situation

Confidential data in

TRE

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

The current situation

Confidential data in

TRE

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

The current situation

Confidential data in

TRE Stata,R,Python

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

The current situation

Confidential data in

TRE Stata,R,Python

Analysis

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

The current situation

Confidential data in

TRE

File1 File1

File1 Filen

Stata,R,Python

Analysis

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

The current situation

Confidential data in

TRE

File1 File1

File1 Filen

Stata,R,Python

Analysis

request

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

The current situation

Confidential data in

TRE

File1 File1

File1 Filen

Stata,R,Python

TRE staff

Analysis

request

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

The current situation

Confidential data in

TRE

File1 File1

Filen

Stata,R,Python

TRE staff

Analysis

request

decision

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO in a nutshell

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO in a nutshell

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO in a nutshell

Confidential data in

TRE

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO in a nutshell

Confidential data in

TRE

TRE risk appetite

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO in a nutshell

Confidential data in

TRE

TRE risk appetite

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO in a nutshell

Confidential data in

TRE

Stata,R,Python

TRE risk appetite

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO in a nutshell

Confidential data in

TRE

Stata,R,Python

TRE risk appetite

ACRO engine

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO in a nutshell

Confidential data in

TRE

Stata,R,Python

TRE risk appetite

ACRO engine

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO in a nutshell

Confidential data in

TRE

Stata,R,Python

TRE risk appetite

ACRO engine

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO in a nutshell

Confidential data in

TRE

Stata,R,Python

Analysis

TRE risk appetite

ACRO engine

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO in a nutshell

Confidential data in

TRE

File1 File1

File1 Filen

Stata,R,Python

Analysis

request TRE risk appetite

ACRO engine

ACRO report

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO in a nutshell

Confidential data in

TRE

File1 File1

File1 Filen

Stata,R,Python

TRE staff

Analysis

request TRE risk appetite

ACRO engine

ACRO report

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO in a nutshell

Confidential data in

TRE

File1 File1

File1 Filen

Stata,R,Python

TRE staff

Analysis

request TRE risk appetite

ACRO engine

ACRO report

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO in a nutshell

Confidential data in

TRE

File1 File1

Filen

Stata,R,Python

TRE staff

Analysis

request

decision

TRE risk appetite

ACRO engine

ACRO report

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Similar for Machine Learning Models Except that we :

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Similar for Machine Learning Models Except that we :

•Run a range of “inference” attacks”

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Similar for Machine Learning Models Except that we :

•Run a range of “inference” attacks”

•Aim to support more ‘user journeys’

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Similar for Machine Learning Models Except that we :

•Run a range of “inference” attacks”

•Aim to support more ‘user journeys’

•Don’t have a set of ‘tried and trusted’ guidelines to work with

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: analytic commands

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: analytic commands

Using the same api as pandas, adding checks for cell count and dominance

• acro.crosstab() • acro.pivot_table() Using the same api as statsmodels, adding checks for DoF:

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: analytic commands

Using the same api as pandas, adding checks for cell count and dominance

• acro.crosstab() • acro.pivot_table() Using the same api as statsmodels, adding checks for DoF:

• acro.logit() , acro.logitr()

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: analytic commands

Using the same api as pandas, adding checks for cell count and dominance

• acro.crosstab() • acro.pivot_table() Using the same api as statsmodels, adding checks for DoF:

• acro.logit() , acro.logitr()

• acro.ols() , acro.olsr()

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: analytic commands

Using the same api as pandas, adding checks for cell count and dominance

• acro.crosstab() • acro.pivot_table() Using the same api as statsmodels, adding checks for DoF:

• acro.logit() , acro.logitr()

• acro.ols() , acro.olsr()

• acro.probit() , acro.probitr()

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: analytic commands

Using the same api as pandas, adding checks for cell count and dominance

• acro.crosstab() • acro.pivot_table() Using the same api as statsmodels, adding checks for DoF:

• acro.logit() , acro.logitr()

• acro.ols() , acro.olsr()

• acro.probit() , acro.probitr()

R versions are prefixed by “acro_”

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: analytic commands

Using the same api as pandas, adding checks for cell count and dominance

• acro.crosstab() • acro.pivot_table() Using the same api as statsmodels, adding checks for DoF:

• acro.logit() , acro.logitr()

• acro.ols() , acro.olsr()

• acro.probit() , acro.probitr()

R versions are prefixed by “acro_”

• Also support R’s built in ‘table’ command

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: analytic commands

Using the same api as pandas, adding checks for cell count and dominance

• acro.crosstab() • acro.pivot_table() Using the same api as statsmodels, adding checks for DoF:

• acro.logit() , acro.logitr()

• acro.ols() , acro.olsr()

• acro.probit() , acro.probitr()

R versions are prefixed by “acro_”

• Also support R’s built in ‘table’ command

• Stata versions prefixed by “acro ”

• Code currently captures and translates commands: table, regress, probit, logit 7

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: user commands for session management

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: user commands for session management

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: user commands for session management

• acro = ACRO()

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: user commands for session management

• acro = ACRO()

• acro.suppress= [True, False]

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: user commands for session management

• acro = ACRO()

• acro.suppress= [True, False]

• acro.print_outputs()

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: user commands for session management

• acro = ACRO()

• acro.suppress= [True, False]

• acro.print_outputs()

• acro.remove_output(key)

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: user commands for session management

• acro = ACRO()

• acro.suppress= [True, False]

• acro.print_outputs()

• acro.remove_output(key)

• acro.rename_output(key, newname)

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: user commands for session management

• acro = ACRO()

• acro.suppress= [True, False]

• acro.print_outputs()

• acro.remove_output(key)

• acro.rename_output(key, newname)

• acro.add_exception(key, text)

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: user commands for session management

• acro = ACRO()

• acro.suppress= [True, False]

• acro.print_outputs()

• acro.remove_output(key)

• acro.rename_output(key, newname)

• acro.add_exception(key, text)

• acro.add_comment(key, text)

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: user commands for session management

• acro = ACRO()

• acro.suppress= [True, False]

• acro.print_outputs()

• acro.remove_output(key)

• acro.rename_output(key, newname)

• acro.add_exception(key, text)

• acro.add_comment(key, text)

• acro.custom_output(filename, description)

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

ACRO: user commands for session management

• acro = ACRO()

• acro.suppress= [True, False]

• acro.print_outputs()

• acro.remove_output(key)

• acro.rename_output(key, newname)

• acro.add_exception(key, text)

• acro.add_comment(key, text)

• acro.custom_output(filename, description)

• acro.finalise(output_directory,format)

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Example: what happens if researcher requests a disclosive table

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO viewer for TRE output checkers

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO viewer for TRE output checkers

Outputs to review

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO viewer for TRE output checkers

Outputs to review

Output with problematic cells

highlighted

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO viewer for TRE output checkers

Outputs to review

view TRE risk appetite

Output with problematic cells

highlighted

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO viewer for TRE output checkers

Outputs to review

view TRE risk appetite

Output with problematic cells

highlighted

Type of output Recommendation Comments from user Exception. request

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO viewer for TRE output checkers

Outputs to review

view TRE risk appetite

Output with problematic cells

highlighted

Type of output Recommendation Comments from user Exception. request

Comments needed to override

recommendation

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

SACRO viewer for TRE output checkers

Outputs to review

view TRE risk appetite

Output with problematic cells

highlighted

Type of output Recommendation Comments from user Exception. request

Comments needed to override

recommendation

Record overall comments, create release package

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Get Involved

Get involved Thanks to:

https://github.com/AI-SDC/ACRO

https://github.com/opensafely-core/sacro

mailto:[email protected]

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Get Involved

Get involved

ACRO ‘engine’:

• https://github.com/AI-SDC/ACRO

Thanks to:

https://github.com/AI-SDC/ACRO

https://github.com/opensafely-core/sacro

mailto:[email protected]

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Get Involved

Get involved

ACRO ‘engine’:

• https://github.com/AI-SDC/ACRO

Viewer:

• https://github.com/opensafely-core/sacro

Consensus statement: [email protected]

Thanks to:

https://github.com/AI-SDC/ACRO

https://github.com/opensafely-core/sacro

mailto:[email protected]

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Get Involved

Get involved

ACRO ‘engine’:

• https://github.com/AI-SDC/ACRO

Viewer:

• https://github.com/opensafely-core/sacro

Consensus statement: [email protected]

Anything else: [email protected]

Thanks to:

https://github.com/AI-SDC/ACRO

https://github.com/opensafely-core/sacro

mailto:[email protected]

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Get Involved

Get involved

ACRO ‘engine’:

• https://github.com/AI-SDC/ACRO

Viewer:

• https://github.com/opensafely-core/sacro

Consensus statement: [email protected]

Anything else: [email protected]

Thanks to:

• All the project partners and TRE staff for all their feedback on the tools as they developed

https://github.com/AI-SDC/ACRO

https://github.com/opensafely-core/sacro

mailto:[email protected]

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Get Involved

Get involved

ACRO ‘engine’:

• https://github.com/AI-SDC/ACRO

Viewer:

• https://github.com/opensafely-core/sacro

Consensus statement: [email protected]

Anything else: [email protected]

Thanks to:

• All the project partners and TRE staff for all their feedback on the tools as they developed

• Members of the public for input to consensus statement

https://github.com/AI-SDC/ACRO

https://github.com/opensafely-core/sacro

mailto:[email protected]

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Get Involved

Get involved

ACRO ‘engine’:

• https://github.com/AI-SDC/ACRO

Viewer:

• https://github.com/opensafely-core/sacro

Consensus statement: [email protected]

Anything else: [email protected]

Thanks to:

• All the project partners and TRE staff for all their feedback on the tools as they developed

• Members of the public for input to consensus statement

• External steering group

https://github.com/AI-SDC/ACRO

https://github.com/opensafely-core/sacro

mailto:[email protected]

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Get Involved

Get involved

ACRO ‘engine’:

• https://github.com/AI-SDC/ACRO

Viewer:

• https://github.com/opensafely-core/sacro

Consensus statement: [email protected]

Anything else: [email protected]

Thanks to:

• All the project partners and TRE staff for all their feedback on the tools as they developed

• Members of the public for input to consensus statement

• External steering group

• You for listening

https://github.com/AI-SDC/ACRO

https://github.com/opensafely-core/sacro

mailto:[email protected]

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Get Involved

Get involved

ACRO ‘engine’:

• https://github.com/AI-SDC/ACRO

Viewer:

• https://github.com/opensafely-core/sacro

Consensus statement: [email protected]

Anything else: [email protected]

Thanks to:

• All the project partners and TRE staff for all their feedback on the tools as they developed

• Members of the public for input to consensus statement

• External steering group

• You for listening

https://github.com/AI-SDC/ACRO

https://github.com/opensafely-core/sacro

mailto:[email protected]

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Get Involved

Get involved

ACRO ‘engine’:

• https://github.com/AI-SDC/ACRO

Viewer:

• https://github.com/opensafely-core/sacro

Consensus statement: [email protected]

Anything else: [email protected]

Thanks to:

• All the project partners and TRE staff for all their feedback on the tools as they developed

• Members of the public for input to consensus statement

• External steering group

• You for listening

https://github.com/AI-SDC/ACRO

https://github.com/opensafely-core/sacro

mailto:[email protected]

DARE UK

SACRO: Professor Jim Smith, UNECE Expert meeting on Statistical Data Confidentiality 2023

Get Involved

Get involved

ACRO ‘engine’:

• https://github.com/AI-SDC/ACRO

Viewer:

• https://github.com/opensafely-core/sacro

Consensus statement: [email protected]

Anything else: [email protected]

Thanks to:

• All the project partners and TRE staff for all their feedback on the tools as they developed

• Members of the public for input to consensus statement

• External steering group

• You for listening

This work is funded by UK research and Innovation, [Grant Number MC_PC_23006], as part of Phase 1 of the DARE UK (Data and Analytics Research Environments UK) programme, delivered in partnership with Health Data Research UK (HDR UK) and Administrative Data Research UK (ADR UK)

https://github.com/AI-SDC/ACRO

https://github.com/opensafely-core/sacro

mailto:[email protected]

Towards a comprehensive theory and practice of output SDC, University of the West of England

Read more about Towards a comprehensive theory and practice of output SDC, University of the West of England

statistical disclosure control of outputs, OSDC, secure research environments, OSCD associated risks,

Languages and translations

English

UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE

CONFERENCE OF EUROPEAN STATISTICIANS

Expert Meeting on Statistical Data Confidentiality

26-28 September 2023, Wiesbaden

Towards a comprehensive theory and practice of output SDC

Ben Derrick(University of the West of England, UK)

Elizabeth Green(University of the West of England, UK)

Felix Ritchie(University of the West of England, UK)

Paul White(University of the West of England, UK)

e-mail: [email protected]

Abstract

In 2000, the statistical disclosure control of outputs (OSDC) was largely limited to models of table protection developed

by and intended for national statistical institutes (NSIs), as a particular branch of general SDC theory. However, in this

century OSDC as a field of enquiry has expanded significantly, reflecting the important of secure research environments

run by NSIs and others. OSDC is still a relatively under-developed field compared to SDC for tables or microdata. There

are a small number of practitioner guides, and some theoretical articles, but this is a diffuse literature.

In the UK, a consortium of universities and data providers is collaborating to provide an integrated analysis of output

checking including

- Key theoretical and operational concepts (eg safe statistics, principles-based OSDC)

- A comprehensive listing of statistics, associated risks, and mitigation measures as well as various practical element to

support output checking.

A key element of this is a theory-driven classification which enables us to have that comprehensive listing whilst still

limiting the dimensionality of OSDC guidelines to a manageable number of rules. This paper explains this model and how

it has been co-developed with RDCs and others, and considers whether this provides a sustainable model for future

development of the OSDC field.

1 Introduction

Increasingly social scientists are making use of confidential data for research. This has accelerated in the 21st

century with the growth of secure environments, referred to as ‘safe havens’, ‘secure data centres’, ‘research

data centres’, ‘trusted research environments’ (TREs) and similar names. These TREs provide standardised

secure access to a range of sensitive datasets for research purposes. In OECD countries these are now common

as part of the portfolio of research data services offered by National Statistics Institutes (NSIs), and academic

groups are also adopting them.

TREs have introduced one substantial change to the way social scientists work. When working with

confidential data, researchers are generally unaware of the potential disclosure risk in statistical outputs, as this

is not covered in research methods courses (Derrick et al, 2022). However, TREs generally require researchers

to submit outputs for a confidentiality review before release (Green et al, 2021). The efficiency of this process

relies substantially on the researchers being aware of confidentiality risks and actively aiming to produce non-

disclosive outputs (Alves and Ritchie, 2019). Hence, most TREs (Green et al, 2021) provide researchers with

some training and/or guidelines in output statistical disclosure control (OSDC). Some organisations that allow

downloads have also provided OSDC guidelines eg Eurostat (2015).

The practice of output checking, and the training of researchers and checkers, lags considerably behind other

areas of confidential data protection, such as source data anonymisation. For many years, OSDC was limited to

models of table protection (frequencies and magnitudes) developed by and intended for national statistical

institutes (NSIs). In this century OSDC as a field of enquiry has expanded significantly, largely as a result of

the growth of TREs and the need to cover the much wider range of outputs generated by researchers.

Nevertheless, general OSDC is still a relatively under-developed field compared to SDC for tables or

microdata.

A part of the problem is that the conceptual framework for generalised OSDC is lacking. There are a small

number of practitioner guides, and a few theoretical articles, but this is a sparse literature. However, that

literature does contains the seeds for a new overarching framework; in particular, the realisation that statistics

could be grouped to minimise the need for rules covering every potential output.

In 2023 the UK academic funding council UKRI funded the project SACRO (Semi-automated checking of

research outputs; see Green et al, 2023a) to deliver a general-purpose toolkit for automating output checking

processes, based on the Eurostat funded pilot ACRO (Green, Ritchie and Smith 2020 and 2021). As part of the

project, the team undertook to provide a comprehensive review of SDC theory, integrated with practical

guidelines. A key part of the project was to formalise the use of classifications (‘statbarns’) and push the

concept to its limit to minimise the dimensionality problem.

This paper describes the statbarn concept, how it was operationalised, how it simplifies disclosure control

processes (both automatic and manual). As of July 2023, this is still a work in progress, so we review the

current status and highlight areas where research needs to be done.

2 Generalised OSDC development1

Statistical disclosure control (SDC, sometimes called statistical disclosure limitation) is the practice of using

statistical analysis to ensure that the use of confidential or sensitive data does not breach the privacy of the data

subjects. SDC can be split into ‘input SDC’ (removing identifying information from the data before analysis is

carried out) and ‘output SDC’ (checking that statistical aggregates do not reveal information).

1 This short review is based on our own understanding and experience in the last two decades. We would very much

appreciate comments from colleagues working in this area as to the accuracy of our representation.

Input SDC is a very well-established process. It has a large and stable literature, a large evidence base of the

efficacy of different measures in different circumstances, and software tools implementing these to de-identify

datasets. Research methods courses rarely teach formal de-identification, but researchers are usually given

some basic guidance on broad principles.

In contrast, OSDC is a largely unknown quantity. Until 2000, ‘output SDC’ (had the term been coined then)

would have been seen as the need to protect frequency and magnitude tables from inadvertent disclosure. This

field had seen some study, and there was a relatively well-established literature, but it remained a specialist

area, even for statisticians. We are not aware of research methods courses, then or now, that teach this as a

matter of course, with one exception.

The exception is courses in the production of official statistics, which do cover OSDC for tables. Until recently,

SDC was very heavily influenced by the needs of national statistics institutes (NSIs), who produce statistical

tables and, increasingly, microdata for secondary analysis. These organisations promoted research into relevant

SDC, which explains the overwhelming focus on tables for OSDC. The first OSDC papers not focusing on

tables appear to be Reznek (2004), Reznek and Riggs (2005) and Corscadden et al (2006), both tacking specific

problems.

In 2003 the TRE at the UK Office for National Statistics was set up, and it was run by social science

researchers rather than the teams producing official statistics.. The ONS team realised that (a) the literature on

tabular OSDC was of limited value in research environments, and (b) the vast majority of research outputs had

no guidance at all. As a result, the team began developing guidelines with a research focus. This included an

analysis of the principles behind output SDC for research (Ritchie, 2007), and the first statement of ‘safe

statistics’ (Ritchie, 2008).

The concept of ‘safe statistics’ is key for efficient processing of research outputs. It recognises that certain

types of output have no meaningful disclosure risk in any reasonable use. For example, the regression

coefficients cannot by themselves reveal an individual value, nor can they be differenced to reveal individual

values, nor are they affected by special cases such as single observations in a category (Ritchie, 2019). Of

course, it is possible to construct special cases such that the regression is informative about individuals, but

these have no meaningful research purpose. For all reasonable purposes, regressions coefficients are non-

informative about individuals in all cases2, and therefore they do not need to undergo output checking.

Ritchie (2016) proposed a method for classifying outputs as safe or unsafe:

- Does the statistic itself pose a risk in the case of low numbers, extreme values or something else which is a

legitimate value?

- If the statistic is compared to another with one more observation, does any differencing risk arise?

- Are there are any other reasonable risks to disclosure, specific to this statistic?

If the answer to all three of these is ‘no’ then the statistic is classified as ‘safe’. The innovation in Ritchie

(2008) was that the classification should be based upon the mathematical characteristics of the statistic, not the

statistical ones; in other words, a ‘safe’ statistic should be safe irrespective of the data it is calculated on.

The ONS guidelines formed the basis for Brandt et al (2010; subsequently re-released, with minor revisions, as

Bond et al, 2016). This Eurostat-sponsored project (complementing a second piece on ‘traditional’ SDC;

Hundepool et al, 2010) aimed to provide the first comprehensive guide for researchers and output checkers. The

guide covered broad theory, including a discussion of safe statistics; guidelines and ‘rules’ on specific statistics,

grouped into similar types; and suggestions for operationalising good practice, including training. Brandt et al

(2010) has been the basis for many of the practice manuals now being produced by NSIs and others for TRE

users.

Despite its influence, Brandt et al (2010) has some significant limitations. The most obvious is that the list of

statistics covered is not comprehensive but selective, neglecting the interests of the report committee. Thus, it

2 There are basic rules that can be checked to make sure that the regression is a genuine regression (sufficient degrees of

freedom to be clear this is not an equation, regression must not be saturated to ensure this is an estimate and not a table

masquerading as a regression) but in genuine situations we would not expect these conditions to occur.

is strong on the measures used by social scientists but has significant gaps relating to health research, for

example. The second limitation is that the recommendations are presented ‘as is’ with little in the way of

explanation as to why this came about. A third limitation is that the report is very laconic, offering rules but

very little in the way of practical interpretation for researchers or output checkers. Subsequent manuals based

on the guide have managed to address some of these; for example, the popular SDAP manual (Griffiths et al,

2019) has both a wider range of statistics, and a commentary for output checkers on how to usefully assess the

output.

However, the major limitation of Brandt et al (2010) is that there is no overall integrating conceptual

framework. The guide reduces the range of rules somewhat by grouping statistics, but these are as likely to be

on whether they are commonly put together, rather than on their disclosure characteristics. Moreover, the

structure of the guide implies that any additional statistics will need to have their own rules added, rather than

being seen as variations on existing ones. Other manuals follow this (implicit) approach as well, listing outputs

and associated rules as if they were separate entities. The implications of safe statistics and the grouping

approach used in Brandt et al (2010) have not been followed through. We consider this now.

3 Conceptual foundations of an integrated approach

Analysts use a great range of statistical techniques in their models. Devising statistical rules for all of these

separately is not feasible. However, it is possible to combine statistics into groups based not on statistical

relation but on common disclosure risks and solutions. For example:

- means and totals are identical in terms of the disclosure risk for all practical purposes

- means and frequencies generate the same risks of low numbers and potential for differencing

- means have the potential for dominance

- survival tables are frequencies but they also generate an implicit secondary table

So a grouping would put means, totals, frequency tables and survival tables into three different disclosure

groups:

Everything in the groups should have the same risks and solutions. For example, suppression, rounding or noise

addition are valid solutions to disclosure risks in both frequency and survival tables, but on the latter they need

to be implemented in a different way to allow for the monotonic relationship between cells.

The advantages of this approach are both statistical and operational:

- Fewer rules/cases for researchers and output checkers to learn

- More consistent treatment of outputs

- Clearer distinctions between outputs

- Easier to develop the theoretical basis for any guidance

- Easier to update guidance when it changes (which it does)

- Adding new statistics is now a case of ‘what category does it fall into?’ rather than ‘what rules are needed?’

- Output checker (and researcher) training can focus on the risky classes rather than trying to cover all cases

Because classification is used in this field in many different ways, we refer to the groupings as ‘statistical

barns’ or ‘statbarns’3.

The real value of this comes from finding that, in terms of disclosure characteristics, the minimum number of

statbarns is fairly small. To a researcher, estimation of a hazard model bears little analytical relation to a

quantile regression; but they pose the same disclosure risks: that is, no meaningful risk in any reasonable use,

and so the only test needed is to make sure that this a genuine research use. In the case of estimated models, the

tests are always

- Are there sufficient residual degrees of freedom (ie making sure this a model not an equation)?

- Is the model saturated (explanatory factors all categorical and all fully interacted ie making sure this is not a table

masquerading as an estimate)?

And just like that, a large and essential part of research output is consigned to the box ‘nothing to see here’.

4 The SACRO classification model

As it currently stands, the SACRO models contains fourteen statbarns:

Barn Example Class Status

1 Frequencies Frequency tables Unsafe Very well understood

2 Statistical hypothesis tests t-stats, p-stats, f-stats Safe Provisional

3 Correlation coefficients Regression coefficients Safe Confirmed

4 Position Median, quartiles, min, max Unsafe Provisional

5 Shape s.d., skewness, kurtosis Safe Provisional

6 Linear aggregations Means, totals Unsafe Very well understood

7 Mode n/a Safe Confirmed

8 Smooth distributions Kernel density functions Safe Provisional

9 Concentration ratios Herfindahl index Safe Provisional

10 Calculated ratios Odds & risk ratios Unsafe Provisional

11 Implicit tables Hazard/survival tables Unsafe Provisional

12 Linked/multi-level tables Nested categorical data ? No knowledge

13 Clusters Cluster analysis ? No knowledge

14 Gini/lorenz curves n/a ? No knowledge

It is clear that some of these statbarns cover a very large number of cases (‘correlation coefficients’ cover linear

and non-linear regression, ANOVA, ANCOVA, pairwise correlation etc). In contrast, the disclosure risks of the

mode are unlike any other statistic, and so it merits its own class. This shows the importance of identifying

exactly what are the disclosure characteristics of a particular statistic.

The act of creating the list is itself a useful exercise, forcing one to consider what are the meaningful

differences. For example, mean and median are often grouped together in OSDC guidelines, but they have quite

different characteristics. On the other hand, maxima and minima are often dealt with on their own but they can

be considered as a special case of percentiles. This means that we no longer need separately rules for

‘structural’ end points (such as 0% or 100% in a proportion variable) but can apply general percentile rules.

This list is likely to undergo change over time. Even in the development process, the list changed as more

statistics were deemed to be of the same type, and others demand a new type. The process of identifying risks

3 The term originally came from an analogy with a farmer trying to organise her livestock, but as a neologism it has the

advantage of being unambiguous

and defining OSDC guidelines for each class is crucial, as this is usually the point at which it becomes clear

whether a new type is needed or not. It may also be the case that trying to identify a minimal set is counter-

productive. As noted, formally maxima/minima can be treated as percentiles; but in terms of communication of

risk to researchers, it may be sensible to separate them again. Finally, we have created some categories as, at

the moment, we don’t have enough information to be comfortable that they fit an existing category. Category

12 “linked/multiple tables” is an example – it seems like these should be covered by frequency tables, but we

suspect there are nuances which need to be explored, and so creating it as a separate category show the need for

more understanding.

The coverage of OSDC theory is decidedly patchy. The ‘status’ column has four values:

Very well

understood

This disclosure issues, things to be checked and protection mechanisms have been

comprehensively studied and there is a consensus

Confirmed These have not been so well studied (conclusions rest on one or two papers) but we are

confident that the conclusions and guidance are robust, well-founded and comprehensive

Provisional We have confidence in our conclusions but this is based on extrapolation from other types,

and from our own understanding; there is substantial further work to be done (for example,

on the impact of extreme values) before the classification can be confirmed

knowledge

While we may have suspicion of how these should be seen, basic analysis has not been

carried out

At present, the focus is to get the ‘provisional’ status raised to ‘confirmed’.

The list above is provisional and was devised by the SACRO team based at the University of the West of

England, Bristol. SACRO’s network of output checkers was consulted as to whether this was a sensible

approach in general; the response was positive, but expected: earlier evidence-gathering sessions had already

indicated a desire for simplification of the current OSDC landscape. The initial categories seemed both sensible

and comprehensive, although these are likely to be modified as they develop in practice.

Of more concern to the output checkers was how they (and researchers) would easily check the guidelines for

statistics. This is achieved by a look-up table, linking statistics to the appropriate statbarn, from which the

corresponding checks, problems and solutions could be found:

This will be created as a searchable file, but the output tools being developed by the SACRO project (Green et

al, 2023) intend to incorporate this in the user front end. Researchers and output checkers should be able to

click on a link to see more information about the output, drawn from the statbarn classification. In the initial

project this will only include basic data such as that shown above, but in future it may be useful to expand the

information on each classification. This highlights the advantage of classification: the SACRO coders only

need to know the statbarn code and then can draw all this information from a finite set of outputs.

5 Graphical outputs

Graphs do not present new issues. In theory, every graph can be represented as a table in some way, and so the

above rules could be applied. To take an obvious example, a pie chart or a histogram are clearly just one-way

tabulations, whereas a waterfall graph is a two-way table. As a counter example, a kernel density estimate could

be represented as a mathematical form, but in practice is almost always show graphically. In practice, we need

separate rules because (a) the quantity of information differs, and (b) precision is likely to be lower in a graph.

Consider the Kaplan-Meier graph, which is simply a survival table re-presented, usually in proportional form

(we assume that counts and proportions are equally disclosive as the total from which the proportion is

calculated is likely to be published somewhere). Survival tables are classed as ‘unsafe but very low risk’

because, even in the case of a unit being identified, the personal information content in the survival table is

negligible. Griffiths et al (2019) suggest that the underlying survival table should be supplied along with the

graph, but this can cause more problems:

In the left-hand graph, the source table would have 15 steps and be checkable by a human. But that table would

have precise numbers easily readable, whereas getting the exact figures from the graph depends on the way that

the image was produced (and even then, some laborious analysis). In the right-hand diagram, a survival table

with 100 rows in it is much harder to assess accurately, whereas identifying individual data points from the

image has become harder.

The above graphs are presented as numbers. Formally Kaplan-Meier graphs should show the survival rate

rather than numbers (ie 0%-100%). In theory this makes graphs slightly more disclosive than the survival table:

tables are likely to limit the number of decimal points shown, whereas the full decimal value may be used in

creating the graph points.

Given the low information content in any data point, even if relating to one person, producing survival tables

alongside graphs seems to increase risk rather than reducing it. Hence, the current guidance from SACRO is

that Kaplan-Meier graphs should be released subject to the researcher confirming that each step and the end

point meets thresholds

The objective for the SACRO guide is that it will show the statbarns that each graph falls into (which in itself

might lead to additional statbarns being defined, as in the case of kernel densities), but will concentrate on the

practical assessment; in particular, how graphical representation adjusts the perspective on what is discoverable

from an output. Again, this is the value of the grouping – we can see what we should be looking for in the

output.

6 Conclusion

As the use of confidential microdata for research rises, so does the need for efficient and effective OSDC.

OSDC for research has made considerable advances in this century, but guidelines have tended to develop on

an ad hoc basis as new statistical queries are raised. The strategic approach being taken by SACRO and

described in this paper attempts to provide a longer-term solution to the problem.

The idea of grouping statistics was first raised in Ritchie (2008) partly as a response to proliferation of OSDC

rules emerging from research use of the ONS TRE. While the safe-unsafe classification is crude, it highlights

how applying a structure can significantly improve operational as well as statistical outcomes. Classification

also changes the way we think about outputs. When Brandt et al (2010) was written, the implication is that

additional statistics would require new rules. In the statbarn model, risk assessment for a new statistic should be

a matter of deciding whether it fits into an existing category. If it does, then no further work is needed. If not,

then a new category is added, but this should be a rare event.

The statbarn approach is part of the development of a wider set of operational guidelines aiming to bring

consistency between theory and practice to output checking.

7 References

Alves, K., & Ritchie, F. (2020). Runners, repeaters, strangers and aliens: Operationalising efficient output

disclosure control. Statistical Journal of the IAOS, 36(4), 1281-1293.

Brandt M., Franconi L., Guerke C., Hundepool A., Lucarelli M., Mol J., Ritchie F., Seri G. and Welpton R.

(2010), Guidelines for the checking of output based on microdata research, Final report of ESSnet sub-

group on output SDC

Bond S., Brandt M., de Wolf P-P (2015) Guidelines for Output Checking. Eurostat.

Corscadden, L., Enright J., Khoo J., Krsnich F., McDonald S., and Zeng I. (2006) Disclosure assessment of

analytical outputs, mimeo, Statistics New Zealand, Wellington

Derrick, B., Green, E., Ritchie, F., Smith J. & White, P. (2022, April). Disclosure protection: a systemic gap in

statistical training?. Paper presented at Scottish Economic Society Annual Conference 2022: Special

session 'Protecting confidentiality in social science research outputs', Glasgow

Eurostat (2015) Self-study material for Microdata users. Eurostat.

Green, E., Ritchie, F., Tava, F., Ashford, W., & Ferrer Breda, P. (2021, July). The present and future of

confidential microdata access: Post-workshop report.

Green, E., Ritchie, F., & Smith, J. (2020). Understanding output checking. Luxembourg: European

Commission (Eurostat - Methodology Directorate)

Green, E., Ritchie, F., & Smith, J. (2021). Automatic Checking of Research Outputs (ACRO): A tool for

dynamic disclosure checks. ESS Statistical Working Papers, 2021 Edition

Griffiths E., Greci C., Kotrotsios Y., Parker S., Scott J., Welpton R., Wolters A. and Woods C. (2019)

Handbook on Statistical Disclosure Control for Outputs. Safe Data Access Professionals Working

Group.

Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Lenz, R., Longhurst, J., Schulte Nord-holt, E.,

Seri, G. and De Wolf, P. (2010). Handbook on Statistical Disclosure Control. ESSNet SDC.

Reznek, A. (2004) Disclosure risks in cross-section regression models, mimeo, Center for Economic Studies,

US Bureau of the Census, Washington

Reznek A. and Riggs T. (2005) "Disclosure Risks in Releasing Output Based on Regression Residuals" ASA

2004 Proceedings, Section on Government Statistics and Section on Social Statistics pp1397-1404

Ritchie F. (2007) Statistical disclosure control in a research environment, mimeo, Office for National Statistics;

available as WISERD Data Resources Paper No. 6

Ritchie F. (2008) “Disclosure detection in research environments in practice”, in Work session on statistical

data confidentiality 2007; Eurostat; pp399-406

Ritchie, F. (2014). Operationalising ‘safe statistics’: The case of linear regression. UWE Working Papers in

Economics no 14/10. Bristol

Ritchie, F. (2019). Analyzing the disclosure risk of regression coefficients. Transactions on data privacy, 12(2),

145-173

Smith J., Preen R., Ritchie F., Green E., Stokes P., & Bacon S. (2023) SACRO: Semi-Automated Checking Of

Research Outputs. Paper prepared for the 2023 UNECE/Eurostat Workshop on Statistical Data

Confidentiality,September.

Towards a comprehensive theory and practice of output checking Ben Derrick

Elizabeth Green

Felix Ritchie

Paul White

Data Research Access & Governance Network

UNECE/Eurostat

Expert Group on

Statsitical Data

Confidentiality

September 2023

A brief history of output SDC

late 20th Century 21st CenturyIn the beginning

TA BL ES

TABLES

everything else

S A C R O

2023

SACRO Feb-Oct 2023

• Review/revise theory

• Re-integrate theory and practice guidelines

• Tools (regular SDC and AI-SDC)

• Community engagement

now

User/checker guides

General, structured by use, maybe organisation-specific

Why a new guideline/manual?

Theory papers

Specific, rarely structural

New guide: Structured

Comprehensive Definitive

Theoretically sound Evidenced/sourced Practice-relevant

but not necessarily user-friendly

Practice papers

Operational, not integrated

Structured? okay

Comprehensive? No way!

• Building block: ‘safe statistic’

o unsafe: data-dependent; check before release

o safe: no disclosure risk [minimal check before] release

o based on mathematical (not statistical) characteristics

• Expand: define all statistics by

o common disclosure characteristics

o common mitigation responses

The statistical barn

• Place homologous statistical analysis into ‘statbarns’ eg

o histogram, count table, pie chart  'frequencies’

FREQUENCIES

Applying the group rules

• In the Frequencies barn we know all outputs are

FREQUENCIES

• With any statistic in the Frequencies barn we need to check:

• Low counts

• Differencing

• Class disclosure

• We would apply the following rules in this barn

• Minimum count

• Appropriate mitigation techniques for this barn are

• Cell suppression, noise addition, rounding

The statistical barn

• Place homologous statistical analysis into ‘statbarns’ eg

o histogram, count table, pie chart  'frequencies’

o median, interquartile range  ‘position’

o ANOVA, proportional hazards  ‘correlation’

FREQUENCIES

The barns so far 1.Frequencies

2.Statistical hypothesis tests

3.Correlation coefficients

4.Position

5.End points

6.Shape

7.Means and total

8.Mode

9.Non-linear concentration ratios

10.Calculated risk ratios

11.Hazard/survival tables

12.Clusters

13.Linked/multi-level tables

14.Gini coefficient

• Each barn has its own set of rules for output checking

Progress (as of today)

• Pretty confident on decisions…

• Using in output checker training – positive feedback

• Statbarn model is basis for SACRO (coming up)

• Some rethinking; some revelations

Rethinking example: survival tables

2010 O’Keefe et al JPC

• No detail in tables

• Blur lines in K-M graphs

2019 SDAP manual

• treat as frequency table

• Minimum thresholds

o tables & graphs

2023 DRAGonTome

• Risk

1. outliers

2. absolute dates

3. differencing via subsets

• Recommendation

o Approve unless the above

o Apply threshold to final count

Rethinking example: survival tables

2010 O’Keefe et al JPC

• No detail in tables

• Blur lines in K-M graphs

2019 SDAP manual

• treat as frequency table

• Minimum thresholds

o tables & graphs

2023 DRAGonTome

• Risk

1. outliers

2. absolute dates

3. differencing via subsets

• Recommendation

o Approve unless the above

o Apply threshold to final count

mode

regression on single binary variable

linear vs non-linear concentration ratios

Min/max versus medians and percentiles

Revelation example

• What is disclosure?

• Small numbers + finite values = rationale for higher thresholds

• Evidential vs structural zeros

Still to be done

• To be written:

o Basic concepts

o Operational issues

o Directory of other guides

o FAQs for researchers and output checkers

• Classifications

o Lookup table – is it comprehensive?

o Web pages

• Community buy-in

o More feedback!

Thank you; questions?

[email protected]

mailto:[email protected]

Slide 1
Slide 2
Slide 3
Slide 4
Slide 5
Slide 6
Slide 7
Slide 8
Slide 9
Slide 10
Slide 11
Slide 12
Slide 13
Slide 14
Slide 15

Do samples of synthetic microdata population replicate the relationship between samples taken from an original population and that population? University of Manchester

Read more about Do samples of synthetic microdata population replicate the relationship between samples taken from an original population and that population? University of Manchester

disclosure risk in sample surveys, k-anonymity, synthetic population, original population, synthetic data,

Languages and translations

English

UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE

CONFERENCE OF EUROPEAN STATISTICIANS

Expert Meeting on Statistical Data Confidentiality

26-28 September 2023, Wiesbaden

Do samples taken from a synthetic microdata population replicate

the relationship between samples taken from an original

population? Mark Elliot, Claire Little and Richard Allmendinger (University of Manchester)

[email protected]

Abstract

Assessment of disclosure risk in sample surveys by data controllers who don’t have access to the population

data are constrained by verifiability challenges. A sample unique may not be population uniques. Statistics

generated at the sample level may not carry over to the population level. Privacy models such as k-anonymity

simply may not make sense when applied to sample data (or only make sense for some scenarios) This study

aims to understand whether samples generated from a synthetic population present the same relationship, in

terms of risk and utility, to the synthetic population, as samples generated from the original population. Note

that this is a very different question from the more general questions about the utility of synthetic data which

compares the synthetic and original data. Here we are comparing two relationships. This opens the possibility of

being able to test and set parameters for models of risk assessment to be applied to real data using synthetic data.

1 Introduction

This document explores whether the relationship between a population dataset and samples drawn

from it is maintained when the samples are drawn from (and compared to) a synthetic version of the

same population. This extends the work of Little et al. (2022), where samples were used to determine

the sample equivalence of synthetic data to the original dataset (for example, to be able to say “the

synthetic dataset has utility equivalent to a 10% original data sample and risk equivalent to a 5%

sample”). In real-life scenarios the population data may not be available, so if synthetic samples were

able to mimic this relationship, it would be useful.

As visualised below, two scenarios are explored: Experiment A (Figure 1), where we do not have

access to the original population data but have a synthetic dataset generated from it that is the same

size as the original population; and Experiment B (Figure 2), where we have a sample of the original

population dataset and from that create a larger synthetic population. An extension to Experiment B

(named B2) is to include the original sample within the synthetic population.

Figure 1: Diagram of data relationships for Experiment A

Experiments were performed using the UK 1991 Census dataset (although it may make sense to

repeat these experiments on other Census datasets in the future). The synthetic data was generated

using Synthpop (Nowok et al., 2016). This was selected because in previous experiments it produced

data with the highest utility compared to other methods (although it should be noted this came with

higher disclosure risk). It may make sense to also experiment with other methods in the future.

The next section introduces the dataset and data/sample generation approach adopted in this study.

Section 3 describes the risk and utility measures used, and Section 4 presents an analysis of

Experiment A and B. Finally, Section 5 concludes the paper and discusses areas for future research.

2 Data

2.1 UK 1991 Census

A subset of the UK 1991 Individual Sample of Anonymised Records for Great Britain (SARs) was

used to simulate a population. The SARs data was downloaded from the UK Data Service on

29/05/21.1 This consists of a 2% sample of the population of Great Britain (excluding Northern

Ireland), with 1,116,181 individual records and 67 attributes. The dataset includes children and adults

and contains information on topics such as age, gender, ethnicity, employment, and housing. To

reduce the computational load the data was subsetted on geographical region (the REGIONP

attribute); there are 12 regions, and the West Midlands was randomly selected for use in this study.

Details of each of the variables are contained in Appendix A. The subset consisted of 104,267 records

(9.34% of overall sample) and fifteen variables (thirteen categorical, two numeric). This subset will be

henceforth referred to as the original population.

Figure 2: Data relationships for Experiment B

2.2 Synthetic Data Generation

Synthpop, developed by Nowok et al. (2016), was used to generate the synthetic data. Synthetic data

the same size as the original population (104,267 records) was generated. Default parameters were

used, with the visit sequence ordered with numerical variables first, followed by categorical variables

with least number of categories to most (with ties decided alphabetically). That gave a visit sequence

of: AGE, HOURS, LTILL, SEX, QUALNUM, MSTATUS, TENURE, RELAT, FAMTYPE,

SOCLASS, ECONPRIM, ETHGROUP, TRANWORK, AREAP, COBIRTH.

2.3 Sample Generation

Random samples of sizes 99%, 98%, 97%, 96%, 95%, 90%, 80%, 70%, 60%, 50%, 40%, 30%, 20%,

10%, 5%, 4%, 3%, 2%, 1%, 0.5%, 0.25%, 0.1% were drawn (without replacement) from both the

original and synthetic populations. For each sample size 100 samples were drawn. This follows the

framework developed in earlier experiments (as reported in Little et al., 2022).

3 Risk and Utility Measures

For calculating the associated risk and utility the sample datasets were measured against the

population dataset. That is, the synthetic samples were measured against the synthetic population they

1 Study Number 7210 (Office for National Statistics, Census Division, University of Manchester, Cathie Marsh

Centre for Census and Survey Research 2013).

were sampled from, and the original samples were measured against the original population that they

were sampled from. Risk-Utility (R-U) maps, as developed by Duncan et al. (2004), were used to

visualise the trade-off between risk and utility.

3.1 TCAP for disclosure Risk

Elliot (2014) and Taub et al. (2018) introduced a measure for the disclosure risk of synthetic data

called the Correct Attribution Probability (CAP) score. The disclosure risk is calculated using an

adaptation used in Taub et al. (2019) called the Targeted Correct Attribution Probability (TCAP).

TCAP is based on a scenario whereby an intruder has partial knowledge about a particular individual.

Specifically, they know (i) the values for some of the variables in the dataset (the keys) and (ii) that

the individual is in the original dataset. We assume that the intruder wishes to infer the value of a

sensitive variable (the target) for that individual. The TCAP metric is then the probability that those

matched records yield a correct value for the target variable (i.e., that the adversary makes a correct

attribution inference).

Three target variables, and corresponding key variables were identified from the UK Census data. For

each target, the TCAP score was calculated using sets of 3, 4, 5 and 6 keys. The overall mean of the

TCAP scores (for each of the target and key combinations) was calculated as the overall disclosure

risk score.

The TCAP statistic has a value between 0 and 1; a low value indicates that the synthetic dataset

carries little risk of disclosure whereas a score close to 1 indicates a higher risk. A baseline value can

be calculated (the usual one being the probability of the intruder being correct if they drew randomly

from the univariate distribution of the target variable) and then the TCAP score is rescaled so that the

baseline equals zero.2 We refer to the rescaled TCAP value as the marginal TCAP, i.e., it is the

increase in risk above the baseline. Rescaling is performed by subtracting the baseline from the TCAP

score and then dividing by 1 minus the baseline. For all experiments the targets were:

• LTILL : baseline = 0.774

• FAMTYPE : baseline = 0.223

• TENURE : baseline = 0.329

With a mean baseline of 0.442. The keys for each were:

• 6 keys: AREAP, AGE, SEX, MSTATUS, ETHGROUP, ECONPRIM

• 5 keys: AREAP, AGE, SEX, MSTATUS, ETHGROUP

• 4 keys: AREAP, AGE, SEX, MSTATUS

• 3 keys: AREAP, AGE, SEX

3.2 Utility

Following previous work (Little et al. 2022) the mean of the Ratio of Counts (ROC) and Confidence

Interval Overlap (CIO) was calculated as the overall utility score. This was to provide a more

complete view of the utility, rather than just using a single measure.

3.2.1 Ratio of Counts (ROC)

The Ratio of Counts (ROC) was calculated for univariate and bivariate cross tabulations of the data.

This is calculated by taking the ratio of the synthetic and original data estimates (where the smaller is

divided by the larger one). Thus, given two corresponding estimates (for example, the number of

records with SEX = female in the original dataset, compared to the number in the synthetic dataset),

where yorig is the estimate from the original data and ysynth is the corresponding estimate from the

synthetic data, the ROC is calculated as:

ROC = min(𝑦𝑜𝑟𝑖𝑔 , 𝑦𝑠𝑦𝑛𝑡ℎ)

max(𝑦𝑜𝑟𝑖𝑔 , 𝑦𝑠𝑦𝑛𝑡ℎ)

2 This does create the possibility of a synthetic dataset receiving a negative TCAP score (which can still be

plotted on the R-U map) but that simply indicates a risk level below that of the baseline and will only occur in

degenerate cases.

If yorig = ysynth then the ROC = 1. Where the original and synthetic (or sample) datasets are of different

sizes (as is the case when calculating the ROC for the various sample datasets) the proportion, rather

than the count can be used. The ROC was calculated over univariate and bivariate cross-tabulations of

the data and takes a value between 0 and 1. For each variable the ROC was averaged across categories

to give an overall score.

3.2.2 Confidence Interval Overlap (CIO)

To calculate the CIO (using 95% confidence intervals), the coefficients from regression models built

on the original and synthetic datasets are used. The CIO, proposed by Karr et al. (2006), is defined as:

𝐶𝐼𝑂 = 1

2 { min(𝑢𝑜, 𝑢𝑠) − max(𝑙𝑜, 𝑙𝑠)

𝑢𝑜 − 𝑙𝑜 +

min(𝑢𝑜, 𝑢𝑠) − max(𝑙𝑜, 𝑙𝑠)

𝑢𝑠 − 𝑙𝑠 }

where uo, lo and us, ls denote the respective upper and lower bounds of the confidence intervals for the

original and synthetic/sample data. This can be summarised by the average across all regression

coefficients, with a higher CIO indicating greater utility (maximum value is 1 and a negative value

indicating no overlap).

For each synthetic (or sample) dataset two logistic regressions were performed, and the CIO (between

the same regression on the original data) for each was calculated. The mean CIO over all coefficients

was used (where a negative overlap was equivalent to no overlap and therefore set to zero). The mean

of the two CIOs was then calculated as the overall score.

The target variables were marital status (MSTATUS) and housing tenure (TENURE), and they were

converted into a binary attribute: for marital status this was married (or living as married) and

anything else; and for tenure this was whether an individual owns their property (or lives in property

that is owned by a family member), and anything else. Eight variables were used as predictors, using

more would seem to overcomplicate the models. The predictors were: AGE, ECONPRIM,

ETHGROUP, LTILL, QUALNUM, SEX, SOCLASS, and TENURE or MSTATUS (whichever was

not the target).

4 Results

4.1 Experiment A

The scenario where we do not have access to the original/population data but have a synthetic dataset

the same size created from it. This explores using a synthetic dataset to model the relationship

between samples and population data. To be clear, throughout this section, the original dataset (the

UK 1991 sample, n=104,267) is referred to as the original population, and the synthetic dataset

created from this is referred to as the synthetic population. The samples are referred to as original

samples and synthetic samples.

The synthetic population was created (using Synthpop) from the original population. The synthetic

population had utility = 0.7596 and Marginal TCAP = 0.7228 (to 4dp) compared to the original.

Samples were drawn from the synthetic population to determine if the results follow the same patterns

as samples drawn from the original population. The same sample sizes were used as in previous

experiments (0.1%, 0.25%, …, 99%, see Little et al., 2022).

The utility and TCAP scores for each sample size were calculated by measuring against the 100%

synthetic population dataset, not the original population since this would not be available in this

scenario. The baseline TCAP scores (used for calculating Marginal TCAP) were calculated from the

100% synthetic population, and these vary slightly from the original population:

• Original TCAP baseline = 0.442

• Synthetic TCAP baseline = 0.441

For each sample size 100 datasets were drawn, and the results are the mean of the 100. The risk and

utility of the synthetic samples were contrasted with the equivalent results from the original samples.

Tables with the mean utility and TCAP scores for each sample size, and the standard deviation (all

values less than 0.04) are contained in Appendix B. Figure 3 displays the R-U map for the original

sample data at each sample size, together with the results for the synthetic sample data. The plot and

tables indicate that the relationship in terms of (risk and utility) between synthetic samples and the

synthetic population follows closely to the relationship between the original samples and original

population. However, the synthetic samples have moderately higher risk (particularly around the 50%

sample size) and moderately lower utility.

Figure 3: R-U map showing the original samples and the synthetic samples (mean of n=100) in experiment A.

Appendix B contains a table with the mean absolute error (MAE) and standard deviation (SD) of the

synthetic utility and TCAP values (when calculated against the original samples), for each sample

size. Figure 4 illustrates the values in the table, displaying the MAE of the utility and TCAP scores. It

highlights that the MAE in terms of utility is low and generally decreases as sample size increases,

whereas whilst the MAE for the TCAP is also low it displays an interesting curve around the 50%

point and then decreases beyond that as sample size increases.

4.2 Experiment B

This scenario where the original (UK 1991 Census sample, n=104,267) dataset represents the

population, then:

• take smaller samples from the original population (1%, 2%, 3%, 4%, 5%)

• generate synthetic populations (the same size as the original population) from the smaller

samples

• then draw multiple samples of different sizes from each synthetic population

• calculate the risk and utility of the samples and contrast with original population samples

This is perhaps the more likely scenario (compared to Experiment A) since we do not usually have

access to the population data – it is more likely a small sample will be provided, and we can then use

this to generate a synthetic population. From this synthetic population samples can be drawn and the

resulting utility and risk of these can be compared to the equivalent results from the original

population samples.

Figure 4: Mean Absolute Error of the utility and marginal TCAP for each synthetic sample size (calculated against the

original samples), with error bars shows +-1 standard deviation.

To calculate the utility and risk, the synthetic samples are measured against the synthetic population

they were drawn from. They are not measured against the original population as that data would not

be available.

4.2.1 Samples to generate the synthetic populations

Sample sizes of 1%, 2%, … 5% were drawn from the population data, Table 1 lists the number of

records in each sample. Note that only 1 sample was (randomly) drawn for each size, this is because

emanating from each of these individual samples were hundreds of datasets, therefore, to keep

complexity down only one of each size was drawn initially.

Table 1: Number of records for each sample size

Sample size 1% 2% 3% 4% 5%

Number of records 1042 2085 3128 4170 5213

Synthpop was used to generate a synthetic population from each sample, using default parameters

(and with the visit sequence as detailed in Section 2.2). One synthetic population the same size as the

original population (104,267) was generated for each sample; therefore 5 synthetic populations were

produced. Table 2 indicates the utility and risk values for each synthetic population measured against

the original population. It highlights that (even with these small sample sizes), the utility of a

population generated from a smaller sample is lower than the utility of a population generated from a

larger sample, as might be expected. The risk (TCAP) exhibits a different pattern, and it is notable

that the TCAP score for the synthetic population generated from a 1% sample is higher than that for

the 2% and 3% sample populations.

For each of these five synthetic populations, random samples the same size as used in previous

experiments (0.1%, 0.25%, …, 99%, see Little et al., 2022) were drawn (without replacement). For

each sample size 100 samples were drawn.

Table 2: Utility and risk scores for each synthetic population, to 3dp

Synthetic population

generated from a: Utility TCAP

Marginal

TCAP

1% sample 0.539 0.669 0.407

2% sample 0.585 0.638 0.351

3% sample 0.591 0.648 0.370

4% sample 0.616 0.670 0.409

5% sample 0.643 0.678 0.423

4.2.2 Utility and Risk

Appendix C contains tables with the results for utility and Appendix D for TCAP. To calculate the

utility and TCAP the synthetic samples are measured against the synthetic population they were

drawn from (they are not compared against the original population as that data would not be

available). Error! Reference source not found. plots (in the left panel) the utility for each of the

synthetic populations at different sample sizes, with the original population plotted for comparison.

The plot highlights that, regardless of the synthetic population origin (whether it was generated from a

1% sample of the original population or a 5% sample) the relationship between the utility and the

sample proportion is similar.

Figure 5:The utility (left) and marginal TCAP (right) for samples drawn from the synthetic populations, contrasted with

samples from the original population, in experiment B

The panel on the right in Figure 5 displays the marginal TCAP results for each synthetic population.

This illustrates that, whilst they all follow a similar curve, the synthetic samples all overestimate the

TCAP compared to the original samples - the samples taken from the synthetic population generated

from a 1% sample of the original population particularly so.

The R-U map (plotting the utility against the marginal TCAP) can be visualised for each synthetic

population. Figure 6 plots them all in one plot, alongside the original population results. Whilst they

all follow a similar pattern, the results from synthetic populations generated from smaller original

samples tend to have higher TCAP values than those generated from larger samples.

Figure 6: R-U map contrasting the results for samples generated from synthetic populations to the original population (with

sample sizes labelled) in experiment B.

Plots and tables of the MAE (and standard deviation) are in Appendix C (utility) and Appendix D

(marginal TCAP). The marginal TCAP plot indicates that the overall pattern of the MAE fluctuates at

lower sample sizes and then generally decreases as the sample size gets larger. The samples from the

synthetic population generated from a 1% sample of the original data have higher MAE than those

generated from larger samples. The samples generated from a 2%, 3% and 4% synthetic population

exhibit unusual behaviour in that they are not in the order one might expect, this is likely due to

variation in the samples for the TCAP key and target variables.

5 Final Thoughts

The results show that, at least in terms of the risk and utility of samples drawn from a synthetic

population, the relationship is similar to the results obtained by drawing samples from the original

population. For Experiment A, which used a synthetic population generated directly from the original

population, the relationship between the synthetic samples and the synthetic population follows

closely the relationship between the original samples and the original population; the lines on the R-U

map were very close together when compared.

For Experiment B, which is perhaps a more likely scenario (since we do not usually have access to the

population data), synthetic populations were generated from samples (of varying sizes) drawn from

the original population. For each synthetic population samples were drawn, and the risk and utility

calculated, with the results compared (in terms of risk and utility) to the results of samples drawn

from the original population. For each of the synthetic populations, the overall relationship, in terms

of the curve on the R-U map, is similar to the original population results. However, each of the

synthetic populations had higher risk (TCAP), pushing the curve upwards; and as the sample that the

synthetic population was generated from gets smaller the curve moves further away from the original

population curve.

Further work on this might involve using a different data synthesizer – Synthpop was selected because

it generally produces data of high utility (and therefore higher risk) – but it may make sense to

perform these experiments with synthetic data of lower utility/risk to determine whether the results

replicate. It is also possible that using different risk and utility metrics may produce different results.

Repeating the experiments with different datasets may also make sense. As in previous work, we have

used a sample to represent the population data, so a further extension would be to access population

data and repeat these experiments.

6 References

Duncan, G.T., Keller-McNulty, S.A. and Stokes, S.L. (2004). Database Security and Confidentiality:

Examining Disclosure Risk vs. Data Utility through the R-U Confidentiality Map.

Elliot, M. (2014). Final Report on the Disclosure Risk Associated with the Synthetic Data Produced

by the SYLLS Team. [online]. Available from:

https://hummedia.manchester.ac.uk/institutes/cmist/archive-publications/reports/2015-02 -Report on

disclosure risk analysis of synthpop synthetic versions of LCF_ final.pdf.

Karr, A.F. et al. (2006). A framework for evaluating the utility of data altered to protect

confidentiality. American Statistician, 60(3), pp.224–232.

Little, C., Elliot, M. and Allmendinger, R. (2022). Comparing the Utility and Disclosure Risk of

Synthetic Data with Samples of Microdata. In Privacy in Statistical Databases. PSD 2022. Springer

International Publishing, pp. 234–249. [online]. Available from: https://doi.org/10.1007/978-3-031-

13945-1_17.

Nowok, B., Raab, G.M. and Dibben, C. (2016). Synthpop: Bespoke creation of synthetic data in R.

Journal of Statistical Software, 74(11).

Office for National Statistics, Census Division, University of Manchester, Cathie Marsh Centre for

Census and Survey Research. (2013). Census 1991: Individual Sample of Anonymised Records for

Great Britain (SARs). UK Data Service. [online]. Available from: http://doi.org/10.5255/UKDA-SN-

7210-1 [Accessed May 29, 2021].

Taub, J. et al. (2019). Creating the Best Risk-Utility Profile: The Synthetic Data Challenge. In Joint

UNECE/Eurostat Work Session on Statistical Data Confidentiality.

Taub, J. et al. (2018). Differential Correct Attribution Probability for Synthetic Data: An Exploration.

In Privacy in Statistical Databases. pp. 122–137. [online]. Available from:

http://dx.doi.org/10.1007/978-3-319-99771-1_9.

7 Appendix A

The UK 1991 Census dataset sample, 104267 records and 15 variables:

Variable

Name

Description Number

of Values

Number

of missing

AREAP Individual SAR area,

e.g., Birmingham, Solihull

21 0

AGE Age

Range: 0 - 95

94 0

COBIRTH Country of birth 42 0

ECONPRIM Primary economic position,

e.g., Employee FT, Student, Retired

Note: omits individuals < 16

10 21467

(20.6%)

ETHGROUP Ethnic group

e.g., White, Black Caribbean

10 0

FAMTYPE Family type

e.g., Married no children, Cohabiting with children

Note: n/a for individuals in communal establishments or with

no family

9 0

HOURS Number of hours worked weekly

Range: 1-81

Note: excludes individuals aged <=16 and those who have not

worked in previous ten years

72 46979

(45.1%)

LTILL Limiting long-term illness.

Two categories: Yes or no

2 0

MSTATUS Marital status

e.g., Single, married, divorced

Note: individuals < 16 are categorised as ‘single’

5 0

QUALNUM Number of higher educational qualifications

Three categories: 0, 1 or 2+

Note: individuals < 18 have a “0”

3 0

RELAT Relationship to household head

e.g., Head, spouse, daughter

8 2113

(2.0%)

SEX Sex

Two categories: Male or female

2 0

SOCLASS Social class (based on occupation)

e.g., Professional, skilled

Note: omits individuals < 16, & those not in paid work in last

10 years

9 44537

(42.7%)

TENURE

Tenure of household space

e.g., Owner occupied outright, rented privately

Note: omits individuals not in a household

7 2113

(2.0%)

TRANWORK Mode of transport to work

e.g., Bus, on foot

Note: omits individuals not in employment in the week before

Census

11 59249

(56.8%)

8 Appendix B

Experiment A: the mean utility and TCAP scores for each synthetic sample size (to 3dp, n=100),

contrasted with the mean utility and TCAP of samples taken from the original population

Experiment A, the standard deviation to 4dp (n=100) of the utility and TCAP scores for the original

and synthetic data samples

Overall utility TCAP (3 targets) Marginal TCAP (3 targets)

Sample size Original Synthetic Original Synthetic Original Synthetic

0.1% 0.424 0.420 0.609 0.607 0.300 0.298

0.25% 0.503 0.497 0.613 0.612 0.306 0.306

0.5% 0.559 0.554 0.617 0.618 0.313 0.317

1% 0.610 0.605 0.627 0.627 0.331 0.333

2% 0.657 0.653 0.643 0.643 0.360 0.362

3% 0.682 0.680 0.655 0.655 0.382 0.384

4% 0.702 0.701 0.664 0.666 0.398 0.403

5% 0.715 0.712 0.674 0.675 0.416 0.419

10% 0.762 0.760 0.710 0.713 0.480 0.486

20% 0.810 0.808 0.762 0.768 0.574 0.585

30% 0.842 0.840 0.800 0.807 0.641 0.656

40% 0.865 0.864 0.831 0.840 0.696 0.713

50% 0.887 0.887 0.858 0.868 0.746 0.764

60% 0.905 0.904 0.885 0.895 0.794 0.812

70% 0.922 0.921 0.913 0.921 0.843 0.859

80% 0.940 0.939 0.941 0.947 0.895 0.905

90% 0.960 0.960 0.970 0.973 0.947 0.952

95% 0.974 0.974 0.985 0.986 0.974 0.976

96% 0.977 0.977 0.988 0.989 0.979 0.981

97% 0.980 0.980 0.991 0.992 0.984 0.985

98% 0.985 0.985 0.994 0.995 0.989 0.990

99% 0.990 0.990 0.997 0.997 0.995 0.995

Overall utility TCAP (3 targets) Marginal TCAP (3 targets)

Sample size Original Synthetic Original Synthetic Original Synthetic

0.1% 0.0106 0.0125 0.0192 0.0185 0.0344 0.0331

0.25% 0.0114 0.0122 0.0108 0.0107 0.0193 0.0192

0.5% 0.0101 0.0108 0.0077 0.0078 0.0138 0.0139

1% 0.0078 0.0089 0.0061 0.0062 0.0109 0.0110

2% 0.0064 0.0076 0.0044 0.0039 0.0080 0.0070

3% 0.0066 0.0066 0.0034 0.0030 0.0061 0.0053

4% 0.0060 0.0057 0.0029 0.0031 0.0052 0.0055

5% 0.0068 0.0066 0.0028 0.0031 0.0050 0.0056

10% 0.0054 0.0065 0.0024 0.0022 0.0042 0.0039

20% 0.0059 0.0061 0.0021 0.0018 0.0037 0.0032

30% 0.0049 0.0060 0.0019 0.0019 0.0035 0.0033

40% 0.0067 0.0050 0.0016 0.0018 0.0028 0.0033

50% 0.0048 0.0049 0.0022 0.0017 0.0039 0.0030

60% 0.0045 0.0041 0.0021 0.0017 0.0037 0.0030

70% 0.0041 0.0041 0.0018 0.0017 0.0032 0.0031

80% 0.0036 0.0038 0.0021 0.0014 0.0038 0.0025

90% 0.0027 0.0028 0.0017 0.0014 0.0030 0.0025

95% 0.0019 0.0018 0.0013 0.0010 0.0024 0.0017

96% 0.0019 0.0016 0.0012 0.0010 0.0021 0.0018

97% 0.0016 0.0015 0.0011 0.0008 0.0020 0.0014

98% 0.0011 0.0011 0.0009 0.0008 0.0016 0.0014

99% 0.0010 0.0009 0.0005 0.0005 0.0010 0.0009

Experiment A: Mean Absolute Error (n=100) and standard deviation to 4dp of the utility and TCAP

values of synthetic samples compared to the original samples

Overall utility TCAP (3 targets) Marginal TCAP (3 targets)

Sample size MAE SD MAE SD MAE SD

0.1% 0.0108 0.0074 0.0147 0.0113 0.0262 0.0201

0.25% 0.0111 0.0081 0.0080 0.0071 0.0143 0.0127

0.5% 0.0088 0.0075 0.0062 0.0049 0.0114 0.0090

1% 0.0079 0.0065 0.0048 0.0038 0.0088 0.0068

2% 0.0060 0.0062 0.0032 0.0023 0.0060 0.0043

3% 0.0052 0.0043 0.0024 0.0017 0.0045 0.0031

4% 0.0043 0.0039 0.0029 0.0022 0.0059 0.0044

5% 0.0053 0.0050 0.0026 0.0021 0.0051 0.0040

10% 0.0053 0.0047 0.0029 0.0018 0.0061 0.0035

20% 0.0050 0.0039 0.0054 0.0018 0.0106 0.0032

30% 0.0045 0.0044 0.0075 0.0019 0.0142 0.0033

40% 0.0040 0.0032 0.0091 0.0018 0.0169 0.0033

50% 0.0039 0.0030 0.0101 0.0017 0.0186 0.0030

60% 0.0034 0.0026 0.0097 0.0017 0.0177 0.0030

70% 0.0035 0.0025 0.0084 0.0017 0.0153 0.0031

80% 0.0030 0.0025 0.0055 0.0014 0.0100 0.0025

90% 0.0022 0.0018 0.0029 0.0014 0.0054 0.0024

95% 0.0014 0.0012 0.0013 0.0008 0.0024 0.0014

96% 0.0012 0.0011 0.0012 0.0007 0.0022 0.0013

97% 0.0012 0.0009 0.0009 0.0006 0.0016 0.0011

98% 0.0009 0.0007 0.0008 0.0005 0.0014 0.0009

99% 0.0007 0.0005 0.0005 0.0003 0.0009 0.0005

9 Appendix C

Experiment B: Mean utility of original samples and synthetic samples, by sample size to 3dp. This is

the mean utility (across 100 samples) of each sample size (the rows) for each of the synthetic

populations (columns).

Experiment B: the standard deviation to 4dp (n=100) of the utility for samples taken from the original

population, and the five synthetic populations

Sample size Original

Population

Synthetic population generated from:

1% sample 2% sample 3% sample 4% sample 5% sample

0.1% 0.424 0.429 0.425 0.428 0.425 0.425

0.25% 0.503 0.509 0.506 0.505 0.505 0.500

0.5% 0.559 0.569 0.564 0.564 0.566 0.558

1% 0.610 0.626 0.617 0.618 0.619 0.611

2% 0.657 0.673 0.666 0.667 0.666 0.660

3% 0.682 0.700 0.692 0.694 0.694 0.687

4% 0.702 0.718 0.712 0.714 0.711 0.706

5% 0.715 0.733 0.725 0.728 0.727 0.721

10% 0.762 0.776 0.771 0.773 0.772 0.766

20% 0.810 0.823 0.817 0.820 0.818 0.813

30% 0.842 0.851 0.848 0.849 0.848 0.844

40% 0.865 0.874 0.871 0.871 0.872 0.868

50% 0.887 0.894 0.891 0.892 0.893 0.890

60% 0.905 0.911 0.909 0.909 0.909 0.907

70% 0.922 0.927 0.926 0.925 0.925 0.924

80% 0.940 0.944 0.944 0.943 0.943 0.941

90% 0.960 0.963 0.962 0.962 0.962 0.961

95% 0.974 0.976 0.974 0.975 0.975 0.974

96% 0.977 0.978 0.978 0.978 0.978 0.977

97% 0.980 0.981 0.981 0.981 0.981 0.980

98% 0.985 0.985 0.985 0.985 0.985 0.984

99% 0.990 0.990 0.990 0.990 0.990 0.990

Sample

size

Original

Population

Synthetic population generated from:

1% sample 2% sample 3% sample 4% sample 5% sample

0.1% 0.0106 0.0140 0.0125 0.0145 0.0113 0.0116

0.25% 0.0114 0.0110 0.0120 0.0119 0.0102 0.0113

0.5% 0.0101 0.0095 0.0092 0.0102 0.0097 0.0084

1% 0.0078 0.0086 0.0087 0.0085 0.0075 0.0076

2% 0.0064 0.0073 0.0066 0.0061 0.0066 0.0070

3% 0.0066 0.0062 0.0069 0.0070 0.0066 0.0064

4% 0.0060 0.0064 0.0059 0.0057 0.0073 0.0071

5% 0.0068 0.0062 0.0067 0.0058 0.0064 0.0057

10% 0.0054 0.0071 0.0063 0.0062 0.0056 0.0060

20% 0.0059 0.0056 0.0057 0.0047 0.0060 0.0064

30% 0.0049 0.0058 0.0056 0.0064 0.0055 0.0060

40% 0.0067 0.0059 0.0051 0.0053 0.0053 0.0048

50% 0.0048 0.0050 0.0052 0.0052 0.0049 0.0047

60% 0.0045 0.0045 0.0046 0.0051 0.0046 0.0045

70% 0.0041 0.0046 0.0046 0.0044 0.0044 0.0043

80% 0.0036 0.0035 0.0035 0.0034 0.0037 0.0038

90% 0.0027 0.0025 0.0024 0.0025 0.0026 0.0028

95% 0.0019 0.0020 0.0020 0.0019 0.0019 0.0019

96% 0.0019 0.0016 0.0018 0.0017 0.0017 0.0018

97% 0.0016 0.0014 0.0014 0.0012 0.0014 0.0016

98% 0.0011 0.0012 0.0014 0.0012 0.0011 0.0015

99% 0.0010 0.0009 0.0009 0.0009 0.0009 0.0010

Experiment B: The MAE to 4dp (n=100) between the utility of the original population samples and

each of the synthetic population samples

Experiment B: the standard deviation for the MAE of the utility, to 4dp

Sample

size

Synthetic population generated from:

1% sample 2% sample 3% sample 4% sample 5% sample

0.1% 0.0121 0.0097 0.0109 0.0092 0.0093

0.25% 0.0103 0.0099 0.0095 0.0081 0.0092

0.5% 0.0118 0.0084 0.0097 0.0091 0.0069

1% 0.0166 0.0096 0.0097 0.0101 0.0065

2% 0.0159 0.0092 0.0102 0.0096 0.0060

3% 0.0181 0.0104 0.0128 0.0125 0.0066

4% 0.0158 0.0103 0.0119 0.0108 0.0067

5% 0.0179 0.0107 0.0128 0.0121 0.0068

10% 0.0138 0.0092 0.0110 0.0096 0.0057

20% 0.0125 0.0077 0.0098 0.0086 0.0061

30% 0.0097 0.0076 0.0085 0.0074 0.0056

40% 0.0092 0.0067 0.0069 0.0072 0.0047

50% 0.0079 0.0057 0.0065 0.0069 0.0044

60% 0.0065 0.0056 0.0054 0.0052 0.0042

70% 0.0060 0.0047 0.0045 0.0045 0.0038

80% 0.0048 0.0046 0.0039 0.0042 0.0032

90% 0.0037 0.0027 0.0027 0.0028 0.0025

95% 0.0022 0.0016 0.0017 0.0016 0.0016

96% 0.0018 0.0016 0.0016 0.0015 0.0015

97% 0.0014 0.0013 0.0011 0.0013 0.0013

98% 0.0011 0.0011 0.0010 0.0010 0.0011

99% 0.0008 0.0007 0.0008 0.0007 0.0008

Synthetic population generated from:

Sample size 1% sample 2% sample 3% sample 4% sample 5% sample

0.1% 0.0085 0.0079 0.0102 0.0067 0.0069

0.25% 0.0071 0.0072 0.0073 0.0063 0.0070

0.5% 0.0069 0.0063 0.0063 0.0074 0.0048

1% 0.0079 0.0055 0.0063 0.0063 0.0042

2% 0.0071 0.0056 0.0049 0.0057 0.0042

3% 0.0062 0.0059 0.0056 0.0058 0.0047

4% 0.0063 0.0048 0.0051 0.0052 0.0045

5% 0.0061 0.0055 0.0052 0.0058 0.0043

10% 0.0056 0.0047 0.0046 0.0050 0.0037

20% 0.0051 0.0045 0.0039 0.0046 0.0037

30% 0.0046 0.0039 0.0043 0.0041 0.0034

40% 0.0044 0.0041 0.0042 0.0041 0.0031

50% 0.0042 0.0037 0.0037 0.0037 0.0032

60% 0.0039 0.0032 0.0034 0.0034 0.0025

70% 0.0032 0.0032 0.0027 0.0028 0.0028

80% 0.0027 0.0025 0.0024 0.0026 0.0026

90% 0.0019 0.0017 0.0017 0.0019 0.0015

95% 0.0014 0.0012 0.0012 0.0012 0.0010

96% 0.0011 0.0011 0.0011 0.0011 0.0010

97% 0.0010 0.0009 0.0007 0.0009 0.0010

98% 0.0008 0.0008 0.0007 0.0007 0.0010

99% 0.0005 0.0005 0.0005 0.0005 0.0006

Experiment B: the MAE for the utility by sample proportion, for each synthetic population, with error

bars indicating +- 1 standard deviation

Appendix D

Experiment B: the mean (n=100) Marginal TCAP values from each of the synthetic populations, and the original

population, to 3dp.

Experiment B: The standard deviation (to 4dp) of the marginal TCAP scores for samples from each of the synthetic

populations. The original population results are included for comparison.

Sample size Original

Population

Synthetic

Population

(1%)

Synthetic

Population

(2%)

Synthetic

Population

(3%)

Synthetic

Population

(4%)

Synthetic

Population

(5%)

0.1% 0.300 0.432 0.410 0.373 0.385 0.378

0.25% 0.306 0.448 0.414 0.378 0.388 0.381

0.5% 0.313 0.458 0.425 0.390 0.398 0.393

1% 0.331 0.482 0.440 0.415 0.418 0.412

2% 0.360 0.516 0.468 0.448 0.450 0.439

3% 0.382 0.539 0.489 0.472 0.473 0.461

4% 0.398 0.558 0.504 0.491 0.491 0.478

5% 0.416 0.574 0.520 0.506 0.510 0.492

10% 0.480 0.635 0.584 0.569 0.574 0.556

20% 0.574 0.717 0.670 0.658 0.670 0.646

30% 0.641 0.776 0.731 0.723 0.738 0.711

40% 0.696 0.822 0.781 0.775 0.790 0.765

50% 0.746 0.861 0.823 0.820 0.833 0.811

60% 0.794 0.894 0.863 0.860 0.872 0.853

70% 0.843 0.924 0.899 0.898 0.907 0.892

80% 0.895 0.951 0.934 0.933 0.939 0.930

90% 0.947 0.976 0.968 0.967 0.970 0.965

95% 0.974 0.988 0.984 0.984 0.985 0.983

96% 0.979 0.991 0.987 0.987 0.988 0.986

97% 0.984 0.993 0.991 0.990 0.991 0.990

98% 0.989 0.995 0.994 0.993 0.994 0.993

99% 0.995 0.998 0.997 0.997 0.997 0.997

Sample

size

Original

Population

Synthetic

Population

(1%)

Synthetic

Population

(2%)

Synthetic

Population

(3%)

Synthetic

Population

(4%)

Synthetic

Population

(5%)

0.1% 0.0344 0.0374 0.0456 0.0429 0.0389 0.0374

0.25% 0.0193 0.0246 0.0250 0.0256 0.0270 0.0241

0.5% 0.0138 0.0176 0.0179 0.0162 0.0170 0.0177

1% 0.0109 0.0097 0.0128 0.0135 0.0118 0.0114

2% 0.0080 0.0082 0.0086 0.0086 0.0088 0.0092

3% 0.0061 0.0066 0.0066 0.0080 0.0062 0.0067

4% 0.0052 0.0052 0.0059 0.0067 0.0059 0.0058

5% 0.0050 0.0046 0.0057 0.0052 0.0048 0.0047

10% 0.0042 0.0031 0.0040 0.0045 0.0042 0.0040

20% 0.0037 0.0025 0.0039 0.0037 0.0032 0.0032

30% 0.0035 0.0025 0.0028 0.0032 0.0027 0.0031

40% 0.0028 0.0024 0.0028 0.0025 0.0024 0.0031

50% 0.0039 0.0021 0.0025 0.0025 0.0021 0.0027

60% 0.0037 0.0016 0.0019 0.0024 0.0019 0.0025

70% 0.0032 0.0016 0.0020 0.0020 0.0021 0.0023

80% 0.0038 0.0011 0.0016 0.0017 0.0013 0.0018

90% 0.0030 0.0008 0.0012 0.0013 0.0010 0.0015

95% 0.0024 0.0005 0.0009 0.0009 0.0008 0.0007

96% 0.0021 0.0005 0.0008 0.0008 0.0007 0.0009

97% 0.0020 0.0005 0.0008 0.0008 0.0006 0.0007

98% 0.0016 0.0004 0.0006 0.0007 0.0005 0.0007

99% 0.0010 0.0003 0.0004 0.0004 0.0003 0.0005

Experiment B: the MAE of the marginal TCAP for each synthetic population by sample proportion, with error

bars indicating +- 1 standard deviation

Do samples taken from a synthetic microdata population replicate the relationship between samples taken from an original population?

M A R K E L L I OT, C L A I R E L I T T L E , R I C H A R D A L L M E N D I N G E R

U N I V E RS I T Y O F M A N C H E S T E R

Introduction Is the relationship between:

• a population dataset and samples drawn from it

replicated by

• a synthetic version of the same population and samples drawn from it?

Population data usually unavailable - if synthetic samples can mimic this relationship, it would be useful

Extends previous work (Little et al., 2022) using samples to determine the sample equivalence of synthetic data to the original dataset • (to be able to say, for example, “the synthetic dataset has utility equivalent to a 10% original sample and

risk equivalent to a 5% original sample”)

Study Design - Data UK 1991 Census microdata (University of Manchester, 2023) is used to represent the population • subsetted on geographical region (West Midlands)

• 104267 records

• 15 variables (13 categorical, 2 numerical)

Area Age Country

of birth

Economic

group

Ethnic

group Family type

Hours

worked

Long term

illness

Marital

status

Num

qualifications Relationship Sex

Social

class

Transport

to work

Housing

tenure

Sandwell 7 England NA Bangladeshi Married dep.

Children NA No Single None Child M NA NA

Own

outright

Coventry 40 England Employee FT White NA 50 No Married None NA F Manag.

tech Car NA

Walsall 70 England Retired White Married no

children 39 Yes Married None

Household

head M

Part

skilled NA

Own

buying

Study Design synthpop (Nowok et al. 2016) used to generate synthetic data • Default parameters

• Visit sequence ordered by ascending number of categories, with numerical variables first

Data samples were drawn randomly without replacement

Various sample fractions • 0.1%, 0.25%, 0.5%, 1%, 2%, 3%, 4%, 5%, 10%, 20%, …, 80%, 90%, 95%, 96%, 97%, 98%, 99%

◦ 22 overall

• n = 100 samples randomly drawn for each sample fraction

• 2200 samples

Study Design – Metrics Disclosure Risk • For synthetic data reidentification risk not meaningful

• Attribution is possible

• Measured using the Targeted Correct Attribution Probability (TCAP) (Taub & Elliot, 2019) ◦ Probability that an intruder makes a correct attribution inference about a particular target variable, given partial

knowledge (key variables)

• We use marginal TCAP score ◦ Calculate baseline – probability of intruder being correct if they drew randomly from univariate distribution of

target variable

◦ Scale TCAP score between baseline and 1

◦ marginal TCAP indicates risk above the baseline

◦ Value between -x and 1, where a higher value indicates greater risk

Study Design – Metrics Utility • Confidence Interval Overlap (CIO) (Karr et al., 2006)

◦ Logistic regressions performed on synthetic and original data (using same target/predictors for each)

◦ Regression coefficients are compared

◦ Score between 0 (no overlap) and 1

• Ratio of Counts/Estimates (ROC) ◦ For univariate and bivariate cross-tabulations

◦ Compares proportion of synthetic and original data estimates by taking the ratio

◦ Score between 0 and 1

• Overall utility score ◦ Mean of CIO, ROC univariate and ROC bivariate

◦ Value between 0 and 1, where a higher value indicates greater utility

Study Design – Metrics Risk-Utility comparison • R-U confidentiality map (developed by Duncan et al. 2004)

• Plots utility against risk (TCAP) score

• Ideally disclosure risk is minimised, utility is maximised

Synthetic / Sample data • Utility and risk metrics calculated in the same way for samples of original data as for

samples of synthetic data ◦ By comparing against the dataset that the samples were drawn from

• Allows comparison on R-U map

Results - Experiment A A synthetic population was generated from the original population

Random samples taken from both populations

Risk and utility calculated for each sample compared to the population it was sampled from

Results compared

Experiment A: Risk-Utility map showing the original samples and synthetic samples

Experiment A: Individual plots showing the original samples and synthetic samples for:

Utility Risk (Marginal TCAP)

Mean Absolute Error of the utility and marginal TCAP for each synthetic sample size (calculated against the original samples, error bars show +- 1 standard deviation)

Results - Experiment B UK 1991 Census data represents the population

Take samples from the population (1%, 2%, 3%, 4%, 5%)

Generate synthetic populations from the samples

Random samples taken from original and synthetic populations

Risk and utility calculated for each sample compared to the population it was sampled from

Results compared

Experiment B Synthetic population generated from smaller samples

• A more likely scenario

Process:

• Take samples from the original population

• 1%, 2%, 3%, 4%, 5%

• From each sample, a synthetic dataset the same size as the population (n=104267) was generated

• Utility increases with sample size

• TCAP differs

Synthetic population generated from a:

Utility Marginal TCAP

1% sample 0.539 0.407

2% sample 0.585 0.351

3% sample 0.591 0.370

4% sample 0.616 0.409

5% sample 0.643 0.423

Risk-Utility map contrasting the results for samples drawn from synthetic populations to those drawn from original population

Individual plots contrasting the results for samples drawn from synthetic populations to samples drawn from the original population, for:

Utility Risk (Marginal TCAP)

Mean Absolute Error of the utility and marginal TCAP for each synthetic sample size (calculated against the original samples, error bars show +- 1 standard deviation)

Utility Risk (marginal TCAP)

Risk-Utility map contrasting the results for samples drawn from synthetic populations to those drawn from original population…

where the synthetic population also contains the original sample used to generate it • very little difference

whether or not the original sample is included

An aside:

Observations Experiment A → Synthetic population generated from original population • Relationship between synthetic samples and the synthetic population follows closely the

relationship between original samples and the original population

Experiment B → Synthetic populations generated from samples drawn from original population • Overall relationship similar to original populations results (similar curve on the RU map)

• But the smaller the original sample (used to generate the synthetic population) the more the risk is overestimated

• Utility similar no matter the original sample size

Caveats Experiments conducted on samples of Census microdata

◦ May not generalise to full population data

Only one data synthesis method used ◦ Synthpop – which tends to create high utility (but also higher risk) synthetic data

Only one dataset used ◦ It may be useful to repeat this on other datasets

Underestimation of the risk of samples, relative to synthetic data ◦ Whilst synthetic data should not contain re-identification risk, sample data does

Risk measure uses a response knowledge attribution disclosure ◦ OK for Census data, but presence detection may be a significant risk in other data

Different risk and utility metrics may produce different results

Future Work Run experiments on full population data

Use different data synthesis methods

Use different datasets

Assess other utility measures

Assess other disclosure control methods

References Nowok, B., Raab, G.M. and Dibben, C., 2016. synthpop: Bespoke creation of synthetic data in R. Journal of statistical software, 74(1), pp.1- 26.

Little, C., Elliot, M. & Allmendinger, R., 2022, Comparing the Utility and Disclosure Risk of Synthetic Data with Samples of Microdata. In Privacy in Statistical Databases: International Conference, PSD 2022, Paris, France, September 21–23, 2022, Proceedings. Lecture Notes in Computer Science vol. 13463 LNCS, Springer Nature, Cham, Switzerland, pp. 234-249. https://doi.org/10.1007/978-3-031-13945-1_17

Taub, J., Elliot, M., Raab, G., Charest, A., Chen, C., O'Keefe, C. M., Nixon, M. P., Snoke, J., Slavkovic, A., 2019. The synthetic data challenge. Joint UNECE/Eurostat Work Session on Statistical Data Condentiality. https://unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2019/mtg1/SDC2019_S3_UK_Synthethic_Data_Challenge_Elliot_AD.p df

Duncan, G.T., Keller-McNulty, S.A. and Stokes, S.L., 2004. Database security and confidentiality: examining disclosure risk vs. data utility through the RU confidentiality map.

Karr, A.F., Kohnen, C.N., Oganian, A., Reiter, J.P., Sanil, A.P.: A framework for evaluating the utility of data altered to protect confidentiality. Am. Stat. 60(3), 224–232 (2006).

https://doi.org/10.1007/978-3-031-13945-1_17

http://doi.org/10.5255/UKDA-SN-7210-1

https://unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2019/mtg1/SDC2019_S3_UK_Synthethic_Data_Challenge_Elliot_AD.pdf

Slide 1: Do samples taken from a synthetic microdata population replicate the relationship between samples taken from an original population?
Slide 2: Introduction
Slide 3: Study Design - Data
Slide 4: Study Design
Slide 5: Study Design – Metrics
Slide 6: Study Design – Metrics
Slide 7: Study Design – Metrics
Slide 8: Results - Experiment A
Slide 9
Slide 10
Slide 11
Slide 12: Results - Experiment B
Slide 13: Experiment B
Slide 14
Slide 15
Slide 16
Slide 17
Slide 18: Observations
Slide 19: Caveats
Slide 20: Future Work
Slide 21: References

Intruder testing – an empirical measure of the quality of Census 2021 England and Wales Disclosure Control methods, ONS UK

Read more about Intruder testing – an empirical measure of the quality of Census 2021 England and Wales Disclosure Control methods, ONS UK

confidentiality rules, individual data, swapping records, cell key method, disclosure rules, intruder testing,

Languages and translations

English

UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE

CONFERENCE OF EUROPEAN STATISTICIANS

Expert Meeting on Statistical Data Confidentiality

26-28 September 2023, Wiesbaden

Title : Intruder testing – an empirical measure of the quality of Census 2021

England and Wales Disclosure Control methods

Author(s) Samantha Trace (Office For National Statistics) Dominic Nelson (Office For National Statistics

e-mail [email protected]

Abstract

By law, the Office for National Statistics (ONS) must protect the confidentiality of respondents to

Census 2021. We protected the confidentiality of individuals' data in three ways: swapping records

between areas, applying a cell key method to each table, and applying disclosure rules in deciding

which tables could be published. To assess the effectiveness of these methods and provide assurance,

an intruder test was performed on Census 2021 data using a secure version of the outputs system. 51

intruders were recruited to attempt to identify individuals in the planned data outputs. 30 Intruders

took part, 81 claims were made, and more than half of these claims (41/81) were incorrect. Further

steps were taken reduce the risks identified by the test, making the data the majority of these claims

were made from no longer possible to access through the Create a Custom Dataset system. This gave

the Office for National Statistics evidence there was sufficient uncertainty in the data to meet the

standard required by legal guidance and we would meet our ethical duty to protect confidentiality.

1 Introduction

The Office for National Statistics (ONS) has legal obligations under the Statistics and Registration

Service Act (SRSA, 2007) Section 39 and the Data Protection Act (2018) that require the ONS not to

reveal the identity or private information about an individual or organisation.

We have a pledge to respondents that the information will only be used for statistical purposes, so we

must look after and protect the information that is provided to us. Moreover, a breach of disclosure

could lead to criminal proceedings against an individual who has released or authorised the release of

personal information, as defined under Section 39 of the SRSA.

The SRSA defines "personal information" as information that identifies a particular person if the

identity of that person:

• is specified in the information

• can be deduced from the information

• can be deduced from the information taken together with any other published information

Therefore, in order for data to be released, the risk of identifying individuals from it, potentially with

additional publicly available information, must be minimal.

Intruder testing is an empirical test to check that the measures applied to make data sufficiently

difficult to identify individuals within have been successful. This involves recruiting ‘friendly

intruders’ who emulate the actions of potential ‘real intruders’ upon the data.

The standard that needs to be met is suggested by the National Statistician’s Guidance, “the design

and selection of intruder scenarios should be informed by the means likely reasonably to be used to

identify an individual in the statistic”.

So, intruder tests are designed to measure what could be done with the means likely to be available to

an opportunistic attacker, it does not have to cover every imaginable scenario, just the most probable.

The 2011 Census outputs were tested in this way, and the findings were useful in providing assurance

that the disclosure controls measures used on the data were adequate, and provided evidence to what

further steps should be taken to further reduce disclosure risk. Other ad-hoc exercises have been

undertaken by the ONS as required since, with the same purpose – to determine the level of

identification risk in a dataset.

For Census 2021, new disclosure control methods were required for a new output system. On top of

the imputation of missing records done to make the Census as representative as it can be, which also

adds doubt as to whether a particular record is ‘real’ or not, there were new measures in place to

protect the data:

• Targeted Record Swapping – swapping households that are marked as unique in the data with

a similar record in the local area. The geographies were changed for between 7% and 10% of

households, and for between 2% and 5% of individuals in communal establishments.

• Cell Key Perturbation - this adds noise to the figures, making slight changes to cell counts

including zero cell counts, by a method which means that where the same records are

presented in a cell, the number should remain consistent. A typical dataset would have around

14% of cell counts perturbed by a small amount, and small counts were more likely to have

been perturbed than large counts.

https://www.ons.gov.uk/methodology/methodologytopicsandstatisticalconcepts/disclosurecontrol/guidanceonintrudertesting

https://analysisfunction.civilservice.gov.uk/wp-content/uploads/2012/12/Confidentiality-of-Official-Statistics-National-Statisticians-Guidance.pdf

https://www.bing.com/ck/a?!&&p=0f8c0869091247cbJmltdHM9MTY4ODI1NjAwMCZpZ3VpZD0yNmU1ODczMy02NDlhLTY0MjctMGJkZS05NDczNjU2MTY1MjEmaW5zaWQ9NTE4OQ&ptn=3&hsh=3&fclid=26e58733-649a-6427-0bde-947365616521&psq=tudor+cornish+spicer&u=a1aHR0cHM6Ly93d3cuc2VtYW50aWNzY2hvbGFyLm9yZy9wYXBlci9JbnRydWRlci1UZXN0aW5nLW9uLXRoZS0yMDExLVVLLUNlbnN1cyUzQS1Qcm92aWRpbmctVHVkb3ItQ29ybmlzaC9hYmM0NmNmMTZhMmQwYzQ5NzBkMWY1YzI4OTgyOWY2YWNjYzlhY2Vl&ntb=1

• Disclosure rules (in the Create a Custom Dataset system) – automated rules including

measures of how many small counts are in the table, that can stop data being given for an area.

These methods were intended to combine as a ‘lighter touch’ approach, allowing some detail to be

possible at low level geography, whilst maintaining the usefulness of the data within the new Create a

custom dataset (CACD) system, and other census outputs. The CACD system allows users to create

their own multivariate datasets, so the rules are set to prevent the possibility of identifying a single

record and building up a list of potential attributes. The level of identification risk should still be

minimal, using information public or private.

2 The Intruder Test

2.1 Method

51 intruders, all ONS employees, were recruited. All had appropriate security clearances and

consented to an enhanced non-disclosure agreement. They were given training on how to use the

output system, and possible methods of working against our statistical disclosure controls. A safe area

of an approved file management system was set up, and they were given access to individualised

folders to record their findings and keep notes.

A version of the planned outputs system was created on a secure internal-access platform and loaded

with the usual resident database. This is the main basis for Census outputs as it includes all people

who are ‘usually resident’ at the enumeration address at the time of the census. This was also

programmed with all the current planned variables and classifications for those variables. A version of

the planned statistical disclosure rules was placed in this system, to auto-control outputs requested by

intruders, and deny access if the output does not pass these rules. The system had built in perturbation

so automatically created outputs with some values slightly changed.

The data placed in the system had targeted swapping already applied and imputed records present, just

as it would be when published. The main census 2021 geographies were available in this system, the

smallest geography used was output area (OA), an area with at least 100 persons in it, though more

typically 400 persons.

Intruders were given individual access to the system, encouraged to collaborate on a private Teams

channel, and to share resources, such as web pages, hints and tips. An errors log was set up to record

system issues, and the details of the claim, including geography, variables and classifications used, as

well as the name and address of the individual being claimed as found, and the confidence level in the

identification as a percentage.

Claims were transcribed from the individual file folders to a single sheet that the checkers had access

to. These checkers were from a different team to ensure the data was fully firewalled from the

intruders, and no actual disclosure would result from the exercise.

The checkers had access to record level data, so could determine whether a claim was correct, partial,

or incorrect. A correct claim would match on name and approximate address. Inaccurate address

matches were counted as correct so long as they would have been within the geographical area used to

make the claim.

Inaccurate name matching was counted as incorrect. A partial match would be where a claim was

made on a 1 in a cell, where more records would have been in that cell but were perturbed down to 1.

https://www.ons.gov.uk/datasets/create

https://www.ons.gov.uk/census/census2021dictionary/measurementsusedincensus2021data

https://www.ons.gov.uk/methodology/geography/ukgeographies/censusgeographies/census2021geographies

2.2 Limitations

We had considered engaging a third party to take part in the test, however we could not be sure of

start time, and there are few companies engaged in exactly this sort of testing that could have gained

security clearances in time, so it was deemed impractical to engage a third party in this exercise.

Therefore, there may be some organisational biases in our exercise.

Although attempts were made to recruit people from more sparsely populated areas of England and

Wales, most people were still clustered geographically around ONS offices and reflect the socio-

demographic mix of ONS staff rather than the general population.

Intruders also had to use their spare time around their regular work, and the exercise ran in August

when many took leave, although it took place over three weeks to allow more people to participate.

The dataset looked at was not the full range of planned Census outputs. The final system includes not

just Usual Resident, but also Usual Residents in Households and Communal Establishments,

Households and Household reference Persons. The Usual Resident dataset used was taken to be a

sufficient test of the general level of risk in the data.

2.3 Results

2.31 Claims

81 identification claims were made, excluding duplicates. These claims are where an intruder

highlighted a ‘1’ cell count in a dataset, and gave the details of this, and claimed they knew which

person it related it to. Some (2) claims listed various methods to approach the same identification, in

these cases this was still counted as one claim and measures such as cell count were taken from the

first tables stated.

40/81 or 49% of identification claims were correct (the intruder correctly named an individual in a

cell)

8/81 or 10% of identification claims were partially correct (the intruder correctly names an individual

in a cell of apparent size 1, but the cell count is greater than 1 – due to cell key perturbation – the cell

could have been representing any of the people in it)

33/81 or 41% of identification claims were incorrect, the record marked in the cell did not relate to the

individual named.

No attribute claims were made, an attribute claim is where an intruder claims to have found something

new about a person through the data presented.

Of the initial 51, 12 dropped out, citing workload or holiday as reasons, and a further 9 filed no notes

and made no claims. Of the 30 intruders that took part, 6 (20%) did not make any claims. Reasons

cited included not being able to claim anything with certainty, some may also have lacked time to

spend on the project.

2.32 Confidence

https://www.gov.uk/government/statistics/civil-service-statistics-2022

Figure1: Confidence, correctness and number of claims

This histogram shows numbers of claims by the percentage confidence the intruder reported in the

claim, banded by whether they were correct, partially correct or incorrect.

There was a range of 7.5-100% confidence in claims

The mean confidence placed in a claim was 73.6%, the median was 80%.

2.32 Cell Counts and correctness

The cell count is the number of cells (row * columns) present in the table used to make the claim.

A wide range of table sizes were used to inform claims, (range 7 – 2100, mean 183, median 182).

Figure 2: Cell counts and correctness

The scatter plot shows claims rated by percentage correctness. Partially correct claims are 50%

correct, fully correct are 100% correct. One outlier (cell count 2100) was removed. This shows a

R² = 0.0986

100

200

300

400

500

600

700

0 20 40 60 80 100 120

N u m

b er

f ce

ll s

fo r

cl ai

m d

at as

Correctness

positive correlation (R^2 = 0.0986), but with outlier, this relationship was zero. This could suggest

that higher cell counts may increase possibility of identification within limits – very high cell counts

may not.

2.34 Variables Used

To assess which variables were most likely to result in a claim, and which in a correct claim, the

claims were coded to variable type. Any table constructed with a single classification making the bulk

of the cells would be coded to that variable, e.g., any claim using single year of age, or single year of

age plus another less detailed classification such as sex, was coded to ‘age’, any claim using a 3-part

country of birth classification, 10-part age, and sex would be coded ‘multivariate’. A few variables

with only a few claims each were coded to ‘other’, such as country_of_birth.

Table 1: Number of claims by variables used in the datasets those claims came from

Variable Number of Claims Number of Correct Claims % Claims that were correct

Age 35 21 60%

Multivariate 28 12 43%

Occupation 9 2 22%

Other 8 5 63%

The table shows claims where age was the main component had the highest number of claims, and

highest number of correct claims. Multivariate tables were less than 50% likely to yield a correct

claim, and occupation was unlikely to result in a correct claim. The main cause of correct claims from

the ‘Other’ category were claims using country_of_birth.

2.35 Geography

Table 2: Number and correctness of claims by Geography used in datasets

Geography

of the table

used for the

claim

Number of

Claims

Number

of Correct

Claims

% of Claims made

that were correct

Mean %

confidence

Mean cell

count

OA 67 34 51% 75 142

LSOA 9 5 56% 73 248

MSOA 5 1 20% 47 610

The largest geography used for any claim was Middle Super Output Area (MSOA), Output area (OA)

was the main area of risk with the bulk of claims (65/81 or 80%) being made using OA datasets. It

was also the focus of correct claims (34/40 or 85%). There were few claims at MSOA, and only one

correct claim. Lower Super Output Area (LSOA) again was used in few claims, and though these were

majority correct, with such a small sample it cannot be concluded that this would always be more

likely to be correct or not.

Table 3: Subject of the Disclosure Claim

Number of Claims Correct claims Percentage of

Claims correct

Family and friends 59 25 42%

People from news/ web 16 11 68%

Self-identification 6 4 67%

Many those known about through news or online articles were centenarians, identified through age

and location.

Other

Though intruders were given access to ‘fixed’ tables as csv files, at least 7 intruders used them, there

were no correct claims from these.

Qualitative evidence suggested the intruders found the new flexible outputs system was very easy to

use (rated 4.3 out of 5 by the 15 intruders surveyed), and low amounts of time were recorded as

typical to arrive at a claim (5-30 minutes) though it is hard to calculate total time taken per claim

accurately as time spent logged in could not be taken as an indication of time spent on this project.

Intruder feedback suggested that the disclosure rules built into the system were working as intended

and when they tried to obtain a cell value of 1 at lower geography, the rules prevented this by denying

the data.

3 Discussion

The overall results show that over half of identification claims were incorrect. However, unlike other

intruder testing exercises carried out previously by ONS, intruders were fairly unlikely to make claims

where they had low confidence. Almost all claims were made with a confidence of 60% or greater.

Generally, the higher percentage of confidence the intruder rated a claim, the more likely they were to

be correct. Although this was statistically significant, the relationship was not so strong, and a

significant portion of those who were over 90% confident were still incorrect or partially correct (35%

or 13/37).

The exercise on 2011 census data saw a drop off in percentage correctness at very high confidence

claims which was not seen here. Possibly, the ease of using the system may have made all intruders

more confident, and meant intruders went for easier identifications, rather than putting forwards ones

they were less sure of.

https://unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2013/Topic_5_Spicer.pdf

The method used for this exercise did not allow us to know whether an identification was wrong due

to swapping, or other reasons – only if it was perturbed and therefore a ‘partial’. Therefore, it is hard

to evaluate the success of swapping as a single method from this evidence.

Cell counts of tables present an unclear picture, as no correlation was found with table size in cell

count and correctness. Smaller tables may be easier to be sure where a person might be represented,

where a larger table makes it more likely to get a small count to base an identification claim. It seems

more detailed classifications may offer additional risk in some circumstances, but dependent on

geography.

There were no claims at any geography higher than MSOA. It is likely that an intruder would have far

more confidence over a claim at lower geographies since they may have considerable knowledge as to

who lived in an OA with which they are familiar, but far more uncertainty as the geography level

increases. Observing a cell count of 1 in an OA may convince them that the person they know is the

only one with that combination of attributes. They might have less certainty at MSOA that the 1

corresponds to the subject of the claim given the lower likelihood of familiarity with the individuals in

the population, as well as ‘noise’ introduced by error, imputation, record swapping and the cell key

method.

The high level of claims and correct claims at OA make this the main area of risk to address in

planned outputs. Claims made at OA also had the highest level of confidence with an average of 75%

confidence expressed in the claims. The variables used for these claims were consistent with the

general picture, that is, age was a main variable used for identifications, followed by other detailed

classifications such as occupation and country_of_birth. Multivariate tables made the basis for 22 of

the OA claims, of which most were incorrect or partially correct (13/22 or 59%), which demonstrates

that the protections did well at protecting multivariate data as they were designed to do.

Whilst most of the claims were correct at LSOA (5/9 or 56%) this was a small sample and could

equally have been majority incorrect with one fewer correct claim. However, some of the claims made

at OA could equally have been made at LSOA, as they are small enough to make small counts

prevalent, and intruders might have a moderate level of familiarity with most residents within a typical

sized LSOA (1600 people). The level of confidence in LSOA claims was not much less than that

shown in claims made from OA level tables (73% confidence in LSOA, 75% in OA claims). A

majority of LSOA claims (5/9 or 56%) were based in multivariate tables, though a minority of these

were correct (2/5 or 40%). The mean cell count of tables used for claims at LSOA was consequently

much higher.

There was little risk of a correct claim (only 1/6 or 17%) from an MSOA table, so this supported

earlier evaluations of the data that looked only at the sparsity of the likely tables, and restricted fixed-

table outputs of detailed univariates to MSOA geography. The cell counts used for MSOA tables were

higher on average, which is unsurprising given the higher population (typically 7000) that would have

to be divided in the classifications to obtain a cell count of ‘1’ to base an identification upon. The level

of confidence was also significantly lower at average 47%.

That age was shown as a specific risk should be noted; however, some of these claims were claims

made using already publicly available information on Centenarians so arguably the disclosure came

from these sources, not the output. That said, many claims were also identifying people who happened

to be the only one of that age in their area, so single year of age at Output Area geography has been

shown as a specific risk to mitigate.

The variables used for correct claims supports current thinking that more ‘definite’ variables are more

disclosive, that is age and country_of_birth are both variables that are likely to be reported

consistently by the person filling in the Census.

Claims based upon occupation were very unlikely to be correct on the other hand, which may be due

to uncertainty about how the question may have been interpreted by the person answering, and how

their answer would have been coded by the automated processing system.

Multivariate claims are also less likely to be correct, possibly because increasing the number of

variables increases the chances an answer would not have been given or been recorded the way the

intruder guessed. The level of risk in these detailed univariates was still limited to smaller sized

geography, so there is no evidence from this test to restrict the use of these variables at MSOA or

higher geography.

In terms of the variables that relate to special category data there was no evidence that variables such

as health, disability, ethnicity, religion, sexual_orientation and gender_identity, all of which were

included in the test, were at significant risk of correct identification claims. This may be due to the

protections put in place for these, and the less definite nature of these variables. Though we know 7

intruders tried to use the sexual_orientation and gender_identity datasets, these were made available

separately through .csv files which may have made them harder to access. In the final outputs they

would not be available below MSOA, so this intruder testing exercise seems to support that decision

in terms of sufficient protection for that data.

The test was conducted pragmatically, and therefore recruited people with more statistical awareness

and knowledge of the data than would be found in the general population, as they were ONS

employees. This may be taken a slightly over-stringent test, as it may over-estimate the risks from

intruder attempts made by the public.

4. Conclusion

The standard to be met to fulfil legal requirements is that claims should not be both made with

confidence and correctness. The level of risk found in the current planned outputs found by this

exercise would meet these legal definitions of safety, and additional steps were taken to decrease this

risk further.

In response to the findings, the rules in the table builder were altered to restrict the availability of

detailed classifications at lower geography, and one more detailed topic summary was replaced with a

classification with fewer categories that consequently posed less risk. The majority of claims made

here would not be possible to make using the actual output system.

Perturbation, swapping, the disclosure rules and general level of doubt in the data together were

shown to be effective at preventing correct identifications.

Awareness of perturbation and swapping did not appear to result in lower levels of intruder’s

confidence in making claims, so this alone cannot be relied upon to meet the legal standards. Further

steps were also taken to ensure LSOA level data was protected by restriction of the level of detail

available at this geography.

https://ico.org.uk/for-organisations/guide-to-data-protection/guide-to-the-general-data-protection-regulation-gdpr/lawful-basis-for-processing/special-category-data/#scd1

The evidence seen here, with lower risk at MSOA, supports the decision to limit the geography of

usual residents in communal establishments and households to MSOA, even though those datasets

were not included in the test

The CACD system has been launched since this test took place, and sees some 900,000 interactions

per month (ONS data), demonstrating the usefulness of Census data delivered in a flexible and

immediate format. If this system is to be employed for a wider range of statistical products, further

intruder testing should be considered as a means of measuring and mitigating disclosure risk in those

datasets.

Intruder testing is a highly useful exercise for data providers to employ, where the level of risk

presented by a dataset is in doubt. It gives evidence on the likely level of risk, where that risk lies, and

can inform appropriate action to mitigate those risks.

https://www.ons.gov.uk/datasets/create

Intruder Testing

Census 2021 England and Wales

Risk and Utility in the Create a Custom Dataset System Sam Trace

Background

• Key Census 2021 White Paper promise ‘Every person’s identity will be protected, not only through secure handling and storage of their data, but also by ensuring that our statistical publications do not identify individuals’

• Since 2011, there has been exponential growth in information publicly available about individuals

• There is an all new customizable system for Census 2021

• Census 2021 has new methods protecting the data

https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/765089/Census2021WhitePaper.pdf

Statistical Disclosure Control (SDC) methods

• Targeted Record Swapping – identifying people and Households that stand

out in the data, swapping them with a similar record in a nearby area.

• Cell Key Perturbation - this adds noise to the figures, making slight

changes to cell counts

• Disclosure rules – automated rule-based checks run by the system, which

decide if there is a low enough disclosure risk to allow the release of a

dataset.

How do we check these have done enough?

Intruder Testing

• Intruder testing is where ‘friendly’ intruders try to identify people in the data to check the risk level

• Census 2011 outputs were intruder tested before release

• It is a practical check to see if the methods worked

• The point of the exercise is to try and find out if it is possible to identify individuals in the data

https://www.ons.gov.uk/methodology/methodologytopicsandstatisticalconcepts/disclosurecontrol/guidanceonintrudertesting

https://unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2013/Topic_5_Spicer.pdf

Legal Standard for outputs

• There must be ‘sufficient uncertainty’ about any identification from a small count

• Identifications made with publicly available information in combination with the data are included

• Testers do not need to be specialist hackers

• Methods must cover the ‘means likely reasonably to be used’

https://www.ons.gov.uk/methodology/methodologytopicsandstatisticalconcepts/disclosurecontrol/guidanceonintrudertesting

https://analysisfunction.civilservice.gov.uk/wp-content/uploads/2012/12/Confidentiality-of-Official-Statistics-National-Statisticians-Guidance.pdf

Method

• Recruit intruders – ONS people only

• Consent intruders

• Train them and advise of the disclosure control methods

• Get the data on a secure pre-release system

• Intruders try to identify individuals in the data

• Collate results including feedback

• Analyse in Excel

Results

• 51 Intruders recruited

• 30 confirmed as working on the project

• 24 intruders made claims

• 81 Claims made (excluding duplicates)

Claims

49%

41%

10%

Correct Incorrect Partial

Confidence and Correctness

Variables Used

Correct All % Correct

Age 21 35 60%

Multi 12 29 41%

Occupation 2 8 25%

Other 5 9 56%

40 81 49%

Cell Counts

Correct All % Correct

0-49 2 9 22%

50-99 3 9 33%

100-149 9 20 45%

150-199 6 9 67%

200+ 20 34 59%

Total 40 81

Options

Remove detailed classifications from the Create Your Own Dataset system

• Loss of useful classifications at higher geography

• There may be other classifications not tried that also pose a risk

Limit max number of Cells

• Loss of useful functionality at higher geography

Specify Max cells specific to geography for univariates

• Would prevent the main risk

Limit max cells by Geography?

• The majority of datasets used for claims and correct claims used would not be available

• Might need to apply to LSOA too as some OA claims could equally have been successful at LSOA

• MSOA claims were already likely to be unsuccessful

Conclusions

• Detail available at low geography was a risk that was addressed in the live release system

• Some variables carry higher risk than others

• Changes to rules effectively blocked the main risks identified

• Automated rules in the Create a Custom dataset system worked to make claims harder to arrive at

https://www.ons.gov.uk/datasets/create

Actions

• Limit detail available at low geography

• Keep detailed topic summaries at MSOA level geography

• Releases could take place as planned

User Experience We asked the intruders their opinions of the new system

Ease of use

Choosing Variables

Choosing

Classifications

Data was clear

Questions and comments please

[email protected]

mailto:[email protected]

Speakers

Samantha Trace

Methodologist

Statistical Disclosure Control

Office for National Statistics

Slide 1: Intruder Testing
Slide 2: Background
Slide 3: Statistical Disclosure Control (SDC) methods
Slide 4: Intruder Testing
Slide 5: Legal Standard for outputs
Slide 6: Method
Slide 7: Results
Slide 8: Claims
Slide 10: Confidence and Correctness
Slide 12: Variables Used
Slide 13: Cell Counts
Slide 14: Options
Slide 15: Limit max cells by Geography?
Slide 16: Conclusions
Slide 17: Actions
Slide 18: User Experience
Slide 19: Ease of use
Slide 20: Questions and comments please [email protected] [email protected]
Slide 21: Speakers

Intruder testing for Census 2021 England and Wales– checking risk and utility in Build Your Own system , ONS, UK

Read more about Intruder testing for Census 2021 England and Wales– checking risk and utility in Build Your Own system , ONS, UK

confidentiality, individual data, cell key method, disclosure rules, intruder test