Reconciling AI and Data Privacy: The Tune Insight way
Reconciling AI and Data Privacy: The Tune Insight way

Reconciling AI and Data Privacy: The Tune Insight way

Executive Summary

Tune Insight’s vision is to transform the paradigm of the data economy into an insight economy that better protects sensitive data, that is more secure, fair, and protective of privacy and confidentiality rights, and therefore reconciles the promise of artificial intelligence with data privacy.

This document defines Tune Insight’s principles in the responsible provisioning and use of Artificial Intelligence (AI) tools in order to preserve the high level of data privacy and compliance that we strive to offer in all our products. This is especially relevant in siloed or federated scenarios dealing with sensitive or regulated data such as in the healthcare and financial sectors. In these cases, the use of AI tools becomes critical in extracting the required actionable insights from data that must always remain under your control.

In this document, we give concrete examples of AI systems with varying levels of complexity and explain the main applicable principles of proportionality, risk minimization and safe data reuse. These principles shall be taken into account to successfully conduct a risk analysis and streamline compliance as well as to guarantee the required level of governance and control on your data.

Furthermore, we explain the technical and organizational approaches needed to enforce these principles and how these approaches materialize in our product portfolio, in particular our Secure Federated Data Space, which works with the most sensitive and critical healthcare and financial scenarios.

About AI

Artificial Intelligence or AI systems and AI projects refer to any software that includes machine learning,  deep learning, natural language processing or neural network components. It comprises all partially or entirely autonomous systems as well as systems whose development include fitting parameters based on data. The participants of an AI project are the parties that take part in its development as well as the data owners whose data is used to fit the system. We provide below a descriptive list of widespread representative examples of AI systems for which Tune Insight provides solutions:

Generalized Linear Models (GLM)

The Linear Models category covers machine learning models composed of one unique linear layer and an optional non-linear activation. It includes Linear Models and Logistic Models, used for regressions and classification tasks on tabular data. They can be securely trained under fully homomorphic encryption in both centralized and federated scenarios, ensuring that neither the private data nor the partially trained models aggregated after each round are disclosed to the other participants. Additionally, they can be served in a privacy preserving way through Model-as-a-Service (see our upcoming MaaS article), allowing users to run inference and classification tasks on their encrypted data, without disclosing the model to the users.

Deep Neural Networks

The Deep Neural Networks category includes machine learning models composed of a large number of layers. They are used in the most advanced medical and industrial tasks such as diagnosis on medical images or anomaly detection on industrial components respectively. This category includes Convolutional Neural Networks (CNN), Residual Neural Networks (RNN), Long Short-Term Memory (LSTM) and Multilayer Perceptrons (MLP). They can be trained collaboratively through Privacy Preserving Federated Learning , providing the additional data protection guarantees needed in the most sensitive tasks that Federated Learning cannot address. These models can also be served in a privacy preserving way through MaaS. For example, doctors can benefit from advanced AI models served by external entities that can be applied on their encrypted sensitive data.

Generative AI

Generative AI models are systems trained to generate data of any kind. This category contains the heaviest machine learning models that require powerful hardware to be trained and to be used. In particular, it includes Large Language Models (LLMs) and image generation models. It is notably difficult to properly balance utility and privacy when resorting to generative AI, hence special care has to be taken when using these systems. For this purpose, we have established a set of principles for responsible use of Generative AI that our product development and operation always follow, and which translate into technical requirements for systems trained with private or protected data.

Large Language Models

Large Language Models (LLMs) are integrated in Tune Insight’s products for specific well-defined tasks, e.g., to improve the user experience with helpers for database query generation in the user interface (see our LLM integration article). Beside frontend enhancements, LLMs can be collaboratively fine-tuned on Tune Insight platform via Low-Rank Adaptation (LoRA) to meet more specific requirements in text generation tasks. This approach enables different parties to collaboratively develop Large Language Models for specific tasks based on their respective proprietary data.

Risk assessment

As a necessary step for guaranteeing compliance with the evolving most recent regulations on AI [2] and data protection [1,3], the development and use of AI applications must be preceded by a risk assessment. It consists of evaluating the risks related to data privacy, including the sensitivity of the data as well as the likelihood and impact of potential attacks and breaches. All the participants involved in the process using AI must be accounted for in the risk assessment and must be aware of it before the process is implemented in a production environment.

Tune Insight’s Principles

Aligned with our mission and vision, we aim at reconciling the advances of AI developments with data protection. This is conveyed in our 3 main principles for a responsible use of AI: proportionality, risk minimization, and safe data reuse. These 3 principles guide the development of all Tune Insight’s solutions and, in particular, they form the core of our AI-related products.

Proportionality

Aligned with Art. 5 of the General Data Protection Regulation (GDPR) [1], every AI system supported by our platform must have clearly defined objectives and must not be used for any other purpose. Those objectives must be shared in a transparent manner between all the parties involved in the system’s development. AI systems built in compliance with this proportionality principle are strongly linked to their objectives and any capability allowing the system to go beyond those objectives should be prevented. The objectives of the system cannot be changed without a proper redefinition of the conditions involving all the participants.

Responsible use of Generative AI and Large Language Models (LLMs)

Generative AI and LLMs present additional risks with regards to proportionality. Models in this category are made up of a very large number of parameters. They require a high memory capacity and imply a high risk of data leakage. Thus, the use of such components in an AI system must be relevant and proportionate in accordance with the general objectives that the AI system responds to. Without this scoping, it is not possible to deliver appropriate privacy protection. Moreover, the integration of Generative AI and LLMs in the corresponding business processes must confine their usage and their intended outputs to the aforementioned objectives. This restriction must be guaranteed by an additional layer enforcing strict rules on the allowed outputs. E.g., an LLM can be used to classify human-written text corresponding to patient medical description into a diagnosis. In this case, a special layer will restrict the output to be one of the authorized answers, i.e. an element from a predefined list of diagnosis. This mechanism prevents asking arbitrary unrestricted questions to the model about possibly protected data, which would go out of the scope of the task and beyond the system’s objectives. To be in line with the proportionality principle as well as the risk minimization principle defined hereafter, Tune Insight avoids high-risk AI systems in all its products and solutions. High-risk AI systems, as defined in Annex III of the European Artificial Intelligence Act (AI Act) [2], do not allow an exhaustive and clear definition of their objectives.

Risk Minimization

As a second axis to align with Arts. 5 (principles), 25 (data protection by design and by default), and 32 (security of processing) of the GDPR [1], our risk minimization principle defines the rules applied during system development and usage (in particular, AI-based models) in order to minimize the risks related to sensitive data. Risk minimization comprises two rules described hereafter: data minimization and minimal information transfer.

Data Minimization

Before training, deploying and adopting machine learning or AI systems, the training data must be defined and restricted. Data should be used if and only if it serves the objectives of the AI model, and this selection must be done at the finest granularity level considering the following procedures:

  • Dataset source selection: Each data source used for the training of ML or AI models must be relevant according to the system’s objectives and no source must be included in any AI projects without explicit justification.
  • Horizontal selection: Inside each dataset taking part in an AI model development, a proper filtering of the samples must be implemented in order to exclude all the samples, subsets or chunks of the data that are considered to be irrelevant for the system’s final objective. This filtering must be set and applied prior to the model’s development and it must select the necessary subsets of the data in accordance with the objectives of the task.
  • Vertical selection: When the data or part of the data is tabular or multimodal, a selection of the required features or groups of features must be done in order to exclude the fields that are not required to fulfill the system’s task.

Minimal Information Transfer

To guarantee data owners total control, governance, and ownership over their data, the development of AI models must not require moving or replicating raw data for training purposes. During training, the data must be loaded by each participant from its source without replications or transfers. In addition to reducing the risk of data breach through replications of datasets, this also constitutes an optimal configuration for permitting model fitting with up-to-date information and for strengthening data integrity. This applies to all settings involving AI systems, including, but not limited to, collaborative building of  machine learning models, federated learning, or decentralized learning. In such scenarios, only insights are shared between the participants, that correspond to the minimal information required to be shared for the model to achieve its objectives.

Safe Data Reuse

The safe data reuse principle sets the standards for a responsible and safe secondary use of health, mobility, environmental, agricultural or public data. Before data can be repurposed for a project that differs from its original collection intent, it must undergo the necessary validation and adaptation steps in accordance with the principles of proportionality and risk minimization. Permission to use any data in a specific project does not grant authorization for its utilization in any other context. The data validation for secondary use is subject to a procedure including filtering, de-identification and validation of the underlying AI system. Compliance to this principle allows valorization of specific data in multiple contexts and opens the door to multidisciplinary studies and collaborations, in alignment with European Data Governance Act (DGA) [3] and, in healthcare scenarios, with the European Health Data Space regulation (EHDS) [4].

Technical Approach

To materialize these principles, Tune Insight platform (Secure Federated Data Space) makes use of a range of technical solutions (listed in Figure 1), which are combined to meet the data protection level required by highly sensitive applications. This section provides an overview of these technologies and explains how they are integrated in Tune Insight’s products.

Figure 1. Technologies used in Tune Insight’s products and their purpose.


Fully Homomorphic Encryption (FHE)

Fully Homomorphic Encryption (FHE) enables secure operations on private data, ensuring the highest level of security. With FHE, data remains encrypted by its owner, preventing unauthorized parties from deriving any information during processing. Tune Insight integrates FHE into its solutions using Lattigo, its award-winning, open-source Multiparty Homomorphic Encryption library (see more in Lattigo v5.0.0 announcement). Lattigo, which will soon release its version 6.0.0, supports various quantum-safe homomorphic encryption schemes, providing the precision required for advanced machine learning tasks such as running inference on Convolutional Neural Networks . For efficiency reasons, we combine homomorphic operations with other methods, described hereafter, to achieve high scalability and performance without compromising the security level in federated scenarios. Additionally, Homomorphic Encryption can also be used to train Generalized Linear Models on encrypted data, allowing secure collaborative model construction in Federated Learning.

Secure Multiparty Computation (SMPC)

Secure Multiparty Computation is a cryptographic technique that enables multiple participants to collaboratively apply a function over their respective inputs while ensuring that each participant’s inputs remain private. In the context of Tune Insight’s Secure Federated Data Space, we do not use SMPC on the data but on the keys, so that decryption of the resulting models  is implemented as a collective protocol. It ensures that the decryption of the machine learning models or the computation results cannot be done without the participation and the approval of all data owners involved in the process.

FL/FA Aggregations

In Tune Insight’s Secure Federated Data Space, training on siloed or federated data guarantees that the individually trained models are always encrypted before being shared for aggregation (see our secure federated learning article). Made possible by a combination of FHE and SMPC, it ensures that only global aggregates are visible to the participants and permits a more efficient application of differential privacy, creating more performant models for a target differential privacy level.

Differential Privacy

Differential Privacy is essential in protecting the training data used to build AI models. It consists of adding carefully controlled random perturbations during the development of the models to limit the leakage of private information from the released computation results. This has been shown to prevent the disclosure of private information through trained models by various attacks such as membership inference attacks [5], model inversion attacks [6] and reconstruction attacks [7]. In our Secure Federated Data Space, differential privacy noise is added to the aggregated models and computation results released to the participants in order to protect the private input information.

Synthetic Data

Synthetic data plays a crucial role in the development of data-oriented computations to validate collective processes before the integration of real sensitive data, which is often subject to restrictive constraints. Synthetic data generators create synthetic datasets that preserve selected statistical properties of the original  data while ensuring the protection of sensitive information. Synthetic data introduces a tradeoff between privacy and utility, and should be used according to the aforementioned proportionality and minimization principles. We avoid general-purpose generators that would incur in poor privacy or poor property preservation. Integrated in Tune Insight’s platform, we instead use generators specifically tailored for chosen target workflows in order to optimize the utility-privacy tradeoff and guarantee leakage minimization while enabling more exploratory processes to setup and test collective processes without exposing the real data to risk.

FHE Model Conversion

Tune Insight’s Secure Model-as-a-Service (MaaS) tools can convert most PyTorch models into secure versions that work with encrypted input data . However, the top-performing AI solutions in classical settings may not be optimal under FHE. To maximize performance on encrypted data, models can often be optimized through architecture refinement before being converted for secure integration with MaaS. Along with its MaaS platform, Tune Insight provides its customers with the tools to adapt their solutions through an FHE-oriented model optimization. Additionally, a guide to developing FHE-friendly models can be provided to help create models specifically designed for encrypted data inferences.

Federated Secure Data Space

As a realization of our principles applied to software for collective data analysis and machine learning, Tune Insight's Federated Secure Data Space (FSDS) connects the previously discussed technologies to facilitate collaborative computations. Through its web user interface, FSDS provides clarity and transparency to data owners regarding the processes applied to their data.

In FSDS, the project's setup brings a comprehensive definition of computations and requires the informed approval from each participant on the terms implying the use of private data before connecting any data source to the process. To assist with parameter selection, the computations can be simulated with synthetic data generated for specific contexts. While FHE and SMPC secure the computations in the backend, differential privacy budget restricts the leakage from the outputs and limits the number of operation executions to prevent abuse or misuse of real data.

FSDS can be deployed on-site or on cloud, but always close to data source, to uphold our principles at the design level. In order to minimize infrastructure-dependent adaptations, the application is containerized via Docker and can be rapidly deployed on a Kubernetes cluster. It is compatible with various network topologies to adapt to the organizational structure and requirements of the application and its participants. Figure 2. displays an arrangement of 3 institutions securely connected with FSDS to run collective analyses.

Figure 2. Federated Secure Data Space Architecture


FSDS has permitted secure collaborations and analyses on sensitive data, notably  patient data across hospitals. The variety of applications goes from basic statistics, such as survival curves of patients undergoing immunotherapy treatment, to the collaborative training of classifiers based on neural networks for detecting pigmented skin lesions on dermatology images (see our article on Secure Federated Learning).

References

  1. European General Data Protection Regulation (GDPR), Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation), European Commission, 27 April 2016.
  2. European Artificial Intelligence Act (AI Act), Regulation Of The European Parliament And Of The Council Laying Down Harmonised Rules On Artificial Intelligence Artificial Intelligence Act And Amending Certain Union Legislative Acts, European Commission, 21 April 2021.
  3. European Data Governance Act (DGA), Regulation (EU) 2022/868 of the European Parliament and of the Council of 30 May 2022 on European data governance and amending Regulation (EU) 2018/1724 (Data Governance Act), European Commission, 30 May 2022.
  4. European Health Data Space (EHDS), Proposal for a REGULATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL on the European Health Data Space, European Commission, 3 May 2022.
  5. Shokri, R., Stronati, M., Song, C. and Shmatikov, V., 2017, May. Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP) (pp. 3-18). IEEE.
  6. Fredrikson, M., Jha, S. and Ristenpart, T., 2015, October. Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security (pp. 1322-1333).
  7. Balle, B., Cherubin, G. and Hayes, J., 2022, May. Reconstructing training data with informed adversaries. In 2022 IEEE Symposium on Security and Privacy (SP) (pp. 1138-1156). IEEE.

Related Articles by Tune Insight

  1. Secure Federated Learning with Tune Insight encrypted computing platform
  2. LLM integration in Tune Insight Products
  3. Tune Insight Open-source Multiparty Homomorphic Encryption Library Lattigo v5.0.0
  4. MaaS: secure Model-as-a-Service (upcoming), description and demonstration of running deep learning models under fully homomorphic encryption, with benchmarks on different tasks from various complexity.
  5. Differential Privacy in Machine Learning (upcoming), how to apply differential privacy in machine learning for an effective privacy gain? How to set sensitivity in advanced machine learning tasks?
  6. Homomorphic Encryption in Federated Learning (upcoming), how Homomorphic Encryption improves model’s accuracy in secure Federated Learning?
  7. Privacy & Generative AI (upcoming), Can Generative AI be privacy preserving? What is the purpose of gen AI? How to use LLMs in a responsible way?

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics