Testing Times Ahead: Meeting the EU AI Act’s requirements for model testing

The EU AI Act, the first comprehensive AI regulation, mandates extensive model testing and risk assessments for high-risk and general-purpose AI systems, emphasizing the need for bias testing, accuracy, robustness, and cybersecurity evaluations. Companies must also establish quality management systems and maintain detailed records of their evaluatory activities. Compliance challenges include data access, testing scalability, and lack of clear guidelines. Significant investments in testing capabilities and the use of third-party evaluators will be crucial for companies to meet these stringent requirements and avoid severe penalties.
May 24, 2024
5 min read

By Stephanie Cairns and Philip Dawson

The EU AI Act, the world’s first comprehensive AI regulation, received its long-awaited final green light this week. Central to the Act’s mandates for providers and deployers of both high-risk and general purpose AI systems are model testing and risk assessment. For companies looking to comply with the new law when it comes into force next month, gearing up their testing capabilities should be a top priority. Doing so, however, may prove difficult for companies, many of whom face key challenges related to data access, testing scalability, and a lack of clear guidance or benchmarks. 

Examples of high-risk AI systems

The full list of high-risk AI systems is contained in Annex III of the Act. Prominent examples include AI/LLM systems used in the following contexts: 

  • Biometrics: AI systems used for remote biometric identification, biometric categorization based on sensitive attributes, and emotion recognition.
  • Critical Infrastructure: AI systems used in the management and operation of critical digital infrastructure, road traffic, and essential utilities like water, gas, heating, and electricity. 
  • Education and Vocational Training: AI systems determining access to educational institutions, evaluating learning outcomes, assessing educational levels, and monitoring prohibited behaviour during tests. 
  • Employment and HR: AI systems for recruitment, job advertisement targeting, application analysis, candidate evaluation, work-related decision-making, task allocation, and performance monitoring. 
  • Public and Private Services: AI systems for evaluating eligibility for public assistance and healthcare services, assessing creditworthiness, risk assessment and pricing in insurance, and classifying emergency calls for service dispatch and triage.

Requirements for high-risk systems

  1. Data testing (Article 10)

Providers1 must test all training, testing, and validation datasets for bias, relevancy, representativeness, accuracy, and completeness. Notably, the Act allows companies to leverage demographic data for bias testing - so long as a) the use of synthetic or anonymous data proves infeasible, and b) certain data protection conditions are met - carving out an exception to the GDPR’s ban on the use of sensitive data. 

  1. Model testing and disclosure requirements (Articles 13 and 15)

Providers must provide deployers with information about their system’s level of accuracy, robustness, and cybersecurity, any relevant risk scenarios, and, “when appropriate”, its performance with respect to target groups. Such disclosures pre-suppose some amount of accuracy, robustness, and cybersecurity testing. While exact testing requirements remain ambiguous, the European Commission aims to encourage the creation of benchmarks and standardized measurement techniques. 

  1. Risk assessment (Article 9)

Providers must set up a risk management system to estimate and evaluate possible risks. Doing so will require providers to make use of pre-defined metrics and thresholds, conduct tests under real-world conditions (when appropriate), conduct tests for the purpose of identifying effective risk mitigation strategies, and pay special consideration to any potential harms to children and other vulnerable groups. Helpful starting points for the implementation of AI risk assessments and impact assessments include published international standards such as ISO/IEC 23894 - AI Risk Management, and ISO/IEC 42005 - AI Impact Assessment, which is under development. 

  1. Rights assessment (Article 27)

Prior to deployment, certain deployers (see here for the full list of conditions) must conduct an assessment to determine how the system may impact fundamental rights, such as through discriminatory actions. The assessment must analyze which groups may be affected or harmed by the system and how. 

  1. Record keeping (Article 11)

Providers must maintain records of the above evaluatory activities (and any others undertake), including information on assessment methods, metrics, and results. 

  1. Quality management system (Article 17)

To ensure that the above requirements (and all other applicable requirements) are satisfied, providers must set up a quality management system. This will involve documenting which tests are to be carried out pre- and post-deployment and with what frequency, as well as other requirements that broadly align with standards such as ISO/IEC 42001 - AI Management Systems (which has been mapped to the NIST AI Risk Management Framework).

Requirements for general purpose AI systems

  1. Record keeping (Article 53)

As with high-risk systems, providers of general purpose systems must maintain records of any evaluatory activities they undertake (such as red teaming and other adversarial testing), including information on assessment methods, metrics, and results. 

  1. Model testing and risk assessment (Article 55)

Providers whose general purpose systems possess “high impact capabilities” must assess “all possible systemic risks” and must undertake and document comprehensive model testing - including adversarial testing. 

Challenges and Looking Ahead

The European Commission has instructed standards development organizations to create standards that support the above requirements. In the meantime, the soon-to-be-formed AI Office intends to “encourage and facilitate” the development of codes of practice, particularly to support the implementation of Articles 53 and 55. Until either harmonized standards or codes of practice are released, companies may struggle to implement the tests, assessments, and documentation protocols required by the new law. In particular, some of the challenges providers and deployers are already grappling with include the following:

Completeness of data sets for fairness testing. The absence of sufficient demographic data is a key obstacle to fairness testing. To comply with the Act’s requirements, AI providers may have to rely on the debiasing exception created by Article 10 and/or apply state-of-the-art data imputation techniques. In doing so, they must also institute appropriate safeguards to ensure data privacy and security. 

Ongoing testing and risk assessment of high-risk LLM applications. Evaluating model risk requires companies to conduct tests under real-world conditions, using pre-defined metrics and thresholds. Providers and deployers LLMs and other general purpose systems used in "high-risk" applications will face difficulty meeting this requirement, as the development of production-centric metrics, tests, and scenarios remains an ongoing challenge. Evaluation is further complicated by the volume and variety of models available (closed versus open source) and the fact that the technology is evolving very quickly, which heightens the risk that material changes in new releases will adversely impact critical business applications and processes. Continuous LLM evaluations will require either dedicated teams within enterprises, the support of expert third party evaluators, or both. 

Model testing and reporting at scale. Spurred by the release of high-profile (and potentially higher risk) third-party applications like ChatGPT, many enterprises have only recently begun tracking the different AI-powered products used by their various business teams. Confronted by the implications of dozens, and in some cases, hundreds of first- and third-party AI models, many data science teams are now experiencing a bottleneck around testing and risk assessment requirements and reporting. Third party evaluation services are already playing a critical role in augmenting enterprise capacity.

Conclusion

Significant investment in testing and risk assessments for AI/LLM systems will be critical to ensuring enterprise compliance with the Act, and avoiding its significant penalties, which range from 3 to 7% in global annual turnover depending on the offense. As a result, many enterprises are likely to turn to experienced third-party assessors that can provide the expertise, confidence, and established testing processes needed to meet the Act’s obligations with ease. Investing in model testing and risk assessments will also force companies to think more concretely about governance and risk mitigation strategies for particular systems – and to take steps to limit their liability exposure for AI-related failures.

1 Note that the Act treats any distributor, importer, deployer, or other third-party that significantly alters an AI system as a provider (see Article 25).