Comparing Differentiable Logics for Learning with Logical Constraints111This paper is an extended version of our paper [1] published in FMAS 2023, containing a revised experimental setup leading to a fairer, more meaningful comparison. Section 1.3 explains our extensions and contributions in more detail.

Thomas Flinkow thomas.flinkow@mu.ie Barak A. Pearlmutter Rosemary Monahan
Abstract

Extensive research on formal verification of machine learning systems indicates that learning from data alone often fails to capture underlying background knowledge such as specifications implicitly available in the data. Various neural network verifiers have been developed to ensure that a machine-learnt model satisfies correctness and safety properties, however, they typically assume a trained network with fixed weights. A promising approach for creating machine learning models that inherently satisfy constraints after training is to encode background knowledge as explicit logical constraints that guide the learning process via so-called differentiable logics. In this paper, we experimentally compare and evaluate various logics from the literature, presenting our findings and highlighting open problems for future work.

keywords:
machine learning , neuro-symbolic , differentiable logic , verification
journal: Science of Computer Programming
\AtBeginDocument\RenewCommandCopy\qty\SI\UseTblrLibrary

booktabs

\affiliation

[1]organization=Department of Computer Science, addressline=Maynooth University, city=Maynooth, Co. Kildare, country=Ireland \affiliation[2]organization=Hamilton Institute, addressline=Maynooth University, city=Maynooth, Co. Kildare, country=Ireland

1 Introduction

Advancements in machine learning (ML) in the past few years indicate great potential for applying ML to various domains. Autonomous systems are one such application domain, but using ML components in such a safety-critical domain presents unique new challenges for formal verification. These include

  • 1.

    ML failing to learn background knowledge from data alone [2],

  • 2.

    neural networks being susceptible to adversarial inputs [3, 4],

  • 3.

    and a lack of specifications, generally and especially when continuous learning is permitted [5, 6, 7].

Addressing these challenges is even more important and more difficult when the ML-enabled autonomous system is permitted to continue to learn after deployment, either to adapt to changing environments or to correct and improve itself when errors are detected [8].

1.1 Formal verification of neural networks

A multitude of neural network verifiers have been presented in the past few years. We refer the reader to the Neural Network Verification Competition (VNN-COMP) reports [9, 10, 11, 12] for an overview of state-of-the-art neural network verifiers, and to Huang et al. [13], Liu et al. [14], Urban and Miné [15], Albarghouthi [16], Meng et al. [17] for in-depth surveys on neural network verification.

Reluplex, one of the first verifiers for neural networks, was provided by Katz et al. [18]. State-of-the-art tools are its successor Marabou [19], along with NNV [20], MN-BaB [21], and α,β𝛼𝛽\alpha,\betaitalic_α , italic_β-CROWN [22, 23, 24, 25, 26, 27] (the winner of the 2021–2023 VNN-COMP competitions [9, 10, 11]). However, as noted by Kwiatkowska [28] they typically assume trained networks with fixed weights and do not target the learning process itself.

1.2 Loss-based methods for guiding network training

There are multiple reasons for injecting background knowledge into neural networks: improved performance, learning with noisy or sparse data, or to guarantee compliance of the network predictions with the background knowledge [29]. One step in the direction of correct-by-construction neural networks are so-called differentiable logics, which transform a logical constraint ϕitalic-ϕ\phiitalic_ϕ into an additional constraint loss term CsubscriptC\mathcal{L}_{\text{C}}caligraphic_L start_POSTSUBSCRIPT C end_POSTSUBSCRIPT that measures how close the network is to satisfying the constraint. This is in addition to standard loss222Without loss of generality, only cross-entropy loss will be used in the following. such as standard cross-entropy loss CEsubscriptCE\mathcal{L}_{\text{CE}}caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT, which is a measure of the difference between the predicted and true probabilities.

The total loss \mathcal{L}caligraphic_L is then a weighted sum of constraint and cross-entropy loss, as shown in Eq. 1, where λCEsubscript𝜆CE\lambda_{\text{CE}}italic_λ start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT and λCsubscript𝜆C\lambda_{\text{C}}italic_λ start_POSTSUBSCRIPT C end_POSTSUBSCRIPT balance the different loss terms, 𝒙0subscript𝒙0\boldsymbol{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is training data, and 𝒚𝒚\boldsymbol{y}bold_italic_y the true label.

(𝒙0,𝒚,ϕ):=λCECE(𝒙0,𝒚)+λCC(𝒙0,𝒚,ϕ).assignsubscript𝒙0𝒚italic-ϕsubscript𝜆CEsubscriptCEsubscript𝒙0𝒚subscript𝜆CsubscriptCsubscript𝒙0𝒚italic-ϕ\mathcal{L}(\boldsymbol{x}_{0},\boldsymbol{y},\phi):=\lambda_{\text{CE}}% \mathcal{L}_{\text{CE}}(\boldsymbol{x}_{0},\boldsymbol{y})+\lambda_{\text{C}}% \mathcal{L}_{\text{C}}(\boldsymbol{x}_{0},\boldsymbol{y},\phi).caligraphic_L ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_y , italic_ϕ ) := italic_λ start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_y ) + italic_λ start_POSTSUBSCRIPT C end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT C end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_y , italic_ϕ ) . (1)

The optimal network weights 𝜽+superscript𝜽\boldsymbol{\theta}^{+}bold_italic_θ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, needed to map the input to a desired output, are then obtained by minimising the total loss using standard gradient descent, where 𝜽𝜽\boldsymbol{\theta}bold_italic_θ denotes the network weights, formally shown in Eq. 2.

𝜽+=argmin𝜽(𝒙,𝒚,ϕ).superscript𝜽subscriptargmin𝜽𝒙𝒚italic-ϕ\boldsymbol{\theta}^{+}=\operatorname{arg\,min}_{\boldsymbol{\theta}}\;% \mathcal{L}(\boldsymbol{x},\boldsymbol{y},\phi).bold_italic_θ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = start_OPFUNCTION roman_arg roman_min end_OPFUNCTION start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT caligraphic_L ( bold_italic_x , bold_italic_y , italic_ϕ ) . (2)

In order to translate logical constraints into this constraint loss, a mapping must be defined that allows for real-valued truth values and is differentiable (almost everywhere) for use with standard gradient-based methods. In Section 2 we provide an overview of two of these mappings (so-called differentiable logics) from popular literature, namely DL2 and fuzzy logics.

1.3 Our Contributions

In this section we outline our contributions and improvements over our work in [1] which provided an experimental comparison of differentiable logics. The experiments in this earlier work were limited to constraints that utilise data that is already in the training set, hence not allowing the constraints to have full effect. We now extend that work, through a complete revision of the experimental setup which provides for a more measurable experimental comparison of differentiable logics as explained in Section 4 below. We achieve this improved comparison by finding counterexamples outside of the data set that violate the logical constraint using Projected Gradient Descent [30] (PGD), thus ensuring that the differentiable logics have an impact on the learning process.

We also improve our presentation of differentiable logics in Section 2 so that differences are easily identified. Further theoretical comparison metrics from the literature, such as derivatives and consistency, are now included in Section 3. In our previous work, the comparison of differentiable logics suffered from a lack of fairness, due to a difficult-to-tune hyperparameter that prevented the logics to perform at their best. Our extended work addresses this through use of an adaptive loss-balancing approach called GradNorm [31], which allows each logic to perform close to optimally within our experiments, facilitating a fairer comparison of the logics in Section 4.

Our implementation of various differentiable logics in Python has been vastly improved, allowing their reuse and adaptation in the training of other neural networks, on new data sets with further constraints. Additionally, the logics are now implemented, not only for use in training with PyTorch [32], but also to allow the investigation of mathematical properties (such as derivatives and integrals) in a automated manner with SymPy [33] and Numpy [34].

1.4 Notation & Definitions

In the following, vectors will be denoted using bold face, e.g. 𝒙𝒙\boldsymbol{x}bold_italic_x.

Without loss of generality, we consider only classifier neural networks that output probabilities. The function approximated by the neural network will be denoted by 𝒩𝒩\mathcal{N}caligraphic_N, and for our purposes will be a function 𝒩:m[0,1]n:𝒩superscript𝑚superscript01𝑛\mathcal{N}:\mathbb{R}^{m}\to[0,1]^{n}caligraphic_N : blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT → [ 0 , 1 ] start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, mapping images to a vector containing the predicted probabilities of the input belonging to any of the n𝑛nitalic_n possible classes. Each vector element 𝒩(𝒙)k𝒩subscript𝒙𝑘\mathcal{N}(\boldsymbol{x})_{k}caligraphic_N ( bold_italic_x ) start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT represents the prediction for class k𝑘kitalic_k (where 0k<n0𝑘𝑛0\leq k<n0 ≤ italic_k < italic_n). We further define Bϵ(𝒙)subscript𝐵italic-ϵ𝒙B_{\epsilon}(\boldsymbol{x})italic_B start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_x ) to be the subscript\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ball with radius ϵitalic-ϵ\epsilonitalic_ϵ around 𝒙0subscript𝒙0\boldsymbol{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, i.e.

Bϵ(𝒙0):={𝒙m𝒙𝒙0ϵ}.assignsubscript𝐵italic-ϵsubscript𝒙0conditional-set𝒙superscript𝑚subscriptdelimited-∥∥𝒙subscript𝒙0italic-ϵB_{\epsilon}(\boldsymbol{x}_{0}):=\{\boldsymbol{x}\in\mathbb{R}^{m}\mid\lVert% \boldsymbol{x}-\boldsymbol{x}_{0}\rVert_{\infty}\leq\epsilon\}.italic_B start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) := { bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∣ ∥ bold_italic_x - bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_ϵ } . (3)

This gives us a shorthand to refer to the set of all points close to point 𝒙0subscript𝒙0\boldsymbol{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with respect to the subscript\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT norm. This will be convenient later when discussing local robustness, where a property is supposed to hold not just for one particular input, but also for all points that are similar to that input. Lastly, given logical formula ϕitalic-ϕ\phiitalic_ϕ, we let [[ϕ]]Lsubscriptdelimited-[]delimited-[]italic-ϕ𝐿[\![\phi]\!]_{\text{$L$}}[ [ italic_ϕ ] ] start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT denote its many-valued relaxation under a specific logic L𝐿Litalic_L, based on notation from Ślusarz et al. [35].

2 A comparison of DL2 and Fuzzy Logics

Deep Learning with Differentiable Logics (DL2) is a system built by Fischer et al. [36] for querying and training neural networks with logical constraints that are mapped into [0,)0[0,\infty)[ 0 , ∞ ), where 00 represents absolute truth, and any positive value represents a degree of falsehood that is directly used as a penalty. Whereas DL2 was designed specifically for deep learning contexts, fuzzy logics are well-studied logical systems that also happen to be suitable for use as a differentiable logic due to their many-valued nature, with operators that are often differentiable almost everywhere. Fuzzy logics express degrees of truth in the unit interval [0,1]01[0,1][ 0 , 1 ], with absolute falsity mapped to 00, and absolute truth mapped to 1111.

The first in-depth study investigating the use of fuzzy logics in loss functions (called Differentiable Fuzzy Logics) was provided by van Krieken et al. [37] and presented insights into the learning properties of differentiable logic operators. The use of t-norm based logical loss was also investigated by Marra et al. [38], who focused on keeping logical relations intact, instead of combining arbitrary fuzzy logic operators.

As a follow up, Ślusarz et al. [35] provide a common presentation of differentiable logics including DL2, fuzzy logics, and Signal Temporal Logic (STL), in a unified, extensible framework called Logic of Differentiable Logics (LDL). This framework allows the investigation of properties of differentiable logics in general, without having to consider each logic on its own. Additionally, it is used by the Vehicle [39, 40, 41] tool, providing an expressive domain specific language for stating neural network specifications which can be compiled to both verifiers and interactive theorem provers (ITPs).

The LDL framework also provides for mechanised proofs of properties of differentiable logics, thus providing a stepping stone for the development of programming language support for verification of machine learning [42].

2.1 Comparison of DL2 and Fuzzy Logics

In the following, we present the components of the logics that we use in our experimental comparison when mapping logical constraints ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ into real-valued loss. These logics are the DL2 logic relaxation [[]]DL2:Φ[0,):subscriptdelimited-[]delimited-[]DL2Φ0[\![\cdot]\!]_{\text{DL2}}:\Phi\to[0,\infty)[ [ ⋅ ] ] start_POSTSUBSCRIPT DL2 end_POSTSUBSCRIPT : roman_Φ → [ 0 , ∞ ), and the fuzzy logic based relaxation [[]]FL:Φ[0,1]:subscriptdelimited-[]delimited-[]FLΦ01[\![\cdot]\!]_{\text{FL}}:\Phi\to[0,1][ [ ⋅ ] ] start_POSTSUBSCRIPT FL end_POSTSUBSCRIPT : roman_Φ → [ 0 , 1 ]. Here, we provide details of their atomic terms, their operators and their quantifiers.

Atomic terms

Atomic terms in DL2 are comparisons that are translated as

[[xy]]DL2:=max{xy,0},assignsubscriptdelimited-[]delimited-[]𝑥𝑦DL2𝑥𝑦0\displaystyle[\![x\leq y]\!]_{\text{DL2}}:=\max\{x-y,0\},[ [ italic_x ≤ italic_y ] ] start_POSTSUBSCRIPT DL2 end_POSTSUBSCRIPT := roman_max { italic_x - italic_y , 0 } , (4)
[[xy]]DL2:=ξ[x=y],assignsubscriptdelimited-[]delimited-[]𝑥𝑦DL2𝜉delimited-[]𝑥𝑦\displaystyle[\![x\neq y]\!]_{\text{DL2}}:=\xi[x=y],[ [ italic_x ≠ italic_y ] ] start_POSTSUBSCRIPT DL2 end_POSTSUBSCRIPT := italic_ξ [ italic_x = italic_y ] , (5)

where ξ>0𝜉0\xi>0italic_ξ > 0 is a constant (found not to have significant influence [36]), and [x=y]delimited-[]𝑥𝑦[x=y][ italic_x = italic_y ] is the indicator, i.e. [x=y]=1delimited-[]𝑥𝑦1[x=y]=1[ italic_x = italic_y ] = 1 if x=y𝑥𝑦x=yitalic_x = italic_y, and [x=y]=0delimited-[]𝑥𝑦0[x=y]=0[ italic_x = italic_y ] = 0, otherwise.

Fuzzy logics typically do not define fuzzy comparison operators, which is why Ślusarz et al. [35] introduce a fuzzy comparison mapping shown in Eq. 6 below:

[[]]FL:[0,1]2[0,1],[[]]FL=1max{xyx+y,0}.:subscriptdelimited-[]delimited-[]FLformulae-sequencesuperscript01201subscriptdelimited-[]delimited-[]FL1𝑥𝑦𝑥𝑦0[\![\leq]\!]_{\text{FL}}:[0,1]^{2}\to[0,1],\qquad[\![\leq]\!]_{\text{FL}}=1-% \max\left\{\frac{x-y}{x+y},0\right\}.[ [ ≤ ] ] start_POSTSUBSCRIPT FL end_POSTSUBSCRIPT : [ 0 , 1 ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT → [ 0 , 1 ] , [ [ ≤ ] ] start_POSTSUBSCRIPT FL end_POSTSUBSCRIPT = 1 - roman_max { divide start_ARG italic_x - italic_y end_ARG start_ARG italic_x + italic_y end_ARG , 0 } . (6)

As with all fuzzy logic operators, the truth values of the terms x,y𝑥𝑦x,yitalic_x , italic_y must be mapped into [0,1]01[0,1][ 0 , 1 ] by some oracle. In our constraints, atomic terms are comparisons, so we take the liberty of changing this mapping to the one shown in Eq. 7, allowing us to forgo the need for an external oracle. Note that although this is not technically a pure fuzzy logic operator anymore, fuzzy logics usually do not have fuzzy comparison operators at all.

[[]]FL:2[0,1],[[]]FL=1max{xy,0}|x|+|y|:subscriptdelimited-[]delimited-[]FLformulae-sequencesuperscript201subscriptdelimited-[]delimited-[]FL1𝑥𝑦0𝑥𝑦[\![\leq]\!]_{\text{FL}}:\mathbb{R}^{2}\to[0,1],\qquad[\![\leq]\!]_{\text{FL}}% =1-\frac{\max\{x-y,0\}}{|x|+|y|}[ [ ≤ ] ] start_POSTSUBSCRIPT FL end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT → [ 0 , 1 ] , [ [ ≤ ] ] start_POSTSUBSCRIPT FL end_POSTSUBSCRIPT = 1 - divide start_ARG roman_max { italic_x - italic_y , 0 } end_ARG start_ARG | italic_x | + | italic_y | end_ARG (7)

Note also that this mapping has a property that we intuitively might wish to hold: for example, we might want 2120212021\leq 2021 ≤ 20 to be as much of a violation as 21 00020 0002100020000$21\,000$\leq$20\,000$21 000 ≤ 20 000. This cannot be achieved in DL2, where the violation depends only on the absolute difference of values.

Conjunction and disjunction

In DL2, conjunction and disjunction are mapped to addition and multiplication, thus being associative and commutative:

[[xy]]DL2:=[[x]]DL2+[[y]]DL2,assignsubscriptdelimited-[]delimited-[]𝑥𝑦DL2subscriptdelimited-[]delimited-[]𝑥DL2subscriptdelimited-[]delimited-[]𝑦DL2\displaystyle[\![x\wedge y]\!]_{\text{DL2}}:=[\![x]\!]_{\text{DL2}}+[\![y]\!]_% {\text{DL2}},[ [ italic_x ∧ italic_y ] ] start_POSTSUBSCRIPT DL2 end_POSTSUBSCRIPT := [ [ italic_x ] ] start_POSTSUBSCRIPT DL2 end_POSTSUBSCRIPT + [ [ italic_y ] ] start_POSTSUBSCRIPT DL2 end_POSTSUBSCRIPT , (8)
[[xy]]DL2:=[[x]]DL2[[y]]DL2.assignsubscriptdelimited-[]delimited-[]𝑥𝑦DL2subscriptdelimited-[]delimited-[]𝑥DL2subscriptdelimited-[]delimited-[]𝑦DL2\displaystyle[\![x\vee y]\!]_{\text{DL2}}:=[\![x]\!]_{\text{DL2}}\cdot[\![y]\!% ]_{\text{DL2}}.[ [ italic_x ∨ italic_y ] ] start_POSTSUBSCRIPT DL2 end_POSTSUBSCRIPT := [ [ italic_x ] ] start_POSTSUBSCRIPT DL2 end_POSTSUBSCRIPT ⋅ [ [ italic_y ] ] start_POSTSUBSCRIPT DL2 end_POSTSUBSCRIPT . (9)

Fuzzy logics are based on functions T:[0,1]2[0,1]:𝑇superscript01201T:[0,1]^{2}\to[0,1]italic_T : [ 0 , 1 ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT → [ 0 , 1 ] that are commutative, associative, monotonic, and satisfy T(1,y)=y𝑇1𝑦𝑦T(1,y)=yitalic_T ( 1 , italic_y ) = italic_y. These are called triangular norms [43] (abbreviated as t-norms) and generalise conjunction. A t-conorm (also called s-norm) generalises disjunction and can be obtained from a t-norm using S(x,y)=1T(1x,1y)𝑆𝑥𝑦1𝑇1𝑥1𝑦S(x,y)=1-T(1-x,1-y)italic_S ( italic_x , italic_y ) = 1 - italic_T ( 1 - italic_x , 1 - italic_y ); thus, for our fuzzy logic mapping we have:

[[xy]]L:=TL(x,y),assignsubscriptdelimited-[]delimited-[]𝑥𝑦𝐿subscript𝑇𝐿𝑥𝑦\displaystyle[\![x\wedge y]\!]_{\text{$L$}}:=T_{L}(x,y),[ [ italic_x ∧ italic_y ] ] start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT := italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( italic_x , italic_y ) , (10)
[[xy]]L:=SL(x,y).assignsubscriptdelimited-[]delimited-[]𝑥𝑦𝐿subscript𝑆𝐿𝑥𝑦\displaystyle[\![x\vee y]\!]_{\text{$L$}}:=S_{L}(x,y).[ [ italic_x ∨ italic_y ] ] start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT := italic_S start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( italic_x , italic_y ) . (11)

Negation

Because DL2 maps into [0,)0[0,\infty)[ 0 , ∞ ), a separate translation of negation does not exist. Instead, negation is handled by pushing it inwards to the level of comparison, e.g. [[¬(xy)]]DL2=[[y<x]]DL2subscriptdelimited-[]delimited-[]𝑥𝑦DL2subscriptdelimited-[]delimited-[]𝑦𝑥DL2[\![\lnot(x\leq y)]\!]_{\text{DL2}}=[\![y<x]\!]_{\text{DL2}}[ [ ¬ ( italic_x ≤ italic_y ) ] ] start_POSTSUBSCRIPT DL2 end_POSTSUBSCRIPT = [ [ italic_y < italic_x ] ] start_POSTSUBSCRIPT DL2 end_POSTSUBSCRIPT.

In fuzzy logics, negation is a function N:[0,1][0,1]:𝑁0101N:[0,1]\to[0,1]italic_N : [ 0 , 1 ] → [ 0 , 1 ] such that N(1)=0𝑁10N(1)=0italic_N ( 1 ) = 0 and for all x𝑥xitalic_x, N(N(x))x𝑁𝑁𝑥𝑥N(N(x))\geq xitalic_N ( italic_N ( italic_x ) ) ≥ italic_x. We will only use the standard strong negation N(x)=1x𝑁𝑥1𝑥N(x)=1-xitalic_N ( italic_x ) = 1 - italic_x in the following.

Implication

DL2 does not provide a separate translation for implication, instead, material implication is used:

[[xy]]DL2:=[[¬xy]]DL2.assignsubscriptdelimited-[]delimited-[]𝑥𝑦DL2subscriptdelimited-[]delimited-[]𝑥𝑦DL2[\![x\to y]\!]_{\text{DL2}}:=[\![\lnot x\vee y]\!]_{\text{DL2}}.[ [ italic_x → italic_y ] ] start_POSTSUBSCRIPT DL2 end_POSTSUBSCRIPT := [ [ ¬ italic_x ∨ italic_y ] ] start_POSTSUBSCRIPT DL2 end_POSTSUBSCRIPT . (12)

In fuzzy logics, there are multiple ways to obtain fuzzy implications. Table 1 lists the definitions of the mentioned t-norms, t-conorms as well as these implications. Baczyński and Jayaram [44] give a detailed overview of fuzzy implications.

  • 1.

    From a t-conorm S𝑆Sitalic_S and fuzzy negation N𝑁Nitalic_N, one obtains a so-called (S,N)𝑆𝑁(S,N)( italic_S , italic_N )-implication (which generalises material implication) as I(x,y)=S(N(x),y)𝐼𝑥𝑦𝑆𝑁𝑥𝑦I(x,y)=S(N(x),y)italic_I ( italic_x , italic_y ) = italic_S ( italic_N ( italic_x ) , italic_y )). Example (S,N)𝑆𝑁(S,N)( italic_S , italic_N )-implications are the Kleene-Dienes implication IKleene-Dienes(x,y):=max{1x,y}assignsubscript𝐼Kleene-Dienes𝑥𝑦1𝑥𝑦I_{\text{Kleene-Dienes}}(x,y):=\max\{1-x,y\}italic_I start_POSTSUBSCRIPT Kleene-Dienes end_POSTSUBSCRIPT ( italic_x , italic_y ) := roman_max { 1 - italic_x , italic_y } and Reichenbach implication IReichenbach(x,y):=1x+xyassignsubscript𝐼Reichenbach𝑥𝑦1𝑥𝑥𝑦I_{\text{Reichenbach}}(x,y):=1-x+xyitalic_I start_POSTSUBSCRIPT Reichenbach end_POSTSUBSCRIPT ( italic_x , italic_y ) := 1 - italic_x + italic_x italic_y, both with the standard negation N(x)=1x𝑁𝑥1𝑥N(x)=1-xitalic_N ( italic_x ) = 1 - italic_x.

  • 2.

    Other implications generalise the intuitionistic implication and are called R𝑅Ritalic_R-implications, because they use the t-norm residuum R(x,y):=sup{t[0,1]T(x,t)y}assign𝑅𝑥𝑦supremumconditional-set𝑡01𝑇𝑥𝑡𝑦R(x,y):=\sup\{t\in[0,1]\mid T(x,t)\leq y\}italic_R ( italic_x , italic_y ) := roman_sup { italic_t ∈ [ 0 , 1 ] ∣ italic_T ( italic_x , italic_t ) ≤ italic_y }. Example R𝑅Ritalic_R-implications are the Gödel and Goguen implications.

  • 3.

    The Łukasiewicz implication IŁukasiewicz(x,y):=min{1x+y,1}assignsubscript𝐼Łukasiewicz𝑥𝑦1𝑥𝑦1I_{\text{\L ukasiewicz}}(x,y):=\min\{1-x+y,1\}italic_I start_POSTSUBSCRIPT Łukasiewicz end_POSTSUBSCRIPT ( italic_x , italic_y ) := roman_min { 1 - italic_x + italic_y , 1 } is both an (S,N)𝑆𝑁(S,N)( italic_S , italic_N )-implication and an R𝑅Ritalic_R-implication.

  • 4.

    Other implications are neither—the Yager implication, for example, is an f𝑓fitalic_f-generated implication that is obtained using f(x)=lnx𝑓𝑥𝑥f(x)=-\ln xitalic_f ( italic_x ) = - roman_ln italic_x in I(x,y)=f1(xf(y))𝐼𝑥𝑦superscript𝑓1𝑥𝑓𝑦I(x,y)=f^{-1}(xf(y))italic_I ( italic_x , italic_y ) = italic_f start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x italic_f ( italic_y ) ) (with the understanding that 0=0000\cdot\infty=00 ⋅ ∞ = 0).

Additionally, van Krieken et al. [37] propose sigmoidal implications in order to improve the derivatives of the original implication, while preserving its characteristics. With the standard sigmoidal function shown in  Eq. 13, Eq. 14 shows the sigmoidal implication I(s)superscript𝐼𝑠I^{(s)}italic_I start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT for any given fuzzy implication I𝐼Iitalic_I and a parameter s𝑠sitalic_s for controlling the steepness. In our experiments, we use the sigmoidal Reichenbach implication with s=9𝑠9s=9italic_s = 9, as suggested by van Krieken et al. [37].

σ(x)=1(1+exp(x))𝜎𝑥11𝑥\sigma(x)=\frac{1}{(1+\exp(-x))}italic_σ ( italic_x ) = divide start_ARG 1 end_ARG start_ARG ( 1 + roman_exp ( start_ARG - italic_x end_ARG ) ) end_ARG (13)
I(s)(x,y)=(1+exp(s/2))σ(sI(x,y)s/2)1exp(s/2)1superscript𝐼𝑠𝑥𝑦1𝑠2𝜎𝑠𝐼𝑥𝑦𝑠21𝑠21I^{(s)}(x,y)=\dfrac{(1+\exp(\nicefrac{{s}}{{2}}))\sigma(sI(x,y)-\nicefrac{{s}}% {{2}})-1}{\exp(\nicefrac{{s}}{{2}})-1}italic_I start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT ( italic_x , italic_y ) = divide start_ARG ( 1 + roman_exp ( start_ARG / start_ARG italic_s end_ARG start_ARG 2 end_ARG end_ARG ) ) italic_σ ( italic_s italic_I ( italic_x , italic_y ) - / start_ARG italic_s end_ARG start_ARG 2 end_ARG ) - 1 end_ARG start_ARG roman_exp ( start_ARG / start_ARG italic_s end_ARG start_ARG 2 end_ARG end_ARG ) - 1 end_ARG (14)

Universal quantification

In DL2, Fischer et al. [36] provide an important insight: training to satisfy a universally quantified constraint can be approximated by finding a counterexample that violates the constraint, and using that counterexample in training. This counterexample can be found by an arbitrary sampling function—DL2 uses Projected Gradient Descent (PGD) to find the worst possible perturbation around particular inputs. In DL2, universal quantification thus is not general, and is treated outside of the logic.

With t-norms being associative and commutative, universal fuzzy quantification is given by repeated application of conjunction.

Note that the LDL framework can handle arbitrarily nested, general universal (and existential) quantifiers.

Other operators

Based on the comparison, negation, conjunction and disjunction, and implication operators shown above, other statements can be derived, such as [[xy]]L=[[(xy)(yx)]]L[\![x\leftrightarrow y]\!]_{\text{$L$}}=[\![(x\to y)\wedge(y\to x)]\!]_{\text{% $L$}}[ [ italic_x ↔ italic_y ] ] start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT = [ [ ( italic_x → italic_y ) ∧ ( italic_y → italic_x ) ] ] start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT. Note that as shown in Table 1, we will only use symmetric configurations, meaning a logic will consist of some t-norm for conjunction, the dual t-conorm for disjunction, and implication will be either an (S,N)𝑆𝑁(S,N)( italic_S , italic_N )-implication based on the t-conorm, or the R-implication based on the t-norm. Symmetric configurations usually retain logical relations, but they might not perform as well as arbitrarily combined fuzzy operators. An analysis of both symmetric as well as arbitrary combinations is provided by van Krieken et al. [37].

Table 1: Common t-norms, t-conorms and fuzzy implications.

{tblr} colspec=lQ[c, mode=dmath]Q[c, mode=dmath]Q[c, mode=dmath], row1=font=, mode=text, cell22=r=2c, cell23=r=2c, cell34=r=3c, cell52=r=2c, cell53=r=2c, vline2-4 = 2-Zdotted, hline3-Y = dotted, Logic L𝐿Litalic_L & T-norm TL(x,y)subscript𝑇𝐿𝑥𝑦T_{L}(x,y)italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( italic_x , italic_y ) S-norm SL(x,y)subscript𝑆𝐿𝑥𝑦S_{L}(x,y)italic_S start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( italic_x , italic_y ) Implication IL(x,y)subscript𝐼𝐿𝑥𝑦I_{L}(x,y)italic_I start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( italic_x , italic_y )
Gödel min{x,y} max{x,y} {1,if x¡y,y,else.
Kleene-Dienes S_L(N(x), y)
\addlinespaceŁukasiewicz max{0,x+y-1} min{1, x+y}
\addlinespaceReichenbach xy x+y-xy
Goguen {1,if x¡y,yx,else.
Yager (p1𝑝1p\geq 1italic_p ≥ 1) max{1-((1-x)^p+(1-y)^p)^1/p1𝑝\nicefrac{{1}}{{p}}/ start_ARG 1 end_ARG start_ARG italic_p end_ARG,0} min{(x^p+y^p)^1/p1𝑝\nicefrac{{1}}{{p}}/ start_ARG 1 end_ARG start_ARG italic_p end_ARG, 1} {1,if x=y=0,y/x𝑦𝑥\nicefrac{{y}}{{x}}/ start_ARG italic_y end_ARG start_ARG italic_x end_ARG,else.

3 Theoretical Comparison

We now compare DL2 and fuzzy logics, discussing soundness, shadow-lifting and derivatives, Modus Ponens and Modus Tollens reasoning, as well as consistency.

3.1 Soundness

Soundness is a highly-desirable—usually even non-negotiable—property any logical system should have. Not all differentiable logics presented herein are sound, e.g., it is known that Łukasiewicz and Yager fuzzy logics are unsound. Ślusarz et al. [35] and Affeldt et al. [42] provide mechanised proofs for the soundness of DL2, STL, and all other fuzzy logics mentioned above.

3.2 Shadow-lifting (Derivatives of Conjunction)

Table 2: The derivatives of the DL2 and fuzzy conjunctions and their properties. Note that all conjunctions shown are commutative, so we only show the derivatives with respect to x𝑥xitalic_x.

\DefTblrTemplatecaptiondefault {talltblr}[ notea = For brevity, we let 𝒮=(1x)p+(1y)p𝒮superscript1𝑥𝑝superscript1𝑦𝑝\mathcal{S}=(1-x)^{p}+(1-y)^{p}caligraphic_S = ( 1 - italic_x ) start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT + ( 1 - italic_y ) start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT., ] colspec=lQ[c, m, mode=dmath]Q[c, m, mode=dmath]Q[l, m, mode=text]Q[c, m, mode=dmath], row1=font=, mode=text, Logic L𝐿Litalic_L & Conjunction [[xy]]Lsubscriptdelimited-[]delimited-[]𝑥𝑦𝐿[\![x\wedge y]\!]_{\text{$L$}}[ [ italic_x ∧ italic_y ] ] start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT [[xy]]Lxpartial-derivative𝑥subscriptdelimited-[]delimited-[]𝑥𝑦𝐿\displaystyle\partialderivative{[\![x\wedge y]\!]_{\text{$L$}}}{x}divide start_ARG ∂ start_ARG [ [ italic_x ∧ italic_y ] ] start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_ARG end_ARG start_ARG ∂ start_ARG italic_x end_ARG end_ARG Behaviour Shadow-
lifting
DL2 x+y 1 strong derivative
Gödel min{x,y} {1,if x≤y0 strong derivative
Łukasiewicz max{0,x+y-1} {1,if x+y¿10 strong derivative;
but vanishing on half
of the domain
Reichenbach xy y derivative low, if y𝑦yitalic_y low;
(not suitable for learning)
Yager\TblrNotea (p1𝑝1p\geq 1italic_p ≥ 1) max{1-S^1/p1𝑝\nicefrac{{1}}{{p}}/ start_ARG 1 end_ARG start_ARG italic_p end_ARG,0} {(1-x)p-1S1/p11𝑝1\nicefrac{{1}}{{p-1}}/ start_ARG 1 end_ARG start_ARG italic_p - 1 end_ARG,if S1/p1𝑝\nicefrac{{1}}{{p}}/ start_ARG 1 end_ARG start_ARG italic_p end_ARG¿10 derivative vanishes on
considerable part of
the domain; derivative
high, if x𝑥xitalic_x or y𝑦yitalic_y low
(suitable for learning)

Varnai and Dimarogonas [45] introduce the shadow-lifting property for conjunction, which requires the truth value of a conjunction to increase when the truth value of a conjunct increases; formally expressed333Note that the original version from Varnai and Dimarogonas [45] is defined for conjunctions with M𝑀Mitalic_M terms, as STL has conjunctions with M𝑀Mitalic_M terms by designs. in Eq. 15: the shadow-lifting property is satisfied, if for any ρ0𝜌0\rho\neq 0italic_ρ ≠ 0,

[[x1x2]]Lxi|x1=x2=ρ>0for all i{1,2}.formulae-sequenceevaluated-atpartial-derivativesubscript𝑥𝑖subscriptdelimited-[]delimited-[]subscript𝑥1subscript𝑥2𝐿subscript𝑥1subscript𝑥2𝜌0for all 𝑖12\partialderivative{[\![x_{1}\wedge x_{2}]\!]_{\text{$L$}}}{x_{i}}\bigg{|}_{x_{% 1}=x_{2}=\rho}>0\quad\text{for all }i\in\{1,2\}.divide start_ARG ∂ start_ARG [ [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∧ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] ] start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_ARG end_ARG start_ARG ∂ start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG | start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_ρ end_POSTSUBSCRIPT > 0 for all italic_i ∈ { 1 , 2 } . (15)

This property is highly desirable for learning, as it allows for gradual improvement. For example, the formula 0.10.90.10.90.1\wedge 0.90.1 ∧ 0.9 should be more true than 0.10.20.10.20.1\wedge 0.20.1 ∧ 0.2, but the Gödel t-norm TG(x,y)=min{x,y}subscript𝑇G𝑥𝑦𝑥𝑦T_{\text{G}}(x,y)=\min\{x,y\}italic_T start_POSTSUBSCRIPT G end_POSTSUBSCRIPT ( italic_x , italic_y ) = roman_min { italic_x , italic_y } yields the same truth value in both cases. Note that Varnai and Dimarogonas [45] have proven that no conjunction operator can be associative and shadow-lifting at the same time.

DL2 uses addition for conjunction, trivially satisfying shadow-lifting. The only t-norm to satisfy the shadow-lifting property is the product t-norm T(x,y)=xy𝑇𝑥𝑦𝑥𝑦T(x,y)=xyitalic_T ( italic_x , italic_y ) = italic_x italic_y. The Gödel, Łukasiewicz, and Yager t-norms are not differentiable everywhere due to their use of the min\minroman_min and max\maxroman_max operators, and do not satisfy shadow-lifting [42]. However, a conjunction that enjoys shadow-lifting might not be favourable for other reasons; as noted by van Krieken et al. [37], the derivatives of the product t-norm will be low if x𝑥xitalic_x and y𝑦yitalic_y are both low, making it hard for the learning process to make progress at all. In contrast to that, the DL2 conjunction is not just shadow-lifting, but also has strong derivatives regardless of the values of x𝑥xitalic_x and y𝑦yitalic_y. Table 2 shows the derivatives of the conjunctions presented earlier.

3.3 Modus Ponens and Modus Tollens Reasoning (Derivatives of Implication)

Table 3: The derivatives of the DL2 and fuzzy implications and their properties. A symbol in the Modus Ponens (or, respectively, the Modus Tollens) column indicates that the particular implication closely follows Modus Ponens (or, respectively, Modus Tollens) reasoning, while a indicates that this particular implication does not follow Modus Ponens (or, respectively, Modus Tollens) reasoning.

\DefTblrTemplatecaptiondefault {talltblr}[ notea=Derivatives vanish on half of the domain., noteb=Modus Ponens reasoning is only possible for x>1𝑥1x>1italic_x > 1. However, when x<1𝑥1x<1italic_x < 1 (corresponding to being very confident in x𝑥xitalic_x), it is only possible to decrease confidence in y𝑦yitalic_y, which is the opposite of what should happen with Modus Ponens., ] colspec=lQ[c, m, mode=dmath]Q[c, m, mode=dmath]Q[c, m, mode=dmath]Q[c, m, mode=text]Q[c, m, mode=text], row1=font=, mode=text, Logic L𝐿Litalic_L & Implication [[xy]]Lsubscriptdelimited-[]delimited-[]𝑥𝑦𝐿[\![x\wedge y]\!]_{\text{$L$}}[ [ italic_x ∧ italic_y ] ] start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT [[xy]]Lxpartial-derivative𝑥subscriptdelimited-[]delimited-[]𝑥𝑦𝐿\displaystyle\partialderivative{[\![x\to y]\!]_{\text{$L$}}}{x}divide start_ARG ∂ start_ARG [ [ italic_x → italic_y ] ] start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_ARG end_ARG start_ARG ∂ start_ARG italic_x end_ARG end_ARG [[xy]]Lypartial-derivative𝑦subscriptdelimited-[]delimited-[]𝑥𝑦𝐿\displaystyle\partialderivative{[\![x\to y]\!]_{\text{$L$}}}{y}divide start_ARG ∂ start_ARG [ [ italic_x → italic_y ] ] start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_ARG end_ARG start_ARG ∂ start_ARG italic_y end_ARG end_ARG Modus
Ponens Modus
Tollens
DL2 y(1-x) -y 1-x ()\TblrNoteb
Gödel {1,if x¡y,yelse. 0 {0,if x¡y\TblrNotea,1,if x¿y.
Kleene-Dienes max{1-x,y} {0,if x+y¿1-1,if x+y¡1. {0,if x+y¡11,if x+y¿1.
Łukasiewicz min{1-x+y,1} {0,if x¡y\TblrNotea-1,if x¿y. {0,if x¡y\TblrNotea1,if x¿y.
Reichenbach 1-x+xy y-1 x
Goguen {1,if x≤y,y/x𝑦𝑥\nicefrac{{y}}{{x}}/ start_ARG italic_y end_ARG start_ARG italic_x end_ARG,else. {0,if x≤y\TblrNotea,-y/x2𝑦superscript𝑥2\nicefrac{{y}}{{x^{2}}}/ start_ARG italic_y end_ARG start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG,else. {0,if x≤y\TblrNotea,1/x1𝑥\nicefrac{{1}}{{x}}/ start_ARG 1 end_ARG start_ARG italic_x end_ARG,else.
Yager (p1𝑝1p\geq 1italic_p ≥ 1) {1,if x,y=0,y/x𝑦𝑥\nicefrac{{y}}{{x}}/ start_ARG italic_y end_ARG start_ARG italic_x end_ARG,else. {0,if x,y=0,yxlogy,else. {0,if x,y=0,xyx-y,else.

Given an implication xy𝑥𝑦x\to yitalic_x → italic_y, x𝑥xitalic_x is called the antecedent, and y𝑦yitalic_y is called the consequent.

Modus Ponens reasoning (affirming the antecedent) allows to infer y𝑦yitalic_y from x𝑥xitalic_x. On the other hand, Modus Tollens reasoning (denying the consequent) is used to infer ¬x𝑥\lnot x¬ italic_x from ¬y𝑦\lnot y¬ italic_y.

Due to the different notions of truth in DL2 and fuzzy logics, in the following we will not refer to truth values directly, but rather use the concept of confidence.

Modus Ponens reasoning needs to be able to increase the confidence in the consequent y𝑦yitalic_y, if the confidence in the antecedent x𝑥xitalic_x is high. Modus Tollens reasoning needs to be able to decrease the confidence in the antecedent x𝑥xitalic_x, if the confidence in the consequent y𝑦yitalic_y is low. Thus, in DL2, increasing the confidence in ϕitalic-ϕ\phiitalic_ϕ means decreasing the value of ϕitalic-ϕ\phiitalic_ϕ, whereas in fuzzy logics, increasing the confidence in ϕitalic-ϕ\phiitalic_ϕ corresponds to increasing the value of ϕitalic-ϕ\phiitalic_ϕ. Table 3 lists derivatives of the DL2 and fuzzy implications, along with a description of their behaviour, and whether they permit Modus Ponens and Modus Tollens reasoning.

As noted by van Krieken et al. [37], not only is most background knowledge phrased as implications, but also there are often far more negative examples than they are positive ones in ML contexts, thus requiring Modus Tollens reasoning more frequently. Choosing a suitable implication that performs well in the presence of this Modus Ponens / Modus Tollens imbalance is thus an important task to guarantee best learning.

In the following, we examine derivatives of [[xy]]Lsubscriptdelimited-[]delimited-[]𝑥𝑦𝐿[\![x\to y]\!]_{\text{$L$}}[ [ italic_x → italic_y ] ] start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT and investigate whether they permit Modus Ponens reasoning, and, more importantly, Modus Tollens reasoning. For the Gödel, Kleene-Dienes, Łukasiewicz, Reichenbach, and Goguen implication, we briefly summarise known results from the literature [37], and contribute investigations into the derivatives and Modus Ponens and Modus Tollens behaviour for DL2 and the Yager implication.

  • 1.

    In DL2, Modus Tollens reasoning is not possible; since the derivative with respect to the antecedent x𝑥xitalic_x is y𝑦-y- italic_y, it is never possible to increase x𝑥xitalic_x (i.e. to decrease the confidence in x𝑥xitalic_x). Modus Ponens reasoning is possible only for values of x>1𝑥1x>1italic_x > 1, which means when we are very confident in x𝑥xitalic_x, it is not possible to increase the confidence in y𝑦yitalic_y—even worse, the confidence in y𝑦yitalic_y will be decreased.

  • 2.

    For the Gödel implication, derivatives with respect to the antecedent x𝑥xitalic_x do not exist; it is thus impossible to perform Modus Tollens reasoning. Whenever x>y𝑥𝑦x>yitalic_x > italic_y, the confidence in y𝑦yitalic_y is arbitrarily increased, even when the confidence in x𝑥xitalic_x is low; thus not exactly following Modus Ponens. Further, when x<y𝑥𝑦x<yitalic_x < italic_y, derivatives vanish.

  • 3.

    The Kleene-Dienes implication does not closely follow Modus Ponens or Modus Tollens; instead, the confidence in the antecedent x𝑥xitalic_x is decreased when the confidence in x𝑥xitalic_x and y𝑦yitalic_y is low, and the confidence in the consequent y𝑦yitalic_y is increased, if the confidence in x𝑥xitalic_x and y𝑦yitalic_y is high.

  • 4.

    The Łukasiewicz logic has another issue: the confidence in the antecedent can never be decreased; it therefore does not follow Modus Tollens reasoning. Whenever we are more confident in antecedent x𝑥xitalic_x than the in consequent y𝑦yitalic_y, the confidence in y𝑦yitalic_y will be increased, which does not closely follow Modus Ponens. Further, when x<y𝑥𝑦x<yitalic_x < italic_y, derivatives vanish.

  • 5.

    The Reichenbach implication closely follows Modus Ponens (whenever the confidence in the antecedent x𝑥xitalic_x is high, the confidence in the consequent y𝑦yitalic_y will be increased) and Modus Tollens (if the confidence in the consequent y𝑦yitalic_y is low, the confidence in the antecedent x𝑥xitalic_x will be decreased).

  • 6.

    The Goguen implication does not follow Modus Ponens: whenever the confidence in x𝑥xitalic_x is very low, the confidence in y𝑦yitalic_y will be increased strongly. Modus Tollens reasoning is possible, however, as y/x2𝑦superscript𝑥2-\nicefrac{{y}}{{x^{2}}}- / start_ARG italic_y end_ARG start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG gets smaller for increasing values of x𝑥xitalic_x, it also behaves the opposite of what should be expected: when the confidence in x𝑥xitalic_x is very high, it should be decreased faster than when it is low. Further, when xy𝑥𝑦x\leq yitalic_x ≤ italic_y, derivatives vanish.

  • 7.

    The Yager implication follows Modus Tollens reasoning but not Modus Ponens reasoning; when the confidence in x𝑥xitalic_x is high but the confidence in y𝑦yitalic_y is low, derivatives with respect to y𝑦yitalic_y are low instead of high; but when x𝑥xitalic_x is high, derivatives are high, too.

3.4 Consistency (Integrals of Fuzzy Logic Operators)

Whereas derivatives of logical operators give insights into how they behave in the learning process, another view is from a logic perspective, looking at their integrals, as proposed by Grespan et al. [46], who introduce a notion of consistency (that is, a degree of truth) for fuzzy logics based on the intuition that a fuzzy logic tautology τ𝜏\tauitalic_τ should be absolutely true for all possible values, i.e. its integral should evaluate to 1111, formally expressed in Eq. 16 for a given fuzzy logic relaxation L𝐿Litalic_L.

[0,1][[τ(x1,,xn)]]Ldxndx1subscript01subscriptdelimited-[]delimited-[]𝜏subscript𝑥1subscript𝑥𝑛𝐿differential-dsubscript𝑥𝑛differential-dsubscript𝑥1\int\cdots\int_{[0,1]}[\![\tau(x_{1},\ldots,x_{n})]\!]_{\text{$L$}}\,\mathrm{d% }x_{n}\cdots\mathrm{d}x_{1}∫ ⋯ ∫ start_POSTSUBSCRIPT [ 0 , 1 ] end_POSTSUBSCRIPT [ [ italic_τ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ] ] start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT roman_d italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⋯ roman_d italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (16)

We use the same representative set of axioms as [46], and evaluate the consistency of these for all fuzzy logics listed in Table 1. The results are displayed in Table 4 and indicate that the Łukasiewicz logic and the sigmoidal Reichenbach logic are the most consistent, whereas Gödel is generally the least consistent for the set of axioms chosen.

Note that this is not a definite metric to evaluate fuzzy logics with, but rather depends on the specific set of axioms chosen and should only be used as an indicator. It is not possible to evaluate DL2 with this measure due to its lack of a general negation and the fact that atomic terms in DL2 are comparison and not absolute truth values.

Table 4: The consistency of different fuzzy logics over a representative set of axioms. It can be observed that the Łukasiewicz, sigmoidal Reichenbach, and Goguen logics are the most, and the Gödel logic being the least consistent for this set of axioms.

{tblr} colspec=lccccccc, cells=mode=dmath, row1=font=, mode=text, rowZ=font=, mode=text, Tautology & Gödel Kleene-Dienes Łukasiewicz Reichenbach Goguen sig. Reichenbach Yager
Axiom schemata
P →(Q →P) 0.67 0.79 1 0.92 1 0.99 0.89
(P →(Q →R)) →( (P →Q) →(P →R) ) 0.75 0.75 0.96 0.87 0.93 0.97 0.85
(¬P →¬Q) →(Q →P) 0.79 0.75 1 0.86 0.90 0.98 0.82
Primitive propositions
(P ∨P) →P 0.50 0.75 0.75 0.75 0.69 0.88 0.72
Q →(P ∨Q) 0.83 0.79 1 0.92 1 0.98 0.90
(P ∨Q) →(Q ∨P) 0.67 0.75 1 0.86 1 0.96 0.85
(P ∨(Q ∨R)) →(Q ∨(P ∨R)) 0.75 0.78 1 0.91 1 0.98 0.91
(Q →R) →( (P ∨Q) →(P ∨R)) 0.88 0.76 1 0.90 1 0.99 0.90
Law of excluded middle
P ∨¬P 0.75 0.75 1 0.83 0.83 0.83 0.81
Law of contradiction
¬(P ∧¬P) 0.75 0.75 1 0.83 0.83 0.83 0.81
Law of double negation
P ↔¬(¬P)) 0.50 0.75 1 0.70 1 0.91 0.69
Principles of transposition
(P ↔Q) ↔(¬P ↔¬Q) 0.17 0.67 1 0.61 0.59 0.93 0.57
( (P ∧Q) →R ) ↔( (P ∧¬R) →¬Q) 0.51 0.64 0.67 0.67 0.65 0.73 0.64
Laws of tautology
P ↔(P ∧P) 0.50 0.75 0.75 0.69 0.50 0.87 0.44
P ↔(P ∨P) 0.50 0.75 0.75 0.69 0.69 0.87 0.69
Laws of absorption
(P →Q) ↔(P ↔(P ∧Q)) 0.33 0.71 0.83 0.66 0.67 0.94 0.49
Q →(P ↔(P ∧Q)) 0.33 0.75 1 0.82 1 0.98 0.57
Assoc., comm., dist. laws
(P ∧(Q ∨R)) ↔( (P ∧Q) ∨(P ∧R) ) 0.42 0.72 0.90 0.69 0.89 0.90 0.67
(P ∨(Q ∧R)) ↔( (P ∨Q) ∧(P ∨R) ) 0.58 0.72 0.90 0.69 0.90 0.90 0.69
De Morgan’s laws
¬(P ∧Q) ↔(¬P ∨¬Q) 0.67 0.75 1 0.75 1 0.93 0.78
¬(P ∨Q) ↔(¬P ∧¬Q) 0.33 0.75 1 0.75 1 0.93 0.73
Material excluded middle
(P →Q) ∨(Q →P) 1 0.83 1 0.97 1 1 1
Average Consistency 0.60 0.75 0.93 0.79 0.87 0.92 0.75

4 Experimental Evaluation & Results

In this section we revise the experimental set up in Flinkow et al. [1] so that constraints are not limited to those that utilise data that is already in the training set. We achieve an improved comparison of logics by finding counterexamples outside of the data set that violate the logical constraint using PGD, thus ensuring that the differentiable logics have optimal positive impact on the learning process.

4.1 Setup

Our comparison experiment is implemented in PyTorch [32] and based on the original DL2 experiments [36, 47]. The code, together with the experimental data, is available on https://github.com/tflinkow/comparing-differentiable-logics. The goal of our experiment is to compare the differentiable logics introduced in Section 2, and investigate which logic performs most favourable, focusing specifically on implication and conjunction, as these have noticeable consequences for the learning process.

All experiments were conducted on Google Colab using a system with an Intel Xeon CPU and an NVIDIA A100 GPU. All experiments were run for 50 epochs. The baseline for each experiment was achieved by training without any logical loss translation. We train with the MNIST [48], CIFAR-10 and CIFAR-100 [49], and the German Traffic Sign Recognition Benchmark (GTSRB) [50] data sets with various logical constraints. For MNIST, we use a simple convolutional neural network (CNN) with two convolutional layers (32 and 64 filters), max-pooling, and two fully connected layers (128 and 10 units) followed by a softmax output layer. For CIFAR-10, CIFAR-100, and GTSRB, we use the VGG16 [51] architecture. Various image manipulation techniques are applied, such as random cropping, flipping, rotation, and colour changes. We use the Adam [52] optimiser with learning rates of [scientific-notation=true]0.0001 and [scientific-notation=true]0.00001 and batch sizes of 256 and 512.

The results reported in the tables in Figs. 2, 3, 5, 6 and 7 are derived by identifying the best combination of prediction and accuracy of the last {10} of experimental runs. The best combination is the one from these that maximises the product of prediction and constraint accuracy, thus effectively selecting the scenario with the highest combined performance (so as not to focus on only prediction or only constraint accuracy).

In each table, the overall best combination of prediction and constraint accuracy is shown in bold face. The plots in Figs. 2, 3, 5, 6 and 7 show how prediction and constraint accuracy change over time. The best combinations (i.e. the values reported in the tables) are marked in each plot.

4.2 Training with Counterexamples.

A major strength of DL2 is the ability to learn with constraints that can refer to data outside of the training set. This is achieved by finding a counterexample that violates the constraint. Given an input 𝒙0subscript𝒙0\boldsymbol{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and label 𝒚𝒚\boldsymbol{y}bold_italic_y, our goal is to train the network to satisfy constraints of the form

𝒙Bϵ(𝒙0)ϕ(𝒙0,𝒙,𝒚).for-all𝒙subscript𝐵italic-ϵsubscript𝒙0italic-ϕsubscript𝒙0𝒙𝒚\forall\boldsymbol{x}\in B_{\epsilon}(\boldsymbol{x}_{0})\to\phi(\boldsymbol{x% }_{0},\boldsymbol{x},\boldsymbol{y}).∀ bold_italic_x ∈ italic_B start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) → italic_ϕ ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_x , bold_italic_y ) . (17)

This restriction allows PGD to find an approximate counterexample 𝒙superscript𝒙\boldsymbol{x}^{*}bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT within ϵitalic-ϵ\epsilonitalic_ϵ-distance of 𝒙0subscript𝒙0\boldsymbol{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT that does not satisfy ϕ(𝒙0,𝒙,𝒚)italic-ϕsubscript𝒙0superscript𝒙𝒚\phi(\boldsymbol{x}_{0},\boldsymbol{x}^{*},\boldsymbol{y})italic_ϕ ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_italic_y ). This is achieved by minimising the loss of ¬ϕitalic-ϕ\lnot\phi¬ italic_ϕ (or, equivalently, maximising the loss of ϕitalic-ϕ\phiitalic_ϕ). Formally,

𝒙=argmax𝒙Bϵ(𝒙0)C(𝒙0,𝒙,𝒚,ϕ).superscript𝒙subscriptargmax𝒙subscript𝐵italic-ϵsubscript𝒙0subscriptCsubscript𝒙0𝒙𝒚italic-ϕ\boldsymbol{x}^{*}=\operatorname{arg\,max}_{\boldsymbol{x}\in B_{\epsilon}(% \boldsymbol{x}_{0})}\;\mathcal{L}_{\text{C}}(\boldsymbol{x}_{0},\boldsymbol{x}% ,\boldsymbol{y},\phi).bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_OPFUNCTION roman_arg roman_max end_OPFUNCTION start_POSTSUBSCRIPT bold_italic_x ∈ italic_B start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT C end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_x , bold_italic_y , italic_ϕ ) . (18)

Learning to satisfy constraints of the form shown in Eq. 17 thus corresponds to a two-step process of first finding a counterexample 𝒙superscript𝒙\boldsymbol{x}^{*}bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, and then minimising the total loss

(𝒙0,𝒚,ϕ)=λCECE(𝒙0,y)+λCC(𝒙0,𝒙,𝒚,ϕ),subscript𝒙0𝒚italic-ϕsubscript𝜆CEsubscriptCEsubscript𝒙0𝑦subscript𝜆CsubscriptCsubscript𝒙0superscript𝒙𝒚italic-ϕ\mathcal{L}(\boldsymbol{x}_{0},\boldsymbol{y},\phi)=\lambda_{\text{CE}}% \mathcal{L}_{\text{CE}}(\boldsymbol{x}_{0},y)+\lambda_{\text{C}}\mathcal{L}_{% \text{C}}(\boldsymbol{x}_{0},\boldsymbol{x}^{*},\boldsymbol{y},\phi),caligraphic_L ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_y , italic_ϕ ) = italic_λ start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) + italic_λ start_POSTSUBSCRIPT C end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT C end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_italic_y , italic_ϕ ) , (19)

where the constraint loss for DL2 is given by

C(𝒙0,𝒙,𝒚,ϕ)=[[ϕ(𝒙0,𝒙,𝒚)]]DL2,subscriptCsubscript𝒙0superscript𝒙𝒚italic-ϕsubscriptdelimited-[]delimited-[]italic-ϕsubscript𝒙0superscript𝒙𝒚DL2\mathcal{L}_{\text{C}}(\boldsymbol{x}_{0},\boldsymbol{x}^{*},\boldsymbol{y},% \phi)=[\![\phi(\boldsymbol{x}_{0},\boldsymbol{x}^{*},\boldsymbol{y})]\!]_{% \text{DL2}},caligraphic_L start_POSTSUBSCRIPT C end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_italic_y , italic_ϕ ) = [ [ italic_ϕ ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_italic_y ) ] ] start_POSTSUBSCRIPT DL2 end_POSTSUBSCRIPT , (20)

or, respectively,

C(𝒙0,𝒙,𝒚,ϕ)=1[[ϕ(𝒙0,𝒙,𝒚)]]L,subscriptCsubscript𝒙0superscript𝒙𝒚italic-ϕ1subscriptdelimited-[]delimited-[]italic-ϕsubscript𝒙0superscript𝒙𝒚𝐿\mathcal{L}_{\text{C}}(\boldsymbol{x}_{0},\boldsymbol{x}^{*},\boldsymbol{y},% \phi)=1-[\![\phi(\boldsymbol{x}_{0},\boldsymbol{x}^{*},\boldsymbol{y})]\!]_{% \text{$L$}},caligraphic_L start_POSTSUBSCRIPT C end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_italic_y , italic_ϕ ) = 1 - [ [ italic_ϕ ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_italic_y ) ] ] start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT , (21)

for a given fuzzy logic L𝐿Litalic_L (to address the inverse notion of truth).

4.3 Balancing loss

The additional constraint loss term CsubscriptC\mathcal{L}_{\text{C}}caligraphic_L start_POSTSUBSCRIPT C end_POSTSUBSCRIPT introduces hyperparameters λCEsubscript𝜆CE\lambda_{\text{CE}}italic_λ start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT and λCsubscript𝜆C\lambda_{\text{C}}italic_λ start_POSTSUBSCRIPT C end_POSTSUBSCRIPT that are responsible for balancing the two loss terms in Eq. 19. These hyperparameters need to be well fine-tuned to properly balance the different loss terms. This is a prerequisite for the logical loss to perform optimally and have the most impact. Our previous work used grid search to find approximate values; however, despite being time-consuming, it is also unlikely to yield optimal values, which had consequences for the fairness of our comparison.

In this paper, we use GradNorm by Chen et al. [31] to provide a fairer comparison. GradNorm is an adaptive loss balancing algorithm in multi-task learning scenarios, where the total loss is a weighted sum of individual task losses. GradNorm treats these loss weights as learnable parameters that are allowed to vary for each epoch. The network is penalised when backpropagated gradients from any task are too large or too small, which should lead to the different tasks training at similar rates. GradNorm at least matches, and often outperforms, exhaustive grid search, not only reducing the time needed to train with close-to-optimal weights, but also leading to overall improved task performance. GradNorm provides only one hyperparameter, the so-called asymmetry parameter α𝛼\alphaitalic_α, where higher values of α𝛼\alphaitalic_α lead to stronger training rate balancing.

Another approach for balancing loss was recently proposed by Li et al. [53], and also has the advantage of avoiding what the authors call shortcut satisfaction—a well-known phenomenon He et al. [54] call implication bias: neural networks trained with logical constraints have a tendency to vacuously satisfy a constraint xy𝑥𝑦x\to yitalic_x → italic_y by learning to decrease the confidence in x𝑥xitalic_x.

Refer to caption
Figure 1: The prediction λP(t)subscript𝜆P𝑡\lambda_{\text{P}}(t)italic_λ start_POSTSUBSCRIPT P end_POSTSUBSCRIPT ( italic_t ) and constraint weights λC(t)subscript𝜆C𝑡\lambda_{\text{C}}(t)italic_λ start_POSTSUBSCRIPT C end_POSTSUBSCRIPT ( italic_t ) learned with GradNorm (α=0.1𝛼0.1\alpha=0.1italic_α = 0.1) for the 𝖱𝗈𝖻𝗎𝗌𝗍𝗇𝖾𝗌𝗌(ϵ=0.4,δ=0.02)𝖱𝗈𝖻𝗎𝗌𝗍𝗇𝖾𝗌𝗌formulae-sequenceitalic-ϵ0.4𝛿0.02\mathsf{Robustness}(\epsilon=0.4,\delta=0.02)sansserif_Robustness ( italic_ϵ = 0.4 , italic_δ = 0.02 ) constraint on GTSRB with the Gödel logic.

4.4 Choice of constraints

In the following, we explain the logical constraints used during training. We also present the logical operators they are meant to evaluate.

Comparing comparison operators

Classification robustness [55] is usually expressed as

𝒙Bϵ(𝒙0)argmaxi𝒩(𝒙)=y,for-all𝒙subscript𝐵italic-ϵsubscript𝒙0subscriptargmax𝑖𝒩𝒙𝑦\forall\boldsymbol{x}\in B_{\epsilon}(\boldsymbol{x}_{0})\to\operatorname{arg% \,max}_{i}\mathcal{N}(\boldsymbol{x})=y,∀ bold_italic_x ∈ italic_B start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) → start_OPFUNCTION roman_arg roman_max end_OPFUNCTION start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT caligraphic_N ( bold_italic_x ) = italic_y , (22)

where y𝑦yitalic_y is the true label. This formulation involves the non-differentiable argmaxargmax\operatorname{arg\,max}roman_arg roman_max operator, making it unsuitable for use in a differentiable loss function setting.

Another variant in DL2 is proposed to circumvent this in Fischer et al. [36], as shown in Eq. 23. Note that it only compares the network’s prediction for the true class against some threshold value (chosen to be δ=0.52𝛿0.52\delta=0.52italic_δ = 0.52).

𝒙Bϵ(𝒙0)𝒩(𝒙0)kδ.for-all𝒙subscript𝐵italic-ϵsubscript𝒙0𝒩subscriptsubscript𝒙0𝑘𝛿\forall\boldsymbol{x}\in B_{\epsilon}(\boldsymbol{x}_{0})\to\mathcal{N}(% \boldsymbol{x}_{0})_{k}\geq\delta.∀ bold_italic_x ∈ italic_B start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) → caligraphic_N ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ italic_δ . (23)

Casadio et al. [55] note that both of these robustness formulations fail to capture the intended semantics, because they compare the prediction for the true class instead of the current network output. Learning to satisfy these constraints can lead to situations where the network is technically robust around 𝒙0subscript𝒙0\boldsymbol{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, because the prediction for all 𝒙Bϵ(𝒙0)𝒙subscript𝐵italic-ϵsubscript𝒙0\boldsymbol{x}\in B_{\epsilon}(\boldsymbol{x}_{0})bold_italic_x ∈ italic_B start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is the same as the one for 𝒙0subscript𝒙0\boldsymbol{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, but is not the correct class. In other words, these constraints capture too much; the standard loss term already penalises incorrect predictions.

Therefore, we are using the notion of standard robustness [55] for our Robustness constraint, shown in Eq. 24:

𝒙Bϵ(𝒙)𝒩(𝒙)𝒩(𝒙0)δ.for-all𝒙subscript𝐵italic-ϵ𝒙subscriptdelimited-∥∥𝒩𝒙𝒩subscript𝒙0𝛿\forall\boldsymbol{x}\in B_{\epsilon}(\boldsymbol{x})\to\lVert\mathcal{N}(% \boldsymbol{x})-\mathcal{N}(\boldsymbol{x}_{0})\rVert_{\infty}\leq\delta.∀ bold_italic_x ∈ italic_B start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_x ) → ∥ caligraphic_N ( bold_italic_x ) - caligraphic_N ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_δ . (24)

As explained before, the 𝒙Bϵ(𝒙)for-all𝒙subscript𝐵italic-ϵ𝒙\forall\boldsymbol{x}\in B_{\epsilon}(\boldsymbol{x})∀ bold_italic_x ∈ italic_B start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_x ) part will be handled outside of the logic by employing PGD to find a counterexample; therefore, this robustness constraint can only be used to compare the different loss translations for comparison, i.e. [[xy]]DL2subscriptdelimited-[]delimited-[]𝑥𝑦DL2[\![x\leq y]\!]_{\text{DL2}}[ [ italic_x ≤ italic_y ] ] start_POSTSUBSCRIPT DL2 end_POSTSUBSCRIPT shown in Eq. 4, and the fuzzy logic one [[xy]]FLsubscriptdelimited-[]delimited-[]𝑥𝑦FL[\![x\leq y]\!]_{\text{FL}}[ [ italic_x ≤ italic_y ] ] start_POSTSUBSCRIPT FL end_POSTSUBSCRIPT shown in Eq. 7. We use this constraint on MNIST and GTSRB.

Results

Refer to caption
Refer to caption
Figure 2: The 𝖱𝗈𝖻𝗎𝗌𝗍𝗇𝖾𝗌𝗌(ϵ=0.8,δ=0.01)𝖱𝗈𝖻𝗎𝗌𝗍𝗇𝖾𝗌𝗌formulae-sequenceitalic-ϵ0.8𝛿0.01\mathsf{Robustness}(\epsilon=0.8,\delta=0.01)sansserif_Robustness ( italic_ϵ = 0.8 , italic_δ = 0.01 ) constraint on MNIST.
Refer to caption
Refer to caption
Figure 3: The 𝖱𝗈𝖻𝗎𝗌𝗍𝗇𝖾𝗌𝗌(ϵ=0.4,δ=0.01)𝖱𝗈𝖻𝗎𝗌𝗍𝗇𝖾𝗌𝗌formulae-sequenceitalic-ϵ0.4𝛿0.01\mathsf{Robustness}(\epsilon=0.4,\delta=0.01)sansserif_Robustness ( italic_ϵ = 0.4 , italic_δ = 0.01 ) constraint on GTSRB.

There are no significant differences between DL2 and fuzzy logics when it comes to training with the robustness constraint on MNIST, as seen in Fig. 2. Both DL2 and fuzzy logic can bring the constraint accuracy from {32.37} for the baseline to {95.42} for DL2, and {96.58} for the fuzzy logic. The prediction accuracy is not negatively affected when training with constraint loss. This is likely due to the fact that MNIST is not a particularly difficult dataset to train with.

On GTSRB, the differences are more noticeable: The baseline has a constraint accuracy of {28.45}, whereas the DL2 one has {56.46}, and the fuzzy logic one has {68.76}. Prediction accuracy is not noticeably reduced for DL2, and slightly improved over the baseline for the fuzzy logic operator. See Fig. 3 for detailed results.

Comparing conjunction and disjunction

In order to compare conjunction and disjunction, we use the Groups constraint introduced in [36] to force the network to make strong decisions: a group of related elements must have a high combined probability, or a low combined probability.

We denote the probability of a group G𝐺Gitalic_G by the sum of probabilities 𝒩(𝒙)g𝒩subscript𝒙𝑔\mathcal{N}(\boldsymbol{x})_{g}caligraphic_N ( bold_italic_x ) start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT for each element g𝑔gitalic_g in that group, i.e.

pG=gG𝒩(𝒙)g,subscript𝑝𝐺subscript𝑔𝐺𝒩subscript𝒙𝑔p_{G}=\sum_{g\in G}\mathcal{N}(\boldsymbol{x})_{g},italic_p start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_g ∈ italic_G end_POSTSUBSCRIPT caligraphic_N ( bold_italic_x ) start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , (25)

and then obtain the group constraint Groups as shown in Eq. 26 below:

𝒙Bϵ(𝒙)G𝒢pGδpG1δ.for-all𝒙subscript𝐵italic-ϵ𝒙subscript𝐺𝒢subscript𝑝𝐺𝛿subscript𝑝𝐺1𝛿\forall\boldsymbol{x}\in B_{\epsilon}(\boldsymbol{x})\to\bigwedge_{G\in% \mathcal{G}}p_{G}\leq\delta\;\vee\;p_{G}\geq 1-\delta.∀ bold_italic_x ∈ italic_B start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_x ) → ⋀ start_POSTSUBSCRIPT italic_G ∈ caligraphic_G end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ≤ italic_δ ∨ italic_p start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ≥ 1 - italic_δ . (26)

Groups consist of classes of a similar type (e.g. speed limit signs for GTSRB) and are thus used to encode a similarity relation on top of the data. We use this constraint on the GTSRB dataset with the set of groups 𝒢𝒢\mathcal{G}caligraphic_G being shown in Fig. 5, and on CIFAR-100 with the groups induced by the “coarse” labels provided, such as “fish”, “flowers”, “people”, etc.

Refer to caption
Figure 4: The traffic sign groups for the Groups constraint on GTSRB [50].

Results

Refer to caption
Refer to caption
Figure 5: The 𝖦𝗋𝗈𝗎𝗉𝗌(ϵ=0.6,δ=0.02)𝖦𝗋𝗈𝗎𝗉𝗌formulae-sequenceitalic-ϵ0.6𝛿0.02\mathsf{Groups}(\epsilon=0.6,\delta=0.02)sansserif_Groups ( italic_ϵ = 0.6 , italic_δ = 0.02 ) constraint on GTSRB.
Refer to caption
Refer to caption
Figure 6: The 𝖦𝗋𝗈𝗎𝗉𝗌(ϵ=0.2,δ=0.09)𝖦𝗋𝗈𝗎𝗉𝗌formulae-sequenceitalic-ϵ0.2𝛿0.09\mathsf{Groups}(\epsilon=0.2,\delta=0.09)sansserif_Groups ( italic_ϵ = 0.2 , italic_δ = 0.09 ) constraint on CIFAR-100.

On GTSRB, as seen in Fig. 5, all logics significantly improve constraint accuracy, while there is a considerable drop in prediction accuracy for all of them. The Gödel logic is able to bring constraint accuracy from {22.15} for the baseline to {99.39}. Prediction accuracy with the Gödel logic drops from {56.73} for the baseline to only {34.76}, which is still higher than the prediction accuracy of any other logic we trained with.

The Gödel logic performs the best overall in this experiment. On CIFAR-100, prediction accuracy drops from {32.16} for all logics but the Łukasiewicz logic, which surpasses the baseline prediction accuracy slightly—however, its constraint accuracy drops even below the baseline with only {14.41}, compared to the baseline with {45.49}.

All other logics notice a drop in prediction accuracy, but are able to increase constraint accuracy (often significantly). The best logic in this experiment was the Reichenbach logic, with a prediction accuracy of {25.62} and {72.75}, however, both the Gödel and Yager logics yield prediction and constraint accuracies that are only slightly worse.

Comparing implication

Lastly, to compare implication, we use a constraint based on the class-similarity constraint from DL2 [36] shown in Eq. 27:

𝒙Bϵ(𝒙)a,b,c𝒯(argmaxi𝒩(𝒙)i=a𝒩(𝒙)b𝒩(𝒙)c).for-all𝒙subscript𝐵italic-ϵ𝒙subscript𝑎𝑏𝑐𝒯subscriptargmax𝑖𝒩subscript𝒙𝑖𝑎𝒩subscript𝒙𝑏𝒩subscript𝒙𝑐\forall\boldsymbol{x}\in B_{\epsilon}(\boldsymbol{x})\to\bigwedge_{\langle a,b% ,c\rangle\in\mathcal{T}}\left(\operatorname{arg\,max}_{i}\mathcal{N}(% \boldsymbol{x})_{i}=a\to\mathcal{N}(\boldsymbol{x})_{b}\geq\mathcal{N}(% \boldsymbol{x})_{c}\right).∀ bold_italic_x ∈ italic_B start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_x ) → ⋀ start_POSTSUBSCRIPT ⟨ italic_a , italic_b , italic_c ⟩ ∈ caligraphic_T end_POSTSUBSCRIPT ( start_OPFUNCTION roman_arg roman_max end_OPFUNCTION start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT caligraphic_N ( bold_italic_x ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_a → caligraphic_N ( bold_italic_x ) start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ≥ caligraphic_N ( bold_italic_x ) start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) . (27)

This constraint would not be suitable to compare fuzzy implications with, because the premise of the implication (i.e. argmaxi𝒩(𝒙)i=asubscriptargmax𝑖𝒩subscript𝒙𝑖𝑎\operatorname{arg\,max}_{i}\mathcal{N}(\boldsymbol{x})_{i}=astart_OPFUNCTION roman_arg roman_max end_OPFUNCTION start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT caligraphic_N ( bold_italic_x ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_a) is either absolute truth or absolute falsity, and by definition all fuzzy logics behave the same in these cases.

Our ClassSimilarity constraint therefore is a modified variant shown in Eq. 28, where the premise is replaced with a fuzzy truth value:

𝒙Bϵ(𝒙)a,b,c𝒯(𝒩(𝒛)a1/n𝒩(𝒛)b𝒩(𝒛)c),for-all𝒙subscript𝐵italic-ϵ𝒙subscript𝑎𝑏𝑐𝒯𝒩subscript𝒛𝑎1𝑛𝒩subscript𝒛𝑏𝒩subscript𝒛𝑐\forall\boldsymbol{x}\in B_{\epsilon}(\boldsymbol{x})\to\bigwedge_{\langle a,b% ,c\rangle\in\mathcal{T}}\left(\mathcal{N}(\boldsymbol{z})_{a}\geq\nicefrac{{1}% }{{n}}\to\mathcal{N}(\boldsymbol{z})_{b}\geq\mathcal{N}(\boldsymbol{z})_{c}% \right),∀ bold_italic_x ∈ italic_B start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_x ) → ⋀ start_POSTSUBSCRIPT ⟨ italic_a , italic_b , italic_c ⟩ ∈ caligraphic_T end_POSTSUBSCRIPT ( caligraphic_N ( bold_italic_z ) start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ≥ / start_ARG 1 end_ARG start_ARG italic_n end_ARG → caligraphic_N ( bold_italic_z ) start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ≥ caligraphic_N ( bold_italic_z ) start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) , (28)

where n𝑛nitalic_n is the number of classes, and 𝒯𝒯\mathcal{T}caligraphic_T is a set of tuples a,b,c𝑎𝑏𝑐\langle a,b,c\rangle⟨ italic_a , italic_b , italic_c ⟩, that expresses the condition “a𝑎aitalic_a is more similar to b𝑏bitalic_b than to c𝑐citalic_c”. For example, cat,dog,frogcatdogfrog\langle\text{cat},\text{dog},\text{frog}\rangle⟨ cat , dog , frog ⟩ expresses that a cat is more similar to a dog than to a frog.

This constraint is used to introduce background knowledge into CIFAR-10.

Results

Refer to caption
Refer to caption
Figure 7: The 𝖢𝗅𝖺𝗌𝗌𝖲𝗂𝗆𝗂𝗅𝖺𝗋𝗂𝗍𝗒(ϵ=0.6)𝖢𝗅𝖺𝗌𝗌𝖲𝗂𝗆𝗂𝗅𝖺𝗋𝗂𝗍𝗒italic-ϵ0.6\mathsf{ClassSimilarity}(\epsilon=0.6)sansserif_ClassSimilarity ( italic_ϵ = 0.6 ) constraint on CIFAR-10.

As shown in Fig. 7, the class-similarity constraint was already reasonably well satisfied for the baseline with {72.15} constraint accuracy. All logics are able to improve on this constraint accuracy; the Łukasiewicz implication leads to the highest constraint accuracy of {97.38}.

Prediction accuracy drops from {59.50} for the baseline with all logics; DL2 leads to the biggest drop to {30.13}, whereas the least drop is with the sigmoidal Reichenbach logic with {54.66} prediction accuracy, followed by the Łukasiewicz logic with {54.34} prediction accuracy.

The Łukasiewicz logic performs the best overall in this experiment.

5 Discussion and Future Work

The theoretic comparison in Section 3 provided different ways to look at logic relaxations.

  • 1.

    DL2 and all fuzzy logics apart from Łukasiewicz and Yager are sound. (Section 3.1)

  • 2.

    The shadow-lifting property of a conjunction is desirable for learning, as it allows for gradual satisfaction. The investigation in Section 3.2 showed that no fuzzy logic conjunction apart from the Reichenbach one satisfies shadow-lifting. The DL2 conjunction satisfies both shadow-lifting and has behaviour that is well-suited for learning, having strong derivatives everywhere. The Reichenbach conjunction, despite satisfying shadow-lifting, does not exhibit behaviour suitable for learning with derivatives being low when they should be high. The Yager conjunction vanishes on a considerable part of its domain, but its derivatives are well-suited for learning.

  • 3.

    Investigating implication with respect to whether they follow Modus Ponens, or, more importantly, Modus Tollens reasoning, in Section 3.3 showed that only the Reichenbach logic closely follows both Modus Ponens and Modus Tollens. Almost no other logic follows Modus Ponens and Modus Tollens reasoning, suggesting that training with these would not lead to as useful updates during training.

  • 4.

    In Section 3.4, we looked at the logics from a logic perspective as opposed to looking at their behaviour in the training process. The Łukasiewicz, sigmoidal Reichenbach, and Goguen logics were the most consistent, with the Gödel logic being the least consistent.

Our experimental evaluation complements the theoretical one, and shows most importantly, that training with any logic will generally lead to improved constraint accuracy, at the expense of prediction accuracy. This phenomenon is commonly known and was first reported by Tsipras et al. [56].

Our experimental evaluation results however do not closely mirror the theoretical results:

  • 1.

    When comparing conjunction with the groups constraint on GTSRB, the Gödel, Reichenbach, and Yager logics generally performed best, despite the Gödel and Yager logic not satisfying shadow-lifting.

  • 2.

    Comparing implication, the Łukasiewicz logic performs best, despite it not following Modus Ponens and Modus Tollens reasoning, in addition to having vanishing derivatives on half of the domain.

In general, it seems that shadow-lifting, the ability to perform Modus Ponens or Modus Tollens reasoning, as well as consistency, do not seem to matter as much as the operators having strong derivatives.

With PGD and GradNorm444Note that GradNorm allows to circumvent the need to find optimal values to balance prediction and constraint loss which works reasonable well most of the time, matching or even surpassing exhaustive grid search, but can sometimes lead to nonoptimal results when one of the tasks has finished training and cannot be improved upon. In these cases, GradNorm becomes stuck, and the training process needs to be stopped before then for best results. See Fig. 8 for a visualisation of this phenomenon on MNIST., the differences in performance between the logical relaxations more clear, but still depend highly on the specific task at hand.

5.1 Lessons Learned

The large number of hyperparameters, including standard ML hyperparameters (such as batch size, optimiser, learning rate, and network architecture) and hyperparameters introduced by the PGD counterexample finding approach (such as norm (we chose the subscript\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT norm because we considered it more interpretable), radius, and step size) and the asymmetry hyperparameter introduced by GradNorm make it difficult (if not impossible) to provide a definite answer for the question of what logical relaxation to train to satisfy arbitrary constraints on arbitrary data sets on an arbitrary network in the future.

Instead of trying to find a single best logic that works well in all possible use cases, it might be a more fruitful research direction to explore what constraints require certain properties the logics need to exhibit.

Refer to caption
Figure 8: A plot showing the constraint accuracy and constraint loss weight λC(t)subscript𝜆C𝑡\lambda_{\text{C}}(t)italic_λ start_POSTSUBSCRIPT C end_POSTSUBSCRIPT ( italic_t ) for 𝖱𝗈𝖻𝗎𝗌𝗍𝗇𝖾𝗌𝗌(ϵ=0.8,δ=0.01)𝖱𝗈𝖻𝗎𝗌𝗍𝗇𝖾𝗌𝗌formulae-sequenceitalic-ϵ0.8𝛿0.01\mathsf{Robustness}(\epsilon=0.8,\delta=0.01)sansserif_Robustness ( italic_ϵ = 0.8 , italic_δ = 0.01 ) for DL2 and fuzzy logic on MNIST. Note that the constraint loss weight is decreasing over time, and the constraint accuracy drops sharply some time after λCsubscript𝜆C\lambda_{\text{C}}italic_λ start_POSTSUBSCRIPT C end_POSTSUBSCRIPT becomes negative. This can be explained by GradNorm becoming stuck and unstable when one of the tasks is too easy (and considering that MNIST is not a difficult data set to train with), which the authors mention in their paper [31] and leave for future work. Note that λCE(t)subscript𝜆CE𝑡\lambda_{\text{CE}}(t)italic_λ start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT ( italic_t ) is not shown here because it can be obtained via to the relationship λCE(t)+λC(t)=2subscript𝜆CE𝑡subscript𝜆C𝑡2\lambda_{\text{CE}}(t)+\lambda_{\text{C}}(t)=2italic_λ start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT ( italic_t ) + italic_λ start_POSTSUBSCRIPT C end_POSTSUBSCRIPT ( italic_t ) = 2.

5.2 Future Work

Our experiments have shown that learning with differentiable logics can generally improve how much a ML model satisfies a constraint. Imposing logical constraints on the training process in this manner could be a step in the direction of verified ML, allowing the use of continuous-learning in self-improving ML-enabled autonomous systems. It has to be noted that in contrast to formal verifiers, training with logical loss does not formally guarantee properties to hold in all possible cases.

We highlight a few more areas for future work in the following.

Reusing logical constraints during inference

Because of the differentiable logics acting as a regulariser during training, any logical constraints imposed on the learning process are unavailable during inference. The trained model can therefore not make use of the logical constraints to check its predictions, for example to attach confidence scores to its predictions. Giunchiglia et al. [29] provide a survey of learning with logical constraints and also investigate ways to guarantee satisfaction of logical constraints after training, by means of constraining the output.

Probabilistic logics

Despite expressing satisfaction of formulas on [0,1]01[0,1][ 0 , 1 ], fuzzy logics are inherently not probabilistic, having been designed instead for reasoning in the presence of vagueness555Which has consequences for the semantics: for example, in probability theory, given P=Q=0.5𝑃𝑄0.5P=Q=0.5italic_P = italic_Q = 0.5, the probability of “P𝑃Pitalic_P and Q𝑄Qitalic_Q” is Pr[PQ]=0.25Pr𝑃𝑄0.25\operatorname{Pr}[P\cap Q]=0.25roman_Pr [ italic_P ∩ italic_Q ] = 0.25, whereas in fuzzy logic, given p=q=0.5𝑝𝑞0.5p=q=0.5italic_p = italic_q = 0.5, the value of pq𝑝𝑞p\wedge qitalic_p ∧ italic_q is 0.50.50.50.5 using the Gödel t-norm min{x,y}𝑥𝑦\min\{x,y\}roman_min { italic_x , italic_y }, or 00, using the Łukasiewicz t-norm max{0,x+y1}0𝑥𝑦1\max\{0,x+y-1\}roman_max { 0 , italic_x + italic_y - 1 }.. We point to DeepProbLog [57] as one example for a probabilistic logic for use with deep learning. In the context of neural networks, which often output probabilities, it could be more natural to reason about probabilities instead of vagueness, especially for constraints that include probabilities [7].

Properties & expressivity

A common problem with verifying ML is the lack of specifications, as noted by Seshia et al. [5], Leucker [6], Farrell et al. [7]. Most properties in the literature are limited to robustness against slight perturbations, although differentiable logics can not only relate network inputs and outputs, but could also refer to the inner workings of the neural network, such as weights and activation states. A related area is to investigate whether learning with logical loss can be used to show that desired properties continue to hold when retraining the network.

Lastly, logical constraints are often encoded as propositional logic constraints. For example, the ROAD-R [58] benchmark provides videos annotated with propositional logic constraints encoding background knowledge, but more expressive logical constraints are planned for future work. Looking at differentiable logics beyond propositional logic is a direction worth exploring; Varnai and Dimarogonas [45] provide a differentiable logic for STL, and Leung et al. [59], Xie et al. [60], Xu et al. [61] have translations for Linear-time Temporal Logic (LTL).

Acknowledgements

This work was supported by Science Foundation Ireland [grant number 20/FFP-P/8853].

References

  • Flinkow et al. [2023] T. Flinkow, B. A. Pearlmutter, R. Monahan, Comparing Differentiable Logics for Learning Systems: A Research Preview, Electronic Proceedings in Theoretical Computer Science 395 (2023) 17–29. doi:10.4204/EPTCS.395.3. arXiv:2311.09809.
  • Wang et al. [2018] S. Wang, K. Pei, J. Whitehouse, J. Yang, S. Jana, Efficient Formal Safety Analysis of Neural Networks, in: Advances in Neural Information Processing Systems, volume 31, Curran Associates, Inc., 2018.
  • Szegedy et al. [2014] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, R. Fergus, Intriguing properties of neural networks, 2014. doi:10.48550/arXiv.1312.6199. arXiv:1312.6199.
  • Goodfellow et al. [2015] I. J. Goodfellow, J. Shlens, C. Szegedy, Explaining and Harnessing Adversarial Examples, 2015. doi:10.48550/arXiv.1412.6572. arXiv:1412.6572.
  • Seshia et al. [2018] S. A. Seshia, A. Desai, T. Dreossi, D. J. Fremont, S. Ghosh, E. Kim, S. Shivakumar, M. Vazquez-Chanlatte, X. Yue, Formal Specification for Deep Neural Networks, in: S. K. Lahiri, C. Wang (Eds.), Automated Technology for Verification and Analysis, Lecture Notes in Computer Science, Springer International Publishing, Cham, 2018, pp. 20–34. doi:10.1007/978-3-030-01090-4_2.
  • Leucker [2020] M. Leucker, Formal Verification of Neural Networks?, in: G. Carvalho, V. Stolz (Eds.), Formal Methods: Foundations and Applications, Lecture Notes in Computer Science, Springer International Publishing, Cham, 2020, pp. 3–7. doi:10.1007/978-3-030-63882-5_1.
  • Farrell et al. [2023] M. Farrell, A. Mavridou, J. Schumann, Exploring Requirements for Software that Learns: A Research Preview, in: A. Ferrari, B. Penzenstadler (Eds.), Requirements Engineering: Foundation for Software Quality, Lecture Notes in Computer Science, Springer Nature Switzerland, Cham, 2023, pp. 179–188. doi:10.1007/978-3-031-29786-1_12.
  • Cheng and Yan [2021] C.-H. Cheng, R. Yan, Continuous Safety Verification of Neural Networks, in: 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2021, pp. 1478–1483. doi:10.23919/DATE51398.2021.9473994.
  • Bak et al. [2021] S. Bak, C. Liu, T. Johnson, The Second International Verification of Neural Networks Competition (VNN-COMP 2021): Summary and Results, 2021. doi:10.48550/arXiv.2109.00498. arXiv:2109.00498.
  • Müller et al. [2022] M. N. Müller, C. Brix, S. Bak, C. Liu, T. T. Johnson, The Third International Verification of Neural Networks Competition (VNN-COMP 2022): Summary and Results, 2022. doi:10.48550/arXiv.2212.10376. arXiv:2212.10376.
  • Brix et al. [2023a] C. Brix, S. Bak, C. Liu, T. T. Johnson, The Fourth International Verification of Neural Networks Competition (VNN-COMP 2023): Summary and Results, 2023a. doi:10.48550/arXiv.2312.16760. arXiv:2312.16760.
  • Brix et al. [2023b] C. Brix, M. N. Müller, S. Bak, T. T. Johnson, C. Liu, First three years of the international verification of neural networks competition (VNN-COMP), International Journal on Software Tools for Technology Transfer 25 (2023b) 329–339. doi:10.1007/s10009-023-00703-4.
  • Huang et al. [2020] X. Huang, D. Kroening, W. Ruan, J. Sharp, Y. Sun, E. Thamo, M. Wu, X. Yi, A survey of safety and trustworthiness of deep neural networks: Verification, testing, adversarial attack and defence, and interpretability, Computer Science Review 37 (2020) 100270. doi:10.1016/j.cosrev.2020.100270.
  • Liu et al. [2021] C. Liu, T. Arnon, C. Lazarus, C. Strong, C. Barrett, M. J. Kochenderfer, Algorithms for Verifying Deep Neural Networks, Foundations and Trends in Optimization 4 (2021) 244–404. doi:10.1561/2400000035.
  • Urban and Miné [2021] C. Urban, A. Miné, A Review of Formal Methods applied to Machine Learning (2021). doi:10.48550/arXiv.2104.02466. arXiv:2104.02466.
  • Albarghouthi [2021] A. Albarghouthi, Introduction to Neural Network Verification, 2021. arXiv:2109.10317.
  • Meng et al. [2022] M. H. Meng, G. Bai, S. G. Teo, Z. Hou, Y. Xiao, Y. Lin, J. S. Dong, Adversarial Robustness of Deep Neural Networks: A Survey from a Formal Verification Perspective, IEEE Transactions on Dependable and Secure Computing (2022) 1–1. doi:10.1109/TDSC.2022.3179131.
  • Katz et al. [2017] G. Katz, C. Barrett, D. L. Dill, K. Julian, M. J. Kochenderfer, Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks, in: R. Majumdar, V. Kunčak (Eds.), Computer Aided Verification, Lecture Notes in Computer Science, Springer International Publishing, Cham, 2017, pp. 97–117. doi:10.1007/978-3-319-63387-9_5.
  • Katz et al. [2019] G. Katz, D. A. Huang, D. Ibeling, K. Julian, C. Lazarus, R. Lim, P. Shah, S. Thakoor, H. Wu, A. Zeljić, D. L. Dill, M. J. Kochenderfer, C. Barrett, The Marabou Framework for Verification and Analysis of Deep Neural Networks, in: I. Dillig, S. Tasiran (Eds.), Computer Aided Verification, Lecture Notes in Computer Science, Springer International Publishing, Cham, 2019, pp. 443–452. doi:10.1007/978-3-030-25540-4_26.
  • Tran et al. [2020] H.-D. Tran, X. Yang, D. Manzanas Lopez, P. Musau, L. V. Nguyen, W. Xiang, S. Bak, T. T. Johnson, NNV: The Neural Network Verification Tool for Deep Neural Networks and Learning-Enabled Cyber-Physical Systems, in: S. K. Lahiri, C. Wang (Eds.), Computer Aided Verification, Lecture Notes in Computer Science, Springer International Publishing, Cham, 2020, pp. 3–17. doi:10.1007/978-3-030-53288-8_1.
  • Ferrari et al. [2021] C. Ferrari, M. N. Mueller, N. Jovanović, M. Vechev, Complete Verification via Multi-Neuron Relaxation Guided Branch-and-Bound, in: International Conference on Learning Representations, 2021.
  • Zhang et al. [2018] H. Zhang, T.-W. Weng, P.-Y. Chen, C.-J. Hsieh, L. Daniel, Efficient Neural Network Robustness Certification with General Activation Functions, 2018. doi:10.48550/arXiv.1811.00866. arXiv:1811.00866.
  • Xu et al. [2020] K. Xu, Z. Shi, H. Zhang, Y. Wang, K.-W. Chang, M. Huang, B. Kailkhura, X. Lin, C.-J. Hsieh, Automatic Perturbation Analysis for Scalable Certified Robustness and Beyond, 2020. doi:10.48550/arXiv.2002.12920. arXiv:2002.12920.
  • Wang et al. [2021] S. Wang, H. Zhang, K. Xu, X. Lin, S. Jana, C.-J. Hsieh, J. Z. Kolter, Beta-CROWN: Efficient Bound Propagation with Per-neuron Split Constraints for Complete and Incomplete Neural Network Robustness Verification, 2021. doi:10.48550/arXiv.2103.06624. arXiv:2103.06624.
  • Xu et al. [2021] K. Xu, H. Zhang, S. Wang, Y. Wang, S. Jana, X. Lin, C.-J. Hsieh, Fast and Complete: Enabling Complete Neural Network Verification with Rapid and Massively Parallel Incomplete Verifiers, 2021. doi:10.48550/arXiv.2011.13824. arXiv:2011.13824.
  • Zhang et al. [2022] H. Zhang, S. Wang, K. Xu, L. Li, B. Li, S. Jana, C.-J. Hsieh, J. Z. Kolter, General Cutting Planes for Bound-Propagation-Based Neural Network Verification, 2022. doi:10.48550/arXiv.2208.05740. arXiv:2208.05740.
  • Shi et al. [2024] Z. Shi, Q. Jin, Z. Kolter, S. Jana, C.-J. Hsieh, H. Zhang, Neural Network Verification with Branch-and-Bound for General Nonlinearities, 2024. doi:10.48550/arXiv.2405.21063. arXiv:2405.21063.
  • Kwiatkowska [2019] M. Kwiatkowska, Safety verification for deep neural networks with provable guarantees, in: Leibniz International Proceedings in Informatics, LIPIcs, volume 140, 2019. doi:10.4230/lipics.concur.2019.1.
  • Giunchiglia et al. [2022] E. Giunchiglia, M. C. Stoian, T. Lukasiewicz, Deep Learning with Logical Constraints, in: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, International Joint Conferences on Artificial Intelligence Organization, Vienna, Austria, 2022, pp. 5478–5485. doi:10.24963/ijcai.2022/767.
  • Madry et al. [2018] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, A. Vladu, Towards Deep Learning Models Resistant to Adversarial Attacks, in: International Conference on Learning Representations, 2018.
  • Chen et al. [2018] Z. Chen, V. Badrinarayanan, C.-Y. Lee, A. Rabinovich, GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks, in: Proceedings of the 35th International Conference on Machine Learning, PMLR, 2018, pp. 794–803.
  • Paszke et al. [2019] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, PyTorch: An Imperative Style, High-Performance Deep Learning Library, in: Advances in Neural Information Processing Systems, volume 32, Curran Associates, Inc., 2019.
  • Meurer et al. [2017] A. Meurer, C. P. Smith, M. Paprocki, O. Čertík, S. B. Kirpichev, M. Rocklin, Am. Kumar, S. Ivanov, J. K. Moore, S. Singh, T. Rathnayake, S. Vig, B. E. Granger, R. P. Muller, F. Bonazzi, H. Gupta, S. Vats, F. Johansson, F. Pedregosa, M. J. Curry, A. R. Terrel, Š. Roučka, A. Saboo, I. Fernando, S. Kulal, R. Cimrman, A. Scopatz, SymPy: Symbolic computing in Python, PeerJ Computer Science 3 (2017) e103. doi:10.7717/peerj-cs.103.
  • Harris et al. [2020] C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. F. del Río, M. Wiebe, P. Peterson, P. Gérard-Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, T. E. Oliphant, Array programming with NumPy, Nature 585 (2020) 357–362. doi:10.1038/s41586-020-2649-2.
  • Ślusarz et al. [2023] N. Ślusarz, E. Komendantskaya, M. Daggitt, R. Stewart, K. Stark, Logic of Differentiable Logics: Towards a Uniform Semantics of DL, in: EPiC Series in Computing, volume 94, EasyChair, 2023, pp. 473–493. doi:10.29007/c1nt.
  • Fischer et al. [2019] M. Fischer, M. Balunovic, D. Drachsler-Cohen, T. Gehr, C. Zhang, M. Vechev, DL2: Training and Querying Neural Networks with Logic, in: Proceedings of the 36th International Conference on Machine Learning, PMLR, 2019, pp. 1931–1941.
  • van Krieken et al. [2022] E. van Krieken, E. Acar, F. van Harmelen, Analyzing Differentiable Fuzzy Logic Operators, Artificial Intelligence 302 (2022) 103602. doi:10.1016/j.artint.2021.103602. arXiv:2002.06100.
  • Marra et al. [2023] G. Marra, F. Giannini, M. Diligenti, M. Maggini, M. Gori, T-Norms Driven Loss Functions for Machine Learning, Applied Intelligence 53 (2023) 18775–18789. doi:10.1007/s10489-022-04383-6. arXiv:1907.11468.
  • Daggitt et al. [2022] M. L. Daggitt, W. Kokke, R. Atkey, L. Arnaboldi, E. Komendantskya, Vehicle: Interfacing Neural Network Verifiers with Interactive Theorem Provers, 2022. arXiv:2202.05207.
  • Daggitt et al. [2024a] M. L. Daggitt, W. Kokke, R. Atkey, N. Slusarz, L. Arnaboldi, E. Komendantskaya, Vehicle: Bridging the Embedding Gap in the Verification of Neuro-Symbolic Programs, 2024a. doi:10.48550/arXiv.2401.06379. arXiv:2401.06379.
  • Daggitt et al. [2024b] M. Daggitt, W. Kokke, N. Ślusarz, R. Atkey, M. Casadio, E. Komendantskaya, Vehicle, 2024b.
  • Affeldt et al. [2024] R. Affeldt, A. Bruni, E. Komendantskaya, N. Ślusarz, K. Stark, Taming Differentiable Logics with Coq Formalisation, 2024. doi:10.48550/arXiv.2403.13700. arXiv:2403.13700.
  • Klement et al. [2000] E. P. Klement, R. Mesiar, E. Pap, Triangular Norms, volume 8 of Trends in Logic, Springer Netherlands, Dordrecht, 2000. doi:10.1007/978-94-015-9540-7.
  • Baczyński and Jayaram [2008] M. Baczyński, B. Jayaram, Fuzzy Implications, number v. 231 in Studies in Fuzziness and Soft Computing, Springer Verlag, Berlin, 2008.
  • Varnai and Dimarogonas [2020] P. Varnai, D. V. Dimarogonas, On Robustness Metrics for Learning STL Tasks, in: 2020 American Control Conference (ACC), 2020, pp. 5394–5399. doi:10.23919/ACC45564.2020.9147692.
  • Grespan et al. [2021] M. M. Grespan, A. Gupta, V. Srikumar, Evaluating Relaxations of Logic for Neural Networks: A Comprehensive Study, 2021. doi:10.48550/arXiv.2107.13646. arXiv:2107.13646.
  • Eth [2024] Eth-sri/dl2, SRI Lab, ETH Zurich, 2024.
  • Lecun et al. [1998] Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86 (1998) 2278–2324. doi:10.1109/5.726791.
  • Krizhevsky [2009] A. Krizhevsky, Learning Multiple Layers of Features from Tiny Images (2009).
  • Stallkamp et al. [2011] J. Stallkamp, M. Schlipsing, J. Salmen, C. Igel, The German Traffic Sign Recognition Benchmark: A multi-class classification competition, in: The 2011 International Joint Conference on Neural Networks, IEEE, San Jose, CA, USA, 2011, pp. 1453–1460. doi:10.1109/IJCNN.2011.6033395.
  • Simonyan and Zisserman [2015] K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, 2015. doi:10.48550/arXiv.1409.1556. arXiv:1409.1556.
  • Kingma and Ba [2017] D. P. Kingma, J. Ba, Adam: A Method for Stochastic Optimization, 2017. doi:10.48550/arXiv.1412.6980. arXiv:1412.6980.
  • Li et al. [2022] Z. Li, Z. Liu, Y. Yao, J. Xu, T. Chen, X. Ma, J. Lü, Learning with Logical Constraints but without Shortcut Satisfaction, in: The Eleventh International Conference on Learning Representations, 2022.
  • He et al. [2023] H. He, W. Dai, M. Li, Reduced Implication-bias Logic Loss for Neuro-Symbolic Learning, 2023. doi:10.48550/arXiv.2208.06838. arXiv:2208.06838.
  • Casadio et al. [2022] M. Casadio, E. Komendantskaya, M. L. Daggitt, W. Kokke, G. Katz, G. Amir, I. Refaeli, Neural Network Robustness as a Verification Property: A Principled Case Study, in: S. Shoham, Y. Vizel (Eds.), Computer Aided Verification, Lecture Notes in Computer Science, Springer International Publishing, Cham, 2022, pp. 219–231. doi:10.1007/978-3-031-13185-1_11.
  • Tsipras et al. [2018] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, A. Madry, Robustness May Be at Odds with Accuracy, in: International Conference on Learning Representations, 2018.
  • Manhaeve et al. [2018] R. Manhaeve, S. Dumancic, A. Kimmig, T. Demeester, L. De Raedt, DeepProbLog: Neural Probabilistic Logic Programming, in: Advances in Neural Information Processing Systems, volume 31, Curran Associates, Inc., 2018.
  • Giunchiglia et al. [2023] E. Giunchiglia, M. C. Stoian, S. Khan, F. Cuzzolin, T. Lukasiewicz, ROAD-R: The autonomous driving dataset with logical requirements, Machine Learning 112 (2023) 3261–3291. doi:10.1007/s10994-023-06322-z.
  • Leung et al. [2021] K. Leung, N. Aréchiga, M. Pavone, Backpropagation through Signal Temporal Logic Specifications: Infusing Logical Structure into Gradient-Based Methods, 2021. doi:10.48550/arXiv.2008.00097. arXiv:2008.00097.
  • Xie et al. [2021] Y. Xie, F. Zhou, H. Soh, Embedding Symbolic Temporal Knowledge into Deep Sequential Models, in: 2021 IEEE International Conference on Robotics and Automation (ICRA), 2021, pp. 4267–4273. doi:10.1109/ICRA48506.2021.9561952.
  • Xu et al. [2022] Z. Xu, Y. S. Rawat, Y. Wong, M. Kankanhalli, M. Shah, Don’t Pour Cereal into Coffee: Differentiable Temporal Logic for Temporal Action Segmentation, in: Advances in Neural Information Processing Systems, 2022.