Comparing Differentiable Logics for Learning with Logical Constraints¹¹1This paper is an extended version of our paper [1] published in FMAS 2023, containing a revised experimental setup leading to a fairer, more meaningful comparison. Section 1.3 explains our extensions and contributions in more detail.

Thomas Flinkow thomas.flinkow@mu.ie Barak A. Pearlmutter Rosemary Monahan

Abstract

Extensive research on formal verification of machine learning systems indicates that learning from data alone often fails to capture underlying background knowledge such as specifications implicitly available in the data. Various neural network verifiers have been developed to ensure that a machine-learnt model satisfies correctness and safety properties, however, they typically assume a trained network with fixed weights. A promising approach for creating machine learning models that inherently satisfy constraints after training is to encode background knowledge as explicit logical constraints that guide the learning process via so-called differentiable logics. In this paper, we experimentally compare and evaluate various logics from the literature, presenting our findings and highlighting open problems for future work.

keywords:

machine learning , neuro-symbolic , differentiable logic , verification

^†^†journal: Science of Computer Programming

\AtBeginDocument\RenewCommandCopy\qty\SI\UseTblrLibrary

booktabs

\affiliation

[1]organization=Department of Computer Science, addressline=Maynooth University, city=Maynooth, Co. Kildare, country=Ireland \affiliation[2]organization=Hamilton Institute, addressline=Maynooth University, city=Maynooth, Co. Kildare, country=Ireland

1 Introduction

Advancements in machine learning (ML) in the past few years indicate great potential for applying ML to various domains. Autonomous systems are one such application domain, but using ML components in such a safety-critical domain presents unique new challenges for formal verification. These include

1.

ML failing to learn background knowledge from data alone [2],
2.

neural networks being susceptible to adversarial inputs [3, 4],
3.

and a lack of specifications, generally and especially when continuous learning is permitted [5, 6, 7].

Addressing these challenges is even more important and more difficult when the ML-enabled autonomous system is permitted to continue to learn after deployment, either to adapt to changing environments or to correct and improve itself when errors are detected [8].

1.1 Formal verification of neural networks

A multitude of neural network verifiers have been presented in the past few years. We refer the reader to the Neural Network Verification Competition (VNN-COMP) reports [9, 10, 11, 12] for an overview of state-of-the-art neural network verifiers, and to Huang et al. [13], Liu et al. [14], Urban and Miné [15], Albarghouthi [16], Meng et al. [17] for in-depth surveys on neural network verification.

Reluplex, one of the first verifiers for neural networks, was provided by Katz et al. [18]. State-of-the-art tools are its successor Marabou [19], along with NNV [20], MN-BaB [21], and $\alpha,\beta$ -CROWN [22, 23, 24, 25, 26, 27] (the winner of the 2021–2023 VNN-COMP competitions [9, 10, 11]). However, as noted by Kwiatkowska [28] they typically assume trained networks with fixed weights and do not target the learning process itself.

1.2 Loss-based methods for guiding network training

There are multiple reasons for injecting background knowledge into neural networks: improved performance, learning with noisy or sparse data, or to guarantee compliance of the network predictions with the background knowledge [29]. One step in the direction of correct-by-construction neural networks are so-called differentiable logics, which transform a logical constraint $\phi$ into an additional constraint loss term $\mathcal{L}_{\text{C}}$ that measures how close the network is to satisfying the constraint. This is in addition to standard loss²²2Without loss of generality, only cross-entropy loss will be used in the following. such as standard cross-entropy loss $\mathcal{L}_{\text{CE}}$ , which is a measure of the difference between the predicted and true probabilities.

The total loss $\mathcal{L}$ is then a weighted sum of constraint and cross-entropy loss, as shown in Eq. 1, where $\lambda_{\text{CE}}$ and $\lambda_{\text{C}}$ balance the different loss terms, $\boldsymbol{x}_{0}$ is training data, and $\boldsymbol{y}$ the true label.

\mathcal{L}(\boldsymbol{x}_{0},\boldsymbol{y},\phi):=\lambda_{\text{CE}}% \mathcal{L}_{\text{CE}}(\boldsymbol{x}_{0},\boldsymbol{y})+\lambda_{\text{C}}% \mathcal{L}_{\text{C}}(\boldsymbol{x}_{0},\boldsymbol{y},\phi).

(1)

The optimal network weights $\boldsymbol{\theta}^{+}$ , needed to map the input to a desired output, are then obtained by minimising the total loss using standard gradient descent, where $\boldsymbol{\theta}$ denotes the network weights, formally shown in Eq. 2.

\boldsymbol{\theta}^{+}=\operatorname{arg\,min}_{\boldsymbol{\theta}}\;% \mathcal{L}(\boldsymbol{x},\boldsymbol{y},\phi).

(2)

In order to translate logical constraints into this constraint loss, a mapping must be defined that allows for real-valued truth values and is differentiable (almost everywhere) for use with standard gradient-based methods. In Section 2 we provide an overview of two of these mappings (so-called differentiable logics) from popular literature, namely DL2 and fuzzy logics.

1.3 Our Contributions

In this section we outline our contributions and improvements over our work in [1] which provided an experimental comparison of differentiable logics. The experiments in this earlier work were limited to constraints that utilise data that is already in the training set, hence not allowing the constraints to have full effect. We now extend that work, through a complete revision of the experimental setup which provides for a more measurable experimental comparison of differentiable logics as explained in Section 4 below. We achieve this improved comparison by finding counterexamples outside of the data set that violate the logical constraint using Projected Gradient Descent [30] (PGD), thus ensuring that the differentiable logics have an impact on the learning process.

We also improve our presentation of differentiable logics in Section 2 so that differences are easily identified. Further theoretical comparison metrics from the literature, such as derivatives and consistency, are now included in Section 3. In our previous work, the comparison of differentiable logics suffered from a lack of fairness, due to a difficult-to-tune hyperparameter that prevented the logics to perform at their best. Our extended work addresses this through use of an adaptive loss-balancing approach called GradNorm [31], which allows each logic to perform close to optimally within our experiments, facilitating a fairer comparison of the logics in Section 4.

Our implementation of various differentiable logics in Python has been vastly improved, allowing their reuse and adaptation in the training of other neural networks, on new data sets with further constraints. Additionally, the logics are now implemented, not only for use in training with PyTorch [32], but also to allow the investigation of mathematical properties (such as derivatives and integrals) in a automated manner with SymPy [33] and Numpy [34].

1.4 Notation & Definitions

In the following, vectors will be denoted using bold face, e.g. $\boldsymbol{x}$ .

Without loss of generality, we consider only classifier neural networks that output probabilities. The function approximated by the neural network will be denoted by $\mathcal{N}$ , and for our purposes will be a function $\mathcal{N}:\mathbb{R}^{m}\to[0,1]^{n}$ , mapping images to a vector containing the predicted probabilities of the input belonging to any of the $n$ possible classes. Each vector element $\mathcal{N}(\boldsymbol{x})_{k}$ represents the prediction for class $k$ (where $0\leq k<n$ ). We further define $B_{\epsilon}(\boldsymbol{x})$ to be the $\ell_{\infty}$ ball with radius $\epsilon$ around $\boldsymbol{x}_{0}$ , i.e.

B_{\epsilon}(\boldsymbol{x}_{0}):=\{\boldsymbol{x}\in\mathbb{R}^{m}\mid\lVert% \boldsymbol{x}-\boldsymbol{x}_{0}\rVert_{\infty}\leq\epsilon\}.

(3)

This gives us a shorthand to refer to the set of all points close to point $\boldsymbol{x}_{0}$ with respect to the $\ell_{\infty}$ norm. This will be convenient later when discussing local robustness, where a property is supposed to hold not just for one particular input, but also for all points that are similar to that input. Lastly, given logical formula $\phi$ , we let $[\![\phi]\!]_{\text{$L$}}$ denote its many-valued relaxation under a specific logic $L$ , based on notation from Ślusarz et al. [35].

2 A comparison of DL2 and Fuzzy Logics

Deep Learning with Differentiable Logics (DL2) is a system built by Fischer et al. [36] for querying and training neural networks with logical constraints that are mapped into $[0,\infty)$ , where $0$ represents absolute truth, and any positive value represents a degree of falsehood that is directly used as a penalty. Whereas DL2 was designed specifically for deep learning contexts, fuzzy logics are well-studied logical systems that also happen to be suitable for use as a differentiable logic due to their many-valued nature, with operators that are often differentiable almost everywhere. Fuzzy logics express degrees of truth in the unit interval $[0,1]$ , with absolute falsity mapped to $0$ , and absolute truth mapped to $1$ .

The first in-depth study investigating the use of fuzzy logics in loss functions (called Differentiable Fuzzy Logics) was provided by van Krieken et al. [37] and presented insights into the learning properties of differentiable logic operators. The use of t-norm based logical loss was also investigated by Marra et al. [38], who focused on keeping logical relations intact, instead of combining arbitrary fuzzy logic operators.

As a follow up, Ślusarz et al. [35] provide a common presentation of differentiable logics including DL2, fuzzy logics, and Signal Temporal Logic (STL), in a unified, extensible framework called Logic of Differentiable Logics (LDL). This framework allows the investigation of properties of differentiable logics in general, without having to consider each logic on its own. Additionally, it is used by the Vehicle [39, 40, 41] tool, providing an expressive domain specific language for stating neural network specifications which can be compiled to both verifiers and interactive theorem provers (ITPs).

The LDL framework also provides for mechanised proofs of properties of differentiable logics, thus providing a stepping stone for the development of programming language support for verification of machine learning [42].

2.1 Comparison of DL2 and Fuzzy Logics

In the following, we present the components of the logics that we use in our experimental comparison when mapping logical constraints $\phi\in\Phi$ into real-valued loss. These logics are the DL2 logic relaxation $[\![\cdot]\!]_{\text{DL2}}:\Phi\to[0,\infty)$ , and the fuzzy logic based relaxation $[\![\cdot]\!]_{\text{FL}}:\Phi\to[0,1]$ . Here, we provide details of their atomic terms, their operators and their quantifiers.

Atomic terms

Atomic terms in DL2 are comparisons that are translated as

	$\displaystyle[\![x\leq y]\!]_{\text{DL2}}:=\max\{x-y,0\},$		(4)
	$\displaystyle[\![x\neq y]\!]_{\text{DL2}}:=\xi[x=y],$		(5)

where $\xi>0$ is a constant (found not to have significant influence [36]), and $[x=y]$ is the indicator, i.e. $[x=y]=1$ if $x=y$ , and $[x=y]=0$ , otherwise.

Fuzzy logics typically do not define fuzzy comparison operators, which is why Ślusarz et al. [35] introduce a fuzzy comparison mapping shown in Eq. 6 below:

[\![\leq]\!]_{\text{FL}}:[0,1]^{2}\to[0,1],\qquad[\![\leq]\!]_{\text{FL}}=1-% \max\left\{\frac{x-y}{x+y},0\right\}.

(6)

As with all fuzzy logic operators, the truth values of the terms $x,y$ must be mapped into $[0,1]$ by some oracle. In our constraints, atomic terms are comparisons, so we take the liberty of changing this mapping to the one shown in Eq. 7, allowing us to forgo the need for an external oracle. Note that although this is not technically a pure fuzzy logic operator anymore, fuzzy logics usually do not have fuzzy comparison operators at all.

[\![\leq]\!]_{\text{FL}}:\mathbb{R}^{2}\to[0,1],\qquad[\![\leq]\!]_{\text{FL}}% =1-\frac{\max\{x-y,0\}}{|x|+|y|}

(7)

Note also that this mapping has a property that we intuitively might wish to hold: for example, we might want $21\leq 20$ to be as much of a violation as $21\,000$\leq$20\,000$ . This cannot be achieved in DL2, where the violation depends only on the absolute difference of values.

Conjunction and disjunction

In DL2, conjunction and disjunction are mapped to addition and multiplication, thus being associative and commutative:

	$\displaystyle[\![x\wedge y]\!]_{\text{DL2}}:=[\![x]\!]_{\text{DL2}}+[\![y]\!]_% {\text{DL2}},$		(8)
	$\displaystyle[\![x\vee y]\!]_{\text{DL2}}:=[\![x]\!]_{\text{DL2}}\cdot[\![y]\!% ]_{\text{DL2}}.$		(9)

Fuzzy logics are based on functions $T:[0,1]^{2}\to[0,1]$ that are commutative, associative, monotonic, and satisfy $T(1,y)=y$ . These are called triangular norms [43] (abbreviated as t-norms) and generalise conjunction. A t-conorm (also called s-norm) generalises disjunction and can be obtained from a t-norm using $S(x,y)=1-T(1-x,1-y)$ ; thus, for our fuzzy logic mapping we have:

	$\displaystyle[\![x\wedge y]\!]_{\text{$L$}}:=T_{L}(x,y),$		(10)
	$\displaystyle[\![x\vee y]\!]_{\text{$L$}}:=S_{L}(x,y).$		(11)

Negation

Because DL2 maps into $[0,\infty)$ , a separate translation of negation does not exist. Instead, negation is handled by pushing it inwards to the level of comparison, e.g. $[\![\lnot(x\leq y)]\!]_{\text{DL2}}=[\![y<x]\!]_{\text{DL2}}$ .

In fuzzy logics, negation is a function $N:[0,1]\to[0,1]$ such that $N(1)=0$ and for all $x$ , $N(N(x))\geq x$ . We will only use the standard strong negation $N(x)=1-x$ in the following.

Implication

DL2 does not provide a separate translation for implication, instead, material implication is used:

[\![x\to y]\!]_{\text{DL2}}:=[\![\lnot x\vee y]\!]_{\text{DL2}}.

(12)

In fuzzy logics, there are multiple ways to obtain fuzzy implications. Table 1 lists the definitions of the mentioned t-norms, t-conorms as well as these implications. Baczyński and Jayaram [44] give a detailed overview of fuzzy implications.

1.

From a t-conorm $S$ and fuzzy negation $N$ , one obtains a so-called $(S,N)$ -implication (which generalises material implication) as $I(x,y)=S(N(x),y)$ ). Example $(S,N)$ -implications are the Kleene-Dienes implication $I_{\text{Kleene-Dienes}}(x,y):=\max\{1-x,y\}$ and Reichenbach implication $I_{\text{Reichenbach}}(x,y):=1-x+xy$ , both with the standard negation $N(x)=1-x$ .
2.

Other implications generalise the intuitionistic implication and are called $R$ -implications, because they use the t-norm residuum $R(x,y):=\sup\{t\in[0,1]\mid T(x,t)\leq y\}$ . Example $R$ -implications are the Gödel and Goguen implications.
3.

The Łukasiewicz implication $I_{\text{\L ukasiewicz}}(x,y):=\min\{1-x+y,1\}$ is both an $(S,N)$ -implication and an $R$ -implication.
4.

Other implications are neither—the Yager implication, for example, is an $f$ -generated implication that is obtained using $f(x)=-\ln x$ in $I(x,y)=f^{-1}(xf(y))$ (with the understanding that $0\cdot\infty=0$ ).

Additionally, van Krieken et al. [37] propose sigmoidal implications in order to improve the derivatives of the original implication, while preserving its characteristics. With the standard sigmoidal function shown in Eq. 13, Eq. 14 shows the sigmoidal implication $I^{(s)}$ for any given fuzzy implication $I$ and a parameter $s$ for controlling the steepness. In our experiments, we use the sigmoidal Reichenbach implication with $s=9$ , as suggested by van Krieken et al. [37].

\sigma(x)=\frac{1}{(1+\exp(-x))}

(13)

I^{(s)}(x,y)=\dfrac{(1+\exp(\nicefrac{{s}}{{2}}))\sigma(sI(x,y)-\nicefrac{{s}}% {{2}})-1}{\exp(\nicefrac{{s}}{{2}})-1}

(14)

Universal quantification

In DL2, Fischer et al. [36] provide an important insight: training to satisfy a universally quantified constraint can be approximated by finding a counterexample that violates the constraint, and using that counterexample in training. This counterexample can be found by an arbitrary sampling function—DL2 uses Projected Gradient Descent (PGD) to find the worst possible perturbation around particular inputs. In DL2, universal quantification thus is not general, and is treated outside of the logic.

With t-norms being associative and commutative, universal fuzzy quantification is given by repeated application of conjunction.

Note that the LDL framework can handle arbitrarily nested, general universal (and existential) quantifiers.

Other operators

Based on the comparison, negation, conjunction and disjunction, and implication operators shown above, other statements can be derived, such as $[\![x\leftrightarrow y]\!]_{\text{$L$}}=[\![(x\to y)\wedge(y\to x)]\!]_{\text{% $L$}}$ . Note that as shown in Table 1, we will only use symmetric configurations, meaning a logic will consist of some t-norm for conjunction, the dual t-conorm for disjunction, and implication will be either an $(S,N)$ -implication based on the t-conorm, or the R-implication based on the t-norm. Symmetric configurations usually retain logical relations, but they might not perform as well as arbitrarily combined fuzzy operators. An analysis of both symmetric as well as arbitrary combinations is provided by van Krieken et al. [37].

Table 1: Common t-norms, t-conorms and fuzzy implications.

{tblr} colspec=lQ[c, mode=dmath]Q[c, mode=dmath]Q[c, mode=dmath], row1=font=, mode=text, cell22=r=2c, cell23=r=2c, cell34=r=3c, cell52=r=2c, cell53=r=2c, vline2-4 = 2-Zdotted, hline3-Y = dotted, Logic $L$ & T-norm $T_{L}(x,y)$ S-norm $S_{L}(x,y)$ Implication $I_{L}(x,y)$
Gödel min{x,y} max{x,y} {1,if x¡y,y,else.
Kleene-Dienes S_L(N(x), y)
\addlinespaceŁukasiewicz max{0,x+y-1} min{1, x+y}
\addlinespaceReichenbach xy x+y-xy
Goguen {1,if x¡y,yx,else.
Yager ( $p\geq 1$ ) max{1-((1-x)^p+(1-y)^p)^ $\nicefrac{{1}}{{p}}$ ,0} min{(x^p+y^p)^ $\nicefrac{{1}}{{p}}$ , 1} {1,if x=y=0, $\nicefrac{{y}}{{x}}$ ,else.

3 Theoretical Comparison

We now compare DL2 and fuzzy logics, discussing soundness, shadow-lifting and derivatives, Modus Ponens and Modus Tollens reasoning, as well as consistency.

3.1 Soundness

Soundness is a highly-desirable—usually even non-negotiable—property any logical system should have. Not all differentiable logics presented herein are sound, e.g., it is known that Łukasiewicz and Yager fuzzy logics are unsound. Ślusarz et al. [35] and Affeldt et al. [42] provide mechanised proofs for the soundness of DL2, STL, and all other fuzzy logics mentioned above.

3.2 Shadow-lifting (Derivatives of Conjunction)

Table 2: The derivatives of the DL2 and fuzzy conjunctions and their properties. Note that all conjunctions shown are commutative, so we only show the derivatives with respect to

x

\DefTblrTemplatecaptiondefault {talltblr}[ notea = For brevity, we let $\mathcal{S}=(1-x)^{p}+(1-y)^{p}$ ., ] colspec=lQ[c, m, mode=dmath]Q[c, m, mode=dmath]Q[l, m, mode=text]Q[c, m, mode=dmath], row1=font=, mode=text, Logic $L$ & Conjunction $[\![x\wedge y]\!]_{\text{$L$}}$ $\displaystyle\partialderivative{[\![x\wedge y]\!]_{\text{$L$}}}{x}$ Behaviour Shadow-
lifting
DL2 x+y 1 strong derivative ✓
Gödel min{x,y} {1,if x≤y0 strong derivative ✗
Łukasiewicz max{0,x+y-1} {1,if x+y¿10 strong derivative;
but vanishing on half
of the domain ✗
Reichenbach xy y derivative low, if $y$ low;
(not suitable for learning) ✓
Yager\TblrNotea ( $p\geq 1$ ) max{1-S^ $\nicefrac{{1}}{{p}}$ ,0} {(1-x)p-1S $\nicefrac{{1}}{{p-1}}$ ,if S $\nicefrac{{1}}{{p}}$ ¿10 derivative vanishes on
considerable part of
the domain; derivative
high, if $x$ or $y$ low
(suitable for learning) ✗

Varnai and Dimarogonas [45] introduce the shadow-lifting property for conjunction, which requires the truth value of a conjunction to increase when the truth value of a conjunct increases; formally expressed³³3Note that the original version from Varnai and Dimarogonas [45] is defined for conjunctions with $M$ terms, as STL has conjunctions with $M$ terms by designs. in Eq. 15: the shadow-lifting property is satisfied, if for any $\rho\neq 0$ ,

\partialderivative{[\![x_{1}\wedge x_{2}]\!]_{\text{$L$}}}{x_{i}}\bigg{|}_{x_{% 1}=x_{2}=\rho}>0\quad\text{for all }i\in\{1,2\}.

(15)

This property is highly desirable for learning, as it allows for gradual improvement. For example, the formula $0.1\wedge 0.9$ should be more true than $0.1\wedge 0.2$ , but the Gödel t-norm $T_{\text{G}}(x,y)=\min\{x,y\}$ yields the same truth value in both cases. Note that Varnai and Dimarogonas [45] have proven that no conjunction operator can be associative and shadow-lifting at the same time.

DL2 uses addition for conjunction, trivially satisfying shadow-lifting. The only t-norm to satisfy the shadow-lifting property is the product t-norm $T(x,y)=xy$ . The Gödel, Łukasiewicz, and Yager t-norms are not differentiable everywhere due to their use of the $\min$ and $\max$ operators, and do not satisfy shadow-lifting [42]. However, a conjunction that enjoys shadow-lifting might not be favourable for other reasons; as noted by van Krieken et al. [37], the derivatives of the product t-norm will be low if $x$ and $y$ are both low, making it hard for the learning process to make progress at all. In contrast to that, the DL2 conjunction is not just shadow-lifting, but also has strong derivatives regardless of the values of $x$ and $y$ . Table 2 shows the derivatives of the conjunctions presented earlier.

3.3 Modus Ponens and Modus Tollens Reasoning (Derivatives of Implication)

Table 3: The derivatives of the DL2 and fuzzy implications and their properties. A ✓ symbol in the Modus Ponens (or, respectively, the Modus Tollens) column indicates that the particular implication closely follows Modus Ponens (or, respectively, Modus Tollens) reasoning, while a ✗ indicates that this particular implication does not follow Modus Ponens (or, respectively, Modus Tollens) reasoning.

\DefTblrTemplatecaptiondefault {talltblr}[ notea=Derivatives vanish on half of the domain., noteb=Modus Ponens reasoning is only possible for $x>1$ . However, when $x<1$ (corresponding to being very confident in $x$ ), it is only possible to decrease confidence in $y$ , which is the opposite of what should happen with Modus Ponens., ] colspec=lQ[c, m, mode=dmath]Q[c, m, mode=dmath]Q[c, m, mode=dmath]Q[c, m, mode=text]Q[c, m, mode=text], row1=font=, mode=text, Logic $L$ & Implication $[\![x\wedge y]\!]_{\text{$L$}}$ $\displaystyle\partialderivative{[\![x\to y]\!]_{\text{$L$}}}{x}$ $\displaystyle\partialderivative{[\![x\to y]\!]_{\text{$L$}}}{y}$ Modus
Ponens Modus
Tollens
DL2 y(1-x) -y 1-x (✓)\TblrNoteb ✗
Gödel {1,if x¡y,yelse. 0 {0,if x¡y\TblrNotea,1,if x¿y. ✗ ✗
Kleene-Dienes max{1-x,y} {0,if x+y¿1-1,if x+y¡1. {0,if x+y¡11,if x+y¿1. ✗ ✗
Łukasiewicz min{1-x+y,1} {0,if x¡y\TblrNotea-1,if x¿y. {0,if x¡y\TblrNotea1,if x¿y. ✗ ✗
Reichenbach 1-x+xy y-1 x ✓ ✓
Goguen {1,if x≤y, $\nicefrac{{y}}{{x}}$ ,else. {0,if x≤y\TblrNotea,- $\nicefrac{{y}}{{x^{2}}}$ ,else. {0,if x≤y\TblrNotea, $\nicefrac{{1}}{{x}}$ ,else. ✗ ✗
Yager ( $p\geq 1$ ) {1,if x,y=0, $\nicefrac{{y}}{{x}}$ ,else. {0,if x,y=0,yxlogy,else. {0,if x,y=0,xyx-y,else. ✗ ✓

Given an implication $x\to y$ , $x$ is called the antecedent, and $y$ is called the consequent.

Modus Ponens reasoning (affirming the antecedent) allows to infer $y$ from $x$ . On the other hand, Modus Tollens reasoning (denying the consequent) is used to infer $\lnot x$ from $\lnot y$ .

Due to the different notions of truth in DL2 and fuzzy logics, in the following we will not refer to truth values directly, but rather use the concept of confidence.

Modus Ponens reasoning needs to be able to increase the confidence in the consequent $y$ , if the confidence in the antecedent $x$ is high. Modus Tollens reasoning needs to be able to decrease the confidence in the antecedent $x$ , if the confidence in the consequent $y$ is low. Thus, in DL2, increasing the confidence in $\phi$ means decreasing the value of $\phi$ , whereas in fuzzy logics, increasing the confidence in $\phi$ corresponds to increasing the value of $\phi$ . Table 3 lists derivatives of the DL2 and fuzzy implications, along with a description of their behaviour, and whether they permit Modus Ponens and Modus Tollens reasoning.

As noted by van Krieken et al. [37], not only is most background knowledge phrased as implications, but also there are often far more negative examples than they are positive ones in ML contexts, thus requiring Modus Tollens reasoning more frequently. Choosing a suitable implication that performs well in the presence of this Modus Ponens / Modus Tollens imbalance is thus an important task to guarantee best learning.

In the following, we examine derivatives of $[\![x\to y]\!]_{\text{$L$}}$ and investigate whether they permit Modus Ponens reasoning, and, more importantly, Modus Tollens reasoning. For the Gödel, Kleene-Dienes, Łukasiewicz, Reichenbach, and Goguen implication, we briefly summarise known results from the literature [37], and contribute investigations into the derivatives and Modus Ponens and Modus Tollens behaviour for DL2 and the Yager implication.

1.

In DL2, Modus Tollens reasoning is not possible; since the derivative with respect to the antecedent $x$ is $-y$ , it is never possible to increase $x$ (i.e. to decrease the confidence in $x$ ). Modus Ponens reasoning is possible only for values of $x>1$ , which means when we are very confident in $x$ , it is not possible to increase the confidence in $y$ —even worse, the confidence in $y$ will be decreased.
2.

For the Gödel implication, derivatives with respect to the antecedent $x$ do not exist; it is thus impossible to perform Modus Tollens reasoning. Whenever $x>y$ , the confidence in $y$ is arbitrarily increased, even when the confidence in $x$ is low; thus not exactly following Modus Ponens. Further, when $x<y$ , derivatives vanish.
3.

The Kleene-Dienes implication does not closely follow Modus Ponens or Modus Tollens; instead, the confidence in the antecedent $x$ is decreased when the confidence in $x$ and $y$ is low, and the confidence in the consequent $y$ is increased, if the confidence in $x$ and $y$ is high.
4.

The Łukasiewicz logic has another issue: the confidence in the antecedent can never be decreased; it therefore does not follow Modus Tollens reasoning. Whenever we are more confident in antecedent $x$ than the in consequent $y$ , the confidence in $y$ will be increased, which does not closely follow Modus Ponens. Further, when $x<y$ , derivatives vanish.
5.

The Reichenbach implication closely follows Modus Ponens (whenever the confidence in the antecedent $x$ is high, the confidence in the consequent $y$ will be increased) and Modus Tollens (if the confidence in the consequent $y$ is low, the confidence in the antecedent $x$ will be decreased).
6.

The Goguen implication does not follow Modus Ponens: whenever the confidence in $x$ is very low, the confidence in $y$ will be increased strongly. Modus Tollens reasoning is possible, however, as $-\nicefrac{{y}}{{x^{2}}}$ gets smaller for increasing values of $x$ , it also behaves the opposite of what should be expected: when the confidence in $x$ is very high, it should be decreased faster than when it is low. Further, when $x\leq y$ , derivatives vanish.
7.

The Yager implication follows Modus Tollens reasoning but not Modus Ponens reasoning; when the confidence in $x$ is high but the confidence in $y$ is low, derivatives with respect to $y$ are low instead of high; but when $x$ is high, derivatives are high, too.

3.4 Consistency (Integrals of Fuzzy Logic Operators)

Whereas derivatives of logical operators give insights into how they behave in the learning process, another view is from a logic perspective, looking at their integrals, as proposed by Grespan et al. [46], who introduce a notion of consistency (that is, a degree of truth) for fuzzy logics based on the intuition that a fuzzy logic tautology $\tau$ should be absolutely true for all possible values, i.e. its integral should evaluate to $1$ , formally expressed in Eq. 16 for a given fuzzy logic relaxation $L$ .

\int\cdots\int_{[0,1]}[\![\tau(x_{1},\ldots,x_{n})]\!]_{\text{$L$}}\,\mathrm{d% }x_{n}\cdots\mathrm{d}x_{1}

(16)

We use the same representative set of axioms as [46], and evaluate the consistency of these for all fuzzy logics listed in Table 1. The results are displayed in Table 4 and indicate that the Łukasiewicz logic and the sigmoidal Reichenbach logic are the most consistent, whereas Gödel is generally the least consistent for the set of axioms chosen.

Note that this is not a definite metric to evaluate fuzzy logics with, but rather depends on the specific set of axioms chosen and should only be used as an indicator. It is not possible to evaluate DL2 with this measure due to its lack of a general negation and the fact that atomic terms in DL2 are comparison and not absolute truth values.

Table 4: The consistency of different fuzzy logics over a representative set of axioms. It can be observed that the Łukasiewicz, sigmoidal Reichenbach, and Goguen logics are the most, and the Gödel logic being the least consistent for this set of axioms.

{tblr} colspec=lccccccc, cells=mode=dmath, row1=font=, mode=text, rowZ=font=, mode=text, Tautology & Gödel Kleene-Dienes Łukasiewicz Reichenbach Goguen sig. Reichenbach Yager
Axiom schemata
P →(Q →P) 0.67 0.79 1 0.92 1 0.99 0.89
(P →(Q →R)) →( (P →Q) →(P →R) ) 0.75 0.75 0.96 0.87 0.93 0.97 0.85
(¬P →¬Q) →(Q →P) 0.79 0.75 1 0.86 0.90 0.98 0.82
Primitive propositions
(P ∨P) →P 0.50 0.75 0.75 0.75 0.69 0.88 0.72
Q →(P ∨Q) 0.83 0.79 1 0.92 1 0.98 0.90
(P ∨Q) →(Q ∨P) 0.67 0.75 1 0.86 1 0.96 0.85
(P ∨(Q ∨R)) →(Q ∨(P ∨R)) 0.75 0.78 1 0.91 1 0.98 0.91
(Q →R) →( (P ∨Q) →(P ∨R)) 0.88 0.76 1 0.90 1 0.99 0.90
Law of excluded middle
P ∨¬P 0.75 0.75 1 0.83 0.83 0.83 0.81
Law of contradiction
¬(P ∧¬P) 0.75 0.75 1 0.83 0.83 0.83 0.81
Law of double negation
P ↔¬(¬P)) 0.50 0.75 1 0.70 1 0.91 0.69
Principles of transposition
(P ↔Q) ↔(¬P ↔¬Q) 0.17 0.67 1 0.61 0.59 0.93 0.57
( (P ∧Q) →R ) ↔( (P ∧¬R) →¬Q) 0.51 0.64 0.67 0.67 0.65 0.73 0.64
Laws of tautology
P ↔(P ∧P) 0.50 0.75 0.75 0.69 0.50 0.87 0.44
P ↔(P ∨P) 0.50 0.75 0.75 0.69 0.69 0.87 0.69
Laws of absorption
(P →Q) ↔(P ↔(P ∧Q)) 0.33 0.71 0.83 0.66 0.67 0.94 0.49
Q →(P ↔(P ∧Q)) 0.33 0.75 1 0.82 1 0.98 0.57
Assoc., comm., dist. laws
(P ∧(Q ∨R)) ↔( (P ∧Q) ∨(P ∧R) ) 0.42 0.72 0.90 0.69 0.89 0.90 0.67
(P ∨(Q ∧R)) ↔( (P ∨Q) ∧(P ∨R) ) 0.58 0.72 0.90 0.69 0.90 0.90 0.69
De Morgan’s laws
¬(P ∧Q) ↔(¬P ∨¬Q) 0.67 0.75 1 0.75 1 0.93 0.78
¬(P ∨Q) ↔(¬P ∧¬Q) 0.33 0.75 1 0.75 1 0.93 0.73
Material excluded middle
(P →Q) ∨(Q →P) 1 0.83 1 0.97 1 1 1
Average Consistency 0.60 0.75 0.93 0.79 0.87 0.92 0.75

4 Experimental Evaluation & Results

In this section we revise the experimental set up in Flinkow et al. [1] so that constraints are not limited to those that utilise data that is already in the training set. We achieve an improved comparison of logics by finding counterexamples outside of the data set that violate the logical constraint using PGD, thus ensuring that the differentiable logics have optimal positive impact on the learning process.

4.1 Setup

Our comparison experiment is implemented in PyTorch [32] and based on the original DL2 experiments [36, 47]. The code, together with the experimental data, is available on https://github.com/tflinkow/comparing-differentiable-logics. The goal of our experiment is to compare the differentiable logics introduced in Section 2, and investigate which logic performs most favourable, focusing specifically on implication and conjunction, as these have noticeable consequences for the learning process.

All experiments were conducted on Google Colab using a system with an Intel Xeon CPU and an NVIDIA A100 GPU. All experiments were run for 50 epochs. The baseline for each experiment was achieved by training without any logical loss translation. We train with the MNIST [48], CIFAR-10 and CIFAR-100 [49], and the German Traffic Sign Recognition Benchmark (GTSRB) [50] data sets with various logical constraints. For MNIST, we use a simple convolutional neural network (CNN) with two convolutional layers (32 and 64 filters), max-pooling, and two fully connected layers (128 and 10 units) followed by a softmax output layer. For CIFAR-10, CIFAR-100, and GTSRB, we use the VGG16 [51] architecture. Various image manipulation techniques are applied, such as random cropping, flipping, rotation, and colour changes. We use the Adam [52] optimiser with learning rates of [scientific-notation=true]0.0001 and [scientific-notation=true]0.00001 and batch sizes of 256 and 512.

The results reported in the tables in Figs. 2, 3, 5, 6 and 7 are derived by identifying the best combination of prediction and accuracy of the last {10} of experimental runs. The best combination is the one from these that maximises the product of prediction and constraint accuracy, thus effectively selecting the scenario with the highest combined performance (so as not to focus on only prediction or only constraint accuracy).

In each table, the overall best combination of prediction and constraint accuracy is shown in bold face. The plots in Figs. 2, 3, 5, 6 and 7 show how prediction and constraint accuracy change over time. The best combinations (i.e. the values reported in the tables) are marked in each plot.

4.2 Training with Counterexamples.

A major strength of DL2 is the ability to learn with constraints that can refer to data outside of the training set. This is achieved by finding a counterexample that violates the constraint. Given an input $\boldsymbol{x}_{0}$ and label $\boldsymbol{y}$ , our goal is to train the network to satisfy constraints of the form

\forall\boldsymbol{x}\in B_{\epsilon}(\boldsymbol{x}_{0})\to\phi(\boldsymbol{x% }_{0},\boldsymbol{x},\boldsymbol{y}).

(17)

This restriction allows PGD to find an approximate counterexample $\boldsymbol{x}^{*}$ within $\epsilon$ -distance of $\boldsymbol{x}_{0}$ that does not satisfy $\phi(\boldsymbol{x}_{0},\boldsymbol{x}^{*},\boldsymbol{y})$ . This is achieved by minimising the loss of $\lnot\phi$ (or, equivalently, maximising the loss of $\phi$ ). Formally,

\boldsymbol{x}^{*}=\operatorname{arg\,max}_{\boldsymbol{x}\in B_{\epsilon}(% \boldsymbol{x}_{0})}\;\mathcal{L}_{\text{C}}(\boldsymbol{x}_{0},\boldsymbol{x}% ,\boldsymbol{y},\phi).

(18)

Learning to satisfy constraints of the form shown in Eq. 17 thus corresponds to a two-step process of first finding a counterexample $\boldsymbol{x}^{*}$ , and then minimising the total loss

\mathcal{L}(\boldsymbol{x}_{0},\boldsymbol{y},\phi)=\lambda_{\text{CE}}% \mathcal{L}_{\text{CE}}(\boldsymbol{x}_{0},y)+\lambda_{\text{C}}\mathcal{L}_{% \text{C}}(\boldsymbol{x}_{0},\boldsymbol{x}^{*},\boldsymbol{y},\phi),

(19)

where the constraint loss for DL2 is given by

\mathcal{L}_{\text{C}}(\boldsymbol{x}_{0},\boldsymbol{x}^{*},\boldsymbol{y},% \phi)=[\![\phi(\boldsymbol{x}_{0},\boldsymbol{x}^{*},\boldsymbol{y})]\!]_{% \text{DL2}},

(20)

or, respectively,

\mathcal{L}_{\text{C}}(\boldsymbol{x}_{0},\boldsymbol{x}^{*},\boldsymbol{y},% \phi)=1-[\![\phi(\boldsymbol{x}_{0},\boldsymbol{x}^{*},\boldsymbol{y})]\!]_{% \text{$L$}},

(21)

for a given fuzzy logic $L$ (to address the inverse notion of truth).

4.3 Balancing loss

The additional constraint loss term $\mathcal{L}_{\text{C}}$ introduces hyperparameters $\lambda_{\text{CE}}$ and $\lambda_{\text{C}}$ that are responsible for balancing the two loss terms in Eq. 19. These hyperparameters need to be well fine-tuned to properly balance the different loss terms. This is a prerequisite for the logical loss to perform optimally and have the most impact. Our previous work used grid search to find approximate values; however, despite being time-consuming, it is also unlikely to yield optimal values, which had consequences for the fairness of our comparison.

In this paper, we use GradNorm by Chen et al. [31] to provide a fairer comparison. GradNorm is an adaptive loss balancing algorithm in multi-task learning scenarios, where the total loss is a weighted sum of individual task losses. GradNorm treats these loss weights as learnable parameters that are allowed to vary for each epoch. The network is penalised when backpropagated gradients from any task are too large or too small, which should lead to the different tasks training at similar rates. GradNorm at least matches, and often outperforms, exhaustive grid search, not only reducing the time needed to train with close-to-optimal weights, but also leading to overall improved task performance. GradNorm provides only one hyperparameter, the so-called asymmetry parameter $\alpha$ , where higher values of $\alpha$ lead to stronger training rate balancing.

Another approach for balancing loss was recently proposed by Li et al. [53], and also has the advantage of avoiding what the authors call shortcut satisfaction—a well-known phenomenon He et al. [54] call implication bias: neural networks trained with logical constraints have a tendency to vacuously satisfy a constraint $x\to y$ by learning to decrease the confidence in $x$ .

Refer to caption — Figure 1: The prediction $\lambda_{\text{P}}(t)$ and constraint weights $\lambda_{\text{C}}(t)$ learned with GradNorm ( $\alpha=0.1$ ) for the $\mathsf{Robustness}(\epsilon=0.4,\delta=0.02)$ constraint on GTSRB with the Gödel logic.

4.4 Choice of constraints

In the following, we explain the logical constraints used during training. We also present the logical operators they are meant to evaluate.

Comparing comparison operators

Classification robustness [55] is usually expressed as

\forall\boldsymbol{x}\in B_{\epsilon}(\boldsymbol{x}_{0})\to\operatorname{arg% \,max}_{i}\mathcal{N}(\boldsymbol{x})=y,

(22)

where $y$ is the true label. This formulation involves the non-differentiable $\operatorname{arg\,max}$ operator, making it unsuitable for use in a differentiable loss function setting.

Another variant in DL2 is proposed to circumvent this in Fischer et al. [36], as shown in Eq. 23. Note that it only compares the network’s prediction for the true class against some threshold value (chosen to be $\delta=0.52$ ).

\forall\boldsymbol{x}\in B_{\epsilon}(\boldsymbol{x}_{0})\to\mathcal{N}(% \boldsymbol{x}_{0})_{k}\geq\delta.

(23)

Casadio et al. [55] note that both of these robustness formulations fail to capture the intended semantics, because they compare the prediction for the true class instead of the current network output. Learning to satisfy these constraints can lead to situations where the network is technically robust around $\boldsymbol{x}_{0}$ , because the prediction for all $\boldsymbol{x}\in B_{\epsilon}(\boldsymbol{x}_{0})$ is the same as the one for $\boldsymbol{x}_{0}$ , but is not the correct class. In other words, these constraints capture too much; the standard loss term already penalises incorrect predictions.

Therefore, we are using the notion of standard robustness [55] for our Robustness constraint, shown in Eq. 24:

\forall\boldsymbol{x}\in B_{\epsilon}(\boldsymbol{x})\to\lVert\mathcal{N}(% \boldsymbol{x})-\mathcal{N}(\boldsymbol{x}_{0})\rVert_{\infty}\leq\delta.

(24)

As explained before, the $\forall\boldsymbol{x}\in B_{\epsilon}(\boldsymbol{x})$ part will be handled outside of the logic by employing PGD to find a counterexample; therefore, this robustness constraint can only be used to compare the different loss translations for comparison, i.e. $[\![x\leq y]\!]_{\text{DL2}}$ shown in Eq. 4, and the fuzzy logic one $[\![x\leq y]\!]_{\text{FL}}$ shown in Eq. 7. We use this constraint on MNIST and GTSRB.

Results

There are no significant differences between DL2 and fuzzy logics when it comes to training with the robustness constraint on MNIST, as seen in Fig. 2. Both DL2 and fuzzy logic can bring the constraint accuracy from {32.37} for the baseline to {95.42} for DL2, and {96.58} for the fuzzy logic. The prediction accuracy is not negatively affected when training with constraint loss. This is likely due to the fact that MNIST is not a particularly difficult dataset to train with.

On GTSRB, the differences are more noticeable: The baseline has a constraint accuracy of {28.45}, whereas the DL2 one has {56.46}, and the fuzzy logic one has {68.76}. Prediction accuracy is not noticeably reduced for DL2, and slightly improved over the baseline for the fuzzy logic operator. See Fig. 3 for detailed results.

Comparing conjunction and disjunction

In order to compare conjunction and disjunction, we use the Groups constraint introduced in [36] to force the network to make strong decisions: a group of related elements must have a high combined probability, or a low combined probability.

We denote the probability of a group $G$ by the sum of probabilities $\mathcal{N}(\boldsymbol{x})_{g}$ for each element $g$ in that group, i.e.

p_{G}=\sum_{g\in G}\mathcal{N}(\boldsymbol{x})_{g},

(25)

and then obtain the group constraint Groups as shown in Eq. 26 below:

\forall\boldsymbol{x}\in B_{\epsilon}(\boldsymbol{x})\to\bigwedge_{G\in% \mathcal{G}}p_{G}\leq\delta\;\vee\;p_{G}\geq 1-\delta.

(26)

Groups consist of classes of a similar type (e.g. speed limit signs for GTSRB) and are thus used to encode a similarity relation on top of the data. We use this constraint on the GTSRB dataset with the set of groups $\mathcal{G}$ being shown in Fig. 5, and on CIFAR-100 with the groups induced by the “coarse” labels provided, such as “fish”, “flowers”, “people”, etc.

Results

On GTSRB, as seen in Fig. 5, all logics significantly improve constraint accuracy, while there is a considerable drop in prediction accuracy for all of them. The Gödel logic is able to bring constraint accuracy from {22.15} for the baseline to {99.39}. Prediction accuracy with the Gödel logic drops from {56.73} for the baseline to only {34.76}, which is still higher than the prediction accuracy of any other logic we trained with.

The Gödel logic performs the best overall in this experiment. On CIFAR-100, prediction accuracy drops from {32.16} for all logics but the Łukasiewicz logic, which surpasses the baseline prediction accuracy slightly—however, its constraint accuracy drops even below the baseline with only {14.41}, compared to the baseline with {45.49}.

All other logics notice a drop in prediction accuracy, but are able to increase constraint accuracy (often significantly). The best logic in this experiment was the Reichenbach logic, with a prediction accuracy of {25.62} and {72.75}, however, both the Gödel and Yager logics yield prediction and constraint accuracies that are only slightly worse.

Comparing implication

Lastly, to compare implication, we use a constraint based on the class-similarity constraint from DL2 [36] shown in Eq. 27:

\forall\boldsymbol{x}\in B_{\epsilon}(\boldsymbol{x})\to\bigwedge_{\langle a,b% ,c\rangle\in\mathcal{T}}\left(\operatorname{arg\,max}_{i}\mathcal{N}(% \boldsymbol{x})_{i}=a\to\mathcal{N}(\boldsymbol{x})_{b}\geq\mathcal{N}(% \boldsymbol{x})_{c}\right).

(27)

This constraint would not be suitable to compare fuzzy implications with, because the premise of the implication (i.e. $\operatorname{arg\,max}_{i}\mathcal{N}(\boldsymbol{x})_{i}=a$ ) is either absolute truth or absolute falsity, and by definition all fuzzy logics behave the same in these cases.

Our ClassSimilarity constraint therefore is a modified variant shown in Eq. 28, where the premise is replaced with a fuzzy truth value:

\forall\boldsymbol{x}\in B_{\epsilon}(\boldsymbol{x})\to\bigwedge_{\langle a,b% ,c\rangle\in\mathcal{T}}\left(\mathcal{N}(\boldsymbol{z})_{a}\geq\nicefrac{{1}% }{{n}}\to\mathcal{N}(\boldsymbol{z})_{b}\geq\mathcal{N}(\boldsymbol{z})_{c}% \right),

(28)

where $n$ is the number of classes, and $\mathcal{T}$ is a set of tuples $\langle a,b,c\rangle$ , that expresses the condition “ $a$ is more similar to $b$ than to $c$ ”. For example, $\langle\text{cat},\text{dog},\text{frog}\rangle$ expresses that a cat is more similar to a dog than to a frog.

This constraint is used to introduce background knowledge into CIFAR-10.

Results

As shown in Fig. 7, the class-similarity constraint was already reasonably well satisfied for the baseline with {72.15} constraint accuracy. All logics are able to improve on this constraint accuracy; the Łukasiewicz implication leads to the highest constraint accuracy of {97.38}.

Prediction accuracy drops from {59.50} for the baseline with all logics; DL2 leads to the biggest drop to {30.13}, whereas the least drop is with the sigmoidal Reichenbach logic with {54.66} prediction accuracy, followed by the Łukasiewicz logic with {54.34} prediction accuracy.

The Łukasiewicz logic performs the best overall in this experiment.

5 Discussion and Future Work

The theoretic comparison in Section 3 provided different ways to look at logic relaxations.

1.

DL2 and all fuzzy logics apart from Łukasiewicz and Yager are sound. (Section 3.1)
2.

The shadow-lifting property of a conjunction is desirable for learning, as it allows for gradual satisfaction. The investigation in Section 3.2 showed that no fuzzy logic conjunction apart from the Reichenbach one satisfies shadow-lifting. The DL2 conjunction satisfies both shadow-lifting and has behaviour that is well-suited for learning, having strong derivatives everywhere. The Reichenbach conjunction, despite satisfying shadow-lifting, does not exhibit behaviour suitable for learning with derivatives being low when they should be high. The Yager conjunction vanishes on a considerable part of its domain, but its derivatives are well-suited for learning.
3.

Investigating implication with respect to whether they follow Modus Ponens, or, more importantly, Modus Tollens reasoning, in Section 3.3 showed that only the Reichenbach logic closely follows both Modus Ponens and Modus Tollens. Almost no other logic follows Modus Ponens and Modus Tollens reasoning, suggesting that training with these would not lead to as useful updates during training.
4.

In Section 3.4, we looked at the logics from a logic perspective as opposed to looking at their behaviour in the training process. The Łukasiewicz, sigmoidal Reichenbach, and Goguen logics were the most consistent, with the Gödel logic being the least consistent.

Our experimental evaluation complements the theoretical one, and shows most importantly, that training with any logic will generally lead to improved constraint accuracy, at the expense of prediction accuracy. This phenomenon is commonly known and was first reported by Tsipras et al. [56].

Our experimental evaluation results however do not closely mirror the theoretical results:

1.

When comparing conjunction with the groups constraint on GTSRB, the Gödel, Reichenbach, and Yager logics generally performed best, despite the Gödel and Yager logic not satisfying shadow-lifting.
2.

Comparing implication, the Łukasiewicz logic performs best, despite it not following Modus Ponens and Modus Tollens reasoning, in addition to having vanishing derivatives on half of the domain.

In general, it seems that shadow-lifting, the ability to perform Modus Ponens or Modus Tollens reasoning, as well as consistency, do not seem to matter as much as the operators having strong derivatives.

With PGD and GradNorm⁴⁴4Note that GradNorm allows to circumvent the need to find optimal values to balance prediction and constraint loss which works reasonable well most of the time, matching or even surpassing exhaustive grid search, but can sometimes lead to nonoptimal results when one of the tasks has finished training and cannot be improved upon. In these cases, GradNorm becomes stuck, and the training process needs to be stopped before then for best results. See Fig. 8 for a visualisation of this phenomenon on MNIST., the differences in performance between the logical relaxations more clear, but still depend highly on the specific task at hand.

5.1 Lessons Learned

The large number of hyperparameters, including standard ML hyperparameters (such as batch size, optimiser, learning rate, and network architecture) and hyperparameters introduced by the PGD counterexample finding approach (such as norm (we chose the $\ell_{\infty}$ norm because we considered it more interpretable), radius, and step size) and the asymmetry hyperparameter introduced by GradNorm make it difficult (if not impossible) to provide a definite answer for the question of what logical relaxation to train to satisfy arbitrary constraints on arbitrary data sets on an arbitrary network in the future.

Instead of trying to find a single best logic that works well in all possible use cases, it might be a more fruitful research direction to explore what constraints require certain properties the logics need to exhibit.

5.2 Future Work

Our experiments have shown that learning with differentiable logics can generally improve how much a ML model satisfies a constraint. Imposing logical constraints on the training process in this manner could be a step in the direction of verified ML, allowing the use of continuous-learning in self-improving ML-enabled autonomous systems. It has to be noted that in contrast to formal verifiers, training with logical loss does not formally guarantee properties to hold in all possible cases.

We highlight a few more areas for future work in the following.

Reusing logical constraints during inference

Because of the differentiable logics acting as a regulariser during training, any logical constraints imposed on the learning process are unavailable during inference. The trained model can therefore not make use of the logical constraints to check its predictions, for example to attach confidence scores to its predictions. Giunchiglia et al. [29] provide a survey of learning with logical constraints and also investigate ways to guarantee satisfaction of logical constraints after training, by means of constraining the output.

Probabilistic logics

Despite expressing satisfaction of formulas on $[0,1]$ , fuzzy logics are inherently not probabilistic, having been designed instead for reasoning in the presence of vagueness⁵⁵5Which has consequences for the semantics: for example, in probability theory, given $P=Q=0.5$ , the probability of “ $P$ and $Q$ ” is $\operatorname{Pr}[P\cap Q]=0.25$ , whereas in fuzzy logic, given $p=q=0.5$ , the value of $p\wedge q$ is $0.5$ using the Gödel t-norm $\min\{x,y\}$ , or $0$ , using the Łukasiewicz t-norm $\max\{0,x+y-1\}$ .. We point to DeepProbLog [57] as one example for a probabilistic logic for use with deep learning. In the context of neural networks, which often output probabilities, it could be more natural to reason about probabilities instead of vagueness, especially for constraints that include probabilities [7].

Properties & expressivity

A common problem with verifying ML is the lack of specifications, as noted by Seshia et al. [5], Leucker [6], Farrell et al. [7]. Most properties in the literature are limited to robustness against slight perturbations, although differentiable logics can not only relate network inputs and outputs, but could also refer to the inner workings of the neural network, such as weights and activation states. A related area is to investigate whether learning with logical loss can be used to show that desired properties continue to hold when retraining the network.

Lastly, logical constraints are often encoded as propositional logic constraints. For example, the ROAD-R [58] benchmark provides videos annotated with propositional logic constraints encoding background knowledge, but more expressive logical constraints are planned for future work. Looking at differentiable logics beyond propositional logic is a direction worth exploring; Varnai and Dimarogonas [45] provide a differentiable logic for STL, and Leung et al. [59], Xie et al. [60], Xu et al. [61] have translations for Linear-time Temporal Logic (LTL).

Acknowledgements

This work was supported by Science Foundation Ireland [grant number 20/FFP-P/8853].

References

Flinkow et al. [2023] T. Flinkow, B. A. Pearlmutter, R. Monahan, Comparing Differentiable Logics for Learning Systems: A Research Preview, Electronic Proceedings in Theoretical Computer Science 395 (2023) 17–29. doi:10.4204/EPTCS.395.3. arXiv:2311.09809.
Wang et al. [2018] S. Wang, K. Pei, J. Whitehouse, J. Yang, S. Jana, Efficient Formal Safety Analysis of Neural Networks, in: Advances in Neural Information Processing Systems, volume 31, Curran Associates, Inc., 2018.
Szegedy et al. [2014] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, R. Fergus, Intriguing properties of neural networks, 2014. doi:10.48550/arXiv.1312.6199. arXiv:1312.6199.
Goodfellow et al. [2015] I. J. Goodfellow, J. Shlens, C. Szegedy, Explaining and Harnessing Adversarial Examples, 2015. doi:10.48550/arXiv.1412.6572. arXiv:1412.6572.
Seshia et al. [2018] S. A. Seshia, A. Desai, T. Dreossi, D. J. Fremont, S. Ghosh, E. Kim, S. Shivakumar, M. Vazquez-Chanlatte, X. Yue, Formal Specification for Deep Neural Networks, in: S. K. Lahiri, C. Wang (Eds.), Automated Technology for Verification and Analysis, Lecture Notes in Computer Science, Springer International Publishing, Cham, 2018, pp. 20–34. doi:10.1007/978-3-030-01090-4_2.
Leucker [2020] M. Leucker, Formal Verification of Neural Networks?, in: G. Carvalho, V. Stolz (Eds.), Formal Methods: Foundations and Applications, Lecture Notes in Computer Science, Springer International Publishing, Cham, 2020, pp. 3–7. doi:10.1007/978-3-030-63882-5_1.
Farrell et al. [2023] M. Farrell, A. Mavridou, J. Schumann, Exploring Requirements for Software that Learns: A Research Preview, in: A. Ferrari, B. Penzenstadler (Eds.), Requirements Engineering: Foundation for Software Quality, Lecture Notes in Computer Science, Springer Nature Switzerland, Cham, 2023, pp. 179–188. doi:10.1007/978-3-031-29786-1_12.
Cheng and Yan [2021] C.-H. Cheng, R. Yan, Continuous Safety Verification of Neural Networks, in: 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2021, pp. 1478–1483. doi:10.23919/DATE51398.2021.9473994.
Bak et al. [2021] S. Bak, C. Liu, T. Johnson, The Second International Verification of Neural Networks Competition (VNN-COMP 2021): Summary and Results, 2021. doi:10.48550/arXiv.2109.00498. arXiv:2109.00498.
Müller et al. [2022] M. N. Müller, C. Brix, S. Bak, C. Liu, T. T. Johnson, The Third International Verification of Neural Networks Competition (VNN-COMP 2022): Summary and Results, 2022. doi:10.48550/arXiv.2212.10376. arXiv:2212.10376.
Brix et al. [2023a] C. Brix, S. Bak, C. Liu, T. T. Johnson, The Fourth International Verification of Neural Networks Competition (VNN-COMP 2023): Summary and Results, 2023a. doi:10.48550/arXiv.2312.16760. arXiv:2312.16760.
Brix et al. [2023b] C. Brix, M. N. Müller, S. Bak, T. T. Johnson, C. Liu, First three years of the international verification of neural networks competition (VNN-COMP), International Journal on Software Tools for Technology Transfer 25 (2023b) 329–339. doi:10.1007/s10009-023-00703-4.
Huang et al. [2020] X. Huang, D. Kroening, W. Ruan, J. Sharp, Y. Sun, E. Thamo, M. Wu, X. Yi, A survey of safety and trustworthiness of deep neural networks: Verification, testing, adversarial attack and defence, and interpretability, Computer Science Review 37 (2020) 100270. doi:10.1016/j.cosrev.2020.100270.
Liu et al. [2021] C. Liu, T. Arnon, C. Lazarus, C. Strong, C. Barrett, M. J. Kochenderfer, Algorithms for Verifying Deep Neural Networks, Foundations and Trends in Optimization 4 (2021) 244–404. doi:10.1561/2400000035.
Urban and Miné [2021] C. Urban, A. Miné, A Review of Formal Methods applied to Machine Learning (2021). doi:10.48550/arXiv.2104.02466. arXiv:2104.02466.
Albarghouthi [2021] A. Albarghouthi, Introduction to Neural Network Verification, 2021. arXiv:2109.10317.
Meng et al. [2022] M. H. Meng, G. Bai, S. G. Teo, Z. Hou, Y. Xiao, Y. Lin, J. S. Dong, Adversarial Robustness of Deep Neural Networks: A Survey from a Formal Verification Perspective, IEEE Transactions on Dependable and Secure Computing (2022) 1–1. doi:10.1109/TDSC.2022.3179131.
Katz et al. [2017] G. Katz, C. Barrett, D. L. Dill, K. Julian, M. J. Kochenderfer, Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks, in: R. Majumdar, V. Kunčak (Eds.), Computer Aided Verification, Lecture Notes in Computer Science, Springer International Publishing, Cham, 2017, pp. 97–117. doi:10.1007/978-3-319-63387-9_5.
Katz et al. [2019] G. Katz, D. A. Huang, D. Ibeling, K. Julian, C. Lazarus, R. Lim, P. Shah, S. Thakoor, H. Wu, A. Zeljić, D. L. Dill, M. J. Kochenderfer, C. Barrett, The Marabou Framework for Verification and Analysis of Deep Neural Networks, in: I. Dillig, S. Tasiran (Eds.), Computer Aided Verification, Lecture Notes in Computer Science, Springer International Publishing, Cham, 2019, pp. 443–452. doi:10.1007/978-3-030-25540-4_26.
Tran et al. [2020] H.-D. Tran, X. Yang, D. Manzanas Lopez, P. Musau, L. V. Nguyen, W. Xiang, S. Bak, T. T. Johnson, NNV: The Neural Network Verification Tool for Deep Neural Networks and Learning-Enabled Cyber-Physical Systems, in: S. K. Lahiri, C. Wang (Eds.), Computer Aided Verification, Lecture Notes in Computer Science, Springer International Publishing, Cham, 2020, pp. 3–17. doi:10.1007/978-3-030-53288-8_1.
Ferrari et al. [2021] C. Ferrari, M. N. Mueller, N. Jovanović, M. Vechev, Complete Verification via Multi-Neuron Relaxation Guided Branch-and-Bound, in: International Conference on Learning Representations, 2021.
Zhang et al. [2018] H. Zhang, T.-W. Weng, P.-Y. Chen, C.-J. Hsieh, L. Daniel, Efficient Neural Network Robustness Certification with General Activation Functions, 2018. doi:10.48550/arXiv.1811.00866. arXiv:1811.00866.
Xu et al. [2020] K. Xu, Z. Shi, H. Zhang, Y. Wang, K.-W. Chang, M. Huang, B. Kailkhura, X. Lin, C.-J. Hsieh, Automatic Perturbation Analysis for Scalable Certified Robustness and Beyond, 2020. doi:10.48550/arXiv.2002.12920. arXiv:2002.12920.
Wang et al. [2021] S. Wang, H. Zhang, K. Xu, X. Lin, S. Jana, C.-J. Hsieh, J. Z. Kolter, Beta-CROWN: Efficient Bound Propagation with Per-neuron Split Constraints for Complete and Incomplete Neural Network Robustness Verification, 2021. doi:10.48550/arXiv.2103.06624. arXiv:2103.06624.
Xu et al. [2021] K. Xu, H. Zhang, S. Wang, Y. Wang, S. Jana, X. Lin, C.-J. Hsieh, Fast and Complete: Enabling Complete Neural Network Verification with Rapid and Massively Parallel Incomplete Verifiers, 2021. doi:10.48550/arXiv.2011.13824. arXiv:2011.13824.
Zhang et al. [2022] H. Zhang, S. Wang, K. Xu, L. Li, B. Li, S. Jana, C.-J. Hsieh, J. Z. Kolter, General Cutting Planes for Bound-Propagation-Based Neural Network Verification, 2022. doi:10.48550/arXiv.2208.05740. arXiv:2208.05740.
Shi et al. [2024] Z. Shi, Q. Jin, Z. Kolter, S. Jana, C.-J. Hsieh, H. Zhang, Neural Network Verification with Branch-and-Bound for General Nonlinearities, 2024. doi:10.48550/arXiv.2405.21063. arXiv:2405.21063.
Kwiatkowska [2019] M. Kwiatkowska, Safety verification for deep neural networks with provable guarantees, in: Leibniz International Proceedings in Informatics, LIPIcs, volume 140, 2019. doi:10.4230/lipics.concur.2019.1.
Giunchiglia et al. [2022] E. Giunchiglia, M. C. Stoian, T. Lukasiewicz, Deep Learning with Logical Constraints, in: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, International Joint Conferences on Artificial Intelligence Organization, Vienna, Austria, 2022, pp. 5478–5485. doi:10.24963/ijcai.2022/767.
Madry et al. [2018] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, A. Vladu, Towards Deep Learning Models Resistant to Adversarial Attacks, in: International Conference on Learning Representations, 2018.
Chen et al. [2018] Z. Chen, V. Badrinarayanan, C.-Y. Lee, A. Rabinovich, GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks, in: Proceedings of the 35th International Conference on Machine Learning, PMLR, 2018, pp. 794–803.
Paszke et al. [2019] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, PyTorch: An Imperative Style, High-Performance Deep Learning Library, in: Advances in Neural Information Processing Systems, volume 32, Curran Associates, Inc., 2019.
Meurer et al. [2017] A. Meurer, C. P. Smith, M. Paprocki, O. Čertík, S. B. Kirpichev, M. Rocklin, Am. Kumar, S. Ivanov, J. K. Moore, S. Singh, T. Rathnayake, S. Vig, B. E. Granger, R. P. Muller, F. Bonazzi, H. Gupta, S. Vats, F. Johansson, F. Pedregosa, M. J. Curry, A. R. Terrel, Š. Roučka, A. Saboo, I. Fernando, S. Kulal, R. Cimrman, A. Scopatz, SymPy: Symbolic computing in Python, PeerJ Computer Science 3 (2017) e103. doi:10.7717/peerj-cs.103.
Harris et al. [2020] C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. F. del Río, M. Wiebe, P. Peterson, P. Gérard-Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, T. E. Oliphant, Array programming with NumPy, Nature 585 (2020) 357–362. doi:10.1038/s41586-020-2649-2.
Ślusarz et al. [2023] N. Ślusarz, E. Komendantskaya, M. Daggitt, R. Stewart, K. Stark, Logic of Differentiable Logics: Towards a Uniform Semantics of DL, in: EPiC Series in Computing, volume 94, EasyChair, 2023, pp. 473–493. doi:10.29007/c1nt.
Fischer et al. [2019] M. Fischer, M. Balunovic, D. Drachsler-Cohen, T. Gehr, C. Zhang, M. Vechev, DL2: Training and Querying Neural Networks with Logic, in: Proceedings of the 36th International Conference on Machine Learning, PMLR, 2019, pp. 1931–1941.
van Krieken et al. [2022] E. van Krieken, E. Acar, F. van Harmelen, Analyzing Differentiable Fuzzy Logic Operators, Artificial Intelligence 302 (2022) 103602. doi:10.1016/j.artint.2021.103602. arXiv:2002.06100.
Marra et al. [2023] G. Marra, F. Giannini, M. Diligenti, M. Maggini, M. Gori, T-Norms Driven Loss Functions for Machine Learning, Applied Intelligence 53 (2023) 18775–18789. doi:10.1007/s10489-022-04383-6. arXiv:1907.11468.
Daggitt et al. [2022] M. L. Daggitt, W. Kokke, R. Atkey, L. Arnaboldi, E. Komendantskya, Vehicle: Interfacing Neural Network Verifiers with Interactive Theorem Provers, 2022. arXiv:2202.05207.
Daggitt et al. [2024a] M. L. Daggitt, W. Kokke, R. Atkey, N. Slusarz, L. Arnaboldi, E. Komendantskaya, Vehicle: Bridging the Embedding Gap in the Verification of Neuro-Symbolic Programs, 2024a. doi:10.48550/arXiv.2401.06379. arXiv:2401.06379.
Daggitt et al. [2024b] M. Daggitt, W. Kokke, N. Ślusarz, R. Atkey, M. Casadio, E. Komendantskaya, Vehicle, 2024b.
Affeldt et al. [2024] R. Affeldt, A. Bruni, E. Komendantskaya, N. Ślusarz, K. Stark, Taming Differentiable Logics with Coq Formalisation, 2024. doi:10.48550/arXiv.2403.13700. arXiv:2403.13700.
Klement et al. [2000] E. P. Klement, R. Mesiar, E. Pap, Triangular Norms, volume 8 of Trends in Logic, Springer Netherlands, Dordrecht, 2000. doi:10.1007/978-94-015-9540-7.
Baczyński and Jayaram [2008] M. Baczyński, B. Jayaram, Fuzzy Implications, number v. 231 in Studies in Fuzziness and Soft Computing, Springer Verlag, Berlin, 2008.
Varnai and Dimarogonas [2020] P. Varnai, D. V. Dimarogonas, On Robustness Metrics for Learning STL Tasks, in: 2020 American Control Conference (ACC), 2020, pp. 5394–5399. doi:10.23919/ACC45564.2020.9147692.
Grespan et al. [2021] M. M. Grespan, A. Gupta, V. Srikumar, Evaluating Relaxations of Logic for Neural Networks: A Comprehensive Study, 2021. doi:10.48550/arXiv.2107.13646. arXiv:2107.13646.
Eth [2024] Eth-sri/dl2, SRI Lab, ETH Zurich, 2024.
Lecun et al. [1998] Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86 (1998) 2278–2324. doi:10.1109/5.726791.
Krizhevsky [2009] A. Krizhevsky, Learning Multiple Layers of Features from Tiny Images (2009).
Stallkamp et al. [2011] J. Stallkamp, M. Schlipsing, J. Salmen, C. Igel, The German Traffic Sign Recognition Benchmark: A multi-class classification competition, in: The 2011 International Joint Conference on Neural Networks, IEEE, San Jose, CA, USA, 2011, pp. 1453–1460. doi:10.1109/IJCNN.2011.6033395.
Simonyan and Zisserman [2015] K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, 2015. doi:10.48550/arXiv.1409.1556. arXiv:1409.1556.
Kingma and Ba [2017] D. P. Kingma, J. Ba, Adam: A Method for Stochastic Optimization, 2017. doi:10.48550/arXiv.1412.6980. arXiv:1412.6980.
Li et al. [2022] Z. Li, Z. Liu, Y. Yao, J. Xu, T. Chen, X. Ma, J. Lü, Learning with Logical Constraints but without Shortcut Satisfaction, in: The Eleventh International Conference on Learning Representations, 2022.
He et al. [2023] H. He, W. Dai, M. Li, Reduced Implication-bias Logic Loss for Neuro-Symbolic Learning, 2023. doi:10.48550/arXiv.2208.06838. arXiv:2208.06838.
Casadio et al. [2022] M. Casadio, E. Komendantskaya, M. L. Daggitt, W. Kokke, G. Katz, G. Amir, I. Refaeli, Neural Network Robustness as a Verification Property: A Principled Case Study, in: S. Shoham, Y. Vizel (Eds.), Computer Aided Verification, Lecture Notes in Computer Science, Springer International Publishing, Cham, 2022, pp. 219–231. doi:10.1007/978-3-031-13185-1_11.
Tsipras et al. [2018] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, A. Madry, Robustness May Be at Odds with Accuracy, in: International Conference on Learning Representations, 2018.
Manhaeve et al. [2018] R. Manhaeve, S. Dumancic, A. Kimmig, T. Demeester, L. De Raedt, DeepProbLog: Neural Probabilistic Logic Programming, in: Advances in Neural Information Processing Systems, volume 31, Curran Associates, Inc., 2018.
Giunchiglia et al. [2023] E. Giunchiglia, M. C. Stoian, S. Khan, F. Cuzzolin, T. Lukasiewicz, ROAD-R: The autonomous driving dataset with logical requirements, Machine Learning 112 (2023) 3261–3291. doi:10.1007/s10994-023-06322-z.
Leung et al. [2021] K. Leung, N. Aréchiga, M. Pavone, Backpropagation through Signal Temporal Logic Specifications: Infusing Logical Structure into Gradient-Based Methods, 2021. doi:10.48550/arXiv.2008.00097. arXiv:2008.00097.
Xie et al. [2021] Y. Xie, F. Zhou, H. Soh, Embedding Symbolic Temporal Knowledge into Deep Sequential Models, in: 2021 IEEE International Conference on Robotics and Automation (ICRA), 2021, pp. 4267–4273. doi:10.1109/ICRA48506.2021.9561952.
Xu et al. [2022] Z. Xu, Y. S. Rawat, Y. Wong, M. Kankanhalli, M. Shah, Don’t Pour Cereal into Coffee: Differentiable Temporal Logic for Temporal Action Segmentation, in: Advances in Neural Information Processing Systems, 2022.