Topological properties of adversarial examples

In their famous paper on adversarial examples, Szegedy et al. conjecture that the set of adversarial examples is dense in the set of regular samples, much like how the rational numbers are dense in the reals:

Possible explanation is that the set of adversarial negatives is of extremely low probability, and thus is never (or rarely) observed in the test set, yet it is dense (much like the rational numbers), and so it is found near every virtually every test case.

I find it interesting that, to my knowledge, this conjecture has not really been formally investigated in the literature. I attempt to shed some light on it here, and show that it leads to a paradoxical conclusion. Specifically, if denseness of the adversarial examples is assumed, we can infer a lower bound on the robustness of the classifier that is surprisingly large.

As a reminder, in topology, a subset \(A\) of a topological space \(X\) is called dense in \(X\) every point \(x \in X\) either belongs to \(A\) or is a limit point of \(A\). A point \(x \in X\) is a limit point of \(A\) if every neighborhood of \(x\) intersects \(A\) in some point other than \(x\). In particular, this means that \(A\) being dense in \(X\) implies that every point \(x \in X \setminus A\) must be a limit point of \(A\).

Let’s restrict our attention now to the specific type of topological spaces known as metric spaces, which additionally have a distance function \(d(\cdot, \cdot)\). Consider a continuous mapping \(f: X \to Y\) from one metric space \((X, d_X)\) to another metric space \((Y, d_Y)\). If \(A\) is dense in \(X\), then every element \(x \in X \setminus A\) must be a limit point of \(A\). In metric spaces, neighborhoods correspond to open balls of some radius, so this means that any open ball centered at \(x\) of any radius contains a point of \(A\).

For simplicity, I will focus on the case of binary classification in \(\mathbb{R}^n\). That is, we have a function \(f: \mathbb{R}^n \to \mathbb{R}\) that maps \(n\)-dimensional vectors onto a score which can be turned into a class label as follows:

\[\hat{f}(x) = \begin{cases} 1 & f(x) > 0,\\ 0 & f(x) \leq 0. \end{cases}\]

Assume that \(f\) is continuous, so

\[\lim_{x \to a}~f(x) = f(a).\]

Define the set of \(\varepsilon\)-adversarials for \(f\) as follows:

\[A = \{ \tilde{x} \in \mathbb{R}^n \mid \exists x \in X: d(x, \tilde{x}) \leq \varepsilon \mbox{ and } \hat{f}(x) \neq \hat{f}(\tilde{x}) \}.\]

In words, an adversarial is a point \(\tilde{x} \in \mathbb R^n\) that lies \(\varepsilon\)-close to some data sample \(x \in X\) but which is assigned a different class label than \(x\). I now formalize the conjecture by Szegedy et al. as the statement that the set \(A\) is dense in \(X\). I do so under a few mild assumptions:

The classifier \(f\) is continuous.
The set \(X \setminus A\) is not empty.
The set \(A\) is not empty.

Continuity is a common assumption for \(f\), since we must generally be able to train it using gradient descent. This algorithm relies on the existence of derivatives of \(f\) and hence must assume continuity. The second assumption means that not all data samples are themselves adversarial examples, which should also be uncontroversial. I leave open the possibility that some data samples can also be adversarial examples, but not all of them. The third assumption is empirically known to be true for most classifiers.

Fix any \(x \in X \setminus A\) and assume without loss of generality that \(f(x) > 0\). Define a sequence of open balls \(B_1(x), B_2(x), \dots\) where each \(B_n\) is centered at \(x\) and has radius \(1/2^n\). Since \(A\) is dense in \(X\), each ball \(B_n\) contains a point \(a_n \in A\), and so we obtain a sequence \(a_1, a_2, \dots\) which converges to \(x\). This yields an important observation:

Property 1. There exists some \(m \in \mathbb{N}\) so that \(f(a_n) \geq 0\) for all \(n > m\).

Proof. Continuity of \(f\) implies

\[\lim_{n \to \infty}~f(a_n) = f(x) > 0.\]

Hence, for every \(\varepsilon > 0\) there exists an \(m \in \mathbb N\) so that \(f(x) - \varepsilon < f(a_n)\) for every \(n > m\). The result then follows by choosing \(\varepsilon = f(x)/2\). \(\blacksquare\)

Property 1 establishes that, eventually, the adversarials in the sequence \((a_n)\) must receive the same class label as \(x\). Hence they cannot be adversarial examples of \(x\) itself. This shows that every data point \(x \in X \setminus A\) must have a non-empty “region of stability” that is devoid of adversarials for that specific sample \(x\). However, this region must then contain adversarials for samples other than \(x\). Let \(\rho(x)\) be the radius of the largest open ball that is contained within the stability region of \(x\). For sufficiently large \(n\), the point \(a_n\) will lie within the stability region of \(x\) and will be adversarial for some \(x_n \in X\). It then follows that

\[\hat{f}(a_n) = \hat{f}(x) \neq \hat{f}(x_n).\]

This immediately gives \(d(x, x_n) \geq \rho(x)\). The triangle inequality also yields

\[d(x, x_n) \leq d(x, a_n) + d(a_n, x_n) \leq \frac{1}{2^n} + \varepsilon.\]

We therefore find

\[\rho(x) \leq d(x, x_n) \leq \frac{1}{2^n} + \varepsilon.\]

Taking the limit as \(n \to \infty\) obtains

\[\rho(x) \leq \varepsilon.\]

We also know that \(\rho(x) > 0\) for all \(x \in X \setminus A\), because the argument proving Property 1 can be applied to all sequences which converge to \(x\) from any direction. This means we have a non-trivial lower bound for \(\varepsilon\):

\[\label{eq:bound}\tag{1} \varepsilon \geq \sup_{x \in X \setminus A}~\rho(x) > 0.\]

The interesting thing about the inequality \(\eqref{eq:bound}\) is that it crucially hinges on the assumption of denseness of \(A\). Otherwise, if we can only assume continuity of \(f\), we can get no further than the rather trivial observation that

\[\label{eq:triv}\tag{2} \varepsilon \geq \inf_{x \in X}~\rho(x).\]

Without denseness of \(A\), there is never any requirement that adversarial examples must lie arbitrarily close to any \(x \in X\), and so it is perfectly possible that certain samples have no adversarial examples within a certain radius whatsoever. This would happen for any \(x\) with \(\rho(x) > \varepsilon\). The argument I present here shows that this cannot happen if \(A\) is assumed to be dense in \(X\). Under this assumption, there must always exist some sequence of adversarial examples converging to \(x\) for every \(x \in X \setminus A\). In particular, this means every \(x \in X \setminus A\) will have adversarial examples within its stability region. These must necessarily be adversarial for some other \(x' \neq x\), which must differ from \(x\) in classification and hence must lie at a distance of at least \(\rho(x)\) but at most \(\varepsilon\) from \(x\), so it must be the case that \(\rho(x) \leq \varepsilon\). Denseness of \(A\) implies this inequality must hold for all \(x \in X \setminus A\), which establishes the result.

The inequality \(\eqref{eq:bound}\) may strike some as paradoxical, because it means that we can give larger lower bounds on the robustness of a classifier as adversarial examples become more abundant in the input space. This is, of course, an illusion: while it is true that \(\eqref{eq:bound}\) will be larger than \(\eqref{eq:triv}\), this is only true if \(f\) is kept constant. But if \(f\) is kept constant, then the set of adversarial examples must also be the same. An assumption of denseness of adversarials therefore likely implies that \(\rho(x)\) must shrink relative to classifiers for which adversarials are not dense. That said, denseness of adversarials does have the nice corollary that the robustness of \(f\) does not depend on the worst stability radius (as in \(\eqref{eq:triv}\)), but the best one, which may be cause for optimism if this conjecture turns out to be realistic.

Enjoy Reading This Article?