Negative Selection Algorithms

Broadly defined, a negative selection algorithm is any classification algorithm that mimics or simulates the process of negative selection in the vertebrate immune system, an idea first developed by Stephanie Forrest and co-authors. Such algorithms belong to the group of one-class classifiers: They are trained on unlabelled data sampled from a certain subregion of the problem domain, and then used to determine whether or not new unseen datapoints belong to the same subregion. Negative selection algorithms are based on so-called detector, which can be understood as patterns that mach to small subsets of the problem domain. For example, in the problem domain U={0,1}⁵ of binary strings of length 5, a detector might be the pattern 1***1 matching all strings that start and end with a 1.

Pseudocode for a common implementation scheme of this approach is presented below.

Algorithm Negative-Selection.
Input: A S ⊂ U ("self-set"); a set M ⊂ U ("monitor set"); an integer n
Output: For each element m ∈ M, either "self" or "non-self".

// training phase
1 d ← empty set
2 while |D|< n do
3 d ← random detector
4 if d does not match any element of S then
5 insert d into D

// classification phase
6 for each m ∈ m do
7 if m matches any detector d ∈ D then
8 output "m is non-self" (an anomaly)
9 else
10 output "m is self"

Most commonly, negative selection algorithms are applied to anomaly detection or novelty detection problems. For instance, the problem domain U might be the set of all unicode strings, the samples might be legitimate e-mail messages, and the task might be to determine whether a new e-mail is legitimate or spam.

Efficiency Problems

Direct implementations of negative selection algorithm schemes like the one presented above frequently suffer from the following two efficiency problems, which are quite independent of the type of detector used:

Huge number of detectors required: The sensitivity of the algorithm, i.e. its true positive rate with respect to anomalies, increases with the number n of detectors used. The value for n that is required to achieve good sensitivity may be prohibitively large.
Difficulties to generate detectors: If the input sample S is quite large, it can be difficult to find detectors that do not match any element of S. In particular, the expected number of iterations of the while-loop in lines 2-5 required to find even a single detector can be of exponential order in the size of the input.

Addressing Efficiency Issues by Detector Compression

Various solutions of the above problems have been suggested for concrete detector types, see e.g. Zhou Ji's work on V-detectors for euclidean problem domains. In a series of papers, my co-authors and I explored the use of efficient data structures to represent the detector set D. This approach can sometimes solve both problems mentioned above at once: In such cases, a compressed structure representing all detectors that do not match any s ∈ S (and therefore, a detector set with maximal sensitivity) can be generated in polynomial time, and be used for classification instead of the individual detectors.

A modified pseudo-code for algorithms following the detector compression approach would be the following:

Algorithm Efficient-Negative-Selection.
Input: A S ⊂ U ("self-set"); a set M ⊂ U ("monitor set"); an integer n
Output: For each element m ∈ M, either "self" or "non-self".

// training phase
1 D ← compress( set of all detecors that do not match any element of S )

// classification phase
2 for each m ∈ M do
3 output number of detectors d ∈ D that match m

Note that, in addition to the detector compression, our approach also generalizes the output of the original negative selection algorithm: By counting the number of matching detectors, each m ∈ M is assigned an "anomaly score". In particular, for an m that also occurred in S, this score will be 0. A threshold for the output score can be used to discriminate between normal and anomalous elements, and by varying this threshold, the user can control the trade-off between sensitivity and specificity.

Implementations of Efficient Negative Selection

For research purposes, I have implemented the algorithms outlined in our papers for efficient negative selection with so-called r-chunk and r-contiguous detectors. Because these algorithms are somewhat intricate, I considered it useful to post my implementations here for reference purposes and as a basis for further empirical research.

Executable JAR file with sources
Sample data used in the examples below: english.train, english.test, and english.labels.
R script to compute the AUC value (see below): auc.r

Negative Selection with r-Contiguous detectors

The following command trains a negative selection algorithm with 4-contiguous detectors on the sample strings of length 10 from the file english.train, and then outputs the number of matching detectors for each line from the file english.test:

java -jar negsel.jar -c -n 10 -r 4 -self english.train < english.test

As you will see the resulting numbers are quite huge, so you might wish to output their logarithms instead. You can do so by adding the "-l" switch:

java -jar negsel.jar -c -l -n 10 -r 4 -self english.train < english.test

Unless otherwise stated, the alphabet to be used is determined by examining the training file and using all letters that occur in it. However, you may also explicitly provide a file that contains all letters of the alphabet (which may be useful if your training file doesn't actually cover all letters). This is done via the following syntax:

java -jar negsel.jar -c -l -n 10 -r 4 -alphabet file://english.train -self english.train < english.test

Moreover, the "-alphabet" switch also supports various built-in alphabets. Run "java -jar negsel.jar" without any further options to see them (this will also expose your other options).

It has to be mentioned that, while being a polynomial-time algorithm, the algorithm used by the above commands is still quite slow and may be actually too slow when used with big files and big alphabets. Often, similarly good classification results may be obtained by not actually counting the detectors, but using a different kind of anomaly score, which is easier to compute. This is done by omitting the "-c" switch (and also the "-l" switch).

java -jar negsel.jar -n 10 -r 4 -alphabet file://english.train -self english.train < english.test

The above command now outputs for each string a number ranging from 0 to 10 (well actually from 3 to 10). This number is the length of the longest contiguous match between the test string and any r-contiguous detector that does not match any string from the training file. Hence, the number is always at least equal to r-1.

Negative Selection with r-Chunk detectors

The "-k" switch may be used for all the above examples in order to use r-chunk instead of r-contiguous detectors. For example, the number of r-chunk detectors matching each test string would be computed as follows:

java -jar negsel.jar -k -c -n 10 -r 4 -self english.train < english.test

This number always ranges from 0 to n-r+1. Hence there is in general no need to logarithmize it with the "-l" switch. Because the computation is also much simpler and hence faster than for r-contiguous detectors, omitting the "-c" switch does not change anything in this case.

Evaluation of the classification performance

This is somewhat beyond the scope of this document. However, if you can use R, you might find my R script useful which computes the AUC value of the classification result as follows (this is a single line):

java -jar negsel.jar -l -c -n 10 -r 3 -self english.train < english.test | paste - english.labels | ./auc.r

Questions?

If you are using this code or have any questions on the algorithms or methodology in general, please drop me a mail at: johannes.textor }at{ gmx.de.