1 Introduction

Connected components analysis (CCA) is a common step in many image processing applications, extracting features such as area or size of arbitrarily shaped objects in a binary image. It is based on connected components labelling (CCL), which creates a labelled image of the same dimensions as the original image where all pixels of each connected component are assigned a unique label. Most recent CCL algorithms carry out three phases: scan, analyse and relabelling [8, 13, 19, 34, 35]. In the scan phase, a provisional label is assigned to each object pixel. If more than one label is assigned to a single connected component, this relationship is detected and memorised. In the analysis phase, one label is chosen to represent each connected component in the labelled image. Most state-of-the-art connected components labelling algorithms perform this analysis using some form of union-find data structure and algorithm [9, 12, 35], although they may not always explicitly mention it by this name [7]. The relabelling phase requires a second pass through the image and replaces each provisional label by its representative label. As a result, all pixels of a connected component are assigned the same label.

In connected components analysis, a feature vector is derived from each connected component. Since the set of feature vectors is of primary interest, a labelled image is essentially only an auxiliary data structure. If features are extracted during the scan phase then relabelling is redundant and the three processing phases can, therefore, be reduced to only two: scan and analyse. Analysing the image while it is scanned resolves data associations on-the-fly [2] and this is the principle behind the recently developed class of single-pass CCA algorithms ([1, 32]). Such CCA algorithms allow stream processing of the input image and reduce memory requirements [18] since only the labels of the current and the previous row are required for further processing. Previous single-pass CCL algorithms are based on contour tracing [5].

Some CCA and CCL algorithms are adapted and optimised to the instruction sets or memory architectures of the hardware device they are used on [3, 11, 19]. Many single-pass algorithms are motivated by the idea of creating a CCA algorithm from which an efficient customised high-performance architecture can be derived by basic processing and storage elements [18]. This is realised by:

  • Single-pass processing

For CCA, a labelled image does not need to be stored, therefore, there is no need to maintain, optimise or accelerate the labelled image data structure or its memory accesses when it is processed in only a single pass.

  • Linear processing time

A necessary condition for real-time processing is that the algorithm complexity is linear in the number of pixels in the image because the binary input image is either read from a memory or received as a pixel stream.

  • One lookup per pixel to determine the representative label

In many CCL and CCA algorithms, the union-find data structures which represent equivalence relations are mapped to arrays [8, 12, 13, 28, 35]. Their union-find algorithms require several lookups per pixel to identify which connected component a pixel is associated with. Most single-pass CCA algorithms reduce this to one lookup per pixel implicitly using a novel, context-based, optimisation of the classical union-find algorithm. The single lookup property is especially important for a dedicated hardware architecture because it enables the system to process the pixel stream at the pixel clock rate.

The contributions of this paper are:

  • State-of-the-art CCL and CCA algorithms are analysed in terms of the union-find algorithm (Sect. 2). In particular, single-pass algorithms are placed within this context, and the corresponding optimised union-find algorithm is identified and analysed.

  • Section 3 presents a full algorithmic description of the state-of-the-art Single Lookup CCASLCCA hardware architecture from [18].

  • A proof of the correctness of the SLCCA algorithm is provided (Sect. 4). This proves that the single lookup of the optimised union-find algorithm is sufficient for CCA. This is the first formal proof of single-pass CCA algorithms; prior outlines of proof [1] are both informal and incomplete.

  • From this, a novel optimised Double Lookup CCA algorithm (DLCCA) is derived in Sect. 5, with fewer total lookups required.

  • Pixel-based and run-based algorithms are unified by proving that it is only necessary to find the equivalent label of the first pixel in a run when propagating labels from one row to the next, enabling run-length encoding to be used for storing the label image.

  • The trade-offs between different CCA algorithms are analysed in terms of memory operations and the required resources in Sect. 6.

2 Union-Find in CCL and CCA Algorithms

First, what is meant by a connected component is formally defined. The binary input image I identifies object and background pixels on a discrete grid in Cartesian space of width W and height H. Let imagePos be the set of all positions in I,

$$\begin{aligned} imagePos = \{(i,j):0\le i<W, 0\le j<H ,\ i,j \in \mathbb {N} \}.\nonumber \\ \end{aligned}$$
(1)

Pixels outside the image are assumed to be background.

$$\begin{aligned} I[p] =\left\{ \begin{array}{ll} 0, &{}\quad \forall p \notin imagePos, \\ 1, &{}\quad \text {if } p = (i,j) \text { is an object pixel}, \\ 0, &{}\quad \text {if } p = (i,j) \text { is not an object pixel}. \end{array}\right. \end{aligned}$$
(2)

Two pixels \(p_1\) and \(p_2\), are adjacent if

$$\begin{aligned} \left\| p_1 - p_2\right\| =1. \end{aligned}$$
(3)

Adjacent object pixels are connected. Here, 8-connectivity is assumed (i.e., using \(\left\| \ \right\| _{L\infty }\)), although the same techniques can be applied for 4-connectivity (using \(\left\| \ \right\| _{L1}\)).

Definition 1

Connectedness Two object pixels in I, \(p_{1}\) and \(p_{2}\), belong to the same connected component if there is a path of connected object pixels in I between \(p_1\) and \(p_2\).

This is denoted as \(p_1 \longleftrightarrow p_2\), which can be defined recursively as:

(4)

The base case holds true if \(p_1\) and \(p_2\), are adjacent object pixels of I. The recursive case holds true if there is an object pixel \(p_i\) with a connected path to both \(p_1\) and \(p_2\).

Definition 2

Connected component A maximal set of mutually connected object pixels in I is called a connected component. Each connected component represents a separate image object in I.

2.1 Union-Find

Problems which require the manipulation of disjoint sets by carrying out intermixed find and union operations are called union-find problems [31]. Within the context of CCL, union-find is used for managing the set of labels associated with a single connected component, and for selecting the representative label for a component.

2.1.1 Graph Notation

The most common union-find data structure to represent disjoint sets (distinct components) is a directed forest. Each provisional label assigned to a connected component is represented by a vertex. A directed forest is an acyclic graph where directed edges, referred to as arcs, link pairs of vertices, indicating the relationship between the associated labels. The following graph notation represents the directed forest structure F as a set of vertices \(V(F)\) and edges \(E(F)\):

$$\begin{aligned} \begin{aligned}&F = (V,E)\\&V(F) = \{v_{0},\dots ,v_{n-1}\}\\&E(F)= \{(v_{i_{0}}\rightarrow v_{j_{0}}),\ldots ,(v_{i_{m-1}}\rightarrow v_{j_{m-1}})\}. \end{aligned} \end{aligned}$$
(5)

For each edge, \(v_i \rightarrow v_j\), \(v_i\) is the child vertex, and \(v_j\) is the parent. Each vertex has exactly one parent (except for a root vertex which has no parent), with the edge represented by a pointer to its parent. A vertex may have many children; vertices with no children are leaf vertices. A path from vertex \(v_1\) to vertex \(v_2\) is denoted \(v_1 \mapsto v_{2}\), which consists of a sequence of vertices \(v_1\rightarrow v_i \rightarrow \cdots \rightarrow v_2\), where each pair of two consecutive vertices is an arc in \(E(F)\).

Definition 3

Rooted Tree A tree (or more formally, a directed rooted tree) T is a subgraph of F comprising a root vertex \(v_{r}\) and all of its children.

Each vertex belongs to exactly one tree, and there is a path following the edges of T from every vertex in the tree to \(v_{r}\) [24]. Therefore:

$$\begin{aligned} V(T_{v_r})=\{v_{i}:v_{i}\in V(F)\wedge v_{i}\mapsto v_{r}\}. \end{aligned}$$
(6)

A tree is associated with one connected component in the image. The root vertex of each tree serves as the representative element for the set. Each tree is referred to by its root \(v_{r}\) (and its associated representative label for the connected component).

Definition 4

Level of a vertex in a tree The level of a vertex v, level(v), is the number of arcs between v and the root, \(v_r\).

The level of the root vertex is therefore 0, and for all other vertices the level is one higher than the level of its parent:

$$\begin{aligned} level(v)=\left\{ \begin{array}{ll} 0, &{}\quad v = v_r {\text { (it is root)}},\\ level(parent(v))+1, &{}\quad {\text {otherwise}}. \end{array}\right. \end{aligned}$$
(7)

Definition 5

Height of a tree The height of a tree T, height(T), is the maximum level of a vertex in \(V(T)\).

$$\begin{aligned} height(T)=\max \{level(v_{i}):v_{i}\in V(T)\}. \end{aligned}$$
(8)

Connected components labelling sequentially assigns a label L[p] to each pixel p, with the goal of eventually assigning the same label to all pixels belonging to a single connected component. Since there is a one-to-one relationship between labels and vertices of the forest F, in the discussion here the term label is synonymous to a vertex of \(V(F)\). Assigning \(L[p]:=L_{v_p}\) is therefore equivalent to associating p with vertex \(L_{v_p}\). An example image is shown in Fig. 1a, and the corresponding forest derived from this image is shown in Fig. 1b.

Fig. 1
figure 1

a A label is assigned to each pixel in raster scan order. Also shown is the neighbourhood of a pixel at position \(p_x=(x,y)\) and b union-find data structure F of the image from a

2.1.2 Union-Find Algorithms

Union-find algorithms have three key operations. MakeSet(e) creates a set \(S_{e}\) consisting of a single element e; a Union(e,f) replaces the sets \(S_{e}\) and \(S_{f}\) by \(S_{e}\cup S_{f}\) [15]; and a Find(e) returns the representative element of the set containing e [15].

With a forest structure in the context of CCL, the operations have the following meanings: MakeSet creates a new tree within F and corresponds to assigning a new label to a new connected component; Union joins two trees into a single tree, corresponding to merging two previously disjoint connected components; and Find returns the root vertex of the tree which contains a specified vertex, corresponding to finding the representative label of a connected component.

Algorithms 1–3 present common variations of union-find which are discussed in the following. These algorithms operate on a directed forest data structure, F, which contains \(n_{F}\) vertices. Adding a vertex to F, changing the parent of a vertex or looking up the parent of a vertex in F are each referred to in the following as one uf-instruction (union-find instruction).

figure d
figure e
figure f

QuickFind based union-find (Algorithm 1) [15] maintains F so that every leaf is directly connected to the root. A Find, therefore, consists of one uf-instruction. A Union checks which vertices of F belong to the changed rooted tree and changes each of their parents to the new root. A Union can, therefore, require up to \(2n_{F}\)uf-instructions in the worst case.

QuickUnion based union-find (Algorithm 2) [15] combines two trees by making the root of one tree the parent of the root of the other tree. This requires two Finds for one Union, which requires up to \(n_{F}\)uf-instructions in the worst case [27]. Both QuickFind and QuickUnion have quadratic run time in the worst case [27].

QuickUnion with path compression (Algorithm 3) [15] joins all vertices which are visited during a Find directly to the root vertex. Whenever these values are accessed again they will point directly to the root (at the time that the path was compressed). The worst-case run time of QuickUnion with path compression grows with the inverse of the Ackermann function [30] (which is quasi-linear for practical cases) when the tree size of the union-find data structure is balanced with a heuristic such as union-by-rank [30], which is not discussed in this paper.

For connected components labelling or analysis, the sequence of Union and Find operations depends on the input image. This can be used to derive a more efficient union-find algorithm for the special case of CCA and CCL of two-dimensional images.

2.2 Improved Union-Find

Single-pass CCA requires the label for each pixel to be resolved on-the-fly so that the contribution of the pixel to the feature vector can be allocated to the correct component. For streamed images, the order of operations is determined by the order in which the pixels are scanned, along with the local connectivity.

The pixels of the input image I are streamed or scanned row-wise from the top-left position (0, 0) to the bottom-right position \((W-1,H-1)\). A position \(p_{1}\) preceding another position \(p_{2}\) in the raster scan order is denoted as \(p_{1}\prec p_{2}\).

As the data structures are updated dynamically as the pixels are processed, it is necessary to define the structures that represent the state after processing each pixel. Let \(p_x\) be the current pixel. The set visited contains all pixels which have already been visited after processing the current pixel:

$$\begin{aligned} visited= \{ p_x \} \cup \{p:p \prec p_x \}. \end{aligned}$$
(9)

Two pixels \(p_1,p_2\) are connected in image I as scanned so far, if they are connected by a path of adjacent object pixels in visited. Equation (4) can be extended to define \(p_1 \!\underset{visited}{\longleftrightarrow }p_2\) as

(10)

As F is updated as each pixel is processed, let \(F_{p_x}^-\) be the state of F before processing pixel \(p_x\), and \(F_{p_x}\) be the resultant state after processing.

Definition 6

Component segment All pixels belonging to the same connected component after processing pixel \(p_x\) are a component segment.

Component segments therefore correspond to sets of pixels with labels associated with individual trees in \(F_{p_x}\) and are subsets of the final connected components of I.

figure g

Algorithm 4, a context-based union-find algorithm, exploits the order of Union and Find operations combined with an age-balancing heuristic [8] to achieve linear run time and requires fewer uf-instructions in the worst case than QuickFind (Algorithm 1), QuickUnion (Algorithm 2) or QuickUnion with path compression (Algorithm 3). Age-balancing ensures that the label assigned to the earliest pixel of a connected component encountered during a scan is always the root vertex.

Context-based union-find combines the best features of QuickFind and QuickUnion. Like QuickFind, the Find requires only one uf-instruction. The Union of two vertices makes one root vertex the parent of the other, similar to QuickUnion with the addition of age-balancing. In addition to MakeSet, Union and Find operations, a fourth operation, Flatten, is introduced which performs the equivalent of path compression by making the root vertex the parent of all vertices in a tree.

In QuickUnion with path compression (Algorithm 3), path compression is performed within the Find, processing from the leaves towards the root by following the arcs of \(E(F)\) as they are searched by Find [29]. In contrast, Flatten starts at the root vertex and processes towards the leaves. To accelerate this, arcs joining vertices with \(level(v) >1\) that will be encountered in subsequent processing are recorded in a stack during the Union operations.

Normally, Find is used to determine the root of a label vertex [30]. Since context-based union-find, replaces Find by a single lookup, it therefore returns only the parent of a vertex. For convenience, every root vertex points to itself, i.e., \(parent[v_r] = v_r\). A single lookup is equivalent to a Find for trees of \(height(T) \le 1\); this will only be the root vertex for vertices of level zero or one.

Table 1 A comparison of properties of representative CCL and CCA algorithms

Definition 7

Stale label A label \(L_{s}\) is called a stale label if a single lookup does not yield the root label.

A necessary condition for the Find not to return a stale label, is that the CCA algorithm using Algorithm 4 must ensure that Flatten is always called before a Find is applied on a vertex with level larger than one. As outlined in [1], and proven in Sect. 4, this can be achieved by calling Flatten after processing each image row. One situation where \(height(T) = 2\) during processing is identified. In SLCCA this exception is managed by deferring the second lookup, whereas in DLCCA the second lookup is performed explicitly.

2.3 State-of-the-Art CCL and CCA Algorithms

Since the introduction of the classic connected components labelling algorithm by Rosenfeld et al. [26], CCL has been improved in many aspects. A summary of several properties of (mainly) modern CCL and CCA algorithms is given in Table 1 which compares:

  • Number of passes

  • Scan mode and scan order

  • Worst-case run time, and how this was evaluated

  • Categorisation of set merging algorithm used

Rosenfeld’s classical CCL algorithm [26], is a two-pass algorithm where the first pass uses a binary image as an input and creates a provisionally labelled image. If more than one label is assigned to a connected component, these labels are stored in an equivalence table. These equivalence relations detected during the first scan are resolved at the end of the first pass by iteratively sorting and replacing the entries of the equivalence table until the table contains one entry for each connected component. After this process, each entry of the equivalence table contains all provisional labels assigned to its connected component in the first pass sorted in ascending order, starting with the smallest label which serves as a representative element. During the second pass all the object pixels of the provisionally labelled image are replaced by their representative values from the equivalence table. This assigns the same label to each pixel of a connected component.

Dillencourt et al. [8] proposed a general two-pass CCL algorithm (GCCL) for different image representations such as 2-D arrays and quad-trees. This algorithm uses QuickUnion with path compression extended by an age-balancing heuristic embedded into the Union operation. Using this property it is formally proven that this algorithm scales linearly with the number of pixels in I.

Di Stefano et al. [7] describe a simple and efficient connected components labelling (SEL) algorithm. It requires two passes to label all pixels using an equivalence table as the union-find data structure carrying out the QuickFind algorithm. The algorithm is improved for the worst-case image. The image pattern becoming the new worst case with the proposed improvement, however, still requires a quadratic number of uf-instructions.

Suzuki et al. [28] proposed a multi-pass CCL algorithm using a connection table to store the relations between provisional labels. This algorithm is, therefore, referred to as scan plus connection table (SCT) CCL algorithm. Previous multi-pass algorithms propagated labels by neighbourhood operations. The algorithm in [28] creates a forest structure stored in the connection table during the first scan, with one tree structure for each connected component consisting of provisional labels as vertices. Every scan over the image decreases the height of the tree structure in the connection table by one. The algorithm merges disjoint sets, however, it cannot be categorised as a union-find algorithm such as those of Sect. 2.1. The run time is stated to be linear in the number of pixels which is determined by experimental evaluation. It should be noted, however, that it would be difficult to experimentally distinguish between linear processing, and run-times proportional to the inverse Ackermann function [30] with small images. Most of the images used for evaluation require four or fewer passes for final labelling [28].

In the two-pass CCL algorithm presented by Wu et al. [35], the union-find data structure is represented by an array, therefore, it is referred to as scan plus array-based union-find (SAUF). QuickUnion with path compression is used to maintain this array-based union-find data structure. To accelerate the label selection process for each pixel, a decision tree is proposed reducing the number of labels of the neighbourhood to be accessed. A formal proof for the linear run time of the algorithm is given.

The CCL algorithm by He et al. [12] is a two-pass algorithm which run-length encodes the binary image during the first pass and processes these runs in the second pass. The algorithm uses a union-find data structure stored in an array which is updated by an optimised variant of QuickFind. To avoid updating all entries of the array for a Union operation, an additional linked list is maintained for each tree structure in the array containing all the vertices of the tree structure. A Union on two vertices links the two lists and updates the equivalence table entries of these vertices to the root vertex. This set merging algorithm is referred to as Equivalent Label Sets strategy (ELS) [11, 14]. In [13] they optimise their algorithm to only process runs of object pixels in the second pass and in [11] extend the algorithm to also compute the Euler number. Since, He and Chao [11] focuses on feature extraction, only the part involved in CCL is considered and is referred to as HCS.

For light speed labelling (LSL) Lacassagne et al. [19] identified memory accesses and conditional statements to be the key issue slowing down CCL on state-of-the-art processors with a RISC architecture. Their algorithm consequently optimises these by distributing the labelling process to three passes, replacing the conditional operations. For set merging, a variation of QuickUnion is applied. In [3] LSL was identified to require the fewest processing cycles per pixel when carried out on a general-purpose processor.

Chang et al. [5] follow a completely different approach. Instead of scanning the image in raster scan order, connected components are identified by contour tracing which requires random access to the image data. During the raster scan, when an unlabelled component is encountered, the border is traced using contour tracing. During this process, control information is stored in the labelled image so that pixels surrounded by already labelled pixels can be labelled when scanning resumes. Contour tracing (CT) avoids the need for set merging. The authors claim this to be a single-pass algorithm, however random access to the input image during contour tracing effectively means that more than one pass is required. In Table 1 it is, therefore, denoted as a 1.5 pass algorithm. The required random access makes this algorithm less practical for recent processors and dedicated hardware architectures.

Grana et al. [10] made the observation that all of the pixels within a \(2 \times 2\) block will have the same label. They extended the idea of pixel-based labelling to processing a \(2 \times 2\) block of pixels at a time. Block-based processing operates in a raster scan of \(2 \times 2\) blocks, hence it is identified as a modified raster scan in Table 1. Like Wu et al. [35], a decision tree approach is used to minimise the number of neighbourhood accesses during label assignment. The decision tree is considerably more complex than that for processing single pixels, so Grana et al. developed an algorithm to derive the optimal decision tree. The set merging algorithm is QuickUnion with path compression, with the trees updated online (whenever a merger occurs).

All of the two-pass CCL algorithms use a set merge algorithm which requires either a minimum of two instructions for a Find, or have a Union operation which scales quadratically with the number of labels.

Single-pass CCA algorithms require that the component feature vector be accumulated while determining the connectivity in the first pass. All of the two-pass CCL algorithms use the second pass for relabelling, so could potentially be converted into single-pass CCA algorithms by accumulating the feature data during the first pass. However, single-pass CCA algorithms have generally been designed in terms of hardware architectures, optimised for directly processing a video stream. With stream processing, the processing, including feature vector accumulation, is performed in a pipelined manner in hardware.

The original single-pass (OSP) CCA algorithm by Bailey and Johnston [1] introduced the principle behind the context-based union-find algorithm, although it was not identified in terms of union-find. The union-find graph was represented by storing the links in a merger table. The algorithm was based on the one lookup per pixel paradigm, with the use of a stack to optimise the Flatten operation. This built on earlier work [2] which introduced the parallel data table and merging the data on-the-fly as regions merged.

Trein et al. [32] accelerated the processing by using run-length encoding. Hence it is labelled RLSP for run-length single-pass CCA. The run-length encoding takes multiple input pixels in parallel, with the runs subsequently processed as one segment (or one overlap between segments) per clock cycle. To manage mergers, they used a pointer from the old label to the new label so that the data from extending an old label could be assigned to the correct component, and the current label assigned to the run. The simple use of pointers in this way corresponds to a QuickUnion, which requires a quadratic number of uf-instructions in the worst case. Data accumulated for each component is output as soon as it is detected that a run is not extended, enabling the memory for data accumulation to be reused.

Bailey’s OSP algorithm was optimised by Ma et al. [22] to significantly reduce the size of the data and merger tables through aggressive relabelling (AR). Each row is relabelled beginning with label 1 on the left, requiring translation of labels from one row to the next. The original context-based union-find is used, although a second lookup is required for the translation associated with relabelling. The two lookups are pipelined in the hardware implementation. One interesting feature of relabelling is that many mergers are managed by the translation table rather than the merger table, reducing the time required for the Flatten operation at the end of each row.

Klaiber et al. [18] took a different approach to reduce the memory requirements while retaining the single lookup paradigm (SLCCA) through label recycling. Augmented labels are introduced to maintain the age-balancing heuristic to ensure correct operation of the context-based union-find algorithm. This algorithm is described more fully in Sect. 3, and proven to have linear run time in Sect. 4.6. Insights gained from the proof of correctness have led to the optimised DLCCA, presented later in this paper.

Jeong et al. [16] removed the need for union-find completely by directly replacing all instances of the old label by the new label whenever a merger occurs. This removes the need for an equivalence table or merger table. However, it requires implementing the label memory (or the buffer caching the temporary labels) as content-addressable memory (CAM). In hardware, the parallel update of the content-addressable memory cannot be implemented using the memory blocks on an FPGA; instead Jeong used a multiplexed shift register. Although this method recycles labels immediately after mergers, it does not detect completed objects until the end of the frame, requiring the size of the data table to be the proportional to the image area in the worst case.

3 Algorithmic Description of SLCCA

figure h

Of all the single-pass algorithms, the SLCCA algorithm [18] was chosen for formal proof because it satisfies all of the requirements outlined in the introduction, and it currently represents the state-of-the-art of single-pass CCA algorithms in terms of efficiency of resources and processing speed. The algorithm underlying the hardware architecture of SLCCA is presented in Algorithm 5. Its constituents are explained in Algorithms 5.1–5.5 presented at the points in the paper where the corresponding algorithmic background is explained in detail.

The double for loop in lines 1 and 2 of Algorithm 5 performs the raster scan through the image. When processing streamed data, these loops are implicit in the order that pixels arrive. The three operations for each pixel can be implemented in one clock cycle each in hardware and can be pipelined enabling one pixel to be processed per clock cycle. At the end of each row, a Flatten is invoked to ensure that the level of any vertices accessed during the following row have \(height(T) \le 1\). In parallel with the pixel processing, when it is detected that a connected component is complete, its associated feature vector is output.

The union-find label graph, F, is realised as a 1-D array, the merger table, MT, indexed by the label, \(L_v\), corresponding to each vertex, v. The arcs, \(E(F)\), are represented by storing the label of parent(v) in \(MT[L_v]\) (each vertex has only one parent). So that lookup of a root vertex, \(v_r\), returns a valid label, every root vertex points to itself, i.e., \(MT[L_{v_r}]=L_{v_r}\).

The provisional label assigned to pixel \(p_x = (x,y)\) is saved in a label image, \(L[p_x]\). In CCA, the labelled image is not required as output; however, one row must be maintained for propagation of labels. L is therefore stored as a 1-D array indexed by column, i.e., L[x].

The abbreviations and names of data structures used in the following are summarised in Table 2.

Table 2 Nomenclature used in the following sections

3.1 Update Neighbourhood

Definition 8

Neighbourhood The neighbourhood \(\eta \) is the set of four positions that have already been processed, adjacent to the current pixel at position \(p_x\) (see Fig. 1a), i.e.,

$$\begin{aligned} \begin{aligned} \eta&= \{(-\,1,-\,1),(0,-\,1), (1,-\,1), (-\,1,0)\} + p_x \\&= \{A,B,C,D \}. \end{aligned} \end{aligned}$$
(11)

The provisional labels assigned to positions in \(\eta \) are therefore L[A], L[B], L[C] and L[D]. The current resolved labels (after a Find) associated with \(\eta \) are contained in variables \(L_A\), \(L_B\), \(L_C\) and \(L_D\). These are realised as registers in the SLCCA hardware architecture. The label assigned to the current pixel is denoted \(L_{p_x}\).

Since adjacent object pixels at the positions \(\eta \) will already have the same label as a result of prior processing, a merger of component segments (requiring a union of the corresponding trees) can only occur between non-adjacent pixels, i.e., between \(L_{A}\) and \(L_{C}\), or \(L_{D}\) and \(L_{C}\) [35], as shown in Fig. 2. As an optimisation, \(L_{AorD}\) is introduced to refer to the label of \(L_{A}\) or \(L_{D}\), i.e., all mergers consist of the two labels \(L_{AorD}\) and \(L_{C}\).

Fig. 2
figure 2

Merger patterns possible in the labels of neighbourhood \(L_{\eta }\)

To move from one window position to the next of the same row, the label values are shifted as given in Algorithm 5.1. The superscript \(^-\) denotes the corresponding neighbourhood at the previous position. This shifting requires only the label coming into position C to be looked up with a Find operation (on line 20).

figure i

3.2 Update Data Structures

3.2.1 Label Selection

The set \(L_{\eta }\) denotes all object pixel labels in the neighbourhood of the current pixel.

$$\begin{aligned} L_{\eta } :=\{L_{AorD},L_B,L_C\}\backslash \{0\}. \end{aligned}$$
(12)

When a pixel is processed, it is assigned a label \(L_{p_x}\). Background pixels are assigned label 0. For object pixels, a label from \(L_{\eta }\) is propagated to the current pixel where possible.

A new label operation is performed if an object pixel has no object pixels in its neighbourhood, i.e., it is assigned the next available new label (a MakeSet on \(F_{p_x}^-\) creating a new tree). Conceptually, to achieve age-balancing, the new label (called newLabel in line 23) is provided by a counter, which is incremented for each new label. The new label operation sets \(MT[newLabel]:=newLabel\). To more easily detect stale labels, a flag \(IsRoot\) is associated with each label.

A label copy operation propagates the one label in \(L_{\eta }\) (as determined by the function posMin in line 46) to the current position of the labelled image \(L[p_x]\).

A merger pattern is detected when \(L_{AorD}\) and \(L_{C}\) have different labels and neither is background, i.e., when \((I[A] \vee I[D]) \wedge I[C] \wedge L_{AorD}\ne L_C\). The last term on line 28 is required to manage the case where the label of C is stale (this will be discussed further in Sect. 3.3). A merger operation makes the label which first appears in the raster scan, \(L_\mathrm{min}\), the parent label of \(L_\mathrm{max}\). This corresponds to a Union merging separate trees in \(F_{p_x}^-\). The vertex associated with \(L_\mathrm{max}\) is no longer a root so the flag \(IsRoot[L_\mathrm{max}]\) is cleared.

figure j

Definition 9

Propagating and non-propagating patterns A merger pattern is propagating if \(L_{AorD}\prec L_{C}\) otherwise it is non-propagating.

Figure 3a shows an example of a propagating merger pattern, where the label \(L_{AorD}\) is propagated through several mergers. Figure 3b is an example of a sequence of non-propagating merger patterns.

A sequence of non-propagating mergers can result in labels having \(level(v) \ge 1\) on the next row. These are resolved by flattening the trees at the end of each row.

Fig. 3
figure 3

a Propagating, and b non-propagating, merger patterns.

3.2.2 Feature Vector Collection

Definition 10

Feature vector The feature vector of an image component is an n-tuple composed of functions of the component’s pixel pattern.

Connected components analysis is concerned with deriving the feature vector for each connected component. To accumulate the feature vectors of component segments, a data table, DT, maintains one feature vector for each label. An operator \(\circ \) is defined for combining the feature vectors when a merger operation is induced. The initial feature vector (IFV) is the feature vector of a single pixel. Table 3 presents the data structures, the initial feature vectors, and the combining operation for extracting area, bounding box and first-order moment of connected components.

Table 3 Data structure and combining operator for the feature vectors area, bounding box and first-order moment

For a background pixel, nothing needs to be saved in DT. A new label operation writes the IFV of the current pixel to \(DT[L_{p_x}]\) (line 26). A label copy operation combines the current pixel’s IFV with the feature vector stored in DT (line 47). A merger operation combines the feature vectors of the object labels in \(L_{\eta }\) with the IFV and stores the result in \(DT[L_\mathrm{min}]\) (line 43); the data table entry at index \(L_\mathrm{max}\) is also invalidated.

3.2.3 Label Reuse

The memory requirements of MT and DT are proportional to the number of labels used, which in the worst case is proportional to the image area [1]. However, at any time, the number of feature vectors updated in one image row is only proportional to the image width [17, 21]. Memory requirements can be significantly reduced by recycling labels no longer in use, enabling entries of MT and DT to be reused after a connected component is completed. Rather than use a counter, newLabel is obtained from a FIFO, \(LabelFIFO\), initialised with the set \(LabelFIFO_{init}\), which contains all possible labels [18]:

$$\begin{aligned} LabelFIFO_{init} = \left\{ 1,\ldots , \lceil \tfrac{W+5}{2} \rceil \right\} . \end{aligned}$$
(13)

Labels which are ready for reuse are queued at the end of \(LabelFIFO\).

To detect when a connected component has been completed, a tag, \(LastLine\), is associated with each label. During the raster scan, whenever a label, \(L_{p_x}\), is updated, its \(LastLine\) tag is updated with the current image row

$$\begin{aligned} LastLine[L_{p_x}]:=y \quad \text {when} \ L_{p_x}\ne 0 \end{aligned}$$
(14)

to reflect that the component is not completed. Labels for which \(LastLine\) is not updated from one line to the next are detected as completed (as described in Sect.3.5), enabling the labels of completed components to be recycled and reused.

After every merger operation, label \(L_\mathrm{max}\) is no longer required. However, it must not be reused for one image row since the labelled image L still might contain \(L_\mathrm{max}\) in the current image row to the left of the current position. Writing \(L_\mathrm{max}\) to the end of \(LabelFIFO\) ensures that it is not assigned to a new connected component within the following image row.

The reuse of labels in this way requires modifying the method used to determine \(L_\mathrm{min}\) and \(L_\mathrm{max}\). New labels produced by a counter strictly increase in scan order. Therefore, realising the \(\prec \)-operator as a comparison is sufficient. When reusing labels, the numeric labels are not necessarily assigned to component segments in increasing order. Therefore augmented labels are introduced to realise the functionality of the \(\prec \)-operator with label reuse.

An augmented label is a two-tuple consisting of the row number L.rw in which the label is first assigned and L.index which is used as an address to access array data structures. For example, \(DT[L_{p_x}]\) translates to \(DT[L_{p_x}.index]\). The row number rw is used for decisions in merger operations. The evaluation of \(L_{AorD}\prec L_C\) (line 29) is thus realised as

$$\begin{aligned} L_{AorD}.rw \le L_C .rw. \end{aligned}$$
(15)

This ensures that \(L_\mathrm{min}\) is always the label created earlier during processing, leading to correct age-balancing behaviour when a merger pattern is detected [1].

When a new label is assigned to a component segment, its index is pulled from the head of \(LabelFIFO\), i.e., newLabel in Algorithm 5.2 line 23 is realised as

$$\begin{aligned} \begin{aligned}newLabel.rw:=&y, \\ newLabel.index\leftarrow&LabelFIFO. \end{aligned} \end{aligned}$$
(16)

3.3 Resolve Stale Labels

A stale label within \(L_{\eta }\) requires an additional lookup to determine the root vertex. Rather than performing this lookup immediately, SLCCA defers this until the root label appears in \(L_{\eta }\). If a non-root label is assigned to \(L_{p_x}\), as determined from the \(IsRoot\) flag, the feature vectors of the object labels in \(L_{\eta }\) are combined and stored to data table entry \(DT[L_{p_x}]\) for later combination with the feature vector of the root of \(L_{p_x}\). The non-root label is pushed onto the stale label stack (SLS) (Algorithm 5.2 line 50) until its root appears in \(L_{\eta }\). To avoid duplicate entries which lead to increased memory requirements and processing times, a label is only added to SLS if it differs from the top entry, SLS.head.

When SLS.head is equal to L[C], then the lookup to determine \(L_{C}\) will return the label associated with the root vertex. Algorithm 5.3 then combines the feature vector of the stale label with the feature vector of the current component segment, and stores the result in \(DT[L_{C}]\). The data table entry associated with the label popped from the stack is then invalidated.

This enables an on-the-fly processing of feature vectors of reachable stale labels.

figure k

3.4 Flattening Trees in F

A prerequisite for Algorithm 4 to produce correct results is that all trees of the forest structure in M are reduced to \(height(T) \le 1\). This can be achieved by using path compression, which is embodied in the Flatten operation.

Since minimum labels propagate to the right due to the raster scan by assigning \(L_\mathrm{min}\) to \(L_{p_x}\), the height of a tree in \(F_{p_x}\) is increased by one for each non-propagating merger pattern. Therefore, the arc from \(L_\mathrm{max}\) to \(L_\mathrm{min}\) created by a union operation induced by a non-propagating merger pattern is pushed onto the stack FS to accelerate flattening (Algorithm 5.2 line 35).

figure l

At the end of each image row, Flatten is invoked as listed in Algorithm 5.4. This pops the arcs off the flatten stack FS, visiting them in reverse order, effectively performing a scan from the root to the leaves in the reverse order that the tree was constructed. The vertex associated with label \(L_\mathrm{max}\) in FS is made the child of the minimum label \(L_\mathrm{min}\) which successively connects each label to the root, flattening the forest structure in M to a height of one.

3.5 Detecting Completed Connected Components

Label reuse requires the data of completed components to be removed from the data table DT so that the label to be recycled. A connected component is completed when no further pixels are added to the component in the current row. This cannot be checked until the end of the current row is reached, so in practise, it is checked while the next row is processed. That is a connected component with label l can be detected as completed if \(LastLine[l]\) was last updated on row \(y-2\) (it was not extended onto the previous row as indicated in Fig. 4), i.e.,

$$\begin{aligned} LastLine[l]=y-2. \end{aligned}$$
(17)

The data table, DT, is searched for feature vectors of completed connected components once per row in parallel with the update process. When a completed component is detected, the feature vector from the data table is output. The data table entry is then cleared to be reused by a subsequently connected component and the label recycled for subsequent components by returning the label to the end of the \(LabelFIFO\). This process is represented in Algorithm 5.5.

Of course, all remaining objects are completed after processing the last row of the image.

Note that in a hardware implementation, it is unnecessary to store all the bits of y in \(LastLine\). Two bits are sufficient to satisfy (17) unambiguously.

figure m

4 Proof of Correctness of the SLCCA Algorithm

In this section, it is shown that the correct feature vector is extracted for each connected component in a binary input image, I, using the algorithm presented in Sect. 3. In particular, it is shown that replacing Find of the classical union-find algorithm by a single lookup as in (Algorithm 5.1), the Flatten operation of Algorithm 5.4, and the deferred lookup of stale labels in Algorithm 5.3 all result in the extraction of the correct feature vectors for the connected components in I. To do this, a top-down hierarchical proof will be used.

I is only processed once in raster scan order. The current pixel \(p_x=(x,y)\) is assigned a label based on the labels in its neighbourhood \(\eta \). Therefore, only the labels in L of the previous line are used to determine the subsequent labels in the scan process. It is convenient to divide the corresponding positions into two sets relative to \(p_x\), as depicted in Fig. 4. The set \(leftPos\) contains the pixel positions of the current row to the left of \(p_x\):

$$\begin{aligned} leftPos=\{(i,y):0\le i<x ,\ i \in \mathbb {N} \}, \end{aligned}$$
(18)

and \(rightPos\) contains the pixel positions of the previous row to the right of \(p_x\):

$$\begin{aligned} rightPos=\{(i,y-1):x<i<W ,\ i \in \mathbb {N} \}. \end{aligned}$$
(19)
Fig. 4
figure 4

Visualisation of the positions in the sets visited, \(rightPos\), \(leftPos\) in the image

Replacing Find by a single lookup to determine the connected component’s root label works correctly for labels associated with vertices of \(level(l) \le 1\). The feature vectors of these labels can be easily accumulated and associated with their connected components.

However, as a result of several mergers, a label can become stale (\(level(l)>1\)). To associate such labels correctly with their connected components, additional steps are required. For this, it is convenient to identify the set of vertices (labels) that may be encountered when processing the rest of the current row (before the next call of Flatten).

Definition 11

Reachable vertices These are the labels of L in \(rightPos\) and their parents:

$$\begin{aligned} V_{reachable}= \{L[p_r]\cup parent(L[p_r]): p_r \in rightPos\}.\nonumber \\ \end{aligned}$$
(20)

4.1 Outline of Correctness Proof

Labels in \(leftPos\) of \(level(l)>1\) (created by a sequence of non-propagating mergers) are not reachable in the current row, which will be shown in Lemma 12. For these labels, calling Flatten at the end of the image row is sufficient as shown in Corollary 13. Therefore, the feature vectors of the associated patterns are correctly determined. The correctness of Flatten for compressing the forest structure, F, represented within the merger table, MT, is shown in Theorem 14.

Labels in \(rightPos\) of \(level(l)>1\) can only be created by a combination of two merger patterns, one in \(leftPos\) and one in \(rightPos\) as shown in Lemma 16. In this case, Lemma 18 shows that the root will always be encountered before the end of the image row. Therefore, by storing the stale label on the stale label stack, SLS, enables the additional lookup to be deferred, while still associating the accumulated data with the correct connected component (Theorem 20). Finally, it is shown in Theorem 21 that any resulting labels of \(level(l)>1\) are also reduced to level 1 by Flatten.

These show that the results of SLCCA are correct.

4.2 Non-propagating Mergers

Since each connected component is represented by a tree in F, the arguments given in the following subsections of this proof refer to a single connected component.

A non-propagating merger has \(L_C \prec L_{AorD}\), so \(L_{AorD}\) is made a child of \(L_C\), increasing \(level(L_{AorD})\) by 1.

Lemma 12

After a non-propagating merger, only the root label, \(L_C\), is reachable.

Proof

For \(L_{AorD}\) to be reachable, it must be connected to a position in \(rightPos\) through the pixels that have already been processed. This requires \(L_{AorD}\prec L_{C}\) which contradicts the requirements of a non-propagating merger. Therefore \(L_{AorD}\) is not reachable [1]. \(\square \)

A sequence of two or more non-propagating mergers will result in stale labels in \(leftPos\) (see Fig. 3b). However, none of these will appear in the neighbourhood before the end of the image row.

Corollary 13

Delaying the Flatten operation until the end of the row will not affect the assigning of correct labels.

When moving to the start of the next row, \(leftPos\) becomes \(rightPos\) so all of the labels in the current row become reachable again. Therefore Flatten must reduce the maximum level of a label to 1.

Theorem 14

The Flatten operation as described in Algorithm 5.4 results in a forest structure \(F_p\) where each rooted tree is of height \(\le 1\).

Proof

Each non-propagating merger pattern increases the level of the vertex associated with label \(L_{AorD}\) by one. Since the labels of successive non-propagating mergers are strictly decreasing (\(L_C \prec L_{AorD}\)), each successive merger grows the tree adding a new root vertex. Therefore revisiting the mergers in reverse order will follow the vertices of a sequence of non-propagating mergers in order from the root to leaf, as illustrated in Fig. 5. This is facilitated by pushing the non-propagating mergers onto a stack, FS, as they occur, saving \(L_{AorD}\) as \(L_\mathrm{max}\) and \(L_C\) as \(L_\mathrm{min}\), respectively. Popping the pair of labels off the stack performs the reverse scan from root back to the leaves. If \(level(L_\mathrm{min}) \le 1\) then assigning \(MT[L_\mathrm{max}]:=\textsc {Find}(L_\mathrm{min})\) will make \(level(L_\mathrm{max})=1\) for each iteration within Algorithm 5.4.

As a result of the reverse scan, \(level(L_\mathrm{min}) \le 1\) for non-propagating mergers. It will be shown in Theorem 21 that this is also true for stale labels following a propagating merger (referred to as reachable stale labels).

Consequently, \(level(v) \le 1\ \forall \ v \in V(F_p)\) before processing the next line. \(\square \)

Fig. 5
figure 5

A sequence of 3 labels followed by 3 propagating merger patterns. Arcs recorded after the last pixel of the current row is processed. The solid arrows represent the arcs pushed onto stack FS which are used for Flatten. The dotted arrows represent the arcs stored in merger table M.

Since non-propagating mergers can result in trees requiring the Flatten operation, an obvious question is “why not make all mergers propagating mergers?”, i.e., to always select \(L_{AorD}\) as the root of a merger. As demonstrated in [1], this does not prevent the building of trees of height greater than 1, and since it is not known in advance which vertices will have their level increased, such a scheme would require all mergers to be stacked for checking, not just non-propagating mergers.

4.3 Propagating Mergers

After a propagating merger (\(L_{AorD}\prec L_C\)), the label L[C] is still reachable, and its level will be increased by 1. If \(level(L[C]) > 1\) then L[C] becomes a reachable stale label. In contrast to stale labels resulting from non-propagating mergers, reachable stale labels can appear in the neighbourhood \(\eta \) of \(p_x\) before the end of the current image row. The following will investigate how a reachable stale label can be created.

Theorem 15

The labels resulting from a merger from more than one row previously will be reduced to the root label before the current row.

Proof

Flatten will reduce the maximum level of a label to level 1 at the end of a row (theorem 14). In the absence of additional mergers, when scanning the following row Find(L[C]) will perform the lookup, returning the root label. \(\square \)

Lemma 16

A reachable stale label can only be created by a non-propagating merger in \(rightPos\) followed by a propagating merger in \(leftPos\).

Proof

The level of a label can only increase as a result of a merger. Therefore at least two mergers are required to make a reachable label stale. From Theorem 15, these mergers must have occurred in the previous W scanned pixels, where W is the image width, i.e., in \(leftPos\) or \(rightPos\). From Lemma 12, the label increased by a non-propagating merger is not reachable, therefore to create a reachable stale label, any mergers in \(leftPos\) must be propagating mergers. In a sequence of such mergers, each merger links \(L_C\) to the root label so successive propagating mergers do not increase the height of the tree (see Fig. 3a). Similarly, from a propagating merger in \(rightPos\), only the root label is reachable. A sequence of one or more non-propagating mergers in \(rightPos\) will only provide a reachable label of \(level(v)\le 1\) (Theorem 14). Therefore, the only way to get a reachable label with \(level >1\) is through a non-propagating merger in \(rightPos\) followed by a propagating merger in \(leftPos\). \(\square \)

Two examples of such mergers are shown in Fig. 6. On the previous row, labels \(l_1\) and \(l_2\) merge, which makes \(level(l_2)=1\). Then, the propagating merger between \(l_0\) and \(l_1\) in \(leftPos\) results in \(level(l_2)=2\). Note that this also requires \(l_0 \prec l_1\) so that the level of \(l_2\) is increased. The single lookup of \(l_2\) at position \(p_x\) results in assigning \(L_{p_x}:=\textsc {Find}(l_2) = l_1\), which is a non-root label, rather than the root \(l_0\).

Fig. 6
figure 6

Two examples of images containing stale label \(l_2\). A non-root label is assigned to \(L_{p_x}\), because a stale label is in the neighbourhood

Definition 17

Bridge patterns A bridge pattern is a component segment in which an object label appears more than once in the same image row separated by background pixels.

A reachable stale label requires a bridge pattern between the merger in \(leftPos\) and the merger in \(rightPos\) as is shown in Fig. 6.

4.4 Feature Vector Accumulation of Reachable Stale Labels

To determine the root vertex of reachable stale labels, a maximum of two lookups are necessary, which are distributed to two different positions in the image (in Algorithms 5.1 line 20 and 5.3). It is, therefore, necessary to show that for every possible image pattern which contains a reachable stale label, these two lookups are performed.

Lemma 18

The appearance of a reachable stale label \(l_{2}\) in the neighbourhood \(L_{\eta }\) of the current pixel is always followed by the appearance of \(l_{1}=parent(l_{2})\) in \(rightPos\) before the end of the current row.

Proof

From Lemma 16, a reachable stale label in \(L_{\eta }\) implies a non-propagating merger pattern in \(rightPos\), between \(l_2\) and \(l_1\). Since \(l_1 \prec l_2\), \(l_{1}=parent(l_{2})\), and \(l_1\) will have been written to the labelled image, L. During the ongoing scan, label \(l_1\) will therefore appear in L[C]. \(\square \)

The stale label stack, SLS, is used for caching the stale label while waiting for its parent to appear in the neighbourhood. The stack is necessary, because the stale label may not necessarily be in the neighbourhood when its parent appears.

Lemma 19

A stack is sufficient for searching for the parent of a stale label.

Proof

Consider the case where a different stale label \(l_2\) is encountered before the parent of the current stale label \(l_1\) is found, i.e., \(parent(l_1)\) has not yet been encountered. Therefore, there is a path in visited between \(l_1\) on the left of \(l_2\) and \(parent(l_1)\) on the right of \(l_2\). To become a stale label, \(l_2\) requires an earlier label on each side (Lemma 16): \(l_{left}\) and \(l_{right}\), such that \(l_{left} \prec l_{right}=parent(l_2) \prec l_2\). Since \(l_1\) appears on both sides of this group, this implies either \(l_1 \prec l_{left}\) or \(l_1 = l_{left}\). This requires \(parent(l_1) \prec parent(l_2)\), therefore \(parent(l_1)\) cannot be in between \(l_2\) and \(parent(l_2)\). So the stale label \(l_2\) must be resolved before \(l_1\), making a stack appropriate. \(\square \)

Theorem 20

The feature vectors of the pixels of a reachable stale label pattern are always associated with their connected component.

Proof

The use of the flag \(IsRoot\) enables every non-root label assigned to \(L_{p_x}\) to be detected when the feature vector is updated in the data table DT. Since \(IsRoot\) indicates that the label is stale, temporarily buffering the feature vector and recording the label in SLS (Algorithm 5.2 line 50), enables the feature vector to later be combined with that of the root label. When the parent of the stale label in encountered (Lemma 18), the data is combined with that of the correct root label. Since the merger in \(rightPos\) is non-propagating (Lemma 16) the stale label will not appear again beyond the merger point. \(\square \)

4.5 Flattening Reachable Stale Labels

When reaching the end of a row, there will be no instances of the previously stale label in \(leftPos\), because the stale label will have been looked up returning its parent. The parent of the stale label may have been propagated into the label image, L. Since \(level(parent(l_{stale})) = 1\) any subsequent non-propagating mergers involving that component will increase the level to 2 or more. Therefore, to ensure that the maximum height after calling Flatten is 1, it is also necessary to include in the Flatten operation any reachable stale label assigned to \(p_x\) (pushed onto SLS and resolved before the end of a row).

Theorem 21

Pushing the reachable stale label onto the flatten stack, FS, when the reachable stale label is resolved is sufficient to correctly flatten reachable stale labels.

Proof

Non-propagating mergers following the event of resolving a reachable stale label are pushed onto the flatten stack after the reachable stale label. Therefore, these non-propagating mergers will be flattened first, ensuring that Flatten on the reachable stale label will yield the root label. Any non-nested sequence of non-propagating mergers will similarly be correctly flattened in the reverse order.

Next consider a nested sequence of reachable stale labels, where an inner reachable stale label \(l_{inner}\) is created after an earlier reachable stale label \(l_{outer}\) is created, but before it is resolved. Since \(l_{outer} \prec l_{inner}\) (see the proof of Lemma 19) then if they are part of the same tree, \(l_{outer}\) will be closer to the root than \(l_{inner}\). Therefore, \(l_{outer}\) must be flattened before \(l_{inner}\) requiring it to be pushed onto the flatten stack later. During the processing, \(l_{outer}\) is encountered before \(l_{inner}\) and will be pushed onto the stale label stack earlier than \(l_{inner}\). When the labels are resolved (Lemma 19), \(l_{inner}\) will be resolved first, and consequently be pushed onto the flatten stack earlier than \(l_{outer}\) as required. \(\square \)

4.6 Processing Complexity of SLCCA

New label, label copy and merger operations require constant processing time per pixel. The processing time for these is clearly linear in the number of pixels. Processing stack FS at the end of each image row by Flatten is data dependent, but is bounded by the number of non-propagating merger operations. There can be a maximum of \(\lfloor \frac{W-1}{2}\rfloor \) non-propagating merger patterns per line of W pixels, so the processing time for this is also linear in the number of pixels.

The worst-case pattern with regards to the total number of uf-instructions [1] has an average of \(\lfloor \frac{W}{5}\rfloor \) merger patterns per row and is shown in Fig. 7. This clearly shows that the algorithm from Sect. 3 scales linearly with the image size.

Fig. 7
figure 7

Stair pattern inducing the maximum number of non-propagating merger operations [1]

4.7 Insights Gained

The proof of correctness of SLCCA demonstrates that performing a Flatten at the end of each row is not only a necessary condition of the improved context-based union-find, but it is also a sufficient condition. In particular, this allows the processing of sequences of non-propagating mergers to be deferred until the end of each image row.

It has also identified a limitation of the OSP algorithm of Bailey and Johnston [1]. There, reachable stale labels were not considered, and as a result, were not included within the Flatten operation at the end of each row, potentially leading to erroneous results in some circumstances. An example of this is using CCA for blob counting, by incrementing the count for each new label operation, and decrementing the count for each merger operation. A reachable stale label can result in an additional merger between already merged components, giving an incorrect count. In SLCCA, the stale label stack ensures the data from pixels with stale labels are assigned to the correct feature vector.

5 Optimised DLCCA Algorithm

The problem associated with reachable stale labels may be overcome if a second lookup can be performed. However, to determine whether or not a label is a root, it is necessary to either look up the \(IsRoot\) flag (from the data table DT) or perform a second lookup within the merger table MT.

In this section, it is shown that if two lookups are made within the merger table then it is only necessary to look up the first pixel in a run of pixels. Consequently, the total number of lookups is less than the number of pixels in the image.

5.1 Properties of a Double Lookup

Lemma 22

A double lookup will always yield the root label.

Proof

A stale label has \(level(l) > 1\). Lemma 16 describes the only way of achieving reachable stale labels, which is through a specific combination of two mergers. The maximum height of a reachable tree is 2 levels. Therefore two lookups are always sufficient to reach the root label. \(\square \)

Theorem 23

Within a run of consecutive object pixels within \(rightPos\), the root label of all pixels in the run is the same as the root label of the first pixel in the run.

Proof

The labels within a run in \(rightPos\) can only be different if there has been a merger (if there is no merger, the label simply propagates). After a merger, any adjacent labels have the same root. \(\square \)

These imply that it is only necessary to look up the first pixel in a run to find the root, and a double lookup is sufficient. The first object pixel in a run will either be followed by another object pixel or background pixel. It is not necessary to look up background pixels, therefore the total number of accesses to the merger table, MT, is less than the total number of pixels in the image, satisfying the single lookup per pixel requirement (on average).

5.2 DLCCA Algorithm

The DLCCA algorithm (Algorithm 6) is similar to that for SLCCA, with minor changes to UpdateNeighbourhood, and UpdateDataStructures. The double lookup means that ResolveStaleLabels is no longer required, however Fatten and ReadFinishedFeatureVectors remain unchanged.

figure n

UpdateNeighbourhood (Algorithm 6.1) has to be modified to perform the double lookup at the start of a run. During a run (on line 21), the label assigned to \(L_B\) on line 15 or 17 is repeated.

UpdateDataStructures (Algorithm 6.2) is simpler than for SLCCA. It is no longer necessary to record the \(IsRoot\) since the double lookup will always return the root. Similarly, the stale label stack, SLS, is no longer required. Label assignment, tree flattening, data table update, and completed object detection remain the same.

figure o
figure p

The fact that only the first object pixel in a run needs to be looked up implies that the label image L may be compressed using run-length encoding. DLCCA therefore unifies pixel-based processing with run-based processing methods, since any subsequent processing can be done on runs. For example, it is similar to He et al.’s CCL algorithms [11, 12] which use run-length encoding to optimise the second (relabelling) pass, and in [11] for feature extraction (Euler number). However, DLCCA differs from Trein’s single-pass run-length algorithm (RLSP) [32] in that it still processes the neighbourhood one pixel at a time, whereas RLSP processes one overlap between runs in each clock cycle.

In the worst case (with alternating object and background pixels), there is no speed advantage of run-based processing, although it can reduce the processing for more typical images at the expense of more complex processing logic. However, unless the image is streamed in at more than one pixel per clock cycle, then there is limited real advantage.

5.3 Implementation Issues of DLCCA

In hardware, each lookup requires a clock cycle. The two lookups can be pipelined, to ensure that the root label is loaded into neighbourhood window. Pipelining the second lookup is possible because the following pixel does not need to be looked up (it is either a background pixel, or part of a run).

However, pipelining can create a race condition where the label being looked up is updated in MT in the same clock cycle as the second lookup, as illustrated in Fig. 8. At time t, the merger between A and C links label 2 to label 1. With pipelining, this is written to the merger table in clock cycle \(t+1\). In parallel with this, pixel E is looked up in the merger table in cycles t and \(t+1\). The first lookup (at t) looks up label 3 and returns 2. The second lookup (at t+1) is of label 2, which would return the old root, label 2, because the merger table has not been updated from merger until the end of \(t+1\). For correct operation, this requires the hardware for MT to perform a write-before-read, or have bypass logic constructed to read the new value being written.

Fig. 8
figure 8

DLCCA race condition. The merger in cycle t is written to MT in cycle \(t+1\). The second lookup of E looks up the same label in cycle \(t+1\)

6 Comparison and Discussion

6.1 Evaluation Method

Modern CCA and CCL algorithms are often tailored to the cache hierarchy of general-purpose processors (GPP) [3, 11, 19] which consist of several levels of on-chip and off-chip memory. For such processors, the average number of clock cycles to process a pixel of a random image is a meaningful metric to compare algorithms [4]. However, the suitability of a CCA or CCL algorithm to a hardware architecture depends on the interaction of the algorithm with the basic building elements of the technology used and the arrangement of these elements. The freedom to arrange the basic building elements of the hardware device facilitates the use of parallelism and helps to reduce processing or I/O bottlenecks.

The number and speed of lookup operations are crucial for carrying out CCA and CCL, as discussed in the introduction. Unlike in a GPP, hardware architectures realised on an ASIC or FPGA do not have a fixed memory model; the three available memory types on-chip registers, on-chip memory and off-chip memory can be arranged and connected to make the best use of lookup operations and to provide data at the exact time they are required. The bandwidth of on-chip registers and memory is significantly higher and the latency is significantly lower than off-chip memory. Therefore, SLCCA and DLCCA are designed to fit completely in on-chip registers and memories, which are limited for current FPGA devices.

Unlike in a cache hierarchy, where the cost of a read or write operation depends on the hierarchy-level, the FPGA on-chip memory model provides random read and write operations at constant cost. Therefore, the total number of memory operations required to process an image provides a good estimate on how suitable a CCA or CCL algorithm is for a hardware architecture.

To compare variants of CCA or CCL algorithms with different numbers of passes, different scan modes and different set merging algorithms, the number of memory access instructions is considered.

Definition 24

Memory access instruction A memory access instruction (MAI) is a single read access from or a single write access to an indexed data structure.

In particular, the state-of-the-art CCA and CCL algorithms are examined with regards to

  • total number of memory accesses,

  • degree of parallelism and

  • required memory resources

to process a stream of binary pixels.

6.2 MAIs for DLCCA

To evaluate and compare the state-of-the-art CCA or CCL algorithms, each algorithm was implemented in C++ or Java. The code was instrumented to count the number of MAIs. Since CCL algorithms are only concerned with outputting a labelled image, feature vector collection was also added to their implementations.

Fig. 9
figure 9

Memory access instructions (MAIs) on each data structure of DLCCA for processing \(512 \times 512\) images with different object pixel densities

Figure 9 shows the number of MAIs DLCCA requires to extract the feature vectors of the components in a random \(512 \times 512\) pixel image as a function of object pixel density. Each colour in Fig. 9 depicts the number of MAIs on one of the data structures; the upper bound shows the total number of MAIs. Read and write accesses to the labelled image, L, are also in parallel using dual-port memory. Although the number of MAIs required for L could be reduced by run-length encoding, these accesses are in parallel to the other data structures, so in practise little would be gained. DLCCA is designed to access all data structures in parallel (except for the flatten stack, FS, during Flatten at the end of each row). Therefore, the maximum number of MAIs carried out in parallel depends on the maximum number of MAIs on a single data structure plus the MAIs required on FS.

The label and the feature vector associated with the current pixel, \(L_{p_x}\), can be stored in registers, with the other feature vectors stored in on-chip memory. As the label L[C] can be from a different connected component than \(L_{AorD}\), for every pixel, the parent label of L[C] must be looked up. DLCCA performs a double lookup (i.e., M[M[L[C]]], in successive clock cycles) to find the root label. Since the root labels of consecutive object pixels are all the same, two lookups are always sufficient to determine the label to assign to all pixels of a run. Therefore at most two MAIs on the merger table MT are necessary for a run of consecutive object pixels.

The labels of \(L_{AorD}\) and \(L_B\) are derived from the labels of L[C] and \(L[p_x]\) of the previous position. It is, therefore, sufficient to store them in registers. As the \(LabelFIFO\) is only accessed when new label patterns or completed connected components are detected, the number of MAIs on the \(LabelFIFO\) is highest around an object pixel density of \(40\%\). DLCCA does not store a fully labelled image, therefore L is only accessed for labels assigned to the previously processed image row. For each pixel in the input image I there is one read access on L to retrieve the pixel coming into the local neighbourhood, and one write access to store the provisional label assigned to a pixel. For consecutive object pixels the feature vector associated with label \(L_{p_x}\) is cached in registers [18] which optimises the number of MAIs on the data table (DT) having the highest number of MAIs at an object pixel density of around 50%. The number of MAIs on the flatten stack (FS) is highest for images with a stair pattern which have an object pixel density of \(40\%\).

6.3 Evaluation of MAIs

To compare the number of memory access instructions of CCA and CCL algorithms, the following cost metric is applied:

  • Successive reads from the same position of a data structure can be buffered in a register and are, therefore, counted as one MAI.

  • Successive writes to the same position of a data structure can be cached in a register and are, therefore, counted as one MAI.

  • Receiving the input image I as a stream (as in a hardware implementation) is not a memory access instruction per se, i.e., requires zero MAIs. However, for a fair comparison, these read accesses are counted as one MAI each (effectively streaming from memory).

This metric does not try to show which CCA algorithm runs the fastest on a general-purpose processor, but indicates the potential speed of a CCA or CCL algorithm when realised as a hardware architecture. In fact, the results of [3] show that LSL requires the smallest number of processing cycles per pixel on Intel and ARM processors.

Fig. 10
figure 10

Comparison of the number of memory access instructions required by CCL and CCA algorithms for processing \(512 \times 512\) images with different object pixel densities

Figure 10 represents the total number of MAIs for extracting the feature vectors of the connected components in a random \(512 \times 512\) pixel image of by DLCCA, SLCCA [18], OSP [1], AR [22], RLSP [32], CAM [16], LSL [19], HCS [11], CT [5] and Rosenfeld’s classical algorithm [26] applying QuickUnion(RQU) (see Table 1 for algorithm abbreviations).

RLSP, HCS and LSL encode and process runs of pixels from the input image, which explains the large difference of MAIs between an image with object pixel density around \(50\%\) and an empty or filled image. For the algorithms HCS, CAM, RLSP and LSL most MAIs are required when processing random images between \(48\%\) and \(55\%\) object pixel density.

Fig. 11
figure 11

The number of memory access instructions required for processing worst-case images (chessboard, stairs, feather pattern) and natural images from the SIPI database [33] and from the Berkley BSDS300 dataset [23]. All images are \(512 \times 512\) pixels

The number of MAIs for SLCCA and OSP is almost equal since the basic processing principle is very similar (although OSP does not use relabelling). AR is also similar to SLCCA and DLCCA, but requires one additional lookup per pixel for the translation table associated with aggressive relabelling. DLCCA is an advancement of SLCCA and requires up to \(25\%\) fewer MAIs due to caching lookups. The number of MAIs of DLCCA increases with the object pixel density until \(43\%\) object pixel density. Above \(43\%\) the number of MAIs decrease again.

The bar diagrams in Fig. 11 a through c show the number of MAIs required for processing worst-case images with chessboard pattern, stair pattern and feather pattern [1]. In the scope of the explored algorithms, the chessboard pattern with a granularity of one pixel has been shown to require the maximum number of MAIs for LSL, HCS, CT, RLSP and RQU. Although DeBock and Philips [6] identified a tree pattern as the worst-case pattern for HCS with respect to the run time, our analysis shows that the chessboard pattern requires more MAIs. The stair pattern from Fig. 7 requires the maximum number of MAIs for OSP, CAM, SLCCA and DLCCA. For AR, all of the mergers associated with the stair pattern are managed by relabelling of objects from one row to the next. The merger table (and flatten stack) is only required when both component segments already have a label on the current row, which can only occur with a bridge pattern. This requires an image such as the feather pattern to induce the maximum number of MAIs for AR [22].

Figure 11d, e shows the average number of MAIs required for processing the more than 300 natural reference images from the USC-SIPI database [33] and the Berkley BSDS300 dataset [23]. For the comparison, these images are scaled to a size of \(512 \times 512\) pixels and binarised with a global threshold value determined by Otsu’s algorithm [25]. In general, the methods which make use of run-length encoding are able to benefit from such images through their ability to access complete runs of pixels with a single MAI.

To compare the minimal guaranteed processing time, the worst-case pattern of each algorithm is used for a comparison. Table 4 lists the total number of MAIs per pixel (the sum of MAIs on all memory structures) for processing the worst-case patterns.

CAM requires the fewest number of MAIs due to its content-addressable memory. Every update of the content-addressable memory is counted as a single MAI, even if multiple locations in the memory with the same label are updated. OSP, AR, SLCCA and DLCCA have a similar range for the sum of MAIs, as these methods are all based on OSP. AR requires more MAIs than OSP due to the additional translation table. SLCCA requires more MAIs than OSP because the data table is continuously searched for finished feature vectors. DLCCA requires fewer MAIs than OSP as it caches labels of continuous runs, whereas OSP requires one lookup for each pixel to determine a label’s parent.

HCS, RLSP and LSL perform run-length encoding of the input images before processing the images. The maximum number of MAIs for those algorithms is required when processing the chessboard pattern which essentially is a series of runs with a length of a single pixel. Therefore, run-length encoding does not gain an advantage for the worst case.

RQU assigns up to \(\lceil W \times H / 4 \rceil \) provisional labels when passing the input image the first time. Merging these provisional labels at the end of the image and assigning the final labels to L in a second pass constitute the difference in comparison with CAM.

CT traces the contour of each connected component in the input image and, therefore, requires multiple read and write operations which are strictly sequential for each input pixel.

6.4 Evaluation of Parallelism

This subsection evaluates the parallelism of the examined algorithms. Some algorithms (especially software algorithms) accomplish parallel processing by means of static or dynamic scheduling on a superscalar general-purpose processor. This scheduling cannot be identified from the algorithm’s description alone. A good measure for software algorithms is, therefore, the cycles-per-pixel (cpp) measure established by Cabaret and Lacassagne [3], as it relates to an algorithm executed on a specific processor. The parallelism of those algorithms is, therefore, implicit and dependent on the hardware the algorithm is run on.

To compare parallelism for a hardware architecture, such as an FPGA implementation, an explicit description of the hardware architecture and a mapping of the algorithm to it is necessary. This is the case for the following algorithms: OSP, RLSP, AR, SLCCA, DLCCA and CAM. Therefore, only these are discussed with regards to parallel MAIs in Table 4. Other algorithms may well contain pipelined or parallel MAIs; however, these are dependent on the particular processor used making their analysis beyond the scope of this evaluation.

AR’s additional lookup in the translation table operates in parallel to the other memory structures and, therefore, does not diminish the performance. Similarly, scanning the data table by SLCCA and DLCCA to detect completed components makes use of a second memory port enabling it to operate in parallel.

CAM requires one parallel MAI per pixel due to the use of a content-addressable memory. Whenever two component segments merge, the provisional label is immediately replaced by the representative label, so no further processing is required.

In contrast, OSP, AR, SLCCA and DLCCA require additional MAIs at the end of each row for flattening stale labels from non-propagating mergers. This overhead is proportional to the number of mergers cached on the flatten stack (FS). As indicated earlier, AR caches fewer mergers as a result of the relabelling process, with a worst-case overhead of 1 in 16 (approximately 6.3%). OSP, SLCCA and DLCCA have a worst-case overhead of 1 in 5 (20%). However, OSP also has an additional sequential overhead of processing completed components at the end of the frame. With AR, SLCCA and DLCCA, label reuse requires identifying completed components on the fly.

This suggests that by making the worst-case image more complex, the number of MAIs required for Flatten could be reduced. In examining the stair pattern, a new label is assigned to a component segment, only for it to subsequently be merged with an existing segment. If the allocation of a new label could be deferred, then the merger would be unnecessary. Such a change would also require modifying the row buffer, to make it run-length encoded, rather than storing every pixel. This would reduce the worst-case overhead from 1 in 5 to 1 in 8 (or 12.5%). For typical images, however, the number of non-propagating mergers is relatively low and the overhead of the Flatten operation is negligible (see FS in Fig. 9).

While RLSP can gain on images with large blobs (with long runs), in the worst case, with alternating sequences of individual pixels from the chessboard pattern, run-length coding does not help. RLSP has serial dependencies within the matching process that cannot easily be pipelined, resulting in an increased parallel MAI score.

Table 4 Comparison of MAIs per pixel for the worst-case patterns

6.5 Evaluation of Resources

While CAM gives the best performance in terms of MAIs, this comes at a heavy cost in terms of resources. Rather than implementing the provisional label cache, L, as a simple memory, the need to replace every instance of an old label during a merger requires the buffer to be implemented using registers. On an FPGA, this is implemented as a shift register with a multiplexer between each stage. While this situation may be improved by implementing the content-addressable memory in VLSI, it would still require significantly more logic than a simple memory-based row buffer.

The main limitation of OSP is that it must maintain data structures (DT and MT) that are proportional to the image area [22]. AR improved this by relabelling each row from L as it is processed, reducing the size of the data structures to the width of the image. SLCCA reduced the total on-chip memory required with improved memory management through label recycling (with augmented labels) [18]. This avoids the need for the additional translation table required by AR, making it better suited for hardware implementation. DLCCA improves the number of MAIs for finding the representative (root) labels (access to MT).

The major advantage of RLSP over the other run-based algorithms (such as HCS or LSL) is that it is a true single-pass algorithm. This allows the memory used by finished connected components to be recycled for subsequent ones. The memory requirements are, therefore, proportional to the image width.

7 Conclusions

Single-pass CCA algorithms are a relatively new class of algorithms designed and optimised for processing streamed image data using an embedded or hardware architecture, by extracting component feature vectors directly from the pixel stream. Real-time operation necessitates processing streamed pixel data at one pixel per clock cycle.

This paper provides the first detailed algorithmic perspective of single-pass CCA algorithms identifying and discussing the implicitly used set merging algorithms. These CCA algorithms have been examined and compared with CCL in terms of the union-find algorithm used for managing object mergers. Through this analysis, single-pass CCA algorithms have been unified with more conventional CCL algorithms on an analytical basis.

It has been shown that many single-pass algorithms use a single lookup variant of union-find. This variant is directly based on the order in which Union and Find operations are encountered in the context of processing a two-dimensional image as a pixel stream. The Find is replaced by a single lookup, which is only valid for trees of height less than or equal to one level. This requires an additional Flatten operation to reduce the height of labels to at most level 1 (to avoid stale labels) before any Find (or Union) is performed on those labels. It is shown that a sufficient condition for this is performing a Flatten at the end of each image row.

One of the key paradigms of single-pass algorithms is the online resolution of mergers, enabling the feature vector extracted from each component to be extracted and merged on the fly. The ability to defer the Flatten operation to the end of each row significantly relaxes the sequential data dependencies, enabling pipelined stream processing on an FPGA at 1 pixel per clock cycle.

The proof of correctness, and associated analysis, has shown the circumstances that lead to stale labels, where additional processing is required to ensure that data from each pixel is correctly associated its corresponding connected component. Although early work on single-pass connected components analysis [1, 22] had identified sequences of non-propagating mergers as one instance of stale label creation, more complex cases involving propagating mergers had not previously been identified. From this insight, it may readily be seen that some algorithms from the literature are either incorrect in their operation (as described) or incomplete in their description, for example [20].

Algorithm analysis has also shown that the issues associated with stale labels can be avoided by using a second lookup. This led to the DLCCA algorithm, which performs the two lookups in successive clock cycles at the start of each run of pixels. Pipelining the lookups in this way, and only looking up the first pixel in a run, is shown to reduce the overall number of memory accesses. It also provides a unification between pixel-based and run-length-based algorithms.

In analysing the operation of single-pass algorithms, there is an obvious trade-off between processing speed and resources. Jeong et al.’s algorithm [16] avoids the overhead of flattening the trees at the end of each row, by immediately removing all references to the old label. This makes it potentially the fastest single pixel per clock cycle algorithm when implemented in hardware. However, the cost of this is replacing the RAM-based row buffer with a significantly more resource intensive multiplexed shift register. Ma et al.’s aggressive relabelling [22] incurs a small overhead at the end of each row for the Flatten, at the cost of additional resources for the translation table (and an additional lookup, although this can be pipelined). Klaiber et al. [18] reduce this (and the associated memory required) at the cost of a higher Flatten overhead.

It is hoped that by outlining the necessary and sufficient conditions for correct operation, as well as the comparison of the strength and weaknesses of existing CCA and CCL algorithms this analysis would inspire further attempts at optimising the class of single-pass CCA algorithms.