An End-to-End System for Automatic Urinary Particle Recognition with Convolutional Neural Network

Liang, Yixiong; Kang, Rui; Lian, Chunyan; Mao, Yuan

doi:10.1007/s10916-018-1014-6

An End-to-End System for Automatic Urinary Particle Recognition with Convolutional Neural Network

Image & Signal Processing
Published: 27 July 2018

Volume 42, article number 165, (2018)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Medical Systems Aims and scope Submit manuscript

An End-to-End System for Automatic Urinary Particle Recognition with Convolutional Neural Network

Download PDF

Yixiong Liang¹,
Rui Kang¹,
Chunyan Lian¹ &
…
Yuan Mao¹

1138 Accesses
27 Citations
1 Altmetric
Explore all metrics

Abstract

The urine sediment analysis of particles in microscopic images can assist physicians in evaluating patients with renal and urinary tract diseases. Manual urine sediment examination is labor-intensive, subjective and time-consuming, and the traditional automatic algorithms often extract the hand-crafted features for recognition. Instead of using the hand-crafted features, in this paper we propose to exploit convolutional neural network (CNN) to learn features in an end-to-end manner to recognize the urinary particle. We treat the urinary particle recognition as object detection and exploit two state-of-the-art CNN-based object detection methods, Faster R-CNN and single shot multibox detector (SSD), along with their variants for urinary particle recognition. We further investigate different factors involving these CNN-based methods to improve the performance of urinary particle recognition. We comprehensively evaluate these methods on a dataset consisting of 5,376 annotated images corresponding to 7 categories of urinary particle, i.e., erythrocyte, leukocyte, epithelial cell, crystal, cast, mycete, epithelial nuclei, and obtain a best mean average precision (mAP) of 84.1% while taking only 72 ms per image on a NVIDIA Titan X GPU.

Urine Sediment Detection Based on Deep Learning

An Efficient Particle YOLO Detector for Urine Sediment Detection

Deep learning classification of urinary sediment crystals with optimal parameter tuning

Article Open access 07 December 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

The urine sediment examination of biological particles in microscopic images is one of the most commonly performed vitro diagnostic screening tests in clinical laboratories and it plays an important role in evaluating the kidney and genitourinary system and monitoring body state. General indications for urinalysis include: the possibility of urinary tract infection or urinary stone formation; non-infectious renal or post-renal diseases; in pregnant women and patients with diabetes mellitus or metabolic states who may have proteinuria, glycosuria, ketosis or acidosis/alkalosis [15, 18].

Traditionally, the trained technicians count the number of each kind of particles of urinary sediment by visual inspection. The manual urine sediment examination works but is labor-intensive, time-consuming, subjective, and operator-dependent in high-volume laboratories. All this issues have motivated lots of automated methods for the analysis of urine microscope images (e.g. [1, 2, 19, 21, 25, 33]). As shown in Fig. 1a, almost all of them follow the multi-stage pipeline, i.e., first generating candidate regions based on segmentation and then extracting hand-crafted features over regions for classification. Therefore, the performance of these methods heavily depends on the accuracy of the segmentation and the effectiveness of the hand-crafted features. However, due to the complicated characteristics of urinary images, the precise segmentation of the interested particles is quite difficult, or even impossible, and the resulting hand-crafted region features are often less discriminatory.

To avoid the segmentation stage and improve the discriminability of features, as shown in Fig. 1b, we propose to exploit CNN to automatically learn task-specific features and perform the urinary particle recognition in an end-to-end manner. Specifically, we treat the urinary particle recognition as object detection and exploit two well-known CNN-based object detection methods, Faster R-CNN [29] and SSD [23], along with their variants including Multiple Scale Faster R-CNN (MS-FRCNN) [13], Faster R-CNN with online hard example mining [34] (OHEM-FRCNN) and the proposed Trimmed SSD to locate and recognise the urinary particles. These end-to-end methods do not perform explicitly segmentation and hand-crafted features extraction, but can automatically learn more discriminative features from the annotated images. We also investigate different factors such as training strategies, network structures, fine-tuning tricks, data augmentation etc, to make these methods more appropriate for urinary particle recognition.

We summarize our contributions as follows:

We exploit Faster R-CNN [29] and SSD [23] for urine particle recognition. It is segmentation free and can learn task-specific features in an end-to-end manner.
We investigate various factors to improve the performance of Faster R-CNN [29] and its variants [13, 34] for urine particle recognition.
We propose a scheme, Trimmed SSD, to prune the network structure adopted in SSD [23] to achieve better performance for urine particle recognition.
We obtain a best mAP of 84.1% while taking only 72 ms per image for 7 categories recognition of urine sediment particles. Importantly, we also get a best AP of 77.2% for cast particles, the most valuable but most difficult to detect ingredients [4, 6, 40].

The remainder of this paper is organized as follows: Section ‘‘Related work’’ reviews the related works of urine particle recognition and the CNN-based object detection methods. Section ‘‘Meta-architectures’’ describes the detection architectures for urinary particle recognition, including Faster R-CNN, SSD and their variants. Section “Experiments” details the urinalysis database organization and provides extensive experiments analysis. Section “Adding bells & whistles” shows more experimental comparisons intuitively. Section “Conclusions” concludes the paper.

Related work

Urine particle recognition

The recognition of urinary sediment particles has been extensively studied following the traditional multi-stage pipeline (Fig 1a) and a variety of approaches can be adopted in each stage.

Rabznto et al.[25] first obtained patches of interest by a detection algorithm, and then extracted invariant features based on “local jets” [31]. Although the system presents reliable recognition results on a pollen dataset, more accurate location for interest patches needed to be improved. In [2], a new technique based on the adaptive discrete wavelet entropy energy was proposed for feature extraction, which follows by some image preprocessing stage including noise reduction, contrast enhancement and segmentation. In classification, the artificial neural network (ANN) classifier was selected to achieve the best performance. Liang et al. [21] adopted a two-step process (the first location step and the second tuning step) to segment particles’ contour. They proposed a two-tier classification strategy to better reduce the false positive rate caused by impurity and poor focused regions. Shen et al. [33] used AdaBoost to select a little part typical Haar features for support vector machine (SVM) classification, and improved system speed via cascade accelerating algorithm. Zhou et al. [40] demonstrated an easy-implemented automatic urinalysis system employing a SVM classifier to distinguish casts from other particles. After the adhesive particles separation by watershed algorithm, Li et al. [19] proposed to combine the Gabor filter with the scattering transform for robust feature description.

The above-mentioned conventional recognition model works for automated urinalysis, but importantly all stages (i.e. segmentation, feature extraction and classification) need to be carefully designed. In addition, the complicated characteristics of urine microscopical images also bring more challenges to this task. Therefore, there is an increasing demand for better solutions relying more on automatic learning and less on hand-designed heuristics.

CNN-based object detection

The Overfeat [32] made the earliest efforts to apply deep CNNs to learn highly discriminative yet invariant feature for object detection and has achieved a significant improvement of more than 50% mAP when compared to the best methods at that time which were based on the hand-crafted features. Since then, a lot of advanced CNN-based methods (e.g. [7, 8, 23, 26,27,28,29]) have been proposed for high-quality object detection, which can be roughly classified into two categories: object proposal-based and regression-based.

The object proposal-based method first generates a series of proposals by applying region proposal methods and then classifies each proposal as background or category-specific objects. The notable R-CNN [8] generates about 2,000 region proposals by selective search [36] and repeatedly resize each proposal box to a fixed size to extract CNN features for SVM classification. The SPP-net [11] introduces a spatial pyramid pooling layer that can flexibly handle variable-size inputs, which avoids repeatedly computing the convolutional features (compute only once per image) and therefore accelerates R-CNN significantly. Instead of a spatial pyramid pooling layer, the Fast R-CNN [7] extends SPP-net by introducing a ROI pooling layer and a joint classification loss and bounding box regression loss. It can fine-tune all layers in an end-to-end manner, which significantly speeds up the stages of training and testing.

The handcrafted region proposal methods such as selective search [36] or Edgeboxes [41] is often time-consuming which immediately becomes the bottleneck of object detection systems. The Faster R-CNN [29] proposes a region proposal network (RPN) for generating region proposals and combines RPN and Fast R-CNN into a single network by sharing their full-image convolutional features, thus it enables nearly cost-free region proposal generation. Faster R-CNN is flexible and robust to many follow-up improvements (e.g., [13, 14, 16, 20, 22, 38, 39]), and has been achieving top performances in several benchmarks [10].

Regression-based method reformulates the object detection as a regression problem with separated bounding boxes and class-specific probabilities and detects objects by regular and dense sampling over locations, scales and aspect ratios [9, 23, 26,27,28]. It does not require proposal generation stage and therefore is much simper than proposal-based methods. SSD [23] and YOLO [26,27,28] are two representative regression-based methods: YOLO [26] opens the door to achieve real-time CNN-based object detection and SSD [23] is proposed for improving YOLO’s performance of small-sized objects detection and localization accuracy. Generally, regression-based methods are much faster than proposal-based methods but the detection accuracy is usually behind that of the proposal-based methods [39].

Meta-architectures

In this paper, we focus primarily on Faster R-CNN [29] and its structural variants, i.e. multiple scale Faster R-CNN (MS-FRCNN) [13], Faster R-CNN with online hard example mining (OHEM-FRCNN) [34] for urinary particle recognition. We also investigate the performance of SSD [23] on urinary particle recognition and propose a named Trimmed SSD.

Faster R-CNN and its variants

Faster R-CNN

[29] is a single unified network which integrates a fully convolutional region proposal generator (RPN) with a fast region-based object detector (Fast R-CNN) [7]. As shown in Fig. 2a, the deep detection framework also can be described as the pipeline of “shareable CNN feature extraction + region proposal generation + region classification and regression”. Moreover, to predict objects across multiple scales and aspect ratios, Faster R-CNN [29] adopts a pyramid of anchors with different aspect ratios, which is a key component for sharing features without extra cost.

MS-FRCNN

[13] is a follow-up improvement and it keeps RPN unchanged and builds a more sophisticated network for Fast R-CNN detector by a combination of both global context and local appearance features. As Fig. 2b shows, each object proposal receives three feature tensors through ROI pooling from the last three convolutional layers. After L2 normalization to each tensor, outputs are concatenated and compressed to maintain the same size as the original architecture.

OHEM-FRCNN

is a combination of online hard example mining (OHEM) [34] and Faster R-CNN [29]. OHEM [34] is a novel bootstrapping for modern CNN-based object detectors trained purely online with SGD. Instead of a sampled mini-batch [29], it eliminates several heuristics and hyperparameters in common use and selects automatically hard examples by loss. As Fig. 2c shows, for each iteration, given the feature map from shareable convolutional network and ROIs from RPN, the read-only ROI network performs a forward pass and computes loss for all input ROIs. Then the regular ROI network computes forward and backward passes only for hard examples selected by hard ROI sampling module according to a distribution that favors diverse, high loss candidates.

SSD and the Proposed Trimmed SSD

SSD

[23] can be decomposed into a truncated base network (usually a VGG-16 net [35]) and several auxiliary convolutional layers used as feature maps and predictors. Unlike Faster R-CNN [29], SSD increases detection speed by removing the region proposal generation and the subsequent pixel or feature resampling stages. Unlike YOLO [26], it improves detection quality by applying a set of small convolutional filters to multiple feature maps to predict confidences and boxes offsets for various-size categories, as shown in Fig. 3.

Trimmed SSD

is the proposed method which a tailored version of the original SSD model [23] for urinary particle recognition. As Fig. 3 shows, from bottom to top, original SSD selects conv4_3, fc7 (convolutional layer), conv6_2, conv7_2, conv8_2, conv9_2 and pool6 as feature maps to produce confidences and locations. If we directly transfer it to urinary particle recognition with only 7 categories, it may produce a large number of redundant prediction results interfering with the final detection performance. And the framework is too complicated to perfectly fit our dataset. For simplification, we attempt to remove several top convolutional layers from the auxiliary network of SSD, which leads to the trimmed SSD.

Experiments

As there is no standard benchmarks available, we first establish the database consisting of 6,804 urinary microscopical images with ground truth boxes marked by clinical experts. All annotated images have a size of 800 × 600, containing objects from 7 categories of urinary sediment particles, i.e., erythrocyte (eryth), leukocyte (leuko), epithelial cell (epith), crystal (cryst), cast, mycete, epithelial nuclei (epithn) and one background class. Figure 4 shows 7 categories of urinary sediment particles from our database, each of which includes many subcategories with various shapes.

In fact, our 6,804 annotated images have a total of 273,718 ground truths, where meaningless background occupies 230,919 annotations, up to eight-four percent. We remove images only including noise and finally get 5,376 useful images, which contain (ground truth boxes) 21,815 for eryth, 6,169 for leuko, 6,175 for epith, 1,644 for cryst, 3,663 for cast, 2,083 for mycete and 687 for epithn. From the final 5,376 images, we randomly select 268 images making up 1/20 as test set, and the others as trainval set, where train set makes up 5/6. Figure 5 demonstrates the details of dataset organization and categories distribution. The top pie chart shows how 5,376 images are organized into train/val/test sets. The bottom bar graphs display detailed objects distribution for the imbalanced dataset.

By default, we still use PASCAL-style Average Precision (AP) at a single IoU threshold of 0.5 and the mAP as metric to evaluate different detection architectures. Due to the limited data, we adopt the well-adopted transfer learning mechanism, i.e. first initialize with pre-trained models on ImageNet dataset [30] and then fine tune them using own dataset.

Urinary particle recognition based on Faster R-CNN

Feature extractors

We first apply a convolutional feature extractor to the input image to obtain high-level features. We mainly focus four feature extractors: ZF net [37], VGG-16 net [35], the ResNet [12] (including ResNet-50 and ResNet-101), and the PVANet [16]. We fine tune all convolutional layers of ZF net and PVANet, but only the conv3_1 and up of VGG-16 net and ResNet.

Training strategies

When training Faster R-CNN, we fine-tune pre-trained models with SGD for 70k mini-batch iterations (unless specified otherwise), with a mini-batch size of 128 on 1 NVIDIA Titan X GPU, a momentum of 0.9 and a weight decay of 0.0005. We start from a learning rate 0.001, and decrease it by 1/10 after 50k iterations. But fine-tuning PVANet [16] adopts a learning rate policy of plateau: 0.003 base learning rate, 0.3165 gamma and a different weight decay of 0.0002. As all know, there are two training solutions, 4-step alternating training and approximate joint training (also called as end2end training). In order to select one more effective and efficient solution for the following networks training, we design this experiment based on ZF net [37] and VGG-16 net [35]. Table 1 shows that adopting the strategy of approximate joint training takes less time, but yields higher mAP (nearly the same accuracy on VGG-16 net), so the next series of experiments all adopt the end2end training solution.

Table 1 Training time and mAP by different training solutions

Full size table

Anchor scales

Unlike generic objects in natural images, the particles of urinary sediment vary very widely in their shapes, sizes and numbers. Moreover, some urinary microscopical images contain a lot of small objects (like erythro- cyte and leukocyte), so as many anchors as possible should be covered in our experiment, especially for small scales.

We compare the detection results under varying anchor scales. First, for networks of ZF, VGG-16 and ResNet we all choose the default settings (the anchor scales of { 128², 256², 512² } and the aspect ratios of {1:1, 1:2, 2:1}) as benchmarks. Then, keep aspect ratios unchanged and gradually increase anchors with smaller scales (i.e., 64² and 32²). The comparative results are listed in Table 2, which shows that more anchors yield higher mAP in general. However, increasing anchor scales {64², 128², 256², 512² } to { 32², 64², 128², 256², 512² } can not achieve better performance on both ZF net and VGG-16 net. It mainly due to the capacity of networks becoming saturated because we do get an accuracy boost when using ResNet-50 and ResNet-101. Further, we delete the scale of 512² as comparison only using ZF and VGG-16 nets. On ZF net, the scales of {64², 128², 256²} has the same 9 anchors with {128², 256², 512² }, but outperforms by 3.4% mAP. Similarly, on VGG-16 net, the performance is improved by 0.5% mAP. It indicates that most particles in our dataset are small objects and the small anchor scales are indispensable. In addition, we note that deeper networks take more test time, but anchor scales have little impact on detection cost. Finally, it’s worth mentioning that the PVANet with best performance takes less test time despite deeper layers, partly because of more anchor scales (5 × 5) but thin structure.

Table 2 Comparisons of detection results using different networks and different anchor scales

Full size table

Data augmentation

Commonly, adopting data augmentation in deep learning can expand training samples, avoid over-fitting and improve test accuracy, especially for small-scale training sets. Faster R-CNN also adopts a horizontal flip to augment training set. Empirically, we append a vertical flip to further expand training data. As comparison, we remove all data augmentations and only use original data in training-stage. Table 3 shows us that adopting horizontal flip or vertical flip alone does increase mAPs. However, there is no benefit to further append vertical flip after a horizontal flip.

Table 3 The effect of data augmentation on test precision

Full size table