1 Introduction

Indoor object detection and indoor scene understanding are basic tasks for many applications including autonomous robot navigation [48] and mobility assistive devices for people with visual impairments (VIP) [17].

For independent mobility, the VIPs need to perceive relevant objects of their nearest space. As the VIPs are not able to see landmarks or such (indoor) objects, an assistive device must indicate their presence.

Indoor objects’ perception in real indoor scene is a challenging task as many complex problems such as background complexity, occlusions, viewpoint changes, etc. should be taken into account. To address this problem, a fully labeled indoor object dataset was elaborated with a goal of their detection. This dataset consists of 8000 indoor images containing 16 different and the most frequent indoor landmark objects and classes.

Moreover, the robotic and human navigation assistance requires a real-time processing. A Deep Convolutional Neural Networks (DCNN) may be a solution to achieve such temporal performance.

Deep CNN combines two concepts: Deep Learning and Convolutional Neural Networks. Such combination integrates millions of values of parameters which underlay the acquired images presentation, parameters which are relevant to perform a specific task and which are taken into account during the training phase.

Furthermore, DCNN exhibits big difference from other traditional approaches for object detection. Indeed, a DCNN models use powerful adhoc objects’ representations by providing good features extraction process in each layer of the network.

The great particularity of Deep Learning models is the hierarchical representation of features. This means that features computed in intermediate layers can be reused in different applications and tasks, while features found by the last layers are specific and a function of the targeted application and the dataset are used. The convolution part of a DCNN (layers closer to the input layers) refers to general features, while the classification part (layers closer to the outputs) refers to specific features.

Deep learning models can be divided into two principal parts: region proposal-based models (such as R-CNN [15], Fast R-CNN [16], Faster R-CNN [43] and Mask R-CNN [19]) and proposal-free methods (such as YOLO [42], YOLO 9000 [41], and YOLOv3 [40], and SSD [35]).

The efficiency and the accuracy of object detection with the deep learning models is that DCNN extracts per-pixel features through a large number of images during the training process, although deep structures CNN models ensure a good extraction of the most relevant image features.

State-of-the-art models of deep learning widely rely on large-scale dataset such as ImageNet [9], MS COCO [33], PASCAL VOC 2007 [13], and VOC 2012 [14], and all of them generally fail in the indoor object detection; indeed, listed images’ datasets present complex scenes. However, despite of the variety of backgrounds, multiple indoor objects, multiple positions, different scales, etc. the considered objects are samples of classic situations and do not consider specific needs of visually impaired people (VIP).

This paper proposes a new fully labeled dataset for indoor object detection and recognition, and relevant to VIP mobility. The originality of the proposed dataset comes from the inclusion of new characteristics of a 3D scene not considered so far, and relevant to VIP mobility. Such new training data will robustify the object recognition and may be used in any (assistive) navigation system. This paper presents the first approach evaluating YOLOv3 architecture on indoor object detection. Our aim from this work is to provide the Visually impaired person with a robust indoor object detection system to help them to more explore and interact with their surrounding environments and to more integrate in the daily life. Our proposed work achieved very encouraging results in term of detection accuracy and speed which meet the VIP mobility requirements. Also, the proposed work achieved high detection performances in challenging conditions as: extreme lighting conditions, heavy occlusion, high intra and inter-class variation.

The remainder of the paper is organized as follows: Section 2 reviews related works on indoor object detection. Section 3 presents the proposed multi-class indoor object detection and recognition dataset. In Section 4, the proposed approach for indoor objects detection based on a pre-trained DCNN is presented. Section 5 outlines the experimental evaluation of the proposed approach and discusses the obtained results. Finally, Section 6 concludes the paper and proposes further extensions of the work.

2 Indoor object recognition: related works

Many researchers and academics show their big interest in real time indoor object detection. The challenge is to detect correctly and accurately the object in an image or in a video. Generally, the indoor environments are different from the outdoor scenery. Indoor scenery is composed generally of a wide range of background elements and different interior decorations.

Since the appearance of RGB-D sensors, such as Kinect cameras, that provide not only image color but also depth information, many works based on RGB-D sensors have been used to guide the robot indoor navigation [25]. However, the detection becomes very challenging when it is used for recognition of specific objects or unknown obstacles in unfamiliar natural environments [20].

Frequently, depth sensors have been widely used for object detection and recognition during the simultaneous localization and mapping (SLAM). Chae et al. [6] introduced a framework for indoor object recognition for SLAM in indoor scenery.

Many others classic works rely on indoor objects detection based on machine learning techniques [30, 37]. However, this category of methods usually contributes to some complex pipeline design which make them highly depended on computational resources and of a very high computational cost; the real-time constraint is usually not met.

During the last few years, DCNN models have gained a great attention in many computer visions tasks. This approach has been used for indoor object recognition [10, 11], for indoor object segmentation [8, 44], detection tasks [46], internet of things (IoT) [31] grasping force prediction [36] and authentication systems [24, 32]. In order to enhance indoor object detection, it is necessary to build and create new reliable classification and detection systems. Kim et al. [27] trained a deep CNN model, ConvNet that will be used for autonomous indoor navigation of robots. Another work based on DCNN models, which focuses on objects’ prediction knowing their poses (and named PoseNet), was introduced by Kendall et al. [26]. This model combines the strengths of DCNN models with SLAM techniques, with the respect of a hard-real-time constraint.

Chen et al. [7] presented a new visual indoor positioning system based on CNN models. To better address the problem of indoor positioning, authors proposed a localization method which consists of features extraction using DCNN and pose estimation.

Sho et al. [45] proposed an indoor positioning deep learning-based system in order to satisfy the increasing demand for these types of systems. Authors adopted DCNN to implement their orientation-free positioning system. Their hybrid system is based on a location part with wifi and fingerprint images; a CNN is used to classify locations.

After many years of research in the field of deep neural networks, DCNNs still present the best choice for many computer vision problems. DCNN are widely used for indoor scene recognition [34]. To design their scene classification system authors used ResNet with all its versions (Renet-50, ResNet-101, ResNet-152) [18], datasets ImageNet 11 K [29] and places 365 [49] for training. Bashiri et al. [3] developed a detection system dedicated to detect three objects (doors, stairs and sign) using deep learning method. Their construction of an appropriate representation of a specific environment still presents a challenging issue in the robotic field. Escalona et al. [12] present a 3D object detection system based on RGB-D cameras. They used the semantic labeling image concept which is very suitable for human-robot interaction as it facilitates the interaction of the robot with the surrounding environment.

Over the last few years, data-driven DCNN outperform the classic approaches. Many indoor object detection approaches was proposed. However, none of them is suitable for independent mobility of VIP. The next section presents a novel efficient indoor objects detection approach based on a DCNN model. In addition, a new dataset was created to train and test the proposed detection system.

3 Proposed multi-class indoor object detection and recognition dataset (IODR)

Several indoor object datasets were proposed in the literature. Bachiri et al. [4] proposed a new indoor dataset used for indoor object classification presenting 20,000 indoor images and containing 3 indoor classes (door, sign, stairs). Quattoni et al. [39] proposed a new dataset used specially to deal with indoor scene recognition problems. This dataset presents 15,620 images with 67 indoor scene categories. Xiao et al. [47] proposed an extensive database named “Scene UNderstanding” (SUN) containing 899 scene categories with over 130,519 images, it is especially used for scene understanding. Their dataset present various categories such as indoor urban and nature’s categories. Indoor datasets used to solve indoor object detection problems, are constrained by the indoor dataset present that do not capture a variety of indoor objects while covering the challenging situations as luminosity invariance, occlusion and various objects positions. For this fact, we will collect and fully label an indoor object dataset that will cover various challenging situations to be used in a second time to test and train the proposed indoor object detection system.

The Multi-class Indoor Object Detection and Recognition (IODR) Dataset presents a new fully labeled indoor object dataset [1]. This fully annotated dataset (§3.1) can be highly recommended for training and testing different DCNN models (§3.2) while prototyping a new indoor navigation assistance system.

3.1 Data preparation and annotation

The biggest challenge is to provide to VIPs the relevant information on their indoor navigation environment. Some images of the proposed dataset are collected from the NAVIIS project [21]; new indoor images contain vital indoor objects for VIP mobility (identified with the VIP) with different lighting conditions and complex backgrounds. The dataset was labeled using LabelImg software [22]. The new original dataset encompasses 8000 indoor annotated images where 16 indoor object classes are considered.

Figure 1 presents an example of the proposed annotation process via labelImg tool: objects are delimited by their rectangular bounding box (with their coordinates in image defined in video streaming modes).

Fig. 1
figure 1

LabelImg annotation example: .jpg image (left picture) and its annotated equivalent (bbox) (right picture)

The proposed dataset is composed of many categories of indoor objects. It contains 8000 indoor images captured, which present different lighting conditions, to obtain a very robust (scene illumination invariant) dataset. Two resolutions are present in the dataset:1616 × 1232 and 4592 × 3448.

The collected dataset contains 16 main landmark objects that are usually present in any indoor scene especially in corridors. They are: doors, light switches, smoke detector, chair, fire extinguisher, sign, window, heating, electricity box, stairs, table, security button, trash can, elevator and notice table. All images are in the .jpg format.

The proposed dataset provides various characteristics important for the VIP mobility, it is original in term of:

  • Light invariance: objects are taken under different lighting conditions (day, night, blurred).

  • Geometrical change invariance: the objects are taken under different angles and poses.

  • Objects provided in in the proposed dataset are vital for VIP indoor mobility.

  • Occlusion: parts of the objects are hidden or overlapped by other objects.

  • Highlighting the presence of dangerous situations to ensure a safe mobility for the VIP person as the downstairs.

  • The proposed dataset is very suitable to develop new robust indoor object detection systems.

  • High inter and intra-class variation.

Figure 2 illustrates the wide intra-class variation between doors in the proposed dataset. Doors present many shapes, many poses, many colors, different textures). Annotations were done on different doors poses, their status (opened or closed), the material used is under different textures (wood, glass, iron). The biggest strength of the proposed dataset is that it provides many challenging conditions in order to perform a robust training to deal with different indoor environments belonging to various buildings and to be relevant to VIP mobility.

Fig. 2
figure 2

Doors intra-class variation

In contrast to the exiting indoor datasets, the proposed dataset provides many challenging conditions taken into account as: heavy occlusions, different lighting conditions, complex background, etc.… in order to increase the robustness of the indoor object detector. Also, the proposed dataset provides a high inter and intra-class variation to build an accurate detector. The proposed dataset is highly recommended for multi-objects problems as it provides various indoor object classes.

3.2 Training and testing subsets

After the selected dataset annotation, training and testing configurations must be prepared. The dataset was divided into train and test sets. For the training set, 66% of the dataset was reserved and the rest was used as testing set. The proposed dataset contains 16 indoor object classes. Table 1 presents all the indoor object classes with all classes’ names and IDs to ensure a better scene understanding of the indoor images. The original images included in the dataset are selected with respect to two main issues of VIP mobility: firstly, providing the most relevant indoor objects and landmarks, and secondly, providing a good annotation to better understand the indoor scene.

Table 1 Indoor object class names and IDs present in the collected indoor dataset

4 Proposed architecture for indoor object detection

Deep learning models have proved their big performances in the computer vision area in particular for object detection tasks. Precise and fast indoor object detection and recognition, in images and videos, is a very important task as it supports the VIP understanding and interaction with the external world.

YOLOv3presents the best compromise between speed and accuracy for object detection [40], and makes YOLOv3 the best choice for this type of applications especially indoor navigation assistance to visually impaired persons. Indoor assistance navigation systems require (fast) real-time object detection as well as the high accuracy of the detection as the secure displacements should be targeted.

As the classic DCNN training requires a long time, the proposed system will use the transfer DCNN learning training technique [38] which uses less data. Indeed, transfer learning, a fast component in artificial intelligence and especially in deep learning field, is usually expanding in using deep CNN pretrained models.

This section provides an overview of the Darknet-53 used by YOLO v3 as a feature extractor (§ 4.1) followed by details of the proposed architecture which is used for indoor object detection (§ 4.2).

4.1 YOLO V3 backbone: Darknet-53

YOLO v3 presents a custom fully convolutional neural network named “Darknet-53” [40]. It makes use of residual blocks, of connections’ skipping and of up-sampling and allows to detect fine-grained features in images. Darknet-53 originally presents 53 convolution layers trained on ImageNet [9]. Darknet-53 is mainly composed of 3 × 3 and 1 × 1 convolution layers with skip-connections. Table 2 lists all processing layers of the Darknet-53, while Fig. 3 outlines the architecture of its residual blocks.

Table 2 Darknet-53 Architecture contents
Fig. 3
figure 3

Residual block architecture

First, the top of the network uses a convolution layer with a 3 × 3 kernel size. It down-samples image size by using the strided convolution instead of using pooling layers. As mentioned in [2] using the strided convolution instead of pooling is more efficient in term of memory and temporal performances.

Darknet-53 deploys also a set of residual blocks where each block is composed of 3 × 3 and 1 × 1 convolution layers. Table 3 provides a comparison of performance of Darknet-53 and ResNet [18] in term of accuracy and BFLOP occupancy.

Table 3 Performance comparison of Darknet-53 &ResNet backbones [40]

From Table 3 it can be easily found that the Darknet-53 implementation is more efficient than that of ResNet-152 [18] as it achieves 1457 BFLOP per second which makes it two times faster than ResNet-152 with the same accuracy.

4.2 YOLOv3 for object detection

YOLOv3 [40] is the 3rd version of the YOLO family [41, 42]. This version shows many temporal and accuracy improvements. YOLOv3 adopts an architecture based on two consecutive powerful Darknet-53 convolution layers whatleadsto106 convolution layers. Generally, the YOLO family models solve the detection problem as a regression problem.

Usually, objects in the images are of different size: small, medium and big (based on the object’s size when compared to the image size). For indoor object detection it is important to detect all objects whatever is the object’s size.

YOLOv3 [40] shows a better ability in detecting multi-scales objects. Indeed, the YOLOv3 adopted Features Pyramid Network (FPN-like) structures to detect objects with different scales. FPN algorithm encompasses two data movements: a bottom-up and atop-down(cf. Fig. 4).In bottom-up movement (image down sampling by 2) the semantic information (object characteristics) increases but the precision of their localization decreases; in top-down movement image up-sampling allows to increase the accuracy of the localization (using the information provided by the additional lateral connections and generated in bottom-up movement).

Fig. 4
figure 4

FPN- like Structure used in YOLOv3 architecture [40]

To perform the feature detection, the input image is subdivided into a grid of detection cells.

The multi-scale detection algorithm (top-down data movement) implements the following three steps (cf. Fig. 5):

  • Step 1 (big object detection): Prediction (localization) of the features of big objects using the last feature map (of the top layer)

  • Step 2 (medium object detection): merging the two corresponding feature maps of the same size (one generated in bottom-up movement with the one which is up-sampled by 2 and generated in the top-down data movement); Convolution to the merged feature map and prediction of the feature localizations for the objects of medium size.

  • Step 3 (small object detection): up-sampling by 2 of the features maps of the convolution layer in step 2; concatenation of the feature maps of two ad hoc layers: one generated in bottom-up with the up-sampled map generated in top-down movement; convolution of the resulting feature map and prediction of small size objects’ locations.

Fig. 5
figure 5

YOLOv3 model simplified Architecture [23]

The prediction of an object localization (bounding box, localization) is performed by a convolution layer with the grid of a shape of 1 × 1 (B x (4 + 1 + C)) where 1 × 1 is the convolution layer, B is the number of rectangular bboxes that can be detected,“4” refers to the bbox attributes (tx,ty,tw,th), “1” is the object confidence for each grid cell and C presents the number of classes. In the proposed approach, 3 boxes for each grid cell and 16 indoor classes are used. Therefore, an output shape of 1 × 1 (3 × 5 + 16). Where 1 × 1 is the convolution layer, 3 is the small, medium and big object sizes, 5 refers to the four bbox attributes plus 1 as object confidence score and 16 is the number of object class.

For each extracted bbox YOLO v3attributes the objectness scores. The objectness score quantifies how likely an image window encompasses an object. The objectness score may be calculated using the independent logistic classifier; the usage of such classifier reduces the computation complexity of the processing.

A bbox is characterized by 4 coordinates (tx,ty,tw,th)(cf. Fig. 6) which should be predicted: (tx,ty) are the image coordinates of the center of the bbox; tw (resp. th) is width (resp. height) offset from the bbox center. Assuming (cx,cy) are the top left corner coordinates of a grid cell in a features map, the final predicted bbox parameters are bx,by,bh,bw which are obtained by using the following equations:

$$ {\displaystyle \begin{array}{c}{b}_x=\sigma \left({t}_x\right)+{c}_x\\ {}{b}_y=\sigma \left({t}_y\right)+y\\ {}{b}_w={p}_w{e}_w^t\\ {}{b}_h={p}_h{e}_h^t\end{array}} $$
Fig. 6
figure 6

Object detection approach based on the bounding box technique used in YOLOv3 [40]

Where:

  • pw, ph are the anchor coordinates of the bbox’ top-left corner (in a cell; 5 bbox can be predicted at each cell of the output feature map);

  • σ is the sigmoid function σ(x) = 1/(1 + ex).

Since YOLO v 3 makes predictions at 3 different scales, and for each scale we have 3 or 5anchors it results in the use of 9 different anchor sizes.

For example, for an input image of 416 × 416 YOLOv3 predicts in 3 scales ((52 × 52) + (26 × 26) + (13 × 13)) × 3 = 10,647 bboxes. This number is large. To reduce it:

  • first bboxes are filtered considering the objectness scores (with a specific threshold);

  • Secondly, the non-maximum (compared to the ground truth) are delayed (NMS).

5 Experiments and results

Indoor environment assistance navigation requires real-time object detection as well as height detection. Good accuracy and better speed (comparing to other DCNN models) makes YOLOv3 the best choice for real-time object detection for mobility assistive device design. This paper takes a step further to address the indoor object detection using DCNN. This is not only classifying objects but also providing the object localization in the current indoor scene. All these are experimentally tested.

The experiments on “indoor object detection and recognition” were implemented with the proposed indoor dataset Images in this dataset are taken in real interior environments. The indoor dataset collected consist of 16 indoor landmark object classes highly present in any indoor environment. The average precision of every indoor object present in the dataset is the quality criterion of the proposed approach.

This section presents the training experiments (§5.1) and test experiments with the proposed annotated dataset (§5.2).

5.1 Training experiments

Training a convolutional neural network requires a huge amount of data. For this purpose, we used the proposed indoor object detection dataset to feed the DCNN.

The training step consists in finding a set of rules to best classify objects. This process performs all tasks to train the indoor object classifier. During the training process, the pretrained model is evaluated on multiple indoor images with multiple points of view, different lighting conditions and complex backgrounds.

The proposed system runs on a HP workstation equipped with Intel Xeon E5–2683 v4 processor and Nvidia Quadro M4000 GPU with 8 GB of integrated memory.

Several steps were performed when training the DCNN model:

  • First, network initialization with weights pretrained on COCO dataset [33];

  • Second, the fine-tuning of the pretrained model on proposed collected dataset.

During the training step, the binary cross-entropy loss for class prediction was applied. 3 anchors are tested at each scale which gives a tensor of N x N * [3*(4 + 1 + 16)], where 4 is the bounding boxes offset, 1 is the objectness prediction and 16 is the number of classes and N x N is the grid dimension.

The proposed dataset was split into two subsets: one for training and the other for testing. At the beginning of the training step, images were resized to the resolution of input images (608 × 608).

For the training process, YOLO v3 uses the Stochastic Gradient Decent (SGD) [5] with momentum as an optimizer for the loss function. SGD updates parameters at each training step. But SGD performs updates with high variance which causes high oscillations of the objective function. These high fluctuations enable the loss functions to reach the local minimum. For this fact in YOLO v3 architecture, they use SGD with momentum. Momentum method enhances the SGD by reducing oscillations and speeding up the convergence process. SGD is performed as Eq. (1)

$$ w=w-\upeta \ast \nabla w\ast J\left(w;{x}^i;{y}^i\right) $$
(1)
Where w:

is models’ parameters (weights + bias)

w ∗ J(w):

is the objective function

η:

is the learning rate

Momentum adds an γ fraction of the updated vector of the previous step to the current updated vector. The momentum updates can be performed as Eq. (2).

$$ {V}_t=\gamma {V}_{t-1}+\upeta \nabla w\ast J(w) $$
(2)
$$ w=w-{V}_t $$

As a result of adding the momentum method to the SGD, the model gains faster convergence with fewer oscillations. But, because of the accumulated speed, the momentum optimizer can miss the global or the minimum local.

In the proposed experiments we used SGD with momentum. To solve the problem caused by momentum optimizer, we propose to change it by the ADAM optimizer [28]. ADAM optimizer behaves like momentum in the parameters updating speed by keeping an exponentially decayed average of the past gradient mt (the first momentum of the gradient). In addition, ADAM optimizer stores an exponentially decayed average of the past squared gradient VT (the second momentum of the gradient). Moreover, it computes an adaptive learning rate for each parameter. It also updates the learning parameters at each training step.

$$ {m}_t={\beta}_1\ast {m}_{t-1}+\left(1-{\beta}_1\right)\ast {g}_t $$
(3)
$$ {V}_t={\beta}_2\ast {V}_{t-1}+\left(1-{\beta}_2\right)\ast {g_t}^2 $$
(4)

Where β1 and β2 are close to 1.

Adam performs a biases correction of the first and the second momentum. The biases corrected first (\( {\hat{m}}_t \)) and second (\( {\hat{v}}_t \)) can be estimated as the following equations:

$$ {\hat{m}}_t=\frac{m_t}{1-{\beta}_1^t} $$
(5)
$$ {\hat{v}}_t=\frac{v_t}{1-{\beta}_2^t} $$
(6)

Then, Adam updates the network parameters using the corrected first and second moment as Eq. (7).

$$ {w}_{t+1}={w}_t-\frac{\eta }{\sqrt{{\hat{v}}_t+\epsilon }}\ast {\hat{m}}_t $$
(7)

By using the ADAM optimizer to train the proposed detection system, we gained around 2% in the mean average precision (mAP).

We trained the proposed detection system by using two optimizers: the momentum and the ADAM. Table 4 reports the results in mAP obtained by using the two methods.

Table 4 Comparison of mAP when using momentum and ADAM optimizers

5.2 Test experiments

This section, describes the performed experiments and the obtained results for indoor object detection. The mean average precision (mAP) was selected to evaluate performances of the proposed detection system. The mAP is the precision average of all class queries present in the collected dataset. The average precision (AP)presents the value of the detection accuracy of a specific indoor object. The obtained detection performance on the considered test subset is summarized in Table 5.

Table 5 Average Precision (AP) results of different indoor objects classe

The proposed object detection system achieves a mean precision of 73.19% (mAP). Almost perfect recognition was obtained for chair and table categories; good performances were obtained in the detection of many other indoor classes such as electricity box, heating, elevator, door, trash can.

The proposed detection system struggles in two indoor classes (smoke detector and light switch). For the rest of the indoor object classes our detection system achieves good performances.

To obtain information about the model’s performances we have to calculate true positive (TP), false positive (FP) and false negative (FN) to calculate precision, recall and F1-score (Table 6).

Table 6 Evaluation metrics used in the proposed detection system
$$ Precision=\frac{TP}{TP+ FP} $$
(8)
$$ Recall=\frac{TP}{TP+ FN} $$
(9)
$$ F1- score=2\ast \frac{Precision\ast Recall}{Precision+ Recall} $$
(10)

As far as the VIP mobility is considered, the system should provide ahead data on upcoming object. It is relevant to assume that:

- The optimal distance between an indoor object and a VIP sufficient to warn him/her in advance is about 5 m.

- The speed of the VIP is 1.4 m/s (speed of a normal person).

Consequently, the VIP will need 3.57 s to reach the indoor object. The temporal performance of the system should achieve a processing speed of 83 millisecond/frame, or 2 FPS. The proposed indoor object detection system achieves a processing speed of 12 FPS, therefore its match the needs of a VIP mobility.

As mentioned in Table 7, our proposed indoor object detection system achieves better results than the results obtained in [10] when using indoor dataset for the three classes. Also, our work outperforms [10] work when using (indoor+FoV) dataset. We achieved higher detection accuracies for chair and table classes. We note that we obtained better results despite we trained and tested our proposed system on challenging conditions including high inter and intra-class variation.

Table 7 Comparison of results obtained by our method and those obtained in [10]

Figure 7 presents a detection example using images from the fully labeled indoor object detection and recognition dataset. The figure shows that all indoor objects present in the input image are detected in the considered image. Moreover, it can be observed that the door was detected despite of the fact that it was opened and it was taken with a challenging angle. We note that each of the two trash cans, very close one to one another, was detected by the proposed system. We note also that despite the small size of the smoke detector and the light switch in the input image, they were successfully detected by the proposed system.

Fig. 7
figure 7

A Detection example of the proposed system

It is possible to conclude that the proposed indoor object detection system achieves high recognition rate of objects of different sizes and respect the real-time constraints required by the VIP mobility speed.

6 Conclusion

This paper presented a new indoor detection system designed for indoor assistance navigation for visually impaired people.

The proposed indoor dataset provides a data that can be used by researchers in computer vision field to develop new deep convolutional neural networks (DCNN), that can be included in many indoor robotic navigation systems, natural mobility of humanoid robotics, and in any system, which assists human being physical or virtual navigation.

The proposed indoor object detection and recognition (IODR) dataset present 8000 containing 16 landmark objects categories. Indoor image provided in this dataset presenting various challenging situations to make training and testing steps of the deep CNN robust for any complex situation given during inference process.

The proposed dataset provides images that are highly relevant for VIP mobility. The evaluation of the proposed system with the proposed new fully annotated lead to the detection precision of 73,19% mAP. This encouraging accuracy may be increased by adding more data during the DCNN model training.

A future work targets the system mAP improvement and integration of the proposed indoor detection system in embedded devices such as intelligent cane.