Keywords

1 Introduction

Detecting object is a technique connecting with the problem of object detection from photos, or videos that fall under a particular category is the main aim of computer vision and image processing. Object detection in 3D image plays an important role in designing effective real-world systems for recognizing and detecting all recognised objects in a image is the main goal of object detection. There are various purpose of locating and recognition of the objects includes counting the object, face recognition, character recognition, and independent driving in a surveillance camera.

Many teams of engineers and scientists are working to address each of these “eternal” challenges through trial and error. As modern technology solutions seem to be more expensive, the task of creating self-sufficient software tools for problem solving is being developed and solved in depth overseas.

A vast number of 3D objects are frequently utilized in 3D graphics, which is becoming more popular throughout the world. In order to use 3D objects, an extraction method is required. This work explains how to visualize a 3D object using the deep learning technology. There has been a lot of study on object detection, but most of it has focused on identifying objects from 2D pixel data generated from camera. In this instance, objects are identified by referring pixel data.

Objects in the image can be seen and pointed out by viewers. The human visual system is very accurate and rapid, it can do challenging tasks including instantly differentiating between a variety of obstacles. Now computers probably learn to recognise and differentiate various components inside an image, due to the availability of larger data sets, faster GPUs, and smarter algorithms.

While the local rendering of an object is painting a small rectangular or square box around objects in a photograph, image classification entails classifying an image. These tasks, drawing a rectangular or square box around each and every object of interest in the image, makes obtaining an object always a difficult task. All of these topics are cited as object recognition issues.

Identifying objects in digital photos involves a number of related processes known as object recognition. In order to deal with local practise and monitoring operations, a family of methods known as regional-based convolutional neural networks or R-CNNs is used to develop a model.

2 Related Work

Hough space of LiDAR point clouds is used to overcome the problem arises such as unstructured distribution, disordered arrangement, and large amounts of data. This requires high computational complexity and it is very difficult to classify 3D objects. CNN model is used to classify multidimensional objects [1].

Perceptual organization technique is used configure the grouping and structure which is unchangeable over a wide range of viewpoints to reduce the size of the search space during model-based matching, probabilistic ranking technique is used. The last technique used is spatial correspondence brings the projections of three-dimensional models into direct correspondence with the image by solving for unknown viewpoint and model parameters [2].

Ning Hao discuss the 3D object detection using three stage objection model. This model improves the efficiency compared with traditional 3D object detection model. Point CNN model for 3D object detection is also discussed [3]. A F M Saifuddin Saif et al. discuss the models, challenges and applications using deep learning techniques for multimodal object detection in future [4].

Current techniques for object recognition uses efficient machine learning methods. To optimize the performance, huge datasets are used, apply more powerful models, and techniques to prevent overfitting. ImageNet classification with deep convolutional neural networks is an efficient technique to classify high resolution images. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully connected layers with a final 1000-way SoftMax [5].

Deep learning techniques are widely used in 3 D object detection in the field of automated driving, medical image analysis, virtual/augmented reality, artificial intelligence robots, and other areas. This is a very challenging and is a hot research topic in the current scenario. There are many techniques existing to solve the problems, but deep learning methods gives optimized result [6].

3 Methodology

3.1 Objectives

  • Identifying 3D objects from an image.

  • Classifying and assign the labels to each object.

3.2 Methodology

Traditionally object recognition was carried out using hand-crafted features like HoG or Haar before the advent of contemporary Convolutional Neural Network (CNN) architectures and large-scale picture datasets like ImageNet. CNNs were first used to recognize objects using externally segmented objects, and later, Region Proposal Networks (RPNs) were used.

Three methods have been used to detect 3D objects. The first strategy, (among many others) relies entirely on monocular visual information to estimate the spatial placement of the objects. A second strategy has attempted to combine the data, as in using both the camera and LiDAR as complimentary data sources. RPN was utilized in this data fusion approach to compute regions of interest and classification in the picture space and perform final location over the LiDAR data.

The third method of 3D object detection computes object detections in 3D using point cloud data and either information from stereo cameras or LiDARs. Examples of this strategy from the past include and more recently, 3D space has been transformed into a voxel grid and 3D convolutions have been used to analyze the spatial data from point clouds. In the point cloud category, using 2D CNNs on a LiDAR point cloud of the front view or a bird's eye view (BEV) is a relatively recent innovation.

To enhance the problems of existing system and to make it better for the Object detection and identification proposed system has been implemented. Object detection Technology is used to identify the object by using python Models and libraries like TensorFlow and OpenCV.

In this work, several commonly used datasets for 3D object detection are reviewed and corelated them with Objectron. The computer vision problem of object detection has been extensively investigated. However, 2D object prediction has received the majority of attention. A various application in robotics and automation, self-driving vehicles, image capturing, and augmented reality are made possible by improving prediction to 3D, which allows one to record the size of object's location and object orientation in the real world. 2D prediction only offers 2D bounding boxes. Even though 2D machine vision is very advanced and has been broadly used in the industries, 3D object detection from 2D photography is a challenging challenge to address due to the insufficient data and the range of data appearance and shapes of items within a category.

Image preprocessing. Pre-processing has been done to enhance the image quality, then image analysis is more effective and successful. Through preprocessing, undesirable distortions are eliminated and enhance certain properties of the image that are important for the application. Those traits are altered depending on the application. During preprocessing image data is transformed into the format compatible for algorithm to process and data analysis.

R-CNN. The R-CNN algorithm is mainly used for progressive visual object detection which fuses convolution neural network generated replacement with bottom-up region approach. R-CNN uses a technique called region proposal to generate prospective bounding boxes around the images before implementing a classifier to the recommended boxes (Fig. 1).

Fig. 1
A flow chart of the system architecture. An input image is followed by pre-processing, M L pipeline, object identification, and ends with object classification.

System architecture

System architecture

MediaPipe. Machine learning pipelines are built using the MediaPipe framework to handle real time including audio and video. This cross-platform system is supported by the desktop/server, Android, iOS, and embedded systems like the Raspberry Pi and Jetson Nano.

Because of the ease of setup and ubiquity of the Python programming language, MediaPipe Python solutions are the best for beginners. The MediaPipe Framework's flexibility allows for customization. But before diving into customizing, we advise being familiar with a number of ready-made solutions. Recognize the internal APIs connected to them, then modify the results to produce your fascinating apps.

ML Pipelines for 3D Object Detection. To estimate the bounding box around 3D object from a single RGB image, this system developed two pipelines: a two-stage pipeline and single-stage pipeline. While maintaining and improving efficiency, the two-stage pipeline is three times faster than the one-stage pipeline. The two-stage pipeline excels in detecting a single dominating item, the single-stage pipeline excels in detecting multiple objects.

Two-stage pipe. The illustration in Fig. 2 shows our two-stage pipeline. In the first stage 2D crop of the object is located in using an object detector. The second stage calculates the 3D bounding box using the image crop. In order to avoid running the object detector every frame, it additionally computes the cropped object from 2D image for the following frame.

Fig. 2
A block flow diagram of the 2-stage pipe. It consists of 3 frames. An image is sent to the object detector. Key points prediction receives images from frame 1 and the object detector and sends them to 2-D to 3-D lifting. It is followed by frames 2 and 3.

Two stage pipe

Single-Stage pipe

This framework uses multi-task learning strategy such as combining, detection and regression to estimate jointly the shape of an item. Depending on the ground truth, annotation is provided. This is not required, if the shape of the image is not proper for identification in the training data (Fig. 3).

Fig. 3
A block flow diagram of the single-stage pipe. The backbone includes an encoder and decoder, followed by the heads, which are detection and regression. It leads to post-processing that includes 2-D projections and a 3-D bounding box.

Single-stage pipe

Box Tracking. The box tracking system takes image frames from a video or camera stream and computes the monitored box positions for each frame by starting box positions with timestamps, signaling 2D region of interest to track. The starting box positions in this particular use case are determined by object detection, but the starting position can also be supplied directly by the user or by another system. Three key parts make up our solution: a motion analysis part, a flow packager part, and a box tracking part. The box tracking solution is represented as a Media Pipe subgraph, with each component represented as a Media Pipe calculator.

Coordinate Systems. Every item is having its own coordinate frame. The source is in the center of the 3D bounding box with +x pointing right, +y pointing up, and +z pointing forward. The object coordinate specification shown below (Fig. 4).

Fig. 4
An illustration represents the coordinate system in a cuboid. It consists of an X-Y-Z axis with the center as (0, 0, 0). The Y-axis is up and the Z-axis is front.

Coordinate system

Camera Coordinate. Scale, rotation, and translation in relation to the camera coordinate frame are the parameters that define a 3D object. The definition of the, the camera coordinate API is as follows: +x points to the right, +y points to the up, and −z points to the scene. To work with box landmarks, one can first derive landmark coordinates in object frame by scaling a origin centered unit box with scale, then transform to camera frame by applying rotation and translation (Fig. 5).

Fig. 5
An illustration represents the camera coordinate in a trapezoid prism. It consists of an X-Y-Z axis and a negative X-axis. It has four coordinates.

Camera coordinate

TensorFlow. To locate objects, a TensorFlow object identification API based on an SSD deep learning framework was employed. We were able to use the model weights provided by this API since it has previously learnt 90 items. For this research, the detection model's source code was studied and updated in order to pinpoint the precise position of objects in real time. Using the source code, the camera's coordinates were found. It is possible to recognize and locate things in an image or video using the object detection computer vision technology. For example, object detection may be used to count and monitor the exact positions of all the items in a scene while precisely identifying each one of them.

OpenCV. The box tracking system takes image frames from a video or camera stream and computes the monitored box positions for each frame by starting box positions with timestamps, signaling 2D regions of interest to track. The starting box positions in this particular use case are determined by object detection, but the starting position can also be supplied directly by the user or by another system. Three key parts make up our solution: a motion analysis part, a flow packager part, and a box tracking part. The box tracking solution is represented as a MediaPipe subgraph, with each component represented as a MediaPipe calculator.

4 Results and Performance Evaluation

Initially the system will capture the input image from webcam after that images will pre-processing in order to get an enhanced image and to extract some useful information and it improves the image quality. 3D object uses the pixel values at that point of an image and describe how bright that pixels, and what color it should be. The computer reads any image as a range of values between 0 and 255 using RGB colors it will plot the pixels. For feature extraction a MediaPipe multistage pipeline is applied. This is known as MediaPipe Holistic. The input taken from web camera, the MediaPipe Holistic uses individual models for each object components using a region-appropriate image resolution.

Next R-CNN algorithm will apply. The region of interest or region proposal will be generated by R-CNN in the I stage. Selective search algorithm will be use to segment the image. This algorithm will divide the image based on size, texture, color and shape. At this stage object proposal will be generating by R-CNN pipeline, it is of different scale. This object proposal is challenging to region proposal. Different features have to extract from the proposal. SVM classification algorithm will be used for classifying and labeling the images. To improve the performance of prediction bounding box regression has been applied.

The main goal is to detect objects in any input image and drawing boundary around them. This technique is used to create potential bounding boxes for images before applying to the suggested box after applying the R-CNN MediaPipe objectrone. This MediaPipe objectrone is a real-time 3D object detection solution for every objects. It detects objects in 2D images and also estimate their poses. LiDAR dataset is used as a training dataset. The accuracy obtained is better and more accurate after using this technique. This proposed work is trained and undergone 300 epoch and achieved 90 percent accuracy.

Table 1 Algorithm

4.1 Pooling Layers

Pooling is a technique in CNN for generalizing feature extraction by convolutional filters and helping the neural network to recognize features independent of their location in the image. This method gives the average of features present in the image. The below table figure shows the feature map of average pooling.

Max pool (Figs. 6, 7, 8, 9, 10, and 11).

Fig. 6
A table illustration represents the average pooling feature map. A 4 by 4 matrix of numbers is converted to a 2 by 2 matrix with filters and strides.

Average pooling feature map

Fig. 7
A photograph depicts wooden and plastic chairs. Both the chairs are detected by two 3-D bounding boxes.

3D bounding object detection chairs

Fig. 8
A close-up photograph depicts a pair of shoes worn by a man. Two three-dimensional bounding boxes detect the pair of shoes.

3D bounding object detection shoes

Fig. 9
A photograph depicts a cup of tea on a table. A three-dimensional bounding box detects the cup.

3D bounding object detection cup

Fig. 10
A photograph depicts a man riding on a horse. The photo is in the shape of a parallelogram. A text written as input image in the bottom left.

Original image

Fig. 11
A photograph depicts a man riding on a horse. The photo is in the shape of a parallelogram. Three detected spots represent the regions of interest from a proposal method.

ROI using selective search method

5 Conclusion

This framework provides a tracking system for 3D objects that enables immediate real-time tracking of 3D bounding boxes on devices. This system recommends a technique which initializes the 3D posture first with a neural network, then uses a planar surface tracker to follow the object's pose through all the images.

The main goal of the System is "3D Object Detection System " is to identify several items in various kinds of images especially in medical images. In order to accomplish this, edge features from the image are retrieved. It makes use of a vast image database to accurately identify and recognize objects. The accuracy obtained from this method is 90%. The user interface for this system will make it simple to retrieve the desired photographs. In the Future research work developing a robust model for multidimensional object detection and classification in real time images.