Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

In assembly tasks several elements need to be addressed such as the part’s orientation and the assembly sequence. As a way to assist assembly tasks, several authors have proposed augmented reality systems [13]. They proposed that images augmented with instructions may be an alternative to the traditional paper manual and it would improve the assembly time. Implementations of augmented reality systems were compared to computer assisted instructions, paper manuals and tutorial by an expert. Although augmented reality (AR) systems did not improve assembly time compared to a tutorial by an expert [2], it was evidenced that assembly time can be improved in comparison to computer assisted instructions [1, 3] and paper manuals [13]. Yet, when an augmented reality system was used the number of errors in the assembly process was reduced [1, 3].

This research consists in development of an augmented reality system to assist the assembly process in real time as a replacement to conventional assembly manuals. The proposed system behaves as an interactive instruction manual where virtual models and graphic signs are overlaid on a video of the real scene so that the user is guided until the final assembled state is reached. To prove the concepts described here, a tablet is used between the user and the workspace in a way that a set of randomly placed assembly parts can be viewed on the display. From the video feed, the parts to be assembled are identified through image processing techniques and graphic signs show to the user where the piece should be assembled. Lastly, the system checks the assembly state and informs the user when the final state is reached.

The developed system has the following characteristics:

  • Fiducial markers are not used: objects are detected by image processing and named from a database;

  • The solution is known beforehand: the system is not for finding a solution to the assembly problem;

  • The system does not teach the assembly process to the user, its purpose is to be a guide through the assembly process;

  • The display is placed between the user and the assembly workspace and it is fixed during all the process;

  • A general assembly guidance process is proposed and validated;

  • The tablet implementation runs an embedded software, only resources from the tablet are used and no network connection is required.

This paper presents the results obtained by the object recognition subsystem and the assembly guidance process of the developed system that is being called Assembly Assisted by Augmented Reality (A3R).

2 Augmented Reality in Assembly Tasks

One of the first efforts to build an AR tool was shown in [4] to guide the assembly of cable harnesses. Information derived from engineering design of parts and processes come to the factory floor in the form of templates, assembly guides, cable lists and location markings. Expenses and delays on manufacturing can be minimized if changes on engineering design of parts and processes can be quickly mirrored to templates, assembly guides, cable lists and location markings. To address this issue, an AR system composed of a wearable computer and a head mounted display (HMD) is demonstrated in [4]. Later, researchers demonstrated applications to guide printer maintenance [5] and car door-lock assembly [6].

Specifically on AR applications in assembly processes, we can analyze the research in this field under 4 main aspects which enables the AR system implementation: display, type of tracking system, user interaction and platform. Additionally, an assembly sequence representation that is needed to guide the user through the assembly process is also analyzed.

2.1 Display

Basically, there are two different approaches to show information to the user in AR systems: optical see-through and video see-through. In the optical see-through approach a half-transparent mirror is placed in front of the user’s eyes. In this way the world can be seen though the half-transparent mirror and information is reflected on these half-transparent mirrors, thereby combining the real world and virtual information. Optical see-through displays do not present parallax effect, however, the combination of virtual elements by half-transparent mirror can reduce the brightness and contrast of virtual images. In the video see-through approach the user sees the world captured by cameras and information is superimposed on the digitized video. Video see-through approaches are cheaper and easier to be implemented. Besides, it is easier to add or remove elements from the display, as the world is already digitized. Still, as the camera and user’s eyes have different field of views, video see-through approaches suffer from parallax effect. As for the placement of the display, some devices are placed on the user’s head and the display is positioned in front of the eyes (HMD—head mounted displays), projectors can be used to display information on real objects (spatial displays) and some displays are placed on a hand’s reach (hand-held displays).

On the literature, optical see-through [3, 7] and video see-through HMD [1, 713] are commonly used in AR systems as they enable the use of both user’s hands on the assembly process. However, HMDs available on the market are bulky, expensive or requires connection to a computer.

Another approach is the use of spatial displays that use projectors to directly display information on real objects. This is the most integrated technology to the environment and several users can interact simultaneously. Projector-based AR systems implementations are shown in [1416]. However, the only use of a projector as a spatial display to guide the user to a solution of a 3D puzzle is presented in [15]. Although spatial systems enable more immersion on the task than other alternatives, it requires a controlled environment illumination and preparation of workspace parameters to correctly superimpose information on assembly parts.

Smartphones, tablets and PDAs represent hand-held displays. A system composed of a smartphone and PC to guide an assembly process of a 3D puzzle is shown in [17]. While the PC made all the image processing, the smartphone was used to display images augmented with assembly information.

2.2 Tracking

When augmented reality is used only to show information to the user and it is not required to superimpose information aligned to real world, this is called augmented reality without context as demonstrated in [3, 8, 15, 18]. In this kind of AR, images, animations and texts are shown to the user on a display similarly to an electronic manual. The main difference between an electronic manual and AR without context is the faster access to assembly information without change of attention, as the information is in the same display as the real world seen by the user.

Augmented reality with context requires the knowledge of camera and object position. In some cases, camera and object position are not known beforehand, therefore, methods to extract this information are needed. These methods are often called tracking. The main challenge to tracking methods is the depth information which is lost when a camera captures the world. Some AR systems use a combination of camera and another sensor to acquire the depth information. Magnetic sensors [1], infrared sensor [19] and multiple cameras [20] have been used to extract the depth data.

Specifically on AR uses on assembly processes, the main approach to tracking is the application of computer vision techniques on images taken by a singular camera. Two different approaches using computer vision techniques are found on the literature: tracking with fiducial markers and marker-less tracking. Fiducial markers make the tracking and recognition more robust, however, it requires the preparation of the workspace. AR systems which use fiducial markers tracking approach are presented in [912, 16, 17, 2123].

There are mainly two kinds of marker-less tracking: frame-to-frame and tracking by detection. Frame-to-frame methods estimate the current position by using information from previous frames. Tracking by detection estimates position by matching extracted features from images to features extracted from a known object model. An AR system that uses tracking by combining frame-to-frame and tracking by detection methods was shown in [24]. While frame-to-frame tracking is achieved by an implementation of a particle filter algorithm, tracking by detection is made by matching features from several 2D synthetic views of the 3D geometric features generated by moving a virtual camera along a sphere.

2.3 User Interaction

Some researchers used mouse and PC keyboard [9, 10, 16] or smartphone keyboard [17] to enable the AR system interaction, while others proposed interfaces with selectable panels by using a 3D stick tool [21], hand and gesture recognition methods [12] and haptic interfaces [20].

Two kinds of tools are used to interact in AR environments: computer vision tools and haptic tools. A system controlled by hand gestures is demonstrated in [10]. The detection of hands is made by color segmentation and virtual menus are opened by holding the hands on an activation area. A stick tool detected by image segmentation to interact with virtual panels is presented in [21]. Following the work in [21], the segmentation technique to recognize hands and gestures, respectively was improved in [25, 26].

An AR system with voice control was presented in [10, 27]. The voice control is made by a connection from the AR system with a stand-alone voice command system which makes the voice recognition of simple English commands like “next”, “previous”, “next phase” and “previous phase”.

Yet, an AR system with a haptic device where users wear a device to manipulate virtual objects as they were real objects was demonstrated in [20], which enables a more integrated and natural way to interact with the virtual elements.

2.4 Platform

Several AR systems were implemented by a combination of PC and HMD/monitor [1, 3, 8, 1016, 24, 27, 28]. The PC versatility, connection diversity which enables several devices connected at the same time, wide availability on market and processing power are the main reasons for choosing PCs as platforms for AR systems. Still, there are restrictions on connections to HMD, as a connection to a PC is required.

One main requirement on AR systems is near real time response. Manipulation of 3D virtual objects, tracking and object recognition require suitable processing power and memory. Due to this requirement, few approaches have been developed on mobile devices and markers for object recognition or tracking have been used to reduce the need of complex image processing algorithms. An AR system using a mobile phone as a client and a PC as a server on a client/server architecture is shown in [17]. A mobile phone is used to take photos of the workspace with markers and the images are sent to a server. The images are processed on the server and then the assembly information is sent back to the mobile phone. On the research presented in [24] a marker-less tracking subsystem for AR systems was described. Images at 640\(\,\times \,\)480 resolution were used to estimate camera pose in hundreds of milliseconds on a PC equipped by Intel Core i7-860 CPU and 3GB of RAM. The development of an AR system with the same architecture as proposed in [17] was proposed in [24] to eliminate the need to move bulky hardware components so that the mobile device’s computational requirements could be lowered. Up until this work, only one project implemented on a tablet [23] was found on the literature, mainly due to limitations on processing power and memory. Nowadays, mobile devices are becoming increasingly more powerful platforms. The iPad is a mobile device that stands out for its computing power, which can allow execution of image processing algorithms in real time and may be used for AR systems [29].

2.5 Assembly Sequence Representation

Some assembly sequence representations have been proposed in [30, 31]. Whenever an intelligent control system for assembly processes is being designed, a way to describe the assembly sequence representation is required. The choice of the representation impacts in the required memory to store the information and the difficult to derive the representation from an assembly process.

The following definitions are used for the assembly sequence representation:

  • Subassembly—represents a set of joined parts, it can contain one or more parts

  • Assembly task—is a connection between subassemblies

  • Assembly sequence—ordered list of assembly tasks

  • Assembly state—list of connected subassemblies

  • Assembly layout—represents the part disposition in the space

The connection between solid parts forming an stable unit is a mechanical assembly. The assembly parts are connected by surface contact reducing the degrees of freedom. Subassemblies are sets of parts containing one or more joined parts. Whenever subassemblies are connected an assembly task is performed. A succession of an ordered list of tasks or assembly states is an assembly sequence. An assembly process starts with all parts separated and ends with all parts connected to form a whole object.

An assembly sequence representation named as Assembly Sequence Table (AST) was introduced in [31]. This representation is used to describe all possible solutions to an assembly problem for evaluation and selection of an assembly sequence. Table 1 shows an example of AST. The first column is the level number, the second column contains the assembly states representing all feasible subassemblies formed after the assembly task is performed. Assembly tasks represent the feasible subassemblies from previous level to be joined. At any level L the assembly states column have all the subassemblies with L number of parts. Hence at level 1 only subassemblies with 1 part on assembly states are listed, on level 2 there are 3 assembly states with sets containing 2 parts each and so on.

Table 1 AST example of parts identified by a, b, c and d

3 The Proposed System

The Assembly Assisted by Augmented Reality (A3R) system was developed to run an embedded software on an off-the-shelf tablet with a camera, display, gyroscope and accelerometer sensors. It uses a video see-through approach to show the assembly space to the user and graphical signs to guide the user through the solution of the assembly process. Consequently, no customized device as an HMD is used and no connection to another computer is required. The parts to be assembled are identified by object recognition techniques in a way that markers are not used as an aid to the parts identification process and environment preparation is not needed. Instead of markers, models are used to identify the parts to be assembled.

Fig. 1
figure 1

Proposed system. a Models and instructions loading from the internet. b System setup

Table 2 Example of the proposed system assembly sequence representation

On a setup phase, the proposed system loads the models and the assembly sequence representation from files or from the internet, as pictured on Fig. 1a. Models are composed by a label to identify the assembly part and edges. The edges are described by a label and sets of points, representing the beginning and the end of an edge. For each assembly part model, features are extracted and stored on a database. The assembly sequence is composed of a list of tasks and each task contains the sequence order and a set of connections. Each connection is composed by a set of labels which identifies the edges to be connected in order to join the subassemblies. A representation derived from [31] is used to represent the assembly sequence. The AST Table, described in [31], contains all possible solutions to the assembly process, whereas in the proposed system only one assembly solution is represented per assembly process, thus there is no need to represent the assembly state on each step. Only the assembly tasks column from the AST representation is used. An example of the assembly sequence representation is presented in Table 2. The Sequence column corresponds to the sequence order and the Task column represents the assembly task. An Assembly task contains sets of edges labels. On the Sequence 0, it is assumed that the first assembly part (I) is located at \((x, y)=(0,0)\) on a cartesian base requiring only the edge that is on axis \(x=0\) and \(y=0\). Therefore, edges “I3” and“I0” are on \(x=0\) and \(y=0\) axis, respectively. On the Sequence 1, the Task column contains the connection between edge “T0” from the “T” assembly part and the edge “I2”, from the “I” assembly part. Finally, on the Sequence 2, the Task column contains the connection between edge “W0” and “T6” and the connection between edges “W1” and “T5”, from assembly parts “W” and “T”, respectively. The assembly guidance is carried out by reading the table from the Sequence 0 to the last Sequence, stepping to the next Sequence whenever the connection between the subassemblies is detected.

Fig. 2
figure 2

System block diagram. a Assembly process diagram. b Object recognition subsystem

Still on the setup phase, the intrinsic camera parameters and lens distortion are determined by a camera calibration process and stored on the database. Later, this information is used for the estimation of the object position and to correct the distortions caused by the lens on the camera image.

On the assembly guidance process phase, the tablet device is placed between the user and the assembly parts in a way that the user can see the assembly parts on the tablet’s display in a video see-through display approach, as shown schematically on Fig. 1b.

Figure 2a shows the assembly process algorithm. At first, the system retrieves an assembly task from the assembly sequence ordered list. If the assembly task is not the last one, the part associated with the assembly task is identified by the object recognition subsystem and is highlighted on the tablet’s display to show to the user the part to move. Graphical signs are shown to the user indicating the following actions: flip the assembly part if necessary and connect the part to a subassembly. While a highlighted assembly part is not connected to the subassembly, the system continuously indicates to the user the assembly part and its placement for connection even while the user moves the part around the workspace, except in the case of occlusion that is not being taken care at this time. After the part is placed on the indicated location and the connection to the subassembly is detected, the system steps to the next assembly task. This loop is repeated until all parts are placed on assembly layout and the proper connections are validated, thus the assembly process is completed.

For each image frame, a sequence of image processing algorithms are executed on the object recognition subsystem, as shown on Fig. 2b. The first algorithm is executed to correct the lens distortion from the camera image by the use of the distortion lens parameters previously estimated on the setup phase. After the undistortion, objects on the workspace are segmented. Due to the tablet’s camera position, segmented parts are rectified to remove the perspective distortion. Features are extracted from the segmented contours and then invariant features relative to rotation and scale are computed. On the object recognition algorithm, features are used to match the parts and the models.

4 Implementation

The implemented system uses an iPad Air 2 as platform with a video see-through approach. An 18 mm focal distance wide angle lens was placed on the tablet camera to broaden the field of view. Xcode was used to implement the software and the OpenCV library was used to develop the image processing algorithms. All implemented routines were written with Objective-C and C++ languages.

As a use case, the system was implemented to guide the assembly process of a pentomino puzzle [32]. This puzzle is composed of 12 pieces where each piece is the result of 5 square concatenated by their edges. Figure 3 shows the pentomino puzzle pieces and the given name of each piece that is used to identify each individual part and in the assembly sequence table. A common puzzle objective is to tile all the pieces in a rectangular box with an area of 6\(\,\times \,\)10 squares. This particular configuration has 2,339 possible solutions. This application was selected because it represents a generic assembly process with known solutions where the user can place the pieces in well defined positions starting from a random distribution of pieces on the workspace.

Fig. 3
figure 3

Pentomino pieces identified by letters

On the setup phase, the camera calibration is performed using a method implemented in the OpenCV library [33]. Using the tablet’s camera with a wide angle lens attached, twenty images were taken from a chessboard on arbitrary orientations to estimate the camera intrinsic matrix and lens distortion parameters. If the sum of the squared distances between the observed projection points and the projected points (re-projection error) is less than 0.5, the camera intrinsic matrix and lens distortion parameters are stored for later use. If the estimated re-projection error is not less than 0.5, the calibration process is repeated. The correction (undistortion) of the camera lens distortion is made in two steps: computing the mapping of points in the distorted space to the undistorted space and interpolating the pixels on the undistorted image after the mapping. The mapping result is the same on all camera images with the same lens, therefore, for optimization proposes it is cached once computed and only the interpolation of pixels on the undistorted images is processed for each camera image.

Assembly parts images were segmented in four steps: (1) color segmentation is made by thresholding the pixels by hue, saturation and value; (2) a Canny Edge algorithm [34] is used to detect edges; (3) a 2\(\,\times \,\)2 rectangular element size is used on a morphology closing process to close open borders on the detected edges; (4) a Border Following algorithm, as described by [35], is used to join the borders in vectors which represents the border points in a counter-clockwise order.

In the particular case of the 2D puzzle solution guidance, it is necessary to make the rectification of perspective distortion to compute assembly parts as a top–down view. Given that the pieces on puzzle are on the plane \(Z=0\), we can relate a point d on the image to a point D on the plane \(Z=0\):

$$\begin{aligned} sd=HD \end{aligned}$$
(1)

where H is the homography matrix and s is an arbitrary scale.

The image points are computed as if the image was taken by a virtual camera placed on \(C_{v}=(0,0,h_{v})_{g}\) on the global coordinate in O, as shown on Fig. 4. Using the (1) a point in the coordinate system O can be represented in the coordinate system C as:

$$\begin{aligned} s_{c}d_{c}=A\left[ \begin{array}{ccc} -\mathrm{cos}(\theta ) &{} 0 &{} h \cdot \frac{\mathrm{sin}^{2}(\theta )}{\mathrm{cos}(\theta )}\\ 0 &{} 1 &{} 0\\ \mathrm{sin}(\theta ) &{} 0 &{} h \cdot \mathrm{sin}(\theta ) \end{array}\right] D=H_{c}D \end{aligned}$$
(2)

and a point in the coordinate system O can be represented in the coordinate system V as:

$$\begin{aligned} s_{v}d_{v}=A\left[ \begin{array}{ccc} \mathrm{cos}(\pi ) &{} 0 &{} 0\\ 0 &{} 1 &{} 0\\ -\mathrm{sin}(\pi ) &{} 0 &{} h_{v} \end{array}\right] D=H_{v}D \end{aligned}$$
(3)

where the \(d_{c}\) and \(d_{v}\) are the points in the images of the real camera and the virtual camera, respectively; A is the camera intrinsic matrix; D is the point on the plane of the assembly parts; \(H_{x}\) is the homography matrix of the real camera or virtual camera; \(\theta \) is the device inclination angle; h is the distance between the camera and the device opposite side; \(h_{v}\) is the height of the virtual camera.

Relating the Eqs. (2) and (3), we can relate an image point to a point as taken by a virtual camera:

$$\begin{aligned} \alpha d_{v}=\left[ \begin{array}{ccc} -1 &{} 0 &{} 0\\ 0 &{} 1 &{} 0\\ 0 &{} 0 &{} h_{v} \end{array}\right] \left[ \begin{array}{ccc} -\mathrm{cos}(\theta ) &{} 0 &{} h \cdot \frac{\mathrm{sin}^{2}(\theta )}{\mathrm{cos}(\theta )}\\ 0 &{} 1 &{} 0\\ \mathrm{sin}(\theta ) &{} 0 &{} h \cdot \mathrm{sin}(\theta ) \end{array}\right] _{c}^{-1}d_{c} \end{aligned}$$
(4)

or

$$\begin{aligned} \alpha d_{v}=H_{v}H_{c}^{-1}d_{c} \end{aligned}$$
(5)

where \(\alpha \) is an arbitrary constant scale.

Fig. 4
figure 4

Perspective effect removal: camera coordinate system C, global coordinate system O and virtual camera coordinate system V

The distance h of the camera and the opposite side of the tablet is known and we used the gyroscope and accelerometer sensor of the tablet to compute the inclination angle \(\theta \).

Following the rectification, we used the Ramer–Douglas–Peucker algorithm [36, 37] implemented on OpenCV to approximate the segmented contours to polygons, which compressed the edge points and reduced noise. Precision of \(1\,\%\) of contour length was used on the algorithm.

The approximated contours of the segmented parts change if the assembly part is rotated or if they are farther or near from the camera. The parts on the image are recognized by matching to the models of a database, so the extraction of features that are invariant to translation, rotation and scale is necessary to make the comparison possible between segmented parts and a model in the database. The models in the database are loaded on the setup phase and they are described in more details later in this section. As presented in [38], the turning function is a description of a shape contour as a representation that is invariant to translation, rotation and scale. Suppose that a polygon is represented by a curve, then we can re-scale this curve in a way that its total length is equal to 1. The Turning Function \(\phi (s)\) measures the change of the angle \(\phi \) along the curve as a function of the curve length s. Following the curve’s path to the counter-clockwise direction, the angle increases on counter-clockwise turns and decreases on clockwise turns, as the turns are accumulated along the curve. The turning function has the following features: it is translation invariant, scale invariant, the rotation is represented by a shift on accumulated turns and the choice of the initial starting point corresponds to a shift on the curve segment length.

The matching between two polygons is done by calculating a distance as discussed in [39]. The method used is the Polygonal Method. The choice of the initial starting point affects the turning functions, so it is necessary to compute the minimum distance from all possible turning functions computed by different initial starting points. The following metric is used to make the curve matching between two closed polygons A and B given their turning functions \(\phi _{1}(s)\) and \(\phi _{2}(s)\):

$$\begin{aligned} D(A,B)=\min _{\underset{u\in [0,1]}{\alpha \in \mathfrak {R}}}\left[ \intop _{0}^{1}(\phi _{1}(s)-\phi _{2}(s+u)+\alpha )^{2}ds\right] ^{\frac{1}{2}} \end{aligned}$$
(6)

The optimal \(\alpha \) is:

$$\begin{aligned} \alpha =\intop _{0}^{1}\left[ \phi _{1}(s)-\phi _{2}(s)\right] ds-2\pi u \end{aligned}$$
(7)

where u is the choice of a starting point and \(\alpha \) is the rotation difference between the two polygons.

The turning function is computed for each segmented object. Then we compute the minimum distance between the segmented object turning function and each of the model’s turning function in the database by the use of Eq. (6). The minimum is reached when the best orientation between the curves and the starting point is selected and a matching is detected when the minimum distance to the model is less than a defined threshold of 0.8. Due to the nature of the pentomino pieces, some flipped pieces are not recognized as the same object because the turning function of a flipped piece is not the same depending on the piece’s shape. So the turning function of the pentominoes pieces that presented distinct turning function on the flipped state were included. In this way some pentominoes pieces have two possible turning functions, as is the case of the “P” piece.

Each assembly part on the pentominoes puzzle has a unique shape, therefore it is necessary to treat repeated identified parts in the object recognition algorithm. If a model is matched to more than one contour, the segmented image object with the minimum distance to the model is held as a match to the model. Yet, due to noise on the captured image, the segmentation algorithm can detect this noise as a small contour. Elimination of this noise contour is made by comparing the arc length of contours matched to the same model. If the arc length of one contour is less than 20 % of the comparing contour, the contour with the minor arc length is discarded as a possible match to the model.

For the assembly guidance process the implemented system loads the assembly parts models and the solution description from files on the setup phase. The assembly parts models are loaded by a JSON formatted file which lists the assembly parts with the following information:

  1. 1.

    Assembly part label;

  2. 2.

    Set of connected edges, each edge is labeled and described by vertices.

On Fig. 5 a JSON formatted file with the description of a pentomino part model is shown. In this example, the pentomino piece “V” is described by a label and by an array of edges. The edges are described by a label “Vn” ( with n between 0 and 5) and the model’s vertices. Each vertex is described by a point on a cartesian base. The models from the JSON formatted file are stored on a database and the model’s turning functions for object recognition are derived from them.

Fig. 5
figure 5

Assembly part JSON example. In this example the “V” part of the pentomino puzzle is described on Parts array

Figure 6 shows a JSON formatted file with a description of an assembly sequence. The assembly sequence is represented by a list of assembly tasks containing:

  1. 1.

    “Sequence”—the sequence order;

  2. 2.

    “Part”—the associated assembly part model to be moved on the task and;

  3. 3.

    “Connections”—the assembly connections. The assembly connections are represented by a set of connected edges.

Fig. 6
figure 6

Assembly sequence JSON example

An assembly layout is used for guidance on user interface. The assembly layout is a grid of cells displayed on the camera image to guide the connection of assembly parts by indicating the part location on the whole object. The location of an assembly part is shown by highlighting the cells to be occupied, determined by the assembly task. Computing which cells are to be highlighted on the assembly task is made by:

  1. 1.

    reading the models and solution description;

  2. 2.

    translating and rotating the assembly part on each assembly task for the assembly part placement on assembly layout;

  3. 3.

    estimating the cells to be highlighted on each assembly task.

The assembly part model rotation is computed by rotating the associated part on the assembly task until all connected edges are parallel. Translation is computed by estimating the distance from the connected edges, which are all parallel after rotation. Given the rotation and translation of the assembly part model the validity of the connection is verified. A connection is valid whenever the rotated and translated part do not overlap another subassembly. The systems steps through each assembly task until all the parts are verified to have valid connections. When this condition occurs and it is in the end of the assembly sequence, the object is considered to be completely assembled.

5 Results and Discussion

We processed 35 images with the 12 pentominoes pieces displaced on random orientation and position without occlusion to evaluate the segmentation and object recognition algorithms. From a total of 420 pieces, the algorithm segmented 413 pieces, thus, the segmentation ratio is 0.98.

Analyzing the pieces which could not be segmented, we noticed that the segmentation failed due to borders that where not fully closed. The morphology closing operation with one interaction of a 2\(\,\times \,\)2 rectangular element size was not sufficient to close some borders and the border following algorithm accumulated the points of the border as an open contour. The segmentation result may be improved if a larger element is used on the morphology closing operation. By the other hand, larger elements may deform the shape, which makes the identification task more challenging.

Fig. 7
figure 7

Example of pentomino parts recognition assuming unique pentominoes

For the object recognition we assumed that the pentominoes pieces are unique. On the distance computation between segmented contours and models turning functions only segmented contours turning function with distance lower than 0.8 are selected as candidates for identification. As the object recognition depends directly from the segmentation result, we only used successfully segmented parts for the object recognition. Then, from a total of 413 segmented parts, 407 were correctly identified. Figure 7 demonstrates an example of the identified parts and their labels. We noticed that some of the unrecognized puzzle parts were near the edge of camera image. They were affected by lens distortion correction that acts more aggressively on pixels far from the center of image, which deforms the piece shape.

It is important to evaluate the execution time of each algorithm as near real time performance is desired on AR systems. We placed pentomino pieces in a way that the camera could capture all the 12 pieces and executed the algorithms 500 times. Execution time for each process is summarized on Table 3. Segmentation execution time had a great significance on the whole process execution time. As assembly guidance algorithm is executed in parallel on the implemented system, the assembly guidance algorithm execution time do not affect the image processing and the object recognition execution time. Segmented contours turning function derivation and distance computation can be parallelized, therefore, the object recognition execution time can be improved by running part of the algorithm on a parallel process.

Table 3 Algorithms execution time
Fig. 8
figure 8

Indication to the user of part to be flipped. a Alternating symbol for current detected part in white. b Alternating symbol for part in final flipped state in green

When an assembly part is required to be flipped, the system alternates between highlighting the current assembly part in white and green shapes signs, as shown in Fig. 8a, b, respectively, in order to indicate to the user the need for this action. The system produces an audible warning whenever the need for a flip action is detected. Figure 9 presents an intermediate state of the assembly guidance process. The assembly part to be moved on this sequence is highlighted and an arrow indicates to the user where the part must be moved to in order to connect to the subassembly. The highlighted cells near the arrow tip represent the placement location of the part for connection and the remaining filled cells represent already assembled and connected parts, or subassemblies, on previous steps. When the system detects that the part is placed on the right position and is connected to the subassembly, an audible confirmation is produced to inform the user and the system steps to the next assembly sequence. This process is repeated until the assembly process is detected as completed, as illustrated in Fig. 9b.

Fig. 9
figure 9

Assembly process guidance viewed by the user. a Assembly process on an intermediate state. b Completed assembly process

6 Conclusion

This paper presented an AR system to guide an user through an assembly process. The system is based on image processing techniques to detect the parts to be assembled, to follow up the assembly process and uses graphical signs to inform the user of each step in the assembly sequence. Segmentation and object recognition are two important aspects of this project. Assembly parts segmentation is made by a combination of color segmentation, Canny Edge algorithm, morphology closing operation and border following algorithm. Segmentation rate of 0.98 has been achieved. Segmented assembly parts are identified by computing the turning function and the distance between the turning functions of segmented assembly parts and models. We made the assumption that all the 12 pentomino parts are present on the evaluated images and a filter algorithm have been used to reduce the likelihood of noise contour being incorrectly labeled as a pentomino part. We achieved a rate of 0.98 for object recognition with this process.

We evaluated the execution time of each image processing step: undistortion, perspective rectification, segmentation and object recognition. A mean of approximately 109 ms to the whole process execution time and an execution time of 20.7 ms for object recognition of 12 pentomino pieces were achieved in an iPad Air 2. We reckon that in about two generations of hardware upgrades real time performance will be possible.

The graphical interface guides the user through a series of movement operations such as flipping and connecting a part to a subassembly. Color symbols and audible confirmations and warnings give the necessary instructions and feedback to the user. The system has been implemented and tested with several puzzle solutions, from start point, where all parts are placed randomly in the workspace, until the full assembly is completed. General descriptions of the assembly model, assembly task, assembly sequence and assembly layout were proposed, derived from [31], implemented and validated.