1 Introduction

Ubiquitous computing is becoming reality with the fast diffusion of smartphones, tablets, because of their increasing processing and 3D graphic power. A novel human-computer interaction paradigm, caller post-desktop approach [1], is also strictly related to pervasive computing where the devices are not personal computers, but tiny, even invisible devices, embedded in almost any type of surrounding objects, including cars, tools, appliances, clothing, etc., all communicating through increasingly interconnected networks [2]. Nowadays, these mobile devices are candidate to replace desktop PCs in our daily life and, in a near future to play an important role also in industrial scenarios. The incorporation of ubiquitous devices, micro projectors, and embedded 3D scanners can lead to a revolutionary way to interact with 3D CAD models. Therefore, in the area of computer aided design methods, the study of the potential and the limits of desktop-less interface in industrial use is a very important issue. In particular current CAD interfaces show evident limits in design review (DR). DR is a crucial step in the product lifecycle management (PLM). Its goal is to spot, as soon as possible in the production chain, product and process weaknesses, errors and manufacturing problems. One critical aspect of the design evaluation of an industrial component is the understanding of engineering model. DR requires an efficient workspace to understand complex in 3D geometries, to browse a large number of components in an assembly and to select and manipulate them. In real industrial scenarios, we can find the following conditions conflicting with the use of a traditional desk interface: the lack of a clean desk, the user wears gloves, and the necessity of comparing virtual (i.e. “ideal” CAD models) with real, defective or uncompleted parts or assemblies. In particular, this paper aims to study a new generation of gesture-based interfaces, called natural interfaces, to facilitate technical discussion and CAD model navigation (see Fig. 1). The natural interfaces are designed to “reuse existing skills for interacting directly with content” [3], and therefore they are particularly user friendly. But due to their novelty and to the lack of implementation of gesture recognition into commercial CAD kernels, literature on this topic is still scarce or too general, while a specific CAD oriented methodology is necessary.

Fig. 1
figure 1

Natural gestures to navigate 3D CAD models in augmented reality

Differently from other similar approaches in literature, in this work we integrate natural interfaces in an augmented reality environment. Using bimanual interaction on a virtual active surface, the user can navigate, inspect, and interact with CAD models.

The paper is organized as follows: we start with a brief survey of related works in Sect. 2 and analyze the design review requirements in Sect. 3. Section 4 contains a detailed description of the virtual active surface concept and in Sect. 5 we detail the implementation. In Sect. 6 we present a case study and the users response, while in Sect. 7 we conclude the paper and defines the future work.

2 Related works

It is a well-known issue in industry that CAD software and downstream applications hardly support DR because their human computer interface which is specifically oriented to expert CAD users seated on an office desk and not suited for natural 3D models navigation and collaboration. Figure 2 depicts a simplified taxonomy of the common interaction metaphors for the manipulation of 3D virtual objects according to Bowman et al. [4]. They split the interaction metaphors in two main branches: exocentric and ego-centric. In the former, the exocentric, called also “perspective of God” the user interacts from the outside of the virtual environment. An example of this approach is the world-in-miniature [5], where the user is provided with a miniature model of the whole scene and the interactions are mapped with the real scale world. In the latter metaphor, the egocentric, the user is interacting from inside the environment according to two metaphors: the virtual hand and the virtual pointer. The first method needs 3D tracking to map the user real hands with a virtual representation of them in order to simulate real life gestures and benefit from natural skills and coordination. With the second metaphor, the virtual pointer, the user selects and manipulates objects by pointing at them. A very common example of this approach is the ”laser ray”: an imaginary linear vector emanating from the user eye through the virtual pointer which emulates the mouse pointer. Each of the presented approaches have positive aspects and drawbacks and there is not an optimal solution for all possible scenarios: model size, number of elements, object complexity, user expertise, etc.

Fig. 2
figure 2

Classification of interaction metaphor according Bowman et al.

In a previous work [6] we presented a desk-less egocentric interface for improving paper technical drawings with virtual technical content (2D images, 3D graphics, annotations). The system was based on two advanced technologies: Tangible interfaces (TUI) and augmented reality (AR). TUIs enable a person to interact with digital information using the physical environment [7]. TUIs are designed as an alternative paradigm to conventional GUIs allowing users to manipulate objects in virtual space using physical, thus “tangible”, objects. The study proved the advantages of a desk-less approach in engineering discussion, however, it disclosed some limits: the “fiducial” based tracking, the restricted number of commands available and the need to have at least one hand free and clean. Those condition have a deep impact on the usability in industrial scenarios where the multiple markers, needed to keep the tracking robust, reduced the useful information area. We also needed a hybrid interface by supporting the tangible with a traditional GUI to exploit a larger choice of commands. Finally tangible interfaces demonstrated to be very intuitive but, especially in some industrial scenarios like the maintenance, an interface which leaves the hands free could be more appropriate.

Buchmann et al. [8] introduced one significant step towards egocentric and highly natural interfaces with FingARtips: a gesture-based system for the direct manipulation of virtual objects. They attached fiducial markers on each finger to track the fingertips of a person and to derive gestures. This system allows to pick up virtual objects in 3D space. Reifinger et al. [9] presented a similar approach but using infrared tracking for hand and gesture recognition. Infrared-reflective markers are attached to the fingertips of the user and s/he can grasp virtual objects as in the physical world.

Recent technologies such as 3D depth cameras (e.g. Microsoft Kinect\(^\circledR \) [10], see Fig. 3), can potentially allow low-cost optical-based gesture recognition without the need to wear or handle any awkward device [11].

Fig. 3
figure 3

The XBox Kinect\({^\circledR }\) specifications and outputs

Recognizing in real time human gestures in a general environment is a complex task, which involves multiple multidisciplinary aspects such as motion modelling, motion analysis, pattern recognition, and machine learning. For hand detection, which is an essential component for gesture recognition, the most common approaches in literature use colour information (skin segmentation) and motion information. However, these approaches struggle with changing light conditions and make hand motion tracking not robust [12]. Recently, 2D vision-based gesture recognition systems have received a boost from the additional aid provided by depth cameras [13, 14]. A lot of research effort is spent to improve the algorithms in terms of latency, robustness and precision, but few studies have been carried out on the usability in a specific application field. Our contribution to the state of art of this emerging technology is to apply natural interfaces to the CAD domain, and in particular to support engineering 3D model navigation for design review. In fact the idea of gesture based CAD interaction is not new in literature and is deeply connected with virtual reality (VR) technology [15]. In a previous research using a VR setup we experimented the 3D input asymmetry with user studies [16]. In particular, pointing accuracy along the depth direction (i.e., in front of the user) is significantly lower than along other (horizontal/vertical) directions. Valentini [17] presented a novel approach to achieve realistic and real time CAD assembly simulations in an augmented reality environment using 3D natural interfaces and a depth camera. A real time solver allowed the picking operation of multiple 3D simple objects (e.g. cylinders, cubes, etc.) while the depth information resolved occlusion of virtual and real object (including the user body).

All the presented interaction metaphors are designed to work with 3D graphic models without taking in account the engineering knowledge that they contain. This paper aims to explore a egocentric virtual pointer interface with a specific design review approach for CAD assemblies.

3 Design review requirements

A significant issue in current DR tools is related to usability. Differently from CAD modeling, the audience of DR is much wider and ranges from specialized designers to marketing, managers and quality control professionals. This heterogenic group of users can benefit from a natural interface. In a previous research we already experienced natural interfaces using 3D interaction where user tests proved the advantages but also the disadvantages of a completely unconstrained 3D navigation [18]. In fact, a fully unconstrained and direct 3D interaction, while theoretically being the most natural, can be awkward due to the model dimensions, the tracking latency and precision, the user attention allocation, the interaction anisotropy and the lack of tactile feedback. In the specific domain of the design review our approach aims to improve the usability by reducing the available degrees of freedom in the natural interaction. In practice, DR requires the implementation of a limited but effective subset of CAD functions. These functions should be accessed without the need of the mouse/keyboard input, for the aforementioned reasons.

Another important aspect in DR is the straightforward integration in the engineering design workflow, according to industry standards and practice (e.g. file formats, technical knowledge, etc.) Most of the natural interaction system described in literature are focusing mainly to geometry and aesthetic rendering. A CAD model is not just geometry and materials, but it contains a lot of technical data which need to be visualized. The extraction of non-geometrical data and how the user can interact with them are actually current issues. Each CAD vendor masks the data structure in proprietary formats, and therefore the only feasible way to access the data is by neutral formats. In particular we decided to use the STEP format, which is a standard, de facto, in most of the commercial CAD systems [19]. The retrieved information can be used to navigate the models using the engineering knowledge that is embedded in the CAD design. Figure 4 depicts the CAD data workflow. In the first step, independently from the specific CAD platform, the models are exported in STEP file format.

Fig. 4
figure 4

CAD data integration workflow

In the second step, the AR converter module, based on the OpenCascade CAD kernel, is used to prepare the data files for the AR application. The main function of this module is the tessellation. It converts each part model from a B-rep mathematical representation in a separate mesh file. The system supports both STEP protocols: AP 203 “Configuration controlled 3D designs of mechanical parts and assemblies” and 214 “Core data for automotive mechanical design processes”. The level of detail of the triangulation and, consequently, the precision of the graphical representation can be optimized for the specific visualization hardware in order to obtain real time interaction. The local translations and rotations of the single CAD parts are flattened to a common world reference system by traversing the assembly structure. The assembly structure, the part filenames and other CAD related data (e.g. volume and constraints) are stored in a custom 3D model XML file. This is the input file for the AR application, which allows the user to browse parts and assemblies using simple gestures and visual feedback according to our novel virtual active surface paradigm.

4 The virtual active surface

Our interaction approach was inspired by the well-known multi-touch interaction metaphor (e.g. Apple iPad\({^\circledR }\)). The proposed interaction technique follows the design recommendations by Dünser et al. [20] Their guidelines demand the reduction of cognitive load, an immediate feedback, and an error tolerant system. We need to consider that the user is always in movement, and that some human actions do not necessarily indicate a wanted interaction (e.g. answering with a mobile phone). Moreover, in the interface design, we need to consider the limited precision of current low cost depth cams: approximately 1 cm at 2 m tolerance. We implemented a bimanual egocentric pointer paradigm with an imaginary virtual plane located approximately 50 cm in front of the user chest (see Fig. 5).

Fig. 5
figure 5

Concept of the virtual active surface

If the user hands are near this plane, the system recognizes an intentional interaction and consequently triggers an action if associated to a previously classified gesture.

The active surface is designed to separate the interaction volume from the “personal” area between the plane and the user. The personal area is a private untracked zone, which allows normal working activity (e.g. sketching on paper or handling objects) without triggering any event. The system recognizes the user chest position and moves the plane accordingly.

A big issue that we experienced in the implementation of this metaphor is the lack of tactile feedback. Our solution is to provide a visual feedback to the user touching the virtual plane by displaying a semi-transparent red frame on the margin of the user field of view (see Fig. 6). The active surface geometry is defined by three parameters: width (\(w\)), height (\(h\)) and depth (\(d\)). Their values must be adapted to the user anthropometry for an ergonomic cursor mapping. Another important parameter is \(d\) which represents the collision sensitivity of the active surface. The interaction is active if the centre of the hand is contained between the two back/front planes, see Fig. 6.

Fig. 6
figure 6

Active plane depth limits

These three parameters are defined during a preliminary calibrating phase while the user extends his/her arms. Visual cursors are displayed on the active plane as semi-transparent proxies (2D discs, 3D spheres, or also a virtual hand model).

The cursor state is obtained by a gesture recognition module. Although we could exploit all the gestures obtained by five fingers, in this metaphor we use only three states: open hand state (OS), fist state (FS), and bang state (BS). We use the open hand for idle cursor visual feedback, the “grasping” fist for selection, and the three-fingers for navigation (see Fig. 7). The user has visual feedback of these states because the proxies change colours accordingly. The 3D object interaction is obtained ray casting the 2D cursors according to the virtual pointer metaphor.

Fig. 7
figure 7

Hand states

4.1 Basic CAD interaction

Our interface provides both single hand (left or right) and bimanual gesture interaction. To visualize the cursor(s), the hand(s) must act on the virtual plane where \(d\) is usually around 30 cm.

The classical scene navigation is obtained with three functions: orbit, pan, zoom. The user can orbit the scene with one single hand in BS. With both hands in BS the system can perform pan and zoom at the same time. If the hands are pointing in generally the same direction, then a pan action is activated. To zoom in the user moves two hands closer to each other, and to zoom out the user moves them away from each other. The midpoint between the hands controls the centre of the scaling operation (i.e., the only point that is remains constant).

The object selection is obtained by moving the cursor on the object and changing the state to FS simulating the grasping action.

4.2 Files interaction

As regards managing CAD files with natural interaction, we already presented a visual file manager in a previous work [21].

In the current approach, differently from the presented solution, opening a CAD file is as simple and intuitive as scrolling the contact list on a smartphone. The documentation browsing bar (see bottom of Fig. 8) appears and disappears automatically when the user pointer reaches the lower zone of the active surface. The CAD files are visually represented as miniaturized icons on the document browsing bar. The user scrolls the files with a hand in FS toward left or right (see Fig. 9). Once found the file is loaded by selecting and dragging the icon in the center of the active surface area. The current CAD document file is closed by a zoom reduction until the hands are joined for more than 3 s.

Fig. 8
figure 8

The virtual workspace provided to the user

Fig. 9
figure 9

Documentation browsing bar concept

4.3 Interactive exploded view

If an assembly file is loaded it is ready for navigation. The user can explode it applying a zoom-in like gesture but with both hands in FS. Correspondingly the user can recompose the assembly by applying a zoom-out like gesture with hands in FS. In the exploded view the user is able to visualize and select a sub assembly from the documentation browsing bar which is updated according to the assembly context.

In the next section we describe the hardware setup and the software architecture we implemented to provide the presented natural interface.

5 Implementation

The system is composed of two distinct PCs that operate in parallel: one for the user tracking and one for the AR application. This approach was necessary due to the computational requirements of the user interaction but also it increases the flexibility of the entire system during the developing and testing phase.

5.1 Hardware

Figure 10 shows an overview of the hardware setup of the presented AR application. The user stands on a fixed position in front of the main working area. Although any AR display system (e.g. head mounted display) can be effectively integrated with natural interaction, we decided to use a simple monitor-based AR-system. It incorporates a 24” widescreen LCD monitor and a video camera mounted on a tripod. The video camera used for pattern-based tracking is a Creative Live Cam Video IM Ultra webcam (\(1{,}280 \times 960\) pixel at 30 fps), located next to the user’s head and aligned towards the working space. This configuration simulates a camera attached on a head mounted display. The user observes the augmented scene on the screen of the monitor and the Kinect device is located under the monitor facing the user to detect her/his gestures. The Kinect video camera provides RGB colour images with a resolution of \(640 \times 480\) pixel and 12bit depth images. The user does not see these images during normal usage.

Fig. 10
figure 10

The hardware setup

5.2 Software

The software is written in C++ using object oriented programming and open source libraries. Figure 11 depicts a schematic overview of the two applications we implemented for the system. The hands tracker system is divided into a skeleton computation module and a hand state recognition module. Their function is to generate real time events of the hand positions in 3D space and their state. The skeleton computation relies on the OpenNI framework (http://www.openni.org), which is an application programming interface that provides a middleware component to retrieve the images from the Kinect and to determine the user limbs positions, and in particular the approximate hands locations.

Fig. 11
figure 11

The software architecture

Since the user anthropometrics are registered in each session, our system can recognize different users and recall their calibration data.

The hand state recognition module uses an image processing algorithm based on OpenCV (http://opencv.willowgarage.com/), an open source computer vision library. The starting point of the algorithm is the depth image retrieved from the Kinect as shown in Fig. 12 on the right.

Fig. 12
figure 12

An example of the Kinect camera image (left) and the depth image (right)

The system segments the user silhouette from the background using the depth information and projects the 3D position of the hands of the user skeleton onto the depth image plane. We define two \(80 \times 80\) pixel square regions of interest around the projected positions and we apply a threshold function. The final output is composed by two black and white images which contain only the hand outlines. The hand shape is defined using the Hu set of invariant moments. They are invariant to translations and rotations; thus, they increase the system robustness. To identify the hand gesture, we compare the estimated Hu set of moments with a pre-defined set using a support vector machine, a non-probabilistic binary linear classifier. The pre-defined Hu set of moments of each gesture is calculated out of 200 image samples. We limited the hand states to just three (OS, FS and BS) for two main reasons. Firstly, image processing in a uncontrolled environment is very challenging and the classifier error rate grows more than linearly with the number of hand states. Moreover we experienced how false or wrong gesture detections are very frustrating for the user (e.g. triggering a command instead of another). We chose three states as an optimal trade-off between robustness and flexibility to perform our DR task. The second reason is that some hand configurations proved to be uncomfortable and wearisome (e.g. indicating number two), especially if repeated or kept for more than a few seconds. Our states proved to be not tiring for hands and fingers and easy to learn and to remember because they mimic real object manipulation.

The 3D hands positions and states are submitted as network events via UDP/IP to the AR application. The augmented reality application can be broken into four modules: the visualization engine, the tracking, the flexible interface and the model manager. The visualization engine main function is to perform the real time overlay and registration of the virtual scene over the real world captured by the camera. This module is based on OpenSceneGraph (a OpenGL based graphic library http://www.openscenegraph.org) for the rendering and ARToolKit library [22] to perform the scene tracking using image based markers.

The application interface module collects the events from the hand tracking and decodes them in a sequence of signals for the application finite state machine. The state machine activates the application functions in a flexible way, because it is described by a standardized UML-XMI model. This approach allowed us to explore easily different interface metaphors just changing a XML file instead of doing it programmatically. The last component of the AR application, the model manager, is expressly dedicated to model related activities. The model manager reads the XML model file generated by the AR converter, it parses the assembly structure and manages: visualization, registration, assembly configurations, dynamic model loading and explosions.

This file is based on the schema ARProTo.xsd (see https://homepages.uni-paderborn.de/rafael76/ARXML/ARProTo.xsd). Figure 13 represents the XML file of a simple product composed by several cubic parts. This file contains a sequence of ”Model3D” elements, each representing a single part with the following data: a unique ID, the OSG mesh models filename (extension .osgt), the location in the global coordinate system (instead the relative one provided by the CAD system) and the volume. The volume information is the exact value computed by the CAD kernel, instead of the approximated mesh internal volume. The volume is used by the Model Manager to optimize the visualization by simply unloading small components located far from the user view on runtime.

Fig. 13
figure 13

A simple example of the XML file generated by AR converter module

The hierarchical structure of the assembly is stored in a separate element ”Structure”, a sequence of Types “LinkType”. In the example in Fig. 13, the second level SOLID59 part belongs to the subassembly PRODUCT L156 which is a child of the CADROOT main assembly. The link between the parts can be used in the interactive explosion to generate the steps of the animation. The Type attribute was designed to contain detailed information about the nature of constraints among the parts. Unfortunately all the CAD systems that we have tested did not export in the STEP format this information and therefore we set Type to ”UNKNOWN”.

Another important data element of the XML file is the ARToolKit element, which contains the ARToolkit tracking configuration: the camera calibration file and the mapping between each fiducial marker (e.g. “patt.hiro”) and the model unique ID.

6 The case study

We compared the proposed virtual active surface interface with a natural interaction interface, we previously presented [18], which allows a completely unconstrained 3D navigation. For our test case, we decided to use a mechatronic design: the Bebot. The Bebot was designed using CATIA at the University of Paderborn in collaboration with the Heinz Nixdorf Institute.

We carried out a usability test with 12 students from the local engineering faculty in Paderborn. All the users (male) were proficient with at least one CAD system and in particularly CATIA. No user had previous experience with hand gesture-based interaction techniques. The Bebot model (total of 169 parts in 5 levels sub-assemblies) had been correctly exported in STEP and visualized with the AR application. The original CAD assembly total size was more than 200 Mb, while the converted mesh models in .osgt format occupied 32 Mb. We let the users try the application for 20 min without any instruction. The user were asked to simply examine the model and to understand the main components and their function. The application ran on a Windows 7 PC with Intel Xeon 3,6 GHz Processor and 6 GB RAM with a NVIDIA Quadro 5500 GPU. Figure 14 represents a screenshot during the selection phase on the Bebop. Figure 15 shows an exploded view animation that is triggered by a bimanual fist gesture. The users found exploded views very helpful to understand complex 3D models when occlusion occurs and for the selection of single components. After the tests we interviewed the participants for their opinion.

Fig. 14
figure 14

Selection with right hand

Fig. 15
figure 15

Model explosion

The post experiment questionnaire featured five-point Likert scale questions (1 = most negative; 5 = most positive) to evaluate ease of use, satisfaction level, and intuitiveness for each interaction mode.

Figure 16 compares the response of the users as to overall ease of use of the 2D interaction with. the 3D one showing that the former is the preferred one (median value 4 vs. 3). Also in a direct question, as if the completion of a task is more difficult using 3D techniques than using 2D techniques, the users strongly agreed (median value 5).

Fig. 16
figure 16

Survey response histograms for overall ease of use. Median values for each condition are shown as triangles

Figure 17 compares the response as to a specific task (Pan & Orbit). The 2D interaction is once more preferred (median value 4 vs. 3).

Fig. 17
figure 17

Survey response histograms for Pan & Orbit Task. Median values for each condition are shown as triangles

In a direct question about manipulating object, the users agreed (median value 4) about moving and rotating objects using 2D interaction techniques to be easier than using constraint 3D techniques.

The results from these user ratings are clearly consistent in favour of the presented active plane interaction techniques. We can justify this non trivial result with two main explanations. Firstly all the users in the test have already experience and familiarity with desktop CAD 2D interfaces so the presented approach fits seamless with their skills. Secondly, due to the existing depth camera tracking limitation, a constrained movement can be more effective and precise than a completely free one.

7 Conclusions and future works

This paper presents a set of natural interaction techniques to facilitate a desk-less interaction with 3D CAD models. In our approach the user interacts with mono/bi manual hand gestures on a virtual plane to control CAD models in an AR environment. We developed a module to access geometry and engineering information directly from standard STEP CAD file. Therefore, each assembly/part can be selected, examined, and navigated using natural hand gestures. The navigation includes orbiting, panning, and zooming. A dynamic exploded view is provided to understand complex assemblies when occlusion occurs and for selecting single components. One advantage of this desk-less approach, in addition to its simplicity, is that all commands can be triggered also in an industrial environment, noisy and dirty, with bare hands or covered by protective gloves. The case study showed the feasibility of hand gesture-based techniques and how the presented virtual plane approach is preferred by the users on an unconstrained 3D navigation. As regards the integration in current industry workflow, the STEP format demonstrated to be very successful in exporting the geometry, the hierarchical assembly structure, the engineering materials and exact volume data, but in our test we could not retrieve any constraints information (i.e. kinematic joints). This information could have been used to develop better explosion and interaction strategies. In future works we will focus on two main aspects. First, we will improve the gesture algorithm by specific filtering to obtain a more precise and stable tracking. Secondly, we will find a way to access the kinematics constraints of the model in order to provide real time simulation in multi-body assemblies. It is important to notice also that we have designed the presented system specifically for industrial applications (e.g. shop floor maintenance, on site quality control, etc.), but we can also picture different scenarios where it can be used to improve 3D models navigation in non-desktop environments (e.g. scientific or medical data visualization, advertising, etc.)