Keywords

1 Introduction

Urban environments have been extensively studied for many purposes such as path planning, to improve the acquisition of images, to create 3D reconstructions [27], or to perform semantic segmentation of roads in the city using RGB-thermal images [29]. The creation of datasets and benchmarks focused on metropolitan areas is therefore an interesting topic and new classified point clouds of cities and datasets of multimodal aerial sources are being released, both for semantic segmentation [3], and also to perform path planning [21]. These recent publications show that despite the progress in computer vision techniques there is still a need for a common framework to evaluate them under the same conditions in different domains.

It is common knowledge that especially image-based 3D reconstruction techniques have been struggling to find the appropriate ground-truth data to evaluate the algorithms. Numerous research efforts were done to provide a successful solution to this by creating benchmarks: since the earliest works [19] which provide a few samples to evaluate the algorithms, to the more recent and complete benchmarks which cover multiple scenarios [7, 18]. However, it is still challenging to have a great extension of the city covered with enough accuracy. Generally, specific equipment such as a LiDAR is needed to collect the ground truth making it unaffordable for most people. In particular, to cover a large extension of an urban environment, not only the equipment but also the planning of a mission (e.g., flight of an helicopter) is a barrier for creating new datasets.

A metropolitan area can accommodate a great variety of elements: buildings, roads, parks, etc. Its diversity can present a challenge for 3D reconstruction techniques and the ground-truth data available does not typically provide detailed information about these elements the city contains. However, in the same way that general purpose 3D reconstruction benchmarks include a great variety of scenarios, a benchmark for the study of an urban environment would benefit from having this type of information. Previous studies [30] remark the importance of including the categorization of the city to support the observations made about the quality of the reconstructions.

The creation of 3D models from a collection of images has been studied for many years [2, 4, 10, 16, 17, 22, 25] and it is a fundamental problem of computer vision [5]. The techniques to create 3D models can be considered specific cases, such as aerial images of a mountainous area [14] which can be later used in augmented reality applications [13], but they are normally general purpose methods. A complete pipeline to obtain an image-based reconstruction includes two different stages: Structure-from-Motion (SfM) and Multi-View Stereo (MVS). The former, to recover a sparse model and the camera poses, and the later, to obtain a detailed dense point cloud. There are very popular open-source techniques which are normally used in the 3D reconstruction studies [23], but there are also some commercial solutions with different licenses that can be used to obtain 3D models from images, which are not included in the studies very often.

In this paper, we present an extension of the previous work [15] with an evaluation for image-based 3D reconstruction pipelines not limited to combination of open-source techniques. We expand the previous work by including commercial applications in the evaluation and thus improving the study of a metropolitan area. We use the annotated point cloud of DublinCity [30] and make use of the initial study done in [15] where the city is analyzed. The analysis not only includes the evaluation at scene level but also per category of urban element, as it was done in the benchmark.

2 Previous Work

Benchmarks have been used to evaluate image-based reconstructions for decades. The first attempts were limited in various aspects [19]: they have a limited number of ground-truth models (only two), they were focused only in a part of the 3D reconstruction pipeline (MVS), and the ground-truth was acquired in a very controlled environment. With time, these type of benchmarks have been enhanced with different sets of images taken under specific lighting conditions and with additional 3D ground-truth models [1]. Nevertheless, the limitation of having all the images acquired in a controlled setting was not tackled until the EPFL benchmark was released [24]. In that work, the ground truth provided was acquired with a terrestrial LiDAR. Since then, outdoor scenarios started to be considered in the evaluations. However, as it is the pioneer for outdoor environments, it has the same drawback as the very first indoor benchmarks: the variety of models captured for the benchmark was small and the coverage of the scene was narrow. Still, it was a significant progress.

The limitation regarding the variety of models and scenarios was overcome with two other benchmarks released at a very similar time: Tanks and Temples [7] and the ETH3D benchmark [18]. The former is used to evaluate the output of the complete image-based 3D reconstruction pipeline (i.e., the dense point cloud) and instead of providing a collection of images they give as input a video. The later is focused on the MVS stage, and therefore, it provides the camera calibration and poses along with the set of images. A terrestial LiDAR was used in these benchmarks to capture the ground-truth, and in some scenes including buildings, some details such as the roofs were not covered.

Other types of LiDAR that can be used to acquire the ground truth are: 3D Mobile Laser Scanning (MLS) and Aerial Laser Scan (ALS). This type of equipment is the one typically used to acquire ground-truth data that covers large metropolitan areas. Among city benchmarks, some of them do not include images alongside such as the the TerraMobilita/iQmulus benchmark [26], Paris-rue-madame dataset [20] and the Oakland 3D Point Cloud dataset [11]. Others, do not use the images to create the 3D reconstructions (like in the Toronto/Vaihingen ISPRS benchmark used in [28]), instead the point cloud is used. Some of the benchmarks that include images such as the Kitti benchmark [8], and the ISPRS Test project on Urban Classification and 3D Building Reconstruction [12] are not focused on the evaluation of the MVS output; instead, they focus on the evaluation of other tasks such as stereo, object detection, or reconstructions made from roofs.

Some of the benchmarks cited above have an annotated point cloud of the city [11, 20, 26], but as pointed out, they do not provide images, with the exception of the ISPRS Test project on Urban Classification and 3D Building Reconstruction [12] and the Ai3Dr Benchmark [15]. The main difference between these last two regarding the ground-truth model is the density. The former is about six points per \({\text {m}}^2\) and the later around 350 points per \({\text {m}}^2\). Also, the later uses the a very rich annotated point cloud [30], and two different types of images, as opposed to a single set as in the former. Our work is an extension of the evaluation presented in the Ai3Dr Benchmark, including an extension of the pipelines evaluated under the same conditions.

Fig. 1.
figure 1

Areas under evaluation. The areas of the city considered in this study are highlighted in the map with a top view of the ground truth.

Fig. 2.
figure 2

Hidden areas. On the left, an example of a hidden area (in color) over the whole region (in grey). On the right, the hidden area represented in 3D [15].

3 Urban Environment

Table 1. Number of points and percentage of points per class in each evaluated area. Also the number of points and percentage used as the hidden zone [15].

The metropolitan area evaluated in this paper is the city of Dublin. In particular, the same extension used in the Ai3Dr benchmark [15] is considered for the evaluation. As it is explained there, two different regions of the city were selected to be evaluated, which include a variety of buildings, streets and parks with diverse styles, sizes and distribution, from the Trinity College campus to modern buildings as well as different structures such as trail tracks. These regions are depicted in Fig. 1 and we refer to them as Area 1 and Area 2 from now on.

Fig. 3.
figure 3

Oblique and nadir images. A sample of oblique images (left) and nadir ones (right).

3.1 Ground Truth

A selection of the DublinCity [30] annotated point cloud is used as ground truth for this evaluation, which includes a representative part of the city. The areas were carefully analyzed from the whole extension of the city in [15], discarding the ones with the following characteristics: small extension (less than \(250\times 250 \text { m}^2\) of the city), those areas with a low percentage of points in certain categories such as trees or grass, tiles that contain elements which are temporarily in the city and can degrade the performance (e.g., cranes) and those ones which have less that 90% of points classified.

Accordingly, the representative areas Area 1 and Area 2 correspond to these selected pieces of ground truth. This ensures a balanced distribution of points in each category while avoiding parts which can diminish the evaluation. It also guarantees the content is representative and diverse enough to make an evaluation per urban object category. Lastly, the magnitude of the city to be analyzed is also adequate so the algorithms process them in a reasonable amount of time.

As explained in [15], we are also using hidden areas for the evaluation (see and example in Fig. 2). These consist of meaningful sections of a particular area, selected to preserve the meaningfulness of the class distribution for the evaluation. However, the specific sections of the ground truth are not disclosed to the final user to avoid fine tuning to a specific region during the online evaluation.

3.2 Image Sets

We use three different sets of images to build the 3D models of each area: oblique, nadir, and a combination of both. Images in the oblique and nadir sets come from the groups described in [30] (Fig. 3 illustrates a sample of each of these images). As the original dataset contains a significantly large amount of images, both oblique and nadir, we use the selection done in [15]. Therefore, we use only the ones that will have a meaningful contribution to the 3D reconstruction of each of our areas. For the selection, COLMAP’s SfM algorithm [16] was used with the complete image datasets, and every image that does not contribute to at least 1500 3D points was discarded. This is was done for both oblique and nadir image sets.

3.3 Urban Categorization

Table 1 illustrates how the 3D points in the ground-truth models are distributed, with respect to their class, both for the complete models and also the hidden areas. The table shows that roof is the class with higher point count, and also its percentage is more balanced in the hidden areas with respect to other classes. The class door is the one with the lowest point count, and the undefined data, which could potentially introduce errors or other inaccuracies, represents a maximum of 9% of the points of all areas. Other relevant metrics are that Area 1 has almost twice the number of window points than Area 2 and four times of roof windows; while Area 2 has three times more grass points.

4 Experimental Setup

The goal of this study is to evaluate the performance of image-based 3D reconstruction techniques in the metropolitan area described in Sect. 3. The inputs of the 3D reconstruction processes are three different set of aerial images (oblique, nadir and combined) and the GPS data captured in the flight. This information is given in the EPSG 29902 reference system in separate files, and, in particular, the nadir set has also geographic information embedded as Exif metadata. All this information is available in the Ai3Dr benchmark. As this study is an extension of [15], we analogously evaluate the final result of the 3D reconstruction process (i.e., the final dense point cloud generated by the methods) and not the intermediate stages (SfM or MVS) separately.

The reasons to choose this approach were essentially two. Firstly, make it adaptable to new approaches that may not follow the typical steps of the pipeline. Secondly, as the measurements of the camera positions available are based on GPS and no additional information (e.g., the orientation of the camera) is given, these measurements are used only as a coarse approximation and are not indicated to be used as ground truth for evaluating intermediate steps.

4.1 3D Reconstruction Techniques

We can divide the 3D reconstruction techniques evaluated in two groups: open-source techniques and commercial applications. On the one hand, the former techniques are freely available, they are usually focused on one specific step of the process but some of them can be mixed to complete the pipeline. Users have access to the details in the code, they are typically used to solve general purpose reconstruction problems, and they have multiples parameters to be configured to adapt the algorithms to specific cases. Commercial applications are in general prepared for a specific type of reconstructions, and they have free trials available but later one needs to pay for the software. The details of the algorithms used are not disclosed, there are intermediate steps of refinement, they have usable interfaces, and default parameters are well adjusted for the tasks. Also, they are fast and are well optimized.

We are combining several open-source SfM and MVS algorithms to create the open-source 3D reconstruction pipelines. Firstly, for SfM, we use COLMAP [16], which includes a geometric verification strategy that helps improving robustness on both initialization and triangulation, and includes an improved bundle adjustment algorithm with outlier filtering. Furthermore, we use two different SfM approaches implemented in OpenMVG: a global one [10] based on the fusion of relative motions between image pairs, and an incremental one [9] that iteratively adds new estimations to the initial reconstruction minimizing the drift with successive steps of non-linear refinement. Secondly, for MVS, we also use COLMAP’s approach [17] that jointly estimates depth and normal information and makes a pixel-wise view selection using photometric and geometric priors. Moreover, we also use OpenMVS [17] which does efficient patch-based stereo matching followed by a depth-map refinement process.

There are several commercial applications that are used to create image-based 3D reconstructions in professional environments. We evaluate three of the most popular applications: Agisoft MetashapeFootnote 1, Pix4DFootnote 2 and RealityCaptureFootnote 3. Agisoft Metashape is a stand-alone photogrammetry software that generates 3D models from a colletion of images and they can be used for cultural heritage works, for GIS applications, etc. They cover a large range of specific applications in their professional edition. Among their features we can find digital elevation model generation, dense point cloud editing and multispectral imagery processing. They also offer a python scripting binding and their program can be used in Windows, Linux and MacOs. Pix4D offers a photogrammetry suite specialized in mobile and drone mapping. They offer solutions for inspection, agriculture and surveying. We have selected the Pix4Dmapper product to make the reconstructions, because although Pix4Dmatic is expected to perform photogrammetry at large scale, we found that more information (e.g., orientation of the camera) was needed for initiate the standard procedure. Pix4D is not available for Linux. RealityCapture is a photogrammetry software recently acquired by Epic Games. They offer a solution to different task such as 2D-3D mapping, 3D printing, full body scans, assets for games, etc. They claim to have a long trajectory and recognition in the computer vision since they have created CMPMVS [6] and their researchers are also recongnized in the community. Their application is available for Windows only.

Fig. 4.
figure 4

Scheme of 3D reconstruction pipelines tested. (1) to (6) as in [15], (7) Metashape, (8) Pix4D, (9) RealityCapture.

4.2 Pipelines Under Study

Using the aforemention 3D reconstruction techniques, we list here the pipelines that are tested in this study:

  1. 1.

    COLMAP(SfM) + COLMAP(MVS)

  2. 2.

    COLMAP(SfM) + OpenMVS

  3. 3.

    OpenMVG-g + COLMAP(MVS)

  4. 4.

    OpenMVG-g + OpenMVS

  5. 5.

    OpenMVG-i + COLMAP(MVS)

  6. 6.

    OpenMVG-i + OpenMVS

  7. 7.

    Metashape

  8. 8.

    Pix4Dmapper

  9. 9.

    RealityCapture

As it is depicted in Fig. 4, pipelines (7) to (9), which correspond to commercial applications, are treated as a complete solution from the input to the output. For pipelines (1)-(6), we use the same configuration as in [15], which correspond to the pipelines assembled with open-source techniques and four stages are needed: SfM, geo-registration, data preparation and MVS. Feature detection and matching are done in the SfM step. Then, the GPS information is used to coarsely register the sparse cloud to the ground truth. The following step is a preparation for performing the densification, and the final one is the densification itself. A different approach is followed in the pipelines (7)-(9) because they consist of closed solutions for all the stages of the pipeline. The coarse registration is done establishing EPSG:29902 as coordinate system for the output. The nadir images have enough information to be geo-referenced without using additional data but for the oblique set, the geographic information provided in the benchmark is needed. Depending on the method, the registration can be done in different stages. For example, (7) can calculate the reconstruction in a local coordinate system and then change it when exporting the point cloud, however, (8) requires to have the coordinate estimation beforehand to have good results.

The versions of each software used in this study are :

  • COLMAP v3.6

  • OpenMVG v1.5

  • OpenMVS v1.1

  • Agisoft Metashape Pro v1.7.5 for Linux

  • Pix4D v4.7.5 for Windows

  • Reality Capture v1.2 for Windows

The parameter configurations in the open-source pipelines are dependent on the stage of the pipeline and the method, as indicated in [15]. In general terms, COLMAP’s parameters are the same as in DublinCity [30] in all the stages of the pipeline. OpenMVG also uses the default parameters and OpenMVS uses the parameters reported in the ETH3D benchmark. We used the default parameters as well in the Metashape, Reality Capture, and Pix4D.

4.3 Alignment

The output of the aforementioned image-based 3D reconstruction pipelines is coarsely registered to the ground truth thanks to the geographical information. This is the same situation as in [15]. There, it is shown that the coarsely registration needs a refinement process in order to be perfectly aligned. The registration refinement process typically consists of a 7DoF ICP algorithm, a strategy as followed in [7], for example. Schops et al. [18] use a more sophisticated approach using the color information of the laser scan, something that is not available in our case. For our approach, we follow the same strategy as in [15], further refined with a 7DoF ICP process with the point cloud and the ground truth.

Fig. 5.
figure 5

Qualitative 3D reconstruction results. Point clouds obtained with the oblique and nadir images combined in Area 1 (top) and Area 2 (bottom).

5 Experimental Results

Following the work in [7, 15, 30], we use the following measurements to evaluate performance:

  • Precision, P: measures the accuracy of the reconstruction.

    $$\begin{aligned} P(d) = \frac{| dist_{I \rightarrow G }(d)|}{|I|} 100 \end{aligned}$$
    (1)
  • Recall, R: measures the completeness of the reconstruction.

    $$\begin{aligned} R(d) = \frac{| dist_{G \rightarrow I }(d)|}{|G|} 100 \end{aligned}$$
    (2)
  • F score, F: a combination of both P and R.

    $$\begin{aligned} F(d) = \frac{2P(d)R(d)}{P(d)+R(d)} \end{aligned}$$
    (3)

Where d is a given threshold distance, I is the point cloud under evaluation and G is the ground-truth point cloud. \( | \cdot | \) is the cardinality and \(dist_{I \rightarrow G }(d)\) are the points in I with a distance to G less than d and \(dist_{G \rightarrow I }(d)\) is analogous (i.e., \(dist_{A \rightarrow B}(d)= \{ a \in A \mid \min \limits _{b \in B} \Arrowvert a - b \Arrowvert _{2} < d \} \), A and B being point clouds). To perform the evaluation per class, the point under evaluation is assigned the same class as its nearest neighbor in the ground-truth. Although there are plenty of metrics that can be used to measure the quality of the reconstruction algorithms, such as the mean distance to the ground truth, as used in [23], P, R, and therefore F, are more robust to outliers.

For each pipeline, set of images and area (including the hidden parts), we calculate P, R, and F. We use a value of d in the range of 1 cm to 100 cm, which produces better results in every method when we increase the value of d, as an expected result after increasing the tolerance. The results reported in this evaluation use a value of 25 cm for d, similar to [15, 30], which represents a good compromise between the limitations of the image resolution and the meaningfulness of the precision, since selecting a very small distance would mean poorer performances for all the methods and with a larger distance the precision would be less informative.

5.1 Scene Level Evaluation

Table 2 shows the evaluation at scene level. This means that all the points in the ground truth are treated in the same way, ignoring to which class they belong. We can see in the results that the reconstructions done with the oblique set achieve the lowest recall values, in comparison with the reconstructions obtained with the other sets of images, in both areas (also in the hidden parts). Therefore, for this set of images, having a good precision value is determinant to achieve a good F score. Among the pipelines created with open-source methods, COLMAP + COLMAP has the best performance for this type of images in both areas. However, some commercial solutions outperform it. Specifically, the best score in Area 2 is obtained with Pix4D, whereas in Area 1 it is with Metashape.

In the nadir set, the recall is usually higher than the precision so the accuracy is not as determinant as in the oblique sets to obtain a good F score. These results suggest that having different camera angles and less coverage of the same parts of the scene (as it is the case in the oblique set but not in the nadir one) makes the recall value decrease while the precision remains similar. As it can be observed, COLMAP + COLMAP is the best pipeline in Area 1 whereas OpenMVG-i + OpenMVS is the best in Area 2 (even in the hidden parts), among the open-source methods. Moreover, among the commercial solutions, we also see different winning methods in Area 1 and Area2: Reality Capture and Pix4D, respectively. For the reconstructions obtained with the combined imagery, OpenMVG-g + OpenMVS and OpenMVG-i + OpenMVS are the open-source pipelines with the highest F score: the former in Area 1 and latter in Area 2. This is slightly different for the commercial solutions as they present their best results with the same method for both the nadir and the combined set of images. In fact, the F score of the reconstructions obtained with the nadir and the combined sets of images is much more similar in the commercial applications than in the others. This suggest that the nadir images are very well treated by the commercial solutions. We can also observe that some open-source pipelines have higher recall values in the nadir and combined sets, although this is not enough to beat the commercial ones. However, for the combined set of images, where the open-source techniques have their best results, the difference between this and the worst performing method of the commercial applications is not very significant. 79.36 and 79.5 for OpenMVG-g + OpenMVS and Pix4D, respectively, in Area 1 and, analogously, 81.13 and 81.53 for OpenMVG-i + OpenMVS and Metashape in Area 2.

Table 2. Study of urban areas (quantitative). Each row shows the results of a specific 3D reconstruction pipeline giving the precision / recall / F score for d=25cm obtained for the reconstruction in each set of images in each area. The best score for each area and image set is in bold letters and the pipelines are as follows: (1) COLMAP + COLMAP, (2) COLMAP + OpenMVS, (3) OpenMVG-g + COLMAP, (4) OpenMVG-g + OpenMVS, (5) OpenMVG-i + COLMAP, (6) OpenMVG-i + OpenMVS, (7) Metashape, (8) Pix4D, (9) RealityCapture.
Table 3. Study of F score per urban element. Column P indicates the pipeline that generated the best F score. If the pipeline or F score calculated with the hidden ground-truth differs from those ones calculated with the complete one, they are shown in square brackets. Pipelines are numbered as: (1) COLMAP + COLMAP, (2) COLMAP + OpenMVS, (3) OpenMVG-g + COLMAP, (4) OpenMVG-g + OpenMVS, (5) OpenMVG-i + COLMAP, (6) OpenMVG-i + OpenMVS, (7) Metashape, (8) Pix4D, (9) RealityCapture.
Table 4. Study of precision per urban element. Column P indicates the pipeline that generated the best precision. If the pipeline or precision calculated with the hidden ground-truth differs from those ones calculated with the complete one, they are shown in square brackets. Pipelines are numbered as: (1) COLMAP + COLMAP, (2) COLMAP + OpenMVS, (3) OpenMVG-g + COLMAP, (4) OpenMVG-g + OpenMVS, (5) OpenMVG-i + COLMAP, (6) OpenMVG-i + OpenMVS, (7) Metashape, (8) Pix4D, (9) RealityCapture.
Table 5. Study of recall per urban element. Column P indicates the pipeline that generated the best recall. If the pipeline or recall calculated with the hidden ground-truth differs from those ones calculated with the complete one, they are shown in square brackets. Pipelines are numbered as: (1) COLMAP + COLMAP, (2) COLMAP + OpenMVS, (3) OpenMVG-g + COLMAP, (4) OpenMVG-g + OpenMVS, (5) OpenMVG-i + COLMAP, (6) OpenMVG-i + OpenMVS, (7) Metashape, (8) Pix4D, (9) RealityCapture.

The qualitative results obtained from the reconstructions with the oblique and nadir images together in Area 1 and Area 2, are shown in Fig. 5. The render was done using the same configuration (e.g., size of the points) to make it comparable. We can observe from these results that the point cloud obtained with COLMAP + COLMAP (Fig. 5 (a)) is sharper than the one obtained with COLMAP + OpenMVS (Fig. 5 (b)) in both areas, in accordance with the precision values (74.89 and 27.93 in Area 1; 76.54 and 24.7 in Area 2 ). We can also see that the highest precision value in Area 1, 87.87 is obtained with RealityCapture (Fig. 5 (i)) with a very sharp reconstruction. Moreover, there are also differences in the completeness of the reconstructions: OpenMVG-i + OpenMVS (Fig. 5 (f)) and OpenMVG-g + OpenMVS (Fig. 5 (d)) are denser than the rest, this time among all the methods including the commercial ones in both areas. However, their F score values (79.36 and 76.76 in Area 1; 80.64 and 81.13 in Area 2) confirm that, as it can be appreciated in the images, the commercial software seems to give sharper reconstructions (Fig. 5 (g)-(i)), and therefore the best scores in the open-source pipelines are finally worse than all of their F scores (79.5, 80.68, 80.99 in Area 1 and 81.54, 82.73, 85.06 in Area 2), thanks to the higher values in precision.

5.2 Urban Category Centric Evaluation

Additionally, we present a summary of the same measurements calculated above but this time per urban element category. This summary shows three tables, one per measurement. F score in Table 3, Precision in Table 4 and Recall in Table 5. Each row has the results of a specific class (i.e., urban category) and each column corresponds to a unique set of images. The result presented is the maximum score obtained among the nine pipelines tested (see Sect. 4.2) and the pipeline that generated the score is shown in column P. The results for the hidden area are presented in squared brackets if they differ from the ones calculated with the complete area.

Analyzing the results that were obtained per class across all the image sets available, we can observe that although in [15] the method that most frequently got the maximum precision was COLMAP + COLMAP, the commercial applications are better in all the categories, and Pix4D is the one that most frequently gets the maximum precision. Roof, sidewalk, street and grass are the categories which obtained the best results. When looking at the recall, the pipeline OpenMVG-i + OpenMVS was the one that more frequently achieved the highest scores among the open-source methods [15]. When we include the commercial software in the analysis, we see that they are not the best in all the image categories and sets of images, as it was the case with the precision. Now, on the results with the reconstruction from the combined set, we can see how the OpenMVG-g + OpenMVS and OpenMVG-i + OpenMVS are still the best in the majority of classes (in Area 1 and n Area 2, respectively), in accordance with the results obtained in the scene level evaluation.

When looking at the F score, the class with lowest F score values is tree and the results are really influenced by the low values of the recall. We can see that for this class the winning methods are the open-source solutions whereas for the rest of the categories the commercial solutions are the best. These results confirm the hypothesis in [30]: trees in the parks of the city can degrade the scores of the reconstructions. We can also analyze the results depending on the image set under study. For example, with the combined set, according to [15] the pipelines with best performance in the majority of classes among the open-source techniques were OpenMVG-g + OpenMVS in Area 1 and OpenMVG-i + OpenMVS in Area 2. These results are in accordance with the ones commented before, which does not consider the class information (Table 2). However, when looking at the nine pipelines, the methods RealityCapture and Pix4D (which were the best in Area 1 and Area 2, respectively, in the scene evaluation) have a different behaviour when looking at the specific urban elements. RealityCapture is the most frequent winner among all the urban categories, whereas Pix4D is only the winner in the roof and r. window (with around 31% of occupancy), suggesting that the main reason why this was the winner method is because it is the best in this specific category.

5.3 General Pipeline Evaluation

We can also observe that, in general, the open-source pipelines that obtained the best results are COLMAP + COLMAP, OpenMVG-g + OpenMVS and OpenMVG-i + OpenMVS. These results are in accordance with previous studies that used the same kind of metric, where COLMAP + COLMAP and OpenMVG-i + OpenMVS obtained the best results [7]. In particular, in that study OpenMVG-g + OpenMVS never has better results than COLMAP + COLMAP, but this situation is plausible in our study given the different camera trajectories (aerial grid configuration vs circle around an object), software versions and parameters used. Pix4D was also compared in that study, obtaining the best results for some categories, in accordance to what we observe in this study.

COLMAP used as MVS is better than OpenMVS only if it is applied after COLMAP SfM, whereas OpenMVS is better using the other SfM methods tested. This leads to the necessity to test not only a particular MVS method but a complete pipeline since it is going to be influenced by: the results obtained in the SfM step, the data conversion and preparation for the MVS step, as well as memory and computing limitations. The commercial applications used in the evaluation give better results than the open-source pipelines in general, and are prepared to handle large amounts of data, as part of their photogrammetry pipelines. Also, some of them provide by default a 3D mesh along with the point cloud. However, some have other disadvantages such as the limitation of supporting only certain operating systems, and all of them are not freely available.

6 Conclusion

We have presented in this paper a study of image-based 3D reconstruction pipelines in a metropolitan area, using exclusively aerial images. This study not only takes under consideration open-source 3D reconstruction techniques but also commercial photogrammetry solutions which are widely used in the industry. The final dense point cloud is evaluated at scene level and per urban category, thus allowing for a finer examination. Thanks to that, we see the influence of urban categories (e.g., roof) in the F scores. We also support the hypothesis done about how parks can degrade the F score values in a scene level evaluation (mainly because of the presence of trees) with facts. We have concluded that the commercial applications have better scores in the majority of the scenarios but the best solution of the open-source techniques is not far from them. When choosing the best 3D reconstruction pipeline other aspects apart from the F score might be important: the budget, the equipment, the possibility of adapting the algorithms for some specific needs, etc. In this study we have created an exhaustive and comprehensive review and we believe that it can be useful for those who want to see how 3D reconstruction methods perform using aerial images from a city.