Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (2024)

[orcid=0000-0002-8320-6308]\creditConceptualization, Methodology, Validation, Formal analysis, Investigation, Data Curation, Writing - Original Draft, -Visualization, - Review & Editing

1]organization=Department of Systems Design Engineering,addressline=University of Waterloo,city=Waterloo,state=Ontario,postcode=N2L3G1,country=Canada

[orcid=0000-0003-0316-0299]\creditInvestigation, Writing - Review & Editing

[orcid=0000-0003-3839-5821]\creditInvestigation, Writing - Review & Editing

[orcid=0000-0002-6833-6462]\creditConceptualization, Resources, Writing - Review & Editing, Supervision\cormark[1]\cortext[1]Corresponding authors.l44xu@uwaterloo.ca (Linlin Xu), junli@uwaterloo.ca (Jonathan Li)

[orcid=0000-0001-7899-0049]\creditConceptualization, Resources, Writing - Review & Editing, Supervision2]organization=Departmemt of Geography and Environmental Management,addressline=University of Waterloo,city=Waterloo,state=Ontario,postcode=N2L3G1,country=Canada\cormark[1]

Kyle-Yilin Gaoy56gao@uwaterloo.ca[  Dening Lud62lu@uwaterloo.ca  Hongjie Hehongjie.he@uwaterloo.ca  Linlin Xul44xu@uwaterloo.ca  Jonathan Lijunli@uwaterloo.ca[

Abstract

Although large-scale remote sensing image based 3D urban scene reconstruction and modelling is crucial for many key applications such as digital twins and smart cities, it is a difficult task due to the uncertainties in heterogeneous datasets and the geometry models. This paper presents a Gaussian splatting based approach for 3D urban scene modeling and geometry retrieval, with the following contributions. First, we develop and implement a 3D Gaussian splatting (3DGS) approach for large-scale 3D urban scene modeling from heterogeneous remote sensing images. Second, we design point cloud densification approach in the proposed 3DGS model to improve the quality of 3D geometry extraction of urban scenes. Leveraging Google Earth imagery of different sensors, the proposed approach is tested on the region of University of Waterloo, demonstrating that the proposed approach greatly improves reconstructed point clouds quality over the other Multi-View-Stereo approaches. Third, we design and conduct extensive experiments on multi-source large-scale Google Earth remote sensing images across ten cities to compare the 3DGS approach with neural radiance field (NeRF) approaches, demonstrating improved view-synthesis results that greatly outperform previous state-of-the-art 3D view-synthesis approaches.

keywords:

3D Gaussian Splatting \sepNovel View Synthesis \sepPhotogrammetry \sepMulti-view-Stereo \sepPoint Cloud

1 Introduction

3D reconstruction and modelling from 2D images have recently received great interest given recent advances in photorealistic view synthesis methods with 3D reconstruction capabilities. From a technical perspective, it is an interdisciplinary research area spanning computer vision, computer graphics and photogrammetry. It finds applications in multiple domains, including autonomous navigation aided by 3D scene understanding (Badue etal., 2021), remote sensing and photogrammetry for crafting 3D maps essential for navigation, urban planning, and administration (Biljecki etal., 2015). Moreover, it extends to geographic information systems incorporating urban digital twins (Lehner and Dorffner, 2020; Schrotter and Hürzeler, 2020), as well as augmented and virtual reality platforms integrating photorealistic scene reconstructions (Carozza etal., 2014; Rohil and Ashok, 2022).

This paper focuses on remote sensing-based large-scale view synthesis based on 3D Gaussian Splatting (3DGS), as well as 3D geometry extraction from 3D geometry extraction from Gaussian Splatting. Using only images from Google Earth Studio, we train a 3D Gaussian splatting model which outperforms previous NeRF-based models. We quantify and benchmark the view synthesis performance on a large-scale urban dataset with 10 cities captured from Google earth, as well as our region of study. We also extract the and densify the 3D geometry of the region of study using 3DGS, which we compare against a Multi-View-Stereo dense reconstruction. To our knowledge, this is the first use of 3D Gaussian Splatting for large-scale remote sensing-based 3D reconstruction and view-synthesis.

2 Back Ground and Related Work

2.1 Urban 3D Photogrammetry

Photogrammetry extracts 3D geometry and potentially other physical information from 2D images. Remote sensing-based urban photogrammetry for 3D city modelling relies on drones/aerial platforms/satellites whereby buildings of interest at captured at an oblique/off-nadir angle. This is often referred to as oblique photogrammetry. In large scale scenes, other land-uses and land-cover may be present and present additional challenges. Ground-based and airborne LiDAR scanners can also be used to generate very accurate 3D models, sometimes in conjunction with image-based methods. However, images are in general more accessible both in terms of sensor and data availability.

Traditional (non-deep learning-based) methods which generates 3D point cloud/geometry from images are grouped into two types: Structure-from-Motion (SfM) which generates sparse point clouds, and Multi-View-Stereo which generates dense point clouds (Musialski etal., 2013). The most fundamental method is perhaps Structure-from-Motion, which relies on multi-view geometry and projective geometry to establish the relationship between 3D points and their 2D projection onto imaging planes. Key points are extracted in each 2D image, and matched in images with scene overlap, then triangulated to three dimensions, and typically further calibrated/error-corrected using bundle adjustments or other methods, resulting in a sparse point cloud 3D reconstruction. The sparse point cloud can then be meshed and/or turned into digital surface models. Sparse SfM photogrammetry is typically applied as a preprocessing step, as shown in various works (Yalcin and Selcuk, 2015; Lingua etal., 2017) to help with further dense reconstruction or data fusion with 3D scanned point clouds. Sparse SfM point clouds can only retrieve scene geometry, and cannot reproduce realistic 3D lighting of the scene which is crucial for AR/VR-based applications, and other applications which heavily rely visualization.

In urban settings, Multi-View-Stereo (MVS) also require oblique imagery to capture the geometry of buildings and their facades. Fundamentally, Multi-View-Stereo differs from sparse SfM photogrammetry since MVS aims for a dense reconstruction by making use of 3D information in each pixel of the 2D images, as opposed to specific key points in the 2D images. This can be done using various methods such as plane sweeping or stereo vision and depth map fusion, or even deep learning methods. MVS methods are typically divided into two categories volume-based, and point cloud based (Musialski etal., 2013; Jensen etal., 2014). Various authors (Yalcin and Selcuk, 2015; Toschi etal., 2017; Lingua etal., 2017; Rong etal., 2020; Pepe etal., 2022; Liao etal., 2024) have employed MVS for dense urban 3D reconstruction which can also be meshed for various purposes such as digital surface modelling and geophysics simulations. However, compared to sparse SfM photogrammetry, dense MVS photogrammetry is much more computationally intensive, especially in terms of memory. Additionally dense MVS photogrammetry typically require sparse SfM photogrammetry, or at least the camera poses which is typically obtained from sparse SfM photogrammetry, as a preprocessing step. Dense reconstructions, although more visually appealing than sparse reconstructions, are still not photorealistic since they cannot model the directional dependence of lighting in the scene.

2.2 Neural Radiance Fields and Urban 3D Reconstruction/View synthesis

In recent years, Neural Radiance Field-based methods (NeRF) (Mildenhall etal., 2021) have dominated the field of novel view synthesis. Trained on posed images of a scene, NeRF methods use a differentiable rendering process to learn implicit (Barron etal., 2021, 2022) or hybrid scene representation (Müller etal., 2022) typically as density and directional color fields, and typically using some Multi-Layer-Perceptron (MLP). The scene representation is then rendered into 2D images using a differentiable volume rendering process, allowing for scene representation learning via pixel-by-pixel supervised learning using back-propagation of a photometric loss. Certain explicit scene representation models (Yu etal., 2021; Fridovich-Keil etal., 2022; Chen etal., 2022) use almost identical differentiable rendering pipelines, but store their scene representations explicitly, forgoing the use of decoding MLPs (although some of these methods allow for the use of a shallow decoding MLP, blurring the line between explicit and hybrid scene representation).

To synthesize images, NeRF methods employ differentiable volume rendering, generating pixel color C𝐶Citalic_C via alpha blending of local colors cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using local densities σisubscript𝜎𝑖\sigma_{i}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT along a ray with sampling intervals δisubscript𝛿𝑖\delta_{i}italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. This is given by

C=iciαiTi𝐶subscript𝑖subscript𝑐𝑖subscript𝛼𝑖subscript𝑇𝑖C=\sum_{i}c_{i}\alpha_{i}T_{i}italic_C = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT(1)

where cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and σisubscript𝜎𝑖\sigma_{i}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are sampled from the learned radiance field (e.g. the NeRF MLP), and

αi=1exp(σiδi)andTi=j=1i1(1αj).subscript𝛼𝑖1subscript𝜎𝑖subscript𝛿𝑖andsubscript𝑇𝑖superscriptsubscriptproduct𝑗1𝑖11subscript𝛼𝑗\alpha_{i}=1-\exp(-\sigma_{i}\delta_{i})\;\text{and}\;T_{i}=\prod_{j=1}^{i-1}(%1-\alpha_{j}).italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 - roman_exp ( - italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ( 1 - italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) .(2)

Urban scenes, unbound, full of transient objects (such as pedestrians, cars), and with changing lighting conditions pose a challenge to the learning of 3D scene representation. Methods such as NeRF-W (Martin-Brualla etal., 2021), Mip-NeRF360 (Barron etal., 2022), Block-NeRF (Tancik etal., 2022), Urban Radiance Fields (Rematas etal., 2022) proposed solutions to some of these problems, and are suited for ground-level view-synthesis and 3D urban reconstruction.

Aerial view 3D reconstruction and view synthesis from remote sensing images was also attempted with methods such as Bungee/City-NeRF (Xiangli etal., 2022), Mega-NeRF (Turki etal., 2022), Shadow NeRF (Derksen and Izzo, 2021), Sat-NeRF (Marí etal., 2022). These methods attempt to solve problems such as piecing together local NeRFs into large scale urban scene, multi-scale city view synthesis, and shadow-aware scene reconstruction for high-rises. BungeeNeRF (Xiangli etal., 2022) is of interest as we extract a Google Earth dataset from our region of study using a similar method.

2.3 3D Gaussian Splatting

3D Gaussian Splatting (3DGS) (Kerbl etal., 2023) was first developed in 2023 as a view synthesis method competing against existing NeRF view synthesis methods. Compared against the vanilla NeRF method, the vanilla Gaussian Splatting method learns the 3D scene and synthesizes novel views orders of magnitude faster, and achieves a visual quality for view synthesis comparable and often exceeding to the best NeRF models, at the cost of a much larger memory footprint and requiring structure-from-motion (SfM) (Schonberger and Frahm, 2016) initialization/preprocessing. The workflow is visualized in Figure 2.3

Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (1)

The SfM preprocessing is exactly the standard sparse photogrammetric process which identifies 2D key points, matches overlapping images, triangulates key points into 3D, and error-corrects trough bundle adjustment or some other method. Compared to standard photogrammetry which can sometimes project image colors into flat(lighting-less) 3D point cloud color, 3DGS is able to photorealistically reproduce the directionally dependent lighting of the scene, which is crucial for many applications. It is also able to fine-tune the geometry of the scene using photometric (color-based) objectives against ground truth pictures, as opposed to only minimizing re-projection errors in photogrammetry. Compared to NeRF models, 3DGS produces more natural 3D geometry, with natural correspondence between learned 3D positional means of 3D Gaussian functions and 3D point cloud representation of the scene geometry.

Representing the scene as 3D Gaussians functions, and representing lighting as spherical harmonic (SH) coefficients attached to these Gaussians, 3D Gaussian Splatting methods produces 2D images via a differentiable tile-based Gaussian rasterizer; projecting Gaussians into two dimensions based on novel view poses’ view cones, alpha-blending the projected Gaussians to produce per-pixel color in the novel view. The novel views are supervised against ground truth images for training of Gaussian Splatting parameters. To the best of our knowledge, this is the first work to attempt large-scale remote sensing-based 3D reconstruction and view synthesis using 3D Gaussian Splatting, however recent works (Kerbl etal., 2024; Zhou etal., 2024) have applied Gaussian Splatting to large-scale urban street-level datasets.

3 Method

3.1 Region of Study

The region of study is the Kitchener-Waterloo Region in Ontario, Canada, centered at the University of Waterloo. The city of Waterloo has a population of approximately 121 000 according to the 2021 census, and occupy 64.06 km2superscriptkm2\text{km}^{2}km start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT of land (Statistics Canada, 2023). The University of Waterloo lies at 43.472°N, 80.550°W, its main campus occupies 4.50 km2superscriptkm2\text{km}^{2}km start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. At the city scale, the study area is comprised of various land use and land cover features such as urban roads, buildings, agriculture, and other land use, low vegetation, water, mixed temperate forest, and other land cover. The study area is centered at the Environment-1 (EV-1) building located at 43.468°N, 80.542°W, and covers roughly an area of 165 km2. We perform large-scale view synthesis at the city scale, and 3D point cloud comparison at the neighborhood scale. The Google Earth images retrieved for the scene are primarily from Landsat/Copernicus, Airbus, Data Scripps Institution of Oceanography (SIO), and National Oceanic and Atmospheric Administration (NOAA).

The University of Waterloo lies on the traditional land of Neutral, Anishinaabeg and Haudenosaunee peoples. The University of Waterloo is situated on the Haldimand Tract, the land promised to the Six Nations that includes six miles on each side of the Grand River.

3.2 Google Earth Studio Datasets

For the region of study, as camera paths, we used seven concentric circles at different altitudes, radius and tilt angle centered around the EV-1 building at the University of Waterloo, Waterloo, Ontario, Canada. The first of these circles are of has a radius of 500 m, and an elevation of 475 m. The last circle has a radius of 7250 m and an elevation of 3690 m. All images point towards and above (at elevation 390 m) the EV-1 building in the University of Waterloo at 43.468 °N, 80.542 °W. The final circle’s images have have a tilt angle of approximately 65.5 ° with respect to the horizontal with some deviations (within 0.3similar-toabsent0.3\sim 0.3∼ 0.3 °). We gathered 401 images using Google Earth Studio along the camera path defined using these circles. The region of study and camera poses, along with sparse SfM results can be seen in Figure 3.2. During preprocessing, we observe poor SfM point cloud reconstruction results further than 6 km away from the scene center, reasonable SfM reconstruction within 6 km, and good SfM reconstruction within 1 km where individual buildings can be identified. The SfM preprocessing resulted in a sparse point cloud with 337382 points which were used to initiate 3D Gaussian functions for 3DGS. This multi-scale Google Earth Studio (Alphabet Inc., 2015-2024) dataset was inspired by the BungeeNeRF dataset (Xiangli etal., 2022), which we also use for a multi-city large-scale view-synthesis benchmark.

For the BungeeNeRF scenes, we used the Google Earth Studio camera paths specified by BungeeNeRF (Xiangli etal., 2022). The BungeeNeRF dataset consists of 10 scenes for 10 cities. Each scene is centered around a particular landmark, with camera paths defined by concentric circles of different orbit radius and elevation, with the scene coverage reaching city-wide at the highest elevation. Detailed information can found in Table 1 for the 10 BungeeNeRF scenes and the Waterloo scene. The New York scene centered at 56 Leonard and the San Francisco scene centered at Transamerica were used as main scenes for view reconstruction benchmark in BungeeNeRF (Xiangli etal., 2022), and have 459 and 455 images respectively. These two scenes were rendered at 30 frames per second in a 1:30 minute video. All other scenes contain 221 images, and were rendered by fixing a frame limit of 220+1 given the fixed camera path, and were used for additional visualizations. We note the original BungeeNeRF paper contained two additional scenes (Sidney and Seattle), but the Google Earth Studio camera paths were not provided for these two scenes.

Google Earth Studio provides a platform for generating multi-view aerial/satellite images by simply specifying camera poses and scene location. Google Earth Studio produced composite images from various government and commercial sources, as well as rendered images from 3D models constructed using remote sensing images from these sources. These include Landsat/Copernicus, Airbus, NOAA, U.S. Navy, USGS, Maxar images and datasets taken at different times. An obvious example of composite image can be observed in the two bottom right images in Figure 4.3, with different Water color indicating different data sources and/or acquisition time.

Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (2)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (3)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (4)
CityLandmarkLowest (m)Highest (m)
New York56 Leonard2903,389
San FranciscoTransamerica3262,962
ChicagoPritzker Pavilion3656,511
QuebecChâteau Frontenac1663,390
AmsterdamNew Church952,3509
BarcelonaSagrada Familia2998,524
RomeColosseum1308,225
Los AngelesHollywood66012,642
BilbaoGuggenheim1637,260
ParisPompidou1592,710
WaterlooEV-15003690

3.3 Structure from Motion Preprocessing and Sparse Point Cloud Extraction

The standard implementation of 3D Gaussian Splatting relies on COLMAP (Schonberger and Frahm, 2016) for preprocessing. This SfM preprocessing takes a collection of unordered images with unknown camera poses, and outputs the camera pose of each image, as well as a sparse point cloud. Like all SfM methods, COLMAP SfM consists of the following steps.

Feature Extraction: In this step, for each image Iisubscript𝐼𝑖I_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, key points xjR2subscript𝑥𝑗superscript𝑅2x_{j}\in R^{2}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT are identified, and robust view invariant local features fjRnsubscript𝑓𝑗superscript𝑅𝑛f_{j}\in R^{n}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT are assigned. Scale Invariant Feature Transform (SIFT) features (Lowe, 1999) are used as default in COLMAP, and provide robust features which allows for the same 3D point to be identified across multiple images as respective projected 2D key points.Matching: By searching through images and their respective features, potentially overlapping images pairs with matching key point features are identified.

Geometric Verification: A scene graph with images at nodes and edges connecting overlapping images is constructed by verifying potentially overlapping image pairs. This verification is done by estimating a valid hom*ography in a potentially connected image pair using a robust estimation technique such as a variant of Random Sample Consensus (RANSAC) by Fischler and Bolles (1981).

Image Registration From a starting image pair whose key points are triangulated into 3D, new images with overlap given the scene graph are added to the scene by solving the Pespective-n-Point problem (Fischler and Bolles, 1981) which estimates camera poses given a number of 3D points and their 2D projection. This step robustly estimates the pose of the newly registered image.

Triangulation: Given key points as viewed from two images with known poses, key points are triangulated (Hartley and Zisserman, 2003) into 3D. Newly registered images extend the scene by allowing for more key points to be triangulated into the 3D reconstruction.

Error-Correction: To correct errors in registration and triangulation, bundle adjustment (Triggs etal., 2000) is performed by jointly optimizing the camera poses PcSE3subscript𝑃𝑐𝑆𝐸3P_{c}\in SE3italic_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∈ italic_S italic_E 3 and 3D points XkR3subscript𝑋𝑘superscript𝑅3X_{k}\in R^{3}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT during the minimization of the reprojection loss E𝐸Eitalic_E given by the square error of the reprojection of the 3D point onto the image plane πPc(Xk)subscript𝜋subscript𝑃𝑐subscript𝑋𝑘\pi_{P_{c}}(X_{k})italic_π start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) with respect to pixel value xjR2subscript𝑥𝑗superscript𝑅2x_{j}\in R^{2}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. This is given by

E=jρj(πPc(Xk)xj)2.𝐸subscript𝑗subscript𝜌𝑗superscriptsubscript𝜋subscript𝑃𝑐subscript𝑋𝑘subscript𝑥𝑗2E=\sum_{j}\rho_{j}(\pi_{P_{c}}(X_{k})-x_{j})^{2}.italic_E = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .(3)

Schonberger and Frahm (2016) introduced various innovations improving the geometric verification, improving the robustness of the initialization and triangulation, introducing a next best view selection method and an iterative and more efficient bundle adjustment method, resulting in the COLMAP SfM library.

3.4 Multi-View-Stereo Dense 3D reconstruction

The MVS dense reconstruction we used as ground truth/reference geometry of the region of study is retrieved from COLMAP’s MVS algorithm (Schönberger etal., 2016). This method is based on joint view selection and depth map estimation (Zheng etal., 2014). The method is summarized as follows.

Depth and normal map estimation: To estimate the depth θlR1subscript𝜃𝑙superscript𝑅1\theta_{l}\in R^{1}italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and normal nlR3subscript𝑛𝑙superscript𝑅3n_{l}\in R^{3}italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT at a pixel l𝑙litalic_l in a references image Xrefsuperscript𝑋𝑟𝑒𝑓X^{ref}italic_X start_POSTSUPERSCRIPT italic_r italic_e italic_f end_POSTSUPERSCRIPT, a joint likelihood function is used. 𝑿={Xref,X1,Xm,XM}𝑿superscript𝑋𝑟𝑒𝑓superscript𝑋1superscript𝑋𝑚superscript𝑋𝑀\boldsymbol{X}=\{X^{ref},X^{1},...X^{m},...X^{M}\}bold_italic_X = { italic_X start_POSTSUPERSCRIPT italic_r italic_e italic_f end_POSTSUPERSCRIPT , italic_X start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … italic_X start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , … italic_X start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT } is the collection of all images (with one image as source image, and the rest as reference images). 𝒁={Zlm|l=1L,m=1M}𝒁conditional-setsuperscriptsubscript𝑍𝑙𝑚formulae-sequence𝑙1𝐿𝑚1𝑀\boldsymbol{Z}=\{Z_{l}^{m}|l=1...L,m=1...M\}bold_italic_Z = { italic_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_l = 1 … italic_L , italic_m = 1 … italic_M } is the set of occlusion indicators, with Zlm=1superscriptsubscript𝑍𝑙𝑚1Z_{l}^{m}=1italic_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = 1 if the image Xmsuperscript𝑋𝑚X^{m}italic_X start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is selected for depth estimation of pixel l𝑙litalic_l in Xrefsuperscript𝑋𝑟𝑒𝑓X^{ref}italic_X start_POSTSUPERSCRIPT italic_r italic_e italic_f end_POSTSUPERSCRIPT, and zero otherwise if occluded. 𝜽={θl|l=1L}𝜽conditional-setsubscript𝜃𝑙𝑙1𝐿\boldsymbol{\theta}=\{\theta_{l}|l=1...L\}bold_italic_θ = { italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_l = 1 … italic_L } are the depths at each pixel l𝑙litalic_l of Xrefsuperscript𝑋𝑟𝑒𝑓X^{ref}italic_X start_POSTSUPERSCRIPT italic_r italic_e italic_f end_POSTSUPERSCRIPT, which are to be recovered. 𝑵={nl|l=1L}𝑵conditional-setsubscript𝑛𝑙𝑙1𝐿\boldsymbol{N}=\{n_{l}|l=1...L\}bold_italic_N = { italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_l = 1 … italic_L } are the normals of Xrefsuperscript𝑋𝑟𝑒𝑓X^{ref}italic_X start_POSTSUPERSCRIPT italic_r italic_e italic_f end_POSTSUPERSCRIPT also to be recovered. This is given by

P(𝑿,𝒁,𝜽,𝑵)=𝑃𝑿𝒁𝜽𝑵absent\displaystyle P(\boldsymbol{X,Z,\theta,N})=italic_P ( bold_italic_X bold_, bold_italic_Z bold_, bold_italic_θ bold_, bold_italic_N ) =(4)
lm[P(Zl,tm|Zl1,tm,Zl,t1m)\displaystyle\prod_{l}\prod_{m}[P(Z^{m}_{l,t}|Z^{m}_{l-1,t},Z^{m}_{l,t-1})∏ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT [ italic_P ( italic_Z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l , italic_t end_POSTSUBSCRIPT | italic_Z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l - 1 , italic_t end_POSTSUBSCRIPT , italic_Z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l , italic_t - 1 end_POSTSUBSCRIPT )(5)
P(Xlm|Zlm,θl,nl)P(θl,nl|θlm,nlm)]\displaystyle P(X^{m}_{l}|Z^{m}_{l},\theta_{l},n_{l})P(\theta_{l},n_{l}|\theta%^{m}_{l},n^{m}_{l})]italic_P ( italic_X start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_Z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) italic_P ( italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_θ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_n start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ](6)

where m𝑚mitalic_m indexes over input images, l𝑙litalic_l indexes pixels or patches in the references image Xrefsuperscript𝑋𝑟𝑒𝑓X^{ref}italic_X start_POSTSUPERSCRIPT italic_r italic_e italic_f end_POSTSUPERSCRIPT, t𝑡titalic_t denotes optimization iteration. The first term P(Zl,tm|Zl1,tm,Zl,t1m)𝑃conditionalsubscriptsuperscript𝑍𝑚𝑙𝑡subscriptsuperscript𝑍𝑚𝑙1𝑡subscriptsuperscript𝑍𝑚𝑙𝑡1P(Z^{m}_{l,t}|Z^{m}_{l-1,t},Z^{m}_{l,t-1})italic_P ( italic_Z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l , italic_t end_POSTSUBSCRIPT | italic_Z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l - 1 , italic_t end_POSTSUBSCRIPT , italic_Z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l , italic_t - 1 end_POSTSUBSCRIPT ) enforces spatially smooth and temporally (in terms of optimization steps) consistent occlusion maps. The second term P(Xlm|Zlm,θl,nl)𝑃conditionalsubscriptsuperscript𝑋𝑚𝑙subscriptsuperscript𝑍𝑚𝑙subscript𝜃𝑙subscript𝑛𝑙P(X^{m}_{l}|Z^{m}_{l},\theta_{l},n_{l})italic_P ( italic_X start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_Z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) enforces photometric consistency between the reference image and source images. The third term P(θl,nl|θlm,nlm)𝑃subscript𝜃𝑙conditionalsubscript𝑛𝑙subscriptsuperscript𝜃𝑚𝑙subscriptsuperscript𝑛𝑚𝑙P(\theta_{l},n_{l}|\theta^{m}_{l},n^{m}_{l})italic_P ( italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_θ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_n start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) enforces depth and normal maps consistent with multi-view geometry. Reader are referred to Schönberger etal. (2016) for the construction of each term this joint likelihood function and its optimization process.

Filtering and fusion:First, depth and normal maps for each image are estimated in accordance with the previous step. Photometric and geometric constraints are derived and used to filter outliers, where any observation xlsuperscript𝑥𝑙x^{l}italic_x start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT who’s support set Sl={xlm}subscript𝑆𝑙subscriptsuperscript𝑥𝑚𝑙S_{l}=\{x^{m}_{l}\}italic_S start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = { italic_x start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } satisfying less both geometric and photometric constraints has less than 3 elements (i.e. the reference pixel can be observed while satisfying both constraints in at least 3 other images.) A directed graph of consistent pixels is defined with supported pixels as nodes, edges pointing from reference to source image. The fusion is initialized at the node with maximum support (observed by the most source images while satisfying photometric and geometric constraints). Recursively, connected nodes are collected under a depth consistency constraint, a normal consistency constraint, and a reprojection error bound constraint. The collection’s elements are fused when there are no more nodes satisfying all 3 constraints. The fused point becomes part of the output dense point cloud with location pjsubscript𝑝𝑗p_{j}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and normal averaged njsubscript𝑛𝑗n_{j}italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT over the collection’s elements. The fused nodes are culled from the graph and the process is repeated until the graph is empty. The final output is a dense point cloud with normals, which can be meshed via Poisson Surface Reconstruction (Kazhdan and Hoppe, 2013) as we have done, or using other methods if desired.

3.5 3D Gaussian Splatting

3D Gaussian Splatting (Kerbl etal., 2023), which we briefly describe in this subsection, is used as foundation for our 3D urban reconstruction and view-synthesis experiments in the region of study and in the benchmarks.

From 2D images of a scene, 3D Gaussian Splatting learns and represents the scene geometry as (unormalized) 3D Gaussian functions with mean μR3𝜇superscript𝑅3\mu\in R^{3}italic_μ ∈ italic_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and 3×3333\times 33 × 3 covariance matrix ΣΣ\Sigmaroman_Σ given by

G(x)=e12(xμ)TΣ1(xμ).𝐺𝑥superscript𝑒12superscript𝑥𝜇𝑇superscriptΣ1𝑥𝜇G(x)=e^{-\frac{1}{2}(x-\mu)^{T}\Sigma^{-1}(x-\mu)}.italic_G ( italic_x ) = italic_e start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_x - italic_μ ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x - italic_μ ) end_POSTSUPERSCRIPT .(7)

The scene lighting and color are learned as third order spherical harmonics coefficients for each color channel attached to each Gaussian. To each Gaussian is also assigned a local (conic) opacity σ𝜎\sigmaitalic_σ. Combined with the 3D mean and covariance matrix, resulting in a total of 59 trainable parameters per Gaussian. The 3D covariance matrix ΣΣ\Sigmaroman_Σ is learned as a 3D diagonal scaling matrix S𝑆Sitalic_S, and a rotation represented by a quaternion (r,i,j,k)𝑟𝑖𝑗𝑘(r,i,j,k)( italic_r , italic_i , italic_j , italic_k ), which can be then used to reconstruct a 3D rotation matrix R𝑅Ritalic_R as follows

R=[12(j2+k2)2(ijkr)2(ik+jr)2(ij+kr)12(i2+k2)2(jkir)2(ikjr)2(jk+ir)12(i2+j2)].𝑅matrix12superscript𝑗2superscript𝑘22𝑖𝑗𝑘𝑟2𝑖𝑘𝑗𝑟2𝑖𝑗𝑘𝑟12superscript𝑖2superscript𝑘22𝑗𝑘𝑖𝑟2𝑖𝑘𝑗𝑟2𝑗𝑘𝑖𝑟12superscript𝑖2superscript𝑗2R=\begin{bmatrix}1-2(j^{2}+k^{2})&2(ij-kr)&2(ik+jr)\\2(ij+kr)&1-2(i^{2}+k^{2})&2(jk-ir)\\2(ik-jr)&2(jk+ir)&1-2(i^{2}+j^{2})\end{bmatrix}.italic_R = [ start_ARG start_ROW start_CELL 1 - 2 ( italic_j start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL start_CELL 2 ( italic_i italic_j - italic_k italic_r ) end_CELL start_CELL 2 ( italic_i italic_k + italic_j italic_r ) end_CELL end_ROW start_ROW start_CELL 2 ( italic_i italic_j + italic_k italic_r ) end_CELL start_CELL 1 - 2 ( italic_i start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL start_CELL 2 ( italic_j italic_k - italic_i italic_r ) end_CELL end_ROW start_ROW start_CELL 2 ( italic_i italic_k - italic_j italic_r ) end_CELL start_CELL 2 ( italic_j italic_k + italic_i italic_r ) end_CELL start_CELL 1 - 2 ( italic_i start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_j start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARG ] .(8)

The 3D covariance matrix is then given by

Σ=RSSTRT.Σ𝑅𝑆superscript𝑆𝑇superscript𝑅𝑇\Sigma=RSS^{T}R^{T}.roman_Σ = italic_R italic_S italic_S start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT .(9)

A sparse initial point cloud and training image camera poses are first computed using a structure from motion library such as COLMAP (Schonberger and Frahm, 2016). A Gaussian is initialized at each point in the sparse point cloud, and trained using the differentiable tile-based rasterizer.

3.5.1 Rasterization

The tile-based rasterizer tiles the image into 16×16161616\times 1616 × 16. For each tile, a view frustrum is projected into the 3D scene. 3D Gaussians are accumulated/assigned per-tile according to overlap to the view frustrum, and are projected into 2D and via projection of its covariance matrix ΣΣ\Sigmaroman_Σ. Starting in hom*ogenous coordinates, this is given by

Σ=JWΣWTJTsuperscriptΣ𝐽𝑊Σsuperscript𝑊𝑇superscript𝐽𝑇\displaystyle\Sigma^{\prime}=JW\Sigma W^{T}J^{T}roman_Σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_J italic_W roman_Σ italic_W start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_J start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT(10)

where W𝑊Witalic_W is the view transformation and J𝐽Jitalic_J is affine approximation of the projective transformation. The projective transformation is a matrix multiplication in hom*ogenous coordinates in the case of a linear camera model such as the pinhole model used with the standard 3DGS model. In which case, J𝐽Jitalic_J is simply obtained from the intrinsic camera matrix. The third column of μsuperscript𝜇\mu^{\prime}italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and the third row and third column of ΣsuperscriptΣ\Sigma^{\prime}roman_Σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are then skipped to obtain the 2D mean and the 2D covariance matrix in the imaging plane in Cartesian coordinates.

Gaussians are then sorted according to tile and by depth. For each pixel in a tile, the pixel’s color is generated via alpha blending accumulating in-scene direction dependent color using the learned SH coefficients. For each Gaussian to be blended together, each αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT at a pixel location x𝑥xitalic_x is given by evaluating the associated 2D Gaussian scaled by its associated learned opacity aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

αi(x)=aiG2D(x)subscript𝛼𝑖𝑥subscript𝑎𝑖subscript𝐺2𝐷𝑥\alpha_{i}(x)=a_{i}G_{2D}(x)italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT 2 italic_D end_POSTSUBSCRIPT ( italic_x )(11)

where G2D(x)subscript𝐺2𝐷𝑥G_{2D}(x)italic_G start_POSTSUBSCRIPT 2 italic_D end_POSTSUBSCRIPT ( italic_x ) is the Gaussian (7) projected into 2D dimension and onto the image plane via (3.5.1).

The rasterizer generates an image which is compared to the ground truth images using a photometric L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss and LDSSIMsubscript𝐿𝐷𝑆𝑆𝐼𝑀L_{D-SSIM}italic_L start_POSTSUBSCRIPT italic_D - italic_S italic_S italic_I italic_M end_POSTSUBSCRIPT, a difference of Structural Similarity Index Measure (D-SSIM) (Wang etal., 2004) loss via

L=(1λ)L1+λDSSIM𝐿1𝜆subscript𝐿1subscript𝜆𝐷𝑆𝑆𝐼𝑀L=(1-\lambda)L_{1}+\lambda_{D-SSIM}italic_L = ( 1 - italic_λ ) italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_D - italic_S italic_S italic_I italic_M end_POSTSUBSCRIPT(12)

with λ𝜆\lambdaitalic_λ being an adjustable weighing parameter defaulting to 0.2. The trainable parameters are back-propagated through the differentiable rasterization and optimized using Adam (Kingma and Ba, 2014).

3.5.2 Densification and Pruning

3D Gaussian Splatting also densifies/grows new Gaussians in regions with high view-space positional gradient (threshold τpos>2.0×104𝜏𝑝𝑜𝑠2.0superscript104\tau{pos}>2.0\times 10^{-4}italic_τ italic_p italic_o italic_s > 2.0 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT as default). These regions correspond to neighborhoods with missing geometric features and regions with a few Gaussians covering large areas of the scene. Low variance Gaussians with view-space position gradients are duplicated. On the other hand, high variance Gaussians are split into two with standard deviation divided by a factor of 1.6. This is illustrated in Figure 3.5.2.

Unimportant Gaussians are also pruned. Gaussian that are essentially transparent with opacity less than some user defined threshold (a<ϵa𝑎subscriptitalic-ϵ𝑎a<\epsilon_{a}italic_a < italic_ϵ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, default value at 5e35𝑒35e-35 italic_e - 3) are deleted. Every 3000 iterations (or some other number of the user’s choosing), every Gaussian’s opacity is set to zero, then allowed to be re-optimized, then culled where needed. This process controls the number of floater artifacts, and helps control the total number of Gaussians. We believe that this densification and density control process can allow for point cloud reconstruction of similar density and potentially quality compared to a dense reconstruction, given a good dataset.

Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (5)

3.6 Evaluation Metrics

For quality of synthesized images, we use Peak Signal to Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM) (Wang etal., 2004), and Learned Perceptual Image Patch Similarity (LPIPS) (Zhang etal., 2018) as full reference image assessment metrics comparing generated views to ground truth views. PSNR is a good indicator of presence of noise and visual artifacts, whereas SSIM and LPIPS have been shown to better correlate with human judgement of the visual similarity of an image to its reference image.

For point cloud geometry assessment, we used point-to-point (D1) mean squared error (MSE), point-to-surface (D2) MSE, Hausdorff distance, Chamfer distance, all of which compare a lower quality point cloud to its reference point cloud. We note that metrics such as D1 and D2 MSE do not penalize differences in point density, only deviations of existing points from ground truth/reference points. On the other hand, Chamfer and Hausdorff distances better capture the difference between the distribution of points, including differences in point density.

4 Experiments and Results

4.1 Experiment Setup

Both COLMAP preprocessing and 3D Gaussian Splatting optimization were performed on a 3080 RTX GPU with 10GB VRAM, i9-10900KF CPU, with PyTorch version 2.1.1 and CUDA toolkit version 12.1. We note the GPU VRAM limitation being especially relevant, as it is always possible to grow more and more Gaussians to achieve higher and higher visual reconstruction quality at the cost of memory and storage while using 3D Gaussian Splatting.

Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (6)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (7)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (8)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (9)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (10)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (11)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (12)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (13)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (14)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (15)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (16)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (17)

4.2 3D Novel View Synthesis of the Region of Study

For the region of study, we used COLMAP SfM (Schonberger and Frahm, 2016) preprocessing and extracted 3D points and camera poses from the 400 two dimensional images. The experiments were performed with MipNeRF-360 (Barron etal., 2022) style training validation split: one in eight images (similar-to\sim 12.5%) were reserved for testing purposes. The 1920 by 1080 resolution images were downsampled by a factor of 4 during training due to GPU memory constraints. We started densification at the 1000th iteration, and trained for 50000 iterations, densifying every 100 iterations. We used an initial positional learning rate of 3.2×1053.2superscript1053.2\times 10^{-5}3.2 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT and a scale learning rate of 2×1032superscript1032\times 10^{-3}2 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT. The other training hyper parameters were kept as default.

The results are shown in Table 2 and Figure 4.1, in conjunction with further view synthesis experiments from the BungeeNeRF dataset. We achieve high view synthesis visual quality on both the training and test set. From visual inspection, the rendered images are nearly indistinguishable from the ground truth images. This is also supported by the visual assessment metrics, with SSIM scores near 1, LPIPS scores near 0, which indicates almost perfect visual agreement between ground truth and generated images. The PSNR values of around 30dB also are indicative of good image quality and low level of noise. This is comparable to the PSNR of a compressed image with respect to a full-sized image with a good lossy compression algorithm (Netravali, 2013), which is impressive considering the 3DGS model was trained at 1414\frac{1}{4}divide start_ARG 1 end_ARG start_ARG 4 end_ARG resolution.

4.3 3D Novel View Synthesis of Bungee-NeRF Scenes

For the BungeeNeRF scenes, the experiments were performed with MipNeRF-360 style training and validation split as previously described. The experimental settings were kept the same as the Waterloo scene experiments, except we reduced the total number of training iteration to 30000. BungeeNeRF provided detailed benchmarks for the New York and San Francisco scenes, which were used for their main view-synthesis experiments. We performed detailed comparisons of Gaussian Splatting against BungeeNeRF, vanilla NeRF and Mip-NeRF for these two scenes. Additionally, we trained and evaluated Gaussian Splatting models for the remaining eight scenes whose camera paths were provided by BungeeNeRF.

As observed in Table 3, across both the New York and San Francisco scenes, we see a large increase in view synthesis quality according to all three metrics. The visual quality improvement from 3DGS to BungeeNeRF is much larger than the quality improvement from BungeeNeRF to vanilla NeRF or any other benchmarked models. We also note the large increase in view synthesis quality does not come at the cost of training time. In fact, the training of Gaussian Splatting models is three to four orders of magnitudes faster than implicit NeRF models such as vanilla NeRF (Mildenhall etal., 2021), Mip-NeRF (Barron etal., 2021), and BungeeNeRF (Barron etal., 2021). Gaussian Splatting models achieve higher view synthesis quality with faster training and rendering time at the cost of memory and storage requirement (Gao etal., 2022).

Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (18)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (19)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (20)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (21)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (22)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (23)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (24)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (25)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (26)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (27)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (28)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (29)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (30)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (31)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (32)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (33)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (34)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (35)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (36)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (37)

As observed in the qualitative comparison in Figure 4.3, the images rendered using 3DGS are of high visual quality, and difficult to distinguish visually from ground truth images except for a Google Earth watermark noticeable at the bottom right of ground truth images. Compared with ground truth images, we observe that rendered images have slightly blurrier edges at the smallest scale (similar-to\sim 300 m altitude) and that certain street-level details are slightly less sharp at the largest scale (similar-to\sim 3000 m altitude). Also noticeable in ground truth Google Earth images and rendered images is the piecing together of multiple data sources at the largest scale at the bottom row of Figure (4.3). We notice visible discontinuous and grid-like change in coloration of the water going from the San Francisco shoreline to the San Francisco Bay and Golden Gate area, likely indicating areas where different aerial and/or satellite images were stitched together. This effect was also learned by the 3DGS model, as is visible in the respective rendered images.

DatasetTrain (PSNR\uparrow)Test (PSNR\uparrowSSIM\uparrowLPIPS\downarrow)
Waterloo32.330.50.9530.0535
New York31.530.70.9640.0500
San Francisco30.829.90.9520.0669
Amsterdam32.329.70.9480.0535
Barcelona31.228.10.9370.0659
Chicago32.330.00.9590.0460
Los Angeles32.028.60.9140.0937
Paris31.628.50.9530.0509
Rome32.727.00.8610.1127
Quebec32.930.10.9530.0603
Bilbao32.127.20.8510.1415

In addition, we also tested the performance of 3DGS on the other BungeeNeRF Google Earth Studio scenes. We note a 0.7 to 5.7 PSNR drop when moving from training set to test set across all scenes in Table 1, indicating a certain degree of overfitting across training views. We notice the overfitting is more severe on the 200similar-toabsent200\sim 200∼ 200 image scenes as opposed to the 450similar-toabsent450\sim 450∼ 450 image New York and San Francisco scenes, with our 400 image Waterloo scene lying in the middle. The Bilbao scene, centered on the Gungenheim museum, has by far the worst performing 3DGS reconstructions. We observe that this is perhaps due to a combination of the complex building shape of the Gungenheim museum, a lack of sufficient training views at low altitude, and poorer quality Google Earth 3D model at off-nadir view angles at low altitude which resulted in poor quality training images.

New York (56 Leonard)San Francisco (Transamerica)
MethodPSNR\uparrowLPIPS\downarrowSSIM\uparrowPSNR\uparrowLPIPS\downarrowSSIM\uparrow
NeRF (D=8, Skip=4) (Mildenhall etal., 2021)21.70.3200.63622.60.3180.690
NeRF w/ WPE (D=8, Skip=4) (Mildenhall etal., 2021)21.60.3650.63322.40.3310.680
Mip-NeRF-small (D=8, Skip=4) (Barron etal., 2021)22.00.3440.64822.70.3270.687
Mip-NeRF-large (D=10, Skip=4) (Barron etal., 2021)22.20.3180.66622.50.3300.686
Mip-NeRF-full (D=10, Skip=4,6,8) (Barron etal., 2021)22.30.2660.68922.80.3140.699
BungeeNeRF (same iter as baselines) (Xiangli etal., 2022)23.50.2350.73923.60.2650.749
BungeeNeRF (until convergence) (Xiangli etal., 2022)24.50.1600.81524.40.1920.801
3DGS30.70.0500.96429.90.0670.952

4.4 3D Reconstruction of the Region of Study

For the 3D reconstruction experiments, due to computational constraints of multi-view stereo (MVS) densification, which is even more memory intensive than 3DGS, we extracted the sparse point cloud using the first 50 images along the first level camera path. Then, using Schönberger etal. (2016), we generated depth and normal maps, which we then used to generate a dense point cloud as the ground truth/reference 3D geometry of the EV1 neighborhood. We then trained a 3D Gaussian Splatting model on these first 50 images. After which, the Gaussian positional means were extracted as a new 3DGS densified 3D point cloud resulting in 1856968 points, starting from a sparse point cloud of 24740 points, a near 10x densification. The positional means which we extracted as 3DGS densified point cloud are visualized in Figure 4.1(rasterized at Gaussian scale = 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT). In comparison, the MVS densified point cloud resulted in 2528969 points. The MVs densification results are visualized in Figure 4.4.

Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (38)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (39)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (40)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (41)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (42)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (43)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (44)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (45)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (46)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (47)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (48)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (49)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (50)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (51)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (52)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (53)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (54)
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (55)

We first note that the initial sparse point cloud and the 3DGS densified point clouds are aligned with each other. However, the MVS densified point cloud which we take to be ground truth/reference point cloud was offset from both by a rotation, a translation, and non-affine deformations far from the origin. This is visible in the last row of Figure 4.4, but even more obvious as we extend the view further.As such, we cropped all three point clouds, and performed point cloud registration to align the three point clouds. We used the iterative closest point algorithm (ICP) (Besl and McKay, 1992) to register both the initial and the 3DGS densified point clouds to the MVS densified point cloud. This process aligned all three point clouds with translations and rotations, but we still observe slight non-affine deformations as distance from origin and height increases post registration, as can be seen in Figure 4.4.

The cropping resulted in 12773, 244849, 1270820, points for the sparse, 3DGS densified, and MVS densified point clouds respectively. We noticed that both the initial sparse point cloud and the MVS densified point cloud were much denser at the center of the scene than at the edges, whereas the 3DGS densified point cloud had proportionately a more uniform point density compared to to the two aforementioned point clouds. As such, the cropping reduced the number of points in the sparse and MVS densified point clouds by roughly a factor of similar-to\sim2, whereas it reduced the number of points in the 3DGS densified point cloud by a factor of similar-to\sim7.5.

We then compared both point clouds to the dense MVS point cloud fused from depth and normal maps, using D1 (point-to-point) MSE and D2 (point-to-surface) MSE, Haussdorff distance, and Chamfer distance. We observe that the 3DGS densified point cloud has marginally higher D1 and D2 MSE (with respect to the MVS densified point cloud) than the sparse initial point cloud. However, we note that neither MSE metric penalizes differences in point density. They only measure the presence of outliers and noise points. On the other Haussdorff and Chamfer distances better reflect the differences between distribution of points. We observe that compared to the sparse point cloud, the 3DGS densified point cloud has much better better agreement with respect to the reference MVS densified point cloud in terms of these two metrics. This is also corroborated with visual inspection of Figure 4.4. We plotted the local Hausdorff distance with respect to the reference MVS densified point cloud in Figure 4.4. Which helped highlight the non-affine distortion between the reference MVS densified point cloud and the two others.

Sparse3DGS Densified
Points247401856968
Points post-cropping12773244849
D1 MSE\downarrow7.625×1037.625superscript1037.625\times 10^{-3}7.625 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT8.154×1038.154superscript1038.154\times 10^{-3}8.154 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
D2 MSE\downarrow6.879×1036.879superscript1036.879\times 10^{-3}6.879 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT7.297×1037.297superscript1037.297\times 10^{-3}7.297 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
Haussdorff distance\downarrow8.753×1018.753superscript1018.753\times 10^{-1}8.753 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT3.745×1013.745superscript1013.745\times 10^{-1}3.745 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT
Chamfer distance\downarrow2.546×1022.546superscript1022.546\times 10^{-2}2.546 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT1.615×1021.615superscript1021.615\times 10^{-2}1.615 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT

5 Discussions

We note that Google Earth Studio produces composite images and images rendered from 3D models, constructed using remote sensing images of a variety of governmental and commercial sources including Landsat, Copernicus, Airbus, NOAA, U.S. Navy, USGS, Maxar, taken at different times. This can be both an advantage and a disadvantage. We first note that low-altitude images using far-from-vertical/off-nadir point-of-views rely on Google Earth Engine’s own 3D models, which are limited in detail compared to real remote sensing images. On the other hand, the variety of data sources benefits the robustness of the 3D Gaussian Splatting model, which has been trained on images from different sensors in different photometric and radiometric conditions. The disadvantages are also counterbalanced with the ease at which Google Earth Studio allows for the creation of multi-scale dataset with spiraling camera path suited for large-scale 3D scene centered around some neighborhood of interest in a city.

When recovering 3D geometry (as 3D point cloud) from both the SfM preprocessing, the post-3DGS densified point cloud, and even the MVS densified dense point cloud, we notice a mild to strong presence of noise, which should be address in future 3DGS research. In our 3D reconstruction and densification experiments, we used the MVS densified point cloud as ground truth, despite that it was also constructed from 2D images. Despite recovering good quality dense 3D surfaces, the MVS densified point cloud was offset from both the initial sparse point cloud and the 3DGS densified point cloud with a non-affine transformation, which should be investigated further. In the future, for geometry recovery benchmarks, we believe a scanned point cloud such as from a LiDAR source would be more accurate as ground truth. A future project could be to use a scanned point cloud as ground truth and to register and georeference both MVS and 3DGS densified point clouds to a scanned point cloud to properly study the geometry of these densifications, as well as enable further mapping and GIS applications. We also note that memory requirement for the COLMAP MVS densification was larger than that of the 3DGS densification, and was a reason behind performing the densification experiment on a smaller scale using less images. Despite these concerns, we note despite that 3DGS was not built as a 3D geometry extraction tool, it is reasonably able to recover scene geometry through densification and optimization of the Gaussian positions.

The high GPU memory requirement of 3D Gaussian Splatting prevents high resolution reconstruction across the entire large-scale scene. Due to the chosen camera path, the center of the scene is well reconstructed at all altitudes and densely populated by Gaussians. This results in high quality rendered images. However, in other neighborhoods further away from the scene center, we are only able to achieve high quality reconstruction at high altitude and struggle near the ground. Although certain advances, currently in preprints, have attempted to address the memory issue; many of these models compress the trained Gaussian Splatting model post-training reducing model storage requirements, and do not achieve significant reduction in working memory requirement during training.

We expect working memory requirement reduction to be a future research direction. This will also allow for better reconstruction across multiple neighborhoods, perhaps using more complex camera paths such as multiple spirals arranged hierarchically in Google Earth Studio, centered around each neighborhood of interests, or space-filling curves filling out a camera path with dense coverage across the entire large-scale scene. Alternatively, large-scale 3D reconstruction scheme which pieces together multiple local models such as with Mega-NeRF (Turki etal., 2022) can be considered. Another future research direction is remote sensing-based large-scale semantics-based 3D reconstruction and semantic synthesis. For urban scenes, this research area is expected to find applications in urban digital twin creation, urban monitoring, and urban/land-use planning. This research direction can also more generally extend land-use/land-cover segmentation to three dimensions, which has a multitude of research and commercial applications. These are the research areas we are currently investigating.

6 Conclusion

By simply leveraging Google Earth imagery we capture an aerial off-nadir dataset of the region of study. We were able to photorealistically render the scene and capture its geometry. We compared the 3DGS with NeRF methods on a large-scale urban reconstruction dataset across 10 cities, and performed a careful study of the 3D point cloud densification capability of 3DGS comparing and visualizing the densification against Multi-View-Stereo dense reconstruction in our region of study. We find both an affine misalignment which we remove with a point cloud registration and a non-linear deformation which we quantify and visualize between the Multi-View-Stereo densified point cloud and 3DGS densified point cloud. We hope our study and experiments help future research in large-scale remote sensing-based 3D Gaussian Splatting for both view synthesis and geometry retrieval.

\printcredits

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

References

  • Alphabet Inc. (2015-2024)Alphabet Inc., 2015-2024.Google earth studio.URL: https://www.google.com/earth/studio/.
  • Badue etal. (2021)Badue, C., Guidolini, R., Carneiro, R.V., Azevedo, P., Cardoso, V.B., Forechi, A., Jesus, L., Berriel, R., Paixao, T.M., Mutz, F., etal., 2021.Self-driving cars: A survey.Expert systems with applications 165, 113816.
  • Barron etal. (2021)Barron, J.T., Mildenhall, B., Tancik, M., Hedman, P., Martin-Brualla, R., Srinivasan, P.P., 2021.Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5855–5864.
  • Barron etal. (2022)Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., Hedman, P., 2022.Mip-nerf 360: Unbounded anti-aliased neural radiance fields, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5470–5479.
  • Besl and McKay (1992)Besl, P.J., McKay, N.D., 1992.Method for registration of 3-d shapes, in: Sensor fusion IV: control paradigms and data structures, Spie. pp. 586–606.
  • Biljecki etal. (2015)Biljecki, F., Stoter, J., Ledoux, H., Zlatanova, S., Çöltekin, A., 2015.Applications of 3d city models: State of the art review.ISPRS International Journal of Geo-Information 4, 2842–2889.
  • Carozza etal. (2014)Carozza, L., Tingdahl, D., Bosché, F., VanGool, L., 2014.Markerless vision-based augmented reality for urban planning.Computer-Aided Civil and Infrastructure Engineering 29, 2–17.
  • Chen etal. (2022)Chen, A., Xu, Z., Geiger, A., Yu, J., Su, H., 2022.Tensorf: Tensorial radiance fields, in: European Conference on Computer Vision, Springer. pp. 333–350.
  • Derksen and Izzo (2021)Derksen, D., Izzo, D., 2021.Shadow neural radiance fields for multi-view satellite photogrammetry, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1152–1161.
  • Fischler and Bolles (1981)Fischler, M.A., Bolles, R.C., 1981.Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM 24, 381–395.
  • Fridovich-Keil etal. (2022)Fridovich-Keil, S., Yu, A., Tancik, M., Chen, Q., Recht, B., Kanazawa, A., 2022.Plenoxels: Radiance fields without neural networks, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5501–5510.
  • Gao etal. (2022)Gao, K., Gao, Y., He, H., Lu, D., Xu, L., Li, J., 2022.Nerf: Neural radiance field in 3d vision, a comprehensive review.arXiv preprint arXiv:2210.00379 .
  • Hartley and Zisserman (2003)Hartley, R., Zisserman, A., 2003.Multiple view geometry in computer vision.Cambridge university press.
  • Jensen etal. (2014)Jensen, R., Dahl, A., Vogiatzis, G., Tola, E., Aanæs, H., 2014.Large scale multi-view stereopsis evaluation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 406–413.
  • Kazhdan and Hoppe (2013)Kazhdan, M., Hoppe, H., 2013.Screened poisson surface reconstruction.ACM Transactions on Graphics (ToG) 32, 1–13.
  • Kerbl etal. (2023)Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G., 2023.3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics 42, 1–14.
  • Kerbl etal. (2024)Kerbl, B., Meuleman, A., Kopanas, G., Wimmer, M., Lanvin, A., Drettakis, G., 2024.A hierarchical 3d gaussian representation for real-time rendering of very large datasets.ACM Transactions on Graphics 44.
  • Kingma and Ba (2014)Kingma, D.P., Ba, J., 2014.Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980 .
  • Lehner and Dorffner (2020)Lehner, H., Dorffner, L., 2020.Digital geotwin vienna: Towards a digital twin city as geodata hub.
  • Liao etal. (2024)Liao, Y., Zhang, X., Huang, N., Fu, C., Huang, Z., Cao, Q., Xu, Z., Xiong, X., Cai, S., 2024.High completeness multi-view stereo for dense reconstruction of large-scale urban scenes.ISPRS Journal of Photogrammetry and Remote Sensing 209, 173–196.
  • Lingua etal. (2017)Lingua, A., Noardo, F., Spanò, A., Sanna, S., Matrone, F., 2017.3d model generation using oblique images acquired by uav.The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences 42, 107–115.
  • Lowe (1999)Lowe, D.G., 1999.Object recognition from local scale-invariant features, in: Proceedings of the seventh IEEE international conference on computer vision, Ieee. pp. 1150–1157.
  • Marí etal. (2022)Marí, R., Facciolo, G., Ehret, T., 2022.Sat-nerf: Learning multi-view satellite photogrammetry with transient objects and shadow modeling using rpc cameras, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1311–1321.
  • Martin-Brualla etal. (2021)Martin-Brualla, R., Radwan, N., Sajjadi, M.S., Barron, J.T., Dosovitskiy, A., Duckworth, D., 2021.Nerf in the wild: Neural radiance fields for unconstrained photo collections, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7210–7219.
  • Mildenhall etal. (2021)Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R., 2021.Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM 65, 99–106.
  • Müller etal. (2022)Müller, T., Evans, A., Schied, C., Keller, A., 2022.Instant neural graphics primitives with a multiresolution hash encoding.ACM transactions on graphics (TOG) 41, 1–15.
  • Musialski etal. (2013)Musialski, P., Wonka, P., Aliaga, D.G., Wimmer, M., VanGool, L., Purgathofer, W., 2013.A survey of urban reconstruction, in: Computer graphics forum, Wiley Online Library. pp. 146–177.
  • Netravali (2013)Netravali, A.N., 2013.Digital pictures: representation, compression, and standards.Springer.
  • Pepe etal. (2022)Pepe, M., Fregonese, L., Crocetto, N., 2022.Use of sfm-mvs approach to nadir and oblique images generated throught aerial cameras to build 2.5 d map and 3d models in urban areas.Geocarto International 37, 120–141.
  • Rematas etal. (2022)Rematas, K., Liu, A., Srinivasan, P.P., Barron, J.T., Tagliasacchi, A., Funkhouser, T., Ferrari, V., 2022.Urban radiance fields, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12932–12942.
  • Rohil and Ashok (2022)Rohil, M.K., Ashok, Y., 2022.Visualization of urban development 3d layout plans with augmented reality.Results in Engineering 14, 100447.
  • Rong etal. (2020)Rong, Y., Zhang, T., Zheng, Y., Hu, C., Peng, L., Feng, P., 2020.Three-dimensional urban flood inundation simulation based on digital aerial photogrammetry.Journal of Hydrology 584, 124308.
  • Schonberger and Frahm (2016)Schonberger, J.L., Frahm, J.M., 2016.Structure-from-motion revisited, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4104–4113.
  • Schönberger etal. (2016)Schönberger, J.L., Zheng, E., Frahm, J.M., Pollefeys, M., 2016.Pixelwise view selection for unstructured multi-view stereo, in: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, Springer. pp. 501–518.
  • Schrotter and Hürzeler (2020)Schrotter, G., Hürzeler, C., 2020.The digital twin of the city of zurich for urban planning.PFG–Journal of Photogrammetry, Remote Sensing and Geoinformation Science 88, 99–112.
  • Statistics Canada (2023)Statistics Canada, 2023.2021 census of population.Statistics Canada Catalogue URL: https://www12.statcan.gc.ca/census-recensem*nt/2021/dp-pd/prof/index.cfm?Lang=E.
  • Tancik etal. (2022)Tancik, M., Casser, V., Yan, X., Pradhan, S., Mildenhall, B., Srinivasan, P.P., Barron, J.T., Kretzschmar, H., 2022.Block-nerf: Scalable large scene neural view synthesis, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8248–8258.
  • Toschi etal. (2017)Toschi, I., Ramos, M., Nocerino, E., Menna, F., Remondino, F., Moe, K., Poli, D., Legat, K., Fassi, F., etal., 2017.Oblique photogrammetry supporting 3d urban reconstruction of complex scenarios.International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences 42, 519–526.
  • Triggs etal. (2000)Triggs, B., McLauchlan, P.F., Hartley, R.I., Fitzgibbon, A.W., 2000.Bundle adjustment—a modern synthesis, in: Vision Algorithms: Theory and Practice: International Workshop on Vision Algorithms Corfu, Greece, September 21–22, 1999 Proceedings, Springer. pp. 298–372.
  • Turki etal. (2022)Turki, H., Ramanan, D., Satyanarayanan, M., 2022.Mega-nerf: Scalable construction of large-scale nerfs for virtual fly-throughs, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12922–12931.
  • Wang etal. (2004)Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P., 2004.Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing 13, 600–612.
  • Xiangli etal. (2022)Xiangli, Y., Xu, L., Pan, X., Zhao, N., Rao, A., Theobalt, C., Dai, B., Lin, D., 2022.Bungeenerf: Progressive neural radiance field for extreme multi-scale scene rendering, in: European conference on computer vision, Springer. pp. 106–122.
  • Yalcin and Selcuk (2015)Yalcin, G., Selcuk, O., 2015.3d city modelling with oblique photogrammetry method.Procedia Technology 19, 424–431.
  • Yu etal. (2021)Yu, A., Li, R., Tancik, M., Li, H., Ng, R., Kanazawa, A., 2021.Plenoctrees for real-time rendering of neural radiance fields, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5752–5761.
  • Zhang etal. (2018)Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O., 2018.The unreasonable effectiveness of deep features as a perceptual metric, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586–595.
  • Zheng etal. (2014)Zheng, E., Dunn, E., Jojic, V., Frahm, J.M., 2014.Patchmatch based joint view selection and depthmap estimation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1510–1517.
  • Zhou etal. (2024)Zhou, H., Shao, J., Xu, L., Bai, D., Qiu, W., Liu, B., Wang, Y., Geiger, A., Liao, Y., 2024.Hugs: Holistic urban 3d scene understanding via gaussian splatting.arXiv preprint arXiv:2403.12722 .
Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (2024)
Top Articles
Latest Posts
Article information

Author: Van Hayes

Last Updated:

Views: 5237

Rating: 4.6 / 5 (46 voted)

Reviews: 85% of readers found this page helpful

Author information

Name: Van Hayes

Birthday: 1994-06-07

Address: 2004 Kling Rapid, New Destiny, MT 64658-2367

Phone: +512425013758

Job: National Farming Director

Hobby: Reading, Polo, Genealogy, amateur radio, Scouting, Stand-up comedy, Cryptography

Introduction: My name is Van Hayes, I am a thankful, friendly, smiling, calm, powerful, fine, enthusiastic person who loves writing and wants to share my knowledge and understanding with you.