Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (2024)

[orcid=0000-0002-8320-6308]\creditConceptualization, Methodology, Validation, Formal analysis, Investigation, Data Curation, Writing - Original Draft, -Visualization, - Review & Editing

1]organization=Department of Systems Design Engineering,addressline=University of Waterloo,city=Waterloo,state=Ontario,postcode=N2L3G1,country=Canada

[orcid=0000-0003-0316-0299]\creditInvestigation, Writing - Review & Editing

[orcid=0000-0003-3839-5821]\creditInvestigation, Writing - Review & Editing

[orcid=0000-0002-6833-6462]\creditConceptualization, Resources, Writing - Review & Editing, Supervision\cormark[1]\cortext[1]Corresponding authors.l44xu@uwaterloo.ca (Linlin Xu), junli@uwaterloo.ca (Jonathan Li)

[orcid=0000-0001-7899-0049]\creditConceptualization, Resources, Writing - Review & Editing, Supervision2]organization=Departmemt of Geography and Environmental Management,addressline=University of Waterloo,city=Waterloo,state=Ontario,postcode=N2L3G1,country=Canada\cormark[1]

Kyle-Yilin Gaoy56gao@uwaterloo.ca[ Dening Lud62lu@uwaterloo.ca Hongjie Hehongjie.he@uwaterloo.ca Linlin Xul44xu@uwaterloo.ca Jonathan Lijunli@uwaterloo.ca[

Abstract

Although large-scale remote sensing image based 3D urban scene reconstruction and modelling is crucial for many key applications such as digital twins and smart cities, it is a difficult task due to the uncertainties in heterogeneous datasets and the geometry models. This paper presents a Gaussian splatting based approach for 3D urban scene modeling and geometry retrieval, with the following contributions. First, we develop and implement a 3D Gaussian splatting (3DGS) approach for large-scale 3D urban scene modeling from heterogeneous remote sensing images. Second, we design point cloud densification approach in the proposed 3DGS model to improve the quality of 3D geometry extraction of urban scenes. Leveraging Google Earth imagery of different sensors, the proposed approach is tested on the region of University of Waterloo, demonstrating that the proposed approach greatly improves reconstructed point clouds quality over the other Multi-View-Stereo approaches. Third, we design and conduct extensive experiments on multi-source large-scale Google Earth remote sensing images across ten cities to compare the 3DGS approach with neural radiance field (NeRF) approaches, demonstrating improved view-synthesis results that greatly outperform previous state-of-the-art 3D view-synthesis approaches.

keywords:

3D Gaussian Splatting \sepNovel View Synthesis \sepPhotogrammetry \sepMulti-view-Stereo \sepPoint Cloud

1 Introduction

3D reconstruction and modelling from 2D images have recently received great interest given recent advances in photorealistic view synthesis methods with 3D reconstruction capabilities. From a technical perspective, it is an interdisciplinary research area spanning computer vision, computer graphics and photogrammetry. It finds applications in multiple domains, including autonomous navigation aided by 3D scene understanding (Badue etal., 2021), remote sensing and photogrammetry for crafting 3D maps essential for navigation, urban planning, and administration (Biljecki etal., 2015). Moreover, it extends to geographic information systems incorporating urban digital twins (Lehner and Dorffner, 2020; Schrotter and Hürzeler, 2020), as well as augmented and virtual reality platforms integrating photorealistic scene reconstructions (Carozza etal., 2014; Rohil and Ashok, 2022).

This paper focuses on remote sensing-based large-scale view synthesis based on 3D Gaussian Splatting (3DGS), as well as 3D geometry extraction from 3D geometry extraction from Gaussian Splatting. Using only images from Google Earth Studio, we train a 3D Gaussian splatting model which outperforms previous NeRF-based models. We quantify and benchmark the view synthesis performance on a large-scale urban dataset with 10 cities captured from Google earth, as well as our region of study. We also extract the and densify the 3D geometry of the region of study using 3DGS, which we compare against a Multi-View-Stereo dense reconstruction. To our knowledge, this is the first use of 3D Gaussian Splatting for large-scale remote sensing-based 3D reconstruction and view-synthesis.

2 Back Ground and Related Work

2.1 Urban 3D Photogrammetry

Photogrammetry extracts 3D geometry and potentially other physical information from 2D images. Remote sensing-based urban photogrammetry for 3D city modelling relies on drones/aerial platforms/satellites whereby buildings of interest at captured at an oblique/off-nadir angle. This is often referred to as oblique photogrammetry. In large scale scenes, other land-uses and land-cover may be present and present additional challenges. Ground-based and airborne LiDAR scanners can also be used to generate very accurate 3D models, sometimes in conjunction with image-based methods. However, images are in general more accessible both in terms of sensor and data availability.

Traditional (non-deep learning-based) methods which generates 3D point cloud/geometry from images are grouped into two types: Structure-from-Motion (SfM) which generates sparse point clouds, and Multi-View-Stereo which generates dense point clouds (Musialski etal., 2013). The most fundamental method is perhaps Structure-from-Motion, which relies on multi-view geometry and projective geometry to establish the relationship between 3D points and their 2D projection onto imaging planes. Key points are extracted in each 2D image, and matched in images with scene overlap, then triangulated to three dimensions, and typically further calibrated/error-corrected using bundle adjustments or other methods, resulting in a sparse point cloud 3D reconstruction. The sparse point cloud can then be meshed and/or turned into digital surface models. Sparse SfM photogrammetry is typically applied as a preprocessing step, as shown in various works (Yalcin and Selcuk, 2015; Lingua etal., 2017) to help with further dense reconstruction or data fusion with 3D scanned point clouds. Sparse SfM point clouds can only retrieve scene geometry, and cannot reproduce realistic 3D lighting of the scene which is crucial for AR/VR-based applications, and other applications which heavily rely visualization.

In urban settings, Multi-View-Stereo (MVS) also require oblique imagery to capture the geometry of buildings and their facades. Fundamentally, Multi-View-Stereo differs from sparse SfM photogrammetry since MVS aims for a dense reconstruction by making use of 3D information in each pixel of the 2D images, as opposed to specific key points in the 2D images. This can be done using various methods such as plane sweeping or stereo vision and depth map fusion, or even deep learning methods. MVS methods are typically divided into two categories volume-based, and point cloud based (Musialski etal., 2013; Jensen etal., 2014). Various authors (Yalcin and Selcuk, 2015; Toschi etal., 2017; Lingua etal., 2017; Rong etal., 2020; Pepe etal., 2022; Liao etal., 2024) have employed MVS for dense urban 3D reconstruction which can also be meshed for various purposes such as digital surface modelling and geophysics simulations. However, compared to sparse SfM photogrammetry, dense MVS photogrammetry is much more computationally intensive, especially in terms of memory. Additionally dense MVS photogrammetry typically require sparse SfM photogrammetry, or at least the camera poses which is typically obtained from sparse SfM photogrammetry, as a preprocessing step. Dense reconstructions, although more visually appealing than sparse reconstructions, are still not photorealistic since they cannot model the directional dependence of lighting in the scene.

2.2 Neural Radiance Fields and Urban 3D Reconstruction/View synthesis

In recent years, Neural Radiance Field-based methods (NeRF) (Mildenhall etal., 2021) have dominated the field of novel view synthesis. Trained on posed images of a scene, NeRF methods use a differentiable rendering process to learn implicit (Barron etal., 2021, 2022) or hybrid scene representation (Müller etal., 2022) typically as density and directional color fields, and typically using some Multi-Layer-Perceptron (MLP). The scene representation is then rendered into 2D images using a differentiable volume rendering process, allowing for scene representation learning via pixel-by-pixel supervised learning using back-propagation of a photometric loss. Certain explicit scene representation models (Yu etal., 2021; Fridovich-Keil etal., 2022; Chen etal., 2022) use almost identical differentiable rendering pipelines, but store their scene representations explicitly, forgoing the use of decoding MLPs (although some of these methods allow for the use of a shallow decoding MLP, blurring the line between explicit and hybrid scene representation).

To synthesize images, NeRF methods employ differentiable volume rendering, generating pixel color $C$ via alpha blending of local colors $c_{i}$ using local densities $\sigma_{i}$ along a ray with sampling intervals $\delta_{i}$ . This is given by

C=\sum_{i}c_{i}\alpha_{i}T_{i}

(1)

where $c_{i}$ and $\sigma_{i}$ are sampled from the learned radiance field (e.g. the NeRF MLP), and

\alpha_{i}=1-\exp(-\sigma_{i}\delta_{i})\;\text{and}\;T_{i}=\prod_{j=1}^{i-1}(%1-\alpha_{j}).

(2)

Urban scenes, unbound, full of transient objects (such as pedestrians, cars), and with changing lighting conditions pose a challenge to the learning of 3D scene representation. Methods such as NeRF-W (Martin-Brualla etal., 2021), Mip-NeRF360 (Barron etal., 2022), Block-NeRF (Tancik etal., 2022), Urban Radiance Fields (Rematas etal., 2022) proposed solutions to some of these problems, and are suited for ground-level view-synthesis and 3D urban reconstruction.

Aerial view 3D reconstruction and view synthesis from remote sensing images was also attempted with methods such as Bungee/City-NeRF (Xiangli etal., 2022), Mega-NeRF (Turki etal., 2022), Shadow NeRF (Derksen and Izzo, 2021), Sat-NeRF (Marí etal., 2022). These methods attempt to solve problems such as piecing together local NeRFs into large scale urban scene, multi-scale city view synthesis, and shadow-aware scene reconstruction for high-rises. BungeeNeRF (Xiangli etal., 2022) is of interest as we extract a Google Earth dataset from our region of study using a similar method.

2.3 3D Gaussian Splatting

3D Gaussian Splatting (3DGS) (Kerbl etal., 2023) was first developed in 2023 as a view synthesis method competing against existing NeRF view synthesis methods. Compared against the vanilla NeRF method, the vanilla Gaussian Splatting method learns the 3D scene and synthesizes novel views orders of magnitude faster, and achieves a visual quality for view synthesis comparable and often exceeding to the best NeRF models, at the cost of a much larger memory footprint and requiring structure-from-motion (SfM) (Schonberger and Frahm, 2016) initialization/preprocessing. The workflow is visualized in Figure 2.3

City	Landmark	Lowest (m)	Highest (m)
New York	56 Leonard	290	3,389
San Francisco	Transamerica	326	2,962
Chicago	Pritzker Pavilion	365	6,511
Quebec	Château Frontenac	166	3,390
Amsterdam	New Church	95	2,3509
Barcelona	Sagrada Familia	299	8,524
Rome	Colosseum	130	8,225
Los Angeles	Hollywood	660	12,642
Bilbao	Guggenheim	163	7,260
Paris	Pompidou	159	2,710
Waterloo	EV-1	500	3690

	$\displaystyle P(\boldsymbol{X,Z,\theta,N})=$		(4)
	$\displaystyle\prod_{l}\prod_{m}[P(Z^{m}_{l,t}\|Z^{m}_{l-1,t},Z^{m}_{l,t-1})$		(5)
	$\displaystyle P(X^{m}_{l}\|Z^{m}_{l},\theta_{l},n_{l})P(\theta_{l},n_{l}\|\theta%^{m}_{l},n^{m}_{l})]$		(6)

Dataset	Train (PSNR $\uparrow$ )	Test (PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$ )
Waterloo	32.3	30.5	0.953	0.0535
New York	31.5	30.7	0.964	0.0500
San Francisco	30.8	29.9	0.952	0.0669
Amsterdam	32.3	29.7	0.948	0.0535
Barcelona	31.2	28.1	0.937	0.0659
Chicago	32.3	30.0	0.959	0.0460
Los Angeles	32.0	28.6	0.914	0.0937
Paris	31.6	28.5	0.953	0.0509
Rome	32.7	27.0	0.861	0.1127
Quebec	32.9	30.1	0.953	0.0603
Bilbao	32.1	27.2	0.851	0.1415

	New York (56 Leonard)			San Francisco (Transamerica)
Method	PSNR $\uparrow$	LPIPS $\downarrow$	SSIM $\uparrow$	PSNR $\uparrow$	LPIPS $\downarrow$	SSIM $\uparrow$
NeRF (D=8, Skip=4) (Mildenhall etal., 2021)	21.7	0.320	0.636	22.6	0.318	0.690
NeRF w/ WPE (D=8, Skip=4) (Mildenhall etal., 2021)	21.6	0.365	0.633	22.4	0.331	0.680
Mip-NeRF-small (D=8, Skip=4) (Barron etal., 2021)	22.0	0.344	0.648	22.7	0.327	0.687
Mip-NeRF-large (D=10, Skip=4) (Barron etal., 2021)	22.2	0.318	0.666	22.5	0.330	0.686
Mip-NeRF-full (D=10, Skip=4,6,8) (Barron etal., 2021)	22.3	0.266	0.689	22.8	0.314	0.699
BungeeNeRF (same iter as baselines) (Xiangli etal., 2022)	23.5	0.235	0.739	23.6	0.265	0.749
BungeeNeRF (until convergence) (Xiangli etal., 2022)	24.5	0.160	0.815	24.4	0.192	0.801
3DGS	30.7	0.050	0.964	29.9	0.067	0.952

	Sparse	3DGS Densified
Points	24740	1856968
Points post-cropping	12773	244849
D1 MSE $\downarrow$	$7.625\times 10^{-3}$	$8.154\times 10^{-3}$
D2 MSE $\downarrow$	$6.879\times 10^{-3}$	$7.297\times 10^{-3}$
Haussdorff distance $\downarrow$	$8.753\times 10^{-1}$	$3.745\times 10^{-1}$
Chamfer distance $\downarrow$	$2.546\times 10^{-2}$	$1.615\times 10^{-2}$

Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery (2024)

Abstract

keywords:

1 Introduction

2 Back Ground and Related Work

2.1 Urban 3D Photogrammetry

2.2 Neural Radiance Fields and Urban 3D Reconstruction/View synthesis

2.3 3D Gaussian Splatting

3 Method

3.1 Region of Study

3.2 Google Earth Studio Datasets

3.3 Structure from Motion Preprocessing and Sparse Point Cloud Extraction

3.4 Multi-View-Stereo Dense 3D reconstruction

3.5 3D Gaussian Splatting

3.5.1 Rasterization

3.5.2 Densification and Pruning

3.6 Evaluation Metrics

4 Experiments and Results

4.1 Experiment Setup

4.2 3D Novel View Synthesis of the Region of Study

4.3 3D Novel View Synthesis of Bungee-NeRF Scenes

4.4 3D Reconstruction of the Region of Study

5 Discussions

6 Conclusion

Declaration of competing interest

Acknowledgements

References