Research on 3D Reconstruction Technology of Indoor Buildings Based on Depth Prediction

Dashi Qiu

doi:10.54097/8a7mdq55

Authors

Dashi Qiu

DOI:

https://doi.org/10.54097/8a7mdq55

Keywords:

3D Reconstruction, Depth Prediction, Multi-view, Indoor Scene

Abstract

To address the limitations of traditional 3D indoor scene reconstruction methods in resource-constrained environments, this paper proposes a depth prediction-based 3D reconstruction method for indoor buildings. The method first employs a pre-trained image encoder to extract multi-scale features from input images, which are then combined with metadata containing ray direction, depth information, and relative pose distance to construct a feature volume. This volume is fed into a 2D convolutional neural network, while a multi-scale depth prediction strategy is adopted to progressively refine depth estimation, generating high-quality depth predictions for more detailed 3D reconstruction. Experimental results demonstrate that the proposed method significantly outperforms traditional depth estimation approaches on the public dataset ScanNet, achieving a 21% improvement under the threshold accuracy metric δ < 1.05. In 3D reconstruction tasks, the method achieves near state-of-the-art performance (F-Score = 0.658) while enabling online real-time reconstruction with low memory consumption, exhibiting a per-frame latency of only 72ms.

Downloads

Download data is not yet available.

References

[1] Huang H, Yan X, Zheng Y, et al. Multi-view stereo algorithms based on deep learning: a survey [J]. Multimedia Tools and Applications, 2024: 1-32.

[2] Maglo A, Lavoué G, Dupont F, et al. 3d mesh compression: Survey, comparisons, and emerging trends [J]. ACM Computing Surveys (CSUR), 2015, 47(3): 1-41.

[3] Tan M, Le Q. Efficientnetv2: Smaller models and faster training; proceedings of the International conference on machine learning, F, 2021 [C]. PMLR.

[4] Wang K, Shen S. MVDepthNet: Real-time Multiview Depth Estimation Neural Network; proceedings of the 2018 International Conference on 3D Vision (3DV), F, 2018 [C].

[5] Im S, Jeon H-G, Lin S, et al. Dpsnet: End-to-end deep plane sweep stereo [J]. arXiv preprint arXiv:190500538, 2019.

[6] Schönberger J L, Zheng E, Frahm J-M, et al. Pixelwise view selection for unstructured multi-view stereo; proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, F, 2016 [C]. Springer.

[7] Murez Z, Van As T, Bartolozzi J, et al. Atlas: End-to-end 3d scene reconstruction from posed images; proceedings of the European conference on computer vision, F, 2020 [C]. Springer.

[8] Stier N, Rich A, Sen P, et al. VoRTX: Volumetric 3D Reconstruction With Transformers for Voxelwise View Selection and Fusion [J]. 2021.

[9] Sun J, Xie Y, Chen L, et al. Neuralrecon: Real-time coherent 3d reconstruction from monocular video; proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, F, 2021 [C].