Self-supervised learning for BIM element classification using a joint embedding predictive architecture

Shi, Jack Wei Lun; Solihin, Wawan; Weng, Yufeng; Zhao, Yimin; Poh, Leong Hien; Yeoh, Justin Ker-Wei

doi:10.1016/j.autcon.2026.107075

Self-supervised learning for BIM element classification using a joint embedding predictive architecture

Automation in Construction

Jack Wei Lun Shi¹, Wawan Solihin^1,2, Yufeng Weng¹, Yimin Zhao¹,
Leong Hien Poh¹, Justin Ker-Wei Yeoh¹

¹Department of Civil and Environmental Engineering, National University of Singapore ²Research and Innovation, NovaCITYNETS Pte.Ltd.

Paper Data Code Demo 🤗 Weights

Abstract

The development of scalable models for automated Building Information Modeling (BIM) element classification is hindered by the reliance on supervised learning, which requires expensive and laborious manual data annotation. This paper introduces a pre-trained model that leverages a Joint Embedding Predictive Architecture for self-supervised learning on unlabeled 3D point cloud representations of individual BIM elements. By predicting the latent representations of masked regions of element geometry, the proposed model learns rich geometric features that achieve competitive accuracy on a downstream classification task, outperforming existing supervised methods without heavy data augmentation, while excelling in data-scarce scenarios. This paper mitigates the data annotation bottleneck and establishes a path toward developing a foundation model for BIM geometry, enabling more scalable, data-efficient, and generalizable representation learning in the Architecture, Engineering, and Construction domain.

Why pre-train a model for BIM geometry?

The development of scalable models for automated BIM element classification is hindered by the reliance on supervised learning, which requires expensive and laborious manual data annotation. BIM-JEPA introduces a pre-trained model that leverages a Joint Embedding Predictive Architecture for self-supervised learning on unlabeled 3D point cloud representations of individual BIM elements. Unlike conventional approaches, which train separate and task-specific models from scratch, a single pre-trained model can be efficiently fine-tuned for a wide range of downstream tasks.

Fig. 1. Differences between other approaches and our approach.

How BIM-JEPA learns

BIM-JEPA learns by predicting the embeddings of masked portions of a BIM element from a given context, entirely within a shared latent space. From an ordered sequence of geometric patches, the model samples one larger context block and four smaller target blocks. The context block represents a partial view of the object (i.e., 40% to 75% of the patch embeddings), while the target blocks represent different masked regions (i.e., 15% to 20% of the patch embeddings) that the model must predict. The indices of patch embeddings chosen for the context differ from those for the targets to avoid trivial solutions. Predicting in the latent space rather than reconstructing raw points lets the model focus on high-level geometry and avoids the need for heavy data augmentation.

Visualization of context blocks and target blocks of various BIM elements — Fig. 4. Visualization of the context blocks and target blocks of various BIM elements.

Datasets

BIM-JEPA is pre-trained on 907,349 samples of individual BIM elements, drawn from three large-scale sources of real-world IFC/BIM geometry. Although these are labeled datasets, the labels were not used for the pre-training task. For the downstream classification task, the model is evaluated on two datasets: IFCNetCore, a curated subset of 20 balanced classes that provides a standard for general classification accuracy, and BIMGEOM, a more imbalanced set of 13 classes that assesses the robustness and generalization capabilities of the model.

Point cloud samples from BIMGEOM and IFCNetCore — Fig. 6. a) Point cloud samples from BIMGEOM with 13 classes and b) point cloud samples from IFCNetCore with 20 classes.

Results

BIM-JEPA establishes the best performance across all metrics on IFCNetCore. The model achieves an overall accuracy of 89.37% and a mean class accuracy of 86.63%, surpassing all baseline methods with minimal data augmentation during the fine-tuning phase (i.e., only scaling), thus relying primarily on the rich features learned during pre-training.

Table 2. Accuracy metrics on IFCNetCore.
Model	Overall Acc. (%)	Mean Class Acc. (%)	Precision	Recall	F1
MVCNN	86.97	85.54	87.48	86.97	86.93
MeshNet	85.75	83.32	86.45	85.75	85.72
DGCNN	82.30	79.11	83.26	82.30	82.15
SpaRSE-BIM	81.59	83.02	82.78	81.59	81.80
BIM-JEPA	89.37	86.63	89.38	89.37	87.69

On the more imbalanced BIMGEOM dataset, BIM-JEPA attains 92.43% overall accuracy and 89.53% mean class accuracy, exceeding SpaRSE-BIM (i.e., 90.47% and 87.66% respectively). On weighted average metrics, BIM-JEPA surpasses SpaRSE-BIM across precision, recall, and F1, while its stronger macro average performance indicates more even performance across under-represented BIM elements.

Table 4. Accuracy metrics on BIMGEOM.
Model	Overall Acc. (%)	Mean Class Acc. (%)	Macro Avg. (P / R / F1)	Weighted Avg. (P / R / F1)
Mesh to Graph	85	76	–	–
5-NN Graph	83	78	–	–
SpaRSE-BIM	90.47	87.66	84.70 / 87.66 / 85.78	91.43 / 90.47 / 90.77
BIM-JEPA	92.43	89.53	89.05 / 89.53 / 89.20	92.57 / 92.43 / 92.44

Quality of the learned representations

To provide qualitative insight into the structure of the learned feature spaces, the test-set embeddings are projected onto a two-dimensional plane using Principal Component Analysis (PCA), with the background regions representing the linear decision boundaries learned by an SVM. The pre-trained encoder yields features that appear as a single, undifferentiated cluster, typical of a self-supervised model before it has been adapted to a downstream task. Upon fine-tuning, this structure becomes visually apparent: the features coalesce into distinct class-specific groups, culminating in the fine-tuned prelogits which form tighter, denser, and separated clusters compared to other models.

Fig. 10. PCA visualization of learned representations from various models. Top row: a) DGCNN, b) MeshNet, c) MVCNN. Bottom row: d) pre-trained BIM-JEPA encoder, e) fine-tuned BIM-JEPA encoder, f) fine-tuned BIM-JEPA prelogits.

Performance in low-data regimes

The data efficiency curves demonstrate that the model is highly sample-efficient, capturing a large fraction of its final performance with only a small subset of the labeled data. Notably, with just 25% of the training set, BIM-JEPA reaches a mean overall accuracy of approximately 75% on IFCNetCore and 82% on BIMGEOM. This rapid convergence strongly suggests that the self-supervised pre-training effectively equips the model with robust and generalizable representations, thereby reducing its dependency on extensive labeled datasets for the downstream task.

Fig. 11. Data efficiency curves on IFCNetCore and BIMGEOM averaged over 5 random seeds.

The N-shot learning experiments further evaluate the ability of BIM-JEPA to generalize from a very small number of labeled training examples per class. For BIMGEOM, the mean overall accuracy steadily climbs from approximately 50% with 5 shots to over 80% with 140 shots. A similar, albeit more compressed, trend is observed on IFCNetCore, where accuracy increases from around 40% to nearly 70% as the number of shots grows from 5 to 36.

Fig. 12. N-shot learning on IFCNetCore and BIMGEOM averaged over 5 random seeds.

That's the big picture! If you're curious about the methodology, experiments, and all the details, come check out the full paper.

Citation

@article{shi2026selfsupervised,
  title={Self-supervised learning for BIM element classification using a joint embedding predictive architecture},
  author={Shi, Jack Wei Lun and Solihin, Wawan and Weng, Yufeng and Zhao, Yimin and Poh, Leong Hien and Yeoh, Justin K.W.},
  journal={Automation in Construction},
  volume={190},
  pages={107075},
  year={2026},
  publisher={Elsevier}
}