Utonia: One Encoder for All 3D Point Clouds

Kabui, Charles

Researchers at The University of Hong Kong and Xiaomi released Utonia, the first self-supervised point cloud encoder trained across five 3D domains at once: satellite scans, outdoor street LiDAR, indoor room scans, standalone object models, and point clouds built from regular video. Built on a 137-million parameter Point Transformer V3, it was trained on 250,000 cross-domain scenes plus 1 million 3D object assets. Three techniques make joint training work: randomly hiding color and surface data during training so the model stays robust when sensors differ, rescaling all point clouds to a shared spatial unit, and a position encoding that avoids locking geometry to fixed grid coordinates. One model matches or beats separate domain-specific encoders across the board: 81.1% on ScanNet indoor segmentation, 82.2% on nuScenes outdoor segmentation, and 95.2% on ScanObjectNN object classification. Robotic grasping success jumps to 82.1% when robot policies use Utonia’s 3D features, up from 74.7% with older encoders.

3D understanding has been fragmented: one model for indoor rooms, another for self-driving, another for small objects. Teams building robots, AR headsets, or autonomous vehicles each trained their own encoder from scratch. Utonia replaces all of them with one pretrained model, and it reveals emergent behaviors that only appear during joint training. The model automatically learns that a toy car in a CAD dataset and a real car on a street share the same structure, something no single-domain model could discover. Weights are on HuggingFace under a CC-BY-NC 4.0 license.

This follows the same trajectory as language and vision: specialized models giving way to unified foundations that benefit all tasks. In 3D, fragmentation persisted longer because point cloud formats vary drastically across sensors. Utonia suggests the unification era for sparse 3D data has started.

Read More: Utonia’s cross-domain spatial understanding connects to the broader challenge of spatial consistency discussed in The Trinity of Consistency.

Sources:

Disclaimer: For information only. Accuracy or completeness not guaranteed. Illegal use prohibited. Not professional advice or solicitation. Read more: /terms-of-service

Reuse

GNU GENERAL PUBLIC LICENSE v3.0(View License)

Citation

BibTeX citation:

@misc{kabui2026,
  author = {{Kabui, Charles}},
  title = {Utonia: {One} {Encoder} for {All} {3D} {Point} {Clouds}},
  date = {2026-03-13},
  url = {https://toknow.ai/posts/utonia-universal-point-cloud-encoder-3d-foundation-model/},
  langid = {en-GB}
}

For attribution, please cite this work as:

Kabui, Charles. 2026. “Utonia: One Encoder for All 3D Point Clouds.” https://toknow.ai/posts/utonia-universal-point-cloud-encoder-3d-foundation-model/.

Other Formats

Reuse

Citation