DOI of the published article https://ieeexplore.ieee.org/document/11465825#:~:text=10.1109/ICIPTM69057.2026.11465825
Substitute-Space Embeddings for Label-Free Syntax: Unsupervised AI for POS Discovery
DOI:
https://doi.org/10.31224/6170Keywords:
Unsupervised learning, part-of-speech induction, word embeddings, substitute distributions, spherical embeddings, syntax acquisition, multilingual NLP, distributional semanticsAbstract
This paper reinterprets part-of-speech induction as an AI representation-learning problem, embedding words alongside their probabilistic substitutes to induce discrete categories without labels. A spherical embedding objective maps target words, substitute distributions, and auxiliary orthographic/morphological cues into a shared space where clusters align with syntactic functions, enabling token- and type-level induction via simple clustering. Experiments across English and 17+ languages use standardized PTB, MULTEXT-East, and CoNLL-X corpora, showing state-of-the-art many-to-one and V-measure scores and analyzing sensitivity to embedding dimension, substitute set size, and feature augmentations. The approach highlights how classic language models and unsupervised embeddings can yield emergent structure, offering a scalable path to label-free linguistic analysis in low-resource AI settings.
Downloads
Downloads
Posted
License
Copyright (c) 2026 Vipul Razdan

This work is licensed under a Creative Commons Attribution 4.0 International License.