Preprint has been published in a journal as an article
DOI of the published article https://ieeexplore.ieee.org/document/11465825#:~:text=10.1109/ICIPTM69057.2026.11465825
Preprint / Version 1

Substitute-Space Embeddings for Label-Free Syntax: Unsupervised AI for POS Discovery

##article.authors##

DOI:

https://doi.org/10.31224/6170

Keywords:

Unsupervised learning, part-of-speech induction, word embeddings, substitute distributions, spherical embeddings, syntax acquisition, multilingual NLP, distributional semantics

Abstract

This paper reinterprets part-of-speech induction as an AI representation-learning problem, embedding words alongside their probabilistic substitutes to induce discrete categories without labels. A spherical embedding objective maps target words, substitute distributions, and auxiliary orthographic/morphological cues into a shared space where clusters align with syntactic functions, enabling token- and type-level induction via simple clustering. Experiments across English and 17+ languages use standardized PTB, MULTEXT-East, and CoNLL-X corpora, showing state-of-the-art many-to-one and V-measure scores and analyzing sensitivity to embedding dimension, substitute set size, and feature augmentations. The approach highlights how classic language models and unsupervised embeddings can yield emergent structure, offering a scalable path to label-free linguistic analysis in low-resource AI settings.

Downloads

Download data is not yet available.

Downloads

Posted

2026-01-08