Substitute-Space Embeddings for Label-Free Syntax: Unsupervised AI for POS Discovery

Vipul Razdan

doi:10.31224/6170

##article.authors##

Vipul Razdan FarEye Technologies https://orcid.org/0009-0001-1165-323X

DOI:

https://doi.org/10.31224/6170

Keywords:

Unsupervised learning, part-of-speech induction, word embeddings, substitute distributions, spherical embeddings, syntax acquisition, multilingual NLP, distributional semantics

Abstract

This paper reinterprets part-of-speech induction as an AI representation-learning problem, embedding words alongside their probabilistic substitutes to induce discrete categories without labels. A spherical embedding objective maps target words, substitute distributions, and auxiliary orthographic/morphological cues into a shared space where clusters align with syntactic functions, enabling token- and type-level induction via simple clustering. Experiments across English and 17+ languages use standardized PTB, MULTEXT-East, and CoNLL-X corpora, showing state-of-the-art many-to-one and V-measure scores and analyzing sensitivity to embedding dimension, substitute set size, and feature augmentations. The approach highlights how classic language models and unsupervised embeddings can yield emergent structure, offering a scalable path to label-free linguistic analysis in low-resource AI settings.

Downloads

Download data is not yet available.

Substitute-Space Embeddings for Label-Free Syntax: Unsupervised AI for POS Discovery

##article.authors##

DOI:

Keywords:

Abstract

Downloads

Downloads

Posted

License

Latest preprints