Preprint / Version 1

Is a Large Context Window all you need? Exploring Time To First Token (TTFT)-context size tradeoff for Autoregressive LLMs

##article.authors##

  • Anuran Roy Alchemyst AI
  • Arnab Sengupta Alchemyst AI
  • Saptarshi Pani Alchemyst AI

DOI:

https://doi.org/10.31224/4666

Keywords:

Information Retrieval, Large Language Models (LLMs), AI Engineering, Multi-agent systems

Abstract

Recent advancements in auto-regressive large language models (henceforth referred to as LLMs) have significantly expanded context window capacities, with Meta’s Llama 4 Scout achieving a 10 million token input length . This expansion is facilitated by techniques like Rotary Position Embedding (RoPE) and YaRN (Yet Another Rope extensioN), which encodes positional information through rotational transformations, enabling models to process longer sequences effectively.

This advancement opens up a host of opportunities for the ubiquitious LLMs. Yet, attention mechanisms barely sub-quadratic in their nature. This means that extending context windows introduces challenges in latencies, especially in scenarios where even sub-second delays can result in catastrophic failures at scale in real-life use cases, many of which can be silent.

This paper examines the trade-offs between context sizes and latencies, highlighting the need for improved context retrieval strategies that do not bloat query sizes to the concerned Large Language Models.

Downloads

Download data is not yet available.

Author Biographies

Anuran Roy, Alchemyst AI

AI Engineering Team

Arnab Sengupta, Alchemyst AI

AI Engineering Team

Saptarshi Pani, Alchemyst AI

AI Engineering Team

Downloads

Posted

2025-06-02