The API Paradigm

Definition:
LLM API Call Structure

Modern LLM APIs use a message-based interface:

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    messages=[
        {"role": "system", "content": "You are a wireless expert."},
        {"role": "user", "content": "Explain OFDM."},
    ],
    max_tokens=1024,
    temperature=0.7,
)

Key parameters: model, messages, max_tokens, temperature, top_p, stop_sequences, and tools.

The API paradigm decouples the model from the application. You send text in, get text out — no GPU needed.

Definition:
Token-Based Pricing

LLM APIs charge per token: $\text{Cost} = T_\text{in} \times p_\text{in} + T_\text{out} \times p_\text{out}$ Output tokens ( $p_\text{out}$ ) typically cost $3$ - $5\times$ more than input tokens ( $p_\text{in}$ ) because generation is autoregressive (sequential) while input processing is parallel.

Definition:
Structured Output

Structured output constrains the LLM to produce valid JSON, XML, or other formats:

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    messages=[...],
    response_format={"type": "json_object"},
)
result = json.loads(response.content)

This eliminates parsing errors and enables reliable pipeline integration.

Definition:
Streaming Responses

Streaming returns tokens as they are generated, reducing perceived latency for interactive applications:

with client.messages.stream(
    model="claude-sonnet-4-20250514",
    messages=[{"role": "user", "content": prompt}],
    max_tokens=1024,
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

Time to first token (TTFT) is typically 200-500ms.

Definition:
Context Window

The context window is the maximum number of tokens the model can process in a single call: $T_\text{in} + T_\text{out} \le T_\text{max}$ .

Model	Context Window
GPT-4o	128K tokens
Claude 3.5 Sonnet	200K tokens
Gemini 1.5 Pro	1M tokens
LLaMA 3 70B	128K tokens

Theorem: Prompt Cost Optimization

For a fixed task, the total API cost over $N$ calls is: $C_\text{total} = N \cdot (T_\text{system} + T_\text{user} + T_\text{few-shot}) \cdot p_\text{in} + N \cdot T_\text{out} \cdot p_\text{out}$ System prompts and few-shot examples are re-sent with every call. Prompt caching (when available) can reduce repeated input costs by up to 90%.

A 2000-token system prompt sent 10,000 times costs the same as processing a 20M-token document — prompt caching is essential.

Example: Complete API Call with Error Handling

Write a robust LLM API call with retry logic, timeout, and structured output parsing.

Solution

Implementation

import anthropic
import json
from tenacity import retry, wait_exponential, stop_after_attempt

client = anthropic.Anthropic()

@retry(wait=wait_exponential(min=1, max=60), stop=stop_after_attempt(3))
def analyze_paper(abstract: str) -> dict:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        system="Extract: title, keywords, methodology. Return JSON.",
        messages=[{"role": "user", "content": abstract}],
    )
    return json.loads(response.content[0].text)

Example: Batch Processing Papers with Rate Limiting

Process 100 paper abstracts through an LLM API with proper rate limiting and progress tracking.

Solution

Implementation

import asyncio
from asyncio import Semaphore

async def process_papers(abstracts, max_concurrent=5):
    sem = Semaphore(max_concurrent)
    results = []

    async def process_one(abstract):
        async with sem:
            result = await async_analyze(abstract)
            results.append(result)

    tasks = [process_one(a) for a in abstracts]
    await asyncio.gather(*tasks)
    return results

API Cost Calculator

Estimate costs for different LLM API usage patterns

Parameters

Prompt Token Analyzer

Analyze token distribution in different prompt strategies

Parameters

LLM API Pipeline — From user query through prompt construction, API call, and response parsing.

Quick Check

Why do output tokens cost more than input tokens?

Output tokens use more memory

Output tokens are generated sequentially (autoregressive), while input tokens are processed in parallel

Output tokens are higher quality

Correction:

Output tokens are generated sequentially (autoregressive), while input tokens are processed in parallel

Each output token requires a full forward pass through the model, making generation ~T times slower than encoding.

Common Mistake: No Retry Logic for API Calls

Mistake:

Making API calls without retry logic or error handling.

Correction:

Always implement exponential backoff with retry (use tenacity or backoff). APIs have rate limits and transient errors.

Key Takeaway

LLM APIs provide a simple text-in, text-out interface with token-based pricing. Always implement retry logic, use structured output for reliable parsing, and consider prompt caching to reduce costs.

Why This Matters: LLMs for Simulation Parameter Selection

LLM APIs can analyze simulation requirements and suggest parameters: given a paper description, an LLM can recommend appropriate channel models, modulation schemes, and SNR ranges, dramatically accelerating the experiment design phase.

See full treatment in Chapter 49

Historical Note: From Fine-Tuning to Prompting

2020-present

Before GPT-3 (2020), using NLP models required fine-tuning on task-specific data. The API paradigm introduced by OpenAI enabled "prompting" — specifying the task in natural language rather than training data. This shifted the bottleneck from ML engineering to prompt design.

Prompt

The input text sent to an LLM that specifies the task, context, and desired output format.

Context Window

The maximum number of tokens an LLM can process in a single inference call, including both input and output.

Structured Output

A constraint on LLM generation that forces the output to conform to a specific format like JSON schema.

Streaming

An API mode that returns tokens incrementally as they are generated, rather than waiting for the complete response.

Time to First Token (TTFT)

The latency from sending an API request to receiving the first generated token, typically 200-500ms for cloud APIs.

Prerequisites & Notation Prompt Engineering

The API Paradigm

Definition: LLM API Call Structure

Definition: Token-Based Pricing

Definition: Structured Output

Definition: Streaming Responses

Definition: Context Window

Theorem: Prompt Cost Optimization

Example: Complete API Call with Error Handling

Implementation

Example: Batch Processing Papers with Rate Limiting

Implementation

API Cost Calculator

Parameters

Prompt Token Analyzer

Parameters

LLM API Pipeline

Quick Check

Common Mistake: No Retry Logic for API Calls

Key Takeaway

Why This Matters: LLMs for Simulation Parameter Selection

Historical Note: From Fine-Tuning to Prompting

Prompt

Context Window

Structured Output

Streaming

Time to First Token (TTFT)

Definition:
LLM API Call Structure

Definition:
Token-Based Pricing

Definition:
Structured Output

Definition:
Streaming Responses

Definition:
Context Window