The API Paradigm

Definition:

LLM API Call Structure

Modern LLM APIs use a message-based interface:

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    messages=[
        {"role": "system", "content": "You are a wireless expert."},
        {"role": "user", "content": "Explain OFDM."},
    ],
    max_tokens=1024,
    temperature=0.7,
)

Key parameters: model, messages, max_tokens, temperature, top_p, stop_sequences, and tools.

The API paradigm decouples the model from the application. You send text in, get text out β€” no GPU needed.

Definition:

Token-Based Pricing

LLM APIs charge per token: Cost=TinΓ—pin+ToutΓ—pout\text{Cost} = T_\text{in} \times p_\text{in} + T_\text{out} \times p_\text{out} Output tokens (poutp_\text{out}) typically cost 33-5Γ—5\times more than input tokens (pinp_\text{in}) because generation is autoregressive (sequential) while input processing is parallel.

Definition:

Structured Output

Structured output constrains the LLM to produce valid JSON, XML, or other formats:

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    messages=[...],
    response_format={"type": "json_object"},
)
result = json.loads(response.content)

This eliminates parsing errors and enables reliable pipeline integration.

Definition:

Streaming Responses

Streaming returns tokens as they are generated, reducing perceived latency for interactive applications:

with client.messages.stream(
    model="claude-sonnet-4-20250514",
    messages=[{"role": "user", "content": prompt}],
    max_tokens=1024,
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

Time to first token (TTFT) is typically 200-500ms.

Definition:

Context Window

The context window is the maximum number of tokens the model can process in a single call: Tin+Tout≀TmaxT_\text{in} + T_\text{out} \le T_\text{max}.

Model Context Window
GPT-4o 128K tokens
Claude 3.5 Sonnet 200K tokens
Gemini 1.5 Pro 1M tokens
LLaMA 3 70B 128K tokens

Theorem: Prompt Cost Optimization

For a fixed task, the total API cost over NN calls is: Ctotal=Nβ‹…(Tsystem+Tuser+Tfew-shot)β‹…pin+Nβ‹…Toutβ‹…poutC_\text{total} = N \cdot (T_\text{system} + T_\text{user} + T_\text{few-shot}) \cdot p_\text{in} + N \cdot T_\text{out} \cdot p_\text{out} System prompts and few-shot examples are re-sent with every call. Prompt caching (when available) can reduce repeated input costs by up to 90%.

A 2000-token system prompt sent 10,000 times costs the same as processing a 20M-token document β€” prompt caching is essential.

Example: Complete API Call with Error Handling

Write a robust LLM API call with retry logic, timeout, and structured output parsing.

Example: Batch Processing Papers with Rate Limiting

Process 100 paper abstracts through an LLM API with proper rate limiting and progress tracking.

API Cost Calculator

Estimate costs for different LLM API usage patterns

Parameters

Prompt Token Analyzer

Analyze token distribution in different prompt strategies

Parameters

LLM API Pipeline

LLM API Pipeline
From user query through prompt construction, API call, and response parsing.

Quick Check

Why do output tokens cost more than input tokens?

Output tokens use more memory

Output tokens are generated sequentially (autoregressive), while input tokens are processed in parallel

Output tokens are higher quality

Common Mistake: No Retry Logic for API Calls

Mistake:

Making API calls without retry logic or error handling.

Correction:

Always implement exponential backoff with retry (use tenacity or backoff). APIs have rate limits and transient errors.

Key Takeaway

LLM APIs provide a simple text-in, text-out interface with token-based pricing. Always implement retry logic, use structured output for reliable parsing, and consider prompt caching to reduce costs.

Why This Matters: LLMs for Simulation Parameter Selection

LLM APIs can analyze simulation requirements and suggest parameters: given a paper description, an LLM can recommend appropriate channel models, modulation schemes, and SNR ranges, dramatically accelerating the experiment design phase.

See full treatment in Chapter 49

Historical Note: From Fine-Tuning to Prompting

2020-present

Before GPT-3 (2020), using NLP models required fine-tuning on task-specific data. The API paradigm introduced by OpenAI enabled "prompting" β€” specifying the task in natural language rather than training data. This shifted the bottleneck from ML engineering to prompt design.

Prompt

The input text sent to an LLM that specifies the task, context, and desired output format.

Context Window

The maximum number of tokens an LLM can process in a single inference call, including both input and output.

Structured Output

A constraint on LLM generation that forces the output to conform to a specific format like JSON schema.

Streaming

An API mode that returns tokens incrementally as they are generated, rather than waiting for the complete response.

Time to First Token (TTFT)

The latency from sending an API request to receiving the first generated token, typically 200-500ms for cloud APIs.