The API Paradigm
Definition: LLM API Call Structure
LLM API Call Structure
Modern LLM APIs use a message-based interface:
response = client.messages.create(
model="claude-sonnet-4-20250514",
messages=[
{"role": "system", "content": "You are a wireless expert."},
{"role": "user", "content": "Explain OFDM."},
],
max_tokens=1024,
temperature=0.7,
)
Key parameters: model, messages, max_tokens, temperature,
top_p, stop_sequences, and tools.
The API paradigm decouples the model from the application. You send text in, get text out β no GPU needed.
Definition: Token-Based Pricing
Token-Based Pricing
LLM APIs charge per token: Output tokens () typically cost - more than input tokens () because generation is autoregressive (sequential) while input processing is parallel.
Definition: Structured Output
Structured Output
Structured output constrains the LLM to produce valid JSON, XML, or other formats:
response = client.messages.create(
model="claude-sonnet-4-20250514",
messages=[...],
response_format={"type": "json_object"},
)
result = json.loads(response.content)
This eliminates parsing errors and enables reliable pipeline integration.
Definition: Streaming Responses
Streaming Responses
Streaming returns tokens as they are generated, reducing perceived latency for interactive applications:
with client.messages.stream(
model="claude-sonnet-4-20250514",
messages=[{"role": "user", "content": prompt}],
max_tokens=1024,
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
Time to first token (TTFT) is typically 200-500ms.
Definition: Context Window
Context Window
The context window is the maximum number of tokens the model can process in a single call: .
| Model | Context Window |
|---|---|
| GPT-4o | 128K tokens |
| Claude 3.5 Sonnet | 200K tokens |
| Gemini 1.5 Pro | 1M tokens |
| LLaMA 3 70B | 128K tokens |
Theorem: Prompt Cost Optimization
For a fixed task, the total API cost over calls is: System prompts and few-shot examples are re-sent with every call. Prompt caching (when available) can reduce repeated input costs by up to 90%.
A 2000-token system prompt sent 10,000 times costs the same as processing a 20M-token document β prompt caching is essential.
Example: Complete API Call with Error Handling
Write a robust LLM API call with retry logic, timeout, and structured output parsing.
Implementation
import anthropic
import json
from tenacity import retry, wait_exponential, stop_after_attempt
client = anthropic.Anthropic()
@retry(wait=wait_exponential(min=1, max=60), stop=stop_after_attempt(3))
def analyze_paper(abstract: str) -> dict:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system="Extract: title, keywords, methodology. Return JSON.",
messages=[{"role": "user", "content": abstract}],
)
return json.loads(response.content[0].text)
Example: Batch Processing Papers with Rate Limiting
Process 100 paper abstracts through an LLM API with proper rate limiting and progress tracking.
Implementation
import asyncio
from asyncio import Semaphore
async def process_papers(abstracts, max_concurrent=5):
sem = Semaphore(max_concurrent)
results = []
async def process_one(abstract):
async with sem:
result = await async_analyze(abstract)
results.append(result)
tasks = [process_one(a) for a in abstracts]
await asyncio.gather(*tasks)
return results
API Cost Calculator
Estimate costs for different LLM API usage patterns
Parameters
Prompt Token Analyzer
Analyze token distribution in different prompt strategies
Parameters
LLM API Pipeline
Quick Check
Why do output tokens cost more than input tokens?
Output tokens use more memory
Output tokens are generated sequentially (autoregressive), while input tokens are processed in parallel
Output tokens are higher quality
Each output token requires a full forward pass through the model, making generation ~T times slower than encoding.
Common Mistake: No Retry Logic for API Calls
Mistake:
Making API calls without retry logic or error handling.
Correction:
Always implement exponential backoff with retry (use tenacity or backoff). APIs have rate limits and transient errors.
Key Takeaway
LLM APIs provide a simple text-in, text-out interface with token-based pricing. Always implement retry logic, use structured output for reliable parsing, and consider prompt caching to reduce costs.
Why This Matters: LLMs for Simulation Parameter Selection
LLM APIs can analyze simulation requirements and suggest parameters: given a paper description, an LLM can recommend appropriate channel models, modulation schemes, and SNR ranges, dramatically accelerating the experiment design phase.
See full treatment in Chapter 49
Historical Note: From Fine-Tuning to Prompting
2020-presentBefore GPT-3 (2020), using NLP models required fine-tuning on task-specific data. The API paradigm introduced by OpenAI enabled "prompting" β specifying the task in natural language rather than training data. This shifted the bottleneck from ML engineering to prompt design.
Prompt
The input text sent to an LLM that specifies the task, context, and desired output format.
Context Window
The maximum number of tokens an LLM can process in a single inference call, including both input and output.
Structured Output
A constraint on LLM generation that forces the output to conform to a specific format like JSON schema.
Streaming
An API mode that returns tokens incrementally as they are generated, rather than waiting for the complete response.
Time to First Token (TTFT)
The latency from sending an API request to receiving the first generated token, typically 200-500ms for cloud APIs.