simonw · simonw · Mar 19, 2025 · Mar 19, 2025 · Mar 20, 2025 · Mar 20, 2025
diff --git a/docs/index.md b/docs/index.md
@@ -17,12 +17,10 @@ Here's a [YouTube video demo](https://www.youtube.com/watch?v=QUXQNi6jQ30) and [
 Background on this project:
 - [llm, ttok and strip-tags—CLI tools for working with ChatGPT and other LLMs](https://simonwillison.net/2023/May/18/cli-tools-for-llms/)
 - [The LLM CLI tool now supports self-hosted language models via plugins](https://simonwillison.net/2023/Jul/12/llm/)
-- [Accessing Llama 2 from the command-line with the llm-replicate plugin](https://simonwillison.net/2023/Jul/18/accessing-llama-2/)
-- [Run Llama 2 on your own Mac using LLM and Homebrew](https://simonwillison.net/2023/Aug/1/llama-2-mac/)
-- [Catching up on the weird world of LLMs](https://simonwillison.net/2023/Aug/3/weird-world-of-llms/)
 - [LLM now provides tools for working with embeddings](https://simonwillison.net/2023/Sep/4/llm-embeddings/)
 - [Build an image search engine with llm-clip, chat with models with llm chat](https://simonwillison.net/2023/Sep/12/llm-clip-and-chat/)
-- [Many options for running Mistral models in your terminal using LLM](https://simonwillison.net/2023/Dec/18/mistral/)
+- [You can now run prompts against images, audio and video in your terminal using LLM](https://simonwillison.net/2024/Oct/29/llm-multi-modal/)
+- [Structured data extraction from unstructured content using LLM schemas](https://simonwillison.net/2025/Feb/28/llm-schemas/)
 
 For more check out [the llm tag](https://simonwillison.net/tags/llm/) on my blog.
 

diff --git a/docs/openai-models.md b/docs/openai-models.md
@@ -55,6 +55,8 @@ OpenAI Chat: o1-preview
 OpenAI Chat: o1-mini
 OpenAI Chat: o3-mini
 OpenAI Completion: gpt-3.5-turbo-instruct (aliases: 3.5-instruct, chatgpt-instruct)
+OpenAI Chat: gpt-4o-search-preview
+OpenAI Chat: gpt-4o-mini-search-preview
 ```
 <!-- [[[end]]] -->
 
@@ -64,6 +66,15 @@ See [the OpenAI models documentation](https://platform.openai.com/docs/models) f
 
 [o1-pro](https://platform.openai.com/docs/models/o1-pro) is not available  through the Chat Completions API used by LLM's default OpenAI plugin. You can install the new [llm-openai-plugin](https://github.com/simonw/llm-openai-plugin) plugin to access that model.
 
+## Model features
+
+The following features work with OpenAI models:
+
+- {ref}`System prompts <usage-system-prompts>` can be used to provide instructions that have a higher weight than the prompt itself.
+- {ref}`Attachments <usage-attachments>`. Many OpenAI models support image inputs - check which ones using `llm models --options`. Any model that accepts images can also accept PDFs.
+- {ref}`Schemas <usage-schemas>` can be used to influence the JSON structure of the model output.
+- {ref}`Model options <usage-model-options>` can be used to set parameters like `temperature`. Use `llm models --options` for a full list of supported options.
+
 (openai-models-embedding)=
 
 ## OpenAI embedding models

diff --git a/docs/plugins/advanced-model-plugins.md b/docs/plugins/advanced-model-plugins.md
@@ -9,6 +9,7 @@ Features to consider for your model plugin include:
 - Including support for {ref}`Async models <advanced-model-plugins-async>` that can be used with Python's `asyncio` library.
 - Support for {ref}`structured output <advanced-model-plugins-schemas>` using JSON schemas.
 - Handling {ref}`attachments <advanced-model-plugins-attachments>` (images, audio and more) for multi-modal models.
+- Supporting {ref}`annotations <advanced-model-plugins-annotations>` for models that return different types of text, or objects that should be attached to sections of the response.
 - Tracking {ref}`token usage <advanced-model-plugins-usage>` for models that charge by the token.
 
 (advanced-model-plugins-api-keys)=
@@ -58,7 +59,7 @@ class MyAsyncModel(llm.AsyncModel):
 
     async def execute(
         self, prompt, stream, response, conversation=None
-    ) -> AsyncGenerator[str, None]:
+    ) -> AsyncGenerator[Union[llm.Chunk, str], None]:
         if stream:
             completion = await client.chat.completions.create(
                 model=self.model_id,
@@ -82,7 +83,7 @@ class MyAsyncModel(llm.AsyncKeyModel):
     ...
     async def execute(
         self, prompt, stream, response, conversation=None, key=None
-    ) -> AsyncGenerator[str, None]:
+    ) -> AsyncGenerator[Union[llm.Chunk, str], None]:
 ```
 
 
@@ -243,3 +244,52 @@ This example logs 15 input tokens, 340 output tokens and notes that 37 tokens we
 ```python
 response.set_usage(input=15, output=340, details={"cached": 37})
 ```
+
+(advanced-model-plugins-annotations)=
+
+## Models that return annotations
+
+Some models may return additional structured data to accompany their text output. LLM calls these **annotations**. Common use-cases for these include:
+
+- Reasoning models that return a portion of text representing "thinking" tokens prior to the main response.
+- Models that return structured citation information attached to portions of the text.
+- Similarly, some search models return references to search reults used to generate the response.
+
+Model plugins can return these annotations directly from their `execute()` method. This method usually yields a series of strings - to attach a citation to one of these strings, return a `Chunk` object instead:
+
+```python
+from llm import Chunk
+
+...
+    # Inside the execute() method:
+    yield llm.Chunk(
+        text="This has an annotation",
+        annotation={
+            "title": "Document title",
+            "url": "https://example.com/document",
+        }
+    )
+```
+The `annotation=` must be a dictionary but can take any shape. LLM will automatically record the annotation with the start and end index of the generated text that it is attached to.
+
+Some annotations may need to be attached to a point in the document without a separate end index. In this case the `text=` parameter should be set to `None`.
+
+Models may exist that do not return their annotations as part of the general stream but instead produce them at the end of the response, specifying start and end indexes to show which parts of the text they should be attached to. This is often the case for non-streaming APIs.
+
+For these cases the `response.add_annotations()` method should be used at the end of the `.execute()` method:
+
+```python
+response.add_annotations([
+    llm.Annotation(
+        start_index=0,
+        end_index=10,
+        data={
+            "title": "Document title",
+            "url": "https://example.com/document"
+        }
+    )
+])
+```
+The method accepts a list of `llm.Annotation` objects, each with a `start_index=`, `end_index=` and `data=` dictionary describing the annotation.
+
+For annotations that are attached to a point rather than a range the `start_index=` and `end_index=` should be the same integer value.
diff --git a/docs/templates.md b/docs/templates.md
@@ -59,6 +59,11 @@ This can be combined with the `-m` option to specify a different model:
 curl -s https://llm.datasette.io/en/latest/ | \
   llm -t summarize -m gpt-3.5-turbo-16k
 ```
+Templates can also be specified as full URLs to YAML files:
+```bash
+llm -t https://raw.githubusercontent.com/simonw/llm-templates/refs/heads/main/python-app.yaml \
+  'Python app to pick a random line from a file'
+```
 
 (prompt-templates-list)=
 

diff --git a/docs/usage.md b/docs/usage.md
@@ -45,6 +45,7 @@ Will run a prompt of:
 ```
 For models that support them, {ref}`system prompts <usage-system-prompts>` are a better tool for this kind of prompting.
 
+(usage-model-options)=
 ### Model options
 
 Some models support options. You can pass these using `-o/--option name value` - for example, to set the temperature to 1.5 run this:
@@ -754,6 +755,34 @@ OpenAI Completion: gpt-3.5-turbo-instruct (aliases: 3.5-instruct, chatgpt-instru
       Include the log probabilities of most likely N per token
   Features:
   - streaming
+OpenAI Chat: gpt-4o-search-preview
+  Options:
+    temperature: float
+    max_tokens: int
+    top_p: float
+    frequency_penalty: float
+    presence_penalty: float
+    stop: str
+    logit_bias: dict, str
+    seed: int
+    search_context_size: str
+  Features:
+  - streaming
+  - async
+OpenAI Chat: gpt-4o-mini-search-preview
+  Options:
+    temperature: float
+    max_tokens: int
+    top_p: float
+    frequency_penalty: float
+    presence_penalty: float
+    stop: str
+    logit_bias: dict, str
+    seed: int
+    search_context_size: str
+  Features:
+  - streaming
+  - async
 
 ```
 <!-- [[[end]]] -->

diff --git a/llm/__init__.py b/llm/__init__.py
@@ -4,11 +4,13 @@
     NeedsKeyException,
 )
 from .models import (
+    Annotation,
     AsyncConversation,
     AsyncKeyModel,
     AsyncModel,
     AsyncResponse,
     Attachment,
+    Chunk,
     Conversation,
     EmbeddingModel,
     EmbeddingModelWithAliases,
@@ -31,10 +33,12 @@
 import struct
 
 __all__ = [
+    "Annotation",
     "AsyncConversation",
     "AsyncKeyModel",
     "AsyncResponse",
     "Attachment",
+    "Chunk",
     "Collection",
     "Conversation",
     "get_async_model",

diff --git a/llm/cli.py b/llm/cli.py
@@ -10,6 +10,7 @@
     AsyncConversation,
     AsyncKeyModel,
     AsyncResponse,
+    Chunk,
     Collection,
     Conversation,
     Response,
@@ -561,6 +562,8 @@ async def inner():
             )
             if should_stream:
                 for chunk in response:
+                    if isinstance(chunk, Chunk) and chunk.annotation:
+                        print(chunk.annotation)
                     print(chunk, end="")
                     sys.stdout.flush()
                 print("")
@@ -2524,7 +2527,28 @@ def logs_db_path():
     return user_dir() / "logs.db"
 
 
+def _parse_yaml_template(name, content):
+    try:
+        loaded = yaml.safe_load(content)
+    except yaml.YAMLError as ex:
+        raise click.ClickException("Invalid YAML: {}".format(str(ex)))
+    if isinstance(loaded, str):
+        return Template(name=name, prompt=loaded)
+    loaded["name"] = name
+    try:
+        return Template(**loaded)
+    except pydantic.ValidationError as ex:
+        msg = "A validation error occurred:\n"
+        msg += render_errors(ex.errors())
+        raise click.ClickException(msg)
+
+
 def load_template(name):
+    if name.startswith("https://") or name.startswith("http://"):
+        response = httpx.get(name)
+        response.raise_for_status()
+        return _parse_yaml_template(name, response.text)
+
     if ":" in name:
         prefix, rest = name.split(":", 1)
         loaders = get_template_loaders()
@@ -2541,19 +2565,8 @@ def load_template(name):
     path = template_dir() / f"{name}.yaml"
     if not path.exists():
         raise click.ClickException(f"Invalid template: {name}")
-    try:
-        loaded = yaml.safe_load(path.read_text())
-    except yaml.YAMLError as ex:
-        raise click.ClickException("Invalid YAML: {}".format(str(ex)))
-    if isinstance(loaded, str):
-        return Template(name=name, prompt=loaded)
-    loaded["name"] = name
-    try:
-        return Template(**loaded)
-    except pydantic.ValidationError as ex:
-        msg = "A validation error occurred:\n"
-        msg += render_errors(ex.errors())
-        raise click.ClickException(msg)
+    content = path.read_text()
+    return _parse_yaml_template(name, content)
 
 
 def get_history(chat_id):