Messages¶
Messages are the fundamental unit of interaction with chat-based language models. They consist of a role and some form of content. For text-only language models, the content is typically just a string. However, with multimodal language models that can process images, audio, text, and other modalities, the content object becomes more complex.
In practice, the content that a language model can consume forms a markup language, where there are different content blocks for text, images, audio, tool use, and so on.
Challenges with LLM APIs¶
The potential complexity of a message object has led language model APIs to establish message specifications that are often quite pedantic, even when users only want to pass around simple types like strings or images. This issue is compounded by the fact that most language model APIs are automatically generated using tools like Stainless, which take an API spec and build multi-language client-side API bindings. Because these APIs are automatically generated, they can’t be optimized for user-friendliness.
For example, many prompt engineering libraries exist primarily to solve the inconvenience of indexing into responses from APIs like OpenAI’s. This complexity in both specifying prompts and handling responses can make working with language models unnecessarily cumbersome for developers.
result : str = openai.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of the moon?"}
]
)["choices"][0]["message"]["content"] # hughkguht this line
result : str my_prompt_engineering_library("prompt")
Likewise, the specification of prompts themselves is also quite cumbersome. Because language model provider API client bindings are often automatically generated, they lack developer-friendly features. As a result, users need to be as verbose and pedantic as possible when constructing prompts. Consider the complexity of passing an input with both text and images to a language model API:
result : str = openai.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [ # highlight these lines
{"type": "text", "text": "What is the capital of the moon?"},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]}
]
)["choices"][0]["message"]["content"]
In essence, the user has to explicitly specify two different content blocks and their types, even though these types are implicit and could be inferred. This is because the language bindings use typed dictionaries and validators generated by tools like Stainless or similar co-generation tools. While not inherently wrong, this approach creates a gap in developer experience, making the code less readable and more cumbersome to work with.
This leads us to a core philosophy in ell:
“Using language models is just passing around strings, except when it’s not.”
Users should be able to specify the minimal amount of complexity necessary for the data they want to pass to a language model. To achieve this, we’ve drawn inspiration from machine learning and scientific computing libraries like TensorFlow, PyTorch, and NumPy to create a new type of message API. In this API, type coercion and implicit inference are key features that enhance the developer experience.
The ell Message API¶
Our API centers around two key objects: Messages and ContentBlocks.
- pydantic model ell.Message¶
- Fields:
content (List[ell.types.message.ContentBlock])
role (str)
- classmethod model_validate_json(json_str: str) Message ¶
Custom validation to handle deserialization from JSON string
- serialize_content(content: List[ContentBlock])¶
Serialize content blocks to a format suitable for JSON
- pydantic model ell.ContentBlock¶
- Fields:
audio (numpy.ndarray | List[float] | None)
image (ell.types.message.ImageContent | None)
parsed (pydantic.main.BaseModel | None)
text (ell.types._lstr._lstr | str | None)
tool_call (ell.types.message.ToolCall | None)
tool_result (ell.types.message.ToolResult | None)
- serialize_parsed(value: BaseModel | None, _info)¶
- property content¶
Solving the construction problem¶
The Message and ContentBlock objects solve the problem of pedantic construction by incorporating type coercion directly into their constructors.
Consider constructing a message that contains both text and an image. Traditionally, you might need to create a Message with a role and two ContentBlocks - one for text and one for an image:
from ell import Message, ContentBlock
message = Message(
role="user",
content=[
ContentBlock(text="What is the capital of the moon?"),
ContentBlock(image=some_PIL_image_object)
]
)
However, the Message object can infer the types of content blocks within it. This allows for a more concise construction:
message = Message(
role="user",
content=["What is the capital of the moon?", some_PIL_image_object]
)
Furthermore, if a message contains only one type of content (for example, just an image), we also support shape coercion:
message = Message(
role="user",
content=some_PIL_image_object
)
Coercion is an important concept in ell, and you can read more about it in the Content Block Coercion API reference page.
Common roles¶
message = ell.user(["What is the capital of the moon?", some_PIL_image_object])
Ell’s message API provides several common helper functions for constructing messages with specific roles in language model APIs. These functions essentially partially compose the Message constructor with a specific role. All of the type coercion and convenient functionality from before is automatically handled.
- ell.system(content: ContentBlock | str | ToolCall | ToolResult | ImageContent | ndarray | Image | BaseModel | List[ContentBlock | str | ToolCall | ToolResult | ImageContent | ndarray | Image | BaseModel]) Message ¶
Create a system message with the given content.
Args: content (str): The content of the system message.
Returns: Message: A Message object with role set to ‘system’ and the provided content.
- ell.user(content: ContentBlock | str | ToolCall | ToolResult | ImageContent | ndarray | Image | BaseModel | List[ContentBlock | str | ToolCall | ToolResult | ImageContent | ndarray | Image | BaseModel]) Message ¶
Create a user message with the given content.
Args: content (str): The content of the user message.
Returns: Message: A Message object with role set to ‘user’ and the provided content.
- ell.assistant(content: ContentBlock | str | ToolCall | ToolResult | ImageContent | ndarray | Image | BaseModel | List[ContentBlock | str | ToolCall | ToolResult | ImageContent | ndarray | Image | BaseModel]) Message ¶
Create an assistant message with the given content.
Args: content (str): The content of the assistant message.
Returns: Message: A Message object with role set to ‘assistant’ and the provided content.
Solving the parsing problem¶
Complex message structures shouldn’t mean complex interactions. Drawing inspiration from rich HTML APIs and JavaScript’s document selector API, as well as BeautifulSoup’s helper functions for extracting text from HTML documents, we’ve built convenient functions for interacting with the contents of a message.
To understand why this approach is necessary, let’s examine how we might parse output from the traditional OpenAI API if the model had multimodal capabilities. This example will illustrate the complexity of handling various content types without a unified message structure.
from ell import Message, ContentBlock
import openai
# Assume we have a response from a multimodal language model
response = openai.ChatCompletion.create(
model="gpt-5-omni",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "text", "text": "Draw me a sketch version of this image"},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]}
]
)
# Access the message content from the OpenAI response
message_content = response.choices[0].message.content
# Check for different types of content in the traditional OpenAI API format
has_image = any(content.get('type') == 'image_url' for content in message_content if isinstance(content, dict))
has_text = any(content.get('type') == 'text' for content in message_content if isinstance(content, dict))
has_tool_call = 'function_call' in response.choices[0].message
if has_image:
image_content = [content['image_url']['url'] for content in message_content if isinstance(content, dict) and content.get('type') == 'image_url']
show(image_content[0])
if has_text:
# Extract text content
text_content = [content['text'] for content in message_content if isinstance(content, dict) and content.get('type') == 'text']
print("".joitext_content[0])
if has_tool_call:
print("The message contains a tool call.")
Now let’s see how we can do the same thing using ell’s message API. In the following example, we’ll use ell’s @ell.complex
decorator which is similar to @ell.simple
. However, instead of returning a string after calling the language model program, it returns a Message object representing the response from the language model. This allows you to have language model responses with multimodal output, including structured and tool call output. You can learn more about this in the @ell.complex section.
import ell
@ell.complex(model="gpt-5-omni")
def draw_sketch(image: PILImage.Image):
return [
ell.system("You are a helpful assistant."),
ell.user(["Draw me a sketch version of this image", image]),
]
response = draw_sketch(some_PIL_image_object)
if response.images:
show(response.images[0])
if response.text:
print(response.text)
if response.tool_calls:
print("The message contains a tool call.")
The following conevnience functions and properties are available on a Message object:
- property Message.text: str¶
Returns all text content, replacing non-text content with their representations.
Example
>>> message = Message(role="user", content=["Hello", PILImage.new('RGB', (100, 100)), "World"]) >>> message.text 'Hello\n<PilImage>\nWorld'
- property Message.text_only: str¶
Returns only the text content, ignoring non-text content.
Example
>>> message = Message(role="user", content=["Hello", PILImage.new('RGB', (100, 100)), "World"]) >>> message.text_only 'Hello\nWorld'
- property Message.tool_calls: List[ToolCall]¶
Returns a list of all tool calls.
Example
>>> tool_call = ToolCall(tool=lambda x: x, params=BaseModel()) >>> message = Message(role="user", content=["Text", tool_call]) >>> len(message.tool_calls) 1
- property Message.tool_results: List[ToolResult]¶
Returns a list of all tool results.
Example
>>> tool_result = ToolResult(tool_call_id="123", result=[ContentBlock(text="Result")]) >>> message = Message(role="user", content=["Text", tool_result]) >>> len(message.tool_results) 1
- property Message.parsed: BaseModel | List[BaseModel]¶
Returns a list of all parsed content.
Example
>>> class CustomModel(BaseModel): ... value: int >>> parsed_content = CustomModel(value=42) >>> message = Message(role="user", content=["Text", ContentBlock(parsed=parsed_content)]) >>> len(message.parsed) 1
- property Message.images: List[ImageContent]¶
Returns a list of all image content.
Example
>>> from PIL import Image as PILImage >>> image1 = Image(url="https://example.com/image.jpg") >>> image2 = Image(image=PILImage.new('RGB', (200, 200))) >>> message = Message(role="user", content=["Text", image1, "More text", image2]) >>> len(message.images) 2 >>> isinstance(message.images[0], Image) True >>> message.images[0].url 'https://example.com/image.jpg' >>> isinstance(message.images[1].image, PILImage.Image) True
- property Message.audios: List[ndarray | List[float]]¶
Returns a list of all audio content.
Example
>>> audio1 = np.array([0.1, 0.2, 0.3]) >>> audio2 = np.array([0.4, 0.5, 0.6]) >>> message = Message(role="user", content=["Text", audio1, "More text", audio2]) >>> len(message.audios) 2
- Message.call_tools_and_collect_as_message(parallel=False, max_workers=None)¶