Skip to content

Add multiple modalities in a single message #89

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

domenic
Copy link
Collaborator

@domenic domenic commented Mar 27, 2025

I'm posting this as a draft, because I'm starting to believe this illustrates that all our shorthands are getting too complicated and error-prone. The difference between the two examples in the PR is easy to miss, and I'm not very happy about this API shape now.

I have an alternative proposal: let's get rid of a lot of the defaulting and overloads, and just require verbose prompting almost all of the time.

I will propose an alternative PR with that.

@domenic
Copy link
Collaborator Author

domenic commented Mar 27, 2025

I will propose an alternative PR with that.

First, let's try to have some discussion on alternatives. Here are some choices in the design space:

Always require the full format

The base format for a prompt is: Array<{ role, content: Array<{ type, data }>>. Always require people to do this. So e.g. a simple text prompt becomes

const response = await session.prompt([
  { role: "user", content: [{ type: "text", data: "the text" }]}
]);

(NOTE: I am distinguishing here between data, the innermost thing, and content, the outermost thing. Right now our API uses content for both, with no data. And, if there are shorthands, we probably should pick a single name and stick with it, because it's confusing to make web developers switch between content vs. data depending on how much shorthand they use. But for the purposes of this discussion it is good to distinguish.)

Add a single string shorthand

Probably 80% of the use cases are are just text prompts. We could support "the text" as a shorthand for

[{ role: "user", content: [{ type: "text", data: "the text" }]}]

and then support zero other shorthands. This might be a reasonable balance between ease of use and simplicity / precision for the nontrivial cases.

OpenAI-esque shorthands

The latest OpenAI responses API supports the equivalents of the following shorthands:

  • The single string shorthand
  • Array<{ role, content: string }>, where content: aString expands to content: [{ type: "text", data: aString }].

This maybe grabs another chunk of the simple use cases, without being too ambiguous. By disallowing multimodal prompts in the shorthand, it avoids the confusing case which this PR discusses.

This shorthand format also seems to be supported by the Python transformers library.

Why I stopped believing in defaults

It's tempting to go further.

  • Can't we just make the default role be "user", if none is provided?
  • Shouldn't we allow { content: "a string" }, without requiring type: "text"? Surely it's obvious that a string is text, right?
  • For cases where we accept an array, shouldn't we just accept a single item and convert that to an array for you?

But I think this puts us back into confusing territory, because now you can almost reproduce the problematic example:

const response = await session.prompt([
  { content: "Here is an image: " },
  { type: "image", content: imageBytes },
  { content: ". Please describe it." }
]);

This sends three separate user-role messages, which is not what the developer intended.


So I'm currently leaning toward stopping at OpenAI-esque shorthands. Let me create a second PR to see what people think.

domenic added a commit that referenced this pull request Mar 27, 2025
As discussed in #89 (comment), the shorthands can cause some confusion in this case.
domenic added a commit that referenced this pull request Mar 27, 2025
As discussed in #89 (comment), the shorthands can cause some confusion in this case.
@clarkduvall
Copy link

I agree with removing many of the shorthands to make things less confusing, the only one IMO that may be useful is to allow directly passing an object rather than an array for both top level and content fields. So allowing something like:

session.prompt({ role: "user", content: { type: "text", data: "the text" }});

Not sure if that form would be used enough to warrant allowing that though, so I don't feel too strongly about it.

@domenic
Copy link
Collaborator Author

domenic commented Mar 28, 2025

I think you're right that a single item -> array shorthand does not hurt. However, my instinct is to leave it out until someone asks for it. Especially since other APIs do not seem to support it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants