-
Notifications
You must be signed in to change notification settings - Fork 38
Add multiple modalities in a single message #89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
First, let's try to have some discussion on alternatives. Here are some choices in the design space: Always require the full formatThe base format for a prompt is: const response = await session.prompt([
{ role: "user", content: [{ type: "text", data: "the text" }]}
]); (NOTE: I am distinguishing here between Add a single string shorthandProbably 80% of the use cases are are just text prompts. We could support [{ role: "user", content: [{ type: "text", data: "the text" }]}] and then support zero other shorthands. This might be a reasonable balance between ease of use and simplicity / precision for the nontrivial cases. OpenAI-esque shorthandsThe latest OpenAI responses API supports the equivalents of the following shorthands:
This maybe grabs another chunk of the simple use cases, without being too ambiguous. By disallowing multimodal prompts in the shorthand, it avoids the confusing case which this PR discusses. This shorthand format also seems to be supported by the Python Why I stopped believing in defaultsIt's tempting to go further.
But I think this puts us back into confusing territory, because now you can almost reproduce the problematic example: const response = await session.prompt([
{ content: "Here is an image: " },
{ type: "image", content: imageBytes },
{ content: ". Please describe it." }
]); This sends three separate user-role messages, which is not what the developer intended. So I'm currently leaning toward stopping at OpenAI-esque shorthands. Let me create a second PR to see what people think. |
As discussed in #89 (comment), the shorthands can cause some confusion in this case.
As discussed in #89 (comment), the shorthands can cause some confusion in this case.
I agree with removing many of the shorthands to make things less confusing, the only one IMO that may be useful is to allow directly passing an object rather than an array for both top level and
Not sure if that form would be used enough to warrant allowing that though, so I don't feel too strongly about it. |
I think you're right that a single item -> array shorthand does not hurt. However, my instinct is to leave it out until someone asks for it. Especially since other APIs do not seem to support it. |
I'm posting this as a draft, because I'm starting to believe this illustrates that all our shorthands are getting too complicated and error-prone. The difference between the two examples in the PR is easy to miss, and I'm not very happy about this API shape now.
I have an alternative proposal: let's get rid of a lot of the defaulting and overloads, and just require verbose prompting almost all of the time.
I will propose an alternative PR with that.