Skip to content

Conversation

@lesters
Copy link

@lesters lesters commented May 21, 2024

This adds the option of applying a chat template such as those found in GGUF files to the prompt before generating tokens. This is a suggestion as I couldn't find a way of doing this in the code today.

@kherud
Copy link
Owner

kherud commented May 21, 2024

Hey @lesters thanks for the pull request! This was indeed not possible, so it's a nice addition.

In general it's best to align the C++ code as close to the llama.cpp server code as possible since that way it's easier to maintain in the long term. I think there the chat template is loaded once when initializing a model (so via ModelParameters instead of InferenceParameters). Do you think it's necessary to be able to change the template for each inference?

However, the server has a separate endpoint for chat completions, which the Java binding doesn't have. So to still be able to choose for each inference whether to use the chat template, I think using something like your PARAM_USE_CHAT_TEMPLATE is the way to go.

I will review the PR in more detail tomorrow.

@lesters
Copy link
Author

lesters commented May 22, 2024

Thanks, @kherud. I think you are right, this is probably better as a part of ModelParameters - as you most likely don't want to change this much for each inference. I'll make the change.

@kherud kherud merged commit 50c85b7 into kherud:master May 22, 2024
@kherud
Copy link
Owner

kherud commented May 22, 2024

Looks good to me, thanks for the work.

@lesters lesters deleted the generate-with-chat-template branch May 23, 2024 07:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants