-
Notifications
You must be signed in to change notification settings - Fork 13.9k
Add safetensors support #17580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add safetensors support #17580
Conversation
|
safetensors tend to be downloaded/pulled more often than gguf. This will introduce support for both formats. It will help us run performance comparisons using the exact same models/files in vllm and llama.cpp |
a963646 to
43efc4d
Compare
6144e45 to
e5e869d
Compare
So we can load these natively just like gguf Signed-off-by: Eric Curtin <[email protected]>
e5e869d to
34c53c1
Compare
|
I think this is good for review now, I did most of the testing with: llama-server -hf HuggingFaceTB/SmolLM2-135M-Instruct |
|
@ggerganov @CISC PTAL |
|
Sooo, |
This PR already has 7 positive emoji reactions (👍/🚀) in just a few days. This indicates a strong user desire to run safetensors models directly without the intermediate conversion step. Allowing native safetensors loading drastically simplifies the workflow for new users. It removes the friction of needing convert-hf-to-gguf.py, making llama.cpp a more "drop-in" replacement for other runners like vLLM. It also saves the user duplication of models which are often GBs in size. Being able to run the exact same file artifacts as other inference engines is useful for fair and accurate performance comparisons. To alleviate the stress you mentioned, I am happy to volunteer to help maintain these mappings or triage issues related to the safetensors loader if that helps get this across the line. Although documenting something along the lines of that safetensors hasn't been as battle tested here as gguf might be worthwhile. @ggerganov WDYT? |
|
I don't think it's worth the complexity. There are various tensor manipulations that are needed during conversion to GGUF that would be difficult to implement at this level. It would duplicate or even more the effort for maintaining the conversion. No mmap support. Json bloat. etc. What we could do in this direction is to think about abstracting the model loader and providing a public interface to implement it. This would allow 3rd parties to implement model loading and move this complexity outside of
I hope this is not a serious argument. |
Just to be clear, do you want to open the door to running safetensors models (and other models) via the public interface you describe here?
My point wasn't to argue for merging based on popularity, but to highlight a clear signal of user friction. The engagement suggests that the current conversion step is a significant pain point. You feel otherwise? |
The counter arguments are:
If you want to discuss this feature, probably also tag all maintainers who spend their efforts maintaining model loading infrastructures.
Most users don't understand the underlaying works and maintainers' efforts to keep things as efficient as possible. So user friction is not the only indicator here. Obviously throughout the discussion, we mentioned twice about how this can be stressful for human maintainers and you don't seem to take this seriously. |
|
A public loader interface sounds much more appealing. |
I am hesitant to mass-tag people as I recently received complaints about causing unnecessary notifications. : Although as an aside, a list of who the maintainers are would be helpful, I'm not 100% sure anymore, although I have a rough idea. This isn't 100% reflective of who are maintainers: https://github.com/ggml-org/llama.cpp/blob/master/CODEOWNERS for example, I am not a maintainer, but I think most others if not all others in this file are, it would be nice to know. @CISC suggested something similar here:
I agree user friction is not the only indicator here. It's reasonable to bring it up as a discussion point, we often do.
To be clear, I am taking the feedback regarding maintainer burnout and architectural complexity seriously. I directly addressed it in fact. My intention was not to dismiss the maintenance burden, but to ensure the user experience was represented in the trade-off analysis. We are discussing a potential interface. I hope it doesn't upset people that we are having the discussion. I notice increasing stress in the last few weeks in particular in llama.cpp , it's not something I noticed when I first started contributing. I feel like I’m walking on eggshells somewhat with every pull request I open, and I certainly don't want to make your jobs harder. My goal is to help improve the project, not add to the burden. |
That should not be a feeling anyone should have to get, but at the same time maintainers have to deliver messages from time to time that can seem quite harsh, often it's not really possible to do that in a universally "nice" way, certainly not in written form. The stress you've noticed is a manifestation of the increased workload, very few people are equipped to handle that without some adverse effect. It's important to keep in mind that any perceived slight or rejection is most likely not intentional nor personal. At the end of the day everyone here is volunteering their time and experience as best they can, take care, be mindful and make sure it's something you enjoy the challenge of. |
I believe there is a distinct difference between blunt technical feedback and personal condescension. Here are some examples: "While your AI is not capable enough to explain this to you @ericcurtin , I will..." This comment is condescending. It attacks me personally. from PR #17579 . Note a Draft PR at the time. There wasn't even an active conversation on that PR. "Obviously... you don't seem to take this seriously." This attacks my intent and professionalism. From this PR. For some reason, there was even negative feedback given towards my PR in a conversation here: which is the most perplexing. Again there wasn't even an active conversation on that PR since January. Is every PR I open being chased? This feedback is intentionally personal.
I am here to volunteer my time and help improve the project, just like everyone else. While I accept that stress runs high, maintaining a baseline of professional respect is essential for contributors and maintainers, it was like this when I first started contributing a year ago. I also think there should be conversations about how maintainers scale across the codebase rather than having maybe one or two people looking at a large amount of PRs, the project has grown beyond that stage and it feels like there are less maintainers then there were before in a growing project. I am closing the PR, the conversation can continue if needs be. |
Ok, I agree, that was not a pleasant tone, but clearly made from a standpoint of exasperation. Given the context I would say not really directly towards you, but the slog of reviewing AI-generated code lately. That does not excuse it though as you are a long time collaborator with valuable contributions.
Could not agree more, harder to realize though. |
Counter point, if I may, that comment was made more than 2 days after you were questioned on whether the code was written by AI and you offered up no explanation other than "I mark it experimental for good reason", since you didn't refute it (and AI written completely untested PRs are on the rise), ngxson understandably assumed that you had had an AI write the entire thing without reviewing the results It was also an active PR until shortly before the comment, and multiple days after the initial comments. Bear in mind as well, on open source projects like this, language barriers exist. People who don't natively speak English can very often come across as rude or condescending simply by not knowing how best to express themselves, so we should all take everything as professionally as possible until given a reason not to. I think your biggest lesson from this should be to only publish draft PRs, not mark them ready for review until they work, and address feedback in a timely manner once it's open and marked for review. These people take time out of their day and development cycle to help get everyone's code in, their time is very valuable, and we should all treat it as such. Just my 2 cents, no offense intended, just trying to examine the situation as an outsider and provide feedback/context. |
So we can load these natively just like gguf