Llama 2 on Yunhost

Hi there,

I was wondering, is there anyone smarter than me who has figured how to make Llama 2 into an app on YunoHost?

I have no idea how good it would be expected to perform on a typical entry-level SSD VPS on OVH

Thanks for the help!

Hum, that will depend a lot on your VPS. I have tried to run the smallest model on my local machine (old iMac) but inference time is just unusable. On the other hand, I’ve been playing with RWKV. It’s a RNN inspired transformer that runs reaaaaalllllly fast on CPU. I’ve started to work on a flask API server to be able to run it on a VPS and use inference in web project but there’s still a lot of work. In the meantime, I’m learning to use weaviate to store embeddings of documents to enhance the response. All of that should run with langchain.

In short, I believe we could have, in the near future, open source and transparent large language models applications self-hosted using CPU only. But there’s still a lot of work if you don’t want to use cloud SaaS.

1 Like

(also for context, I’ve tried Llama2 with q4_1 quantization, you could try lower quantization to get faster inference and lower RAM and CPU usage but lower you quantize, the worst the response gets. also, so far the only non cloud/non GPU demo I’ve seen with fast response time were run on M2 macs)

1 Like

Okay so I back again with good news, Mistral Instruct 7B is running great on CPU (with the GGUF quantization), currently working on a demo basic app with streamlit to provide self-hosted LLM processing

2 Likes

That would be awesome :heart_eyes: