Local LLMs on Windows - A Guide

With all of the hype surrounding OpenAI’s ChatGPT product, and their rapid development over the last year: it has been fascinating to learn about the technology facilitating these AI breakthroughs, known as Large Language Models (LLMs). Even though this feels like one of the most exciting times in technological development since the invention of the Internet, I know plenty of people who barely use ChatGPT, or have never used it at all.

When I first started learning about ChatGPT, I had no idea that it was possible to run the same technology on my own PC However, when Meta’s research LLM “LLaMA” leaked to the general public, it set off a wildfire of open-source LLM research and hype from people who had never been into machine learning or the AI space prior. It is not a “user-friendly” space as far as there being proper documentation and easy-to-use tools. However, as more and more people get involved with running LLMs locally, things have started improving on that front.

To keep this functional and get you up-and-running ASAP, I am not going to dive into every possible detail or context related to local LLMs with this short guide. I just want to people explore this fascinating technology and realize how important it really is.

Benefits of a Local Model

While the models available from companies like OpenAI, Microsoft, and Google are certainly better than the results from most models: any data you provide to them via chat input or uploaded files/images can be used for training or whatever purposes they desire. You are beholden to their terms of service, so your ability to freely express yourself is limited. The models themselves will refrain from discussing certain things due to controls these companies place on their content. By having a local LLM available, you can do the following:

Total control of your data: Nothing communicates outside of your computer. You can configure the models to be available to others on your local network, but key point: whereever the model is officially loaded and running, is where the data is being stored.
No censoring (optional): One of the most apparent qualities of the original LLaMA leak (and LLaMA 2 later on) was that it was completely unfiltered and uncensored when it came to responding to prompts. This was intentional by Meta, and has been greatly lauded by the local LLM community. However, there are plenty of models, such as Vicuna and Mistral, which have alignment and content guards if one doesn’t mind censored output.

Setup: GGUFs and text-generation-web-ui

To start, download the below. FYI my specs are Ryzen 5700X, 64 GB of RAM, and a Nvidia GeForce 3080 (10 GB VRAM) GPU:

OpenHermes-2.5-Mistral-7B-16K-GGUF by NurtureAI: This is my favorite 7B model thus far, and because TheBloke quantized it, it can run comfortably on basically any PC as long as you have 16 GB of RAM. The “16K” in the name refers to context, which is the number of “tokens” that the model can store in its “memory” during a single conversation.
text-generation-web-ui by oobabooga: The one-click installer files in this are really solid now, and make maintaining this frontend a breeze, which is why I use it. It also feels like the least bloated and cleanest of all of the various frontends. My personal opinion though.

Usage and Tips

Extract the “text-generation-web-ui” folder, and run the “start_windows.bat” file. Answer the prompts according to your hardware and let the installer take care of the rest. Close the installer after it finishes.
Cut/paste your GGUF file into the “models” directory inside your new text-generation-web-ui folder.
Run “start_windows.bat” again

On future runs, run “update_windows.bat” before running the “start_windows.bat” file, as updates are made so frequently that it is essential to stay up to date, or risk having a broken system due to missing Python dependencies, etc.

text-generation-web-ui Config

I recommend Mirostat as the sampling method.
Change “Maximum tokens generated” to the maximum value. This ensures lengthy responses when they are needed. You can always hit the “Stop” button to kill a response mid-generation if it is going too long.
With 10 GB of VRAM, I usually put 25 layers on the GPU and I match the CPU thread count with the number of cores on my processor. If you didn’t know already, with hyperthreading you technically have “double” the amount of threads, but those hyperthreads are only about 70% as powerful as an actual core. Plus, you want your machine to be usable while it’s working on generating responses.