How to run Open Source LLMs locally using Llamafile

Llamafile is a framework developed by Mozilla that simplifies the distribution and execution of Large Language Models (LLMs) by packaging them into a single executable file. This file, known as a “llamafile,” combines the model weights and a specially-compiled version of llama.cpp with Cosmopolitan Libc, allowing the model to run locally on most computers without the need for additional dependencies or installations.

Key Features of Llamafile:

Single-File Executable: Llamafile collapses the complexity of LLMs into a single file that can be easily distributed and run on various operating systems.
Local Execution: The llamafile runs entirely on local hardware, providing a secure and private way to use LLMs without relying on cloud services.
Web UI and API: When started, llamafile hosts a web UI chat server and provides an OpenAI API-compatible chat completions endpoint, enabling local interaction and integration with other applications.
Performance Optimizations: Llamafile includes performance improvements for CPU and GPU inference, making it efficient for local AI applications.
Model Compatibility: Llamafile supports a variety of models, including those available on HuggingFace, and can be used for tasks like chat, image analysis, and more.

How Llamafile works

A llamafile is an executable LLM that you can run on your own computer. It contains the weights for a given open LLM, as well as everything needed to actually run that model on your computer. There’s nothing to install or configure (with a few caveats, discussed in subsequent sections of this document).

This is all accomplished by combining llama.cpp with Cosmopolitan Libc, which provides some useful capabilities:

1) Llamafiles can run on multiple CPU microarchitectures. Llamafile team has added runtime dispatching to llama.cpp that lets new Intel systems use modern CPU features without trading away support for older computers.

2) Llamafiles can run on multiple CPU architectures. This is done by concatenating AMD64 and ARM64 builds with a shell script that launches the appropriate one. Llamafile file format is compatible with WIN32 and most UNIX shells. It’s also able to be easily converted (by either you or your users) to the platform-native format, whenever required.

3) Llamafiles can run on six OSes (macOS, Windows, Linux, FreeBSD, OpenBSD, and NetBSD). If you make your own llama files, you’ll only need to build your code once, using a Linux-style toolchain. The GCC-based compiler provided by the Llamafile team is itself an Actually Portable Executable, so you can build your software for all six OSes from the comfort of whichever one you prefer most for development.

4) The weights for an LLM can be embedded within the llamafile. Llamafile team has added support for PKZIP to the GGML library. This lets uncompressed weights be mapped directly into memory, similar to a self-extracting archive. It enables quantized weights distributed online to be prefixed with a compatible version of the llama.cpp software, thereby ensuring its originally observed behaviors can be reproduced indefinitely.

5) Finally, with the tools included in this project you can create your own llamafiles, using any compatible model weights you want. You can then distribute these llamafiles to other people, who can easily make use of them regardless of what kind of computer they have.

How to Use Llamafile:

Download: Obtain a llamafile for the desired model from repositories like HuggingFace.
Make Executable: Ensure the file is executable by adjusting permissions (e.g., using chmod on MacOS, Linux, or BSD) or renaming it with a .exe extension on Windows.
Run: Execute the llamafile to start the model server and access the web UI or API endpoints

How to setup

1) Download the LLM model in the llamafile format.

Example: https://huggingface.co/Mozilla/Mixtral-8x7B-Instruct-v0.1-llamafile/resolve/main/mixtral-8x7b-instruct-v0.1.Q5_K_M.llamafile?download=true

2) Open your computer’s terminal.

3) If you’re using macOS, Linux, or BSD, you’ll need to grant permission for your computer to execute this new file. (You only need to do this once.)

chmod +x mixtral-8x7b-instruct-v0.1.Q5_K_M.llamafile

4) If you’re on Windows, rename the file by adding “.exe” on the end.

5) Run the llamafile. e.g.:

./mixtral-8x7b-instruct-v0.1.Q5_K_M.llamafile

6) Your browser should open automatically and display a chat interface. (If it doesn’t, just open your browser and point it at http://localhost:8080)

7) When you’re done chatting, return to your terminal and hit Control-C to shut down llamafile.

Python API Example

If you are familiar with the popular OpenAI python package, you can use the same package and point it to your local LLM running on Llamafile instead.

#!/usr/bin/env python3
from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
    api_key = "sk-no-key-required"
)
completion = client.chat.completions.create(
    model="LLaMA_CPP",
    messages=[
        {"role": "system", "content": "You are TecheonsGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."},
        {"role": "user", "content": "Write a limerick about python exceptions"}
    ]
)
print(completion.choices[0].message)

Links

Share on Social Media

Key Features of Llamafile:

How Llamafile works

How to Use Llamafile:

How to setup

Python API Example

Leave a Reply Cancel reply