mirror of https://github.com/trholding/llama2.c.git synced 2026-02-06 11:26:53 +00:00

Llama 2 Everywhere (L2E)

accelerate ape armpl blas blis c cblas clblast cosmopolitan linux-kernel llama2 llama2c llm mkl multios openacc openblas openmp os unikernel

Go to file

Vulcan bbb7b0e365 Update README.md		2023-08-20 14:37:23 +05:30
.github/workflows	Add build step for win64 msys2/mingw	2023-07-29 23:55:23 +01:00
assets	Update README.MD & Add logo asset	2023-08-01 18:29:27 +05:30
.gitignore	run.c - arg parse fix	2023-08-03 02:59:21 +05:30
build_msvc.bat	Add fp fast for better performance on windows	2023-07-28 17:21:31 +02:00
configurator.py	somewhere ~20 hours later	2023-07-23 05:23:45 +00:00
export_meta_llama_bin.py	Merge branch 'master' of https://github.com/ai-doge/llama2.c into ai-doge-master	2023-08-01 15:25:14 +00:00
LICENSE	Initial commit	2023-07-22 22:15:06 -07:00
Makefile	Added Apple Accelerate Support (WIP)	2023-08-06 13:22:20 +05:30
model.py	turned on trimTrailingWhitespace in my vscode sorry about that	2023-08-05 22:46:35 +00:00
README.md	Update README.md	2023-08-20 14:37:23 +05:30
requirements.txt	added requirement.txt	2023-07-23 22:31:16 +05:30
run.c	Merge remote-tracking branch 'upstream/master'	2023-08-06 14:11:24 +05:30
sample.py	default to whatever system has	2023-07-23 10:41:03 -07:00
save_torchscript.py	docs typo	2023-08-04 23:12:06 +07:00
test_all.py	tweaks and add a simple test	2023-07-23 14:52:08 +00:00
tinyshakespeare.py	Add tinyshakespeare dataset	2023-08-01 15:26:47 -07:00
tinystories.py	Speed up tinystories pretokenize command	2023-07-29 03:08:33 +02:00
tokenizer.bin	big change: adding prompting. many LOC, but critical. ty @atamurad for the first draft, i ended up tuning it quite a bit.	2023-07-28 04:12:54 +00:00
tokenizer.model	somewhere ~20 hours later	2023-07-23 05:23:45 +00:00
tokenizer.py	big change: adding prompting. many LOC, but critical. ty @atamurad for the first draft, i ended up tuning it quite a bit.	2023-07-28 04:12:54 +00:00
train.py	fix bug, have to use raw_model not model to access the loss	2023-08-06 07:55:46 +00:00
win.c	replaced __int64 with int64_t and DWORD with uint_32	2023-08-02 10:18:30 +02:00
win.h	replaced __int64 with int64_t and DWORD with uint_32	2023-08-02 10:18:30 +02:00

README.md

Llama 2 Everywhere

Standalone and 64bit Binary Portable Llama 2 Inference in one file of C

A friendly fork of the excellent llama2.c

Our goal is to mirror the progress of https://github.com/karpathy/llama2.c, add improvements such as as OpenCL / Vulkan GPU inference, webserver etc which certainly would not fit in the upstream do to the minimal / simplicity / elegance requirements constraints there.

Features

Portability Features

Single Executable that runs on
- GNU/Systemd
- BSD ++ FreeBSD ++ OpenBSD ++ NetBSD
- XNU's Not UNIX (Mac)
- Bare Metal (Not fully functional yet but almost...)
- Windows
Runs on ARM64 (aarch64), x86_64
Standalone
- Embedded model

These combined features depend on a specific cosmocc toolchain: https://github.com/jart/cosmopolitan

Building this with gcc or clang would result in normal binaries similar to upstream.

Performance Features

CPU

OpenBLAS
CBLAS
BLIS
Intel MKL (WIP)
ArmPL (WIP)
Apple Accelerate Framework (CBLAS)

CPU/GPU

OpenMP
OpenACC

Both OpenMP and OpenACC builds currently use host CPU and do not offload GPU.

GPU

OpenCL (via CLBlast)
Vulkan
CUDA

Download the prebuilt run.com binary from releases

llama2.c

A friendly fork of the excellent llama2.c

The original repository offers a full-stack solution for training and inferring the Llama 2 LLM architecture using PyTorch and a simple 500-line C file. The focus is on minimalism and simplicity, and the repo is a young project that is still being actively developed. The author recommends looking at the TinyStories paper for inspiration, as small LLMs can have strong performance in narrow domains. The C inference engine in run.c was the main focus of the project, and the Llama 2 architecture is hard-coded with no dependencies.

Feel the Magic

git clone https://github.com/trholding/llama2.c.git
cd llama2.c
make runfast
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin
./run stories15M.bin

You can also prompt the model with a prefix:

./run stories42M.bin -t 0.8 -n 256 -i "A big dog"

When prompting, the temperature and steps parameters are needed since we use simple positional arguments.

Output

A big dog named Zip. He loved to run and play in the sun. He was a happy dog. One day, Zip saw a little bird on the ground. The bird looked helpless. Zip wanted to help the bird. He ran to the bird and lay down next to it. Zip and the bird became friends. They played together every day. Zip would carry the bird to play in the trees. The bird would fly around, and Zip would bark. They were very happy together.

Models

The original author trained a series of small models on TinyStories, which took a few hours to train on their setup. The 110M model took around 24 hours. The models are hosted on huggingface hub:

model	dim	n_layers	n_heads	max context length	parameters	val loss	download
OG	288	6	6	256	15M	1.072	stories15M.bin
42M	512	8	8	1024	42M	0.847	stories42M.bin
110M	768	12	12	1024	110M	0.760	stories110M.bin

The upstream project owner trained the llama2.c storyteller models on a 4X A100 40GB box provided by Lambda labs.

Quick note on sampling, the recommendation for good results is to use -t 1.0 -p 0.9, i.e. top-p sampling at 0.9 with temperature 1.0 (this is the default). To control the diversity of samples use either the temperature (i.e. vary -t between 0 and 1 and keep top-p off with -p 0) or the top-p value (i.e. vary -p between 0 and 1 and keep -t 1), but not both. Nice explainers on LLM sampling strategies include this, this or this.

./run llama2_7b.bin

A converted Meta's Llama 2 7b model can be inferenced at a slow speed.

Usage

Full Usage

./run <checkpoint_file> -t [temperature] -s [steps] -b [buffertokens] -p [prompt]

Where

<checkpoint_file> is the mandatory checkpoint / model file.
E.g. stories15M.bin or stories42M.bin or stories110M.bin.

-t is the optional temperature in the range 0.0 to 1.0 and a default of 0.9.
0 makes outputs with same or no prompt reproducible.

-n is the optional number of steps in the range 1 to 256 and a default of 256.
0 sets it to context length of model.
This option defines the number of tokens to infer and output.

-b is the optional number of tokens to buffer from a range 1 to context length and a default of 1.
0 sets it to context length of model.
This increases the interactive performance. Use values such as 4, 8, 16, 32, 64, 128 ... YMMV!

-i is the optional input prompt such as "A car" to pass on to guide inference.
If omitted the model will infer on its own.

Minimal Usage

./run <checkpoint_file>

Platforms

Multi OS build

make cosmorun

The binary will boot on baremetal and also run on any 64 Bit OS such as Linux, *BSD, Windows and slower on Aarch64 Mac & Linux.

Currently when used to boot, it won't be able to find the models. It's a toolchain issue with an anticipated fix.

The performance of this build is more than twice of the basic build.

Linux

Centos 7 / Amazon Linux 2018

make rungnu or make runompgnu to use openmp.

Other Linux Distros / Mac

make runfast or make runomp to use openmp.

Windows

Build on windows:

build_msvc.bat in a Visual Studio Command Prompt

The MSVC build will use openmp and max threads suitable for your CPU unless you set OMP_NUM_THREADS env.

Build on Linux and Windows:

make win64 to use the mingw compiler toolchain.

Performance

Basic

This build does not enable any optimizations.

make run

This can be used as baseline build against which performance of other builds can be compared.

Fast

This build enables basic performance boost with compiler provided optimizations.

make runfast

Build wth Acceleration

OpenMP

This build enables acceleration via OpenMP

make runomp

Requires OpenMP libraries and compiler with OpenMP support to be available on system. E.g. apt install clang libomp-dev on ubuntu

When you run inference make sure to use OpenMP flags to set the number of threads, e.g.:

OMP_NUM_THREADS=4 ./run out/model.bin

More threads is not always better.

OpenACC

This build enables acceleration via OpenACC

make runoacc

Requires OpenACC libraries and compiler with OpenACC support to be available on system.

OpenBLAS

This build enables acceleration via OpenBLAS

make runopenblas

Requires OpenBLAS to be installed on system.

BLIS

This build enables acceleration via BLIS

make runblis

Requires BLIS compiled with ./configure --enable-cblas -t openmp,pthreads auto to be installed on system.

Intel oneAPI MKL

This build enables acceleration via Intel® oneAPI Math Kernel Library on x86_64 systems and Intel Mac OS - WIP

make runmkl

Requires Intel oneAPI MKL to be installed on system.

Arm Performance Library (ArmPL)

This build enables acceleration via Arm Performance Library on ARM64 systems such as Linux or Mac OS - WIP

make runarmpl

Requires ArmPL to be installed on system.

Apple Accelerate

This build enables BLAS acceleration via Apple Accelerate on Mac OS - Testing

make runaccel

Requires Apple Accelerate to be available on system.

Note: Needs testing.

Generic CBLAS

This build enables acceleration with any Netlib CBLAS interface compatible libraries

make runblas

Requires any BLAS library with Netlib CBLAS interface such as LAPACK to be installed on system.

CLBlast (GPU/OpenCL)

This build enables tuned GPU acceleration via OpenCL with CLBlast

make runclblast

Requires CLBlast compiled with cmake -DNETLIB=ON to be installed on system.

Note: Currently runs much slower than CPU! Requires investigation or memory I/O is a bottle neck on the test system.

Portable Binary Build

Have you ever wanted to inference a baby Llama 2 model with a single executable on any OS or *as OS? No? Well, now you can!

By making use of the Cosmopolitan libc toolchain to build llama2.c we get the following features:

Executable that runs on
- GNU/Systemd
- FreeBSD
- OpenBSD
- NetBSD
- XNU's Not UNIX
- Bare Metal (-D COSMO_METAL) (*Not fully functional yet)
- Windows
Runs on
- ARM64 via Blink x86-64 emulation (-D COSMO_BLINK) (Slow)
- x86_64
Standalone
- Embedded model in executable (-D COSMO_ZIP)

Instructions

Get and build the comopolitan libc toolchain:

Follow instructions at https://github.com/jart/cosmopolitan

Or do:

sudo mkdir -p /opt
sudo chmod 1777 /opt
git clone https://github.com/jart/cosmopolitan /opt/cosmo
cd /opt/cosmo
make -j8 toolchain
mkdir -p /opt/cosmos/bin
export PATH="/opt/cosmos/bin:$PATH"
echo 'PATH="/opt/cosmos/bin:$PATH"' >>~/.profile
sudo ln -sf /opt/cosmo/tool/scripts/cosmocc /opt/cosmos/bin/cosmocc
sudo ln -sf /opt/cosmo/tool/scripts/cosmoc++ /opt/cosmos/bin/cosmoc++

Example build to generate a Actually Portable Executable (APE):

make cosmorun

Run or copy to any supported system and run:

If model is embedded:

./run.com

Else

/run.com model.bin

TODO

CLI Chat
Web Chat
Fix baremetal cosmo boot model loading
Alt model embedding
NetBSD Rump Kernel Boot (R&D)
Unikraft unikernel Boot (R&D)
GNU/Linux Linux Minimal Boot
Intel MKL Acceleration (WIP)
Arm Performance Libraries (WIP)
Apple Accelerate BLAS (Testing)
EFI Capsule
OpenCL pure
Vulkan
CUDA
OpenMP SIMD?
Optimize OpenMP & OpenACC (WIP)
Documentation (WIP)
Clang builds (Makefile)
Quantization (16, 4 , 2)
Minimize Code
Split extras into conditional header file
Update Github CI/CD workflow
Apply changes to closely resemble upstream
Raspi Zero Demo

Changelog

See commits.

Contributing

All pull requests that are merged to upstream will be automatically applied here as we closely mirror upstream.
I merge pull requests that improves performance even if they are rejected upstream.
Performance and usability improvement contriubutions are welcome.

Developer Status

See "Developer Status" issue.

Current status: Busy since Aug ~6 2023, away on bigger IRL projects. Just merging stuff. Addressing all issues after ~7 days.

Notable projects

License

MIT