Llama 2 Everywhere (L2E)
Go to file
2023-09-02 10:04:40 +05:30
.github Create FUNDING.yml 2023-08-28 15:35:48 +05:30
assets Update README.MD & Add logo asset 2023-08-01 18:29:27 +05:30
doc fix tinyllamas url 2023-08-26 17:05:21 -04:00
.gitignore win64 & unikernel build fixes 2023-08-28 13:31:54 +05:30
build_msvc.bat Add fp fast for better performance on windows 2023-07-28 17:21:31 +02:00
Config.uk unikernel support (WIP) 2023-08-26 01:38:15 +05:30
configurator.py somewhere ~20 hours later 2023-07-23 05:23:45 +00:00
export.py Get vocab_size from token embeddings size 2023-08-26 22:35:55 +03:00
incbin.c Various small improvements 2023-08-23 19:26:40 +05:30
incbin.h Baremetal/Portable Embedding (WIP) 2023-08-21 15:18:33 +05:30
L2E_amd64_qemu.config win64 & unikernel build fixes 2023-08-28 13:31:54 +05:30
LICENSE Initial commit 2023-07-22 22:15:06 -07:00
Makefile win64 & unikernel build fixes 2023-08-28 13:31:54 +05:30
Makefile.uk unikernel support (WIP) 2023-08-26 01:38:15 +05:30
Makefile.unikernel unikernel support (WIP) 2023-08-26 01:38:15 +05:30
model.py mark ModelArgs.hidden_dim as optional and calculate as previously if not provided 2023-08-21 03:40:34 +03:00
README.md Update README.md 2023-09-02 10:04:40 +05:30
requirements.txt removed transformers from requirements.txt, added error message 2023-08-21 06:07:29 +03:00
run.c Merge remote-tracking branch 'upstream/master' 2023-09-02 09:22:48 +05:30
run.ipynb Jupter Notebook: Add run Meta's Llama 2 models 2023-08-16 20:27:28 +08:00
sample.py Fix vocab_source in sample.py 2023-08-18 18:40:25 +10:00
strliteral.c Rebase (WIP) 2023-08-21 23:53:31 +05:30
test_all.py style changes and remove spurious runc test call at the bottom 2023-08-16 02:22:13 +00:00
test.c freeing tokenizer in test.c 2023-08-26 16:35:50 -04:00
tinystories.py Setting UTF encoding, otherwise windows breaks with UnicodeEncodeError: 'charmap' codec can't encode character '\u200b' in position 971: character maps to <undefined> 2023-08-30 14:54:41 -05:00
tokenizer.bin prompt tokenizer improvements: utf8 support, add_dummy_prefix and byte_fallback options to match sentencepiece 2023-08-07 13:12:44 +03:00
tokenizer.model somewhere ~20 hours later 2023-07-23 05:23:45 +00:00
tokenizer.py Merge branch 'utf8' of https://github.com/atamurad/llama2.c into feature/utf8 2023-08-15 00:18:53 +00:00
train.py Correct WandB log step 2023-08-25 17:12:29 +08:00
win.c replaced __int64 with int64_t and DWORD with uint_32 2023-08-02 10:18:30 +02:00
win.h replaced __int64 with int64_t and DWORD with uint_32 2023-08-02 10:18:30 +02:00

Llama 2 Everywhere (L2E)

LLamas Everywhere!

Standalone, Binary Portable, Bootable Llama 2

The primary objective of Llama 2 Everywhere (L2E) is to ensure its compatibility across a wide range of devices, from booting on repurposed chromebooks discarded by school districts to high-density unikernel deployments in enterprises.

We believe that in future by harnessing a legion of small specialized LLMs with modest hardware requirements which are networked, distributed, and self-coordinated, L2E has the potential to democratize access to AI and unlock collective intelligence that surpasses that of a single large LLM.

The current compelling use case of L2E involves training small models on diverse textual sources, including textbooks, open books, and comprehensive corpora like the SlimPajama corpus. These trained models can be deployed using L2E, enabling them to run as bootable instances on outdated school computers. This deployment scenario proves particularly valuable in school libraries or classrooms where internet connectivity is limited or unavailable, serving as an information gateway* for students without constant reliance on the internet.

By pursuing the vision of Llama 2 Everywhere, we aim to create an inclusive AI ecosystem that can adapt to diverse environments and empower individuals and communities on a global scale.

My research goal is to train models using various hardware telemetry data with the hope that the models learn to interpret sensor inputs and control actuators based on the insights they glean from the sensor inputs. This research direction may open up exciting possibilities in fields such as automation, space, robotics and IoT, where L2E can play a pivotal role in bridging the gap between AI and physical systems.

A friendly fork of the excellent @karpathy's llama2.c

I will be mirrorring the progress of https://github.com/karpathy/llama2.c every week, add portability, performance improvements and convenience features such as a web interface which certainly would not fit in the upstream do to the minimalistic elegance requirements there.

* How do we make sure that the output is factual and not hallucinated?

It's a chicken and egg problem. This has to be explored and figured out on the way. Some ideas on mind are:

  1. Topic specialized models which are frequently updated maybe every month or two.
  2. Fact Checking & Moderation specialized models which moderate or do fact checking on other model's output.
  3. Reduce / mitigate hallucinations through output validation (both neural and rule based).
  4. Prompt rewriting both neural and with rules.
  5. Educators / Students / Users can flag answers. Administrators could update rules.

Features

NEW - L2E OS (Linux Kernel)

Have you ever wanted to really boot and inference a baby Llama 2 model on a computer? No? Well, now you can!

guru

Probably Releasing today!

NEW - Linux Kernel Module

Have you ever wanted to do cat /dev/llama and echo "Sudo make me a sandwich!" > /dev/llama or pass a kernel parameter such as l2e.quest="What is the meaning of life?" ? No? Well, as luck would have, it now you can!

bootos1

Probably Releasing today! (WIP)

NEW - Unikernel Build

Have you ever wanted to boot and inference a herd of 1000's of Virtual baby Llama 2 models on big ass enterprise servers? No? Well, now you can!

l2e_unik

Just do the following to build:

make run_unik_qemu_x86_64

Please note that the requirements - unikraft and musl sources - will automatically get cloned before building.

Once the build completes, (takes a while), run L2E like this:

qemu-system-x86_64 -m 256m -accel kvm -kernel build/L2E_qemu-x86_64

You can also run with -nographic option to directly interact in terminal.

qemu-system-x86_64 -m 256m -accel kvm -kernel build/L2E_qemu-x86_64 -nographic

Download and try this and the cosmocc build in the latest release.

Portability Features

  • Single Executable that runs on any x86_64 OS (cosmocc builds)
  • GNU Linux
  • GNU/Systemd
  • *BSD (NetBSD, OpenBSD, FreeBSD)
  • XNU's Not UNIX (Mac)
  • Bare Metal Boot (BIOS & EFI) (Not fully functional yet but almost...)
  • Windows
  • Runs on ARM64 via inbuilt BLINK emulation
  • Standalone
  • Embedded model and tokenizer via ZipOS (cosmocc), INCBIN, strliteral
  • Usability
  • Hacky CLI Chat - use any _incbin, _strlit or _zipos build.

Some combined features depend on a specific cosmocc toolchain: https://github.com/jart/cosmopolitan

Building this with gcc or clang would result in normal binaries similar to upstream.

Read more: How to build

Performance Features

CPU

  • OpenBLAS
  • CBLAS
  • BLIS
  • Intel MKL (WIP)
  • ArmPL (WIP)
  • Apple Accelerate Framework (CBLAS) (WIP/Testing)

CPU/GPU

  • OpenMP
  • OpenACC

Both OpenMP and OpenACC builds currently use host CPU and do not offload to GPU.

GPU

  • OpenCL (via CLBlast) (Direct - planned)
  • OpenGL
  • Vulkan
  • CUDA

Download the prebuilt run.com binary from releases

llama2.c

Cute Llama

A friendly fork of the excellent llama2.c

The original repository offers a full-stack solution for training and inferring the Llama 2 LLM architecture using PyTorch and a simple 500-line C file. The focus is on minimalism and simplicity, and the repo is a young project that is still being actively developed. The author recommends looking at the TinyStories paper for inspiration, as small LLMs can have strong performance in narrow domains. The C inference engine in run.c was the main focus of the project, and the Llama 2 architecture is hard-coded with no dependencies.

Feel the Magic

git clone https://github.com/trholding/llama2.c.git
cd llama2.c
make runfast
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin
./run stories15M.bin

You can also prompt the model with a prefix:

./run stories42M.bin -t 0.8 -n 256 -i "A big dog"

When prompting, the temperature and steps parameters are needed since we use simple positional arguments.

Output

A big dog named Zip. He loved to run and play in the sun. He was a happy dog. One day, Zip saw a little bird on the ground. The bird looked helpless. Zip wanted to help the bird. He ran to the bird and lay down next to it. Zip and the bird became friends. They played together every day. Zip would carry the bird to play in the trees. The bird would fly around, and Zip would bark. They were very happy together.

Models

The original author trained a series of small models on TinyStories, which took a few hours to train on their setup. The 110M model took around 24 hours. The models are hosted on huggingface hub:

model dim n_layers n_heads max context length parameters val loss download
OG 288 6 6 256 15M 1.072 stories15M.bin
42M 512 8 8 1024 42M 0.847 stories42M.bin
110M 768 12 12 1024 110M 0.760 stories110M.bin

The upstream project owner trained the llama2.c storyteller models on a 4X A100 40GB box provided by Lambda labs.

Quick note on sampling, the recommendation for good results is to use -t 1.0 -p 0.9, i.e. top-p sampling at 0.9 with temperature 1.0 (this is the default). To control the diversity of samples use either the temperature (i.e. vary -t between 0 and 1 and keep top-p off with -p 0) or the top-p value (i.e. vary -p between 0 and 1 and keep -t 1), but not both. Nice explainers on LLM sampling strategies include this, this or this.

./run llama2_7b.bin

A converted Meta's Llama 2 7b model can be inferenced at a slow speed.

Usage

Full Usage

Usage:   run <checkpoint> [options]
Example: ./run model.bin -n 256 -i "Once upon a time"
Options:
  -t <float>  temperature in [0,inf], default 1.0
  -p <float>  p value in top-p (nucleus) sampling in [0,1] default 0.9
  -s <int>    random seed, default time(NULL)
  -n <int>    number of steps to run for, default 256. 0 = max_seq_len
  -b <int>    number of tokens to buffer, default 1. 0 = max_seq_len
  -x <int>    extended info / stats, default 1 = on. 0 = off
  -i <string> input prompt
  -z <string> optional path to custom tokenizer

<checkpoint> is the mandatory checkpoint / model file.

Minimal Usage

./run <checkpoint_file>

Platforms

Multi OS build

make run_cosmocc

The binary will boot on baremetal and also run on any 64 Bit OS such as Linux, *BSD, Windows and slower on Aarch64 Mac & Linux.

Currently when used to boot, it won't be able to find the models. It's a toolchain feature with an anticipated PR merge.

The performance of this build is more than twice of the basic build.

The cosmopolitan toolchain is a requirement for this build to work. Read here: How to build

Please note that the Multi OS binaries builds built with cosmocc cause a false positive with AV/Microsoft Defender and Virus Total

The issue is that AV's consider unsigned binaries or binaries that contain multiple OS binary signatures in one binary as suspicious. Get more insight here: https://github.com/trholding/llama2.c/issues/8 and https://github.com/jart/cosmopolitan/issues/342

Linux

Centos 7 / Amazon Linux 2018

make rungnu or make runompgnu to use openmp.

Other Linux Distros / Mac

make runfast or make runomp to use openmp.

Windows

Build on windows:

build_msvc.bat in a Visual Studio Command Prompt

The MSVC build will use openmp and max threads suitable for your CPU unless you set OMP_NUM_THREADS env.

Build on Linux and Windows:

make win64 to use the mingw compiler toolchain.

Android

See this @lordunix's post on how to build this on Android within termux:

https://github.com/trholding/llama2.c/issues/7#issue-1867639275

TODO.

Performance

Basic

This build does not enable any optimizations.

make run

This can be used as baseline build against which performance of other builds can be compared.

Fast

This build enables basic performance boost with compiler provided optimizations.

make runfast

Build wth Acceleration

OpenMP

This build enables acceleration via OpenMP

make run_cc_openmp

Requires OpenMP libraries and compiler with OpenMP support to be available on system. E.g. apt install clang libomp-dev on ubuntu

When you run inference make sure to use OpenMP flags to set the number of threads, e.g.:

OMP_NUM_THREADS=4 ./run out/model.bin

More threads is not always better.

OpenACC

This build enables acceleration via OpenACC

make run_cc_openacc

Requires OpenACC libraries and compiler with OpenACC support to be available on system.

OpenBLAS

This build enables acceleration via OpenBLAS

make run_cc_openblas

Requires OpenBLAS to be installed on system.

BLIS

This build enables acceleration via BLIS

make run_cc_blis

Requires BLIS compiled with ./configure --enable-cblas -t openmp,pthreads auto to be installed on system.

Intel oneAPI MKL

This build enables acceleration via Intel® oneAPI Math Kernel Library on x86_64 systems and Intel Mac OS - WIP

make run_cc_mkl

Requires Intel oneAPI MKL to be installed on system.

Arm Performance Library (ArmPL)

This build enables acceleration via Arm Performance Library on ARM64 systems such as Linux or Mac OS - WIP

make run_cc_armpl

Requires ArmPL to be installed on system.

Apple Accelerate

This build enables BLAS acceleration via Apple Accelerate on Mac OS - Testing

make run_cc_mac_accel

Requires Apple Accelerate to be available on system.

Note: Needs testing.

Generic CBLAS

This build enables acceleration with any Netlib CBLAS interface compatible libraries

make run_cc_cblas

Requires any BLAS library with Netlib CBLAS interface such as LAPACK to be installed on system.

CLBlast (GPU/OpenCL)

This build enables tuned GPU acceleration via OpenCL with CLBlast

make run_cc_clblast

Requires CLBlast compiled with cmake -DNETLIB=ON to be installed on system.

Note: Currently runs much slower than CPU! Requires investigation or memory I/O is a bottle neck on the test system.

Portable Binary Build

Have you ever wanted to inference a baby Llama 2 model with a single executable on any OS or *as OS? No? Well, now you can!

By making use of the Cosmopolitan libc toolchain to build llama2.c we get the we can get those features.

Instructions

Get and build the comopolitan libc toolchain:

Follow instructions at https://github.com/jart/cosmopolitan

Or do:

sudo mkdir -p /opt
sudo chmod 1777 /opt
git clone https://github.com/jart/cosmopolitan /opt/cosmo
export PATH="/opt/cosmo/bin:/opt/cosmos/bin:$PATH"
echo 'PATH="/opt/cosmo/bin:/opt/cosmos/bin:$PATH"' >>~/.profile
cosmocc --update   # pull cosmo and build/rebuild toolchain

Example build to generate a Actually Portable Executable (APE) with embedded model:

mkdir out
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin -O out/model.bin
make run_cosmocc_incbin

Example build to generate a APE:

make run_cosmocc

Run or copy to any supported system and run:

If model is embedded:

./run.com

Else

/run.com model.bin

All 'make' targets

Do make to build for a particular target.

Example:

make run_cc_openmp

Targets:

NEW:

run_unik_qemu_x86_64 - Unikernel + embedded model Build (QEMU/x86_64)
run 			- Default build
rungnu 			- Generic linux distro build
runompgnu 		- Generic linux distro + OpenMP build 
runfast 		- Optimized build
run_cc_armpl 		- ARM PL BLAS accelerated build (ARM64) (WIP)
run_cc_blis 		- BLIS accelerated build
run_cc_cblas 		- Generic CBLAS accelerated build
run_cc_clblast 		- CLBlast OpenCL CBLAS GPU accelerated build
run_cc_mac_accel	- Mac OS CBLAS via Accelerate Framework (WIP/TEST)
run_cc_mkl 		- Intel MKL CBLAS build (x86_64 / intel Mac)(WIP)
run_cc_openacc		- OpenACC accelerated build
run_cc_openblas		- Openblas CBLAS accelerated build
run_cc_openmp		- OpenMP accelerated build
run_gcc_openmp_incbin	- Gcc + OpenMP + embedded model fast build
run_gcc_openmp_strlit	- Gcc + OpenMP + embedded model build
run_clang_openmp_incbin - Clang + OpenMP + embedded model fast build
run_clang_openmp_strlit	- Clang + OpenMP + embedded model build
run_gcc_static_incbin	- Static gcc + OpenMP + embedded model fast build
run_gcc_static_strlit	- Static gcc + OpenMP + embedded model build
run_clang_static_incbin - Static clang + OpenMP + embedded model fast build
run_clang_static_strlit - Static clang + OpenMP + embedded model build
run_cosmocc		- Portable + cosmocc
run_cosmocc_incbin	- Portable + cosmocc + embedded model fast build (All OSes)
run_cosmocc_strlit	- Portable + cosmocc + embedded model build (All OSes)
run_cosmocc_zipos	- Portable + cosmocc + embedded zip model build(All OSes)

All builds with embedded models need the model to be in out/ directory and the model name has to be model.bin

Example:

mkdir out
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin -O out/model.bin

TODO

  • Python Binding + Streamlit Demo (priority)
  • Web UI + Minimal OpenAI API compat (priority)
  • GNU/Linux kernel + efistub + cpio + l2e as init boot image (priority)
  • Users need better docs / howto / example, especially VM related.
  • Train a small test model on open books. (I need to figure out sourcing the compute)
  • Unikraft unikernel Boot (WIP/Testing) (Task: Multi Arch + Firecracker VM support)
  • Intel MKL BLAS Acceleration (WIP)
  • Arm Performance Libraries (WIP)
  • Apple Accelerate BLAS (WIP/Testing)
  • NetBSD Rump Kernel Boot (R&D, attempt failed, needs deep study)
  • sgemm OpenGL acceleration (next)
  • sgemm SSE, AVX acceleration
  • Fix baremetal cosmo boot model loading (pending)
  • OpenMP SIMD (pending)
  • OpenCL pure
  • Vulkan
  • CUDA
  • MPI / PVM / PBLAS
  • cFS App
  • Android support (both ndk builds and minimal APK)
  • Various uC demos (ESP32, ESP8266, Pico) - load models via network, Raspi Zero Demo
  • Quantization (16, 4 , 2)
  • CLara (tried / broken) / SunCL Distributed OpenCL support
  • Fix broken MSVC build (!) yikes
  • Split, re-order, rebase repo.

Changelog

See commits.

Contributing

  • All pull requests that are merged to upstream will be automatically applied here as we closely mirror upstream.
  • I merge pull requests that improves performance even if they are rejected upstream.
  • Performance and usability improvement contriubutions are welcome.

Developer Status

See "Developer Status" issue.

Current status: Busy since Aug ~6 2023, away on bigger IRL projects. Just merging stuff. Addressing all issues every ~7 days.

Gratitude & Credits

Thank you to to the creators of the following libraries and tools and their contributors:

Notable projects

License

MIT