llama2.c

mirror of https://github.com/trholding/llama2.c.git synced 2026-02-06 11:26:53 +00:00

Llama 2 Everywhere (L2E)

accelerate ape armpl blas blis c cblas clblast cosmopolitan linux-kernel llama2 llama2c llm mkl multios openacc openblas openmp os unikernel

Go to file

YiMing Han 8607b11ea1 working one		2023-08-18 15:07:41 -04:00
.dart_tool	working one	2023-08-18 15:07:41 -04:00
.github/workflows	style changes and remove spurious runc test call at the bottom	2023-08-16 02:22:13 +00:00
assets	somewhere ~20 hours later	2023-07-23 05:23:45 +00:00
configurator.py	somewhere ~20 hours later	2023-07-23 05:23:45 +00:00
export_meta_llama_bin.py	Merge branch 'master' of https://github.com/ai-doge/llama2.c into ai-doge-master	2023-08-01 15:25:14 +00:00
export_meta_llama_hf_bin.py	[Feat]: Add support for meta llama hf model conversion	2023-08-14 10:18:51 +05:30
LICENSE	Initial commit	2023-07-22 22:15:06 -07:00
model.py	bigchange: add multiquery support in run.c. we can now train and inference multiquery models (where n_kv_heads < n_heads). this also means that we, in principle, support Llama 2 34B and 70B models, which are multiquery	2023-08-13 19:34:05 +00:00
ORIGINAL.md	working one	2023-08-18 15:07:41 -04:00
pubspec.lock	working one	2023-08-18 15:07:41 -04:00
pubspec.yaml	working one	2023-08-18 15:07:41 -04:00
README.md	working one	2023-08-18 15:07:41 -04:00
requirements.txt	remove tiktoken as dependency	2023-08-14 05:53:57 +00:00
run.dart	working one	2023-08-18 15:07:41 -04:00
run.ipynb	Jupter Notebook: Add run Meta's Llama 2 models	2023-08-16 20:27:28 +08:00
sample.py	fix sample.py from tokenizer changes before	2023-08-15 02:33:01 +00:00
save_torchscript.py	docs typo	2023-08-04 23:12:06 +07:00
tinystories.py	Fixes https://github.com/karpathy/llama2.c/issues/280	2023-08-13 17:49:10 +03:00
tokenizer.bin	prompt tokenizer improvements: utf8 support, add_dummy_prefix and byte_fallback options to match sentencepiece	2023-08-07 13:12:44 +03:00
tokenizer.model	somewhere ~20 hours later	2023-07-23 05:23:45 +00:00
tokenizer.py	Merge branch 'utf8' of https://github.com/atamurad/llama2.c into feature/utf8	2023-08-15 00:18:53 +00:00
train_vocab.sh	add the ability to train a custom sentencepiece tokenizer with a given vocab_size, and pretok with it. some more changes still needed to merge this branch, in train.py and ofc run.c. did this in a sadly bit ugly, but fully backwards compatible way. basically when we use custom tokenizer we create a whole new directory structure for that	2023-08-11 03:58:22 +00:00
train.py	bigchange: add multiquery support in run.c. we can now train and inference multiquery models (where n_kv_heads < n_heads). this also means that we, in principle, support Llama 2 34B and 70B models, which are multiquery	2023-08-13 19:34:05 +00:00

README.md

llama2.dart

This is a fork of Andrej Karpathy's llama2.c, implemented in (Almost) Pure Dart, except for some args parsing utility library.

To run :

Instal Dart

brew tap dart-lang/dart
brew install dart

Install the arg parsing dependency

dart pub add args

Download the dataset:

wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories42M.bin
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.bin

dart run run.dart -c ./stories15M.bin -i "PROMPT GOES HERE"

Performance

Dart suprisingly ok performance being a single threaded language, tho it's starting to struggle at 110M: Tested on M2 Max Chip

Model	Token/s
15M	tok/s: 17.78
42M	tok/s: 6.43
110M	tok/s: 2.47

Original README

Extract from the original Repo:

Train the Llama 2 LLM architecture in PyTorch then inference it with one simple 700-line C file (run.c). You might think that you need many billion parameter LLMs to do anything useful, but in fact very small LLMs can have surprisingly strong performance if you make the domain narrow enough (ref: TinyStories paper). This repo is a "fullstack" train + inference solution for Llama 2 LLM, with focus on minimalism and simplicity.

As the architecture is identical, you can also load and inference Meta's Llama 2 models. However, the current code only inferences models in fp32, so you will most likely not be able to productively load models larger than 7B. Work on model quantization is currently ongoing.

Please note that this repo started recently as a fun weekend project: I took my earlier nanoGPT, tuned it to implement the Llama-2 architecture instead of GPT-2, and the meat of it was writing the C inference engine in run.c. So the project is young and moving quickly. Hat tip to the awesome llama.cpp for inspiring this project. Compred to llama.cpp, I wanted something super simple, minimal, and educational so I chose to hard-code the Llama 2 architecture and just roll one inference file of pure C with no dependencies.

Please refer to Original README or the upstream repo for more information on llama2.c