From 37157bc0a380994749b5325fd653fe21ec9439c2 Mon Sep 17 00:00:00 2001
From: Atamurad Hezretkuliyev <atamyrat@gmail.com>
Date: Sun, 27 Aug 2023 02:27:47 +0300
Subject: [PATCH 01/29] Update README.md

Fixed unclosed code block quotes
---
 README.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/README.md b/README.md
index b6bd418..f4bdeb2 100644
--- a/README.md
+++ b/README.md
@@ -109,6 +109,7 @@ Chat with Code Llama Instruct:
 python export.py codellama2_7b_instruct.bin --meta-llama /path/to/CodeLlama-7b-Instruct
 python tokenizer.py --tokenizer-model=/path/to/CodeLlama-7b-Instruct/tokenizer.model
 ./run codellama2_7b_instruct.bin -m chat -z /path/to/CodeLlama-7b-Instruct/tokenizer.bin
+```
 
 ## hugginface models
 

From 1ebb27f090e10117964f1fa54a0be32d10a5a6e1 Mon Sep 17 00:00:00 2001
From: Jani Monoses <jani.monoses@gmail.com>
Date: Sun, 27 Aug 2023 12:21:11 +0300
Subject: [PATCH 02/29] Do parameter count calculations in 64 bits to not
 overflow in case of very large models

---
 run.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/run.c b/run.c
index 9329b93..fdbb16d 100644
--- a/run.c
+++ b/run.c
@@ -115,26 +115,28 @@ void free_run_state(RunState* s) {
 
 void memory_map_weights(TransformerWeights *w, Config* p, float* ptr, int shared_weights) {
     int head_size = p->dim / p->n_heads;
+    // make sure the multiplications below are done in 64bit to fit the parameter counts of 13B+ models
+    unsigned long n_layers = p->n_layers;
     w->token_embedding_table = ptr;
     ptr += p->vocab_size * p->dim;
     w->rms_att_weight = ptr;
-    ptr += p->n_layers * p->dim;
+    ptr += n_layers * p->dim;
     w->wq = ptr;
-    ptr += p->n_layers * p->dim * (p->n_heads * head_size);
+    ptr += n_layers * p->dim * (p->n_heads * head_size);
     w->wk = ptr;
-    ptr += p->n_layers * p->dim * (p->n_kv_heads * head_size);
+    ptr += n_layers * p->dim * (p->n_kv_heads * head_size);
     w->wv = ptr;
-    ptr += p->n_layers * p->dim * (p->n_kv_heads * head_size);
+    ptr += n_layers * p->dim * (p->n_kv_heads * head_size);
     w->wo = ptr;
-    ptr += p->n_layers * (p->n_heads * head_size) * p->dim;
+    ptr += n_layers * (p->n_heads * head_size) * p->dim;
     w->rms_ffn_weight = ptr;
-    ptr += p->n_layers * p->dim;
+    ptr += n_layers * p->dim;
     w->w1 = ptr;
-    ptr += p->n_layers * p->dim * p->hidden_dim;
+    ptr += n_layers * p->dim * p->hidden_dim;
     w->w2 = ptr;
-    ptr += p->n_layers * p->hidden_dim * p->dim;
+    ptr += n_layers * p->hidden_dim * p->dim;
     w->w3 = ptr;
-    ptr += p->n_layers * p->dim * p->hidden_dim;
+    ptr += n_layers * p->dim * p->hidden_dim;
     w->rms_final_weight = ptr;
     ptr += p->dim;
     ptr += p->seq_len * head_size / 2; // skip what used to be freq_cis_real (for RoPE)
@@ -249,7 +251,7 @@ float* forward(Transformer* transformer, int token, int pos) {
     memcpy(x, content_row, dim*sizeof(*x));
 
     // forward all the layers
-    for(int l = 0; l < p->n_layers; l++) {
+    for(unsigned long l = 0; l < p->n_layers; l++) {
 
         // attention rmsnorm
         rmsnorm(s->xb, x, w->rms_att_weight + l*dim, dim);

From fab753db3a3dfdb55a8055600bba8177d82a127d Mon Sep 17 00:00:00 2001
From: Diego Marcos Segura <diego.marcos@gmail.com>
Date: Mon, 28 Aug 2023 14:43:04 -0700
Subject: [PATCH 03/29] Add llama2.c-web to the list of projects in README.md

---
 README.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/README.md b/README.md
index f4bdeb2..12cb12e 100644
--- a/README.md
+++ b/README.md
@@ -344,6 +344,8 @@ If your candidate PRs have elements of these it doesn't mean they won't get merg
   - [llama2.cs](https://github.com/trrahul/llama2.cs) by @[trrahul](https://github.com/trrahul): a C# port of this project
 - Dart
   - [llama2.dart](https://github.com/yiminghan/llama2.dart) by @[yiminghan](https://github.com/yiminghan/llama2.dart): one-file dart port of this project, works with Flutter!
+- Web
+  - [llama2c-web](https://github.com/dmarcos/llama2.c-web) by @[dmarcos](https://github.com/dmarcos): Super simple way to build unmodified llama2.c to WASM and run it in the browser. [Demo](https://diegomarcos.com/llama2.c-web/)
 - WebAssembly
   - [icpp-llm](https://github.com/icppWorld/icpp-llm): LLMs for the Internet Computer
 - [llama2.c - Llama 2 Everywhere](https://github.com/trholding/llama2.c) by @[trholding](https://github.com/trholding): Standalone, Bootable & Portable Binary Llama 2

From c5ec6e21b8659d6d3500a2af3ac1dfe7f3e19ae1 Mon Sep 17 00:00:00 2001
From: Jani Monoses <jani.monoses@gmail.com>
Date: Tue, 29 Aug 2023 17:47:55 +0300
Subject: [PATCH 04/29] Use long long so it works with MSVC

---
 run.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/run.c b/run.c
index fdbb16d..c6ec94a 100644
--- a/run.c
+++ b/run.c
@@ -116,7 +116,7 @@ void free_run_state(RunState* s) {
 void memory_map_weights(TransformerWeights *w, Config* p, float* ptr, int shared_weights) {
     int head_size = p->dim / p->n_heads;
     // make sure the multiplications below are done in 64bit to fit the parameter counts of 13B+ models
-    unsigned long n_layers = p->n_layers;
+    unsigned long long n_layers = p->n_layers;
     w->token_embedding_table = ptr;
     ptr += p->vocab_size * p->dim;
     w->rms_att_weight = ptr;
@@ -251,7 +251,7 @@ float* forward(Transformer* transformer, int token, int pos) {
     memcpy(x, content_row, dim*sizeof(*x));
 
     // forward all the layers
-    for(unsigned long l = 0; l < p->n_layers; l++) {
+    for(unsigned long long l = 0; l < p->n_layers; l++) {
 
         // attention rmsnorm
         rmsnorm(s->xb, x, w->rms_att_weight + l*dim, dim);

From ab19aa08045f0f30db4291641ece301d7cc339f3 Mon Sep 17 00:00:00 2001
From: Brandon Rowlett <blrow@hotmail.com>
Date: Wed, 30 Aug 2023 14:54:41 -0500
Subject: [PATCH 05/29] Setting UTF encoding, otherwise windows breaks with
 UnicodeEncodeError: 'charmap' codec can't encode character '\u200b' in
 position 971: character maps to <undefined>

---
 tinystories.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tinystories.py b/tinystories.py
index 800d73a..814732d 100644
--- a/tinystories.py
+++ b/tinystories.py
@@ -88,7 +88,7 @@ def train_vocab(vocab_size):
     shard_filenames = sorted(glob.glob(os.path.join(data_dir, "*.json")))
 
     print(f"Writing temporary file {tiny_file} with {num_shards} shards...")
-    with open(tiny_file, "w") as of:
+    with open(tiny_file, "w", encoding="utf-8") as of:
         for shard in tqdm(shard_filenames[:num_shards]):
             with open(shard, "r") as f:
                 data = json.load(f)

From a69ee269c5e7c4ed06c3fc8c56b66ef22438edb3 Mon Sep 17 00:00:00 2001
From: Daniel Furrer <daniel.furrer@gmail.com>
Date: Sun, 3 Sep 2023 22:37:10 +0200
Subject: [PATCH 06/29] Update run.c

Remove duplicate word in comments.
---
 run.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/run.c b/run.c
index c6ec94a..efb254f 100644
--- a/run.c
+++ b/run.c
@@ -756,7 +756,7 @@ void generate(Transformer *transformer, Tokenizer *tokenizer, Sampler *sampler,
         // forward the transformer to get logits for the next token
         float* logits = forward(transformer, token, pos);
 
-        // advance the state state machine
+        // advance the state machine
         if (pos < num_prompt_tokens - 1) {
             // if we are still processing the input prompt, force the next prompt token
             next = prompt_tokens[pos + 1];

From 0b3a5e17fd30ba7382c1d3a3258bc6994fd9430a Mon Sep 17 00:00:00 2001
From: Andrew <andrewmarble@gmail.com>
Date: Sun, 3 Sep 2023 17:54:54 -0400
Subject: [PATCH 07/29] added fortran clone

---
 README.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/README.md b/README.md
index f4bdeb2..8d135e8 100644
--- a/README.md
+++ b/README.md
@@ -346,6 +346,8 @@ If your candidate PRs have elements of these it doesn't mean they won't get merg
   - [llama2.dart](https://github.com/yiminghan/llama2.dart) by @[yiminghan](https://github.com/yiminghan/llama2.dart): one-file dart port of this project, works with Flutter!
 - WebAssembly
   - [icpp-llm](https://github.com/icppWorld/icpp-llm): LLMs for the Internet Computer
+- Fortran
+  - [llama2.f90](https://github.com/rbitr/llama2.f90): a Fortran port of this project
 - [llama2.c - Llama 2 Everywhere](https://github.com/trholding/llama2.c) by @[trholding](https://github.com/trholding): Standalone, Bootable & Portable Binary Llama 2
 - [llama2.c-zh - Bilingual Chinese and English](https://github.com/chenyangMl/llama2.c-zh) by @[chenyangMl](https://github.com/chenyangMl): Expand tokenizer to support training and inference in both Chinese and English
 

From 3ac620572e962282e4f7c77248e1d30adfe000a1 Mon Sep 17 00:00:00 2001
From: Li Yazhou <me.ssword@gmail.com>
Date: Sun, 10 Sep 2023 21:22:05 +0800
Subject: [PATCH 08/29] add another rust implementation

---
 README.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/README.md b/README.md
index 000d944..bae48ef 100644
--- a/README.md
+++ b/README.md
@@ -312,6 +312,7 @@ If your candidate PRs have elements of these it doesn't mean they won't get merg
   - [llama2-rs](https://github.com/danielgrittner/llama2-rs) by @[danielgrittner](https://github.com/danielgrittner): a Rust port of this project
   - [llama2.rs](https://github.com/lintian06/llama2.rs) by @[lintian06](https://github.com/lintian06): A Rust port of this project
   - [pecca.rs](https://github.com/rahoua/pecca-rs) by @[rahoua](https://github.com/rahoua): A Rust port leveraging [ndarray](https://github.com/rust-ndarray/ndarray), supports BLAS.
+  - [llama2.rs](https://github.com/flaneur2020/llama2.rs) by @[flaneur2020](https://github.com/flaneur2020): A Rust port of this project.
 - Go
   - [go-llama2](https://github.com/tmc/go-llama2) by @[tmc](https://github.com/tmc): a Go port of this project
   - [llama2.go](https://github.com/nikolaydubina/llama2.go) by @[nikolaydubina](https://github.com/nikolaydubina): a Go port of this project

From 38011d070a2824297bc14634a5dcf8b5ab30a4a2 Mon Sep 17 00:00:00 2001
From: Aydyn Tairov <den.tairov@gmail.com>
Date: Mon, 11 Sep 2023 12:29:25 +0100
Subject: [PATCH 09/29] Add link to pure Mojo port of project

---
 README.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/README.md b/README.md
index 000d944..94601f1 100644
--- a/README.md
+++ b/README.md
@@ -350,6 +350,8 @@ If your candidate PRs have elements of these it doesn't mean they won't get merg
   - [icpp-llm](https://github.com/icppWorld/icpp-llm): LLMs for the Internet Computer
 - Fortran
   - [llama2.f90](https://github.com/rbitr/llama2.f90): a Fortran port of this project
+- Mojo
+  - [llama2.🔥](https://github.com/tairov/llama2.mojo) by @[tairov](https://github.com/tairov): pure Mojo port of this project
 - [llama2.c - Llama 2 Everywhere](https://github.com/trholding/llama2.c) by @[trholding](https://github.com/trholding): Standalone, Bootable & Portable Binary Llama 2
 - [llama2.c-zh - Bilingual Chinese and English](https://github.com/chenyangMl/llama2.c-zh) by @[chenyangMl](https://github.com/chenyangMl): Expand tokenizer to support training and inference in both Chinese and English
 

From 9d73a377fb1aee70c0b8c335ce3d06eb12ae74c8 Mon Sep 17 00:00:00 2001
From: Juarez Bochi <juarez.bochi@grammarly.com>
Date: Mon, 11 Sep 2023 14:05:37 -0400
Subject: [PATCH 10/29] Use CLOCK_MONOTONIC instead of realtime

---
 run.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/run.c b/run.c
index 0eaa655..af3d375 100644
--- a/run.c
+++ b/run.c
@@ -724,7 +724,7 @@ int sample(Sampler* sampler, float* logits) {
 long time_in_ms() {
     // return time in milliseconds, for benchmarking the model speed
     struct timespec time;
-    clock_gettime(CLOCK_REALTIME, &time);
+    clock_gettime(CLOCK_MONOTONIC, &time);
     return time.tv_sec * 1000 + time.tv_nsec / 1000000;
 }
 

From 38c58ac336544ba08a814dea89135b0d07fd7450 Mon Sep 17 00:00:00 2001
From: Andrej <andrej.karpathy@gmail.com>
Date: Tue, 12 Sep 2023 11:17:04 +0300
Subject: [PATCH 11/29] Revert "Minor fix: Use CLOCK_MONOTONIC instead of
 CLOCK_REALTIME"

---
 run.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/run.c b/run.c
index 484d709..efb254f 100644
--- a/run.c
+++ b/run.c
@@ -726,7 +726,7 @@ int sample(Sampler* sampler, float* logits) {
 long time_in_ms() {
     // return time in milliseconds, for benchmarking the model speed
     struct timespec time;
-    clock_gettime(CLOCK_MONOTONIC, &time);
+    clock_gettime(CLOCK_REALTIME, &time);
     return time.tv_sec * 1000 + time.tv_nsec / 1000000;
 }
 

From c568f6952dbcc6d5c19fdec1a9c76dc649f14be4 Mon Sep 17 00:00:00 2001
From: Nicky Pochinkov <work@nicky.pro>
Date: Tue, 12 Sep 2023 19:50:59 +0100
Subject: [PATCH 12/29] added option to export to huggingface format

---
 export.py | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 69 insertions(+), 1 deletion(-)

diff --git a/export.py b/export.py
index 4143f70..39b13d8 100644
--- a/export.py
+++ b/export.py
@@ -265,6 +265,73 @@ def version2_export(model, filepath, group_size=64):
     out_file.close()
     print(f"wrote {filepath}")
 
+def hf_export(llama_model, filepath, group_size=64):
+    """ Generate the pytorch_model.bin state_dict and config.json for HuggingFace """
+
+    # Generate LlamaModel state_dict
+    def permute_original(w, n_heads=llama_model.params.n_heads, dim1=llama_model.params.dim, dim2=llama_model.params.dim):
+        return w.view(dim1, dim2).reshape(n_heads, dim1 // n_heads // 2, 2, dim2).transpose(1, 2).reshape(dim1, dim2)
+
+    hf_state_dict = {}
+
+    # Transfer weights from llama model to the HF state dictionary format
+    hf_state_dict['model.embed_tokens.weight'] = llama_model.tok_embeddings.weight.clone()
+    hf_state_dict['model.norm.weight'] = llama_model.norm.weight.clone()
+
+    for i, layer in enumerate(llama_model.layers):
+        layer_id = layer.layer_id  # Assuming llama.c layers have layer_id
+        hf_state_dict[f'model.layers.{i}.input_layernorm.weight'] = llama_model.layers[layer_id].attention_norm.weight.clone()
+        hf_state_dict[f'model.layers.{i}.self_attn.q_proj.weight'] = permute_original(llama_model.layers[layer_id].attention.wq.weight.clone())
+        hf_state_dict[f'model.layers.{i}.self_attn.k_proj.weight'] = permute_original(llama_model.layers[layer_id].attention.wk.weight.clone())
+        hf_state_dict[f'model.layers.{i}.self_attn.v_proj.weight'] = llama_model.layers[layer_id].attention.wv.weight.clone()
+        hf_state_dict[f'model.layers.{i}.self_attn.o_proj.weight'] = llama_model.layers[layer_id].attention.wo.weight.clone()
+        hf_state_dict[f'model.layers.{i}.post_attention_layernorm.weight'] = llama_model.layers[layer_id].ffn_norm.weight.clone()
+        hf_state_dict[f'model.layers.{i}.mlp.gate_proj.weight'] = llama_model.layers[layer_id].feed_forward.w1.weight.clone()
+        hf_state_dict[f'model.layers.{i}.mlp.down_proj.weight'] = llama_model.layers[layer_id].feed_forward.w2.weight.clone()
+        hf_state_dict[f'model.layers.{i}.mlp.up_proj.weight'] = llama_model.layers[layer_id].feed_forward.w3.weight.clone()
+
+    hf_state_dict['lm_head.weight'] = llama_model.output.weight.clone()
+
+
+    # Generate LlamaConfig (seen in transformers.models.llama.configuration_llama)
+    from transformers.models.llama.configuration_llama import LlamaConfig
+
+    # Extract necessary attributes from llama.c model
+    vocab_size = llama_model.params.vocab_size
+    hidden_size = llama_model.params.dim
+    intermediate_size = llama_model.layers[0].feed_forward.w1.weight.shape[0]
+    num_hidden_layers = llama_model.params.n_layers
+    num_attention_heads = llama_model.params.n_heads
+    num_key_value_heads = llama_model.params.n_kv_heads
+    max_position_embeddings = llama_model.params.max_seq_len
+    rms_norm_eps = llama_model.params.norm_eps
+
+    # TODO values for: pretraining_tp, initializer_range, use_cache,
+    # tie_word_embeddings, rope_theta, and rope_scaling.
+
+    config = LlamaConfig(
+        vocab_size=vocab_size,
+        hidden_size=hidden_size,
+        intermediate_size=intermediate_size,
+        num_hidden_layers=num_hidden_layers,
+        num_attention_heads=num_attention_heads,
+        num_key_value_heads=num_key_value_heads,
+        max_position_embeddings=max_position_embeddings,
+        rms_norm_eps=rms_norm_eps,
+        # Manual
+        architectures=["LlamaForCausalLM"],
+        hidden_act="silu",
+    )
+
+
+    # Save files in directory filepath
+    # First make the directory if it doesn't exist
+    os.makedirs(filepath, exist_ok=True)
+
+    # Save the state dictionary in .pt format
+    torch.save(hf_state_dict, os.path.join(filepath, "pytorch_model.pt"))
+    config.save_pretrained(filepath)
+
 
 # -----------------------------------------------------------------------------
 # Load / import functions
@@ -401,7 +468,6 @@ def load_hf_model(model_path):
     model.eval()
     return model
 
-
 # -----------------------------------------------------------------------------
 # API entrypoint
 
@@ -412,6 +478,8 @@ def model_export(model, filepath, version):
         version1_export(model, filepath)
     elif version == 2:
         version2_export(model, filepath)
+    elif version == -1:
+        hf_export(model, filepath)
     else:
         raise ValueError(f"unknown version {version}")
 

From 6360a539013c7f371fbca1376663825f080d6f2b Mon Sep 17 00:00:00 2001
From: Nicky Pochinkov <work@nicky.pro>
Date: Tue, 12 Sep 2023 19:53:26 +0100
Subject: [PATCH 13/29] fixed whitespace

---
 export.py | 1 +
 1 file changed, 1 insertion(+)

diff --git a/export.py b/export.py
index 39b13d8..e4d863a 100644
--- a/export.py
+++ b/export.py
@@ -468,6 +468,7 @@ def load_hf_model(model_path):
     model.eval()
     return model
 
+
 # -----------------------------------------------------------------------------
 # API entrypoint
 

From bf9a1162e1bd2e1d492bd637eb26fd1c04d22d79 Mon Sep 17 00:00:00 2001
From: Nicky Pochinkov <work@nicky.pro>
Date: Tue, 12 Sep 2023 19:55:28 +0100
Subject: [PATCH 14/29] Added error handling for LlamaConfig import

---
 export.py | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/export.py b/export.py
index e4d863a..3bbbb47 100644
--- a/export.py
+++ b/export.py
@@ -268,6 +268,13 @@ def version2_export(model, filepath, group_size=64):
 def hf_export(llama_model, filepath, group_size=64):
     """ Generate the pytorch_model.bin state_dict and config.json for HuggingFace """
 
+    try:
+        from transformers.models.llama.configuration_llama import LlamaConfig
+    except ImportError:
+        print("Error: transformers package is required to load huggingface models")
+        print("Please run `pip install transformers` to install it")
+        return None
+
     # Generate LlamaModel state_dict
     def permute_original(w, n_heads=llama_model.params.n_heads, dim1=llama_model.params.dim, dim2=llama_model.params.dim):
         return w.view(dim1, dim2).reshape(n_heads, dim1 // n_heads // 2, 2, dim2).transpose(1, 2).reshape(dim1, dim2)
@@ -294,7 +301,6 @@ def hf_export(llama_model, filepath, group_size=64):
 
 
     # Generate LlamaConfig (seen in transformers.models.llama.configuration_llama)
-    from transformers.models.llama.configuration_llama import LlamaConfig
 
     # Extract necessary attributes from llama.c model
     vocab_size = llama_model.params.vocab_size

From 3da6cc1b21fabf6ac2ac44f3798bd1b9000b59ed Mon Sep 17 00:00:00 2001
From: Bernardo Ramos <berna.gensis@gmail.com>
Date: Wed, 13 Sep 2023 16:09:39 -0300
Subject: [PATCH 15/29] readme: add another javascript port

---
 README.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/README.md b/README.md
index aaf7ded..b613fee 100644
--- a/README.md
+++ b/README.md
@@ -325,6 +325,7 @@ If your candidate PRs have elements of these it doesn't mean they won't get merg
   - [llama2.cpp](https://github.com/leloykun/llama2.cpp) by @[leloykun](https://github.com/leloykun): a C++ port of this project
 - JavaScript
   - [llama2.js](https://github.com/epicure/llama2.js) by @[epicure](https://github.com/epicure): a JavaScript port of this project
+  - [llamajs](https://github.com/agershun/llamajs) by @[agershun](https://github.com/agershun): a JavaScript port of this project
   - [llama2.ts](https://github.com/wizzard0/llama2.ts) by @[oleksandr_now](https://twitter.com/oleksandr_now): a TypeScript port of this project. Full Llama2-7B capable.
   - [llama2.c-emscripten](https://github.com/gohai/llama2.c-emscripten) by @[gohai](https://github.com/gohai): Emscripten (JavaScript) port, based on @ggerganov's initial prototype
 - Zig

From 593d846bc3d9460c66925d7d3281e67c1b2df5d1 Mon Sep 17 00:00:00 2001
From: Bernardo Ramos <berna.gensis@gmail.com>
Date: Thu, 14 Sep 2023 01:13:08 +0000
Subject: [PATCH 16/29] use key and value from kv cache

---
 run.c | 19 ++++++-------------
 1 file changed, 6 insertions(+), 13 deletions(-)

diff --git a/run.c b/run.c
index efb254f..615ef38 100644
--- a/run.c
+++ b/run.c
@@ -83,16 +83,13 @@ void malloc_run_state(RunState* s, Config* p) {
     s->hb = calloc(p->hidden_dim, sizeof(float));
     s->hb2 = calloc(p->hidden_dim, sizeof(float));
     s->q = calloc(p->dim, sizeof(float));
-    s->k = calloc(kv_dim, sizeof(float));
-    s->v = calloc(kv_dim, sizeof(float));
     s->att = calloc(p->n_heads * p->seq_len, sizeof(float));
     s->logits = calloc(p->vocab_size, sizeof(float));
     s->key_cache = calloc(p->n_layers * p->seq_len * kv_dim, sizeof(float));
     s->value_cache = calloc(p->n_layers * p->seq_len * kv_dim, sizeof(float));
     // ensure all mallocs went fine
     if (!s->x || !s->xb || !s->xb2 || !s->hb || !s->hb2 || !s->q
-     || !s->k || !s->v || !s->att || !s->logits || !s->key_cache
-     || !s->value_cache) {
+     || !s->att || !s->logits || !s->key_cache || !s->value_cache) {
         fprintf(stderr, "malloc failed!\n");
         exit(EXIT_FAILURE);
     }
@@ -105,8 +102,6 @@ void free_run_state(RunState* s) {
     free(s->hb);
     free(s->hb2);
     free(s->q);
-    free(s->k);
-    free(s->v);
     free(s->att);
     free(s->logits);
     free(s->key_cache);
@@ -256,6 +251,11 @@ float* forward(Transformer* transformer, int token, int pos) {
         // attention rmsnorm
         rmsnorm(s->xb, x, w->rms_att_weight + l*dim, dim);
 
+        // key and value point to the kv cache
+        int loff = l * p->seq_len * kv_dim; // kv cache layer offset for convenience
+        s->k = s->key_cache + loff + pos * kv_dim;
+        s->v = s->value_cache + loff + pos * kv_dim;
+
         // qkv matmuls for this position
         matmul(s->q, s->xb, w->wq + l*dim*dim, dim, dim);
         matmul(s->k, s->xb, w->wk + l*dim*kv_dim, dim, kv_dim);
@@ -278,13 +278,6 @@ float* forward(Transformer* transformer, int token, int pos) {
             }
         }
 
-        // save key,value at this time step (pos) to our kv cache
-        int loff = l * p->seq_len * kv_dim; // kv cache layer offset for convenience
-        float* key_cache_row = s->key_cache + loff + pos * kv_dim;
-        float* value_cache_row = s->value_cache + loff + pos * kv_dim;
-        memcpy(key_cache_row, s->k, kv_dim * sizeof(*key_cache_row));
-        memcpy(value_cache_row, s->v, kv_dim * sizeof(*value_cache_row));
-
         // multihead attention. iterate over all heads
         int h;
         #pragma omp parallel for private(h)

From 411c5bd2db9a87e94e1bd1a6c7b7ca117adc4b01 Mon Sep 17 00:00:00 2001
From: Bernardo Ramos <berna.gensis@gmail.com>
Date: Thu, 14 Sep 2023 07:14:45 +0000
Subject: [PATCH 17/29] reorganize variables

---
 run.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/run.c b/run.c
index 615ef38..e1a4ec2 100644
--- a/run.c
+++ b/run.c
@@ -83,13 +83,13 @@ void malloc_run_state(RunState* s, Config* p) {
     s->hb = calloc(p->hidden_dim, sizeof(float));
     s->hb2 = calloc(p->hidden_dim, sizeof(float));
     s->q = calloc(p->dim, sizeof(float));
-    s->att = calloc(p->n_heads * p->seq_len, sizeof(float));
-    s->logits = calloc(p->vocab_size, sizeof(float));
     s->key_cache = calloc(p->n_layers * p->seq_len * kv_dim, sizeof(float));
     s->value_cache = calloc(p->n_layers * p->seq_len * kv_dim, sizeof(float));
+    s->att = calloc(p->n_heads * p->seq_len, sizeof(float));
+    s->logits = calloc(p->vocab_size, sizeof(float));
     // ensure all mallocs went fine
     if (!s->x || !s->xb || !s->xb2 || !s->hb || !s->hb2 || !s->q
-     || !s->att || !s->logits || !s->key_cache || !s->value_cache) {
+     || !s->key_cache || !s->value_cache || !s->att || !s->logits) {
         fprintf(stderr, "malloc failed!\n");
         exit(EXIT_FAILURE);
     }

From b259fb44321caa1ca1a91dbb744a54ab36ba5863 Mon Sep 17 00:00:00 2001
From: jackpeck <114533165+jackpeck@users.noreply.github.com>
Date: Sat, 16 Sep 2023 13:43:10 +0100
Subject: [PATCH 18/29] Add link to pure OCaml port

---
 README.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/README.md b/README.md
index aaf7ded..aeba166 100644
--- a/README.md
+++ b/README.md
@@ -353,6 +353,8 @@ If your candidate PRs have elements of these it doesn't mean they won't get merg
   - [llama2.f90](https://github.com/rbitr/llama2.f90): a Fortran port of this project
 - Mojo
   - [llama2.🔥](https://github.com/tairov/llama2.mojo) by @[tairov](https://github.com/tairov): pure Mojo port of this project
+- OCaml
+  - [llama2.ml](https://github.com/jackpeck/llama2.ml) by @[jackpeck](https://github.com/jackpeck): an OCaml port of this project
 - [llama2.c - Llama 2 Everywhere](https://github.com/trholding/llama2.c) by @[trholding](https://github.com/trholding): Standalone, Bootable & Portable Binary Llama 2
 - [llama2.c-zh - Bilingual Chinese and English](https://github.com/chenyangMl/llama2.c-zh) by @[chenyangMl](https://github.com/chenyangMl): Expand tokenizer to support training and inference in both Chinese and English
 

From f38055dfb637c1d50b5f1ac2999a6d54cf8fa2ca Mon Sep 17 00:00:00 2001
From: Nicky Pochinkov <work@nicky.pro>
Date: Sat, 16 Sep 2023 14:07:48 +0100
Subject: [PATCH 19/29] add option to set dtype for export

---
 export.py | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/export.py b/export.py
index 3bbbb47..e0f7a9b 100644
--- a/export.py
+++ b/export.py
@@ -265,7 +265,7 @@ def version2_export(model, filepath, group_size=64):
     out_file.close()
     print(f"wrote {filepath}")
 
-def hf_export(llama_model, filepath, group_size=64):
+def hf_export(llama_model, filepath, group_size=64, dtype=torch.float16):
     """ Generate the pytorch_model.bin state_dict and config.json for HuggingFace """
 
     try:
@@ -282,22 +282,22 @@ def hf_export(llama_model, filepath, group_size=64):
     hf_state_dict = {}
 
     # Transfer weights from llama model to the HF state dictionary format
-    hf_state_dict['model.embed_tokens.weight'] = llama_model.tok_embeddings.weight.clone()
-    hf_state_dict['model.norm.weight'] = llama_model.norm.weight.clone()
+    hf_state_dict['model.embed_tokens.weight'] = llama_model.tok_embeddings.weight.clone().to(dtype)
+    hf_state_dict['model.norm.weight'] = llama_model.norm.weight.clone().to(dtype)
 
     for i, layer in enumerate(llama_model.layers):
         layer_id = layer.layer_id  # Assuming llama.c layers have layer_id
-        hf_state_dict[f'model.layers.{i}.input_layernorm.weight'] = llama_model.layers[layer_id].attention_norm.weight.clone()
-        hf_state_dict[f'model.layers.{i}.self_attn.q_proj.weight'] = permute_original(llama_model.layers[layer_id].attention.wq.weight.clone())
-        hf_state_dict[f'model.layers.{i}.self_attn.k_proj.weight'] = permute_original(llama_model.layers[layer_id].attention.wk.weight.clone())
-        hf_state_dict[f'model.layers.{i}.self_attn.v_proj.weight'] = llama_model.layers[layer_id].attention.wv.weight.clone()
-        hf_state_dict[f'model.layers.{i}.self_attn.o_proj.weight'] = llama_model.layers[layer_id].attention.wo.weight.clone()
-        hf_state_dict[f'model.layers.{i}.post_attention_layernorm.weight'] = llama_model.layers[layer_id].ffn_norm.weight.clone()
-        hf_state_dict[f'model.layers.{i}.mlp.gate_proj.weight'] = llama_model.layers[layer_id].feed_forward.w1.weight.clone()
-        hf_state_dict[f'model.layers.{i}.mlp.down_proj.weight'] = llama_model.layers[layer_id].feed_forward.w2.weight.clone()
-        hf_state_dict[f'model.layers.{i}.mlp.up_proj.weight'] = llama_model.layers[layer_id].feed_forward.w3.weight.clone()
+        hf_state_dict[f'model.layers.{i}.input_layernorm.weight'] = llama_model.layers[layer_id].attention_norm.weight.clone().to(dtype)
+        hf_state_dict[f'model.layers.{i}.self_attn.q_proj.weight'] = permute_original(llama_model.layers[layer_id].attention.wq.weight.clone()).to(dtype)
+        hf_state_dict[f'model.layers.{i}.self_attn.k_proj.weight'] = permute_original(llama_model.layers[layer_id].attention.wk.weight.clone()).to(dtype)
+        hf_state_dict[f'model.layers.{i}.self_attn.v_proj.weight'] = llama_model.layers[layer_id].attention.wv.weight.clone().to(dtype)
+        hf_state_dict[f'model.layers.{i}.self_attn.o_proj.weight'] = llama_model.layers[layer_id].attention.wo.weight.clone().to(dtype)
+        hf_state_dict[f'model.layers.{i}.post_attention_layernorm.weight'] = llama_model.layers[layer_id].ffn_norm.weight.clone().to(dtype)
+        hf_state_dict[f'model.layers.{i}.mlp.gate_proj.weight'] = llama_model.layers[layer_id].feed_forward.w1.weight.clone().to(dtype)
+        hf_state_dict[f'model.layers.{i}.mlp.down_proj.weight'] = llama_model.layers[layer_id].feed_forward.w2.weight.clone().to(dtype)
+        hf_state_dict[f'model.layers.{i}.mlp.up_proj.weight'] = llama_model.layers[layer_id].feed_forward.w3.weight.clone().to(dtype)
 
-    hf_state_dict['lm_head.weight'] = llama_model.output.weight.clone()
+    hf_state_dict['lm_head.weight'] = llama_model.output.weight.clone().to(dtype)
 
 
     # Generate LlamaConfig (seen in transformers.models.llama.configuration_llama)
@@ -335,7 +335,7 @@ def hf_export(llama_model, filepath, group_size=64):
     os.makedirs(filepath, exist_ok=True)
 
     # Save the state dictionary in .pt format
-    torch.save(hf_state_dict, os.path.join(filepath, "pytorch_model.pt"))
+    torch.save(hf_state_dict, os.path.join(filepath, "pytorch_model.bin"))
     config.save_pretrained(filepath)
 
 

From fc11cc387b47efd98ca4ac0956f715d2e5451c41 Mon Sep 17 00:00:00 2001
From: Nicky Pochinkov <work@nicky.pro>
Date: Sat, 16 Sep 2023 18:10:36 +0100
Subject: [PATCH 20/29] Changed code so that lm_head and token_embed are tied

---
 export.py | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/export.py b/export.py
index e0f7a9b..d87a0d5 100644
--- a/export.py
+++ b/export.py
@@ -297,7 +297,9 @@ def hf_export(llama_model, filepath, group_size=64, dtype=torch.float16):
         hf_state_dict[f'model.layers.{i}.mlp.down_proj.weight'] = llama_model.layers[layer_id].feed_forward.w2.weight.clone().to(dtype)
         hf_state_dict[f'model.layers.{i}.mlp.up_proj.weight'] = llama_model.layers[layer_id].feed_forward.w3.weight.clone().to(dtype)
 
-    hf_state_dict['lm_head.weight'] = llama_model.output.weight.clone().to(dtype)
+    # llama2.c uses tied weights, so we reference the embed_tokens.weights instead
+    #hf_state_dict['lm_head.weight'] = llama_model.output.weight.clone().to(dtype)
+    hf_state_dict['lm_head.weight'] = hf_state_dict['model.embed_tokens.weight']
 
 
     # Generate LlamaConfig (seen in transformers.models.llama.configuration_llama)

From 19f40a2a717e6bff858a551b7fd8776b11edcbd7 Mon Sep 17 00:00:00 2001
From: Nicky Pochinkov <work@nicky.pro>
Date: Sat, 16 Sep 2023 18:32:21 +0100
Subject: [PATCH 21/29] Made default hf export torch.float32

---
 export.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/export.py b/export.py
index d87a0d5..dc55234 100644
--- a/export.py
+++ b/export.py
@@ -265,7 +265,7 @@ def version2_export(model, filepath, group_size=64):
     out_file.close()
     print(f"wrote {filepath}")
 
-def hf_export(llama_model, filepath, group_size=64, dtype=torch.float16):
+def hf_export(llama_model, filepath, group_size=64, dtype=torch.float32):
     """ Generate the pytorch_model.bin state_dict and config.json for HuggingFace """
 
     try:

From a61173d6b9dd544631a73808ffa89592ef8fa6e9 Mon Sep 17 00:00:00 2001
From: Nicky Pochinkov <work@nicky.pro>
Date: Sat, 16 Sep 2023 18:32:31 +0100
Subject: [PATCH 22/29] Added CLI dtype code

---
 export.py | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/export.py b/export.py
index dc55234..79dba5e 100644
--- a/export.py
+++ b/export.py
@@ -480,7 +480,8 @@ def load_hf_model(model_path):
 # -----------------------------------------------------------------------------
 # API entrypoint
 
-def model_export(model, filepath, version):
+def model_export(model, filepath, version, dtype=torch.float32):
+    # TODO: add dtype export support for other versions
     if version == 0:
         legacy_export(model, filepath)
     elif version == 1:
@@ -488,7 +489,7 @@ def model_export(model, filepath, version):
     elif version == 2:
         version2_export(model, filepath)
     elif version == -1:
-        hf_export(model, filepath)
+        hf_export(model, filepath, dtype)
     else:
         raise ValueError(f"unknown version {version}")
 
@@ -528,11 +529,13 @@ if __name__ == "__main__":
     parser = argparse.ArgumentParser()
     parser.add_argument("filepath", type=str, help="the output filepath")
     parser.add_argument("--version", default=0, type=int, help="the version to export with")
+    parser.add_argument("--dtype", type=str, help="dtype of the model (fp16, fp32)", default="fp32")
     group = parser.add_mutually_exclusive_group(required=True)
     group.add_argument("--checkpoint", type=str, help="model checkpoint, .pt file")
     group.add_argument("--meta-llama", type=str, help="meta llama model path")
     group.add_argument("--hf", type=str, help="huggingface model path")
     args = parser.parse_args()
+    dtype = {"fp16": torch.float16, "fp32": torch.float32}[args.dtype]
 
     if args.checkpoint:
         model = load_checkpoint(args.checkpoint)
@@ -545,4 +548,4 @@ if __name__ == "__main__":
         parser.error("Can't load input model!")
 
     # export
-    model_export(model, args.filepath, args.version)
+    model_export(model, args.filepath, args.version, args.dtype)

From ffea28751614cf161e513c09e2c2fd1635115a42 Mon Sep 17 00:00:00 2001
From: Nicky Pochinkov <work@nicky.pro>
Date: Sat, 16 Sep 2023 18:46:27 +0100
Subject: [PATCH 23/29] updated comment .pt -> .bin

---
 export.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/export.py b/export.py
index 79dba5e..19336b6 100644
--- a/export.py
+++ b/export.py
@@ -336,7 +336,7 @@ def hf_export(llama_model, filepath, group_size=64, dtype=torch.float32):
     # First make the directory if it doesn't exist
     os.makedirs(filepath, exist_ok=True)
 
-    # Save the state dictionary in .pt format
+    # Save the state dictionary in .bin format, and config as .json
     torch.save(hf_state_dict, os.path.join(filepath, "pytorch_model.bin"))
     config.save_pretrained(filepath)
 

From d3c25b10a6d0ca89c563a192267c24c04379ba27 Mon Sep 17 00:00:00 2001
From: Nicky Pochinkov <work@nicky.pro>
Date: Thu, 21 Sep 2023 16:36:36 +0200
Subject: [PATCH 24/29] Add checks/config for tied embedding weights

---
 export.py | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/export.py b/export.py
index 19336b6..08c6813 100644
--- a/export.py
+++ b/export.py
@@ -297,10 +297,14 @@ def hf_export(llama_model, filepath, group_size=64, dtype=torch.float32):
         hf_state_dict[f'model.layers.{i}.mlp.down_proj.weight'] = llama_model.layers[layer_id].feed_forward.w2.weight.clone().to(dtype)
         hf_state_dict[f'model.layers.{i}.mlp.up_proj.weight'] = llama_model.layers[layer_id].feed_forward.w3.weight.clone().to(dtype)
 
-    # llama2.c uses tied weights, so we reference the embed_tokens.weights instead
-    #hf_state_dict['lm_head.weight'] = llama_model.output.weight.clone().to(dtype)
+    # llama2.c usually uses tied weights -> reference the embed_tokens.weights instead
     hf_state_dict['lm_head.weight'] = hf_state_dict['model.embed_tokens.weight']
 
+    # We check that the embeddings are tied, else use manual output weights
+    _embeddings_are_tied: bool = torch.equal(llama_model.tok_embeddings.weight, llama_model.output.weight)
+    if not _embeddings_are_tied:
+        hf_state_dict['lm_head.weight'] = llama_model.output.weight.clone().to(dtype)
+
 
     # Generate LlamaConfig (seen in transformers.models.llama.configuration_llama)
 
@@ -326,6 +330,7 @@ def hf_export(llama_model, filepath, group_size=64, dtype=torch.float32):
         num_key_value_heads=num_key_value_heads,
         max_position_embeddings=max_position_embeddings,
         rms_norm_eps=rms_norm_eps,
+        tie_word_embeddings=_embeddings_are_tied,
         # Manual
         architectures=["LlamaForCausalLM"],
         hidden_act="silu",

From 2dedad6ceaa68bb4b5d101cb69034ec7aa1ee6c5 Mon Sep 17 00:00:00 2001
From: Nicky Pochinkov <work@nicky.pro>
Date: Thu, 21 Sep 2023 16:38:06 +0200
Subject: [PATCH 25/29] Added support for repeated kv weights

---
 export.py | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/export.py b/export.py
index 08c6813..787b1ef 100644
--- a/export.py
+++ b/export.py
@@ -276,20 +276,29 @@ def hf_export(llama_model, filepath, group_size=64, dtype=torch.float32):
         return None
 
     # Generate LlamaModel state_dict
-    def permute_original(w, n_heads=llama_model.params.n_heads, dim1=llama_model.params.dim, dim2=llama_model.params.dim):
-        return w.view(dim1, dim2).reshape(n_heads, dim1 // n_heads // 2, 2, dim2).transpose(1, 2).reshape(dim1, dim2)
-
     hf_state_dict = {}
 
+    # Sometimes we have repeated key values for the heads
+    dim = llama_model.params.dim
+    num_key_value_heads = llama_model.params.n_kv_heads
+    n_rep = llama_model.params.n_heads // num_key_value_heads
+    key_value_dim = dim // n_rep
+
+    # HuggingFace needs the weights permuted.
+    # See: https://github.com/huggingface/transformers/blob/b132c1703eb1c8bd9dfa4ad6a9be2bfd6ef819e9/src/transformers/models/llama/convert_llama_weights_to_hf.py#L122
+    def permute_original(w, n_heads=llama_model.params.n_heads, dim1=dim, dim2=dim):
+        return w.view(dim1, dim2).reshape(n_heads, dim1 // n_heads // 2, 2, dim2).transpose(1, 2).reshape(dim1, dim2)
+
     # Transfer weights from llama model to the HF state dictionary format
     hf_state_dict['model.embed_tokens.weight'] = llama_model.tok_embeddings.weight.clone().to(dtype)
     hf_state_dict['model.norm.weight'] = llama_model.norm.weight.clone().to(dtype)
 
+    # Add each layer's weights to the HF state dictionary
     for i, layer in enumerate(llama_model.layers):
-        layer_id = layer.layer_id  # Assuming llama.c layers have layer_id
+        layer_id = layer.layer_id
         hf_state_dict[f'model.layers.{i}.input_layernorm.weight'] = llama_model.layers[layer_id].attention_norm.weight.clone().to(dtype)
         hf_state_dict[f'model.layers.{i}.self_attn.q_proj.weight'] = permute_original(llama_model.layers[layer_id].attention.wq.weight.clone()).to(dtype)
-        hf_state_dict[f'model.layers.{i}.self_attn.k_proj.weight'] = permute_original(llama_model.layers[layer_id].attention.wk.weight.clone()).to(dtype)
+        hf_state_dict[f'model.layers.{i}.self_attn.k_proj.weight'] = permute_original(llama_model.layers[layer_id].attention.wk.weight.clone(), num_key_value_heads, key_value_dim, dim).to(dtype)
         hf_state_dict[f'model.layers.{i}.self_attn.v_proj.weight'] = llama_model.layers[layer_id].attention.wv.weight.clone().to(dtype)
         hf_state_dict[f'model.layers.{i}.self_attn.o_proj.weight'] = llama_model.layers[layer_id].attention.wo.weight.clone().to(dtype)
         hf_state_dict[f'model.layers.{i}.post_attention_layernorm.weight'] = llama_model.layers[layer_id].ffn_norm.weight.clone().to(dtype)
@@ -318,8 +327,9 @@ def hf_export(llama_model, filepath, group_size=64, dtype=torch.float32):
     max_position_embeddings = llama_model.params.max_seq_len
     rms_norm_eps = llama_model.params.norm_eps
 
-    # TODO values for: pretraining_tp, initializer_range, use_cache,
-    # tie_word_embeddings, rope_theta, and rope_scaling.
+    # TODO check values for:
+    # pretraining_tp, initializer_range, use_cache,
+    # rope_theta, and rope_scaling.
 
     config = LlamaConfig(
         vocab_size=vocab_size,

From 9fdb1316c7f63c3f3629f5f737ac7f1750528704 Mon Sep 17 00:00:00 2001
From: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com>
Date: Sun, 1 Oct 2023 11:18:33 +0530
Subject: [PATCH 26/29] Update README.md

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index aaf7ded..f03fd25 100644
--- a/README.md
+++ b/README.md
@@ -14,7 +14,7 @@ Please note that this repo started recently as a fun weekend project: I took my
 
 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/karpathy/llama2.c/blob/master/run.ipynb)
 
-First, navigate to the folder when you keep your projects and clone this repository to this folder:
+First, navigate to the folder where you keep your projects and clone this repository to this folder:
 
 ```bash
 git clone https://github.com/karpathy/llama2.c.git

From d0237abd32e553317a2bd80ecd5d4c621ddd307a Mon Sep 17 00:00:00 2001
From: Andrej <andrej.karpathy@gmail.com>
Date: Thu, 5 Oct 2023 15:19:39 -0700
Subject: [PATCH 27/29] Bring back legendary tag line :D

hahaha
---
 README.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/README.md b/README.md
index aaf7ded..7e243f4 100644
--- a/README.md
+++ b/README.md
@@ -4,6 +4,8 @@
   <img src="assets/llama_cute.jpg" width="300" height="300" alt="Cute Llama">
 </p>
 
+Have you ever wanted to inference a baby [Llama 2](https://ai.meta.com/llama/) model in pure C? No? Well, now you can!
+
 Train the Llama 2 LLM architecture in PyTorch then inference it with one simple 700-line C file ([run.c](run.c)). You might think that you need many billion parameter LLMs to do anything useful, but in fact very small LLMs can have surprisingly strong performance if you make the domain narrow enough (ref: [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) paper). This repo is a "fullstack" train + inference solution for Llama 2 LLM, with focus on minimalism and simplicity.
 
 As the architecture is identical, you can also load and inference Meta's Llama 2 models. However, the current code only inferences models in fp32, so you will most likely not be able to productively load models larger than 7B. Work on model quantization is currently ongoing.

From 2752ab69499f96aa6f3636936bd47d633fa34162 Mon Sep 17 00:00:00 2001
From: Akshay Trikha <akshaytrikha@gmail.com>
Date: Fri, 6 Oct 2023 20:27:24 -0700
Subject: [PATCH 28/29] hugginface --> huggingface

---
 README.md | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/README.md b/README.md
index 7e243f4..f335d1a 100644
--- a/README.md
+++ b/README.md
@@ -113,7 +113,7 @@ python tokenizer.py --tokenizer-model=/path/to/CodeLlama-7b-Instruct/tokenizer.m
 ./run codellama2_7b_instruct.bin -m chat -z /path/to/CodeLlama-7b-Instruct/tokenizer.bin
 ```
 
-## hugginface models
+## huggingface models
 
 We can load any huggingface models that use the Llama 2 architecture. See the script [export.py](export.py) and the `--hf` flag to export the model .bin file.
 
@@ -121,12 +121,12 @@ We can load any huggingface models that use the Llama 2 architecture. See the sc
 
 For the sake of examples of smaller, from-scratch models, I trained a small model series on TinyStories. All of these trained in a few hours on my training setup (4X A100 40GB GPUs). The 110M took around 24 hours. I am hosting them on huggingface hub [tinyllamas](https://huggingface.co/karpathy/tinyllamas), both in the original PyTorch .pt, and also in the llama2.c format .bin:
 
-| model | dim | n_layers | n_heads | n_kv_heads | max context length | parameters | val loss | download
-| --- | --- | --- | --- | --- | --- | --- | --- | --- |
-| 260K | 64 | 5 | 8 | 4 | 512 | 260K | 1.297 | [stories260K](https://huggingface.co/karpathy/tinyllamas/tree/main/stories260K)
-| OG | 288 | 6 | 6 | 6 | 256 | 15M | 1.072 | [stories15M.bin](https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin) |
-| 42M| 512 | 8 | 8 | 8 | 1024 | 42M | 0.847 | [stories42M.bin](https://huggingface.co/karpathy/tinyllamas/resolve/main/stories42M.bin) |
-| 110M| 768 | 12 | 12 | 12 | 1024 | 110M | 0.760 | [stories110M.bin](https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.bin) |
+| model | dim | n_layers | n_heads | n_kv_heads | max context length | parameters | val loss | download                                                                                   |
+| ----- | --- | -------- | ------- | ---------- | ------------------ | ---------- | -------- | ------------------------------------------------------------------------------------------ |
+| 260K  | 64  | 5        | 8       | 4          | 512                | 260K       | 1.297    | [stories260K](https://huggingface.co/karpathy/tinyllamas/tree/main/stories260K)            |
+| OG    | 288 | 6        | 6       | 6          | 256                | 15M        | 1.072    | [stories15M.bin](https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin)   |
+| 42M   | 512 | 8        | 8       | 8          | 1024               | 42M        | 0.847    | [stories42M.bin](https://huggingface.co/karpathy/tinyllamas/resolve/main/stories42M.bin)   |
+| 110M  | 768 | 12       | 12      | 12         | 1024               | 110M       | 0.760    | [stories110M.bin](https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.bin) |
 
 You'll notice that the 110M model is equivalent to GPT-1 in size. Alternatively, this is also the smallest model in the GPT-2 series (`GPT-2 small`), except the max context length is only 1024 instead of 2048. The only notable changes from GPT-1/2 architecture is that Llama uses RoPE relatively positional embeddings instead of absolute/learned positional embeddings, a bit more fancy SwiGLU non-linearity in the MLP, RMSNorm instead of LayerNorm, bias=False on all Linear layers, and is optionally multiquery (but this is not yet supported in llama2.c).
 
@@ -145,7 +145,7 @@ Then train our model:
 python train.py
 ```
 
-**brief training guide**. See the train.py script for more exotic launches and hyperparameter overrides. Here is a brief guide to how to set the parameters. Look at the table at the very end of the [Chinchilla paper](https://arxiv.org/abs/2203.15556) to get a sense of how the Transformer parameters (dim, n_layers, n_heads) grow or shrink together. Extrapolate/interpolate this pattern to get bigger or smaller transformers. Set the max context length however you wish, depending on the problem: this should be the max number of tokens that matter to predict the next token. E.g. Llama 2 uses 2048. Next, you want the _total_ batch size per update (printed by the script as "tokens per iteration will be:") to be somewhere around 100K tokens for medium-sized applications. For tiny applications it could be lower, for large training (e.g. GPTs/LLamas) it is usually ~0.5M, or even more. You get there by first maxing out the batch_size to whatever your system allows (e.g. mine was 16 in a recent run because after that my GPU runs out of memory), and then you want to increase gradient_accumulation_steps to be as high as necessary to reach the total batch size of ~100K. Finally, you want to tune your learning_rate (LR). You want this to be as high as your training allows. Very small networks can get away with a large LR (e.g. 1e-3 or even higher). Large networks need lower LRs. 3e-4 is a safe choice in most medium-sized applications, but can be too low for small networks, so try to increase it! Finally, max_iters is the length of training. Play with different settings. I mostly only ever tune these parameters and leave most of the others unchanged. Here is an example of how I trained the 110M model, which I don't think is anywhere near optimal, but looked sensible to me: dim 768, n_layers 12, n_heads 12 (so size of each head is 768 / 12 = 64 channels), seq len of 1024, batch size 16 (this is the most that fit my A100 40GB GPU), gradient_accumulation_steps = 8 was needed to get total tokens batch size to be 16 batch size * 1024 tokens in sequence * 8 grad_accum = 131,072 tokens per update. Good. Learning rate 4e-4 (probably a little too low). max_iters 200K (probably a bit too high). Dropout 0.1, as that usually helps a bit at medium size. That was it. I ran using Distributed Data Parallel (DDP) on 4 GPUs on my cloud machine, training took ~day or so.
+**brief training guide**. See the train.py script for more exotic launches and hyperparameter overrides. Here is a brief guide to how to set the parameters. Look at the table at the very end of the [Chinchilla paper](https://arxiv.org/abs/2203.15556) to get a sense of how the Transformer parameters (dim, n*layers, n_heads) grow or shrink together. Extrapolate/interpolate this pattern to get bigger or smaller transformers. Set the max context length however you wish, depending on the problem: this should be the max number of tokens that matter to predict the next token. E.g. Llama 2 uses 2048. Next, you want the \_total* batch size per update (printed by the script as "tokens per iteration will be:") to be somewhere around 100K tokens for medium-sized applications. For tiny applications it could be lower, for large training (e.g. GPTs/LLamas) it is usually ~0.5M, or even more. You get there by first maxing out the batch_size to whatever your system allows (e.g. mine was 16 in a recent run because after that my GPU runs out of memory), and then you want to increase gradient_accumulation_steps to be as high as necessary to reach the total batch size of ~100K. Finally, you want to tune your learning_rate (LR). You want this to be as high as your training allows. Very small networks can get away with a large LR (e.g. 1e-3 or even higher). Large networks need lower LRs. 3e-4 is a safe choice in most medium-sized applications, but can be too low for small networks, so try to increase it! Finally, max_iters is the length of training. Play with different settings. I mostly only ever tune these parameters and leave most of the others unchanged. Here is an example of how I trained the 110M model, which I don't think is anywhere near optimal, but looked sensible to me: dim 768, n_layers 12, n_heads 12 (so size of each head is 768 / 12 = 64 channels), seq len of 1024, batch size 16 (this is the most that fit my A100 40GB GPU), gradient_accumulation_steps = 8 was needed to get total tokens batch size to be 16 batch size _ 1024 tokens in sequence _ 8 grad_accum = 131,072 tokens per update. Good. Learning rate 4e-4 (probably a little too low). max_iters 200K (probably a bit too high). Dropout 0.1, as that usually helps a bit at medium size. That was it. I ran using Distributed Data Parallel (DDP) on 4 GPUs on my cloud machine, training took ~day or so.
 
 Totally understand if you want to skip model training, for simple demo just download one of the pretrained models (see [models](#models) section), e.g.:
 

From c97befa7d27062315657560338bf09c300378052 Mon Sep 17 00:00:00 2001
From: Akshay Trikha <akshaytrikha@gmail.com>
Date: Fri, 6 Oct 2023 20:33:44 -0700
Subject: [PATCH 29/29] remove accidental linting

---
 README.md | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/README.md b/README.md
index f335d1a..a05288a 100644
--- a/README.md
+++ b/README.md
@@ -121,12 +121,12 @@ We can load any huggingface models that use the Llama 2 architecture. See the sc
 
 For the sake of examples of smaller, from-scratch models, I trained a small model series on TinyStories. All of these trained in a few hours on my training setup (4X A100 40GB GPUs). The 110M took around 24 hours. I am hosting them on huggingface hub [tinyllamas](https://huggingface.co/karpathy/tinyllamas), both in the original PyTorch .pt, and also in the llama2.c format .bin:
 
-| model | dim | n_layers | n_heads | n_kv_heads | max context length | parameters | val loss | download                                                                                   |
-| ----- | --- | -------- | ------- | ---------- | ------------------ | ---------- | -------- | ------------------------------------------------------------------------------------------ |
-| 260K  | 64  | 5        | 8       | 4          | 512                | 260K       | 1.297    | [stories260K](https://huggingface.co/karpathy/tinyllamas/tree/main/stories260K)            |
-| OG    | 288 | 6        | 6       | 6          | 256                | 15M        | 1.072    | [stories15M.bin](https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin)   |
-| 42M   | 512 | 8        | 8       | 8          | 1024               | 42M        | 0.847    | [stories42M.bin](https://huggingface.co/karpathy/tinyllamas/resolve/main/stories42M.bin)   |
-| 110M  | 768 | 12       | 12      | 12         | 1024               | 110M       | 0.760    | [stories110M.bin](https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.bin) |
+| model | dim | n_layers | n_heads | n_kv_heads | max context length | parameters | val loss | download
+| --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| 260K | 64 | 5 | 8 | 4 | 512 | 260K | 1.297 | [stories260K](https://huggingface.co/karpathy/tinyllamas/tree/main/stories260K)
+| OG | 288 | 6 | 6 | 6 | 256 | 15M | 1.072 | [stories15M.bin](https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin) |
+| 42M| 512 | 8 | 8 | 8 | 1024 | 42M | 0.847 | [stories42M.bin](https://huggingface.co/karpathy/tinyllamas/resolve/main/stories42M.bin) |
+| 110M| 768 | 12 | 12 | 12 | 1024 | 110M | 0.760 | [stories110M.bin](https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.bin) |
 
 You'll notice that the 110M model is equivalent to GPT-1 in size. Alternatively, this is also the smallest model in the GPT-2 series (`GPT-2 small`), except the max context length is only 1024 instead of 2048. The only notable changes from GPT-1/2 architecture is that Llama uses RoPE relatively positional embeddings instead of absolute/learned positional embeddings, a bit more fancy SwiGLU non-linearity in the MLP, RMSNorm instead of LayerNorm, bias=False on all Linear layers, and is optionally multiquery (but this is not yet supported in llama2.c).
 
@@ -145,7 +145,7 @@ Then train our model:
 python train.py
 ```
 
-**brief training guide**. See the train.py script for more exotic launches and hyperparameter overrides. Here is a brief guide to how to set the parameters. Look at the table at the very end of the [Chinchilla paper](https://arxiv.org/abs/2203.15556) to get a sense of how the Transformer parameters (dim, n*layers, n_heads) grow or shrink together. Extrapolate/interpolate this pattern to get bigger or smaller transformers. Set the max context length however you wish, depending on the problem: this should be the max number of tokens that matter to predict the next token. E.g. Llama 2 uses 2048. Next, you want the \_total* batch size per update (printed by the script as "tokens per iteration will be:") to be somewhere around 100K tokens for medium-sized applications. For tiny applications it could be lower, for large training (e.g. GPTs/LLamas) it is usually ~0.5M, or even more. You get there by first maxing out the batch_size to whatever your system allows (e.g. mine was 16 in a recent run because after that my GPU runs out of memory), and then you want to increase gradient_accumulation_steps to be as high as necessary to reach the total batch size of ~100K. Finally, you want to tune your learning_rate (LR). You want this to be as high as your training allows. Very small networks can get away with a large LR (e.g. 1e-3 or even higher). Large networks need lower LRs. 3e-4 is a safe choice in most medium-sized applications, but can be too low for small networks, so try to increase it! Finally, max_iters is the length of training. Play with different settings. I mostly only ever tune these parameters and leave most of the others unchanged. Here is an example of how I trained the 110M model, which I don't think is anywhere near optimal, but looked sensible to me: dim 768, n_layers 12, n_heads 12 (so size of each head is 768 / 12 = 64 channels), seq len of 1024, batch size 16 (this is the most that fit my A100 40GB GPU), gradient_accumulation_steps = 8 was needed to get total tokens batch size to be 16 batch size _ 1024 tokens in sequence _ 8 grad_accum = 131,072 tokens per update. Good. Learning rate 4e-4 (probably a little too low). max_iters 200K (probably a bit too high). Dropout 0.1, as that usually helps a bit at medium size. That was it. I ran using Distributed Data Parallel (DDP) on 4 GPUs on my cloud machine, training took ~day or so.
+**brief training guide**. See the train.py script for more exotic launches and hyperparameter overrides. Here is a brief guide to how to set the parameters. Look at the table at the very end of the [Chinchilla paper](https://arxiv.org/abs/2203.15556) to get a sense of how the Transformer parameters (dim, n_layers, n_heads) grow or shrink together. Extrapolate/interpolate this pattern to get bigger or smaller transformers. Set the max context length however you wish, depending on the problem: this should be the max number of tokens that matter to predict the next token. E.g. Llama 2 uses 2048. Next, you want the _total_ batch size per update (printed by the script as "tokens per iteration will be:") to be somewhere around 100K tokens for medium-sized applications. For tiny applications it could be lower, for large training (e.g. GPTs/LLamas) it is usually ~0.5M, or even more. You get there by first maxing out the batch_size to whatever your system allows (e.g. mine was 16 in a recent run because after that my GPU runs out of memory), and then you want to increase gradient_accumulation_steps to be as high as necessary to reach the total batch size of ~100K. Finally, you want to tune your learning_rate (LR). You want this to be as high as your training allows. Very small networks can get away with a large LR (e.g. 1e-3 or even higher). Large networks need lower LRs. 3e-4 is a safe choice in most medium-sized applications, but can be too low for small networks, so try to increase it! Finally, max_iters is the length of training. Play with different settings. I mostly only ever tune these parameters and leave most of the others unchanged. Here is an example of how I trained the 110M model, which I don't think is anywhere near optimal, but looked sensible to me: dim 768, n_layers 12, n_heads 12 (so size of each head is 768 / 12 = 64 channels), seq len of 1024, batch size 16 (this is the most that fit my A100 40GB GPU), gradient_accumulation_steps = 8 was needed to get total tokens batch size to be 16 batch size * 1024 tokens in sequence * 8 grad_accum = 131,072 tokens per update. Good. Learning rate 4e-4 (probably a little too low). max_iters 200K (probably a bit too high). Dropout 0.1, as that usually helps a bit at medium size. That was it. I ran using Distributed Data Parallel (DDP) on 4 GPUs on my cloud machine, training took ~day or so.
 
 Totally understand if you want to skip model training, for simple demo just download one of the pretrained models (see [models](#models) section), e.g.: