Andrej Karpathy
|
8417cb438d
|
Merge branch 'utf8' of https://github.com/atamurad/llama2.c into feature/utf8
|
2023-08-15 00:18:53 +00:00 |
|
Andrej Karpathy
|
ea4cedc588
|
add ability to export custom tokenizer to .bin format for run.c file
|
2023-08-13 02:00:19 +00:00 |
|
Andrej Karpathy
|
4c6f0af9ff
|
add the ability to train a custom sentencepiece tokenizer with a given vocab_size, and pretok with it. some more changes still needed to merge this branch, in train.py and ofc run.c. did this in a sadly bit ugly, but fully backwards compatible way. basically when we use custom tokenizer we create a whole new directory structure for that
|
2023-08-11 03:58:22 +00:00 |
|
atamyrat
|
c02865df30
|
prompt tokenizer improvements: utf8 support, add_dummy_prefix and byte_fallback options to match sentencepiece
|
2023-08-07 13:12:44 +03:00 |
|
Andrej Karpathy
|
b4bb47bb7b
|
big change: adding prompting. many LOC, but critical. ty @atamurad for the first draft, i ended up tuning it quite a bit.
|
2023-07-28 04:12:54 +00:00 |
|
Andrej Karpathy
|
3bfa5665d1
|
delete the run_wrap file! yay. ty @python273 and @ggerganov for code snippets
|
2023-07-24 04:02:57 +00:00 |
|
Andrej Karpathy
|
5baaf9df06
|
small format tweaks, get rid of prints in tokenizer
|
2023-07-23 17:09:23 +00:00 |
|
Andrej Karpathy
|
5b161abb9a
|
somewhere ~20 hours later
|
2023-07-23 05:23:45 +00:00 |
|