Skip to content

support Aquila-7B model series #2487

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Aug 2, 2023
Merged

support Aquila-7B model series #2487

merged 6 commits into from
Aug 2, 2023

Conversation

ftgreat
Copy link
Contributor

@ftgreat ftgreat commented Aug 2, 2023

We released Aquila-7B model seriesrelated issue, which based on Chinese and English knowledge.
And also open-sourced them in HuggingFace and FlagAI.

Because of using the BPE tokenizer, our pull request of support BPE tokenizer has been merged.

Could add Aquila-7B models into llama.cpp? Thanks for your review.

ldwang and others added 6 commits July 15, 2023 14:12
Signed-off-by: ldwang <ftgreat@gmail.com>
Signed-off-by: ldwang <ftgreat@gmail.com>
Signed-off-by: ldwang <ftgreat@gmail.com>
Signed-off-by: ldwang <ftgreat@gmail.com>
Signed-off-by: ldwang <ftgreat@gmail.com>
@monatis monatis merged commit 220d931 into ggml-org:master Aug 2, 2023
@goerch
Copy link
Collaborator

goerch commented Aug 6, 2023

I'm trying to convert Aquila-7B with

python.exe convert.py models\Aquila-7B --vocabtype bpe

First problem I ran into was the missing encoding in

        if self.vocabtype == "bpe":
          self.sentencepiece_tokenizer = json.loads(open(str(fname_tokenizer), encoding='utf-8').read())

Now I'm stuck with

Exception: Vocab size mismatch (model has 100008, but models\Aquila-7B\vocab.json has 100000).  Most likely you are missing added_tokens.json (should be in models\Aquila-7B).

Edit: is UTF-8 the correct encoding?

@goerch
Copy link
Collaborator

goerch commented Aug 6, 2023

OK, I found the missing added_tokens in tokenizer.json:

  "added_tokens": [
    {
      "id": 0,
      "content": "<|endoftext|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 100000,
      "content": "<|startofpiece|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": true,
      "special": false
    },
    {
      "id": 100001,
      "content": "<|endofpiece|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": true,
      "special": false
    },
    {
      "id": 100002,
      "content": "<|LDWANG|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": true,
      "special": false
    },
    {
      "id": 100003,
      "content": "[MASK]",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": true,
      "special": false
    },
    {
      "id": 100004,
      "content": "[gMASK]",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": true,
      "special": false
    },
    {
      "id": 100005,
      "content": "[sMASK]",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": true,
      "special": false
    },
    {
      "id": 100006,
      "content": "[CLS]",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": true,
      "special": false
    },
    {
      "id": 100007,
      "content": "</s>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": true,
      "special": false
    }
  ],

Now I'm missing how to incorporate them?

@goerch
Copy link
Collaborator

goerch commented Aug 6, 2023

Manually generating 'added_tokens.json' with content

{
  "<|endoftext|>": 0,
  "<|startofpiece|>": 100000,
  "<|endofpiece|>": 100001,
  "<|LDWANG|>": 100002,
  "[MASK]": 100003,
  "[gMASK]": 100004,
  "[sMASK]": 100005,
  "[CLS]": 100006,
  "</s>": 100007
}

results in

Exception: Expected added token IDs to be sequential and start at 9; got [0, 100000, 100001, 100002, 100003, 100004, 100005, 100006, 100007]

Edit: removing the entry for "<|endoftext|>" seems to fix the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants