Skip to content
This repository was archived by the owner on Apr 8, 2025. It is now read-only.

CodeBERT support for embeddings #488

Merged
merged 3 commits into from
Aug 31, 2020
Merged

Conversation

lambdaofgod
Copy link
Contributor

I've added support to CodeBERT models. These models essentially use RoBERTa class and tokenizer, so it was only necessary to load Huggingface model.

@Timoeller
Copy link
Contributor

Thanks for the PR. CodeBert looks really interesting to us. Especially the code search functionality.

But does the code run for all cases of codebert?
I see model = RobertaForMaskedLM.from_pretrained('microsoft/codebert-base-mlm') which is different than RobertaModel.

So the code search functionality seems like textpairclassification, which is possible with current FARM and for this use case we should be able to load codebert.

The MLM objective for "code docstring generation" is something we would need to implement in FARM.

What are your thoughts on that?

@lambdaofgod
Copy link
Contributor Author

I only checked the model for classification because I was interested to use it for retrieval (actually I used haystack).

The docstring generation would be pretty limited given that there is only model for MLM (I don't think it can handle variable length output), and Microsoft doesn't plan to release model that can be actually used for text generation (see this issue)

If you have spare compute I someone (for example me) could actually finetune CodeBERT for docstring generation, but for now at least it's pretty hard given that it was trained on 4 P40 GPUs and the data preprocessing step doesn't fit in 32GB RAM.

@Timoeller
Copy link
Contributor

Nice, I am fine with just going for the textpairclassification case.

Will it create problems if we load the microsoft/codebert-base-mlm model without the mlm head? If so I would rather exclude this model from being loaded, e.g.

elif 'codebert' in pretrained_model_name_or_path.lower():
    if "mlm" in pretrained_model_name_or_path.lower():
        raise NotImplementedError("MLM part of codebert is currently not supported in FARM")
    else:
        ....

@lambdaofgod
Copy link
Contributor Author

I've checked the microsoft/codebert-base-mlm model and it works fine.

@Timoeller
Copy link
Contributor

Timoeller commented Aug 17, 2020

Hey, thanks a lot for checking. I would still prefer to exclude the mlm model from loading, because it will confuse people loading mlm and doing classification.
Would you like to adjust the code or should I do it?

@lambdaofgod
Copy link
Contributor Author

Changed.

Copy link
Contributor

@Timoeller Timoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the PR.
I made a comment on the additional changes you made.

@@ -154,7 +160,10 @@ def load(cls, pretrained_model_name_or_path, n_added_tokens=0, language_model_cl
if language_model_class:
language_model = cls.subclasses[language_model_class].load(pretrained_model_name_or_path, **kwargs)
else:
language_model = None
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, this seems like a solid idea and we should soon add this functionality of AutoModel, but this implementation will create errors downstream that might be tough for a user to understand.
Each model has to be implemented (Its often not much work) and checked for interoperability with FARMs LanguageModel class.

But I agree, we currently load models based on the model string, which is sub optimal.
We could load models based on the config and only select those that are already implemented in FARM.
If you want to work on this functionality please create a separate PR for this and flag it as work in progress, so we can give early feedback.

@mozzerela

This comment has been minimized.

@mozzerela

This comment has been minimized.

@Timoeller
Copy link
Contributor

I guess something went wrong with mail reply?

In my experience it is nearly always best to use the github UI for answering...

@lambdaofgod
Copy link
Contributor Author

I removed AutoModel functionalities.

@tholor tholor requested a review from Timoeller August 31, 2020 06:59
Copy link
Contributor

@Timoeller Timoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good

@Timoeller Timoeller merged commit dd3945d into deepset-ai:master Aug 31, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants