CodeBERT support for embeddings #488

lambdaofgod · 2020-08-09T12:37:21Z

I've added support to CodeBERT models. These models essentially use RoBERTa class and tokenizer, so it was only necessary to load Huggingface model.

Timoeller · 2020-08-10T10:40:01Z

Thanks for the PR. CodeBert looks really interesting to us. Especially the code search functionality.

But does the code run for all cases of codebert?
I see model = RobertaForMaskedLM.from_pretrained('microsoft/codebert-base-mlm') which is different than RobertaModel.

So the code search functionality seems like textpairclassification, which is possible with current FARM and for this use case we should be able to load codebert.

The MLM objective for "code docstring generation" is something we would need to implement in FARM.

What are your thoughts on that?

lambdaofgod · 2020-08-11T08:54:09Z

I only checked the model for classification because I was interested to use it for retrieval (actually I used haystack).

The docstring generation would be pretty limited given that there is only model for MLM (I don't think it can handle variable length output), and Microsoft doesn't plan to release model that can be actually used for text generation (see this issue)

If you have spare compute I someone (for example me) could actually finetune CodeBERT for docstring generation, but for now at least it's pretty hard given that it was trained on 4 P40 GPUs and the data preprocessing step doesn't fit in 32GB RAM.

Timoeller · 2020-08-11T09:03:39Z

Nice, I am fine with just going for the textpairclassification case.

Will it create problems if we load the microsoft/codebert-base-mlm model without the mlm head? If so I would rather exclude this model from being loaded, e.g.

elif 'codebert' in pretrained_model_name_or_path.lower():
    if "mlm" in pretrained_model_name_or_path.lower():
        raise NotImplementedError("MLM part of codebert is currently not supported in FARM")
    else:
        ....

lambdaofgod · 2020-08-15T10:39:40Z

I've checked the microsoft/codebert-base-mlm model and it works fine.

Timoeller · 2020-08-17T18:09:42Z

Hey, thanks a lot for checking. I would still prefer to exclude the mlm model from loading, because it will confuse people loading mlm and doing classification.
Would you like to adjust the code or should I do it?

lambdaofgod · 2020-08-17T20:27:07Z

Changed.

Timoeller

Thanks for updating the PR.
I made a comment on the additional changes you made.

Timoeller · 2020-08-18T08:36:29Z

farm/modeling/language_model.py

@@ -154,7 +160,10 @@ def load(cls, pretrained_model_name_or_path, n_added_tokens=0, language_model_cl
            if language_model_class:
                language_model = cls.subclasses[language_model_class].load(pretrained_model_name_or_path, **kwargs)
            else:
-                language_model = None
+                try:


Hey, this seems like a solid idea and we should soon add this functionality of AutoModel, but this implementation will create errors downstream that might be tough for a user to understand.
Each model has to be implemented (Its often not much work) and checked for interoperability with FARMs LanguageModel class.

But I agree, we currently load models based on the model string, which is sub optimal.
We could load models based on the config and only select those that are already implemented in FARM.
If you want to work on this functionality please create a separate PR for this and flag it as work in progress, so we can give early feedback.

Timoeller · 2020-08-18T12:35:36Z

I guess something went wrong with mail reply?

In my experience it is nearly always best to use the github UI for answering...

lambdaofgod · 2020-08-29T12:55:09Z

I removed AutoModel functionalities.

Timoeller

Looking good

CodeBERT support

70900f9

exclude mlm codebert model

61cfe05

Timoeller suggested changes Aug 18, 2020

View reviewed changes

This comment has been minimized.

Sign in to view

removed automodel support

ade64ca

tholor requested a review from Timoeller August 31, 2020 06:59

Timoeller approved these changes Aug 31, 2020

View reviewed changes

Timoeller merged commit dd3945d into deepset-ai:master Aug 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CodeBERT support for embeddings #488

CodeBERT support for embeddings #488

lambdaofgod commented Aug 9, 2020

Timoeller commented Aug 10, 2020

lambdaofgod commented Aug 11, 2020

Timoeller commented Aug 11, 2020

lambdaofgod commented Aug 15, 2020

Timoeller commented Aug 17, 2020 •

edited

Loading

lambdaofgod commented Aug 17, 2020

Timoeller left a comment

Timoeller Aug 18, 2020

This comment has been minimized.

This comment has been minimized.

Timoeller commented Aug 18, 2020

lambdaofgod commented Aug 29, 2020

Timoeller left a comment

CodeBERT support for embeddings #488

CodeBERT support for embeddings #488

Conversation

lambdaofgod commented Aug 9, 2020

Timoeller commented Aug 10, 2020

lambdaofgod commented Aug 11, 2020

Timoeller commented Aug 11, 2020

lambdaofgod commented Aug 15, 2020

Timoeller commented Aug 17, 2020 • edited Loading

lambdaofgod commented Aug 17, 2020

Timoeller left a comment

Choose a reason for hiding this comment

Timoeller Aug 18, 2020

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

Timoeller commented Aug 18, 2020

lambdaofgod commented Aug 29, 2020

Timoeller left a comment

Choose a reason for hiding this comment

Timoeller commented Aug 17, 2020 •

edited

Loading