-
Notifications
You must be signed in to change notification settings - Fork 249
Conversation
Thanks for the PR. CodeBert looks really interesting to us. Especially the code search functionality. But does the code run for all cases of codebert? So the code search functionality seems like textpairclassification, which is possible with current FARM and for this use case we should be able to load codebert. The MLM objective for "code docstring generation" is something we would need to implement in FARM. What are your thoughts on that? |
I only checked the model for classification because I was interested to use it for retrieval (actually I used haystack). The docstring generation would be pretty limited given that there is only model for MLM (I don't think it can handle variable length output), and Microsoft doesn't plan to release model that can be actually used for text generation (see this issue) If you have spare compute I someone (for example me) could actually finetune CodeBERT for docstring generation, but for now at least it's pretty hard given that it was trained on 4 P40 GPUs and the data preprocessing step doesn't fit in 32GB RAM. |
Nice, I am fine with just going for the textpairclassification case. Will it create problems if we load the
|
I've checked the |
Hey, thanks a lot for checking. I would still prefer to exclude the mlm model from loading, because it will confuse people loading mlm and doing classification. |
Changed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the PR.
I made a comment on the additional changes you made.
farm/modeling/language_model.py
Outdated
@@ -154,7 +160,10 @@ def load(cls, pretrained_model_name_or_path, n_added_tokens=0, language_model_cl | |||
if language_model_class: | |||
language_model = cls.subclasses[language_model_class].load(pretrained_model_name_or_path, **kwargs) | |||
else: | |||
language_model = None | |||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, this seems like a solid idea and we should soon add this functionality of AutoModel, but this implementation will create errors downstream that might be tough for a user to understand.
Each model has to be implemented (Its often not much work) and checked for interoperability with FARMs LanguageModel class.
But I agree, we currently load models based on the model string, which is sub optimal.
We could load models based on the config and only select those that are already implemented in FARM.
If you want to work on this functionality please create a separate PR for this and flag it as work in progress, so we can give early feedback.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
I guess something went wrong with mail reply? In my experience it is nearly always best to use the github UI for answering... |
I removed AutoModel functionalities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good
I've added support to CodeBERT models. These models essentially use RoBERTa class and tokenizer, so it was only necessary to load Huggingface model.