Skip to content

Fix #441 : paren-brace, equals-NA and no-tab lints in strings or comments #467

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

russHyde
Copy link
Collaborator

@russHyde russHyde commented Mar 16, 2020

Some regex-based linters were throwing false-positives in irrelevant bits of source code (in strings, or in comments; see #441 ).
For example,

  • the paren-brace linter was flagging the regex in grepl("(iss){2}", "Mississippi"),
  • the equals-na linter was flagging is.na(x) # test whether x == NA"
  • the no-tabs linter was flagging multi_line_tabbed_string <- "lorem ipsum\n\tblah de blah"

Tests based on these false-positives were added.

This PR abstracts some of the code that these regex-based linters use into a function "make_linter_from_regex"; this function can be tuned to ignore regex matches that occur in comments and/or string definitions. Then redefines paren_brace_linter, equals_na_linter and no_tabs_linter to use make_linter_from_regex.

I have had some trouble understanding comment-parsing and would like to continue working on this PR for a bit (I'm shifting it off my work computer since I might not be able to access my office for a while)

@russHyde russHyde requested a review from jimhester as a code owner March 16, 2020 11:28
@jimhester
Copy link
Member

The main question I guess is if these linters should remain using regexes. The no tabs linter I guess has to, as you don't have access to whitespace in the AST. But definitely the equals_na linter could be implemented by looking at the AST and probably the paren brace linter as well, which would likely be more robust than relying on regexes for their behavior.

@gaborcsardi
Copy link
Member

gaborcsardi commented Mar 16, 2020

I think even for the whitespace linters, it is a better solution to create the AST, and then get all the positions that are not accounted for by terminal tokens, plus the trailing whitespace.

Otherwise it is pretty difficult to handle this with regular expressions, especially with the new raw character constants.

@russHyde
Copy link
Collaborator Author

Totally agree. I originally tried (and stuggled) to implement the paren-brace linter using xpath; I'll have a look at it again

@gaborcsardi
Copy link
Member

Totally agree. I originally tried (and stuggled) to implement the paren-brace linter using xpath; I'll have a look at it again

Let me know if you need help with the xpath way, it is really difficult to use it AFAIR. Unfortunately, apart from xpath, there isn't much software out there for attributed sub-graph isomorphism.

@russHyde
Copy link
Collaborator Author

Thanks

@russHyde
Copy link
Collaborator Author

The the equals-NA linter has been reimplemented using xpath. I will try to do the same for paren-braces linter (but will leave the tab-linter as is).

xpath seems like a multi-dimensional regex.

Should we be concerned that making xpath-searches the norm for linter implementations may raise the bar for new contributors? As such, I'd still like to sure up the code for building linters based on regexes in a way that can optionally exclude comments, string literals etc. Will continue to work on this.


I'm not sure why the Travis run isn't being reported in the PR checks (that is, whether it's a github or a travis issue): it can be found here https://travis-ci.org/github/jimhester/lintr/builds/668856106

Observed failures in Travis on R3.3 and R3.2 because Rcpp-compilation fails (something I've recently seen for my own (lintr-dependent) package; I just dropped the tests/support for R < 3.4 in dupree when that happened).

@jimhester
Copy link
Member

I agree with you about using XPath making it more difficult for contributors. That is also the reason I didn't use C or C++ in lintr, I wanted to lower the bar to contributors. Unfortunately R is not terribly well suited to this particular task and the performance is not great, which drives us towards more esoteric solutions like XPath...

Unfortunately I don't have any great solutions :/

@gaborcsardi
Copy link
Member

gaborcsardi commented Mar 31, 2020

Well, I would say that, yes, xpath is more difficult to write. But imo it is still easier to write a correct linter with xpath than with regexes. Especially with the new raw string that is coming in R 4.0.0.

An alternative to the xpath search would be to walk the AST manually, but that is still quite error prone imo, except for maybe the simplest patterns.

(Being the author of the xpath based search I am probably a bit (?) biased. :)

@jimhester
Copy link
Member

I agree that xpath is definitely better than regexes

russHyde added 8 commits April 6, 2020 10:11
Tests added to ensure paren-brace linter, equals-NA linter and no-tab
linter do not flag lints in irrelevant sections of code (eg, strings /
comments).
This can check if a regex-match is covered by an expression in a
source_file and restrict analysis to tokens of a particular type, if
required.
Also, rename file containing equals_na_linter to match format of other
linter-files (`*_linter.R` not `*_lintr.R`)
@russHyde russHyde force-pushed the regex-based-linter-issue-441 branch from d42bf64 to 998de4a Compare April 6, 2020 09:32
@russHyde
Copy link
Collaborator Author

russHyde commented Apr 6, 2020

@jimhester
I've updated the code.

  • No-Tab linter is still defined using a regex (and tabs within strings are ignored by this regex)
  • Both paren-brace and equals-NA linter are defined using xpath searches

I have added tests that should capture all the original bugs (paren-braces in comments or strings, equals-NA in comments or strings, tab-prefixed lines in a multi-line string) and a couple of additional tests.

The travis run is here: https://travis-ci.org/github/jimhester/lintr/builds/671546861
It still fails on R3.2 and 3.3; due to an Rcpp installation issue

@russHyde russHyde changed the title [WIP] Regex-based linter issue 441 Fix #441 : paren-brace, equals-NA and no-tab lints in strings or comments Apr 6, 2020
@MichaelChirico
Copy link
Collaborator

superseded by #620 (starts from the same history, just from a branch originating on the main repo)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants