Skip to content

implement fixed_regex_linter as plain R + regex #1032

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 56 commits into from
May 19, 2022
Merged

Conversation

AshesITR
Copy link
Collaborator

NB merge target is the original fixed_regex PR #1021 so we can separately review the implementation of is_not_regex and the PR in full.

@AshesITR AshesITR changed the base branch from fixed_regex to master March 30, 2022 19:23
@AshesITR
Copy link
Collaborator Author

Changed target to master temporarily to get GHA goodness.

@AshesITR AshesITR changed the base branch from master to fixed_regex March 30, 2022 19:38
@AshesITR
Copy link
Collaborator Author

@MichaelChirico LMK if you would like me to adapt .dev/compare_branches.R to include an argument (--base-branch?)
Allowing a local branch from r-lib as reference should give enough flexibility I think.

@MichaelChirico
Copy link
Collaborator

yes please do. agree about scope.

@MichaelChirico
Copy link
Collaborator

Some patience here... my mirror has gathered some dust that's slow to remove :)

@AshesITR
Copy link
Collaborator Author

AshesITR commented Apr 1, 2022

No worries, thanks for the update 😊

@MichaelChirico
Copy link
Collaborator

On a sample of 2,000 packages, there are 8 false positives where the R version says fixed but the C version does not:

diff.csv

They are all the same expression:

gsub("\\[|\\]", "", s, perl = TRUE)

(also FWIW the two branches ran in almost indistinguishable time -- 13.18 (C) vs 13.49 minutes)

@AshesITR
Copy link
Collaborator Author

AshesITR commented Apr 3, 2022

Ah I see that must be because the special regex skips if a [ is before it. That needs another not-escaped check before it.

Great news on the performance side!

@MichaelChirico
Copy link
Collaborator

Assuming we can fix the issue, let's (1) merge #1021 then quickly (2) merge this to master as a follow-up. That way we can (1) keep the original C version in the repo history and (2) make it easier to give you credit for the great improvement!

@AshesITR
Copy link
Collaborator Author

AshesITR commented Apr 4, 2022

Okay, we can do that. Or merge once and not squash the changes during merge?

@AshesITR
Copy link
Collaborator Author

AshesITR commented Apr 4, 2022

Found a fix and added it. Also added a test case for that to make sure it works.
LMK how you want to proceed.

@AshesITR AshesITR changed the base branch from fixed_regex to master April 4, 2022 14:52
@MichaelChirico
Copy link
Collaborator

first I'll run again to make sure we didn't whack-a-mole any new issues.

I think we do want to squash some commits but not others which will be a pain... easier to merge the two PRs with squash

@AshesITR
Copy link
Collaborator Author

AshesITR commented Apr 4, 2022

Allright. I pre-approved the C implementation. Merge at your discretion; LMK if I need to take another look.

@AshesITR
Copy link
Collaborator Author

During some manual testing, I noticed that at least the most recent R versions don't have all of the features mentioned at the link I provided.

@AshesITR
Copy link
Collaborator Author

@MichaelChirico All tests succeed on all platforms now 🥳

@MichaelChirico
Copy link
Collaborator

awesome!! thanks again for your patience/diligence here!

I'll start another run tomorrow evening and hopefully we can (finally) merge

@AshesITR
Copy link
Collaborator Author

Sounds good, thanks a lot for the extensive testing and feedback!

@MichaelChirico
Copy link
Collaborator

Some issues... getting warnings running the linter on some packages, e.g. g3viz:

lintr::lint_package(linters=lintr::fixed_regex_linter())
.................
Warning messages:
1: In grepl(paste0("(?s)", rx_static_regex), str, perl = TRUE) || grepl(rx_static_regex,  :
  'length(x) = 2 > 1' in coercion to 'logical(1)'
2: In grepl(paste0("(?s)", rx_static_regex), str, perl = TRUE) || grepl(rx_static_regex,  :
  'length(x) = 2 > 1' in coercion to 'logical(1)'

(IINM that'll be an error in r-devel)

Packages that don't seem to want to lint on this branch:

addinsOutline, alignfigR, anyLib, ASIP, ast2ast, BaBooN, baseflow, bbw, Biocomb, caretEnsemble, cgam, cgmanalysis, compareODM, copBasic, crossword.r, DistPlotter, dscore, ecm, envDocument, ergmito, errorlocate, exp2flux, fail, fcr, g3viz, geno2proteo, GFE, ggpattern, ggpol, gllm, gmpoly, gsignal, huito, japanstat, JuliaCall, knotR, loo, lsa, marelac, mldr, MM4LMM, mma, mmaqshiny, mmm2, MRFcov, mwcsr, nat.utils, paletteer, plumberDeploy, polite, pxR, r2symbols, rcbayes, RcmdrPlugin.DCE, readmoRe, riskclustr, RLogicalOps, roloc, Rsgf, RWmisc, SCINA, semnova, shinybrms, spex, stpm, SuperLearner, tci, terrainr, thinkr, tidyvpc, TiPS, waterfalls, whereami

@MichaelChirico
Copy link
Collaborator

I think that may be throwing things off for the rest of the lints. The current results have a ton of false positives that I don't reproduce if I try and run the expressions individually.

@MichaelChirico
Copy link
Collaborator

simple enough fix 😅

@AshesITR
Copy link
Collaborator Author

I was somehow convinced that str was a single string and not a vector 😅

@MichaelChirico
Copy link
Collaborator

MichaelChirico commented May 17, 2022

OK another report. Ran on 1300 unique packages. Still a few cases where C & R disagree.

  1. str_replace_all() false positives. This is a known false positive when stringr functions are used in magrittr pipelines. examples:
  Package: threeBrain, file: R/class_brainatlas.R
self$atlas_type <- stringr::str_replace_all(atlas_type, '[\\W]', '_')

  Package: threeBrain, file: R/fs_brain2.R
atlas_t_alt <- stringr::str_replace_all(atlas_t, '[\\W]', '_')

  Package: campfin, file: R/normal-address.R
stringr::str_replace_all("^([:digit:]+)([:alpha:]+)", "\\1 \\2") %>%

  Package: campfin, file: R/normal-address.R
stringr::str_replace_all("([:alpha:]+)([:digit:]+)$", "\\1 \\2")

  Package: doseminer, file: R/extract.R
str_replace_all('([0-9]+)([x/])([a-z]+)', '\\1 \\2 \\3') %>%

  Package: doseminer, file: R/extract.R
str_replace_all('([a-z]+)([0-9]+)', '\\1 \\2') %>%

  Package: doseminer, file: R/extract.R
str_replace_all('([0-9]+)([a-z]+)', '\\1 \\2') %>%

  Package: doseminer, file: R/extract.R
str_replace_all('([0-9]+) ?(-|(?:up )?to|or) ?([0-9]+)', '\\1 - \\3') %>%

  Package: doseminer, file: R/extract.R
str_replace_all('(\\bq) ([1-8]) ([dh])', '\\1\\2\\3') %>%

  Package: doseminer, file: R/extract.R
str_replace_all('(\\w+)(bd|[qt]ds)\\b', '\\1 \\2') %>%

  Package: doseminer, file: R/extract.R
str_replace_all('(?<!take )([0-9]+) (?:times daily|a day)', '\\1 / day') %>%

  Package: doseminer, file: R/extract.R
str_replace_all('(\\d+[.]?\\d*) (\\d+[.]?\\d* ml spoon)', '\\1 x \\2') %>%

  Package: torchaudio, file: R/temp.R
stringr::str_replace_all("(:[^,=\n]+)(,|( =)|\n)", "\\2") %>%

  Package: torchaudio, file: R/temp.R
stringr::str_replace_all("\n(.+)(\\-=)", "\n\\1 = \\1 -") %>%

  Package: torchaudio, file: R/temp.R
stringr::str_replace_all("\n(.+)(\\/=)", "\n\\1 = \\1 /") %>%

This looks like the C side is recognizing the \\1 substitutions as being regex-y and skipping while R side is not.

  1. others:
  • Package: pander, file: R/pandoc.R R is right, false negative for C, because duplicates in char class don't matter
    r x <- gsub('[\\\\]', '', x) # backslashes
  • Package: SGP, file: R/courseProgressionSGP.R C is right, false positive for R, this uses \< special
    r setattr(sgp_object_subset[["GRADE_CHAR"]], "levels", gsub("\\<CT\\>", "EOCT", levels(sgp_object_subset[["GRADE_CHAR"]])))
  • Package: stpm, file: R/spm_time-dependent.R C is right, false positive for R, because perl=FALSE
    r p.temp.coeff <- trim(unlist(strsplit(p.temp[i],"[\\*]",fixed=F)))
  • Package: FastRWeb, file: R/parse-multipart.R. R is right, false negative for C, because perl=TRUE
    r data$filename <- strsplit(filename,'[\\/]',perl=TRUE)[[1L]]

The ones that hinge on the value of perl I think we can punt on for now. \1 can't show up in the pattern argument of a regex without also using (), so I think we can safely ignore it. Better to focus effort on improving the XPath to skip the pipeline matches in the first place.

So I believe we should fix the \< case and we can be done. Note that in context I believe the author indeed intentionally used \< as specials:

https://github.com/CenterForAssessment/SGP/blob/153d35d4c3cfbc62240796156c879e36b731d4ba/R/courseProgressionSGP.R#L40-L42

@AshesITR
Copy link
Collaborator Author

self$atlas_type <- stringr::str_replace_all(atlas_type, '[\W]', '_')

That's not a FP related to pipes, no?
It is a false positive, though, since that is a regex as evident by this experiment:

> stringr::str_replace_all("\\W-", '[\\W]', '_')
[1] "_W_"
> gsub("[\\W]", "_", "\\W-")
[1] "__-"

\< seems to be fixed iff perl = TRUE, so that's also dependent on the value of perl

> gsub("\\<", "#", "\\<a<s<d")
[1] "\\<#a<#s<#d"
> gsub("\\<", "#", "\\<a<s<d", perl = TRUE)
[1] "\\#a#s#d"

So I'd suggest we assume perl = TRUE for the moment, i.e. let \< lint but make sure [\W] doesn't.
That way we'd at least be consistent in the flavor we try to detect and could at a later point extend to perl = FALSE, which affects the base regex functions. WDYT?

The necessary fix would be to disallow some characters after \\ within [, namely the character class shorthands.

@MichaelChirico
Copy link
Collaborator

ok that works for me, esp. since perl=TRUE is the engine for stringr functions too.

this PR is teaching me way more about regex than I ever cared to know 🥲

@MichaelChirico
Copy link
Collaborator

thanks!! starting the merge 🚀🚀🚀🚀

@AshesITR AshesITR mentioned this pull request May 19, 2022
# Conflicts:
#	R/fixed_regex_linter.R
#	tests/testthat/test-fixed_regex_linter.R
@MichaelChirico MichaelChirico merged commit dffa03f into master May 19, 2022
@MichaelChirico MichaelChirico deleted the fixed_regex-R branch May 19, 2022 19:19
@MichaelChirico
Copy link
Collaborator

thanks again!!

@AshesITR
Copy link
Collaborator Author

Thank you too!

@MichaelChirico MichaelChirico changed the title implement as plain R + regex implement fixed_regex_linter as plain R + regex Jun 4, 2022
@AshesITR
Copy link
Collaborator Author

AshesITR commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants