implement fixed_regex_linter as plain R + regex #1032

AshesITR · 2022-03-30T19:22:42Z

NB merge target is the original fixed_regex PR #1021 so we can separately review the implementation of is_not_regex and the PR in full.

AshesITR · 2022-03-30T19:23:46Z

Changed target to master temporarily to get GHA goodness.

AshesITR · 2022-03-30T19:41:12Z

@MichaelChirico LMK if you would like me to adapt .dev/compare_branches.R to include an argument (--base-branch?)
Allowing a local branch from r-lib as reference should give enough flexibility I think.

MichaelChirico · 2022-03-30T19:49:01Z

yes please do. agree about scope.

MichaelChirico · 2022-04-01T05:12:59Z

Some patience here... my mirror has gathered some dust that's slow to remove :)

AshesITR · 2022-04-01T05:28:39Z

No worries, thanks for the update 😊

MichaelChirico · 2022-04-03T19:17:53Z

On a sample of 2,000 packages, there are 8 false positives where the R version says fixed but the C version does not:

diff.csv

They are all the same expression:

gsub("\\[|\\]", "", s, perl = TRUE)

(also FWIW the two branches ran in almost indistinguishable time -- 13.18 (C) vs 13.49 minutes)

AshesITR · 2022-04-03T20:17:16Z

Ah I see that must be because the special regex skips if a [ is before it. That needs another not-escaped check before it.

Great news on the performance side!

MichaelChirico · 2022-04-03T22:23:31Z

Assuming we can fix the issue, let's (1) merge #1021 then quickly (2) merge this to master as a follow-up. That way we can (1) keep the original C version in the repo history and (2) make it easier to give you credit for the great improvement!

AshesITR · 2022-04-04T14:26:05Z

Okay, we can do that. Or merge once and not squash the changes during merge?

AshesITR · 2022-04-04T14:52:33Z

Found a fix and added it. Also added a test case for that to make sure it works.
LMK how you want to proceed.

MichaelChirico · 2022-04-04T15:04:33Z

first I'll run again to make sure we didn't whack-a-mole any new issues.

I think we do want to squash some commits but not others which will be a pain... easier to merge the two PRs with squash

AshesITR · 2022-04-04T17:34:06Z

Allright. I pre-approved the C implementation. Merge at your discretion; LMK if I need to take another look.

AshesITR · 2022-05-16T07:59:31Z

During some manual testing, I noticed that at least the most recent R versions don't have all of the features mentioned at the link I provided.

# Conflicts: # man/linters.Rd

AshesITR · 2022-05-16T08:17:47Z

@MichaelChirico All tests succeed on all platforms now 🥳

MichaelChirico · 2022-05-16T08:28:30Z

awesome!! thanks again for your patience/diligence here!

I'll start another run tomorrow evening and hopefully we can (finally) merge

AshesITR · 2022-05-16T08:29:26Z

Sounds good, thanks a lot for the extensive testing and feedback!

MichaelChirico · 2022-05-17T06:56:18Z

Some issues... getting warnings running the linter on some packages, e.g. g3viz:

lintr::lint_package(linters=lintr::fixed_regex_linter())
.................
Warning messages:
1: In grepl(paste0("(?s)", rx_static_regex), str, perl = TRUE) || grepl(rx_static_regex,  :
  'length(x) = 2 > 1' in coercion to 'logical(1)'
2: In grepl(paste0("(?s)", rx_static_regex), str, perl = TRUE) || grepl(rx_static_regex,  :
  'length(x) = 2 > 1' in coercion to 'logical(1)'

(IINM that'll be an error in r-devel)

Packages that don't seem to want to lint on this branch:

addinsOutline, alignfigR, anyLib, ASIP, ast2ast, BaBooN, baseflow, bbw, Biocomb, caretEnsemble, cgam, cgmanalysis, compareODM, copBasic, crossword.r, DistPlotter, dscore, ecm, envDocument, ergmito, errorlocate, exp2flux, fail, fcr, g3viz, geno2proteo, GFE, ggpattern, ggpol, gllm, gmpoly, gsignal, huito, japanstat, JuliaCall, knotR, loo, lsa, marelac, mldr, MM4LMM, mma, mmaqshiny, mmm2, MRFcov, mwcsr, nat.utils, paletteer, plumberDeploy, polite, pxR, r2symbols, rcbayes, RcmdrPlugin.DCE, readmoRe, riskclustr, RLogicalOps, roloc, Rsgf, RWmisc, SCINA, semnova, shinybrms, spex, stpm, SuperLearner, tci, terrainr, thinkr, tidyvpc, TiPS, waterfalls, whereami

MichaelChirico · 2022-05-17T07:12:27Z

I think that may be throwing things off for the rest of the lints. The current results have a ton of false positives that I don't reproduce if I try and run the expressions individually.

MichaelChirico · 2022-05-17T07:23:30Z

simple enough fix 😅

AshesITR · 2022-05-17T16:01:00Z

I was somehow convinced that str was a single string and not a vector 😅

MichaelChirico · 2022-05-17T21:16:35Z

OK another report. Ran on 1300 unique packages. Still a few cases where C & R disagree.

str_replace_all() false positives. This is a known false positive when stringr functions are used in magrittr pipelines. examples:

  Package: threeBrain, file: R/class_brainatlas.R
self$atlas_type <- stringr::str_replace_all(atlas_type, '[\\W]', '_')

  Package: threeBrain, file: R/fs_brain2.R
atlas_t_alt <- stringr::str_replace_all(atlas_t, '[\\W]', '_')

  Package: campfin, file: R/normal-address.R
stringr::str_replace_all("^([:digit:]+)([:alpha:]+)", "\\1 \\2") %>%

  Package: campfin, file: R/normal-address.R
stringr::str_replace_all("([:alpha:]+)([:digit:]+)$", "\\1 \\2")

  Package: doseminer, file: R/extract.R
str_replace_all('([0-9]+)([x/])([a-z]+)', '\\1 \\2 \\3') %>%

  Package: doseminer, file: R/extract.R
str_replace_all('([a-z]+)([0-9]+)', '\\1 \\2') %>%

  Package: doseminer, file: R/extract.R
str_replace_all('([0-9]+)([a-z]+)', '\\1 \\2') %>%

  Package: doseminer, file: R/extract.R
str_replace_all('([0-9]+) ?(-|(?:up )?to|or) ?([0-9]+)', '\\1 - \\3') %>%

  Package: doseminer, file: R/extract.R
str_replace_all('(\\bq) ([1-8]) ([dh])', '\\1\\2\\3') %>%

  Package: doseminer, file: R/extract.R
str_replace_all('(\\w+)(bd|[qt]ds)\\b', '\\1 \\2') %>%

  Package: doseminer, file: R/extract.R
str_replace_all('(?<!take )([0-9]+) (?:times daily|a day)', '\\1 / day') %>%

  Package: doseminer, file: R/extract.R
str_replace_all('(\\d+[.]?\\d*) (\\d+[.]?\\d* ml spoon)', '\\1 x \\2') %>%

  Package: torchaudio, file: R/temp.R
stringr::str_replace_all("(:[^,=\n]+)(,|( =)|\n)", "\\2") %>%

  Package: torchaudio, file: R/temp.R
stringr::str_replace_all("\n(.+)(\\-=)", "\n\\1 = \\1 -") %>%

  Package: torchaudio, file: R/temp.R
stringr::str_replace_all("\n(.+)(\\/=)", "\n\\1 = \\1 /") %>%

This looks like the C side is recognizing the \\1 substitutions as being regex-y and skipping while R side is not.

others:

Package: pander, file: R/pandoc.R R is right, false negative for C, because duplicates in char class don't matter
r x <- gsub('[\\\\]', '', x) # backslashes
Package: SGP, file: R/courseProgressionSGP.R C is right, false positive for R, this uses \< special
r setattr(sgp_object_subset[["GRADE_CHAR"]], "levels", gsub("\\<CT\\>", "EOCT", levels(sgp_object_subset[["GRADE_CHAR"]])))
Package: stpm, file: R/spm_time-dependent.R C is right, false positive for R, because perl=FALSE
r p.temp.coeff <- trim(unlist(strsplit(p.temp[i],"[\\*]",fixed=F)))
Package: FastRWeb, file: R/parse-multipart.R. R is right, false negative for C, because perl=TRUE
r data$filename <- strsplit(filename,'[\\/]',perl=TRUE)[[1L]]

The ones that hinge on the value of perl I think we can punt on for now. \1 can't show up in the pattern argument of a regex without also using (), so I think we can safely ignore it. Better to focus effort on improving the XPath to skip the pipeline matches in the first place.

So I believe we should fix the \< case and we can be done. Note that in context I believe the author indeed intentionally used \< as specials:

https://github.com/CenterForAssessment/SGP/blob/153d35d4c3cfbc62240796156c879e36b731d4ba/R/courseProgressionSGP.R#L40-L42

AshesITR · 2022-05-19T17:40:11Z

self$atlas_type <- stringr::str_replace_all(atlas_type, '[\W]', '_')

That's not a FP related to pipes, no?
It is a false positive, though, since that is a regex as evident by this experiment:

> stringr::str_replace_all("\\W-", '[\\W]', '_')
[1] "_W_"
> gsub("[\\W]", "_", "\\W-")
[1] "__-"

\< seems to be fixed iff perl = TRUE, so that's also dependent on the value of perl

> gsub("\\<", "#", "\\<a<s<d")
[1] "\\<#a<#s<#d"
> gsub("\\<", "#", "\\<a<s<d", perl = TRUE)
[1] "\\#a#s#d"

So I'd suggest we assume perl = TRUE for the moment, i.e. let \< lint but make sure [\W] doesn't.
That way we'd at least be consistent in the flavor we try to detect and could at a later point extend to perl = FALSE, which affects the base regex functions. WDYT?

The necessary fix would be to disallow some characters after \\ within [, namely the character class shorthands.

MichaelChirico · 2022-05-19T18:08:15Z

ok that works for me, esp. since perl=TRUE is the engine for stringr functions too.

this PR is teaching me way more about regex than I ever cared to know 🥲

MichaelChirico · 2022-05-19T18:31:36Z

thanks!! starting the merge 🚀🚀🚀🚀

# Conflicts: # R/fixed_regex_linter.R # tests/testthat/test-fixed_regex_linter.R

MichaelChirico · 2022-05-19T19:19:32Z

thanks again!!

AshesITR · 2022-05-19T19:41:54Z

Thank you too!

AshesITR · 2022-10-11T07:17:12Z

I think converting the regex to a whitelist of to be linted regexes should eliminate the false positives.

…

Am 10.04.2022 um 06:26 schrieb Michael Chirico ***@***.***>: Mixed results -- some things caught by R only, some things caught by C only Mainly it looks like on the C branch, I am too strict for the [$CHAR] case (e.g. $CHAR can be an escaped character, or a \u-escaped string), while the R branch is too loose (namely, the default regex allows []...] to be a single character class including ] and ...). False positives in the R-only branch: strsplit(rangeStr, "[][ ]") gsub("[][]", "", line) strsplit(colnames(obj)[-1], "[],[]") gsub("[]}]", ")", transformed) gsub("[]:]","",betainfo) False negatives in the C-only branch: stringr::str_replace_all(lines, "[\u0451]", "\u0435") gsub("\u{A0}", " ", out, useBytes = TRUE) (one other identical hit in roxygen2) grep("[]]", x) (23 total hits like this or []]) gsub( '[\\]', '/', dirname( chname)) (10 total hits like this) gsub("[\r]", "", config_char) (5 total hits like this or [\n]) strsplit(ls[1],"[\"]") (2 total hits like this) — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

MichaelChirico and others added 5 commits March 28, 2022 22:38

New fixed_regex_linter

f67e9a0

fix package name in so

67c9e1d

back-compatible read-only macro

f7954ba

Merge branch 'master' into fixed_regex

4663843

implement as plain R + regex

aa4dcc6

AshesITR changed the base branch from fixed_regex to master March 30, 2022 19:23

Merge branch 'master' into fixed_regex-R

f874325

fix butchered two-way merge

2020b4b

AshesITR changed the base branch from master to fixed_regex March 30, 2022 19:38

Merge branch 'master' into fixed_regex

033de02

MichaelChirico added 3 commits April 3, 2022 15:13

Merge branch 'master' into fixed_regex

8b47bf8

Merge branch 'master' into fixed_regex-R

1c7b87a

Merge branch 'fixed_regex' into fixed_regex-R

bd70085

fix false positive "\\[|\\]" and add as a test case

716e2b4

AshesITR changed the base branch from fixed_regex to master April 4, 2022 14:52

Merge branch 'master' into fixed_regex-R

ada88ed

MichaelChirico added 2 commits April 4, 2022 08:18

Merge branch 'master' into fixed_regex

3a0739b

Merge branch 'fixed_regex' into fixed_regex-R

0b2d306

AshesITR added 2 commits May 16, 2022 10:03

Merge branch 'master' into fixed_regex-R

3ce4b1d

# Conflicts: # man/linters.Rd

\xA7 -> \x32

aeb2fc0

Merge branch 'master' into fixed_regex-R

3d8bf17

Merge branch 'master' into fixed_regex-R

1bd7e01

use | not ||

1d82bee

MichaelChirico added 2 commits May 17, 2022 00:25

add a test

9b24cac

test parsing

0482664

assume perl = TRUE, add some more tests

bb1cc73

AshesITR mentioned this pull request May 19, 2022

New fixed_regex_linter #1021

Merged

Merge branch 'master' into fixed_regex-R

4f5a637

# Conflicts: # R/fixed_regex_linter.R # tests/testthat/test-fixed_regex_linter.R

MichaelChirico approved these changes May 19, 2022

View reviewed changes

MichaelChirico merged commit dffa03f into master May 19, 2022

MichaelChirico deleted the fixed_regex-R branch May 19, 2022 19:19

MichaelChirico changed the title ~~implement as plain R + regex~~ implement fixed_regex_linter as plain R + regex Jun 4, 2022

MichaelChirico mentioned this pull request Oct 11, 2022

fixed_regex_linter incorrectly suggest that my regex with "\>" is static #1478

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement fixed_regex_linter as plain R + regex #1032

implement fixed_regex_linter as plain R + regex #1032

AshesITR commented Mar 30, 2022

AshesITR commented Mar 30, 2022

AshesITR commented Mar 30, 2022

MichaelChirico commented Mar 30, 2022

MichaelChirico commented Apr 1, 2022

AshesITR commented Apr 1, 2022

MichaelChirico commented Apr 3, 2022

AshesITR commented Apr 3, 2022

MichaelChirico commented Apr 3, 2022

AshesITR commented Apr 4, 2022

AshesITR commented Apr 4, 2022

MichaelChirico commented Apr 4, 2022

AshesITR commented Apr 4, 2022

AshesITR commented May 16, 2022

AshesITR commented May 16, 2022

MichaelChirico commented May 16, 2022

AshesITR commented May 16, 2022

MichaelChirico commented May 17, 2022

MichaelChirico commented May 17, 2022

MichaelChirico commented May 17, 2022

AshesITR commented May 17, 2022

MichaelChirico commented May 17, 2022 •

edited

Loading

AshesITR commented May 19, 2022

MichaelChirico commented May 19, 2022

MichaelChirico commented May 19, 2022

MichaelChirico commented May 19, 2022

AshesITR commented May 19, 2022

AshesITR commented Oct 11, 2022 via email

implement fixed_regex_linter as plain R + regex #1032

implement fixed_regex_linter as plain R + regex #1032

Conversation

AshesITR commented Mar 30, 2022

AshesITR commented Mar 30, 2022

AshesITR commented Mar 30, 2022

MichaelChirico commented Mar 30, 2022

MichaelChirico commented Apr 1, 2022

AshesITR commented Apr 1, 2022

MichaelChirico commented Apr 3, 2022

AshesITR commented Apr 3, 2022

MichaelChirico commented Apr 3, 2022

AshesITR commented Apr 4, 2022

AshesITR commented Apr 4, 2022

MichaelChirico commented Apr 4, 2022

AshesITR commented Apr 4, 2022

AshesITR commented May 16, 2022

AshesITR commented May 16, 2022

MichaelChirico commented May 16, 2022

AshesITR commented May 16, 2022

MichaelChirico commented May 17, 2022

MichaelChirico commented May 17, 2022

MichaelChirico commented May 17, 2022

AshesITR commented May 17, 2022

MichaelChirico commented May 17, 2022 • edited Loading

AshesITR commented May 19, 2022

MichaelChirico commented May 19, 2022

MichaelChirico commented May 19, 2022

MichaelChirico commented May 19, 2022

AshesITR commented May 19, 2022

AshesITR commented Oct 11, 2022 via email

MichaelChirico commented May 17, 2022 •

edited

Loading