-
-
Notifications
You must be signed in to change notification settings - Fork 31.9k
gh-76960: Fix urljoin() and urldefrag() for URIs with empty components #123273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-76960: Fix urljoin() and urldefrag() for URIs with empty components #123273
Conversation
…ponents * urljoin() with relative reference "?" sets empty query and removes fragment. * Preserve empty components (authority, params, query, fragment) in urljoin(). * Preserve empty components (authority, params, query) in urldefrag(). Also refactor the code and get rid of double _coerce_args() and _coerce_result() calls in urljoin(), urldefrag(), urlparse() and urlunparse().
self.checkJoin(SIMPLE_BASE, 'http:g','http://a/b/c/g') | ||
self.checkJoin(SIMPLE_BASE, 'http:','http://a/b/c/d') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicated below.
self.checkJoin(SIMPLE_BASE, 'http:','http://a/b/c/d') | ||
self.checkJoin(SIMPLE_BASE, 'http:?y','http://a/b/c/d?y') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicated in new tests below.
v = SplitResult(scheme or '', netloc or '', url, query or '', fragment or '') | ||
return _coerce_result(v) | ||
|
||
def _urlsplit(url, scheme=None, allow_fragments=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if fragment is None: | ||
fragment = bfragment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This contradicts RFC 3986 according to which the target fragment is always set to the relative reference fragment. But this means that an empty string (which has no defined fragment) should remove fragment. I think this is an error in RFC 3986.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this means that an empty string (which has no defined fragment) should remove fragment. I think this is an error in RFC 3986.
Just to be explicit, this means we are keeping the existing behavior when a relative fragment is given an empty string (or None). Is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Practically, this only affects 4 cases: an empty string, '//'
. and URIs like 'http:'
and http://
where the scheme is the same as in the base URI.
urljoin('http:/a/b/c/d;p?q#f', '')
currently returns the base URI 'http:/a/b/c/d;p?q#f'
. The same in Ruby and Go. According to the RFC 3986 pseudocode it should drop fragment. According to previous RFCs it should return the base URI. I think it is important to keep the current behavior in this case, so there should be some error in RFC 3986.
urljoin('http:/a/b/c/d;p?q#f', '//')
currently returns the base URI without fragment 'http:/a/b/c/d;p?q'
, which is consistent with RFC 3986 for non-strict parser. Ruby and Go do not distinguish between '//'
and ''
, so they preserve fragment. The current PR preserves it too.
urljoin('http:/a/b/c/d;p?q#f', 'http:')
currently returns the base URI without fragment 'http:/a/b/c/d;p?q'
, which is consistent with RFC 3986 for non-strict parser. But Ruby and Go return 'http:
which is consistent with RFC 3986 for strict parser. With this PR it will return the base URI unchanged, this is a change from the current behavior and is not consistent with RFC 3986 unless we suppose that RFC 3986 was wrong in this place.
But maybe RFC 3986 is wrong in other place. Previous RFCs contained special statement for empty relative URI. If it is missed in RFC 3986 by mistake, then we can only add special case for empty relative URI and preserve the current behavior for urljoin('http:/a/b/c/d;p?q#f', 'http:')
which is consistent with the pseudocode in RFC 3986.
We can only add a special case for empty relative URI (it is already here for performance reasons), with possible addition of '//'
. Then the behavior for 'http:'
and 'http://'
will be consistent with 3986 and match the current behavior (but Ruby and Go go other way).
|
||
if scheme is None: | ||
scheme = bscheme | ||
if scheme != bscheme or scheme not in uses_relative: | ||
return _coerce_result(url) | ||
if scheme in uses_netloc: | ||
if netloc: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to RFC 3986, this should be netlog is not None
(undefined). But this gives results too different from Ruby and Go (they may be wrong) and from the current Python code. I leave this for a separate issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Being consistent with Ruby, Go, and Java (and usually with browser/curl) somes seems to be more beneficial, and going with that is expected by the programmer.
@orsenthil, please take a look at this PR. |
Fix :func:`urllib.parse.urljoin` and :func:`urllib.parse.urldefrag` for URIs | ||
containing empty components. For example, :func:`!urljoin()` with relative | ||
reference "?" now sets empty query and removes fragment. Empty components are | ||
preserved in the result. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is meant by " Empty components are preserved in the result." in this news entry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it what is meant in the ticket summary.
Preserve empty components (authority, params, query, fragment) in urljoin().
Preserve empty components (authority, params, query) in urldefrag().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that I should be more explicit (even if a little repetitive) here.
self.checkJoin(RFC1808_BASE, '?', 'http://a/b/c/d;p?', relroundtrip=False) | ||
self.checkJoin(RFC1808_BASE, '?#z', 'http://a/b/c/d;p?#z', relroundtrip=False) | ||
self.checkJoin(RFC1808_BASE, '?y', 'http://a/b/c/d;p?y') | ||
self.checkJoin(RFC1808_BASE, ';', 'http://a/b/c/;') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making a note, using parameter ;
removes the final portion of the path in the base url /d
in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this as per the rfc3986 requirement as given in the 5.4. Reference Resolution Examples
and these tests handle them properly for the empty ?
cases correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From RFC3986's point of view, ;
is not a delimiter, it is a normal character. So /b/c/d;p
is the old path, d;p
is its last part, and it will be replaced with relative path ;
.
Other example:
>>> urljoin('http:/a/b/c/d;?#f', '#z')
'http:///a/b/c/d#z'
Currently it removes ;
and ?
which correspond to empty params and query. But for RFC3986 they should be preserved. It is not only a matter of representation of empty component, params is not a thing in RFC3986. Removing trailing ;
from path /b/c/d;
is a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Thank you @serhiy-storchaka for bringing this long pending and desirable change into urllib.parse module in a relatively safe way.
Thank you for your review @orsenthil. During writing responses to your comments I reconsider my approach to empty URI problem, tried different ways, but have not found good reasons to prefer one way or another for such corner cases like |
This commin change the behavior of urljoin. Consider this code |
Also refactor the code and get rid of double _coerce_args() and _coerce_result() calls in urljoin(), urldefrag(), urlparse() and urlunparse().