Skip to content

gh-76960: Fix urljoin() and urldefrag() for URIs with empty components #123273

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 31, 2024

Conversation

serhiy-storchaka
Copy link
Member

@serhiy-storchaka serhiy-storchaka commented Aug 23, 2024

  • urljoin() with relative reference "?" sets empty query and removes fragment.
  • Preserve empty components (authority, params, query, fragment) in urljoin().
  • Preserve empty components (authority, params, query) in urldefrag().

Also refactor the code and get rid of double _coerce_args() and _coerce_result() calls in urljoin(), urldefrag(), urlparse() and urlunparse().

…ponents

* urljoin() with relative reference "?" sets empty query and removes fragment.
* Preserve empty components (authority, params, query, fragment) in urljoin().
* Preserve empty components (authority, params, query) in urldefrag().

Also refactor the code and get rid of double _coerce_args() and
_coerce_result() calls in urljoin(), urldefrag(), urlparse() and
urlunparse().
Comment on lines -529 to -530
self.checkJoin(SIMPLE_BASE, 'http:g','http://a/b/c/g')
self.checkJoin(SIMPLE_BASE, 'http:','http://a/b/c/d')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated below.

Comment on lines -551 to -552
self.checkJoin(SIMPLE_BASE, 'http:','http://a/b/c/d')
self.checkJoin(SIMPLE_BASE, 'http:?y','http://a/b/c/d?y')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated in new tests below.

v = SplitResult(scheme or '', netloc or '', url, query or '', fragment or '')
return _coerce_result(v)

def _urlsplit(url, scheme=None, allow_fragments=True):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This refactoring is also a preparation for #67041. These private functions distinguish undefined components (None) from empty components (empty string), but public functions remove this difference. With new options added for #67041 this will be optional.

Comment on lines +592 to +593
if fragment is None:
fragment = bfragment
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This contradicts RFC 3986 according to which the target fragment is always set to the relative reference fragment. But this means that an empty string (which has no defined fragment) should remove fragment. I think this is an error in RFC 3986.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this means that an empty string (which has no defined fragment) should remove fragment. I think this is an error in RFC 3986.

Just to be explicit, this means we are keeping the existing behavior when a relative fragment is given an empty string (or None). Is that right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Practically, this only affects 4 cases: an empty string, '//'. and URIs like 'http:' and http:// where the scheme is the same as in the base URI.

urljoin('http:/a/b/c/d;p?q#f', '') currently returns the base URI 'http:/a/b/c/d;p?q#f'. The same in Ruby and Go. According to the RFC 3986 pseudocode it should drop fragment. According to previous RFCs it should return the base URI. I think it is important to keep the current behavior in this case, so there should be some error in RFC 3986.

urljoin('http:/a/b/c/d;p?q#f', '//') currently returns the base URI without fragment 'http:/a/b/c/d;p?q', which is consistent with RFC 3986 for non-strict parser. Ruby and Go do not distinguish between '//' and '', so they preserve fragment. The current PR preserves it too.

urljoin('http:/a/b/c/d;p?q#f', 'http:') currently returns the base URI without fragment 'http:/a/b/c/d;p?q', which is consistent with RFC 3986 for non-strict parser. But Ruby and Go return 'http: which is consistent with RFC 3986 for strict parser. With this PR it will return the base URI unchanged, this is a change from the current behavior and is not consistent with RFC 3986 unless we suppose that RFC 3986 was wrong in this place.

But maybe RFC 3986 is wrong in other place. Previous RFCs contained special statement for empty relative URI. If it is missed in RFC 3986 by mistake, then we can only add special case for empty relative URI and preserve the current behavior for urljoin('http:/a/b/c/d;p?q#f', 'http:') which is consistent with the pseudocode in RFC 3986.

We can only add a special case for empty relative URI (it is already here for performance reasons), with possible addition of '//'. Then the behavior for 'http:' and 'http://' will be consistent with 3986 and match the current behavior (but Ruby and Go go other way).


if scheme is None:
scheme = bscheme
if scheme != bscheme or scheme not in uses_relative:
return _coerce_result(url)
if scheme in uses_netloc:
if netloc:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to RFC 3986, this should be netlog is not None (undefined). But this gives results too different from Ruby and Go (they may be wrong) and from the current Python code. I leave this for a separate issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being consistent with Ruby, Go, and Java (and usually with browser/curl) somes seems to be more beneficial, and going with that is expected by the programmer.

@serhiy-storchaka
Copy link
Member Author

@orsenthil, please take a look at this PR.

Fix :func:`urllib.parse.urljoin` and :func:`urllib.parse.urldefrag` for URIs
containing empty components. For example, :func:`!urljoin()` with relative
reference "?" now sets empty query and removes fragment. Empty components are
preserved in the result.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is meant by " Empty components are preserved in the result." in this news entry?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it what is meant in the ticket summary.

Preserve empty components (authority, params, query, fragment) in urljoin().
Preserve empty components (authority, params, query) in urldefrag().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that I should be more explicit (even if a little repetitive) here.

self.checkJoin(RFC1808_BASE, '?', 'http://a/b/c/d;p?', relroundtrip=False)
self.checkJoin(RFC1808_BASE, '?#z', 'http://a/b/c/d;p?#z', relroundtrip=False)
self.checkJoin(RFC1808_BASE, '?y', 'http://a/b/c/d;p?y')
self.checkJoin(RFC1808_BASE, ';', 'http://a/b/c/;')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making a note, using parameter ; removes the final portion of the path in the base url /d in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this as per the rfc3986 requirement as given in the 5.4. Reference Resolution Examples and these tests handle them properly for the empty ? cases correctly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From RFC3986's point of view, ; is not a delimiter, it is a normal character. So /b/c/d;p is the old path, d;p is its last part, and it will be replaced with relative path ;.

Other example:

>>> urljoin('http:/a/b/c/d;?#f', '#z')
'http:///a/b/c/d#z'

Currently it removes ; and ? which correspond to empty params and query. But for RFC3986 they should be preserved. It is not only a matter of representation of empty component, params is not a thing in RFC3986. Removing trailing ; from path /b/c/d; is a bug.

Copy link
Member

@orsenthil orsenthil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Thank you @serhiy-storchaka for bringing this long pending and desirable change into urllib.parse module in a relatively safe way.

@serhiy-storchaka
Copy link
Member Author

Thank you for your review @orsenthil.

During writing responses to your comments I reconsider my approach to empty URI problem, tried different ways, but have not found good reasons to prefer one way or another for such corner cases like //, http: and http://. The current approach at least looks more consistent for all 4 cases, so I leave it. This is not the last PR for these functions, we will need to add more options for consistency with Ruby and Go, and maybe change some defaults in future.

@serhiy-storchaka serhiy-storchaka merged commit fc897fc into python:main Aug 31, 2024
34 checks passed
@serhiy-storchaka serhiy-storchaka deleted the urllib-urljoin2 branch August 31, 2024 09:42
@WinChua
Copy link

WinChua commented Sep 20, 2024

This commin change the behavior of urljoin. Consider this code urljoin('file:', '/home'), before this commit it produced 'file:///home', while it produces 'file:/home' now, which may break some code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants