Get raw file name in bytes from ZipFile #90139

accelerator0099 · 2021-12-04T13:39:27Z

BPO	45981
Nosy	@ericvsmith, @danifus

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2021-12-04.13:39:26.920>
labels = ['type-feature', 'library', '3.10']
title = 'Get raw file name in bytes from ZipFile'
updated_at = <Date 2021-12-15.12:18:57.899>
user = 'https://bugs.python.org/accelerator0099'

bugs.python.org fields:

activity = <Date 2021-12-15.12:18:57.899>
actor = 'accelerator0099'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2021-12-04.13:39:26.920>
creator = 'accelerator0099'
dependencies = []
files = []
hgrepos = []
issue_num = 45981
keywords = []
message_count = 7.0
messages = ['407665', '407666', '407669', '407696', '407697', '407765', '408596']
nosy_count = 3.0
nosy_names = ['eric.smith', 'dhillier', 'accelerator0099']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue45981'
versions = ['Python 3.10']

accelerator0099 · 2021-12-04T13:39:27Z

It's quite annoying that ZipFile corrupts the filename by simply replacing '\\' with '/', not providing the raw file name in bytes to us.

accelerator0099 · 2021-12-04T14:01:19Z

In file Lib/zipfile.py:
1357> flags = centdir[5]
1358> if flags & 0x800:
1359> # UTF-8 file names extension
1360> filename = filename.decode('utf-8')
1361> else:
1362> # Historical ZIP filename encoding
1363> filename = filename.decode('cp437')

ZipFile simply decodes all non-utf8 file names by encoding CP437.

In file Lib/zipfile.py:
352> # This is used to ensure paths in generated ZIP files always use
353> # forward slashes as the directory separator, as required by the
354> # ZIP format specification.
355> if os.sep != "/" and os.sep in filename:
356> filename = filename.replace(os.sep, "/")

And it replaces every '\\' with '/' on windows.

Consider we have a file named '\x97\x5c\x92\x9b', which is '予兆' in Japanese encoded in SHIFT_JIS.
You may have noticed the problem:

'\x5c' is '\\'(backslash) in ASCII

So you will see ZipFile decodes the bytes by CP437, and replaces all '\\' with '/'.
And the Japanese character '予' is replaced partially, it is no longer itself.

Someone says we can replace '/' with '\\' back, and decode it by CP437 to get the raw bytes.
But what if both '/'('\x2f') and '\\'('\x5c') appear in the raw filename?

Simply replacing '\\' in a bytestream without knowning the encoding is by no means a good way.
Maybe we can provide a rawname field in the ZipInfo struct?

ericvsmith · 2021-12-04T14:16:18Z

You would also need to decide what to do with these lines, just before the os.sep test:

        # Terminate the file name at the first null byte.  Null bytes in file
        # names are used as tricks by viruses in archives.
        null_byte = filename.find(chr(0))
        if null_byte >= 0:
            filename = filename[0:null_byte]

I don't think you'd want to do this on an encoded (raw) filename, but on the other hand the comment makes a good point.

accelerator0099 · 2021-12-05T02:02:50Z

Null bytes appear in abnormal zip files. (I haven't seen any multibyte encoding that represents a character with null bytes)

But non-utf8 encodings are common in normal zip files, as windows uses different encodings for different language settings. (On the other hand, Linux suggests everyone use UTF8 regardless of their language settings.)

It's a pity that nowadays few software supports specifying encoding when extracting archives.
(We have unzip-iconv patch on Linux, even if the patch is never accepted by unzip)

Changing the language and rebooting my OS makes no sense, and I don't know why.

ericvsmith · 2021-12-05T02:06:34Z

UTF-16 uses null bytes. I'm sure there are other encodings that do, too.

But I don't know if these encodings are permitted or common in zip files.

danifus · 2021-12-06T01:27:54Z

Handling different character sets is not completely supported yet. There are a couple of open issues relating to this: https://bugs.python.org/issue40407 (reading file names), https://bugs.python.org/issue41928 (support for reading and writing filenames using the unicode filename extra field) and https://bugs.python.org/issue40172 (issues with reading and then writing a filename from and back into a zip where the initial filename isn't encoded in cp437).

Most modern zip programs that deal with characters outside ascii or cp437 either set the utf-8 flag or write both an ascii or cp437 compatible filename (to the original filename field in the zip header) and the actual filename with all non-ascii characters in the unicode filename extra field. I think adding support for the unicode field to Python would probably cover the majority files generated by modern zip programs.

For complete support, including older zip programs that don't support the utf-8 flag or unicode filename extra field, we may need to provide another parameter in Python's ZipFile's read and write functions to be able to override the charset used for the filename stored directly in the zip file header.

I've added my thoughts on how to approach this in https://bugs.python.org/issue40172 but haven't had time to implement these myself.

accelerator0099 · 2021-12-15T12:18:58Z

I do think providing a rawfile field in the ZipInfo struct helps.
As a library, ZipFile should let users know what they are dealing with.
Users can get data from zip files, and ZipFile shouldn't corrupt them.
I don't mean that we should provide everything in raw bytes.
What I mean is that DATA could be CONVERTED, but couldn't be CORRUPTED.

accelerator0099 mannequin added 3.10 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Dec 4, 2021

ezio-melotti transferred this issue from another repository Apr 10, 2022

serhiy-storchaka added this to @serhiy-storchaka's project python-zipfile May 8, 2022

serhiy-storchaka added this to Zipfile issues May 19, 2022

gerph mentioned this issue Jul 3, 2022

ZipInfo filename is mangled when os.sep is not '/' #94529

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get raw file name in bytes from ZipFile #90139

Get raw file name in bytes from ZipFile #90139

accelerator0099 mannequin commented Dec 4, 2021

accelerator0099 mannequin commented Dec 4, 2021

accelerator0099 mannequin commented Dec 4, 2021

ericvsmith commented Dec 4, 2021

accelerator0099 mannequin commented Dec 5, 2021

ericvsmith commented Dec 5, 2021

danifus mannequin commented Dec 6, 2021

accelerator0099 mannequin commented Dec 15, 2021

Get raw file name in bytes from ZipFile #90139

Get raw file name in bytes from ZipFile #90139

Comments

accelerator0099 mannequin commented Dec 4, 2021

accelerator0099 mannequin commented Dec 4, 2021

accelerator0099 mannequin commented Dec 4, 2021

ericvsmith commented Dec 4, 2021

accelerator0099 mannequin commented Dec 5, 2021

ericvsmith commented Dec 5, 2021

danifus mannequin commented Dec 6, 2021

accelerator0099 mannequin commented Dec 15, 2021