Skip to content

Scanning Speed #231

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
uwalkin opened this issue Mar 22, 2022 · 12 comments
Closed

Scanning Speed #231

uwalkin opened this issue Mar 22, 2022 · 12 comments
Assignees

Comments

@uwalkin
Copy link

uwalkin commented Mar 22, 2022

Server information.

phpMussel version used

v3.3.1

Signatures version

Latest from Git

PHP SAPI used

apache2handler

PHP version installed

8.0.11

Operating system used

Windows NT 10.0 build 19042 (Windows 10) AMD64

Issue

Hi,

I'm just starting with an implementation of phpMussle in an image / pdf uploader and I'm wondering about the scanning speed.
I'm just scanning files via
Scanner->scan()

I'm noticing that it takes about 13s to scan a 4 MB jpeg and about 40s to scan a 10 MB pdf.

Are these expected times, or do I have something misconfigured here?

@nhlm
Copy link
Member

nhlm commented Mar 22, 2022

You do scan 250% more data, which should statistically took about 33 seconds (mathematically). I would consider a period of additionally 6-8 seconds as a peak. Did you do any kind of benchmarking or is it a one-time value?

best regards.

@Maikuolan
Copy link
Member

Maikuolan commented Mar 22, 2022

Are these expected times, or do I have something misconfigured here?

Although I wouldn't say "misconfigured", expected times can be significantly influenced by the exact configuration at the implementation in question.

As to which would be the most correct way to configure your implementation, that would depend a lot on the exact needs of your implementation (because not every website/platform/service/etc serves the same purpose, or accepts the same kinds of files, not all will have the exact same kinds of needs).

For example: If an implementation is intended to accept only very specific types of files without exception, it may make sense for the implementation to utilise phpMussel's greylist feature as a means to automatically blacklist any non-greylisted filetypes. Or alternatively, if an implementation may potential accept a range of different types of files which can't always be anticipated, but there are certain already known, specific types of files which should never be accepted by the implementation, it may make sense for the implementation to utilise phpMussel's blacklist feature to directly blacklist those types of files. (Documentation).

Naturally, any blacklisted types of files would not need to be checked against signatures, thus should be processed almost instantaneously (improving at least the speed at which those particular files are dealt with, though not necessarily the speed of processing files across the spectrum as a whole).

It may be worth considering, too, exactly which signature files your implementation needs to utilise. Naturally, the more signature files utilised, the more thorough phpMussel should be in terms of what it's able to detect, but also the more time required to finish scanning each file, so being selective regarding which signature files are utilised would be trying to find an ideal balance between thoroughness and speed. However, knowing the exact needs of your implementation may help in being able to determine whether specific signature files are needed or not for your implementation.

For example: An implementation which accepts literally anything, without being able to predict the exact types of files expected, would in most cases probably need to utilise all signature files (thus being very thorough, but unfortunately also a little slow at times). However, an implementation which accepts only specific types of files, and which also properly utilises the whitelist/blacklist/greylist features provided by phpMussel to reduce unnecessary processing, would in most cases probably not need to utilise those signature files intended to deal solely with the specific types of files which wouldn't, in the case of the implementation in question, ever be scanned.

At the repository for phpMussel signatures, in each of the two directories for signatures at that repository (separated on the basis of whether the vendor for those signatures is ClamAV or something/someone else), there's a README file describing the intended purpose of each signature file and their intended purpose, types of files targeted, etc. When being selective about signature files, those READMEs are worth referencing.

There are various other little tricks that can be adopted in order to optimise the scanning speed for phpMussel, most which can be gaged by tweaking the configuration according to the needs of the implementation.

As a final note: As brought up by @nhlm, benchmarking is always a good idea as a means to have a more concrete idea about the expected scanning speed at the implementation in question, and if possible, ideally, both before and after any changes are made to the configuration and choice of signature files utilised. As well as the above mentioned points, the specifics of the hardware behind the implementation can have an influence, too (e.g., resources allocated and available to the particular PHP instance running the implementation, RAM, processor speed, etc).

@uwalkin
Copy link
Author

uwalkin commented Mar 23, 2022

Thank you to both @nhlm and @Maikuolan for the ultra fast replies and amazing pointers!

I've boiled it down a bit more: For my baseline I have disabled ALL signature files via the configuration (and actually deleted the signature files, just to be sure). Other than that the configuration is completely default.

Environment

System Resources

My dev environment is running on an AMD Ryzen 5950x and my PHP is configured with memory_limit=512M. So I'm not inclined to assume a bottleneck here. Although I may be wrong of course.

Application

My application ensures that the only files that even arrive for phpMussel to scan are of one of the following types: "image/jpeg", "image/png", "image/bmp", "image/tiff", "application/pdf". This is ensured by getting the mime type via mime_content_type() and comparing it against the list of allowed mime types. Only then does phpMussel get to scan the file.

Benchmarking

I am using the same 4MB jpeg from before as my test-candidate. This is a normal, valid, malware-free image file.

  • The baseline processing speed of my php script is basically instantaneous.
  • Once I add the Scanner->scan() directive into the mix, processing takes roughly 9 seconds on average. (With no signature files)
  • When I add some useful signature files to processing, processing time goes up to roughly 13 seconds.
  • I do see my measured processing time reflected in the scan_log output by phpMussel

Conclusions / Open Quesions

The big question for me is, why could phpMussel be taking about 9 seconds to process a regular file, while using no signatures to compare it to. And what options can I reconfigure to reduce this "baseline" processing time.

The second question is, with the above given parameters, would I even see any performance increase with greylisting/blacklisting, since only allowed file-types actually arrive for phpMussel to scan.

@Maikuolan
Copy link
Member

Maikuolan commented Mar 23, 2022

My application ensures that the only files that even arrive for phpMussel to scan are of one of the following types: "image/jpeg", "image/png", "image/bmp", "image/tiff", "application/pdf". This is ensured by getting the mime type via mime_content_type() and comparing it against the list of allowed mime types. Only then does phpMussel get to scan the file.

Although unrelated to the original question, something related to that particular point which might be of interest: phpMussel's "chameleon attack" protections. :-)

Sometimes uploaders will attempt to hide a file's type by changing extensions, sequences of bytes in files, and etc in order to sneak those files past mime_content_type() and similar such strategies often used to protect against unwanted types of files, often in an attempt to cause some kind of damage (seeing as exploitable types of files are normally the types we would regard as unwanted in the first place), which is what phpMussel refers to as a "chameleon attack". Its "chameleon attack" protections are intended to prevent that by checking for "magic numbers" and other such indicators to ensure that the file being uploaded is of the type it's being presented as, and block it if it's not (e.g., a file with an image extension but with an executable magic number instead of an image magic number).

They're all already enabled by default, so unless they've been explicitly disabled, you won't need to do anything to get them working anyway. But, just mentioning it, because it might be something useful to know about in case not already aware. :-)

  • Once I add the Scanner->scan() directive into the mix, processing takes roughly 9 seconds on average. (With no signature files)
  • When I add some useful signature files to processing, processing time goes up to roughly 13 seconds.

~9 seconds does feel like fair bit more time than I'd normally expect for it to finish scanning in the absence of any signature files at all, and from memory, I don't think it took that long the last time I'd tried benchmarking the speed of scanning files at around that size without any signature files, but that was a while back now. Though, I ran some new benchmarks just now under similar conditions (e.g., latest version, no signature files used) using a benign 4.78MB GIF file, and encountered actually slightly worse results at my end:

Thu, 24 Mar 2022 00:49:28 +0800 Started.
→ Checking "C:\[REDACTED].gif".
─→ No problems found.
Thu, 24 Mar 2022 00:49:41 +0800 Finished.

It's getting a little late at night now in my part of the world, so I think I'll call it a night after submitting this reply. But, I'm thinking that some time early tomorrow, I'll start trying out some break points at various points during the scan process at my local dev copy to try to narrow down exactly where this bottleneck is occurring.

The big question for me is, why could phpMussel be taking about 9 seconds to process a regular file, while using no signatures to compare it to. And what options can I reconfigure to reduce this "baseline" processing time.

Very good question, and I'm going to investigate further. I suspect that some changes introduced since the last time I'd benchmarked everything might've inadvertently introduced a bottleneck somewhere, but I haven't confirmed that yet (something I'm going to work on tomorrow).

Of course, I'll reply back here again with the results to share whatever I can find. :-)

The second question is, with the above given parameters, would I even see any performance increase with greylisting/blacklisting, since only allowed file-types actually arrive for phpMussel to scan.

In this particular case, probably not. That said, if/when possible, it's probably a good idea to use greylists/blacklists anyway, just as an added security precaution, but yeah.. probably won't make much difference to performance, relative to the bottlenecks we're currently seeing, and in light of the other measures you've mentioned as already being implemented (e.g., mime_content_type()).

@nhlm
Copy link
Member

nhlm commented Mar 24, 2022

@uwalkin May i ask which PHP Version is in use (besides the mentioned 8.0.1)? Did you try 8.1.*

@Maikuolan
Copy link
Member

Found the cause of the bottleneck. It's the readFileBlocks method of the Loader class. Working on a fix now.

Maikuolan added a commit that referenced this issue Mar 24, 2022
Changelog excerpt:
- Fixed a bottleneck in the scan process caused by the ReadFile closure.
Maikuolan added a commit that referenced this issue Mar 24, 2022
Changelog excerpt:
- Fixed a bottleneck in the scan process caused by the ReadFile closure.
Maikuolan added a commit to phpMussel/Core that referenced this issue Mar 24, 2022
Changelog excerpt:
- Fixed a bottleneck in the scan process caused by the readFileBlocks
  method.
Maikuolan added a commit to phpMussel/CLI that referenced this issue Mar 24, 2022
Changelog excerpt:
- Fixed a bottleneck in the scan process caused by the readFileBlocks
  method.
Maikuolan added a commit to phpMussel/FrontEnd that referenced this issue Mar 24, 2022
Changelog excerpt:
- Fixed a bottleneck in the scan process caused by the readFileBlocks
  method.
@Maikuolan
Copy link
Member

Almost there. Bottleneck seems resolved now at my end when I run my tests, but some of the tests also broke since the changes, so I'm investigating what's going on there now. I'll reply back again soon.

Maikuolan added a commit that referenced this issue Mar 24, 2022
Maikuolan added a commit that referenced this issue Mar 24, 2022
Maikuolan added a commit to phpMussel/Core that referenced this issue Mar 24, 2022
@Maikuolan
Copy link
Member

Maikuolan commented Mar 24, 2022

Not sure why workflow actions didn't trigger just now with the latest commits, but I performed the tests offline at my dev machine and they seem to be passing now there, at least. Would've been nice to have that green tick ✔️ next to those commits to show for it, but whatever; as long as it's working now, I guess.

Not quite ready yet to create a new tag and release (there are a few other small things I still hope to get done before then), but the commits for the bottleneck fix have been pushed out now, at least.

@Maikuolan
Copy link
Member

I'll aim to push a new tag/release by tomorrow, if there's enough time for it, and reply back again once that's done. (Composer users would need that new tag to be pushed in order to be able to update to the latest commits with latest fixes without compromising on minimum stability).

@Maikuolan
Copy link
Member

New tags/releases created. It should now be possible to update to the fix commits via Composer as per normal. :-)

@uwalkin
Copy link
Author

uwalkin commented Mar 26, 2022

Thank you so much @Maikuolan !

Updated to 3.3.2 and now baseline (no signatures) processing is instantaneous, and with my current signature set it takes about 4 seconds for my 4MB jpeg.

Thank you as well for all the tips regarding setup, chameleon attacks, strategically selecting the signatures etc.

Props to you and the whole phpMussel team. You guys are doing amazing work :)

@uwalkin uwalkin closed this as completed Mar 26, 2022
@nhlm
Copy link
Member

nhlm commented Mar 26, 2022

would be nice to have this case validated against PHP 8.0.x and PHP 8.1.x. @Maikuolan @uwalkin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants