-
-
Notifications
You must be signed in to change notification settings - Fork 82
Scanning Speed #231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
You do scan 250% more data, which should statistically took about 33 seconds (mathematically). I would consider a period of additionally 6-8 seconds as a peak. Did you do any kind of benchmarking or is it a one-time value? best regards. |
Although I wouldn't say "misconfigured", expected times can be significantly influenced by the exact configuration at the implementation in question. As to which would be the most correct way to configure your implementation, that would depend a lot on the exact needs of your implementation (because not every website/platform/service/etc serves the same purpose, or accepts the same kinds of files, not all will have the exact same kinds of needs). For example: If an implementation is intended to accept only very specific types of files without exception, it may make sense for the implementation to utilise phpMussel's greylist feature as a means to automatically blacklist any non-greylisted filetypes. Or alternatively, if an implementation may potential accept a range of different types of files which can't always be anticipated, but there are certain already known, specific types of files which should never be accepted by the implementation, it may make sense for the implementation to utilise phpMussel's blacklist feature to directly blacklist those types of files. (Documentation). Naturally, any blacklisted types of files would not need to be checked against signatures, thus should be processed almost instantaneously (improving at least the speed at which those particular files are dealt with, though not necessarily the speed of processing files across the spectrum as a whole). It may be worth considering, too, exactly which signature files your implementation needs to utilise. Naturally, the more signature files utilised, the more thorough phpMussel should be in terms of what it's able to detect, but also the more time required to finish scanning each file, so being selective regarding which signature files are utilised would be trying to find an ideal balance between thoroughness and speed. However, knowing the exact needs of your implementation may help in being able to determine whether specific signature files are needed or not for your implementation. For example: An implementation which accepts literally anything, without being able to predict the exact types of files expected, would in most cases probably need to utilise all signature files (thus being very thorough, but unfortunately also a little slow at times). However, an implementation which accepts only specific types of files, and which also properly utilises the whitelist/blacklist/greylist features provided by phpMussel to reduce unnecessary processing, would in most cases probably not need to utilise those signature files intended to deal solely with the specific types of files which wouldn't, in the case of the implementation in question, ever be scanned. At the repository for phpMussel signatures, in each of the two directories for signatures at that repository (separated on the basis of whether the vendor for those signatures is ClamAV or something/someone else), there's a README file describing the intended purpose of each signature file and their intended purpose, types of files targeted, etc. When being selective about signature files, those READMEs are worth referencing. There are various other little tricks that can be adopted in order to optimise the scanning speed for phpMussel, most which can be gaged by tweaking the configuration according to the needs of the implementation. As a final note: As brought up by @nhlm, benchmarking is always a good idea as a means to have a more concrete idea about the expected scanning speed at the implementation in question, and if possible, ideally, both before and after any changes are made to the configuration and choice of signature files utilised. As well as the above mentioned points, the specifics of the hardware behind the implementation can have an influence, too (e.g., resources allocated and available to the particular PHP instance running the implementation, RAM, processor speed, etc). |
Thank you to both @nhlm and @Maikuolan for the ultra fast replies and amazing pointers! I've boiled it down a bit more: For my baseline I have disabled ALL signature files via the configuration (and actually deleted the signature files, just to be sure). Other than that the configuration is completely default. EnvironmentSystem ResourcesMy dev environment is running on an ApplicationMy application ensures that the only files that even arrive for phpMussel to scan are of one of the following types: BenchmarkingI am using the same 4MB jpeg from before as my test-candidate. This is a normal, valid, malware-free image file.
Conclusions / Open QuesionsThe big question for me is, why could phpMussel be taking about 9 seconds to process a regular file, while using no signatures to compare it to. And what options can I reconfigure to reduce this "baseline" processing time. The second question is, with the above given parameters, would I even see any performance increase with greylisting/blacklisting, since only allowed file-types actually arrive for phpMussel to scan. |
Although unrelated to the original question, something related to that particular point which might be of interest: phpMussel's "chameleon attack" protections. :-) Sometimes uploaders will attempt to hide a file's type by changing extensions, sequences of bytes in files, and etc in order to sneak those files past They're all already enabled by default, so unless they've been explicitly disabled, you won't need to do anything to get them working anyway. But, just mentioning it, because it might be something useful to know about in case not already aware. :-)
~9 seconds does feel like fair bit more time than I'd normally expect for it to finish scanning in the absence of any signature files at all, and from memory, I don't think it took that long the last time I'd tried benchmarking the speed of scanning files at around that size without any signature files, but that was a while back now. Though, I ran some new benchmarks just now under similar conditions (e.g., latest version, no signature files used) using a benign 4.78MB GIF file, and encountered actually slightly worse results at my end:
It's getting a little late at night now in my part of the world, so I think I'll call it a night after submitting this reply. But, I'm thinking that some time early tomorrow, I'll start trying out some break points at various points during the scan process at my local dev copy to try to narrow down exactly where this bottleneck is occurring.
Very good question, and I'm going to investigate further. I suspect that some changes introduced since the last time I'd benchmarked everything might've inadvertently introduced a bottleneck somewhere, but I haven't confirmed that yet (something I'm going to work on tomorrow). Of course, I'll reply back here again with the results to share whatever I can find. :-)
In this particular case, probably not. That said, if/when possible, it's probably a good idea to use greylists/blacklists anyway, just as an added security precaution, but yeah.. probably won't make much difference to performance, relative to the bottlenecks we're currently seeing, and in light of the other measures you've mentioned as already being implemented (e.g., |
@uwalkin May i ask which PHP Version is in use (besides the mentioned 8.0.1)? Did you try 8.1.* |
Found the cause of the bottleneck. It's the |
Changelog excerpt: - Fixed a bottleneck in the scan process caused by the readFileBlocks method.
Changelog excerpt: - Fixed a bottleneck in the scan process caused by the readFileBlocks method.
Changelog excerpt: - Fixed a bottleneck in the scan process caused by the readFileBlocks method.
Almost there. Bottleneck seems resolved now at my end when I run my tests, but some of the tests also broke since the changes, so I'm investigating what's going on there now. I'll reply back again soon. |
Not sure why workflow actions didn't trigger just now with the latest commits, but I performed the tests offline at my dev machine and they seem to be passing now there, at least. Would've been nice to have that green tick ✔️ next to those commits to show for it, but whatever; as long as it's working now, I guess. Not quite ready yet to create a new tag and release (there are a few other small things I still hope to get done before then), but the commits for the bottleneck fix have been pushed out now, at least. |
I'll aim to push a new tag/release by tomorrow, if there's enough time for it, and reply back again once that's done. (Composer users would need that new tag to be pushed in order to be able to update to the latest commits with latest fixes without compromising on minimum stability). |
New tags/releases created. It should now be possible to update to the fix commits via Composer as per normal. :-) |
Thank you so much @Maikuolan ! Updated to 3.3.2 and now baseline (no signatures) processing is instantaneous, and with my current signature set it takes about 4 seconds for my 4MB jpeg. Thank you as well for all the tips regarding setup, chameleon attacks, strategically selecting the signatures etc. Props to you and the whole phpMussel team. You guys are doing amazing work :) |
would be nice to have this case validated against PHP 8.0.x and PHP 8.1.x. @Maikuolan @uwalkin |
Server information.
phpMussel version used
v3.3.1
Signatures version
Latest from Git
PHP SAPI used
apache2handler
PHP version installed
8.0.11
Operating system used
Windows NT 10.0 build 19042 (Windows 10) AMD64
Issue
Hi,
I'm just starting with an implementation of phpMussle in an image / pdf uploader and I'm wondering about the scanning speed.
I'm just scanning files via
Scanner->scan()
I'm noticing that it takes about 13s to scan a 4 MB jpeg and about 40s to scan a 10 MB pdf.
Are these expected times, or do I have something misconfigured here?
The text was updated successfully, but these errors were encountered: