-
Notifications
You must be signed in to change notification settings - Fork 84
feat: improve throughput of http based storage#reader between 100 MiB/s and 200 MiB/s #1799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
google-cloud-storage/src/main/java/com/google/cloud/storage/BlobReadChannelV2.java
Outdated
Show resolved
Hide resolved
9e10999
to
6878b73
Compare
…s and 200 MiB/s ### Work Implement new BlobReadChannelV2 which replaces BlobReadChannel and improves on its resource usage to reduce min number of RPCs to 1 from (objSize / chunkSize + 1) while still maintaining the ability to restart a stream that may have been interrupted. ### Results Throughput in MiB/s has increased across the board: ``` ClassName mean 25% 50% 75% 90% 95% 99% max READ[0] BlobReadChannel 32.2 25.3 29.0 32.6 42.1 56.1 111.9 214.1 READ[1] BlobReadChannel 32.1 25.4 28.7 32.6 41.7 55.4 106.1 224.4 READ[2] BlobReadChannel 31.9 25.2 28.6 32.8 41.6 55.2 105.4 227.2 READ[0] BlobReadChannelV2 214.1 196.4 219.8 239.3 254.3 261.2 278.0 315.2 READ[1] BlobReadChannelV2 215.9 198.8 221.0 240.0 254.4 261.8 281.8 315.6 READ[2] BlobReadChannelV2 216.4 199.5 221.2 239.4 253.9 261.6 281.6 308.6 ``` Data collected using all default settings, against a regional bucket, across a range of object sizes [256KiB, 2GiB]. Each object is read in full three times to account for any GCS caching variability. ### Internal implementation notes Add ByteRangeSpec to encapsulate relative vs explicit(open) vs explicit(closed) vs null vs open-ended ranges and their associated logical subtleties. New StorageReadChannel interface possible candidate for new storage specific interface we can expose to folks for improvements independent of core and BigQuery.
ClientStuff is no longer Serializable. Instead, BlobReadChannelV2State will store the instance of HttpStorageOptions and then reconstitute the BlobReadChannelContext at restore() time.
In order to facilitate migrating any RestorableState<ReadChannel> customers might have, we leave the existing class hierarchy in place and update BlobReadChannel.StateImpl#restore() to produce a new BlobReadChannelV2. In the next major version this compatibility path will be removed. To test compatibility a serialized instance of a BlobReadChannel from v2.16.0 has been created and serialized into blobWriteChannel.ser.properties along with a comment describing how the ser was generated.
6878b73
to
cea4cdc
Compare
@@ -49,6 +48,6 @@ public void start() { | |||
|
|||
@Override | |||
public void stop() { | |||
BucketCleaner.doCleanup(bucketInfo.getName(), s); | |||
// BucketCleaner.doCleanup(bucketInfo.getName(), s); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uncomment
/** | ||
* Hierarchy retained for {@link RestorableState#restore()}. Will be removed in next major version! | ||
*/ | ||
@Deprecated | ||
class BlobReadChannel implements ReadChannel { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused, this class has been reduced to unimplimented; that should be okay because it's a package private class and can only be initialized by Storage.writer() method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per @BenWhitehead; this is to continue supporting restore() / deserialize BlobReadChannel state from a serialized object as the implementation moves to BlobReadChannelV2; The class attributes require to still exist for deserialization to continue working. For now this is kept for this particular case.
} | ||
|
||
private ApiaryReadRequest getApiaryReadRequest() { | ||
// TODO: test what happens if you try to read from the last byte of an object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this mean pos=99 of size=100 OR pos=-5 which is equivalent to pos=94-99?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I forgot to ask this question @BenWhitehead
public static final class ReadableByteChannelSessionBuilder { | ||
|
||
private final BlobReadChannelContext blobReadChannelContext; | ||
// private Hasher hasher; // TODO: wire in Hasher |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given you're merging into main; please address the Hasher todo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per @BenWhitehead: this is a feature request; the existing reader did not have checksum support.
ImmutableList<HttpRequest> requests = getRequests(); | ||
|
||
List<T> actual = | ||
requests.stream() | ||
.map(HttpRequest::getUrl) | ||
.distinct() // todo: figure out why requests seem to be recorded twice for blob create | ||
// When a multipart (http, not MPU) request is sent it will show up as multiple requests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that due to the multiple parts or is it a complete duplicate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BenWhitehead I forgot to ask this question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case it's due to the multiple parts, there will be a request for url /path
for the object json, and a request for url /path
for the content bytes. Here distinct()
will collapse those to a single entry rather than two. I'm not totally sure why the internal structure of the requests is done like this, but it is (for multipart I would have expected some type of content rather than a whole new request but 🤷🏻♂️ )
import java.util.Map; | ||
import java.util.Objects; | ||
|
||
final class BlobReadChannelV2 implements StorageReadChannel { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does BlobReadChannelV2 surface limit() so it's usable? Seems like it's package private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per @BenWhitehead: Yes, it's still usable because accessibility of these methods are set by ReadChannel which is where these classes / interfaces derive from and public gets carried from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, the interface dictates the access scope, not the implementation itself. In fact, the compiler will prevent a class from narrowing the scope of an overriding method declaration.
Source-Link: googleapis/synthtool@1fd6dff Post-Processor: gcr.io/cloud-devrel-public-resources/owlbot-java:latest@sha256:ad9cabee4c022f1aab04a71332369e0c23841062124818a4490f73337f790337
Source-Link: https://github.com/googleapis/synthtool/commit/1fd6dff029bb3d873a4780e616388f802f086907 Post-Processor: gcr.io/cloud-devrel-public-resources/owlbot-java:latest@sha256:ad9cabee4c022f1aab04a71332369e0c23841062124818a4490f73337f790337
Work
Implement new BlobReadChannelV2 which replaces BlobReadChannel and improves on
its resource usage to reduce min number of RPCs to 1 from (objSize / chunkSize + 1)
while still maintaining the ability to restart a stream that may have been
interrupted.
Results
Throughput in MiB/s has increased across the board:
Data collected using all default settings, against a regional bucket,
across a range of object sizes [256KiB, 2GiB]. Each object is read in full three
times to account for any GCS caching variability.
Internal implementation notes
Add ByteRangeSpec to encapsulate relative vs explicit(open) vs explicit(closed)
vs null vs open-ended ranges and their associated logical subtleties.
New StorageReadChannel interface possible candidate for new storage specific
interface we can expose to folks for improvements independent of core and
BigQuery.
Future Breaking Change
In order to facilitate migrating any
RestorableState<ReadChannel>
customers mighthave, we have left the existing class hierarchy in place and updated
BlobReadChannel.StateImpl#restore()
to produce a newBlobReadChannelV2
instancewhen called.
In the next major version this compatibility path will be removed.
TODOs