Skip to content

speed up mem::swap #40454

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Jun 11, 2017
32 changes: 22 additions & 10 deletions src/libcore/mem.rs
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ pub use intrinsics::transmute;
/// [`Clone`][clone]. You need the value's destructor to run only once,
/// because a double `free` is undefined behavior.
///
/// An example is the definition of [`mem::swap`][swap] in this module:
/// An example is the (old) definition of [`mem::swap`][swap] in this module:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would phrase this differently - maybe "An example is a possible implementation of mem::swap" or something like that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should not intimate the existance of a former impl.

///
/// ```
/// use std::mem;
Expand Down Expand Up @@ -448,17 +448,29 @@ pub unsafe fn uninitialized<T>() -> T {
pub fn swap<T>(x: &mut T, y: &mut T) {
unsafe {
// Give ourselves some scratch space to work with
let mut t: T = uninitialized();
let mut t: [u8; 16] = uninitialized();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor point: Instead of using a literal 16 here, it might be slightly better to stick a

const SWAP_BLOCK_SIZE: usize = 16;

at the top of the function. This would make it slightly less error-prone to tweak the value later, and it would make it easy to, e.g., pick different sizes for different architectures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good suggestion, thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for reference, the actual block size didn't matter too much but that was just on my machine - 16 may very well not be optimal


// Perform the swap, `&mut` pointers never alias
ptr::copy_nonoverlapping(&*x, &mut t, 1);
ptr::copy_nonoverlapping(&*y, x, 1);
ptr::copy_nonoverlapping(&t, y, 1);
let x = x as *mut T as *mut u8;
let y = y as *mut T as *mut u8;
let t = &mut t as *mut _ as *mut u8;

// y and t now point to the same thing, but we need to completely
// forget `t` because we do not want to run the destructor for `T`
// on its value, which is still owned somewhere outside this function.
forget(t);
// can't use a for loop as the `range` impl calls `mem::swap` recursively
let len = size_of::<T>() as isize;
let mut i = 0;
while i + 16 <= len {
// Perform the swap 16 bytes at a time, `&mut` pointers never alias
ptr::copy_nonoverlapping(x.offset(i), t, 16);
ptr::copy_nonoverlapping(y.offset(i), x.offset(i), 16);
ptr::copy_nonoverlapping(t, y.offset(i), 16);
i += 16;
}
if i < len {
// Swap any remaining bytes
let rem = (len - i) as usize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it ever matters, but it occurs to me that calculating rem as len % 16 instead of len - i might be more obvious to an optimizer, as knowing what len - i is requires reasoning about a loop. Similarly, it would be better to test whether rem == 0 than to test i < len, as it is more trivially determined at compile time.

I doubt this matters in most cases, but I could see the optimizer failing when the size is large, since the loop might not be fully unrolled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it; looks like the generated assembly is the same either way. len here is a constant known at compile time so I guess it just gets completely eliminated

ptr::copy_nonoverlapping(x.offset(i), t, rem);
ptr::copy_nonoverlapping(y.offset(i), x.offset(i), rem);
ptr::copy_nonoverlapping(t, y.offset(i), rem);
}
}
}

Expand Down