Skip to content

Add to_boxed_slice() to clone slice into boxed slice #80460

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

calebsander
Copy link
Contributor

@calebsander calebsander commented Dec 28, 2020

The current recommended approach seems to be to use .to_vec().into_boxed_slice(), but this misses some opportunities for optimization. A Godbolt comparison shows that Vec::into_boxed_slice() emits unnecessary calls to realloc() and free() and uses some extra registers for local variables: https://godbolt.org/z/eq7Pc7. I was able to remove 3 calls to Vec::into_boxed_slice() from the standard library.
If this addition looks reasonable, I can create a tracking issue for the slice_to_boxed feature.

@rust-highfive
Copy link
Contributor

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @sfackler (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Dec 28, 2020
@the8472
Copy link
Member

the8472 commented Dec 29, 2020

A Godbolt comparison shows that Vec::into_boxed_slice() performs an unnecessary realloc()

Is that code actually executed? It might be dead code that just doesn't get elided by the compiler.

@calebsander
Copy link
Contributor Author

That's true, it internally calls Vec::shrink_to_fit(), which will skip calling realloc(). But this code doesn't need to be emitted, and it still performs an unnecessary comparison of capacity to len.

@the8472
Copy link
Member

the8472 commented Dec 29, 2020

There already exists a impl<T: Copy> From<&[T]> for Box<[T]>. Not sure why that's not T: Clone

@calebsander calebsander force-pushed the feature/slice-to-box branch 2 times, most recently from 76e0473 to 509280c Compare December 29, 2020 18:04
@calebsander
Copy link
Contributor Author

There already exists a impl<T: Copy> From<&[T]> for Box<[T]>. Not sure why that's not T: Clone

Perhaps because it was implemented before specialization? There is an impl<T: Clone> From<&[T]> for Vec<T>, so I've changed the bounds on the Box<[T]> implementation to match, and made it use `to_boxed_slice().

@the8472
Copy link
Member

the8472 commented Dec 30, 2020

If you extend the From implementation then I think the named method is redundant.

@calebsander
Copy link
Contributor Author

If you extend the From implementation then I think the named method is redundant.

Well there already exist Vec::from(), <[T]>::to_vec(), and <[T]>::to_owned() so I added <[T]>::to_boxed_slice() for consistency. I think some people also prefer postfix conversions and some prefer prefix methods, but it is true that both methods are not strictly necessary.

@JohnCSimon JohnCSimon added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 19, 2021
@JohnCSimon JohnCSimon added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 7, 2021
@Dylan-DPC-zz
Copy link

r? @m-ou-se

@rust-highfive rust-highfive assigned m-ou-se and unassigned sfackler Feb 9, 2021
@JohnCSimon JohnCSimon added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 28, 2021
Copy link
Member

@m-ou-se m-ou-se left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good addition to me, to add unstably so we can try it out. Can you open a tracking issue? Thanks!

@calebsander calebsander force-pushed the feature/slice-to-box branch from 509280c to f50a592 Compare March 3, 2021 16:58
@calebsander
Copy link
Contributor Author

Thanks for the feedback! I think I've addressed it, but we should probably do a perf test to make sure that implementing <[T]>::to_vec() in terms of <[T]>::to_boxed_slice() didn't cause performance regressions.
I created a tracking issue: #82725.

@m-ou-se
Copy link
Member

m-ou-se commented Mar 3, 2021

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Mar 3, 2021
@bors
Copy link
Collaborator

bors commented Mar 3, 2021

⌛ Trying commit f50a5923f16d92d2afaafe0b10bb68a67100aacd with merge 678aab286813c93b8a5589c664cc3534f33e46e6...

m-ou-se added a commit to m-ou-se/rust that referenced this pull request Mar 5, 2021
… r=m-ou-se

Avoid unnecessary Vec construction in BufReader

As mentioned in rust-lang#80460, creating a `Vec` and calling `Vec::into_boxed_slice()` emits unnecessary calls to `realloc()` and `free()`. Updated the code to use `Box::new_uninit_slice()` to create a boxed slice directly. I think this also makes it more explicit that the initial contents of the buffer are uninitialized.

r? `@m-ou-se`
m-ou-se added a commit to m-ou-se/rust that referenced this pull request Mar 5, 2021
… r=m-ou-se

Avoid unnecessary Vec construction in BufReader

As mentioned in rust-lang#80460, creating a `Vec` and calling `Vec::into_boxed_slice()` emits unnecessary calls to `realloc()` and `free()`. Updated the code to use `Box::new_uninit_slice()` to create a boxed slice directly. I think this also makes it more explicit that the initial contents of the buffer are uninitialized.

r? ``@m-ou-se``
@calebsander
Copy link
Contributor Author

It looks like the perf run was cancelled when I updated the PR to factor out the second commit into #82728. Would it be possible to re-run it?

@m-ou-se
Copy link
Member

m-ou-se commented Mar 8, 2021

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

@bors
Copy link
Collaborator

bors commented Mar 8, 2021

⌛ Trying commit 47b1dbd with merge 15dafdb539a06e106a54773f45ddf0b7f750c6c2...

@calebsander
Copy link
Contributor Author

Hmm, in the interest of reducing code duplication, it looks like copying and cloning into a boxed slice can be easily implemented in terms of the recently-added functions on MaybeUninit in #79995. But we may want to wait for this feature to be closer to stabilization and see if there are any optimizations to be made to its implementation.

@bors
Copy link
Collaborator

bors commented Mar 8, 2021

☀️ Try build successful - checks-actions
Build commit: 15dafdb539a06e106a54773f45ddf0b7f750c6c2 (15dafdb539a06e106a54773f45ddf0b7f750c6c2)

@rust-timer
Copy link
Collaborator

Queued 15dafdb539a06e106a54773f45ddf0b7f750c6c2 with parent 1d6b0f6, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit (15dafdb539a06e106a54773f45ddf0b7f750c6c2): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Mar 8, 2021
@m-ou-se
Copy link
Member

m-ou-se commented Mar 21, 2021

Those results don't look great. :(

@m-ou-se m-ou-se added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 21, 2021
@calebsander
Copy link
Contributor Author

calebsander commented Mar 24, 2021

Hmm, there seem to be some significant improvements and some significant regressions for each of the performance measures. I'm not sure what is causing this, since the generated assembly for <[T]>::to_vec() is very similar. Here is a Godbolt link comparing the assembly: https://godbolt.org/z/149hxoddx. The main differences I see are:

  • Copying a slice into a Vec uses one more register for local variables (not sure why):
example::copy_slice_old:
        push    r15
        push    r14
        push    r13
        push    r12
        push    rbx
        ; ...

example::copy_slice_new:
        push    rbp
        push    r15
        push    r14
        push    r13
        push    r12
        push    rbx
        sub     rsp, 8
        ; ...
  • Copying a slice has fewer instructions to manipulate the return value of __rust_alloc():
example::copy_slice_old:
        ; ...
        mov     rdi, rax
        mov     rax, r13
        shr     rax, 2
        test    rdi, rdi
        cmove   rax, rdi
        test    rdi, rdi
        je      .LBB0_7
        ; ...

example::copy_slice_new:
        ; ...
        mov     rbp, rax
        test    rbp, rbp
        je      .LBB1_7
        ; ...
  • Cloning a slice no longer uses the stack to store a local variable:
example::clone_slice_old:
        ; ...
        mov     qword ptr [rsp], r14
        ; ...
        mov     rax, qword ptr [rsp]
        ; ...
  • Cloning a slice now has fewer instructions and jumps in the loop:
example::clone_slice_old:
        ; ...
.LBB0_7:
        cmp     r12, rbx
        je      .LBB0_10
        mov     edi, 4
        mov     esi, 4
        call    r14
        test    rax, rax
        je      .LBB0_13
        mov     rcx, qword ptr [r15 + 8*rbx]
        mov     ecx, dword ptr [rcx]
        mov     dword ptr [rax], ecx
        mov     qword ptr [r13 + 8*rbx], rax
        lea     rax, [rbx + 1]
        mov     rbx, rax
        cmp     rbp, rax
        jne     .LBB0_7
        ; ...

example::clone_slice_new:
        ; ...
.LBB1_7:
        mov     edi, 4
        mov     esi, 4
        call    r13
        test    rax, rax
        je      .LBB1_12
        mov     rcx, qword ptr [r15 + 8*rbp]
        mov     ecx, dword ptr [rcx]
        mov     dword ptr [rax], ecx
        mov     qword ptr [rbx + 8*rbp], rax
        add     rbp, 1
        cmp     r12, rbp
        jne     .LBB1_7
        ; ...

I would expect the code involved in making the allocation or performing a memcpy() to take much longer than these small differences in instruction counts. Is it possible that the performance diff is picking up some unrelated changes?

@crlf0710
Copy link
Member

crlf0710 commented Apr 9, 2021

@calebsander Ping from triage, what's next steps on this?

@crlf0710 crlf0710 added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Apr 9, 2021
@calebsander
Copy link
Contributor Author

The concern @m-ou-se brought up was that the perf run showed some performance regressions (and some improvements) from implementing <[T]>::to_vec() in terms of <[T]>::to_boxed_slice(). I see a few possible options:

  1. Try to figure out what is causing the performance regressions. I looked at the assembly generated in Godbolt for <[T]>::to_vec() and it is nearly identical. I really have no idea how this change could be causing 1% or 2% slowdowns.
  2. Accept the mixed performance results and merge this PR as is. We could try to improve the performance in a future PR.
  3. Go back to the original version of the PR, which only implemented <[T]>::to_boxed_slice() and did not change the implementation of <[T]>::to_vec(). Unfortunately this version had a significant amount of duplicated code, which is why @m-ou-se suggested using <[T]>::to_boxed_slice() to implement <[T]>::to_vec().

@the8472
Copy link
Member

the8472 commented Apr 9, 2021

The generated assembly may not be the problem. perf rlo mostly measures compile time, not runtime. So the results aren't necessarily due to box/vec use, they may simply be due to code being being more difficult for the compiler even when it ultimately results in similar assembly.

E.g. this one spends more time in LLVM: https://perf.rust-lang.org/detailed-query.html?commit=15dafdb539a06e106a54773f45ddf0b7f750c6c2&base_commit=1d6b0f626aad4ee9f2eaec4d5582f45620ccab80&benchmark=ripgrep-opt&run_name=full

// allocated above with the capacity of `s`, and initialize to `s.len()` in
// ptr::copy_to_non_overlapping below.
fn to_boxed_slice<A: Allocator>(s: &[Self], alloc: A) -> Box<[Self], A> {
let mut boxed = Box::new_uninit_slice_in(s.len(), alloc);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of effort went into the Vec allocation path to reduce the amount of IR generated, perhaps the Box equivalent simply isn't as fine-tuned and that causes the regressions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're identical except for the additional RawVec::into_box() call, which just moves some pointers and lengths around, so it seems unlikely that this would be the culprit:

impl<T, A: Allocator> Box<[T], A> {
    pub fn new_uninit_slice_in(len: usize, alloc: A) -> Box<[mem::MaybeUninit<T>], A> {
        unsafe { RawVec::with_capacity_in(len, alloc).into_box(len) }
    }
}

impl<T, A: Allocator> Vec<T, A> {
    pub fn with_capacity_in(capacity: usize, alloc: A) -> Self {
        Vec { buf: RawVec::with_capacity_in(capacity, alloc), len: 0 }
    }
}

impl<T, A: Allocator> RawVec<T, A> {
    pub unsafe fn into_box(self, len: usize) -> Box<[MaybeUninit<T>], A> {
        // Sanity-check one half of the safety requirement (we cannot check the other half).
        debug_assert!(
            len <= self.capacity(),
            "`len` must be smaller than or equal to `self.capacity()`"
        );

        let me = ManuallyDrop::new(self);
        unsafe {
            let slice = slice::from_raw_parts_mut(me.ptr() as *mut MaybeUninit<T>, len);
            Box::from_raw_in(slice, ptr::read(&me.alloc))
        }
    }
}

@calebsander
Copy link
Contributor Author

calebsander commented Apr 9, 2021

The generated assembly may not be the problem. perf rlo mostly measures compile time, not runtime. So the results aren't necessarily due to box/vec use, they may simply be due to code being being more difficult for the compiler even when it ultimately results in similar assembly.

Okay, that's good to know. It's definitely possible that more IR is being generated, although the Rust playground shows pretty similar LLVM IR generated (in release mode):

Old IR for <[i32]>::to_vec()
; ModuleID = 'playground.5mgwo5pt-cgu.0'
source_filename = "playground.5mgwo5pt-cgu.0"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

%"std::vec::Vec<i32>" = type { [0 x i64], { i32*, i64 }, [0 x i64], i64, [0 x i64] }
%"unwind::libunwind::_Unwind_Exception" = type { [0 x i64], i64, [0 x i64], void (i32, %"unwind::libunwind::_Unwind_Exception"*)*, [0 x i64], [6 x i64], [0 x i64] }
%"unwind::libunwind::_Unwind_Context" = type { [0 x i8] }

; playground::copy_slice_old
; Function Attrs: nonlazybind uwtable
define void @_ZN10playground14copy_slice_old17h6d0009fe79b83180E(%"std::vec::Vec<i32>"* noalias nocapture sret(%"std::vec::Vec<i32>") dereferenceable(24) %0, [0 x i32]* noalias nocapture nonnull readonly align 4 %slice.0, i64 %slice.1) unnamed_addr #0 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
start:
  tail call void @llvm.experimental.noalias.scope.decl(metadata !2)
  tail call void @llvm.experimental.noalias.scope.decl(metadata !5), !noalias !8
  tail call void @llvm.experimental.noalias.scope.decl(metadata !10), !noalias !13
  tail call void @llvm.experimental.noalias.scope.decl(metadata !15), !noalias !18
  tail call void @llvm.experimental.noalias.scope.decl(metadata !20), !noalias !23
  %1 = tail call { i64, i1 } @llvm.umul.with.overflow.i64(i64 %slice.1, i64 4) #7
  %2 = extractvalue { i64, i1 } %1, 1
  %3 = extractvalue { i64, i1 } %1, 0
  %.sroa.3.0.i.i.i.i.i.i.i.i = select i1 %2, i64 0, i64 4
  %.sroa.0.0.i.i.i.i.i.i.i.i = select i1 %2, i64 undef, i64 %3
  br i1 %2, label %bb6.i.i.i.i.i.i.i, label %bb13.i.i.i.i.i.i.i

bb6.i.i.i.i.i.i.i:                                ; preds = %start
; call alloc::raw_vec::capacity_overflow
  tail call void @_ZN5alloc7raw_vec17capacity_overflow17h97153d40f6e2cd3bE(), !noalias !25
  unreachable

bb13.i.i.i.i.i.i.i:                               ; preds = %start
  %4 = icmp eq i64 %.sroa.0.0.i.i.i.i.i.i.i.i, 0
  br i1 %4, label %bb3.i.i.i.i.i.i.i.i.i, label %bb7.i.i.i.i.i.i.i.i.i

bb3.i.i.i.i.i.i.i.i.i:                            ; preds = %bb13.i.i.i.i.i.i.i
  %_2.i.i.i.i.i.i.i.i.i.i = inttoptr i64 %.sroa.3.0.i.i.i.i.i.i.i.i to i8*
  br label %"_ZN63_$LT$alloc..alloc..Global$u20$as$u20$core..alloc..Allocator$GT$8allocate17hdb6fd4f64ab5e4e5E.exit.i.i.i.i.i.i.i"

bb7.i.i.i.i.i.i.i.i.i:                            ; preds = %bb13.i.i.i.i.i.i.i
  %5 = tail call i8* @__rust_alloc(i64 %.sroa.0.0.i.i.i.i.i.i.i.i, i64 %.sroa.3.0.i.i.i.i.i.i.i.i) #7, !noalias !25
  %6 = icmp eq i8* %5, null
  %.sroa.0.0.i.op.i.i.i.i.i.i.i = lshr i64 %.sroa.0.0.i.i.i.i.i.i.i.i, 2
  %phi.bo.i.i.i.i.i.i.i = select i1 %6, i64 0, i64 %.sroa.0.0.i.op.i.i.i.i.i.i.i
  br label %"_ZN63_$LT$alloc..alloc..Global$u20$as$u20$core..alloc..Allocator$GT$8allocate17hdb6fd4f64ab5e4e5E.exit.i.i.i.i.i.i.i"

"_ZN63_$LT$alloc..alloc..Global$u20$as$u20$core..alloc..Allocator$GT$8allocate17hdb6fd4f64ab5e4e5E.exit.i.i.i.i.i.i.i": ; preds = %bb7.i.i.i.i.i.i.i.i.i, %bb3.i.i.i.i.i.i.i.i.i
  %.sroa.4.0.i.i.i.i.i.i.i.i.i = phi i64 [ 0, %bb3.i.i.i.i.i.i.i.i.i ], [ %phi.bo.i.i.i.i.i.i.i, %bb7.i.i.i.i.i.i.i.i.i ]
  %.sroa.0.0.i.i.i.i.i.i.i.i.i = phi i8* [ %_2.i.i.i.i.i.i.i.i.i.i, %bb3.i.i.i.i.i.i.i.i.i ], [ %5, %bb7.i.i.i.i.i.i.i.i.i ]
  %7 = icmp eq i8* %.sroa.0.0.i.i.i.i.i.i.i.i.i, null
  br i1 %7, label %bb20.i.i.i.i.i.i.i, label %"_ZN5alloc5slice29_$LT$impl$u20$$u5b$T$u5d$$GT$6to_vec17h82a8a1f74f9ea9ddE.exit"

bb20.i.i.i.i.i.i.i:                               ; preds = %"_ZN63_$LT$alloc..alloc..Global$u20$as$u20$core..alloc..Allocator$GT$8allocate17hdb6fd4f64ab5e4e5E.exit.i.i.i.i.i.i.i"
; call alloc::alloc::handle_alloc_error
  tail call void @_ZN5alloc5alloc18handle_alloc_error17hbbcab09c85c3ddd4E(i64 %.sroa.0.0.i.i.i.i.i.i.i.i, i64 %.sroa.3.0.i.i.i.i.i.i.i.i), !noalias !25
  unreachable

"_ZN5alloc5slice29_$LT$impl$u20$$u5b$T$u5d$$GT$6to_vec17h82a8a1f74f9ea9ddE.exit": ; preds = %"_ZN63_$LT$alloc..alloc..Global$u20$as$u20$core..alloc..Allocator$GT$8allocate17hdb6fd4f64ab5e4e5E.exit.i.i.i.i.i.i.i"
  %8 = bitcast %"std::vec::Vec<i32>"* %0 to i8**
  store i8* %.sroa.0.0.i.i.i.i.i.i.i.i.i, i8** %8, align 8, !alias.scope !26, !noalias !27
  %9 = getelementptr inbounds %"std::vec::Vec<i32>", %"std::vec::Vec<i32>"* %0, i64 0, i32 1, i32 1
  store i64 %.sroa.4.0.i.i.i.i.i.i.i.i.i, i64* %9, align 8, !alias.scope !26, !noalias !27
  %10 = getelementptr inbounds %"std::vec::Vec<i32>", %"std::vec::Vec<i32>"* %0, i64 0, i32 3
  %11 = shl nuw i64 %slice.1, 2
  %12 = bitcast [0 x i32]* %slice.0 to i8*
  tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 4 %.sroa.0.0.i.i.i.i.i.i.i.i.i, i8* nonnull align 4 %12, i64 %11, i1 false) #7, !noalias !28
  store i64 %slice.1, i64* %10, align 8, !alias.scope !29, !noalias !27
  ret void
}

; Function Attrs: nofree nosync nounwind readnone speculatable willreturn
declare { i64, i1 } @llvm.umul.with.overflow.i64(i64, i64) #1

; Function Attrs: nounwind nonlazybind uwtable
declare i32 @rust_eh_personality(i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*) unnamed_addr #2

; Function Attrs: argmemonly nofree nosync nounwind willreturn
declare void @llvm.memcpy.p0i8.p0i8.i64(i8* noalias nocapture writeonly, i8* noalias nocapture readonly, i64, i1 immarg) #3

; Function Attrs: nounwind nonlazybind uwtable
declare noalias i8* @__rust_alloc(i64, i64) unnamed_addr #2

; alloc::raw_vec::capacity_overflow
; Function Attrs: noreturn nonlazybind uwtable
declare void @_ZN5alloc7raw_vec17capacity_overflow17h97153d40f6e2cd3bE() unnamed_addr #4

; alloc::alloc::handle_alloc_error
; Function Attrs: cold noreturn nounwind nonlazybind uwtable
declare void @_ZN5alloc5alloc18handle_alloc_error17hbbcab09c85c3ddd4E(i64, i64) unnamed_addr #5

; Function Attrs: inaccessiblememonly nofree nosync nounwind willreturn
declare void @llvm.experimental.noalias.scope.decl(metadata) #6

attributes #0 = { nonlazybind uwtable "probe-stack"="inline-asm" "target-cpu"="x86-64" }
attributes #1 = { nofree nosync nounwind readnone speculatable willreturn }
attributes #2 = { nounwind nonlazybind uwtable "probe-stack"="inline-asm" "target-cpu"="x86-64" }
attributes #3 = { argmemonly nofree nosync nounwind willreturn }
attributes #4 = { noreturn nonlazybind uwtable "probe-stack"="inline-asm" "target-cpu"="x86-64" }
attributes #5 = { cold noreturn nounwind nonlazybind uwtable "probe-stack"="inline-asm" "target-cpu"="x86-64" }
attributes #6 = { inaccessiblememonly nofree nosync nounwind willreturn }
attributes #7 = { nounwind }

!llvm.module.flags = !{!0, !1}

!0 = !{i32 7, !"PIC Level", i32 2}
!1 = !{i32 2, !"RtLibUseGOT", i32 1}
!2 = !{!3}
!3 = distinct !{!3, !4, !"_ZN5alloc5slice29_$LT$impl$u20$$u5b$T$u5d$$GT$6to_vec17h82a8a1f74f9ea9ddE: argument 0"}
!4 = distinct !{!4, !"_ZN5alloc5slice29_$LT$impl$u20$$u5b$T$u5d$$GT$6to_vec17h82a8a1f74f9ea9ddE"}
!5 = !{!6}
!6 = distinct !{!6, !7, !"_ZN5alloc5slice29_$LT$impl$u20$$u5b$T$u5d$$GT$9to_vec_in17hf0efe825ef187949E: argument 0"}
!7 = distinct !{!7, !"_ZN5alloc5slice29_$LT$impl$u20$$u5b$T$u5d$$GT$9to_vec_in17hf0efe825ef187949E"}
!8 = !{!3, !9}
!9 = distinct !{!9, !4, !"_ZN5alloc5slice29_$LT$impl$u20$$u5b$T$u5d$$GT$6to_vec17h82a8a1f74f9ea9ddE: %self.0"}
!10 = !{!11}
!11 = distinct !{!11, !12, !"_ZN5alloc5slice4hack6to_vec17h4e5b681653d8de3eE: argument 0"}
!12 = distinct !{!12, !"_ZN5alloc5slice4hack6to_vec17h4e5b681653d8de3eE"}
!13 = !{!6, !14, !3, !9}
!14 = distinct !{!14, !7, !"_ZN5alloc5slice29_$LT$impl$u20$$u5b$T$u5d$$GT$9to_vec_in17hf0efe825ef187949E: %self.0"}
!15 = !{!16}
!16 = distinct !{!16, !17, !"_ZN52_$LT$T$u20$as$u20$alloc..slice..hack..ConvertVec$GT$6to_vec17h153ad63a55993150E: %v"}
!17 = distinct !{!17, !"_ZN52_$LT$T$u20$as$u20$alloc..slice..hack..ConvertVec$GT$6to_vec17h153ad63a55993150E"}
!18 = !{!11, !19, !6, !14, !3, !9}
!19 = distinct !{!19, !12, !"_ZN5alloc5slice4hack6to_vec17h4e5b681653d8de3eE: %s.0"}
!20 = !{!21}
!21 = distinct !{!21, !22, !"_ZN5alloc3vec16Vec$LT$T$C$A$GT$16with_capacity_in17hce1a244487fd289dE: argument 0"}
!22 = distinct !{!22, !"_ZN5alloc3vec16Vec$LT$T$C$A$GT$16with_capacity_in17hce1a244487fd289dE"}
!23 = !{!16, !24, !11, !19, !6, !14, !3, !9}
!24 = distinct !{!24, !17, !"_ZN52_$LT$T$u20$as$u20$alloc..slice..hack..ConvertVec$GT$6to_vec17h153ad63a55993150E: %s.0"}
!25 = !{!21, !16, !24, !11, !19, !6, !14, !3, !9}
!26 = !{!21, !16, !11, !6, !3}
!27 = !{!24, !19, !14, !9}
!28 = !{!16, !11, !6, !3}
!29 = !{!30, !16, !11, !6, !3}
!30 = distinct !{!30, !31, !"_ZN5alloc3vec16Vec$LT$T$C$A$GT$7set_len17hcd51c7b209261a71E: %self"}
!31 = distinct !{!31, !"_ZN5alloc3vec16Vec$LT$T$C$A$GT$7set_len17hcd51c7b209261a71E"}
New IR
; ModuleID = 'playground.5mgwo5pt-cgu.0'
source_filename = "playground.5mgwo5pt-cgu.0"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

%"std::vec::Vec<i32>" = type { [0 x i64], { i32*, i64 }, [0 x i64], i64, [0 x i64] }
%"unwind::libunwind::_Unwind_Exception" = type { [0 x i64], i64, [0 x i64], void (i32, %"unwind::libunwind::_Unwind_Exception"*)*, [0 x i64], [6 x i64], [0 x i64] }
%"unwind::libunwind::_Unwind_Context" = type { [0 x i8] }

; playground::copy_slice_new
; Function Attrs: nonlazybind uwtable
define void @_ZN10playground14copy_slice_new17hbe14d17b96488007E(%"std::vec::Vec<i32>"* noalias nocapture sret(%"std::vec::Vec<i32>") dereferenceable(24) %0, [0 x i32]* noalias nocapture nonnull readonly align 4 %slice.0, i64 %slice.1) unnamed_addr #0 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
start:
  tail call void @llvm.experimental.noalias.scope.decl(metadata !2)
  %1 = tail call { i64, i1 } @llvm.umul.with.overflow.i64(i64 %slice.1, i64 4) #7
  %2 = extractvalue { i64, i1 } %1, 1
  %3 = extractvalue { i64, i1 } %1, 0
  %.sroa.3.0.i.i.i.i.i.i.i = select i1 %2, i64 0, i64 4
  %.sroa.0.0.i.i.i.i.i.i.i = select i1 %2, i64 undef, i64 %3
  br i1 %2, label %bb6.i.i.i.i.i.i, label %bb13.i.i.i.i.i.i

bb6.i.i.i.i.i.i:                                  ; preds = %start
; call alloc::raw_vec::capacity_overflow
  tail call void @_ZN5alloc7raw_vec17capacity_overflow17h97153d40f6e2cd3bE(), !noalias !5
  unreachable

bb13.i.i.i.i.i.i:                                 ; preds = %start
  %4 = icmp eq i64 %.sroa.0.0.i.i.i.i.i.i.i, 0
  br i1 %4, label %bb3.i.i.i.i.i.i.i.i, label %bb7.i.i.i.i.i.i.i.i

bb3.i.i.i.i.i.i.i.i:                              ; preds = %bb13.i.i.i.i.i.i
  %_2.i.i.i.i.i.i.i.i.i = inttoptr i64 %.sroa.3.0.i.i.i.i.i.i.i to i8*
  br label %"_ZN63_$LT$alloc..alloc..Global$u20$as$u20$core..alloc..Allocator$GT$8allocate17hdb6fd4f64ab5e4e5E.exit.i.i.i.i.i.i"

bb7.i.i.i.i.i.i.i.i:                              ; preds = %bb13.i.i.i.i.i.i
  %5 = tail call i8* @__rust_alloc(i64 %.sroa.0.0.i.i.i.i.i.i.i, i64 %.sroa.3.0.i.i.i.i.i.i.i) #7, !noalias !5
  br label %"_ZN63_$LT$alloc..alloc..Global$u20$as$u20$core..alloc..Allocator$GT$8allocate17hdb6fd4f64ab5e4e5E.exit.i.i.i.i.i.i"

"_ZN63_$LT$alloc..alloc..Global$u20$as$u20$core..alloc..Allocator$GT$8allocate17hdb6fd4f64ab5e4e5E.exit.i.i.i.i.i.i": ; preds = %bb7.i.i.i.i.i.i.i.i, %bb3.i.i.i.i.i.i.i.i
  %.sroa.0.0.i.i.i.i.i.i.i.i = phi i8* [ %_2.i.i.i.i.i.i.i.i.i, %bb3.i.i.i.i.i.i.i.i ], [ %5, %bb7.i.i.i.i.i.i.i.i ]
  %6 = icmp eq i8* %.sroa.0.0.i.i.i.i.i.i.i.i, null
  br i1 %6, label %bb20.i.i.i.i.i.i, label %_ZN10playground4hack6to_vec17h302c903ca456a4daE.exit

bb20.i.i.i.i.i.i:                                 ; preds = %"_ZN63_$LT$alloc..alloc..Global$u20$as$u20$core..alloc..Allocator$GT$8allocate17hdb6fd4f64ab5e4e5E.exit.i.i.i.i.i.i"
; call alloc::alloc::handle_alloc_error
  tail call void @_ZN5alloc5alloc18handle_alloc_error17hbbcab09c85c3ddd4E(i64 %.sroa.0.0.i.i.i.i.i.i.i, i64 %.sroa.3.0.i.i.i.i.i.i.i), !noalias !5
  unreachable

_ZN10playground4hack6to_vec17h302c903ca456a4daE.exit: ; preds = %"_ZN63_$LT$alloc..alloc..Global$u20$as$u20$core..alloc..Allocator$GT$8allocate17hdb6fd4f64ab5e4e5E.exit.i.i.i.i.i.i"
  %7 = shl nuw i64 %slice.1, 2
  %8 = bitcast [0 x i32]* %slice.0 to i8*
  tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 4 %.sroa.0.0.i.i.i.i.i.i.i.i, i8* nonnull align 4 %8, i64 %7, i1 false) #7, !noalias !2
  %9 = bitcast %"std::vec::Vec<i32>"* %0 to i8**
  store i8* %.sroa.0.0.i.i.i.i.i.i.i.i, i8** %9, align 8, !alias.scope !11, !noalias !16
  %10 = getelementptr inbounds %"std::vec::Vec<i32>", %"std::vec::Vec<i32>"* %0, i64 0, i32 1, i32 1
  store i64 %slice.1, i64* %10, align 8, !alias.scope !11, !noalias !16
  %11 = getelementptr inbounds %"std::vec::Vec<i32>", %"std::vec::Vec<i32>"* %0, i64 0, i32 3
  store i64 %slice.1, i64* %11, align 8, !alias.scope !11, !noalias !16
  ret void
}

; Function Attrs: nofree nosync nounwind readnone speculatable willreturn
declare { i64, i1 } @llvm.umul.with.overflow.i64(i64, i64) #1

; Function Attrs: nounwind nonlazybind uwtable
declare i32 @rust_eh_personality(i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*) unnamed_addr #2

; Function Attrs: argmemonly nofree nosync nounwind willreturn
declare void @llvm.memcpy.p0i8.p0i8.i64(i8* noalias nocapture writeonly, i8* noalias nocapture readonly, i64, i1 immarg) #3

; Function Attrs: nounwind nonlazybind uwtable
declare noalias i8* @__rust_alloc(i64, i64) unnamed_addr #2

; alloc::raw_vec::capacity_overflow
; Function Attrs: noreturn nonlazybind uwtable
declare void @_ZN5alloc7raw_vec17capacity_overflow17h97153d40f6e2cd3bE() unnamed_addr #4

; alloc::alloc::handle_alloc_error
; Function Attrs: cold noreturn nounwind nonlazybind uwtable
declare void @_ZN5alloc5alloc18handle_alloc_error17hbbcab09c85c3ddd4E(i64, i64) unnamed_addr #5

; Function Attrs: inaccessiblememonly nofree nosync nounwind willreturn
declare void @llvm.experimental.noalias.scope.decl(metadata) #6

attributes #0 = { nonlazybind uwtable "probe-stack"="inline-asm" "target-cpu"="x86-64" }
attributes #1 = { nofree nosync nounwind readnone speculatable willreturn }
attributes #2 = { nounwind nonlazybind uwtable "probe-stack"="inline-asm" "target-cpu"="x86-64" }
attributes #3 = { argmemonly nofree nosync nounwind willreturn }
attributes #4 = { noreturn nonlazybind uwtable "probe-stack"="inline-asm" "target-cpu"="x86-64" }
attributes #5 = { cold noreturn nounwind nonlazybind uwtable "probe-stack"="inline-asm" "target-cpu"="x86-64" }
attributes #6 = { inaccessiblememonly nofree nosync nounwind willreturn }
attributes #7 = { nounwind }

!llvm.module.flags = !{!0, !1}

!0 = !{i32 7, !"PIC Level", i32 2}
!1 = !{i32 2, !"RtLibUseGOT", i32 1}
!2 = !{!3}
!3 = distinct !{!3, !4, !"_ZN10playground4hack6to_vec17h302c903ca456a4daE: argument 0"}
!4 = distinct !{!4, !"_ZN10playground4hack6to_vec17h302c903ca456a4daE"}
!5 = !{!6, !8, !3, !10}
!6 = distinct !{!6, !7, !"_ZN52_$LT$T$u20$as$u20$playground..hack..ConvertBoxed$GT$14to_boxed_slice17hf040d5891f80ff7fE: %s.0"}
!7 = distinct !{!7, !"_ZN52_$LT$T$u20$as$u20$playground..hack..ConvertBoxed$GT$14to_boxed_slice17hf040d5891f80ff7fE"}
!8 = distinct !{!8, !9, !"_ZN10playground4hack14to_boxed_slice17hee7857f846790a13E: %s.0"}
!9 = distinct !{!9, !"_ZN10playground4hack14to_boxed_slice17hee7857f846790a13E"}
!10 = distinct !{!10, !4, !"_ZN10playground4hack6to_vec17h302c903ca456a4daE: %s.0"}
!11 = !{!12, !14, !3}
!12 = distinct !{!12, !13, !"_ZN5alloc3vec16Vec$LT$T$C$A$GT$17from_raw_parts_in17h851da6b596063aefE: argument 0"}
!13 = distinct !{!13, !"_ZN5alloc3vec16Vec$LT$T$C$A$GT$17from_raw_parts_in17h851da6b596063aefE"}
!14 = distinct !{!14, !15, !"_ZN10playground4hack8into_vec17hf50de4bbad45c759E: argument 0"}
!15 = distinct !{!15, !"_ZN10playground4hack8into_vec17hf50de4bbad45c759E"}
!16 = !{!17, !10}
!17 = distinct !{!17, !15, !"_ZN10playground4hack8into_vec17hf50de4bbad45c759E: argument 1"}

@the8472
Copy link
Member

the8472 commented Apr 9, 2021

Well, then I am out of guesses. If you want to dig further you can download (there's a tool for that) the try builds that were used for the perf run and then get additional profiling outputs for individual benchmarks that regressed to figure out what's causing it.

@JohnCSimon
Copy link
Member

ping from triage:
@calebsander what would you like to do with this PR?

@calebsander
Copy link
Contributor Author

I don't think I have enough experience with the internals of the compilation process to debug why compilation time has increased for some of the perf test cases. If someone with more expertise wants to take a look at it, they are welcome to.
I think it would make the most sense to merge these changes and let others contribute optimizations later. But if the team feels that the performance regressions are unacceptable, then it probably makes sense to just close this PR.

@bstrie
Copy link
Contributor

bstrie commented May 12, 2021

But if the team feels that the performance regressions are unacceptable, then it probably makes sense to just close this PR.

Since there's a tracking issue to keep this topic from being totally dropped, and in lieu of any immediate solutions to the perf problems, I'd recommend closing for now unless a libs team member disagrees.

@bstrie
Copy link
Contributor

bstrie commented Jun 2, 2021

I'll close for now, this is tracked by #82725 .

@bstrie bstrie closed this Jun 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author.
Projects
None yet
Development

Successfully merging this pull request may close these issues.