Linux: Always request and wait for an ACK from nl80211 #1510

johnthacker · 2025-04-26T01:28:17Z

The Linux kernel documentation for Netlink says:

"Note that unless the NLM_F_ACK flag is set on the request Netlink will not respond with NLMSG_ERROR if there is no error. To avoid having to special-case this quirk it is recommended to always set NLM_F_ACK." (https://docs.kernel.org/userspace-api/netlink/intro.html#nl-msg-type)

In other places, it suggests that sockets do have Auto-ACK enabled by default, however.

Regardless, follow instructions and always set NLM_F_ACK, and receive messages until we get the ACK.

Fix #1508

infrastation · 2025-05-17T22:40:39Z

Thank you for waiting. In my test with wlan0 on an RPi these changes do not restore the expected error message, could you double-check?

Without these changes:

tcpdump: SIOCGIFINDEX: No such device

With these changes:

tcpdump: wlan0: nl_wait_for_ack failed adding mon0 interface: Operation not supported

The expected error message is:

tcpdump: wlan0: That device doesn't support monitor mode

johnthacker · 2025-05-18T01:22:12Z

I agree I don't see that error message, but I don't see that error message by checking out the commit before 0ac887b either. I also don't see it before 24cf361 either. In fact, I checked out libpcap-1.10.4 and built that, and linked tcpdump against it, and saw the same:

tcpdump: wlan0: nl_wait_for_ack failed adding mon0 interface: Operation not supported

message. I also saw it when checking out libpcap-1.10.3, which surprised me somewhat as the RPi has libpcap-1.10.3 installed on it from Debian Bookworm, and the libpcap on the RPi did give the error message you're looking for.

Then a lightbulb went off:

$ ldd /usr/lib/aarch64-linux-gnu/libpcap.so.0.8
	linux-vdso.so.1 (0x0000007fabcc8000)
	libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000007faba60000)
	/lib/ld-linux-aarch64.so.1 (0x0000007fabc80000)
	libdbus-1.so.3 => /lib/aarch64-linux-gnu/libdbus-1.so.3 (0x0000007fab9e0000)
	libsystemd.so.0 => /lib/aarch64-linux-gnu/libsystemd.so.0 (0x0000007fab8f0000)
	libcap.so.2 => /lib/aarch64-linux-gnu/libcap.so.2 (0x0000007fab8c0000)
	libgcrypt.so.20 => /lib/aarch64-linux-gnu/libgcrypt.so.20 (0x0000007fab7b0000)
	liblzma.so.5 => /lib/aarch64-linux-gnu/liblzma.so.5 (0x0000007fab760000)
	libzstd.so.1 => /lib/aarch64-linux-gnu/libzstd.so.1 (0x0000007fab6a0000)
	liblz4.so.1 => /lib/aarch64-linux-gnu/liblz4.so.1 (0x0000007fab650000)
	libgpg-error.so.0 => /lib/aarch64-linux-gnu/libgpg-error.so.0 (0x0000007fab600000)
$ ldd ../libpcap/libpcap.so.1.10.3
	linux-vdso.so.1 (0x0000007f81844000)
	libnl-genl-3.so.200 => /lib/aarch64-linux-gnu/libnl-genl-3.so.200 (0x0000007f81760000)
	libnl-3.so.200 => /lib/aarch64-linux-gnu/libnl-3.so.200 (0x0000007f81710000)
	libdbus-1.so.3 => /lib/aarch64-linux-gnu/libdbus-1.so.3 (0x0000007f81690000)
	libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000007f814e0000)
	/lib/ld-linux-aarch64.so.1 (0x0000007f81800000)
	libsystemd.so.0 => /lib/aarch64-linux-gnu/libsystemd.so.0 (0x0000007f813f0000)
	libcap.so.2 => /lib/aarch64-linux-gnu/libcap.so.2 (0x0000007f813c0000)
	libgcrypt.so.20 => /lib/aarch64-linux-gnu/libgcrypt.so.20 (0x0000007f812b0000)
	liblzma.so.5 => /lib/aarch64-linux-gnu/liblzma.so.5 (0x0000007f81260000)
	libzstd.so.1 => /lib/aarch64-linux-gnu/libzstd.so.1 (0x0000007f811a0000)
	liblz4.so.1 => /lib/aarch64-linux-gnu/liblz4.so.1 (0x0000007f81150000)
	libgpg-error.so.0 => /lib/aarch64-linux-gnu/libgpg-error.so.0 (0x0000007f81100000)

The libpcap installed by Debian isn't built against libnl, so it's unable to set monitor mode for any interface, and libpcap returns PCAP_ERROR_RFMON_NOTSUP for the attempt, which results in getting the error message you expect. The libnl codepath has for a very long time given the "operation not supported" more confusing error message.

So I think that this PR does fix the issue that was introduced by my earlier commit. I do agree that the error message you mention above is the correct one to give, but that's a separate pre-existing issue. I am willing to work on it as well, but perhaps we can commit this fix first?

johnthacker · 2025-05-18T01:26:12Z

It probably wasn't being built against libnl because of #1358.

johnthacker · 2025-05-18T10:59:37Z

I added another commit to return PCAP_ERROR_RFMON_NOTSUP instead of the generic error in the case where Netlink reports that adding a device (already successfully queried as a mac80211 device not in monitor mode) in monitor mode is unsupported.

infrastation · 2025-05-18T13:18:51Z

You are right, the user-visible message is not a new problem and addressing one thing at a time would be best.

I tried comprehending the first proposed commit, among other things it seems to plug a memory leak on error. Specifically, in get_if_type() it adds calls to nl_cb_put() after nlmsg_free() everywhere except in the very last instance (where indeed it should not be done) and the branch where nlmsg_alloc() fails (where it seems it should). Would the following be easier to follow, if I understand this logic correctly? Would it be better to make the leak plug a separate change?

--- a/pcap-linux.c
+++ b/pcap-linux.c
@@ -648,9 +648,6 @@ get_if_type(pcap_t *handle, int sock_fd, struct nl80211_state *state,
 	if (ifindex == -1)
 		return PCAP_ERROR;
 
-	struct nl_cb *cb = nl_cb_alloc(NL_CB_DEFAULT);
-	nl_cb_set(cb, NL_CB_VALID, NL_CB_CUSTOM, if_type_cb, (void*)type);
-
 	msg = nlmsg_alloc();
 	if (!msg) {
 		snprintf(handle->errbuf, PCAP_ERRBUF_SIZE,
@@ -674,7 +671,6 @@ get_if_type(pcap_t *handle, int sock_fd, struct nl80211_state *state,
 			 * about that.)
 			 */
 			nlmsg_free(msg);
-			nl_cb_put(cb);
 			return 0;
 		} else {
 			/*
@@ -685,11 +681,13 @@ get_if_type(pcap_t *handle, int sock_fd, struct nl80211_state *state,
 			    "%s: nl_send_auto_complete failed getting interface type: %s",
 			    device, nl_geterror(-err));
 			nlmsg_free(msg);
-			nl_cb_put(cb);
 			return PCAP_ERROR;
 		}
 	}
 
+	struct nl_cb *cb = nl_cb_alloc(NL_CB_DEFAULT);
+	nl_cb_set(cb, NL_CB_VALID, NL_CB_CUSTOM, if_type_cb, (void*)type);
+
 	err = nl_recvmsgs(state->nl_sock, cb);
 	if (err < 0) {
 		snprintf(handle->errbuf, PCAP_ERRBUF_SIZE,
@@ -723,6 +721,7 @@ nla_put_failure:
 	    "%s: nl_put failed getting interface type",
 	    device);
 	nlmsg_free(msg);
+	// Do not call nl_cb_put(): nl_cb_alloc() has not been called.
 	return PCAP_ERROR;
 }

Another change is replacing nl_send_auto_complete() with nl_send_auto(), which is an unrelated change:

/**
 * @deprecated Please use nl_send_auto()
 */
int nl_send_auto_complete(struct nl_sock *sk, struct nl_msg *msg)
{
	return nl_send_auto(sk, msg);
}

One other thing that looks not quite right is that (notwithstanding the logic explained for the kernel end of a netlink socket) libnl is supposed to set NLM_F_ACK automatically: according to the documentation, this is the default (can be switched off using nl_socket_disable_auto_ack()), and the current source code looks consistent with that:

/**
 * Disable automatic request for ACK
 * @arg sk		Netlink socket.
 *
 * The default behaviour of a socket is to request an ACK for
 * each message sent to allow for the caller to synchronize to
 * the completion of the netlink operation. This function
 * disables this behaviour and will result in requests being
 * sent which will not have the NLM_F_ACK flag set automatically.
 * However, it is still possible for the caller to set the
 * NLM_F_ACK flag explicitely.
 */
void nl_socket_disable_auto_ack(struct nl_sock *sk)
{
	sk->s_flags |= NL_NO_AUTO_ACK;
}

If this is the case, the change to genlmsg_put() does not affect the logic because NLM_F_ACK will be set automatically and both NL_AUTO_PORT and NL_AUTO_SEQ are defined to 0, is this correct? In this case the latter could be a part of a separate style-focused commit to keep the actual bug fix focused.

johnthacker · 2025-05-18T14:32:54Z

Sounds good, I went with your proposed changes.

infrastation · 2025-05-18T15:50:45Z

pcap-linux.c

@@ -754,49 +754,73 @@ DIAG_ON_NARROWING

 	err = nl_send_auto_complete(state->nl_sock, msg);


As far as I understand, here nlmsg_free() could be called exactly once; furthermore, it could be called implicitly from nl_send_sync(), in which case this code block would be reduced to roughly this:

err = nl_send_sync(state->nl_sock, msg); switch (err) { // return one of the error codes } // success

Arguably, the latter would only make perfect sense if nl_wait_for_ack() indeed returned -NLE_OPNOTSUPP etc, which may be not the case.

I am a bit surprised too that nl_wait_for_ack returns -NLE_OPNOTSUPP instead of the preceding send command, but apparently at least on the RPi that is the case, as seen in the error:

tcpdump: wlan0: nl_wait_for_ack failed adding mon0 interface: Operation not supported

infrastation · 2025-05-18T15:54:55Z

Thank you, it is becoming less convoluted now.

johnthacker · 2025-05-18T17:40:37Z

Thanks for all the review. I think I'm happy with the current state now; let me know if you want me to squash any commits or anything.

infrastation · 2025-05-18T20:55:52Z

Thank you for preparing solutions to the following problems:

an incorrect reference to SIOCGIFINDEX (a relatively recent bug)
a not-sufficiently specific reference to nl_wait_for_ack instead of returning PCAP_ERROR_RFMON_NOTSUP (an older bug)

a memory leak in get_if_type():

opentest: wlan0: Generic error
(SIOCGIFINDEX: No such device)
==80869== 224 bytes in 1 blocks are definitely lost in loss record 2 of 3
==80869==    at 0x4889F94: calloc (vg_replace_malloc.c:1328)
==80869==    by 0x48FF9BB: nl_cb_alloc (in /usr/lib/aarch64-linux-gnu/libnl-3.so.200.26.0)
==80869==    by 0x12C69B: get_if_type (pcap-linux.c:647)
==80869==    by 0x12C69B: enter_rfmon_mode (pcap-linux.c:4934)
==80869==    by 0x12D127: setup_socket (pcap-linux.c:2505)
==80869==    by 0x12D127: pcap_activate_linux (pcap-linux.c:1249)
==80869==    by 0x110DAF: pcap_activate (pcap.c:2708)
==80869==    by 0x10DE87: main (opentest.c:174)

On the one hand, some of these changes overwrite each other, but on the other hand it is desirable to make the commits easy to cherry-pick into the libpcap-1.10 branch. Let me see if there is a better way to arrange these commits.

infrastation · 2025-05-23T19:47:08Z

Thank you for waiting. Please see the "nl80211_no_mon_rejig" branch in my libpcap repository, this represents the sum of your changes almost exactly less a few trivial style fixups, but the order and contents of commits is different (one lot of clean-ups followed by three minimal bug fixes), also the bug fix commit messages now include "before" and "after". This should be much easier to follow later if necessary.

I have tested that each of these commits works as expected both if applied to the master branch and if cherry-picked into the libpcap-1.10 branch. Please have a look and use for this pull request as much as it fits.

One thing that is still missing is the rationale in the commit that fixes the original SIOCGIFINDEX problem. I did not examine the netlink message exchange closely, my understanding of the diff is that returning of NL_STOP somehow prevented consuming of the NLMSG_ERROR part of the response and returning NL_SKIP unblocked that, but the call to nl_wait_for_ack() is still required. "Receive messages until we get the ACK." does not quite explain the root cause and the solution. A few more comments in the source code would be very appropriate to explain the context.

Use the constants NL_AUTO_PORT and NL_AUTO_SEQ instead of their value 0, for clarity. Use nl_send_auto() instead of nl_send_auto_complete(), which is deprecated. Factor out some common nlmsg_free() calls. Use nl_send_sync() in add_mon_if() and del_mon_if(), which combines nl_send_auto() and nl_wait_for_ack() when using the standard callback. Signed-off-by: John Thacker <johnthacker@gmail.com>

The function does not deallocate the struct nl_cb, no matter if it exits normally (wlan1) or due to an error (wlan0). To address the former code path, add a missing nl_cb_put() call and make it as early as possible; to address the latter code path, move the nl_cb_alloc() call to make it as late as possible. Before: # valgrind --quiet --leak-check=full ./testprogs/opentest -i wlan0 -I opentest: wlan0: Generic error (SIOCGIFINDEX: No such device) ==1748== 224 bytes in 1 blocks are definitely lost in loss record 2 of 3 ==1748== at 0x4889F94: calloc (vg_replace_malloc.c:1328) ==1748== by 0x48FF9BB: nl_cb_alloc (in /usr/lib/aarch64-linux-gnu/libnl-3.so.200.26.0) ==1748== by 0x12C6CB: get_if_type (pcap-linux.c:647) ==1748== by 0x12C6CB: enter_rfmon_mode (pcap-linux.c:4934) ==1748== by 0x12D157: setup_socket (pcap-linux.c:2505) ==1748== by 0x12D157: pcap_activate_linux (pcap-linux.c:1249) ==1748== by 0x110E2F: pcap_activate (pcap.c:2708) ==1748== by 0x10DF07: main (opentest.c:174) ==1748== # valgrind --quiet --leak-check=full ./testprogs/opentest -i wlan1 -I wlan1 opened successfully ==1754== 224 bytes in 1 blocks are definitely lost in loss record 1 of 1 ==1754== at 0x4889F94: calloc (vg_replace_malloc.c:1328) ==1754== by 0x48FF9BB: nl_cb_alloc (in /usr/lib/aarch64-linux-gnu/libnl-3.so.200.26.0) ==1754== by 0x12C6CB: get_if_type (pcap-linux.c:647) ==1754== by 0x12C6CB: enter_rfmon_mode (pcap-linux.c:4934) ==1754== by 0x12D157: setup_socket (pcap-linux.c:2505) ==1754== by 0x12D157: pcap_activate_linux (pcap-linux.c:1249) ==1754== by 0x110E2F: pcap_activate (pcap.c:2708) ==1754== by 0x10DF07: main (opentest.c:174) ==1754== After: # valgrind --quiet --leak-check=full ./testprogs/opentest -i wlan0 -I opentest: wlan0: Generic error (SIOCGIFINDEX: No such device) # valgrind --quiet --leak-check=full ./testprogs/opentest -i wlan1 -I wlan1 opened successfully Signed-off-by: John Thacker <johnthacker@gmail.com>

get_if_type and add_mon_if share the same generic netlink socket. By default ACKs are sent for every transaction. If get_if_type does not wait for the ACK, then add_mon_if will receive the ACK and wrongly interpret its NL80211_CMD_NEW_INTERFACE to add the monitor device as a success even if it failed. This can cause libpcap to assume that a monitor device was added even when it failed, later causing a ENODEV error when trying to open the non-existent device. Before: # ./testprogs/opentest -i wlan0 -I opentest: wlan0: Generic error (SIOCGIFINDEX: No such device) After: # ./testprogs/opentest -i wlan0 -I opentest: wlan0: Generic error (wlan0: nl_send_sync failed adding mon0 interface: Operation not supported) Signed-off-by: John Thacker <johnthacker@gmail.com>

If Netlink reports that an interface is a mac80211 device (not in monitor mode), but then reports NLE_OPNOTSUPP when trying to add that interface as a monitor mode device, report that as the PCAP_ERROR_RFMON_NOTSUP so the libpcap error message is used instead of the generic Netlink error message. For other error types, continue to report a generic error. Before: # ./testprogs/opentest -i wlan0 -I opentest: wlan0: Generic error (wlan0: nl_send_sync failed adding mon0 interface: Operation not supported) After: # ./testprogs/opentest -i wlan0 -I opentest: wlan0: That device doesn't support monitor mode (That device doesn't support monitor mode) Signed-off-by: John Thacker <johnthacker@gmail.com>

johnthacker · 2025-05-24T01:46:45Z

Thanks, I used your suggestion and added some details about why we must wait for the ACK when getting the interface type.

johnthacker force-pushed the nl80211_no_mon branch from e4b5d21 to 6b4510e Compare May 18, 2025 14:32

infrastation reviewed May 18, 2025

View reviewed changes

infrastation self-assigned this May 19, 2025

infrastation added the linux label May 19, 2025

johnthacker added 2 commits May 23, 2025 21:28

johnthacker force-pushed the nl80211_no_mon branch from ad41782 to 8d1f47b Compare May 24, 2025 01:35

johnthacker added 2 commits May 23, 2025 21:44

johnthacker force-pushed the nl80211_no_mon branch from 8d1f47b to 186f908 Compare May 24, 2025 01:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Linux: Always request and wait for an ACK from nl80211 #1510

Linux: Always request and wait for an ACK from nl80211 #1510

johnthacker commented Apr 26, 2025

Uh oh!

infrastation commented May 17, 2025

Uh oh!

johnthacker commented May 18, 2025 •

edited

Loading

Uh oh!

johnthacker commented May 18, 2025

Uh oh!

johnthacker commented May 18, 2025

Uh oh!

infrastation commented May 18, 2025

Uh oh!

johnthacker commented May 18, 2025

Uh oh!

infrastation May 18, 2025

Uh oh!

johnthacker May 18, 2025

Uh oh!

infrastation commented May 18, 2025

Uh oh!

johnthacker commented May 18, 2025

Uh oh!

infrastation commented May 18, 2025

Uh oh!

infrastation commented May 23, 2025 •

edited

Loading

Uh oh!

johnthacker commented May 24, 2025

Uh oh!

Uh oh!

		@@ -754,49 +754,73 @@ DIAG_ON_NARROWING

		err = nl_send_auto_complete(state->nl_sock, msg);

Linux: Always request and wait for an ACK from nl80211 #1510

Are you sure you want to change the base?

Linux: Always request and wait for an ACK from nl80211 #1510

Conversation

johnthacker commented Apr 26, 2025

Uh oh!

infrastation commented May 17, 2025

Uh oh!

johnthacker commented May 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johnthacker commented May 18, 2025

Uh oh!

johnthacker commented May 18, 2025

Uh oh!

infrastation commented May 18, 2025

Uh oh!

johnthacker commented May 18, 2025

Uh oh!

infrastation May 18, 2025

Choose a reason for hiding this comment

Uh oh!

johnthacker May 18, 2025

Choose a reason for hiding this comment

Uh oh!

infrastation commented May 18, 2025

Uh oh!

johnthacker commented May 18, 2025

Uh oh!

infrastation commented May 18, 2025

Uh oh!

infrastation commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johnthacker commented May 24, 2025

Uh oh!

Uh oh!

johnthacker commented May 18, 2025 •

edited

Loading

infrastation commented May 23, 2025 •

edited

Loading