linux-stable/net/sctp/ulpevent.c

1095 lines
30 KiB
C
Raw Normal View History

/* SCTP kernel implementation
* (C) Copyright IBM Corp. 2001, 2004
* Copyright (c) 1999-2000 Cisco, Inc.
* Copyright (c) 1999-2001 Motorola, Inc.
* Copyright (c) 2001 Intel Corp.
* Copyright (c) 2001 Nokia, Inc.
* Copyright (c) 2001 La Monte H.P. Yarroll
*
* These functions manipulate an sctp event. The struct ulpevent is used
* to carry notifications and data to the ULP (sockets).
*
* This SCTP implementation is free software;
* you can redistribute it and/or modify it under the terms of
* the GNU General Public License as published by
* the Free Software Foundation; either version 2, or (at your option)
* any later version.
*
* This SCTP implementation is distributed in the hope that it
* will be useful, but WITHOUT ANY WARRANTY; without even the implied
* ************************
* warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
* See the GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with GNU CC; see the file COPYING. If not, see
* <http://www.gnu.org/licenses/>.
*
* Please send any bug reports or fixes you make to the
* email address(es):
* lksctp developers <linux-sctp@vger.kernel.org>
*
* Written or modified by:
* Jon Grimm <jgrimm@us.ibm.com>
* La Monte H.P. Yarroll <piggy@acm.org>
* Ardelle Fan <ardelle.fan@intel.com>
* Sridhar Samudrala <sri@us.ibm.com>
*/
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 17:04:11 +09:00
#include <linux/slab.h>
#include <linux/types.h>
#include <linux/skbuff.h>
#include <net/sctp/structs.h>
#include <net/sctp/sctp.h>
#include <net/sctp/sm.h>
static void sctp_ulpevent_receive_data(struct sctp_ulpevent *event,
struct sctp_association *asoc);
static void sctp_ulpevent_release_data(struct sctp_ulpevent *event);
static void sctp_ulpevent_release_frag_data(struct sctp_ulpevent *event);
/* Initialize an ULP event from an given skb. */
static void sctp_ulpevent_init(struct sctp_ulpevent *event,
int msg_flags,
unsigned int len)
{
memset(event, 0, sizeof(struct sctp_ulpevent));
event->msg_flags = msg_flags;
event->rmem_len = len;
}
/* Create a new sctp_ulpevent. */
static struct sctp_ulpevent *sctp_ulpevent_new(int size, int msg_flags,
gfp_t gfp)
{
struct sctp_ulpevent *event;
struct sk_buff *skb;
skb = alloc_skb(size, gfp);
if (!skb)
goto fail;
event = sctp_skb2event(skb);
sctp_ulpevent_init(event, msg_flags, skb->truesize);
return event;
fail:
return NULL;
}
/* Is this a MSG_NOTIFICATION? */
int sctp_ulpevent_is_notification(const struct sctp_ulpevent *event)
{
return MSG_NOTIFICATION == (event->msg_flags & MSG_NOTIFICATION);
}
/* Hold the association in case the msg_name needs read out of
* the association.
*/
static inline void sctp_ulpevent_set_owner(struct sctp_ulpevent *event,
const struct sctp_association *asoc)
{
struct sk_buff *skb;
/* Cast away the const, as we are just wanting to
* bump the reference count.
*/
sctp_association_hold((struct sctp_association *)asoc);
skb = sctp_event2skb(event);
event->asoc = (struct sctp_association *)asoc;
atomic_add(event->rmem_len, &event->asoc->rmem_alloc);
sctp_skb_set_owner_r(skb, asoc->base.sk);
}
/* A simple destructor to give up the reference to the association. */
static inline void sctp_ulpevent_release_owner(struct sctp_ulpevent *event)
{
struct sctp_association *asoc = event->asoc;
atomic_sub(event->rmem_len, &asoc->rmem_alloc);
sctp_association_put(asoc);
}
/* Create and initialize an SCTP_ASSOC_CHANGE event.
*
* 5.3.1.1 SCTP_ASSOC_CHANGE
*
* Communication notifications inform the ULP that an SCTP association
* has either begun or ended. The identifier for a new association is
* provided by this notification.
*
* Note: There is no field checking here. If a field is unused it will be
* zero'd out.
*/
struct sctp_ulpevent *sctp_ulpevent_make_assoc_change(
const struct sctp_association *asoc,
__u16 flags, __u16 state, __u16 error, __u16 outbound,
__u16 inbound, struct sctp_chunk *chunk, gfp_t gfp)
{
struct sctp_ulpevent *event;
struct sctp_assoc_change *sac;
struct sk_buff *skb;
/* If the lower layer passed in the chunk, it will be
* an ABORT, so we need to include it in the sac_info.
*/
if (chunk) {
/* Copy the chunk data to a new skb and reserve enough
* head room to use as notification.
*/
skb = skb_copy_expand(chunk->skb,
sizeof(struct sctp_assoc_change), 0, gfp);
if (!skb)
goto fail;
/* Embed the event fields inside the cloned skb. */
event = sctp_skb2event(skb);
sctp_ulpevent_init(event, MSG_NOTIFICATION, skb->truesize);
/* Include the notification structure */
sac = (struct sctp_assoc_change *)
skb_push(skb, sizeof(struct sctp_assoc_change));
/* Trim the buffer to the right length. */
skb_trim(skb, sizeof(struct sctp_assoc_change) +
ntohs(chunk->chunk_hdr->length) -
sizeof(sctp_chunkhdr_t));
} else {
event = sctp_ulpevent_new(sizeof(struct sctp_assoc_change),
MSG_NOTIFICATION, gfp);
if (!event)
goto fail;
skb = sctp_event2skb(event);
sac = (struct sctp_assoc_change *) skb_put(skb,
sizeof(struct sctp_assoc_change));
}
/* Socket Extensions for SCTP
* 5.3.1.1 SCTP_ASSOC_CHANGE
*
* sac_type:
* It should be SCTP_ASSOC_CHANGE.
*/
sac->sac_type = SCTP_ASSOC_CHANGE;
/* Socket Extensions for SCTP
* 5.3.1.1 SCTP_ASSOC_CHANGE
*
* sac_state: 32 bits (signed integer)
* This field holds one of a number of values that communicate the
* event that happened to the association.
*/
sac->sac_state = state;
/* Socket Extensions for SCTP
* 5.3.1.1 SCTP_ASSOC_CHANGE
*
* sac_flags: 16 bits (unsigned integer)
* Currently unused.
*/
sac->sac_flags = 0;
/* Socket Extensions for SCTP
* 5.3.1.1 SCTP_ASSOC_CHANGE
*
* sac_length: sizeof (__u32)
* This field is the total length of the notification data, including
* the notification header.
*/
sac->sac_length = skb->len;
/* Socket Extensions for SCTP
* 5.3.1.1 SCTP_ASSOC_CHANGE
*
* sac_error: 32 bits (signed integer)
*
* If the state was reached due to a error condition (e.g.
* COMMUNICATION_LOST) any relevant error information is available in
* this field. This corresponds to the protocol error codes defined in
* [SCTP].
*/
sac->sac_error = error;
/* Socket Extensions for SCTP
* 5.3.1.1 SCTP_ASSOC_CHANGE
*
* sac_outbound_streams: 16 bits (unsigned integer)
* sac_inbound_streams: 16 bits (unsigned integer)
*
* The maximum number of streams allowed in each direction are
* available in sac_outbound_streams and sac_inbound streams.
*/
sac->sac_outbound_streams = outbound;
sac->sac_inbound_streams = inbound;
/* Socket Extensions for SCTP
* 5.3.1.1 SCTP_ASSOC_CHANGE
*
* sac_assoc_id: sizeof (sctp_assoc_t)
*
* The association id field, holds the identifier for the association.
* All notifications for a given association have the same association
* identifier. For TCP style socket, this field is ignored.
*/
sctp_ulpevent_set_owner(event, asoc);
sac->sac_assoc_id = sctp_assoc2id(asoc);
return event;
fail:
return NULL;
}
/* Create and initialize an SCTP_PEER_ADDR_CHANGE event.
*
* Socket Extensions for SCTP - draft-01
* 5.3.1.2 SCTP_PEER_ADDR_CHANGE
*
* When a destination address on a multi-homed peer encounters a change
* an interface details event is sent.
*/
struct sctp_ulpevent *sctp_ulpevent_make_peer_addr_change(
const struct sctp_association *asoc,
const struct sockaddr_storage *aaddr,
int flags, int state, int error, gfp_t gfp)
{
struct sctp_ulpevent *event;
struct sctp_paddr_change *spc;
struct sk_buff *skb;
event = sctp_ulpevent_new(sizeof(struct sctp_paddr_change),
MSG_NOTIFICATION, gfp);
if (!event)
goto fail;
skb = sctp_event2skb(event);
spc = (struct sctp_paddr_change *)
skb_put(skb, sizeof(struct sctp_paddr_change));
/* Sockets API Extensions for SCTP
* Section 5.3.1.2 SCTP_PEER_ADDR_CHANGE
*
* spc_type:
*
* It should be SCTP_PEER_ADDR_CHANGE.
*/
spc->spc_type = SCTP_PEER_ADDR_CHANGE;
/* Sockets API Extensions for SCTP
* Section 5.3.1.2 SCTP_PEER_ADDR_CHANGE
*
* spc_length: sizeof (__u32)
*
* This field is the total length of the notification data, including
* the notification header.
*/
spc->spc_length = sizeof(struct sctp_paddr_change);
/* Sockets API Extensions for SCTP
* Section 5.3.1.2 SCTP_PEER_ADDR_CHANGE
*
* spc_flags: 16 bits (unsigned integer)
* Currently unused.
*/
spc->spc_flags = 0;
/* Sockets API Extensions for SCTP
* Section 5.3.1.2 SCTP_PEER_ADDR_CHANGE
*
* spc_state: 32 bits (signed integer)
*
* This field holds one of a number of values that communicate the
* event that happened to the address.
*/
spc->spc_state = state;
/* Sockets API Extensions for SCTP
* Section 5.3.1.2 SCTP_PEER_ADDR_CHANGE
*
* spc_error: 32 bits (signed integer)
*
* If the state was reached due to any error condition (e.g.
* ADDRESS_UNREACHABLE) any relevant error information is available in
* this field.
*/
spc->spc_error = error;
/* Socket Extensions for SCTP
* 5.3.1.1 SCTP_ASSOC_CHANGE
*
* spc_assoc_id: sizeof (sctp_assoc_t)
*
* The association id field, holds the identifier for the association.
* All notifications for a given association have the same association
* identifier. For TCP style socket, this field is ignored.
*/
sctp_ulpevent_set_owner(event, asoc);
spc->spc_assoc_id = sctp_assoc2id(asoc);
/* Sockets API Extensions for SCTP
* Section 5.3.1.2 SCTP_PEER_ADDR_CHANGE
*
* spc_aaddr: sizeof (struct sockaddr_storage)
*
* The affected address field, holds the remote peer's address that is
* encountering the change of state.
*/
memcpy(&spc->spc_aaddr, aaddr, sizeof(struct sockaddr_storage));
/* Map ipv4 address into v4-mapped-on-v6 address. */
sctp_get_pf_specific(asoc->base.sk->sk_family)->addr_v4map(
sctp_sk(asoc->base.sk),
(union sctp_addr *)&spc->spc_aaddr);
return event;
fail:
return NULL;
}
/* Create and initialize an SCTP_REMOTE_ERROR notification.
*
* Note: This assumes that the chunk->skb->data already points to the
* operation error payload.
*
* Socket Extensions for SCTP - draft-01
* 5.3.1.3 SCTP_REMOTE_ERROR
*
* A remote peer may send an Operational Error message to its peer.
* This message indicates a variety of error conditions on an
* association. The entire error TLV as it appears on the wire is
* included in a SCTP_REMOTE_ERROR event. Please refer to the SCTP
* specification [SCTP] and any extensions for a list of possible
* error formats.
*/
struct sctp_ulpevent *sctp_ulpevent_make_remote_error(
const struct sctp_association *asoc, struct sctp_chunk *chunk,
__u16 flags, gfp_t gfp)
{
struct sctp_ulpevent *event;
struct sctp_remote_error *sre;
struct sk_buff *skb;
sctp_errhdr_t *ch;
__be16 cause;
int elen;
ch = (sctp_errhdr_t *)(chunk->skb->data);
cause = ch->cause;
elen = WORD_ROUND(ntohs(ch->length)) - sizeof(sctp_errhdr_t);
/* Pull off the ERROR header. */
skb_pull(chunk->skb, sizeof(sctp_errhdr_t));
/* Copy the skb to a new skb with room for us to prepend
* notification with.
*/
skb = skb_copy_expand(chunk->skb, sizeof(struct sctp_remote_error),
0, gfp);
/* Pull off the rest of the cause TLV from the chunk. */
skb_pull(chunk->skb, elen);
if (!skb)
goto fail;
/* Embed the event fields inside the cloned skb. */
event = sctp_skb2event(skb);
sctp_ulpevent_init(event, MSG_NOTIFICATION, skb->truesize);
sre = (struct sctp_remote_error *)
skb_push(skb, sizeof(struct sctp_remote_error));
/* Trim the buffer to the right length. */
skb_trim(skb, sizeof(struct sctp_remote_error) + elen);
/* Socket Extensions for SCTP
* 5.3.1.3 SCTP_REMOTE_ERROR
*
* sre_type:
* It should be SCTP_REMOTE_ERROR.
*/
sre->sre_type = SCTP_REMOTE_ERROR;
/*
* Socket Extensions for SCTP
* 5.3.1.3 SCTP_REMOTE_ERROR
*
* sre_flags: 16 bits (unsigned integer)
* Currently unused.
*/
sre->sre_flags = 0;
/* Socket Extensions for SCTP
* 5.3.1.3 SCTP_REMOTE_ERROR
*
* sre_length: sizeof (__u32)
*
* This field is the total length of the notification data,
* including the notification header.
*/
sre->sre_length = skb->len;
/* Socket Extensions for SCTP
* 5.3.1.3 SCTP_REMOTE_ERROR
*
* sre_error: 16 bits (unsigned integer)
* This value represents one of the Operational Error causes defined in
* the SCTP specification, in network byte order.
*/
sre->sre_error = cause;
/* Socket Extensions for SCTP
* 5.3.1.3 SCTP_REMOTE_ERROR
*
* sre_assoc_id: sizeof (sctp_assoc_t)
*
* The association id field, holds the identifier for the association.
* All notifications for a given association have the same association
* identifier. For TCP style socket, this field is ignored.
*/
sctp_ulpevent_set_owner(event, asoc);
sre->sre_assoc_id = sctp_assoc2id(asoc);
return event;
fail:
return NULL;
}
/* Create and initialize a SCTP_SEND_FAILED notification.
*
* Socket Extensions for SCTP - draft-01
* 5.3.1.4 SCTP_SEND_FAILED
*/
struct sctp_ulpevent *sctp_ulpevent_make_send_failed(
const struct sctp_association *asoc, struct sctp_chunk *chunk,
__u16 flags, __u32 error, gfp_t gfp)
{
struct sctp_ulpevent *event;
struct sctp_send_failed *ssf;
struct sk_buff *skb;
/* Pull off any padding. */
int len = ntohs(chunk->chunk_hdr->length);
/* Make skb with more room so we can prepend notification. */
skb = skb_copy_expand(chunk->skb,
sizeof(struct sctp_send_failed), /* headroom */
0, /* tailroom */
gfp);
if (!skb)
goto fail;
/* Pull off the common chunk header and DATA header. */
skb_pull(skb, sizeof(struct sctp_data_chunk));
len -= sizeof(struct sctp_data_chunk);
/* Embed the event fields inside the cloned skb. */
event = sctp_skb2event(skb);
sctp_ulpevent_init(event, MSG_NOTIFICATION, skb->truesize);
ssf = (struct sctp_send_failed *)
skb_push(skb, sizeof(struct sctp_send_failed));
/* Socket Extensions for SCTP
* 5.3.1.4 SCTP_SEND_FAILED
*
* ssf_type:
* It should be SCTP_SEND_FAILED.
*/
ssf->ssf_type = SCTP_SEND_FAILED;
/* Socket Extensions for SCTP
* 5.3.1.4 SCTP_SEND_FAILED
*
* ssf_flags: 16 bits (unsigned integer)
* The flag value will take one of the following values
*
* SCTP_DATA_UNSENT - Indicates that the data was never put on
* the wire.
*
* SCTP_DATA_SENT - Indicates that the data was put on the wire.
* Note that this does not necessarily mean that the
* data was (or was not) successfully delivered.
*/
ssf->ssf_flags = flags;
/* Socket Extensions for SCTP
* 5.3.1.4 SCTP_SEND_FAILED
*
* ssf_length: sizeof (__u32)
* This field is the total length of the notification data, including
* the notification header.
*/
ssf->ssf_length = sizeof(struct sctp_send_failed) + len;
skb_trim(skb, ssf->ssf_length);
/* Socket Extensions for SCTP
* 5.3.1.4 SCTP_SEND_FAILED
*
* ssf_error: 16 bits (unsigned integer)
* This value represents the reason why the send failed, and if set,
* will be a SCTP protocol error code as defined in [SCTP] section
* 3.3.10.
*/
ssf->ssf_error = error;
/* Socket Extensions for SCTP
* 5.3.1.4 SCTP_SEND_FAILED
*
* ssf_info: sizeof (struct sctp_sndrcvinfo)
* The original send information associated with the undelivered
* message.
*/
memcpy(&ssf->ssf_info, &chunk->sinfo, sizeof(struct sctp_sndrcvinfo));
/* Per TSVWG discussion with Randy. Allow the application to
* reassemble a fragmented message.
*/
ssf->ssf_info.sinfo_flags = chunk->chunk_hdr->flags;
/* Socket Extensions for SCTP
* 5.3.1.4 SCTP_SEND_FAILED
*
* ssf_assoc_id: sizeof (sctp_assoc_t)
* The association id field, sf_assoc_id, holds the identifier for the
* association. All notifications for a given association have the
* same association identifier. For TCP style socket, this field is
* ignored.
*/
sctp_ulpevent_set_owner(event, asoc);
ssf->ssf_assoc_id = sctp_assoc2id(asoc);
return event;
fail:
return NULL;
}
/* Create and initialize a SCTP_SHUTDOWN_EVENT notification.
*
* Socket Extensions for SCTP - draft-01
* 5.3.1.5 SCTP_SHUTDOWN_EVENT
*/
struct sctp_ulpevent *sctp_ulpevent_make_shutdown_event(
const struct sctp_association *asoc,
__u16 flags, gfp_t gfp)
{
struct sctp_ulpevent *event;
struct sctp_shutdown_event *sse;
struct sk_buff *skb;
event = sctp_ulpevent_new(sizeof(struct sctp_shutdown_event),
MSG_NOTIFICATION, gfp);
if (!event)
goto fail;
skb = sctp_event2skb(event);
sse = (struct sctp_shutdown_event *)
skb_put(skb, sizeof(struct sctp_shutdown_event));
/* Socket Extensions for SCTP
* 5.3.1.5 SCTP_SHUTDOWN_EVENT
*
* sse_type
* It should be SCTP_SHUTDOWN_EVENT
*/
sse->sse_type = SCTP_SHUTDOWN_EVENT;
/* Socket Extensions for SCTP
* 5.3.1.5 SCTP_SHUTDOWN_EVENT
*
* sse_flags: 16 bits (unsigned integer)
* Currently unused.
*/
sse->sse_flags = 0;
/* Socket Extensions for SCTP
* 5.3.1.5 SCTP_SHUTDOWN_EVENT
*
* sse_length: sizeof (__u32)
* This field is the total length of the notification data, including
* the notification header.
*/
sse->sse_length = sizeof(struct sctp_shutdown_event);
/* Socket Extensions for SCTP
* 5.3.1.5 SCTP_SHUTDOWN_EVENT
*
* sse_assoc_id: sizeof (sctp_assoc_t)
* The association id field, holds the identifier for the association.
* All notifications for a given association have the same association
* identifier. For TCP style socket, this field is ignored.
*/
sctp_ulpevent_set_owner(event, asoc);
sse->sse_assoc_id = sctp_assoc2id(asoc);
return event;
fail:
return NULL;
}
/* Create and initialize a SCTP_ADAPTATION_INDICATION notification.
*
* Socket Extensions for SCTP
* 5.3.1.6 SCTP_ADAPTATION_INDICATION
*/
struct sctp_ulpevent *sctp_ulpevent_make_adaptation_indication(
const struct sctp_association *asoc, gfp_t gfp)
{
struct sctp_ulpevent *event;
struct sctp_adaptation_event *sai;
struct sk_buff *skb;
event = sctp_ulpevent_new(sizeof(struct sctp_adaptation_event),
MSG_NOTIFICATION, gfp);
if (!event)
goto fail;
skb = sctp_event2skb(event);
sai = (struct sctp_adaptation_event *)
skb_put(skb, sizeof(struct sctp_adaptation_event));
sai->sai_type = SCTP_ADAPTATION_INDICATION;
sai->sai_flags = 0;
sai->sai_length = sizeof(struct sctp_adaptation_event);
sai->sai_adaptation_ind = asoc->peer.adaptation_ind;
sctp_ulpevent_set_owner(event, asoc);
sai->sai_assoc_id = sctp_assoc2id(asoc);
return event;
fail:
return NULL;
}
/* A message has been received. Package this message as a notification
* to pass it to the upper layers. Go ahead and calculate the sndrcvinfo
* even if filtered out later.
*
* Socket Extensions for SCTP
* 5.2.2 SCTP Header Information Structure (SCTP_SNDRCV)
*/
struct sctp_ulpevent *sctp_ulpevent_make_rcvmsg(struct sctp_association *asoc,
struct sctp_chunk *chunk,
gfp_t gfp)
{
struct sctp_ulpevent *event = NULL;
struct sk_buff *skb;
size_t padding, len;
int rx_count;
/*
* check to see if we need to make space for this
* new skb, expand the rcvbuffer if needed, or drop
* the frame
*/
if (asoc->ep->rcvbuf_policy)
rx_count = atomic_read(&asoc->rmem_alloc);
else
rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
if (rx_count >= asoc->base.sk->sk_rcvbuf) {
if ((asoc->base.sk->sk_userlocks & SOCK_RCVBUF_LOCK) ||
netvm: prevent a stream-specific deadlock This patch series is based on top of "Swap-over-NBD without deadlocking v15" as it depends on the same reservation of PF_MEMALLOC reserves logic. When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. In diskless systems this is not an option so if swap if required then swapping over the network is considered. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap but this is not always an option. There is no guarantee that the network attached storage (NAS) device is running Linux or supports NBD. However, it is likely that it supports NFS so there are users that want support for swapping over NFS despite any performance concern. Some distributions currently carry patches that support swapping over NFS but it would be preferable to support it in the mainline kernel. Patch 1 avoids a stream-specific deadlock that potentially affects TCP. Patch 2 is a small modification to SELinux to avoid using PFMEMALLOC reserves. Patch 3 adds three helpers for filesystems to handle swap cache pages. For example, page_file_mapping() returns page->mapping for file-backed pages and the address_space of the underlying swap file for swap cache pages. Patch 4 adds two address_space_operations to allow a filesystem to pin all metadata relevant to a swapfile in memory. Upon successful activation, the swapfile is marked SWP_FILE and the address space operation ->direct_IO is used for writing and ->readpage for reading in swap pages. Patch 5 notes that patch 3 is bolting filesystem-specific-swapfile-support onto the side and that the default handlers have different information to what is available to the filesystem. This patch refactors the code so that there are generic handlers for each of the new address_space operations. Patch 6 adds an API to allow a vector of kernel addresses to be translated to struct pages and pinned for IO. Patch 7 adds support for using highmem pages for swap by kmapping the pages before calling the direct_IO handler. Patch 8 updates NFS to use the helpers from patch 3 where necessary. Patch 9 avoids setting PF_private on PG_swapcache pages within NFS. Patch 10 implements the new swapfile-related address_space operations for NFS and teaches the direct IO handler how to manage kernel addresses. Patch 11 prevents page allocator recursions in NFS by using GFP_NOIO where appropriate. Patch 12 fixes a NULL pointer dereference that occurs when using swap-over-NFS. With the patches applied, it is possible to mount a swapfile that is on an NFS filesystem. Swap performance is not great with a swap stress test taking roughly twice as long to complete than if the swap device was backed by NBD. This patch: netvm: prevent a stream-specific deadlock It could happen that all !SOCK_MEMALLOC sockets have buffered so much data that we're over the global rmem limit. This will prevent SOCK_MEMALLOC buffers from receiving data, which will prevent userspace from running, which is needed to reduce the buffered data. Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit. Once this change it applied, it is important that sockets that set SOCK_MEMALLOC do not clear the flag until the socket is being torn down. If this happens, a warning is generated and the tokens reclaimed to avoid accounting errors until the bug is fixed. [davem@davemloft.net: Warning about clearing SOCK_MEMALLOC] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Rik van Riel <riel@redhat.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Neil Brown <neilb@suse.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31 16:44:41 -07:00
(!sk_rmem_schedule(asoc->base.sk, chunk->skb,
chunk->skb->truesize)))
goto fail;
}
/* Clone the original skb, sharing the data. */
skb = skb_clone(chunk->skb, gfp);
if (!skb)
goto fail;
/* Now that all memory allocations for this chunk succeeded, we
* can mark it as received so the tsn_map is updated correctly.
*/
if (sctp_tsnmap_mark(&asoc->peer.tsn_map,
sctp: be more restrictive in transport selection on bundled sacks It was noticed recently that when we send data on a transport, its possible that we might bundle a sack that arrived on a different transport. While this isn't a major problem, it does go against the SHOULD requirement in section 6.4 of RFC 2960: An endpoint SHOULD transmit reply chunks (e.g., SACK, HEARTBEAT ACK, etc.) to the same destination transport address from which it received the DATA or control chunk to which it is replying. This rule should also be followed if the endpoint is bundling DATA chunks together with the reply chunk. This patch seeks to correct that. It restricts the bundling of sack operations to only those transports which have moved the ctsn of the association forward since the last sack. By doing this we guarantee that we only bundle outbound saks on a transport that has received a chunk since the last sack. This brings us into stricter compliance with the RFC. Vlad had initially suggested that we strictly allow only sack bundling on the transport that last moved the ctsn forward. While this makes sense, I was concerned that doing so prevented us from bundling in the case where we had received chunks that moved the ctsn on multiple transports. In those cases, the RFC allows us to select any of the transports having received chunks to bundle the sack on. so I've modified the approach to allow for that, by adding a state variable to each transport that tracks weather it has moved the ctsn since the last sack. This I think keeps our behavior (and performance), close enough to our current profile that I think we can do this without a sysctl knob to enable/disable it. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Vlad Yaseivch <vyasevich@gmail.com> CC: David S. Miller <davem@davemloft.net> CC: linux-sctp@vger.kernel.org Reported-by: Michele Baldessari <michele@redhat.com> Reported-by: sorin serban <sserban@redhat.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-30 03:04:26 +00:00
ntohl(chunk->subh.data_hdr->tsn),
chunk->transport))
goto fail_mark;
/* First calculate the padding, so we don't inadvertently
* pass up the wrong length to the user.
*
* RFC 2960 - Section 3.2 Chunk Field Descriptions
*
* The total length of a chunk(including Type, Length and Value fields)
* MUST be a multiple of 4 bytes. If the length of the chunk is not a
* multiple of 4 bytes, the sender MUST pad the chunk with all zero
* bytes and this padding is not included in the chunk length field.
* The sender should never pad with more than 3 bytes. The receiver
* MUST ignore the padding bytes.
*/
len = ntohs(chunk->chunk_hdr->length);
padding = WORD_ROUND(len) - len;
/* Fixup cloned skb with just this chunks data. */
skb_trim(skb, chunk->chunk_end - padding - skb->data);
/* Embed the event fields inside the cloned skb. */
event = sctp_skb2event(skb);
/* Initialize event with flags 0 and correct length
* Since this is a clone of the original skb, only account for
* the data of this chunk as other chunks will be accounted separately.
*/
sctp_ulpevent_init(event, 0, skb->len + sizeof(struct sk_buff));
sctp_ulpevent_receive_data(event, asoc);
event->stream = ntohs(chunk->subh.data_hdr->stream);
event->ssn = ntohs(chunk->subh.data_hdr->ssn);
event->ppid = chunk->subh.data_hdr->ppid;
if (chunk->chunk_hdr->flags & SCTP_DATA_UNORDERED) {
event->flags |= SCTP_UNORDERED;
event->cumtsn = sctp_tsnmap_get_ctsn(&asoc->peer.tsn_map);
}
event->tsn = ntohl(chunk->subh.data_hdr->tsn);
event->msg_flags |= chunk->chunk_hdr->flags;
event->iif = sctp_chunk_iif(chunk);
return event;
fail_mark:
kfree_skb(skb);
fail:
return NULL;
}
/* Create a partial delivery related event.
*
* 5.3.1.7 SCTP_PARTIAL_DELIVERY_EVENT
*
* When a receiver is engaged in a partial delivery of a
* message this notification will be used to indicate
* various events.
*/
struct sctp_ulpevent *sctp_ulpevent_make_pdapi(
const struct sctp_association *asoc, __u32 indication,
gfp_t gfp)
{
struct sctp_ulpevent *event;
struct sctp_pdapi_event *pd;
struct sk_buff *skb;
event = sctp_ulpevent_new(sizeof(struct sctp_pdapi_event),
MSG_NOTIFICATION, gfp);
if (!event)
goto fail;
skb = sctp_event2skb(event);
pd = (struct sctp_pdapi_event *)
skb_put(skb, sizeof(struct sctp_pdapi_event));
/* pdapi_type
* It should be SCTP_PARTIAL_DELIVERY_EVENT
*
* pdapi_flags: 16 bits (unsigned integer)
* Currently unused.
*/
pd->pdapi_type = SCTP_PARTIAL_DELIVERY_EVENT;
pd->pdapi_flags = 0;
/* pdapi_length: 32 bits (unsigned integer)
*
* This field is the total length of the notification data, including
* the notification header. It will generally be sizeof (struct
* sctp_pdapi_event).
*/
pd->pdapi_length = sizeof(struct sctp_pdapi_event);
/* pdapi_indication: 32 bits (unsigned integer)
*
* This field holds the indication being sent to the application.
*/
pd->pdapi_indication = indication;
/* pdapi_assoc_id: sizeof (sctp_assoc_t)
*
* The association id field, holds the identifier for the association.
*/
sctp_ulpevent_set_owner(event, asoc);
pd->pdapi_assoc_id = sctp_assoc2id(asoc);
return event;
fail:
return NULL;
}
struct sctp_ulpevent *sctp_ulpevent_make_authkey(
const struct sctp_association *asoc, __u16 key_id,
__u32 indication, gfp_t gfp)
{
struct sctp_ulpevent *event;
struct sctp_authkey_event *ak;
struct sk_buff *skb;
event = sctp_ulpevent_new(sizeof(struct sctp_authkey_event),
MSG_NOTIFICATION, gfp);
if (!event)
goto fail;
skb = sctp_event2skb(event);
ak = (struct sctp_authkey_event *)
skb_put(skb, sizeof(struct sctp_authkey_event));
ak->auth_type = SCTP_AUTHENTICATION_EVENT;
ak->auth_flags = 0;
ak->auth_length = sizeof(struct sctp_authkey_event);
ak->auth_keynumber = key_id;
ak->auth_altkeynumber = 0;
ak->auth_indication = indication;
/*
* The association id field, holds the identifier for the association.
*/
sctp_ulpevent_set_owner(event, asoc);
ak->auth_assoc_id = sctp_assoc2id(asoc);
return event;
fail:
return NULL;
}
/*
* Socket Extensions for SCTP
* 6.3.10. SCTP_SENDER_DRY_EVENT
*/
struct sctp_ulpevent *sctp_ulpevent_make_sender_dry_event(
const struct sctp_association *asoc, gfp_t gfp)
{
struct sctp_ulpevent *event;
struct sctp_sender_dry_event *sdry;
struct sk_buff *skb;
event = sctp_ulpevent_new(sizeof(struct sctp_sender_dry_event),
MSG_NOTIFICATION, gfp);
if (!event)
return NULL;
skb = sctp_event2skb(event);
sdry = (struct sctp_sender_dry_event *)
skb_put(skb, sizeof(struct sctp_sender_dry_event));
sdry->sender_dry_type = SCTP_SENDER_DRY_EVENT;
sdry->sender_dry_flags = 0;
sdry->sender_dry_length = sizeof(struct sctp_sender_dry_event);
sctp_ulpevent_set_owner(event, asoc);
sdry->sender_dry_assoc_id = sctp_assoc2id(asoc);
return event;
}
/* Return the notification type, assuming this is a notification
* event.
*/
__u16 sctp_ulpevent_get_notification_type(const struct sctp_ulpevent *event)
{
union sctp_notification *notification;
struct sk_buff *skb;
skb = sctp_event2skb(event);
notification = (union sctp_notification *) skb->data;
return notification->sn_header.sn_type;
}
/* Copy out the sndrcvinfo into a msghdr. */
void sctp_ulpevent_read_sndrcvinfo(const struct sctp_ulpevent *event,
struct msghdr *msghdr)
{
struct sctp_sndrcvinfo sinfo;
if (sctp_ulpevent_is_notification(event))
return;
/* Sockets API Extensions for SCTP
* Section 5.2.2 SCTP Header Information Structure (SCTP_SNDRCV)
*
* sinfo_stream: 16 bits (unsigned integer)
*
* For recvmsg() the SCTP stack places the message's stream number in
* this value.
*/
sinfo.sinfo_stream = event->stream;
/* sinfo_ssn: 16 bits (unsigned integer)
*
* For recvmsg() this value contains the stream sequence number that
* the remote endpoint placed in the DATA chunk. For fragmented
* messages this is the same number for all deliveries of the message
* (if more than one recvmsg() is needed to read the message).
*/
sinfo.sinfo_ssn = event->ssn;
/* sinfo_ppid: 32 bits (unsigned integer)
*
* In recvmsg() this value is
* the same information that was passed by the upper layer in the peer
* application. Please note that byte order issues are NOT accounted
* for and this information is passed opaquely by the SCTP stack from
* one end to the other.
*/
sinfo.sinfo_ppid = event->ppid;
/* sinfo_flags: 16 bits (unsigned integer)
*
* This field may contain any of the following flags and is composed of
* a bitwise OR of these values.
*
* recvmsg() flags:
*
* SCTP_UNORDERED - This flag is present when the message was sent
* non-ordered.
*/
sinfo.sinfo_flags = event->flags;
/* sinfo_tsn: 32 bit (unsigned integer)
*
* For the receiving side, this field holds a TSN that was
* assigned to one of the SCTP Data Chunks.
*/
sinfo.sinfo_tsn = event->tsn;
/* sinfo_cumtsn: 32 bit (unsigned integer)
*
* This field will hold the current cumulative TSN as
* known by the underlying SCTP layer. Note this field is
* ignored when sending and only valid for a receive
* operation when sinfo_flags are set to SCTP_UNORDERED.
*/
sinfo.sinfo_cumtsn = event->cumtsn;
/* sinfo_assoc_id: sizeof (sctp_assoc_t)
*
* The association handle field, sinfo_assoc_id, holds the identifier
* for the association announced in the COMMUNICATION_UP notification.
* All notifications for a given association have the same identifier.
* Ignored for one-to-one style sockets.
*/
sinfo.sinfo_assoc_id = sctp_assoc2id(event->asoc);
/* context value that is set via SCTP_CONTEXT socket option. */
sinfo.sinfo_context = event->asoc->default_rcv_context;
/* These fields are not used while receiving. */
sinfo.sinfo_timetolive = 0;
put_cmsg(msghdr, IPPROTO_SCTP, SCTP_SNDRCV,
sizeof(struct sctp_sndrcvinfo), (void *)&sinfo);
}
/* Do accounting for bytes received and hold a reference to the association
* for each skb.
*/
static void sctp_ulpevent_receive_data(struct sctp_ulpevent *event,
struct sctp_association *asoc)
{
struct sk_buff *skb, *frag;
skb = sctp_event2skb(event);
/* Set the owner and charge rwnd for bytes received. */
sctp_ulpevent_set_owner(event, asoc);
Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" This reverts commit ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer") as it introduced a serious performance regression on SCTP over IPv4 and IPv6, though a not as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs. Current state: [root@Lab200slot2 ~]# iperf3 --sctp -4 -c 192.168.241.3 -V -l 1452 -t 60 iperf version 3.0.1 (10 January 2014) Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64 Time: Fri, 11 Apr 2014 17:56:21 GMT Connecting to host 192.168.241.3, port 5201 Cookie: Lab200slot2.1397238981.812898.548918 [ 4] local 192.168.241.2 port 38616 connected to 192.168.241.3 port 5201 Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test [ ID] Interval Transfer Bandwidth [ 4] 0.00-1.09 sec 20.8 MBytes 161 Mbits/sec [ 4] 1.09-2.13 sec 10.8 MBytes 86.8 Mbits/sec [ 4] 2.13-3.15 sec 3.57 MBytes 29.5 Mbits/sec [ 4] 3.15-4.16 sec 4.33 MBytes 35.7 Mbits/sec [ 4] 4.16-6.21 sec 10.4 MBytes 42.7 Mbits/sec [ 4] 6.21-6.21 sec 0.00 Bytes 0.00 bits/sec [ 4] 6.21-7.35 sec 34.6 MBytes 253 Mbits/sec [ 4] 7.35-11.45 sec 22.0 MBytes 45.0 Mbits/sec [ 4] 11.45-11.45 sec 0.00 Bytes 0.00 bits/sec [ 4] 11.45-11.45 sec 0.00 Bytes 0.00 bits/sec [ 4] 11.45-11.45 sec 0.00 Bytes 0.00 bits/sec [ 4] 11.45-12.51 sec 16.0 MBytes 126 Mbits/sec [ 4] 12.51-13.59 sec 20.3 MBytes 158 Mbits/sec [ 4] 13.59-14.65 sec 13.4 MBytes 107 Mbits/sec [ 4] 14.65-16.79 sec 33.3 MBytes 130 Mbits/sec [ 4] 16.79-16.79 sec 0.00 Bytes 0.00 bits/sec [ 4] 16.79-17.82 sec 5.94 MBytes 48.7 Mbits/sec (etc) [root@Lab200slot2 ~]# iperf3 --sctp -6 -c 2001:db8:0:f101::1 -V -l 1400 -t 60 iperf version 3.0.1 (10 January 2014) Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64 Time: Fri, 11 Apr 2014 19:08:41 GMT Connecting to host 2001:db8:0:f101::1, port 5201 Cookie: Lab200slot2.1397243321.714295.2b3f7c [ 4] local 2001:db8:0:f101::2 port 55804 connected to 2001:db8:0:f101::1 port 5201 Starting Test: protocol: SCTP, 1 streams, 1400 byte blocks, omitting 0 seconds, 60 second test [ ID] Interval Transfer Bandwidth [ 4] 0.00-1.00 sec 169 MBytes 1.42 Gbits/sec [ 4] 1.00-2.00 sec 201 MBytes 1.69 Gbits/sec [ 4] 2.00-3.00 sec 188 MBytes 1.58 Gbits/sec [ 4] 3.00-4.00 sec 174 MBytes 1.46 Gbits/sec [ 4] 4.00-5.00 sec 165 MBytes 1.39 Gbits/sec [ 4] 5.00-6.00 sec 199 MBytes 1.67 Gbits/sec [ 4] 6.00-7.00 sec 163 MBytes 1.36 Gbits/sec [ 4] 7.00-8.00 sec 174 MBytes 1.46 Gbits/sec [ 4] 8.00-9.00 sec 193 MBytes 1.62 Gbits/sec [ 4] 9.00-10.00 sec 196 MBytes 1.65 Gbits/sec [ 4] 10.00-11.00 sec 157 MBytes 1.31 Gbits/sec [ 4] 11.00-12.00 sec 175 MBytes 1.47 Gbits/sec [ 4] 12.00-13.00 sec 192 MBytes 1.61 Gbits/sec [ 4] 13.00-14.00 sec 199 MBytes 1.67 Gbits/sec (etc) After patch: [root@Lab200slot2 ~]# iperf3 --sctp -4 -c 192.168.240.3 -V -l 1452 -t 60 iperf version 3.0.1 (10 January 2014) Linux Lab200slot2 3.14.0+ #1 SMP Mon Apr 14 12:06:40 EDT 2014 x86_64 Time: Mon, 14 Apr 2014 16:40:48 GMT Connecting to host 192.168.240.3, port 5201 Cookie: Lab200slot2.1397493648.413274.65e131 [ 4] local 192.168.240.2 port 50548 connected to 192.168.240.3 port 5201 Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test [ ID] Interval Transfer Bandwidth [ 4] 0.00-1.00 sec 240 MBytes 2.02 Gbits/sec [ 4] 1.00-2.00 sec 239 MBytes 2.01 Gbits/sec [ 4] 2.00-3.00 sec 240 MBytes 2.01 Gbits/sec [ 4] 3.00-4.00 sec 239 MBytes 2.00 Gbits/sec [ 4] 4.00-5.00 sec 245 MBytes 2.05 Gbits/sec [ 4] 5.00-6.00 sec 240 MBytes 2.01 Gbits/sec [ 4] 6.00-7.00 sec 240 MBytes 2.02 Gbits/sec [ 4] 7.00-8.00 sec 239 MBytes 2.01 Gbits/sec With the reverted patch applied, the SCTP/IPv4 performance is back to normal on latest upstream for IPv4 and IPv6 and has same throughput as 3.4.2 test kernel, steady and interval reports are smooth again. Fixes: ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer") Reported-by: Peter Butler <pbutler@sonusnet.com> Reported-by: Dongsheng Song <dongsheng.song@gmail.com> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Tested-by: Peter Butler <pbutler@sonusnet.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com> Cc: Alexander Sverdlin <alexander.sverdlin@nsn.com> Cc: Vlad Yasevich <vyasevich@gmail.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-14 21:45:17 +02:00
sctp_assoc_rwnd_decrease(asoc, skb_headlen(skb));
if (!skb->data_len)
return;
/* Note: Not clearing the entire event struct as this is just a
* fragment of the real event. However, we still need to do rwnd
* accounting.
* In general, the skb passed from IP can have only 1 level of
* fragments. But we allow multiple levels of fragments.
*/
skb_walk_frags(skb, frag)
sctp_ulpevent_receive_data(sctp_skb2event(frag), asoc);
}
/* Do accounting for bytes just read by user and release the references to
* the association.
*/
static void sctp_ulpevent_release_data(struct sctp_ulpevent *event)
{
struct sk_buff *skb, *frag;
unsigned int len;
/* Current stack structures assume that the rcv buffer is
* per socket. For UDP style sockets this is not true as
* multiple associations may be on a single UDP-style socket.
* Use the local private area of the skb to track the owning
* association.
*/
skb = sctp_event2skb(event);
len = skb->len;
if (!skb->data_len)
goto done;
/* Don't forget the fragments. */
skb_walk_frags(skb, frag) {
/* NOTE: skb_shinfos are recursive. Although IP returns
* skb's with only 1 level of fragments, SCTP reassembly can
* increase the levels.
*/
sctp_ulpevent_release_frag_data(sctp_skb2event(frag));
}
done:
Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" This reverts commit ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer") as it introduced a serious performance regression on SCTP over IPv4 and IPv6, though a not as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs. Current state: [root@Lab200slot2 ~]# iperf3 --sctp -4 -c 192.168.241.3 -V -l 1452 -t 60 iperf version 3.0.1 (10 January 2014) Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64 Time: Fri, 11 Apr 2014 17:56:21 GMT Connecting to host 192.168.241.3, port 5201 Cookie: Lab200slot2.1397238981.812898.548918 [ 4] local 192.168.241.2 port 38616 connected to 192.168.241.3 port 5201 Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test [ ID] Interval Transfer Bandwidth [ 4] 0.00-1.09 sec 20.8 MBytes 161 Mbits/sec [ 4] 1.09-2.13 sec 10.8 MBytes 86.8 Mbits/sec [ 4] 2.13-3.15 sec 3.57 MBytes 29.5 Mbits/sec [ 4] 3.15-4.16 sec 4.33 MBytes 35.7 Mbits/sec [ 4] 4.16-6.21 sec 10.4 MBytes 42.7 Mbits/sec [ 4] 6.21-6.21 sec 0.00 Bytes 0.00 bits/sec [ 4] 6.21-7.35 sec 34.6 MBytes 253 Mbits/sec [ 4] 7.35-11.45 sec 22.0 MBytes 45.0 Mbits/sec [ 4] 11.45-11.45 sec 0.00 Bytes 0.00 bits/sec [ 4] 11.45-11.45 sec 0.00 Bytes 0.00 bits/sec [ 4] 11.45-11.45 sec 0.00 Bytes 0.00 bits/sec [ 4] 11.45-12.51 sec 16.0 MBytes 126 Mbits/sec [ 4] 12.51-13.59 sec 20.3 MBytes 158 Mbits/sec [ 4] 13.59-14.65 sec 13.4 MBytes 107 Mbits/sec [ 4] 14.65-16.79 sec 33.3 MBytes 130 Mbits/sec [ 4] 16.79-16.79 sec 0.00 Bytes 0.00 bits/sec [ 4] 16.79-17.82 sec 5.94 MBytes 48.7 Mbits/sec (etc) [root@Lab200slot2 ~]# iperf3 --sctp -6 -c 2001:db8:0:f101::1 -V -l 1400 -t 60 iperf version 3.0.1 (10 January 2014) Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64 Time: Fri, 11 Apr 2014 19:08:41 GMT Connecting to host 2001:db8:0:f101::1, port 5201 Cookie: Lab200slot2.1397243321.714295.2b3f7c [ 4] local 2001:db8:0:f101::2 port 55804 connected to 2001:db8:0:f101::1 port 5201 Starting Test: protocol: SCTP, 1 streams, 1400 byte blocks, omitting 0 seconds, 60 second test [ ID] Interval Transfer Bandwidth [ 4] 0.00-1.00 sec 169 MBytes 1.42 Gbits/sec [ 4] 1.00-2.00 sec 201 MBytes 1.69 Gbits/sec [ 4] 2.00-3.00 sec 188 MBytes 1.58 Gbits/sec [ 4] 3.00-4.00 sec 174 MBytes 1.46 Gbits/sec [ 4] 4.00-5.00 sec 165 MBytes 1.39 Gbits/sec [ 4] 5.00-6.00 sec 199 MBytes 1.67 Gbits/sec [ 4] 6.00-7.00 sec 163 MBytes 1.36 Gbits/sec [ 4] 7.00-8.00 sec 174 MBytes 1.46 Gbits/sec [ 4] 8.00-9.00 sec 193 MBytes 1.62 Gbits/sec [ 4] 9.00-10.00 sec 196 MBytes 1.65 Gbits/sec [ 4] 10.00-11.00 sec 157 MBytes 1.31 Gbits/sec [ 4] 11.00-12.00 sec 175 MBytes 1.47 Gbits/sec [ 4] 12.00-13.00 sec 192 MBytes 1.61 Gbits/sec [ 4] 13.00-14.00 sec 199 MBytes 1.67 Gbits/sec (etc) After patch: [root@Lab200slot2 ~]# iperf3 --sctp -4 -c 192.168.240.3 -V -l 1452 -t 60 iperf version 3.0.1 (10 January 2014) Linux Lab200slot2 3.14.0+ #1 SMP Mon Apr 14 12:06:40 EDT 2014 x86_64 Time: Mon, 14 Apr 2014 16:40:48 GMT Connecting to host 192.168.240.3, port 5201 Cookie: Lab200slot2.1397493648.413274.65e131 [ 4] local 192.168.240.2 port 50548 connected to 192.168.240.3 port 5201 Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test [ ID] Interval Transfer Bandwidth [ 4] 0.00-1.00 sec 240 MBytes 2.02 Gbits/sec [ 4] 1.00-2.00 sec 239 MBytes 2.01 Gbits/sec [ 4] 2.00-3.00 sec 240 MBytes 2.01 Gbits/sec [ 4] 3.00-4.00 sec 239 MBytes 2.00 Gbits/sec [ 4] 4.00-5.00 sec 245 MBytes 2.05 Gbits/sec [ 4] 5.00-6.00 sec 240 MBytes 2.01 Gbits/sec [ 4] 6.00-7.00 sec 240 MBytes 2.02 Gbits/sec [ 4] 7.00-8.00 sec 239 MBytes 2.01 Gbits/sec With the reverted patch applied, the SCTP/IPv4 performance is back to normal on latest upstream for IPv4 and IPv6 and has same throughput as 3.4.2 test kernel, steady and interval reports are smooth again. Fixes: ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer") Reported-by: Peter Butler <pbutler@sonusnet.com> Reported-by: Dongsheng Song <dongsheng.song@gmail.com> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Tested-by: Peter Butler <pbutler@sonusnet.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com> Cc: Alexander Sverdlin <alexander.sverdlin@nsn.com> Cc: Vlad Yasevich <vyasevich@gmail.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-14 21:45:17 +02:00
sctp_assoc_rwnd_increase(event->asoc, len);
sctp_ulpevent_release_owner(event);
}
static void sctp_ulpevent_release_frag_data(struct sctp_ulpevent *event)
{
struct sk_buff *skb, *frag;
skb = sctp_event2skb(event);
if (!skb->data_len)
goto done;
/* Don't forget the fragments. */
skb_walk_frags(skb, frag) {
/* NOTE: skb_shinfos are recursive. Although IP returns
* skb's with only 1 level of fragments, SCTP reassembly can
* increase the levels.
*/
sctp_ulpevent_release_frag_data(sctp_skb2event(frag));
}
done:
sctp_ulpevent_release_owner(event);
}
/* Free a ulpevent that has an owner. It includes releasing the reference
* to the owner, updating the rwnd in case of a DATA event and freeing the
* skb.
*/
void sctp_ulpevent_free(struct sctp_ulpevent *event)
{
if (sctp_ulpevent_is_notification(event))
sctp_ulpevent_release_owner(event);
else
sctp_ulpevent_release_data(event);
kfree_skb(sctp_event2skb(event));
}
/* Purge the skb lists holding ulpevents. */
unsigned int sctp_queue_purge_ulpevents(struct sk_buff_head *list)
{
struct sk_buff *skb;
unsigned int data_unread = 0;
while ((skb = skb_dequeue(list)) != NULL) {
struct sctp_ulpevent *event = sctp_skb2event(skb);
if (!sctp_ulpevent_is_notification(event))
data_unread += skb->len;
sctp_ulpevent_free(event);
}
return data_unread;
}