2005-04-16 15:20:36 -07:00
|
|
|
/*
|
|
|
|
* sysctl_net_ipv4.c: sysctl interface to net IPV4 subsystem.
|
|
|
|
*
|
|
|
|
* Begun April 1, 1996, Mike Shaver.
|
|
|
|
* Added /proc/sys/net/ipv4 directory entry (empty =) ). [MS]
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/mm.h>
|
|
|
|
#include <linux/module.h>
|
|
|
|
#include <linux/sysctl.h>
|
2005-08-16 02:18:02 -03:00
|
|
|
#include <linux/igmp.h>
|
2005-12-27 02:43:12 -02:00
|
|
|
#include <linux/inetdevice.h>
|
2007-10-10 17:30:46 -07:00
|
|
|
#include <linux/seqlock.h>
|
2007-12-05 01:41:26 -08:00
|
|
|
#include <linux/init.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 17:04:11 +09:00
|
|
|
#include <linux/slab.h>
|
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
|
|
|
#include <linux/nsproxy.h>
|
2011-12-11 21:47:05 +00:00
|
|
|
#include <linux/swap.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <net/snmp.h>
|
2005-08-16 02:18:02 -03:00
|
|
|
#include <net/icmp.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <net/ip.h>
|
|
|
|
#include <net/route.h>
|
|
|
|
#include <net/tcp.h>
|
2007-12-31 00:29:24 -08:00
|
|
|
#include <net/udp.h>
|
2006-08-03 16:48:06 -07:00
|
|
|
#include <net/cipso_ipv4.h>
|
2007-10-15 02:33:45 -07:00
|
|
|
#include <net/inet_frag.h>
|
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
|
|
|
#include <net/ping.h>
|
2011-12-11 21:47:06 +00:00
|
|
|
#include <net/tcp_memcontrol.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2005-12-13 23:14:27 -08:00
|
|
|
static int zero;
|
2013-01-23 20:35:28 +00:00
|
|
|
static int one = 1;
|
tcp: Tail loss probe (TLP)
This patch series implement the Tail loss probe (TLP) algorithm described
in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
first patch implements the basic algorithm.
TLP's goal is to reduce tail latency of short transactions. It achieves
this by converting retransmission timeouts (RTOs) occuring due
to tail losses (losses at end of transactions) into fast recovery.
TLP transmits one packet in two round-trips when a connection is in
Open state and isn't receiving any ACKs. The transmitted packet, aka
loss probe, can be either new or a retransmission. When there is tail
loss, the ACK from a loss probe triggers FACK/early-retransmit based
fast recovery, thus avoiding a costly RTO. In the absence of loss,
there is no change in the connection state.
PTO stands for probe timeout. It is a timer event indicating
that an ACK is overdue and triggers a loss probe packet. The PTO value
is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
ACK timer when there is only one oustanding packet.
TLP Algorithm
On transmission of new data in Open state:
-> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
-> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
-> PTO = min(PTO, RTO)
Conditions for scheduling PTO:
-> Connection is in Open state.
-> Connection is either cwnd limited or no new data to send.
-> Number of probes per tail loss episode is limited to one.
-> Connection is SACK enabled.
When PTO fires:
new_segment_exists:
-> transmit new segment.
-> packets_out++. cwnd remains same.
no_new_packet:
-> retransmit the last segment.
Its ACK triggers FACK or early retransmit based recovery.
ACK path:
-> rearm RTO at start of ACK processing.
-> reschedule PTO if need be.
In addition, the patch includes a small variation to the Early Retransmit
(ER) algorithm, such that ER and TLP together can in principle recover any
N-degree of tail loss through fast recovery. TLP is controlled by the same
sysctl as ER, tcp_early_retrans sysctl.
tcp_early_retrans==0; disables TLP and ER.
==1; enables RFC5827 ER.
==2; delayed ER.
==3; TLP and delayed ER. [DEFAULT]
==4; TLP only.
The TLP patch series have been extensively tested on Google Web servers.
It is most effective for short Web trasactions, where it reduced RTOs by 15%
and improved HTTP response time (average by 6%, 99th percentile by 10%).
The transmitted probes account for <0.5% of the overall transmissions.
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-11 10:00:43 +00:00
|
|
|
static int four = 4;
|
2007-02-09 23:24:47 +09:00
|
|
|
static int tcp_retr1_max = 255;
|
2005-04-16 15:20:36 -07:00
|
|
|
static int ip_local_port_range_min[] = { 1, 1 };
|
|
|
|
static int ip_local_port_range_max[] = { 65535, 65535 };
|
2010-11-22 12:54:21 +00:00
|
|
|
static int tcp_adv_win_scale_min = -31;
|
|
|
|
static int tcp_adv_win_scale_max = 31;
|
2010-12-13 12:16:14 -08:00
|
|
|
static int ip_ttl_min = 1;
|
|
|
|
static int ip_ttl_max = 255;
|
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
|
|
|
static int ip_ping_group_range_min[] = { 0, 0 };
|
|
|
|
static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2007-10-10 17:30:46 -07:00
|
|
|
/* Update system visible IP port range */
|
|
|
|
static void set_local_port_range(int range[2])
|
|
|
|
{
|
2008-10-08 14:18:04 -07:00
|
|
|
write_seqlock(&sysctl_local_ports.lock);
|
|
|
|
sysctl_local_ports.range[0] = range[0];
|
|
|
|
sysctl_local_ports.range[1] = range[1];
|
|
|
|
write_sequnlock(&sysctl_local_ports.lock);
|
2007-10-10 17:30:46 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Validate changes from /proc interface. */
|
2013-06-11 23:04:25 -07:00
|
|
|
static int ipv4_local_port_range(struct ctl_table *table, int write,
|
2007-10-10 17:30:46 -07:00
|
|
|
void __user *buffer,
|
|
|
|
size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
|
|
|
int ret;
|
2008-10-08 14:18:04 -07:00
|
|
|
int range[2];
|
2013-06-11 23:04:25 -07:00
|
|
|
struct ctl_table tmp = {
|
2007-10-10 17:30:46 -07:00
|
|
|
.data = &range,
|
|
|
|
.maxlen = sizeof(range),
|
|
|
|
.mode = table->mode,
|
|
|
|
.extra1 = &ip_local_port_range_min,
|
|
|
|
.extra2 = &ip_local_port_range_max,
|
|
|
|
};
|
|
|
|
|
2008-10-08 14:18:04 -07:00
|
|
|
inet_get_local_port_range(range, range + 1);
|
2009-09-23 15:57:19 -07:00
|
|
|
ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
|
2007-10-10 17:30:46 -07:00
|
|
|
|
|
|
|
if (write && ret == 0) {
|
2007-10-18 22:00:17 -07:00
|
|
|
if (range[1] < range[0])
|
2007-10-10 17:30:46 -07:00
|
|
|
ret = -EINVAL;
|
|
|
|
else
|
|
|
|
set_local_port_range(range);
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
|
|
|
|
2012-05-24 10:34:21 -06:00
|
|
|
static void inet_get_ping_group_range_table(struct ctl_table *table, kgid_t *low, kgid_t *high)
|
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
|
|
|
{
|
2012-05-24 10:34:21 -06:00
|
|
|
kgid_t *data = table->data;
|
2012-04-15 05:58:06 +00:00
|
|
|
unsigned int seq;
|
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
|
|
|
do {
|
|
|
|
seq = read_seqbegin(&sysctl_local_ports.lock);
|
|
|
|
|
|
|
|
*low = data[0];
|
|
|
|
*high = data[1];
|
|
|
|
} while (read_seqretry(&sysctl_local_ports.lock, seq));
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Update system visible IP port range */
|
2012-05-24 10:34:21 -06:00
|
|
|
static void set_ping_group_range(struct ctl_table *table, kgid_t low, kgid_t high)
|
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
|
|
|
{
|
2012-05-24 10:34:21 -06:00
|
|
|
kgid_t *data = table->data;
|
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
|
|
|
write_seqlock(&sysctl_local_ports.lock);
|
2012-05-24 10:34:21 -06:00
|
|
|
data[0] = low;
|
|
|
|
data[1] = high;
|
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
|
|
|
write_sequnlock(&sysctl_local_ports.lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Validate changes from /proc interface. */
|
2013-06-11 23:04:25 -07:00
|
|
|
static int ipv4_ping_group_range(struct ctl_table *table, int write,
|
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
|
|
|
void __user *buffer,
|
|
|
|
size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
2012-05-24 10:34:21 -06:00
|
|
|
struct user_namespace *user_ns = current_user_ns();
|
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
|
|
|
int ret;
|
2012-05-24 10:34:21 -06:00
|
|
|
gid_t urange[2];
|
|
|
|
kgid_t low, high;
|
2013-06-11 23:04:25 -07:00
|
|
|
struct ctl_table tmp = {
|
2012-05-24 10:34:21 -06:00
|
|
|
.data = &urange,
|
|
|
|
.maxlen = sizeof(urange),
|
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
|
|
|
.mode = table->mode,
|
|
|
|
.extra1 = &ip_ping_group_range_min,
|
|
|
|
.extra2 = &ip_ping_group_range_max,
|
|
|
|
};
|
|
|
|
|
2012-05-24 10:34:21 -06:00
|
|
|
inet_get_ping_group_range_table(table, &low, &high);
|
|
|
|
urange[0] = from_kgid_munged(user_ns, low);
|
|
|
|
urange[1] = from_kgid_munged(user_ns, high);
|
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
|
|
|
ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
|
|
|
|
|
2012-05-24 10:34:21 -06:00
|
|
|
if (write && ret == 0) {
|
|
|
|
low = make_kgid(user_ns, urange[0]);
|
|
|
|
high = make_kgid(user_ns, urange[1]);
|
|
|
|
if (!gid_valid(low) || !gid_valid(high) ||
|
|
|
|
(urange[1] < urange[0]) || gid_lt(high, low)) {
|
|
|
|
low = make_kgid(&init_user_ns, 1);
|
|
|
|
high = make_kgid(&init_user_ns, 0);
|
|
|
|
}
|
|
|
|
set_ping_group_range(table, low, high);
|
|
|
|
}
|
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-06-11 23:04:25 -07:00
|
|
|
static int proc_tcp_congestion_control(struct ctl_table *ctl, int write,
|
2005-06-23 12:19:55 -07:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
|
|
|
char val[TCP_CA_NAME_MAX];
|
2013-06-11 23:04:25 -07:00
|
|
|
struct ctl_table tbl = {
|
2005-06-23 12:19:55 -07:00
|
|
|
.data = val,
|
|
|
|
.maxlen = TCP_CA_NAME_MAX,
|
|
|
|
};
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
tcp_get_default_congestion_control(val);
|
|
|
|
|
2009-09-23 15:57:19 -07:00
|
|
|
ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
|
2005-06-23 12:19:55 -07:00
|
|
|
if (write && ret == 0)
|
|
|
|
ret = tcp_set_default_congestion_control(val);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-06-11 23:04:25 -07:00
|
|
|
static int proc_tcp_available_congestion_control(struct ctl_table *ctl,
|
2009-09-23 15:57:19 -07:00
|
|
|
int write,
|
2006-11-09 16:32:06 -08:00
|
|
|
void __user *buffer, size_t *lenp,
|
|
|
|
loff_t *ppos)
|
|
|
|
{
|
2013-06-11 23:04:25 -07:00
|
|
|
struct ctl_table tbl = { .maxlen = TCP_CA_BUF_MAX, };
|
2006-11-09 16:32:06 -08:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
tbl.data = kmalloc(tbl.maxlen, GFP_USER);
|
|
|
|
if (!tbl.data)
|
|
|
|
return -ENOMEM;
|
|
|
|
tcp_get_available_congestion_control(tbl.data, TCP_CA_BUF_MAX);
|
2009-09-23 15:57:19 -07:00
|
|
|
ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
|
2006-11-09 16:32:06 -08:00
|
|
|
kfree(tbl.data);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-06-11 23:04:25 -07:00
|
|
|
static int proc_allowed_congestion_control(struct ctl_table *ctl,
|
2009-09-23 15:57:19 -07:00
|
|
|
int write,
|
2006-11-09 16:35:15 -08:00
|
|
|
void __user *buffer, size_t *lenp,
|
|
|
|
loff_t *ppos)
|
|
|
|
{
|
2013-06-11 23:04:25 -07:00
|
|
|
struct ctl_table tbl = { .maxlen = TCP_CA_BUF_MAX };
|
2006-11-09 16:35:15 -08:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
tbl.data = kmalloc(tbl.maxlen, GFP_USER);
|
|
|
|
if (!tbl.data)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
tcp_get_allowed_congestion_control(tbl.data, tbl.maxlen);
|
2009-09-23 15:57:19 -07:00
|
|
|
ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
|
2006-11-09 16:35:15 -08:00
|
|
|
if (write && ret == 0)
|
|
|
|
ret = tcp_set_allowed_congestion_control(tbl.data);
|
|
|
|
kfree(tbl.data);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-06-11 23:04:25 -07:00
|
|
|
static int ipv4_tcp_mem(struct ctl_table *ctl, int write,
|
2011-12-11 21:47:05 +00:00
|
|
|
void __user *buffer, size_t *lenp,
|
|
|
|
loff_t *ppos)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
unsigned long vec[3];
|
|
|
|
struct net *net = current->nsproxy->net_ns;
|
2012-07-31 16:43:02 -07:00
|
|
|
#ifdef CONFIG_MEMCG_KMEM
|
2011-12-11 21:47:06 +00:00
|
|
|
struct mem_cgroup *memcg;
|
|
|
|
#endif
|
2011-12-11 21:47:05 +00:00
|
|
|
|
2013-06-11 23:04:25 -07:00
|
|
|
struct ctl_table tmp = {
|
2011-12-11 21:47:05 +00:00
|
|
|
.data = &vec,
|
|
|
|
.maxlen = sizeof(vec),
|
|
|
|
.mode = ctl->mode,
|
|
|
|
};
|
|
|
|
|
|
|
|
if (!write) {
|
|
|
|
ctl->data = &net->ipv4.sysctl_tcp_mem;
|
|
|
|
return proc_doulongvec_minmax(ctl, write, buffer, lenp, ppos);
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = proc_doulongvec_minmax(&tmp, write, buffer, lenp, ppos);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
2012-07-31 16:43:02 -07:00
|
|
|
#ifdef CONFIG_MEMCG_KMEM
|
2011-12-11 21:47:06 +00:00
|
|
|
rcu_read_lock();
|
|
|
|
memcg = mem_cgroup_from_task(current);
|
|
|
|
|
|
|
|
tcp_prot_mem(memcg, vec[0], 0);
|
|
|
|
tcp_prot_mem(memcg, vec[1], 1);
|
|
|
|
tcp_prot_mem(memcg, vec[2], 2);
|
|
|
|
rcu_read_unlock();
|
|
|
|
#endif
|
|
|
|
|
2011-12-11 21:47:05 +00:00
|
|
|
net->ipv4.sysctl_tcp_mem[0] = vec[0];
|
|
|
|
net->ipv4.sysctl_tcp_mem[1] = vec[1];
|
|
|
|
net->ipv4.sysctl_tcp_mem[2] = vec[2];
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-06-11 23:04:25 -07:00
|
|
|
static int proc_tcp_fastopen_key(struct ctl_table *ctl, int write,
|
|
|
|
void __user *buffer, size_t *lenp,
|
|
|
|
loff_t *ppos)
|
2012-08-31 12:29:11 +00:00
|
|
|
{
|
2013-06-11 23:04:25 -07:00
|
|
|
struct ctl_table tbl = { .maxlen = (TCP_FASTOPEN_KEY_LENGTH * 2 + 10) };
|
2012-08-31 12:29:11 +00:00
|
|
|
struct tcp_fastopen_context *ctxt;
|
|
|
|
int ret;
|
|
|
|
u32 user_key[4]; /* 16 bytes, matching TCP_FASTOPEN_KEY_LENGTH */
|
|
|
|
|
|
|
|
tbl.data = kmalloc(tbl.maxlen, GFP_KERNEL);
|
|
|
|
if (!tbl.data)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
ctxt = rcu_dereference(tcp_fastopen_ctx);
|
|
|
|
if (ctxt)
|
|
|
|
memcpy(user_key, ctxt->key, TCP_FASTOPEN_KEY_LENGTH);
|
2012-10-11 06:24:14 +00:00
|
|
|
else
|
|
|
|
memset(user_key, 0, sizeof(user_key));
|
2012-08-31 12:29:11 +00:00
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
snprintf(tbl.data, tbl.maxlen, "%08x-%08x-%08x-%08x",
|
|
|
|
user_key[0], user_key[1], user_key[2], user_key[3]);
|
|
|
|
ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
|
|
|
|
|
|
|
|
if (write && ret == 0) {
|
|
|
|
if (sscanf(tbl.data, "%x-%x-%x-%x", user_key, user_key + 1,
|
|
|
|
user_key + 2, user_key + 3) != 4) {
|
|
|
|
ret = -EINVAL;
|
|
|
|
goto bad_key;
|
|
|
|
}
|
|
|
|
tcp_fastopen_reset_cipher(user_key, TCP_FASTOPEN_KEY_LENGTH);
|
|
|
|
}
|
|
|
|
|
|
|
|
bad_key:
|
|
|
|
pr_debug("proc FO key set 0x%x-%x-%x-%x <- 0x%s: %u\n",
|
|
|
|
user_key[0], user_key[1], user_key[2], user_key[3],
|
|
|
|
(char *)tbl.data, ret);
|
|
|
|
kfree(tbl.data);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2007-12-05 01:41:26 -08:00
|
|
|
static struct ctl_table ipv4_table[] = {
|
2007-02-09 23:24:47 +09:00
|
|
|
{
|
2005-04-16 15:20:36 -07:00
|
|
|
.procname = "tcp_timestamps",
|
|
|
|
.data = &sysctl_tcp_timestamps,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2007-02-09 23:24:47 +09:00
|
|
|
{
|
2005-04-16 15:20:36 -07:00
|
|
|
.procname = "tcp_window_scaling",
|
|
|
|
.data = &sysctl_tcp_window_scaling,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2007-02-09 23:24:47 +09:00
|
|
|
{
|
2005-04-16 15:20:36 -07:00
|
|
|
.procname = "tcp_sack",
|
|
|
|
.data = &sysctl_tcp_sack,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2007-02-09 23:24:47 +09:00
|
|
|
{
|
2005-04-16 15:20:36 -07:00
|
|
|
.procname = "tcp_retrans_collapse",
|
|
|
|
.data = &sysctl_tcp_retrans_collapse,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2007-02-09 23:24:47 +09:00
|
|
|
{
|
2005-04-16 15:20:36 -07:00
|
|
|
.procname = "ip_default_ttl",
|
2007-02-09 23:24:47 +09:00
|
|
|
.data = &sysctl_ip_default_ttl,
|
2005-04-16 15:20:36 -07:00
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2010-12-13 12:16:14 -08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = &ip_ttl_min,
|
|
|
|
.extra2 = &ip_ttl_max,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2007-02-09 23:24:47 +09:00
|
|
|
{
|
2005-04-16 15:20:36 -07:00
|
|
|
.procname = "ip_no_pmtu_disc",
|
|
|
|
.data = &ipv4_config.no_pmtu_disc,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "ip_nonlocal_bind",
|
|
|
|
.data = &sysctl_ip_nonlocal_bind,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "tcp_syn_retries",
|
|
|
|
.data = &sysctl_tcp_syn_retries,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "tcp_synack_retries",
|
|
|
|
.data = &sysctl_tcp_synack_retries,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "tcp_max_orphans",
|
|
|
|
.data = &sysctl_tcp_max_orphans,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "tcp_max_tw_buckets",
|
2005-08-09 20:44:40 -07:00
|
|
|
.data = &tcp_death_row.sysctl_max_tw_buckets,
|
2005-04-16 15:20:36 -07:00
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2012-06-21 13:58:31 +00:00
|
|
|
{
|
|
|
|
.procname = "ip_early_demux",
|
|
|
|
.data = &sysctl_ip_early_demux,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec
|
|
|
|
},
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
.procname = "ip_dynaddr",
|
|
|
|
.data = &sysctl_ip_dynaddr,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "tcp_keepalive_time",
|
|
|
|
.data = &sysctl_tcp_keepalive_time,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec_jiffies,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "tcp_keepalive_probes",
|
|
|
|
.data = &sysctl_tcp_keepalive_probes,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "tcp_keepalive_intvl",
|
|
|
|
.data = &sysctl_tcp_keepalive_intvl,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec_jiffies,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "tcp_retries1",
|
|
|
|
.data = &sysctl_tcp_retries1,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2005-04-16 15:20:36 -07:00
|
|
|
.extra2 = &tcp_retr1_max
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "tcp_retries2",
|
|
|
|
.data = &sysctl_tcp_retries2,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "tcp_fin_timeout",
|
|
|
|
.data = &sysctl_tcp_fin_timeout,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec_jiffies,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
#ifdef CONFIG_SYN_COOKIES
|
|
|
|
{
|
|
|
|
.procname = "tcp_syncookies",
|
|
|
|
.data = &sysctl_tcp_syncookies,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
#endif
|
2012-07-19 06:43:05 +00:00
|
|
|
{
|
|
|
|
.procname = "tcp_fastopen",
|
|
|
|
.data = &sysctl_tcp_fastopen,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec,
|
|
|
|
},
|
2012-08-31 12:29:11 +00:00
|
|
|
{
|
|
|
|
.procname = "tcp_fastopen_key",
|
|
|
|
.mode = 0600,
|
|
|
|
.maxlen = ((TCP_FASTOPEN_KEY_LENGTH * 2) + 10),
|
|
|
|
.proc_handler = proc_tcp_fastopen_key,
|
|
|
|
},
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
.procname = "tcp_tw_recycle",
|
2005-08-09 20:44:40 -07:00
|
|
|
.data = &tcp_death_row.sysctl_tw_recycle,
|
2005-04-16 15:20:36 -07:00
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "tcp_abort_on_overflow",
|
|
|
|
.data = &sysctl_tcp_abort_on_overflow,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "tcp_stdurg",
|
|
|
|
.data = &sysctl_tcp_stdurg,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "tcp_rfc1337",
|
|
|
|
.data = &sysctl_tcp_rfc1337,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "tcp_max_syn_backlog",
|
|
|
|
.data = &sysctl_max_syn_backlog,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "ip_local_port_range",
|
2008-10-08 14:18:04 -07:00
|
|
|
.data = &sysctl_local_ports.range,
|
|
|
|
.maxlen = sizeof(sysctl_local_ports.range),
|
2005-04-16 15:20:36 -07:00
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = ipv4_local_port_range,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2010-05-05 00:27:06 +00:00
|
|
|
{
|
|
|
|
.procname = "ip_local_reserved_ports",
|
|
|
|
.data = NULL, /* initialized in sysctl_ipv4_init */
|
|
|
|
.maxlen = 65536,
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_do_large_bitmap,
|
|
|
|
},
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
.procname = "igmp_max_memberships",
|
|
|
|
.data = &sysctl_igmp_max_memberships,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "igmp_max_msf",
|
|
|
|
.data = &sysctl_igmp_max_msf,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "inet_peer_threshold",
|
|
|
|
.data = &inet_peer_threshold,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "inet_peer_minttl",
|
|
|
|
.data = &inet_peer_minttl,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec_jiffies,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "inet_peer_maxttl",
|
|
|
|
.data = &inet_peer_maxttl,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec_jiffies,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "tcp_orphan_retries",
|
|
|
|
.data = &sysctl_tcp_orphan_retries,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "tcp_fack",
|
|
|
|
.data = &sysctl_tcp_fack,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "tcp_reordering",
|
|
|
|
.data = &sysctl_tcp_reordering,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "tcp_dsack",
|
|
|
|
.data = &sysctl_tcp_dsack,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "tcp_wmem",
|
|
|
|
.data = &sysctl_tcp_wmem,
|
|
|
|
.maxlen = sizeof(sysctl_tcp_wmem),
|
|
|
|
.mode = 0644,
|
2013-01-23 20:35:28 +00:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = &one,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "tcp_rmem",
|
|
|
|
.data = &sysctl_tcp_rmem,
|
|
|
|
.maxlen = sizeof(sysctl_tcp_rmem),
|
|
|
|
.mode = 0644,
|
2013-01-23 20:35:28 +00:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = &one,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "tcp_app_win",
|
|
|
|
.data = &sysctl_tcp_app_win,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "tcp_adv_win_scale",
|
|
|
|
.data = &sysctl_tcp_adv_win_scale,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2010-11-22 12:54:21 +00:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = &tcp_adv_win_scale_min,
|
|
|
|
.extra2 = &tcp_adv_win_scale_max,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "tcp_tw_reuse",
|
|
|
|
.data = &sysctl_tcp_tw_reuse,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "tcp_frto",
|
|
|
|
.data = &sysctl_tcp_frto,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "tcp_low_latency",
|
|
|
|
.data = &sysctl_tcp_low_latency,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "tcp_no_metrics_save",
|
|
|
|
.data = &sysctl_tcp_nometrics_save,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "tcp_moderate_rcvbuf",
|
|
|
|
.data = &sysctl_tcp_moderate_rcvbuf,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "tcp_tso_win_divisor",
|
|
|
|
.data = &sysctl_tcp_tso_win_divisor,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
2005-06-23 12:19:55 -07:00
|
|
|
.procname = "tcp_congestion_control",
|
2005-04-16 15:20:36 -07:00
|
|
|
.mode = 0644,
|
2005-06-23 12:19:55 -07:00
|
|
|
.maxlen = TCP_CA_NAME_MAX,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_tcp_congestion_control,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2006-03-20 17:53:41 -08:00
|
|
|
{
|
|
|
|
.procname = "tcp_mtu_probing",
|
|
|
|
.data = &sysctl_tcp_mtu_probing,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2006-03-20 17:53:41 -08:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "tcp_base_mss",
|
|
|
|
.data = &sysctl_tcp_base_mss,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2006-03-20 17:53:41 -08:00
|
|
|
},
|
2007-02-09 23:24:47 +09:00
|
|
|
{
|
2006-03-20 22:40:29 -08:00
|
|
|
.procname = "tcp_workaround_signed_windows",
|
|
|
|
.data = &sysctl_tcp_workaround_signed_windows,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2006-03-20 22:40:29 -08:00
|
|
|
},
|
tcp: TCP Small Queues
This introduce TSQ (TCP Small Queues)
TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
device queues), to reduce RTT and cwnd bias, part of the bufferbloat
problem.
sk->sk_wmem_alloc not allowed to grow above a given limit,
allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
given time.
TSO packets are sized/capped to half the limit, so that we have two
TSO packets in flight, allowing better bandwidth use.
As a side effect, setting the limit to 40000 automatically reduces the
standard gso max limit (65536) to 40000/2 : It can help to reduce
latencies of high prio packets, having smaller TSO packets.
This means we divert sock_wfree() to a tcp_wfree() handler, to
queue/send following frames when skb_orphan() [2] is called for the
already queued skbs.
Results on my dev machines (tg3/ixgbe nics) are really impressive,
using standard pfifo_fast, and with or without TSO/GSO.
Without reduction of nominal bandwidth, we have reduction of buffering
per bulk sender :
< 1ms on Gbit (instead of 50ms with TSO)
< 8ms on 100Mbit (instead of 132 ms)
I no longer have 4 MBytes backlogged in qdisc by a single netperf
session, and both side socket autotuning no longer use 4 Mbytes.
As skb destructor cannot restart xmit itself ( as qdisc lock might be
taken at this point ), we delegate the work to a tasklet. We use one
tasklest per cpu for performance reasons.
If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag.
This flag is tested in a new protocol method called from release_sock(),
to eventually send new segments.
[1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
[2] skb_orphan() is usually called at TX completion time,
but some drivers call it in their start_xmit() handler.
These drivers should at least use BQL, or else a single TCP
session can still fill the whole NIC TX ring, since TSQ will
have no effect.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Dave Taht <dave.taht@bufferbloat.net>
Cc: Tom Herbert <therbert@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-11 05:50:31 +00:00
|
|
|
{
|
|
|
|
.procname = "tcp_limit_output_bytes",
|
|
|
|
.data = &sysctl_tcp_limit_output_bytes,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec
|
|
|
|
},
|
2012-07-17 10:13:05 +02:00
|
|
|
{
|
|
|
|
.procname = "tcp_challenge_ack_limit",
|
|
|
|
.data = &sysctl_tcp_challenge_ack_limit,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec
|
|
|
|
},
|
2006-05-23 18:02:55 -07:00
|
|
|
#ifdef CONFIG_NET_DMA
|
|
|
|
{
|
|
|
|
.procname = "tcp_dma_copybreak",
|
|
|
|
.data = &sysctl_tcp_dma_copybreak,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2006-05-23 18:02:55 -07:00
|
|
|
},
|
|
|
|
#endif
|
2006-06-13 22:33:04 -07:00
|
|
|
{
|
|
|
|
.procname = "tcp_slow_start_after_idle",
|
|
|
|
.data = &sysctl_tcp_slow_start_after_idle,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2006-06-13 22:33:04 -07:00
|
|
|
},
|
2006-08-03 16:48:06 -07:00
|
|
|
#ifdef CONFIG_NETLABEL
|
|
|
|
{
|
|
|
|
.procname = "cipso_cache_enable",
|
|
|
|
.data = &cipso_v4_cache_enabled,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2006-08-03 16:48:06 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "cipso_cache_bucket_size",
|
|
|
|
.data = &cipso_v4_cache_bucketsize,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2006-08-03 16:48:06 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "cipso_rbm_optfmt",
|
|
|
|
.data = &cipso_v4_rbm_optfmt,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2006-08-03 16:48:06 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "cipso_rbm_strictvalid",
|
|
|
|
.data = &cipso_v4_rbm_strictvalid,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2006-08-03 16:48:06 -07:00
|
|
|
},
|
|
|
|
#endif /* CONFIG_NETLABEL */
|
2006-11-09 16:32:06 -08:00
|
|
|
{
|
|
|
|
.procname = "tcp_available_congestion_control",
|
|
|
|
.maxlen = TCP_CA_BUF_MAX,
|
|
|
|
.mode = 0444,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_tcp_available_congestion_control,
|
2006-11-09 16:32:06 -08:00
|
|
|
},
|
2006-11-09 16:35:15 -08:00
|
|
|
{
|
|
|
|
.procname = "tcp_allowed_congestion_control",
|
|
|
|
.maxlen = TCP_CA_BUF_MAX,
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_allowed_congestion_control,
|
2006-11-09 16:35:15 -08:00
|
|
|
},
|
2007-03-25 19:21:45 -07:00
|
|
|
{
|
|
|
|
.procname = "tcp_max_ssthresh",
|
|
|
|
.data = &sysctl_tcp_max_ssthresh,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2007-03-25 19:21:45 -07:00
|
|
|
},
|
2010-02-18 02:47:01 +00:00
|
|
|
{
|
|
|
|
.procname = "tcp_thin_linear_timeouts",
|
|
|
|
.data = &sysctl_tcp_thin_linear_timeouts,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec
|
|
|
|
},
|
2010-02-18 04:48:19 +00:00
|
|
|
{
|
|
|
|
.procname = "tcp_thin_dupack",
|
|
|
|
.data = &sysctl_tcp_thin_dupack,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec
|
|
|
|
},
|
2012-05-02 13:30:03 +00:00
|
|
|
{
|
|
|
|
.procname = "tcp_early_retrans",
|
|
|
|
.data = &sysctl_tcp_early_retrans,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = &zero,
|
tcp: Tail loss probe (TLP)
This patch series implement the Tail loss probe (TLP) algorithm described
in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
first patch implements the basic algorithm.
TLP's goal is to reduce tail latency of short transactions. It achieves
this by converting retransmission timeouts (RTOs) occuring due
to tail losses (losses at end of transactions) into fast recovery.
TLP transmits one packet in two round-trips when a connection is in
Open state and isn't receiving any ACKs. The transmitted packet, aka
loss probe, can be either new or a retransmission. When there is tail
loss, the ACK from a loss probe triggers FACK/early-retransmit based
fast recovery, thus avoiding a costly RTO. In the absence of loss,
there is no change in the connection state.
PTO stands for probe timeout. It is a timer event indicating
that an ACK is overdue and triggers a loss probe packet. The PTO value
is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
ACK timer when there is only one oustanding packet.
TLP Algorithm
On transmission of new data in Open state:
-> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
-> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
-> PTO = min(PTO, RTO)
Conditions for scheduling PTO:
-> Connection is in Open state.
-> Connection is either cwnd limited or no new data to send.
-> Number of probes per tail loss episode is limited to one.
-> Connection is SACK enabled.
When PTO fires:
new_segment_exists:
-> transmit new segment.
-> packets_out++. cwnd remains same.
no_new_packet:
-> retransmit the last segment.
Its ACK triggers FACK or early retransmit based recovery.
ACK path:
-> rearm RTO at start of ACK processing.
-> reschedule PTO if need be.
In addition, the patch includes a small variation to the Early Retransmit
(ER) algorithm, such that ER and TLP together can in principle recover any
N-degree of tail loss through fast recovery. TLP is controlled by the same
sysctl as ER, tcp_early_retrans sysctl.
tcp_early_retrans==0; disables TLP and ER.
==1; enables RFC5827 ER.
==2; delayed ER.
==3; TLP and delayed ER. [DEFAULT]
==4; TLP only.
The TLP patch series have been extensively tested on Google Web servers.
It is most effective for short Web trasactions, where it reduced RTOs by 15%
and improved HTTP response time (average by 6%, 99th percentile by 10%).
The transmitted probes account for <0.5% of the overall transmissions.
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-11 10:00:43 +00:00
|
|
|
.extra2 = &four,
|
2012-05-02 13:30:03 +00:00
|
|
|
},
|
2007-12-31 00:29:24 -08:00
|
|
|
{
|
|
|
|
.procname = "udp_mem",
|
|
|
|
.data = &sysctl_udp_mem,
|
|
|
|
.maxlen = sizeof(sysctl_udp_mem),
|
|
|
|
.mode = 0644,
|
2010-11-09 23:24:26 +00:00
|
|
|
.proc_handler = proc_doulongvec_minmax,
|
2007-12-31 00:29:24 -08:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "udp_rmem_min",
|
|
|
|
.data = &sysctl_udp_rmem_min,
|
|
|
|
.maxlen = sizeof(sysctl_udp_rmem_min),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2013-01-23 20:35:28 +00:00
|
|
|
.extra1 = &one
|
2007-12-31 00:29:24 -08:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "udp_wmem_min",
|
|
|
|
.data = &sysctl_udp_wmem_min,
|
|
|
|
.maxlen = sizeof(sysctl_udp_wmem_min),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2013-01-23 20:35:28 +00:00
|
|
|
.extra1 = &one
|
2007-12-31 00:29:24 -08:00
|
|
|
},
|
2009-11-05 13:32:03 -08:00
|
|
|
{ }
|
2005-04-16 15:20:36 -07:00
|
|
|
};
|
2007-12-05 01:41:26 -08:00
|
|
|
|
2008-03-26 01:56:24 -07:00
|
|
|
static struct ctl_table ipv4_net_table[] = {
|
|
|
|
{
|
|
|
|
.procname = "icmp_echo_ignore_all",
|
|
|
|
.data = &init_net.ipv4.sysctl_icmp_echo_ignore_all,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2008-03-26 01:56:24 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "icmp_echo_ignore_broadcasts",
|
|
|
|
.data = &init_net.ipv4.sysctl_icmp_echo_ignore_broadcasts,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2008-03-26 01:56:24 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "icmp_ignore_bogus_error_responses",
|
|
|
|
.data = &init_net.ipv4.sysctl_icmp_ignore_bogus_error_responses,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2008-03-26 01:56:24 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "icmp_errors_use_inbound_ifaddr",
|
|
|
|
.data = &init_net.ipv4.sysctl_icmp_errors_use_inbound_ifaddr,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2008-03-26 01:56:24 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "icmp_ratelimit",
|
|
|
|
.data = &init_net.ipv4.sysctl_icmp_ratelimit,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec_ms_jiffies,
|
2008-03-26 01:56:24 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "icmp_ratemask",
|
|
|
|
.data = &init_net.ipv4.sysctl_icmp_ratemask,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2008-11-03 18:21:05 -08:00
|
|
|
.proc_handler = proc_dointvec
|
2008-03-26 01:56:24 -07:00
|
|
|
},
|
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
|
|
|
{
|
|
|
|
.procname = "ping_group_range",
|
|
|
|
.data = &init_net.ipv4.sysctl_ping_group_range,
|
2012-05-24 10:34:21 -06:00
|
|
|
.maxlen = sizeof(gid_t)*2,
|
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = ipv4_ping_group_range,
|
|
|
|
},
|
2013-01-05 16:10:48 +00:00
|
|
|
{
|
|
|
|
.procname = "tcp_ecn",
|
|
|
|
.data = &init_net.ipv4.sysctl_tcp_ecn,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec
|
|
|
|
},
|
2011-12-11 21:47:05 +00:00
|
|
|
{
|
|
|
|
.procname = "tcp_mem",
|
|
|
|
.maxlen = sizeof(init_net.ipv4.sysctl_tcp_mem),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = ipv4_tcp_mem,
|
|
|
|
},
|
2008-03-26 01:56:24 -07:00
|
|
|
{ }
|
|
|
|
};
|
|
|
|
|
2008-03-26 01:54:18 -07:00
|
|
|
static __net_init int ipv4_sysctl_init_net(struct net *net)
|
|
|
|
{
|
2008-03-26 01:56:24 -07:00
|
|
|
struct ctl_table *table;
|
|
|
|
|
|
|
|
table = ipv4_net_table;
|
2009-11-25 15:14:13 -08:00
|
|
|
if (!net_eq(net, &init_net)) {
|
2008-03-26 01:56:24 -07:00
|
|
|
table = kmemdup(table, sizeof(ipv4_net_table), GFP_KERNEL);
|
|
|
|
if (table == NULL)
|
|
|
|
goto err_alloc;
|
|
|
|
|
|
|
|
table[0].data =
|
|
|
|
&net->ipv4.sysctl_icmp_echo_ignore_all;
|
|
|
|
table[1].data =
|
|
|
|
&net->ipv4.sysctl_icmp_echo_ignore_broadcasts;
|
|
|
|
table[2].data =
|
|
|
|
&net->ipv4.sysctl_icmp_ignore_bogus_error_responses;
|
|
|
|
table[3].data =
|
|
|
|
&net->ipv4.sysctl_icmp_errors_use_inbound_ifaddr;
|
|
|
|
table[4].data =
|
|
|
|
&net->ipv4.sysctl_icmp_ratelimit;
|
|
|
|
table[5].data =
|
|
|
|
&net->ipv4.sysctl_icmp_ratemask;
|
2008-10-27 12:28:25 -07:00
|
|
|
table[6].data =
|
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
|
|
|
&net->ipv4.sysctl_ping_group_range;
|
2013-01-05 16:10:48 +00:00
|
|
|
table[7].data =
|
|
|
|
&net->ipv4.sysctl_tcp_ecn;
|
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
|
|
|
|
2012-11-16 03:02:59 +00:00
|
|
|
/* Don't export sysctls to unprivileged users */
|
|
|
|
if (net->user_ns != &init_user_ns)
|
|
|
|
table[0].procname = NULL;
|
2008-03-26 01:56:24 -07:00
|
|
|
}
|
|
|
|
|
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
|
|
|
/*
|
|
|
|
* Sane defaults - nobody may create ping sockets.
|
|
|
|
* Boot scripts should set this to distro-specific group.
|
|
|
|
*/
|
2012-05-24 10:34:21 -06:00
|
|
|
net->ipv4.sysctl_ping_group_range[0] = make_kgid(&init_user_ns, 1);
|
|
|
|
net->ipv4.sysctl_ping_group_range[1] = make_kgid(&init_user_ns, 0);
|
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
|
|
|
|
2012-01-30 01:20:17 +00:00
|
|
|
tcp_init_mem(net);
|
2011-12-11 21:47:05 +00:00
|
|
|
|
2012-04-19 13:44:49 +00:00
|
|
|
net->ipv4.ipv4_hdr = register_net_sysctl(net, "net/ipv4", table);
|
2008-03-26 01:56:24 -07:00
|
|
|
if (net->ipv4.ipv4_hdr == NULL)
|
|
|
|
goto err_reg;
|
|
|
|
|
2008-03-26 01:54:18 -07:00
|
|
|
return 0;
|
2008-03-26 01:56:24 -07:00
|
|
|
|
|
|
|
err_reg:
|
2009-11-25 15:14:13 -08:00
|
|
|
if (!net_eq(net, &init_net))
|
2008-03-26 01:56:24 -07:00
|
|
|
kfree(table);
|
|
|
|
err_alloc:
|
|
|
|
return -ENOMEM;
|
2008-03-26 01:54:18 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
static __net_exit void ipv4_sysctl_exit_net(struct net *net)
|
|
|
|
{
|
2008-03-26 01:56:24 -07:00
|
|
|
struct ctl_table *table;
|
|
|
|
|
|
|
|
table = net->ipv4.ipv4_hdr->ctl_table_arg;
|
|
|
|
unregister_net_sysctl_table(net->ipv4.ipv4_hdr);
|
|
|
|
kfree(table);
|
2008-03-26 01:54:18 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
static __net_initdata struct pernet_operations ipv4_sysctl_ops = {
|
|
|
|
.init = ipv4_sysctl_init_net,
|
|
|
|
.exit = ipv4_sysctl_exit_net,
|
|
|
|
};
|
|
|
|
|
2007-12-05 01:41:26 -08:00
|
|
|
static __init int sysctl_ipv4_init(void)
|
|
|
|
{
|
|
|
|
struct ctl_table_header *hdr;
|
2010-05-05 00:27:06 +00:00
|
|
|
struct ctl_table *i;
|
|
|
|
|
|
|
|
for (i = ipv4_table; i->procname; i++) {
|
|
|
|
if (strcmp(i->procname, "ip_local_reserved_ports") == 0) {
|
|
|
|
i->data = sysctl_local_reserved_ports;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (!i->procname)
|
|
|
|
return -EINVAL;
|
2007-12-05 01:41:26 -08:00
|
|
|
|
2012-04-19 13:44:49 +00:00
|
|
|
hdr = register_net_sysctl(&init_net, "net/ipv4", ipv4_table);
|
2008-03-26 01:54:18 -07:00
|
|
|
if (hdr == NULL)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
if (register_pernet_subsys(&ipv4_sysctl_ops)) {
|
2012-04-19 13:24:33 +00:00
|
|
|
unregister_net_sysctl_table(hdr);
|
2008-03-26 01:54:18 -07:00
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
2007-12-05 01:41:26 -08:00
|
|
|
}
|
|
|
|
|
|
|
|
__initcall(sysctl_ipv4_init);
|