TCP IP Topics

Coming here more on TCP/IP stuff... Keep watching

TCP Buffer Sizing

TCP uses what is called the "congestion window", or CWND, to determine how many packets can be sent at one time. The larger the congestion window size, the higher the throughput. The TCP "slow start" and "congestion avoidance" algorithms determine the size of the congestion window. The maximum congestion window is related to the amount of buffer space that the kernel allocates for each socket. For each socket, there is a default value for the buffer size, which can be changed by the program using a system library call just before opening the socket. There is also a kernel enforced maximum buffer size. The buffer size can be adjusted for both the send and receive ends of the socket.

To get maximal throughput it is critical to use optimal TCP send and receive socket buffer sizes for the link you are using. If the buffers are too small, the TCP congestion window will never fully open up. If the receiver buffers are too large, TCP flow control breaks and the sender can overrun the receiver, which will cause the TCP window to shut down. This is likely to happen if the sending host is faster than the receiving host. Overly large windows on the sending side is not usually a problem as long as you have excess memory.

The optimal buffer size is twice the bandwidth*delay product of the link:

buffer size = 2 * bandwidth * delay
The ping program can be used to get the delay. Determining the end-to-end capacity (the bandwidth of the slowest hop in your path) is trickier, and may require you to ask around to find out the capacity of various networks in the path. Tools such as pathrate will also give you an estimate of the network capacity. Since ping gives the round trip time (RTT), this formula can be used instead of the previous one:

buffer size = bandwidth * RTT
For example, if your ping time is 50 ms, and the end-to-end network consists of all 1G or 10G Ethernet, the TCP buffers should be:

.05 sec * (1 Gbit / 8 bits) = 6.25 MBytes.
Historically in order get full bandwidth required the the user to specify the buffer size for the network path being used, and the the application programmer had to set use the SO_SNDBUF and SO_RCVBUF options of the BSD setsockopt() call to set the buffer size for the sender an receiver. Luckily Linux, FreeBSD, Windows, and Mac OSX all now support TCP autotuning, so you no longer need to worry about setting the default buffer sizes

Linux Networking Configurations +
NIC Tuning

These can be added to /etc/rc.local to get run at boot time.

# increase txqueuelen for 10G NICS
/sbin/ifconfig eth2 txqueuelen 10000

Under LInux, For TCP/IP one may change any parameters using a command such as

% echo "10" > /proc/sys/net/ipv4/tcp_fin_timeout
% /etc/rc.d/init.d/network restart

It would set the TCP/FIN timeout to 10 seconds instead of its default 60 seconds.

You can also adjust the parameters in /etc/sysctl.conf ; in our example, the parameter is net.ipv4.tcp_fin_timeout.

sysctl -w net.ipv4.tcp_congestion_control=cubic
(Background: TCP Congestion Avoidance Algorithms
The TCP reno congestion avoidance algorithm was the default in all TCP implementations for many years. However, as networks got faster and faster it became clear that reno would not work well for high bandwidth delay product networks. To address this a number of new congestion avoidance algorithms were developed, including:

reno: Traditional TCP used by almost all other operating systems. (default)
cubic: CUBIC-TCP
bic: BIC-TCP
htcp: Hamilton TCP
vegas: TCP Vegas
westwood: optimized for lossy networks
Most Linux distributions now use cubic by default, and Windows now uses compound tcp. If you are using an older version of Linux, be sure to change the default from reno to cubic or htcp.)

List of TCP adjustable parameters and meaning
The below listing was extracted from this link.

/proc/sys/net/ipv4/* Variables:

ip_forward - BOOL : Forward Packets between interfaces.
0 - disabled (default)
not 0 - enabled
This variable is special, its change resets all configuration
parameters to their default state (RFC1122 for hosts, RFC1812
for routers)

ip_default_ttl - INT : used in IP Header
default 64

ip_no_pmtu_disc - BOOL
Disable Path MTU Discovery.
default FALSE

IP Fragmentation Related +
ipfrag_high_thresh - INT
Maximum memory used to reassemble IP fragments. When
ipfrag_high_thresh bytes of memory is allocated for this purpose,
the fragment handler will toss packets until ipfrag_low_thresh
is reached.

ipfrag_low_thresh - INT
See ipfrag_high_thresh

ipfrag_time - INT
Time in seconds to keep an IP fragment in memory.

INET peer storage Related +
inet_peer_threshold - INT
The approximate size of the storage. Starting from this threshold
entries will be thrown aggressively. This threshold also determines
entries' time-to-live and time intervals between garbage collection
passes. More entries, less time-to-live, less GC interval.

inet_peer_minttl - INT
Minimum time-to-live of entries. Should be enough to cover fragment
time-to-live on the reassembling side. This minimum time-to-live is
guaranteed if the pool size is less than inet_peer_threshold.
Measured in jiffies.

inet_peer_maxttl - INT
Maximum time-to-live of entries. Unused entries will expire after
this period of time if there is no memory pressure on the pool (i.e.
when the number of entries in the pool is very small).
Measured in jiffies.

inet_peer_gc_mintime - INT
Minimum interval between garbage collection passes. This interval is
in effect under high memory pressure on the pool.
Measured in jiffies.

inet_peer_gc_maxtime - INT
Minimum interval between garbage collection passes. This interval is
in effect under low (or absent) memory pressure on the pool.
Measured in jiffies.

TCP variables +
tcp_syn_retries - INT
Number of times initial SYNs for an active TCP connection attempt
will be retransmitted. Should not be higher than 255. Default value
is 5, which corresponds to ~180seconds.

tcp_synack_retries - INT
Number of times SYNACKs for a passive TCP connection attempt will
be retransmitted. Should not be higher than 255. Default value
is 5, which corresponds to ~180seconds.

tcp_keepalive_time - INT
How often TCP sends out keepalive messages when keepalive is enabled.
Default: 2hours.

tcp_keepalive_probes - INT
How many keepalive probes TCP sends out, until it decides that the
connection is broken. Default value: 9.

tcp_keepalive_interval - INT
How frequently the probes are send out. Multiplied by
tcp_keepalive_probes it is time to kill not responding connection,
after probes started. Default value: 75sec i.e. connection
will be aborted after ~11 minutes of retries.

tcp_retries1 - INT
How many times to retry before deciding that something is wrong
and it is necessary to report this suspection to network layer.
Minimal RFC value is 3, it is default, which corresponds
to ~3sec-8min depending on RTO.

tcp_retries2 - INT
How may times to retry before killing alive TCP connection.
RFC1122 says that the limit should be longer than 100 sec.
It is too small number. Default value 15 corresponds to ~13-30min
depending on RTO.

tcp_orphan_retries - INT
How may times to retry before killing TCP connection, closed
by our side. Default value 7 corresponds to ~50sec-16min
depending on RTO. If you machine is loaded WEB server,
you should think about lowering this value, such sockets
may consume significant resources. Cf. tcp_max_orphans.

tcp_fin_timeout - INT
Time to hold socket in state FIN-WAIT-2, if it was closed
by our side. Peer can be broken and never close its side,
or even died unexpectedly. Default value is 60sec.
Usual value used in 2.2 was 180 seconds, you may restore
it, but remember that if your machine is even underloaded WEB server,
you risk to overflow memory with kilotons of dead sockets,
FIN-WAIT-2 sockets are less dangerous than FIN-WAIT-1,
because they eat maximum 1.5K of memory, but they tend
to live longer. Cf. tcp_max_orphans.

tcp_max_tw_buckets - INT
Maximal number of timewait sockets held by system simultaneously.
If this number is exceeded time-wait socket is immediately destroyed
and warning is printed. This limit exists only to prevent
simple DoS attacks, you _must_ not lower the limit artificially,
but rather increase it (probably, after increasing installed memory),
if network conditions require more than default value.

tcp_tw_recycle - BOOL
Enable fast recycling TIME-WAIT sockets. Default value is 1.
It should not be changed without advice/request of technical
experts.

tcp_max_orphans - INT
Maximal number of TCP sockets not attached to any user file handle,
held by system. If this number is exceeded orphaned connections are
reset immediately and warning is printed. This limit exists
only to prevent simple DoS attacks, you _must_ not rely on this
or lower the limit artificially, but rather increase it
(probably, after increasing installed memory),
if network conditions require more than default value,
and tune network services to linger and kill such states
more aggressively. Let me to remind again: each orphan eats
up to ~64K of unswappable memory.

tcp_abort_on_overflow - BOOL
If listening service is too slow to accept new connections,
reset them. Default state is FALSE. It means that if overflow
occurred due to a burst, connection will recover. Enable this
option _only_ if you are really sure that listening daemon
cannot be tuned to accept connections faster. Enabling this
option can harm clients of your server.

tcp_syncookies - BOOL
Only valid when the kernel was compiled with CONFIG_SYNCOOKIES
Send out syncookies when the syn backlog queue of a socket
overflows. This is to prevent against the common 'syn flood attack'
Default: FALSE

Note, that syncookies is fallback facility.
It MUST NOT be used to help highly loaded servers to stand
against legal connection rate. If you see synflood warnings
in your logs, but investigation shows that they occur
because of overload with legal connections, you should tune
another parameters until this warning disappear.
See: tcp_max_syn_backlog, tcp_synack_retries, tcp_abort_on_overflow.

syncookies seriously violate TCP protocol, do not allow
to use TCP extensions, can result in serious degradation
of some services (f.e. SMTP relaying), visible not by you,
but your clients and relays, contacting you. While you see
synflood warnings in logs not being really flooded, your server
is seriously misconfigured.

tcp_stdurg - BOOL
Use the Host requirements interpretation of the TCP urg pointer field.
Most hosts use the older BSD interpretation, so if you turn this on
Linux might not communicate correctly with them.
Default: FALSE

tcp_max_syn_backlog - INT
Maximal number of remembered connection requests, which are
still did not receive an acknowledgement from connecting client.
Default value is 1024 for systems with more than 128Mb of memory,
and 128 for low memory machines. If server suffers of overload,
try to increase this number. Warning! If you make it greater
than 1024, it would be better to change TCP_SYNQ_HSIZE in
include/net/tcp.h to keep TCP_SYNQ_HSIZE*16<=tcp_max_syn_backlog
and to recompile kernel.

tcp_window_scaling - BOOL
Enable window scaling as defined in RFC1323.

tcp_timestamps - BOOL
Enable timestamps as defined in RFC1323.

tcp_sack - BOOL
Enable select acknowledgments (SACKS).

tcp_fack - BOOL
Enable FACK congestion avoidance and fast restransmission.
The value is not used, if tcp_sack is not enabled.

tcp_dsack - BOOL
Allows TCP to send "duplicate" SACKs.

tcp_ecn - BOOL
Enable Explicit Congestion Notification in TCP.

tcp_reordering - INT
Maximal reordering of packets in a TCP stream.
Default: 3

tcp_retrans_collapse - BOOL
Bug-to-bug compatibility with some broken printers.
On retransmit try to send bigger packets to work around bugs in
certain TCP stacks.

tcp_wmem - vector of 3 INTs: min, default, max
min: Amount of memory reserved for send buffers for TCP socket.
Each TCP socket has rights to use it due to fact of its birth.
Default: 4K

default: Amount of memory allowed for send buffers for TCP socket
by default. This value overrides net.core.wmem_default used
by other protocols, it is usually lower than net.core.wmem_default.
Default: 16K

max: Maximal amount of memory allowed for automatically selected
send buffers for TCP socket. This value does not override
net.core.wmem_max, "static" selection via SO_SNDBUF does not use this.
Default: 128K

tcp_rmem - vector of 3 INTs: min, default, max
min: Minimal size of receive buffer used by TCP sockets.
It is guaranteed to each TCP socket, even under moderate memory
pressure.
Default: 8K

default: default size of receive buffer used by TCP sockets.
This value overrides net.core.rmem_default used by other protocols.
Default: 87380 bytes. This value results in window of 65535 with
default setting of tcp_adv_win_scale and tcp_app_win:0 and a bit
less for default tcp_app_win. See below about these variables.

max: maximal size of receive buffer allowed for automatically
selected receiver buffers for TCP socket. This value does not override
net.core.rmem_max, "static" selection via SO_RCVBUF does not use this.
Default: 87380*2 bytes.

tcp_mem - vector of 3 integer values: min, pressure, max
low: below this number of pages TCP is not bothered about its
memory appetite.

pressure: when amount of memory allocated by TCP exceeds this number
of pages, TCP moderates its memory consumption and enters memory
pressure mode, which is exited when memory consumtion falls
under "low".

high: number of pages allowed for queueing by all TCP sockets.

Defaults are calculated at boot time from amount of available
memory.

tcp_app_win - INT
Reserve max(window/2^tcp_app_win, mss) of window for application
buffer. Value 0 is special, it means that nothing is reserved.
Default: 31

tcp_adv_win_scale - INT
Count buffering overhead as bytes/2^tcp_adv_win_scale
(if tcp_adv_win_scale > 0) or bytes-bytes/2^(-tcp_adv_win_scale),
if it is <= 0.
Default: 2

ip_local_port_range - 2 INTS
Defines the local port range that is used by TCP and UDP to
choose the local port. The first number is the first, the
second the last local port number. Default value depends on
amount of memory available on the system:
> 128Mb 32768-61000
< 128Mb 1024-4999 or even less.
This number defines number of active connections, which this
system can issue simultaneously to systems not supporting
TCP extensions (timestamps). With tcp_tw_recycle enabled
(i.e. by default) range 1024-4999 is enough to issue up to
2000 connections per second to systems supporting timestamps.

icmp_echo_ignore_all - BOOL
icmp_echo_ignore_broadcasts - BOOL
If either is set to true, then the kernel will ignore either all
ICMP ECHO requests sent to it or just those to broadcast/multicast
addresses, respectively.

icmp_destunreach_rate - INT
icmp_paramprob_rate - INT
icmp_timeexceed_rate - INT
icmp_echoreply_rate - INT (not enabled per default)
Limit the maximal rates for sending ICMP packets to specific targets.
0 to disable any limiting, otherwise the maximal rate in jiffies(1)
See the source for more information.

icmp_ignore_bogus_error_responses - BOOL
Some routers violate RFC 1122 by sending bogus responses to broadcast
frames. Such violations are normally logged via a kernel warning.
If this is set to TRUE, the kernel will not give such warnings, which
will avoid log file clutter.
Default: FALSE

(1) Jiffie: internal timeunit for the kernel. On the i386 1/100s, on the
Alpha 1/1024s. See the HZ define in /usr/include/asm/param.h for the exact
value on your system.

conf/interface/*:
conf/all/* is special and changes the settings for all interfaces.
To Change special settings per interface.

log_martians - BOOL
Log packets with impossible addresses to kernel log.

accept_redirects - BOOL
Accept ICMP redirect messages.
default TRUE (host)
FALSE (router)

forwarding - BOOL
Enable IP forwarding on this interface.

mc_forwarding - BOOL
Do multicast routing. The kernel needs to be compiled with CONFIG_MROUTE
and a multicast routing daemon is required.

proxy_arp - BOOL
Do proxy arp.

shared_media - BOOL
Send(router) or accept(host) RFC1620 shared media redirects.
Overrides ip_secure_redirects.
default TRUE

secure_redirects - BOOL
Accept ICMP redirect messages only for gateways,
listed in default gateway list.
default TRUE

send_redirects - BOOL
Send redirects, if router. Default: TRUE

bootp_relay - BOOL
Accept packets with source address 0.b.c.d destined
not to this host as local ones. It is supposed, that
BOOTP relay daemon will catch and forward such packets.

default FALSE
Not Implemented Yet.

accept_source_route - BOOL
Accept packets with SRR option.
default TRUE (router)
FALSE (host)

rp_filter - BOOL
1 - do source validation by reversed path, as specified in RFC1812
Recommended option for single homed hosts and stub network
routers. Could cause troubles for complicated (not loop free)
networks running a slow unreliable protocol (sort of RIP),
or using static routes.

0 - No source validation.

Default value is 0. Note that some distributions enable it
in startip scripts.r "http://books.google.com/books?id=ptSC4LpwGA0C&pg=PA228&lpg=PA228&dq=sctp+max+burst&source=bl&ots=Kq4GLldqSs&sig=Xz

19MnduKpD2wix8u2R0-QRlhQI&hl=en&ei=swxzStf0N8eAkQWUrYycDA&sa=X&oi=book_result&ct=result&resnum=6#v=onepage&q=sctp%

20max%20burst&f=false

http://www.linuxhomenetworking.com/wiki/index.php/Quick_HOWTO_:_Ch14_:_Linux_Firewalls_Using_iptables

http://linuxreviews.org/man/iptables/"

Mind hover

Search This Blog

TCP IP Topics

Comments

Popular posts from this blog

welcome

SCTP - A new transport protocol

NSSF - an 5G network function to support the network slicing