Imagined Cities


  Connection Timer
  Retransmission Timer
  Delayed ACK Timer
  Persist Timer
  Keepalive Timer
  FIN_WAIT_2 Timer
  TIME_WAIT Timer


  Nagle Algorithm
  Sliding Window
  Fast Sender Example
  Silly Window Sindrome
  Slow Start
  Congestion Window
  TCP segment
  Max Segment Size
  Urgent Mode
  Congestion Avoidance
  Fast Retransmit
  Fast Recovery


  Long Fat Network
(Elephant)

  Out-of-band Concept
  Explicit Congestion
Notification (ECN)

  Bandwidth-delay
product

  Fields of TCP Options

All About TCP
W. Richard Stevens
Van Jackabson
Sally Floyd
TCP Sockets
TCP Protocol
TCP Security
TCP Mechanisms




Traceroute Servlet

WhoIs Servlet



Connection Timer
 To establish a new TCP
session the message SYN is sent.  Connection
timeout starts when SYN
is sent. The timeout value
is 75 sec. If a response is
not received within this
timeout the connection
is aborted.


Retransmission Timer
 TCP retransmits the
data if this timer expires
and data is not
acknowledged. The value
of this timeout
is calculated dynamically
based on the RTT(round
trip time)
mesuared by
TCP for this connection.
This value is bounded
from 1 sec. to 64 sec.


Delayed ACK Timer
 TCP waits up to 200 ms
before sending the ACK .
If during this period
there is data to send TCP
sends the ACK along
with this data

(piggybacking) .

Persist Timer
  Receiver can advertise
a window size zero as
a
flow quenching
mechanism.
It will
"open" the window by
sending a further ACK
with the updated window
value. This value known
as a
window update
A particular problem
arises if the window
update is lost, this
problem is handled by
the sender sending
peridoic probes after

persist timer.

Its value is changed
in the interval [5,60]
seconds and is increased
exponentially.


Keepalive Timer
  The process can use
the socket option SO_KEEPALIVE to
set this timeout. If the
connection is idle for 2
hours, the keepalive
timeout expires and TCP
sends probes. If no
response is received TCP
assumes that the other
end is crashed.


FIN_WAIT_2 Timer
  The purpose of this
timer is to avoid leaving
a connection in the
FIN_WAIT_2 state
forever, if the other send
is crashed. This timer
consists of 2 parts:
10 min. and 75 sec.


TIME_WAIT Timer
  The reason for this
timeout is that while the
local side of the
connection has sent an
ACK in response to the
other side's FIN segment,
it does not know that the
ACK was successfully
delivered. Then this other
side might retransmit its
FIN segment, and this
second FIN segment
might be delayed in the
network.If the connection
were allowed to move
directly to the CLOSED
state, then another pair
of application processes
might come along, open
the same connection,
and the delayed FIN
segment from the earlier
incarnation of the
connection immediately
initiates the termination
of the later incarnation
of that connection.
This timeout is set to
2MSL (maximum
segment lifetime).

Common implementation
values of MSL are
30 sec.,1 min. or 2 min.





Elephant
Sometimes, the buffer
space
(sliding window)
is too small. This happens
in the network with large
bandwidth-delay products
The simplest solution is
to increase the buffer,
but for extreme cases the
protocol itself becomes
the bottleneck (because
it doesn't support a large
enough Window Size).
Under these conditions,
the network is termed an
LFN (Long Fat Network -
pronounced elephant).
TCP Extensions for
Long-Delay Paths,
V. Jackobson, R.Braden,
RFC 1072, 1988


Out-of-band Concept
Many implementation
incorrectly call
urgent
mode
out-of-band data.
If an application really
wants a separate out of
band channel, a second
connection is the easiest
way to accomplish this
(which could be used for
out-of-band signaling).
The confusion between
TCP urgent mode and
out-of-band data is also
because the predominant
sockets API interface
maps urgent mode into
what sockets calls
out-of-band data. What
is urgent mode used for?
The two most commonly
used applications are
Telnet and Rlogin, when
the interactive user
types the interrupt key.
Another is FTP, when
the interacticve user
aborts a file transfer.
There are two concepts:
1) Out-of-band signalling
(i.e., control)
2) Out-of-band data
TCP is a data-transport
protocol that does not
offer signalling. Thus,
it cannot provide
out-of-band signalling.
The question is whether
TCP's urgent data
should be considered
out-of-band. In fact,
the early interpretation
of urgent data was
indeed out-of-band:
a receiving TCP placed
standard data in one
queue and urgent data
in another.
   Out-of-Band Control
Signals in a Host-to-Host
Protocol, Larry Garlick,
RFC 721, 1976

 To send an out-of-band message the MSG_OOB flag is supplied to a
send call. The DoS
attack WinNuke used
the out-of-band message.
Details in the article
Why WinNuking is Lame ?

by Dan Finkelstein.

Explicit Congestion
Notification

Explicit Congestion
Notification (ECN) allows
routers to notify clients
about network congestion,
resulting in fewer dropped
packets and increased
network performance.
ECN is supported in the
Linux kernel. How to use
ECN with retransmitted
data for an ECN-capable
TCP connection is
diccussed in the

TCP with ECN: The
Treatment of
Retransmitted Data
Packets, Sally Floyd,
K. K. Ramakrishnan, 2001


Bandwidth-Delay
Product

The capacity of a
connection calculated as
cpacity(bits) =
bandwidth(bits/sec) X
round-trip time(sec)
and called
bandwidth-delay product
Bandwidth(bits/sec) and
Latency(ms) are basic
network performance
characteristics.
Bandwidth vs. Latency,
Brent Chapman
.
The Latency (=delay)
is the time that data
should be transmitted
instantly between one
point and another.
In the computer world
bandwidth is king.
Because it's easy to
increase bandwidth and
and hard to decrease
latency.
The Firefighter Problem
understanding latency
and bandwidth,
Rick Russel


TCP Options
Length of TCP header
is 20 bytes. The TCP
Options starts from the
21 byte (if any):
End of
option
list:

No
Operation:
 


Maximum Segment Size:

Window Scale Factor:

Timestamp:
Tracerote Protocols: ICMP, IP, UDP Traceroute Servlet
Traceroute List
Traceroute Download
WhoIs Service
URL Geography
WhoIs Servlet
Nagle Algorithm, Sliding Window, Fast Sender Example, Elephant, Silly Window Sindrome, Slow Start, Congestion Window, TCP segment, Max Segment Size, Urgent Mode, Out-of-band Concept, Congestion Avoidance, Fast Retransmit, Fast Recovery, Explicit Congestion Notification, Bandwidth-Delay Product, Fields of TCP Options

   
Algotithm Breifly Purpose Note
Nagle Algorithm Small segments cannot be
sent until the oustanding
data is acknowledged
Prevent tinygram
congestion
Enabled: Telenet, Rlogin
Desabled: X protocol
Sliding Window The receiver sends
the ACK with the
advertised window size
Avoid overflowing
the buffer
Fast Sender can be
stopped by advertised
window 0
Slow Start Sender sets congestion
window(cwnd): 1, 2, 4, ...
If router starts discarding
packets, the cwnd will be
decreased.
Avoid
network congestion
The sender can transmit
up to the minimum of the
congestion window and
the advertised window.
Congestion Avoidance Growth of cwnd is additive.
Indications of packet loss:
timeout or duplicate ACKs.
Avoid lost packets.
Congestion Avoidance
and Slow Start
are independent
algorithmes.

When congestion occurs
the transmission rate
of packets should be slow down.
Fast Retransmit If three or more duplicate
ACKs are received in a
row, it is a strong
indication that a
segment has been lost.
Retransmission of the
missing segment,
without waiting
for a retransmission
timer to expire.
If only 1 or 2 duplicate
ACKs are received in row,
it is a indication that
just segments are
reordered.
Fast Recovery After fast retransmit
starting the congestion
avoidance, but not
slow start is performed.
No reduce the flow
suddenly by going
into slow start.
If the third duplicate ACK
is received it is a
indication of segment
loss.

Nagle Algorithm
  This algorithm says that when a TCP connection has outstanding data that has not
yet been acknowledged, small segments cannot be sent until the oustanding data is
acknowledged. In this case small amounts of data are collected by TCP and sent in
single segment.
   This TCP mechanism was described by John Nagle in
"Congestion Control in IP/TCP Internetworks", RFC 896, 1984.
    Sometimes the Nagle algorithm needs to be turned off. For example, in X Window
System server small messages (mouse movements) must be delivered without delay
to provide real-time feedback for interactive user.
   The Nagle algorithm is enabled by default for Telnet or Rlogin sessions.
   Telnet example test over a long-haul link with a 5-second round trip time. User sends
25 bytes. Without any mechanism to prevent small-packet (tinygram) congestion
25 new packets would be sent in 5 seconds in according with the delayed ACK
agorithm, meaning delay up to 200 ms. Amount data to be sent is 41x25 bytes
(20 bytes for IP header, 20 bytes for TCP header). Overhead here is 4000%.
With Nagle algorithm however, the first character from the user would be sent
immediately. The next 24 characters, arriving from the user at 200ms intervals.
When an ACK arrived for the first packet at the end of 5 seconds, a single packet
with the 24 queued characters would be sent, i.e. 41x2 +25 byes would be sent total.
Overhead is only 320% with no penalty in response time.
   The Nagle Algorithm is useful on a slow WAN when it is desired to reduce
the tinygram congestion.
    An user aplication can disable the Nagle algorithm by the socket API option
TCP_NODELAY.


Sliding Window
  Each endpoint of a TCP connection has a buffer for storing data that is transmitted
over the network before the application is ready to read the data. To avoid overflowing
the buffer, TCP receiver sets a advertised window size field in each packet it
transmits. This field contains the amount of data that may be transmitted into the
buffer. If this number falls to zero, the remote TCP can send no more data. It must
wait until buffer space becomes available and it receives a packet announcing a non-zero
window size.
    TCP uses a 32-bit sequence number that counts bytes in the data stream.
Each TCP packet contains the starting sequence number of the data in that packet,
and a 32-bit acknowledgment number of the last byte received from the remote peer.
   With this information, a sliding-window protocol is implemented. Forward and
reverse sequence numbers are completely independent, and each TCP peer must
track both its own sequence numbering and the numbering being used by the remote peer.


Fast Sender Example:
  The sender more faster than receiver. After some transmitted packets the sender
stops and waits for the ACK. The receiver sends the ACK with the advertised
window
0, mentioned in the Persist Timer Mechanism. This means that receiver
has all the data, but it's all in thr receiver's TCP buffers and aplication hasn't
chance to read the data. Another ACK (called a window update) is sent later,
announcing that the receiver can now receive another buffer.
   Some analysis are discussed in Sliding Window Protocol. Experiments and analysis
The case when the network's bandwidth exceeds the buffer size is named elephant.


Silly Window Syndrome
  If a receiver advertises a small window, perhaps due to congestion then the
sender may split its data to small segments and send them. These segments may be
even smaller receiver window.
   The overheads of this behaviour, known as the silly window syndrome are
significant. Rules to avoid it:
   The receiver can't increase its advertised window size unless the increase
is either a full segment size or the current buffer space whichever is smaller.
   The sender should also refrain from sending segments unless a) a full size
segment can be sent or b) a segment that can be sent is at least half the size of
the largest window ever advetised by the receiver or c) all outstanding can
be sent and no ACK is expected or Nagle Algorithm is disabled.
   Details of sender and receiver behavior may be found in the Requirements for
Internet Hosts -- Communication Layers, R. Braden, RFC 1122, 1989
(p.4.2.3.3 and p.4.2.3.4)


Slow Start Algorithm
  This algorithm operates by observing that the rate at which new packets should
be injected into the network is the rate at which acknowledgments are returned by
other end. Slow start adds another window to the TCP sender: the
congestion window, called "cwnd".
   
Who Imposes Sender Receiver
What Flow Control Congestion Window Advertised Window
What Algotithm Uses Slow Start Sliding Window
Growth Start Increase,
(in segments)
Decrease/Increase
Based On Network Congestion Buffer Space

    When a new connection is established with a host on another network, the
congestion window is initialized to one segment.
    Each time an ACK is received, the congestion window is increased by
one segment. The sender can transmit up to the minimum of the congestion
window
and the advertised window. The congestion window is flow
control imposed by the sender, while the advertised window is flow control
imposed by the receiver. The former is based on the sender's assessment of
perceived network congestion; the latter is related to the amount of available
buffer space at the receiver for this connection.
   Requirements for Internet Hosts -- Communication Layers, R. Braden,
RFC 1122, 1989

  TCP Slow Start, Congestion Avoidance, Fast Retransmit,
and Fast Recovery Algorithms, W.Stevens, RFC 2001, 1997


Congestion Window Growth
  The sender starts by transmitting one segment and waiting for its ACK.
When that ACK is received, the congestion window is incremented from one to two,
and two segments can be sent. When each of those two segments is acknowledged,
the congestion window is increased to four. This provides an exponential growth,
although it is not exactly exponential because the receiver may delay its ACKs,
typically sending one ACK for every two segments that it receives.
   At some point the capacity of the internet can be reached, and an intermediate
router will start discarding packets. This tells the sender that its congestion
window has gotten too large.
   Early implementations performed slow start only if the other end was on a
different network.Current implementations always perform slow start.
   About TCPs slow start by sending as many as four packets (instead of the usual
one packet) see:
   Increasing TCP's Initial Window,
Allman, M., Floyd, S., and C. Partridge, RFC 2414, 1998


TCP Segment
  The application data (for example, file sending by FTP) is broken to chunks,
which would be sent by TCP. The unit of information passed by the TCP to IP
is called a segment.
   When TCP receives one or more segments it sends an acknowledgment.
Since TCP segments are trnsmitted as IP datagrams and IP datagrams can arrive
out of order, TCP segments can arrive out of order. A TCP receiver resequences the
data if necessary.
IP datagram and TCP segment

TCP Header Format, RFC 793
UDP Header Format, RFC 768
IP Header Format, RFC 791


Maximum Segment Size
  The maximum segment size (MSS) is set when TCP connection established.
An MSS option can only appear in a SYN segment. If one end does not receive
an MSS option from the other end, a default of 536 bytes is assumed. The resulting
datagram is 40 bytes larger: 20 bytes for IP header and 20 bytes for TCP headers.
Some systems, such SunOS, Solaris and AIX all announce an MSS of 1460
(i.e. IP datagram = 1460+40 = 1500 bytes). The MSS is selected as the minimum
of the MSS sent from the other end and the MSS of the local end.


Urgent Mode
  TCP provides what is called urgent mode, allowing one end to tell the other
end that "urgent data" of some form has been placed into the normal stream
of data. It's up to the receiving end to decide what to do.
   The notification from one end to the other that urgent data has exists in the
data stream is done by setting two fields in the TCP header. The URG bit is turned
on and the 16-bit urgent pointer is set to a positive offset that must be added
to the sequence number field in the TCP header to obtain the sequence number
of the last byte of urgent data.
   TCP itself says little more about urgent data. There is no way to specify where
the urgent data starts in the data stream. The only information sent across the
connection by TCP is that urgent mode has begun and the pointer to the last byte
of urgent data. Everything else is left to the application.
 TCP, Darpa Internet program, Protocol Specification, RFC 793, Information
Sciences Institute, University of Southern California1

 Telnet Protocol Specifications, J.Postel, J.Reynolds, RFC 854, 1983
 BSD Rlogin, B. Kantor, RFC 1258, 1991
    Out-of-band concept is closely connected with urgent mode.


Congestion Avoidance
  Congestion Avoidance is a way to deal with lost packets. There are two
indications of packet loss: a timeout occuring and the receipt of duplicate ACKs.
   Congestion Avoidance and slow start are independent algorithms. When
congestion occurs the transmission rate of packets should be slow down.
   Congestion Avoidance and slow start require two variables maintained for
each connection: congestion window, cwnd and a slow start
threshold size, sstresh. Algotithm steps:
   1. Initialization for a given connection sets cwnd to 1 segment and sstresh
to 65536 bytes.
   2. The TCP output routine never sends more than the minimum of cwnd and
the receiver's advertised window.
   3. When congestion occurs (timeout or the reception of duplicate ACKs),
one-half of the current window size (the minimum of cwnd and the receiver's
advertised window, but at least two segments) is saved in ssthresh.
Additionally, if the congestion is indicated by a timeout, cwnd is set to
one segment (i.e., slow start).
   4. When new data is acknowledged by the other end, increase cwnd, but the way
it increases depends on whether TCP is performing slow start or congestion
avoidance.
   
Algotithm Slow Start Congestion Avoidance
When to Run cwnd <= sstresh cwnd > sstresh
Window Growth by 1 segment,
if ACK received
1/cwnd,
if ACK received
Rate of Growth Exponentialy Additive

    The term slow start is not completely correct. It is a slower transmission of
packets than what caused the congestion, but the rate of growth in the number
of packets injected into the network increases during slow start. The rate of
growth doesn't slow down until sstresh is reached, when congestion avoidance
takes over.


Fast Retransmit
  If out-of-order segment is received TCP sends an immediate acknowledgment
(a duplicate ACK).
This duplicate ACK should not be delayed. The purpose of this
duplicate ACK is to let the other end know that a segment was received out of order,
and to tell it what sequence number is expected.
  If three or more duplicate ACKs are received in a row, it is a strong indication
that a segment has been lost. Then TCP performs a retransmission of the missing
segment, without waiting for a retransmission timer to expire.
  If only one or two duplicate ACKs are received in row, it is a indication that just
segments are reordered.
 Since TCP does not know whether a duplicate ACK is caused by a lost segment or
just a reordering of segments, it waits for a small number of duplicate ACKs to be
received.
  TCP Slow Start, Congestion Avoidance, Fast Retransmit,
and Fast Recovery Algorithms, W.Stevens, RFC 2001, 1997

   About Fast-Retransmit Algorithm


Fast Recovery
     After fast retransmit starting the congestion avoidance, but not slow start
is performed. This is the fast recovery algorithm. The reason for not performing
slow start in this case is that duplicate ACKs tells us more than just a packet has
been lost. The receiver can only generate the duplicate ACK when another segment
is received, i.e. there is still data flowing between the two ends, and we don't want
to reduce the flow abruptly by going into slow start.
    Algorithmes fast retransmit and fast recovery are usually implemented together
as foolows:
   1. When the third duplicate ACK is received (segment loss), set ssthresh (border
between the congestion avoidance, but not slow start ):
   ssthresh = 1/2 minimum ( cwnd , advertised window)
Retransmit the missing segment.
Set cwnd = ssthresh + 3 * segment size. This inflates the congestion window.
Since cwnd > ssthresh the algorithm congestion avoidance is selected.
   2. Each time another duplicate ACK arrives, increment cwnd by
the segment size. This inflates the congestion window for the additional segment
that has left the network. Transmit a packet, if allowed by the new value of cwnd.
   3. When the next ACK arrives that acknowledges new data, set cwnd to ssthresh
(the value set in step 1). This ACK should be the acknowledgment of the retransmission
from step 1, one round-trip time after the retransmission. Additionally, this ACK should
acknowledge all the intermediate segments sent between the lost packet and the receipt
of the first duplicate ACK. This step is congestion avoidance, since TCP is down to
one-half the rate it was at when the packet was lost.
  TCP Slow Start, Congestion Avoidance, Fast Retransmit,
and Fast Recovery Algorithms, W.Stevens, RFC 2001, 1997







W. Richard Stevens

W. Richard Stevens' Homepage
Books and Papes of Richard Stevens, Papers of Others, IP Multicasting Information,
Biography

Funeral Notices
In Memoriam: W. Richard Stevens

TCP Slow Start, Congestion Avoidance, Fast Retransmit, and Fast Recovery
Algorithms, W.Stevens, RFC 2001, 1997

Van Jackobson

TCP Extensions for Long-Delay Paths, V. Jackobson, R.Braden, RFC 1072, 1988

TCP Extensions for High Performance,V. Jackobson, R.Braden, D. Borman, RFC 1323, 1992
TCP extensions to improve performance over large bandwidth-delay product paths

Sally Floyd

TCP with ECN: The Treatment of Retransmitted Data Packets,
Sally Floyd, K. K. Ramakrishnan, 2001


Sally Floyd Homepage
Research scientist at ACIRI.
ACIRI is an AT&T-funded research institute at ICSI in Berkeley, California.

An Extension to the Selective Acknowledgement (SACK) Option for TCP,
S. Floyd, J. Mahdavi, M. Mathis, M. Podolsky, RFC 2883

Discussion and Suggestions for Improvements. Selective Acknowledgement (SACK).

TCP Sockets

Beej's Guide to Network Programming. Using Internet Sockets     Simple Stream
Client/Server, select()--Synchronous I/O Multiplexing

Programming UNIX Sockets in C, FAQ, Vic Metcalfe, Andrew Gierth
Writing Client/Server Applications FAQ

Client/Server Model
Server code. Client code. Connectionless servers

BSD Sockets: A Quick And Dirty Primer, Jim Frost
New Phone, Dialing, Conversation, Hanging Up

TCP Protocol

An Overview of TCP/IP Protocols and the Internet, Gary C. Kessler
The TCP/IP Protocol Architecture, Network Interface Layer, Interenet Layer,
Transport Layer, Application Layer

A Brief History of the Internet, Barry M. Leiner , Vinton G. Cerf , David D. Clark,
Robert E. Kahn, Leonard Kleinrock, Daniel C. Lynch, Jon Postel, Lawrence G. Roberts, Stephen Wolff.
Origins of the Internet, ARPANET,
DARPA - Defense Advanced Research Projects Agency.

Introduction to the Internet Protocols, Steven E. Newton
Domain System, Routing, Subnets and Broadcasting

Cisco - TCP/IP, OSI Reference Model
Relationship of the Internet Protocol Suite to the OSI Reference Model
Routing in IP Environments: Protocols RIP, IGRP, EGP, BGP

TCP, Transmission Control Protocol
TCP RFCs References

TCP Implementation (tcpimpl)
Proceedings of the 42nd IETF Meeting in Chicago, 1998
IETF - Internet Engineering Task Force

Advanced Topics of IPC Out-of-band. Non-blocking Sockets. Interrupt Driven Socket.
Pseudo Terminals. Address Binding. Broadcasting.
Socket Options. Three-way Handshake.

TCP Implementation (tcpimpl)
Proceedings of the 42nd IETF Meeting in Chicago, 1998
IETF - Internet Engineering Task Force

TCP Security

TCP SYN Flooding and IP Spoofing Attacks
Denial-of-Service attacks by creating TCP "half-open" connections.

Project Neptune. The comprehensive analysis of TCP SYN flooding
Comprehensive analysis of TCP SYN flooding. Trace, Discussion and Prevention

SYN Flood DoS Attack Experiments, Dan Forsberg
Patches and Information for many operating systems

SecurityTeam.com (Exploits). Vulnerability
Great Monthly Security Archive

TCP Mechanisms

Window and Acknowledgment Strategy in TCP, David D. Clark, RFC 813, 1982
Implementation strategies to deal with two mechanisms in TCP,
the window and the acknowledgement.

Transition of Internet Mail from Just-Send-8 to 8bit-SMTP/MIME, G. Vaudreuil,
RFC 1428, 1993
The interpretation of the "unknown-8bit" (out-of-band)
is up to the mail reader.

IAB Technical Comment on the Unique DNS Root, Internet Architecture Board,
RFC 2826, 2000

The DNS name space is a hierarchical name space derived from a single, globally unique root.
DNS - Domain Name Service, IAB - Internet Architecture Board

Home at LK.NET. Where cultures of the world converge.

Maintained by Rafael Stekolshchik       
klivlend1@yahoo.com

Java Notions Dictionary