Applications and the Speed of Light

How well do applications perform on long perfect networks?

This document describes some techniques to determine if a network application is capable of performing well over a long perfect network, in the presence of the "speed of light" delays (also referred to as network latency or network round trip time (RTT)). We start by assuming that we can build a network that is ideal in every way (infinite bandwidth, no losses, etc), except that it is subject to the inevitable finite data transmission delays. We find that even in such an ideal environment, many applications do not perform well because the application itself cannot overlap network delays with other processing, and is therefore subject to declining performance with increasing path length.

Introduction

It turns out that it is a fairly hard problem to design an application that can effectively overlap arbitrary network delays and other processing. So it is not surprising that most applications don't perform well over long paths, unless they are either very simple or explicitly designed to accommodate for network delay.

For most end-users, the easiest approach to determine if an application is capable of performing well over a long path is to consider the application's reputation. If there are known examples of the application performing well over a long path it is more likely that some other component, such as then network itself, is the bottleneck. If there are no known examples of the application performing well over a long path, the end-user has good reason to be suspicious that the application may have a problem.

Layer Diagram
Figure 1: Network layers. This document describes the effects of delay on Applications. For information on the effects of delay on the transport layer (TCP), see "Enabling High Performance Data Transfers" and "NPAD Diagnostics Servers".

We describe several techniques for testing performance of applications over long paths. Since network latency can impact both application performance and TCP performance (which in turn also affects application performance), an important aspect of these methods is that they isolate the impact of latency on the application performance from that of its impact on TCP performance. The impact of latency on TCP performance is addressed by other tools such as pathdiag.

Background

Over the last several years there has been a huge amount of effort put into addressing the "end-to-end" network performance problem. Usually most of the tests emphasize the bandwidth related issues of network and application performance. Also, most of these tests tend to treat applications at one of two extremes, either they are not considered at all, or it is assumed that the application has some complicated or subtle network requirements. In this document we take slightly difference stance: We assume network is ideal and completely invisible in every way except for one parameter, transit delay, as determined by the physical length of the network path between the end-systems and the speed of light.

We find that even in this ideal environment, some applications may not perform well because the application itself can not adequately overlap network delays with other processing, and as a consequence the network delays are added to the elapsed running time for the application. Under these conditions application performance intrinsically declines with increasing network delays, and in some cases the performance degradation may make it impractical to be usable over a long network path. The examples below illustrate two cases, one in which the network latency has not been addressed in the application and another one in which network latency is adequately addressed.

Applications that use middleware such as X11 fall in the class of applications that will suffer severe performance degradation in the presence of latency. A number of middleware libraries (such as X11) use a "remote procedure call" (RPC) communications model. In the traditional RPC model each procedure (function) call in the application is implemented by sending the arguments through the network to the remote end, which performs the requested procedure, and any results are returned to the application through the network. To the programmer, the remote procedure call has the same semantics as a local procedure call, except it is performed on the remote system. The problem is it is easy to overlook the time taken by the messages in the network. A simple application may do a thousands RPCs, which might not be noticeable across a campus local area network (LAN) because the total communications time will typically be under a second. However the very same application will take more than a minute on a transcontinental path. For example if the application does 1000 RPCs, and the network round trip time is 60 milliseconds, then the communication adds a full minute to the application elapsed time. While the X11 protocol itself does support multiple outstanding RPCs to compensate for delay, it is up to the libraries and applications built on top of it to make use of this pipelining. Most X11-based applications experience significant performance degradation with high delay. (For more on this, see "X Window System Network Performance".)

The second example, one in which the latency issues of the network are addressed to some extent, can be illustrated by remote file system protocols, such as NFS, AFS or XXXX. These protocols are also typically RPC based, but partially compensate for network delays by overlapping multiple remote procedure calls. They all exploit caching and "read-ahead" to hide latency in all parts of the system. When a user application requests data in a file, additional data is read ahead and stored in a local data cache in anticipation of future requests. The goal is to (almost) always arrange for the data to be in the cache before it is requested by the user, such that the user does not have to wait for requests to traverse the network or disk latency in the server at the far end of the network. If the cache (also called the read ahead buffer) is large enough, this works very well. However, if the cache is inadequately sized it will not be sufficient at very high delay, because the file system protocol can not request more read-ahead than fits into the cache, and the application will be able read the data that arrives at the local cache more quickly than the requests to refill it can propagate through the network. This leads to "fixed window" behavior: the protocol performance can be predicted by imagining that a fixed amount of data (the cache size) is delivered on each possible round trip through the network.

Note that to get optimal performance, the application has to read ahead enough data to hide both the latency of the network and the latency of the disk in the server. If the path changes, then the ideal cache size also changes. This is the key point: to make an application that completely hides network delay, you need to fully overlap the client, the server and the network. The additional complication is that since amount of data in transit depends on the network delay, the application has to adapt to the networking delay, before it can be optimal.

Note that the fixed application window problem is completely analogous to the TCP buffer tuning problem. However, the above description is independent of the underlying protocol.

Application Reputations

As mentioned in the previous section, before running your own tests to determine if an application is capable of running well over a long network path, it is useful to see if this information is already known for this application. This table provides some information about how various applications cope with high delay networks. It is far from complete: if you use the testing procedures below, please send the results to us, and we will save others the trouble of doing their own test.

Estimated or unconfirmed data is tagged with a question mark "?".

To reduce the clutter we abbreviate common results:

Pass The application works at least to reasonable transcontinental scales (e.g 100 ms RTT). Note that all applications fail at sufficiently large scales, so some failure notations may also appear, indicating any known scale limits.
Fail: 64kb Window The application is performance limited by a fixed window size, in this case 64k Bytes. For long paths, the performance is inversely proportional to the RTT and can be estimated by the window size divided by the RTT.
Fail: 100 RTT The performance is limited, due to excessive round trips, in this case 100, probably for some specific operations that are often repetitive.
setsockopt() The application includes user (or automatic) commands or options to adjust the kernel TCP tuning. See the information on TCP tuning.

Some of the well known applications:

Application Conditions Pass/Fail Comments
ftp
sftp
kftp
gftp
Large file transfers Pass Ftp is the traditional bulk data transfer tool. Note that classic ftp is no longer considered safe for the public Internet because it sends plain-text passwords and anybody "sniffing the network" can get your password. The many variants include methods for secure (encrypted) authentication and graphical user interfaces.
Many ftp variants support setsockopt().
Multiple small file transfers using the mget and mput commands Fail: many RTTs per file
ssh
scp
Files of any size Fail: 64 kB Window
10-20 RTT per connection (Depends on security negotiations).
Several variants of ssh and ssh based tools except hpn-ssh (listed below) rely on a window based internal flow control mechanism to prevent deadlock under a number of conditions, such as when using port forwarding.
hpn-ssh Large file transfers Pass hpn-ssh (High Performance Network Secure Shell Protocol) adapts the internal flow control window to match TCP's window size.
Due to the high overhead associated with negotiating authentication and encryption options, ssh, scp and related commands are not efficient inside of other applications or scripts that do extensive looping.
Multiple small file transfers Fail: 10-20 RTTs per connection

Please send information about additional applications to nettune@psc.edu

Application Testing

If the application you are interested is not listed above or is not a "common" application, you may want to test it using the methods suggested below.

The application testing techniques listed here are listed in order of difficulty. The first few may be suitable for an end-user trying to find a smoking gun to present to a recalcitrant application programmer who denies that their application may have a problem.

These techniques are designed to facilitate "bench-testing" applications. The goal is to provide the developer with easy access to all components of the application while emulating long paths between them.

Note that it is always harder to use real applications rather than diagnostic tools to confirm that the path and end system are healthy. Always confirm that the end-system and network path are healthy and properly tuned before starting on application testing. The NPAD diagnostic servers provide one-click diagnosis of most last-mile and end-system problems. In addition, we recommend using iperf or ttcp to verify that the entire end-to-end path does not have any difficulty maintaining the desired rate.

Alternate path

The simplest way to test to see if an application is affected by network delay is to measure the performance from a number of different locations and determine if the application running time is roughly proportional to the network RTT. All you need to do this is an easily repeated set of application operations, a stopwatch to measure the application running time, ping to measure the network RTT and some helpful colleagues at other institutions to collect additional data points.

If the application elapsed time fits a simple liner function of the RTT, and the path and end system test clean to the full application data rate, then you have a pretty convincing argument that there is problem with the application itself.

Delay using tunnels

When developing an application, it is often not feasible to test over a path with a long RTT, due to physical location of machines running the application. One way to create a long-delay scenic path between two local hosts is to set up a tunnel decapsulator at a remote location, and configure the hosts so that packets between them go through that tunnel.

We have some instructions for Linux on setting up a universal GRE decapsulator.

Delay using kernel netemu

Instead of using a scenic path, it is possible to articially add delay to an existing short path. This can be done by a kernel module that queues packets below the network layer. The most commonly used packages are Dummynet for BSD kernels, and Netem for Linux.

For information on configuring the kernel to delay packets, see the Netem Wiki or the Dummynet documentation.

A simple delay tool

Another way to study the impact of network latency on an application is to use a proxy server that introduces a delay so that actual tests can be conducted on a LAN. With almost no latency and most available bandwidth this will approximate an "ideal network" referred to in the introduction, and delay can be controlled to simulate a long path.

One significant advantage of this approach over others is that it introduces delay above the transport layer. This isolates the effects of delay to the application only, eliminating possible effects on TCP from the debugging process.

We have written a simple tool for Linux to add delay to TCP applications. In use, it acts much like an SSH tunnel.

See the included README file for instructions on installation and use.

About NPAD

Network Path and Application Diagnosis is a joint project of the PSC and NCAR, funded under NSF grant ANI-0334061. This project is focused on using Web100 and other methods to extend fairly standard diagnostic techniques to compensate for the "symptom scaling" that leads to false positive diagnostic results on short paths.

Matt Mathis, John Heffner, and Raghu Reddy
Please send comments and suggestions to nettune@psc.edu