{saboo, kale}@cs.uiuc.edu
Parallel programs often deal with large problems which are computationally intensive and have huge memory requirements. In spite of the relatively large amount of memory available in today's computer systems, there are many parallel applications that require more memory than available. Virtual Memory mechanism satisfies the memory requirement of such applications by paging the required pages of memory from physical memory. However, page faults take several milliseconds compared to few nanoseconds for memory accesses, leading to some performance loss. Therefore, reducing or hiding page faults is crucial in achieving high performance. We aim at improving performance for such programs by exploring and analyzing the concept of Object Prefetching.
If the data referenced by an application is not in main memory, a page fault occurs and the page is fetched from secondary memory to main memory. CPU is essentially idle during that period. Software prefetching is a technique for tolerating memory latency by explicitly getting data in memory before it is actually needed. Thus, when the program asks for the page, the page is likely to be available instantaneously instead of going through a page fault.
The paper explores a multi-threading approach for data prefetching. It creates one or more threads that will fetch the data in memory before it is actually required. When one thread is blocked on paging, system passes control to other threads [1]. Thus the performance loss due to paging can be reduced by overlapping it with computation.
Experiments are conducted on Solaris and Linux platforms using POSIX-threads to study the behavior of paging with different parameters.
This approach presents a major challenge. To fetch the required data, program should correctly predict what data is going to be accessed in near future. Data-driven object paradigms overcome this challenge. In such systems, the execution of objects' method is triggered by availability of messages (method invocations) under the control of a prioritized scheduler. Thus the scheduler has the knowledge of objects that are going to receive messages in near future. In the following section, we discuss one such data-driven object oriented paradigm.
CHARM++ is an object oriented parallel language [2]. It is different from traditional parallel programming paradigms such as message passing and shared variable programming in that its execution model is message-driven. The computations in CHARM++ are triggered on arrival of associated messages. These computations in turn trigger more messages to other processors that might cause more computations on those processors. CHARM++ is based on a concept of parallel objects called Chares. Chares can communicate with each other only through messages. CHARM++ also supports notion of a Group and Array. A Group has one representative on each processor. Arrays are arbitrarily sized collection of chares. Each array element can send and receive messages just like chares, and computation is performed upon arrival of those messages.
Each chare contain a number of entry methods, which are methods that can be invoked from remote processors. CHARM++ provides system calls to asynchronously create remote chares and to asynchronously invoke entry methods on remote chares by sending messages to those chares. This asynchronous message passing is the basic interprocess communication mechanism in CHARM++. CHARM++ also allows application to associate priorities with these messages, so that high priority messages will be handled before lower priority messages. CHARM++ is built on top of the Converse run-time framework. Converse provides portable, efficient implementations of all the functions typically needed by a parallel language of library. For example, Converse provides an architecture-independent interface to most thread functions, thread scheduling, synchronization of variables, and message passing. The CHARM++ environment can run on combinations of many different operating systems, including Linux, Irix, Solaris, Windows NT as well as many specialized parallel computers like Cray T3E, IBM's SP, etc.
Whenever a message is sent from a chare or an array element to another chare or an array element, the message is inserted in the schedulers' queue which is a prioritized queue. The scheduler thread runs a loop which retrieves messages from a prioritized queue and pass it to a message handler function.
Thus, the scheduler has the knowledge of messages which are going to be executed in near future and therefore does not need any prediction mechanisms for prefetching.
In this section we study the feasibility of the concept of prefetching and provide theoretical support for it. We write a simple benchmark that models the behavior of a data-driven program which uses a high amount of memory and accesses the data so that it causes significant amount of paging. The performance of the application is then measured in terms of access time per object, which is averaged over total number of objects. Experiments are performed on Solaris and Linux with different scenarios: in absence of paging, with paging, with prefetching, different computation times.
The benchmark program consists of a dynamically allocated array of objects (N), each object of size of a page, 4K (S). The program creates a Computation Thread, which selects an object at a time and performs some computation on it, in a loop for some fixed number of iterations. Objects are accessed randomly so that consecutive accesses are on different pages. Number of iterations is 10 times the number of objects to avoid the effect of cold misses due to initial few access. Effective time per object is measured as the total time divided by the number of iterations.
A lock-free Producer-Consumer queue is maintained which holds the indices of the objects in the array. Prefetch thread accesses an object in the array and enqueues the index in the queue. The computation thread dequeues an index from the queue and performs computation on that object. Prefetch thread maintains a leash such that it will prefetch only when the queue size is less than the leash size. This is a programmable parameter to study the effect of different leash sizes on prefetching. To simplify the benchmark application, prefetch thread decides what data is going to be accessed in near future instead of computation thread governing it.
The idea behind this approach is that as prefetch thread accesses the objects, page faults will be taken by it. As computation thread comes to a point to access that object, it is already brought in memory. So it will not incur a page fault, assuming the object is not paged out by this time. Keeping leash to some reasonable size such as 10-30, we can statistically assume that pages will not be thrown out between the time prefetch thread brings it into memory and computation thread tries to access it. If LRU page replacement policy is used by the operating system, it becomes even less likely. By maintaining the leash, there is some work queued up for computation thread. As prefetch thread is taking a page fault, operating system will pass the control to other threads. computation thread thus can keep working on the objects brought in memory earlier. If these two times overlap for a maximum time, the program performs as though there is no paging. When the queue is filled up to leash size, prefetch thread yields to computation thread. Also, if queue is empty, computation thread yields to prefetch thread. Code for prefetch thread and computation thread is in Following figure.
The benchmark application is similar to CHARM++ scheduler we discussed in the previous section as prefetch thread knows what data the application needs in near future. Thus, the same idea of prefetching can be incorporated in CHARM++ scheduler. A prefetch thread can be introduced in CHARM++ scheduler which will access the schedulers queue and prefetch the objects which are going to receive message in near future, while computation thread only handles the messages and performs computation.
Let us start by considering how paging affects the system performance. Consider an access pattern in an application which repeatedly accesses some data and performs some computation on it.
N = the number of objects. S = the size of each object in bytes.
M = the available memory.= the computation time.
= page fault service time.
= page fault frequency.
t = total completion time., effective time taken per object.
This section analyses different memory access patterns when the computation time per
object is larger compared to page fault service time, i.e.
.
When all the required data is available in physical memory, access time is very
small compared to computation time. Most of the time is spent in computation.
The behavior is modeled in Figure 2 (a). Total time taken
is given by the equation
As the amount of memory used in the application increases, not all the data will fit
in memory and application might start paging.
When a page fault occurs, time spent in access is high as compared to memory access with
no page faults. This behavior degrades the overall performance of the application.
Figure 2 (b) models the paging effect.
Let us calculate
now. When required memory is less than the available memory, i.e.
, every access will be in memory, and there will not be any page
faults.
As the problem size increases, we can define probability of a page not being
in real memory as,
Asymptotically, required memory reaches infinity and page fault frequency
approaches 1.
From equations 2, 3, 4 and
5, we have
From equations 6, 7 and
8, we conclude that time per object remains constant at
when the available memory is sufficient to fit the given problem, starts
increasing with
for some time and then asymptotically approaches
for infinitely large problem size. These equations are plotted
as in Figure 3.
As seen in Figure 2 (b), computation thread needs
to wait for the data to be brought in memory. The performance can be improved by
overlapping page access with computation. To do so, a prefetch thread is
introduced that will access any data before a computation is performed on it.
At any point, when prefetch thread is blocked on a page fault, computation thread
still continues to perform computation on earlier data prefetched. As seen in
2 (c), most of the computation is overlapped by page
fault servicing time. Computation thread need not wait for next object access
as in 2(b). We observe that computation time
dominates the total time taken. Thus, we conclude that
From equations, 8 and 9, we conclude that
asymptotically,
will reach
without prefetching, while with
prefetching it remains constant at
. Figure 3 models
the above analysis.
The break in the curve demonstrates the asymptotic behavior at infinite memory requirements. This analysis ignores second-order effects, such as
The slight difference between
for prefetch thread and
is due to
overheads of thread creation, thread context switches and redundant accesses by the
prefetch thread.
The analysis in the previous section assumed that
. This section
analyzes the scenario when the computation time is small compared to page fault
service time. We observe that prefetching has advantages even when computation
time is smaller than page fault service time.
Figure 4 models the behavior of the application with small computation time. The analysis for the case with no prefetching remains same as given by the equations 6, 7 and 8.
When prefetching is introduced in such an application, an overlap is achieved
between the paging time and the computation time as shown in Figure
4 (c). When
, by the time
the computation thread finishes executing K objects in time
,
the prefetch thread incurs
page faults, taking
time units. Thus, assuming that the access time
spent by the prefetch thread is negligible and so is the context switching
overhead, the computation thread is never kept waiting by the prefetch thread.
Therefore, we have,
This analysis also ignores second-order effects, as analysis in previous section.
If
, on the other hand, the total time is decided by the
page fault time. For a service of K accesses, the prefetch thread incurs
time in servicing the page faults. The computation thread
finishes its computations in time
which is less than
during page fault servicing. Therefore,
The condition
can be rewritten as
According to our assumption,
. Thus, the quantity
is greater than 1. Thus, we observe that
when prefetching is introduced in the program,
remains at
for
times more than in program without prefetching.
Let us now consider the case when
approaches 1. According to equations
2 and 11,
Thus, when computation time is high (
), prefetching
makes the program twice as fast as the original program. The expected performance
of the application is shown in Figure 5
Keeping N constant at very large value such that
, if we vary
,
following graph can be plotted.
Without prefetching, effective time per object is given as
,
which is plotted as a straight line in Figure 6. When prefetching
is introduced,
depends on relative values of computation time per object &
page fault service time. When
, effective time taken is dominated
by the page fault service,
, which is independent of
. When
computation time is high compared to page fault service time,
is dominated
by the computation time,
.
The benchmark application we discussed in previous section is run on Linux
and Solaris machines to study the paging behavior. The function doWork
characterizes the work performed per object. It can be tuned to set time
per object,
. Graphs are plotted between number of objects (N) and
effective time taken per object,
. As we increase N, program starts paging
at some value of N. We study this behavior for different values of
,
different leash sizes. We also study the asymptotic behavior of the program
for different values of
. It takes several hours to complete all the
data points in the graphs depending on the values of
and N. To perform
more experiments in given time, we also wrote a simulator that will produce similar
results in much shorter time.
Figure 7 demonstrates the behavior of the program with computation time 20 ms. It shows that as N increases, effective time per object for No Prefetch program increases, but for Prefetch program it does not increase that much. The behavior is in accordance with equations, 8 and 9.
We discuss behavior of the benchmark application when computation time is small
compared to page fault servicing time.
Figure 8 and 9 show that for large values
of N, difference between No Prefetch and Prefetch curve is approximately
equal to
as in equation 13 and 14.
Also, we observe that With Prefetch curve starts rising at a later point
than the No Prefetch curve, as expected in analysis done in equation
12.
Benchmark application runs are limited by the length of time it takes to perform each run. It takes few days to perform a complete experiment to study a particular behavior. On the other hand, the equations we studied in section 4.1 ignore second order effects. Thus, we developed a a simulator which will take into account second order effects, but will not perform any actual work thus saving the execution time. The simulator does not perform any page accesses or any computation. It maintains two timer values, one for computation thread and one for prefetch thread. It simulates the probability of the page faults depending on the data size. It also simulates the context switch between the two threads when queue is full or empty or when prefetch thread is taking a page fault. With the simulator, we can perform the experiment for much larger data size and computation time per object in much shorter time. The simulation data confirms the equations derived in section 3. Figure 11 plots the simulation data for computation time per object as 5ms.
Figure 12 shows the asymptotic behavior of programs when the object size is very high. The simulation environment assumes a large data size so that page fault rate is 95%. Computation time is varied from 1 ms to 40 ms. We can observe the change in prefetch curve as computation size increases beyond 20ms. The simulation graph is in accordance with the theoretical analysis in figure 6.
If more than one prefetch thread are introduced in the program, even when one prefetch thread is blocked on a page fault, other prefetch threads can keep prefetching objects producing more work for computation thread. Of course, the benefits will be limited by the ability of the paging system and disks to overlap multiple disk accesses. More experiments need to be performed for N prefetch thread approach. Also, interaction between paging behavior and context switch behavior of pthreads need to be studied more carefully, since we found several instances of inexplicable behavior in the course of this work.
To get explicit control of objects that are to be paged out and to avoid the overheads of thread creation and context switching between threads, we plan to experiment with an approach using asynchronous I/O mechanisms. Objects are serialized and stored on disk. To prefetch an object, it is asynchronously read, deserialized and brought in memory. Asynchronous operations can be overlapped with the computation. This approach of prefetching needs more careful design and feasibility study. To compete well with the other explicit out-of-core approaches [4], [5], we plan to experiment with schedulers that are aware of which objects are in memory and reorder method invocation to minimize paging.
This document was generated using the LaTeX2HTML translator Version 99.2beta6 (1.42)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 prefetch.tex
The translation was initiated by Orion Lawlor on 2001-07-12