next_inactive up previous


Improving Paging Performance With Object Prefetching



Neelam Saboo
Laxmikant V. Kalé
{saboo, kale}@cs.uiuc.edu



Dept. of Computer Science
University of Illinois at Urbana-Champaign
1304 W. Springfield Ave.
Urbana, IL 61801





Abstract:

Many computationally intensive parallel programs are also memory intensive. Even when the memory required to model a particular application is larger than the available memory, the virtual memory system permits the program to run. However, this comes at a substantial performance cost because disk accesses incurred by paging are substantially slower than memory accesses. Out-of-core techniques are often developed for each application separately to deal with this issue. We present a generic, application independent, technique that automatically improves paging performance, by using prefetching and multithreading. We take advantage of the virtualization provided by Charm++ objects in parallel programs. The data driven objects provide the prediction mechanism necessary for an effective prefetching scheme. An analytical model as well as experimental data is presented to demonstrate the advantages of the proposed scheme.

Introduction

Parallel programs often deal with large problems which are computationally intensive and have huge memory requirements. In spite of the relatively large amount of memory available in today's computer systems, there are many parallel applications that require more memory than available. Virtual Memory mechanism satisfies the memory requirement of such applications by paging the required pages of memory from physical memory. However, page faults take several milliseconds compared to few nanoseconds for memory accesses, leading to some performance loss. Therefore, reducing or hiding page faults is crucial in achieving high performance. We aim at improving performance for such programs by exploring and analyzing the concept of Object Prefetching.

If the data referenced by an application is not in main memory, a page fault occurs and the page is fetched from secondary memory to main memory. CPU is essentially idle during that period. Software prefetching is a technique for tolerating memory latency by explicitly getting data in memory before it is actually needed. Thus, when the program asks for the page, the page is likely to be available instantaneously instead of going through a page fault.

The paper explores a multi-threading approach for data prefetching. It creates one or more threads that will fetch the data in memory before it is actually required. When one thread is blocked on paging, system passes control to other threads  [1]. Thus the performance loss due to paging can be reduced by overlapping it with computation.

Experiments are conducted on Solaris and Linux platforms using POSIX-threads to study the behavior of paging with different parameters.

This approach presents a major challenge. To fetch the required data, program should correctly predict what data is going to be accessed in near future. Data-driven object paradigms overcome this challenge. In such systems, the execution of objects' method is triggered by availability of messages (method invocations) under the control of a prioritized scheduler. Thus the scheduler has the knowledge of objects that are going to receive messages in near future. In the following section, we discuss one such data-driven object oriented paradigm.


Data-driven Object Paradigms: CHARM++

CHARM++ is an object oriented parallel language  [2]. It is different from traditional parallel programming paradigms such as message passing and shared variable programming in that its execution model is message-driven. The computations in CHARM++ are triggered on arrival of associated messages. These computations in turn trigger more messages to other processors that might cause more computations on those processors. CHARM++ is based on a concept of parallel objects called Chares. Chares can communicate with each other only through messages. CHARM++ also supports notion of a Group and Array. A Group has one representative on each processor. Arrays are arbitrarily sized collection of chares. Each array element can send and receive messages just like chares, and computation is performed upon arrival of those messages.

Each chare contain a number of entry methods, which are methods that can be invoked from remote processors. CHARM++ provides system calls to asynchronously create remote chares and to asynchronously invoke entry methods on remote chares by sending messages to those chares. This asynchronous message passing is the basic interprocess communication mechanism in CHARM++. CHARM++ also allows application to associate priorities with these messages, so that high priority messages will be handled before lower priority messages. CHARM++ is built on top of the Converse run-time framework. Converse provides portable, efficient implementations of all the functions typically needed by a parallel language of library. For example, Converse provides an architecture-independent interface to most thread functions, thread scheduling, synchronization of variables, and message passing. The CHARM++ environment can run on combinations of many different operating systems, including Linux, Irix, Solaris, Windows NT as well as many specialized parallel computers like Cray T3E, IBM's SP, etc.

Scheduler Overview

Each processor is running a Converse scheduler  [3], which is responsible for all message reception and scheduling. Each message contains a function pointer and some bytes of user data . The scheduler's job is to repeatedly pick and process the messages in the order of their priority. Messages need to register with converse environment informing it about the association of handlers with messages. Processing a message consists of calling the handler which is encapsulated in the message, passing in the message's data.

Whenever a message is sent from a chare or an array element to another chare or an array element, the message is inserted in the schedulers' queue which is a prioritized queue. The scheduler thread runs a loop which retrieves messages from a prioritized queue and pass it to a message handler function.

\begin{figure}\begin{center}
{\small\begin{boxedverbatim}while(true)
{
dequ...
...dle until a message is received.
}\end{boxedverbatim}}
\end{center}\end{figure}

Thus, the scheduler has the knowledge of messages which are going to be executed in near future and therefore does not need any prediction mechanisms for prefetching.


Object Prefetching

In this section we study the feasibility of the concept of prefetching and provide theoretical support for it. We write a simple benchmark that models the behavior of a data-driven program which uses a high amount of memory and accesses the data so that it causes significant amount of paging. The performance of the application is then measured in terms of access time per object, which is averaged over total number of objects. Experiments are performed on Solaris and Linux with different scenarios: in absence of paging, with paging, with prefetching, different computation times.

Benchmark Application

The benchmark program consists of a dynamically allocated array of objects (N), each object of size of a page, 4K (S). The program creates a Computation Thread, which selects an object at a time and performs some computation on it, in a loop for some fixed number of iterations. Objects are accessed randomly so that consecutive accesses are on different pages. Number of iterations is 10 times the number of objects to avoid the effect of cold misses due to initial few access. Effective time per object is measured as the total time divided by the number of iterations.

A lock-free Producer-Consumer queue is maintained which holds the indices of the objects in the array. Prefetch thread accesses an object in the array and enqueues the index in the queue. The computation thread dequeues an index from the queue and performs computation on that object. Prefetch thread maintains a leash such that it will prefetch only when the queue size is less than the leash size. This is a programmable parameter to study the effect of different leash sizes on prefetching. To simplify the benchmark application, prefetch thread decides what data is going to be accessed in near future instead of computation thread governing it.

The idea behind this approach is that as prefetch thread accesses the objects, page faults will be taken by it. As computation thread comes to a point to access that object, it is already brought in memory. So it will not incur a page fault, assuming the object is not paged out by this time. Keeping leash to some reasonable size such as 10-30, we can statistically assume that pages will not be thrown out between the time prefetch thread brings it into memory and computation thread tries to access it. If LRU page replacement policy is used by the operating system, it becomes even less likely. By maintaining the leash, there is some work queued up for computation thread. As prefetch thread is taking a page fault, operating system will pass the control to other threads. computation thread thus can keep working on the objects brought in memory earlier. If these two times overlap for a maximum time, the program performs as though there is no paging. When the queue is filled up to leash size, prefetch thread yields to computation thread. Also, if queue is empty, computation thread yields to prefetch thread. Code for prefetch thread and computation thread is in Following figure.

Figure 1: Worker Thread and Prefetch Thread Code
\begin{figure}\begin{center}
{\small\begin{boxedverbatim}void* computationThre...
...x);
countW++ ; countP++ ;
} }
} }\end{boxedverbatim}}
\end{center}\end{figure}

The benchmark application is similar to CHARM++ scheduler we discussed in the previous section as prefetch thread knows what data the application needs in near future. Thus, the same idea of prefetching can be incorporated in CHARM++ scheduler. A prefetch thread can be introduced in CHARM++ scheduler which will access the schedulers queue and prefetch the objects which are going to receive message in near future, while computation thread only handles the messages and performs computation.

Analysis

Let us start by considering how paging affects the system performance. Consider an access pattern in an application which repeatedly accesses some data and performs some computation on it.


		N = the number of objects. 				 S =  the size of each object in bytes. 

M = the available memory. $t_{c}$ = the computation time.
$t_{p}$ = page fault service time. $f_{p}$ = page fault frequency.
t = total completion time. $t_{0} = \frac{t}{N}$, effective time taken per object.

Large Computation Time

This section analyses different memory access patterns when the computation time per object is larger compared to page fault service time, i.e. $t_{c} > t_{p}$.

Figure 2: Timeline: Computation time higher than page fault servicing time

\includegraphics[width=5.0in]{LargeComputation}

When all the required data is available in physical memory, access time is very small compared to computation time. Most of the time is spent in computation. The behavior is modeled in Figure 2 (a). Total time taken is given by the equation

$\displaystyle t = N \cdot t_{c}$      
$\displaystyle t_{0} = \frac{t}{N} \ = \ t_{c}$     (1)

As the amount of memory used in the application increases, not all the data will fit in memory and application might start paging. When a page fault occurs, time spent in access is high as compared to memory access with no page faults. This behavior degrades the overall performance of the application. Figure 2 (b) models the paging effect.

$\displaystyle t = N \ ( t_{c} + f_{p} \cdot t_{p} \ )$      
$\displaystyle t_{0}= \frac{t}{N} \ = \ t_{c} + f_{p} \cdot t_{p}$     (2)

Let us calculate $f_{p}$ now. When required memory is less than the available memory, i.e. $S \cdot N < M$ , every access will be in memory, and there will not be any page faults.

\begin{displaymath}
f_{p} = 0 ,\ S \cdot N < M
\end{displaymath} (3)

As the problem size increases, we can define probability of a page not being in real memory as,

\begin{displaymath}
f_{p} = \frac{S \cdot N - M}{S \cdot N} \ = \ 1 - \frac{M}{S \cdot N}\ \ ,\ S \cdot N > M
\end{displaymath} (4)

Asymptotically, required memory reaches infinity and page fault frequency $f_{p}$ approaches 1.

\begin{displaymath}
f_{p} \to 1 , \ S \cdot N \to \infty
\end{displaymath} (5)

From equations 2, 3, 4 and 5, we have

$\displaystyle t_{0}$ $\textstyle =$ $\displaystyle t_{c} \hspace{1.4in} ,S \cdot N < M$ (6)
  $\textstyle =$ $\displaystyle t_{c} + (1 - \frac{M}{S \cdot N}) \cdot t_{p} \;\;\; ,S \cdot N > M$ (7)
  $\textstyle \to$ $\displaystyle t_{c} + t_{p} \hspace{1.1in} ,S \cdot N \to \infty$ (8)

From equations 6, 7 and 8, we conclude that time per object remains constant at $t_{c}$ when the available memory is sufficient to fit the given problem, starts increasing with $f_{p}$ for some time and then asymptotically approaches $t_{c} + t_{p}$ for infinitely large problem size. These equations are plotted as in Figure 3.

As seen in Figure 2 (b), computation thread needs to wait for the data to be brought in memory. The performance can be improved by overlapping page access with computation. To do so, a prefetch thread is introduced that will access any data before a computation is performed on it. At any point, when prefetch thread is blocked on a page fault, computation thread still continues to perform computation on earlier data prefetched. As seen in 2 (c), most of the computation is overlapped by page fault servicing time. Computation thread need not wait for next object access as in 2(b). We observe that computation time $t_{c}$ dominates the total time taken. Thus, we conclude that

\begin{displaymath}
t_{0} = t_{c}
\end{displaymath} (9)

From equations, 8 and 9, we conclude that asymptotically, $t_{0}$ will reach $t_{c} + t_{p}$ without prefetching, while with prefetching it remains constant at $t_{c}$. Figure 3 models the above analysis.

Figure 3: Large Computation Time: Time per Object Vs Problem Size

\includegraphics[width=4.0in]{Hight0_prefetch}

The break in the curve demonstrates the asymptotic behavior at infinite memory requirements. This analysis ignores second-order effects, such as

The slight difference between $t_{0}$ for prefetch thread and $t_{c}$ is due to overheads of thread creation, thread context switches and redundant accesses by the prefetch thread.

Small Computation Time

The analysis in the previous section assumed that $t_{c} > t_{p}$. This section analyzes the scenario when the computation time is small compared to page fault service time. We observe that prefetching has advantages even when computation time is smaller than page fault service time.

Figure 4: Timeline: Computation time smaller than page fault servicing time

\includegraphics[width=5.0in]{SmallComputation}

Figure 4 models the behavior of the application with small computation time. The analysis for the case with no prefetching remains same as given by the equations 6, 7 and 8.

When prefetching is introduced in such an application, an overlap is achieved between the paging time and the computation time as shown in Figure 4 (c). When $f_{p} \cdot t_{p} < t_{c}$, by the time the computation thread finishes executing K objects in time $K \cdot t_{c}$, the prefetch thread incurs $f_{p} \cdot K$ page faults, taking $K \cdot f_{p} \cdot t_{p}$ time units. Thus, assuming that the access time spent by the prefetch thread is negligible and so is the context switching overhead, the computation thread is never kept waiting by the prefetch thread. Therefore, we have,

$\displaystyle t = N \cdot t_{c} \ ,\ if \ f_{p} \cdot t_{p} < t_{c}$      
$\displaystyle t_{0} = t_{c} \ ,\ if \ f_{p} \cdot t_{p} < t_{c}$     (10)

This analysis also ignores second-order effects, as analysis in previous section.

If $f_{p} \cdot t_{p} > t_{c}$, on the other hand, the total time is decided by the page fault time. For a service of K accesses, the prefetch thread incurs $K \cdot f_{p} \cdot t_{p}$ time in servicing the page faults. The computation thread finishes its computations in time $K \cdot t_{c}$ which is less than $K \cdot f_{p} \cdot t_{p}$ during page fault servicing. Therefore,

$\displaystyle t = N \cdot (f_{p} \cdot t_{p}),\ if \ f_{p} \cdot t_{p} > t_{c}$      
$\displaystyle t_{0} = (f_{p} \cdot t_{p}),\ if \ f_{p} \cdot t_{p} > t_{c}$     (11)

The condition $f_{p} \cdot t_{p} < t_{c}$ can be rewritten as

$f_{p} \cdot t_{p} \; < \;\; t_{c} $

$(1 - \frac{M}{S \cdot N}) \cdot t_{p} \;\; < \;\;t_{c}$


\begin{displaymath}
\hspace{1.0in} i.e.\;\; S \cdot N \;\;\;\;< \;\;M (\frac{t_{p}}{t_{p}-t_{c}})
\end{displaymath} (12)

According to our assumption, $t_{c} < t_{p}$. Thus, the quantity $(\frac{t_{p}}{t_{p}-t_{c}})$ is greater than 1. Thus, we observe that when prefetching is introduced in the program, $t_{0}$ remains at $t_{c}$ for $(\frac{t_{p}}{t_{p}-t_{c}})$ times more than in program without prefetching.

Let us now consider the case when $f_{p}$ approaches 1. According to equations 2 and 11,

$\displaystyle t_{0} \to t_{c} + t_{p} \ , \ Without \ Prefetching$     (13)
$\displaystyle t_{0} \to t_{p} \ , \ With \ Prefetching$     (14)

Thus, when computation time is high ( $t_{c} \approx t_{p}$), prefetching makes the program twice as fast as the original program. The expected performance of the application is shown in Figure 5

Figure 5: Small Computation Time: Time per Object Vs Problem Size

\includegraphics[width=4.0in]{Smallt0_prefetch}

Asymptotic Behavior

Keeping N constant at very large value such that $f_{p} \to 1$, if we vary $t_{c}$, following graph can be plotted.

Figure 6: Asymptotic Analysis for different computation time per object

\includegraphics[width=4.0in]{asymptote}

Without prefetching, effective time per object is given as $t_{0} = t_{c} + t_{p}$, which is plotted as a straight line in Figure 6. When prefetching is introduced, $t_{0}$ depends on relative values of computation time per object & page fault service time. When $t_{c} < t_{p}$, effective time taken is dominated by the page fault service, $t_{0} = t_{p}$, which is independent of $t_{c}$. When computation time is high compared to page fault service time, $t_{0}$ is dominated by the computation time, $t_{0} = t_{c}$.


Performance Evaluation

The benchmark application we discussed in previous section is run on Linux and Solaris machines to study the paging behavior. The function doWork characterizes the work performed per object. It can be tuned to set time per object, $t_{c}$. Graphs are plotted between number of objects (N) and effective time taken per object, $t_{0}$. As we increase N, program starts paging at some value of N. We study this behavior for different values of $t_{c}$, different leash sizes. We also study the asymptotic behavior of the program for different values of $t_{c}$. It takes several hours to complete all the data points in the graphs depending on the values of $t_{c}$ and N. To perform more experiments in given time, we also wrote a simulator that will produce similar results in much shorter time.

Different Leash Size

Prefetch thread maintains a leash so that even when it is blocked at a page fault, computation thread can continue working. When the leash is 1, computation thread has to wait at every object until prefetch thread has enqueued it in the queue, showing no performance improvements. As the leash is increased, computation thread has more work to do when prefetch thread is waiting on a page fault. Thus, computation thread needs to wait for little or no time for work to be enqueued. Leash should not be selected so high that prefetch thread prefetches too much data which might get paged out before it is needed, thus causing even more page faults. Experimental results show that leash size of 10-30 show same performance benefits.

Large Computation Time

Figure 7 demonstrates the behavior of the program with computation time 20 ms. It shows that as N increases, effective time per object for No Prefetch program increases, but for Prefetch program it does not increase that much. The behavior is in accordance with equations, 8 and 9.

Figure 7: Runs on Linux with computation time of 20 ms


Small Computation Time

We discuss behavior of the benchmark application when computation time is small compared to page fault servicing time. Figure 8 and 9 show that for large values of N, difference between No Prefetch and Prefetch curve is approximately equal to $t_{c}$ as in equation 13 and 14. Also, we observe that With Prefetch curve starts rising at a later point than the No Prefetch curve, as expected in analysis done in equation 12.

Figure 8: Runs on Linux with computation time of 1 ms


Figure 9: Runs on Linux with computation time of 5 ms


Runs on Solaris

Similar experiments were performed on Solaris to confirm the behavior of prefetching on Solaris. Experiments are performed with a fixed computation time of 1 ms while varying object data sizes. Graph is shown in Figure 10.

Figure 10: Runs on Solaris with computation time of 1 ms


Simulator

Benchmark application runs are limited by the length of time it takes to perform each run. It takes few days to perform a complete experiment to study a particular behavior. On the other hand, the equations we studied in section 4.1 ignore second order effects. Thus, we developed a a simulator which will take into account second order effects, but will not perform any actual work thus saving the execution time. The simulator does not perform any page accesses or any computation. It maintains two timer values, one for computation thread and one for prefetch thread. It simulates the probability of the page faults depending on the data size. It also simulates the context switch between the two threads when queue is full or empty or when prefetch thread is taking a page fault. With the simulator, we can perform the experiment for much larger data size and computation time per object in much shorter time. The simulation data confirms the equations derived in section 3. Figure 11 plots the simulation data for computation time per object as 5ms.

Figure 11: Simulator runs with computation time as 5 ms


Figure 12 shows the asymptotic behavior of programs when the object size is very high. The simulation environment assumes a large data size so that page fault rate is 95%. Computation time is varied from 1 ms to 40 ms. We can observe the change in prefetch curve as computation size increases beyond 20ms. The simulation graph is in accordance with the theoretical analysis in figure 6.

Figure 12: Simulation runs for asymptotic behavior with varying computation time


Summary and Future Work

For applications with large memory requirements, reducing or hiding paging costs is crucial. We proposed a multithreaded approach with a prefetch thread that accesses objects that are needed in near future, while computation thread is working on the objects available. data-driven object paradigms, such as CHARM++, facilitates this approach, since they can ``predict'' the objects needed in near future. Benchmark application demonstrates the results of the prefetching concept and confirm that prefetching improves the performance of an application by overlapping paging and computation time. When computation time is higher than the page fault servicing time, computation time overlaps most of the paging costs and $t_{0}$ remains constant at $t_{c}$, while with no prefetching $t_{0}$ approaches $t_{c} + t_{p}$. For smaller computation times, prefetching performs better by $t_{c}$. Asymptotically, prefetching approaches $t_{p}$. Also, with prefetching, paging behavior shows effect at a larger data size hiding the effect of paging.

If more than one prefetch thread are introduced in the program, even when one prefetch thread is blocked on a page fault, other prefetch threads can keep prefetching objects producing more work for computation thread. Of course, the benefits will be limited by the ability of the paging system and disks to overlap multiple disk accesses. More experiments need to be performed for N prefetch thread approach. Also, interaction between paging behavior and context switch behavior of pthreads need to be studied more carefully, since we found several instances of inexplicable behavior in the course of this work.

To get explicit control of objects that are to be paged out and to avoid the overheads of thread creation and context switching between threads, we plan to experiment with an approach using asynchronous I/O mechanisms. Objects are serialized and stored on disk. To prefetch an object, it is asynchronously read, deserialized and brought in memory. Asynchronous operations can be overlapped with the computation. This approach of prefetching needs more careful design and feasibility study. To compete well with the other explicit out-of-core approaches  [4],  [5],   we plan to experiment with schedulers that are aware of which objects are in memory and reorder method invocation to minimize paging.

Bibliography

1
Lewis Bil and Berg Daniel.
Multithreaded programming with PTHREADS.
Prentice Hall, 1998.

2
L.V. Kale and S. Krishnan.
Charm++: A portable concurrent object oriented system based on C++.
In Proceedings of the Conference on Object Oriented Programming Systems, Languages and Applications, September 1993.

3
L.V. Kalé, M. Bhandarkar, N. Jagathesan, S. Krishnan, and J. Yelon.
Converse: An Interoperable Framework for Parallel Programming.
In International Parallel Processing Symposium 1996 (to appear), 1996.

4
Todd C. Mowry.
Tolerating latency in multiprocessors through compiler-inserted prefetching.
ACM Transactions on computer Systems, 16(1):55-92, February 1998.

5
R. Bordawekar, A. Choudhary, K. Kennedy, C. Koelbel, and M. Paleczny.
A model and compilation strategy for out-of-core data parallel programs.
30(8):1-10, August 1995.

6
Ken Klimkowski and Robert van de Geijn.
Anatomy of an out-of-core dense linear solver.
In Proceddings of the 1995 International Conference on Parallel Processing, pages 29-33, 1995.

About this document ...

This document was generated using the LaTeX2HTML translator Version 99.2beta6 (1.42)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -split 0 prefetch.tex

The translation was initiated by Orion Lawlor on 2001-07-12


next_inactive up previous
Orion Lawlor 2001-07-12