The AMPI implementation currently has over 70 commonly used functions from the MPI 1.1 standard. It needs be made fully standard compliant. Communicator-related functionality could be implemented with array sections in Charm++. Topology-related functions of MPI could then be implemented more efficiently on top of this new implementation of communicators.
We have demonstrated that for real applications, AMPI overhead is compensated for by several advantages of AMPI over MPI. However, we believe that the AMPI overhead itself can be further reduced, especially in collective communications. In general, further work is needed for optimizing collective communications in the presence of object virtualization. Work is currently undergoing for building Converse-level optimized communication routines, and we expect it to be beneficial to AMPI.
On 32-bit processors, the unused address space that is used for isomalloc'ed threads can be limited sometimes depending upon the heap size and stack usage. While this problem becomes irrelevant on the new 64-bit processors, support for clusters based on Intel IA-32 chips, as well as the new ASCI class BG/L machine from IBM (that uses a 32-bit processor) is crucial. For this purpose, one can implement isomalloc'ed threads by limiting their migratability. This leads to fewer divisions of the unused address space, and therefore more availability of this address space on each processor.