Integrated memory controllers with parallel coherence streams

Authors

    Authors

    M. Chaudhuri;M. Heinrich

    Comments

    Authors: contact us about adding a copy of your work at STARS@ucf.edu

    Abbreviated Journal Title

    IEEE Trans. Parallel Distrib. Syst.

    Keywords

    distributed shared memory multiprocessor; directory protocol; multiple; coherence controllers; coherence bandwidth; integrated memory controller; DUAL-CORE; PROCESSOR; Computer Science, Theory & Methods; Engineering, Electrical & Electronic

    Abstract

    Previous work in scalable hardware distributed shared memory (DSM) multiprocessors has established the critical and dominant role that protocol processing bandwidth (or its inverse, occupancy) plays in determining overall performance in architectures with standalone memory/coherence controllers. However, with recent architectural trends toward integrated (on-chip) memory controllers and the well-known fact that processor frequency is increasing more rapidly than memory systems', we must ask whether parallel coherence processing engines (either multiple integrated protocol processors/cores or multiple protocol threads) are needed in DSM machines constructed from modern processor architectures and, if so, when. We construct a useful analytical model to give the designer insight into when parallel coherence streams will improve performance and verify our model via detailed simulation on 64-threaded microbenchmarks and parallel applications and on single-node multiprogrammed workloads. Surprisingly, and contrary to related work, we find that, in these architectures, adding a second coherence engine has almost no impact on performance. Further, for less-tuned applications that suffer from hot spots (contentious requests to the same memory line), additional engines offer no benefit whatsoever. Even with double the memory bandwidth (or channels), an additional coherence processing stream yields only slight performance improvement. Only for a special class of DSM machines employing directoryless broadcast protocols over unordered interconnects does parallel "snoop" processing offer reasonable performance improvement for communication-intensive applications. Overall, given the architectural trends, this is good news for DSM designers who want to minimize the resources necessary (protocol threads or integrated protocol processor cores for maintaining internode coherence, respectively) to create SMTp-based or multi-CMP-based scalable DSM machines using directory protocols.

    Journal Title

    Ieee Transactions on Parallel and Distributed Systems

    Volume

    18

    Issue/Number

    8

    Publication Date

    1-1-2007

    Document Type

    Article

    Language

    English

    First Page

    1159

    Last Page

    1173

    WOS Identifier

    WOS:000247541500012

    ISSN

    1045-9219

    Share

    COinS