您的当前位置：首页 Comparing memory systems for chip multiprocessors

Comparing memory systems for chip multiprocessors

来源：知库网

ComparingMemorySystemsforChipMultiprocessors

JacobLeverichAminFiroozshahian

HidehoArakidaMarkHorowitz

AlexSolomatnikovChristosKozyrakis

ComputerSystemsLaboratory

StanfordUniversity

{leverich,arakida,sols,aminf13,horowitz,kozyraki}@stanford.edu

ABSTRACT

Therearetwobasicmodelsfortheon-chipmemoryinCMPsys-tems:hardware-managedcoherentcachesandsoftware-managedstreamingmemory.Thispaperperformsadirectcomparisonofthetwomodelsunderthesamesetofassumptionsabouttechnology,area,andcomputationalcapabilities.Thegoalistoquantifyhowandwhentheydifferintermsofperformance,energyconsump-tion,bandwidthrequirements,andlatencytoleranceforgeneral-purposeCMPs.Wedemonstratethatfordata-parallelapplications,thecache-basedandstreamingmodelsperformandscaleequallywell.Forcertainapplicationswithlittledatareuse,streamingscalesbetterduetobetterbandwidthuseandmacroscopicsoftwarepre-fetching.However,theintroductionoftechniquessuchashard-wareprefetchingandnon-allocatingstorestothecache-basedmod-eleliminatesthestreamingadvantage.Overall,ourresultsindicatethatthereisnotsufﬁcientadvantageinbuildingstreamingmemorysystemswhereallon-chipmemorystructuresareexplicitlyman-aged.Ontheotherhand,weshowthatstreamingattheprogram-mingmodellevelisparticularlybeneﬁcial,evenwiththecache-basedmodel,asitenhanceslocalityandcreatesopportunitiesforbandwidthoptimizations.Moreover,weobservethatstreampro-grammingisactuallyeasierwiththecache-basedmodelbecausethehardwareguaranteescorrect,best-effortexecutionevenwhentheprogrammercannotfullyregularizeanapplication’scode.CategoriesandSubjectDescriptors:B.3.2[MemoryStruc-tures]:DesignStyles;C.1.2[ProcessorArchitectures]:MultipleDataStreamArchitectures(Multiprocessors);D.1.3[ProgrammingTechniques]:ConcurrentProgrammingGeneralTerms:Performance,Design

Keywords:Chipmultiprocessors,coherentcaches,streamingmemory,parallelprogramming,localityoptimizations

1.INTRODUCTION

Thescalinglimitationsofuniprocessors[2]haveledtoanindustry-wideturntowardschipmultiprocessor(CMP)systems.CMPsarebecomingubiquitousinallcomputingdomains.Un-likeuniprocessors,whichhaveadominant,well-understoodmodel

Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforproﬁtorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationontheﬁrstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspeciﬁcpermissionand/orafee.

foron-chipmemorystructures,thereisnowidespreadagreementonthememorymodelforCMPdesigns.Thechoiceofmemorymodelcansigniﬁcantlyimpactefﬁciencyintermsofperformance,energyconsumption,andscalability.Moreover,itiscloselycou-pledwiththechoiceofparallelprogrammingmodel,whichinturnaffectseaseofuse.Whileitispossibletomapanyprogrammingmodeltoanymemorymodel,itistypicallymoreefﬁcientiftheprogrammingmodelbuildsuponthebasicprinciplesofthemem-orymodel.

Similartolargerparallelsystems[8],therearetwobasicmem-orymodelsforcontemporaryCMPsystems:hardware-managed,implicitly-addressed,coherentcachesandsoftware-managed,ex-plicitly-addressed,localmemories(alsocalledstreamingmemory).Withthecache-basedmodel,allon-chipstorageisusedforprivateandsharedcachesthatarekeptcoherentbyhardware.Thead-vantageisthatthehardwareprovidesbest-effortlocalityandcom-municationmanagement,evenwhentheaccessandsharingpat-ternsaredifﬁculttostaticallyanalyze.Withthestreamingmemorymodel,partoftheon-chipstorageisorganizedasindependentlyaddressablestructures.ExplicitaccessesandDMAtransfersareneededtomovedatatoandfromoff-chipmemoryorbetweentwoon-chipstructures.Theadvantageofstreamingmemoryisthatitprovidessoftwarewithfullﬂexibilityonlocalityandcommunica-tionmanagementintermsofaddressing,granularity,andreplace-mentpolicy.Sincecommunicationisexplicit,itcanalsobeproac-tive,unlikethemostlyreactivebehaviorofthecache-basedmodel.Hence,streamingallowssoftwaretoexploitproducer-consumerlo-cality,avoidredundantwrite-backsfortemporaryresults,manageselectivedatareplication,andperformapplication-speciﬁccachingandmacroscopicprefetching.Streamingeliminatesthecommuni-cationoverheadandhardwarecomplexityofthecoordinationpro-tocolneededforcachecoherence.Ontheotherhand,itintroducessoftwarecomplexity,sinceeithertheprogrammerorthecompilermustexplicitlymanagelocalityandcommunication.

Traditionaldesktopandenterpriseapplicationsaredifﬁculttoanalyzeandfavorcache-basedsystems.Incontrast,manyupcom-ingapplicationsfromthemultimedia,graphics,physicalsimula-tion,andscientiﬁccomputingdomainsarebeingtargetedbybothcache-based[4,45]andstreaming[3,39,9,16,32]systems.Thedebateisparticularlyinterestingvis-`a-visthelatestgameconsoles.TheCMPsfortheXbox360andPlayStation3differdramaticallyintheiron-chipmemorymodel,asXbox360’sXenonprocessorisacache-basedCMPandPlayStation3’sCellprocessorisastream-ingmemoryCMP.Hence,itisinterestingtoevaluateifstreamingprovidesspeciﬁcbeneﬁtstothisdomain,orifusingacache-basedapproachacrossallapplicationdomainsshouldbepreferable.

Thegoalofthispaperistocomparetheefﬁciencyofthetwomemorymodelsunderthesamesetofassumptionsabouttech-

nology,area,andcomputationalcapabilities.Speciﬁcally,weareinterestedinansweringthefollowingquestions:Howdothetwomodelscompareintermsofoverallperformanceandenergycon-sumption?Howdoesthecomparisonchangeaswescalethenum-berorcomputethroughputoftheprocessorcores?Howsensitiveiseachmodeltobandwidthorlatencyvariations?WebelievethatsuchadirectcomparisonwillprovidevaluableinformationfortheCMParchitecturedebateandgeneratesomeguidelinesforthede-velopmentoffuturesystems.

Themajorconclusionsfromourcomparisonare:•Fordata-parallelapplicationswithabundantdatareuse,thetwomodelsperformandscaleequallywell.Cachesareasef-fectiveassoftware-managedmemoriesatcapturinglocalityandreducingaveragememoryaccesstime.Forsomeoftheseapplications,streaminghasanenergyadvantageof10%to25%overwrite-allocatecachesbecauseitavoidssuperﬂu-ousreﬁllsonoutputdatastreams.Usingano-write-allocatepolicyforoutputdatainthecache-basedsystemreducesthestreamingadvantage.•Forapplicationswithoutsigniﬁcantdatareuse,macroscopicprefetching(double-buffering)providesstreamingmemorysystemswithaperformanceadvantagewhenwescalethenumberandcomputationalcapabilitiesofthecores.Theuseofhardwareprefetchingwiththecache-basedmodelelim-inatesthestreamingadvantageforsomelatency-boundap-plications.Therearealsocaseswherestreamingperformsworse,suchaswhenitrequiresredundantcopyingofdataorextracomputationinordertomanagelocalstores.•Ourresultsindicatethatapurestreamingmemorymodelisnotsufﬁcientlyadvantageousatthememorysystemlevel.Withtheadditionofprefetchingandnon-allocatingwrites,thecache-basedmodelprovidessimilarperformance,energy,andbandwidthbehavior.Ontheotherhand,wefoundthat“streaming”attheprogrammingmodellevelisveryimpor-tant,evenwiththecache-basedmodel.Properlyblockinganapplication’sworkingset,exposingproducer-consumerlo-cality,andidentifyingoutput-onlydataleadstosigniﬁcantefﬁciencygains.Moreover,streamprogrammingleadstocodethatrequirescoarser-grainandlower-frequencycoher-enceandconsistencyinacache-basedsystem.Thisobser-vationwillbeincreasinglyrelevantasCMPsscaletomuchlargernumbersofcores.•Finally,weobservethatstreamprogrammingisactuallyeas-ierwiththecache-basedmodelbecausethehardwareguar-anteescorrect,best-effortexecution,evenwhentheprogram-mercannotfullyregularizeanapplication’scode.Withthestreamingmemorymodel,thesoftwaremustorchestratelo-calityandcommunicationperfectly,evenforirregularcodes.Therestofthepaperisorganizedasfollows.Section2summa-rizesthetwomemorymodelsanddiscussestheiradvantagesanddrawbacks.Section3presentsthearchitecturalframeworkforourcomparisonandSection4describestheexperimentalmethodology.Section5analyzesourevaluationresults.Section6focusesonstreamingattheprogramminglevelanditsinteractionswithCMParchitecture.Section7addresseslimitationsinourmethodology.Finally,Section8reviewsrelatedworkandSection9concludesthepaper.

Communication

ytilHWCoherent Cache-basedIncoherent Cache-basedacoLSW

(impractical)Streaming MemoryTable1.Thedesignspaceforon-chipmemoryforCMPs.Thisworkfocusesonthetwohighlightedoptions:coherentcache-basedandstreamingmemory.Thethirdpracticaloption,incoherentcache-based,isbrieﬂydiscussedinSection7.

ON-CHIPMEMORYMODELSFORCHIPMULTIPROCESSORS

General-purposesystemsaredesignedaroundacomputationalmodelandamemorymodel.Thecomputationalmodeldeﬁnestheorganizationofexecutionresourcesandregisterﬁlesandmayfol-lowsomecombinationofthethesuperscalar,VLIW,vector,orSIMDapproaches.Thememorymodeldeﬁnestheorganizationofon-chipandoff-chipmemories,aswellasthecommunicationmechanismsbetweenthevariouscomputationandmemoryunits.Thetwomodelsarelinked,andanefﬁcientsystemiscarefullyde-signedalongbothdimensions.However,webelievethattheon-chipmemorymodelcreatesmoreinterestingchallengesforCMPs.First,itistheonethatchangesmostdramaticallycomparedtouniprocessordesigns.Second,thememorymodelhasbroaderim-plicationsonbothsoftwareandhardware.Assumingwecanmovethedataclosetotheexecutionunitsinanefﬁcientmanner,itisnotdifﬁculttoselectthepropercomputationalmodelbasedonthetype(s)ofparallelismavailableinthecomputationkernels.

Forboththecache-basedandstreamingmodels,certainaspectsoftheon-chipmemorysystemaresetbyVLSIconstraintssuchaswiredelay[18].EverynodeintheCMPsystemisdirectlyassoci-atedwithalimitedamountofstorage(ﬁrst-levelmemory)thatcanbeaccessedwithinasmallnumberofcycles.Nodescommunicatebyexchangingpacketsoveranon-chipnetworkthatcanrangefromasimplebustoahierarchicalstructure.Additionalmemorystruc-tures(second-levelmemory)arealsoconnectedtothenetworkfab-ric.Thesystemscaleswithtechnologybyincreasingthenumberofnodes,thesizeofthenetwork,andthecapacityofsecond-levelstorage.Eventually,thescaleofaCMPdesignmaybelimitedbyoff-chipbandwidthorenergyconsumption[20].

AlthoughtheyareunderthesameVLSIconstraints,thecache-basedandstreamingmodelsdiffersigniﬁcantlyinthewaytheymanagedatalocalityandinter-processorcommunication.AsshowninTable1,thecache-basedmodelreliesonhardwaremechanismsforboth,whilethestreamingmodeldelegatesmanagementtosoft-ware.TherestofthissectionoverviewstheprotocolandoperationforeachmemorymodelforCMPsystems.

2.1CoherentCache-basedModel

Withthecache-basedmodel[8],theonlydirectlyaddressablestorageistheoff-chipmemory.Allon-chipstorageisusedforcacheswithhardwaremanagementoflocalityandcommunication.Ascoresperformmemoryaccesses,thecachesattempttocap-turetheapplication’sworkingsetbyfetchingorreplacingdataatthegranularityofcacheblocks.Thecorescommunicateimplic-itlythroughloadsandstorestothesinglememoryimage.Sincemanycachesmaystorecopiesofaspeciﬁcaddress,itisneces-sarytoquerymultiplecachesonloadandstorerequestsandpo-tentiallyinvalidateentriestokeepcachescoherent.Acoherenceprotocol,suchasMESI,minimizesthecaseswhenremotecache

lookupsarenecessary.Remotelookupscanbedistributedthroughabroadcastmechanismorbyﬁrstconsultingadirectorystructure.Thecoressynchronizeusingatomicoperationssuchascompare-and-swap.Cache-basedsystemsmustalsoprovideeventorder-ingguaranteeswithinandacrosscoresfollowingsomeconsistencymodel[1].Caching,coherence,synchronization,andconsistencyareimplementedinhardware.

Coherentcachingtechniquesweredevelopedforboard-levelandcabinet-levelsystems(SMPsandDSMs),forwhichcommunica-tionlatencyrangesfromtenstohundredsofcycles.InCMPs,coherencesignalstravelwithinonechip,wherelatencyismuchlowerandbandwidthismuchhigher.Consequently,evenalgo-rithmswithnon-trivialamountsofcommunicationandsynchro-nizationcanscalereasonablywell.Moreover,theefﬁcientdesignpointsforcoherentcachinginCMPsarelikelytobedifferentfromthoseforSMPandDSMsystems.

2.2StreamingMemoryModel

Withstreaming,thelocalmemoryfordataineachcoreisasep-aratelyaddressablestructure,calledascratch-pad,localstore,orstreamregisterﬁle.Weadoptthetermlocalstoreinthiswork.Softwareisresponsibleformanaginglocalityandcommunicationacrosslocalstores.Softwarehasfullﬂexibilityinplacingfre-quentlyaccesseddatainlocalstoreswithrespecttolocation,gran-ularity,replacementpolicy,allocationpolicy,andeventhenumberofcopies.Forapplicationswithstaticallyanalyzableaccesspat-terns,softwarecanexploitthisﬂexibilitytominimizecommuni-cationandoverlapitwithusefulcomputationinthebestpossibleapplication-speciﬁcway.Dataarecommunicatedbetweenlocalstoresortoandfromoff-chipmemoryusingexplicitDMAtrans-fers.ThecorescanaccesstheirlocalstoresasFIFOqueuesorasrandomlyindexedstructures[23].ThestreamingmodelrequiresDMAengines,butnootherspecialhardwaresupportforcoherenceorconsistency.

Streamingisessentiallymessage-passingappliedtoCMPde-signs,thoughtherearesomeimportantdifferencescomparedtoconventionalmessage-passingforclustersandmassivelyparallelsystems[8].First,communicationisalwaysorchestratedattheuser-level,anditsoverheadislow.Next,messagesareexchangedattheﬁrstlevelofthememoryhierarchy,notthelastone.Sincethecommunicatingcoresareonthesamechip,communicationlaten-ciesarelowandbandwidthishigh.Finally,softwaremanagesboththecommunicationbetweencoresandthecommunicationbetweenacoreandoff-chipmemory.

Thediscussionaboveseparatesthetwomemorymodelsfromtheselectionofacomputationmodel.ThestreamingmemorymodelisgeneralandhasalreadybeenusedwithVLIW/SIMDsystems[3],RISCcores[39],vectorprocessors[30],andevenDSPs[32].Thecache-basedmodelcanbeusedwithanyofthesecomputationmod-elsaswell.

2.3QualitativeComparison

Whenconsideringthememorysystemalone,weﬁndseveralwaysinwhichcache-basedandstreamingmemorysystemsmaydiffer:bandwidthutilization,latencytolerance,performance,en-ergyconsumption,andcost.Thissectionsummarizeseachofthesedifferencesinaqualitativemanner.

Off-chipBandwidth&LocalMemoryUtilization:Thecache-basedmodelleadstobandwidthandstoragecapacitywasteonsparselystridedmemoryaccesses.Intheabsenceofspatiallocality,manipulatingdataatthegranularityofwidecachelinesiswasteful.Streamingmemorysystems,byvirtueofstridedscatterandgatherDMAtransfers,canusetheminimummemorychannelbandwidth

necessarytodeliverdata,andalsocompactthedatawithinthelocalstore.Note,however,thatmemoryandinterconnectchannelsaretypicallyoptimizedforblocktransfersandmaynotbebandwidthefﬁcientforstridedorscatter/gatheraccesses.

Cachingcanalsowasteoff-chipbandwidthonunnecessaryre-ﬁllsforoutputdata.Becausecachesoftenusewrite-allocatepoli-cies,storemissesforcememoryreadsbeforethedataareover-writteninthecache.Ifanapplicationhasdisjointinputandout-putstreams,thereﬁllsmaywasteasigniﬁcantpercentageofband-width.Similarly,cachingcanwastebandwidthonwrite-backsofdeadtemporarydata.Astreamingsystemdoesnotsufferfromtheseproblems,astheoutputandtemporarybuffersaremanagedexplicitly.Outputdataaresentoff-chipwithoutreﬁlls,anddeadtemporarydatacanbeignored,astheyarenotmappedtooff-chipaddresses.Tomitigatethereﬁllproblem,cache-basedsystemscanuseano-write-allocatepolicy.Inthiscase,itisnecessarytogroupstoredatainwritebuffersbeforeforwardingthemtomemoryinordertoavoidwastingbandwidthonnarrowwrites[4].Anotherapproachistousecachecontrolinstructions,suchas“PrepareForStore,”[34]thatinstructthecachetoallocateacachelinebutavoidretrievaloftheoldvaluesfrommemory.Similarly,temporarydatacanbemarkedinvalidattheendofacomputation[7,43].Inanycase,softwaremustdeterminewhentousethesemechanisms.Streamingsystemsmayalsowastebandwidthandstorageca-pacityonprogramswithstaticallyunpredictable,irregulardatapat-terns.Astreamingsystemcansometimescopewiththesepatternsbyfetchingasupersetoftheneededinputdata.Alternatively,atthecostofenduringlonglatencies,thesystemcoulduseaDMAtransfertocollectrequireddataondemandfrommainmemorybe-foreeachcomputationaltask.Forprogramsthatoperateonover-lappingblocksorgraphstructureswithmultiplereferencestothesamedata,thestreamingsystemmaynaivelyre-fetchdata.Thiscanbeavoidedthroughincreasedaddressgenerationcomplexityorsoftwarecaching.Finally,forapplicationsthatfetchablockandupdatesomeofitselementsin-place,astreamingsystemwilloftenwritebackthewholeblocktomemoryattheendofthecomputa-tion,evenifsomedatawerenotupdated.Incontrasttoallofthesescenarios,cache-basedsystemsperformloadandstoreaccessesondemand,andhenceonlymovecachelinesasrequired.Theymayevensearchforcopiesoftherequireddatainotheron-chipcachesbeforegoingoff-chip.

LatencyTolerance:Thecache-basedmodelistraditionallyre-active,meaningthatamissmustoccurbeforeafetchistriggered.Memorylatencycanbehiddenusinghardwareprefetchingtech-niques,whichdetectrepetitiveaccesspatternsandissuememoryaccessesaheadoftime,orproactivesoftwareprefetching.Inprac-tice,theDMAtransfersinthestreamingmemorymodelareanefﬁcientandaccurateformofsoftwareprefetching.Theycanhideasigniﬁcantamountoflatency,especiallyifdouble-bufferingisused.Unlikehardwareprefetching,whichrequiresafewmissesbeforeapatternisdetected(microscopicview),aDMAaccesscanstartarbitrarilyearly(macroscopicview)andcancapturebothreg-ularandirregular(scatter/gather)accesses.

Performance:Fromthediscussionsofar,onecanconcludethatstreamingmemorymayhaveaperformanceorscalingadvantageforregularapplications,duetopotentiallybetterlatencytoleranceandbetterusageofoff-chipbandwidthorlocalstorage.Thesead-vantagesareimportantonlyiflatency,bandwidth,orlocalstoragecapacityaresigniﬁcantbottleneckstobeginwith.Forexample,reducingthenumberofmissesisunimportantforacomputation-allyintensiveapplicationthatalreadyhasverygoodlocality.Intheeventthatanapplicationisbandwidth-bound,latencytolerancemeasureswillbeineffective.Adrawbackforstreaming,evenwith

regularcode,isthatitoftenhastoexecuteadditionalinstructionstosetupDMAtransfers.Forapplicationswithunpredictabledataaccesspatternsorcontrolﬂow,astreamingsystemmayexecuteasigniﬁcantlyhighernumberofinstructionsthanthatofacache-basedsysteminordertoproducepredictablepatternsortousethelocalstoretoemulateasoftwarecache.

EnergyConsumption:Anyperformanceadvantagealsotrans-latestoanenergyadvantage,asitallowsustoturnoffthesystemearlyorscaledownitspowersupplyandclockfrequency.Stream-ingaccessestotheﬁrst-levelstorageeliminatetheenergyoverheadofcaches(tagaccessandtagcomparison).Thecache-basedmodelconsumesadditionalenergyforon-chipcoherencetrafﬁc,snooprequestsordirectorylookups.Moreover,efﬁcientuseoftheavail-ableoff-chipbandwidthbyeitherofthetwomodels(throughfewertransfersormessages)reducestheenergyconsumptionbythein-terconnectnetworkandmainmemory.

Complexity&Cost:Itisdifﬁculttomakeaccurateestimatesofhardwarecostwithoutcomparableimplementations.Thehardwareforthecache-basedmodelisgenerallymorecomplextodesignandverify,ascoherence,synchronization,consistency,andprefetchinginteractinsubtleways.Still,reuseacrossserver,desktop,andem-beddedCMPdesignscansigniﬁcantlyreducesuchcosts.Ontheotherhand,streamingpassesthecomplexitytosoftware,thecom-piler,and/ortheprogrammer.Forapplicationsinthesynchronousdata-ﬂow,DSP,ordensematrixdomains,itisoftenstraightfor-wardtoexpressastreamingalgorithm.Forotherapplications,itisnon-trivial,andasinglegoodalgorithmisoftenaresearchcon-tributioninitself[11,14].Finally,complexityandcostmustalsobeconsideredwithscalinginmind.Amemorymodelhasanad-vantageifitallowsefﬁcientuseofmorecoreswithouttheneedfordisproportionalincreasesinbandwidthorsomeotherresourceinthememorysystem.

3.CMPARCHITECTUREFRAMEWORK

WecomparethetwomodelsusingtheCMParchitectureshowninFigure1.TherearenumerousdesignparametersinaCMPsys-tem,andevaluatingthetwomodelsunderallpossiblecombinationsisinfeasible.Hence,westartwithabaselinesystemthatrepresentsapracticaldesignpointandvaryonlythekeyparametersthatinter-actsigniﬁcantlywiththeon-chipmemorysystem.

3.1BaselineArchitecture

OurCMPdesignisbasedonin-orderprocessorssimilartoPi-ranha[5],RAW[39],UltrasparcT1[25],andXbox360[4].SuchCMPshavebeenshowntobeefﬁcientformultimedia,communi-cations,andthroughputcomputingworkloadsastheyprovidehighcomputedensitywithouttheareaandpoweroverheadofout-of-orderprocessors[10].WeusetheTensilicaXtensaLX,3-wayVLIWcore[22].Tensilicacoreshavebeenusedinseveralem-beddedCMPdesigns,includingthe188-coreCiscoMetrorouterchip[12].Wealsoperformedexperimentswithasingle-issueXtensacorethatproducedsimilarresultsforthememorymodelcomparison.TheVLIWcoreisconsistentlyfasterby1.6xto2xfortheapplicationswestudied.Thecorehas3slotsperinstruc-tion,withupto2slotsforﬂoating-pointoperationsandupto1slotforloadsandstores.Duetotimeconstraints,wedonotusetheTensilicaSIMDextensionsatthispoint.Nevertheless,Sec-tion5.3includesanexperimentthatevaluatestheefﬁciencyofeachmodelwithprocessorsthatprovidehighercomputationalthrough-put.Eachprocessorhasa16-KByte,2-wayset-associativeinstruc-tioncache.Theﬁxedamountoflocaldatastorageisuseddiffer-entlyineachmemorymodel.

Weexploresystemswith1to16coresusingahierarchicalinter-connectsimilartothatsuggestedby[26].Wegroupcoresinclus-tersoffourwithawide,bidirectionalbus(localnetwork)provid-ingthenecessaryinterconnect.Theclusterstructureallowsforfastcommunicationbetweenneighboringcores.Ifthreadsaremappedintelligently,theintra-clusterbuswillhandlemostofthecore-to-corecommunication.Aglobalcrossbarconnectstheclusterstothesecond-levelstorage.Thereisbufferingatallinterfacestotoler-atearbitrationlatenciesandensureefﬁcientbandwidthuseateachlevel.Thehierarchicalinterconnectprovidessufﬁcientbandwidth,whileavoidingthebottlenecksoflongwiresandcentralizedarbi-tration[26].Finally,thesecondarystoragecommunicatestooff-chipmemorythroughsomenumberofmemorychannels.

Table2presentstheparametersoftheCMPsystem.Wevarythenumberofcores,thecoreclockfrequency,theavailableoff-chipbandwidth,andthedegreeofhardwareprefetching.Wekeepthecapacityoftheﬁrst-leveldata-storageconstant.

3.2Cache-basedImplementation

Forthecache-basedmodel,theﬁrst-leveldatastorageineachcoreisorganizedasa32-KByte,2-wayset-associativedatacache.Thesecond-levelcacheisa512-KByte,16-wayset-associativecache.Bothcachesuseawrite-back,write-allocatepolicy.Co-herenceismaintainedusingtheMESIwrite-invalidateprotocol.Coherencerequestsarepropagatedinstepsoverthehierarchicalinterconnect.First,theyarebroadcasttootherprocessorswithinthecluster.Ifthecache-lineisnotavailableortherequestcannotbesatisﬁedwithinonecluster(e.g.,upgraderequest),itisbroadcasttoallotherclustersaswell.Snoopingrequestsfromothercoresoc-cupythedatacacheforonecycle,forcingthecoretostallifittriestodoaload/storeaccessinthesamecycle.Eachcoreincludesastore-bufferthatallowsloadstobypassstoremisses.Asaresult,theconsistencymodelisweak.Sincetheprocessorsarein-order,itiseasytoprovidesufﬁcientMSHRsforthemaximumpossiblenumberofconcurrentmisses.

Eachcoreadditionallyincludesahardwarestream-basedpre-fetchenginethatplacesdatadirectlyintheL1cache.Modeledafterthetaggedprefetcherdescribedin[41],theprefetcherkeepsahistoryofthelast8cachemissesforidentifyingsequentialac-cesses,runsaconﬁgurablenumberofcachelinesaheadofthelatestcachemiss,andtracks4separateaccessstreams.Ourexperimentsincludehardwareprefetchingonlywhenexplicitlystated.

WeusePOSIXthreadstomanuallyparallelizeandtuneappli-cations[27].Theapplicationsusedareregularanduselockstoimplementefﬁcienttask-queuesandbarrierstosynchronizeSPMDcode.Higher-levelprogrammingmodels,suchasOpenMP,arealsoapplicabletotheseapplications.

3.3StreamingImplementation

Forthestreamingmodel,theﬁrst-leveldatastorageineachcoreissplitbetweena24-KBytelocalstoreandan8-KBytecache.Thesmallcacheisusedforstackdataandglobalvariables.Itispar-ticularlyusefulforthesequentialportionsofthecodeandhelpssimplifytheprogrammingandcompilationofthestreamingpor-tionaswell.The24-KBytelocalstoreisindexedasarandomac-cessmemory.OurimplementationalsoprovideshardwaresupportforFIFOaccessestothelocalstore,butwedidnotusethisfeaturewithanyofourapplications.EachcorehasaDMAenginethatsupportssequential,strided,andindexedtransfers,aswellascom-mandqueuing.Atanypoint,theDMAenginemayhaveupto1632-byteoutstandingaccesses.

Thelocalstoreiseffectivelysmallerthana24-KBytecache,sinceithasnotagorcontrolbitsoverhead.Wedonotraisethe

CoreCache-coherent modelI-CacheCoreDRAM I/FCoreCoreStreamingmodelI-CacheLocal NetworkCoreCoreL2CacheBankGlobal NetworkLocal NetworkCoreCoreCPUD-CacheCC EngineCPUCoreCoreD-CacheCoreCoreLocal NetworkL2CacheBankDRAM I/FLocalStoreLocal NetworkDMA EngineCoreCoreCoreCoreFigure1.ThearchitectureoftheCMPsystemwithupto16processors.Thecoreorganizationsforthecache-basedandstreamingmodelsareshownontheleftandrightsiderespectively.

sizeofthelocalstore,sincethesmallincrement(2KBytes)doesnotmakeadifferenceforourapplications.Still,theenergycon-sumptionmodelaccountscorrectlyforthereducedcapacity.Thesecondarystorageisagainorganizedasa16-wayset-associativecache.L2cachesareusefulwithstreamprocessors,astheycap-turelong-termreusepatternsandavoidexpensiveaccessestomainmemory[36,9].TheL2cacheavoidsreﬁllsonwritemisseswhenDMAtransfersoverwriteentirelines.

WedevelopedstreamingcodeusingasimplelibraryforDMAtransferswithinthreadedCcode.Wemanuallyappliedtheproperblockingfactoranddouble-bufferinginordertooverlapDMAtrans-ferswithusefulcomputation.Wealsorunmultiplecomputationalkernelsoneachdatablocktobeneﬁtfromproducer-consumerlo-calitywithoutadditionalmemoryaccessesorwrite-backsforinter-mediateresults.Higher-levelstreamprogrammingmodelsshouldbeapplicabletomostofourapplications[15,13].Insomecases,theDMAlibraryusesaschedulingthreadthatqueuespendingtransfers.Weavoidtakingupawholecoreforthisthreadbymul-tiplexingitwithanapplicationthread.Theperformanceimpactisnegligible.

4.2Applications

Table3presentsthesetofapplicationsusedforthisstudy.Theyrepresentapplicationsfromthemultimedia,graphics,physicalsim-ulation,DSP,anddatamanagementdomains.Suchapplicationshavebeenusedtoevaluateandmotivatethedevelopmentofstream-ingarchitectures.MPEG-2,H.2,Raytracer,JPEG,andStereoDepthExtractionarecompute-intensiveapplicationsandshowex-ceptionallygoodcacheperformancedespitetheirlargedatasets.Theyexhibitgoodspatialortemporallocalityandhaveenoughcomputationperdataelementtoamortizethepenaltyforanymisses.FEMisascientiﬁcapplication,buthasaboutthesamecomputein-tensityasmultimediaapplications.Theremainingapplications—BitonicSort,MergeSort,FIR,and179.art—performarelativelysmallcomputationoneachinputelement.Theyrequireconsid-erablyhigheroff-chipbandwidthandaresensitivetomemoryla-tency.

Wemanuallyoptimizedbothversionsofeachapplicationtoelim-inatebottlenecksandscheduleitsparallelisminthebestpossibleway.Wheneverappropriate,weappliedthesamedata-localityopti-mizations(i.e.blocking,producer-consumer,etc.)tobothmodels.InSection6,weexploretheimpactofdata-localityoptimizations.Thefollowingisabriefdescriptionofhoweachapplicationwasparallelized.

MPEG-2andH.2areparallelizedatthemacroblocklevel.Bothdynamicallyassignmacroblockstocoresusingataskqueue.Macroblocksareentirelydata-parallelinMPEG-2.InH.2,wescheduletheprocessingofdependentmacroblockssoastomini-mizethelengthofthecriticalexecutionpath.WiththeCIFreso-lutionvideoframesweencodeforthisstudy,themacroblockpar-allelismavailableinH.2islimited.StereoDepthExtractionisparallelizedbydividinginputframesinto32x32blocksandstati-callyassigningthemtoprocessors.

KD-treeRaytracerisparallelizedacrosscamerarays.Weassignraystoprocessorsinchunkstoimprovelocality.OurstreamingversionreadstheKD-treefromthecacheinsteadofstreamingitwithaDMAcontroller.JPEGEncodeandJPEGDecodearepar-allelizedacrossinputimages,inamannersimilartothatdonebyanimagethumbnailbrowser.NotethatEncodereadsalotofdatabutoutputslittle;Decodebehavesintheoppositeway.TheFi-niteElementMethod(FEM)isparallelizedacrossmeshcells.TheFIRﬁlterhas16tapsandisparallelizedacrosslongstripsofsam-ples.179.artisparallelizedacrossF1neurons;thisapplicationiscomposedofseveraldata-parallelvectoroperationsandreductionsbetweenwhichweplacebarriers.

MergeSortandBitonicSortareparallelizedacrosssub-arraysofalargeinputarray.Theprocessorsﬁrstsortchunksof4096keysin

4.4.1

METHODOLOGY

SimulationandEnergyModeling

WeusedTensilica’smodelingtoolstoconstructaCMPsimula-torforbothmemorymodels[40].Thesimulatorcapturesallstallandcontentioneventsinthecorepipelineandthememorysystem.Table2summarizesthemajorsystemcharacteristicsandthepa-rameterswevariedforthisstudy.Thedefaultvaluesareshowninbold.TheapplicationswerecompiledwithTensilica’soptimizingcompileratthe-O3optimizationlevel.Wefast-forwardovertheinitializationforeachapplicationbutsimulatetheresttocomple-tion,excluding179.artforwhichwemeasure10invocationsofthetrainmatchfunction.

Wealsodevelopedanenergymodelforthearchitectureina90nmCMOSprocess(1.0Vpowersupply).Forthecores,themodelcombinesusagestatistics(instructionmix,functionalunitutilization,stallsandidlecycles,etc.)withenergydatafromthelayoutofactualTensilicadesignsat600MHzin90nm.Theen-ergyconsumedbyon-chipmemorystructuresiscalculatedusingCACTI4.1[38],whichincludesaleakagepowermodelandim-provedcircuitmodelscomparedtoCACTI3.Interconnectenergyiscalculatedbasedonourmeasuredactivitystatisticsandscaledpowermeasurementsfrom[19].Theenergyconsumptionforoff-chipDRAMisderivedfromDRAMsim[42].Wemodeltheeffectofleakageandclockgatingonenergyatalllevelsofthemodel.

CoreI-cacheDataStorageLocalNetworkGlobalCrossbarL2-cacheDRAMCache-CoherentModel(CC)StreamingModel(STR)1,2,4,8,or16TensilicaLXcores,3-wayVLIW,7-stagepipeline800MHz,1.6GHz,3.2GHz,or6.4GHzclockfrequency2FPUs,2integerunits,1load/storeunit

16KB,2-wayassociative,32-byteblocks,1port32KB,2-wayassociativecache24KBlocalstore,1port32-byteblocks,1port,MESI8KB,2-wayassociativecache,32-byteblocks,1portHardwarestreamprefetcherDMAenginewith16outstandingaccessesbidirectionalbus,32byteswide,2cyclelatency(afterarbitration)1inputandoutputportperclusterorL2bank,16byteswide,2.5nslatency(pipelined)512KB,16-waysetassociative,1port,2.2nsaccesslatency,non-inclusiveOnememorychannelat1.6GB/s,3.2GB/s,6.4GB/s,or12.8GB/s;70nsrandomaccesslatencyTable2.ParametersfortheCMPsystem.Forparametersthatvary,wedenotethedefaultvalueinbold.Latenciesarefora90nmCMOSprocess.

L1D-MissRate0.58%0.06%1.06%0.40%0.58%0.03%0.60%0.63%1.79%2.22%3.98%L2D-MissRate85.3%30.8%98.9%72.9%76.2%46.1%86.2%99.8%7.4%98.2%99.7%Instr.perL1D-Miss324.83705.5256.3577.1352.98662.5368.8214.6150.1140.971.1CyclesperL2D-Miss135.44225.96.684.244.93995.355.520.4230.926.133.7Application

MPEG-2Encoder[28]H.2EncoderKD-treeRaytracerJPEGEncoder[21]JPEGDecoder[21]StereoDepthExtraction2DFEMFIRﬁlter179.artBitonicSortMergeSortInputDataset

10CIFframes(Foreman)10CIFframes(Foreman)128x128,16371triangles128PPMsofvarioussizes128JPGsofvarioussizes3CIFimagepairs5006cellmesh,7663edges22032-bitsamplesSPECreferencedataset21932-bitkeys(2MB)21932-bitkeys(2MB)Off-chipB/W292.4MB/s10.8MB/s45.1MB/s402.2MB/s1059.2MB/s11.4MB/s587.9MB/s1839.1MB/s227.7MB/s1594.2MB/s1167.8MB/sTable3.Memorycharacteristicsoftheapplicationsmeasuredonthecache-basedmodelusing16coresrunningat800MHz.

parallelusingquicksort.Then,sortedchunksaremergedorsorteduntilthefullarrayissorted.MergeSortgraduallyreducesinpar-allelismasitprogress,whereasBitonicSortretainsfullparallelismforitsduration.MergeSortalternateswritingoutputsubliststotwobufferarrays,whileBitonicSortoperatesonthelistinsitu.

duetocachemisses.Streamingversionseliminatemanyofthesestallsusingdouble-buffering(macroscopicprefetching).ThisisnotthecaseforBitonicSort,becauseoff-chipbandwidthissaturatedathighprocessorcounts.BitonicSortisanin-placesortingalgo-rithm,anditisoftenthecasethatsublistsaremoderatelyin-orderandelementsdon’tneedtobeswapped,andconsequentlydon’tneedtobewrittenback.Thecache-basedsystemnaturallydiscov-ersthisbehavior,whilethestreamingmemorysystemwritestheunmodiﬁeddatabacktomainmemoryanyway.H.2andMergeSorthavesynchronizationstallswithbothmodelsduetolimitedparallelism.

Therearesubtledifferencesintheusefulcyclesofsomeapplica-tions.FIRexecutes14%moreinstructionsinthestreamingmodelthanthecachingmodelbecauseofthemanagementofDMAtrans-fers(128elementspertransfer).InthestreamingMergeSort,theinnerloopexecutesextracomparisonstocheckifanoutputbufferisfullandneedstobedrainedtomainmemory,whereasthecache-basedvariantfreelywritesdatasequentially.Eventhoughdouble-bufferingeliminatesalldatastalls,theapplicationrunslongerbe-causeofitshigherinstructioncount.ThestreamingH.2takesadvantageofsomeboundary-conditionoptimizationsthatproveddifﬁcultinthecache-basedvariant.Thisresultedinaslightreduc-tionininstructioncountwhenstreaming.MPEG-2suffersamod-eratenumberofinstructioncachemissesduethecache’slimitedsize.

Overall,Figure2showsthatneithermodelhasaconsistentper-formanceadvantage.Evenwithoutprefetching,thecache-basedmodelperformssimilarlytothestreamingone,andsometimesitisactuallyfaster(BitonicSortfor16cores).Bothmodelsusedataefﬁciently,andthefundamentalparallelismavailableintheseap-plicationsisnotaffectedbyhowdataaremoved.Althoughthedifferencesinperformancearesmall,thereisalargervariationin

5.EVALUATION

Ourevaluationstartswithacomparisonofthestreamingsystemtothebaselinecache-basedsystemwithoutprefetchingorotherenhancements(Sections5.1and5.2).Wethenstudytheband-widthconsumptionandlatencytoleranceofthetwosystems(Sec-tion5.3).Weconcludebyevaluatingmeanstoenhancetheperfor-manceofcachingsystems(Sections5.4and5.5).

5.1PerformanceComparison

Figure2presentstheexecutiontimeforthecoherentcache-based(CC)andstreaming(STR)modelsaswevarythenumberof800MHzcoresfrom2to16.Wenormalizetotheexecutiontimeofthesequentialrunwiththecache-basedsystem.Thecomponentsofexecutiontimeare:usefulexecution(includingfetchandnon-memorypipelinestalls),synchronization(lock,barrier,waitforDMA),andstallsfordata(caches).Lowerbarsarebetter.Thecache-basedresultsassumenohardwareprefetching.

For7outof11applications(MPEG-2,H.2,Depth,Raytrac-ing,FEM,JPEGEncodeandDecode),thetwomodelsperformal-mostidenticallyforallprocessorcounts.Theseprogramsperformasigniﬁcantcomputationoneachdataelementfetchedandcanbeclassiﬁedascompute-bound.Bothcachesandlocalstorescapturetheirlocalitypatternsequallywell.

Theremainingapplications(179.art,FIR,MergeSort,andBiton-icSort)aredata-boundandrevealsomeinterestingdifferencesbe-tweenthetwomodels.Thecache-basedversionsstallregularly

MPEG−2 Encoder0.60Normalized Execution TimeNormalized Execution Time0.500.400.300.200.10 CC STR CC STR CC STR CC STR0.00StoreLoadSyncUsefulH.2 Encoder0.60

Normalized Execution Time0.500.400.300.200.10

CC STR CC STR CC STR CC STR0.00

StoreLoadSyncUsefulRay Tracer0.600.500.400.300.200.10 CC STR CC STR CC STR CC STR0.00Normalized Execution TimeStoreLoadSyncUsefulJPEG Encoder0.600.500.400.300.200.10

CC STR CC STR CC STR CC STR16 CPUs

StoreLoadSyncUsefulStoreLoadSyncUseful0.00

2 CPUs

0.60Normalized Execution Time4 CPUs8 CPUs16 CPUs

0.60Normalized Execution Time2 CPUs4 CPUs8 CPUs16 CPUs

0.50Normalized Execution Time2 CPUs4 CPUs8 CPUs16 CPUs

0.60Normalized Execution TimeStoreLoadSyncUseful2 CPUs4 CPUs8 CPUs

JPEG DecoderStoreLoadSyncUsefulDepth ExtractionStoreLoadSyncUsefulFEM0.450.400.350.300.250.200.150.100.050.00FIR0.500.400.300.200.10 CC STR CC STR CC STR CC STR0.000.500.400.300.200.10

CC STR CC STR CC STR CC STR0.00

0.500.400.300.200.10

CC STR CC STR CC STR8 CPUs

CC STR0.00

CC STR CC STR CC STR8 CPUs0.602 CPUs4 CPUs8 CPUs0.5016 CPUs

179.art2 CPUs4 CPUs8 CPUs0.6016 CPUs

Bitonic Sort2 CPUs4 CPUs16 CPUs

Merge Sort CC STR2 CPUs4 CPUs16 CPUs

Normalized Execution TimeNormalized Execution Time0.400.350.300.250.200.150.100.05 CC STR CC STR CC STR CC STR0.000.500.400.300.200.10 CC STR CC STR CC STR CC STR0.00Normalized Execution Time0.45StoreLoadSyncUsefulStoreLoadSyncUseful0.500.400.300.200.10 CC STR CC STR CC STR CC STR0.00StoreLoadSyncUseful2 CPUs4 CPUs8 CPUs16 CPUs2 CPUs4 CPUs8 CPUs16 CPUs2 CPUs4 CPUs8 CPUs16 CPUs

Figure2.Executiontimesforthetwomemorymodelsasthenumberofcoresisincreased,normalizedtoasinglecachingcore.

1.40Normalized Off−Chip Traffic1.201.000.800.600.400.20

CC STR CC STR CC STR CC STR0.00

16 CPUs16 CPUs1.00Normalized Energy Consumption0.900.800.700.600.500.400.300.200.10

CC STR CC STR CC STR CC STR0.00

DRAML2−CacheNetworkLMemD−CacheI−CacheCore

WriteReadFEMMPEG−2FIRBitonic Sort

FEMMPEG−2FIRBitonic Sort

Figure3.Off-chiptrafﬁcforthecache-basedandstreamingsys-temswith16CPUs,normalizedtoasinglecachingcore.off-chipbandwidthutilizationofthetwomodels.Figure3showsthateachmodelhasanadvantageinsomesituations.

Figure4.Energyconsumptionforthecache-basedandstreamingsystemswith16CPUs,normalizedtoasinglecachingcore.canbeobservedinthecorrelationbetweenFigures3and4.Speciﬁ-cally,thestreamingapplicationstypicallytransferfewerbytesfrommainmemory,oftenthroughtheeliminationofsuperﬂuousreﬁllsforoutput-onlydata.TheoppositeistrueforourstreamingBitonicSort,whichtendstocommunicatemoredatawithmainmemorythanthecachingversionduetothewrite-backofunmodiﬁeddata.Forapplicationswherethereislittlebandwidthdifferencebetweenthetwomodels(suchasFEM)orthecomputationalintensityisveryhigh(suchasDepth),thedifferenceinenergyconsumptionisinsigniﬁcant.

WeexpectedtoseeagreaterdifferencebetweenthelocalstoreandL1datacache,butitnevermaterialized.Sinceourapplicationsaredata-parallelandrarelysharedata,theenergycostofanaveragecachemissisdominatedbytheoff-chipDRAMaccessratherthanthemodesttagbroadcastandlookup.Hence,theper-accessenergysavingsbyeliminatingtaglookupsinthestreamingsystemmadelittleimpactonthetotalenergyfootprintofthesystem.

5.2EnergyComparison

Figure4presentsenergyconsumptionforthetwomodelsrun-ningFEM,MPEG-2,FIR,andBitonicSort.Wenormalizetotheenergyconsumptionofasinglecachingcoreforeachapplication.Eachbarindicatestheenergyconsumedbythecores,thecachesandlocalstores,theon-chipnetwork,thesecond-levelcache,andthemainmemory.Thenumbersincludebothstaticanddynamicpower.Lowerbarsarebetter.Incontrasttoperformancescal-ing,energyconsumptiondoesnotalwaysimprovewithmorecores,sincetheamountofhardwareusedtoruntheapplicationincreases.For5outof11applications(JPEGEncode,JPEGDecode,FIR,179.art,andMergeSort),streamingconsistentlyconsumeslessen-ergythancache-coherence,typically10%to25%.Theenergydif-ferentialinnearlyeverycasecomesfromtheDRAMsystem.This

MPEG−2 Encoder0.08Normalized Execution TimeNormalized Execution Time0.070.060.050.040.030.020.01

CC STR CC STR CC STR CC STR0.00

StoreLoadSyncUsefulFIR0.07

Normalized Execution Time0.060.050.040.030.020.01

CC STR CC STR CC STR CC STR0.00

StoreLoadSyncUsefulBitonic Sort0.090.080.070.060.050.040.030.020.01

CC STR CC STR CC STR CC STR6.4 GHz

StoreLoadSyncUseful0.00

0.8 GHz1.6 GHz3.2 GHz6.4 GHz0.8 GHz1.6 GHz3.2 GHz6.4 GHz0.8 GHz1.6 GHz3.2 GHz

Figure5.Normalizedexecutiontimeasthecomputationrateofprocessorcoresisincreased(16cores).

5.3IncreasedComputationalThroughput

Uptonow,theresultsassume800MHzcores,whicharereason-ableforembeddedCMPsforconsumerapplications.Toexploretheefﬁciencyofthetwomemorymodelsasthecomputationalthrough-putoftheprocessorisincreased,wevarytheclockfrequencyofthecoreswhilekeepingconstantthebandwidthandlatencyintheon-chipnetworks,L2cache,andoff-chipmemory.Insomesense,thehigherclockfrequenciestellusingeneralwhatwouldhappenwithmorepowerfulprocessorsthatuseSIMDunits,out-of-orderschemes,higherclockfrequency,oracombination.Forexample,the6.4GHzconﬁgurationcanberepresentativeoftheperformanceofan800MHzprocessorthatuses4-to8-wideSIMDinstructions.Theexperimentwasperformedwith16corestostressscalingandincreasethesystem’ssensitivitytobothmemorylatencyandmem-orybandwidth.

Applicationswithsigniﬁcantdatareuse,suchasH.2andDepth,shownosensitivitytothisexperimentandperformequallywellonbothsystems.Figure5showstheresultsforsomeoftheapplicationsthatareaffectedbycomputationalscaling.Theseap-plicationsfallintooneoftwocategories:bandwidth-sensitiveorlatency-sensitive.Latency-sensitiveprograms,likeMPEG-2En-coding,performarelativelargedegreeofcomputationbetweenoff-chipmemoryaccesses(hundredsofinstructions).Whilethehighercorefrequencyshortensthesecomputations,itdoesnotre-ducetheamountoftime(innanoseconds,notcycles)requiredtofetchthedatainbetweencomputations.Themacroscopicprefetch-inginthestreamingsystemcantolerateasigniﬁcantpercentageofthememorylatency.Henceat6.4GHz,thestreamingMPEG-2Encoderis9%faster.

Bandwidth-sensitiveapplications,likeFIRandBitonicSort,eventuallysaturatetheavailableoff-chipbandwidth.Beyondthatpoint,furtherincreasesincomputationalthroughputdonotim-proveoverallperformance.ForFIR,thecache-basedsystemsat-uratesbeforethestreamingsystemduetothesuperﬂuousreﬁllsonstoremissestooutput-onlydata.Atthehighestcomputationalthroughput,thestreamingsystemperforms36%faster.ForBitonicSort,thestreamingversionsaturatesﬁrst,sinceitperformsmorewritesthanthecache-basedversion(asdescribedin5.1).Thisgivesthecache-basedversiona19%performanceadvantage.

5.4MitigatingLatencyandBandwidthIssues

Theprevioussectionindicatesthatwhenalargenumberofcoreswithhighcomputationalthroughputareused,thecache-basedmod-elfaceslatencyandbandwidthissueswithcertainapplications.Tocharacterizetheseinefﬁciencies,weperformedexperimentswithincreasedoff-chipbandwidthandhardwareprefetching.

Figure6showstheimpactofincreasingtheavailableoff-chipbandwidthforFIR.Thiscanbeachievedbyusinghigherfrequency

FIR0.08Storeem0.07LoadiTSync n0.06Usefuloitu0.05cexE0.04 dezi0.03lamr0.02oN0.01

0.00

CRCRCRCRCTCTCTCT S S S S 1.6 GB/s3.2 GB/s6.4 GB/s12.8 GB/s

Figure6.Theeffectofincreasedoff-chipbandwidthonFIRper-formance.Measuredon16coresat3.2GHz.

DRAM(e.g.DDR2,DDR3,GDDR)ormultiplememorychannels.Withmorebandwidthavailable,theeffectofsuperﬂuousreﬁllsissigniﬁcantlyreduced,andthecache-basedsystemperformsnearlyaswellasthestreamingone.Whenhardwareprefetchingisin-troducedat12.8GB/s,loadstallsarereducedto3%ofthetotalexecutiontime.However,theadditionaloff-chipbandwidthdoesnotclosetheenergygapforthisapplication.Anenergy-efﬁcientsolutionforthecache-basedsystemistouseanon-allocatingwritepolicy,whichweexploreinSection5.5.

ForMergeSortand179.art(Figure7),hardwareprefetchingsig-niﬁcantlyimprovesthelatencytoleranceofthecache-basedsys-tems;datastallsarevirtuallyeliminated.Thisisnottosaythatweneverobserveddatastalls—at16cores,thecache-basedMergeSortsaturatesthememorychannelduetosuperﬂuousreﬁlls—butthatasmalldegreeofprefetchingissufﬁcienttohideover200cy-clesofmemorylatency.

5.5MitigatingSuperﬂuousReﬁlls

Forsomeapplications,thecache-basedsystemusesmoreoff-chipbandwidth(andconsequentlyenergy)becauseofsuperﬂuousreﬁllsforoutput-onlydata.Thisdisadvantagecanbeaddressedbyusinganon-write-allocatepolicyforoutput-onlydatastreams.Wemimicnon-allocatingstoresbyusinganinstructionsimilartotheMIPS32“PrepareforStore”(PFS)instruction[34].PFSallocatesandvalidatesacachelinewithoutreﬁllingit.TheresultsforFIR,Mergesort,andMPEG-2areshowninFigure8.Foreachapplica-tion,theeliminationofsuperﬂuousreﬁllsbringsthememorytrafﬁcandenergyconsumptionofthecache-basedmodelintoparitywiththestreamingmodel.ForMPEG-2,thememorytrafﬁcduetowritemisseswasreduced56%comparedtothecache-basedapplicationwithoutPFS.

0.30PrefetchingeStoremLoadiT0.25

Syncnoitu0.20UsefulcexE0.15

dezila0.10mroN0.050.00

C4RC4RCPTCPT +S +CCS CCMerge Sort 179.art

Figure7.Theeffectofhardwareprefetchingonperformance.P4

referstotheprefetchdepthof4.Measuredon2coresat3.2GHzwitha12.8GB/smemorychannel.

1.20Prepare For StorePrepare For Store0.80cifWritenoifReadtpm0.70ar1.00Tu spnih0.800.60oCC 0.50−0.60yfgfrOe0.40 nde0.40E zd0.30ileazmi0.20

l0.20raoNmro0.10

0.00

NCSS0.00

FRCFRCSRSCCRPTCPTCFPTCFT +S P +S +S CC+S CCCCC FIRMerge Sort MPEG−2

C FIR

Figure8.Theeffectof“PrepareforStore”(PFS)instructionsontheoff-chiptrafﬁcforthecache-basedsystem,normalizedtoasin-glecachingcore.AlsoshownisenergyconsumptionforFIRwith16coresat800MHz.SeeFigure4forthesecondgraph’slegend.Notethatafullhardwareimplementationofanon-write-allocatecachepolicy,alongwiththenecessarywrite-gatheringbuffer,mightperformbetterthanPFSasitwouldalsoeliminatecachereplace-mentsduetooutput-onlydata.

6.STREAMINGASAPROGRAMMING

MODEL

Ourevaluationsofarshowsthatthetwomemorymodelsleadtosimilarperformanceandscaling.Itisimportanttorememberthatwetookadvantageofstreamingoptimizations,suchasblockingandlocality-awarescheduling,onbothmemorymodels.Toillus-tratethis,itiseducationaltolookatstreamprogramminganditsimplicationsforCMParchitecture.

Streamprogrammingmodelsencourageprogrammerstothinkaboutthelocality,datamovement,andstoragecapacityissuesintheirapplications[15,13].Whiletheydonotnecessarilyrequiretheprogrammertomanagetheseissues,theprogrammerstructurestheapplicationinsuchawaythatitiseasytoreasonaboutthem.Thisexposednatureofastreamprogramisvitallyimportantforstreamingarchitectures,asitenablessoftwareorcompilerman-agementofdatalocalityandasynchronouscommunicationwitharchitecturallyvisibleon-chipmemories.Despiteitsafﬁnityforstreamarchitectures,weﬁndthatstreamprogrammingisbeneﬁ-cialforcache-basedarchitecturesaswell.

Figure9showstheimportanceofstreamingoptimizationsinthecache-basedMPEG-2Encoder.Theoriginalparallelcodefrom[28]performsanapplicationkernelonawholevideoframebeforethenextkernelisinvoked(i.e.MotionEstimation,DCT,Quantization,

MPEG−2MPEG−2 Encoderc1.80ifWrite0.60efStorea1.60rReadmiTT0.50Load 1.40pnSynciohitUsefulC1.20u0.40−cef1.00fxOE0.30 d0.80deezzii0.60l0.20laam0.40mrroo0.10

N0.20N0.00

CCR0.00

CCRCCRCCRCCRCCTCCTCCTCCTCCT− S− S− S− S− SG G G G G IIIIIRRRRROOOOO󰀀󰀀󰀀󰀀󰀀 2 CPUs

2 CPUs 4 CPUs 8 CPUs16 CPUs

Figure9.Theeffectofstreamprogrammingoptimizationsontheoff-chipbandwidthandperformanceofMPEG-2at800MHz.

179.art4.00emStorei3.50TLoad no3.00SyncituUsefulc2.50exE2.00 dez1.50ilam1.00roN0.50

0.00

CCRCCRCCRCCRCCTCCTCCTCCT− S− S− S− SG G G G IIIIRRRROOOO󰀀󰀀󰀀󰀀 2 CPUs 4 CPUs 8 CPUs16 CPUs

Figure10.Theeffectofstreamprogrammingoptimizationsontheperformanceof179.artat800MHz.

etc.).Werestructuredthiscodebyhoistingtheinnerloopsofsev-eraltasksintoasingleouterloopthatcallseachtaskinturn.Intheoptimizedversion,weexecutealltasksonablockofaframebeforemovingtothenextblock.Thisalsoallowedustocondensealargetemporaryarrayintoasmallstackvariable.Theimprovedproducer-consumerlocalityreducedwrite-backsfromL1cachesby60%.Datastallsarereducedby41%at6.4GHz,evenwith-outprefetching.Furthermore,improvingtheparallelefﬁciencyoftheapplicationbecameasimplematterofschedulingasingledata-parallelloop,whichaloneisresponsiblefora40%performanceimprovementat16cores.However,instructioncachemissesarenotablyincreasedinthestreaming-optimizedcode.

For179.art,wereorganizedthemaindatastructureinthecache-basedversioninthesamewayaswedidforthestreamingcode.Wewerealsoabletoreplaceseverallargetemporaryvectorswithscalarvaluesbymergingseveralloops.Theseoptimizationsre-ducedthesparsenessof179.art’smemoryaccesspattern,improvedbothtemporalandspatiallocality,andallowedustouseprefetch-ingeffectively.AsshowninFigure10,theimpactonperformanceisdramatic,evenatsmallcorecounts(7xspeedup).

Overall,weobservedperformance,bandwidth,andenergyben-eﬁtswheneverstreamprogrammingoptimizationswereappliedtoourcache-basedapplications.Thisisnotanovelresult,sinceitiswell-knownthatlocalityoptimizations,suchasblockingandloopfusion[29],increasecomputationalintensityandcacheefﬁ-ciency.However,streamprogrammingmodelsencourageuserstowritecodethatexplicitlyexposesanapplication’sparallelismanddata-accesspattern,moreoftenallowingsuchoptimizations.

Ourexperienceisthatthatstreamprogrammingisactuallyeasierwiththecache-basedmodelratherthanthestreamingmodel.Withstreamingmemory,theprogrammerorthecompilermustorches-tratealldatamovementandpositioningexactlyrightinorderfor

theprogramtooperatecorrectlyandfast.Thiscanbeburdensomeforirregularaccesspatterns(overlappingblocks,searchstructures,unpredictableordata-dependentpatterns,etc.),orforaccessesthatdonotaffectanapplication’sperformance.Itcanleadtoaddi-tionalinstructionsandmemoryreferencesthatreduceoreliminatestreaminghardware’sotheradvantages.Withcache-basedhard-ware,streamprogrammingisjustanissueofperformanceopti-mization.Evenifthealgorithmisnotblockedexactlyright,thecacheswillprovidebest-effortlocalityandcommunicationman-agement.Hence,theprogrammerorthecompilercanfocusonthemostpromisingandmostregulardatastructuresinsteadofmanag-ingalldatastructuresinaprogram.

Moreover,streamprogrammingcanaddresssomeofthecoher-enceandconsistencychallengeswhenscalingcache-basedCMPstolargenumbersofcores.Sinceastreamingapplicationtypicallyoperatesinadata-parallelfashiononasequenceofdata,thereislittleshort-termcommunicationorsynchronizationbetweenpro-cessors.Communicationisonlynecessarywhenprocessorsmovefromoneindependentsetofinput/outputblockstothenextorreachacycleinanapplication’sdata-ﬂowgraph.ThisobservationmaybeincreasinglyimportantasCMPsgrow,asitimpliesthatlessag-gressive,coarser-grain,orlower-frequencymechanismscanbeem-ployedtokeepcachescoherent.

7.DISCUSSIONANDLIMITATIONS

Itisimportanttorecognizethatourstudyhaslimitations.OurexperimentsfocusonCMPswith1-16coresanduniformmemoryaccess.Someoftheconclusionsmaynotgeneralizetolarger-scaleCMPswithnon-uniformmemoryaccess(NUMA).Alarge-scale,cache-basedCMP,programmedinalocality-obliviousway,willundoubtedlysufferstallsduetolongmemorydelaysorexcessiveon-chipcoherencetrafﬁc.Weobservethatthestreamprogrammingmodelmaybeabletoaddressesbothlimitations;itexposestheﬂowofdataearlyenoughthattheycanbeprefetched,andmotivatesafarcoarser-grained,lower-frequencycoherencemodel.

TowardsthegoalofdesigninglargeCMPsthatarestilleasytoprogram,ahybridmemorysystemthatcombinescachingandsoftware-managedmemorystructurescanmitigateefﬁciencychal-lengeswithoutexacerbatingthedifﬁcultyofsoftwaredevelop-ment.Forexample,bulktransferprimitivesforcache-basedsys-temscouldenablemoreefﬁcientmacroscopicprefetching.Con-versely,small,potentiallyincoherentcachesinstreamingmem-orysystemscouldvastlysimplifytheuseofstaticdatastructureswithabundanttemporallocality.Anindustryexampleofahy-bridsystemistheNVIDIAG80GPU[6]which,inadditiontocachedaccesstoglobalmemory,includesasmallscratch-padforapplication-speciﬁclocalityandcommunicationoptimizations.Besidesthelimitsofscalability,wedidnotconsiderarchitec-turesthatexposethestreamingmodelallthewaytotheregisterﬁle[39]orapplicationswithoutabundantdataparallelism.Wealsodidnotconsiderchangestothepipelineofourcores,sincethatispreciselywhatmakesitdifﬁculttoevaluateexistingstreamingmemoryprocessorscomparedtocache-basedprocessors.Finally,ourstudywasperformedusinggeneral-purposeCMPs.Acompari-sonbetweenthetwomemorymodelsforspecializedCMPsremainsanopenissue.Despitetheselimitations,webelievethisstudy’sconclusionsareimportantintermsofunderstandingtheactualbe-haviorofCMPmemorysystemandmotivatingfutureresearchanddevelopment.

8.RELATEDWORK

Severalarchitectures[3,39,16,32]usestreaminghardwarewithmultimediaandscientiﬁccodeinordertogetperformanceand

energybeneﬁtsfromsoftware-managedmemoryhierarchiesandregularcontrolﬂow.Therearealsocorrespondingproposalsforstreamprogramminglanguagesandcompileroptimizations[15,13].Suchtoolscanreducetheburdenontheprogrammerforex-plicitlocalityandcommunicationmanagement.Inparallel,therearesigniﬁcanteffortstoenhancecache-basedsystemswithtrafﬁcﬁlters[35],replacementhints[43],orprefetchinghints[44].Theseenhancementstargetthesameaccesspatternsthatstreamingmem-orysystemsbeneﬁtfrom.

Tothebestofourknowledge,thisistheﬁrstdirectcompari-sonofthetwomemorymodelsforCMPsystemsunderauni-ﬁedsetofassumptions.Jayasena[23]comparedastreamregis-terﬁletoasingle-levelcacheforaSIMDprocessor.Hefoundthatthestreamregisterﬁleprovidesperformanceandbandwidthadvantagesforapplicationswithsigniﬁcantproducer-consumerlo-cality.LoghiandPoncinocomparedhardwarecachecoherencetonotcachingshareddataatallforembeddedCMPswithon-chipmainmemory[31].TheALPreport[28]evaluatesmultimediacodesonCMPswithstreamingsupport.However,forallbutonebenchmark,streamingimpliedtheuseofenhancedSIMDinstruc-tions,notsoftwaremanagedmemoryhierarchies.Suhetal.[37]comparedastreamingSIMDprocessor,astreamingCMPchip,avectordesign,andasuperscalarprocessorforDSPkernels.How-ever,thefoursystemsvariedvastlyatalllevels,henceitisdifﬁ-culttocomparememorymodelsdirectly.Thereareseveralpro-posalsforconﬁgurableorhybridmemorysystems[33,36,23,28].Insuchsystems,alevelinthememoryhierarchycanbeconﬁg-uredasacacheorasalocalstoredependingonanapplication’sneeds.GummarajuandRosenblumhaveshownbeneﬁtsfromahy-bridarchitecturethatusesstreamprogrammingonacache-basedsuperscalardesignforscientiﬁccode[17].Ourworksupportsthisapproachasweshowthatcache-basedmemorysystemscanbeasefﬁcientasstreamingmemorysystems,butcouldbeneﬁtintermsofbandwidthconsumptionandlatencytolerancefromstreampro-gramming.

Ourstudyfollowstheexampleofpapersthatcomparedsharedmemoryandmessagepassingformulti-chipsystems[24,8].

9.CONCLUSIONS

Thechoiceoftheon-chipmemorymodelhasfar-reachingim-plicationsforCMPsystems.Inthispaper,weperformedadirectcomparisonoftwobasicmodels:coherentcachesandstreamingmemory.

Weconcludethatforthemajorityofapplications,bothmodelsperformandscaleequallywell.Forsomeapplicationswithoutsig-niﬁcantdatareuse,streaminghasaperformanceandenergyadvan-tagewhenwescalethenumberandcomputationalcapabilitiesofthecores.However,theefﬁciencygapcanbebridgedbyintroduc-ingprefetchingandnon-allocatingwritepoliciestocache-coherentsystems.Wealsofoundsomeapplicationsforwhichstreamingscalesworsethancachingduetotheredundanciesintroducedinordertomakethecoderegularenoughforstreaming.

Throughtheadoptionofstreamprogrammingmethodologies,whichencourageblocking,macroscopicprefetching,andlocalityawaretaskscheduling,cache-basedsystemsareequallyasefﬁcientasstreamingmemorysystems.Thisindicatesthatthereisnotasufﬁcientadvantageinbuildinggeneral-purposecoresthatfollowapurestreamingmodel,wherealllocalmemorystructuresandalldatastreamsareexplicitlymanaged.Wealsoobservedthatstreamprogrammingisactuallyeasierwhentargetingcache-basedsys-temsratherthanstreamingmemorysystems,andthatitmaybebeneﬁcialinscalingcoherenceandconsistencyforcachestolargersystems.

10.ACKNOWLEDGEMENTS

WesincerelythankBillMark,MattanErez,andRonHofortheirinvaluablecommentsonearlyversionsofthispaper.Wearealsogratefulfortheinsightfulcritiqueofthisworkbythereviewers.ThisworkwaspartiallyfundedbyDARPAundercontractnumberF29601-03-2-0117.JacobLeverichissupportedbyaDavidCheri-tonStanfordGraduateFellowship.

11.REFERENCES

[1]S.V.AdveandK.Gharachorloo.SharedMemoryConsistency

Models:ATutorial.IEEEComputer,29(12):66–76,Dec.1996.[2]V.Agarwal,M.S.Hrishikesh,S.W.Keckler,andD.Burger.Clock

RateversusIPC:theEndoftheRoadforConventional

Microarchitectures.InProceedingsofthe27thIntl.Symp.onComputerArchitecture,June2000.

[3]J.Ahnetal.EvaluatingtheImagineStreamArchitecture.In

Proceedingsofthe31stIntl.Symp.onComputerArchitecture,May2004.

[4]J.AndrewsandN.Backer.Xbox360SystemArchitecture.InConf.

RecordofHotChips17,Stanford,CA,Aug.2005.

[5]L.A.Barrosoetal.Piranha:AScalableArchitectureBasedon

Single-ChipMultiprocessing.InProceedingsofthe27thIntl.Symp.onComputerArchitecture,Vancouver,Canada,June2000.[6]I.Buck.GPUComputing:ProgrammingaMassivelyParallel

Processor,Mar.2005.KeynotepresentationattheInternationalSymposiumonCodeGenerationandOptimization,SanJose,CA.[7]T.Chiueh.AGenerationalAlgorithmtoMultiprocessorCache

Coherence.InInternationalConferenceonParallelProcessing,pages20–24,Oct.1993.

[8]D.Culler,J.P.Singh,andA.Gupta.ParallelComputerArchitecture:

AHardware/SoftwareApproach.MorganKauffman,1999.[9]W.Dallyetal.Merrimac:SupercomputingwithStreams.In

Proceedingsofthe2003Conf.onSupercomputing,Nov.2003.[10]J.D.Davis,J.Laudon,andK.Olukotun.MaximizingCMP

ThroughputwithMediocreCores.InProceedingsofthe14thIntl.Conf.onParallelArchitecturesandCompilationTechniques,Sept.2005.

[11]M.Drake,H.Hoffmann,R.Rabbah,andS.Amarasinghe.MPEG-2

DecodinginaStreamProgrammingLanguage.InProceedingsofthe20thIEEEInternationalParallel&DistributedProcessingSymposium,RhodesIsland(IPDPS),Apr.2006.

[12]W.Eatherton.ThePushofNetworkProcessingtotheTopofthe

Pyramid,Oct.2005.KeynotepresentationattheSymposiumonArchitecturesforNetworkingandCommunicationSystems,Princeton,NJ.

[13]K.Fatahalianetal.Sequoia:ProgrammingTheMemoryHierarchy.

InSupercomputingConference,Nov.2006.

[14]T.FoleyandJ.Sugerman.KD-TreeAccelerationStructuresfora

GPURaytracer.InProceedingsoftheGraphicsHardwareConf.,July2005.

[15]M.I.Gordonetal.AStreamCompilerforCommunication-exposed

Architectures.InProceedingsofthe10thIntl.Conf.onArchitecturalSupportforProgrammingLanguagesandOperatingSystems,Oct.2002.

[16]M.Gschwindetal.ANovelSIMDArchitecturefortheCell

HeterogeneousChip-Multiprocessor.InConf.RecordofHotChips17,Stanford,CA,Aug.2005.

[17]J.GummarajuandM.Rosenblum.StreamProgrammingon

General-PurposeProcessors.InProceedingsofthe38thIntl.Symp.onMicroarchitecture,Nov.2005.

[18]R.Ho,K.Mai,andM.Horowitz.TheFutureofWires.Proceedings

oftheIEEE,(4),Apr.2001.

[19]R.Ho,K.Mai,andM.Horowitz.EfﬁcientOn-chipGlobal

Interconnects,June2003.

[20]M.HorowitzandW.Dally.HowScalingWillChangeProcessor

Architecture.InInternationalSolid-StateCircuitsConference,pages132–133,Feb.2004.

[21]IndependentJPEGGroup.IJG’sJPEGSoftwareRelease6b,1998.[22]D.Jani,G.Ezer,andJ.Kim.LongWordsandWidePorts:

ReinventingtheConﬁgurableProcessor.InConf.RecordofHotChips16,Stanford,CA,Aug.2004.

[23]N.Jayasena.MemoryHierarchyDesignforStreamComputing.PhDthesis,StanfordUniversity,2005.

[24]

A.C.KlaiberandH.M.Levy.AComparisonofMessagePassingandSharedMemoryArchitecturesforDataParallelPrograms.InProceedingsofthe21thIntl.Symp.onComputerArchitecture,Apr.1994.

[25]

P.Kongetira.A32-wayMultithreadedSparcProcessor.InConf.RecordofHotChips16,Stanford,CA,Aug.2004.

[26]

R.Kumar,V.Zyuban,andD.M.Tullsen.Interconnectionsin

Multi-CoreArchitectures:UnderstandingMechanisms,OverheadsandScaling.InProceedingsofthe32ndIntl.Symp.onComputerArchitecture,June2005.

[27]B.LewisandD.J.Berg.MultithreadedProgrammingwithPthreads.PrenticeHall,1998.

[28]M.Lietal.ALP:EfﬁcientSupportforAllLevelsofParallelismforComplexMediaApplications.TechnicalReportUIUCDCS-R-2005-2605,UIUCCS,July2005.

[29]A.W.Lim,S.-W.Liao,andM.S.Lam.BlockingandArrayContractionAcrossArbitrarilyNestedLoopsUsingAfﬁne

Partitioning.ACMSIGPLANNotices,36(7):103–112,July2001.[30]Y.Lin.AProgrammableVectorCoprocessorArchitectureforWirelessApplications.InProceedingsofthe3rdWorkshoponApplicationSpeciﬁcProcessors,Sept.2004.

[31]

M.LoghiandM.Pncino.ExploringEnergy/PerformanceTradeoffsinSharedMemoryMPSoCs:Snoop-BasedCacheCoherencevs.SoftwareSolutions.InProceedingsoftheDesignAutomationandTestinEuropeConf.,Mar.2005.

[32]E.Machnicki.UltraHighPerformanceScalableDSPFamilyforMultimedia.InConf.RecordofHotChips17,Stanford,CA,Aug.2005.

[33]K.Maietal.SmartMemories:aModularReconﬁgurable

Architecture.InProceedingsofthe27thIntl.Symp.onComputerArchitecture,June2000.

[34]MIPS32ArchitectureForProgrammersVolumeII:TheMIPS32InstructionSet.MIPSTechnologies,Inc.,2001.

[35]A.Moshovos.RegionScout:ExploitingCoarseGrainSharinginSnoop-BasedCoherence.InProceedingsofthe32ndIntl.Symp.onComputerArchitecture,June2005.

[36]K.Sankaralingam.TRIPS:APolymorphousArchitectureforExploitingILP,TLP,andDLP.ACMTrans.Archit.CodeOptim.,1(1):62–93,Mar.2004.

[37]

J.Suhetal.APerformanceAnalysisofPIM,StreamProcessing,andTiledProcessingonMemory-intensiveSignalProcessingKernels.InProceedingsofthe30thIntl.Symp.onComputerArchitecture,June2003.

[38]D.Tarjan,S.Thoziyoor,andN.P.Jouppi.CACTI4.0.TechnicalReportHPL-2006-86,HPLabs,2006.

[39]

M.Tayloretal.EvaluationoftheRawMicroprocessor:AnExposed-Wire-DelayArchitectureforILPandStreams.In

Proceedingsofthe31stIntl.Symp.onComputerArchitecture,May2004.

[40]TensilicaSoftwareTools.

http://www.tensilica.com/products/software.htm.

[41]S.P.VanderWielandD.J.Lilja.DataPrefetchMechanisms.ACMComputingSurveys,32(2):174–199,2000.

[42]D.Wangetal.DRAMsim:AMemory-SystemSimulator.SIGARCHComputerArchitectureNews,33(4),2005.

[43]Z.Wangetal.UsingtheCompilertoImproveCacheReplacementDecisions.InProceedingsoftheConf.onParallelArchitecturesandCompilationTechniques,Sept.2002.

[44]Z.Wangetal.GuidedRegionPrefetching:ACooperative

Hardware/SoftwareApproach.InProceedingsofthe30thIntl.Symp.onComputerArchitecture,June2003.

[45]

T.-Y.Yeh.TheLow-PowerHigh-PerformanceArchitectureofthePWRﬁcientProcessorFamily.InConf.RecordofHotChips17,Stanford,CA,Aug.2005.

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文