JacobLeverichAminFiroozshahian
HidehoArakidaMarkHorowitz
AlexSolomatnikovChristosKozyrakis
ComputerSystemsLaboratory
StanfordUniversity
{leverich,arakida,sols,aminf13,horowitz,kozyraki}@stanford.edu
ABSTRACT
Therearetwobasicmodelsfortheon-chipmemoryinCMPsys-tems:hardware-managedcoherentcachesandsoftware-managedstreamingmemory.Thispaperperformsadirectcomparisonofthetwomodelsunderthesamesetofassumptionsabouttechnology,area,andcomputationalcapabilities.Thegoalistoquantifyhowandwhentheydifferintermsofperformance,energyconsump-tion,bandwidthrequirements,andlatencytoleranceforgeneral-purposeCMPs.Wedemonstratethatfordata-parallelapplications,thecache-basedandstreamingmodelsperformandscaleequallywell.Forcertainapplicationswithlittledatareuse,streamingscalesbetterduetobetterbandwidthuseandmacroscopicsoftwarepre-fetching.However,theintroductionoftechniquessuchashard-wareprefetchingandnon-allocatingstorestothecache-basedmod-eleliminatesthestreamingadvantage.Overall,ourresultsindicatethatthereisnotsufficientadvantageinbuildingstreamingmemorysystemswhereallon-chipmemorystructuresareexplicitlyman-aged.Ontheotherhand,weshowthatstreamingattheprogram-mingmodellevelisparticularlybeneficial,evenwiththecache-basedmodel,asitenhanceslocalityandcreatesopportunitiesforbandwidthoptimizations.Moreover,weobservethatstreampro-grammingisactuallyeasierwiththecache-basedmodelbecausethehardwareguaranteescorrect,best-effortexecutionevenwhentheprogrammercannotfullyregularizeanapplication’scode.CategoriesandSubjectDescriptors:B.3.2[MemoryStruc-tures]:DesignStyles;C.1.2[ProcessorArchitectures]:MultipleDataStreamArchitectures(Multiprocessors);D.1.3[ProgrammingTechniques]:ConcurrentProgrammingGeneralTerms:Performance,Design
Keywords:Chipmultiprocessors,coherentcaches,streamingmemory,parallelprogramming,localityoptimizations
1.INTRODUCTION
Thescalinglimitationsofuniprocessors[2]haveledtoanindustry-wideturntowardschipmultiprocessor(CMP)systems.CMPsarebecomingubiquitousinallcomputingdomains.Un-likeuniprocessors,whichhaveadominant,well-understoodmodel
Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecificpermissionand/orafee.
ISCA’07,June9–13,2007,SanDiego,California,USA.Copyright2007ACM978-1-59593-706-3/07/0006...$5.00.
foron-chipmemorystructures,thereisnowidespreadagreementonthememorymodelforCMPdesigns.Thechoiceofmemorymodelcansignificantlyimpactefficiencyintermsofperformance,energyconsumption,andscalability.Moreover,itiscloselycou-pledwiththechoiceofparallelprogrammingmodel,whichinturnaffectseaseofuse.Whileitispossibletomapanyprogrammingmodeltoanymemorymodel,itistypicallymoreefficientiftheprogrammingmodelbuildsuponthebasicprinciplesofthemem-orymodel.
Similartolargerparallelsystems[8],therearetwobasicmem-orymodelsforcontemporaryCMPsystems:hardware-managed,implicitly-addressed,coherentcachesandsoftware-managed,ex-plicitly-addressed,localmemories(alsocalledstreamingmemory).Withthecache-basedmodel,allon-chipstorageisusedforprivateandsharedcachesthatarekeptcoherentbyhardware.Thead-vantageisthatthehardwareprovidesbest-effortlocalityandcom-municationmanagement,evenwhentheaccessandsharingpat-ternsaredifficulttostaticallyanalyze.Withthestreamingmemorymodel,partoftheon-chipstorageisorganizedasindependentlyaddressablestructures.ExplicitaccessesandDMAtransfersareneededtomovedatatoandfromoff-chipmemoryorbetweentwoon-chipstructures.Theadvantageofstreamingmemoryisthatitprovidessoftwarewithfullflexibilityonlocalityandcommunica-tionmanagementintermsofaddressing,granularity,andreplace-mentpolicy.Sincecommunicationisexplicit,itcanalsobeproac-tive,unlikethemostlyreactivebehaviorofthecache-basedmodel.Hence,streamingallowssoftwaretoexploitproducer-consumerlo-cality,avoidredundantwrite-backsfortemporaryresults,manageselectivedatareplication,andperformapplication-specificcachingandmacroscopicprefetching.Streamingeliminatesthecommuni-cationoverheadandhardwarecomplexityofthecoordinationpro-tocolneededforcachecoherence.Ontheotherhand,itintroducessoftwarecomplexity,sinceeithertheprogrammerorthecompilermustexplicitlymanagelocalityandcommunication.
Traditionaldesktopandenterpriseapplicationsaredifficulttoanalyzeandfavorcache-basedsystems.Incontrast,manyupcom-ingapplicationsfromthemultimedia,graphics,physicalsimula-tion,andscientificcomputingdomainsarebeingtargetedbybothcache-based[4,45]andstreaming[3,39,9,16,32]systems.Thedebateisparticularlyinterestingvis-`a-visthelatestgameconsoles.TheCMPsfortheXbox360andPlayStation3differdramaticallyintheiron-chipmemorymodel,asXbox360’sXenonprocessorisacache-basedCMPandPlayStation3’sCellprocessorisastream-ingmemoryCMP.Hence,itisinterestingtoevaluateifstreamingprovidesspecificbenefitstothisdomain,orifusingacache-basedapproachacrossallapplicationdomainsshouldbepreferable.
Thegoalofthispaperistocomparetheefficiencyofthetwomemorymodelsunderthesamesetofassumptionsabouttech-
nology,area,andcomputationalcapabilities.Specifically,weareinterestedinansweringthefollowingquestions:Howdothetwomodelscompareintermsofoverallperformanceandenergycon-sumption?Howdoesthecomparisonchangeaswescalethenum-berorcomputethroughputoftheprocessorcores?Howsensitiveiseachmodeltobandwidthorlatencyvariations?WebelievethatsuchadirectcomparisonwillprovidevaluableinformationfortheCMParchitecturedebateandgeneratesomeguidelinesforthede-velopmentoffuturesystems.
Themajorconclusionsfromourcomparisonare:•Fordata-parallelapplicationswithabundantdatareuse,thetwomodelsperformandscaleequallywell.Cachesareasef-fectiveassoftware-managedmemoriesatcapturinglocalityandreducingaveragememoryaccesstime.Forsomeoftheseapplications,streaminghasanenergyadvantageof10%to25%overwrite-allocatecachesbecauseitavoidssuperflu-ousrefillsonoutputdatastreams.Usingano-write-allocatepolicyforoutputdatainthecache-basedsystemreducesthestreamingadvantage.•Forapplicationswithoutsignificantdatareuse,macroscopicprefetching(double-buffering)providesstreamingmemorysystemswithaperformanceadvantagewhenwescalethenumberandcomputationalcapabilitiesofthecores.Theuseofhardwareprefetchingwiththecache-basedmodelelim-inatesthestreamingadvantageforsomelatency-boundap-plications.Therearealsocaseswherestreamingperformsworse,suchaswhenitrequiresredundantcopyingofdataorextracomputationinordertomanagelocalstores.•Ourresultsindicatethatapurestreamingmemorymodelisnotsufficientlyadvantageousatthememorysystemlevel.Withtheadditionofprefetchingandnon-allocatingwrites,thecache-basedmodelprovidessimilarperformance,energy,andbandwidthbehavior.Ontheotherhand,wefoundthat“streaming”attheprogrammingmodellevelisveryimpor-tant,evenwiththecache-basedmodel.Properlyblockinganapplication’sworkingset,exposingproducer-consumerlo-cality,andidentifyingoutput-onlydataleadstosignificantefficiencygains.Moreover,streamprogrammingleadstocodethatrequirescoarser-grainandlower-frequencycoher-enceandconsistencyinacache-basedsystem.Thisobser-vationwillbeincreasinglyrelevantasCMPsscaletomuchlargernumbersofcores.•Finally,weobservethatstreamprogrammingisactuallyeas-ierwiththecache-basedmodelbecausethehardwareguar-anteescorrect,best-effortexecution,evenwhentheprogram-mercannotfullyregularizeanapplication’scode.Withthestreamingmemorymodel,thesoftwaremustorchestratelo-calityandcommunicationperfectly,evenforirregularcodes.Therestofthepaperisorganizedasfollows.Section2summa-rizesthetwomemorymodelsanddiscussestheiradvantagesanddrawbacks.Section3presentsthearchitecturalframeworkforourcomparisonandSection4describestheexperimentalmethodology.Section5analyzesourevaluationresults.Section6focusesonstreamingattheprogramminglevelanditsinteractionswithCMParchitecture.Section7addresseslimitationsinourmethodology.Finally,Section8reviewsrelatedworkandSection9concludesthepaper.
Communication
HW
SW
ytilHWCoherent Cache-basedIncoherent Cache-basedacoLSW
(impractical)Streaming MemoryTable1.Thedesignspaceforon-chipmemoryforCMPs.Thisworkfocusesonthetwohighlightedoptions:coherentcache-basedandstreamingmemory.Thethirdpracticaloption,incoherentcache-based,isbrieflydiscussedinSection7.
2.
ON-CHIPMEMORYMODELSFORCHIPMULTIPROCESSORS
General-purposesystemsaredesignedaroundacomputationalmodelandamemorymodel.Thecomputationalmodeldefinestheorganizationofexecutionresourcesandregisterfilesandmayfol-lowsomecombinationofthethesuperscalar,VLIW,vector,orSIMDapproaches.Thememorymodeldefinestheorganizationofon-chipandoff-chipmemories,aswellasthecommunicationmechanismsbetweenthevariouscomputationandmemoryunits.Thetwomodelsarelinked,andanefficientsystemiscarefullyde-signedalongbothdimensions.However,webelievethattheon-chipmemorymodelcreatesmoreinterestingchallengesforCMPs.First,itistheonethatchangesmostdramaticallycomparedtouniprocessordesigns.Second,thememorymodelhasbroaderim-plicationsonbothsoftwareandhardware.Assumingwecanmovethedataclosetotheexecutionunitsinanefficientmanner,itisnotdifficulttoselectthepropercomputationalmodelbasedonthetype(s)ofparallelismavailableinthecomputationkernels.
Forboththecache-basedandstreamingmodels,certainaspectsoftheon-chipmemorysystemaresetbyVLSIconstraintssuchaswiredelay[18].EverynodeintheCMPsystemisdirectlyassoci-atedwithalimitedamountofstorage(first-levelmemory)thatcanbeaccessedwithinasmallnumberofcycles.Nodescommunicatebyexchangingpacketsoveranon-chipnetworkthatcanrangefromasimplebustoahierarchicalstructure.Additionalmemorystruc-tures(second-levelmemory)arealsoconnectedtothenetworkfab-ric.Thesystemscaleswithtechnologybyincreasingthenumberofnodes,thesizeofthenetwork,andthecapacityofsecond-levelstorage.Eventually,thescaleofaCMPdesignmaybelimitedbyoff-chipbandwidthorenergyconsumption[20].
AlthoughtheyareunderthesameVLSIconstraints,thecache-basedandstreamingmodelsdiffersignificantlyinthewaytheymanagedatalocalityandinter-processorcommunication.AsshowninTable1,thecache-basedmodelreliesonhardwaremechanismsforboth,whilethestreamingmodeldelegatesmanagementtosoft-ware.TherestofthissectionoverviewstheprotocolandoperationforeachmemorymodelforCMPsystems.
2.1CoherentCache-basedModel
Withthecache-basedmodel[8],theonlydirectlyaddressablestorageistheoff-chipmemory.Allon-chipstorageisusedforcacheswithhardwaremanagementoflocalityandcommunication.Ascoresperformmemoryaccesses,thecachesattempttocap-turetheapplication’sworkingsetbyfetchingorreplacingdataatthegranularityofcacheblocks.Thecorescommunicateimplic-itlythroughloadsandstorestothesinglememoryimage.Sincemanycachesmaystorecopiesofaspecificaddress,itisneces-sarytoquerymultiplecachesonloadandstorerequestsandpo-tentiallyinvalidateentriestokeepcachescoherent.Acoherenceprotocol,suchasMESI,minimizesthecaseswhenremotecache
lookupsarenecessary.Remotelookupscanbedistributedthroughabroadcastmechanismorbyfirstconsultingadirectorystructure.Thecoressynchronizeusingatomicoperationssuchascompare-and-swap.Cache-basedsystemsmustalsoprovideeventorder-ingguaranteeswithinandacrosscoresfollowingsomeconsistencymodel[1].Caching,coherence,synchronization,andconsistencyareimplementedinhardware.
Coherentcachingtechniquesweredevelopedforboard-levelandcabinet-levelsystems(SMPsandDSMs),forwhichcommunica-tionlatencyrangesfromtenstohundredsofcycles.InCMPs,coherencesignalstravelwithinonechip,wherelatencyismuchlowerandbandwidthismuchhigher.Consequently,evenalgo-rithmswithnon-trivialamountsofcommunicationandsynchro-nizationcanscalereasonablywell.Moreover,theefficientdesignpointsforcoherentcachinginCMPsarelikelytobedifferentfromthoseforSMPandDSMsystems.
2.2StreamingMemoryModel
Withstreaming,thelocalmemoryfordataineachcoreisasep-aratelyaddressablestructure,calledascratch-pad,localstore,orstreamregisterfile.Weadoptthetermlocalstoreinthiswork.Softwareisresponsibleformanaginglocalityandcommunicationacrosslocalstores.Softwarehasfullflexibilityinplacingfre-quentlyaccesseddatainlocalstoreswithrespecttolocation,gran-ularity,replacementpolicy,allocationpolicy,andeventhenumberofcopies.Forapplicationswithstaticallyanalyzableaccesspat-terns,softwarecanexploitthisflexibilitytominimizecommuni-cationandoverlapitwithusefulcomputationinthebestpossibleapplication-specificway.Dataarecommunicatedbetweenlocalstoresortoandfromoff-chipmemoryusingexplicitDMAtrans-fers.ThecorescanaccesstheirlocalstoresasFIFOqueuesorasrandomlyindexedstructures[23].ThestreamingmodelrequiresDMAengines,butnootherspecialhardwaresupportforcoherenceorconsistency.
Streamingisessentiallymessage-passingappliedtoCMPde-signs,thoughtherearesomeimportantdifferencescomparedtoconventionalmessage-passingforclustersandmassivelyparallelsystems[8].First,communicationisalwaysorchestratedattheuser-level,anditsoverheadislow.Next,messagesareexchangedatthefirstlevelofthememoryhierarchy,notthelastone.Sincethecommunicatingcoresareonthesamechip,communicationlaten-ciesarelowandbandwidthishigh.Finally,softwaremanagesboththecommunicationbetweencoresandthecommunicationbetweenacoreandoff-chipmemory.
Thediscussionaboveseparatesthetwomemorymodelsfromtheselectionofacomputationmodel.ThestreamingmemorymodelisgeneralandhasalreadybeenusedwithVLIW/SIMDsystems[3],RISCcores[39],vectorprocessors[30],andevenDSPs[32].Thecache-basedmodelcanbeusedwithanyofthesecomputationmod-elsaswell.
2.3QualitativeComparison
Whenconsideringthememorysystemalone,wefindseveralwaysinwhichcache-basedandstreamingmemorysystemsmaydiffer:bandwidthutilization,latencytolerance,performance,en-ergyconsumption,andcost.Thissectionsummarizeseachofthesedifferencesinaqualitativemanner.
Off-chipBandwidth&LocalMemoryUtilization:Thecache-basedmodelleadstobandwidthandstoragecapacitywasteonsparselystridedmemoryaccesses.Intheabsenceofspatiallocality,manipulatingdataatthegranularityofwidecachelinesiswasteful.Streamingmemorysystems,byvirtueofstridedscatterandgatherDMAtransfers,canusetheminimummemorychannelbandwidth
necessarytodeliverdata,andalsocompactthedatawithinthelocalstore.Note,however,thatmemoryandinterconnectchannelsaretypicallyoptimizedforblocktransfersandmaynotbebandwidthefficientforstridedorscatter/gatheraccesses.
Cachingcanalsowasteoff-chipbandwidthonunnecessaryre-fillsforoutputdata.Becausecachesoftenusewrite-allocatepoli-cies,storemissesforcememoryreadsbeforethedataareover-writteninthecache.Ifanapplicationhasdisjointinputandout-putstreams,therefillsmaywasteasignificantpercentageofband-width.Similarly,cachingcanwastebandwidthonwrite-backsofdeadtemporarydata.Astreamingsystemdoesnotsufferfromtheseproblems,astheoutputandtemporarybuffersaremanagedexplicitly.Outputdataaresentoff-chipwithoutrefills,anddeadtemporarydatacanbeignored,astheyarenotmappedtooff-chipaddresses.Tomitigatetherefillproblem,cache-basedsystemscanuseano-write-allocatepolicy.Inthiscase,itisnecessarytogroupstoredatainwritebuffersbeforeforwardingthemtomemoryinordertoavoidwastingbandwidthonnarrowwrites[4].Anotherapproachistousecachecontrolinstructions,suchas“PrepareForStore,”[34]thatinstructthecachetoallocateacachelinebutavoidretrievaloftheoldvaluesfrommemory.Similarly,temporarydatacanbemarkedinvalidattheendofacomputation[7,43].Inanycase,softwaremustdeterminewhentousethesemechanisms.Streamingsystemsmayalsowastebandwidthandstorageca-pacityonprogramswithstaticallyunpredictable,irregulardatapat-terns.Astreamingsystemcansometimescopewiththesepatternsbyfetchingasupersetoftheneededinputdata.Alternatively,atthecostofenduringlonglatencies,thesystemcoulduseaDMAtransfertocollectrequireddataondemandfrommainmemorybe-foreeachcomputationaltask.Forprogramsthatoperateonover-lappingblocksorgraphstructureswithmultiplereferencestothesamedata,thestreamingsystemmaynaivelyre-fetchdata.Thiscanbeavoidedthroughincreasedaddressgenerationcomplexityorsoftwarecaching.Finally,forapplicationsthatfetchablockandupdatesomeofitselementsin-place,astreamingsystemwilloftenwritebackthewholeblocktomemoryattheendofthecomputa-tion,evenifsomedatawerenotupdated.Incontrasttoallofthesescenarios,cache-basedsystemsperformloadandstoreaccessesondemand,andhenceonlymovecachelinesasrequired.Theymayevensearchforcopiesoftherequireddatainotheron-chipcachesbeforegoingoff-chip.
LatencyTolerance:Thecache-basedmodelistraditionallyre-active,meaningthatamissmustoccurbeforeafetchistriggered.Memorylatencycanbehiddenusinghardwareprefetchingtech-niques,whichdetectrepetitiveaccesspatternsandissuememoryaccessesaheadoftime,orproactivesoftwareprefetching.Inprac-tice,theDMAtransfersinthestreamingmemorymodelareanefficientandaccurateformofsoftwareprefetching.Theycanhideasignificantamountoflatency,especiallyifdouble-bufferingisused.Unlikehardwareprefetching,whichrequiresafewmissesbeforeapatternisdetected(microscopicview),aDMAaccesscanstartarbitrarilyearly(macroscopicview)andcancapturebothreg-ularandirregular(scatter/gather)accesses.
Performance:Fromthediscussionsofar,onecanconcludethatstreamingmemorymayhaveaperformanceorscalingadvantageforregularapplications,duetopotentiallybetterlatencytoleranceandbetterusageofoff-chipbandwidthorlocalstorage.Thesead-vantagesareimportantonlyiflatency,bandwidth,orlocalstoragecapacityaresignificantbottleneckstobeginwith.Forexample,reducingthenumberofmissesisunimportantforacomputation-allyintensiveapplicationthatalreadyhasverygoodlocality.Intheeventthatanapplicationisbandwidth-bound,latencytolerancemeasureswillbeineffective.Adrawbackforstreaming,evenwith
regularcode,isthatitoftenhastoexecuteadditionalinstructionstosetupDMAtransfers.Forapplicationswithunpredictabledataaccesspatternsorcontrolflow,astreamingsystemmayexecuteasignificantlyhighernumberofinstructionsthanthatofacache-basedsysteminordertoproducepredictablepatternsortousethelocalstoretoemulateasoftwarecache.
EnergyConsumption:Anyperformanceadvantagealsotrans-latestoanenergyadvantage,asitallowsustoturnoffthesystemearlyorscaledownitspowersupplyandclockfrequency.Stream-ingaccessestothefirst-levelstorageeliminatetheenergyoverheadofcaches(tagaccessandtagcomparison).Thecache-basedmodelconsumesadditionalenergyforon-chipcoherencetraffic,snooprequestsordirectorylookups.Moreover,efficientuseoftheavail-ableoff-chipbandwidthbyeitherofthetwomodels(throughfewertransfersormessages)reducestheenergyconsumptionbythein-terconnectnetworkandmainmemory.
Complexity&Cost:Itisdifficulttomakeaccurateestimatesofhardwarecostwithoutcomparableimplementations.Thehardwareforthecache-basedmodelisgenerallymorecomplextodesignandverify,ascoherence,synchronization,consistency,andprefetchinginteractinsubtleways.Still,reuseacrossserver,desktop,andem-beddedCMPdesignscansignificantlyreducesuchcosts.Ontheotherhand,streamingpassesthecomplexitytosoftware,thecom-piler,and/ortheprogrammer.Forapplicationsinthesynchronousdata-flow,DSP,ordensematrixdomains,itisoftenstraightfor-wardtoexpressastreamingalgorithm.Forotherapplications,itisnon-trivial,andasinglegoodalgorithmisoftenaresearchcon-tributioninitself[11,14].Finally,complexityandcostmustalsobeconsideredwithscalinginmind.Amemorymodelhasanad-vantageifitallowsefficientuseofmorecoreswithouttheneedfordisproportionalincreasesinbandwidthorsomeotherresourceinthememorysystem.
3.CMPARCHITECTUREFRAMEWORK
WecomparethetwomodelsusingtheCMParchitectureshowninFigure1.TherearenumerousdesignparametersinaCMPsys-tem,andevaluatingthetwomodelsunderallpossiblecombinationsisinfeasible.Hence,westartwithabaselinesystemthatrepresentsapracticaldesignpointandvaryonlythekeyparametersthatinter-actsignificantlywiththeon-chipmemorysystem.
3.1BaselineArchitecture
OurCMPdesignisbasedonin-orderprocessorssimilartoPi-ranha[5],RAW[39],UltrasparcT1[25],andXbox360[4].SuchCMPshavebeenshowntobeefficientformultimedia,communi-cations,andthroughputcomputingworkloadsastheyprovidehighcomputedensitywithouttheareaandpoweroverheadofout-of-orderprocessors[10].WeusetheTensilicaXtensaLX,3-wayVLIWcore[22].Tensilicacoreshavebeenusedinseveralem-beddedCMPdesigns,includingthe188-coreCiscoMetrorouterchip[12].Wealsoperformedexperimentswithasingle-issueXtensacorethatproducedsimilarresultsforthememorymodelcomparison.TheVLIWcoreisconsistentlyfasterby1.6xto2xfortheapplicationswestudied.Thecorehas3slotsperinstruc-tion,withupto2slotsforfloating-pointoperationsandupto1slotforloadsandstores.Duetotimeconstraints,wedonotusetheTensilicaSIMDextensionsatthispoint.Nevertheless,Sec-tion5.3includesanexperimentthatevaluatestheefficiencyofeachmodelwithprocessorsthatprovidehighercomputationalthrough-put.Eachprocessorhasa16-KByte,2-wayset-associativeinstruc-tioncache.Thefixedamountoflocaldatastorageisuseddiffer-entlyineachmemorymodel.
Weexploresystemswith1to16coresusingahierarchicalinter-connectsimilartothatsuggestedby[26].Wegroupcoresinclus-tersoffourwithawide,bidirectionalbus(localnetwork)provid-ingthenecessaryinterconnect.Theclusterstructureallowsforfastcommunicationbetweenneighboringcores.Ifthreadsaremappedintelligently,theintra-clusterbuswillhandlemostofthecore-to-corecommunication.Aglobalcrossbarconnectstheclusterstothesecond-levelstorage.Thereisbufferingatallinterfacestotoler-atearbitrationlatenciesandensureefficientbandwidthuseateachlevel.Thehierarchicalinterconnectprovidessufficientbandwidth,whileavoidingthebottlenecksoflongwiresandcentralizedarbi-tration[26].Finally,thesecondarystoragecommunicatestooff-chipmemorythroughsomenumberofmemorychannels.
Table2presentstheparametersoftheCMPsystem.Wevarythenumberofcores,thecoreclockfrequency,theavailableoff-chipbandwidth,andthedegreeofhardwareprefetching.Wekeepthecapacityofthefirst-leveldata-storageconstant.
3.2Cache-basedImplementation
Forthecache-basedmodel,thefirst-leveldatastorageineachcoreisorganizedasa32-KByte,2-wayset-associativedatacache.Thesecond-levelcacheisa512-KByte,16-wayset-associativecache.Bothcachesuseawrite-back,write-allocatepolicy.Co-herenceismaintainedusingtheMESIwrite-invalidateprotocol.Coherencerequestsarepropagatedinstepsoverthehierarchicalinterconnect.First,theyarebroadcasttootherprocessorswithinthecluster.Ifthecache-lineisnotavailableortherequestcannotbesatisfiedwithinonecluster(e.g.,upgraderequest),itisbroadcasttoallotherclustersaswell.Snoopingrequestsfromothercoresoc-cupythedatacacheforonecycle,forcingthecoretostallifittriestodoaload/storeaccessinthesamecycle.Eachcoreincludesastore-bufferthatallowsloadstobypassstoremisses.Asaresult,theconsistencymodelisweak.Sincetheprocessorsarein-order,itiseasytoprovidesufficientMSHRsforthemaximumpossiblenumberofconcurrentmisses.
Eachcoreadditionallyincludesahardwarestream-basedpre-fetchenginethatplacesdatadirectlyintheL1cache.Modeledafterthetaggedprefetcherdescribedin[41],theprefetcherkeepsahistoryofthelast8cachemissesforidentifyingsequentialac-cesses,runsaconfigurablenumberofcachelinesaheadofthelatestcachemiss,andtracks4separateaccessstreams.Ourexperimentsincludehardwareprefetchingonlywhenexplicitlystated.
WeusePOSIXthreadstomanuallyparallelizeandtuneappli-cations[27].Theapplicationsusedareregularanduselockstoimplementefficienttask-queuesandbarrierstosynchronizeSPMDcode.Higher-levelprogrammingmodels,suchasOpenMP,arealsoapplicabletotheseapplications.
3.3StreamingImplementation
Forthestreamingmodel,thefirst-leveldatastorageineachcoreissplitbetweena24-KBytelocalstoreandan8-KBytecache.Thesmallcacheisusedforstackdataandglobalvariables.Itispar-ticularlyusefulforthesequentialportionsofthecodeandhelpssimplifytheprogrammingandcompilationofthestreamingpor-tionaswell.The24-KBytelocalstoreisindexedasarandomac-cessmemory.OurimplementationalsoprovideshardwaresupportforFIFOaccessestothelocalstore,butwedidnotusethisfeaturewithanyofourapplications.EachcorehasaDMAenginethatsupportssequential,strided,andindexedtransfers,aswellascom-mandqueuing.Atanypoint,theDMAenginemayhaveupto1632-byteoutstandingaccesses.
Thelocalstoreiseffectivelysmallerthana24-KBytecache,sinceithasnotagorcontrolbitsoverhead.Wedonotraisethe
CoreCache-coherent modelI-CacheCoreDRAM I/FCoreCoreStreamingmodelI-CacheLocal NetworkCoreCoreL2CacheBankGlobal NetworkLocal NetworkCoreCoreCPUD-CacheCC EngineCPUCoreCoreD-CacheCoreCoreLocal NetworkL2CacheBankDRAM I/FLocalStoreLocal NetworkDMA EngineCoreCoreCoreCoreFigure1.ThearchitectureoftheCMPsystemwithupto16processors.Thecoreorganizationsforthecache-basedandstreamingmodelsareshownontheleftandrightsiderespectively.
sizeofthelocalstore,sincethesmallincrement(2KBytes)doesnotmakeadifferenceforourapplications.Still,theenergycon-sumptionmodelaccountscorrectlyforthereducedcapacity.Thesecondarystorageisagainorganizedasa16-wayset-associativecache.L2cachesareusefulwithstreamprocessors,astheycap-turelong-termreusepatternsandavoidexpensiveaccessestomainmemory[36,9].TheL2cacheavoidsrefillsonwritemisseswhenDMAtransfersoverwriteentirelines.
WedevelopedstreamingcodeusingasimplelibraryforDMAtransferswithinthreadedCcode.Wemanuallyappliedtheproperblockingfactoranddouble-bufferinginordertooverlapDMAtrans-ferswithusefulcomputation.Wealsorunmultiplecomputationalkernelsoneachdatablocktobenefitfromproducer-consumerlo-calitywithoutadditionalmemoryaccessesorwrite-backsforinter-mediateresults.Higher-levelstreamprogrammingmodelsshouldbeapplicabletomostofourapplications[15,13].Insomecases,theDMAlibraryusesaschedulingthreadthatqueuespendingtransfers.Weavoidtakingupawholecoreforthisthreadbymul-tiplexingitwithanapplicationthread.Theperformanceimpactisnegligible.
4.2Applications
Table3presentsthesetofapplicationsusedforthisstudy.Theyrepresentapplicationsfromthemultimedia,graphics,physicalsim-ulation,DSP,anddatamanagementdomains.Suchapplicationshavebeenusedtoevaluateandmotivatethedevelopmentofstream-ingarchitectures.MPEG-2,H.2,Raytracer,JPEG,andStereoDepthExtractionarecompute-intensiveapplicationsandshowex-ceptionallygoodcacheperformancedespitetheirlargedatasets.Theyexhibitgoodspatialortemporallocalityandhaveenoughcomputationperdataelementtoamortizethepenaltyforanymisses.FEMisascientificapplication,buthasaboutthesamecomputein-tensityasmultimediaapplications.Theremainingapplications—BitonicSort,MergeSort,FIR,and179.art—performarelativelysmallcomputationoneachinputelement.Theyrequireconsid-erablyhigheroff-chipbandwidthandaresensitivetomemoryla-tency.
Wemanuallyoptimizedbothversionsofeachapplicationtoelim-inatebottlenecksandscheduleitsparallelisminthebestpossibleway.Wheneverappropriate,weappliedthesamedata-localityopti-mizations(i.e.blocking,producer-consumer,etc.)tobothmodels.InSection6,weexploretheimpactofdata-localityoptimizations.Thefollowingisabriefdescriptionofhoweachapplicationwasparallelized.
MPEG-2andH.2areparallelizedatthemacroblocklevel.Bothdynamicallyassignmacroblockstocoresusingataskqueue.Macroblocksareentirelydata-parallelinMPEG-2.InH.2,wescheduletheprocessingofdependentmacroblockssoastomini-mizethelengthofthecriticalexecutionpath.WiththeCIFreso-lutionvideoframesweencodeforthisstudy,themacroblockpar-allelismavailableinH.2islimited.StereoDepthExtractionisparallelizedbydividinginputframesinto32x32blocksandstati-callyassigningthemtoprocessors.
KD-treeRaytracerisparallelizedacrosscamerarays.Weassignraystoprocessorsinchunkstoimprovelocality.OurstreamingversionreadstheKD-treefromthecacheinsteadofstreamingitwithaDMAcontroller.JPEGEncodeandJPEGDecodearepar-allelizedacrossinputimages,inamannersimilartothatdonebyanimagethumbnailbrowser.NotethatEncodereadsalotofdatabutoutputslittle;Decodebehavesintheoppositeway.TheFi-niteElementMethod(FEM)isparallelizedacrossmeshcells.TheFIRfilterhas16tapsandisparallelizedacrosslongstripsofsam-ples.179.artisparallelizedacrossF1neurons;thisapplicationiscomposedofseveraldata-parallelvectoroperationsandreductionsbetweenwhichweplacebarriers.
MergeSortandBitonicSortareparallelizedacrosssub-arraysofalargeinputarray.Theprocessorsfirstsortchunksof4096keysin
4.4.1
METHODOLOGY
SimulationandEnergyModeling
WeusedTensilica’smodelingtoolstoconstructaCMPsimula-torforbothmemorymodels[40].Thesimulatorcapturesallstallandcontentioneventsinthecorepipelineandthememorysystem.Table2summarizesthemajorsystemcharacteristicsandthepa-rameterswevariedforthisstudy.Thedefaultvaluesareshowninbold.TheapplicationswerecompiledwithTensilica’soptimizingcompileratthe-O3optimizationlevel.Wefast-forwardovertheinitializationforeachapplicationbutsimulatetheresttocomple-tion,excluding179.artforwhichwemeasure10invocationsofthetrainmatchfunction.
Wealsodevelopedanenergymodelforthearchitectureina90nmCMOSprocess(1.0Vpowersupply).Forthecores,themodelcombinesusagestatistics(instructionmix,functionalunitutilization,stallsandidlecycles,etc.)withenergydatafromthelayoutofactualTensilicadesignsat600MHzin90nm.Theen-ergyconsumedbyon-chipmemorystructuresiscalculatedusingCACTI4.1[38],whichincludesaleakagepowermodelandim-provedcircuitmodelscomparedtoCACTI3.Interconnectenergyiscalculatedbasedonourmeasuredactivitystatisticsandscaledpowermeasurementsfrom[19].Theenergyconsumptionforoff-chipDRAMisderivedfromDRAMsim[42].Wemodeltheeffectofleakageandclockgatingonenergyatalllevelsofthemodel.
CoreI-cacheDataStorageLocalNetworkGlobalCrossbarL2-cacheDRAMCache-CoherentModel(CC)StreamingModel(STR)1,2,4,8,or16TensilicaLXcores,3-wayVLIW,7-stagepipeline800MHz,1.6GHz,3.2GHz,or6.4GHzclockfrequency2FPUs,2integerunits,1load/storeunit
16KB,2-wayassociative,32-byteblocks,1port32KB,2-wayassociativecache24KBlocalstore,1port32-byteblocks,1port,MESI8KB,2-wayassociativecache,32-byteblocks,1portHardwarestreamprefetcherDMAenginewith16outstandingaccessesbidirectionalbus,32byteswide,2cyclelatency(afterarbitration)1inputandoutputportperclusterorL2bank,16byteswide,2.5nslatency(pipelined)512KB,16-waysetassociative,1port,2.2nsaccesslatency,non-inclusiveOnememorychannelat1.6GB/s,3.2GB/s,6.4GB/s,or12.8GB/s;70nsrandomaccesslatencyTable2.ParametersfortheCMPsystem.Forparametersthatvary,wedenotethedefaultvalueinbold.Latenciesarefora90nmCMOSprocess.
L1D-MissRate0.58%0.06%1.06%0.40%0.58%0.03%0.60%0.63%1.79%2.22%3.98%L2D-MissRate85.3%30.8%98.9%72.9%76.2%46.1%86.2%99.8%7.4%98.2%99.7%Instr.perL1D-Miss324.83705.5256.3577.1352.98662.5368.8214.6150.1140.971.1CyclesperL2D-Miss135.44225.96.684.244.93995.355.520.4230.926.133.7Application
MPEG-2Encoder[28]H.2EncoderKD-treeRaytracerJPEGEncoder[21]JPEGDecoder[21]StereoDepthExtraction2DFEMFIRfilter179.artBitonicSortMergeSortInputDataset
10CIFframes(Foreman)10CIFframes(Foreman)128x128,16371triangles128PPMsofvarioussizes128JPGsofvarioussizes3CIFimagepairs5006cellmesh,7663edges22032-bitsamplesSPECreferencedataset21932-bitkeys(2MB)21932-bitkeys(2MB)Off-chipB/W292.4MB/s10.8MB/s45.1MB/s402.2MB/s1059.2MB/s11.4MB/s587.9MB/s1839.1MB/s227.7MB/s1594.2MB/s1167.8MB/sTable3.Memorycharacteristicsoftheapplicationsmeasuredonthecache-basedmodelusing16coresrunningat800MHz.
parallelusingquicksort.Then,sortedchunksaremergedorsorteduntilthefullarrayissorted.MergeSortgraduallyreducesinpar-allelismasitprogress,whereasBitonicSortretainsfullparallelismforitsduration.MergeSortalternateswritingoutputsubliststotwobufferarrays,whileBitonicSortoperatesonthelistinsitu.
duetocachemisses.Streamingversionseliminatemanyofthesestallsusingdouble-buffering(macroscopicprefetching).ThisisnotthecaseforBitonicSort,becauseoff-chipbandwidthissaturatedathighprocessorcounts.BitonicSortisanin-placesortingalgo-rithm,anditisoftenthecasethatsublistsaremoderatelyin-orderandelementsdon’tneedtobeswapped,andconsequentlydon’tneedtobewrittenback.Thecache-basedsystemnaturallydiscov-ersthisbehavior,whilethestreamingmemorysystemwritestheunmodifieddatabacktomainmemoryanyway.H.2andMergeSorthavesynchronizationstallswithbothmodelsduetolimitedparallelism.
Therearesubtledifferencesintheusefulcyclesofsomeapplica-tions.FIRexecutes14%moreinstructionsinthestreamingmodelthanthecachingmodelbecauseofthemanagementofDMAtrans-fers(128elementspertransfer).InthestreamingMergeSort,theinnerloopexecutesextracomparisonstocheckifanoutputbufferisfullandneedstobedrainedtomainmemory,whereasthecache-basedvariantfreelywritesdatasequentially.Eventhoughdouble-bufferingeliminatesalldatastalls,theapplicationrunslongerbe-causeofitshigherinstructioncount.ThestreamingH.2takesadvantageofsomeboundary-conditionoptimizationsthatproveddifficultinthecache-basedvariant.Thisresultedinaslightreduc-tionininstructioncountwhenstreaming.MPEG-2suffersamod-eratenumberofinstructioncachemissesduethecache’slimitedsize.
Overall,Figure2showsthatneithermodelhasaconsistentper-formanceadvantage.Evenwithoutprefetching,thecache-basedmodelperformssimilarlytothestreamingone,andsometimesitisactuallyfaster(BitonicSortfor16cores).Bothmodelsusedataefficiently,andthefundamentalparallelismavailableintheseap-plicationsisnotaffectedbyhowdataaremoved.Althoughthedifferencesinperformancearesmall,thereisalargervariationin
5.EVALUATION
Ourevaluationstartswithacomparisonofthestreamingsystemtothebaselinecache-basedsystemwithoutprefetchingorotherenhancements(Sections5.1and5.2).Wethenstudytheband-widthconsumptionandlatencytoleranceofthetwosystems(Sec-tion5.3).Weconcludebyevaluatingmeanstoenhancetheperfor-manceofcachingsystems(Sections5.4and5.5).
5.1PerformanceComparison
Figure2presentstheexecutiontimeforthecoherentcache-based(CC)andstreaming(STR)modelsaswevarythenumberof800MHzcoresfrom2to16.Wenormalizetotheexecutiontimeofthesequentialrunwiththecache-basedsystem.Thecomponentsofexecutiontimeare:usefulexecution(includingfetchandnon-memorypipelinestalls),synchronization(lock,barrier,waitforDMA),andstallsfordata(caches).Lowerbarsarebetter.Thecache-basedresultsassumenohardwareprefetching.
For7outof11applications(MPEG-2,H.2,Depth,Raytrac-ing,FEM,JPEGEncodeandDecode),thetwomodelsperformal-mostidenticallyforallprocessorcounts.Theseprogramsperformasignificantcomputationoneachdataelementfetchedandcanbeclassifiedascompute-bound.Bothcachesandlocalstorescapturetheirlocalitypatternsequallywell.
Theremainingapplications(179.art,FIR,MergeSort,andBiton-icSort)aredata-boundandrevealsomeinterestingdifferencesbe-tweenthetwomodels.Thecache-basedversionsstallregularly
MPEG−2 Encoder0.60Normalized Execution TimeNormalized Execution Time0.500.400.300.200.10 CC STR CC STR CC STR CC STR0.00StoreLoadSyncUsefulH.2 Encoder0.60
Normalized Execution Time0.500.400.300.200.10
CC STR CC STR CC STR CC STR0.00
StoreLoadSyncUsefulRay Tracer0.600.500.400.300.200.10 CC STR CC STR CC STR CC STR0.00Normalized Execution TimeStoreLoadSyncUsefulJPEG Encoder0.600.500.400.300.200.10
CC STR CC STR CC STR CC STR16 CPUs
StoreLoadSyncUsefulStoreLoadSyncUseful0.00
2 CPUs
0.60Normalized Execution Time4 CPUs8 CPUs16 CPUs
0.60Normalized Execution Time2 CPUs4 CPUs8 CPUs16 CPUs
0.50Normalized Execution Time2 CPUs4 CPUs8 CPUs16 CPUs
0.60Normalized Execution TimeStoreLoadSyncUseful2 CPUs4 CPUs8 CPUs
JPEG DecoderStoreLoadSyncUsefulDepth ExtractionStoreLoadSyncUsefulFEM0.450.400.350.300.250.200.150.100.050.00FIR0.500.400.300.200.10 CC STR CC STR CC STR CC STR0.000.500.400.300.200.10
CC STR CC STR CC STR CC STR0.00
0.500.400.300.200.10
CC STR CC STR CC STR8 CPUs
CC STR0.00
CC STR CC STR CC STR8 CPUs0.602 CPUs4 CPUs8 CPUs0.5016 CPUs
179.art2 CPUs4 CPUs8 CPUs0.6016 CPUs
Bitonic Sort2 CPUs4 CPUs16 CPUs
Merge Sort CC STR2 CPUs4 CPUs16 CPUs
Normalized Execution TimeNormalized Execution Time0.400.350.300.250.200.150.100.05 CC STR CC STR CC STR CC STR0.000.500.400.300.200.10 CC STR CC STR CC STR CC STR0.00Normalized Execution Time0.45StoreLoadSyncUsefulStoreLoadSyncUseful0.500.400.300.200.10 CC STR CC STR CC STR CC STR0.00StoreLoadSyncUseful2 CPUs4 CPUs8 CPUs16 CPUs2 CPUs4 CPUs8 CPUs16 CPUs2 CPUs4 CPUs8 CPUs16 CPUs
Figure2.Executiontimesforthetwomemorymodelsasthenumberofcoresisincreased,normalizedtoasinglecachingcore.
1.40Normalized Off−Chip Traffic1.201.000.800.600.400.20
CC STR CC STR CC STR CC STR0.00
16 CPUs16 CPUs1.00Normalized Energy Consumption0.900.800.700.600.500.400.300.200.10
CC STR CC STR CC STR CC STR0.00
DRAML2−CacheNetworkLMemD−CacheI−CacheCore
WriteReadFEMMPEG−2FIRBitonic Sort
FEMMPEG−2FIRBitonic Sort
Figure3.Off-chiptrafficforthecache-basedandstreamingsys-temswith16CPUs,normalizedtoasinglecachingcore.off-chipbandwidthutilizationofthetwomodels.Figure3showsthateachmodelhasanadvantageinsomesituations.
Figure4.Energyconsumptionforthecache-basedandstreamingsystemswith16CPUs,normalizedtoasinglecachingcore.canbeobservedinthecorrelationbetweenFigures3and4.Specifi-cally,thestreamingapplicationstypicallytransferfewerbytesfrommainmemory,oftenthroughtheeliminationofsuperfluousrefillsforoutput-onlydata.TheoppositeistrueforourstreamingBitonicSort,whichtendstocommunicatemoredatawithmainmemorythanthecachingversionduetothewrite-backofunmodifieddata.Forapplicationswherethereislittlebandwidthdifferencebetweenthetwomodels(suchasFEM)orthecomputationalintensityisveryhigh(suchasDepth),thedifferenceinenergyconsumptionisinsignificant.
WeexpectedtoseeagreaterdifferencebetweenthelocalstoreandL1datacache,butitnevermaterialized.Sinceourapplicationsaredata-parallelandrarelysharedata,theenergycostofanaveragecachemissisdominatedbytheoff-chipDRAMaccessratherthanthemodesttagbroadcastandlookup.Hence,theper-accessenergysavingsbyeliminatingtaglookupsinthestreamingsystemmadelittleimpactonthetotalenergyfootprintofthesystem.
5.2EnergyComparison
Figure4presentsenergyconsumptionforthetwomodelsrun-ningFEM,MPEG-2,FIR,andBitonicSort.Wenormalizetotheenergyconsumptionofasinglecachingcoreforeachapplication.Eachbarindicatestheenergyconsumedbythecores,thecachesandlocalstores,theon-chipnetwork,thesecond-levelcache,andthemainmemory.Thenumbersincludebothstaticanddynamicpower.Lowerbarsarebetter.Incontrasttoperformancescal-ing,energyconsumptiondoesnotalwaysimprovewithmorecores,sincetheamountofhardwareusedtoruntheapplicationincreases.For5outof11applications(JPEGEncode,JPEGDecode,FIR,179.art,andMergeSort),streamingconsistentlyconsumeslessen-ergythancache-coherence,typically10%to25%.Theenergydif-ferentialinnearlyeverycasecomesfromtheDRAMsystem.This
MPEG−2 Encoder0.08Normalized Execution TimeNormalized Execution Time0.070.060.050.040.030.020.01
CC STR CC STR CC STR CC STR0.00
StoreLoadSyncUsefulFIR0.07
Normalized Execution Time0.060.050.040.030.020.01
CC STR CC STR CC STR CC STR0.00
StoreLoadSyncUsefulBitonic Sort0.090.080.070.060.050.040.030.020.01
CC STR CC STR CC STR CC STR6.4 GHz
StoreLoadSyncUseful0.00
0.8 GHz1.6 GHz3.2 GHz6.4 GHz0.8 GHz1.6 GHz3.2 GHz6.4 GHz0.8 GHz1.6 GHz3.2 GHz
Figure5.Normalizedexecutiontimeasthecomputationrateofprocessorcoresisincreased(16cores).
5.3IncreasedComputationalThroughput
Uptonow,theresultsassume800MHzcores,whicharereason-ableforembeddedCMPsforconsumerapplications.Toexploretheefficiencyofthetwomemorymodelsasthecomputationalthrough-putoftheprocessorisincreased,wevarytheclockfrequencyofthecoreswhilekeepingconstantthebandwidthandlatencyintheon-chipnetworks,L2cache,andoff-chipmemory.Insomesense,thehigherclockfrequenciestellusingeneralwhatwouldhappenwithmorepowerfulprocessorsthatuseSIMDunits,out-of-orderschemes,higherclockfrequency,oracombination.Forexample,the6.4GHzconfigurationcanberepresentativeoftheperformanceofan800MHzprocessorthatuses4-to8-wideSIMDinstructions.Theexperimentwasperformedwith16corestostressscalingandincreasethesystem’ssensitivitytobothmemorylatencyandmem-orybandwidth.
Applicationswithsignificantdatareuse,suchasH.2andDepth,shownosensitivitytothisexperimentandperformequallywellonbothsystems.Figure5showstheresultsforsomeoftheapplicationsthatareaffectedbycomputationalscaling.Theseap-plicationsfallintooneoftwocategories:bandwidth-sensitiveorlatency-sensitive.Latency-sensitiveprograms,likeMPEG-2En-coding,performarelativelargedegreeofcomputationbetweenoff-chipmemoryaccesses(hundredsofinstructions).Whilethehighercorefrequencyshortensthesecomputations,itdoesnotre-ducetheamountoftime(innanoseconds,notcycles)requiredtofetchthedatainbetweencomputations.Themacroscopicprefetch-inginthestreamingsystemcantolerateasignificantpercentageofthememorylatency.Henceat6.4GHz,thestreamingMPEG-2Encoderis9%faster.
Bandwidth-sensitiveapplications,likeFIRandBitonicSort,eventuallysaturatetheavailableoff-chipbandwidth.Beyondthatpoint,furtherincreasesincomputationalthroughputdonotim-proveoverallperformance.ForFIR,thecache-basedsystemsat-uratesbeforethestreamingsystemduetothesuperfluousrefillsonstoremissestooutput-onlydata.Atthehighestcomputationalthroughput,thestreamingsystemperforms36%faster.ForBitonicSort,thestreamingversionsaturatesfirst,sinceitperformsmorewritesthanthecache-basedversion(asdescribedin5.1).Thisgivesthecache-basedversiona19%performanceadvantage.
5.4MitigatingLatencyandBandwidthIssues
Theprevioussectionindicatesthatwhenalargenumberofcoreswithhighcomputationalthroughputareused,thecache-basedmod-elfaceslatencyandbandwidthissueswithcertainapplications.Tocharacterizetheseinefficiencies,weperformedexperimentswithincreasedoff-chipbandwidthandhardwareprefetching.
Figure6showstheimpactofincreasingtheavailableoff-chipbandwidthforFIR.Thiscanbeachievedbyusinghigherfrequency
FIR0.08Storeem0.07LoadiTSync n0.06Usefuloitu0.05cexE0.04 dezi0.03lamr0.02oN0.01
0.00
CRCRCRCRCTCTCTCT S S S S 1.6 GB/s3.2 GB/s6.4 GB/s12.8 GB/s
Figure6.Theeffectofincreasedoff-chipbandwidthonFIRper-formance.Measuredon16coresat3.2GHz.
DRAM(e.g.DDR2,DDR3,GDDR)ormultiplememorychannels.Withmorebandwidthavailable,theeffectofsuperfluousrefillsissignificantlyreduced,andthecache-basedsystemperformsnearlyaswellasthestreamingone.Whenhardwareprefetchingisin-troducedat12.8GB/s,loadstallsarereducedto3%ofthetotalexecutiontime.However,theadditionaloff-chipbandwidthdoesnotclosetheenergygapforthisapplication.Anenergy-efficientsolutionforthecache-basedsystemistouseanon-allocatingwritepolicy,whichweexploreinSection5.5.
ForMergeSortand179.art(Figure7),hardwareprefetchingsig-nificantlyimprovesthelatencytoleranceofthecache-basedsys-tems;datastallsarevirtuallyeliminated.Thisisnottosaythatweneverobserveddatastalls—at16cores,thecache-basedMergeSortsaturatesthememorychannelduetosuperfluousrefills—butthatasmalldegreeofprefetchingissufficienttohideover200cy-clesofmemorylatency.
5.5MitigatingSuperfluousRefills
Forsomeapplications,thecache-basedsystemusesmoreoff-chipbandwidth(andconsequentlyenergy)becauseofsuperfluousrefillsforoutput-onlydata.Thisdisadvantagecanbeaddressedbyusinganon-write-allocatepolicyforoutput-onlydatastreams.Wemimicnon-allocatingstoresbyusinganinstructionsimilartotheMIPS32“PrepareforStore”(PFS)instruction[34].PFSallocatesandvalidatesacachelinewithoutrefillingit.TheresultsforFIR,Mergesort,andMPEG-2areshowninFigure8.Foreachapplica-tion,theeliminationofsuperfluousrefillsbringsthememorytrafficandenergyconsumptionofthecache-basedmodelintoparitywiththestreamingmodel.ForMPEG-2,thememorytrafficduetowritemisseswasreduced56%comparedtothecache-basedapplicationwithoutPFS.
0.30PrefetchingeStoremLoadiT0.25
Syncnoitu0.20UsefulcexE0.15
dezila0.10mroN0.050.00
C4RC4RCPTCPT +S +CCS CCMerge Sort 179.art
Figure7.Theeffectofhardwareprefetchingonperformance.P4
referstotheprefetchdepthof4.Measuredon2coresat3.2GHzwitha12.8GB/smemorychannel.
1.20Prepare For StorePrepare For Store0.80cifWritenoifReadtpm0.70ar1.00Tu spnih0.800.60oCC 0.50−0.60yfgfrOe0.40 nde0.40E zd0.30ileazmi0.20
l0.20raoNmro0.10
0.00
NCSS0.00
FRCFRCSRSCCRPTCPTCFPTCFT +S P +S +S CC+S CCCCC FIRMerge Sort MPEG−2
C FIR
Figure8.Theeffectof“PrepareforStore”(PFS)instructionsontheoff-chiptrafficforthecache-basedsystem,normalizedtoasin-glecachingcore.AlsoshownisenergyconsumptionforFIRwith16coresat800MHz.SeeFigure4forthesecondgraph’slegend.Notethatafullhardwareimplementationofanon-write-allocatecachepolicy,alongwiththenecessarywrite-gatheringbuffer,mightperformbetterthanPFSasitwouldalsoeliminatecachereplace-mentsduetooutput-onlydata.
6.STREAMINGASAPROGRAMMING
MODEL
Ourevaluationsofarshowsthatthetwomemorymodelsleadtosimilarperformanceandscaling.Itisimportanttorememberthatwetookadvantageofstreamingoptimizations,suchasblockingandlocality-awarescheduling,onbothmemorymodels.Toillus-tratethis,itiseducationaltolookatstreamprogramminganditsimplicationsforCMParchitecture.
Streamprogrammingmodelsencourageprogrammerstothinkaboutthelocality,datamovement,andstoragecapacityissuesintheirapplications[15,13].Whiletheydonotnecessarilyrequiretheprogrammertomanagetheseissues,theprogrammerstructurestheapplicationinsuchawaythatitiseasytoreasonaboutthem.Thisexposednatureofastreamprogramisvitallyimportantforstreamingarchitectures,asitenablessoftwareorcompilerman-agementofdatalocalityandasynchronouscommunicationwitharchitecturallyvisibleon-chipmemories.Despiteitsaffinityforstreamarchitectures,wefindthatstreamprogrammingisbenefi-cialforcache-basedarchitecturesaswell.
Figure9showstheimportanceofstreamingoptimizationsinthecache-basedMPEG-2Encoder.Theoriginalparallelcodefrom[28]performsanapplicationkernelonawholevideoframebeforethenextkernelisinvoked(i.e.MotionEstimation,DCT,Quantization,
MPEG−2MPEG−2 Encoderc1.80ifWrite0.60efStorea1.60rReadmiTT0.50Load 1.40pnSynciohitUsefulC1.20u0.40−cef1.00fxOE0.30 d0.80deezzii0.60l0.20laam0.40mrroo0.10
N0.20N0.00
CCR0.00
CCRCCRCCRCCRCCTCCTCCTCCTCCT− S− S− S− S− SG G G G G IIIIIRRRRROOOOO 2 CPUs
2 CPUs 4 CPUs 8 CPUs16 CPUs
Figure9.Theeffectofstreamprogrammingoptimizationsontheoff-chipbandwidthandperformanceofMPEG-2at800MHz.
179.art4.00emStorei3.50TLoad no3.00SyncituUsefulc2.50exE2.00 dez1.50ilam1.00roN0.50
0.00
CCRCCRCCRCCRCCTCCTCCTCCT− S− S− S− SG G G G IIIIRRRROOOO 2 CPUs 4 CPUs 8 CPUs16 CPUs
Figure10.Theeffectofstreamprogrammingoptimizationsontheperformanceof179.artat800MHz.
etc.).Werestructuredthiscodebyhoistingtheinnerloopsofsev-eraltasksintoasingleouterloopthatcallseachtaskinturn.Intheoptimizedversion,weexecutealltasksonablockofaframebeforemovingtothenextblock.Thisalsoallowedustocondensealargetemporaryarrayintoasmallstackvariable.Theimprovedproducer-consumerlocalityreducedwrite-backsfromL1cachesby60%.Datastallsarereducedby41%at6.4GHz,evenwith-outprefetching.Furthermore,improvingtheparallelefficiencyoftheapplicationbecameasimplematterofschedulingasingledata-parallelloop,whichaloneisresponsiblefora40%performanceimprovementat16cores.However,instructioncachemissesarenotablyincreasedinthestreaming-optimizedcode.
For179.art,wereorganizedthemaindatastructureinthecache-basedversioninthesamewayaswedidforthestreamingcode.Wewerealsoabletoreplaceseverallargetemporaryvectorswithscalarvaluesbymergingseveralloops.Theseoptimizationsre-ducedthesparsenessof179.art’smemoryaccesspattern,improvedbothtemporalandspatiallocality,andallowedustouseprefetch-ingeffectively.AsshowninFigure10,theimpactonperformanceisdramatic,evenatsmallcorecounts(7xspeedup).
Overall,weobservedperformance,bandwidth,andenergyben-efitswheneverstreamprogrammingoptimizationswereappliedtoourcache-basedapplications.Thisisnotanovelresult,sinceitiswell-knownthatlocalityoptimizations,suchasblockingandloopfusion[29],increasecomputationalintensityandcacheeffi-ciency.However,streamprogrammingmodelsencourageuserstowritecodethatexplicitlyexposesanapplication’sparallelismanddata-accesspattern,moreoftenallowingsuchoptimizations.
Ourexperienceisthatthatstreamprogrammingisactuallyeasierwiththecache-basedmodelratherthanthestreamingmodel.Withstreamingmemory,theprogrammerorthecompilermustorches-tratealldatamovementandpositioningexactlyrightinorderfor
theprogramtooperatecorrectlyandfast.Thiscanbeburdensomeforirregularaccesspatterns(overlappingblocks,searchstructures,unpredictableordata-dependentpatterns,etc.),orforaccessesthatdonotaffectanapplication’sperformance.Itcanleadtoaddi-tionalinstructionsandmemoryreferencesthatreduceoreliminatestreaminghardware’sotheradvantages.Withcache-basedhard-ware,streamprogrammingisjustanissueofperformanceopti-mization.Evenifthealgorithmisnotblockedexactlyright,thecacheswillprovidebest-effortlocalityandcommunicationman-agement.Hence,theprogrammerorthecompilercanfocusonthemostpromisingandmostregulardatastructuresinsteadofmanag-ingalldatastructuresinaprogram.
Moreover,streamprogrammingcanaddresssomeofthecoher-enceandconsistencychallengeswhenscalingcache-basedCMPstolargenumbersofcores.Sinceastreamingapplicationtypicallyoperatesinadata-parallelfashiononasequenceofdata,thereislittleshort-termcommunicationorsynchronizationbetweenpro-cessors.Communicationisonlynecessarywhenprocessorsmovefromoneindependentsetofinput/outputblockstothenextorreachacycleinanapplication’sdata-flowgraph.ThisobservationmaybeincreasinglyimportantasCMPsgrow,asitimpliesthatlessag-gressive,coarser-grain,orlower-frequencymechanismscanbeem-ployedtokeepcachescoherent.
7.DISCUSSIONANDLIMITATIONS
Itisimportanttorecognizethatourstudyhaslimitations.OurexperimentsfocusonCMPswith1-16coresanduniformmemoryaccess.Someoftheconclusionsmaynotgeneralizetolarger-scaleCMPswithnon-uniformmemoryaccess(NUMA).Alarge-scale,cache-basedCMP,programmedinalocality-obliviousway,willundoubtedlysufferstallsduetolongmemorydelaysorexcessiveon-chipcoherencetraffic.Weobservethatthestreamprogrammingmodelmaybeabletoaddressesbothlimitations;itexposestheflowofdataearlyenoughthattheycanbeprefetched,andmotivatesafarcoarser-grained,lower-frequencycoherencemodel.
TowardsthegoalofdesigninglargeCMPsthatarestilleasytoprogram,ahybridmemorysystemthatcombinescachingandsoftware-managedmemorystructurescanmitigateefficiencychal-lengeswithoutexacerbatingthedifficultyofsoftwaredevelop-ment.Forexample,bulktransferprimitivesforcache-basedsys-temscouldenablemoreefficientmacroscopicprefetching.Con-versely,small,potentiallyincoherentcachesinstreamingmem-orysystemscouldvastlysimplifytheuseofstaticdatastructureswithabundanttemporallocality.Anindustryexampleofahy-bridsystemistheNVIDIAG80GPU[6]which,inadditiontocachedaccesstoglobalmemory,includesasmallscratch-padforapplication-specificlocalityandcommunicationoptimizations.Besidesthelimitsofscalability,wedidnotconsiderarchitec-turesthatexposethestreamingmodelallthewaytotheregisterfile[39]orapplicationswithoutabundantdataparallelism.Wealsodidnotconsiderchangestothepipelineofourcores,sincethatispreciselywhatmakesitdifficulttoevaluateexistingstreamingmemoryprocessorscomparedtocache-basedprocessors.Finally,ourstudywasperformedusinggeneral-purposeCMPs.Acompari-sonbetweenthetwomemorymodelsforspecializedCMPsremainsanopenissue.Despitetheselimitations,webelievethisstudy’sconclusionsareimportantintermsofunderstandingtheactualbe-haviorofCMPmemorysystemandmotivatingfutureresearchanddevelopment.
8.RELATEDWORK
Severalarchitectures[3,39,16,32]usestreaminghardwarewithmultimediaandscientificcodeinordertogetperformanceand
energybenefitsfromsoftware-managedmemoryhierarchiesandregularcontrolflow.Therearealsocorrespondingproposalsforstreamprogramminglanguagesandcompileroptimizations[15,13].Suchtoolscanreducetheburdenontheprogrammerforex-plicitlocalityandcommunicationmanagement.Inparallel,therearesignificanteffortstoenhancecache-basedsystemswithtrafficfilters[35],replacementhints[43],orprefetchinghints[44].Theseenhancementstargetthesameaccesspatternsthatstreamingmem-orysystemsbenefitfrom.
Tothebestofourknowledge,thisisthefirstdirectcompari-sonofthetwomemorymodelsforCMPsystemsunderauni-fiedsetofassumptions.Jayasena[23]comparedastreamregis-terfiletoasingle-levelcacheforaSIMDprocessor.Hefoundthatthestreamregisterfileprovidesperformanceandbandwidthadvantagesforapplicationswithsignificantproducer-consumerlo-cality.LoghiandPoncinocomparedhardwarecachecoherencetonotcachingshareddataatallforembeddedCMPswithon-chipmainmemory[31].TheALPreport[28]evaluatesmultimediacodesonCMPswithstreamingsupport.However,forallbutonebenchmark,streamingimpliedtheuseofenhancedSIMDinstruc-tions,notsoftwaremanagedmemoryhierarchies.Suhetal.[37]comparedastreamingSIMDprocessor,astreamingCMPchip,avectordesign,andasuperscalarprocessorforDSPkernels.How-ever,thefoursystemsvariedvastlyatalllevels,henceitisdiffi-culttocomparememorymodelsdirectly.Thereareseveralpro-posalsforconfigurableorhybridmemorysystems[33,36,23,28].Insuchsystems,alevelinthememoryhierarchycanbeconfig-uredasacacheorasalocalstoredependingonanapplication’sneeds.GummarajuandRosenblumhaveshownbenefitsfromahy-bridarchitecturethatusesstreamprogrammingonacache-basedsuperscalardesignforscientificcode[17].Ourworksupportsthisapproachasweshowthatcache-basedmemorysystemscanbeasefficientasstreamingmemorysystems,butcouldbenefitintermsofbandwidthconsumptionandlatencytolerancefromstreampro-gramming.
Ourstudyfollowstheexampleofpapersthatcomparedsharedmemoryandmessagepassingformulti-chipsystems[24,8].
9.CONCLUSIONS
Thechoiceoftheon-chipmemorymodelhasfar-reachingim-plicationsforCMPsystems.Inthispaper,weperformedadirectcomparisonoftwobasicmodels:coherentcachesandstreamingmemory.
Weconcludethatforthemajorityofapplications,bothmodelsperformandscaleequallywell.Forsomeapplicationswithoutsig-nificantdatareuse,streaminghasaperformanceandenergyadvan-tagewhenwescalethenumberandcomputationalcapabilitiesofthecores.However,theefficiencygapcanbebridgedbyintroduc-ingprefetchingandnon-allocatingwritepoliciestocache-coherentsystems.Wealsofoundsomeapplicationsforwhichstreamingscalesworsethancachingduetotheredundanciesintroducedinordertomakethecoderegularenoughforstreaming.
Throughtheadoptionofstreamprogrammingmethodologies,whichencourageblocking,macroscopicprefetching,andlocalityawaretaskscheduling,cache-basedsystemsareequallyasefficientasstreamingmemorysystems.Thisindicatesthatthereisnotasufficientadvantageinbuildinggeneral-purposecoresthatfollowapurestreamingmodel,wherealllocalmemorystructuresandalldatastreamsareexplicitlymanaged.Wealsoobservedthatstreamprogrammingisactuallyeasierwhentargetingcache-basedsys-temsratherthanstreamingmemorysystems,andthatitmaybebeneficialinscalingcoherenceandconsistencyforcachestolargersystems.
10.ACKNOWLEDGEMENTS
WesincerelythankBillMark,MattanErez,andRonHofortheirinvaluablecommentsonearlyversionsofthispaper.Wearealsogratefulfortheinsightfulcritiqueofthisworkbythereviewers.ThisworkwaspartiallyfundedbyDARPAundercontractnumberF29601-03-2-0117.JacobLeverichissupportedbyaDavidCheri-tonStanfordGraduateFellowship.
11.REFERENCES
[1]S.V.AdveandK.Gharachorloo.SharedMemoryConsistency
Models:ATutorial.IEEEComputer,29(12):66–76,Dec.1996.[2]V.Agarwal,M.S.Hrishikesh,S.W.Keckler,andD.Burger.Clock
RateversusIPC:theEndoftheRoadforConventional
Microarchitectures.InProceedingsofthe27thIntl.Symp.onComputerArchitecture,June2000.
[3]J.Ahnetal.EvaluatingtheImagineStreamArchitecture.In
Proceedingsofthe31stIntl.Symp.onComputerArchitecture,May2004.
[4]J.AndrewsandN.Backer.Xbox360SystemArchitecture.InConf.
RecordofHotChips17,Stanford,CA,Aug.2005.
[5]L.A.Barrosoetal.Piranha:AScalableArchitectureBasedon
Single-ChipMultiprocessing.InProceedingsofthe27thIntl.Symp.onComputerArchitecture,Vancouver,Canada,June2000.[6]I.Buck.GPUComputing:ProgrammingaMassivelyParallel
Processor,Mar.2005.KeynotepresentationattheInternationalSymposiumonCodeGenerationandOptimization,SanJose,CA.[7]T.Chiueh.AGenerationalAlgorithmtoMultiprocessorCache
Coherence.InInternationalConferenceonParallelProcessing,pages20–24,Oct.1993.
[8]D.Culler,J.P.Singh,andA.Gupta.ParallelComputerArchitecture:
AHardware/SoftwareApproach.MorganKauffman,1999.[9]W.Dallyetal.Merrimac:SupercomputingwithStreams.In
Proceedingsofthe2003Conf.onSupercomputing,Nov.2003.[10]J.D.Davis,J.Laudon,andK.Olukotun.MaximizingCMP
ThroughputwithMediocreCores.InProceedingsofthe14thIntl.Conf.onParallelArchitecturesandCompilationTechniques,Sept.2005.
[11]M.Drake,H.Hoffmann,R.Rabbah,andS.Amarasinghe.MPEG-2
DecodinginaStreamProgrammingLanguage.InProceedingsofthe20thIEEEInternationalParallel&DistributedProcessingSymposium,RhodesIsland(IPDPS),Apr.2006.
[12]W.Eatherton.ThePushofNetworkProcessingtotheTopofthe
Pyramid,Oct.2005.KeynotepresentationattheSymposiumonArchitecturesforNetworkingandCommunicationSystems,Princeton,NJ.
[13]K.Fatahalianetal.Sequoia:ProgrammingTheMemoryHierarchy.
InSupercomputingConference,Nov.2006.
[14]T.FoleyandJ.Sugerman.KD-TreeAccelerationStructuresfora
GPURaytracer.InProceedingsoftheGraphicsHardwareConf.,July2005.
[15]M.I.Gordonetal.AStreamCompilerforCommunication-exposed
Architectures.InProceedingsofthe10thIntl.Conf.onArchitecturalSupportforProgrammingLanguagesandOperatingSystems,Oct.2002.
[16]M.Gschwindetal.ANovelSIMDArchitecturefortheCell
HeterogeneousChip-Multiprocessor.InConf.RecordofHotChips17,Stanford,CA,Aug.2005.
[17]J.GummarajuandM.Rosenblum.StreamProgrammingon
General-PurposeProcessors.InProceedingsofthe38thIntl.Symp.onMicroarchitecture,Nov.2005.
[18]R.Ho,K.Mai,andM.Horowitz.TheFutureofWires.Proceedings
oftheIEEE,(4),Apr.2001.
[19]R.Ho,K.Mai,andM.Horowitz.EfficientOn-chipGlobal
Interconnects,June2003.
[20]M.HorowitzandW.Dally.HowScalingWillChangeProcessor
Architecture.InInternationalSolid-StateCircuitsConference,pages132–133,Feb.2004.
[21]IndependentJPEGGroup.IJG’sJPEGSoftwareRelease6b,1998.[22]D.Jani,G.Ezer,andJ.Kim.LongWordsandWidePorts:
ReinventingtheConfigurableProcessor.InConf.RecordofHotChips16,Stanford,CA,Aug.2004.
[23]N.Jayasena.MemoryHierarchyDesignforStreamComputing.PhDthesis,StanfordUniversity,2005.
[24]
A.C.KlaiberandH.M.Levy.AComparisonofMessagePassingandSharedMemoryArchitecturesforDataParallelPrograms.InProceedingsofthe21thIntl.Symp.onComputerArchitecture,Apr.1994.
[25]
P.Kongetira.A32-wayMultithreadedSparcProcessor.InConf.RecordofHotChips16,Stanford,CA,Aug.2004.
[26]
R.Kumar,V.Zyuban,andD.M.Tullsen.Interconnectionsin
Multi-CoreArchitectures:UnderstandingMechanisms,OverheadsandScaling.InProceedingsofthe32ndIntl.Symp.onComputerArchitecture,June2005.
[27]B.LewisandD.J.Berg.MultithreadedProgrammingwithPthreads.PrenticeHall,1998.
[28]M.Lietal.ALP:EfficientSupportforAllLevelsofParallelismforComplexMediaApplications.TechnicalReportUIUCDCS-R-2005-2605,UIUCCS,July2005.
[29]A.W.Lim,S.-W.Liao,andM.S.Lam.BlockingandArrayContractionAcrossArbitrarilyNestedLoopsUsingAffine
Partitioning.ACMSIGPLANNotices,36(7):103–112,July2001.[30]Y.Lin.AProgrammableVectorCoprocessorArchitectureforWirelessApplications.InProceedingsofthe3rdWorkshoponApplicationSpecificProcessors,Sept.2004.
[31]
M.LoghiandM.Pncino.ExploringEnergy/PerformanceTradeoffsinSharedMemoryMPSoCs:Snoop-BasedCacheCoherencevs.SoftwareSolutions.InProceedingsoftheDesignAutomationandTestinEuropeConf.,Mar.2005.
[32]E.Machnicki.UltraHighPerformanceScalableDSPFamilyforMultimedia.InConf.RecordofHotChips17,Stanford,CA,Aug.2005.
[33]K.Maietal.SmartMemories:aModularReconfigurable
Architecture.InProceedingsofthe27thIntl.Symp.onComputerArchitecture,June2000.
[34]MIPS32ArchitectureForProgrammersVolumeII:TheMIPS32InstructionSet.MIPSTechnologies,Inc.,2001.
[35]A.Moshovos.RegionScout:ExploitingCoarseGrainSharinginSnoop-BasedCoherence.InProceedingsofthe32ndIntl.Symp.onComputerArchitecture,June2005.
[36]K.Sankaralingam.TRIPS:APolymorphousArchitectureforExploitingILP,TLP,andDLP.ACMTrans.Archit.CodeOptim.,1(1):62–93,Mar.2004.
[37]
J.Suhetal.APerformanceAnalysisofPIM,StreamProcessing,andTiledProcessingonMemory-intensiveSignalProcessingKernels.InProceedingsofthe30thIntl.Symp.onComputerArchitecture,June2003.
[38]D.Tarjan,S.Thoziyoor,andN.P.Jouppi.CACTI4.0.TechnicalReportHPL-2006-86,HPLabs,2006.
[39]
M.Tayloretal.EvaluationoftheRawMicroprocessor:AnExposed-Wire-DelayArchitectureforILPandStreams.In
Proceedingsofthe31stIntl.Symp.onComputerArchitecture,May2004.
[40]TensilicaSoftwareTools.
http://www.tensilica.com/products/software.htm.
[41]S.P.VanderWielandD.J.Lilja.DataPrefetchMechanisms.ACMComputingSurveys,32(2):174–199,2000.
[42]D.Wangetal.DRAMsim:AMemory-SystemSimulator.SIGARCHComputerArchitectureNews,33(4),2005.
[43]Z.Wangetal.UsingtheCompilertoImproveCacheReplacementDecisions.InProceedingsoftheConf.onParallelArchitecturesandCompilationTechniques,Sept.2002.
[44]Z.Wangetal.GuidedRegionPrefetching:ACooperative
Hardware/SoftwareApproach.InProceedingsofthe30thIntl.Symp.onComputerArchitecture,June2003.
[45]
T.-Y.Yeh.TheLow-PowerHigh-PerformanceArchitectureofthePWRficientProcessorFamily.InConf.RecordofHotChips17,Stanford,CA,Aug.2005.
因篇幅问题不能全部显示,请点此查看更多更全内容
Copyright © 2019- zicool.com 版权所有 湘ICP备2023022495号-2
违法及侵权请联系:TEL:199 1889 7713 E-MAIL:2724546146@qq.com
本站由北京市万商天勤律师事务所王兴未律师提供法律服务