您好,欢迎来到知库网。
搜索
您的当前位置:首页Comparing memory systems for chip multiprocessors

Comparing memory systems for chip multiprocessors

来源:知库网
ComparingMemorySystemsforChipMultiprocessors

JacobLeverichAminFiroozshahian

HidehoArakidaMarkHorowitz

AlexSolomatnikovChristosKozyrakis

ComputerSystemsLaboratory

StanfordUniversity

{leverich,arakida,sols,aminf13,horowitz,kozyraki}@stanford.edu

ABSTRACT

Therearetwobasicmodelsfortheon-chipmemoryinCMPsys-tems:hardware-managedcoherentcachesandsoftware-managedstreamingmemory.Thispaperperformsadirectcomparisonofthetwomodelsunderthesamesetofassumptionsabouttechnology,area,andcomputationalcapabilities.Thegoalistoquantifyhowandwhentheydifferintermsofperformance,energyconsump-tion,bandwidthrequirements,andlatencytoleranceforgeneral-purposeCMPs.Wedemonstratethatfordata-parallelapplications,thecache-basedandstreamingmodelsperformandscaleequallywell.Forcertainapplicationswithlittledatareuse,streamingscalesbetterduetobetterbandwidthuseandmacroscopicsoftwarepre-fetching.However,theintroductionoftechniquessuchashard-wareprefetchingandnon-allocatingstorestothecache-basedmod-eleliminatesthestreamingadvantage.Overall,ourresultsindicatethatthereisnotsufficientadvantageinbuildingstreamingmemorysystemswhereallon-chipmemorystructuresareexplicitlyman-aged.Ontheotherhand,weshowthatstreamingattheprogram-mingmodellevelisparticularlybeneficial,evenwiththecache-basedmodel,asitenhanceslocalityandcreatesopportunitiesforbandwidthoptimizations.Moreover,weobservethatstreampro-grammingisactuallyeasierwiththecache-basedmodelbecausethehardwareguaranteescorrect,best-effortexecutionevenwhentheprogrammercannotfullyregularizeanapplication’scode.CategoriesandSubjectDescriptors:B.3.2[MemoryStruc-tures]:DesignStyles;C.1.2[ProcessorArchitectures]:MultipleDataStreamArchitectures(Multiprocessors);D.1.3[ProgrammingTechniques]:ConcurrentProgrammingGeneralTerms:Performance,Design

Keywords:Chipmultiprocessors,coherentcaches,streamingmemory,parallelprogramming,localityoptimizations

1.INTRODUCTION

Thescalinglimitationsofuniprocessors[2]haveledtoanindustry-wideturntowardschipmultiprocessor(CMP)systems.CMPsarebecomingubiquitousinallcomputingdomains.Un-likeuniprocessors,whichhaveadominant,well-understoodmodel

Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecificpermissionand/orafee.

ISCA’07,June9–13,2007,SanDiego,California,USA.Copyright2007ACM978-1-59593-706-3/07/0006...$5.00.

foron-chipmemorystructures,thereisnowidespreadagreementonthememorymodelforCMPdesigns.Thechoiceofmemorymodelcansignificantlyimpactefficiencyintermsofperformance,energyconsumption,andscalability.Moreover,itiscloselycou-pledwiththechoiceofparallelprogrammingmodel,whichinturnaffectseaseofuse.Whileitispossibletomapanyprogrammingmodeltoanymemorymodel,itistypicallymoreefficientiftheprogrammingmodelbuildsuponthebasicprinciplesofthemem-orymodel.

Similartolargerparallelsystems[8],therearetwobasicmem-orymodelsforcontemporaryCMPsystems:hardware-managed,implicitly-addressed,coherentcachesandsoftware-managed,ex-plicitly-addressed,localmemories(alsocalledstreamingmemory).Withthecache-basedmodel,allon-chipstorageisusedforprivateandsharedcachesthatarekeptcoherentbyhardware.Thead-vantageisthatthehardwareprovidesbest-effortlocalityandcom-municationmanagement,evenwhentheaccessandsharingpat-ternsaredifficulttostaticallyanalyze.Withthestreamingmemorymodel,partoftheon-chipstorageisorganizedasindependentlyaddressablestructures.ExplicitaccessesandDMAtransfersareneededtomovedatatoandfromoff-chipmemoryorbetweentwoon-chipstructures.Theadvantageofstreamingmemoryisthatitprovidessoftwarewithfullflexibilityonlocalityandcommunica-tionmanagementintermsofaddressing,granularity,andreplace-mentpolicy.Sincecommunicationisexplicit,itcanalsobeproac-tive,unlikethemostlyreactivebehaviorofthecache-basedmodel.Hence,streamingallowssoftwaretoexploitproducer-consumerlo-cality,avoidredundantwrite-backsfortemporaryresults,manageselectivedatareplication,andperformapplication-specificcachingandmacroscopicprefetching.Streamingeliminatesthecommuni-cationoverheadandhardwarecomplexityofthecoordinationpro-tocolneededforcachecoherence.Ontheotherhand,itintroducessoftwarecomplexity,sinceeithertheprogrammerorthecompilermustexplicitlymanagelocalityandcommunication.

Traditionaldesktopandenterpriseapplicationsaredifficulttoanalyzeandfavorcache-basedsystems.Incontrast,manyupcom-ingapplicationsfromthemultimedia,graphics,physicalsimula-tion,andscientificcomputingdomainsarebeingtargetedbybothcache-based[4,45]andstreaming[3,39,9,16,32]systems.Thedebateisparticularlyinterestingvis-`a-visthelatestgameconsoles.TheCMPsfortheXbox360andPlayStation3differdramaticallyintheiron-chipmemorymodel,asXbox360’sXenonprocessorisacache-basedCMPandPlayStation3’sCellprocessorisastream-ingmemoryCMP.Hence,itisinterestingtoevaluateifstreamingprovidesspecificbenefitstothisdomain,orifusingacache-basedapproachacrossallapplicationdomainsshouldbepreferable.

Thegoalofthispaperistocomparetheefficiencyofthetwomemorymodelsunderthesamesetofassumptionsabouttech-

nology,area,andcomputationalcapabilities.Specifically,weareinterestedinansweringthefollowingquestions:Howdothetwomodelscompareintermsofoverallperformanceandenergycon-sumption?Howdoesthecomparisonchangeaswescalethenum-berorcomputethroughputoftheprocessorcores?Howsensitiveiseachmodeltobandwidthorlatencyvariations?WebelievethatsuchadirectcomparisonwillprovidevaluableinformationfortheCMParchitecturedebateandgeneratesomeguidelinesforthede-velopmentoffuturesystems.

Themajorconclusionsfromourcomparisonare:•Fordata-parallelapplicationswithabundantdatareuse,thetwomodelsperformandscaleequallywell.Cachesareasef-fectiveassoftware-managedmemoriesatcapturinglocalityandreducingaveragememoryaccesstime.Forsomeoftheseapplications,streaminghasanenergyadvantageof10%to25%overwrite-allocatecachesbecauseitavoidssuperflu-ousrefillsonoutputdatastreams.Usingano-write-allocatepolicyforoutputdatainthecache-basedsystemreducesthestreamingadvantage.•Forapplicationswithoutsignificantdatareuse,macroscopicprefetching(double-buffering)providesstreamingmemorysystemswithaperformanceadvantagewhenwescalethenumberandcomputationalcapabilitiesofthecores.Theuseofhardwareprefetchingwiththecache-basedmodelelim-inatesthestreamingadvantageforsomelatency-boundap-plications.Therearealsocaseswherestreamingperformsworse,suchaswhenitrequiresredundantcopyingofdataorextracomputationinordertomanagelocalstores.•Ourresultsindicatethatapurestreamingmemorymodelisnotsufficientlyadvantageousatthememorysystemlevel.Withtheadditionofprefetchingandnon-allocatingwrites,thecache-basedmodelprovidessimilarperformance,energy,andbandwidthbehavior.Ontheotherhand,wefoundthat“streaming”attheprogrammingmodellevelisveryimpor-tant,evenwiththecache-basedmodel.Properlyblockinganapplication’sworkingset,exposingproducer-consumerlo-cality,andidentifyingoutput-onlydataleadstosignificantefficiencygains.Moreover,streamprogrammingleadstocodethatrequirescoarser-grainandlower-frequencycoher-enceandconsistencyinacache-basedsystem.Thisobser-vationwillbeincreasinglyrelevantasCMPsscaletomuchlargernumbersofcores.•Finally,weobservethatstreamprogrammingisactuallyeas-ierwiththecache-basedmodelbecausethehardwareguar-anteescorrect,best-effortexecution,evenwhentheprogram-mercannotfullyregularizeanapplication’scode.Withthestreamingmemorymodel,thesoftwaremustorchestratelo-calityandcommunicationperfectly,evenforirregularcodes.Therestofthepaperisorganizedasfollows.Section2summa-rizesthetwomemorymodelsanddiscussestheiradvantagesanddrawbacks.Section3presentsthearchitecturalframeworkforourcomparisonandSection4describestheexperimentalmethodology.Section5analyzesourevaluationresults.Section6focusesonstreamingattheprogramminglevelanditsinteractionswithCMParchitecture.Section7addresseslimitationsinourmethodology.Finally,Section8reviewsrelatedworkandSection9concludesthepaper.

Communication

HW

SW

ytilHWCoherent Cache-basedIncoherent Cache-basedacoLSW

(impractical)Streaming MemoryTable1.Thedesignspaceforon-chipmemoryforCMPs.Thisworkfocusesonthetwohighlightedoptions:coherentcache-basedandstreamingmemory.Thethirdpracticaloption,incoherentcache-based,isbrieflydiscussedinSection7.

2.

ON-CHIPMEMORYMODELSFORCHIPMULTIPROCESSORS

General-purposesystemsaredesignedaroundacomputationalmodelandamemorymodel.Thecomputationalmodeldefinestheorganizationofexecutionresourcesandregisterfilesandmayfol-lowsomecombinationofthethesuperscalar,VLIW,vector,orSIMDapproaches.Thememorymodeldefinestheorganizationofon-chipandoff-chipmemories,aswellasthecommunicationmechanismsbetweenthevariouscomputationandmemoryunits.Thetwomodelsarelinked,andanefficientsystemiscarefullyde-signedalongbothdimensions.However,webelievethattheon-chipmemorymodelcreatesmoreinterestingchallengesforCMPs.First,itistheonethatchangesmostdramaticallycomparedtouniprocessordesigns.Second,thememorymodelhasbroaderim-plicationsonbothsoftwareandhardware.Assumingwecanmovethedataclosetotheexecutionunitsinanefficientmanner,itisnotdifficulttoselectthepropercomputationalmodelbasedonthetype(s)ofparallelismavailableinthecomputationkernels.

Forboththecache-basedandstreamingmodels,certainaspectsoftheon-chipmemorysystemaresetbyVLSIconstraintssuchaswiredelay[18].EverynodeintheCMPsystemisdirectlyassoci-atedwithalimitedamountofstorage(first-levelmemory)thatcanbeaccessedwithinasmallnumberofcycles.Nodescommunicatebyexchangingpacketsoveranon-chipnetworkthatcanrangefromasimplebustoahierarchicalstructure.Additionalmemorystruc-tures(second-levelmemory)arealsoconnectedtothenetworkfab-ric.Thesystemscaleswithtechnologybyincreasingthenumberofnodes,thesizeofthenetwork,andthecapacityofsecond-levelstorage.Eventually,thescaleofaCMPdesignmaybelimitedbyoff-chipbandwidthorenergyconsumption[20].

AlthoughtheyareunderthesameVLSIconstraints,thecache-basedandstreamingmodelsdiffersignificantlyinthewaytheymanagedatalocalityandinter-processorcommunication.AsshowninTable1,thecache-basedmodelreliesonhardwaremechanismsforboth,whilethestreamingmodeldelegatesmanagementtosoft-ware.TherestofthissectionoverviewstheprotocolandoperationforeachmemorymodelforCMPsystems.

2.1CoherentCache-basedModel

Withthecache-basedmodel[8],theonlydirectlyaddressablestorageistheoff-chipmemory.Allon-chipstorageisusedforcacheswithhardwaremanagementoflocalityandcommunication.Ascoresperformmemoryaccesses,thecachesattempttocap-turetheapplication’sworkingsetbyfetchingorreplacingdataatthegranularityofcacheblocks.Thecorescommunicateimplic-itlythroughloadsandstorestothesinglememoryimage.Sincemanycachesmaystorecopiesofaspecificaddress,itisneces-sarytoquerymultiplecachesonloadandstorerequestsandpo-tentiallyinvalidateentriestokeepcachescoherent.Acoherenceprotocol,suchasMESI,minimizesthecaseswhenremotecache

lookupsarenecessary.Remotelookupscanbedistributedthroughabroadcastmechanismorbyfirstconsultingadirectorystructure.Thecoressynchronizeusingatomicoperationssuchascompare-and-swap.Cache-basedsystemsmustalsoprovideeventorder-ingguaranteeswithinandacrosscoresfollowingsomeconsistencymodel[1].Caching,coherence,synchronization,andconsistencyareimplementedinhardware.

Coherentcachingtechniquesweredevelopedforboard-levelandcabinet-levelsystems(SMPsandDSMs),forwhichcommunica-tionlatencyrangesfromtenstohundredsofcycles.InCMPs,coherencesignalstravelwithinonechip,wherelatencyismuchlowerandbandwidthismuchhigher.Consequently,evenalgo-rithmswithnon-trivialamountsofcommunicationandsynchro-nizationcanscalereasonablywell.Moreover,theefficientdesignpointsforcoherentcachinginCMPsarelikelytobedifferentfromthoseforSMPandDSMsystems.

2.2StreamingMemoryModel

Withstreaming,thelocalmemoryfordataineachcoreisasep-aratelyaddressablestructure,calledascratch-pad,localstore,orstreamregisterfile.Weadoptthetermlocalstoreinthiswork.Softwareisresponsibleformanaginglocalityandcommunicationacrosslocalstores.Softwarehasfullflexibilityinplacingfre-quentlyaccesseddatainlocalstoreswithrespecttolocation,gran-ularity,replacementpolicy,allocationpolicy,andeventhenumberofcopies.Forapplicationswithstaticallyanalyzableaccesspat-terns,softwarecanexploitthisflexibilitytominimizecommuni-cationandoverlapitwithusefulcomputationinthebestpossibleapplication-specificway.Dataarecommunicatedbetweenlocalstoresortoandfromoff-chipmemoryusingexplicitDMAtrans-fers.ThecorescanaccesstheirlocalstoresasFIFOqueuesorasrandomlyindexedstructures[23].ThestreamingmodelrequiresDMAengines,butnootherspecialhardwaresupportforcoherenceorconsistency.

Streamingisessentiallymessage-passingappliedtoCMPde-signs,thoughtherearesomeimportantdifferencescomparedtoconventionalmessage-passingforclustersandmassivelyparallelsystems[8].First,communicationisalwaysorchestratedattheuser-level,anditsoverheadislow.Next,messagesareexchangedatthefirstlevelofthememoryhierarchy,notthelastone.Sincethecommunicatingcoresareonthesamechip,communicationlaten-ciesarelowandbandwidthishigh.Finally,softwaremanagesboththecommunicationbetweencoresandthecommunicationbetweenacoreandoff-chipmemory.

Thediscussionaboveseparatesthetwomemorymodelsfromtheselectionofacomputationmodel.ThestreamingmemorymodelisgeneralandhasalreadybeenusedwithVLIW/SIMDsystems[3],RISCcores[39],vectorprocessors[30],andevenDSPs[32].Thecache-basedmodelcanbeusedwithanyofthesecomputationmod-elsaswell.

2.3QualitativeComparison

Whenconsideringthememorysystemalone,wefindseveralwaysinwhichcache-basedandstreamingmemorysystemsmaydiffer:bandwidthutilization,latencytolerance,performance,en-ergyconsumption,andcost.Thissectionsummarizeseachofthesedifferencesinaqualitativemanner.

Off-chipBandwidth&LocalMemoryUtilization:Thecache-basedmodelleadstobandwidthandstoragecapacitywasteonsparselystridedmemoryaccesses.Intheabsenceofspatiallocality,manipulatingdataatthegranularityofwidecachelinesiswasteful.Streamingmemorysystems,byvirtueofstridedscatterandgatherDMAtransfers,canusetheminimummemorychannelbandwidth

necessarytodeliverdata,andalsocompactthedatawithinthelocalstore.Note,however,thatmemoryandinterconnectchannelsaretypicallyoptimizedforblocktransfersandmaynotbebandwidthefficientforstridedorscatter/gatheraccesses.

Cachingcanalsowasteoff-chipbandwidthonunnecessaryre-fillsforoutputdata.Becausecachesoftenusewrite-allocatepoli-cies,storemissesforcememoryreadsbeforethedataareover-writteninthecache.Ifanapplicationhasdisjointinputandout-putstreams,therefillsmaywasteasignificantpercentageofband-width.Similarly,cachingcanwastebandwidthonwrite-backsofdeadtemporarydata.Astreamingsystemdoesnotsufferfromtheseproblems,astheoutputandtemporarybuffersaremanagedexplicitly.Outputdataaresentoff-chipwithoutrefills,anddeadtemporarydatacanbeignored,astheyarenotmappedtooff-chipaddresses.Tomitigatetherefillproblem,cache-basedsystemscanuseano-write-allocatepolicy.Inthiscase,itisnecessarytogroupstoredatainwritebuffersbeforeforwardingthemtomemoryinordertoavoidwastingbandwidthonnarrowwrites[4].Anotherapproachistousecachecontrolinstructions,suchas“PrepareForStore,”[34]thatinstructthecachetoallocateacachelinebutavoidretrievaloftheoldvaluesfrommemory.Similarly,temporarydatacanbemarkedinvalidattheendofacomputation[7,43].Inanycase,softwaremustdeterminewhentousethesemechanisms.Streamingsystemsmayalsowastebandwidthandstorageca-pacityonprogramswithstaticallyunpredictable,irregulardatapat-terns.Astreamingsystemcansometimescopewiththesepatternsbyfetchingasupersetoftheneededinputdata.Alternatively,atthecostofenduringlonglatencies,thesystemcoulduseaDMAtransfertocollectrequireddataondemandfrommainmemorybe-foreeachcomputationaltask.Forprogramsthatoperateonover-lappingblocksorgraphstructureswithmultiplereferencestothesamedata,thestreamingsystemmaynaivelyre-fetchdata.Thiscanbeavoidedthroughincreasedaddressgenerationcomplexityorsoftwarecaching.Finally,forapplicationsthatfetchablockandupdatesomeofitselementsin-place,astreamingsystemwilloftenwritebackthewholeblocktomemoryattheendofthecomputa-tion,evenifsomedatawerenotupdated.Incontrasttoallofthesescenarios,cache-basedsystemsperformloadandstoreaccessesondemand,andhenceonlymovecachelinesasrequired.Theymayevensearchforcopiesoftherequireddatainotheron-chipcachesbeforegoingoff-chip.

LatencyTolerance:Thecache-basedmodelistraditionallyre-active,meaningthatamissmustoccurbeforeafetchistriggered.Memorylatencycanbehiddenusinghardwareprefetchingtech-niques,whichdetectrepetitiveaccesspatternsandissuememoryaccessesaheadoftime,orproactivesoftwareprefetching.Inprac-tice,theDMAtransfersinthestreamingmemorymodelareanefficientandaccurateformofsoftwareprefetching.Theycanhideasignificantamountoflatency,especiallyifdouble-bufferingisused.Unlikehardwareprefetching,whichrequiresafewmissesbeforeapatternisdetected(microscopicview),aDMAaccesscanstartarbitrarilyearly(macroscopicview)andcancapturebothreg-ularandirregular(scatter/gather)accesses.

Performance:Fromthediscussionsofar,onecanconcludethatstreamingmemorymayhaveaperformanceorscalingadvantageforregularapplications,duetopotentiallybetterlatencytoleranceandbetterusageofoff-chipbandwidthorlocalstorage.Thesead-vantagesareimportantonlyiflatency,bandwidth,orlocalstoragecapacityaresignificantbottleneckstobeginwith.Forexample,reducingthenumberofmissesisunimportantforacomputation-allyintensiveapplicationthatalreadyhasverygoodlocality.Intheeventthatanapplicationisbandwidth-bound,latencytolerancemeasureswillbeineffective.Adrawbackforstreaming,evenwith

regularcode,isthatitoftenhastoexecuteadditionalinstructionstosetupDMAtransfers.Forapplicationswithunpredictabledataaccesspatternsorcontrolflow,astreamingsystemmayexecuteasignificantlyhighernumberofinstructionsthanthatofacache-basedsysteminordertoproducepredictablepatternsortousethelocalstoretoemulateasoftwarecache.

EnergyConsumption:Anyperformanceadvantagealsotrans-latestoanenergyadvantage,asitallowsustoturnoffthesystemearlyorscaledownitspowersupplyandclockfrequency.Stream-ingaccessestothefirst-levelstorageeliminatetheenergyoverheadofcaches(tagaccessandtagcomparison).Thecache-basedmodelconsumesadditionalenergyforon-chipcoherencetraffic,snooprequestsordirectorylookups.Moreover,efficientuseoftheavail-ableoff-chipbandwidthbyeitherofthetwomodels(throughfewertransfersormessages)reducestheenergyconsumptionbythein-terconnectnetworkandmainmemory.

Complexity&Cost:Itisdifficulttomakeaccurateestimatesofhardwarecostwithoutcomparableimplementations.Thehardwareforthecache-basedmodelisgenerallymorecomplextodesignandverify,ascoherence,synchronization,consistency,andprefetchinginteractinsubtleways.Still,reuseacrossserver,desktop,andem-beddedCMPdesignscansignificantlyreducesuchcosts.Ontheotherhand,streamingpassesthecomplexitytosoftware,thecom-piler,and/ortheprogrammer.Forapplicationsinthesynchronousdata-flow,DSP,ordensematrixdomains,itisoftenstraightfor-wardtoexpressastreamingalgorithm.Forotherapplications,itisnon-trivial,andasinglegoodalgorithmisoftenaresearchcon-tributioninitself[11,14].Finally,complexityandcostmustalsobeconsideredwithscalinginmind.Amemorymodelhasanad-vantageifitallowsefficientuseofmorecoreswithouttheneedfordisproportionalincreasesinbandwidthorsomeotherresourceinthememorysystem.

3.CMPARCHITECTUREFRAMEWORK

WecomparethetwomodelsusingtheCMParchitectureshowninFigure1.TherearenumerousdesignparametersinaCMPsys-tem,andevaluatingthetwomodelsunderallpossiblecombinationsisinfeasible.Hence,westartwithabaselinesystemthatrepresentsapracticaldesignpointandvaryonlythekeyparametersthatinter-actsignificantlywiththeon-chipmemorysystem.

3.1BaselineArchitecture

OurCMPdesignisbasedonin-orderprocessorssimilartoPi-ranha[5],RAW[39],UltrasparcT1[25],andXbox360[4].SuchCMPshavebeenshowntobeefficientformultimedia,communi-cations,andthroughputcomputingworkloadsastheyprovidehighcomputedensitywithouttheareaandpoweroverheadofout-of-orderprocessors[10].WeusetheTensilicaXtensaLX,3-wayVLIWcore[22].Tensilicacoreshavebeenusedinseveralem-beddedCMPdesigns,includingthe188-coreCiscoMetrorouterchip[12].Wealsoperformedexperimentswithasingle-issueXtensacorethatproducedsimilarresultsforthememorymodelcomparison.TheVLIWcoreisconsistentlyfasterby1.6xto2xfortheapplicationswestudied.Thecorehas3slotsperinstruc-tion,withupto2slotsforfloating-pointoperationsandupto1slotforloadsandstores.Duetotimeconstraints,wedonotusetheTensilicaSIMDextensionsatthispoint.Nevertheless,Sec-tion5.3includesanexperimentthatevaluatestheefficiencyofeachmodelwithprocessorsthatprovidehighercomputationalthrough-put.Eachprocessorhasa16-KByte,2-wayset-associativeinstruc-tioncache.Thefixedamountoflocaldatastorageisuseddiffer-entlyineachmemorymodel.

Weexploresystemswith1to16coresusingahierarchicalinter-connectsimilartothatsuggestedby[26].Wegroupcoresinclus-tersoffourwithawide,bidirectionalbus(localnetwork)provid-ingthenecessaryinterconnect.Theclusterstructureallowsforfastcommunicationbetweenneighboringcores.Ifthreadsaremappedintelligently,theintra-clusterbuswillhandlemostofthecore-to-corecommunication.Aglobalcrossbarconnectstheclusterstothesecond-levelstorage.Thereisbufferingatallinterfacestotoler-atearbitrationlatenciesandensureefficientbandwidthuseateachlevel.Thehierarchicalinterconnectprovidessufficientbandwidth,whileavoidingthebottlenecksoflongwiresandcentralizedarbi-tration[26].Finally,thesecondarystoragecommunicatestooff-chipmemorythroughsomenumberofmemorychannels.

Table2presentstheparametersoftheCMPsystem.Wevarythenumberofcores,thecoreclockfrequency,theavailableoff-chipbandwidth,andthedegreeofhardwareprefetching.Wekeepthecapacityofthefirst-leveldata-storageconstant.

3.2Cache-basedImplementation

Forthecache-basedmodel,thefirst-leveldatastorageineachcoreisorganizedasa32-KByte,2-wayset-associativedatacache.Thesecond-levelcacheisa512-KByte,16-wayset-associativecache.Bothcachesuseawrite-back,write-allocatepolicy.Co-herenceismaintainedusingtheMESIwrite-invalidateprotocol.Coherencerequestsarepropagatedinstepsoverthehierarchicalinterconnect.First,theyarebroadcasttootherprocessorswithinthecluster.Ifthecache-lineisnotavailableortherequestcannotbesatisfiedwithinonecluster(e.g.,upgraderequest),itisbroadcasttoallotherclustersaswell.Snoopingrequestsfromothercoresoc-cupythedatacacheforonecycle,forcingthecoretostallifittriestodoaload/storeaccessinthesamecycle.Eachcoreincludesastore-bufferthatallowsloadstobypassstoremisses.Asaresult,theconsistencymodelisweak.Sincetheprocessorsarein-order,itiseasytoprovidesufficientMSHRsforthemaximumpossiblenumberofconcurrentmisses.

Eachcoreadditionallyincludesahardwarestream-basedpre-fetchenginethatplacesdatadirectlyintheL1cache.Modeledafterthetaggedprefetcherdescribedin[41],theprefetcherkeepsahistoryofthelast8cachemissesforidentifyingsequentialac-cesses,runsaconfigurablenumberofcachelinesaheadofthelatestcachemiss,andtracks4separateaccessstreams.Ourexperimentsincludehardwareprefetchingonlywhenexplicitlystated.

WeusePOSIXthreadstomanuallyparallelizeandtuneappli-cations[27].Theapplicationsusedareregularanduselockstoimplementefficienttask-queuesandbarrierstosynchronizeSPMDcode.Higher-levelprogrammingmodels,suchasOpenMP,arealsoapplicabletotheseapplications.

3.3StreamingImplementation

Forthestreamingmodel,thefirst-leveldatastorageineachcoreissplitbetweena24-KBytelocalstoreandan8-KBytecache.Thesmallcacheisusedforstackdataandglobalvariables.Itispar-ticularlyusefulforthesequentialportionsofthecodeandhelpssimplifytheprogrammingandcompilationofthestreamingpor-tionaswell.The24-KBytelocalstoreisindexedasarandomac-cessmemory.OurimplementationalsoprovideshardwaresupportforFIFOaccessestothelocalstore,butwedidnotusethisfeaturewithanyofourapplications.EachcorehasaDMAenginethatsupportssequential,strided,andindexedtransfers,aswellascom-mandqueuing.Atanypoint,theDMAenginemayhaveupto1632-byteoutstandingaccesses.

Thelocalstoreiseffectivelysmallerthana24-KBytecache,sinceithasnotagorcontrolbitsoverhead.Wedonotraisethe

CoreCache-coherent modelI-CacheCoreDRAM I/FCoreCoreStreamingmodelI-CacheLocal NetworkCoreCoreL2CacheBankGlobal NetworkLocal NetworkCoreCoreCPUD-CacheCC EngineCPUCoreCoreD-CacheCoreCoreLocal NetworkL2CacheBankDRAM I/FLocalStoreLocal NetworkDMA EngineCoreCoreCoreCoreFigure1.ThearchitectureoftheCMPsystemwithupto16processors.Thecoreorganizationsforthecache-basedandstreamingmodelsareshownontheleftandrightsiderespectively.

sizeofthelocalstore,sincethesmallincrement(2KBytes)doesnotmakeadifferenceforourapplications.Still,theenergycon-sumptionmodelaccountscorrectlyforthereducedcapacity.Thesecondarystorageisagainorganizedasa16-wayset-associativecache.L2cachesareusefulwithstreamprocessors,astheycap-turelong-termreusepatternsandavoidexpensiveaccessestomainmemory[36,9].TheL2cacheavoidsrefillsonwritemisseswhenDMAtransfersoverwriteentirelines.

WedevelopedstreamingcodeusingasimplelibraryforDMAtransferswithinthreadedCcode.Wemanuallyappliedtheproperblockingfactoranddouble-bufferinginordertooverlapDMAtrans-ferswithusefulcomputation.Wealsorunmultiplecomputationalkernelsoneachdatablocktobenefitfromproducer-consumerlo-calitywithoutadditionalmemoryaccessesorwrite-backsforinter-mediateresults.Higher-levelstreamprogrammingmodelsshouldbeapplicabletomostofourapplications[15,13].Insomecases,theDMAlibraryusesaschedulingthreadthatqueuespendingtransfers.Weavoidtakingupawholecoreforthisthreadbymul-tiplexingitwithanapplicationthread.Theperformanceimpactisnegligible.

4.2Applications

Table3presentsthesetofapplicationsusedforthisstudy.Theyrepresentapplicationsfromthemultimedia,graphics,physicalsim-ulation,DSP,anddatamanagementdomains.Suchapplicationshavebeenusedtoevaluateandmotivatethedevelopmentofstream-ingarchitectures.MPEG-2,H.2,Raytracer,JPEG,andStereoDepthExtractionarecompute-intensiveapplicationsandshowex-ceptionallygoodcacheperformancedespitetheirlargedatasets.Theyexhibitgoodspatialortemporallocalityandhaveenoughcomputationperdataelementtoamortizethepenaltyforanymisses.FEMisascientificapplication,buthasaboutthesamecomputein-tensityasmultimediaapplications.Theremainingapplications—BitonicSort,MergeSort,FIR,and179.art—performarelativelysmallcomputationoneachinputelement.Theyrequireconsid-erablyhigheroff-chipbandwidthandaresensitivetomemoryla-tency.

Wemanuallyoptimizedbothversionsofeachapplicationtoelim-inatebottlenecksandscheduleitsparallelisminthebestpossibleway.Wheneverappropriate,weappliedthesamedata-localityopti-mizations(i.e.blocking,producer-consumer,etc.)tobothmodels.InSection6,weexploretheimpactofdata-localityoptimizations.Thefollowingisabriefdescriptionofhoweachapplicationwasparallelized.

MPEG-2andH.2areparallelizedatthemacroblocklevel.Bothdynamicallyassignmacroblockstocoresusingataskqueue.Macroblocksareentirelydata-parallelinMPEG-2.InH.2,wescheduletheprocessingofdependentmacroblockssoastomini-mizethelengthofthecriticalexecutionpath.WiththeCIFreso-lutionvideoframesweencodeforthisstudy,themacroblockpar-allelismavailableinH.2islimited.StereoDepthExtractionisparallelizedbydividinginputframesinto32x32blocksandstati-callyassigningthemtoprocessors.

KD-treeRaytracerisparallelizedacrosscamerarays.Weassignraystoprocessorsinchunkstoimprovelocality.OurstreamingversionreadstheKD-treefromthecacheinsteadofstreamingitwithaDMAcontroller.JPEGEncodeandJPEGDecodearepar-allelizedacrossinputimages,inamannersimilartothatdonebyanimagethumbnailbrowser.NotethatEncodereadsalotofdatabutoutputslittle;Decodebehavesintheoppositeway.TheFi-niteElementMethod(FEM)isparallelizedacrossmeshcells.TheFIRfilterhas16tapsandisparallelizedacrosslongstripsofsam-ples.179.artisparallelizedacrossF1neurons;thisapplicationiscomposedofseveraldata-parallelvectoroperationsandreductionsbetweenwhichweplacebarriers.

MergeSortandBitonicSortareparallelizedacrosssub-arraysofalargeinputarray.Theprocessorsfirstsortchunksof4096keysin

4.4.1

METHODOLOGY

SimulationandEnergyModeling

WeusedTensilica’smodelingtoolstoconstructaCMPsimula-torforbothmemorymodels[40].Thesimulatorcapturesallstallandcontentioneventsinthecorepipelineandthememorysystem.Table2summarizesthemajorsystemcharacteristicsandthepa-rameterswevariedforthisstudy.Thedefaultvaluesareshowninbold.TheapplicationswerecompiledwithTensilica’soptimizingcompileratthe-O3optimizationlevel.Wefast-forwardovertheinitializationforeachapplicationbutsimulatetheresttocomple-tion,excluding179.artforwhichwemeasure10invocationsofthetrainmatchfunction.

Wealsodevelopedanenergymodelforthearchitectureina90nmCMOSprocess(1.0Vpowersupply).Forthecores,themodelcombinesusagestatistics(instructionmix,functionalunitutilization,stallsandidlecycles,etc.)withenergydatafromthelayoutofactualTensilicadesignsat600MHzin90nm.Theen-ergyconsumedbyon-chipmemorystructuresiscalculatedusingCACTI4.1[38],whichincludesaleakagepowermodelandim-provedcircuitmodelscomparedtoCACTI3.Interconnectenergyiscalculatedbasedonourmeasuredactivitystatisticsandscaledpowermeasurementsfrom[19].Theenergyconsumptionforoff-chipDRAMisderivedfromDRAMsim[42].Wemodeltheeffectofleakageandclockgatingonenergyatalllevelsofthemodel.

CoreI-cacheDataStorageLocalNetworkGlobalCrossbarL2-cacheDRAMCache-CoherentModel(CC)StreamingModel(STR)1,2,4,8,or16TensilicaLXcores,3-wayVLIW,7-stagepipeline800MHz,1.6GHz,3.2GHz,or6.4GHzclockfrequency2FPUs,2integerunits,1load/storeunit

16KB,2-wayassociative,32-byteblocks,1port32KB,2-wayassociativecache24KBlocalstore,1port32-byteblocks,1port,MESI8KB,2-wayassociativecache,32-byteblocks,1portHardwarestreamprefetcherDMAenginewith16outstandingaccessesbidirectionalbus,32byteswide,2cyclelatency(afterarbitration)1inputandoutputportperclusterorL2bank,16byteswide,2.5nslatency(pipelined)512KB,16-waysetassociative,1port,2.2nsaccesslatency,non-inclusiveOnememorychannelat1.6GB/s,3.2GB/s,6.4GB/s,or12.8GB/s;70nsrandomaccesslatencyTable2.ParametersfortheCMPsystem.Forparametersthatvary,wedenotethedefaultvalueinbold.Latenciesarefora90nmCMOSprocess.

L1D-MissRate0.58%0.06%1.06%0.40%0.58%0.03%0.60%0.63%1.79%2.22%3.98%L2D-MissRate85.3%30.8%98.9%72.9%76.2%46.1%86.2%99.8%7.4%98.2%99.7%Instr.perL1D-Miss324.83705.5256.3577.1352.98662.5368.8214.6150.1140.971.1CyclesperL2D-Miss135.44225.96.684.244.93995.355.520.4230.926.133.7Application

MPEG-2Encoder[28]H.2EncoderKD-treeRaytracerJPEGEncoder[21]JPEGDecoder[21]StereoDepthExtraction2DFEMFIRfilter179.artBitonicSortMergeSortInputDataset

10CIFframes(Foreman)10CIFframes(Foreman)128x128,16371triangles128PPMsofvarioussizes128JPGsofvarioussizes3CIFimagepairs5006cellmesh,7663edges22032-bitsamplesSPECreferencedataset21932-bitkeys(2MB)21932-bitkeys(2MB)Off-chipB/W292.4MB/s10.8MB/s45.1MB/s402.2MB/s1059.2MB/s11.4MB/s587.9MB/s1839.1MB/s227.7MB/s1594.2MB/s1167.8MB/sTable3.Memorycharacteristicsoftheapplicationsmeasuredonthecache-basedmodelusing16coresrunningat800MHz.

parallelusingquicksort.Then,sortedchunksaremergedorsorteduntilthefullarrayissorted.MergeSortgraduallyreducesinpar-allelismasitprogress,whereasBitonicSortretainsfullparallelismforitsduration.MergeSortalternateswritingoutputsubliststotwobufferarrays,whileBitonicSortoperatesonthelistinsitu.

duetocachemisses.Streamingversionseliminatemanyofthesestallsusingdouble-buffering(macroscopicprefetching).ThisisnotthecaseforBitonicSort,becauseoff-chipbandwidthissaturatedathighprocessorcounts.BitonicSortisanin-placesortingalgo-rithm,anditisoftenthecasethatsublistsaremoderatelyin-orderandelementsdon’tneedtobeswapped,andconsequentlydon’tneedtobewrittenback.Thecache-basedsystemnaturallydiscov-ersthisbehavior,whilethestreamingmemorysystemwritestheunmodifieddatabacktomainmemoryanyway.H.2andMergeSorthavesynchronizationstallswithbothmodelsduetolimitedparallelism.

Therearesubtledifferencesintheusefulcyclesofsomeapplica-tions.FIRexecutes14%moreinstructionsinthestreamingmodelthanthecachingmodelbecauseofthemanagementofDMAtrans-fers(128elementspertransfer).InthestreamingMergeSort,theinnerloopexecutesextracomparisonstocheckifanoutputbufferisfullandneedstobedrainedtomainmemory,whereasthecache-basedvariantfreelywritesdatasequentially.Eventhoughdouble-bufferingeliminatesalldatastalls,theapplicationrunslongerbe-causeofitshigherinstructioncount.ThestreamingH.2takesadvantageofsomeboundary-conditionoptimizationsthatproveddifficultinthecache-basedvariant.Thisresultedinaslightreduc-tionininstructioncountwhenstreaming.MPEG-2suffersamod-eratenumberofinstructioncachemissesduethecache’slimitedsize.

Overall,Figure2showsthatneithermodelhasaconsistentper-formanceadvantage.Evenwithoutprefetching,thecache-basedmodelperformssimilarlytothestreamingone,andsometimesitisactuallyfaster(BitonicSortfor16cores).Bothmodelsusedataefficiently,andthefundamentalparallelismavailableintheseap-plicationsisnotaffectedbyhowdataaremoved.Althoughthedifferencesinperformancearesmall,thereisalargervariationin

5.EVALUATION

Ourevaluationstartswithacomparisonofthestreamingsystemtothebaselinecache-basedsystemwithoutprefetchingorotherenhancements(Sections5.1and5.2).Wethenstudytheband-widthconsumptionandlatencytoleranceofthetwosystems(Sec-tion5.3).Weconcludebyevaluatingmeanstoenhancetheperfor-manceofcachingsystems(Sections5.4and5.5).

5.1PerformanceComparison

Figure2presentstheexecutiontimeforthecoherentcache-based(CC)andstreaming(STR)modelsaswevarythenumberof800MHzcoresfrom2to16.Wenormalizetotheexecutiontimeofthesequentialrunwiththecache-basedsystem.Thecomponentsofexecutiontimeare:usefulexecution(includingfetchandnon-memorypipelinestalls),synchronization(lock,barrier,waitforDMA),andstallsfordata(caches).Lowerbarsarebetter.Thecache-basedresultsassumenohardwareprefetching.

For7outof11applications(MPEG-2,H.2,Depth,Raytrac-ing,FEM,JPEGEncodeandDecode),thetwomodelsperformal-mostidenticallyforallprocessorcounts.Theseprogramsperformasignificantcomputationoneachdataelementfetchedandcanbeclassifiedascompute-bound.Bothcachesandlocalstorescapturetheirlocalitypatternsequallywell.

Theremainingapplications(179.art,FIR,MergeSort,andBiton-icSort)aredata-boundandrevealsomeinterestingdifferencesbe-tweenthetwomodels.Thecache-basedversionsstallregularly

MPEG−2 Encoder0.60Normalized Execution TimeNormalized Execution Time0.500.400.300.200.10 CC STR CC STR CC STR CC STR0.00StoreLoadSyncUsefulH.2 Encoder0.60

Normalized Execution Time0.500.400.300.200.10

CC STR CC STR CC STR CC STR0.00

StoreLoadSyncUsefulRay Tracer0.600.500.400.300.200.10 CC STR CC STR CC STR CC STR0.00Normalized Execution TimeStoreLoadSyncUsefulJPEG Encoder0.600.500.400.300.200.10

CC STR CC STR CC STR CC STR16 CPUs

StoreLoadSyncUsefulStoreLoadSyncUseful0.00

2 CPUs

0.60Normalized Execution Time4 CPUs8 CPUs16 CPUs

0.60Normalized Execution Time2 CPUs4 CPUs8 CPUs16 CPUs

0.50Normalized Execution Time2 CPUs4 CPUs8 CPUs16 CPUs

0.60Normalized Execution TimeStoreLoadSyncUseful2 CPUs4 CPUs8 CPUs

JPEG DecoderStoreLoadSyncUsefulDepth ExtractionStoreLoadSyncUsefulFEM0.450.400.350.300.250.200.150.100.050.00FIR0.500.400.300.200.10 CC STR CC STR CC STR CC STR0.000.500.400.300.200.10

CC STR CC STR CC STR CC STR0.00

0.500.400.300.200.10

CC STR CC STR CC STR8 CPUs

CC STR0.00

CC STR CC STR CC STR8 CPUs0.602 CPUs4 CPUs8 CPUs0.5016 CPUs

179.art2 CPUs4 CPUs8 CPUs0.6016 CPUs

Bitonic Sort2 CPUs4 CPUs16 CPUs

Merge Sort CC STR2 CPUs4 CPUs16 CPUs

Normalized Execution TimeNormalized Execution Time0.400.350.300.250.200.150.100.05 CC STR CC STR CC STR CC STR0.000.500.400.300.200.10 CC STR CC STR CC STR CC STR0.00Normalized Execution Time0.45StoreLoadSyncUsefulStoreLoadSyncUseful0.500.400.300.200.10 CC STR CC STR CC STR CC STR0.00StoreLoadSyncUseful2 CPUs4 CPUs8 CPUs16 CPUs2 CPUs4 CPUs8 CPUs16 CPUs2 CPUs4 CPUs8 CPUs16 CPUs

Figure2.Executiontimesforthetwomemorymodelsasthenumberofcoresisincreased,normalizedtoasinglecachingcore.

1.40Normalized Off−Chip Traffic1.201.000.800.600.400.20

CC STR CC STR CC STR CC STR0.00

16 CPUs16 CPUs1.00Normalized Energy Consumption0.900.800.700.600.500.400.300.200.10

CC STR CC STR CC STR CC STR0.00

DRAML2−CacheNetworkLMemD−CacheI−CacheCore

WriteReadFEMMPEG−2FIRBitonic Sort

FEMMPEG−2FIRBitonic Sort

Figure3.Off-chiptrafficforthecache-basedandstreamingsys-temswith16CPUs,normalizedtoasinglecachingcore.off-chipbandwidthutilizationofthetwomodels.Figure3showsthateachmodelhasanadvantageinsomesituations.

Figure4.Energyconsumptionforthecache-basedandstreamingsystemswith16CPUs,normalizedtoasinglecachingcore.canbeobservedinthecorrelationbetweenFigures3and4.Specifi-cally,thestreamingapplicationstypicallytransferfewerbytesfrommainmemory,oftenthroughtheeliminationofsuperfluousrefillsforoutput-onlydata.TheoppositeistrueforourstreamingBitonicSort,whichtendstocommunicatemoredatawithmainmemorythanthecachingversionduetothewrite-backofunmodifieddata.Forapplicationswherethereislittlebandwidthdifferencebetweenthetwomodels(suchasFEM)orthecomputationalintensityisveryhigh(suchasDepth),thedifferenceinenergyconsumptionisinsignificant.

WeexpectedtoseeagreaterdifferencebetweenthelocalstoreandL1datacache,butitnevermaterialized.Sinceourapplicationsaredata-parallelandrarelysharedata,theenergycostofanaveragecachemissisdominatedbytheoff-chipDRAMaccessratherthanthemodesttagbroadcastandlookup.Hence,theper-accessenergysavingsbyeliminatingtaglookupsinthestreamingsystemmadelittleimpactonthetotalenergyfootprintofthesystem.

5.2EnergyComparison

Figure4presentsenergyconsumptionforthetwomodelsrun-ningFEM,MPEG-2,FIR,andBitonicSort.Wenormalizetotheenergyconsumptionofasinglecachingcoreforeachapplication.Eachbarindicatestheenergyconsumedbythecores,thecachesandlocalstores,theon-chipnetwork,thesecond-levelcache,andthemainmemory.Thenumbersincludebothstaticanddynamicpower.Lowerbarsarebetter.Incontrasttoperformancescal-ing,energyconsumptiondoesnotalwaysimprovewithmorecores,sincetheamountofhardwareusedtoruntheapplicationincreases.For5outof11applications(JPEGEncode,JPEGDecode,FIR,179.art,andMergeSort),streamingconsistentlyconsumeslessen-ergythancache-coherence,typically10%to25%.Theenergydif-ferentialinnearlyeverycasecomesfromtheDRAMsystem.This

MPEG−2 Encoder0.08Normalized Execution TimeNormalized Execution Time0.070.060.050.040.030.020.01

CC STR CC STR CC STR CC STR0.00

StoreLoadSyncUsefulFIR0.07

Normalized Execution Time0.060.050.040.030.020.01

CC STR CC STR CC STR CC STR0.00

StoreLoadSyncUsefulBitonic Sort0.090.080.070.060.050.040.030.020.01

CC STR CC STR CC STR CC STR6.4 GHz

StoreLoadSyncUseful0.00

0.8 GHz1.6 GHz3.2 GHz6.4 GHz0.8 GHz1.6 GHz3.2 GHz6.4 GHz0.8 GHz1.6 GHz3.2 GHz

Figure5.Normalizedexecutiontimeasthecomputationrateofprocessorcoresisincreased(16cores).

5.3IncreasedComputationalThroughput

Uptonow,theresultsassume800MHzcores,whicharereason-ableforembeddedCMPsforconsumerapplications.Toexploretheefficiencyofthetwomemorymodelsasthecomputationalthrough-putoftheprocessorisincreased,wevarytheclockfrequencyofthecoreswhilekeepingconstantthebandwidthandlatencyintheon-chipnetworks,L2cache,andoff-chipmemory.Insomesense,thehigherclockfrequenciestellusingeneralwhatwouldhappenwithmorepowerfulprocessorsthatuseSIMDunits,out-of-orderschemes,higherclockfrequency,oracombination.Forexample,the6.4GHzconfigurationcanberepresentativeoftheperformanceofan800MHzprocessorthatuses4-to8-wideSIMDinstructions.Theexperimentwasperformedwith16corestostressscalingandincreasethesystem’ssensitivitytobothmemorylatencyandmem-orybandwidth.

Applicationswithsignificantdatareuse,suchasH.2andDepth,shownosensitivitytothisexperimentandperformequallywellonbothsystems.Figure5showstheresultsforsomeoftheapplicationsthatareaffectedbycomputationalscaling.Theseap-plicationsfallintooneoftwocategories:bandwidth-sensitiveorlatency-sensitive.Latency-sensitiveprograms,likeMPEG-2En-coding,performarelativelargedegreeofcomputationbetweenoff-chipmemoryaccesses(hundredsofinstructions).Whilethehighercorefrequencyshortensthesecomputations,itdoesnotre-ducetheamountoftime(innanoseconds,notcycles)requiredtofetchthedatainbetweencomputations.Themacroscopicprefetch-inginthestreamingsystemcantolerateasignificantpercentageofthememorylatency.Henceat6.4GHz,thestreamingMPEG-2Encoderis9%faster.

Bandwidth-sensitiveapplications,likeFIRandBitonicSort,eventuallysaturatetheavailableoff-chipbandwidth.Beyondthatpoint,furtherincreasesincomputationalthroughputdonotim-proveoverallperformance.ForFIR,thecache-basedsystemsat-uratesbeforethestreamingsystemduetothesuperfluousrefillsonstoremissestooutput-onlydata.Atthehighestcomputationalthroughput,thestreamingsystemperforms36%faster.ForBitonicSort,thestreamingversionsaturatesfirst,sinceitperformsmorewritesthanthecache-basedversion(asdescribedin5.1).Thisgivesthecache-basedversiona19%performanceadvantage.

5.4MitigatingLatencyandBandwidthIssues

Theprevioussectionindicatesthatwhenalargenumberofcoreswithhighcomputationalthroughputareused,thecache-basedmod-elfaceslatencyandbandwidthissueswithcertainapplications.Tocharacterizetheseinefficiencies,weperformedexperimentswithincreasedoff-chipbandwidthandhardwareprefetching.

Figure6showstheimpactofincreasingtheavailableoff-chipbandwidthforFIR.Thiscanbeachievedbyusinghigherfrequency

FIR0.08Storeem0.07LoadiTSync n0.06Usefuloitu0.05cexE0.04 dezi0.03lamr0.02oN0.01

0.00

CRCRCRCRCTCTCTCT S S S S 1.6 GB/s3.2 GB/s6.4 GB/s12.8 GB/s

Figure6.Theeffectofincreasedoff-chipbandwidthonFIRper-formance.Measuredon16coresat3.2GHz.

DRAM(e.g.DDR2,DDR3,GDDR)ormultiplememorychannels.Withmorebandwidthavailable,theeffectofsuperfluousrefillsissignificantlyreduced,andthecache-basedsystemperformsnearlyaswellasthestreamingone.Whenhardwareprefetchingisin-troducedat12.8GB/s,loadstallsarereducedto3%ofthetotalexecutiontime.However,theadditionaloff-chipbandwidthdoesnotclosetheenergygapforthisapplication.Anenergy-efficientsolutionforthecache-basedsystemistouseanon-allocatingwritepolicy,whichweexploreinSection5.5.

ForMergeSortand179.art(Figure7),hardwareprefetchingsig-nificantlyimprovesthelatencytoleranceofthecache-basedsys-tems;datastallsarevirtuallyeliminated.Thisisnottosaythatweneverobserveddatastalls—at16cores,thecache-basedMergeSortsaturatesthememorychannelduetosuperfluousrefills—butthatasmalldegreeofprefetchingissufficienttohideover200cy-clesofmemorylatency.

5.5MitigatingSuperfluousRefills

Forsomeapplications,thecache-basedsystemusesmoreoff-chipbandwidth(andconsequentlyenergy)becauseofsuperfluousrefillsforoutput-onlydata.Thisdisadvantagecanbeaddressedbyusinganon-write-allocatepolicyforoutput-onlydatastreams.Wemimicnon-allocatingstoresbyusinganinstructionsimilartotheMIPS32“PrepareforStore”(PFS)instruction[34].PFSallocatesandvalidatesacachelinewithoutrefillingit.TheresultsforFIR,Mergesort,andMPEG-2areshowninFigure8.Foreachapplica-tion,theeliminationofsuperfluousrefillsbringsthememorytrafficandenergyconsumptionofthecache-basedmodelintoparitywiththestreamingmodel.ForMPEG-2,thememorytrafficduetowritemisseswasreduced56%comparedtothecache-basedapplicationwithoutPFS.

0.30PrefetchingeStoremLoadiT0.25

Syncnoitu0.20UsefulcexE0.15

dezila0.10mroN0.050.00

C4RC4RCPTCPT +S +CCS CCMerge Sort 179.art

Figure7.Theeffectofhardwareprefetchingonperformance.P4

referstotheprefetchdepthof4.Measuredon2coresat3.2GHzwitha12.8GB/smemorychannel.

1.20Prepare For StorePrepare For Store0.80cifWritenoifReadtpm0.70ar1.00Tu spnih0.800.60oCC 0.50−0.60yfgfrOe0.40 nde0.40E zd0.30ileazmi0.20

l0.20raoNmro0.10

0.00

NCSS0.00

FRCFRCSRSCCRPTCPTCFPTCFT +S P +S +S CC+S CCCCC FIRMerge Sort MPEG−2

C FIR

Figure8.Theeffectof“PrepareforStore”(PFS)instructionsontheoff-chiptrafficforthecache-basedsystem,normalizedtoasin-glecachingcore.AlsoshownisenergyconsumptionforFIRwith16coresat800MHz.SeeFigure4forthesecondgraph’slegend.Notethatafullhardwareimplementationofanon-write-allocatecachepolicy,alongwiththenecessarywrite-gatheringbuffer,mightperformbetterthanPFSasitwouldalsoeliminatecachereplace-mentsduetooutput-onlydata.

6.STREAMINGASAPROGRAMMING

MODEL

Ourevaluationsofarshowsthatthetwomemorymodelsleadtosimilarperformanceandscaling.Itisimportanttorememberthatwetookadvantageofstreamingoptimizations,suchasblockingandlocality-awarescheduling,onbothmemorymodels.Toillus-tratethis,itiseducationaltolookatstreamprogramminganditsimplicationsforCMParchitecture.

Streamprogrammingmodelsencourageprogrammerstothinkaboutthelocality,datamovement,andstoragecapacityissuesintheirapplications[15,13].Whiletheydonotnecessarilyrequiretheprogrammertomanagetheseissues,theprogrammerstructurestheapplicationinsuchawaythatitiseasytoreasonaboutthem.Thisexposednatureofastreamprogramisvitallyimportantforstreamingarchitectures,asitenablessoftwareorcompilerman-agementofdatalocalityandasynchronouscommunicationwitharchitecturallyvisibleon-chipmemories.Despiteitsaffinityforstreamarchitectures,wefindthatstreamprogrammingisbenefi-cialforcache-basedarchitecturesaswell.

Figure9showstheimportanceofstreamingoptimizationsinthecache-basedMPEG-2Encoder.Theoriginalparallelcodefrom[28]performsanapplicationkernelonawholevideoframebeforethenextkernelisinvoked(i.e.MotionEstimation,DCT,Quantization,

MPEG−2MPEG−2 Encoderc1.80ifWrite0.60efStorea1.60rReadmiTT0.50Load 1.40pnSynciohitUsefulC1.20u0.40−cef1.00fxOE0.30 d0.80deezzii0.60l0.20laam0.40mrroo0.10

N0.20N0.00

CCR0.00

CCRCCRCCRCCRCCTCCTCCTCCTCCT− S− S− S− S− SG G G G G IIIIIRRRRROOOOO󰀀󰀀󰀀󰀀󰀀 2 CPUs

2 CPUs 4 CPUs 8 CPUs16 CPUs

Figure9.Theeffectofstreamprogrammingoptimizationsontheoff-chipbandwidthandperformanceofMPEG-2at800MHz.

179.art4.00emStorei3.50TLoad no3.00SyncituUsefulc2.50exE2.00 dez1.50ilam1.00roN0.50

0.00

CCRCCRCCRCCRCCTCCTCCTCCT− S− S− S− SG G G G IIIIRRRROOOO󰀀󰀀󰀀󰀀 2 CPUs 4 CPUs 8 CPUs16 CPUs

Figure10.Theeffectofstreamprogrammingoptimizationsontheperformanceof179.artat800MHz.

etc.).Werestructuredthiscodebyhoistingtheinnerloopsofsev-eraltasksintoasingleouterloopthatcallseachtaskinturn.Intheoptimizedversion,weexecutealltasksonablockofaframebeforemovingtothenextblock.Thisalsoallowedustocondensealargetemporaryarrayintoasmallstackvariable.Theimprovedproducer-consumerlocalityreducedwrite-backsfromL1cachesby60%.Datastallsarereducedby41%at6.4GHz,evenwith-outprefetching.Furthermore,improvingtheparallelefficiencyoftheapplicationbecameasimplematterofschedulingasingledata-parallelloop,whichaloneisresponsiblefora40%performanceimprovementat16cores.However,instructioncachemissesarenotablyincreasedinthestreaming-optimizedcode.

For179.art,wereorganizedthemaindatastructureinthecache-basedversioninthesamewayaswedidforthestreamingcode.Wewerealsoabletoreplaceseverallargetemporaryvectorswithscalarvaluesbymergingseveralloops.Theseoptimizationsre-ducedthesparsenessof179.art’smemoryaccesspattern,improvedbothtemporalandspatiallocality,andallowedustouseprefetch-ingeffectively.AsshowninFigure10,theimpactonperformanceisdramatic,evenatsmallcorecounts(7xspeedup).

Overall,weobservedperformance,bandwidth,andenergyben-efitswheneverstreamprogrammingoptimizationswereappliedtoourcache-basedapplications.Thisisnotanovelresult,sinceitiswell-knownthatlocalityoptimizations,suchasblockingandloopfusion[29],increasecomputationalintensityandcacheeffi-ciency.However,streamprogrammingmodelsencourageuserstowritecodethatexplicitlyexposesanapplication’sparallelismanddata-accesspattern,moreoftenallowingsuchoptimizations.

Ourexperienceisthatthatstreamprogrammingisactuallyeasierwiththecache-basedmodelratherthanthestreamingmodel.Withstreamingmemory,theprogrammerorthecompilermustorches-tratealldatamovementandpositioningexactlyrightinorderfor

theprogramtooperatecorrectlyandfast.Thiscanbeburdensomeforirregularaccesspatterns(overlappingblocks,searchstructures,unpredictableordata-dependentpatterns,etc.),orforaccessesthatdonotaffectanapplication’sperformance.Itcanleadtoaddi-tionalinstructionsandmemoryreferencesthatreduceoreliminatestreaminghardware’sotheradvantages.Withcache-basedhard-ware,streamprogrammingisjustanissueofperformanceopti-mization.Evenifthealgorithmisnotblockedexactlyright,thecacheswillprovidebest-effortlocalityandcommunicationman-agement.Hence,theprogrammerorthecompilercanfocusonthemostpromisingandmostregulardatastructuresinsteadofmanag-ingalldatastructuresinaprogram.

Moreover,streamprogrammingcanaddresssomeofthecoher-enceandconsistencychallengeswhenscalingcache-basedCMPstolargenumbersofcores.Sinceastreamingapplicationtypicallyoperatesinadata-parallelfashiononasequenceofdata,thereislittleshort-termcommunicationorsynchronizationbetweenpro-cessors.Communicationisonlynecessarywhenprocessorsmovefromoneindependentsetofinput/outputblockstothenextorreachacycleinanapplication’sdata-flowgraph.ThisobservationmaybeincreasinglyimportantasCMPsgrow,asitimpliesthatlessag-gressive,coarser-grain,orlower-frequencymechanismscanbeem-ployedtokeepcachescoherent.

7.DISCUSSIONANDLIMITATIONS

Itisimportanttorecognizethatourstudyhaslimitations.OurexperimentsfocusonCMPswith1-16coresanduniformmemoryaccess.Someoftheconclusionsmaynotgeneralizetolarger-scaleCMPswithnon-uniformmemoryaccess(NUMA).Alarge-scale,cache-basedCMP,programmedinalocality-obliviousway,willundoubtedlysufferstallsduetolongmemorydelaysorexcessiveon-chipcoherencetraffic.Weobservethatthestreamprogrammingmodelmaybeabletoaddressesbothlimitations;itexposestheflowofdataearlyenoughthattheycanbeprefetched,andmotivatesafarcoarser-grained,lower-frequencycoherencemodel.

TowardsthegoalofdesigninglargeCMPsthatarestilleasytoprogram,ahybridmemorysystemthatcombinescachingandsoftware-managedmemorystructurescanmitigateefficiencychal-lengeswithoutexacerbatingthedifficultyofsoftwaredevelop-ment.Forexample,bulktransferprimitivesforcache-basedsys-temscouldenablemoreefficientmacroscopicprefetching.Con-versely,small,potentiallyincoherentcachesinstreamingmem-orysystemscouldvastlysimplifytheuseofstaticdatastructureswithabundanttemporallocality.Anindustryexampleofahy-bridsystemistheNVIDIAG80GPU[6]which,inadditiontocachedaccesstoglobalmemory,includesasmallscratch-padforapplication-specificlocalityandcommunicationoptimizations.Besidesthelimitsofscalability,wedidnotconsiderarchitec-turesthatexposethestreamingmodelallthewaytotheregisterfile[39]orapplicationswithoutabundantdataparallelism.Wealsodidnotconsiderchangestothepipelineofourcores,sincethatispreciselywhatmakesitdifficulttoevaluateexistingstreamingmemoryprocessorscomparedtocache-basedprocessors.Finally,ourstudywasperformedusinggeneral-purposeCMPs.Acompari-sonbetweenthetwomemorymodelsforspecializedCMPsremainsanopenissue.Despitetheselimitations,webelievethisstudy’sconclusionsareimportantintermsofunderstandingtheactualbe-haviorofCMPmemorysystemandmotivatingfutureresearchanddevelopment.

8.RELATEDWORK

Severalarchitectures[3,39,16,32]usestreaminghardwarewithmultimediaandscientificcodeinordertogetperformanceand

energybenefitsfromsoftware-managedmemoryhierarchiesandregularcontrolflow.Therearealsocorrespondingproposalsforstreamprogramminglanguagesandcompileroptimizations[15,13].Suchtoolscanreducetheburdenontheprogrammerforex-plicitlocalityandcommunicationmanagement.Inparallel,therearesignificanteffortstoenhancecache-basedsystemswithtrafficfilters[35],replacementhints[43],orprefetchinghints[44].Theseenhancementstargetthesameaccesspatternsthatstreamingmem-orysystemsbenefitfrom.

Tothebestofourknowledge,thisisthefirstdirectcompari-sonofthetwomemorymodelsforCMPsystemsunderauni-fiedsetofassumptions.Jayasena[23]comparedastreamregis-terfiletoasingle-levelcacheforaSIMDprocessor.Hefoundthatthestreamregisterfileprovidesperformanceandbandwidthadvantagesforapplicationswithsignificantproducer-consumerlo-cality.LoghiandPoncinocomparedhardwarecachecoherencetonotcachingshareddataatallforembeddedCMPswithon-chipmainmemory[31].TheALPreport[28]evaluatesmultimediacodesonCMPswithstreamingsupport.However,forallbutonebenchmark,streamingimpliedtheuseofenhancedSIMDinstruc-tions,notsoftwaremanagedmemoryhierarchies.Suhetal.[37]comparedastreamingSIMDprocessor,astreamingCMPchip,avectordesign,andasuperscalarprocessorforDSPkernels.How-ever,thefoursystemsvariedvastlyatalllevels,henceitisdiffi-culttocomparememorymodelsdirectly.Thereareseveralpro-posalsforconfigurableorhybridmemorysystems[33,36,23,28].Insuchsystems,alevelinthememoryhierarchycanbeconfig-uredasacacheorasalocalstoredependingonanapplication’sneeds.GummarajuandRosenblumhaveshownbenefitsfromahy-bridarchitecturethatusesstreamprogrammingonacache-basedsuperscalardesignforscientificcode[17].Ourworksupportsthisapproachasweshowthatcache-basedmemorysystemscanbeasefficientasstreamingmemorysystems,butcouldbenefitintermsofbandwidthconsumptionandlatencytolerancefromstreampro-gramming.

Ourstudyfollowstheexampleofpapersthatcomparedsharedmemoryandmessagepassingformulti-chipsystems[24,8].

9.CONCLUSIONS

Thechoiceoftheon-chipmemorymodelhasfar-reachingim-plicationsforCMPsystems.Inthispaper,weperformedadirectcomparisonoftwobasicmodels:coherentcachesandstreamingmemory.

Weconcludethatforthemajorityofapplications,bothmodelsperformandscaleequallywell.Forsomeapplicationswithoutsig-nificantdatareuse,streaminghasaperformanceandenergyadvan-tagewhenwescalethenumberandcomputationalcapabilitiesofthecores.However,theefficiencygapcanbebridgedbyintroduc-ingprefetchingandnon-allocatingwritepoliciestocache-coherentsystems.Wealsofoundsomeapplicationsforwhichstreamingscalesworsethancachingduetotheredundanciesintroducedinordertomakethecoderegularenoughforstreaming.

Throughtheadoptionofstreamprogrammingmethodologies,whichencourageblocking,macroscopicprefetching,andlocalityawaretaskscheduling,cache-basedsystemsareequallyasefficientasstreamingmemorysystems.Thisindicatesthatthereisnotasufficientadvantageinbuildinggeneral-purposecoresthatfollowapurestreamingmodel,wherealllocalmemorystructuresandalldatastreamsareexplicitlymanaged.Wealsoobservedthatstreamprogrammingisactuallyeasierwhentargetingcache-basedsys-temsratherthanstreamingmemorysystems,andthatitmaybebeneficialinscalingcoherenceandconsistencyforcachestolargersystems.

10.ACKNOWLEDGEMENTS

WesincerelythankBillMark,MattanErez,andRonHofortheirinvaluablecommentsonearlyversionsofthispaper.Wearealsogratefulfortheinsightfulcritiqueofthisworkbythereviewers.ThisworkwaspartiallyfundedbyDARPAundercontractnumberF29601-03-2-0117.JacobLeverichissupportedbyaDavidCheri-tonStanfordGraduateFellowship.

11.REFERENCES

[1]S.V.AdveandK.Gharachorloo.SharedMemoryConsistency

Models:ATutorial.IEEEComputer,29(12):66–76,Dec.1996.[2]V.Agarwal,M.S.Hrishikesh,S.W.Keckler,andD.Burger.Clock

RateversusIPC:theEndoftheRoadforConventional

Microarchitectures.InProceedingsofthe27thIntl.Symp.onComputerArchitecture,June2000.

[3]J.Ahnetal.EvaluatingtheImagineStreamArchitecture.In

Proceedingsofthe31stIntl.Symp.onComputerArchitecture,May2004.

[4]J.AndrewsandN.Backer.Xbox360SystemArchitecture.InConf.

RecordofHotChips17,Stanford,CA,Aug.2005.

[5]L.A.Barrosoetal.Piranha:AScalableArchitectureBasedon

Single-ChipMultiprocessing.InProceedingsofthe27thIntl.Symp.onComputerArchitecture,Vancouver,Canada,June2000.[6]I.Buck.GPUComputing:ProgrammingaMassivelyParallel

Processor,Mar.2005.KeynotepresentationattheInternationalSymposiumonCodeGenerationandOptimization,SanJose,CA.[7]T.Chiueh.AGenerationalAlgorithmtoMultiprocessorCache

Coherence.InInternationalConferenceonParallelProcessing,pages20–24,Oct.1993.

[8]D.Culler,J.P.Singh,andA.Gupta.ParallelComputerArchitecture:

AHardware/SoftwareApproach.MorganKauffman,1999.[9]W.Dallyetal.Merrimac:SupercomputingwithStreams.In

Proceedingsofthe2003Conf.onSupercomputing,Nov.2003.[10]J.D.Davis,J.Laudon,andK.Olukotun.MaximizingCMP

ThroughputwithMediocreCores.InProceedingsofthe14thIntl.Conf.onParallelArchitecturesandCompilationTechniques,Sept.2005.

[11]M.Drake,H.Hoffmann,R.Rabbah,andS.Amarasinghe.MPEG-2

DecodinginaStreamProgrammingLanguage.InProceedingsofthe20thIEEEInternationalParallel&DistributedProcessingSymposium,RhodesIsland(IPDPS),Apr.2006.

[12]W.Eatherton.ThePushofNetworkProcessingtotheTopofthe

Pyramid,Oct.2005.KeynotepresentationattheSymposiumonArchitecturesforNetworkingandCommunicationSystems,Princeton,NJ.

[13]K.Fatahalianetal.Sequoia:ProgrammingTheMemoryHierarchy.

InSupercomputingConference,Nov.2006.

[14]T.FoleyandJ.Sugerman.KD-TreeAccelerationStructuresfora

GPURaytracer.InProceedingsoftheGraphicsHardwareConf.,July2005.

[15]M.I.Gordonetal.AStreamCompilerforCommunication-exposed

Architectures.InProceedingsofthe10thIntl.Conf.onArchitecturalSupportforProgrammingLanguagesandOperatingSystems,Oct.2002.

[16]M.Gschwindetal.ANovelSIMDArchitecturefortheCell

HeterogeneousChip-Multiprocessor.InConf.RecordofHotChips17,Stanford,CA,Aug.2005.

[17]J.GummarajuandM.Rosenblum.StreamProgrammingon

General-PurposeProcessors.InProceedingsofthe38thIntl.Symp.onMicroarchitecture,Nov.2005.

[18]R.Ho,K.Mai,andM.Horowitz.TheFutureofWires.Proceedings

oftheIEEE,(4),Apr.2001.

[19]R.Ho,K.Mai,andM.Horowitz.EfficientOn-chipGlobal

Interconnects,June2003.

[20]M.HorowitzandW.Dally.HowScalingWillChangeProcessor

Architecture.InInternationalSolid-StateCircuitsConference,pages132–133,Feb.2004.

[21]IndependentJPEGGroup.IJG’sJPEGSoftwareRelease6b,1998.[22]D.Jani,G.Ezer,andJ.Kim.LongWordsandWidePorts:

ReinventingtheConfigurableProcessor.InConf.RecordofHotChips16,Stanford,CA,Aug.2004.

[23]N.Jayasena.MemoryHierarchyDesignforStreamComputing.PhDthesis,StanfordUniversity,2005.

[24]

A.C.KlaiberandH.M.Levy.AComparisonofMessagePassingandSharedMemoryArchitecturesforDataParallelPrograms.InProceedingsofthe21thIntl.Symp.onComputerArchitecture,Apr.1994.

[25]

P.Kongetira.A32-wayMultithreadedSparcProcessor.InConf.RecordofHotChips16,Stanford,CA,Aug.2004.

[26]

R.Kumar,V.Zyuban,andD.M.Tullsen.Interconnectionsin

Multi-CoreArchitectures:UnderstandingMechanisms,OverheadsandScaling.InProceedingsofthe32ndIntl.Symp.onComputerArchitecture,June2005.

[27]B.LewisandD.J.Berg.MultithreadedProgrammingwithPthreads.PrenticeHall,1998.

[28]M.Lietal.ALP:EfficientSupportforAllLevelsofParallelismforComplexMediaApplications.TechnicalReportUIUCDCS-R-2005-2605,UIUCCS,July2005.

[29]A.W.Lim,S.-W.Liao,andM.S.Lam.BlockingandArrayContractionAcrossArbitrarilyNestedLoopsUsingAffine

Partitioning.ACMSIGPLANNotices,36(7):103–112,July2001.[30]Y.Lin.AProgrammableVectorCoprocessorArchitectureforWirelessApplications.InProceedingsofthe3rdWorkshoponApplicationSpecificProcessors,Sept.2004.

[31]

M.LoghiandM.Pncino.ExploringEnergy/PerformanceTradeoffsinSharedMemoryMPSoCs:Snoop-BasedCacheCoherencevs.SoftwareSolutions.InProceedingsoftheDesignAutomationandTestinEuropeConf.,Mar.2005.

[32]E.Machnicki.UltraHighPerformanceScalableDSPFamilyforMultimedia.InConf.RecordofHotChips17,Stanford,CA,Aug.2005.

[33]K.Maietal.SmartMemories:aModularReconfigurable

Architecture.InProceedingsofthe27thIntl.Symp.onComputerArchitecture,June2000.

[34]MIPS32ArchitectureForProgrammersVolumeII:TheMIPS32InstructionSet.MIPSTechnologies,Inc.,2001.

[35]A.Moshovos.RegionScout:ExploitingCoarseGrainSharinginSnoop-BasedCoherence.InProceedingsofthe32ndIntl.Symp.onComputerArchitecture,June2005.

[36]K.Sankaralingam.TRIPS:APolymorphousArchitectureforExploitingILP,TLP,andDLP.ACMTrans.Archit.CodeOptim.,1(1):62–93,Mar.2004.

[37]

J.Suhetal.APerformanceAnalysisofPIM,StreamProcessing,andTiledProcessingonMemory-intensiveSignalProcessingKernels.InProceedingsofthe30thIntl.Symp.onComputerArchitecture,June2003.

[38]D.Tarjan,S.Thoziyoor,andN.P.Jouppi.CACTI4.0.TechnicalReportHPL-2006-86,HPLabs,2006.

[39]

M.Tayloretal.EvaluationoftheRawMicroprocessor:AnExposed-Wire-DelayArchitectureforILPandStreams.In

Proceedingsofthe31stIntl.Symp.onComputerArchitecture,May2004.

[40]TensilicaSoftwareTools.

http://www.tensilica.com/products/software.htm.

[41]S.P.VanderWielandD.J.Lilja.DataPrefetchMechanisms.ACMComputingSurveys,32(2):174–199,2000.

[42]D.Wangetal.DRAMsim:AMemory-SystemSimulator.SIGARCHComputerArchitectureNews,33(4),2005.

[43]Z.Wangetal.UsingtheCompilertoImproveCacheReplacementDecisions.InProceedingsoftheConf.onParallelArchitecturesandCompilationTechniques,Sept.2002.

[44]Z.Wangetal.GuidedRegionPrefetching:ACooperative

Hardware/SoftwareApproach.InProceedingsofthe30thIntl.Symp.onComputerArchitecture,June2003.

[45]

T.-Y.Yeh.TheLow-PowerHigh-PerformanceArchitectureofthePWRficientProcessorFamily.InConf.RecordofHotChips17,Stanford,CA,Aug.2005.

因篇幅问题不能全部显示,请点此查看更多更全内容

Copyright © 2019- zicool.com 版权所有 湘ICP备2023022495号-2

违法及侵权请联系:TEL:199 1889 7713 E-MAIL:2724546146@qq.com

本站由北京市万商天勤律师事务所王兴未律师提供法律服务