ADirectMethodforBuildingSparseKernelLearning
Algorithms
MingruiWuBernhardSch¨olkopfG¨okhanBakır
MaxPlanckInstituteforBiologicalCyberneticsSpemannstrasse3872076T¨ubingen,GermanyEditor:NelloCristianini
mingrui.wu@tuebingen.mpg.de
bernhard.schoelkopf@tuebingen.mpg.de
goekhan.bakir@tuebingen.mpg.de
Abstract
Manykernellearningalgorithms,includingsupportvectormachines,resultinakernelmachine,suchasakernelclassifier,whosekeycomponentisaweightvectorinafeaturespaceimplicitlyintroducedbyapositivedefinitekernelfunction.Thisweightvectorisusuallyobtainedbysolvingaconvexoptimizationproblem.Basedonthisfactwepresentadirectmethodtobuildsparsekernellearningalgorithmsbyaddingonemoreconstrainttotheoriginalconvexoptimizationproblem,suchthatthesparsenessoftheresultingker-nelmachineisexplicitlycontrolledwhileatthesametimeperformanceiskeptashighaspossible.Agradientbasedapproachisprovidedtosolvethismodifiedoptimizationprob-lem.Applyingthismethodtothesupportvectommachineresultsinaconcretealgorithmforbuildingsparselargemarginclassifiers.Theseclassifiersessentiallyfindadiscriminat-ingsubspacethatcanbespannedbyasmallnumberofvectors,andinthissubspace,thedifferentclassesofdataarelinearlywellseparated.Experimentalresultsoverseveralclassificationbenchmarksdemonstratetheeffectivenessofourapproach.
Keywords:sparselearning,sparselargemarginclassifiers,kernellearningalgorithms,supportvectormachine,kernelFisherdiscriminant
1.Introduction
Manykernellearningalgorithms(KLA)havebeenproposedforsolvingdifferentkindsofproblems.Forexamplethesupportvectormachine(SVM)(Vapnik,1995)havebeenwidelyusedforclassificationandregression,theminimaxprobabilitymachine(MPM)(Lanckrietetal.,2002)isanothercompetitivealgorithmforclassification,theone-classSVM(Sch¨olkopfandSmola,2002)isausefultoolfornoveltydetection,whilethekernelFisherdiscriminant(KFD)(Mikaetal.,2003)andthekernelPCA(KPCA)(Sch¨olkopfandSmola,2002)arepowerfulalgorithmsforfeatureextraction.
Manykernellearningalgorithmsresultinakernelmachine(KM)(suchasakernelclassifier),whoseoutputcanbecalculatedas
NXVi=1
c2006MingruiWu,BernhardSch¨olkopfandG¨okhanBakır.
τ(x)=
αˆiK(ˆxi,x)+b,
(1)
¨lkopfandBakırWu,Scho
ˆi∈X,1≤i≤NXV,arewherex∈X⊂Rdistheinputdata,Xistheinputspace,x
1calledexpansionvectors(XVs)inthispaper,NXVisthenumberofXVs,αˆi∈Risthe
ˆi,b∈RisthebiasandK:X×X→Risakernelexpansioncoefficientassociatedwithx
function.
UsuallyKisapositivedefinitekernel(Sch¨olkopfandSmola,2002),whichimplicitlyintroducesafeaturespaceF.Letφ(·)denotethemapfromXtoF,then
K(x,x)=φ(x),φ(x),
Hence(1)canalsobewrittenasalinearfunction
τ(x)=w,φ(x)+b,
where
w=
NXVi=1
∀x,x∈X.
(2)
αˆiφ(ˆxi)
(3)
istheweightvectoroftheKManditequalsthelinearexpansionofXVsinthefeature
spaceF.
ForalltheKLAsmentionedabove,thevectorwisobtainedbysolvingaconvexopti-mizationproblem.Forexample,theSVMcanbeformulatedasaquadraticprogrammingproblem(Vapnik,1995),in(Mikaetal.,2003),aconvexformulationisproposedforKFDandaleastsquaresSVMformulationisestablishedforKPCAin(Suykensetal.,2002).
Fromtheabovedescription,wecanseethatalthoughmanyKLAsareproposedforsolv-ingdifferentkindsofproblemsandhavevariousformulations,therearethreewidelyknowncommonpointsamongthem.First,eachofthemresultsinaKMwhosekeycomponentisaweightvectorw,whichcanbeexpressedasthelinearexpansionofXVsinthefeaturespaceF.Second,thevectorwisobtainedbysolvingaconvexoptimizationproblem.Third,theoutputoftheresultingKMiscalculatedas(1)(or(2)).
Whensolvingpracticalproblems,wewantthetimeforcomputingtheoutputtobeasshortaspossible.Forexampleinreal-timeimagerecognition,inadditiontogoodclassificationaccuracy,highclassificationspeedisalsodesirable.Thetimeofcalculating(1)(or(2))isproportionaltoNXV.ThusseveralsparselearningalgorithmshavebeenproposedtobuildKMswithsmallNXV.
Thereducedset(RS)method(Burges,1996;Sch¨olkopfandSmola,2002)waspro-posedtosimplify(1)bydeterminingNzvectorsz1,...,zNzandcorrespondingexpansioncoefficientsβ1,...,βNzsuchthat
w−
Nzj=1
βjφ(zj)2
(4)
Nz
isminimized.RSmethodsapproximateandreplacewin(2)byj=1βjφ(zj),whereNz arecalledsupportvectorsintheSVM,andrelevancevectorsintherelevancevectormachine(Tipping,2001).Inthispaperweuniformlycallthemexpansionvectorsforthesakeofsimplicity. 604 ADirectMethodforBuildingSparseKernelLearningAlgorithms KMitaimstosimplify,andinordertoapplyRSmethods,weneedtobuildanotherKMinadvance. In(LeeandMangasarian,2001),thereducedsupportvectormachine(RSVM)algorithmisproposed,whichrandomlyselectsNzvectorsfromthetrainingsetasXVs,andthencomputestheexpansioncoefficients.Thisalgorithmcanbeappliedtobuildsparsekernelclassifiers.ButastheXVsarechosenrandomly,andmaynotbegoodrepresentativesofthetrainingdata,goodclassificationperformancecannotbeguaranteedwhenNzissmall(LinandLin,2003). Therelevancevectormachine(RVM)(Tipping,2001)isanotheralgorithmwhichleadstosparseKMs.ThebasicideaoftheRVMistoassumeaprioroftheexpansioncoefficientswhichfavorssparsesolutions. Inthispaper,basedonthecommonpointsofKLAsmentionedbefore,wepresentadirectmethodtobuildsparsekernellearningalgorithms(SKLA).Inparticular,givenaKLA,wemodifyitbyaddingonemoreconstrainttoitscorrespondingconvexoptimizationproblem.TheaddedconstraintexplicitlycontrolsthesparsenessoftheresultingKMandagradientbasedapproachisproposedtosolvethemodifiedoptimizationproblem.WewillalsoshowthatapplyingthismethodtotheSVMwillresultinaspecificalgorithmforbuildingsparselargemarginclassifiers(SLMC).2 Theremainderofthispaperisorganizedasfollows.Insection2,wedescribeadirectmethodforbuildingSKLAs.Afterthis,wewillfocusonaparticularapplicationofthismethodtotheSVMalgorithm,leadingtoadetailedalgorithmforbuildingSLMC.TheSLMCalgorithmispresentedinsection3,wherewewillalsopointoutthatitactuallyfindsadiscriminatingsubspaceofthefeaturespaceF.Somecomparisonswithrelatedapproachesaregiveninsection4.Experimentalresultsareprovidedinsection5andweconcludethepaperinthelastsection.3 2.ADirectMethodofBuildingSKLAs Inthissection,weproposeadirectmethodforbuildingSKLAs.2.1BasicIdea Asmentionedbefore,manyKLAscanbeformulatedasanoptimizationproblem,whichcanbewritteninageneralformasfollows: w,b,ξ minf(w,b,ξ),gi(w,b,ξ)≤0,hj(w,b,ξ)=0, 1≤i≤Ng,1≤j≤Nh, (5)(6)(7) subjectto wheref(w,b,ξ)istheobjectivefunctiontobeminimized,w∈Fandb∈RarerespectivelytheweightvectorandthebiasoftheKMtobebuilt,ξ=[ξ1,...,ξNξ]∈RNξisavectorofsomeauxiliaryvariables(suchastheslackvariablesinthesoftmargintrainingproblem), 2.Herewedon’tusethephrase“sparseSVM”becausetheXVsoftheresultingclassifierarenotnecessarilysupportvectors,i.e.theymaynotlieneartheclassificationboundary.3.Thispaperisanextensionofourpreviouswork(Wuetal.,2005). 605 ¨lkopfandBakırWu,Scho Nξisthenumberofauxiliaryvariables,Ngisthenumberofinequalityconstraintsspecified bygi(w,b,ξ),whileNhisthenumberofequalityconstraintsspecifiedbyhj(w,b,ξ). Ourobjectiveisasfollows:givenaKLAandapositiveintegerNz,wewanttomodifythegivenKLAsuchthatthenumberofXVsoftheresultingKMequalsNzwhileatthesametimetheperformanceoftheKMshouldbekeptaswellaspossible.Toachievethis,weproposetosolvethefollowingproblem: w,b,ξ,β,Z minf(w,b,ξ),gi(w,b,ξ)≤0,hj(w,b,ξ)=0,w= Nzi=1 (8) 1≤i≤Ng,1≤j≤Nh, (9)(10)(11) subjectto φ(zi)βi, where(8)–(10)areexactlythesameas(5)–(7),whileZ=[z1,...,zNz]∈Rd×NzisthematrixofXVsandβ=[β1,...,βNz]∈RNzisthevectorofexpansioncoefficients. Itcanbeseenthattheaboveproblemistheproblem(5)–(7)withoneaddedconstraint(11)sayingthattheweightvectoroftheresultingKMequalstheexpansionoftheφ(zi),1≤i≤Nz.Notethattheziarealsovariables,sotheyneedtobecomputedwhensolvingtheoptimizationproblem. Duetotheconstraint(11),solvingtheproblem(8)–(11)willnaturallyleadtoasparseKM.Furthermore,sincetheobjectivefunctionoftheproblem(8)–(11)isexactlythesameastheoriginalproblem(5)–(7),soinprincipletheperformanceoftheresultingKMcanbekeptaswellaspossible. Becauseofthenon-convexconstraint(11),itisdifficulttoobtaintheglobaloptimumoftheaboveproblem,thusweproposeagradientbasedapproach.However,ourgradientbasedminimizationwillbeperformedonlyontheexpansionvectorsZbutnotonallthevariables.Tothisend,wedefinethefollowingthemarginalfunctionW(Z)whichisobtainedbykeepingtheexpansionvectorsinproblem(8)–(11)fixed,i.e.: W(Z):=min w,b,ξ,β f(w,b,ξ),gi(w,b,ξ)≤0,hj(w,b,ξ)=0, Nzi=1 (12) 1≤i≤Ng,1≤j≤Nh, (13)(14)(15) subjectto andw= φ(zi)βi. Theaboveproblemisthesameasproblem(8)–(11)exceptthatZisnotvariablebutfixed. Clearlyanyglobal(orlocal)minimumofW(Z)isalsoaglobal(orlocal)minimumofproblem(8)–(11),whichmeansthattheminimaoftheoriginalproblem(8)–(11)canbefoundbycomputingtheminimaoffunctionW(Z).HereweproposetominimizeW(Z)bythegradientbasedalgorithm.Tothisend,atanygivenZ∈Rd×Nz,weneedtocalculateboththefunctionvalueW(Z)andthegradient∇W(Z).Thesetwoproblemsarediscussedinthenextsubsection. 606 ADirectMethodforBuildingSparseKernelLearningAlgorithms 2.2ComputingW(Z)anditsGradient∇W(Z) TocomputethefunctionvalueofW(Z)atanygivenZ,weneedtosolvetheoptimizationproblem(12)–(15).Obviouslythisisaproblemdependenttask.Howeverasmentionedbefore,theoriginaloptimizationproblem(5)–(7)ofmanycurrentKLAsareconvex,and(15)isjustalinearconstraintonceZisfixed.Thereforetheproblem(12)–(15)isstillconvex,whichmeansitsglobaloptimum,andthusthefunctionvalueofW(Z),canbereadilycomputed. Nextweturntoconsiderhowtocompute∇W(Z),whichrequiresmorecarefulnesssinceingeneralW(Z)isnotnecessarilydifferentiable.Accordingtotheconstraint(11),theweightvectorwcanbecompletelydeterminedbyβandZ,sothefunctionsf,giandhjcanberegardedasfunctionsofb,ξ,βandZ.Withoutcausingconfusions,wewritethemasf(x,Z),gi(x,Z)andhj(x,Z)inthefollowing,wherex:=[b,ξ,β]. Substituting(15)into(12)–(14),wehave min x f(x,Z),gi(x,Z)≤0,hj(x,Z)=0, 1≤i≤Ng,1≤j≤Nh, (16)(17)(18) subjectto andW(Z)istheoptimalvalueoftheaboveoptimizationproblem. Tocompute∇W(Z),wecanapplythefollowinglemmawhichgivesboththeconditionswhenW(Z)isdifferentiableandanexplicitformofthederivative: Lemma1(GauvinandDubeau,1982):Assumethatallthefunctionsf,giandhjinproblem(16)–(18)arecontinuouslydifferentiableandsupposethatproblem(16)–(18)hasa ¯jbetheuniquecorresponding¯.Furthermoreletα¯atZ=Zuniqueoptimalsolutionx¯iandβ Lagrangemultipliersassociatedwithgiandhjrespectively,1≤i≤Ng,1≤j≤Nh. ¯andAssumefurtherthatthefeasiblesetofproblem(16)–(18)isuniformlycompactatZ ¯isMangasarian-Fromovitzregular,thenthegradientofW(Z)that,theoptimalsolutionx ¯andequalsexistsatZ ∇W(Z)|Z=Zx,Z)|Z=Z¯=∇Zf(¯¯+ Ng i=1 α¯i∇Zgi(¯x,Z)|Z=Z¯+ Nh j=1 ¯j∇Zhj(¯βx,Z)|Z=Z¯,(19) ¯whilefixingwhere∇Zf(¯x,Z)|Z=Z¯denotesthegradientoff(x,Z)withrespecttoZatZ=Z ¯.xatx Ascanbeseenthatinadditiontotheuniquenessoftheoptimalsolutionanditscor-respondingLagrangemultipliers,lemma1alsorequiresthefeasiblesettobeuniformlycompactandtheoptimalsolutiontobeMangasarian-Fromovitzregular.Theformaldef-initionsofthelasttwoconditionsaregiveninappendixB.SincethepropertiesofthefeasiblesetandtheoptimalsolutionsdependonthespecificKLA,wewilldiscussalltheseconditionsforsomeparticularKLAsinsubsection3.3andappendixA.Wewillseethattheconditionsoflemma1areverymildformanycurrentKLAs. 607 ¨lkopfandBakırWu,Scho 3.BuildinganSLMC HavingdescribedourdirectmethodforbuildingSKLAs,wewillfocusonaconcreteap-plicationofthismethodintherestofthispaper.Inparticular,wewillapplythismethod toSVMsinordertoobtainanalgorithmforbuildingSLMCs.Wewillalsoanalyzeitspropertiesandcompareitwithotherrelatedapproaches.Inthissection,wewillpresenttheSLMCalgorithmbycloselyfollowingthediscussionsofsection2.3.1Objective Nowwebegintoconsiderthebinaryclassificationproblem,wherewearegivenasetoftrainingdata{(xi,yi)}Ni=1,wherexi∈Xistheinputdata,andyi∈{−1,1}istheclasslabel. Ourobjectiveisasfollows:givenatrainingdatasetandanpositiveintegerNz,wewanttobuildaclassifiersuchthatthenumberofXVsoftheclassifierequalsNzandthemarginoftheclassifierisaslargeaspossible.Thiswaywebuildalargemarginclassifierwhosesparsenessisexplicitlycontrolled. Basedonthedirectmethoddescribedinthelastsection,weneedtosolvetheproblem(8)–(11)toachievethisgoal.Forthemoment,(8)–(10)becometheSVMtrainingproblemandtheconstraint(11)controllsthesparsenessoftheresultingclassifier.Soweshouldsolvethefollowingproblem min 1 ww+Cξi,2 i=1N w,ξ,b,β,Z (20) ∀i, (21)(22)(23) subjectto yi(wφ(xi)+b)≥1−ξi,ξi≥0,w= Nzi=1 ∀i,φ(zi)βi, whereCisapositiveconstant,w∈Fistheweightvectorofthedecisionhyperplaneinfeaturespace,b∈Risthebiasoftheclassifier,ξ=[ξ1,...,ξN]∈RNisthevectorofslackvariables,Z=[z1,...,zNz]∈Rd×NzisthematrixofXVsandβ=[β1,...,βNz]∈RNzisthevectorofexpansioncoefficients. Followingourproposedmethod,tosolvetheaboveproblem,weturntominimizethemarginalfunctionW(Z)definedinproblem(12)–(15).Forthecurrentproblem,thevalueofW(Z)istheminimumofthefollowingoptimizationproblem, min 1 ww+Cξi,2 i=1N w,ξ,b,β (24) ∀i, (25)(26)(27) subjectto yi(wφ(xi)+b)≥1−ξi,ξi≥0,w= Nzi=1 ∀i,φ(zi)βi. 608 ADirectMethodforBuildingSparseKernelLearningAlgorithms Theaboveproblemisthesameasproblem(20)–(23)exceptthatZisnotvariablebutfixed.Followingthediscussioninsection2.1,the(local)minimumoftheoriginalproblem(20)–(23)canbefoundbycomputingthe(local)minimumoffunctionW(Z)andwewillusethegradientbasedalgorithmtodothis.InthefollowingtwosubsectionswewilldiscusshowtocomputethefunctionvalueW(Z)andthegradient∇W(Z)respectively.3.2ComputingW(Z)andβ TocomputethefunctionvalueofW(Z)atanygivenZ,weneedtosolvetheconvexoptimizationproblem(24)–(27),whichisactuallyaproblemofbuildinganSVMwithgivenXVsz1,...,zNz.ThisproblemhasalreadybeenconsideredintheRSVMalgorithm(LeeandMangasarian,2001;LinandLin,2003).ButintheRSVMalgorithm,onlyanapproximationoftheproblem(24)–(27)issolved.Herewewillproposeadifferentmethodwhichexactlysolvesthisproblem.(Seesection4.2foradiscussionandsection5.5foracomparisonoftheexperimentalresultsofthesetwomethods.) Substituting(27)into(24)and(25),wehave 1z βKβ+Cξi,2 i=1N ξ,b,β min(28) ∀i, (29)(30) subjectto yi(βψz(xi)+b)≥1−ξi,ξi≥0, ∀i, where ψz(xi)=[K(z1,xi),...,K(zNz,xi)] (31) istheempiricalkernelmap(Sch¨olkopfandSmola,2002)andKzisthekernelmatrixofzi,i.e.Kzij=K(zi,zj).NotethatwhenNz=Nandzi=xi,1≤i≤N,thisisthestandardSVMtrainingproblem.Incontrast,theproblem(28)–(30)istotrainalinearSVMinasubspacespannedbyφ(zi),1≤i≤Nz,whereziarearenotnecessarilytrainingexamples. Nowweinvestigateitsdualproblem.Toderiveit,weintroducetheLagrangian, L(ξ,b,β,α,γ)= 1z βKβ+Cξi−γiξi2 i=1 i=1 N N (32) − Ni=1 αi[yi(βψz(xi)+b)−1+ξi], withLagrangemultipliersγi≥0andαi≥0. 609 ¨lkopfandBakırWu,Scho ThederivativesofL(ξ,b,β,α,γ)withrespecttotheprimalvariablesmustvanish, ∂Lz =Kβ−αiyiψz(xi)=0,∂β i=1N (33)(34)(35) ∂L =−αiyi=0,∂b i=1 N ∀i,∀i. ∂L =C−αi−γi=0,∂ξi Equation(33)leadsto β=(K) z−1 Ni=1 αiyiψz(xi). (36) Substituting(33)–(35)into(32)andusing(36),wearriveatthedualformoftheopti-mizationproblem: max Ni=1 α∈RN 1ˆz(xi,xj),αiαjyiyjKαi− 2 i=1j=1 NN (37) subjectto Ni=1 yiαi=0, ∀i, (38)(39)(40) and0≤αi≤C, where ˆz(xi,xj)=ψz(xi)(Kz)−1ψz(xj).K ˆz(·,·)definedby(40)isapositivedefinitekernelfunction(Sch¨ThefunctionKolkopfand Smola,2002).Toseethis,considerthefollowingmap,4 φz(xi)=Tψz(xi), whereψz(·)isdefinedby(31)and T=Λ−2V, 1 (41) (42) whereΛisadiagonalmatrixofeigenvaluesofmatrixKzandVisamatrixwhosecolumnsareeigenvectorsofKz.So TT=VΛ−1V=(Kz)−1.(43) Combiningequation(41)and(43)wehave ˆz(xi,xj).φz(xi),φz(xj)=ψz(xi)(Kz)−1ψz(xj)=K Itcanbeseenthatproblem(37)–(39)hasthesameformasthedualofanSVMtrain-ingproblem.ThereforegivenZ,computingtheexpansioncoefficientsofSVMwithkernel 4.Themapdefinedin(41)iscalledthe“whitened”empiricalkernelmapor“kernelPCAmap”(Sch¨olkopf andSmola,2002). 610 ADirectMethodforBuildingSparseKernelLearningAlgorithms ˆzdefinedbyfunctionKisequivalenttotraininganSVMwithamodifiedkernelfunctionK (40). Sinceproblem(37)–(39)isthedualofproblem(24)–(27),theoptimaofthesetwo z,1≤i≤Narethesolutionofproblemsareequaltoeachother.SogivenZ,assumingαi (37)–(39),thenwecancomputeW(Z)as W(Z)= Ni=1 1zzzˆz(xi,xj).αiαjyiyjKαi− 2 ij=1 NN (44) Accordingto(36),theexpansioncoefficientsβcanbecalculatedas β=(K) z−1 Ni=1 z αiyiψz(xi)=(Kz)−1(Kzx)Yαz, (45) whereψz(·)isdefinedby(31),KzxisthematrixdefinedbyKzxij=K(zi,xj),Yisadiagonal z,...,αz].matrixofclasslabels,i.e.Yii=yi,andαz=[α1N3.3Computing∇W(Z)ofSLMC Tocompute∇W(Z),wecanapplylemma1tothesoftmarginSVMtrainingproblem (37)–(39)andyieldthefollowingresult. Corollary2InthesoftmarginSVMtrainingproblem(37)–(39),assumethatthekernel ˆz(·,·)isstrictlypositivedefiniteandtheresultingsupportvectorscomefromfunctionK bothpositiveandnegativeclasses,5thenthederivativesofW(Z)withrespecttozuv,whichdenotesthev-thcomponentofvectorzu,1≤u≤Nz,1≤v≤d,existsandcanbecomputedasfollows: Nˆz(xi,xj)∂K∂W1zz =−αiαjyiyj,(46)∂zuv2∂zuv i,j=1 z,1≤i≤Ndenotethesolutionofproblem(37)–(39).Inotherwords,∇W(Z)whereαi canbecomputedasifαzdidnotdependonZ.6 Theproofofcorollary2isgiveninappendixC. Ascanbeseenincorollary2,toapplylemma1tocalculate∇W(Z),weonlyneedto ˆz(·,·)isstrictlypositivemaketwoassumptionsonproblem(37)–(39):ThekernelfunctionK definiteandtheresultingsupportvectorscomefrombothclasses.Certainlythesearenotstrictassumptionsformostpracticalapplications.Similarlyonecanverifythatlemma1canalsobeappliedtomanyotherKLAswithmildassumptions,suchastheone-classSVM. z 5.Letαi,1≤i≤Ndenotetheoptimalsolutionofproblem(37)–(39),supportvectorsarethoseinput z dataxiwhosecorrespondingαiarelargerthan0(Vapnik,1995). z 6.Incorollary2,αiisboundedabovebyCasshownin(39).Asimilarconclusionisproposedin(Chapelle z etal.,2002)forthehardmarginSVMtrainingproblem,wherethereisnoupperboundonαi.Thisimpliesthatthefeasiblesetisnotcompact,hencelemma1cannotbeappliedanymore.Actuallyin(Chapelleetal.,2002),onlytheuniquenessoftheoptimalsolutionisemphasized,which,toourknowledge,isnotenoughtoguaranteethedifferentiabilityofthemarginalfunctionW(Z). 611 ¨lkopfandBakırWu,Scho AndwewillshowanotherexampleinappendixAonapplyingourdirectmethodtosparsify theKFDalgorithm(Mikaetal.,2003). Accordingto(40), ˆz(xi,xj)∂K ∂zuv =( ∂ψz(xj)∂ψz(xi)z−1 )(K)ψz(xj)+ψz(xi)(Kz)−1 ∂zuv∂zuv z−1 ∂(K)ψz(xj),+ψz(xi) ∂zuv where ∂(Kz)−1∂zuv canbecalculatedas z∂(Kz)−1z−1∂K=−(K)(Kz)−1. ∂zuv∂zuv SoatanygivenZ,W(Z)and∇W(Z)canbecomputedas(44)and(46)respectively. Inourimplementation,weusetheLBFGSalgorithm(LiuandNocedal,1989)tominimizeW(Z),whichisanefficientgradientbasedoptimizationalgorithm. ˆzandItsCorrespondingFeatureSpaceFz3.4TheKernelFunctionK ˆzplaysanimportantroleinourapproach.Inthissection,someThekernelfunctionK ˆzisprovided,whichwillgiveusinsightsintohowtobuildanSLMC.analysisofK ItiswellknownthattraininganSVMwithanonlinearkernelfunctionKintheinputspaceXisequivalenttobuildingalinearSVMinafeaturespaceF.Themapφ(·)fromXtoFisimplicitlyintroducedbyK.Insection2.2,wederivedthatforagivensetofXVsz1,...,zNz,traininganSVMwithkernelfunctionKisequivalenttobuildinganSVM ˆz,whichisinturnequivalenttoconstructingalinearSVMwithanotherkernelfunctionK inanotherfeaturespace.LetFzdenotethisfeaturespace,thenthemapfromtheXtoFzisφz(·),whichisexplicitlydefinedby(41). Accordingto(41),φz(x)=Tψz(x).ToinvestigatetheroleofthematrixT,considerUzdefinedby Uz=[φ(z1),...,φ(zNz)]T.Then (Uz)Uz=TKzT=I, whereIistheunitmatrix,whichmeansthatTorthonormalizesφ(zi)inthefeaturespaceF.ThusthecolumnsofUzcanberegardedasanorthonormalbasisofasubspaceofF.Foranyx∈X,ifwecalculatetheprojectionofφ(x)intothissubspace,wehave (Uz)φ(x)=T[φ(z1),...,φ(zNz)]φ(x) =T[K(z1,x),...,K(zNz,x)]=Tψz(x)=φz(x). ThisshowsthatthesubspacespannedbythecolumnsofUzisidenticaltoFz.AsUzareobtainedbyorthonormalizingφ(zi),FzisasubspaceofFanditisspannedbyφ(zi),1≤i≤Nz. 612 ADirectMethodforBuildingSparseKernelLearningAlgorithms NowthatforagivensetofXVszi,buildinganSVMwithakernelfunctionKisequivalenttobuildingalinearSVMinFz,inordertogetgoodclassificationperformance,wehavetofindadiscriminatingsubspaceFzwheretwoclassesofdataarelinearlywellseparated.Basedonthispointofview,wecanseethatourproposedapproachessentiallyfindsasubspaceFzwherethemarginofthetrainingdataismaximized. 4.ComparisonwithRelatedApproaches InthissectionwecomparetheSLMCalgorithmwithrelatedapproaches.4.1ModifiedRSMethod InthesecondstepoftheRSmethod,aftertheXVsz1,...,zNzareobtained,theexpansioncoefficientsβarecomputedbyminimizing(4),whichleadsto(Sch¨olkopfandSmola,2002) β=(Kz)−1(Kzx)Yα, (47) whereKzxandYaredefinedasin(45),andαisthesolutionofbuildinganSVMwithkernelfunctionKonthetrainingdataset{(xi,yi)}Ni=1. WeproposetomodifythesecondstepofRSmethodas(45).Clearly(47)and(45)areofthesameform.Theonlydifferenceisthatin(47),αisthesolutionoftraininganSVMwithkernelfunctionK,whilein(45),αzisthesolutionoftraininganSVMwiththekernel ˆz,whichtakestheXVsziintoconsideration.Asβcalculatedby(45)maximizesfunctionK themarginoftheresultingclassifier,wecanexpectabetterclassificationperformanceofthismodifiedRSmethod.Wewillseethisintheexperimentalresults.4.2ComparisonwithRSVMandaModifiedRSVMAlgorithm OnemightarguethatourapproachappearstobesimilartotheRSVM,becausetheRSVMalgorithmalsorestrictstheweightvectorofthedecisionhyperplanetobealinearexpansionofNzXVs. HowevertherearetwoimportantdifferencesbetweentheRSVMandourapproach.Thefirstone(andprobablythefundamentalone)isthatintheRSVMapproach,NzXVsarerandomlyselectedfromthetrainingdatainadvance,butarenotcomputedbyfindingadiscriminatingsubspaceFz.Theseconddifferenceliesinthemethodforcomputingtheexpansioncoefficientsβ.Ourmethodexactlysolvestheproblem(28)–(30)withoutanysimplifications.ButintheRSVMapproach,certainsimplificationsareperformed,amongwhichthemostsignificantoneischangingthefirsttermintheobjectivefunction(28)from1z1 2βKβto2ββ.Thisstepimmediatelyreducestheproblem(28)–(30)toastandardlinearSVMtrainingproblem(LinandLin,2003),whereβbecomestheweightvectorofthedecisionhyperplaneandthetrainingsetbecomes{ψz(xi),yi}Ni=1. Ontheotherhand,ourmethodofcomputingβistobuildalinearSVMinthesubspaceFz,whichistotrainalinearSVMforthetrainingdataset{φz(xi),yi}Ni=1. Nowletuscomparethetwotrainingsetsmentionedabove,i.e.{φz(xi),yi}Ni=1and{ψz(xi),yi}Ni=1.Asderivedinsection3.4,φz(xi)arecalculatedbyprojectingφ(xi)ontoasetofvectors,whichisobtainedbyorthonormalizingφ(zj)(1≤j≤Nz),whileψz(xi)is 613 ¨lkopfandBakırWu,Scho calculatedbycomputingthedotproductionbetweenφ(xi)andφ(zj)(1≤j≤Nz)directly, withoutthestepoforthonormalization. AnalogoustothemodifiedRSmethod,weproposeamodifiedRSVMalgorithm:Firstly,NztrainingdataarerandomlyselectedasXVs,thentheexpansioncoefficientsβarecom-putedby(45). 4.3ComparisonwiththeRVM TheRVM(Tipping,2001)algorithmandmanyothersparselearningalgorithms,suchassparsegreedyalgorithms(Nairetal.,2002),orSVMswithl1-normregularization(Bennett,1999),resultinaclassifierwhoseXVsareasubsetofthetrainingdata.Incontrast,theXVsofSLMCdonotnecessarilybelongtothetrainingset.ThismeansthatSLMCcaninprinciplelocatebetterdiscriminatingXVs.Consequently,withthesamenumberofXVs,SLMCcanhavebetterclassificationperformancethantheRVMandothersparselearningalgorithmswhichselecttheXVSonlyfromthetrainingdata.Thiscanbeseenfromtheexperimentalresultsprovidedinsection5.5.4.4SLMCvsNeuralNetworks SincetheXVsoftheSLMCdonotnecessarilybelongtothetrainingsetandtraininganSLMCisagradientbasedprocess,7theSLMCcanbethoughtofasaneuralnetworkwithweightregularization(Bishop,1995).However,therearecleardifferencesbetweentheSLMCalgorithmandafeedforwardneuralnetwork.First,analogoustoanSVM,theSLMCconsidersthegeometricconceptofmargin,andaimstomaximizesit.Tothisend,theregularizertakesintoaccountthekernelmatrixKz.Second,SLMCminimizesthe“hinge-loss”,whichisdifferentfromthelossfunctionsadoptedbyneuralnetworks.8Therefore,boththeregularizerandthelossfunctionofSLMCaredifferentfromthoseoftraditionalperceptrons. Furthermore,theSLMCalgorithmisjustanapplicationofour’directsparse’method.Itisstraightforwardtoapplythismethodtobuildsparseone-classSVMalgorithm(Sch¨olkopfandSmola,2002),towhichthereisnoobviousneuralnetworkcounterpart. Ontheotherhand,analogoustoneuralnetworks,wealsohaveanadditionalregular-izationviathenumberNzdeterminingthenumberofXVs,whichisanadvantageinsomepracticalapplicationswhereruntimeconstraintsexistandthemaximumpredictiontimeisknownapriori.Notethatthepredictiontime(thenumberofkernelevaluations)ofasoftmarginSVMscaleslinearlywiththenumberoftrainingpatterns(Steinwart,2003). 5.ExperimentalResults NowweconductsomeexperimentstoinvestigatetheperformanceoftheSLMCalgorithmandcompareitwithotherrelatedapproaches. 7.Thisisalsosimilartotherecentworkof(SnelsonandGhahramani,2006)onbuildingsparseGaussianprocesses,whichwasdoneatalmostthesametimewithourpreviouswork(Wuetal.,2005). 8.Notethattheshapeofthehinge-lossissimilartothatofthelossfunctionadoptedinlogisticregression,wherethelogarithmofthelogisticsigmoidfunction(Bishop,1995)isinvolved.Herethelogisticsigmoid 1 functionreferstoy(x)=1+e−x.Sotheshapeofthehing-lossisdifferentfromthatofthelossfunctionusedbytheperceptron. 614 ADirectMethodforBuildingSparseKernelLearningAlgorithms 5.1ApproachestobeCompared Thefollowingapproachesarecomparedintheexperiments:StandardSVM,RSmethod,modifiedRSmethod(MRS,cf.section4.1),RSVM,modifiedRSVM(MRSVM,cf.section4.2),relevancevectormachine(RVM),andtheproposedSLMCapproach. Notethatinourexperiments,RSandMRSuseexactlythesameXVs,buttheycomputetheexpansioncoefficientsby(47)and(45)respectively.SimilarlyRSVMandMRSVMalsousethesamesetofXVs,thedifferenceliesinthemethodforcomputingtheexpansioncoefficients.5.2DataSets Sevenclassificationbenchmarksareconsidered:USPS,Banana,BreastCancer,Titanic,Waveform,GermanandImage.ThelastsixdatasetsareprovidedbyGunnarR¨atschandcanbedownloadedfromhttp://ida.first.fraunhofer.de/projects/bench.FortheUSPSdataset,7291examplesareusedfortrainingandtheremaining2007arefortesting.Foreachofthelastsixdatasets,thereare100training/testsplitsandwefollowthesameschemeas(Tipping,2001):ourresultsshowaveragesoverthefirst10ofthose.5.3ParameterSelection AGaussiankernelisusedintheexperiments: K(x,x)=exp(−γx−x2). (48) Theparametersfordifferentapproachesareasfollows: StandardSVM:FortheUSPSdataset,weusethesameparametersasin(Sch¨olkopfandSmola,2002):C=10andγ=1/128.Fortheotherdatasets,weusetheparametersprovidedbyGunnarR¨atsch,whichareshownonthesamewebsitewherethesedatasetsaredownloaded.9 RSVMandMRSVM:Weperform5-foldcrossvalidationonthetrainingsettoselectparametersfortheRSVM.MRSVMusesthesameparametersastheRSVM. RSmethod:TheRSmethodusesthesamekernelparameterasthestandardSVM,sinceitaimstosimplifythestandardSVMsolution. SLMCandMRS:Inourexperiments,theyuseexactlythesameparametersasthestandardSVMonallthedatasets. RVM:TheresultsfortheRVMaretakendirectlyfrom(Tipping,2001),where5-foldcrossvalidationwasperformedforparameterselection.5.4ExperimentalSettings Foreachdataset,firstastandardSVMistrainedwiththeLIBSVMsoftware.10(FortheUSPS,tenSVMsarebuilt,eachtrainedtoseparateonedigitfromallothers).Thentheotherapproachesareapplied.TheratioNz/NSVvariesfrom5%to10%. FortheRSVM,weusetheimplementationcontainedintheLIBSVMTools.11 9.Seehttp://ida.first.fraunhofer.de/projects/bench10.Fromhttp://www.csie.ntu.edu.tw/˜cjlin/libsvm 11.Fromhttp://www.csie.ntu.edu.tw/˜cjlin/libsvmtools 615 ¨lkopfandBakırWu,Scho FortheRSmethod,thereisstillnostandardorwidelyacceptedimplementation,sowe trythreedifferentones:aprogramwrittenbyourselves,thecodecontainedinthemachinelearningtoolboxSPIDER,12andthecodecontainedinthestatisticalpatternrecognitiontoolboxSTPRTOOL.13Foreachdataset,weapplythesethreeimplementationsandselectthebestonecorrespondingtotheminimalvalueoftheobjectivefunction(4).5.5NumericalResults ExperimentalresultsareshowninTable1,wheretheinitialXVsoftheSLMCarerandomlyselectedfromthetrainingdata.InTable1,NSVstandsforthenumberofsupportvectors(SVs)ofthestandardSVM,NzrepresentsthenumberofXVsofothersparselearningalgorithms. DataSetSVMNSV Error(%)RSNz/NSVMRS=5%RSVMMRSVMSLMCRSNz/NSVMRS=10%RSVMMRSVMSLMCRVMNz/NSV(%)Error(%)USPS26834.34.94.911.611.54.94.74.88.28.04.711.85.1Banana86.711.839.427.629.928.116.521.917.517.516.911.013.210.8BreastCancer112.828.628.828.829.529.427.927.929.031.030.327.95.629.9Titanic70.622.137.423.924.524.826.426.622.622.923.922.492.523.0Waveform158.99.99.910.015.114.79.910.09.911.611.89.99.210.9German408.222.522.922.523.623.922.322.922.624.523.722.93.122.2Image172.12.837.619.423.620.75.218.36.914.212.73.620.13.9Table1:Resultsonsevenclassificationbenchmarks.Thetesterrorratesofeachalgorithm arepresented.TheNSVforthelastsixdatasetsaretheaveragesover10train-ing/testsplits.Thebestresultineachgroupisshowninboldface.ThenumberofXVsoftheRVMisnotchosenapriori,butcomesoutasaresultoftraining.SofortheRVM,theratioNz/NSVisgiveninordertocompareitwithotheralgorithms.Foreachdataset,theresultoftheRVMisshowninboldfaceifitisthebestcomparedtotheothersparselearningalgorithms.FromTable1,itcanbeseentheclassificationaccuracyofSLMCiscomparablewiththefullSVMwhenNz/NSV=0.1. Table1alsoillustratesthatSLMCoutperformstheothersparselearningalgorithmsinmostcases.AlsotheSLMCusuallyimprovestheclassificationresultsoftheRSmethod.InsomecasestheimprovementislargesuchasonBananaandImagedatasets. WhencomparingMRSwithRS,andMRSVMwithRSVM,theresultsinTable1demon-stratethatinmostcasesMRSbeatsRS,andsimilarly,MRSVMusuallyoutperformsRSVM 12.Fromhttp://www.kyb.mpg.de/bs/people/spider13.Fromhttp://cmp.felk.cvut.cz/˜xfrancv/stprtool 616 ADirectMethodforBuildingSparseKernelLearningAlgorithms alittle.ThismeansthatforagivensetofXVs,computingtheexpansioncoefficientsac-cordingto(45)isagoodchoice.5.6SomeImplementationDetails Intable1,wereporttheresultsobtainedbyrandominitialization.TheK-meansalgorithmhasalsobeentriedtochoosetheinitialXVsandresultingclassificationresultsaresimilar.Toillustratethisquantitatively,intable2,wepresenttheresultsobtainedbyusingtheK-meansalgorithmforinitialization. DataNz/NSV=5%Nz/NSV=10%Setrandomk-meansrandomk-meansUSPS4.94.94.74.6Banana16.516.211.010.9BreastCancer27.926.827.927.3Titanic26.424.422.423.2Waveform9.99.99.99.9German22.322.722.922.7Image5.26.03.63.8Table2:ResultsoftheSLMCalgorithm,obtainedbyrandominitializationandk-means initialization.Inourproposedapproach,weneedtocomputetheinverseofKz(seeforexample(45)).TheoreticallyiftheGaussiankernelisusedandzi,1≤i≤Nz,aredifferentfromeachother,thenKzshouldbefullrank,whoseinversecanbecomputedwithoutanyproblems.Howeverinexperiments,wedoobservethecaseswhereKzwasillconditioned.Butwefindoutthereasonforthisisthattherearesomeduplicateddatapointsinthedatasets,whichareaccidentallyselectedastheinitialXVs.Forexample,inthefirsttrainingsetoftheImagedataset,thefirstandthe521stdatapointareexactlythesame.Soinexperiments,weremovetheduplicatedpointsasapreprocessingstep.5.7SomeResultsonXVs ItisknownthattheXVsofstandardSVMaresupportvectors,whichlieneartheclassi-ficationboundary.HerewegivetwoexamplestoillustratewhattheXVsofSLMClooklike. Example1.BuildinganSVMinvolvessolvingproblem(20)–(22),whilebuildinganSLMCistosolvethesameproblemplusonemoreconstraint(23).IfwewanttobuildanSLMCwiththesamenumberofXVsasastandardSVM,namelyNz=NSV,thentheoptimalsolutionofproblem(20)–(22)isalsoaglobaloptimalsolutionofproblem(20)–(23),sinceitsatisfiesalltheconstraints.Sointhisspecialcase,thesupportvectorsofthestandardSVMarealsoanoptimalchoiceofXVsforSLMC. Example2.OntheUSPSdataset,webuiltanSVMontrainingdatawithγ=1/128,C=10toseparatedigit’3’fromdigit’8’.TheresultingSVMhas116SVsandatesterrorrateof1.8%.ThenwebuiltanSLMCwiththesameγandC,whileNz=12(i.e.about10%ofthenumberofSVs).TheresultingSLMCalsohasatesterrorrateof1.8%.AsshowninFigure1,theimagesofthe12XVsproducedbySLMCapproachlooklikedigits. 617 ¨lkopfandBakırWu,Scho Figure1:ImagesofXVsforseparating’3’and’8’. 5.8TrainingTimeofSLMC BuildinganSLMCisagradientbasedprocess,whereeachiterationconsistsofcomputing 14andthencomputingtheφz(xi),1≤i≤N,trainingalinearSVMover{φz(xi),yi}Ni=1, thegradient∇W(Z). LetTSVM(N,d)denotethetimecomplexityoftrainingalinearSVMoveradatasetcontainingNd-dimensionalvectors,thenthetimecomplexityoftraininganSLMCis 32 O(n×(NNzd+TSVM(N,Nz)+Nz+Nzd)), wherenisthenumberofiterationsoftheSLMCtrainingprocess.Inexperiments,wefind thattheSLMCalgorithmrequires20–200iterationstoconverge. WecannotdirectlycomparethetrainingtimeofSLMCwithRSmethodsandRVM(relevancevectormachine),becauseweusedC++toimplementourapproach,whilethepubliclyavailablecodeofRSmethodsandtheRVMiswritteninMatlab.UsingtheseimplementationsandapersonalcomputerwithaPentium4CPUof3GHz,onegetsthefollowingnumbers:OntheUSPSdataset,SLMCtakes6.9hourstotrain,whiletheRSmethodtakes2.3hours.OntheBananadataset,SLMCtrainingisabout1.5seconds,andRVMtrainingisabout5seconds. TraininganSLMCistimeconsumingonlargedatasets.Howeverinpractice,oncethetrainingisfinished,thetrainedKMwillbeputintouseprocessinglargeamountoftestdata.Forapplicationswhereprocessingspeedisimportant,suchasrealtimecomputervision,sparseKMscanbeofgreatvalue.ThisisthereasonwhyseveralSKLAshavebeendevelopedalthoughtheyaremoreexpensivethanthestandardSVMalgorithm. Furthermore,somekernellearningalgorithmsarenotsparseatall,suchaskernelridgeregression,KFD,KPCA,whichmeansthatallthetrainingdataneedtobesavedastheXVsintheresultingKMstrainedbythesealgorithms.Hencebuildingsparseversionsofthesealgorithmscannotonlyacceleratetheevaluationofthetestdata,butalsodramatically ˆzover{xi,yi}N14.EquivalentlywecanbuildanSVMwiththekernelfunctionKi=1.Butthisismuchslower ˆbecauseitistimeconsumingtocomputeKz(·,·)definedby(40). 618 ADirectMethodforBuildingSparseKernelLearningAlgorithms reducethespaceneededforstoringthetrainedKM.AnexampleofthiswillbegiveninappendixA. 6.Conclusions Wepresentadirectmethodtobuildsparsekernellearningalgorithms.Therearemainlytwoadvantagesofthismethod:First,itcanbeappliedtosparsifymanycurrentkernellearningalgorithms.SeconditsimultaneouslyconsidersthesparsenessandtheperformanceoftheresultingKM.Basedonthismethodweproposeanapproachtobuildsparselargemarginclassifiers,whichessentiallyfindsadiscriminatingsubspaceFzofthefeaturespaceF.Experimentalresultsindicatethatthisapproachoftenexhibitsabetterclassificationaccuracy,atcomparablesparsity,thanthesparselearningalgorithmstowhichwecompared. Aby-productofthispaperisamethodforcalculatingtheexpansioncoefficientsofSVMsforgivenXVs.BasedonthismethodweproposedamodifiedversionoftheRSmethodandtheRSVM.Experimentalresultsshowthatthesetwomodifiedalgorithmscanimprovetheclassificationaccuracyoftheircounterparts.OnecouldalsotrythismethodonotheralgorithmssuchastheRVM. PossiblefutureworkmayincludeapplyingtheproposedmethodtootherkernellearningalgorithmsandrunningthedirectmethodgreedilytofindtheXVsoneafteranotherinordertoacceleratethetrainingprocedure. Acknowledgments Wewouldliketoacknowledgetheanonymousreviewersfortheircommentsthatsignificantlyimprovedthequalityofthemanuscript. AppendixA.SparseKFD WehavederivedtheSLMCalgorithmasanexampleofourdirectsparselearningapproach.HerewewillshowanotherexampletobuildsparseKFD(Mikaetal.,2003). IntheKFDalgorithm,weneedtoconsiderthefollowingRayleighquotientmaximizationproblem(Sch¨olkopfandSmola,2002;Shawe-TaylorandCristianini,2004).15 wAw max.w∈FwBw+µw2(49) In(49),AandBarerespectivelythebetween-classandwithin-classvariancesofthetrainingdatainthefeaturespaceF,whileµisaregularizationparameter.ItcanbeseenthatbothAandBarepositivedefinite. 15.Asmentionedbefore,convexformulationshavebeenproposedfortheKFD(Mikaetal.,2003;Suykens etal.,2002).Hereweonlyconsiderthetraditionalnon-convexformulationwhosesolutioncanbeeasilyobtainedviaeigendecomposition.Thisalsoillustratesthatourmethodcanalsobeappliedtonon-convexoptimizationproblemsinsomespecialcases. 619 ¨lkopfandBakırWu,Scho Thefollowingisanequivalentformof(49) w∈F min −wAw,ˆ=1,wBw (50)(51) subjectto ˆ=B+µI,implyingthatBˆisstrictlypositivedefinite.whereB Followingourproposedapproach,inordertobuildsparseKFD(SKFD),wehavetosolvethefollowingproblem: w∈F,β∈RNz,Z∈Rd×Nz min −wAw,ˆ=1,wBww= Nzi=1 (52)(53)(54) subjectto φ(zi)βi. Substituting(54)into(52)and(53),wehave β∈RNz,Z∈Rd×Nz min −βAzβ,βBzβ=1, (55)(56) subjectto ˆ(φ(Z)),wherewithalittleabuseofwhereAz=(φ(Z))A(φ(Z))andBz=(φ(Z))B symbols,weuseφ(Z)todenotethematrix[φ(z1),...,φ(zNz)].Notethatintheproblem(55)–(56),Az∈RNz×Nzispositivedefinite,whileBz∈RNz×Nzisstrictlypositivedefinite. Asbefore,wedefinethemarginalfunctionW(Z)astheminimalvalueofthefollowingoptimizationproblem: β∈RNz min −βAzβ,βBzβ=1. (57)(58) subjectto Notetheaboveproblemisthesameastheproblem(55)–(56)exceptthatintheaboveproblem,Zisfixedratherthanvariable. NowweneedtoconsiderhowtocomputeW(Z)and∇W(Z).TocomputethefunctionvalueW(Z),weneedtosolvetheproblem(57)–(58).BytheLagrangemultipliermethod,thisproblemcanbesolvedbysolvingthefollowingunconstrainedoptimizationproblem: β∈RNz,λ∈R min J(β,λ)=−βAzβ+λ(βBzβ−1), (59) withLagrangemultiplierλ∈R. ThederivativeofJ(β,λ)withrespecttoβandλmustvanish,leadingto Azβ=λBzβ,βBzβ=1. (60)(61) Equation(60)showsthatβshouldbeaneigenvectorofthematrix(Bz)−1Azandλshouldbethecorrespondingeigenvalue.Leftmultiplyingbothsidesofequation(60)byβandusing(61),wehave βAzβ=λ.(62) 620 ADirectMethodforBuildingSparseKernelLearningAlgorithms Since−βAzβistheobjectivefunction(57)weareminimizing,wecanseethatitsmini-malvalueshouldequalthenegativeofthelargesteigenvalueof(Bz)−1Az.ThereforethefunctionvalueW(Z)isobtained. ¯andλ¯denotetheoptimalsolutionandthecorrespondingLagrangemultiplierofLetβ ¯isthelargesteigenvalueof(Bz)−1Azandβ¯istheproblem(57)–(58),asderivedabove,λ correspondingeigenvectormultipliedbyaconstantsuchthatequation(61)issatisfied. Asmentionedabove,Bzisstrictlypositivedefinite.Hereweassumethatthatthere ¯.Asequation(58)istheonlyconstraintofisanuniqueeigenvectorcorrespondingtoλ ¯isMangasarian-Fromovitzregular.Anditisproblem(57)–(58),theoptimalsolutionβ straightforwardtoverifythatthesetoffeasiblesolutionsS(Z)isuniformlycompactifBzisstrictlypositivedefinite.Thereforeaccordingtolemma1,thederivativeofW(Z)withrespecttozuv,whichdenotesthev-thcomponentofvectorzu,1≤u≤Nz,1≤v≤d,existsandcanbecomputedasfollows: zz∂W(Z)∂A∂B¯¯+λ¯β¯¯.=−βββ ∂zuv∂zuv∂zuv (63) NowthatboththefunctionvalueW(Z)andthegradient∇W(Z)canbecomputed,the (local)optimumoftheproblem(52)–(54)canbecomputedbythegradientbasedalgorithm. AfterobtainingtheXVszibysolvingtheproblem(52)–(54),wecantakethesolutionβiofthisproblemastheexpansioncoefficients.HoweverwecannotgetthebiasbfortheresultingKMinthisway(c.fequation(1)).Asmentionedbefore,havingzi,wecanapplyourproposedmethodthattheexpansioncoefficientsandthebiascanbecalculatedbysolvingtheproblem(37)–(39). ExperimentalresultsonsixclassificationbenchmarksfortheproposedSKFDalgorithmareprovidedintable3. DataSetSVMNSV Error(%)KFDNz Error(%)RSNz/NSVMRS=10%RSVMMRSVMSKFDBanana86.711.840010.821.917.517.516.910.8BreastCancer112.828.620025.827.929.031.030.325.5Titanic70.622.115023.226.622.622.923.923.5Waveform158.99.94009.910.09.911.611.89.9German408.222.570023.722.922.624.523.723.5Image172.12.813003.318.36.914.212.74.0Table3:Resultsonsixclassificationbenchmarks.TheSKFDisinitializedwiththek-meansalgorithm.Thebestresultsamongthesparselearningalgorithmsareinboldface.Similarlyasbefore,theresultsreportedherearetheaveragesoverthefirst10training/testsplits,exceptfortheKFDalgorithm,whoseresultsaretakendirectlyfrom(Sch¨olkopfandSmola,2002),whicharetheaveragesoverallthe100training/testsplits.FortheKFDalgorithm,thenumberofexpansionvectorsNzisthesameasthenumberoftrainingdatasinceKFDisnotsparseatall. 621 ¨lkopfandBakırWu,Scho Theresultspresentedintable3validatetheeffectivenessoftheproposedSKFDalgo-rithm.Furthermore,theoriginalKFDalgorithmisnotsparseallall,i.e.allthetraining dataneedtobestoredastheXVs.ThereforetheproposedSKFDalgorithmnotonlyaccel-eratesthetestphase,butalsosignificantlyreducesthespaceneededforstoringtheresultingKM.Forexample,theBananadatasetcontains400trainingdata,implyingthatonaverageonly86.7×10%=8.7XVsneedtostoredintheresultingKMtrainedbytheSKFD,savingabout1−8.7/400=97.8%storagecomparedwiththeorignialKFDalgorithm. AppendixB.FormalDefinitionsoftheTwoConditionsinLemma1 ForeachZ∈Rd×Nz,letS(Z)denotethefeasiblesetofproblem(16)–(18) S(Z)={x|gi(x,Z)≤0,1≤i≤Ng}∩{x|hj(x,Z)=0,1≤j≤Nh}. Definition3(GauvinandDubeau,1982)ThefeasiblesetS(Z)ofproblem(16)–(18)is ¯ifthereisaneighborhoodN(Z¯)ofZ¯suchthattheclosureofuniformlycompactatZ ¯)S(Z)iscompact.Z∈N(Z¯∈S(Z)ofproblem(16)–Definition4(Mangasarian,1969)ForanyZ,afeasiblepointx (18)issaidtobeMangasarian-FromovitzregularifitsatisfiesthefollowingMangasarian-Fromovitzregularitycondition:1.Thereexistsavectorvsuchthat v∇xgi(x,Z)|x=¯x<0,v∇xhj(x,Z)|x=¯x=0, i∈{i|gi(¯x,Z)=0},1≤j≤Nh, (64)(65) 2.Thegradients{∇xhj(x,Z)|x=¯x,1≤j≤Nh}arelinearlyindependent,where∇xdenotesthegradientwithrespecttothevariablesx. AppendixC.ProofofCorollary2 ProofTheproofisjusttoverifyalltheconditionsinlemma1. First,forthesoftmarginSVMtrainingproblem(37)–(39),itisknownthatifthekernel ˆz(·,·)isstrictlypositivedefinitethentheproblemhasanuniquesolutionandfunctionK thecorrespondingLagrangemultipliersarealsounique. Second,itcanbeseenthatatanyZ∈Rd×Nz,thesetoffeasiblesolutionsS(Z)iscompactanddoesnotdependonZ,thereforeS(Z)isuniformlycompactatanyZ∈Rd×Nz. Third,obviouslythesecondpartoftheMangasarian-Fromovitzregularityconditionholdssincethereisonlyoneequalityconstraint.NowweprovethatthefirstpartoftheMangasarian-Fromovitzregularityconditionalsoholdsbyconstructingavectorv=[v1,...,vN]∈RNthatsatisfiesboth(64)and(65).Foreaseofdescription,wepartition z(1≤i≤N),theindexset{1,...,N}intothefollowingthreesubsetsaccordingtoαi z=0},S={i|αz=C}andwhicharethesolutionofproblem(37)–(39):S1={i|αi2i zNS3={i|0<αi 622 ADirectMethodforBuildingSparseKernelLearningAlgorithms First,tomake(64)hold,wecansimplyassignanarbitrarypositivevaluetoviifi∈S1,andanarbitrarynegativevaluetovjifj∈S2.Tomake(65)true,wedistinguishtwocases: Case1,|S3|>0.Inthiscase,thevectorvs3canbeeasilycomputedas vs3=− ys3ys32 i∈S1∪S2 viyi, wherey=[y1,...yN].Theaboveequationresultsinvy=0,whichisthesameas(65). zforCase2,|S3|=0.Inthiscase,alltheresultingsupportvectorscorrespondtoαi i∈S2,whichcomefrombothclassesaccordingtotheassumptionsincorollary2.Thereforetheleftsideofequation(65)equals vy= = i∈S1 viyi+viyi+ i∈S2,yi>0 viyi+vi− i∈S2,yi<0 viyivi. (66) i∈S1i∈S2,yi>0i∈S2,yi<0 Recallthatweconstructvsuchthatvi<0fori∈S2.Soifvy=0,equation(65)alreadyholds.Ifvy>0,wecanalwaysdecreasethevalues(orequivalently,increasetheabsolutevalues)ofviinthesecondtermofequation(66)tomakevy=0,whileatthesametimetokeepvi<0fori∈S2sothat(64)stillholds.Similarly,ifvy<0,wecandecreasethevaluesofviinthethirdtermofequation(66). ThusthefirstpartoftheMangasarian-Fromovitzregularityconditionalsoholds.Hencetheoptimalsolutionofproblem(37)–(39)isMangasarian-Fromovitzregular. Thereforealltheconditionsinlemma1aresatisfiedand(46)followsfrom(19)sincetheconstraintsofproblem(37)–(39)donotdependonZ,whichmeansboththesecondandthethirdtermsin(19)are0. References K.P.Bennett.Combiningsupportvectorandmathematicalprogrammingmethodsforclassification.InB.Sch¨olkopf,C.J.C.Burges,andA.J.Smola,editors,AdvancesinKernelMethods,pages307–326.TheMITPress,CambridgeMA,1999. C.M.Bishop.NeuralNetworksforPatternRecognition.OxfordUniversityPress,Oxford,UK,1995. C.J.C.Burges.Simplifiedsupportvectordecisionrules.InL.Saitta,editor,Proc.13thInternationalConferenceonMachineLearning,pages71–77.MorganKaufmann,1996.O.Chapelle,V.Vapnik,O.Bousquet,andS.Mukherjee.Choosingmultipleparametersforsupportvectormachines.MachineLearning,46(1-3):131–159,2002.J.GauvinandF.Dubeau.Differentialpropertiesofthemarginalfunctioninmathematicalprogramming.MathematicalProgrammingStudy,19:101–119,1982. 623 ¨lkopfandBakırWu,Scho G.R.G.Lanckriet,L.E.Ghaoui,C.Bhattacharyya,andM.I.Jordan.Arobustminimax approachtoclassification.JournalofMachineLearningResearch,3:555–582,2002.Y.LeeandO.L.Mangasarian.RSVM:reducedsupportvectormachines.InCDProceedingsoftheFirstSIAMInternationalConferenceonDataMining,Chicago,2001. K.LinandC.Lin.Astudyonreducedsupportvectormachines.IEEETransactionsonNeuralNetworks,14:1449–1459,2003. D.C.LiuandJ.Nocedal.OnthelimitedmemoryBFGSmethodforlargescaleoptimization.Math.Programming,45(3,(Ser.B)):503–528,1989.O.L.Mangasarian.NonlinearProgramming.McGraw-Hill,NewYork,1969. S.Mika,G.R¨atsch,J.Weston,B.Sch¨olkopf,A.J.Smola,andK.-R.Mueller.Constructingdescriptiveanddiscriminativenon-linearfeatures:Rayleighcoefficientsinkernelfeaturespaces.IEEETransactionsonPatternAnalysisandMachineIntelligence,25(5):623–628,2003.P.B.Nair,A.Choudhury,andA.J.Keane.SomegreedylearningalgorithmsforsparseregressionandclassificationwithMercerkernels.JournalofMachineLearningResearch,3:781–801,2002.B.Sch¨olkopfandA.J.Smola.LearningwithKernels.TheMITPress,Cambridge,MA,2002.J.Shawe-TaylorandN.Cristianini.KernelMethodsforPatternAnalysis.CambridgeUniversityPress,Cambridge,UK,2004. EdwardSnelsonandZoubinGhahramani.Sparsegaussianprocessesusingpseudo-inputs.InY.Weiss,B.Sch¨olkopf,andJ.Platt,editors,AdvancesinNeuralInformationProcessingSystems18,pages1259–1266.MITPress,Cambridge,MA,2006.I.Steinwart.Sparsenessofsupportvectormachine.JournalofMachineLearningResearch,4:1071–1105,2003.J.A.K.Suykens,T.V.Gestel,J.D.Brabanter,B.D.Moor,andJ.Vandewalle.LeastSquaresSupportVectorMachines.WorldScientific,Singapore,2002.M.E.Tipping.SparseBayesianlearningandtherelevancevectormachine.JournalofMachineLearningResearch,1:211–244,2001.V.Vapnik.TheNatureofStatisticalLearningTheory.SpringerVerlag,NewYork,1995.M.Wu,B.Sch¨olkopf,andG.Bakir.Buildingsparselargemarginclassifiers.InL.D.RaedtandS.Wrobel,editors,Proc.22thInternationalConferenceonMachineLearning,pages1001–1008.ACM,2005. 624 因篇幅问题不能全部显示,请点此查看更多更全内容