搜索
您的当前位置:首页正文

a direct method for building__ sparse kernel learning algorithms

来源:知库网
JournalofMachineLearningResearch7(2006)603–624Submitted10/05;Published4/06

ADirectMethodforBuildingSparseKernelLearning

Algorithms

MingruiWuBernhardSch¨olkopfG¨okhanBakır

MaxPlanckInstituteforBiologicalCyberneticsSpemannstrasse3872076T¨ubingen,GermanyEditor:NelloCristianini

mingrui.wu@tuebingen.mpg.de

bernhard.schoelkopf@tuebingen.mpg.de

goekhan.bakir@tuebingen.mpg.de

Abstract

Manykernellearningalgorithms,includingsupportvectormachines,resultinakernelmachine,suchasakernelclassifier,whosekeycomponentisaweightvectorinafeaturespaceimplicitlyintroducedbyapositivedefinitekernelfunction.Thisweightvectorisusuallyobtainedbysolvingaconvexoptimizationproblem.Basedonthisfactwepresentadirectmethodtobuildsparsekernellearningalgorithmsbyaddingonemoreconstrainttotheoriginalconvexoptimizationproblem,suchthatthesparsenessoftheresultingker-nelmachineisexplicitlycontrolledwhileatthesametimeperformanceiskeptashighaspossible.Agradientbasedapproachisprovidedtosolvethismodifiedoptimizationprob-lem.Applyingthismethodtothesupportvectommachineresultsinaconcretealgorithmforbuildingsparselargemarginclassifiers.Theseclassifiersessentiallyfindadiscriminat-ingsubspacethatcanbespannedbyasmallnumberofvectors,andinthissubspace,thedifferentclassesofdataarelinearlywellseparated.Experimentalresultsoverseveralclassificationbenchmarksdemonstratetheeffectivenessofourapproach.

Keywords:sparselearning,sparselargemarginclassifiers,kernellearningalgorithms,supportvectormachine,kernelFisherdiscriminant

1.Introduction

Manykernellearningalgorithms(KLA)havebeenproposedforsolvingdifferentkindsofproblems.Forexamplethesupportvectormachine(SVM)(Vapnik,1995)havebeenwidelyusedforclassificationandregression,theminimaxprobabilitymachine(MPM)(Lanckrietetal.,2002)isanothercompetitivealgorithmforclassification,theone-classSVM(Sch¨olkopfandSmola,2002)isausefultoolfornoveltydetection,whilethekernelFisherdiscriminant(KFD)(Mikaetal.,2003)andthekernelPCA(KPCA)(Sch¨olkopfandSmola,2002)arepowerfulalgorithmsforfeatureextraction.

Manykernellearningalgorithmsresultinakernelmachine(KM)(suchasakernelclassifier),whoseoutputcanbecalculatedas

NXV󰀂i=1

c2006MingruiWu,BernhardSch¨󰀂olkopfandG¨okhanBakır.

τ(x)=

αˆiK(ˆxi,x)+b,

(1)

¨lkopfandBakırWu,Scho

ˆi∈X,1≤i≤NXV,arewherex∈X⊂Rdistheinputdata,Xistheinputspace,x

1calledexpansionvectors(XVs)inthispaper,NXVisthenumberofXVs,αˆi∈Risthe

ˆi,b∈RisthebiasandK:X×X→Risakernelexpansioncoefficientassociatedwithx

function.

UsuallyKisapositivedefinitekernel(Sch¨olkopfandSmola,2002),whichimplicitlyintroducesafeaturespaceF.Letφ(·)denotethemapfromXtoF,then

K(x,x󰀃)=󰀑φ(x),φ(x󰀃)󰀒,

Hence(1)canalsobewrittenasalinearfunction

τ(x)=󰀑w,φ(x)󰀒+b,

where

w=

NXV󰀂i=1

∀x,x󰀃∈X.

(2)

αˆiφ(ˆxi)

(3)

istheweightvectoroftheKManditequalsthelinearexpansionofXVsinthefeature

spaceF.

ForalltheKLAsmentionedabove,thevectorwisobtainedbysolvingaconvexopti-mizationproblem.Forexample,theSVMcanbeformulatedasaquadraticprogrammingproblem(Vapnik,1995),in(Mikaetal.,2003),aconvexformulationisproposedforKFDandaleastsquaresSVMformulationisestablishedforKPCAin(Suykensetal.,2002).

Fromtheabovedescription,wecanseethatalthoughmanyKLAsareproposedforsolv-ingdifferentkindsofproblemsandhavevariousformulations,therearethreewidelyknowncommonpointsamongthem.First,eachofthemresultsinaKMwhosekeycomponentisaweightvectorw,whichcanbeexpressedasthelinearexpansionofXVsinthefeaturespaceF.Second,thevectorwisobtainedbysolvingaconvexoptimizationproblem.Third,theoutputoftheresultingKMiscalculatedas(1)(or(2)).

Whensolvingpracticalproblems,wewantthetimeforcomputingtheoutputtobeasshortaspossible.Forexampleinreal-timeimagerecognition,inadditiontogoodclassificationaccuracy,highclassificationspeedisalsodesirable.Thetimeofcalculating(1)(or(2))isproportionaltoNXV.ThusseveralsparselearningalgorithmshavebeenproposedtobuildKMswithsmallNXV.

Thereducedset(RS)method(Burges,1996;Sch¨olkopfandSmola,2002)waspro-posedtosimplify(1)bydeterminingNzvectorsz1,...,zNzandcorrespondingexpansioncoefficientsβ1,...,βNzsuchthat

󰀔w−

Nz󰀂j=1

βjφ(zj)󰀔2

(4)

󰀃Nz

isminimized.RSmethodsapproximateandreplacewin(2)byj=1βjφ(zj),whereNzˆi,1≤i≤NXVhavedifferentnamesindifferentkernellearningalgorithms.Forexample,they1.Thex

arecalledsupportvectorsintheSVM,andrelevancevectorsintherelevancevectormachine(Tipping,2001).Inthispaperweuniformlycallthemexpansionvectorsforthesakeofsimplicity.

604

ADirectMethodforBuildingSparseKernelLearningAlgorithms

KMitaimstosimplify,andinordertoapplyRSmethods,weneedtobuildanotherKMinadvance.

In(LeeandMangasarian,2001),thereducedsupportvectormachine(RSVM)algorithmisproposed,whichrandomlyselectsNzvectorsfromthetrainingsetasXVs,andthencomputestheexpansioncoefficients.Thisalgorithmcanbeappliedtobuildsparsekernelclassifiers.ButastheXVsarechosenrandomly,andmaynotbegoodrepresentativesofthetrainingdata,goodclassificationperformancecannotbeguaranteedwhenNzissmall(LinandLin,2003).

Therelevancevectormachine(RVM)(Tipping,2001)isanotheralgorithmwhichleadstosparseKMs.ThebasicideaoftheRVMistoassumeaprioroftheexpansioncoefficientswhichfavorssparsesolutions.

Inthispaper,basedonthecommonpointsofKLAsmentionedbefore,wepresentadirectmethodtobuildsparsekernellearningalgorithms(SKLA).Inparticular,givenaKLA,wemodifyitbyaddingonemoreconstrainttoitscorrespondingconvexoptimizationproblem.TheaddedconstraintexplicitlycontrolsthesparsenessoftheresultingKMandagradientbasedapproachisproposedtosolvethemodifiedoptimizationproblem.WewillalsoshowthatapplyingthismethodtotheSVMwillresultinaspecificalgorithmforbuildingsparselargemarginclassifiers(SLMC).2

Theremainderofthispaperisorganizedasfollows.Insection2,wedescribeadirectmethodforbuildingSKLAs.Afterthis,wewillfocusonaparticularapplicationofthismethodtotheSVMalgorithm,leadingtoadetailedalgorithmforbuildingSLMC.TheSLMCalgorithmispresentedinsection3,wherewewillalsopointoutthatitactuallyfindsadiscriminatingsubspaceofthefeaturespaceF.Somecomparisonswithrelatedapproachesaregiveninsection4.Experimentalresultsareprovidedinsection5andweconcludethepaperinthelastsection.3

2.ADirectMethodofBuildingSKLAs

Inthissection,weproposeadirectmethodforbuildingSKLAs.2.1BasicIdea

Asmentionedbefore,manyKLAscanbeformulatedasanoptimizationproblem,whichcanbewritteninageneralformasfollows:

w,b,ξ

minf(w,b,ξ),gi(w,b,ξ)≤0,hj(w,b,ξ)=0,

1≤i≤Ng,1≤j≤Nh,

(5)(6)(7)

subjectto

wheref(w,b,ξ)istheobjectivefunctiontobeminimized,w∈Fandb∈RarerespectivelytheweightvectorandthebiasoftheKMtobebuilt,ξ=[ξ1,...,ξNξ]󰀅∈RNξisavectorofsomeauxiliaryvariables(suchastheslackvariablesinthesoftmargintrainingproblem),

2.Herewedon’tusethephrase“sparseSVM”becausetheXVsoftheresultingclassifierarenotnecessarilysupportvectors,i.e.theymaynotlieneartheclassificationboundary.3.Thispaperisanextensionofourpreviouswork(Wuetal.,2005).

605

¨lkopfandBakırWu,Scho

Nξisthenumberofauxiliaryvariables,Ngisthenumberofinequalityconstraintsspecified

bygi(w,b,ξ),whileNhisthenumberofequalityconstraintsspecifiedbyhj(w,b,ξ).

Ourobjectiveisasfollows:givenaKLAandapositiveintegerNz,wewanttomodifythegivenKLAsuchthatthenumberofXVsoftheresultingKMequalsNzwhileatthesametimetheperformanceoftheKMshouldbekeptaswellaspossible.Toachievethis,weproposetosolvethefollowingproblem:

w,b,ξ,β,Z

minf(w,b,ξ),gi(w,b,ξ)≤0,hj(w,b,ξ)=0,w=

Nz󰀂i=1

(8)

1≤i≤Ng,1≤j≤Nh,

(9)(10)(11)

subjectto

φ(zi)βi,

where(8)–(10)areexactlythesameas(5)–(7),whileZ=[z1,...,zNz]∈Rd×NzisthematrixofXVsandβ=[β1,...,βNz]󰀅∈RNzisthevectorofexpansioncoefficients.

Itcanbeseenthattheaboveproblemistheproblem(5)–(7)withoneaddedconstraint(11)sayingthattheweightvectoroftheresultingKMequalstheexpansionoftheφ(zi),1≤i≤Nz.Notethattheziarealsovariables,sotheyneedtobecomputedwhensolvingtheoptimizationproblem.

Duetotheconstraint(11),solvingtheproblem(8)–(11)willnaturallyleadtoasparseKM.Furthermore,sincetheobjectivefunctionoftheproblem(8)–(11)isexactlythesameastheoriginalproblem(5)–(7),soinprincipletheperformanceoftheresultingKMcanbekeptaswellaspossible.

Becauseofthenon-convexconstraint(11),itisdifficulttoobtaintheglobaloptimumoftheaboveproblem,thusweproposeagradientbasedapproach.However,ourgradientbasedminimizationwillbeperformedonlyontheexpansionvectorsZbutnotonallthevariables.Tothisend,wedefinethefollowingthemarginalfunctionW(Z)whichisobtainedbykeepingtheexpansionvectorsinproblem(8)–(11)fixed,i.e.:

W(Z):=min

w,b,ξ,β

f(w,b,ξ),gi(w,b,ξ)≤0,hj(w,b,ξ)=0,

Nz󰀂i=1

(12)

1≤i≤Ng,1≤j≤Nh,

(13)(14)(15)

subjectto

andw=

φ(zi)βi.

Theaboveproblemisthesameasproblem(8)–(11)exceptthatZisnotvariablebutfixed.

Clearlyanyglobal(orlocal)minimumofW(Z)isalsoaglobal(orlocal)minimumofproblem(8)–(11),whichmeansthattheminimaoftheoriginalproblem(8)–(11)canbefoundbycomputingtheminimaoffunctionW(Z).HereweproposetominimizeW(Z)bythegradientbasedalgorithm.Tothisend,atanygivenZ∈Rd×Nz,weneedtocalculateboththefunctionvalueW(Z)andthegradient∇W(Z).Thesetwoproblemsarediscussedinthenextsubsection.

606

ADirectMethodforBuildingSparseKernelLearningAlgorithms

2.2ComputingW(Z)anditsGradient∇W(Z)

TocomputethefunctionvalueofW(Z)atanygivenZ,weneedtosolvetheoptimizationproblem(12)–(15).Obviouslythisisaproblemdependenttask.Howeverasmentionedbefore,theoriginaloptimizationproblem(5)–(7)ofmanycurrentKLAsareconvex,and(15)isjustalinearconstraintonceZisfixed.Thereforetheproblem(12)–(15)isstillconvex,whichmeansitsglobaloptimum,andthusthefunctionvalueofW(Z),canbereadilycomputed.

Nextweturntoconsiderhowtocompute∇W(Z),whichrequiresmorecarefulnesssinceingeneralW(Z)isnotnecessarilydifferentiable.Accordingtotheconstraint(11),theweightvectorwcanbecompletelydeterminedbyβandZ,sothefunctionsf,giandhjcanberegardedasfunctionsofb,ξ,βandZ.Withoutcausingconfusions,wewritethemasf(x,Z),gi(x,Z)andhj(x,Z)inthefollowing,wherex:=[b,ξ󰀅,β󰀅]󰀅.

Substituting(15)into(12)–(14),wehave

min

x

f(x,Z),gi(x,Z)≤0,hj(x,Z)=0,

1≤i≤Ng,1≤j≤Nh,

(16)(17)(18)

subjectto

andW(Z)istheoptimalvalueoftheaboveoptimizationproblem.

Tocompute∇W(Z),wecanapplythefollowinglemmawhichgivesboththeconditionswhenW(Z)isdifferentiableandanexplicitformofthederivative:

Lemma1(GauvinandDubeau,1982):Assumethatallthefunctionsf,giandhjinproblem(16)–(18)arecontinuouslydifferentiableandsupposethatproblem(16)–(18)hasa

¯jbetheuniquecorresponding¯.Furthermoreletα¯atZ=Zuniqueoptimalsolutionx¯iandβ

Lagrangemultipliersassociatedwithgiandhjrespectively,1≤i≤Ng,1≤j≤Nh.

¯andAssumefurtherthatthefeasiblesetofproblem(16)–(18)isuniformlycompactatZ

¯isMangasarian-Fromovitzregular,thenthegradientofW(Z)that,theoptimalsolutionx

¯andequalsexistsatZ

∇W(Z)|Z=Zx,Z)|Z=Z¯=∇Zf(¯¯+

Ng

󰀂i=1

α¯i∇Zgi(¯x,Z)|Z=Z¯+

Nh

󰀂j=1

¯j∇Zhj(¯βx,Z)|Z=Z¯,(19)

¯whilefixingwhere∇Zf(¯x,Z)|Z=Z¯denotesthegradientoff(x,Z)withrespecttoZatZ=Z

¯.xatx

Ascanbeseenthatinadditiontotheuniquenessoftheoptimalsolutionanditscor-respondingLagrangemultipliers,lemma1alsorequiresthefeasiblesettobeuniformlycompactandtheoptimalsolutiontobeMangasarian-Fromovitzregular.Theformaldef-initionsofthelasttwoconditionsaregiveninappendixB.SincethepropertiesofthefeasiblesetandtheoptimalsolutionsdependonthespecificKLA,wewilldiscussalltheseconditionsforsomeparticularKLAsinsubsection3.3andappendixA.Wewillseethattheconditionsoflemma1areverymildformanycurrentKLAs.

607

¨lkopfandBakırWu,Scho

3.BuildinganSLMC

HavingdescribedourdirectmethodforbuildingSKLAs,wewillfocusonaconcreteap-plicationofthismethodintherestofthispaper.Inparticular,wewillapplythismethod

toSVMsinordertoobtainanalgorithmforbuildingSLMCs.Wewillalsoanalyzeitspropertiesandcompareitwithotherrelatedapproaches.Inthissection,wewillpresenttheSLMCalgorithmbycloselyfollowingthediscussionsofsection2.3.1Objective

Nowwebegintoconsiderthebinaryclassificationproblem,wherewearegivenasetoftrainingdata{(xi,yi)}Ni=1,wherexi∈Xistheinputdata,andyi∈{−1,1}istheclasslabel.

Ourobjectiveisasfollows:givenatrainingdatasetandanpositiveintegerNz,wewanttobuildaclassifiersuchthatthenumberofXVsoftheclassifierequalsNzandthemarginoftheclassifierisaslargeaspossible.Thiswaywebuildalargemarginclassifierwhosesparsenessisexplicitlycontrolled.

Basedonthedirectmethoddescribedinthelastsection,weneedtosolvetheproblem(8)–(11)toachievethisgoal.Forthemoment,(8)–(10)becometheSVMtrainingproblemandtheconstraint(11)controllsthesparsenessoftheresultingclassifier.Soweshouldsolvethefollowingproblem

min

󰀂1󰀅

ww+Cξi,2

i=1N

w,ξ,b,β,Z

(20)

∀i,

(21)(22)(23)

subjectto

yi(wφ(xi)+b)≥1−ξi,ξi≥0,w=

Nz󰀂i=1

󰀅

∀i,φ(zi)βi,

whereCisapositiveconstant,w∈Fistheweightvectorofthedecisionhyperplaneinfeaturespace,b∈Risthebiasoftheclassifier,ξ=[ξ1,...,ξN]󰀅∈RNisthevectorofslackvariables,Z=[z1,...,zNz]∈Rd×NzisthematrixofXVsandβ=[β1,...,βNz]󰀅∈RNzisthevectorofexpansioncoefficients.

Followingourproposedmethod,tosolvetheaboveproblem,weturntominimizethemarginalfunctionW(Z)definedinproblem(12)–(15).Forthecurrentproblem,thevalueofW(Z)istheminimumofthefollowingoptimizationproblem,

min

󰀂1󰀅

ww+Cξi,2

i=1N

w,ξ,b,β

(24)

∀i,

(25)(26)(27)

subjectto

yi(w󰀅φ(xi)+b)≥1−ξi,ξi≥0,w=

Nz󰀂i=1

∀i,φ(zi)βi.

608

ADirectMethodforBuildingSparseKernelLearningAlgorithms

Theaboveproblemisthesameasproblem(20)–(23)exceptthatZisnotvariablebutfixed.Followingthediscussioninsection2.1,the(local)minimumoftheoriginalproblem(20)–(23)canbefoundbycomputingthe(local)minimumoffunctionW(Z)andwewillusethegradientbasedalgorithmtodothis.InthefollowingtwosubsectionswewilldiscusshowtocomputethefunctionvalueW(Z)andthegradient∇W(Z)respectively.3.2ComputingW(Z)andβ

TocomputethefunctionvalueofW(Z)atanygivenZ,weneedtosolvetheconvexoptimizationproblem(24)–(27),whichisactuallyaproblemofbuildinganSVMwithgivenXVsz1,...,zNz.ThisproblemhasalreadybeenconsideredintheRSVMalgorithm(LeeandMangasarian,2001;LinandLin,2003).ButintheRSVMalgorithm,onlyanapproximationoftheproblem(24)–(27)issolved.Herewewillproposeadifferentmethodwhichexactlysolvesthisproblem.(Seesection4.2foradiscussionandsection5.5foracomparisonoftheexperimentalresultsofthesetwomethods.)

Substituting(27)into(24)and(25),wehave

󰀂1󰀅z

βKβ+Cξi,2

i=1N

ξ,b,β

min(28)

∀i,

(29)(30)

subjectto

yi(β󰀅ψz(xi)+b)≥1−ξi,ξi≥0,

∀i,

where

ψz(xi)=[K(z1,xi),...,K(zNz,xi)]󰀅

(31)

istheempiricalkernelmap(Sch¨olkopfandSmola,2002)andKzisthekernelmatrixofzi,i.e.Kzij=K(zi,zj).NotethatwhenNz=Nandzi=xi,1≤i≤N,thisisthestandardSVMtrainingproblem.Incontrast,theproblem(28)–(30)istotrainalinearSVMinasubspacespannedbyφ(zi),1≤i≤Nz,whereziarearenotnecessarilytrainingexamples.

Nowweinvestigateitsdualproblem.Toderiveit,weintroducetheLagrangian,

L(ξ,b,β,α,γ)=

󰀂󰀂1󰀅z

βKβ+Cξi−γiξi2

i=1

i=1

N

N

(32)

N󰀂i=1

αi[yi(β󰀅ψz(xi)+b)−1+ξi],

withLagrangemultipliersγi≥0andαi≥0.

609

¨lkopfandBakırWu,Scho

ThederivativesofL(ξ,b,β,α,γ)withrespecttotheprimalvariablesmustvanish,

󰀂∂Lz

=Kβ−αiyiψz(xi)=0,∂β

i=1N

(33)(34)(35)

󰀂∂L

=−αiyi=0,∂b

i=1

N

∀i,∀i.

∂L

=C−αi−γi=0,∂ξi

Equation(33)leadsto

β=(K)

z−1

N󰀂i=1

αiyiψz(xi).

(36)

Substituting(33)–(35)into(32)andusing(36),wearriveatthedualformoftheopti-mizationproblem:

max

N󰀂i=1

α∈RN

1󰀂󰀂ˆz(xi,xj),αiαjyiyjKαi−

2

i=1j=1

NN

(37)

subjectto

N󰀂i=1

yiαi=0,

∀i,

(38)(39)(40)

and0≤αi≤C,

where

ˆz(xi,xj)=ψz(xi)󰀅(Kz)−1ψz(xj).K

ˆz(·,·)definedby(40)isapositivedefinitekernelfunction(Sch¨ThefunctionKolkopfand

Smola,2002).Toseethis,considerthefollowingmap,4

φz(xi)=Tψz(xi),

whereψz(·)isdefinedby(31)and

T=Λ−2V󰀅,

1

(41)

(42)

whereΛisadiagonalmatrixofeigenvaluesofmatrixKzandVisamatrixwhosecolumnsareeigenvectorsofKz.So

T󰀅T=VΛ−1V=(Kz)−1.(43)

Combiningequation(41)and(43)wehave

ˆz(xi,xj).󰀑φz(xi),φz(xj)󰀒=ψz(xi)󰀅(Kz)−1ψz(xj)=K

Itcanbeseenthatproblem(37)–(39)hasthesameformasthedualofanSVMtrain-ingproblem.ThereforegivenZ,computingtheexpansioncoefficientsofSVMwithkernel

4.Themapdefinedin(41)iscalledthe“whitened”empiricalkernelmapor“kernelPCAmap”(Sch¨olkopf

andSmola,2002).

610

ADirectMethodforBuildingSparseKernelLearningAlgorithms

ˆzdefinedbyfunctionKisequivalenttotraininganSVMwithamodifiedkernelfunctionK

(40).

Sinceproblem(37)–(39)isthedualofproblem(24)–(27),theoptimaofthesetwo

z,1≤i≤Narethesolutionofproblemsareequaltoeachother.SogivenZ,assumingαi

(37)–(39),thenwecancomputeW(Z)as

W(Z)=

N󰀂i=1

1󰀂󰀂zzzˆz(xi,xj).αiαjyiyjKαi−

2

ij=1

NN

(44)

Accordingto(36),theexpansioncoefficientsβcanbecalculatedas

β=(K)

z−1

N󰀂i=1

z

αiyiψz(xi)=(Kz)−1(Kzx)Yαz,

(45)

whereψz(·)isdefinedby(31),KzxisthematrixdefinedbyKzxij=K(zi,xj),Yisadiagonal

z,...,αz]󰀅.matrixofclasslabels,i.e.Yii=yi,andαz=[α1N3.3Computing∇W(Z)ofSLMC

Tocompute∇W(Z),wecanapplylemma1tothesoftmarginSVMtrainingproblem

(37)–(39)andyieldthefollowingresult.

Corollary2InthesoftmarginSVMtrainingproblem(37)–(39),assumethatthekernel

ˆz(·,·)isstrictlypositivedefiniteandtheresultingsupportvectorscomefromfunctionK

bothpositiveandnegativeclasses,5thenthederivativesofW(Z)withrespecttozuv,whichdenotesthev-thcomponentofvectorzu,1≤u≤Nz,1≤v≤d,existsandcanbecomputedasfollows:

Nˆz(xi,xj)∂K∂W1󰀂zz

=−αiαjyiyj,(46)∂zuv2∂zuv

i,j=1

z,1≤i≤Ndenotethesolutionofproblem(37)–(39).Inotherwords,∇W(Z)whereαi

canbecomputedasifαzdidnotdependonZ.6

Theproofofcorollary2isgiveninappendixC.

Ascanbeseenincorollary2,toapplylemma1tocalculate∇W(Z),weonlyneedto

ˆz(·,·)isstrictlypositivemaketwoassumptionsonproblem(37)–(39):ThekernelfunctionK

definiteandtheresultingsupportvectorscomefrombothclasses.Certainlythesearenotstrictassumptionsformostpracticalapplications.Similarlyonecanverifythatlemma1canalsobeappliedtomanyotherKLAswithmildassumptions,suchastheone-classSVM.

z

5.Letαi,1≤i≤Ndenotetheoptimalsolutionofproblem(37)–(39),supportvectorsarethoseinput

z

dataxiwhosecorrespondingαiarelargerthan0(Vapnik,1995).

z

6.Incorollary2,αiisboundedabovebyCasshownin(39).Asimilarconclusionisproposedin(Chapelle

z

etal.,2002)forthehardmarginSVMtrainingproblem,wherethereisnoupperboundonαi.Thisimpliesthatthefeasiblesetisnotcompact,hencelemma1cannotbeappliedanymore.Actuallyin(Chapelleetal.,2002),onlytheuniquenessoftheoptimalsolutionisemphasized,which,toourknowledge,isnotenoughtoguaranteethedifferentiabilityofthemarginalfunctionW(Z).

611

¨lkopfandBakırWu,Scho

AndwewillshowanotherexampleinappendixAonapplyingourdirectmethodtosparsify

theKFDalgorithm(Mikaetal.,2003).

Accordingto(40),

ˆz(xi,xj)∂K

∂zuv

=(

∂ψz(xj)∂ψz(xi)󰀅z−1

)(K)ψz(xj)+ψz(xi)󰀅(Kz)−1

∂zuv∂zuv

z−1

󰀅∂(K)ψz(xj),+ψz(xi)

∂zuv

where

∂(Kz)−1∂zuv

canbecalculatedas

z∂(Kz)−1z−1∂K=−(K)(Kz)−1.

∂zuv∂zuv

SoatanygivenZ,W(Z)and∇W(Z)canbecomputedas(44)and(46)respectively.

Inourimplementation,weusetheLBFGSalgorithm(LiuandNocedal,1989)tominimizeW(Z),whichisanefficientgradientbasedoptimizationalgorithm.

ˆzandItsCorrespondingFeatureSpaceFz3.4TheKernelFunctionK

ˆzplaysanimportantroleinourapproach.Inthissection,someThekernelfunctionK

ˆzisprovided,whichwillgiveusinsightsintohowtobuildanSLMC.analysisofK

ItiswellknownthattraininganSVMwithanonlinearkernelfunctionKintheinputspaceXisequivalenttobuildingalinearSVMinafeaturespaceF.Themapφ(·)fromXtoFisimplicitlyintroducedbyK.Insection2.2,wederivedthatforagivensetofXVsz1,...,zNz,traininganSVMwithkernelfunctionKisequivalenttobuildinganSVM

ˆz,whichisinturnequivalenttoconstructingalinearSVMwithanotherkernelfunctionK

inanotherfeaturespace.LetFzdenotethisfeaturespace,thenthemapfromtheXtoFzisφz(·),whichisexplicitlydefinedby(41).

Accordingto(41),φz(x)=Tψz(x).ToinvestigatetheroleofthematrixT,considerUzdefinedby

Uz=[φ(z1),...,φ(zNz)]T󰀅.Then

(Uz)󰀅Uz=TKzT󰀅=I,

whereIistheunitmatrix,whichmeansthatT󰀅orthonormalizesφ(zi)inthefeaturespaceF.ThusthecolumnsofUzcanberegardedasanorthonormalbasisofasubspaceofF.Foranyx∈X,ifwecalculatetheprojectionofφ(x)intothissubspace,wehave

(Uz)󰀅φ(x)=T[φ(z1),...,φ(zNz)]󰀅φ(x)

=T[K(z1,x),...,K(zNz,x)]󰀅=Tψz(x)=φz(x).

ThisshowsthatthesubspacespannedbythecolumnsofUzisidenticaltoFz.AsUzareobtainedbyorthonormalizingφ(zi),FzisasubspaceofFanditisspannedbyφ(zi),1≤i≤Nz.

612

ADirectMethodforBuildingSparseKernelLearningAlgorithms

NowthatforagivensetofXVszi,buildinganSVMwithakernelfunctionKisequivalenttobuildingalinearSVMinFz,inordertogetgoodclassificationperformance,wehavetofindadiscriminatingsubspaceFzwheretwoclassesofdataarelinearlywellseparated.Basedonthispointofview,wecanseethatourproposedapproachessentiallyfindsasubspaceFzwherethemarginofthetrainingdataismaximized.

4.ComparisonwithRelatedApproaches

InthissectionwecomparetheSLMCalgorithmwithrelatedapproaches.4.1ModifiedRSMethod

InthesecondstepoftheRSmethod,aftertheXVsz1,...,zNzareobtained,theexpansioncoefficientsβarecomputedbyminimizing(4),whichleadsto(Sch¨olkopfandSmola,2002)

β=(Kz)−1(Kzx)Yα,

(47)

whereKzxandYaredefinedasin(45),andαisthesolutionofbuildinganSVMwithkernelfunctionKonthetrainingdataset{(xi,yi)}Ni=1.

WeproposetomodifythesecondstepofRSmethodas(45).Clearly(47)and(45)areofthesameform.Theonlydifferenceisthatin(47),αisthesolutionoftraininganSVMwithkernelfunctionK,whilein(45),αzisthesolutionoftraininganSVMwiththekernel

ˆz,whichtakestheXVsziintoconsideration.Asβcalculatedby(45)maximizesfunctionK

themarginoftheresultingclassifier,wecanexpectabetterclassificationperformanceofthismodifiedRSmethod.Wewillseethisintheexperimentalresults.4.2ComparisonwithRSVMandaModifiedRSVMAlgorithm

OnemightarguethatourapproachappearstobesimilartotheRSVM,becausetheRSVMalgorithmalsorestrictstheweightvectorofthedecisionhyperplanetobealinearexpansionofNzXVs.

HowevertherearetwoimportantdifferencesbetweentheRSVMandourapproach.Thefirstone(andprobablythefundamentalone)isthatintheRSVMapproach,NzXVsarerandomlyselectedfromthetrainingdatainadvance,butarenotcomputedbyfindingadiscriminatingsubspaceFz.Theseconddifferenceliesinthemethodforcomputingtheexpansioncoefficientsβ.Ourmethodexactlysolvestheproblem(28)–(30)withoutanysimplifications.ButintheRSVMapproach,certainsimplificationsareperformed,amongwhichthemostsignificantoneischangingthefirsttermintheobjectivefunction(28)from1󰀅z1󰀅

2βKβto2ββ.Thisstepimmediatelyreducestheproblem(28)–(30)toastandardlinearSVMtrainingproblem(LinandLin,2003),whereβbecomestheweightvectorofthedecisionhyperplaneandthetrainingsetbecomes{ψz(xi),yi}Ni=1.

Ontheotherhand,ourmethodofcomputingβistobuildalinearSVMinthesubspaceFz,whichistotrainalinearSVMforthetrainingdataset{φz(xi),yi}Ni=1.

Nowletuscomparethetwotrainingsetsmentionedabove,i.e.{φz(xi),yi}Ni=1and{ψz(xi),yi}Ni=1.Asderivedinsection3.4,φz(xi)arecalculatedbyprojectingφ(xi)ontoasetofvectors,whichisobtainedbyorthonormalizingφ(zj)(1≤j≤Nz),whileψz(xi)is

613

¨lkopfandBakırWu,Scho

calculatedbycomputingthedotproductionbetweenφ(xi)andφ(zj)(1≤j≤Nz)directly,

withoutthestepoforthonormalization.

AnalogoustothemodifiedRSmethod,weproposeamodifiedRSVMalgorithm:Firstly,NztrainingdataarerandomlyselectedasXVs,thentheexpansioncoefficientsβarecom-putedby(45).

4.3ComparisonwiththeRVM

TheRVM(Tipping,2001)algorithmandmanyothersparselearningalgorithms,suchassparsegreedyalgorithms(Nairetal.,2002),orSVMswithl1-normregularization(Bennett,1999),resultinaclassifierwhoseXVsareasubsetofthetrainingdata.Incontrast,theXVsofSLMCdonotnecessarilybelongtothetrainingset.ThismeansthatSLMCcaninprinciplelocatebetterdiscriminatingXVs.Consequently,withthesamenumberofXVs,SLMCcanhavebetterclassificationperformancethantheRVMandothersparselearningalgorithmswhichselecttheXVSonlyfromthetrainingdata.Thiscanbeseenfromtheexperimentalresultsprovidedinsection5.5.4.4SLMCvsNeuralNetworks

SincetheXVsoftheSLMCdonotnecessarilybelongtothetrainingsetandtraininganSLMCisagradientbasedprocess,7theSLMCcanbethoughtofasaneuralnetworkwithweightregularization(Bishop,1995).However,therearecleardifferencesbetweentheSLMCalgorithmandafeedforwardneuralnetwork.First,analogoustoanSVM,theSLMCconsidersthegeometricconceptofmargin,andaimstomaximizesit.Tothisend,theregularizertakesintoaccountthekernelmatrixKz.Second,SLMCminimizesthe“hinge-loss”,whichisdifferentfromthelossfunctionsadoptedbyneuralnetworks.8Therefore,boththeregularizerandthelossfunctionofSLMCaredifferentfromthoseoftraditionalperceptrons.

Furthermore,theSLMCalgorithmisjustanapplicationofour’directsparse’method.Itisstraightforwardtoapplythismethodtobuildsparseone-classSVMalgorithm(Sch¨olkopfandSmola,2002),towhichthereisnoobviousneuralnetworkcounterpart.

Ontheotherhand,analogoustoneuralnetworks,wealsohaveanadditionalregular-izationviathenumberNzdeterminingthenumberofXVs,whichisanadvantageinsomepracticalapplicationswhereruntimeconstraintsexistandthemaximumpredictiontimeisknownapriori.Notethatthepredictiontime(thenumberofkernelevaluations)ofasoftmarginSVMscaleslinearlywiththenumberoftrainingpatterns(Steinwart,2003).

5.ExperimentalResults

NowweconductsomeexperimentstoinvestigatetheperformanceoftheSLMCalgorithmandcompareitwithotherrelatedapproaches.

7.Thisisalsosimilartotherecentworkof(SnelsonandGhahramani,2006)onbuildingsparseGaussianprocesses,whichwasdoneatalmostthesametimewithourpreviouswork(Wuetal.,2005).

8.Notethattheshapeofthehinge-lossissimilartothatofthelossfunctionadoptedinlogisticregression,wherethelogarithmofthelogisticsigmoidfunction(Bishop,1995)isinvolved.Herethelogisticsigmoid

1

functionreferstoy(x)=1+e−x.Sotheshapeofthehing-lossisdifferentfromthatofthelossfunctionusedbytheperceptron.

614

ADirectMethodforBuildingSparseKernelLearningAlgorithms

5.1ApproachestobeCompared

Thefollowingapproachesarecomparedintheexperiments:StandardSVM,RSmethod,modifiedRSmethod(MRS,cf.section4.1),RSVM,modifiedRSVM(MRSVM,cf.section4.2),relevancevectormachine(RVM),andtheproposedSLMCapproach.

Notethatinourexperiments,RSandMRSuseexactlythesameXVs,buttheycomputetheexpansioncoefficientsby(47)and(45)respectively.SimilarlyRSVMandMRSVMalsousethesamesetofXVs,thedifferenceliesinthemethodforcomputingtheexpansioncoefficients.5.2DataSets

Sevenclassificationbenchmarksareconsidered:USPS,Banana,BreastCancer,Titanic,Waveform,GermanandImage.ThelastsixdatasetsareprovidedbyGunnarR¨atschandcanbedownloadedfromhttp://ida.first.fraunhofer.de/projects/bench.FortheUSPSdataset,7291examplesareusedfortrainingandtheremaining2007arefortesting.Foreachofthelastsixdatasets,thereare100training/testsplitsandwefollowthesameschemeas(Tipping,2001):ourresultsshowaveragesoverthefirst10ofthose.5.3ParameterSelection

AGaussiankernelisusedintheexperiments:

K(x,x󰀃)=exp(−γ󰀔x−x󰀃󰀔2).

(48)

Theparametersfordifferentapproachesareasfollows:

StandardSVM:FortheUSPSdataset,weusethesameparametersasin(Sch¨olkopfandSmola,2002):C=10andγ=1/128.Fortheotherdatasets,weusetheparametersprovidedbyGunnarR¨atsch,whichareshownonthesamewebsitewherethesedatasetsaredownloaded.9

RSVMandMRSVM:Weperform5-foldcrossvalidationonthetrainingsettoselectparametersfortheRSVM.MRSVMusesthesameparametersastheRSVM.

RSmethod:TheRSmethodusesthesamekernelparameterasthestandardSVM,sinceitaimstosimplifythestandardSVMsolution.

SLMCandMRS:Inourexperiments,theyuseexactlythesameparametersasthestandardSVMonallthedatasets.

RVM:TheresultsfortheRVMaretakendirectlyfrom(Tipping,2001),where5-foldcrossvalidationwasperformedforparameterselection.5.4ExperimentalSettings

Foreachdataset,firstastandardSVMistrainedwiththeLIBSVMsoftware.10(FortheUSPS,tenSVMsarebuilt,eachtrainedtoseparateonedigitfromallothers).Thentheotherapproachesareapplied.TheratioNz/NSVvariesfrom5%to10%.

FortheRSVM,weusetheimplementationcontainedintheLIBSVMTools.11

9.Seehttp://ida.first.fraunhofer.de/projects/bench10.Fromhttp://www.csie.ntu.edu.tw/˜cjlin/libsvm

11.Fromhttp://www.csie.ntu.edu.tw/˜cjlin/libsvmtools

615

¨lkopfandBakırWu,Scho

FortheRSmethod,thereisstillnostandardorwidelyacceptedimplementation,sowe

trythreedifferentones:aprogramwrittenbyourselves,thecodecontainedinthemachinelearningtoolboxSPIDER,12andthecodecontainedinthestatisticalpatternrecognitiontoolboxSTPRTOOL.13Foreachdataset,weapplythesethreeimplementationsandselectthebestonecorrespondingtotheminimalvalueoftheobjectivefunction(4).5.5NumericalResults

ExperimentalresultsareshowninTable1,wheretheinitialXVsoftheSLMCarerandomlyselectedfromthetrainingdata.InTable1,NSVstandsforthenumberofsupportvectors(SVs)ofthestandardSVM,NzrepresentsthenumberofXVsofothersparselearningalgorithms.

DataSetSVMNSV

Error(%)RSNz/NSVMRS=5%RSVMMRSVMSLMCRSNz/NSVMRS=10%RSVMMRSVMSLMCRVMNz/NSV(%)Error(%)USPS26834.34.94.911.611.54.94.74.88.28.04.711.85.1Banana86.711.839.427.629.928.116.521.917.517.516.911.013.210.8BreastCancer112.828.628.828.829.529.427.927.929.031.030.327.95.629.9Titanic70.622.137.423.924.524.826.426.622.622.923.922.492.523.0Waveform158.99.99.910.015.114.79.910.09.911.611.89.99.210.9German408.222.522.922.523.623.922.322.922.624.523.722.93.122.2Image172.12.837.619.423.620.75.218.36.914.212.73.620.13.9Table1:Resultsonsevenclassificationbenchmarks.Thetesterrorratesofeachalgorithm

arepresented.TheNSVforthelastsixdatasetsaretheaveragesover10train-ing/testsplits.Thebestresultineachgroupisshowninboldface.ThenumberofXVsoftheRVMisnotchosenapriori,butcomesoutasaresultoftraining.SofortheRVM,theratioNz/NSVisgiveninordertocompareitwithotheralgorithms.Foreachdataset,theresultoftheRVMisshowninboldfaceifitisthebestcomparedtotheothersparselearningalgorithms.FromTable1,itcanbeseentheclassificationaccuracyofSLMCiscomparablewiththefullSVMwhenNz/NSV=0.1.

Table1alsoillustratesthatSLMCoutperformstheothersparselearningalgorithmsinmostcases.AlsotheSLMCusuallyimprovestheclassificationresultsoftheRSmethod.InsomecasestheimprovementislargesuchasonBananaandImagedatasets.

WhencomparingMRSwithRS,andMRSVMwithRSVM,theresultsinTable1demon-stratethatinmostcasesMRSbeatsRS,andsimilarly,MRSVMusuallyoutperformsRSVM

12.Fromhttp://www.kyb.mpg.de/bs/people/spider13.Fromhttp://cmp.felk.cvut.cz/˜xfrancv/stprtool

616

ADirectMethodforBuildingSparseKernelLearningAlgorithms

alittle.ThismeansthatforagivensetofXVs,computingtheexpansioncoefficientsac-cordingto(45)isagoodchoice.5.6SomeImplementationDetails

Intable1,wereporttheresultsobtainedbyrandominitialization.TheK-meansalgorithmhasalsobeentriedtochoosetheinitialXVsandresultingclassificationresultsaresimilar.Toillustratethisquantitatively,intable2,wepresenttheresultsobtainedbyusingtheK-meansalgorithmforinitialization.

DataNz/NSV=5%Nz/NSV=10%Setrandomk-meansrandomk-meansUSPS4.94.94.74.6Banana16.516.211.010.9BreastCancer27.926.827.927.3Titanic26.424.422.423.2Waveform9.99.99.99.9German22.322.722.922.7Image5.26.03.63.8Table2:ResultsoftheSLMCalgorithm,obtainedbyrandominitializationandk-means

initialization.Inourproposedapproach,weneedtocomputetheinverseofKz(seeforexample(45)).TheoreticallyiftheGaussiankernelisusedandzi,1≤i≤Nz,aredifferentfromeachother,thenKzshouldbefullrank,whoseinversecanbecomputedwithoutanyproblems.Howeverinexperiments,wedoobservethecaseswhereKzwasillconditioned.Butwefindoutthereasonforthisisthattherearesomeduplicateddatapointsinthedatasets,whichareaccidentallyselectedastheinitialXVs.Forexample,inthefirsttrainingsetoftheImagedataset,thefirstandthe521stdatapointareexactlythesame.Soinexperiments,weremovetheduplicatedpointsasapreprocessingstep.5.7SomeResultsonXVs

ItisknownthattheXVsofstandardSVMaresupportvectors,whichlieneartheclassi-ficationboundary.HerewegivetwoexamplestoillustratewhattheXVsofSLMClooklike.

Example1.BuildinganSVMinvolvessolvingproblem(20)–(22),whilebuildinganSLMCistosolvethesameproblemplusonemoreconstraint(23).IfwewanttobuildanSLMCwiththesamenumberofXVsasastandardSVM,namelyNz=NSV,thentheoptimalsolutionofproblem(20)–(22)isalsoaglobaloptimalsolutionofproblem(20)–(23),sinceitsatisfiesalltheconstraints.Sointhisspecialcase,thesupportvectorsofthestandardSVMarealsoanoptimalchoiceofXVsforSLMC.

Example2.OntheUSPSdataset,webuiltanSVMontrainingdatawithγ=1/128,C=10toseparatedigit’3’fromdigit’8’.TheresultingSVMhas116SVsandatesterrorrateof1.8%.ThenwebuiltanSLMCwiththesameγandC,whileNz=12(i.e.about10%ofthenumberofSVs).TheresultingSLMCalsohasatesterrorrateof1.8%.AsshowninFigure1,theimagesofthe12XVsproducedbySLMCapproachlooklikedigits.

617

¨lkopfandBakırWu,Scho

Figure1:ImagesofXVsforseparating’3’and’8’.

5.8TrainingTimeofSLMC

BuildinganSLMCisagradientbasedprocess,whereeachiterationconsistsofcomputing

14andthencomputingtheφz(xi),1≤i≤N,trainingalinearSVMover{φz(xi),yi}Ni=1,

thegradient∇W(Z).

LetTSVM(N,d)denotethetimecomplexityoftrainingalinearSVMoveradatasetcontainingNd-dimensionalvectors,thenthetimecomplexityoftraininganSLMCis

32

O(n×(NNzd+TSVM(N,Nz)+Nz+Nzd)),

wherenisthenumberofiterationsoftheSLMCtrainingprocess.Inexperiments,wefind

thattheSLMCalgorithmrequires20–200iterationstoconverge.

WecannotdirectlycomparethetrainingtimeofSLMCwithRSmethodsandRVM(relevancevectormachine),becauseweusedC++toimplementourapproach,whilethepubliclyavailablecodeofRSmethodsandtheRVMiswritteninMatlab.UsingtheseimplementationsandapersonalcomputerwithaPentium4CPUof3GHz,onegetsthefollowingnumbers:OntheUSPSdataset,SLMCtakes6.9hourstotrain,whiletheRSmethodtakes2.3hours.OntheBananadataset,SLMCtrainingisabout1.5seconds,andRVMtrainingisabout5seconds.

TraininganSLMCistimeconsumingonlargedatasets.Howeverinpractice,oncethetrainingisfinished,thetrainedKMwillbeputintouseprocessinglargeamountoftestdata.Forapplicationswhereprocessingspeedisimportant,suchasrealtimecomputervision,sparseKMscanbeofgreatvalue.ThisisthereasonwhyseveralSKLAshavebeendevelopedalthoughtheyaremoreexpensivethanthestandardSVMalgorithm.

Furthermore,somekernellearningalgorithmsarenotsparseatall,suchaskernelridgeregression,KFD,KPCA,whichmeansthatallthetrainingdataneedtobesavedastheXVsintheresultingKMstrainedbythesealgorithms.Hencebuildingsparseversionsofthesealgorithmscannotonlyacceleratetheevaluationofthetestdata,butalsodramatically

ˆzover{xi,yi}N14.EquivalentlywecanbuildanSVMwiththekernelfunctionKi=1.Butthisismuchslower

ˆbecauseitistimeconsumingtocomputeKz(·,·)definedby(40).

618

ADirectMethodforBuildingSparseKernelLearningAlgorithms

reducethespaceneededforstoringthetrainedKM.AnexampleofthiswillbegiveninappendixA.

6.Conclusions

Wepresentadirectmethodtobuildsparsekernellearningalgorithms.Therearemainlytwoadvantagesofthismethod:First,itcanbeappliedtosparsifymanycurrentkernellearningalgorithms.SeconditsimultaneouslyconsidersthesparsenessandtheperformanceoftheresultingKM.Basedonthismethodweproposeanapproachtobuildsparselargemarginclassifiers,whichessentiallyfindsadiscriminatingsubspaceFzofthefeaturespaceF.Experimentalresultsindicatethatthisapproachoftenexhibitsabetterclassificationaccuracy,atcomparablesparsity,thanthesparselearningalgorithmstowhichwecompared.

Aby-productofthispaperisamethodforcalculatingtheexpansioncoefficientsofSVMsforgivenXVs.BasedonthismethodweproposedamodifiedversionoftheRSmethodandtheRSVM.Experimentalresultsshowthatthesetwomodifiedalgorithmscanimprovetheclassificationaccuracyoftheircounterparts.OnecouldalsotrythismethodonotheralgorithmssuchastheRVM.

PossiblefutureworkmayincludeapplyingtheproposedmethodtootherkernellearningalgorithmsandrunningthedirectmethodgreedilytofindtheXVsoneafteranotherinordertoacceleratethetrainingprocedure.

Acknowledgments

Wewouldliketoacknowledgetheanonymousreviewersfortheircommentsthatsignificantlyimprovedthequalityofthemanuscript.

AppendixA.SparseKFD

WehavederivedtheSLMCalgorithmasanexampleofourdirectsparselearningapproach.HerewewillshowanotherexampletobuildsparseKFD(Mikaetal.,2003).

IntheKFDalgorithm,weneedtoconsiderthefollowingRayleighquotientmaximizationproblem(Sch¨olkopfandSmola,2002;Shawe-TaylorandCristianini,2004).15

w󰀅Aw

max.w∈Fw󰀅Bw+µ󰀔w󰀔2(49)

In(49),AandBarerespectivelythebetween-classandwithin-classvariancesofthetrainingdatainthefeaturespaceF,whileµisaregularizationparameter.ItcanbeseenthatbothAandBarepositivedefinite.

15.Asmentionedbefore,convexformulationshavebeenproposedfortheKFD(Mikaetal.,2003;Suykens

etal.,2002).Hereweonlyconsiderthetraditionalnon-convexformulationwhosesolutioncanbeeasilyobtainedviaeigendecomposition.Thisalsoillustratesthatourmethodcanalsobeappliedtonon-convexoptimizationproblemsinsomespecialcases.

619

¨lkopfandBakırWu,Scho

Thefollowingisanequivalentformof(49)

w∈F

min

−w󰀅Aw,ˆ=1,w󰀅Bw

(50)(51)

subjectto

ˆ=B+µI,implyingthatBˆisstrictlypositivedefinite.whereB

Followingourproposedapproach,inordertobuildsparseKFD(SKFD),wehavetosolvethefollowingproblem:

w∈F,β∈RNz,Z∈Rd×Nz

min

−w󰀅Aw,ˆ=1,w󰀅Bww=

Nz󰀂i=1

(52)(53)(54)

subjectto

φ(zi)βi.

Substituting(54)into(52)and(53),wehave

β∈RNz,Z∈Rd×Nz

min

−β󰀅Azβ,β󰀅Bzβ󰀅=1,

(55)(56)

subjectto

ˆ(φ(Z)),wherewithalittleabuseofwhereAz=(φ(Z))󰀅A(φ(Z))andBz=(φ(Z))󰀅B

symbols,weuseφ(Z)todenotethematrix[φ(z1),...,φ(zNz)].Notethatintheproblem(55)–(56),Az∈RNz×Nzispositivedefinite,whileBz∈RNz×Nzisstrictlypositivedefinite.

Asbefore,wedefinethemarginalfunctionW(Z)astheminimalvalueofthefollowingoptimizationproblem:

β∈RNz

min

−β󰀅Azβ,β󰀅Bzβ󰀅=1.

(57)(58)

subjectto

Notetheaboveproblemisthesameastheproblem(55)–(56)exceptthatintheaboveproblem,Zisfixedratherthanvariable.

NowweneedtoconsiderhowtocomputeW(Z)and∇W(Z).TocomputethefunctionvalueW(Z),weneedtosolvetheproblem(57)–(58).BytheLagrangemultipliermethod,thisproblemcanbesolvedbysolvingthefollowingunconstrainedoptimizationproblem:

β∈RNz,λ∈R

min

J(β,λ)=−β󰀅Azβ+λ(β󰀅Bzβ󰀅−1),

(59)

withLagrangemultiplierλ∈R.

ThederivativeofJ(β,λ)withrespecttoβandλmustvanish,leadingto

Azβ=λBzβ,β󰀅Bzβ󰀅=1.

(60)(61)

Equation(60)showsthatβshouldbeaneigenvectorofthematrix(Bz)−1Azandλshouldbethecorrespondingeigenvalue.Leftmultiplyingbothsidesofequation(60)byβ󰀅andusing(61),wehave

β󰀅Azβ=λ.(62)

620

ADirectMethodforBuildingSparseKernelLearningAlgorithms

Since−β󰀅Azβistheobjectivefunction(57)weareminimizing,wecanseethatitsmini-malvalueshouldequalthenegativeofthelargesteigenvalueof(Bz)−1Az.ThereforethefunctionvalueW(Z)isobtained.

¯andλ¯denotetheoptimalsolutionandthecorrespondingLagrangemultiplierofLetβ

¯isthelargesteigenvalueof(Bz)−1Azandβ¯istheproblem(57)–(58),asderivedabove,λ

correspondingeigenvectormultipliedbyaconstantsuchthatequation(61)issatisfied.

Asmentionedabove,Bzisstrictlypositivedefinite.Hereweassumethatthatthere

¯.Asequation(58)istheonlyconstraintofisanuniqueeigenvectorcorrespondingtoλ

¯isMangasarian-Fromovitzregular.Anditisproblem(57)–(58),theoptimalsolutionβ

straightforwardtoverifythatthesetoffeasiblesolutionsS(Z)isuniformlycompactifBzisstrictlypositivedefinite.Thereforeaccordingtolemma1,thederivativeofW(Z)withrespecttozuv,whichdenotesthev-thcomponentofvectorzu,1≤u≤Nz,1≤v≤d,existsandcanbecomputedasfollows:

zz∂W(Z)∂A∂B󰀅󰀅¯¯+λ¯β¯¯.=−βββ

∂zuv∂zuv∂zuv

(63)

NowthatboththefunctionvalueW(Z)andthegradient∇W(Z)canbecomputed,the

(local)optimumoftheproblem(52)–(54)canbecomputedbythegradientbasedalgorithm.

AfterobtainingtheXVszibysolvingtheproblem(52)–(54),wecantakethesolutionβiofthisproblemastheexpansioncoefficients.HoweverwecannotgetthebiasbfortheresultingKMinthisway(c.fequation(1)).Asmentionedbefore,havingzi,wecanapplyourproposedmethodthattheexpansioncoefficientsandthebiascanbecalculatedbysolvingtheproblem(37)–(39).

ExperimentalresultsonsixclassificationbenchmarksfortheproposedSKFDalgorithmareprovidedintable3.

DataSetSVMNSV

Error(%)KFDNz

Error(%)RSNz/NSVMRS=10%RSVMMRSVMSKFDBanana86.711.840010.821.917.517.516.910.8BreastCancer112.828.620025.827.929.031.030.325.5Titanic70.622.115023.226.622.622.923.923.5Waveform158.99.94009.910.09.911.611.89.9German408.222.570023.722.922.624.523.723.5Image172.12.813003.318.36.914.212.74.0Table3:Resultsonsixclassificationbenchmarks.TheSKFDisinitializedwiththek-meansalgorithm.Thebestresultsamongthesparselearningalgorithmsareinboldface.Similarlyasbefore,theresultsreportedherearetheaveragesoverthefirst10training/testsplits,exceptfortheKFDalgorithm,whoseresultsaretakendirectlyfrom(Sch¨olkopfandSmola,2002),whicharetheaveragesoverallthe100training/testsplits.FortheKFDalgorithm,thenumberofexpansionvectorsNzisthesameasthenumberoftrainingdatasinceKFDisnotsparseatall.

621

¨lkopfandBakırWu,Scho

Theresultspresentedintable3validatetheeffectivenessoftheproposedSKFDalgo-rithm.Furthermore,theoriginalKFDalgorithmisnotsparseallall,i.e.allthetraining

dataneedtobestoredastheXVs.ThereforetheproposedSKFDalgorithmnotonlyaccel-eratesthetestphase,butalsosignificantlyreducesthespaceneededforstoringtheresultingKM.Forexample,theBananadatasetcontains400trainingdata,implyingthatonaverageonly86.7×10%=8.7XVsneedtostoredintheresultingKMtrainedbytheSKFD,savingabout1−8.7/400=97.8%storagecomparedwiththeorignialKFDalgorithm.

AppendixB.FormalDefinitionsoftheTwoConditionsinLemma1

ForeachZ∈Rd×Nz,letS(Z)denotethefeasiblesetofproblem(16)–(18)

S(Z)={x|gi(x,Z)≤0,1≤i≤Ng}∩{x|hj(x,Z)=0,1≤j≤Nh}.

Definition3(GauvinandDubeau,1982)ThefeasiblesetS(Z)ofproblem(16)–(18)is

¯ifthereisaneighborhoodN(Z¯)ofZ¯suchthattheclosureofuniformlycompactatZ󰀁

¯)S(Z)iscompact.Z∈N(Z¯∈S(Z)ofproblem(16)–Definition4(Mangasarian,1969)ForanyZ,afeasiblepointx

(18)issaidtobeMangasarian-FromovitzregularifitsatisfiesthefollowingMangasarian-Fromovitzregularitycondition:1.Thereexistsavectorvsuchthat

v󰀅∇xgi(x,Z)|x=¯x<0,v󰀅∇xhj(x,Z)|x=¯x=0,

i∈{i|gi(¯x,Z)=0},1≤j≤Nh,

(64)(65)

2.Thegradients{∇xhj(x,Z)|x=¯x,1≤j≤Nh}arelinearlyindependent,where∇xdenotesthegradientwithrespecttothevariablesx.

AppendixC.ProofofCorollary2

ProofTheproofisjusttoverifyalltheconditionsinlemma1.

First,forthesoftmarginSVMtrainingproblem(37)–(39),itisknownthatifthekernel

ˆz(·,·)isstrictlypositivedefinitethentheproblemhasanuniquesolutionandfunctionK

thecorrespondingLagrangemultipliersarealsounique.

Second,itcanbeseenthatatanyZ∈Rd×Nz,thesetoffeasiblesolutionsS(Z)iscompactanddoesnotdependonZ,thereforeS(Z)isuniformlycompactatanyZ∈Rd×Nz.

Third,obviouslythesecondpartoftheMangasarian-Fromovitzregularityconditionholdssincethereisonlyoneequalityconstraint.NowweprovethatthefirstpartoftheMangasarian-Fromovitzregularityconditionalsoholdsbyconstructingavectorv=[v1,...,vN]󰀅∈RNthatsatisfiesboth(64)and(65).Foreaseofdescription,wepartition

z(1≤i≤N),theindexset{1,...,N}intothefollowingthreesubsetsaccordingtoαi

z=0},S={i|αz=C}andwhicharethesolutionofproblem(37)–(39):S1={i|αi2i

z󰀅NS3={i|0<αi(1≤k≤3)todenotethesub-vectoroftthatiscomposedoftifori∈Sk,1≤k≤3.

622

ADirectMethodforBuildingSparseKernelLearningAlgorithms

First,tomake(64)hold,wecansimplyassignanarbitrarypositivevaluetoviifi∈S1,andanarbitrarynegativevaluetovjifj∈S2.Tomake(65)true,wedistinguishtwocases:

Case1,|S3|>0.Inthiscase,thevectorvs3canbeeasilycomputedas

vs3=−

ys3󰀔ys3󰀔2󰀂

i∈S1∪S2

viyi,

wherey=[y1,...yN]󰀅.Theaboveequationresultsinv󰀅y=0,whichisthesameas(65).

zforCase2,|S3|=0.Inthiscase,alltheresultingsupportvectorscorrespondtoαi

i∈S2,whichcomefrombothclassesaccordingtotheassumptionsincorollary2.Thereforetheleftsideofequation(65)equals

vy=

=

󰀅

󰀂

i∈S1

viyi+viyi+

󰀂

i∈S2,yi>0

viyi+vi−

󰀂

i∈S2,yi<0

viyivi.

(66)

󰀂󰀂󰀂

i∈S1i∈S2,yi>0i∈S2,yi<0

Recallthatweconstructvsuchthatvi<0fori∈S2.Soifv󰀅y=0,equation(65)alreadyholds.Ifv󰀅y>0,wecanalwaysdecreasethevalues(orequivalently,increasetheabsolutevalues)ofviinthesecondtermofequation(66)tomakev󰀅y=0,whileatthesametimetokeepvi<0fori∈S2sothat(64)stillholds.Similarly,ifv󰀅y<0,wecandecreasethevaluesofviinthethirdtermofequation(66).

ThusthefirstpartoftheMangasarian-Fromovitzregularityconditionalsoholds.Hencetheoptimalsolutionofproblem(37)–(39)isMangasarian-Fromovitzregular.

Thereforealltheconditionsinlemma1aresatisfiedand(46)followsfrom(19)sincetheconstraintsofproblem(37)–(39)donotdependonZ,whichmeansboththesecondandthethirdtermsin(19)are0.

References

K.P.Bennett.Combiningsupportvectorandmathematicalprogrammingmethodsforclassification.InB.Sch¨olkopf,C.J.C.Burges,andA.J.Smola,editors,AdvancesinKernelMethods,pages307–326.TheMITPress,CambridgeMA,1999.

C.M.Bishop.NeuralNetworksforPatternRecognition.OxfordUniversityPress,Oxford,UK,1995.

C.J.C.Burges.Simplifiedsupportvectordecisionrules.InL.Saitta,editor,Proc.13thInternationalConferenceonMachineLearning,pages71–77.MorganKaufmann,1996.O.Chapelle,V.Vapnik,O.Bousquet,andS.Mukherjee.Choosingmultipleparametersforsupportvectormachines.MachineLearning,46(1-3):131–159,2002.J.GauvinandF.Dubeau.Differentialpropertiesofthemarginalfunctioninmathematicalprogramming.MathematicalProgrammingStudy,19:101–119,1982.

623

¨lkopfandBakırWu,Scho

G.R.G.Lanckriet,L.E.Ghaoui,C.Bhattacharyya,andM.I.Jordan.Arobustminimax

approachtoclassification.JournalofMachineLearningResearch,3:555–582,2002.Y.LeeandO.L.Mangasarian.RSVM:reducedsupportvectormachines.InCDProceedingsoftheFirstSIAMInternationalConferenceonDataMining,Chicago,2001.

K.LinandC.Lin.Astudyonreducedsupportvectormachines.IEEETransactionsonNeuralNetworks,14:1449–1459,2003.

D.C.LiuandJ.Nocedal.OnthelimitedmemoryBFGSmethodforlargescaleoptimization.Math.Programming,45(3,(Ser.B)):503–528,1989.O.L.Mangasarian.NonlinearProgramming.McGraw-Hill,NewYork,1969.

S.Mika,G.R¨atsch,J.Weston,B.Sch¨olkopf,A.J.Smola,andK.-R.Mueller.Constructingdescriptiveanddiscriminativenon-linearfeatures:Rayleighcoefficientsinkernelfeaturespaces.IEEETransactionsonPatternAnalysisandMachineIntelligence,25(5):623–628,2003.P.B.Nair,A.Choudhury,andA.J.Keane.SomegreedylearningalgorithmsforsparseregressionandclassificationwithMercerkernels.JournalofMachineLearningResearch,3:781–801,2002.B.Sch¨olkopfandA.J.Smola.LearningwithKernels.TheMITPress,Cambridge,MA,2002.J.Shawe-TaylorandN.Cristianini.KernelMethodsforPatternAnalysis.CambridgeUniversityPress,Cambridge,UK,2004.

EdwardSnelsonandZoubinGhahramani.Sparsegaussianprocessesusingpseudo-inputs.InY.Weiss,B.Sch¨olkopf,andJ.Platt,editors,AdvancesinNeuralInformationProcessingSystems18,pages1259–1266.MITPress,Cambridge,MA,2006.I.Steinwart.Sparsenessofsupportvectormachine.JournalofMachineLearningResearch,4:1071–1105,2003.J.A.K.Suykens,T.V.Gestel,J.D.Brabanter,B.D.Moor,andJ.Vandewalle.LeastSquaresSupportVectorMachines.WorldScientific,Singapore,2002.M.E.Tipping.SparseBayesianlearningandtherelevancevectormachine.JournalofMachineLearningResearch,1:211–244,2001.V.Vapnik.TheNatureofStatisticalLearningTheory.SpringerVerlag,NewYork,1995.M.Wu,B.Sch¨olkopf,andG.Bakir.Buildingsparselargemarginclassifiers.InL.D.RaedtandS.Wrobel,editors,Proc.22thInternationalConferenceonMachineLearning,pages1001–1008.ACM,2005.

624

因篇幅问题不能全部显示,请点此查看更多更全内容

Top