Thesubtokencontextmodelexploitsthefactthatidenti ernamesareoftenformedbyconcatenatingwordsinaphrase,suchasgetLocationorsetContentLengthHeader.Wecalleachofthesmallerwordsinanidenti erasubtoken.Wesplitidenti ernamesintosubtokensbasedoncamelcaseandunderscores,resultinginasetofsubtokensthatweusetocomposenewidenti ers.Todothis,
weexploitthesummationtrickweusedin r
constructedthisvectorasasumofembeddingcontext.Recallthatwevectorsforparticularfeaturesinthecontext.Here,wede netheembeddingofatargetvectortobethesumoftheembeddingsofitssubtokens.
Lettbethetokenthatwearetryingtopredictfromacontextc.Asinthecontextmodel,ccancontaintokensbeforeandaftert,andtokensfromtheglobalcontext.Inthesubtokenmodel,weadditionallysupposethattissplitupintoasequenceofMsubto-kens,thatis,t=s1s2...sM,wheresMisalwaysaspecialENDsubtokenthatsigni estheendofthesubtokensequence.Thatis,thecontextmodelnowneedstopredictasequenceofsubtokensinordertopredictafullidenti er.Webeginbybreakingupthepredictiononesubtokenatatime,usingthechainruleofprobabil-ity:P(s1s2...sM|c)=∏Mm=1P(sm|s1...sP(sc)ofthenextm 1,c).Then,wemodeltheprobabilitym|s1...sm 1,subtokensonesandthecontext.Sincepreliminaryexperimentsmgivenallofthepreviouswithann-gramversionofasubtokenmodelshowedthatn-gramsdidnotyieldgoodresults,weemployalogbilinearmodel
P(sm|sexp{s,s1...sm 1,c)=
θ(sm1...sm 1,c)}
∑.
s θm1m 1(5)
Asbefore,sθ(sm,s1...sm 1,c)canbeinterpretedasascore,whichcanbepositiveornegativeandindicateshowmuchthemodel“likes”toseethesubtokensm,giventheprevioussubtokensandthecontext.Theexponentialfunctionsandthedenominatorareamathematicaldevicetoconvertthescoreintoaprobabilitydistribution.
Wechooseabilinearformforstotokenshavingembeddingθ,withthedifferencebeingthatinadditionvectors,subtokenshaveembeddingsaswell.Mathematically,wede nethescoreas
sθ(sm,s1...sm 1,c)= r SUBCqsm+bsm,
(6)
whereqsm∈RDisanembeddingforthesubtokensm,and r
SUBCisacontinuousvectorthatrepresentstheprevioussubtokensandthe
context.Tode neacontinuousrepresentation r
SUBCofthecontext,webreakthisdownfurtherintoasumofotherembeddingfeaturesas
r
SUBC=r context+r SUBC-TOK.(7)
Inotherwords,thecontinuousrepresentationofthecontextbreaks
downintoasumoftwovectors:the rstterm r
effectofthesurroundingtokensc—bothlocalcontextrepresentstheandglobal—andisde nedexactlyasinthecontextmodelvia(4).
Thenewaspectishowwemodeltheeffectoftheprevioussubto-kenss SUBCassigning1...seachm 1inthesecondtermr
-TOK.Wehandlethisbysubtokensasecondembeddingvectorraprevioussubtoken;s∈RDthatrepresentsitsin uencewhenusedaswecallthisahistoryembedding.Weweightthesevectorsbyadiagonal
matrixCS UBC
k,toallowthemodeltolearnthatsubtokenshavedecay-ingin uencethefartherthattheyarefromthetokenthatisbeingpredicted.Puttingthisalltogether,wede ne
M
r
SUBC-TOK=i∑C=1
SUBC
i
rsm i.
(8)
Thiscompletesthede nitionofthesubtokencontextmodel.Tosumup,theparametersofthesubtokencontextmodelare(a)thetargetembeddingsqsforeachsubtokensthatoccursinthedata,(b)thehistoryembeddingsrsforeachsubtokens,(c)thediagonal
weightmatricesCS UBC
mform=1,2,...,Mthatrepresenttheeffectof
distanceonthesubtokenhistory(weuseM=3,yieldinga4-gram-likemodelonsubtokens)andtheparametersthatwecarriedoverfromthelogbilinearcontextmodel:(d)thelocalcontextembeddingsrtforeachtokentthatappearsinthecontext,(e)thelocalcontextweightmatricesC kandCkfor K≤k≤K,k=0,and(f)thefeatureembeddingsrfforeachfeaturef(c)oftheglobalcontext.Weestimatealloftheseparametersfromthetrainingcorpus.
Althoughthismayseemalargenumberofparameters,thisistypicalforlanguagemodels,e.g.,considertheV5parameters,ifVisthenumberoflexemes,requiredbya5-gramlanguagemodel.Howcanwehandlesomanyparameters?Thereasonissimple:intheeraofvast,publiclyavailablesourcecoderepositorieslikeGitHubandBitbucket,codescarcityisathingofthepast.
GeneratingNeologismsA nalquestionis“Giventhecontextc,howdowe ndthelexemetthatmaximizesP(t|c)?”.Previousmod-elscouldanswerthisquestionsimplybyloopingoverallpossiblelexemesinthemodel,butthisisimpossibleforasubtokenmodel,becausetherearein nitelymanypossibleneologisms.Soweem-ploybeamsearch(seeRussellandNorvig[44]fordetails)to ndtheBtokens(i.e.,subtokensequences)withthehighestprobability.
2.4SourceCodeFeaturesforContextModels
Inthissection,wedescribethefeaturesweusetocaptureglobal
context.Identifyingsoftwaremeasuresandfeaturesthateffectivelycapturesemanticpropertieslikecomprehensibilityorbug-pronenessisaseminalsoftwareengineeringproblemthatwedonottackleinthispaper.Here,wehaveselectedmeasuresandfeaturesheavilyusedintheliteratureandindustry.Forinstance,control owisindisputablyimportant;weselectedCyclomaticcomplexity,despiteitscorrelationwithcodesize,tomeasureit.The rstcolumnofTable4de nesthefeaturesweusedinthiswork.Inthetable,“VariableType”trackswhetherthetypeisgeneric,itstypeaftererasure,and,ifthetypeisanarray,itssize.“ContainedMethods”and“SiblingMethods”excludemethodoverloadsandrecursion.Thefeaturesofatargettokenareitstargetfeatures;weassignarfvectortoeachofthem;thisvectorisaddedintheleftsummationofEquation4ifafeature’sindicatorfunctionfreturns1foraparticulartoken.Althoughfeaturesarebinary,wedescribesome—likethemodi ersofadeclaration,thenodetypeofaAST,etc.—ascategorical.Allcategoricalfeaturesareconvertedintobinaryusinga1-of-Kencoding.Formethods,weincludeCyclomaticcomplexity,clippingitto10andtreatingitascategorical.Whenfeaturesdonotmakesenseforaparticulartoken,liketheCyclomaticcomplexityofavariable,thefeature’sfunctionsimplyreturnszero.
3.METHODOLOGY
Thecorechallengeofsolvingthemethodnamingproblemfrom
codeisdatasparsity.Ourguidingintuitionisthatsourcecodecontainsrichstructurethatcanalleviatethesparsityproblem.Wethereforeposethefollowingquestion:Howcanwebettermaximallyexploitthestructureinherenttosourcecode?Thisquestioninturnleadsustotheresearchquestions:
RQ1.Canweidentifyandextractlongandshort-rangecontextfeaturesofidenti ersfornaming?
RQ2.Doidenti erscontainexploitablesubstructure?
Answeringbothofthesequestionsintheaf rmative,weturnourattentiontoexploitingtheresultingnaminginformation;here,weaskifthisnewinformationissuf cientlyrichtoallowustoaccuratelysuggestnames.Moreconcretely:
RQ3.Canweaccuratelysuggestmethoddeclarationnames,lookingonlyatthecontextofthedeclaredmethod?
RQ4.Canwedothesameforclass(i.e.type)names?
百度搜索“77cn”或“免费范文网”即可找到本站免费阅读全部范文。收藏本站方便下次阅读,免费范文网,提供经典小说综合文库2015-FSE-Suggesting accurate method and class names(5)在线全文阅读。
相关推荐: